aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2019-11-18scsi: sd_zbc: Remove set but not used variable 'buflen'YueHaibing1-2/+0
Fixes gcc '-Wunused-but-set-variable' warning: drivers/scsi/sd_zbc.c: In function 'sd_zbc_check_zones': drivers/scsi/sd_zbc.c:341:9: warning: variable 'buflen' set but not used [-Wunused-but-set-variable] It is not used since commit d9dd73087a8b ("block: Enhance blk_revalidate_disk_zones()") Reported-by: Hulk Robot <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: YueHaibing <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-11-18dm thin: wakeup worker only when deferred bios existJeffle Xu1-1/+4
Single thread fio test (read, bs=4k, ioengine=libaio, iodepth=128, numjobs=1) over dm-thin device has poor performance versus bare nvme device. Further investigation with perf indicates that queue_work_on() consumes over 20% CPU time when doing IO over dm-thin device. The call stack is as follows. - 40.57% thin_map + 22.07% queue_work_on + 9.95% dm_thin_find_block + 2.80% cell_defer_no_holder 1.91% inc_all_io_entry.isra.33.part.34 + 1.78% bio_detain.isra.35 In cell_defer_no_holder(), wakeup_worker() is always called, no matter whether the tc->deferred_bio_list list is empty or not. In single thread IO model, this list is most likely empty. So skip waking up worker thread if tc->deferred_bio_list list is empty. Single thread IO performance improves from 448 MiB/s to 646 MiB/s (+44%) once the needless wake_worker() calls are properly skipped. Signed-off-by: Jeffle Xu <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2019-11-18block: Don't disable interrupts in trigger_softirq()Sebastian Andrzej Siewior1-4/+0
trigger_softirq() is always invoked as a SMP-function call which is always invoked with disables interrupts. Don't disable interrupt in trigger_softirq() because interrupts are already disabled. Reviewed-by: Ming Lei <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-11-18affs: fix a memory leak in affs_remountNavid Emamdoost1-6/+0
In affs_remount if data is provided it is duplicated into new_opts. The allocated memory for new_opts is only released if parse_options fails. There's a bit of history behind new_options, originally there was save/replace options on the VFS layer so the 'data' passed must not change (thus strdup), this got cleaned up in later patches. But not completely. There's no reason to do the strdup in cases where the filesystem does not need to reuse the 'data' again, because strsep would modify it directly. Fixes: c8f33d0bec99 ("affs: kstrdup() memory handling") Signed-off-by: Navid Emamdoost <[email protected]> [ update changelog ] Signed-off-by: David Sterba <[email protected]>
2019-11-18affs: Replace binary semaphores with mutexesDavidlohr Bueso2-10/+10
At a slight footprint cost (24 vs 32 bytes), mutexes are more optimal than semaphores; it's also a nicer interface for mutual exclusion, which is why they are encouraged over binary semaphores, when possible. For both i_link_lock and i_ext_lock (and hence i_hash_lock which I annotated for the hash lock mapping hackery for lockdep), their semantics imply traditional lock ownership; that is, the lock owner is the same for both lock/unlock operations and does not run in irq context. Therefore it is safe to convert. Signed-off-by: Davidlohr Bueso <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18Merge tag 'kvm-s390-next-5.5-1' of ↵Paolo Bonzini5-24/+38
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD KVM: s390: small fixes and enhancements - selftest improvements - yield improvements - cleanups
2019-11-18btrfs: get bdev from latest_dev for dio bh_resultDavid Sterba1-2/+4
To remove use of extent_map::bdev we need to find a replacement, and the latest_bdev is the only one we can use here, because inode::i_bdev and superblock::s_bdev are NULL. The DIO code uses bdev in two places: * to read blocksize to perform alignment checks in do_blockdev_direct_IO, but we do them in btrfs code before any call to DIO * in the following call chain: do_direct_IO get_more_blocks sdio->get_block() <-- this is btrfs_get_blocks_direct subsequently the map_bh->b_dev member is used in clean_bdev_aliases and dio_new_bio to set the bio's bdev to that of the buffer_head. However, because we have provided a submit function dio_bio_submit calls our submission function and ignores the bdev. So it's safe to pass any valid bdev that's used within the filesystem. Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: assert extent_map bdevs and lookup_map and splitDavid Sterba2-2/+9
This is a preparatory patch for removing extent_map::bdev. There's some history behind the code so this is only precaution to catch if things break before the actual removal happens. Logically, comparing a raw low-level block device (bdev) does not make sense for extent maps (high-level objects). This had no effect in practice but was quite confusing in the code. The lookup_map is set iff EXTENT_FLAG_FS_MAPPING is set. The two pointers were stored in the same bytes and used potentially in two meanings. Now they're split, so the asserts are in place to check that the condition will not change. The lookup map pointer misused bdev, this has been changed in commit 95617d69326c ("btrfs: cleanup, stop casting for extent_map->lookup everywhere") to the explicit type. But the semantics hasn't changed and bdev was not actually used to decide if maps are mergeable. Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: remove pointless indentation in btrfs_read_sys_array()Johannes Thumshirn1-35/+36
Instead of checking if we've read a BTRFS_CHUNK_ITEM_KEY from disk and then process it we could just bail out early if the read disk key wasn't a BTRFS_CHUNK_ITEM_KEY. This removes a level of indentation and makes the code nicer to read. Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: reduce indentation in btrfs_may_alloc_data_chunkJohannes Thumshirn1-20/+21
In btrfs_may_alloc_data_chunk() we're checking if the chunk type is of type BTRFS_BLOCK_GROUP_DATA and if it is we process it. Instead of checking if the chunk type is a BTRFS_BLOCK_GROUP_DATA chunk we can negate the check and bail out early if it isn't. This makes the code a bit more readable. Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: remove pointless local variable in lock_stripe_add()Johannes Thumshirn1-2/+3
In lock_stripe_add() we're caching the bucket for the stripe hash table just for a single call to dereference the stripe hash. If we just directly call rbio_bucket() we can safe the pointless local variable. Also move the dereferencing of the stripe hash outside of the variable declaration block to not break over the 80 characters limit. Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: raid56: reduce indentation in lock_stripe_addJohannes Thumshirn1-47/+44
In lock_stripe_add() we're traversing the stripe hash list and check if the current list element's raid_map equals is equal to the raid bio's raid_map. If both are equal we continue processing. If we'd check for inequality instead of equality we can reduce one level of indentation. Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: tracepoints: constify all pointersDavid Sterba1-26/+26
We don't modify the data passed to tracepoints, some of the declarations are already const, add it to the rest. Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: tracepoints: drop typecasts from printkDavid Sterba1-15/+13
Remove typecasts from trace printk, adjust types and move typecast to the assignment if necessary. When assigning, the types are more obvious compared to matching the variables to the format strings. Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: Return offset from find_desired_extentNikolay Borisov1-15/+19
Instead of using an input pointer parameter as the return value and have an int as the return type of find_desired_extent, rework the function to directly return the found offset. Doing that the 'ret' variable in btrfs_llseek_file can be removed. Additional (subjective) benefit is that btrfs' llseek function now resemebles those of the other major filesystems. Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: Simplify btrfs_file_llseekNikolay Borisov1-7/+3
Handle SEEK_END/SEEK_CUR in a single 'default' case by directly returning from generic_file_llseek. This makes the 'out' label redundant. Finally return directly the vale from vfs_setpos. No semantic changes. Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: Speed up btrfs_file_llseekNikolay Borisov1-16/+10
Modifying the file position is done on a per-file basis. This renders holding the inode lock for writing useless and makes the performance of concurrent llseek's abysmal. Fix this by holding the inode for read. This provides protection against concurrent truncates and find_desired_extent already includes proper extent locking for the range which ensures proper locking against concurrent writes. SEEK_CUR and SEEK_END can be done lockessly. The former is synchronized by file::f_lock spinlock. SEEK_END is not synchronized but atomic, but that's OK since there is not guarantee that SEEK_END will always be at the end of the file in the face of tail modifications. This change brings ~82% performance improvement when doing a lot of parallel fseeks. The workload essentially does: for (d=0; d<num_seek_read; d++) { /* offset %= 16777216; */ fseek (f, 256 * d % 16777216, SEEK_SET); fread (buffer, 64, 1, f); } Without patch: num workprocesses = 16 num fseek/fread = 8000000 step = 256 fork 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 real 0m41.412s user 0m28.777s sys 2m16.510s With patch: num workprocesses = 16 num fseek/fread = 8000000 step = 256 fork 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 real 0m11.479s user 0m27.629s sys 0m21.040s Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: compression: remove ops pointer from workspace_managerDavid Sterba2-5/+2
We can infer the ops from the type that is now passed to all functions that would need it, this makes workspace_manager::ops redundant and can be removed. Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: compression: inline free_workspaceDavid Sterba5-8/+18
Replace indirect calls to free_workspace by switch and calls to the specific callbacks. This is mainly to get rid of the indirection due to spectre vulnerability mitigations. Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: compression: pass type to btrfs_put_workspaceDavid Sterba2-8/+7
We can infer the workspace_manager from type and the type will be used in the following patch to call a common helper for free_workspace. Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: compression: inline alloc_workspaceDavid Sterba5-8/+18
Replace indirect calls to alloc_workspace by switch and calls to the specific callbacks. This is mainly to get rid of the indirection due to spectre vulnerability mitigations. Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: compression: pass type to btrfs_get_workspaceDavid Sterba3-10/+7
We can infer the workspace_manager from type and the type will be used in the following patch to call a common helper for alloc_workspace. Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: compression: inline put_workspaceDavid Sterba5-24/+15
Similar to get_workspace, majority of the callbacks is trivial, we don't gain anything by the indirection, so replace them by a switch function. Trivial callback implementations use the helper. Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: compression: inline get_workspaceDavid Sterba5-18/+15
Majority of the callbacks is trivial, we don't gain anything by the indirection, so replace them by a switch function. ZLIB needs to adjust level in the callback and ZSTD workspace management is complex, the rest is call to the helper. Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: compression: export alloc/free/get/put callbacks of all algosDavid Sterba4-15/+26
The indirect calls will be replaced by a switch in compression.c. (Switch is faster than indirect calls with when Spectre mitigations are enabled). Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: compression: inline cleanup_workspace_managerDavid Sterba5-28/+7
Replace loop calling to all algos with a list of direct calls to the cleanup manager callback. When that becomes trivial it is replaced by direct call to the helper. Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: compression: let workspace manager cleanup take only the typeDavid Sterba4-5/+7
With the access to the workspace structures, we can look it up together with the compression ops inside the workspace manager cleanup helper. Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: compression: inline init_workspace_managerDavid Sterba5-28/+7
Replace loop calling to all algos with a list of direct calls to the init manager callback. When that becomes trivial it is replaced by direct call to the helper. Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: compression: let workspace manager init take only the typeDavid Sterba4-7/+7
With the access to the workspace structures, we can look it up together with the compression ops inside the workspace manager init helper. Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: compression: attach workspace manager to the opsDavid Sterba5-0/+6
There's a lot of indirection when the generic code calls into algo-specific callbacks to reach the private workspace manager structure and back to the generic code. To simplify that, export the workspace manager for heuristic, LZO and ZLIB, while ZSTD is going to use it's own manager. Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: switch compression callbacks to direct callsDavid Sterba5-34/+69
The indirect calls bring some overhead due to spectre vulnerability mitigations. The number of cases is small and below the threshold (10-20) where indirect call would be better. Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: export compression and decompression callbacksDavid Sterba4-36/+45
Export compress_pages, decompress_bio and decompress callbacks for all compression algos. The indirect calls will be replaced by a switch. Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: use btrfs_block_group_cache_done in update_block_groupJosef Bacik1-1/+1
When free'ing extents in a block group we check to see if the block group is not cached, and then cache it if we need to. However we'll just carry on as long as we're loading the cache. This is problematic because we are dirtying the block group here. If we are fast enough we could do a transaction commit and clear the free space cache while we're still loading the space cache in another thread. This truncates the free space inode, which will keep it from loading the space cache. Fix this by using the btrfs_block_group_cache_done helper so that we try to load the space cache unconditionally here, which will result in the caller waiting for the fast caching to complete and keep us from truncating the free space inode. CC: [email protected] # 4.4+ Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: check page->mapping when loading free space cacheJosef Bacik1-0/+6
While testing 5.2 we ran into the following panic [52238.017028] BUG: kernel NULL pointer dereference, address: 0000000000000001 [52238.105608] RIP: 0010:drop_buffers+0x3d/0x150 [52238.304051] Call Trace: [52238.308958] try_to_free_buffers+0x15b/0x1b0 [52238.317503] shrink_page_list+0x1164/0x1780 [52238.325877] shrink_inactive_list+0x18f/0x3b0 [52238.334596] shrink_node_memcg+0x23e/0x7d0 [52238.342790] ? do_shrink_slab+0x4f/0x290 [52238.350648] shrink_node+0xce/0x4a0 [52238.357628] balance_pgdat+0x2c7/0x510 [52238.365135] kswapd+0x216/0x3e0 [52238.371425] ? wait_woken+0x80/0x80 [52238.378412] ? balance_pgdat+0x510/0x510 [52238.386265] kthread+0x111/0x130 [52238.392727] ? kthread_create_on_node+0x60/0x60 [52238.401782] ret_from_fork+0x1f/0x30 The page we were trying to drop had a page->private, but had no page->mapping and so called drop_buffers, assuming that we had a buffer_head on the page, and then panic'ed trying to deref 1, which is our page->private for data pages. This is happening because we're truncating the free space cache while we're trying to load the free space cache. This isn't supposed to happen, and I'll fix that in a followup patch. However we still shouldn't allow those sort of mistakes to result in messing with pages that do not belong to us. So add the page->mapping check to verify that we still own this page after dropping and re-acquiring the page lock. This page being unlocked as: btrfs_readpage extent_read_full_page __extent_read_full_page __do_readpage if (!nr) unlock_page <-- nr can be 0 only if submit_extent_page returns an error CC: [email protected] # 4.4+ Reviewed-by: Filipe Manana <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Josef Bacik <[email protected]> [ add callchain ] Signed-off-by: David Sterba <[email protected]>
2019-11-18Btrfs: fix metadata space leak on fixup worker failure to set range as delallocFilipe Manana1-1/+5
In the fixup worker, if we fail to mark the range as delalloc in the io tree, we must release the previously reserved metadata, as well as update the outstanding extents counter for the inode, otherwise we leak metadata space. In pratice we can't return an error from btrfs_set_extent_delalloc(), which is just a wrapper around __set_extent_bit(), as for most errors __set_extent_bit() does a BUG_ON() (or panics which hits a BUG_ON() as well) and returning an -EEXIST error doesn't happen in this case since the exclusive bits parameter always has a value of 0 through this code path. Nevertheless, just fix the error handling in the fixup worker, in case one day __set_extent_bit() can return an error to this code path. Fixes: f3038ee3a3f101 ("btrfs: Handle btrfs_set_extent_delalloc failure in fixup worker") CC: [email protected] # 4.19+ Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18Btrfs: fix negative subv_writers counter and data space leak after buffered ↵Filipe Manana1-1/+1
write When doing a buffered write it's possible to leave the subv_writers counter of the root, used for synchronization between buffered nocow writers and snapshotting. This happens in an exceptional case like the following: 1) We fail to allocate data space for the write, since there's not enough available data space nor enough unallocated space for allocating a new data block group; 2) Because of that failure, we try to go to NOCOW mode, which succeeds and therefore we set the local variable 'only_release_metadata' to true and set the root's sub_writers counter to 1 through the call to btrfs_start_write_no_snapshotting() made by check_can_nocow(); 3) The call to btrfs_copy_from_user() returns zero, which is very unlikely to happen but not impossible; 4) No pages are copied because btrfs_copy_from_user() returned zero; 5) We call btrfs_end_write_no_snapshotting() which decrements the root's subv_writers counter to 0; 6) We don't set 'only_release_metadata' back to 'false' because we do it only if 'copied', the value returned by btrfs_copy_from_user(), is greater than zero; 7) On the next iteration of the while loop, which processes the same page range, we are now able to allocate data space for the write (we got enough data space released in the meanwhile); 8) After this if we fail at btrfs_delalloc_reserve_metadata(), because now there isn't enough free metadata space, or in some other place further below (prepare_pages(), lock_and_cleanup_extent_if_need(), btrfs_dirty_pages()), we break out of the while loop with 'only_release_metadata' having a value of 'true'; 9) Because 'only_release_metadata' is 'true' we end up decrementing the root's subv_writers counter to -1 (through a call to btrfs_end_write_no_snapshotting()), and we also end up not releasing the data space previously reserved through btrfs_check_data_free_space(). As a consequence the mechanism for synchronizing NOCOW buffered writes with snapshotting gets broken. Fix this by always setting 'only_release_metadata' to false at the start of each iteration. Fixes: 8257b2dc3c1a ("Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume") Fixes: 7ee9e4405f26 ("Btrfs: check if we can nocow if we don't have data space") CC: [email protected] # 4.4+ Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: ioctl: Try to use btrfs_fs_info instead of *fileMarcos Paulo de Souza1-21/+15
Some functions are doing some unnecessary indirection to reach the btrfs_fs_info struct. Change these functions to receive a btrfs_fs_info struct instead of a *file. Reviewed-by: Anand Jain <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Marcos Paulo de Souza <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: use bool argument in free_root_pointers()Anand Jain1-5/+5
We don't need int argument bool shall do in free_root_pointers(). And rename the argument as it confused two people. Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Anand Jain <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: use better definition of number of compression typeChengguang Xu4-5/+7
The compression type upper limit constant is the same as the last value and this is confusing. In order to keep coding style consistent, use BTRFS_NR_COMPRESS_TYPES as the total number that follows the idom of 'NR' being one more than the last value. Signed-off-by: Chengguang Xu <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: use enum for extent type definesChengguang Xu2-6/+8
Use enum to replace macro definitions of extent types. Signed-off-by: Chengguang Xu <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: props: remove unnecessary hash_init()Chengguang Xu1-2/+0
DEFINE_HASHTABLE itself has already included initialization code, we don't have to call hash_init() again, so remove it. Signed-off-by: Chengguang Xu <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: Rename btrfs_join_transaction_nolockNikolay Borisov3-8/+8
This function is used only during the final phase of freespace cache writeout. This is necessary since using the plain btrfs_join_transaction api is deadlock prone. The deadlock looks like: T1: btrfs_commit_transaction commit_cowonly_roots btrfs_write_dirty_block_groups btrfs_wait_cache_io __btrfs_wait_cache_io btrfs_wait_ordered_range <-- Triggers ordered IO for freespace inode and blocks transaction commit until freespace cache writeout T2: <-- after T1 has triggered the writeout finish_ordered_fn btrfs_finish_ordered_io btrfs_join_transaction <--- this would block waiting for current transaction to commit, but since trans commit is waiting for this writeout to finish The special purpose functions prevents it by simply skipping the "wait for writeout" since it's guaranteed the transaction won't proceed until we are done. Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: User assert to document transaction requirementNikolay Borisov1-3/+2
Using an ASSERT in btrfs_pin_extent allows to more stringently observe whether the function is called under a transaction or not. Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: opencode extent_buffer_getDavid Sterba7-19/+14
The helper is trivial and we can understand what the atomic_inc on something named refs does. Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: Avoid getting stuck during cyclic writebacksTejun Heo1-11/+1
During a cyclic writeback, extent_write_cache_pages() uses done_index to update the writeback_index after the current run is over. However, instead of current index + 1, it gets to to the current index itself. Unfortunately, this, combined with returning on EOF instead of looping back, can lead to the following pathlogical behavior. 1. There is a single file which has accumulated enough dirty pages to trigger balance_dirty_pages() and the writer appending to the file with a series of short writes. 2. balance_dirty_pages kicks in, wakes up background writeback and sleeps. 3. Writeback kicks in and the cursor is on the last page of the dirty file. Writeback is started or skipped if already in progress. As it's EOF, extent_write_cache_pages() returns and the cursor is set to done_index which is pointing to the last page. 4. Writeback is done. Nothing happens till balance_dirty_pages finishes, at which point we go back to #1. This can almost completely stall out writing back of the file and keep the system over dirty threshold for a long time which can mess up the whole system. We encountered this issue in production with a package handling application which can reliably reproduce the issue when running under tight memory limits. Reading the comment in the error handling section, this seems to be to avoid accidentally skipping a page in case the write attempt on the page doesn't succeed. However, this concern seems bogus. On each page, the code either: * Skips and moves onto the next page. * Fails issue and sets done_index to index + 1. * Successfully issues and continue to the next page if budget allows and not EOF. IOW, as long as it's not EOF and there's budget, the code never retries writing back the same page. Only when a page happens to be the last page of a particular run, we end up retrying the page, which can't possibly guarantee anything data integrity related. Besides, cyclic writes are only used for non-syncing writebacks meaning that there's no data integrity implication to begin with. Fix it by always setting done_index past the current page being processed. Note that this problem exists in other writepages too. CC: [email protected] # 4.19+ Signed-off-by: Tejun Heo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: block-group: Rework documentation of check_system_chunk functionMarcos Paulo de Souza1-3/+1
Commit 4617ea3a52cf (" Btrfs: fix necessary chunk tree space calculation when allocating a chunk") removed the is_allocation argument from check_system_chunk, since the formula for reserving the necessary space for allocation or removing a chunk would be the same. So, rework the comment by removing the mention of is_allocation argument. Signed-off-by: Marcos Paulo de Souza <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: Enhance error output for write time tree checkerQu Wenruo1-0/+2
Unlike read time tree checker errors, write time error can't be inspected by "btrfs inspect dump-tree", so we need extra information to determine what's going wrong. The patch will add the following output for write time tree checker error: - The content of the offending tree block To help determining if it's a false alert. - Kernel WARN_ON() for debug build This is helpful for us to detect unexpected write time tree checker error, especially fstests could catch the dmesg. Since the WARN_ON() is only triggered for write time tree checker, test cases utilizing dm-error won't trigger this WARN_ON(), thus no extra noise. Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: tree-checker: Refactor prev_key check for ino into a functionQu Wenruo1-41/+72
Refactor the check for prev_key->objectid of the following key types into one function, check_prev_ino(): - EXTENT_DATA - INODE_REF - DIR_INDEX - DIR_ITEM - XATTR_ITEM Also add the check of prev_key for INODE_REF. Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18Btrfs: extent_write_locked_range() should attach inode->i_wbChris Mason1-4/+6
extent_write_locked_range() is used when we're falling back to buffered IO from inside of compression. It allocates its own wbc and should associate it with the inode's i_wb to make sure the IO goes down from the correct cgroup. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18Btrfs: use REQ_CGROUP_PUNT for worker thread submitted biosChris Mason5-5/+46
Async CRCs and compression submit IO through helper threads, which means they have IO priority inversions when cgroup IO controllers are in use. This flags all of the writes submitted by btrfs helper threads as REQ_CGROUP_PUNT. submit_bio() will punt these to dedicated per-blkcg work items to avoid the priority inversion. For the compression code, we take a reference on the wbc's blkg css and pass it down to the async workers. For the async CRCs, the bio already has the correct css, we just need to tell the block layer to use REQ_CGROUP_PUNT. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Chris Mason <[email protected]> Modified-and-reviewed-by: Tejun Heo <[email protected]> Signed-off-by: David Sterba <[email protected]>