aboutsummaryrefslogtreecommitdiff
path: root/fs/btrfs/delayed-inode.c
AgeCommit message (Collapse)AuthorFilesLines
2020-03-23btrfs: add wrapper for transaction abort predicateDavid Sterba1-1/+1
The status of aborted transaction can change between calls and it needs to be accessed by READ_ONCE. Add a helper that also wraps the unlikely hint. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-03-23btrfs: use the file extent tree infrastructureJosef Bacik1-0/+3
We want to use this everywhere we modify the file extent items permanently. These include: 1) Inserting new file extents for writes and prealloc extents. 2) Truncating inode items. 3) btrfs_cont_expand(). 4) Insert inline extents. 5) Insert new extents from log replay. 6) Insert a new extent for clone, as it could be past i_size. 7) Hole punching For hole punching in particular it might seem it's not necessary because anybody extending would use btrfs_cont_expand, however there is a corner that still can give us trouble. Start with an empty file and fallocate KEEP_SIZE 1M-2M We now have a 0 length file, and a hole file extent from 0-1M, and a prealloc extent from 1M-2M. Now punch 1M-1.5M Because this is past i_size we have [HOLE EXTENT][ NOTHING ][PREALLOC] [0 1M][1M 1.5M][1.5M 2M] with an i_size of 0. Now if we pwrite 0-1.5M we'll increas our i_size to 1.5M, but our disk_i_size is still 0 until the ordered extent completes. However if we now immediately truncate 2M on the file we'll just call btrfs_cont_expand(inode, 1.5M, 2M), since our old i_size is 1.5M. If we commit the transaction here and crash we'll expose the gap. To fix this we need to clear the file extent mapping for the range that we punched but didn't insert a corresponding file extent for. This will mean the truncate will only get an disk_i_size set to 1M if we crash before the finish ordered io happens. I've written an xfstest to reproduce the problem and validate this fix. Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: use refcount_inc_not_zero in kill_all_nodesJosef Bacik1-3/+10
We hit the following warning while running down a different problem [ 6197.175850] ------------[ cut here ]------------ [ 6197.185082] refcount_t: underflow; use-after-free. [ 6197.194704] WARNING: CPU: 47 PID: 966 at lib/refcount.c:190 refcount_sub_and_test_checked+0x53/0x60 [ 6197.521792] Call Trace: [ 6197.526687] __btrfs_release_delayed_node+0x76/0x1c0 [ 6197.536615] btrfs_kill_all_delayed_nodes+0xec/0x130 [ 6197.546532] ? __btrfs_btree_balance_dirty+0x60/0x60 [ 6197.556482] btrfs_clean_one_deleted_snapshot+0x71/0xd0 [ 6197.566910] cleaner_kthread+0xfa/0x120 [ 6197.574573] kthread+0x111/0x130 [ 6197.581022] ? kthread_create_on_node+0x60/0x60 [ 6197.590086] ret_from_fork+0x1f/0x30 [ 6197.597228] ---[ end trace 424bb7ae00509f56 ]--- This is because the free side drops the ref without the lock, and then takes the lock if our refcount is 0. So you can have nodes on the tree that have a refcount of 0. Fix this by zero'ing out that element in our temporary array so we don't try to kill it again. CC: [email protected] # 4.14+ Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> [ add comment ] Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: move btrfs_unlock_up_safe to other locking functionsDavid Sterba1-0/+1
The function belongs to the family of locking functions, so move it there. The 'noinline' keyword is dropped as it's now an exported function that does not need it. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-11-18btrfs: get rid of unique workqueue helper functionsOmar Sandoval1-2/+2
Commit 9e0af2376434 ("Btrfs: fix task hang under heavy compressed write") worked around the issue that a recycled work item could get a false dependency on the original work item due to how the workqueue code guarantees non-reentrancy. It did so by giving different work functions to different types of work. However, the fixes in the previous few patches are more complete, as they prevent a work item from being recycled at all (except for a tiny window that the kernel workqueue code handles for us). This obsoletes the previous fix, so we don't need the unique helpers for correctness. The only other reason to keep them would be so they show up in stack traces, but they always seem to be optimized to a tail call, so they don't show up anyways. So, let's just get rid of the extra indirection. While we're here, rename normal_work_helper() to the more informative btrfs_work_helper(). Reviewed-by: Nikolay Borisov <[email protected]> Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Omar Sandoval <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-09-09btrfs: move cond_wake_up functions out of ctreeDavid Sterba1-0/+1
The file ctree.h serves as a header for everything and has become quite bloated. Split some helpers that are generic and create a new file that should be the catch-all for code that's not btrfs-specific. Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-09-09btrfs: only reserve metadata_size for inodesJosef Bacik1-1/+1
Historically we reserved worst case for every btree operation, and generally speaking we want to do that in cases where it could be the worst case. However for updating inodes we know the inode items are already in the tree, so it will only be an update operation and never an insert operation. This allows us to always reserve only the metadata_size amount for inode updates rather than the insert_metadata_size amount. Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-09-09btrfs: rename the btrfs_calc_*_metadata_size helpersJosef Bacik1-2/+2
btrfs_calc_trunc_metadata_size differs from trans_metadata_size in that it doesn't take into account any splitting at the levels, because truncate will never split nodes. However truncate _and_ changing will never split nodes, so rename btrfs_calc_trunc_metadata_size to btrfs_calc_metadata_size. Also btrfs_calc_trans_metadata_size is purely for inserting items, so rename this to btrfs_calc_insert_metadata_size. Making these clearer will help when I start using them differently in upcoming patches. Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-09-09btrfs: delayed-inode: Kill the BUG_ON() in btrfs_delete_delayed_dir_index()Qu Wenruo1-2/+11
There is one report of fuzzed image which leads to BUG_ON() in btrfs_delete_delayed_dir_index(). Although that fuzzed image can already be addressed by enhanced extent-tree error handler, it's still better to hunt down more BUG_ON(). This patch will hunt down two BUG_ON()s in btrfs_delete_delayed_dir_index(): - One for error from btrfs_delayed_item_reserve_metadata() Instead of BUG_ON(), we output an error message and free the item. And return the error. All callers of this function handles the error by aborting current trasaction. - One for possible EEXIST from __btrfs_add_delayed_deletion_item() That function can return -EEXIST. We already have a good enough error message for that, only need to clean up the reserved metadata space and allocated item. To help above cleanup, also modifiy __btrfs_remove_delayed_item() called in btrfs_release_delayed_item(), to skip unassociated item. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203253 Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2019-04-29btrfs: get fs_info from eb in btrfs_leaf_free_spaceDavid Sterba1-2/+1
We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <[email protected]>
2019-04-29btrfs: use common file type conversionPhillip Potter1-1/+1
Deduplicate the btrfs file type conversion implementation - file systems that use the same file types as defined by POSIX do not need to define their own versions and can use the common helper functions decared in fs_types.h and implemented in fs_types.c Common implementation can be found via commit: bbe7449e2599 "fs: common implementation of file type" Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Amir Goldstein <[email protected]> Signed-off-by: Phillip Potter <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-10-15Btrfs: kill btrfs_clear_path_blockingLiu Bo1-3/+0
Btrfs's btree locking has two modes, spinning mode and blocking mode, while searching btree, locking is always acquired in spinning mode and then converted to blocking mode if necessary, and in some hot paths we may switch the locking back to spinning mode by btrfs_clear_path_blocking(). When acquiring locks, both of reader and writer need to wait for blocking readers and writers to complete before doing read_lock()/write_lock(). The problem is that btrfs_clear_path_blocking() needs to switch nodes in the path to blocking mode at first (by btrfs_set_path_blocking) to make lockdep happy before doing its actual clearing blocking job. When switching to blocking mode from spinning mode, it consists of step 1) bumping up blocking readers counter and step 2) read_unlock()/write_unlock(), this has caused serious ping-pong effect if there're a great amount of concurrent readers/writers, as waiters will be woken up and go to sleep immediately. 1) Killing this kind of ping-pong results in a big improvement in my 1600k files creation script, MNT=/mnt/btrfs mkfs.btrfs -f /dev/sdf mount /dev/def $MNT time fsmark -D 10000 -S0 -n 100000 -s 0 -L 1 -l /tmp/fs_log.txt \ -d $MNT/0 -d $MNT/1 \ -d $MNT/2 -d $MNT/3 \ -d $MNT/4 -d $MNT/5 \ -d $MNT/6 -d $MNT/7 \ -d $MNT/8 -d $MNT/9 \ -d $MNT/10 -d $MNT/11 \ -d $MNT/12 -d $MNT/13 \ -d $MNT/14 -d $MNT/15 w/o patch: real 2m27.307s user 0m12.839s sys 13m42.831s w/ patch: real 1m2.273s user 0m15.802s sys 8m16.495s 1.1) latency histogram from funclatency[1] Overall with the patch, there're ~50% less write lock acquisition and the 95% max latency that write lock takes also reduces to ~100ms from >500ms. -------------------------------------------- w/o patch: -------------------------------------------- Function = btrfs_tree_lock msecs : count distribution 0 -> 1 : 2385222 |****************************************| 2 -> 3 : 37147 | | 4 -> 7 : 20452 | | 8 -> 15 : 13131 | | 16 -> 31 : 3877 | | 32 -> 63 : 3900 | | 64 -> 127 : 2612 | | 128 -> 255 : 974 | | 256 -> 511 : 165 | | 512 -> 1023 : 13 | | Function = btrfs_tree_read_lock msecs : count distribution 0 -> 1 : 6743860 |****************************************| 2 -> 3 : 2146 | | 4 -> 7 : 190 | | 8 -> 15 : 38 | | 16 -> 31 : 4 | | -------------------------------------------- w/ patch: -------------------------------------------- Function = btrfs_tree_lock msecs : count distribution 0 -> 1 : 1318454 |****************************************| 2 -> 3 : 6800 | | 4 -> 7 : 3664 | | 8 -> 15 : 2145 | | 16 -> 31 : 809 | | 32 -> 63 : 219 | | 64 -> 127 : 10 | | Function = btrfs_tree_read_lock msecs : count distribution 0 -> 1 : 6854317 |****************************************| 2 -> 3 : 2383 | | 4 -> 7 : 601 | | 8 -> 15 : 92 | | 2) dbench also proves the improvement, dbench -t 120 -D /mnt/btrfs 16 w/o patch: Throughput 158.363 MB/sec w/ patch: Throughput 449.52 MB/sec 3) xfstests didn't show any additional failures. One thing to note is that callers may set path->leave_spinning to have all nodes in the path stay in spinning mode, which means callers are ready to not sleep before releasing the path, but it won't cause problems if they don't want to sleep in blocking mode. [1]: https://github.com/iovisor/bcc/blob/master/tools/funclatency.py Signed-off-by: Liu Bo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-10-15Btrfs: delayed-inode: use rb_first_cached for ins_root and del_rootLiu Bo1-13/+16
rb_first_cached() trades an extra pointer "leftmost" for doing the same job as rb_first() but in O(1). Functions manipulating delayed_item need to get the first entry, this converts it to use rb_first_cached(). For more details about the optimization see patch "Btrfs: delayed-refs: use rb_first_cached for href_root". Tested-by: Holger Hoffstätte <[email protected]> Signed-off-by: Liu Bo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-10-15btrfs: Remove 'objectid' member from struct btrfs_rootMisono Tomohiro1-2/+3
There are two members in struct btrfs_root which indicate root's objectid: objectid and root_key.objectid. They are both set to the same value in __setup_root(): static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info, u64 objectid) { ... root->objectid = objectid; ... root->root_key.objectid = objecitd; ... } and not changed to other value after initialization. grep in btrfs directory shows both are used in many places: $ grep -rI "root->root_key.objectid" | wc -l 133 $ grep -rI "root->objectid" | wc -l 55 (4.17, inc. some noise) It is confusing to have two similar variable names and it seems that there is no rule about which should be used in a certain case. Since ->root_key itself is needed for tree reloc tree, let's remove 'objecitd' member and unify code to use ->root_key.objectid in all places. Signed-off-by: Misono Tomohiro <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-10-15btrfs: switch update_size to bool in btrfs_block_rsv_migrate and ↵Lu Fengqi1-2/+2
btrfs_rsv_add_bytes Using true and false here is closer to the expected semantic than using 0 and 1. No functional change. Signed-off-by: Lu Fengqi <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-08-06btrfs: Remove fs_info from btrfs_delete_delayed_dir_indexLu Fengqi1-3/+3
It can be referenced from the passed transaction handle. Signed-off-by: Lu Fengqi <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-08-06btrfs: Remove fs_info from btrfs_insert_delayed_dir_indexLu Fengqi1-3/+1
It can be referenced from the passed transaction handle. Signed-off-by: Lu Fengqi <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-08-06btrfs: simplify pointer chasing of local fs_info variablesDavid Sterba1-2/+2
Functions that get btrfs inode can simply reach the fs_info by dereferencing the root and this looks a bit more straightforward compared to the btrfs_sb(...) indirection. If the transaction handle is available and not NULL it's used instead. Signed-off-by: David Sterba <[email protected]>
2018-05-28btrfs: replace waitqueue_actvie with cond_wake_upDavid Sterba1-6/+3
Use the wrappers and reduce the amount of low-level details about the waitqueue management. Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-04-18btrfs: delayed-inode: Remove wrong qgroup meta reservation callsQu Wenruo1-4/+16
Commit 4f5427ccce5d ("btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item") merged into mainline was not latest version submitted to the mail list in Dec 2017. Which lacks the following fixes: 1) Remove btrfs_qgroup_convert_reserved_meta() call in btrfs_delayed_item_release_metadata() 2) Remove btrfs_qgroup_reserve_meta_prealloc() call in btrfs_delayed_inode_reserve_metadata() Those fixes will resolve unexpected EDQUOT problems. Fixes: 4f5427ccce5d ("btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item") Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-04-12btrfs: replace GPL boilerplate by SPDX -- sourcesDavid Sterba1-14/+1
Remove GPL boilerplate text (long, short, one-line) and keep the rest, ie. personal, company or original source copyright statements. Add the SPDX header. Signed-off-by: David Sterba <[email protected]>
2018-03-31btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and itemQu Wenruo1-16/+30
Quite similar for delalloc, some modification to delayed-inode and delayed-item reservation. Also needs extra parameter for release case to distinguish normal release and error release. Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-03-26btrfs: add more __cold annotationsDavid Sterba1-1/+1
The __cold functions are placed to a special section, as they're expected to be called rarely. This could help i-cache prefetches or help compiler to decide which branches are more/less likely to be taken without any other annotations needed. Though we can't add more __exit annotations, it's still possible to add __cold (that's also added with __exit). That way the following function categories are tagged: - printf wrappers, error messages - exit helpers Signed-off-by: David Sterba <[email protected]>
2018-03-26btrfs: Don't pass fs_info to btrfs_run_delayed_items/_nrNikolay Borisov1-4/+2
We already pass the transaction which has a reference to the fs_info, so use that. No functional changes. Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-03-26btrfs: Don't pass fs_info to __btrfs_run_delayed_itemsNikolay Borisov1-4/+4
We already pass the transaction handle, which contains a refrence to the fs_info so grab it from there. No functional changes. Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-01-29Merge tag 'for-4.16-tag' of ↵Linus Torvalds1-30/+29
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "Features or user visible changes: - fallocate: implement zero range mode - avoid losing data raid profile when deleting a device - tree item checker: more checks for directory items and xattrs Notable fixes: - raid56 recovery: don't use cached stripes, that could be potentially changed and a later RMW or recovery would lead to corruptions or failures - let raid56 try harder to rebuild damaged data, reading from all stripes if necessary - fix scrub to repair raid56 in a similar way as in the case above Other: - cleanups: device freeing, removed some call indirections, redundant bio_put/_get, unused parameters, refactorings and renames - RCU list traversal fixups - simplify mount callchain, remove recursing back when mounting a subvolume - plug for fsync, may improve bio merging on multiple devices - compression heurisic: replace heap sort with radix sort, gains some performance - add extent map selftests, buffered write vs dio" * tag 'for-4.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (155 commits) btrfs: drop devid as device_list_add() arg btrfs: get device pointer from device_list_add() btrfs: set the total_devices in device_list_add() btrfs: move pr_info into device_list_add btrfs: make btrfs_free_stale_devices() to match the path btrfs: rename btrfs_free_stale_devices() arg to skip_dev btrfs: make btrfs_free_stale_devices() argument optional btrfs: make btrfs_free_stale_device() to iterate all stales btrfs: no need to check for btrfs_fs_devices::seeding btrfs: Use IS_ALIGNED in btrfs_truncate_block instead of opencoding it Btrfs: noinline merge_extent_mapping Btrfs: add WARN_ONCE to detect unexpected error from merge_extent_mapping Btrfs: extent map selftest: dio write vs dio read Btrfs: extent map selftest: buffered write vs dio read Btrfs: add extent map selftests Btrfs: move extent map specific code to extent_map.c Btrfs: add helper for em merge logic Btrfs: fix unexpected EEXIST from btrfs_get_extent Btrfs: fix incorrect block_len in merge_extent_mapping btrfs: Remove unused readahead spinlock ...
2018-01-29Merge tag 'iversion-v4.16-1' of ↵Linus Torvalds1-2/+5
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull inode->i_version rework from Jeff Layton: "This pile of patches is a rework of the inode->i_version field. We have traditionally incremented that field on every inode data or metadata change. Typically this increment needs to be logged on disk even when nothing else has changed, which is rather expensive. It turns out though that none of the consumers of that field actually require this behavior. The only real requirement for all of them is that it be different iff the inode has changed since the last time the field was checked. Given that, we can optimize away most of the i_version increments and avoid dirtying inode metadata when the only change is to the i_version and no one is querying it. Queries of the i_version field are rather rare, so we can help write performance under many common workloads. This patch series converts existing accesses of the i_version field to a new API, and then converts all of the in-kernel filesystems to use it. The last patch in the series then converts the backend implementation to a scheme that optimizes away a large portion of the metadata updates when no one is looking at it. In my own testing this series significantly helps performance with small I/O sizes. I also got this email for Christmas this year from the kernel test robot (a 244% r/w bandwidth improvement with XFS over DAX, with 4k writes): https://lkml.org/lkml/2017/12/25/8 A few of the earlier patches in this pile are also flowing to you via other trees (mm, integrity, and nfsd trees in particular)". * tag 'iversion-v4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: (22 commits) fs: handle inode->i_version more efficiently btrfs: only dirty the inode in btrfs_update_time if something was changed xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing fs: only set S_VERSION when updating times if necessary IMA: switch IMA over to new i_version API xfs: convert to new i_version API ufs: use new i_version API ocfs2: convert to new i_version API nfsd: convert to new i_version API nfs: convert to new i_version API ext4: convert to new i_version API ext2: convert to new i_version API exofs: switch to new i_version API btrfs: convert to new i_version API afs: convert to new i_version API affs: convert to new i_version API fat: convert to new i_version API fs: don't take the i_lock in inode_inc_iversion fs: new API for handling inode->i_version ntfs: remove i_version handling ...
2018-01-29btrfs: convert to new i_version APIJeff Layton1-2/+5
Signed-off-by: Jeff Layton <[email protected]> Acked-by: David Sterba <[email protected]>
2018-01-24Btrfs: fix stale entries in readdirJosef Bacik1-18/+8
In fixing the readdir+pagefault deadlock I accidentally introduced a stale entry regression in readdir. If we get close to full for the temporary buffer, and then skip a few delayed deletions, and then try to add another entry that won't fit, we will emit the entries we found and retry. Unfortunately we delete entries from our del_list as we find them, assuming we won't need them. However our pos will be with whatever our last entry was, which could be before the delayed deletions we skipped, so the next search will add the deleted entries back into our readdir buffer. So instead don't delete entries we find in our del_list so we can make sure we always find our delayed deletions. This is a slight perf hit for readdir with lots of pending deletions, but hopefully this isn't a common occurrence. If it is we can revist this and optimize it. cc: [email protected] Fixes: 23b5ec74943f ("btrfs: fix readdir deadlock with pagefault") Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-01-22btrfs: Move checks from btrfs_wq_run_delayed_node to btrfs_balance_delayed_itemsNikolay Borisov1-5/+2
btrfs_balance_delayed_items is the sole caller of btrfs_wq_run_delayed_node and already includes one of the checks whether the delayed inodes should be run. On the other hand btrfs_wq_run_delayed_node duplicates that check and performs an additional one for wq congestion. Let's remove the duplicate check and move the congestion one in btrfs_balance_delayed_items, leaving btrfs_wq_run_delayed_node to only care about setting up the wq run. No functional changes. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-01-22btrfs: Make btrfs_async_run_delayed_root use a loop rather than multiple labelsNikolay Borisov1-25/+27
Currently btrfs_async_run_delayed_root's implementation uses 3 goto labels to mimic the functionality of a simple do {} while loop. Refactor the function to use a do {} while construct, making intention clear and code easier to follow. No functional changes. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-01-02btrfs: fix refcount_t usage when deleting btrfs_delayed_nodesChris Mason1-11/+34
refcounts have a generic implementation and an asm optimized one. The generic version has extra debugging to make sure that once a refcount goes to zero, refcount_inc won't increase it. The btrfs delayed inode code wasn't expecting this, and we're tripping over the warnings when the generic refcounts are used. We ended up with this race: Process A Process B btrfs_get_delayed_node() spin_lock(root->inode_lock) radix_tree_lookup() __btrfs_release_delayed_node() refcount_dec_and_test(&delayed_node->refs) our refcount is now zero refcount_add(2) <--- warning here, refcount unchanged spin_lock(root->inode_lock) radix_tree_delete() With the generic refcounts, we actually warn again when process B above tries to release his refcount because refcount_add() turned into a no-op. We saw this in production on older kernels without the asm optimized refcounts. The fix used here is to use refcount_inc_not_zero() to detect when the object is in the middle of being freed and return NULL. This is almost always the right answer anyway, since we usually end up pitching the delayed_node if it didn't have fresh data in it. This also changes __btrfs_release_delayed_node() to remove the extra check for zero refcounts before radix tree deletion. btrfs_get_delayed_node() was the only path that was allowing refcounts to go from zero to one. Fixes: 6de5f18e7b0da ("btrfs: fix refcount_t usage when deleting btrfs_delayed_node") CC: <[email protected]> # 4.12+ Signed-off-by: Chris Mason <[email protected]> Reviewed-by: Liu Bo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2017-11-01btrfs: make the delalloc block rsv per inodeJosef Bacik1-45/+1
The way we handle delalloc metadata reservations has gotten progressively more complicated over the years. There is so much cruft and weirdness around keeping the reserved count and outstanding counters consistent and handling the error cases that it's impossible to understand. Fix this by making the delalloc block rsv per-inode. This way we can calculate the actual size of the outstanding metadata reservations every time we make a change, and then reserve the delta based on that amount. This greatly simplifies the code everywhere, and makes the error handling in btrfs_delalloc_reserve_metadata far less terrifying. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>
2017-08-18btrfs: increase ctx->pos for delayed dir indexJosef Bacik1-0/+1
Our dir_context->pos is supposed to hold the next position we're supposed to look. If we successfully insert a delayed dir index we could end up with a duplicate entry because we don't increase ctx->pos after doing the dir_emit. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: Liu Bo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2017-04-18btrfs: convert btrfs_delayed_item.refs from atomic_t to refcount_tElena Reshetova1-9/+9
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <[email protected]> Signed-off-by: Hans Liljestrand <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: David Windsor <[email protected]> Signed-off-by: David Sterba <[email protected]>
2017-04-18btrfs: convert btrfs_delayed_node.refs from atomic_t to refcount_tElena Reshetova1-14/+14
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <[email protected]> Signed-off-by: Hans Liljestrand <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: David Windsor <[email protected]> Signed-off-by: David Sterba <[email protected]>
2017-02-28btrfs: Make btrfs_i_size_write take btrfs_inodeNikolay Borisov1-1/+1
Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2017-02-14btrfs: fix over-80 lines introduced by previous cleanupsDavid Sterba1-2/+3
This goes as a separate patch because fixing that inside the patches caused too many many conflicts. Signed-off-by: David Sterba <[email protected]>
2017-02-14btrfs: Make btrfs_inode_delayed_dir_index_count take btrfs_inodeNikolay Borisov1-3/+3
Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2017-02-14btrfs: Make btrfs_commit_inode_delayed_items take btrfs_inodeNikolay Borisov1-2/+2
Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2017-02-14btrfs: Make btrfs_commit_inode_delayed_inode take btrfs_inodeNikolay Borisov1-3/+3
Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2017-02-14btrfs: Make btrfs_remove_delayed_node take btrfs_inodeNikolay Borisov1-3/+3
Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2017-02-14btrfs: Make btrfs_kill_delayed_inode_items take btrfs_inodeNikolay Borisov1-2/+2
Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2017-02-14btrfs: Make btrfs_delayed_delete_inode_ref take btrfs_inodeNikolay Borisov1-3/+3
Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2017-02-14btrfs: Make btrfs_delete_delayed_dir_index take btrfs_inodeNikolay Borisov1-3/+3
Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2017-02-14btrfs: Make btrfs_insert_delayed_dir_index take btrfs_inodeNikolay Borisov1-3/+3
Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2017-02-14btrfs: Make btrfs_delayed_inode_reserve_metadata take btrfs_inodeNikolay Borisov1-8/+8
Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2017-02-14btrfs: Make btrfs_get_or_create_delayed_node take btrfs_inodeNikolay Borisov1-6/+5
Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2017-02-14btrfs: Make btrfs_get_delayed_node take btrfs_inodeNikolay Borisov1-9/+8
This function is internal to btrfs and doesn't really deal with any VFS members, as such it needn't take a struct inode refrence but btrfs_inode. Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2017-02-14btrfs: Make btrfs_ino take a struct btrfs_inodeNikolay Borisov1-7/+7
Currently btrfs_ino takes a struct inode and this causes a lot of internal btrfs functions which consume this ino to take a VFS inode, rather than btrfs' own struct btrfs_inode. In order to fix this "leak" of VFS structs into the internals of btrfs first it's necessary to eliminate all uses of struct inode for the purpose of inode. This patch does that by using BTRFS_I to convert an inode to btrfs_inode. With this problem eliminated subsequent patches will start eliminating the passing of struct inode altogether, eventually resulting in a lot cleaner code. Signed-off-by: Nikolay Borisov <[email protected]> [ fix btrfs_get_extent tracepoint prototype ] Signed-off-by: David Sterba <[email protected]>