aboutsummaryrefslogtreecommitdiff
path: root/fs/btrfs/relocation.c
AgeCommit message (Collapse)AuthorFilesLines
2021-08-23btrfs: subpage: fix relocation potentially overwriting last page dataQu Wenruo1-0/+60
[BUG] When using the following script, btrfs will report data corruption after one data balance with subpage support: mkfs.btrfs -f -s 4k $dev mount $dev -o nospace_cache $mnt $fsstress -w -n 8 -s 1620948986 -d $mnt/ -v > /tmp/fsstress sync btrfs balance start -d $mnt btrfs scrub start -B $mnt Similar problem can be easily observed in btrfs/028 test case, there will be tons of balance failure with -EIO. [CAUSE] Above fsstress will result the following data extents layout in extent tree: item 10 key (13631488 EXTENT_ITEM 98304) itemoff 15889 itemsize 82 refs 2 gen 7 flags DATA extent data backref root FS_TREE objectid 259 offset 1339392 count 1 extent data backref root FS_TREE objectid 259 offset 647168 count 1 item 11 key (13631488 BLOCK_GROUP_ITEM 8388608) itemoff 15865 itemsize 24 block group used 102400 chunk_objectid 256 flags DATA item 12 key (13733888 EXTENT_ITEM 4096) itemoff 15812 itemsize 53 refs 1 gen 7 flags DATA extent data backref root FS_TREE objectid 259 offset 729088 count 1 Then when creating the data reloc inode, the data reloc inode will look like this: 0 32K 64K 96K 100K 104K |<------ Extent A ----->| |<- Ext B ->| Then when we first try to relocate extent A, we setup the data reloc inode with i_size 96K, then read both page [0, 64K) and page [64K, 128K). For page 64K, since the i_size is just 96K, we fill range [96K, 128K) with 0 and set it uptodate. Then when we come to extent B, we update i_size to 104K, then try to read page [64K, 128K). Then we find the page is already uptodate, so we skip the read. But range [96K, 128K) is filled with 0, not the real data. Then we writeback the data reloc inode to disk, with 0 filling range [96K, 128K), corrupting the content of extent B. The behavior is caused by the fact that we still do full page read for subpage case. The bug won't really happen for regular sectorsize, as one page only contains one sector. [FIX] This patch will fix the problem by invalidating range [i_size, PAGE_END] in prealloc_file_extent_cluster(). So that if above example happens, when we preallocate the file extent for extent B, we will clear the uptodate bits for range [96K, 128K), allowing later relocate_one_page() to re-read the needed range. There is a special note for the invalidating part. Since we're not calling real btrfs_invalidatepage(), but just clearing the subpage and page uptodate bits, we can leave a page half dirty and half out of date. Reading such page can cause a deadlock, as we normally expect a dirty page to be fully uptodate. Thus here we flush and wait the data reloc inode before doing the hacked invalidating. This won't cause extra overhead, as we're going to writeback the data later anyway. Reported-by: Ritesh Harjani <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23btrfs: make relocate_one_page() handle subpage caseQu Wenruo1-29/+77
For subpage case, one page of data reloc inode can contain several file extents, like this: |<--- File extent A --->| FE B | FE C |<--- File extent D -->| |<--------- Page --------->| We can no longer use PAGE_SIZE directly for various operations. This patch will relocate_one_page() to handle subpage case by: - Iterating through all extents of a cluster when marking pages When marking pages dirty and delalloc, we need to check the cluster extent boundary. Now we introduce a loop to go extent by extent of a page, until we either finished the last extent, or reach the page end. By this, regular sectorsize == PAGE_SIZE can still work as usual, since we will do that loop only once. - Iteration start from max(page_start, extent_start) Since we can have the following case: | FE B | FE C |<--- File extent D -->| |<--------- Page --------->| Thus we can't always start from page_start, but do a max(page_start, extent_start) - Iteration end when the cluster is exhausted Similar to previous case, the last file extent can end before the page end: |<--- File extent A --->| FE B | FE C | |<--------- Page --------->| In this case, we need to manually exit the loop after we have finished the last extent of the cluster. - Reserve metadata space for each extent range Since now we can hit multiple ranges in one page, we should reserve metadata for each range, not simply PAGE_SIZE. Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23btrfs: reloc: factor out relocation page read and dirty partQu Wenruo1-105/+93
In function relocate_file_extent_cluster(), we have a big loop for marking all involved page delalloc. That part is long enough to be contained in one function, so this patch will move that code chunk into a new function, relocate_one_page(). This also provides enough space for later subpage work. Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-22btrfs: ensure relocation never runs while we have send operations runningFilipe Manana1-0/+13
Relocation and send do not play well together because while send is running a block group can be relocated, a transaction committed and the respective disk extents get re-allocated and written to or discarded while send is about to do something with the extents. This was explained in commit 9e967495e0e0ae ("Btrfs: prevent send failures and crashes due to concurrent relocation"), which prevented balance and send from running in parallel but it did not address one remaining case where chunk relocation can happen: shrinking a device (and device deletion which shrinks a device's size to 0 before deleting the device). We also have now one more case where relocation is triggered: on zoned filesystems partially used block groups get relocated by a background thread, introduced in commit 18bb8bbf13c183 ("btrfs: zoned: automatically reclaim zones"). So make sure that instead of preventing balance from running when there are ongoing send operations, we prevent relocation from happening. This uses the infrastructure recently added by a patch that has the subject: "btrfs: add cancellable chunk relocation support". Also it adds a spinlock used exclusively for the exclusivity between send and relocation, as before fs_info->balance_mutex was used, which would make an attempt to run send to block waiting for balance to finish, which can take a lot of time on large filesystems. Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-06-21btrfs: add cancellable chunk relocation supportDavid Sterba1-2/+60
Add support code that will allow canceling relocation on the chunk granularity. This is different and independent of balance, that also uses relocation but is a higher level operation and manages it's own state and pause/cancellation requests. Relocation is used for resize (shrink) and device deletion so this will be a common point to implement cancellation for both. The context is entirely in btrfs_relocate_block_group and btrfs_recover_relocation, enclosing one chunk relocation. The status bit is set and unset between the chunks. As relocation can take long, the effects may not be immediate and the request and actual action can slightly race. The fs_info::reloc_cancel_req is only supposed to be increased and does not pair with decrement like fs_info::balance_cancel_req. Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: check return value of btrfs_commit_transaction in relocationJosef Bacik1-4/+5
There are a few places where we don't check the return value of btrfs_commit_transaction in relocation.c. Thankfully all these places have straightforward error handling, so simply change all of the sites at once. Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: do proper error handling in merge_reloc_rootsJosef Bacik1-2/+23
We have a BUG_ON() if we get an error back from btrfs_get_fs_root(). This honestly should never fail, as at this point we have a solid coordination of fs root to reloc root, and these roots will all be in memory. But in the name of killing BUG_ON()'s remove these and handle the error condition properly, ASSERT()'ing for developers. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: handle extent corruption with select_one_root properlyJosef Bacik1-3/+16
In corruption cases we could have paths from a block up to no root at all, and thus we'll BUG_ON(!root) in select_one_root. Handle this by adding an ASSERT() for developers, and returning an error for normal users. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: cleanup error handling in prepare_to_mergeJosef Bacik1-2/+12
This probably can't happen even with a corrupt file system, because we would have failed much earlier on than here. However there's no reason we can't just check and bail out as appropriate, so do that and convert the correctness BUG_ON() to an ASSERT(). Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> [ add comment ] Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: do not panic in __add_reloc_rootJosef Bacik1-1/+5
If we have a duplicate entry for a reloc root then we could have fs corruption that resulted in a double allocation. Since this shouldn't happen unless there is corruption, add an ASSERT(ret != -EEXIST) to all of the callers of __add_reloc_root() to catch any logic mistakes for developers, otherwise normal error handling will happen for normal users. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: handle __add_reloc_root failures in btrfs_recover_relocationJosef Bacik1-2/+11
We can already handle errors appropriately from this function, deal with an error coming from __add_reloc_root appropriately. Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> [ add comment ] Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: do proper error handling in create_reloc_inodeJosef Bacik1-2/+37
We already handle some errors in this function, and the callers do the correct error handling, so clean up the rest of the function to do the appropriate error handling. There's a little extra work that needs to be done here, as we create the inode item before we create the orphan item. We could potentially add the orphan item, but if we failed to create the inode item we would have to abort the transaction. Instead add a helper to delete the inode item we created in the case that we're unable to look up the inode (this would likely be caused by an ENOMEM), which if it succeeds means we can avoid a transaction abort in this particular error case. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: remove the extent item sanity checks in relocate_block_groupJosef Bacik1-28/+1
These checks are all taken care of for us by the tree checker code: - the flags don't change or are updated consistently - the v0 extent item format is invalid and caught in many other places too Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> [ update changelog ] Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: handle extent reference errors in do_relocationJosef Bacik1-4/+5
We can already deal with errors appropriately from do_relocation, simply handle any errors that come from changing the refs at this point cleanly. We have to abort the transaction if we fail here as we've modified metadata at this point. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: handle errors in reference count manipulation in replace_pathJosef Bacik1-4/+16
If any of the reference count manipulation stuff fails in replace_path we need to abort the transaction, as we've modified the blocks already. We can simply break at this point and everything will be cleaned up. Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: handle btrfs_search_slot failure in replace_pathJosef Bacik1-1/+5
The search can fail for various reasons, in case of errors there's no cleanup to be done so we can pass the error to the caller, adjusting for the case where the key is not found and search slot returns 1. Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> [ update changelog ] Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: handle btrfs_cow_block errors in replace_pathJosef Bacik1-2/+10
If we error out COWing the root node when doing a replace_path then we simply unlock and free the buffer and return the error. Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: convert logic BUG_ON()'s in replace_path to ASSERT()'sJosef Bacik1-3/+3
A few BUG_ON()'s in replace_path are purely to keep us from making logical mistakes, so replace them with ASSERT()'s. Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: do proper error handling in btrfs_update_reloc_rootJosef Bacik1-4/+2
We call btrfs_update_root in btrfs_update_reloc_root, which can fail for all sorts of reasons, including IO errors. Instead of panicing the box lets return the error, now that all callers properly handle those errors. Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: handle btrfs_update_reloc_root failure in prepare_to_mergeJosef Bacik1-1/+12
btrfs_update_reloc_root will will return errors in the future, so handle an error properly in prepare_to_merge. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: handle btrfs_update_reloc_root failure in insert_dirty_subvolJosef Bacik1-1/+4
btrfs_update_reloc_root will will return errors in the future, so handle the error properly in insert_dirty_subvol. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: change insert_dirty_subvol to return errorsJosef Bacik1-5/+10
This will be able to return errors in the future, so change it to return an error and handle the errors appropriately. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: validate root::reloc_root after recording root in transJosef Bacik1-0/+15
If we fail to setup a root->reloc_root in a different thread that path will error out, however it still leaves root->reloc_root NULL but would still appear set up in the transaction. Subsequent calls to btrfs_record_root_in_transaction would succeed without attempting to create the reloc root, as the transid has already been updated. Handle this case by making sure we have a root->reloc_root set after a btrfs_record_root_in_transaction call so we don't end up dereferencing a NULL pointer. Reported-by: Zygo Blaxell <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: do proper error handling in create_reloc_rootJosef Bacik1-6/+28
We do memory allocations here, read blocks from disk, all sorts of operations that could easily fail at any given point. Instead of panicing the box, simply return the error back up the chain, all callers at this point have proper error handling. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: have proper error handling in btrfs_init_reloc_rootJosef Bacik1-1/+7
create_reloc_root will return errors in the future, and __add_reloc_root can return ENOMEM or EEXIST, so handle these errors properly. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> [ add comment ] Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: handle btrfs_record_root_in_trans failure in relocate_tree_blockJosef Bacik1-1/+3
btrfs_record_root_in_trans will return errors in the future, so handle the error properly in relocate_tree_block. Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: do proper error handling in record_reloc_root_in_transJosef Bacik1-2/+21
Generally speaking this shouldn't ever fail, the corresponding fs root for the reloc root will already be in memory, so we won't get ENOMEM here. However if there is no corresponding root for the reloc root then we could get ENOMEM when we try to allocate it or we could get ENOENT when we look it up and see that it doesn't exist. Convert these BUG_ON()'s into ASSERT()'s and add proper error handling for the case of corruption. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: check record_root_in_trans related failures in select_reloc_rootJosef Bacik1-2/+7
We will record the fs root or the reloc root in the trans in select_reloc_root. These will actually return errors in the following patches, so check their return value here and return it up the stack. Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: convert BUG_ON()'s in select_reloc_root() to proper errorsJosef Bacik1-4/+43
We have several BUG_ON()'s in select_reloc_root() that can be tripped if there is an extent tree corruption. Convert these to ASSERT()'s, because if we hit it during testing it really is bad, or could indicate a problem with the backref walking code. However if users hit these problems it generally indicates corruption, I've hit a few machines in the fleet that trip over these with clearly corrupted extent trees, so be nice and print out an error message and return an error instead of bringing the whole box down. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: handle errors from select_reloc_root()Josef Bacik1-3/+12
Currently select_reloc_root() doesn't return an error, but followup patches will make it possible for it to return an error. We do have proper error recovery in do_relocation however, so handle the possibility of select_reloc_root() having an error properly instead of BUG_ON(!root). I've also adjusted select_reloc_root() to return ERR_PTR(-ENOENT) if we don't find a root, instead of NULL, to make the error case easier to deal with. I've replaced the BUG_ON(!root) with an ASSERT(0) for this case as it indicates we messed up the backref walking code, but it could also indicate corruption. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: convert BUG_ON()'s in relocate_tree_blockJosef Bacik1-2/+22
We have a couple of BUG_ON()'s in relocate_tree_block() that can be tripped if we have file system corruption. Convert these to ASSERT()'s so developers still get yelled at when they break the backref code, but error out nicely for users so the whole box doesn't go down. Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: convert some BUG_ON()'s to ASSERT()'s in do_relocationJosef Bacik1-3/+16
A few of these are checking for correctness, and won't be triggered by corrupted file systems, so convert them to ASSERT() instead of BUG_ON() and add a comment explaining their existence. Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-04-19btrfs: use btrfs_inode_lock/btrfs_inode_unlock inode lock helpersJosef Bacik1-2/+2
A few places we intermix btrfs_inode_lock with a inode_unlock, and some places we just use inode_lock/inode_unlock instead of btrfs_inode_lock. None of these places are using this incorrectly, but as we adjust some of these callers it would be nice to keep everything consistent, so convert everybody to use btrfs_inode_lock/btrfs_inode_unlock. Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-02-09btrfs: zoned: enable relocation on a zoned filesystemNaohiro Aota1-2/+32
Currently fallocate() is disabled on a zoned filesystem. Since current relocation process relies on preallocation to move file data extents, it must be handled differently. On a zoned filesystem, we just truncate the inode to the size that we wanted to pre-allocate. Then, we flush dirty pages on the file before finishing the relocation process. run_delalloc_zoned() will handle all the allocations and submit IOs to the underlying layers. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Naohiro Aota <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-02-08btrfs: introduce btrfs_subpage for data inodesQu Wenruo1-2/+9
To support subpage sector size, data also need extra info to make sure which sectors in a page are uptodate/dirty/... This patch will make pages for data inodes get btrfs_subpage structure attached, and detached when the page is freed. This patch also slightly changes the timing when set_page_extent_mapped() is called to make sure: - We have page->mapping set page->mapping->host is used to grab btrfs_fs_info, thus we can only call this function after page is mapped to an inode. One call site attaches pages to inode manually, thus we have to modify the timing of set_page_extent_mapped() a bit. - As soon as possible, before other operations Since memory allocation can fail, we have to do extra error handling. Calling set_page_extent_mapped() as soon as possible can simply the error handling for several call sites. The idea is pretty much the same as iomap_page, but with more bitmaps for btrfs specific cases. Currently the plan is to switch iomap if iomap can provide sector aligned write back (only write back dirty sectors, but not the full page, data balance require this feature). So we will stick to btrfs specific bitmap for now. Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-02-08btrfs: fix reloc root leak with 0 ref reloc roots on recoveryJosef Bacik1-3/+1
When recovering a relocation, if we run into a reloc root that has 0 refs we simply add it to the reloc_control->reloc_roots list, and then clean it up later. The problem with this is __del_reloc_root() doesn't do anything if the root isn't in the radix tree, which in this case it won't be because we never call __add_reloc_root() on the reloc_root. This exit condition simply isn't correct really. During normal operation we can remove ourselves from the rb tree and then we're meant to clean up later at merge_reloc_roots() time, and this happens correctly. During recovery we're depending on free_reloc_roots() to drop our references, but we're short-circuiting. Fix this by continuing to check if we're on the list and dropping ourselves from the reloc_control root list and dropping our reference appropriately. Change the corresponding BUG_ON() to an ASSERT() that does the correct thing if we aren't in the rb tree. CC: [email protected] # 4.4+ Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-02-08btrfs: keep track of the root owner for relocation readsJosef Bacik1-3/+43
While testing the error paths in relocation, I hit the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.10.0-rc3+ #206 Not tainted ------------------------------------------------------ btrfs-balance/1571 is trying to acquire lock: ffff8cdbcc8f77d0 (&head_ref->mutex){+.+.}-{3:3}, at: btrfs_lookup_extent_info+0x156/0x3b0 but task is already holding lock: ffff8cdbc54adbf8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x27/0x100 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (btrfs-tree-00){++++}-{3:3}: down_write_nested+0x43/0x80 __btrfs_tree_lock+0x27/0x100 btrfs_search_slot+0x248/0x890 relocate_tree_blocks+0x490/0x650 relocate_block_group+0x1ba/0x5d0 kretprobe_trampoline+0x0/0x50 -> #1 (btrfs-csum-01){++++}-{3:3}: down_read_nested+0x43/0x130 __btrfs_tree_read_lock+0x27/0x100 btrfs_read_lock_root_node+0x31/0x40 btrfs_search_slot+0x5ab/0x890 btrfs_del_csums+0x10b/0x3c0 __btrfs_free_extent+0x49d/0x8e0 __btrfs_run_delayed_refs+0x283/0x11f0 btrfs_run_delayed_refs+0x86/0x220 btrfs_start_dirty_block_groups+0x2ba/0x520 kretprobe_trampoline+0x0/0x50 -> #0 (&head_ref->mutex){+.+.}-{3:3}: __lock_acquire+0x1167/0x2150 lock_acquire+0x116/0x3e0 __mutex_lock+0x7e/0x7b0 btrfs_lookup_extent_info+0x156/0x3b0 walk_down_proc+0x1c3/0x280 walk_down_tree+0x64/0xe0 btrfs_drop_subtree+0x182/0x260 do_relocation+0x52e/0x660 relocate_tree_blocks+0x2ae/0x650 relocate_block_group+0x1ba/0x5d0 kretprobe_trampoline+0x0/0x50 other info that might help us debug this: Chain exists of: &head_ref->mutex --> btrfs-csum-01 --> btrfs-tree-00 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-tree-00); lock(btrfs-csum-01); lock(btrfs-tree-00); lock(&head_ref->mutex); *** DEADLOCK *** 5 locks held by btrfs-balance/1571: #0: ffff8cdb89749ff8 (&fs_info->delete_unused_bgs_mutex){+.+.}-{3:3}, at: btrfs_balance+0x563/0xf40 #1: ffff8cdb89748838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x156/0x300 #2: ffff8cdbc2c16650 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x413/0x5c0 #3: ffff8cdbc135f538 (btrfs-treloc-01){+.+.}-{3:3}, at: __btrfs_tree_lock+0x27/0x100 #4: ffff8cdbc54adbf8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x27/0x100 stack backtrace: CPU: 1 PID: 1571 Comm: btrfs-balance Not tainted 5.10.0-rc3+ #206 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack+0x8b/0xb0 check_noncircular+0xcf/0xf0 ? trace_call_bpf+0x139/0x260 __lock_acquire+0x1167/0x2150 lock_acquire+0x116/0x3e0 ? btrfs_lookup_extent_info+0x156/0x3b0 __mutex_lock+0x7e/0x7b0 ? btrfs_lookup_extent_info+0x156/0x3b0 ? btrfs_lookup_extent_info+0x156/0x3b0 ? release_extent_buffer+0x124/0x170 ? _raw_spin_unlock+0x1f/0x30 ? release_extent_buffer+0x124/0x170 btrfs_lookup_extent_info+0x156/0x3b0 walk_down_proc+0x1c3/0x280 walk_down_tree+0x64/0xe0 btrfs_drop_subtree+0x182/0x260 do_relocation+0x52e/0x660 relocate_tree_blocks+0x2ae/0x650 ? add_tree_block+0x149/0x1b0 relocate_block_group+0x1ba/0x5d0 elfcorehdr_read+0x40/0x40 ? elfcorehdr_read+0x40/0x40 ? btrfs_balance+0x796/0xf40 ? __kthread_parkme+0x66/0x90 ? btrfs_balance+0xf40/0xf40 ? balance_kthread+0x37/0x50 ? kthread+0x137/0x150 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x30 As you can see this is bogus, we never take another tree's lock under the csum lock. This happens because sometimes we have to read tree blocks from disk without knowing which root they belong to during relocation. We defaulted to an owner of 0, which translates to an fs tree. This is fine as all fs trees have the same class, but obviously isn't fine if the block belongs to a COW only tree. Thankfully COW only trees only have their owners root as a reference to them, and since we already look up the extent information during relocation, go ahead and check and see if this block might belong to a COW only tree, and if so save the owner in the tree_block struct. This allows us to read_tree_block with the proper owner, which gets rid of this lockdep splat. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-02-08btrfs: noinline btrfs_should_cancel_balanceJosef Bacik1-1/+1
I was attempting to reproduce a problem that Zygo hit, but my error injection wasn't firing for a few of the common calls to btrfs_should_cancel_balance. This is because the compiler decided to inline it at these spots. Keep this from happening by explicitly marking the function as noinline so that error injection will always work. Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-02-08btrfs: rename btrfs_find_free_objectid to btrfs_get_free_objectidNikolay Borisov1-1/+1
This better reflects the semantics of the function i.e no search is performed whatsoever. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-01-07btrfs: reloc: fix wrong file extent type check to avoid false ENOENTQu Wenruo1-1/+6
[BUG] There are several bug reports about recent kernel unable to relocate certain data block groups. Sometimes the error just goes away, but there is one reporter who can reproduce it reliably. The dmesg would look like: [438.260483] BTRFS info (device dm-10): balance: start -dvrange=34625344765952..34625344765953 [438.269018] BTRFS info (device dm-10): relocating block group 34625344765952 flags data|raid1 [450.439609] BTRFS info (device dm-10): found 167 extents, stage: move data extents [463.501781] BTRFS info (device dm-10): balance: ended with status: -2 [CAUSE] The ENOENT error is returned from the following call chain: add_data_references() |- delete_v1_space_cache(); |- if (!found) return -ENOENT; The variable @found is set to true if we find a data extent whose disk bytenr matches parameter @data_bytes. With extra debugging, the offending tree block looks like this: leaf bytenr = 42676709441536, data_bytenr = 34626327621632 ctime 1567904822.739884119 (2019-09-08 03:07:02) mtime 0.0 (1970-01-01 01:00:00) otime 0.0 (1970-01-01 01:00:00) item 27 key (51933 EXTENT_DATA 0) itemoff 9854 itemsize 53 generation 1517381 type 2 (prealloc) prealloc data disk byte 34626327621632 nr 262144 <<< prealloc data offset 0 nr 262144 item 28 key (52262 ROOT_ITEM 0) itemoff 9415 itemsize 439 generation 2618893 root_dirid 256 bytenr 42677048360960 level 3 refs 1 lastsnap 2618893 byte_limit 0 bytes_used 5557338112 flags 0x0(none) uuid d0d4361f-d231-6d40-8901-fe506e4b2b53 Although item 27 has disk bytenr 34626327621632, which matches the data_bytenr, its type is prealloc, not reg. This makes the existing code skip that item, and return ENOENT. [FIX] The code is modified in commit 19b546d7a1b2 ("btrfs: relocation: Use btrfs_find_all_leafs to locate data extent parent tree leaves"), before that commit, we use something like "if (type == BTRFS_FILE_EXTENT_INLINE) continue;" But in that offending commit, we use (type == BTRFS_FILE_EXTENT_REG), ignoring BTRFS_FILE_EXTENT_PREALLOC. Fix it by also checking BTRFS_FILE_EXTENT_PREALLOC. Reported-by: Stéphane Lesimple <[email protected]> Link: https://lore.kernel.org/linux-btrfs/[email protected] Fixes: 19b546d7a1b2 ("btrfs: relocation: Use btrfs_find_all_leafs to locate data extent parent tree leaves") CC: [email protected] # 5.6+ Tested-By: Stéphane Lesimple <[email protected]> Reviewed-by: Su Yue <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-09btrfs: remove inode number cache featureNikolay Borisov1-1/+0
It's been deprecated since commit b547a88ea577 ("btrfs: start deprecation of mount option inode_cache") which enumerates the reasons. A filesystem that uses the feature (mount -o inode_cache) tracks the inode numbers in bitmaps, that data stay on the filesystem after this patch. The size is roughly 5MiB for 1M inodes [1], which is considered small enough to be left there. Removal of the change can be implemented in btrfs-progs if needed. [1] https://lore.kernel.org/linux-btrfs/[email protected]/ Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> [ update changelog ] Signed-off-by: David Sterba <[email protected]>
2020-12-08btrfs: remove err variable from do_relocationNikolay Borisov1-21/+12
It simply gets assigned to 'ret' in case of errors. The flow of the while loop is not changed by this commit since the few call sites that 'goto next' will simply break from the loop. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-08btrfs: eliminate err variable from merge_reloc_rootNikolay Borisov1-17/+7
In most cases when an error is returned from a function 'ret' is simply assigned to 'err'. There is only one case where walk_up_reloc_tree can return a positive value - in this case the code breaks from the loop and ret is going to get its return value from btrfs_cow_block - either 0 or negative. This retains the old logic of how 'err' used to be set at this call site. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-08btrfs: pass the owner_root and level to alloc_extent_bufferJosef Bacik1-1/+2
Now that we've plumbed all of the callers to have the owner root and the level, plumb it down into alloc_extent_buffer(). Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-08btrfs: pass root owner to read_tree_blockJosef Bacik1-2/+2
In order to properly set the lockdep class of a newly allocated block we need to know the owner of the block. For non-refcounted trees this is straightforward, we always know in advance what tree we're reading from. For refcounted trees we don't necessarily know, however all refcounted trees share the same lockdep class name, tree-<level>. Fix all the callers of read_tree_block() to pass in the root objectid we're using. In places like relocation and backref we could probably unconditionally use 0, but just in case use the root when we have it, otherwise use 0 in the cases we don't have the root as it's going to be a refcounted tree anyway. This is a preparation patch for further changes. Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-08btrfs: use btrfs_read_node_slot in replace_pathJosef Bacik1-9/+1
We're open-coding btrfs_read_node_slot() here, replace with the helper. Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-08btrfs: use btrfs_read_node_slot in do_relocationJosef Bacik1-11/+1
We're open coding btrfs_read_node_slot in do_relocation, replace this with the proper helper. Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-08btrfs: use btrfs_read_node_slot in walk_down_reloc_treeJosef Bacik1-13/+2
We do not need to call read_tree_block() here, simply use the btrfs_read_node_slot helper. Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-08btrfs: cleanup extent buffer readaheadJosef Bacik1-1/+1
We're going to pass around more information when we allocate extent buffers, in order to make that cleaner how we do readahead. Most of the callers have the parent node that we're getting our blockptr from, with the sole exception of relocation which simply has the bytenr it wants to read. Add a helper that takes the current arguments that we need (bytenr and gen), and add another helper for simply reading the slot out of a node. In followup patches the helper that takes all the extra arguments will be expanded, and the simpler helper won't need to have it's arguments adjusted. Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2020-12-08btrfs: locking: remove all the blocking helpersJosef Bacik1-4/+0
Now that we're using a rw_semaphore we no longer need to indicate if a lock is blocking or not, nor do we need to flip the entire path from blocking to spinning. Remove these helpers and all the places they are called. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>