aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-04-19btrfs: fix direct I/O writes for split bios on zoned devicesChristoph Hellwig1-2/+3
When a bio is split in btrfs_submit_direct, dip->file_offset contains the file offset for the first bio. But this means the start value used in btrfs_end_dio_bio to record the write location for zone devices is incorrect for subsequent bios. CC: [email protected] # 5.16+ Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Naohiro Aota <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Sweet Tea Dorminy <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-04-19btrfs: fix direct I/O read repair for split biosChristoph Hellwig3-8/+9
When a bio is split in btrfs_submit_direct, dip->file_offset contains the file offset for the first bio. But this means the start value used in btrfs_check_read_dio_bio is incorrect for subsequent bios. Add a file_offset field to struct btrfs_bio to pass along the correct offset. Given that check_data_csum only uses start of an error message this means problems with this miscalculation will only show up when I/O fails or checksums mismatch. The logic was removed in f4f39fc5dc30 ("btrfs: remove btrfs_bio::logical member") but we need it due to the bio splitting. CC: [email protected] # 5.16+ Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Naohiro Aota <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Sweet Tea Dorminy <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-04-19btrfs: fix and document the zoned device choice in alloc_new_bioChristoph Hellwig1-15/+28
Zone Append bios only need a valid block device in struct bio, but not the device in the btrfs_bio. Use the information from btrfs_zoned_get_device to set up bi_bdev and fix zoned writes on multi-device file system with non-homogeneous capabilities and remove the pointless btrfs_bio.device assignment. Add big fat comments explaining what is going on here. Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Naohiro Aota <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-04-19btrfs: fix leaked plug after failure syncing log on zoned filesystemsFilipe Manana1-0/+1
On a zoned filesystem, if we fail to allocate the root node for the log root tree while syncing the log, we end up returning without finishing the IO plug we started before, resulting in leaking resources as we have started writeback for extent buffers of a log tree before. That allocation failure, which typically is either -ENOMEM or -ENOSPC, is not fatal and the fsync can safely fallback to a full transaction commit. So release the IO plug if we fail to allocate the extent buffer for the root of the log root tree when syncing the log on a zoned filesystem. Fixes: 3ddebf27fcd3a9 ("btrfs: zoned: reorder log node allocation on zoned filesystem") CC: [email protected] # 5.15+ Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-04-06btrfs: fix btrfs_submit_compressed_write cgroup attributionDennis Zhou1-0/+8
This restores the logic from commit 46bcff2bfc5e ("btrfs: fix compressed write bio blkcg attribution") which added cgroup attribution to btrfs writeback. It also adds back the REQ_CGROUP_PUNT flag for these ios. Fixes: 91507240482e ("btrfs: determine stripe boundary at bio allocation time in btrfs_submit_compressed_write") CC: [email protected] # 5.16+ Signed-off-by: Dennis Zhou <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-04-06btrfs: fix root ref counts in error handling in btrfs_get_root_refJia-Ju Bai1-2/+3
In btrfs_get_root_ref(), when btrfs_insert_fs_root() fails, btrfs_put_root() can happen for two reasons: - the root already exists in the tree, in that case it returns the reference obtained in btrfs_lookup_fs_root() - another error so the cleanup is done in the fail label Calling btrfs_put_root() unconditionally would lead to double decrement of the root reference possibly freeing it in the second case. Reported-by: TOTE Robot <[email protected]> Fixes: bc44d7c4b2b1 ("btrfs: push btrfs_grab_fs_root into btrfs_get_fs_root") CC: [email protected] # 5.10+ Signed-off-by: Jia-Ju Bai <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-04-06btrfs: zoned: activate block group only for extent allocationNaohiro Aota3-9/+21
In btrfs_make_block_group(), we activate the allocated block group, expecting that the block group is soon used for allocation. However, the chunk allocation from flush_space() context broke the assumption. There can be a large time gap between the chunk allocation time and the extent allocation time from the chunk. Activating the empty block groups pre-allocated from flush_space() context can exhaust the active zone counter of a device. Once we use all the active zone counts for empty pre-allocated block groups, we cannot activate new block group for the other things: metadata, tree-log, or data relocation block group. That failure results in a fake -ENOSPC. This patch introduces CHUNK_ALLOC_FORCE_FOR_EXTENT to distinguish the chunk allocation from find_free_extent(). Now, the new block group is activated only in that context. Fixes: eb66a010d518 ("btrfs: zoned: activate new block group") CC: [email protected] # 5.16+ Reviewed-by: Johannes Thumshirn <[email protected]> Tested-by: Johannes Thumshirn <[email protected]> Signed-off-by: Naohiro Aota <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-04-06btrfs: return allocated block group from do_chunk_alloc()Naohiro Aota1-3/+13
Return the allocated block group from do_chunk_alloc(). This is a preparation patch for the next patch. CC: [email protected] # 5.16+ Reviewed-by: Johannes Thumshirn <[email protected]> Tested-by: Johannes Thumshirn <[email protected]> Signed-off-by: Naohiro Aota <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-04-06btrfs: mark resumed async balance as writingNaohiro Aota1-0/+2
When btrfs balance is interrupted with umount, the background balance resumes on the next mount. There is a potential deadlock with FS freezing here like as described in commit 26559780b953 ("btrfs: zoned: mark relocation as writing"). Mark the process as sb_writing to avoid it. Reviewed-by: Filipe Manana <[email protected]> CC: [email protected] # 4.9+ Signed-off-by: Naohiro Aota <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-04-06btrfs: remove support of balance v1 ioctlNikolay Borisov1-2/+0
It was scheduled for removal in kernel v5.18 commit 6c405b24097c ("btrfs: deprecate BTRFS_IOC_BALANCE ioctl") thus its time has come. Reviewed-by: Sweet Tea Dorminy <[email protected]> Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-04-06btrfs: release correct delalloc amount in direct IO write pathNaohiro Aota1-3/+3
Running generic/406 causes the following WARNING in btrfs_destroy_inode() which tells there are outstanding extents left. In btrfs_get_blocks_direct_write(), we reserve a temporary outstanding extents with btrfs_delalloc_reserve_metadata() (or indirectly from btrfs_delalloc_reserve_space(()). We then release the outstanding extents with btrfs_delalloc_release_extents(). However, the "len" can be modified in the COW case, which releases fewer outstanding extents than expected. Fix it by calling btrfs_delalloc_release_extents() for the original length. To reproduce the warning, the filesystem should be 1 GiB. It's triggering a short-write, due to not being able to allocate a large extent and instead allocating a smaller one. WARNING: CPU: 0 PID: 757 at fs/btrfs/inode.c:8848 btrfs_destroy_inode+0x1e6/0x210 [btrfs] Modules linked in: btrfs blake2b_generic xor lzo_compress lzo_decompress raid6_pq zstd zstd_decompress zstd_compress xxhash zram zsmalloc CPU: 0 PID: 757 Comm: umount Not tainted 5.17.0-rc8+ #101 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS d55cb5a 04/01/2014 RIP: 0010:btrfs_destroy_inode+0x1e6/0x210 [btrfs] RSP: 0018:ffffc9000327bda8 EFLAGS: 00010206 RAX: 0000000000000000 RBX: ffff888100548b78 RCX: 0000000000000000 RDX: 0000000000026900 RSI: 0000000000000000 RDI: ffff888100548b78 RBP: ffff888100548940 R08: 0000000000000000 R09: ffff88810b48aba8 R10: 0000000000000001 R11: ffff8881004eb240 R12: ffff88810b48a800 R13: ffff88810b48ec08 R14: ffff88810b48ed00 R15: ffff888100490c68 FS: 00007f8549ea0b80(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f854a09e733 CR3: 000000010a2e9003 CR4: 0000000000370eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> destroy_inode+0x33/0x70 dispose_list+0x43/0x60 evict_inodes+0x161/0x1b0 generic_shutdown_super+0x2d/0x110 kill_anon_super+0xf/0x20 btrfs_kill_super+0xd/0x20 [btrfs] deactivate_locked_super+0x27/0x90 cleanup_mnt+0x12c/0x180 task_work_run+0x54/0x80 exit_to_user_mode_prepare+0x152/0x160 syscall_exit_to_user_mode+0x12/0x30 do_syscall_64+0x42/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f854a000fb7 Fixes: f0bfa76a11e9 ("btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range") CC: [email protected] # 5.17 Reviewed-by: Johannes Thumshirn <[email protected]> Tested-by: Johannes Thumshirn <[email protected]> Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Naohiro Aota <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-04-06btrfs: remove unused variable in btrfs_{start,write}_dirty_block_groups()Nathan Chancellor1-4/+0
Clang's version of -Wunused-but-set-variable recently gained support for unary operations, which reveals two unused variables: fs/btrfs/block-group.c:2949:6: error: variable 'num_started' set but not used [-Werror,-Wunused-but-set-variable] int num_started = 0; ^ fs/btrfs/block-group.c:3116:6: error: variable 'num_started' set but not used [-Werror,-Wunused-but-set-variable] int num_started = 0; ^ 2 errors generated. These variables appear to be unused from their introduction, so just remove them to silence the warnings. Fixes: c9dc4c657850 ("Btrfs: two stage dirty block group writeout") Fixes: 1bbc621ef284 ("Btrfs: allow block group cache writeout outside critical section in commit") CC: [email protected] # 5.4+ Link: https://github.com/ClangBuiltLinux/linux/issues/1614 Signed-off-by: Nathan Chancellor <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-04-06btrfs: zoned: remove redundant condition in btrfs_run_delalloc_rangeHaowen Bai1-2/+1
The logic !A || A && B is equivalent to !A || B. so we can make code clear. Note: though it's preferred to be in the more human readable form, there have been repeated reports and patches as the expression is detected by tools so apply it to reduce the load. Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Haowen Bai <[email protected]> Reviewed-by: David Sterba <[email protected]> [ add note ] Signed-off-by: David Sterba <[email protected]>
2022-03-24btrfs: prevent subvol with swapfile from being deletedKaiwen Hu1-0/+22
A subvolume with an active swapfile must not be deleted otherwise it would not be possible to deactivate it. After the subvolume is deleted, we cannot swapoff the swapfile in this deleted subvolume because the path is unreachable. The swapfile is still active and holding references, the filesystem cannot be unmounted. The test looks like this: mkfs.btrfs -f $dev > /dev/null mount $dev $mnt btrfs sub create $mnt/subvol touch $mnt/subvol/swapfile chmod 600 $mnt/subvol/swapfile chattr +C $mnt/subvol/swapfile dd if=/dev/zero of=$mnt/subvol/swapfile bs=1K count=4096 mkswap $mnt/subvol/swapfile swapon $mnt/subvol/swapfile btrfs sub delete $mnt/subvol swapoff $mnt/subvol/swapfile # failed: No such file or directory swapoff --all unmount $mnt # target is busy. To prevent above issue, we simply check that whether the subvolume contains any active swapfile, and stop the deleting process. This behavior is like snapshot ioctl dealing with a swapfile. CC: [email protected] # 5.4+ Reviewed-by: Robbie Ko <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Kaiwen Hu <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-24btrfs: do not warn for free space inode in cow_file_rangeJosef Bacik1-1/+0
This is a long time leftover from when I originally added the free space inode, the point was to catch cases where we weren't honoring the NOCOW flag. However there exists a race with relocation, if we allocate our free space inode in a block group that is about to be relocated, we could trigger the COW path before the relocation has the opportunity to find the extents and delete the free space cache. In production where we have auto-relocation enabled we're seeing this WARN_ON_ONCE() around 5k times in a 2 week period, so not super common but enough that it's at the top of our metrics. We're properly handling the error here, and with us phasing out v1 space cache anyway just drop the WARN_ON_ONCE. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-24btrfs: avoid defragging extents whose next extents are not targetsQu Wenruo1-6/+14
[BUG] There is a report that autodefrag is defragging single sector, which is completely waste of IO, and no help for defragging: btrfs-cleaner-808 defrag_one_locked_range: root=256 ino=651122 start=0 len=4096 [CAUSE] In defrag_collect_targets(), we check if the current range (A) can be merged with next one (B). If mergeable, we will add range A into target for defrag. However there is a catch for autodefrag, when checking mergeability against range B, we intentionally pass 0 as @newer_than, hoping to get a higher chance to merge with the next extent. But in the next iteration, range B will looked up by defrag_lookup_extent(), with non-zero @newer_than. And if range B is not really newer, it will rejected directly, causing only range A being defragged, while we expect to defrag both range A and B. [FIX] Since the root cause is the difference in check condition of defrag_check_next_extent() and defrag_collect_targets(), we fix it by: 1. Pass @newer_than to defrag_check_next_extent() 2. Pass @extent_thresh to defrag_check_next_extent() This makes the check between defrag_collect_targets() and defrag_check_next_extent() more consistent. While there is still some minor difference, the remaining checks are focus on runtime flags like writeback/delalloc, which are mostly transient and safe to be checked only in defrag_collect_targets(). Link: https://github.com/btrfs/linux/issues/423#issuecomment-1066981856 CC: [email protected] # 5.16+ Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-24btrfs: fix fallocate to use file_modified to update permissions consistentlyDarrick J. Wong1-2/+11
Since the initial introduction of (posix) fallocate back at the turn of the century, it has been possible to use this syscall to change the user-visible contents of files. This can happen by extending the file size during a preallocation, or through any of the newer modes (punch, zero range). Because the call can be used to change file contents, we should treat it like we do any other modification to a file -- update the mtime, and drop set[ug]id privileges/capabilities. The VFS function file_modified() does all this for us if pass it a locked inode, so let's make fallocate drop permissions correctly. Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-24btrfs: remove device item and update super block in the same transactionQu Wenruo1-37/+28
[BUG] There is a report that a btrfs has a bad super block num devices. This makes btrfs to reject the fs completely. BTRFS error (device sdd3): super_num_devices 3 mismatch with num_devices 2 found here BTRFS error (device sdd3): failed to read chunk tree: -22 BTRFS error (device sdd3): open_ctree failed [CAUSE] During btrfs device removal, chunk tree and super block num devs are updated in two different transactions: btrfs_rm_device() |- btrfs_rm_dev_item(device) | |- trans = btrfs_start_transaction() | | Now we got transaction X | | | |- btrfs_del_item() | | Now device item is removed from chunk tree | | | |- btrfs_commit_transaction() | Transaction X got committed, super num devs untouched, | but device item removed from chunk tree. | (AKA, super num devs is already incorrect) | |- cur_devices->num_devices--; |- cur_devices->total_devices--; |- btrfs_set_super_num_devices() All those operations are not in transaction X, thus it will only be written back to disk in next transaction. So after the transaction X in btrfs_rm_dev_item() committed, but before transaction X+1 (which can be minutes away), a power loss happen, then we got the super num mismatch. [FIX] Instead of starting and committing a transaction inside btrfs_rm_dev_item(), start a transaction in side btrfs_rm_device() and pass it to btrfs_rm_dev_item(). And only commit the transaction after everything is done. Reported-by: Luca Béla Palkovics <[email protected]> Link: https://lore.kernel.org/linux-btrfs/CA+8xDSpvdm_U0QLBAnrH=zqDq_cWCOH5TiV46CKmp3igr44okQ@mail.gmail.com/ CC: [email protected] # 4.14+ Reviewed-by: Anand Jain <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-23btrfs: fix qgroup reserve overflow the qgroup limitEthan Lien1-1/+1
We use extent_changeset->bytes_changed in qgroup_reserve_data() to record how many bytes we set for EXTENT_QGROUP_RESERVED state. Currently the bytes_changed is set as "unsigned int", and it will overflow if we try to fallocate a range larger than 4GiB. The result is we reserve less bytes and eventually break the qgroup limit. Unlike regular buffered/direct write, which we use one changeset for each ordered extent, which can never be larger than 256M. For fallocate, we use one changeset for the whole range, thus it no longer respects the 256M per extent limit, and caused the problem. The following example test script reproduces the problem: $ cat qgroup-overflow.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f $DEV mount $DEV $MNT # Set qgroup limit to 2GiB. btrfs quota enable $MNT btrfs qgroup limit 2G $MNT # Try to fallocate a 3GiB file. This should fail. echo echo "Try to fallocate a 3GiB file..." fallocate -l 3G $MNT/3G.file # Try to fallocate a 5GiB file. echo echo "Try to fallocate a 5GiB file..." fallocate -l 5G $MNT/5G.file # See we break the qgroup limit. echo sync btrfs qgroup show -r $MNT umount $MNT When running the test: $ ./qgroup-overflow.sh (...) Try to fallocate a 3GiB file... fallocate: fallocate failed: Disk quota exceeded Try to fallocate a 5GiB file... qgroupid         rfer         excl     max_rfer --------         ----         ----     -------- 0/5           5.00GiB      5.00GiB      2.00GiB Since we have no control of how bytes_changed is used, it's better to set it to u64. CC: [email protected] # 4.14+ Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Ethan Lien <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-23btrfs: zoned: remove left over ASSERT checking for single profileJohannes Thumshirn1-4/+0
With commit dcf5652291f6 ("btrfs: zoned: allow DUP on meta-data block groups") we started allowing DUP on metadata block groups, so the ASSERT()s in btrfs_can_activate_zone() and btrfs_zoned_get_device() are no longer valid and in fact even harmful. Fixes: dcf5652291f6 ("btrfs: zoned: allow DUP on meta-data block groups") CC: [email protected] # 5.17 Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-23btrfs: zoned: traverse devices under chunk_mutex in btrfs_can_activate_zoneJohannes Thumshirn1-4/+5
btrfs_can_activate_zone() can be called with the device_list_mutex already held, which will lead to a deadlock: insert_dev_extents() // Takes device_list_mutex `-> insert_dev_extent() `-> btrfs_insert_empty_item() `-> btrfs_insert_empty_items() `-> btrfs_search_slot() `-> btrfs_cow_block() `-> __btrfs_cow_block() `-> btrfs_alloc_tree_block() `-> btrfs_reserve_extent() `-> find_free_extent() `-> find_free_extent_update_loop() `-> can_allocate_chunk() `-> btrfs_can_activate_zone() // Takes device_list_mutex again Instead of using the RCU on fs_devices->device_list we can use fs_devices->alloc_list, protected by the chunk_mutex to traverse the list of active devices. We are in the chunk allocation thread. The newer chunk allocation happens from the devices in the fs_device->alloc_list protected by the chunk_mutex. btrfs_create_chunk() lockdep_assert_held(&info->chunk_mutex); gather_device_info list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) Also, a device that reappears after the mount won't join the alloc_list yet and, it will be in the dev_list, which we don't want to consider in the context of the chunk alloc. [15.166572] WARNING: possible recursive locking detected [15.167117] 5.17.0-rc6-dennis #79 Not tainted [15.167487] -------------------------------------------- [15.167733] kworker/u8:3/146 is trying to acquire lock: [15.167733] ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: find_free_extent+0x15a/0x14f0 [btrfs] [15.167733] [15.167733] but task is already holding lock: [15.167733] ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_create_pending_block_groups+0x20a/0x560 [btrfs] [15.167733] [15.167733] other info that might help us debug this: [15.167733] Possible unsafe locking scenario: [15.167733] [15.171834] CPU0 [15.171834] ---- [15.171834] lock(&fs_devs->device_list_mutex); [15.171834] lock(&fs_devs->device_list_mutex); [15.171834] [15.171834] *** DEADLOCK *** [15.171834] [15.171834] May be due to missing lock nesting notation [15.171834] [15.171834] 5 locks held by kworker/u8:3/146: [15.171834] #0: ffff888100050938 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x1c3/0x5a0 [15.171834] #1: ffffc9000067be80 ((work_completion)(&fs_info->async_data_reclaim_work)){+.+.}-{0:0}, at: process_one_work+0x1c3/0x5a0 [15.176244] #2: ffff88810521e620 (sb_internal){.+.+}-{0:0}, at: flush_space+0x335/0x600 [btrfs] [15.176244] #3: ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_create_pending_block_groups+0x20a/0x560 [btrfs] [15.176244] #4: ffff8881152e4b78 (btrfs-dev-00){++++}-{3:3}, at: __btrfs_tree_lock+0x27/0x130 [btrfs] [15.179641] [15.179641] stack backtrace: [15.179641] CPU: 1 PID: 146 Comm: kworker/u8:3 Not tainted 5.17.0-rc6-dennis #79 [15.179641] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014 [15.179641] Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs] [15.179641] Call Trace: [15.179641] <TASK> [15.179641] dump_stack_lvl+0x45/0x59 [15.179641] __lock_acquire.cold+0x217/0x2b2 [15.179641] lock_acquire+0xbf/0x2b0 [15.183838] ? find_free_extent+0x15a/0x14f0 [btrfs] [15.183838] __mutex_lock+0x8e/0x970 [15.183838] ? find_free_extent+0x15a/0x14f0 [btrfs] [15.183838] ? find_free_extent+0x15a/0x14f0 [btrfs] [15.183838] ? lock_is_held_type+0xd7/0x130 [15.183838] ? find_free_extent+0x15a/0x14f0 [btrfs] [15.183838] find_free_extent+0x15a/0x14f0 [btrfs] [15.183838] ? _raw_spin_unlock+0x24/0x40 [15.183838] ? btrfs_get_alloc_profile+0x106/0x230 [btrfs] [15.187601] btrfs_reserve_extent+0x131/0x260 [btrfs] [15.187601] btrfs_alloc_tree_block+0xb5/0x3b0 [btrfs] [15.187601] __btrfs_cow_block+0x138/0x600 [btrfs] [15.187601] btrfs_cow_block+0x10f/0x230 [btrfs] [15.187601] btrfs_search_slot+0x55f/0xbc0 [btrfs] [15.187601] ? lock_is_held_type+0xd7/0x130 [15.187601] btrfs_insert_empty_items+0x2d/0x60 [btrfs] [15.187601] btrfs_create_pending_block_groups+0x2b3/0x560 [btrfs] [15.187601] __btrfs_end_transaction+0x36/0x2a0 [btrfs] [15.192037] flush_space+0x374/0x600 [btrfs] [15.192037] ? find_held_lock+0x2b/0x80 [15.192037] ? btrfs_async_reclaim_data_space+0x49/0x180 [btrfs] [15.192037] ? lock_release+0x131/0x2b0 [15.192037] btrfs_async_reclaim_data_space+0x70/0x180 [btrfs] [15.192037] process_one_work+0x24c/0x5a0 [15.192037] worker_thread+0x4a/0x3d0 Fixes: a85f05e59bc1 ("btrfs: zoned: avoid chunk allocation if active block group has enough space") CC: [email protected] # 5.16+ Reviewed-by: Anand Jain <[email protected]> Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: zoned: put block group after final usageNikolay Borisov1-3/+4
It's counter-intuitive (and wrong) to put the block group _before_ the final usage in submit_eb_page. Fix it by re-ordering the call to btrfs_put_block_group after its final reference. Also fix a minor typo in 'implies' Fixes: be1a1d7a5d24 ("btrfs: zoned: finish fully written block group") CC: [email protected] # 5.16+ Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: don't access possibly stale fs_info data in device_list_addDongliang Mu1-7/+6
Syzbot reported a possible use-after-free in printing information in device_list_add. Very similar with the bug fixed by commit 0697d9a61099 ("btrfs: don't access possibly stale fs_info data for printing duplicate device"), but this time the use occurs in btrfs_info_in_rcu. Call Trace: kasan_report.cold+0x83/0xdf mm/kasan/report.c:459 btrfs_printk+0x395/0x425 fs/btrfs/super.c:244 device_list_add.cold+0xd7/0x2ed fs/btrfs/volumes.c:957 btrfs_scan_one_device+0x4c7/0x5c0 fs/btrfs/volumes.c:1387 btrfs_control_ioctl+0x12a/0x2d0 fs/btrfs/super.c:2409 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:874 [inline] __se_sys_ioctl fs/ioctl.c:860 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae Fix this by modifying device->fs_info to NULL too. Reported-and-tested-by: [email protected] CC: [email protected] # 4.19+ Signed-off-by: Dongliang Mu <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: add lockdep_assert_held to need_preemptive_reclaimNiels Dossche1-0/+2
In a previous patch ("btrfs: extend locking to all space_info members accesses") the locking for the space_info members was extended in btrfs_preempt_reclaim_metadata_space because not all the member accesses that needed locks were actually locked (bytes_pinned et al). It was then suggested to also add a call to lockdep_assert_held to need_preemptive_reclaim. This function also works with space_info members. As of now, it has only two call sites which both hold the lock. Suggested-by: Johannes Thumshirn <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Niels Dossche <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: verify the tranisd of the to-be-written dirty extent bufferQu Wenruo1-6/+20
[BUG] There is a bug report that a bitflip in the transid part of an extent buffer makes btrfs to reject certain tree blocks: BTRFS error (device dm-0): parent transid verify failed on 1382301696 wanted 262166 found 22 [CAUSE] Note the failed transid check, hex(262166) = 0x40016, while hex(22) = 0x16. It's an obvious bitflip. Furthermore, the reporter also confirmed the bitflip is from the hardware, so it's a real hardware caused bitflip, and such problem can not be detected by the existing tree-checker framework. As tree-checker can only verify the content inside one tree block, while generation of a tree block can only be verified against its parent. So such problem remain undetected. [FIX] Although tree-checker can not verify it at write-time, we still have a quick (but not the most accurate) way to catch such obvious corruption. Function csum_one_extent_buffer() is called before we submit metadata write. Thus it means, all the extent buffer passed in should be dirty tree blocks, and should be newer than last committed transaction. Using that we can catch the above bitflip. Although it's not a perfect solution, as if the corrupted generation is higher than the correct value, we have no way to catch it at all. Reported-by: Christoph Anton Mitterer <[email protected]> Link: https://lore.kernel.org/linux-btrfs/[email protected]/ CC: [email protected] # 5.15+ Signed-off-by: Qu Wenruo <wqu@sus,ree.com> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: unify the error handling of btrfs_read_buffer()Qu Wenruo1-6/+6
There is one oddball error handling of btrfs_read_buffer(): ret = btrfs_read_buffer(tmp, gen, parent_level - 1, &first_key); if (!ret) { *eb_ret = tmp; return 0; } free_extent_buffer(tmp); btrfs_release_path(p); return -EIO; While all other call sites check the error first. Unify the behavior. Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: unify the error handling pattern for read_tree_block()Qu Wenruo5-27/+34
We had an error handling pattern for read_tree_block() like this: eb = read_tree_block(); if (IS_ERR(eb)) { /* * Handling error here * Normally ended up with return or goto out. */ } else if (!extent_buffer_uptodate(eb)) { /* * Different error handling here * Normally also ended up with return or goto out; */ } This is fine, but if we want to add extra check for each read_tree_block(), the existing if-else-if is not that expandable and will take reader some seconds to figure out there is no extra branch. Here we change it to a more common way, without the extra else: eb = read_tree_block(); if (IS_ERR(eb)) { /* * Handling error here */ return eb or goto out; } if (!extent_buffer_uptodate(eb)) { /* * Different error handling here */ return eb or goto out; } This also removes some oddball call sites which uses some creative way to check error. Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: factor out do_free_extent_accounting helperJosef Bacik1-22/+30
__btrfs_free_extent() does all of the hard work of updating the extent ref items, and then at the end if we dropped the extent completely it does the cleanup accounting work. We're going to only want to do that work for metadata with extent tree v2, so extract this bit into its own helper. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: remove last_ref from the extent freeing codeJosef Bacik1-23/+11
This is a remnant of the work I did for qgroups a long time ago to only run for a block when we had dropped the last ref. We haven't done that for years, but the code remains. Drop this remnant. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: add a alloc_reserved_extent helperJosef Bacik1-32/+24
We duplicate this logic for both data and metadata, at this point we've already done our type specific extent root operations, this is just doing the accounting and removing the space from the free space tree. Extract this common logic out into a helper. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: remove BUG_ON(ret) in alloc_reserved_tree_blockJosef Bacik1-1/+2
Switch this to an ASSERT() and return the error in the normal case. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: add and use helper for unlinking inode during log replayFilipe Manana1-48/+29
During log replay there is this pattern of running delayed items after every inode unlink. To avoid repeating this several times, move the logic into an helper function and use it instead of calling btrfs_unlink_inode() followed by btrfs_run_delayed_items(). Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: extend locking to all space_info members accessesNiels Dossche1-1/+2
bytes_pinned is always accessed under space_info->lock, except in btrfs_preempt_reclaim_metadata_space, however the other members are accessed under that lock. The reserved member of the rsv's are also partially accessed under a lock and partially not. Move all these accesses into the same lock to ensure consistency. This could potentially race and lead to a flush instead of a commit but it's not a big problem as it's only for preemptive flush. CC: [email protected] # 5.15+ Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Niels Dossche <[email protected]> Signed-off-by: Niels Dossche <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: zoned: mark relocation as writingNaohiro Aota2-1/+10
There is a hung_task issue with running generic/068 on an SMR device. The hang occurs while a process is trying to thaw the filesystem. The process is trying to take sb->s_umount to thaw the FS. The lock is held by fsstress, which calls btrfs_sync_fs() and is waiting for an ordered extent to finish. However, as the FS is frozen, the ordered extents never finish. Having an ordered extent while the FS is frozen is the root cause of the hang. The ordered extent is initiated from btrfs_relocate_chunk() which is called from btrfs_reclaim_bgs_work(). This commit adds sb_*_write() around btrfs_relocate_chunk() call site. For the usual "btrfs balance" command, we already call it with mnt_want_file() in btrfs_ioctl_balance(). Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones") CC: [email protected] # 5.13+ Link: https://github.com/naota/linux/issues/56 Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Naohiro Aota <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14fs: allow cross-vfsmount reflink/dedupeJosef Bacik2-10/+1
Currently we disallow reflink and dedupe if the two files aren't on the same vfsmount. However we really only need to disallow it if they're not on the same super block. It is very common for btrfs to have a main subvolume that is mounted and then different subvolumes mounted at different locations. It's allowed to reflink between these volumes, but the vfsmount check disallows this. Instead fix dedupe to check for the same superblock, and simply remove the vfsmount check for reflink as it already does the superblock check. Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: remove the cross file system checks from remapJosef Bacik1-3/+1
The sb check is already done in do_clone_file_range, and the mnt check (which will hopefully go away in a subsequent patch) is done in ioctl_file_clone(). Remove the check in our code and put an ASSERT() to make sure it doesn't change underneath us. Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: pass btrfs_fs_info to btrfs_recover_relocationJosef Bacik3-5/+4
We don't need a root here, we just need the btrfs_fs_info, we can just get the specific roots we need from fs_info. Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: pass btrfs_fs_info for deleting snapshots and cleanerJosef Bacik3-7/+6
We're passing a root around here, but we only really need the fs_info, so fix up btrfs_clean_one_deleted_snapshot() to take an fs_info instead, and then fix up all the callers appropriately. Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: add filesystems state details to error messagesSweet Tea Dorminy2-8/+62
When a filesystem goes read-only due to an error, multiple errors tend to be reported, some of which are knock-on failures. Logging fs_states, in btrfs_handle_fs_error() and btrfs_printk() helps distinguish the first error from subsequent messages which may only exist due to an error state. Under the new format, most initial errors will look like: `BTRFS: error (device loop0) in ...` while subsequent errors will begin with: `error (device loop0: state E) in ...` An initial transaction abort error will look like `error (device loop0: state A) in ...` and subsequent messages will contain `(device loop0: state EA) in ...` In addition to the error states we can also print other states that are temporary, like remounting, device replace, or indicate a global state that may affect functionality. Now implemented: E - filesystem error detected A - transaction aborted L - log tree errors M - remounting in progress R - device replace in progress C - data checksums not verified (mounted with ignoredatacsums) Signed-off-by: Sweet Tea Dorminy <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: deal with unexpected extent type during reflinkingFilipe Manana1-2/+4
Smatch complains about a possible dereference of a pointer that was not initialized: CC [M] fs/btrfs/reflink.o CHECK fs/btrfs/reflink.c fs/btrfs/reflink.c:533 btrfs_clone() error: potentially dereferencing uninitialized 'trans'. This is because we are not dealing with the case where the type of a file extent has an unexpected value (not regular, not prealloc and not inline), in which case the transaction handle pointer is not initialized. Such unexpected type should be impossible, except in case of some memory corruption caused either by bad hardware or some software bug causing something like a buffer overrun. So ASSERT that if the extent type is neither regular nor prealloc, then it must be inline. Bail out with -EUCLEAN and a warning in case it is not. This silences smatch. Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: fix unexpected error path when reflinking an inline extentFilipe Manana1-2/+5
When reflinking an inline extent, we assert that its file offset is 0 and that its uncompressed length is not greater than the sector size. We then return an error if one of those conditions is not satisfied. However we use a return statement, which results in returning from btrfs_clone() without freeing the path and buffer that were allocated before, as well as not clearing the flag BTRFS_INODE_NO_DELALLOC_FLUSH for the destination inode. Fix that by jumping to the 'out' label instead, and also add a WARN_ON() for each condition so that in case assertions are disabled, we get to known which of the unexpected conditions triggered the error. Fixes: a61e1e0df9f321 ("Btrfs: simplify inline extent handling when doing reflinks") Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: reset last_reflink_trans after fsyncing inodeFilipe Manana5-14/+48
When an inode has a last_reflink_trans matching the current transaction, we have to take special care when logging its checksums in order to avoid getting checksum items with overlapping ranges in a log tree, which could result in missing checksums after log replay (more on that in the changelogs of commit 40e046acbd2f36 ("Btrfs: fix missing data checksums after replaying a log tree") and commit e289f03ea79bbc ("btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents")). We also need to make sure a full fsync will copy all old file extent items it finds in modified leaves, because they might have been copied from some other inode. However once we fsync an inode, we don't need to keep paying the price of that extra special care in future fsyncs done in the same transaction, unless the inode is used for another reflink operation or the full sync flag is set on it (truncate, failure to allocate extent maps for holes, and other exceptional and infrequent cases). So after we fsync an inode reset its last_unlink_trans to zero. In case another reflink happens, we continue to update the last_reflink_trans of the inode, just as before. Also set last_reflink_trans to the generation of the last transaction that modified the inode whenever we need to set the full sync flag on the inode, just like when we need to load an inode from disk after eviction. Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: voluntarily relinquish cpu when doing a full fsyncFilipe Manana1-0/+7
Doing a full fsync may require processing many leaves of metadata, which can take some time and result in a task monopolizing a cpu for too long. So add a cond_resched() after processing a leaf when doing a full fsync, while not holding any locks on any tree (a subvolume or a log tree). Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: hold on to less memory when logging checksums during full fsyncFilipe Manana1-17/+12
When doing a full fsync, at copy_items(), we iterate over all extents and then collect their checksums into a list. After copying all the extents to the log tree, we then log all the previously collected checksums. Before the previous patch in the series (subject "btrfs: stop copying old file extents when doing a full fsync"), we had to do it this way, because while we were iterating over the items in the leaf of the subvolume tree, we were holding a write lock on a leaf of the log tree, so logging the checksums for an extent right after we collected them could result in a deadlock, in case the checksum items ended up in the same leaf. However after the previous patch in the series we now do a first iteration over all the items in the leaf of the subvolume tree before locking a path in the log tree, so we can now log the checksums right after we have obtained them. This avoids holding in memory all checksums for all extents in the leaf while copying items from the source leaf to the log tree. The amount of memory used to hold all checksums of the extents in a leaf can be significant. For example if a leaf has 200 file extent items referring to 1M extents, using the default crc32c checksums, would result in using over 200K of memory (not accounting for the extra overhead of struct btrfs_ordered_sum), with smaller or less extents it would be less, but it could be much more with more extents per leaf and/or much larger extents. So change copy_items() to log the checksums for an extent after looking them up, and then free their memory, as they are no longer necessary. Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: stop copying old file extents when doing a full fsyncFilipe Manana2-73/+148
When logging an inode in full sync mode, we go over every leaf that was modified in the current transaction and has items associated to our inode, and then copy all those items into the log tree. This includes copying file extent items that were created and added to the inode in past transactions, which is useless and only makes use more leaf space in the log tree. It's common to have a file with many file extent items spanning many leaves where only a few file extent items are new and need to be logged, and in such case we log all the file extent items we find in the modified leaves. So change the full sync behaviour to skip over file extent items that are not needed. Those are the ones that match the following criteria: 1) Have a generation older than the current transaction and the inode was not a target of a reflink operation, as that can copy file extent items from a past generation from some other inode into our inode, so we have to log them; 2) Start at an offset within i_size - we must log anything at or beyond i_size, otherwise we would lose prealloc extents after log replay. The following script exercises a scenario where this happens, and it's somehow close enough to what happened often on a SQL Server workload which I had to debug sometime ago to fix an issue where a pattern of writes to prealloc extents and fsync resulted in fsync failing with -EIO (that was commit ea7036de0d36c4 ("btrfs: fix fsync failure and transaction abort after writes to prealloc extents")). In that particular case, we had large files that had random writes and were often truncated, which made the next fsync be a full sync. $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi MKFS_OPTIONS="-O no-holes -R free-space-tree" MOUNT_OPTIONS="-o ssd" FILE_SIZE=$((1 * 1024 * 1024 * 1024)) # 1G # FILE_SIZE=$((2 * 1024 * 1024 * 1024)) # 2G # FILE_SIZE=$((512 * 1024 * 1024)) # 512M mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT # Create a file with many extents. Use direct IO to make it faster # to create the file - using buffered IO we would have to fsync # after each write (terribly slow). echo "Creating file with $((FILE_SIZE / 4096)) extents of 4K each..." xfs_io -f -d -c "pwrite -b 4K 0 $FILE_SIZE" $MNT/foobar # Commit the transaction, so every extent after this is from an # old generation. sync # Now rewrite only a few extents, which are all far spread apart from # each other (e.g. 1G / 32M = 32 extents). # After this only a few extents have a new generation, while all other # ones have an old generation. echo "Rewriting $((FILE_SIZE / (32 * 1024 * 1024))) extents..." for ((i = 0; i < $FILE_SIZE; i += $((32 * 1024 * 1024)))); do xfs_io -c "pwrite $i 4K" $MNT/foobar >/dev/null done # Fsync, the inode logged in full sync mode since it was never fsynced # before. echo "Fsyncing file..." xfs_io -c "fsync" $MNT/foobar umount $MNT And the following bpftrace program was running when executing the test script: $ cat bpf-script.sh #!/usr/bin/bpftrace k:btrfs_log_inode { @start_log_inode[tid] = nsecs; } kr:btrfs_log_inode /@start_log_inode[tid]/ { @log_inode_dur[tid] = (nsecs - @start_log_inode[tid]) / 1000; delete(@start_log_inode[tid]); } k:btrfs_sync_log { @start_sync_log[tid] = nsecs; } kr:btrfs_sync_log /@start_sync_log[tid]/ { $sync_log_dur = (nsecs - @start_sync_log[tid]) / 1000; printf("btrfs_log_inode() took %llu us\n", @log_inode_dur[tid]); printf("btrfs_sync_log() took %llu us\n", $sync_log_dur); delete(@start_sync_log[tid]); delete(@log_inode_dur[tid]); exit(); } With 512M test file, before this patch: btrfs_log_inode() took 15218 us btrfs_sync_log() took 1328 us Log tree has 17 leaves and 1 node, its total size is 294912 bytes. With 512M test file, after this patch: btrfs_log_inode() took 14760 us btrfs_sync_log() took 588 us Log tree has a single leaf, its total size is 16K. With 1G test file, before this patch: btrfs_log_inode() took 27301 us btrfs_sync_log() took 1767 us Log tree has 33 leaves and 1 node, its total size is 557056 bytes. With 1G test file, after this patch: btrfs_log_inode() took 26166 us btrfs_sync_log() took 593 us Log tree has a single leaf, its total size is 16K With 2G test file, before this patch: btrfs_log_inode() took 50892 us btrfs_sync_log() took 3127 us Log tree has 65 leaves and 1 node, its total size is 1081344 bytes. With 2G test file, after this patch: btrfs_log_inode() took 50126 us btrfs_sync_log() took 586 us Log tree has a single leaf, its total size is 16K. Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: do not clean up repair bio if submit failsJosef Bacik1-8/+7
The submit helper will always run bio_endio() on the bio if it fails to submit, so cleaning up the bio just leads to a variety of use-after-free and NULL pointer dereference bugs because we race with the endio function that is cleaning up the bio. Instead just return BLK_STS_OK as the repair function has to continue to process the rest of the pages, and the endio for the repair bio will do the appropriate cleanup for the page that it was given. Reviewed-by: Boris Burkov <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: do not try to repair bio that has no mirror setJosef Bacik1-0/+9
If we fail to submit a bio for whatever reason, we may not have setup a mirror_num for that bio. This means we shouldn't try to do the repair workflow, if we do we'll hit an BUG_ON(!failrec->this_mirror) in clean_io_failure. Instead simply skip the repair workflow if we have no mirror set, and add an assert to btrfs_check_repairable() to make it easier to catch what is happening in the future. Reviewed-by: Boris Burkov <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: do not double complete bio on errors during compressed readsJosef Bacik2-7/+21
I hit some weird panics while fixing up the error handling from btrfs_lookup_bio_sums(). Turns out the compression path will complete the bio we use if we set up any of the compression bios and then return an error, and then btrfs_submit_data_bio() will also call bio_endio() on the bio. Fix this by making btrfs_submit_compressed_read() responsible for calling bio_endio() on the bio if there are any errors. Currently it was only doing it if we created the compression bios, otherwise it was depending on btrfs_submit_data_bio() to do the right thing. This creates the above problem, so fix up btrfs_submit_compressed_read() to always call bio_endio() in case of an error, and then simply return from btrfs_submit_data_bio() if we had to call btrfs_submit_compressed_read(). Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: track compressed bio errors as blk_status_tJosef Bacik2-12/+14
Right now we just have a binary "errors" flag, so any error we get on the compressed bio's gets translated to EIO. This isn't necessarily a bad thing, but if we get an ENOMEM it may be nice to know that's what happened instead of an EIO. Track our errors as a blk_status_t, and do the appropriate setting of the errors accordingly. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-14btrfs: remove the bio argument from finish_compressed_bio_readJosef Bacik1-5/+3
This bio is usually one of the compressed bio's, and we don't actually need it in this function, so remove the argument and stop passing it around. Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>