aboutsummaryrefslogtreecommitdiff
path: root/fs/btrfs/transaction.c
AgeCommit message (Collapse)AuthorFilesLines
2024-09-10btrfs: qgroup: use xarray to track dirty extents in transactionJunchao Sun1-3/+2
Use xarray to track dirty extents to reduce the size of the struct btrfs_qgroup_extent_record from 64 bytes to 40 bytes. The xarray is more cache line friendly, it also reduces the complexity of insertion and search code compared to rb tree. Another change introduced is about error handling. Before this patch, the result of btrfs_qgroup_trace_extent_nolock() is always a success. In this patch, because of this function calls the function xa_store() which has the possibility to fail, so mark qgroup as inconsistent if error happened and then free preallocated memory. Also we preallocate memory before spin_lock(), if memory preallcation failed, error handling is the same the existing code. Suggested-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Junchao Sun <sunjunchao2870@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: fix data race when accessing the last_trans field of a rootFilipe Manana1-4/+4
KCSAN complains about a data race when accessing the last_trans field of a root: [ 199.553628] BUG: KCSAN: data-race in btrfs_record_root_in_trans [btrfs] / record_root_in_trans [btrfs] [ 199.555186] read to 0x000000008801e308 of 8 bytes by task 2812 on cpu 1: [ 199.555210] btrfs_record_root_in_trans+0x9a/0x128 [btrfs] [ 199.555999] start_transaction+0x154/0xcd8 [btrfs] [ 199.556780] btrfs_join_transaction+0x44/0x60 [btrfs] [ 199.557559] btrfs_dirty_inode+0x9c/0x140 [btrfs] [ 199.558339] btrfs_update_time+0x8c/0xb0 [btrfs] [ 199.559123] touch_atime+0x16c/0x1e0 [ 199.559151] pipe_read+0x6a8/0x7d0 [ 199.559179] vfs_read+0x466/0x498 [ 199.559204] ksys_read+0x108/0x150 [ 199.559230] __s390x_sys_read+0x68/0x88 [ 199.559257] do_syscall+0x1c6/0x210 [ 199.559286] __do_syscall+0xc8/0xf0 [ 199.559318] system_call+0x70/0x98 [ 199.559431] write to 0x000000008801e308 of 8 bytes by task 2808 on cpu 0: [ 199.559464] record_root_in_trans+0x196/0x228 [btrfs] [ 199.560236] btrfs_record_root_in_trans+0xfe/0x128 [btrfs] [ 199.561097] start_transaction+0x154/0xcd8 [btrfs] [ 199.561927] btrfs_join_transaction+0x44/0x60 [btrfs] [ 199.562700] btrfs_dirty_inode+0x9c/0x140 [btrfs] [ 199.563493] btrfs_update_time+0x8c/0xb0 [btrfs] [ 199.564277] file_update_time+0xb8/0xf0 [ 199.564301] pipe_write+0x8ac/0xab8 [ 199.564326] vfs_write+0x33c/0x588 [ 199.564349] ksys_write+0x108/0x150 [ 199.564372] __s390x_sys_write+0x68/0x88 [ 199.564397] do_syscall+0x1c6/0x210 [ 199.564424] __do_syscall+0xc8/0xf0 [ 199.564452] system_call+0x70/0x98 This is because we update and read last_trans concurrently without any type of synchronization. This should be generally harmless and in the worst case it can make us do extra locking (btrfs_record_root_in_trans()) trigger some warnings at ctree.c or do extra work during relocation - this would probably only happen in case of load or store tearing. So fix this by always reading and updating the field using READ_ONCE() and WRITE_ONCE(), this silences KCSAN and prevents load and store tearing. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: switch btrfs_pending_snapshot::dir to btrfs_inodeDavid Sterba1-1/+1
The structure is internal so we should use struct btrfs_inode for that. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: add and use helper to commit the current transactionFilipe Manana1-0/+19
We have several places that attach to the current transaction with btrfs_attach_transaction_barrier() and then commit the transaction if there is one. Add a helper and use it to deduplicate this pattern. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: simplify range parameters of btrfs_wait_ordered_roots()David Sterba1-1/+1
The range is specified only in two ways, we can simplify the case for the whole filesystem range as a NULL block group parameter. Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: rename werr and err to ret in __btrfs_wait_marked_extents()Anand Jain1-13/+8
Rename the function's local return variables err and werr to ret. Also, align the variable declarations with the other declarations in the function for better function space alignment. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: rename werr and err to ret in btrfs_write_marked_extents()Anand Jain1-13/+10
Rename the function's local variable werr and err to ret. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: report filemap_fdata<write|wait>_range() errorAnand Jain1-0/+4
In the function btrfs_write_marked_extents() and in __btrfs_wait_marked_extents() return the actual error if when filemap_fdata<write|wait>_range() fails. Suggested-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: change root->root_key.objectid to btrfs_root_id()Josef Bacik1-11/+11
A comment from Filipe on one of my previous cleanups brought my attention to a new helper we have for getting the root id of a root, which makes it easier to read in the code. The changes where made with the following Coccinelle semantic patch: // <smpl> @@ expression E,E1; @@ ( E->root_key.objectid = E1 | - E->root_key.objectid + btrfs_root_id(E) ) // </smpl> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: rename err to ret in __btrfs_end_transaction()Anand Jain1-4/+4
Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: remove pointless BUG_ON() when creating snapshotFilipe Manana1-2/+0
When creating a snapshot we first check with btrfs_lookup_dir_item() if there is a name collision in the parent directory and then return an error if there's a collision. Then later on when trying to insert a dir item for the snapshot we BUG_ON() if the return value is -EEXIST or -EOVERFLOW: static noinline int create_pending_snapshot(...) { (...) /* check if there is a file/dir which has the same name. */ dir_item = btrfs_lookup_dir_item(...); (...) ret = btrfs_insert_dir_item(...); /* We have check then name at the beginning, so it is impossible. */ BUG_ON(ret == -EEXIST || ret == -EOVERFLOW); if (ret) { btrfs_abort_transaction(trans, ret); goto fail; } (...) } It's impossible to get the -EEXIST because we previously checked for a potential collision with btrfs_lookup_dir_item() and we know that after that no one could have added a colliding name because at this point the transaction is in its critical section, state TRANS_STATE_COMMIT_DOING, so no one can join this transaction to add a colliding name and neither can anyone start a new transaction to do that. As for the -EOVERFLOW, that can't happen as long as we have the extended references feature enabled, which is a mkfs default for many years now. In either case, the BUG_ON() is excessive as we can properly deal with any error and can abort the transaction and jump to the 'fail' label, in which case we'll also get the useful stack trace (just like a BUG_ON()) from the abort if the error is either -EEXIST or -EOVERFLOW. So remove the BUG_ON(). Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-02btrfs: always clear PERTRANS metadata during commitBoris Burkov1-1/+1
It is possible to clear a root's IN_TRANS tag from the radix tree, but not clear its PERTRANS, if there is some error in between. Eliminate that possibility by moving the free up to where we clear the tag. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-02btrfs: qgroup: convert PREALLOC to PERTRANS after record_root_in_transBoris Burkov1-9/+8
The transaction is only able to free PERTRANS reservations for a root once that root has been recorded with the TRANS tag on the roots radix tree. Therefore, until we are sure that this root will get tagged, it isn't safe to convert. Generally, this is not an issue as *some* transaction will likely tag the root before long and this reservation will get freed in that transaction, but technically it could stick around until unmount and result in a warning about leaked metadata reservation space. This path is most exercised by running the generic/269 fstest with CONFIG_BTRFS_DEBUG. Fixes: a6496849671a ("btrfs: fix start transaction qgroup rsv double free") CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-05btrfs: remove SLAB_MEM_SPREAD flag useChengming Zhou1-2/+1
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was removed as of v6.8-rc1, so it became a dead flag since the commit 16a1d968358a ("mm/slab: remove mm/slab.c and slab_def.h"). And the series[1] went on to mark it obsolete to avoid confusion for users. Here we can just remove all its users, which has no functional change. [1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: use KMEM_CACHE() to create btrfs_trans_handle cacheKunwu Chan1-3/+2
Use the KMEM_CACHE() macro instead of kmem_cache_create() to simplify the creation of SLAB caches when the default values are used. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: remove no longer used btrfs_transaction_in_commit()Filipe Manana1-13/+0
The function btrfs_transaction_in_commit() is no longer used, its last use was removed in commit 11aeb97b45ad ("btrfs: don't arbitrarily slow down delalloc if we're committing"), so just remove it. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: remove unused included headersDavid Sterba1-2/+0
With help of neovim, LSP and clangd we can identify header files that are not actually needed to be included in the .c files. This is focused only on removal (with minor fixups), further cleanups are possible but will require doing the header files properly with forward declarations, minimized includes and include-what-you-use care. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-29btrfs: fix double free of anonymous device after snapshot creation failureFilipe Manana1-1/+1
When creating a snapshot we may do a double free of an anonymous device in case there's an error committing the transaction. The second free may result in freeing an anonymous device number that was allocated by some other subsystem in the kernel or another btrfs filesystem. The steps that lead to this: 1) At ioctl.c:create_snapshot() we allocate an anonymous device number and assign it to pending_snapshot->anon_dev; 2) Then we call btrfs_commit_transaction() and end up at transaction.c:create_pending_snapshot(); 3) There we call btrfs_get_new_fs_root() and pass it the anonymous device number stored in pending_snapshot->anon_dev; 4) btrfs_get_new_fs_root() frees that anonymous device number because btrfs_lookup_fs_root() returned a root - someone else did a lookup of the new root already, which could some task doing backref walking; 5) After that some error happens in the transaction commit path, and at ioctl.c:create_snapshot() we jump to the 'fail' label, and after that we free again the same anonymous device number, which in the meanwhile may have been reallocated somewhere else, because pending_snapshot->anon_dev still has the same value as in step 1. Recently syzbot ran into this and reported the following trace: ------------[ cut here ]------------ ida_free called for id=51 which is not allocated. WARNING: CPU: 1 PID: 31038 at lib/idr.c:525 ida_free+0x370/0x420 lib/idr.c:525 Modules linked in: CPU: 1 PID: 31038 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00410-gc02197fc9076 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024 RIP: 0010:ida_free+0x370/0x420 lib/idr.c:525 Code: 10 42 80 3c 28 (...) RSP: 0018:ffffc90015a67300 EFLAGS: 00010246 RAX: be5130472f5dd000 RBX: 0000000000000033 RCX: 0000000000040000 RDX: ffffc90009a7a000 RSI: 000000000003ffff RDI: 0000000000040000 RBP: ffffc90015a673f0 R08: ffffffff81577992 R09: 1ffff92002b4cdb4 R10: dffffc0000000000 R11: fffff52002b4cdb5 R12: 0000000000000246 R13: dffffc0000000000 R14: ffffffff8e256b80 R15: 0000000000000246 FS: 00007fca3f4b46c0(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f167a17b978 CR3: 000000001ed26000 CR4: 0000000000350ef0 Call Trace: <TASK> btrfs_get_root_ref+0xa48/0xaf0 fs/btrfs/disk-io.c:1346 create_pending_snapshot+0xff2/0x2bc0 fs/btrfs/transaction.c:1837 create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1931 btrfs_commit_transaction+0xf1c/0x3740 fs/btrfs/transaction.c:2404 create_snapshot+0x507/0x880 fs/btrfs/ioctl.c:848 btrfs_mksubvol+0x5d0/0x750 fs/btrfs/ioctl.c:998 btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1044 __btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1306 btrfs_ioctl_snap_create_v2+0x1ca/0x400 fs/btrfs/ioctl.c:1393 btrfs_ioctl+0xa74/0xd40 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:871 [inline] __se_sys_ioctl+0xfe/0x170 fs/ioctl.c:857 do_syscall_64+0xfb/0x240 entry_SYSCALL_64_after_hwframe+0x6f/0x77 RIP: 0033:0x7fca3e67dda9 Code: 28 00 00 00 (...) RSP: 002b:00007fca3f4b40c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007fca3e7abf80 RCX: 00007fca3e67dda9 RDX: 00000000200005c0 RSI: 0000000050009417 RDI: 0000000000000003 RBP: 00007fca3e6ca47a R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000000b R14: 00007fca3e7abf80 R15: 00007fff6bf95658 </TASK> Where we get an explicit message where we attempt to free an anonymous device number that is not currently allocated. It happens in a different code path from the example below, at btrfs_get_root_ref(), so this change may not fix the case triggered by syzbot. To fix at least the code path from the example above, change btrfs_get_root_ref() and its callers to receive a dev_t pointer argument for the anonymous device number, so that in case it frees the number, it also resets it to 0, so that up in the call chain we don't attempt to do the double free. CC: stable@vger.kernel.org # 5.10+ Link: https://lore.kernel.org/linux-btrfs/000000000000f673a1061202f630@google.com/ Fixes: e03ee2fe873e ("btrfs: do not ASSERT() if the newly created subvolume already got read") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-13btrfs: don't refill whole delayed refs block reserve when starting transactionFilipe Manana1-36/+2
Since commit 28270e25c69a ("btrfs: always reserve space for delayed refs when starting transaction") we started not only to reserve metadata space for the delayed refs a caller of btrfs_start_transaction() might generate but also to try to fully refill the delayed refs block reserve, because there are several case where we generate delayed refs and haven't reserved space for them, relying on the global block reserve. Relying too much on the global block reserve is not always safe, and can result in hitting -ENOSPC during transaction commits or worst, in rare cases, being unable to mount a filesystem that needs to do orphan cleanup or anything that requires modifying the filesystem during mount, and has no more unallocated space and the metadata space is nearly full. This was explained in detail in that commit's change log. However the gap between the reserved amount and the size of the delayed refs block reserve can be huge, so attempting to reserve space for such a gap can result in allocating many metadata block groups that end up not being used. After a recent patch, with the subject: "btrfs: add new unused block groups to the list of unused block groups" We started to add new block groups that are unused to the list of unused block groups, to avoid having them around for a very long time in case they are never used, because a block group is only added to the list of unused block groups when we deallocate the last extent or when mounting the filesystem and the block group has 0 bytes used. This is not a problem introduced by the commit mentioned earlier, it always existed as our metadata space reservations are, most of the time, pessimistic and end up not using all the space they reserved, so we can occasionally end up with one or two unused metadata block groups for a long period. However after that commit mentioned earlier, we are just more pessimistic in the metadata space reservations when starting a transaction and therefore the issue is more likely to happen. This however is not always enough because we might create unused metadata block groups when reserving metadata space at a high rate if there's always a gap in the delayed refs block reserve and the cleaner kthread isn't triggered often enough or is busy with other work (running delayed iputs, cleaning deleted roots, etc), not to mention the block group's allocated space is only usable for a new block group after the transaction used to remove it is committed. A user reported that he's getting a lot of allocated metadata block groups but the usage percentage of metadata space was very low compared to the total allocated space, specially after running a series of block group relocations. So for now stop trying to refill the gap in the delayed refs block reserve and reserve space only for the delayed refs we are expected to generate when starting a transaction. CC: stable@vger.kernel.org # 6.7+ Reported-by: Ivan Shapovalov <intelfx@intelfx.name> Link: https://lore.kernel.org/linux-btrfs/9cdbf0ca9cdda1b4c84e15e548af7d7f9f926382.camel@intelfx.name/ Link: https://lore.kernel.org/linux-btrfs/CAL3q7H6802ayLHUJFztzZAVzBLJAGdFx=6FHNNy87+obZXXZpQ@mail.gmail.com/ Tested-by: Ivan Shapovalov <intelfx@intelfx.name> Reported-by: Heddxh <g311571057@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAE93xANEby6RezOD=zcofENYZOT-wpYygJyauyUAZkLv6XVFOA@mail.gmail.com/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-14Merge tag 'for-6.7-rc5-tag' of ↵Linus Torvalds1-2/+0
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Some fixes to quota accounting code, mostly around error handling and correctness: - free reserves on various error paths, after IO errors or transaction abort - don't clear reserved range at the folio release time, it'll be properly cleared after final write - fix integer overflow due to int used when passing around size of freed reservations - fix a regression in squota accounting that missed some cases with delayed refs" * tag 'for-6.7-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: ensure releasing squota reserve on head refs btrfs: don't clear qgroup reserved bit in release_folio btrfs: free qgroup pertrans reserve on transaction abort btrfs: fix qgroup_free_reserved_data int overflow btrfs: free qgroup reserve when ORDERED_IOERR is set
2023-12-06btrfs: free qgroup pertrans reserve on transaction abortBoris Burkov1-2/+0
If we abort a transaction, we never run the code that frees the pertrans qgroup reservation. This results in warnings on unmount as that reservation has been leaked. The leak isn't a huge issue since the fs is read-only, but it's better to clean it up when we know we can/should. Do it during the cleanup_transaction step of aborting. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-28Merge tag 'for-6.7-rc3-tag' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few fixes and message updates: - for simple quotas, handle the case when a snapshot is created and the target qgroup already exists - fix a warning when file descriptor given to send ioctl is not writable - fix off-by-one condition when checking chunk maps - free pages when page array allocation fails during compression read, other cases were handled - fix memory leak on error handling path in ref-verify debugging feature - copy missing struct member 'version' in 64/32bit compat send ioctl - tree-checker verifies inline backref ordering - print messages to syslog on first mount and last unmount - update error messages when reading chunk maps" * tag 'for-6.7-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: send: ensure send_fd is writable btrfs: free the allocated memory if btrfs_alloc_page_array() fails btrfs: fix 64bit compat send ioctl arguments not initializing version member btrfs: make error messages more clear when getting a chunk map btrfs: fix off-by-one when checking chunk map includes logical address btrfs: ref-verify: fix memory leaks in btrfs_ref_tree_mod() btrfs: add dmesg output for first mount and last unmount of a filesystem btrfs: do not abort transaction if there is already an existing qgroup btrfs: tree-checker: add type and sequence check for inline backrefs
2023-11-15btrfs: do not abort transaction if there is already an existing qgroupQu Wenruo1-1/+1
[BUG] Syzbot reported a regression that after commit 6ed05643ddb1 ("btrfs: create qgroup earlier in snapshot creation") we can trigger transaction abort during snapshot creation: BTRFS: Transaction aborted (error -17) WARNING: CPU: 0 PID: 5057 at fs/btrfs/transaction.c:1778 create_pending_snapshot+0x25f4/0x2b70 fs/btrfs/transaction.c:1778 Modules linked in: CPU: 0 PID: 5057 Comm: syz-executor225 Not tainted 6.6.0-syzkaller-15365-g305230142ae0 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023 RIP: 0010:create_pending_snapshot+0x25f4/0x2b70 fs/btrfs/transaction.c:1778 Call Trace: <TASK> create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1967 btrfs_commit_transaction+0xf1c/0x3730 fs/btrfs/transaction.c:2440 create_snapshot+0x4a5/0x7e0 fs/btrfs/ioctl.c:845 btrfs_mksubvol+0x5d0/0x750 fs/btrfs/ioctl.c:995 btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1041 __btrfs_ioctl_snap_create+0x344/0x460 fs/btrfs/ioctl.c:1294 btrfs_ioctl_snap_create+0x13c/0x190 fs/btrfs/ioctl.c:1321 btrfs_ioctl+0xbbf/0xd40 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:871 [inline] __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:857 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x63/0x6b RIP: 0033:0x7f2f791127b9 </TASK> [CAUSE] The error number is -EEXIST, which can happen for qgroup if there is already an existing qgroup and then we're trying to create a snapshot for it. [FIX] In that case, we can continue creating the snapshot, although it may lead to qgroup inconsistency, it's not so critical to abort the current transaction. So in this case, we can just ignore the non-critical errors, mostly -EEXIST (there is already a qgroup). Reported-by: syzbot+4d81015bc10889fd12ea@syzkaller.appspotmail.com Fixes: 6ed05643ddb1 ("btrfs: create qgroup earlier in snapshot creation") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-30Merge tag 'for-6.7-tag' of ↵Linus Torvalds1-89/+140
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "New features: - raid-stripe-tree New tree for logical file extent mapping where the physical mapping may not match on multiple devices. This is now used in zoned mode to implement RAID0/RAID1* profiles, but can be used in non-zoned mode as well. The support for RAID56 is in development and will eventually fix the problems with the current implementation. This is a backward incompatible feature and has to be enabled at mkfs time. - simple quota accounting (squota) A simplified mode of qgroup that accounts all space on the initial extent owners (a subvolume), the snapshots are then cheap to create and delete. The deletion of snapshots in fully accounting qgroups is a known CPU/IO performance bottleneck. The squota is not suitable for the general use case but works well for containers where the original subvolume exists for the whole time. This is a backward incompatible feature as it needs extending some structures, but can be enabled on an existing filesystem. - temporary filesystem fsid (temp_fsid) The fsid identifies a filesystem and is hard coded in the structures, which disallows mounting the same fsid found on different devices. For a single device filesystem this is not strictly necessary, a new temporary fsid can be generated on mount e.g. after a device is cloned. This will be used by Steam Deck for root partition A/B testing, or can be used for VM root images. Other user visible changes: - filesystems with partially finished metadata_uuid conversion cannot be mounted anymore and the uuid fixup has to be done by btrfs-progs (btrfstune). Performance improvements: - reduce reservations for checksum deletions (with enabled free space tree by factor of 4), on a sample workload on file with many extents the deletion time decreased by 12% - make extent state merges more efficient during insertions, reduce rb-tree iterations (run time of critical functions reduced by 5%) Core changes: - the integrity check functionality has been removed, this was a debugging feature and removal does not affect other integrity checks like checksums or tree-checker - space reservation changes: - more efficient delayed ref reservations, this avoids building up too much work or overusing or exhausting the global block reserve in some situations - move delayed refs reservation to the transaction start time, this prevents some ENOSPC corner cases related to exhaustion of global reserve - improvements in reducing excessive reservations for block group items - adjust overcommit logic in near full situations, account for one more chunk to eventually allocate metadata chunk, this is mostly relevant for small filesystems (<10GiB) - single device filesystems are scanned but not registered (except seed devices), this allows temp_fsid to work - qgroup iterations do not need GFP_ATOMIC allocations anymore - cleanups, refactoring, reduced data structure size, function parameter simplifications, error handling fixes" * tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (156 commits) btrfs: open code timespec64 in struct btrfs_inode btrfs: remove redundant log root tree index assignment during log sync btrfs: remove redundant initialization of variable dirty in btrfs_update_time() btrfs: sysfs: show temp_fsid feature btrfs: disable the device add feature for temp-fsid btrfs: disable the seed feature for temp-fsid btrfs: update comment for temp-fsid, fsid, and metadata_uuid btrfs: remove pointless empty log context list check when syncing log btrfs: update comment for struct btrfs_inode::lock btrfs: remove pointless barrier from btrfs_sync_file() btrfs: add and use helpers for reading and writing last_trans_committed btrfs: add and use helpers for reading and writing fs_info->generation btrfs: add and use helpers for reading and writing log_transid btrfs: add and use helpers for reading and writing last_log_commit btrfs: support cloned-device mount capability btrfs: add helper function find_fsid_by_disk btrfs: stop reserving excessive space for block group item insertions btrfs: stop reserving excessive space for block group item updates btrfs: reorder btrfs_inode to fill gaps btrfs: open code btrfs_ordered_inode_tree in btrfs_inode ...
2023-10-18btrfs: convert to new timestamp accessorsJeff Layton1-1/+2
Convert to using the new inode timestamp accessor functions. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20231004185347.80880-21-jlayton@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-12btrfs: add and use helpers for reading and writing last_trans_committedFilipe Manana1-3/+3
Currently the last_trans_committed field of struct btrfs_fs_info is modified and read without any locking or other protection. For example early in the fsync path, skip_inode_logging() is called which reads fs_info->last_trans_committed, but at the same time we can have a transaction commit completing and updating that field. In the case of an fsync this is harmless and any data race should be rare and at most cause an unnecessary logging of an inode. To avoid data race warnings from tools like KCSAN and other issues such as load and store tearing (amongst others, see [1]), create helpers to access the last_trans_committed field of struct btrfs_fs_info using READ_ONCE() and WRITE_ONCE(), and use these helpers everywhere. [1] https://lwn.net/Articles/793253/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: add and use helpers for reading and writing fs_info->generationFilipe Manana1-1/+1
Currently the generation field of struct btrfs_fs_info is always modified while holding fs_info->trans_lock locked. Most readers will access this field without taking that lock but while holding a transaction handle, which is safe to do due to the transaction life cycle. However there are other readers that are neither holding the lock nor holding a transaction handle open: 1) When reading an inode from disk, at btrfs_read_locked_inode(); 2) When reading the generation to expose it to sysfs, at btrfs_generation_show(); 3) Early in the fsync path, at skip_inode_logging(); 4) When creating a hole at btrfs_cont_expand(), during write paths, truncate and reflinking; 5) In the fs_info ioctl (btrfs_ioctl_fs_info()); 6) While mounting the filesystem, in the open_ctree() path. In these cases it's safe to directly read fs_info->generation as no one can concurrently start a transaction and update fs_info->generation. In case of the fsync path, races here should be harmless, and in the worst case they may cause a fsync to log an inode when it's not really needed, so nothing bad from a functional perspective. In the other cases it's not so clear if functional problems may arise, though in case 1 rare things like a load/store tearing [1] may cause the BTRFS_INODE_NEEDS_FULL_SYNC flag not being set on an inode and therefore result in incorrect logging later on in case a fsync call is made. To avoid data race warnings from tools like KCSAN and other issues such as load and store tearing (amongst others, see [1]), create helpers to access the generation field of struct btrfs_fs_info using READ_ONCE() and WRITE_ONCE(), and use these helpers where needed. [1] https://lwn.net/Articles/793253/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: stop reserving excessive space for block group item insertionsFilipe Manana1-1/+1
Space for block group item insertions, necessary after allocating a new block group, is reserved in the delayed refs block reserve. Currently we do this by incrementing the transaction handle's delayed_ref_updates counter and then calling btrfs_update_delayed_refs_rsv(), which will increase the size of the delayed refs block reserve by an amount that corresponds to the same amount we use for delayed refs, given by btrfs_calc_delayed_ref_bytes(). That is an excessive amount because it corresponds to the amount of space needed to insert one item in a btree (btrfs_calc_insert_metadata_size()) times 2 when the free space tree feature is enabled. All we need is an amount as given by btrfs_calc_insert_metadata_size(), since we only need to insert a block group item in the extent tree (or block group tree if this feature is enabled). By using btrfs_calc_insert_metadata_size() we will need to reserve 2 times less space when using the free space tree, putting less pressure on space reservation. So use helpers to reserve and release space for block group item insertions that use btrfs_calc_insert_metadata_size() for calculation of the space. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: move btrfs_defrag_root() to defrag.{c,h}Filipe Manana1-39/+0
The btrfs_defrag_root() function does not really belong in the transaction.{c,h} module and as we have a defrag.{c,h} nowadays, move it to there instead. This also allows to stop exporting btrfs_defrag_leaves(), so we can make it static. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ rename info to fs_info for consistency ] Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: remove redundant root argument from btrfs_update_inode_fallback()Filipe Manana1-1/+1
The root argument for btrfs_update_inode_fallback() always matches the root of the given inode, so remove the root argument and get it from the inode argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: qgroup: simple quota auto hierarchy for nested subvolumesBoris Burkov1-5/+9
Consider the following sequence: - enable quotas - create subvol S id 256 at dir outer/ - create a qgroup 1/100 - add 0/256 (S's auto qgroup) to 1/100 - create subvol T id 257 at dir outer/inner/ With full qgroups, there is no relationship between 0/257 and either of 0/256 or 1/100. There is an inherit feature that the creator of inner/ can use to specify it ought to be in 1/100. Simple quotas are targeted at container isolation, where such automatic inheritance for not necessarily trusted/controlled nested subvol creation would be quite helpful. Therefore, add a new default behavior for simple quotas: when you create a nested subvol, automatically inherit as parents any parents of the qgroup of the subvol the new inode is going in. In our example, 257/0 would also be under 1/100, allowing easy control of a total quota over an arbitrary hierarchy of subvolumes. I think this _might_ be a generally useful behavior, so it could be interesting to put it behind a new inheritance flag that simple quotas always use while traditional quotas let the user specify, but this is a minimally intrusive change to start. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: create qgroup earlier in snapshot creationBoris Burkov1-0/+6
Pull creating the qgroup earlier in the snapshot. This allows simple quotas qgroups to see all the metadata writes related to the snapshot being created and to be born with the root node accounted. Note this has an impact on transaction commit where the qgroup creation can do a lot of work, allocate memory and take locks. The change is done for correctness, potential performance issues will be fixed in the future. Signed-off-by: Boris Burkov <boris@bur.io> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: qgroup: add new quota mode for simple quotasBoris Burkov1-4/+3
Add a new quota mode called "simple quotas". It can be enabled by the existing quota enable ioctl via a new command, and sets an incompat bit, as the implementation of simple quotas will make backwards incompatible changes to the disk format of the extent tree. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: rename errno identifiers to errorDavid Sterba1-5/+5
We sync the kernel files to userspace and the 'errno' symbol is defined by standard library, which does not matter in kernel but the parameters or local variables could clash. Rename them all. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: always reserve space for delayed refs when starting transactionFilipe Manana1-27/+107
When starting a transaction (or joining an existing one with btrfs_start_transaction()), we reserve space for the number of items we want to insert in a btree, but we don't do it for the delayed refs we will generate while using the transaction to modify (COW) extent buffers in a btree or allocate new extent buffers. Basically how it works: 1) When we start a transaction we reserve space for the number of items the caller wants to be inserted/modified/deleted in a btree. This space goes to the transaction block reserve; 2) If the delayed refs block reserve is not full, its size is greater than the amount of its reserved space, and the flush method is BTRFS_RESERVE_FLUSH_ALL, then we attempt to reserve more space for it corresponding to the number of items the caller wants to insert/modify/delete in a btree; 3) The size of the delayed refs block reserve is increased when a task creates delayed refs after COWing an extent buffer, allocating a new one or deleting (freeing) an extent buffer. This happens after the the task started or joined a transaction, whenever it calls btrfs_update_delayed_refs_rsv(); 4) The delayed refs block reserve is then refilled by anyone calling btrfs_delayed_refs_rsv_refill(), either during unlink/truncate operations or when someone else calls btrfs_start_transaction() with a 0 number of items and flush method BTRFS_RESERVE_FLUSH_ALL; 5) As a task COWs or allocates extent buffers, it consumes space from the transaction block reserve. When the task releases its transaction handle (btrfs_end_transaction()) or it attempts to commit the transaction, it releases any remaining space in the transaction block reserve that it did not use, as not all space may have been used (due to pessimistic space calculation) by calling btrfs_block_rsv_release() which will try to add that unused space to the delayed refs block reserve (if its current size is greater than its reserved space). That transferred space may not be enough to completely fulfill the delayed refs block reserve. Plus we have some tasks that will attempt do modify as many leaves as they can before getting -ENOSPC (and then reserving more space and retrying), such as hole punching and extent cloning which call btrfs_replace_file_extents(). Such tasks can generate therefore a high number of delayed refs, for both metadata and data (we can't know in advance how many file extent items we will find in a range and therefore how many delayed refs for dropping references on data extents we will generate); 6) If a transaction starts its commit before the delayed refs block reserve is refilled, for example by the transaction kthread or by someone who called btrfs_join_transaction() before starting the commit, then when running delayed references if we don't have enough reserved space in the delayed refs block reserve, we will consume space from the global block reserve. Now this doesn't make a lot of sense because: 1) We should reserve space for delayed references when starting the transaction, since we have no guarantees the delayed refs block reserve will be refilled; 2) If no refill happens then we will consume from the global block reserve when running delayed refs during the transaction commit; 3) If we have a bunch of tasks calling btrfs_start_transaction() with a number of items greater than zero and at the time the delayed refs reserve is full, then we don't reserve any space at btrfs_start_transaction() for the delayed refs that will be generated by a task, and we can therefore end up using a lot of space from the global reserve when running the delayed refs during a transaction commit; 4) There are also other operations that result in bumping the size of the delayed refs reserve, such as creating and deleting block groups, as well as the need to update a block group item because we allocated or freed an extent from the respective block group; 5) If we have a significant gap between the delayed refs reserve's size and its reserved space, two very bad things may happen: 1) The reserved space of the global reserve may not be enough and we fail the transaction commit with -ENOSPC when running delayed refs; 2) If the available space in the global reserve is enough it may result in nearly exhausting it. If the fs has no more unallocated device space for allocating a new block group and all the available space in existing metadata block groups is not far from the global reserve's size before we started the transaction commit, we may end up in a situation where after the transaction commit we have too little available metadata space, and any future transaction commit will fail with -ENOSPC, because although we were able to reserve space to start the transaction, we were not able to commit it, as running delayed refs generates some more delayed refs (to update the extent tree for example) - this includes not even being able to commit a transaction that was started with the goal of unlinking a file, removing an empty data block group or doing reclaim/balance, so there's no way to release metadata space. In the worst case the next time we mount the filesystem we may also fail with -ENOSPC due to failure to commit a transaction to cleanup orphan inodes. This later case was reported and hit by someone running a SLE (SUSE Linux Enterprise) distribution for example - where the fs had no more unallocated space that could be used to allocate a new metadata block group, and the available metadata space was about 1.5M, not enough to commit a transaction to cleanup an orphan inode (or do relocation of data block groups that were far from being full). So improve on this situation by always reserving space for delayed refs when calling start_transaction(), and if the flush method is BTRFS_RESERVE_FLUSH_ALL, also try to refill the delayed refs block reserve if it's not full. The space reserved for the delayed refs is added to a local block reserve that is part of the transaction handle, and when a task updates the delayed refs block reserve size, after creating a delayed ref, the space is transferred from that local reserve to the global delayed refs reserve (fs_info->delayed_refs_rsv). In case the local reserve does not have enough space, which may happen for tasks that generate a variable and potentially large number of delayed refs (such as the hole punching and extent cloning cases mentioned before), we transfer any available space and then rely on the current behaviour of hoping some other task refills the delayed refs reserve or fallback to the global block reserve. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: stop doing excessive space reservation for csum deletionFilipe Manana1-1/+1
Currently when reserving space for deleting the csum items for a data extent, when adding or updating a delayed ref head, we determine how many leaves of csum items we can have and then pass that number to the helper btrfs_calc_delayed_ref_bytes(). This helper is used for calculating space for all tree modifications we need when running delayed references, however the amount of space it computes is excessive for deleting csum items because: 1) It uses btrfs_calc_insert_metadata_size() which is excessive because we only need to delete csum items from the csum tree, we don't need to insert any items, so btrfs_calc_metadata_size() is all we need (as it computes space needed to delete an item); 2) If the free space tree is enabled, it doubles the amount of space, which is pointless for csum deletion since we don't need to touch the free space tree or any other tree other than the csum tree. So improve on this by tracking how many csum deletions we have and using a new helper to calculate space for csum deletions (just a wrapper around btrfs_calc_metadata_size() with a comment). This reduces the amount of space we need to reserve for csum deletions by a factor of 4, and it helps reduce the number of times we have to block space reservations and have the reclaim task enter the space flushing algorithm (flush delayed items, flush delayed refs, etc) in order to satisfy tickets. For example this results in a total time decrease when unlinking (or truncating) files with many extents, as we end up having to block on space metadata reservations less often. Example test: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/test umount $DEV &> /dev/null mkfs.btrfs -f $DEV # Use compression to quickly create files with a lot of extents # (each with a size of 128K). mount -o compress=lzo $DEV $MNT # 100G gives at least 983040 extents with a size of 128K. xfs_io -f -c "pwrite -S 0xab -b 1M 0 120G" $MNT/foobar # Flush all delalloc and clear all metadata from memory. umount $MNT mount -o compress=lzo $DEV $MNT start=$(date +%s%N) rm -f $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "rm took $dur milliseconds" umount $MNT Before this change rm took: 7504 milliseconds After this change rm took: 6574 milliseconds (-12.4%) Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: allow to run delayed refs by bytes to be released instead of countFilipe Manana1-4/+4
When running delayed references, through btrfs_run_delayed_refs(), we can specify how many to run, run all existing delayed references and keep running delayed references while we can find any. This is controlled with the value of the 'count' argument, where a value of 0 means to run all delayed references that exist by the time btrfs_run_delayed_refs() is called, (unsigned long)-1 means to keep running delayed references while we are able find any, and any other value to run that exact number of delayed references. Typically a specific value other than 0 or -1 is used when flushing space to try to release a certain amount of bytes for a ticket. In this case we just simply calculate how many delayed reference heads correspond to a specific amount of bytes, with calc_delayed_refs_nr(). However that only takes into account the space reserved for the reference heads themselves, and does not account for the space reserved for deleting checksums from the csum tree (see add_delayed_ref_head() and update_existing_head_ref()) in case we are going to delete a data extent. This means we may end up running more delayed references than necessary in case we process delayed references for deleting a data extent. So change the logic of btrfs_run_delayed_refs() to take a bytes argument to specify how many bytes of delayed references to run/release, using the special values of 0 to mean all existing delayed references and U64_MAX (or (u64)-1) to keep running delayed references while we can find any. This prevents running more delayed references than necessary, when we have delayed references for deleting data extents, but also makes the upcoming changes/patches simpler and it's preparatory work for them. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: pass a space_info argument to btrfs_reserve_metadata_bytes()Filipe Manana1-1/+2
We are passing a block reserve argument to btrfs_reserve_metadata_bytes() which is not really used, all we need is to pass the space_info associated to the block reserve, we don't change the block reserve at all. Not only it's pointless to pass the block reserve, it's also confusing as one might think that the reserved bytes will end up being added to the passed block reserve, when that's not the case. The pattern for reserving space and adding it to a block reserve is to first reserve space with btrfs_reserve_metadata_bytes() and if that succeeds, then add the space to a block reserve by calling btrfs_block_rsv_add_bytes(). Also the reverse of btrfs_reserve_metadata_bytes(), which is btrfs_space_info_free_bytes_may_use(), takes a space_info argument and not a block reserve, so one more reason to pass a space_info and not a block reserve to btrfs_reserve_metadata_bytes(). So change btrfs_reserve_metadata_bytes() and its callers to pass a space_info argument instead of a block reserve argument. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: reformat remaining kdoc style commentsDavid Sterba1-2/+2
Function name in the comment does not bring much value to code not exposed as API and we don't stick to the kdoc format anymore. Update formatting of parameter descriptions. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-26Merge tag 'for-6.6-rc3-tag' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - delayed refs fixes: - fix race when refilling delayed refs block reserve - prevent transaction block reserve underflow when starting transaction - error message and value adjustments - fix build warnings with CONFIG_CC_OPTIMIZE_FOR_SIZE and -Wmaybe-uninitialized - fix for smatch report where uninitialized data from invalid extent buffer range could be returned to the caller - fix numeric overflow in statfs when calculating lower threshold for a full filesystem * tag 'for-6.6-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: initialize start_slot in btrfs_log_prealloc_extents btrfs: make sure to initialize start and len in find_free_dev_extent btrfs: reset destination buffer when read_extent_buffer() gets invalid range btrfs: properly report 0 avail for very full file systems btrfs: log message if extent item not found when running delayed extent op btrfs: remove redundant BUG_ON() from __btrfs_inc_extent_ref() btrfs: return -EUCLEAN for delayed tree ref with a ref count not equals to 1 btrfs: prevent transaction block reserve underflow when starting transaction btrfs: fix race when refilling delayed refs block reserve
2023-09-20btrfs: prevent transaction block reserve underflow when starting transactionFilipe Manana1-3/+3
When starting a transaction, with a non-zero number of items, we reserve metadata space for that number of items and for delayed refs by doing a call to btrfs_block_rsv_add(), with the transaction block reserve passed as the block reserve argument. This reserves metadata space and adds it to the transaction block reserve. Later we migrate the space we reserved for delayed references from the transaction block reserve into the delayed refs block reserve, by calling btrfs_migrate_to_delayed_refs_rsv(). btrfs_migrate_to_delayed_refs_rsv() decrements the number of bytes to migrate from the source block reserve, and this however may result in an underflow in case the space added to the transaction block reserve ended up being used by another task that has not reserved enough space for its own use - examples are tasks doing reflinks or hole punching because they end up calling btrfs_replace_file_extents() -> btrfs_drop_extents() and may need to modify/COW a variable number of leaves/paths, so they keep trying to use space from the transaction block reserve when they need to COW an extent buffer, and may end up trying to use more space then they have reserved (1 unit/path only for removing file extent items). This can be avoided by simply reserving space first without adding it to the transaction block reserve, then add the space for delayed refs to the delayed refs block reserve and finally add the remaining reserved space to the transaction block reserve. This also makes the code a bit shorter and simpler. So just do that. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-12Merge tag 'for-6.6-rc1-tag' of ↵Linus Torvalds1-15/+24
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - several fixes for handling directory item (inserting, removing, iteration, error handling) - fix transaction commit stalls when auto relocation is running and blocks other tasks that want to commit - fix a build error when DEBUG is enabled - fix lockdep warning in inode number lookup ioctl - fix race when finishing block group creation - remove link to obsolete wiki in several files * tag 'for-6.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: MAINTAINERS: remove links to obsolete btrfs.wiki.kernel.org btrfs: assert delayed node locked when removing delayed item btrfs: remove BUG() after failure to insert delayed dir index item btrfs: improve error message after failure to add delayed dir index item btrfs: fix a compilation error if DEBUG is defined in btree_dirty_folio btrfs: check for BTRFS_FS_ERROR in pending ordered assert btrfs: fix lockdep splat and potential deadlock after failure running delayed items btrfs: do not block starts waiting on previous transaction commit btrfs: release path before inode lookup during the ino lookup ioctl btrfs: fix race between finishing block group creation and its item update
2023-09-08btrfs: do not block starts waiting on previous transaction commitJosef Bacik1-15/+24
Internally I got a report of very long stalls on normal operations like creating a new file when auto relocation was running. The reporter used the 'bpf offcputime' tracer to show that we would get stuck in start_transaction for 5 to 30 seconds, and were always being woken up by the transaction commit. Using my timing-everything script, which times how long a function takes and what percentage of that total time is taken up by its children, I saw several traces like this 1083 took 32812902424 ns 29929002926 ns 91.2110% wait_for_commit_duration 25568 ns 7.7920e-05% commit_fs_roots_duration 1007751 ns 0.00307% commit_cowonly_roots_duration 446855602 ns 1.36182% btrfs_run_delayed_refs_duration 271980 ns 0.00082% btrfs_run_delayed_items_duration 2008 ns 6.1195e-06% btrfs_apply_pending_changes_duration 9656 ns 2.9427e-05% switch_commit_roots_duration 1598 ns 4.8700e-06% btrfs_commit_device_sizes_duration 4314 ns 1.3147e-05% btrfs_free_log_root_tree_duration Here I was only tracing functions that happen where we are between START_COMMIT and UNBLOCKED in order to see what would be keeping us blocked for so long. The wait_for_commit() we do is where we wait for a previous transaction that hasn't completed it's commit. This can include all of the unpin work and other cleanups, which tends to be the longest part of our transaction commit. There is no reason we should be blocking new things from entering the transaction at this point, it just adds to random latency spikes for no reason. Fix this by adding a PREP stage. This allows us to properly deal with multiple committers coming in at the same time, we retain the behavior that the winner waits on the previous transaction and the losers all wait for this transaction commit to occur. Nothing else is blocked during the PREP stage, and then once the wait is complete we switch to COMMIT_START and all of the same behavior as before is maintained. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-28Merge tag 'for-6.6-tag' of ↵Linus Torvalds1-11/+28
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "No new features, the bulk of the changes are fixes, refactoring and cleanups. The notable fix is the scrub performance restoration after rewrite in 6.4, though still only partial. Fixes: - scrub performance drop due to rewrite in 6.4 partially restored: - do IO grouping by blg_plug/blk_unplug again - avoid unnecessary tree searches when processing stripes, in extent and checksum trees - the drop is noticeable on fast PCIe devices, -66% and restored to -33% of the original - backports to 6.4 planned - handle more corner cases of transaction commit during orphan cleanup or delayed ref processing - use correct fsid/metadata_uuid when validating super block - copy directory permissions and time when creating a stub subvolume Core: - debugging feature integrity checker deprecated, to be removed in 6.7 - in zoned mode, zones are activated just before the write, making error handling easier, now the overcommit mechanism can be enabled again which improves performance by avoiding more frequent flushing - v0 extent handling completely removed, deprecated long time ago - error handling improvements - tests: - extent buffer bitmap tests - pinned extent splitting tests - cleanups and refactoring: - compression writeback - extent buffer bitmap - space flushing, ENOSPC handling" * tag 'for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (110 commits) btrfs: zoned: skip splitting and logical rewriting on pre-alloc write btrfs: tests: test invalid splitting when skipping pinned drop extent_map btrfs: tests: add a test for btrfs_add_extent_mapping btrfs: tests: add extent_map tests for dropping with odd layouts btrfs: scrub: move write back of repaired sectors to scrub_stripe_read_repair_worker() btrfs: scrub: don't go ordered workqueue for dev-replace btrfs: scrub: fix grouping of read IO btrfs: scrub: avoid unnecessary csum tree search preparing stripes btrfs: scrub: avoid unnecessary extent tree search preparing stripes btrfs: copy dir permission and time when creating a stub subvolume btrfs: remove pointless empty list check when reading delayed dir indexes btrfs: drop redundant check to use fs_devices::metadata_uuid btrfs: compare the correct fsid/metadata_uuid in btrfs_validate_super btrfs: use the correct superblock to compare fsid in btrfs_validate_super btrfs: simplify memcpy either of metadata_uuid or fsid btrfs: add a helper to read the superblock metadata_uuid btrfs: remove v0 extent handling btrfs: output extra debug info if we failed to find an inline backref btrfs: move the !zoned assert into run_delalloc_cow btrfs: consolidate the error handling in run_delalloc_nocow ...
2023-08-28Merge tag 'v6.6-vfs.ctime' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs timestamp updates from Christian Brauner: "This adds VFS support for multi-grain timestamps and converts tmpfs, xfs, ext4, and btrfs to use them. This carries acks from all relevant filesystems. The VFS always uses coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot of metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g., backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. This introduces fine-grained timestamps that are used when they are actively queried. This uses the 31st bit of the ctime tv_nsec field to indicate that something has queried the inode for the mtime or ctime. When this flag is set, on the next mtime or ctime update, the kernel will fetch a fine-grained timestamp instead of the usual coarse-grained one. As POSIX generally mandates that when the mtime changes, the ctime must also change the kernel always stores normalized ctime values, so only the first 30 bits of the tv_nsec field are ever used. Filesytems can opt into this behavior by setting the FS_MGTIME flag in the fstype. Filesystems that don't set this flag will continue to use coarse-grained timestamps. Various preparatory changes, fixes and cleanups are included: - Fixup all relevant places where POSIX requires updating ctime together with mtime. This is a wide-range of places and all maintainers provided necessary Acks. - Add new accessors for inode->i_ctime directly and change all callers to rely on them. Plain accesses to inode->i_ctime are now gone and it is accordingly rename to inode->__i_ctime and commented as requiring accessors. - Extend generic_fillattr() to pass in a request mask mirroring in a sense the statx() uapi. This allows callers to pass in a request mask to only get a subset of attributes filled in. - Rework timestamp updates so it's possible to drop the @now parameter the update_time() inode operation and associated helpers. - Add inode_update_timestamps() and convert all filesystems to it removing a bunch of open-coding" * tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits) btrfs: convert to multigrain timestamps ext4: switch to multigrain timestamps xfs: switch to multigrain timestamps tmpfs: add support for multigrain timestamps fs: add infrastructure for multigrain timestamps fs: drop the timespec64 argument from update_time xfs: have xfs_vn_update_time gets its own timestamp fat: make fat_update_time get its own timestamp fat: remove i_version handling from fat_update_time ubifs: have ubifs_update_time use inode_update_timestamps btrfs: have it use inode_update_timestamps fs: drop the timespec64 arg from generic_update_time fs: pass the request_mask to generic_fillattr fs: remove silly warning from current_time gfs2: fix timestamp handling on quota inodes fs: rename i_ctime field to __i_ctime selinux: convert to ctime accessor functions security: convert to ctime accessor functions apparmor: convert to ctime accessor functions sunrpc: convert to ctime accessor functions ...
2023-08-21btrfs: update comment for btrfs_join_transaction_nostart()Filipe Manana1-1/+4
Update the comment for btrfs_join_transaction_nostart() to be more clear about how it works and how it's different from btrfs_attach_transaction(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: don't start transaction when joining with TRANS_JOIN_NOSTARTFilipe Manana1-3/+4
When joining a transaction with TRANS_JOIN_NOSTART, if we don't find a running transaction we end up creating one. This goes against the purpose of TRANS_JOIN_NOSTART which is to join a running transaction if its state is at or below the state TRANS_STATE_COMMIT_START, otherwise return an -ENOENT error and don't start a new transaction. So fix this to not create a new transaction if there's no running transaction at or below that state. CC: stable@vger.kernel.org # 4.14+ Fixes: a6d155d2e363 ("Btrfs: fix deadlock between fiemap and transaction commits") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: fix start transaction qgroup rsv double freeBoris Burkov1-3/+16
btrfs_start_transaction reserves metadata space of the PERTRANS type before it identifies a transaction to start/join. This allows flushing when reserving that space without a deadlock. However, it results in a race which temporarily breaks qgroup rsv accounting. T1 T2 start_transaction do_stuff start_transaction qgroup_reserve_meta_pertrans commit_transaction qgroup_free_meta_all_pertrans hit an error starting txn goto reserve_fail qgroup_free_meta_pertrans (already freed!) The basic issue is that there is nothing preventing another commit from committing before start_transaction finishes (in fact sometimes we intentionally wait for it) so any error path that frees the reserve is at risk of this race. While this exact space was getting freed anyway, and it's not a huge deal to double free it (just a warning, the free code catches this), it can result in incorrectly freeing some other pertrans reservation in this same reservation, which could then lead to spuriously granting reservations we might not have the space for. Therefore, I do believe it is worth fixing. To fix it, use the existing prealloc->pertrans conversion mechanism. When we first reserve the space, we reserve prealloc space and only when we are sure we have a transaction do we convert it to pertrans. This way any racing commits do not blow away our reservation, but we still get a pertrans reservation that is freed when _this_ transaction gets committed. This issue can be reproduced by running generic/269 with either qgroups or squotas enabled via mkfs on the scratch device. Reviewed-by: Josef Bacik <josef@toxicpanda.com> CC: stable@vger.kernel.org # 5.10+ Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: make find_first_extent_bit() return a booleanFilipe Manana1-4/+4
Currently find_first_extent_bit() returns a 0 if it found a range in the given io tree and 1 if it didn't find any. There's no need to return any errors, so make the return value a boolean and invert the logic to make more sense: return true if it found a range and false if it didn't find any range. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-26btrfs: check for commit error at btrfs_attach_transaction_barrier()Filipe Manana1-2/+7
btrfs_attach_transaction_barrier() is used to get a handle pointing to the current running transaction if the transaction has not started its commit yet (its state is < TRANS_STATE_COMMIT_START). If the transaction commit has started, then we wait for the transaction to commit and finish before returning - however we completely ignore if the transaction was aborted due to some error during its commit, we simply return ERR_PT(-ENOENT), which makes the caller assume everything is fine and no errors happened. This could make an fsync return success (0) to user space when in fact we had a transaction abort and the target inode changes were therefore not persisted. Fix this by checking for the return value from btrfs_wait_for_commit(), and if it returned an error, return it back to the caller. Fixes: d4edf39bd5db ("Btrfs: fix uncompleted transaction") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>