aboutsummaryrefslogtreecommitdiff
path: root/fs/btrfs/qgroup.c
AgeCommit message (Collapse)AuthorFilesLines
2024-10-22btrfs: qgroup: set a more sane default value for subtree drop thresholdQu Wenruo1-1/+1
Since commit 011b46c30476 ("btrfs: skip subtree scan if it's too high to avoid low stall in btrfs_commit_transaction()"), btrfs qgroup can automatically skip large subtree scan at the cost of marking qgroup inconsistent. It's designed to address the final performance problem of snapshot drop with qgroup enabled, but to be safe the default value is BTRFS_MAX_LEVEL, requiring a user space daemon to set a different value to make it work. I'd say it's not a good idea to rely on user space tool to set this default value, especially when some operations (snapshot dropping) can be triggered immediately after mount, leaving a very small window to that that sysfs interface. So instead of disabling this new feature by default, enable it with a low threshold (3), so that large subvolume tree drop at mount time won't cause huge qgroup workload. CC: stable@vger.kernel.org # 6.1 Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-11btrfs: use sector numbers as keys for the dirty extents xarrayFilipe Manana1-5/+16
We are using the logical address ("bytenr") of an extent as the key for qgroup records in the dirty extents xarray. This is a problem because the xarrays use "unsigned long" for keys/indices, meaning that on a 32 bits platform any extent starting at or beyond 4G is truncated, which is a too low limitation as virtually everyone is using storage with more than 4G of space. This means a "bytenr" of 4G gets truncated to 0, and so does 8G and 16G for example, resulting in incorrect qgroup accounting. Fix this by using sector numbers as keys instead, that is, using keys that match the logical address right shifted by fs_info->sectorsize_bits, which is what we do for the fs_info->buffer_radix that tracks extent buffers (radix trees also use an "unsigned long" type for keys). This also makes the index space more dense which helps optimize the xarray (as mentioned at Documentation/core-api/xarray.rst). Fixes: 3cce39a8ca4e ("btrfs: qgroup: use xarray to track dirty extents in transaction") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: qgroup: use xarray to track dirty extents in transactionJunchao Sun1-34/+32
Use xarray to track dirty extents to reduce the size of the struct btrfs_qgroup_extent_record from 64 bytes to 40 bytes. The xarray is more cache line friendly, it also reduces the complexity of insertion and search code compared to rb tree. Another change introduced is about error handling. Before this patch, the result of btrfs_qgroup_trace_extent_nolock() is always a success. In this patch, because of this function calls the function xa_store() which has the possibility to fail, so mark qgroup as inconsistent if error happened and then free preallocated memory. Also we preallocate memory before spin_lock(), if memory preallcation failed, error handling is the same the existing code. Suggested-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Junchao Sun <sunjunchao2870@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-02btrfs: qgroup: don't use extent changeset when not neededFedor Pchelkin1-2/+1
The local extent changeset is passed to clear_record_extent_bits() where it may have some additional memory dynamically allocated for ulist. When qgroup is disabled, the memory is leaked because in this case the changeset is not released upon __btrfs_qgroup_release_data() return. Since the recorded contents of the changeset are not used thereafter, just don't pass it. Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Reported-by: syzbot+81670362c283f3dd889c@syzkaller.appspotmail.com Closes: https://lore.kernel.org/lkml/000000000000aa8c0c060ade165e@google.com Fixes: af0e2aab3b70 ("btrfs: qgroup: flush reservations during quota disable") CC: stable@vger.kernel.org # 6.10+ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-25btrfs: run delayed iputs when flushing delallocJosef Bacik1-0/+2
We have transient failures with btrfs/301, specifically in the part where we do for i in $(seq 0 10); do write 50m to file rm -f file done Sometimes this will result in a transient quota error, and it's because sometimes we start writeback on the file which results in a delayed iput, and thus the rm doesn't actually clean the file up. When we're flushing the quota space we need to run the delayed iputs to make sure all the unlinks that we think have completed have actually completed. This removes the small window where we could fail to find enough space in our quota. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: qgroup: preallocate memory before adding a relationDavid Sterba1-17/+8
There's a transaction joined in the qgroup relation add/remove ioctl and any error will lead to abort/error. We could lift the allocation from btrfs_add_qgroup_relation() and move it outside of the transaction context. The relation deletion does not need that. The ownership of the structure is moved to the add relation handler. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: do not BUG_ON() when freeing tree block after errorFilipe Manana1-2/+4
When freeing a tree block, at btrfs_free_tree_block(), if we fail to create a delayed reference we don't deal with the error and just do a BUG_ON(). The error most likely to happen is -ENOMEM, and we have a comment mentioning that only -ENOMEM can happen, but that is not true, because in case qgroups are enabled any error returned from btrfs_qgroup_trace_extent_post() (can be -EUCLEAN or anything returned from btrfs_search_slot() for example) can be propagated back to btrfs_free_tree_block(). So stop doing a BUG_ON() and return the error to the callers and make them abort the transaction to prevent leaking space. Syzbot was triggering this, likely due to memory allocation failure injection. Reported-by: syzbot+a306f914b4d01b3958fe@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/000000000000fcba1e05e998263c@google.com/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: constify pointer parameters where applicableDavid Sterba1-11/+11
We can add const to many parameters, this is for clarity and minor addition to safety. There are some minor effects, in the assembly code and .ko measured on release config. This patch does not cover all possible conversions. Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: add and use helper to commit the current transactionFilipe Manana1-28/+5
We have several places that attach to the current transaction with btrfs_attach_transaction_barrier() and then commit the transaction if there is one. Add a helper and use it to deduplicate this pattern. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: qgroup: avoid start/commit empty transaction when flushing reservationsFilipe Manana1-5/+7
When flushing reservations we are using btrfs_join_transaction() to get a handle for the current transaction and then commit it to try to release space. However btrfs_join_transaction() has some undesirable consequences: 1) If there's no running transaction, it will create one, and we will commit it right after. This is unnecessary because it will not release any space, and it will result in unnecessary IO and rotation of backup roots in the superblock; 2) If there's a current transaction and that transaction is committing (its state is >= TRANS_STATE_COMMIT_DOING), it will wait for that transaction to almost finish its commit (for its state to be >= TRANS_STATE_UNBLOCKED) and then start and return a new transaction. We will then commit that new transaction, which is pointless because all we wanted was to wait for the current (previous) transaction to fully finish its commit (state == TRANS_STATE_COMPLETED), and by starting and committing a new transaction we are wasting IO too and causing unnecessary rotation of backup roots in the superblock. So improve this by using btrfs_attach_transaction_barrier() instead, which does not create a new transaction if there's none running, and if there's a current transaction that is committing, it will wait for it to fully commit and not create a new transaction. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: simplify range parameters of btrfs_wait_ordered_roots()David Sterba1-2/+2
The range is specified only in two ways, we can simplify the case for the whole filesystem range as a NULL block group parameter. Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: automatically remove the subvolume qgroupQu Wenruo1-0/+35
Currently if we fully clean a subvolume (not only delete its directory, but fully clean all it's related data and root item), the associated qgroup would not be removed. We have "btrfs qgroup clear-stale" to handle such 0 level qgroups. Change the behavior to automatically removie the qgroup of a fully cleaned subvolume when possible: - Full qgroup but still consistent We can and should remove the qgroup. The qgroup numbers should be 0, without any rsv. - Full qgroup but inconsistent Can happen with drop_subtree_threshold feature (skip accounting and mark qgroup inconsistent). We can and should remove the qgroup. Higher level qgroup numbers will be incorrect, but since qgroup is already inconsistent, it should not be a problem. - Squota mode This is the special case, we can only drop the qgroup if its numbers are all 0. This would be handled by can_delete_qgroup(), so we only need to check the return value and ignore the -EBUSY error. Link: https://bugzilla.suse.com/show_bug.cgi?id=1222847 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: slightly loosen the requirement for qgroup removalQu Wenruo1-7/+80
[BUG] Currently if one is utilizing "qgroups/drop_subtree_threshold" sysfs, and a snapshot with level higher than that value is dropped, we will not be able to delete the qgroup until next qgroup rescan: uuid=ffffffff-eeee-dddd-cccc-000000000000 wipefs -fa $dev mkfs.btrfs -f $dev -O quota -s 4k -n 4k -U $uuid mount $dev $mnt btrfs subvolume create $mnt/subv1/ for (( i = 0; i < 1024; i++ )); do xfs_io -f -c "pwrite 0 2k" $mnt/subv1/file_$i > /dev/null done sync btrfs subvolume snapshot $mnt/subv1 $mnt/snapshot btrfs quota enable $mnt btrfs quota rescan -w $mnt sync echo 1 > /sys/fs/btrfs/$uuid/qgroups/drop_subtree_threshold btrfs subvolume delete $mnt/snapshot btrfs subvolume sync $mnt btrfs qgroup show -prce --sync $mnt btrfs qgroup destroy 0/257 $mnt umount $mnt The final qgroup removal would fail with the following error: ERROR: unable to destroy quota group: Device or resource busy [CAUSE] The above script would generate a subvolume of level 2, then snapshot it, enable qgroup, set the drop_subtree_threshold, then drop the snapshot. Since the subvolume drop would meet the threshold, qgroup would be marked inconsistent and skip accounting to avoid hanging the system at transaction commit. But currently we do not allow a qgroup with any rfer/excl numbers to be dropped, and this is not really compatible with the new drop_subtree_threshold behavior. [FIX] Only require the strict zero rfer/excl/rfer_cmpr/excl_cmpr for squota mode. This is due to the fact that squota can never go inconsistent, and it can have dropped subvolume but with non-zero qgroup numbers for future accounting. For full qgroup mode, we only check if there is a subvolume for it. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: use for-local variables that shadow function variablesDavid Sterba1-6/+5
We've started to use for-loop local variables and in a few places this shadows a function variable. Convert a few cases reported by 'make W=2'. If applicable also change the style to post-increment, that's the preferred one. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-02btrfs: always do the basic checks for btrfs_qgroup_inherit structureQu Wenruo1-2/+8
[BUG] Syzbot reports the following regression detected by KASAN: BUG: KASAN: slab-out-of-bounds in btrfs_qgroup_inherit+0x42e/0x2e20 fs/btrfs/qgroup.c:3277 Read of size 8 at addr ffff88814628ca50 by task syz-executor318/5171 CPU: 0 PID: 5171 Comm: syz-executor318 Not tainted 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114 print_address_description mm/kasan/report.c:377 [inline] print_report+0x169/0x550 mm/kasan/report.c:488 kasan_report+0x143/0x180 mm/kasan/report.c:601 btrfs_qgroup_inherit+0x42e/0x2e20 fs/btrfs/qgroup.c:3277 create_pending_snapshot+0x1359/0x29b0 fs/btrfs/transaction.c:1854 create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1922 btrfs_commit_transaction+0xf20/0x3740 fs/btrfs/transaction.c:2382 create_snapshot+0x6a1/0x9e0 fs/btrfs/ioctl.c:875 btrfs_mksubvol+0x58f/0x710 fs/btrfs/ioctl.c:1029 btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1075 __btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1340 btrfs_ioctl_snap_create_v2+0x1f2/0x3a0 fs/btrfs/ioctl.c:1422 btrfs_ioctl+0x99e/0xc60 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:907 [inline] __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:893 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fcbf1992509 RSP: 002b:00007fcbf1928218 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007fcbf1a1f618 RCX: 00007fcbf1992509 RDX: 0000000020000280 RSI: 0000000050009417 RDI: 0000000000000003 RBP: 00007fcbf1a1f610 R08: 00007ffea1298e97 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fcbf19eb660 R13: 00000000200002b8 R14: 00007fcbf19e60c0 R15: 0030656c69662f2e </TASK> And it also pinned it down to commit b5357cb268c4 ("btrfs: qgroup: do not check qgroup inherit if qgroup is disabled"). [CAUSE] That offending commit skips the whole qgroup inherit check if qgroup is not enabled. But that also skips the very basic checks like num_ref_copies/num_excl_copies and the structure size checks. Meaning if a qgroup enable/disable race is happening at the background, and we pass a btrfs_qgroup_inherit structure when the qgroup is disabled, the check would be completely skipped. Then at the time of transaction commitment, qgroup is re-enabled and btrfs_qgroup_inherit() is going to use the incorrect structure and causing the above KASAN error. [FIX] Make btrfs_qgroup_check_inherit() only skip the source qgroup checks. So that even if invalid btrfs_qgroup_inherit structure is passed in, we can still reject invalid ones no matter if qgroup is enabled or not. Furthermore we do already have an extra safety inside btrfs_qgroup_inherit(), which would just ignore invalid qgroup sources, so even if we only skip the qgroup source check we're still safe. Reported-by: syzbot+a0d1f7e26910be4dc171@syzkaller.appspotmail.com Fixes: b5357cb268c4 ("btrfs: qgroup: do not check qgroup inherit if qgroup is disabled") Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Jeongjun Park <aha310510@gmail.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-06-25btrfs: qgroup: fix quota root leak after quota disable failureFilipe Manana1-2/+2
If during the quota disable we fail when cleaning the quota tree or when deleting the root from the root tree, we jump to the 'out' label without ever dropping the reference on the quota root, resulting in a leak of the root since fs_info->quota_root is no longer pointing to the root (we have set it to NULL just before those steps). Fix this by always doing a btrfs_put_root() call under the 'out' label. This is a problem that exists since qgroups were first added in 2012 by commit bed92eae26cc ("Btrfs: qgroup implementation and prototypes"), but back then we missed a kfree on the quota root and free_extent_buffer() calls on its root and commit root nodes, since back then roots were not yet reference counted. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-15btrfs: qgroup: fix qgroup id collision across mountsBoris Burkov1-0/+20
If we delete subvolumes whose ID is the largest in the filesystem, then unmount and mount again, then btrfs_init_root_free_objectid on the tree_root will select a subvolid smaller than that one and thus allow reusing it. If we are also using qgroups (and particularly squotas) it is possible to delete the subvol without deleting the qgroup. In that case, we will be able to create a new subvol whose id already has a level 0 qgroup. This will result in re-using that qgroup which would then lead to incorrect accounting. Fixes: 6ed05643ddb1 ("btrfs: create qgroup earlier in snapshot creation") CC: stable@vger.kernel.org # 6.7+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-15btrfs: qgroup: update rescan message levels and error codesDavid Sterba1-7/+5
On filesystems without enabled quotas there's still a warning message in the logs when rescan is called. In that case it's not a problem that should be reported, rescan can be called unconditionally. Change the error code to ENOTCONN which is used for 'quotas not enabled' elsewhere. Remove message (also a warning) when rescan is called during an ongoing rescan, this brings no useful information and the error code is sufficient. Change message levels to debug for now, they can be removed eventually. CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: qgroup: fix initialization of auto inherit arrayDan Carpenter1-1/+1
The "i++" was accidentally left out so it just sets qgids[0] over and over. This can lead to unexpected problems, as the groups[1:] would be all 0, leading to later find_qgroup_rb() unable to find a qgroup and cause snapshot creation failure. Fixes: 5343cd9364ea ("btrfs: qgroup: simple quota auto hierarchy for nested subvolumes") CC: stable@vger.kernel.org # 6.7+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: rename return variables in btrfs_qgroup_rescan_worker()Anand Jain1-19/+19
Rename ret to ret2 compile and then err to ret. Also, new ret2 is found to be localized within the 'if (trans)' statement, so move its declaration there. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: drop variable err in quick_update_accounting()Anand Jain1-6/+3
In quick_update_accounting() err is used as 2nd return value, which could be achieved just with ret. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: change root->root_key.objectid to btrfs_root_id()Josef Bacik1-14/+13
A comment from Filipe on one of my previous cleanups brought my attention to a new helper we have for getting the root id of a root, which makes it easier to read in the code. The changes where made with the following Coccinelle semantic patch: // <smpl> @@ expression E,E1; @@ ( E->root_key.objectid = E1 | - E->root_key.objectid + btrfs_root_id(E) ) // </smpl> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: qgroup: delete unnecessary check in btrfs_qgroup_check_inherit()Dan Carpenter1-3/+0
This check "if (inherit->num_qgroups > PAGE_SIZE)" is confusing and unnecessary. The problem with the check is that static checkers flag it as a potential mixup of between units of bytes vs number of elements. Fortunately, the check can safely be deleted because the next check is correct and applies an even stricter limit: if (size != struct_size(inherit, qgroups, inherit->num_qgroups)) return -EINVAL; The "inherit" struct ends in a variable array of __u64 and "inherit->num_qgroups" is the number of elements in the array. At the start of the function we check that: if (size < sizeof(*inherit) || size > PAGE_SIZE) return -EINVAL; Thus, since we verify that the whole struct fits within one page, that means that the number of elements in the inherit->qgroups[] array must be less than PAGE_SIZE. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-02btrfs: qgroup: do not check qgroup inherit if qgroup is disabledQu Wenruo1-0/+2
[BUG] After kernel commit 86211eea8ae1 ("btrfs: qgroup: validate btrfs_qgroup_inherit parameter"), user space tool snapper will fail to create snapshot using its timeline feature. [CAUSE] It turns out that, if using timeline snapper would unconditionally pass btrfs_qgroup_inherit parameter (assigning the new snapshot to qgroup 1/0) for snapshot creation. In that case, since qgroup is disabled there would be no qgroup 1/0, and btrfs_qgroup_check_inherit() would return -ENOENT and fail the whole snapshot creation. [FIX] Just skip the check if qgroup is not enabled. This is to keep the older behavior for user space tools, as if the kernel behavior changed for user space, it is a regression of kernel. Thankfully snapper is also fixing the behavior by detecting if qgroup is running in the first place, so the effect should not be that huge. Link: https://github.com/openSUSE/snapper/issues/894 Fixes: 86211eea8ae1 ("btrfs: qgroup: validate btrfs_qgroup_inherit parameter") CC: stable@vger.kernel.org # 6.8+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-25btrfs: take the cleaner_mutex earlier in qgroup disableJosef Bacik1-13/+8
One of my CI runs popped the following lockdep splat ====================================================== WARNING: possible circular locking dependency detected 6.9.0-rc4+ #1 Not tainted ------------------------------------------------------ btrfs/471533 is trying to acquire lock: ffff92ba46980850 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_quota_disable+0x54/0x4c0 but task is already holding lock: ffff92ba46980bd0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1c8f/0x2600 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&fs_info->subvol_sem){++++}-{3:3}: down_read+0x42/0x170 btrfs_rename+0x607/0xb00 btrfs_rename2+0x2e/0x70 vfs_rename+0xaf8/0xfc0 do_renameat2+0x586/0x600 __x64_sys_rename+0x43/0x50 do_syscall_64+0x95/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (&sb->s_type->i_mutex_key#16){++++}-{3:3}: down_write+0x3f/0xc0 btrfs_inode_lock+0x40/0x70 prealloc_file_extent_cluster+0x1b0/0x370 relocate_file_extent_cluster+0xb2/0x720 relocate_data_extent+0x107/0x160 relocate_block_group+0x442/0x550 btrfs_relocate_block_group+0x2cb/0x4b0 btrfs_relocate_chunk+0x50/0x1b0 btrfs_balance+0x92f/0x13d0 btrfs_ioctl+0x1abf/0x2600 __x64_sys_ioctl+0x97/0xd0 do_syscall_64+0x95/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #0 (&fs_info->cleaner_mutex){+.+.}-{3:3}: __lock_acquire+0x13e7/0x2180 lock_acquire+0xcb/0x2e0 __mutex_lock+0xbe/0xc00 btrfs_quota_disable+0x54/0x4c0 btrfs_ioctl+0x206b/0x2600 __x64_sys_ioctl+0x97/0xd0 do_syscall_64+0x95/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: &fs_info->cleaner_mutex --> &sb->s_type->i_mutex_key#16 --> &fs_info->subvol_sem Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&fs_info->subvol_sem); lock(&sb->s_type->i_mutex_key#16); lock(&fs_info->subvol_sem); lock(&fs_info->cleaner_mutex); *** DEADLOCK *** 2 locks held by btrfs/471533: #0: ffff92ba4319e420 (sb_writers#14){.+.+}-{0:0}, at: btrfs_ioctl+0x3b5/0x2600 #1: ffff92ba46980bd0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1c8f/0x2600 stack backtrace: CPU: 1 PID: 471533 Comm: btrfs Kdump: loaded Not tainted 6.9.0-rc4+ #1 Call Trace: <TASK> dump_stack_lvl+0x77/0xb0 check_noncircular+0x148/0x160 ? lock_acquire+0xcb/0x2e0 __lock_acquire+0x13e7/0x2180 lock_acquire+0xcb/0x2e0 ? btrfs_quota_disable+0x54/0x4c0 ? lock_is_held_type+0x9a/0x110 __mutex_lock+0xbe/0xc00 ? btrfs_quota_disable+0x54/0x4c0 ? srso_return_thunk+0x5/0x5f ? lock_acquire+0xcb/0x2e0 ? btrfs_quota_disable+0x54/0x4c0 ? btrfs_quota_disable+0x54/0x4c0 btrfs_quota_disable+0x54/0x4c0 btrfs_ioctl+0x206b/0x2600 ? srso_return_thunk+0x5/0x5f ? __do_sys_statfs+0x61/0x70 __x64_sys_ioctl+0x97/0xd0 do_syscall_64+0x95/0x180 ? srso_return_thunk+0x5/0x5f ? reacquire_held_locks+0xd1/0x1f0 ? do_user_addr_fault+0x307/0x8a0 ? srso_return_thunk+0x5/0x5f ? lock_acquire+0xcb/0x2e0 ? srso_return_thunk+0x5/0x5f ? srso_return_thunk+0x5/0x5f ? find_held_lock+0x2b/0x80 ? srso_return_thunk+0x5/0x5f ? lock_release+0xca/0x2a0 ? srso_return_thunk+0x5/0x5f ? do_user_addr_fault+0x35c/0x8a0 ? srso_return_thunk+0x5/0x5f ? trace_hardirqs_off+0x4b/0xc0 ? srso_return_thunk+0x5/0x5f ? lockdep_hardirqs_on_prepare+0xde/0x190 ? srso_return_thunk+0x5/0x5f This happens because when we call rename we already have the inode mutex held, and then we acquire the subvol_sem if we are a subvolume. This makes the dependency inode lock -> subvol sem When we're running data relocation we will preallocate space for the data relocation inode, and we always run the relocation under the ->cleaner_mutex. This now creates the dependency of cleaner_mutex -> inode lock (from the prealloc) -> subvol_sem Qgroup delete is doing this in the opposite order, it is acquiring the subvol_sem and then it is acquiring the cleaner_mutex, which results in this lockdep splat. This deadlock can't happen in reality, because we won't ever rename the data reloc inode, nor is the data reloc inode a subvolume. However this is fairly easy to fix, simply take the cleaner mutex in the case where we are disabling qgroups before we take the subvol_sem. This resolves the lockdep splat. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-02btrfs: qgroup: correctly model root qgroup rsv in convertBoris Burkov1-0/+2
We use add_root_meta_rsv and sub_root_meta_rsv to track prealloc and pertrans reservations for subvolumes when quotas are enabled. The convert function does not properly increment pertrans after decrementing prealloc, so the count is not accurate. Note: we check that the fs is not read-only to mirror the logic in qgroup_convert_meta, which checks that before adding to the pertrans rsv. Fixes: 8287475a2055 ("btrfs: qgroup: Use root::qgroup_meta_rsv_* to record qgroup meta reserved space") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-05btrfs: qgroup: allow quick inherit if snapshot is created and added to the ↵Qu Wenruo1-7/+72
same parent Currently "btrfs subvolume snapshot -i <qgroupid>" would always mark the qgroup inconsistent. This can be annoying if the fs has a lot of snapshots, and needs qgroup to get the accounting for the amount of bytes it can free for each snapshot. Although we have the new simple quote as a solution, there is also a case where we can skip the full scan, if all the following conditions are met: - The source subvolume belongs to a higher level parent qgroup - The parent qgroup already owns all its bytes exclusively - The new snapshot is also added to the same parent qgroup In that case, we only need to add nodesize to the parent qgroup and avoid a full rescan. This patch would add the extra quick accounting update for such inherit. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-05btrfs: qgroup: validate btrfs_qgroup_inherit parameterQu Wenruo1-0/+51
[BUG] Currently btrfs can create subvolume with an invalid qgroup inherit without triggering any error: # mkfs.btrfs -O quota -f $dev # mount $dev $mnt # btrfs subvolume create -i 2/0 $mnt/subv1 # btrfs qgroup show -prce --sync $mnt Qgroupid Referenced Exclusive Path -------- ---------- --------- ---- 0/5 16.00KiB 16.00KiB <toplevel> 0/256 16.00KiB 16.00KiB subv1 [CAUSE] We only do a very basic size check for btrfs_qgroup_inherit structure, but never really verify if the values are correct. Thus in btrfs_qgroup_inherit() function, we have to skip non-existing qgroups, and never return any error. [FIX] Fix the behavior and introduce extra checks: - Introduce early check for btrfs_qgroup_inherit structure Not only the size, but also all the qgroup ids would be verified. And the timing is very early, so we can return error early. This early check is very important for snapshot creation, as snapshot is delayed to transaction commit. - Drop support for btrfs_qgroup_inherit::num_ref_copies and num_excl_copies Those two members are used to specify to copy refr/excl numbers from other qgroups. This would definitely mark qgroup inconsistent, and btrfs-progs has dropped the support for them for a long time. It's time to drop the support for kernel. - Verify the supported btrfs_qgroup_inherit::flags Just in case we want to add extra flags for btrfs_qgroup_inherit. Now above subvolume creation would fail with -ENOENT other than silently ignore the non-existing qgroup. CC: stable@vger.kernel.org # 6.7+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-05btrfs: qgroup: always free reserved space for extent recordsQu Wenruo1-5/+5
[BUG] If qgroup is marked inconsistent (e.g. caused by operations needing full subtree rescan, like creating a snapshot and assign to a higher level qgroup), btrfs would immediately start leaking its data reserved space. The following script can easily reproduce it: mkfs.btrfs -O quota -f $dev mount $dev $mnt btrfs subvolume create $mnt/subv1 btrfs qgroup create 1/0 $mnt # This snapshot creation would mark qgroup inconsistent, # as the ownership involves different higher level qgroup, thus # we have to rescan both source and snapshot, which can be very # time consuming, thus here btrfs just choose to mark qgroup # inconsistent, and let users to determine when to do the rescan. btrfs subv snapshot -i 1/0 $mnt/subv1 $mnt/snap1 # Now this write would lead to qgroup rsv leak. xfs_io -f -c "pwrite 0 64k" $mnt/file1 # And at unmount time, btrfs would report 64K DATA rsv space leaked. umount $mnt And we would have the following dmesg output for the unmount: BTRFS info (device dm-1): last unmount of filesystem 14a3d84e-f47b-4f72-b053-a8a36eef74d3 BTRFS warning (device dm-1): qgroup 0/5 has unreleased space, type 0 rsv 65536 [CAUSE] Since commit e15e9f43c7ca ("btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting"), we introduce a mode for btrfs qgroup to skip the timing consuming backref walk, if the qgroup is already inconsistent. But this skip also covered the data reserved freeing, thus the qgroup reserved space for each newly created data extent would not be freed, thus cause the leakage. [FIX] Make the data extent reserved space freeing mandatory. The qgroup reserved space handling is way cheaper compared to the backref walking part, and we always have the super sensitive leak detector, thus it's definitely worth to always free the qgroup reserved data space. Reported-by: Fabian Vogt <fvogt@suse.com> Fixes: e15e9f43c7ca ("btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting") CC: stable@vger.kernel.org # 6.1+ Link: https://bugzilla.suse.com/show_bug.cgi?id=1216196 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: handle transaction commit errors in flush_reservations()David Sterba1-1/+1
Other errors in flush_reservations() are handled and also in the caller. Ignoring commit might make some sense as it's called right after join so it's to poke the whole commit machinery to free space. However for consistency return the error. The caller btrfs_quota_disable() would try to start the transaction which would in turn fail too so there's no effective change. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: delete pointless BUG_ON check on quota root in ↵David Sterba1-2/+0
btrfs_qgroup_account_extent() The BUG_ON is deep in the qgroup code where we can expect that it exists. A NULL pointer would cause a crash. It was added long ago in 550d7a2ed5db35 ("btrfs: qgroup: Add new qgroup calculation function btrfs_qgroup_account_extents()."). It maybe made sense back then as the quota enable/disable state machine was not that robust as it is nowadays, so we can just delete it. Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: change BUG_ONs to assertions in btrfs_qgroup_trace_subtree()David Sterba1-2/+2
The only caller do_walk_down() of btrfs_qgroup_trace_subtree() validates the value of level and uses it several times before it's passed as an argument. Same for root_eb that's called 'next' in the caller. Change both BUG_ONs to assertions as this is to assure proper interface use rather than real errors. Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-31btrfs: forbid deleting live subvol qgroupBoris Burkov1-0/+14
If a subvolume still exists, forbid deleting its qgroup 0/subvolid. This behavior generally leads to incorrect behavior in squotas and doesn't have a legitimate purpose. Fixes: cecbb533b5fc ("btrfs: record simple quota deltas in delayed refs") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: fix typos found by codespellDavid Sterba1-1/+1
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-06btrfs: ensure releasing squota reserve on head refsBoris Burkov1-3/+11
A reservation goes through a 3 step lifetime: - generated during delalloc - released/counted by ordered_extent allocation - freed by running delayed ref That third step depends on must_insert_reserved on the head ref, so the head ref with that field set owns the reservation. Once you prepare to run the head ref, must_insert_reserved is unset, which means that running the ref must free the reservation, whether or not it succeeds, or else the reservation is leaked. That results in either a risk of spurious ENOSPC if the fs stays writeable or a warning on unmount if it is readonly. The existing squota code was aware of these invariants, but missed a few cases. Improve it by adding a helper function to use in the cleanup paths and call it from the existing early returns in running delayed refs. This also simplifies btrfs_record_squota_delta and struct btrfs_quota_delta. This fixes (or at least improves the reliability of) generic/475 with "mkfs -O squota". On my machine, that test failed ~4/10 times without this patch and passed 100/100 times with it. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-06btrfs: free qgroup pertrans reserve on transaction abortBoris Burkov1-2/+3
If we abort a transaction, we never run the code that frees the pertrans qgroup reservation. This results in warnings on unmount as that reservation has been leaked. The leak isn't a huge issue since the fs is read-only, but it's better to clean it up when we know we can/should. Do it during the cleanup_transaction step of aborting. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-06btrfs: fix qgroup_free_reserved_data int overflowBoris Burkov1-10/+15
The reserved data counter and input parameter is a u64, but we inadvertently accumulate it in an int. Overflowing that int results in freeing the wrong amount of data and breaking reserve accounting. Unfortunately, this overflow rot spreads from there, as the qgroup release/free functions rely on returning an int to take advantage of negative values for error codes. Therefore, the full fix is to return the "released" or "freed" amount by a u64 argument and to return 0 or negative error code via the return value. Most of the call sites simply ignore the return value, though some of them handle the error and count the returned bytes. Change all of them accordingly. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-09btrfs: fix qgroup record leaks when using simple quotasFilipe Manana1-1/+1
When using simple quotas we are not supposed to allocate qgroup records when adding delayed references. However we allocate them if either mode of quotas is enabled (the new simple one or the old one), but then we never free them because running the accounting, which frees the records, is only run when using the old quotas (at btrfs_qgroup_account_extents()), resulting in a memory leak of the records allocated when adding delayed references. Fix this by allocating the records only if the old quotas mode is enabled. Also fix btrfs_qgroup_trace_extent_nolock() to return 1 if the old quotas mode is not enabled - meaning the caller has to free the record. Fixes: 182940f4f4db ("btrfs: qgroup: add new quota mode for simple quotas") Reported-by: syzbot+d3ddc6dcc6386dea398b@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/00000000000004769106097f9a34@google.com/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-09btrfs: fix race between accounting qgroup extents and removing a qgroupFilipe Manana1-1/+7
When doing qgroup accounting for an extent, we take the spinlock fs_info->qgroup_lock and then add qgroups to the local list (iterator) named "qgroups". These qgroups are found in the fs_info->qgroup_tree rbtree. After we're done, we unlock fs_info->qgroup_lock and then call qgroup_iterator_nested_clean(), which will iterate over all the qgroups added to the local list "qgroups" and then delete them from the list. Deleting a qgroup from the list can however result in a use-after-free if a qgroup remove operation happens after we unlock fs_info->qgroup_lock and before or while we are at qgroup_iterator_nested_clean(). Fix this by calling qgroup_iterator_nested_clean() while still holding the lock fs_info->qgroup_lock - we don't need it under the 'out' label since before taking the lock the "qgroups" list is always empty. This guarantees safety because btrfs_remove_qgroup() takes that lock before removing a qgroup from the rbtree fs_info->qgroup_tree. This was reported by syzbot with the following stack traces: BUG: KASAN: slab-use-after-free in __list_del_entry_valid_or_report+0x2f/0x130 lib/list_debug.c:49 Read of size 8 at addr ffff888027e420b0 by task kworker/u4:3/48 CPU: 1 PID: 48 Comm: kworker/u4:3 Not tainted 6.6.0-syzkaller-10396-g4652b8e4f3ff #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023 Workqueue: btrfs-qgroup-rescan btrfs_work_helper Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:364 [inline] print_report+0x163/0x540 mm/kasan/report.c:475 kasan_report+0x175/0x1b0 mm/kasan/report.c:588 __list_del_entry_valid_or_report+0x2f/0x130 lib/list_debug.c:49 __list_del_entry_valid include/linux/list.h:124 [inline] __list_del_entry include/linux/list.h:215 [inline] list_del_init include/linux/list.h:287 [inline] qgroup_iterator_nested_clean fs/btrfs/qgroup.c:2623 [inline] btrfs_qgroup_account_extent+0x18b/0x1150 fs/btrfs/qgroup.c:2883 qgroup_rescan_leaf fs/btrfs/qgroup.c:3543 [inline] btrfs_qgroup_rescan_worker+0x1078/0x1c60 fs/btrfs/qgroup.c:3604 btrfs_work_helper+0x37c/0xbd0 fs/btrfs/async-thread.c:315 process_one_work kernel/workqueue.c:2630 [inline] process_scheduled_works+0x90f/0x1400 kernel/workqueue.c:2703 worker_thread+0xa5f/0xff0 kernel/workqueue.c:2784 kthread+0x2d3/0x370 kernel/kthread.c:388 ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242 </TASK> Allocated by task 6355: kasan_save_stack mm/kasan/common.c:45 [inline] kasan_set_track+0x4f/0x70 mm/kasan/common.c:52 ____kasan_kmalloc mm/kasan/common.c:374 [inline] __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:383 kmalloc include/linux/slab.h:600 [inline] kzalloc include/linux/slab.h:721 [inline] btrfs_quota_enable+0xee9/0x2060 fs/btrfs/qgroup.c:1209 btrfs_ioctl_quota_ctl+0x143/0x190 fs/btrfs/ioctl.c:3705 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:871 [inline] __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:857 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x63/0x6b Freed by task 6355: kasan_save_stack mm/kasan/common.c:45 [inline] kasan_set_track+0x4f/0x70 mm/kasan/common.c:52 kasan_save_free_info+0x28/0x40 mm/kasan/generic.c:522 ____kasan_slab_free+0xd6/0x120 mm/kasan/common.c:236 kasan_slab_free include/linux/kasan.h:164 [inline] slab_free_hook mm/slub.c:1800 [inline] slab_free_freelist_hook mm/slub.c:1826 [inline] slab_free mm/slub.c:3809 [inline] __kmem_cache_free+0x263/0x3a0 mm/slub.c:3822 btrfs_remove_qgroup+0x764/0x8c0 fs/btrfs/qgroup.c:1787 btrfs_ioctl_qgroup_create+0x185/0x1e0 fs/btrfs/ioctl.c:3811 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:871 [inline] __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:857 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x63/0x6b Last potentially related work creation: kasan_save_stack+0x3f/0x60 mm/kasan/common.c:45 __kasan_record_aux_stack+0xad/0xc0 mm/kasan/generic.c:492 __call_rcu_common kernel/rcu/tree.c:2667 [inline] call_rcu+0x167/0xa70 kernel/rcu/tree.c:2781 kthread_worker_fn+0x4ba/0xa90 kernel/kthread.c:823 kthread+0x2d3/0x370 kernel/kthread.c:388 ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242 Second to last potentially related work creation: kasan_save_stack+0x3f/0x60 mm/kasan/common.c:45 __kasan_record_aux_stack+0xad/0xc0 mm/kasan/generic.c:492 __call_rcu_common kernel/rcu/tree.c:2667 [inline] call_rcu+0x167/0xa70 kernel/rcu/tree.c:2781 kthread_worker_fn+0x4ba/0xa90 kernel/kthread.c:823 kthread+0x2d3/0x370 kernel/kthread.c:388 ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242 The buggy address belongs to the object at ffff888027e42000 which belongs to the cache kmalloc-512 of size 512 The buggy address is located 176 bytes inside of freed 512-byte region [ffff888027e42000, ffff888027e42200) The buggy address belongs to the physical page: page:ffffea00009f9000 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x27e40 head:ffffea00009f9000 order:2 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0xfff00000000840(slab|head|node=0|zone=1|lastcpupid=0x7ff) page_type: 0xffffffff() raw: 00fff00000000840 ffff888012c41c80 ffffea0000a5ba00 dead000000000002 raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 2, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 4514, tgid 4514 (udevadm), ts 24598439480, free_ts 23755696267 set_page_owner include/linux/page_owner.h:31 [inline] post_alloc_hook+0x1e6/0x210 mm/page_alloc.c:1536 prep_new_page mm/page_alloc.c:1543 [inline] get_page_from_freelist+0x31db/0x3360 mm/page_alloc.c:3170 __alloc_pages+0x255/0x670 mm/page_alloc.c:4426 alloc_slab_page+0x6a/0x160 mm/slub.c:1870 allocate_slab mm/slub.c:2017 [inline] new_slab+0x84/0x2f0 mm/slub.c:2070 ___slab_alloc+0xc85/0x1310 mm/slub.c:3223 __slab_alloc mm/slub.c:3322 [inline] __slab_alloc_node mm/slub.c:3375 [inline] slab_alloc_node mm/slub.c:3468 [inline] __kmem_cache_alloc_node+0x19d/0x270 mm/slub.c:3517 kmalloc_trace+0x2a/0xe0 mm/slab_common.c:1098 kmalloc include/linux/slab.h:600 [inline] kzalloc include/linux/slab.h:721 [inline] kernfs_fop_open+0x3e7/0xcc0 fs/kernfs/file.c:670 do_dentry_open+0x8fd/0x1590 fs/open.c:948 do_open fs/namei.c:3622 [inline] path_openat+0x2845/0x3280 fs/namei.c:3779 do_filp_open+0x234/0x490 fs/namei.c:3809 do_sys_openat2+0x13e/0x1d0 fs/open.c:1440 do_sys_open fs/open.c:1455 [inline] __do_sys_openat fs/open.c:1471 [inline] __se_sys_openat fs/open.c:1466 [inline] __x64_sys_openat+0x247/0x290 fs/open.c:1466 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x63/0x6b page last free stack trace: reset_page_owner include/linux/page_owner.h:24 [inline] free_pages_prepare mm/page_alloc.c:1136 [inline] free_unref_page_prepare+0x8c3/0x9f0 mm/page_alloc.c:2312 free_unref_page+0x37/0x3f0 mm/page_alloc.c:2405 discard_slab mm/slub.c:2116 [inline] __unfreeze_partials+0x1dc/0x220 mm/slub.c:2655 put_cpu_partial+0x17b/0x250 mm/slub.c:2731 __slab_free+0x2b6/0x390 mm/slub.c:3679 qlink_free mm/kasan/quarantine.c:166 [inline] qlist_free_all+0x75/0xe0 mm/kasan/quarantine.c:185 kasan_quarantine_reduce+0x14b/0x160 mm/kasan/quarantine.c:292 __kasan_slab_alloc+0x23/0x70 mm/kasan/common.c:305 kasan_slab_alloc include/linux/kasan.h:188 [inline] slab_post_alloc_hook+0x67/0x3d0 mm/slab.h:762 slab_alloc_node mm/slub.c:3478 [inline] slab_alloc mm/slub.c:3486 [inline] __kmem_cache_alloc_lru mm/slub.c:3493 [inline] kmem_cache_alloc+0x104/0x2c0 mm/slub.c:3502 getname_flags+0xbc/0x4f0 fs/namei.c:140 do_sys_openat2+0xd2/0x1d0 fs/open.c:1434 do_sys_open fs/open.c:1455 [inline] __do_sys_openat fs/open.c:1471 [inline] __se_sys_openat fs/open.c:1466 [inline] __x64_sys_openat+0x247/0x290 fs/open.c:1466 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x63/0x6b Memory state around the buggy address: ffff888027e41f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff888027e42000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff888027e42080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888027e42100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888027e42180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Reported-by: syzbot+e0b615318f8fcfc01ceb@syzkaller.appspotmail.com Fixes: dce28769a33a ("btrfs: qgroup: use qgroup_iterator_nested to in qgroup_update_refcnt()") CC: stable@vger.kernel.org # 6.6 Link: https://lore.kernel.org/linux-btrfs/00000000000091a5b2060936bf6d@google.com/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: qgroup: only set QUOTA_ENABLED when done reading qgroupsBoris Burkov1-8/+7
In open_ctree, we set BTRFS_FS_QUOTA_ENABLED as soon as we see a quota_root, as opposed to after we are done setting up the qgroup structures. In the quota_enable path, we wait until after the structures are set up. Likewise, in disable, we clear the bit before tearing down the structures. I feel that this organization is less surprising for the open_ctree path. I don't believe this fixes any actual bug, but avoids potential confusion when using btrfs_qgroup_mode in an intermediate state where we are enabled but haven't yet setup the qgroup status flags. It also avoids any risk of calling a qgroup function and attempting to use the qgroup rbtrees before they exist/are setup. This all occurs before we do rw setup, so I believe it should be mostly a no-op. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: qgroup: check generation when recording simple quota deltaBoris Burkov1-6/+22
Simple quotas count extents only from the moment the feature is enabled. Therefore, if we do something like: 1. create subvol S 2. write F in S 3. enable quotas 4. remove F 5. write G in S then after 3. and 4. we would expect the simple quota usage of S to be 0 (putting aside some metadata extents that might be written) and after 5., it should be the size of G plus metadata. Therefore, we need to be able to determine whether a particular quota delta we are processing predates simple quota enablement. To do this, store the transaction id when quotas were enabled. In fs_info for immediate use and in the quota status item to make it recoverable on mount. When we see a delta, check if the generation of the extent item is less than that of quota enablement. If so, we should ignore the delta from this extent. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: qgroup: simple quota auto hierarchy for nested subvolumesBoris Burkov1-3/+54
Consider the following sequence: - enable quotas - create subvol S id 256 at dir outer/ - create a qgroup 1/100 - add 0/256 (S's auto qgroup) to 1/100 - create subvol T id 257 at dir outer/inner/ With full qgroups, there is no relationship between 0/257 and either of 0/256 or 1/100. There is an inherit feature that the creator of inner/ can use to specify it ought to be in 1/100. Simple quotas are targeted at container isolation, where such automatic inheritance for not necessarily trusted/controlled nested subvol creation would be quite helpful. Therefore, add a new default behavior for simple quotas: when you create a nested subvol, automatically inherit as parents any parents of the qgroup of the subvol the new inode is going in. In our example, 257/0 would also be under 1/100, allowing easy control of a total quota over an arbitrary hierarchy of subvolumes. I think this _might_ be a generally useful behavior, so it could be interesting to put it behind a new inheritance flag that simple quotas always use while traditional quotas let the user specify, but this is a minimally intrusive change to start. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: add helper for recording simple quota deltasBoris Burkov1-0/+46
Rather than re-computing shared/exclusive ownership based on backrefs and walking roots for implicit backrefs, simple quotas does an increment when creating an extent and a decrement when deleting it. Add the API for the extent item code to use to track those events. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: create qgroup earlier in snapshot creationBoris Burkov1-0/+3
Pull creating the qgroup earlier in the snapshot. This allows simple quotas qgroups to see all the metadata writes related to the snapshot being created and to be born with the root node accounted. Note this has an impact on transaction commit where the qgroup creation can do a lot of work, allocate memory and take locks. The change is done for correctness, potential performance issues will be fixed in the future. Signed-off-by: Boris Burkov <boris@bur.io> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: qgroup: flush reservations during quota disableBoris Burkov1-3/+44
The following sequence: enable simple quotas do some writes reserve space create ordered_extent release rsv (store rsv_bytes in OE, mark QGROUP_RESERVED bits) disable quotas enable simple quotas set qgroup rsv to 0 on all subvolumes ordered_extent finishes create delayed ref with rsv_bytes from before run delayed ref record_simple_quota_delta free rsv_bytes (0 -> -rsv_delta) results in us reliably underflowing the subvolume's qgroup rsv counter, because disabling/re-enabling quotas toggles reservation counters down to 0, but does not remove other file system state which represents successful acquisition of qgroup rsv space. Specifically metadata rsv counters on the root object and rsv_bytes on ordered_extent objects that have released their reservation as well as the corresponding QGROUP_RESERVED extent bits. Normal qgroups gets away with this, I believe because it forces more work to happen on transaction commit, but I am not certain it is totally safe from the ordered_extent/leaked extent bit variant. Simple quotas hits this reliably. The intent of the fix is to make disable take the time to clear that external to qgroups state as well: after flipping off the quota bit on fs_info, flush delalloc and ordered extents, clearing the extent bits along the way. This makes it so there are no ordered extents or meta prealloc hanging around from the first enablement period during the second. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: qgroup: add new quota mode for simple quotasBoris Burkov1-30/+80
Add a new quota mode called "simple quotas". It can be enabled by the existing quota enable ioctl via a new command, and sets an incompat bit, as the implementation of simple quotas will make backwards incompatible changes to the disk format of the extent tree. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: qgroup: introduce quota modeBoris Burkov1-0/+7
In preparation for introducing simple quotas, change from a binary setting for quotas to an enum based mode. Initially, the possible modes are disabled/full. Full quotas is normal btrfs qgroups. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: merge ordered work callbacks in btrfs_work into oneDavid Sterba1-1/+1
There are two callbacks defined in btrfs_work but only two actually make use of them, otherwise there are NULLs. We can get rid of the freeing callback making it a special case of the normal work. This reduces the size of btrfs_work by 8 bytes, final layout: struct btrfs_work { btrfs_func_t func; /* 0 8 */ btrfs_ordered_func_t ordered_func; /* 8 8 */ struct work_struct normal_work; /* 16 32 */ struct list_head ordered_list; /* 48 16 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct btrfs_workqueue * wq; /* 64 8 */ long unsigned int flags; /* 72 8 */ /* size: 80, cachelines: 2, members: 6 */ /* last cacheline: 16 bytes */ }; This in turn reduces size of other structures (on a release config): - async_chunk 160 -> 152 - async_submit_bio 152 -> 144 - btrfs_async_delayed_work 104 -> 96 - btrfs_caching_control 176 -> 168 - btrfs_delalloc_work 144 -> 136 - btrfs_fs_info 3608 -> 3600 - btrfs_ordered_extent 440 -> 424 - btrfs_writepage_fixup 104 -> 96 Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: abort transaction on generation mismatch when marking eb as dirtyFilipe Manana1-7/+7
When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(), we check if its generation matches the running transaction and if not we just print a warning. Such mismatch is an indicator that something really went wrong and only printing a warning message (and stack trace) is not enough to prevent a corruption. Allowing a transaction to commit with such an extent buffer will trigger an error if we ever try to read it from disk due to a generation mismatch with its parent generation. So abort the current transaction with -EUCLEAN if we notice a generation mismatch. For this we need to pass a transaction handle to btrfs_mark_buffer_dirty() which is always available except in test code, in which case we can pass NULL since it operates on dummy extent buffers and all test roots have a single node/leaf (root node at level 0). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: move functions comments from qgroup.h to qgroup.cDavid Sterba1-0/+71
We keep the comments next to the implementation, there were some left to move. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>