aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2022-01-03btrfs: consolidate device_list_mutex in prepare_sprout to its parentAnand Jain1-18/+51
btrfs_prepare_sprout() splices seed devices into its own struct fs_devices, so that its parent function btrfs_init_new_device() can add the new sprout device to fs_info->fs_devices. Both btrfs_prepare_sprout() and btrfs_init_new_device() need device_list_mutex. But they are holding it separately, thus create a small race window. Close it and hold device_list_mutex across both functions btrfs_init_new_device() and btrfs_prepare_sprout(). Split btrfs_prepare_sprout() into btrfs_init_sprout() and btrfs_setup_sprout(). This split is essential because device_list_mutex must not be held for allocations in btrfs_init_sprout() but must be held for btrfs_setup_sprout(). So now a common device_list_mutex can be used between btrfs_init_new_device() and btrfs_setup_sprout(). Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Anand Jain <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-01-03btrfs: switch seeding_dev in init_new_device to boolAnand Jain1-2/+2
Declare int seeding_dev as a bool. Also, move its declaration a line below to adjust packing. Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Anand Jain <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-01-03btrfs: send: remove unused type parameter to iterate_inode_ref_tOmar Sandoval1-19/+13
Again, I don't think this was ever used since iterate_dir_item() is only used for xattrs. No functional change. Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Omar Sandoval <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-01-03btrfs: send: remove unused found_type parameter to lookup_dir_item_inode()Omar Sandoval1-7/+3
As far as I can tell, this was never used. No functional change. Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Omar Sandoval <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-01-03btrfs: rename btrfs_item_end_nr to btrfs_item_data_endJosef Bacik3-10/+10
The name btrfs_item_end_nr() is a bit of a misnomer, as it's actually the offset of the end of the data the item points to. In fact all of the helpers that we use btrfs_item_end_nr() use data in their name, like BTRFS_LEAF_DATA_SIZE() and leaf_data(). Rename to btrfs_item_data_end() to make it clear what this helper is giving us. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-01-03btrfs: remove the btrfs_item_end() helperJosef Bacik1-11/+5
We're only using btrfs_item_end() from btrfs_item_end_nr(), so this can be collapsed. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-01-03btrfs: drop the _nr from the item helpersJosef Bacik24-218/+191
Now that all call sites are using the slot number to modify item values, rename the SETGET helpers to raw_item_*(), and then rework the _nr() helpers to be the btrfs_item_*() btrfs_set_item_*() helpers, and then rename all of the callers to the new helpers. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-01-03btrfs: introduce item_nr token variant helpersJosef Bacik2-40/+57
The last remaining place where we have the pattern of item = btrfs_item_nr(slot) <do something with the item> are the token helpers. Handle this by introducing token helpers that will do the btrfs_item_nr() work inside of the helper itself, and then convert all users of the btrfs_item token helpers to the new _nr() variants. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-01-03btrfs: make btrfs_file_extent_inline_item_len take a slotJosef Bacik2-4/+3
Instead of getting the btrfs_item for this, simply pass in the slot of the item and then use the btrfs_item_size_nr() helper inside of btrfs_file_extent_inline_item_len(). Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-01-03btrfs: add btrfs_set_item_*_nr() helpersJosef Bacik2-15/+21
We have the pattern of item = btrfs_item_nr(slot); btrfs_set_item_*(leaf, item); in a bunch of places in our code. Fix this by adding btrfs_set_item_*_nr() helpers which will do the appropriate work, and replace those calls with btrfs_set_item_*_nr(leaf, slot); Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-01-03btrfs: use btrfs_item_size_nr/btrfs_item_offset_nr everywhereJosef Bacik8-49/+19
We have this pattern in a lot of places item = btrfs_item_nr(slot); btrfs_item_size(leaf, item); when we could simply use btrfs_item_size(leaf, slot); Fix all callers of btrfs_item_size() and btrfs_item_offset() to use the _nr variation of the helpers. Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-01-03btrfs: remove no longer needed logic for replaying directory deletesFilipe Manana1-89/+69
Now that we log only dir index keys when logging a directory, we no longer need to deal with dir item keys in the log replay code for replaying directory deletes. This is also true for the case when we replay a log tree created by a kernel that still logs dir items. So remove the remaining code of the replay of directory deletes algorithm that deals with dir item keys. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-01-03btrfs: only copy dir index keys when logging a directoryFilipe Manana2-231/+182
Currently, when logging a directory, we copy both dir items and dir index items from the fs/subvolume tree to the log tree. Both items have exactly the same data (same struct btrfs_dir_item), the difference lies in the key values, where a dir index key contains the index number of a directory entry while the dir item key does not, as it's used for doing fast lookups of an entry by name, while the former is used for sorting entries when listing a directory. We can exploit that and log only the dir index items, since they contain all the information needed to correctly add, replace and delete directory entries when replaying a log tree. Logging only the dir index items is also backward and forward compatible: an unpatched kernel (without this change) can correctly replay a log tree generated by a patched kernel (with this patch), and a patched kernel can correctly replay a log tree generated by an unpatched kernel. The backward compatibility is ensured because: 1) For inserting a new dentry: a dentry is only inserted when we find a new dir index key - we can only insert if we know the dir index offset, which is encoded in the dir index key's offset; 2) For deleting dentries: during log replay, before adding or replacing dentries, we first replay dentry deletions. Whenever we find a dir item key or a dir index key in the subvolume/fs tree that is not logged in a range for which the log tree is authoritative, we do the unlink of the dentry, which removes both the existing dir item key and the dir index key. Therefore logging just dir index keys is enough to ensure dentry deletions are correctly replayed; 3) For dentry replacements: they work when we log only dir index keys and this is mostly due to a combination of 1) and 2). If we replace a dentry with name "foobar" to point from inode A to inode B, then we know the dir index key for the new dentry is different from the old one, as it has an index number (key offset) larger than the old one. This results in replaying a deletion, through replay_dir_deletes(), that causes the old dentry to be removed, both the dir item key and the dir index key, as mentioned at 2). Then when processing the new dir index key, we add the new dentry, adding both a new dir item key and a new index key pointing to inode B, as stated in 1). The forward compatibility, the ability for a patched kernel to replay a log created by an older, unpatched kernel, comes from the changes required for making sure we are able to replay a log that only contains dir index keys - we simply ignore every dir item key we find. So modify directory logging to log only dir index items, and modify the log replay process to ignore dir item keys, from log trees created by an unpatched kernel, and process only with dir index keys. This reduces the amount of logged metadata by about half, and therefore the time spent logging or fsyncing large directories (less CPU time and less IO). The following test script was used to measure this change: #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_NEW_FILES=1000000 NUM_FILE_DELETES=10000 mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_NEW_FILES; i++)); do echo -n > $MNT/testdir/file_$i done start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files" # sync to force transaction commit and wipeout the log. sync del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES )) for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do rm -f $MNT/testdir/file_$i done start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files" echo umount $MNT The tests were run on a physical machine, with a non-debug kernel (Debian's default kernel config), for different values of $NUM_NEW_FILES and $NUM_FILE_DELETES, and the results were the following: ** Before patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 ** dir fsync took 8412 ms after adding 1000000 files dir fsync took 500 ms after deleting 10000 files ** After patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 ** dir fsync took 4252 ms after adding 1000000 files (-49.5%) dir fsync took 269 ms after deleting 10000 files (-46.2%) ** Before patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 ** dir fsync took 745 ms after adding 100000 files dir fsync took 59 ms after deleting 1000 files ** After patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 ** dir fsync took 404 ms after adding 100000 files (-45.8%) dir fsync took 31 ms after deleting 1000 files (-47.5%) ** Before patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 ** dir fsync took 67 ms after adding 10000 files dir fsync took 9 ms after deleting 1000 files ** After patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 ** dir fsync took 36 ms after adding 10000 files (-46.3%) dir fsync took 5 ms after deleting 1000 files (-44.4%) ** Before patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 ** dir fsync took 9 ms after adding 1000 files dir fsync took 4 ms after deleting 100 files ** After patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 ** dir fsync took 7 ms after adding 1000 files (-22.2%) dir fsync took 3 ms after deleting 100 files (-25.0%) Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-01-03btrfs: remove spurious unlock/lock of unused_bgs_lockNikolay Borisov1-2/+0
Since both unused block groups and reclaim bgs lists are protected by unused_bgs_lock then free them in the same critical section without doing an extra unlock/lock pair. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-01-03btrfs: fix deadlock between quota enable and other quota operationsFilipe Manana1-0/+19
When enabling quotas, we attempt to commit a transaction while holding the mutex fs_info->qgroup_ioctl_lock. This can result on a deadlock with other quota operations such as: - qgroup creation and deletion, ioctl BTRFS_IOC_QGROUP_CREATE; - adding and removing qgroup relations, ioctl BTRFS_IOC_QGROUP_ASSIGN. This is because these operations join a transaction and after that they attempt to lock the mutex fs_info->qgroup_ioctl_lock. Acquiring that mutex after joining or starting a transaction is a pattern followed everywhere in qgroups, so the quota enablement operation is the one at fault here, and should not commit a transaction while holding that mutex. Fix this by making the transaction commit while not holding the mutex. We are safe from two concurrent tasks trying to enable quotas because we are serialized by the rw semaphore fs_info->subvol_sem at btrfs_ioctl_quota_ctl(), which is the only call site for enabling quotas. When this deadlock happens, it produces a trace like the following: INFO: task syz-executor:25604 blocked for more than 143 seconds. Not tainted 5.15.0-rc6 #4 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz-executor state:D stack:24800 pid:25604 ppid: 24873 flags:0x00004004 Call Trace: context_switch kernel/sched/core.c:4940 [inline] __schedule+0xcd9/0x2530 kernel/sched/core.c:6287 schedule+0xd3/0x270 kernel/sched/core.c:6366 btrfs_commit_transaction+0x994/0x2e90 fs/btrfs/transaction.c:2201 btrfs_quota_enable+0x95c/0x1790 fs/btrfs/qgroup.c:1120 btrfs_ioctl_quota_ctl fs/btrfs/ioctl.c:4229 [inline] btrfs_ioctl+0x637e/0x7b70 fs/btrfs/ioctl.c:5010 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:874 [inline] __se_sys_ioctl fs/ioctl.c:860 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f86920b2c4d RSP: 002b:00007f868f61ac58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007f86921d90a0 RCX: 00007f86920b2c4d RDX: 0000000020005e40 RSI: 00000000c0109428 RDI: 0000000000000008 RBP: 00007f869212bd80 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f86921d90a0 R13: 00007fff6d233e4f R14: 00007fff6d233ff0 R15: 00007f868f61adc0 INFO: task syz-executor:25628 blocked for more than 143 seconds. Not tainted 5.15.0-rc6 #4 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz-executor state:D stack:29080 pid:25628 ppid: 24873 flags:0x00004004 Call Trace: context_switch kernel/sched/core.c:4940 [inline] __schedule+0xcd9/0x2530 kernel/sched/core.c:6287 schedule+0xd3/0x270 kernel/sched/core.c:6366 schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425 __mutex_lock_common kernel/locking/mutex.c:669 [inline] __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729 btrfs_remove_qgroup+0xb7/0x7d0 fs/btrfs/qgroup.c:1548 btrfs_ioctl_qgroup_create fs/btrfs/ioctl.c:4333 [inline] btrfs_ioctl+0x683c/0x7b70 fs/btrfs/ioctl.c:5014 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:874 [inline] __se_sys_ioctl fs/ioctl.c:860 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae Reported-by: Hao Sun <[email protected]> Link: https://lore.kernel.org/linux-btrfs/CACkBjsZQF19bQ1C6=yetF3BvL10OSORpFUcWXTP6HErshDB4dQ@mail.gmail.com/ Fixes: 340f1aa27f36 ("btrfs: qgroups: Move transaction management inside btrfs_quota_enable/disable") CC: [email protected] # 4.19 Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-01-03btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW rangeFilipe Manana1-64/+78
When doing a direct IO write against a file range that either has preallocated extents in that range or has regular extents and the file has the NOCOW attribute set, the write fails with -ENOSPC when all of the following conditions are met: 1) There are no data blocks groups with enough free space matching the size of the write; 2) There's not enough unallocated space for allocating a new data block group; 3) The extents in the target file range are not shared, neither through snapshots nor through reflinks. This is wrong because a NOCOW write can be done in such case, and in fact it's possible to do it using a buffered IO write, since when failing to allocate data space, the buffered IO path checks if a NOCOW write is possible. The failure in direct IO write path comes from the fact that early on, at btrfs_dio_iomap_begin(), we try to allocate data space for the write and if it that fails we return the error and stop - we never check if we can do NOCOW. But later, at btrfs_get_blocks_direct_write(), we check if we can do a NOCOW write into the range, or a subset of the range, and then release the previously reserved data space. Fix this by doing the data reservation only if needed, when we must COW, at btrfs_get_blocks_direct_write() instead of doing it at btrfs_dio_iomap_begin(). This also simplifies a bit the logic and removes the inneficiency of doing unnecessary data reservations. The following example test script reproduces the problem: $ cat dio-nocow-enospc.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj # Use a small fixed size (1G) filesystem so that it's quick to fill # it up. # Make sure the mixed block groups feature is not enabled because we # later want to not have more space available for allocating data # extents but still have enough metadata space free for the file writes. mkfs.btrfs -f -b $((1024 * 1024 * 1024)) -O ^mixed-bg $DEV mount $DEV $MNT # Create our test file with the NOCOW attribute set. touch $MNT/foobar chattr +C $MNT/foobar # Now fill in all unallocated space with data for our test file. # This will allocate a data block group that will be full and leave # no (or a very small amount of) unallocated space in the device, so # that it will not be possible to allocate a new block group later. echo echo "Creating test file with initial data..." xfs_io -c "pwrite -S 0xab -b 1M 0 900M" $MNT/foobar # Now try a direct IO write against file range [0, 10M[. # This should succeed since this is a NOCOW file and an extent for the # range was previously allocated. echo echo "Trying direct IO write over allocated space..." xfs_io -d -c "pwrite -S 0xcd -b 10M 0 10M" $MNT/foobar umount $MNT When running the test: $ ./dio-nocow-enospc.sh (...) Creating test file with initial data... wrote 943718400/943718400 bytes at offset 0 900 MiB, 900 ops; 0:00:01.43 (625.526 MiB/sec and 625.5265 ops/sec) Trying direct IO write over allocated space... pwrite: No space left on device A test case for fstests will follow, testing both this direct IO write scenario as well as the buffered IO write scenario to make it less likely to get future regressions on the buffered IO case. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-12-30fs/mount_setattr: always cleanup mount_kattrChristian Brauner1-5/+4
Make sure that finish_mount_kattr() is called after mount_kattr was succesfully built in both the success and failure case to prevent leaking any references we took when we built it. We returned early if path lookup failed thereby risking to leak an additional reference we took when building mount_kattr when an idmapped mount was requested. Cc: [email protected] Cc: [email protected] Fixes: 9caccd41541a ("fs: introduce MOUNT_ATTR_IDMAP") Signed-off-by: Christian Brauner <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-12-23Merge tag '5.16-rc5-ksmbd-fixes' of git://git.samba.org/ksmbdLinus Torvalds3-8/+26
Pull ksmbd fixes from Steve French: "Three ksmbd fixes, all for stable as well. Two fix potential unitialized memory and one fixes a security problem where encryption is unitentionally disabled from some clients" * tag '5.16-rc5-ksmbd-fixes' of git://git.samba.org/ksmbd: ksmbd: disable SMB2_GLOBAL_CAP_ENCRYPTION for SMB 3.1.1 ksmbd: fix uninitialized symbol 'pntsd_size' ksmbd: fix error code in ndr_read_int32()
2021-12-23Merge tag 'io_uring-5.16-2021-12-23' of git://git.kernel.dk/linux-blockLinus Torvalds1-3/+7
Pull io_uring fix from Jens Axboe: "Single fix for not clearing kiocb->ki_pos back to 0 for a stream, destined for stable as well" * tag 'io_uring-5.16-2021-12-23' of git://git.kernel.dk/linux-block: io_uring: zero iocb->ki_pos for stream file types
2021-12-22io_uring: zero iocb->ki_pos for stream file typesJens Axboe1-3/+7
io_uring supports using offset == -1 for using the current file position, and we read that in as part of read/write command setup. For the non-iter read/write types we pass in NULL for the position pointer, but for the iter types we should not be passing any anything but 0 for the position for a stream. Clear kiocb->ki_pos if the file is a stream, don't leave it as -1. If we do, then the request will error with -ESPIPE. Fixes: ba04291eb66e ("io_uring: allow use of offset == -1 to mean file position") Link: https://github.com/axboe/liburing/discussions/501 Reported-by: Samuel Williams <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-12-21Merge tag 'nfsd-5.16-3' of ↵Linus Torvalds2-11/+8
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fix from Chuck Lever: "Address a buffer overrun reported by Anatoly Trosinenko" * tag 'nfsd-5.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: NFSD: Fix READDIR buffer overflow
2021-12-19Merge tag '5.16-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds3-14/+44
Pull cifs fixes from Steve French: "Two cifs/smb3 fixes, one fscache related, and one mount parsing related for stable" * tag '5.16-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: sanitize multiple delimiters in prepath cifs: ignore resource_id while getting fscache super cookie
2021-12-18NFSD: Fix READDIR buffer overflowChuck Lever2-11/+8
If a client sends a READDIR count argument that is too small (say, zero), then the buffer size calculation in the new init_dirlist helper functions results in an underflow, allowing the XDR stream functions to write beyond the actual buffer. This calculation has always been suspect. NFSD has never sanity- checked the READDIR count argument, but the old entry encoders managed the problem correctly. With the commits below, entry encoding changed, exposing the underflow to the pointer arithmetic in xdr_reserve_space(). Modern NFS clients attempt to retrieve as much data as possible for each READDIR request. Also, we have no unit tests that exercise the behavior of READDIR at the lower bound of @count values. Thus this case was missed during testing. Reported-by: Anatoly Trosinenko <[email protected]> Fixes: f5dcccd647da ("NFSD: Update the NFSv2 READDIR entry encoder to use struct xdr_stream") Fixes: 7f87fc2d34d4 ("NFSD: Update NFSv3 READDIR entry encoders to use struct xdr_stream") Signed-off-by: Chuck Lever <[email protected]>
2021-12-17Merge tag 'zonefs-5.16-rc6' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs Pull zonefs fixes from Damien Le Moal: "One fix and one trivial update for rc6: - Add MODULE_ALIAS_FS to get automatic module loading on mount (Naohiro) - Update Damien's email address in the MAINTAINERS file (me)" * tag 'zonefs-5.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs: MAITAINERS: Change zonefs maintainer email address zonefs: add MODULE_ALIAS_FS
2021-12-17ksmbd: disable SMB2_GLOBAL_CAP_ENCRYPTION for SMB 3.1.1Marcos Del Sol Vives2-7/+21
According to the official Microsoft MS-SMB2 document section 3.3.5.4, this flag should be used only for 3.0 and 3.0.2 dialects. Setting it for 3.1.1 is a violation of the specification. This causes my Windows 10 client to detect an anomaly in the negotiation, and disable encryption entirely despite being explicitly enabled in ksmbd, causing all data transfers to go in plain text. Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3") Cc: [email protected] # v5.15 Acked-by: Namjae Jeon <[email protected]> Signed-off-by: Marcos Del Sol Vives <[email protected]> Signed-off-by: Steve French <[email protected]>
2021-12-17cifs: sanitize multiple delimiters in prepathThiago Rafael Becker1-1/+37
mount.cifs can pass a device with multiple delimiters in it. This will cause rename(2) to fail with ENOENT. V2: - Make sanitize_path more readable. - Fix multiple delimiters between UNC and prepath. - Avoid a memory leak if a bad user starts putting a lot of delimiters in the path on purpose. BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=2031200 Fixes: 24e0a1eff9e2 ("cifs: switch to new mount api") Cc: [email protected] # 5.11+ Acked-by: Ronnie Sahlberg <[email protected]> Signed-off-by: Thiago Rafael Becker <[email protected]> Signed-off-by: Steve French <[email protected]>
2021-12-17cifs: ignore resource_id while getting fscache super cookieShyam Prasad N2-13/+7
We have a cyclic dependency between fscache super cookie and root inode cookie. The super cookie relies on tcon->resource_id, which gets populated from the root inode number. However, fetching the root inode initializes inode cookie as a child of super cookie, which is yet to be populated. resource_id is only used as auxdata to check the validity of super cookie. We can completely avoid setting resource_id to remove the circular dependency. Since vol creation time and vol serial numbers are used for auxdata, we should be fine. Additionally, there will be auxiliary data check for each inode cookie as well. Fixes: 5bf91ef03d98 ("cifs: wait for tcon resource_id before getting fscache super") CC: David Howells <[email protected]> Signed-off-by: Shyam Prasad N <[email protected]> Signed-off-by: Steve French <[email protected]>
2021-12-17Merge tag 'for-5.16-rc5-tag' of ↵Linus Torvalds10-24/+54
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes, almost all error handling one-liners and for stable. - regression fix in directory logging items - regression fix of extent buffer status bits handling after an error - fix memory leak in error handling path in tree-log - fix freeing invalid anon device number when handling errors during subvolume creation - fix warning when freeing leaf after subvolume creation failure - fix missing blkdev put in device scan error handling - fix invalid delayed ref after subvolume creation failure" * tag 'for-5.16-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix missing blkdev_put() call in btrfs_scan_one_device() btrfs: fix warning when freeing leaf after subvolume creation failure btrfs: fix invalid delayed ref after subvolume creation failure btrfs: check WRITE_ERR when trying to read an extent buffer btrfs: fix missing last dir item offset update when logging directory btrfs: fix double free of anon_dev after failure to create subvolume btrfs: fix memory leak in __add_inode_ref()
2021-12-17Merge tag 'io_uring-5.16-2021-12-17' of git://git.kernel.dk/linux-blockLinus Torvalds1-0/+2
Pull io_uring fix from Jens Axboe: "Just a single fix, fixing an issue with the worker creation change that was merged last week" * tag 'io_uring-5.16-2021-12-17' of git://git.kernel.dk/linux-block: io-wq: drop wqe lock before creating new worker
2021-12-17zonefs: add MODULE_ALIAS_FSNaohiro Aota1-0/+1
Add MODULE_ALIAS_FS() to load the module automatically when you do "mount -t zonefs". Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system") Cc: stable <[email protected]> # 5.6+ Signed-off-by: Naohiro Aota <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Damien Le Moal <[email protected]>
2021-12-16ksmbd: fix uninitialized symbol 'pntsd_size'Namjae Jeon1-0/+4
No check for if "rc" is an error code for build_sec_desc(). This can cause problems with using uninitialized pntsd_size. Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3") Cc: [email protected] # v5.15 Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Namjae Jeon <[email protected]> Signed-off-by: Steve French <[email protected]>
2021-12-16ksmbd: fix error code in ndr_read_int32()Dan Carpenter1-1/+1
This is a failure path and it should return -EINVAL instead of success. Otherwise it could result in the caller using uninitialized memory. Fixes: 303fff2b8c77 ("ksmbd: add validation for ndr read/write functions") Cc: [email protected] # v5.15 Acked-by: Namjae Jeon <[email protected]> Signed-off-by: Dan Carpenter <[email protected]> Signed-off-by: Steve French <[email protected]>
2021-12-16afs: Fix mmapDavid Howells2-2/+4
Fix afs_add_open_map() to check that the vnode isn't already on the list when it adds it. It's possible that afs_drop_open_mmap() decremented the cb_nr_mmap counter, but hadn't yet got into the locked section to remove it. Also vnode->cb_mmap_link should be initialised, so fix that too. Fixes: 6e0e99d58a65 ("afs: Fix mmap coherency vs 3rd-party changes") Reported-by: [email protected] Suggested-by: Marc Dionne <[email protected]> Signed-off-by: David Howells <[email protected]> Tested-by: [email protected] cc: [email protected] Link: https://lore.kernel.org/r/[email protected]/ # v1 Signed-off-by: Linus Torvalds <[email protected]>
2021-12-15Merge tag 'ceph-for-5.16-rc6' of git://github.com/ceph/ceph-clientLinus Torvalds3-14/+25
Pull ceph fixes from Ilya Dryomov: "An SGID directory handling fix (marked for stable), a metrics accounting fix and two fixups to appease static checkers" * tag 'ceph-for-5.16-rc6' of git://github.com/ceph/ceph-client: ceph: fix up non-directory creation in SGID directories ceph: initialize pathlen variable in reconnect_caps_cb ceph: initialize i_size variable in ceph_sync_read ceph: fix duplicate increment of opened_inodes metric
2021-12-15btrfs: fix missing blkdev_put() call in btrfs_scan_one_device()Shin'ichiro Kawasaki1-2/+4
The function btrfs_scan_one_device() calls blkdev_get_by_path() and blkdev_put() to get and release its target block device. However, when btrfs_sb_log_location_bdev() fails, blkdev_put() is not called and the block device is left without clean up. This triggered failure of fstests generic/085. Fix the failure path of btrfs_sb_log_location_bdev() to call blkdev_put(). Fixes: 12659251ca5df ("btrfs: implement log-structured superblock for ZONED mode") CC: [email protected] # 5.15+ Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Shin'ichiro Kawasaki <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-12-15btrfs: fix warning when freeing leaf after subvolume creation failureFilipe Manana1-3/+4
When creating a subvolume, at ioctl.c:create_subvol(), if we fail to insert the root item for the new subvolume into the root tree, we can trigger the following warning: [78961.741046] WARNING: CPU: 0 PID: 4079814 at fs/btrfs/extent-tree.c:3357 btrfs_free_tree_block+0x2af/0x310 [btrfs] [78961.743344] Modules linked in: [78961.749440] dm_snapshot dm_thin_pool (...) [78961.773648] CPU: 0 PID: 4079814 Comm: fsstress Not tainted 5.16.0-rc4-btrfs-next-108 #1 [78961.775198] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [78961.777266] RIP: 0010:btrfs_free_tree_block+0x2af/0x310 [btrfs] [78961.778398] Code: 17 00 48 85 (...) [78961.781067] RSP: 0018:ffffaa4001657b28 EFLAGS: 00010202 [78961.781877] RAX: 0000000000000213 RBX: ffff897f8a796910 RCX: 0000000000000000 [78961.782780] RDX: 0000000000000000 RSI: 0000000011004000 RDI: 00000000ffffffff [78961.783764] RBP: ffff8981f490e800 R08: 0000000000000001 R09: 0000000000000000 [78961.784740] R10: 0000000000000000 R11: 0000000000000001 R12: ffff897fc963fcc8 [78961.785665] R13: 0000000000000001 R14: ffff898063548000 R15: ffff898063548000 [78961.786620] FS: 00007f31283c6b80(0000) GS:ffff8982ace00000(0000) knlGS:0000000000000000 [78961.787717] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [78961.788598] CR2: 00007f31285c3000 CR3: 000000023fcc8003 CR4: 0000000000370ef0 [78961.789568] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [78961.790585] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [78961.791684] Call Trace: [78961.792082] <TASK> [78961.792359] create_subvol+0x5d1/0x9a0 [btrfs] [78961.793054] btrfs_mksubvol+0x447/0x4c0 [btrfs] [78961.794009] ? preempt_count_add+0x49/0xa0 [78961.794705] __btrfs_ioctl_snap_create+0x123/0x190 [btrfs] [78961.795712] ? _copy_from_user+0x66/0xa0 [78961.796382] btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs] [78961.797392] btrfs_ioctl+0xd1e/0x35c0 [btrfs] [78961.798172] ? __slab_free+0x10a/0x360 [78961.798820] ? rcu_read_lock_sched_held+0x12/0x60 [78961.799664] ? lock_release+0x223/0x4a0 [78961.800321] ? lock_acquired+0x19f/0x420 [78961.800992] ? rcu_read_lock_sched_held+0x12/0x60 [78961.801796] ? trace_hardirqs_on+0x1b/0xe0 [78961.802495] ? _raw_spin_unlock_irqrestore+0x3e/0x60 [78961.803358] ? kmem_cache_free+0x321/0x3c0 [78961.804071] ? __x64_sys_ioctl+0x83/0xb0 [78961.804711] __x64_sys_ioctl+0x83/0xb0 [78961.805348] do_syscall_64+0x3b/0xc0 [78961.805969] entry_SYSCALL_64_after_hwframe+0x44/0xae [78961.806830] RIP: 0033:0x7f31284bc957 [78961.807517] Code: 3c 1c 48 f7 d8 (...) This is because we are calling btrfs_free_tree_block() on an extent buffer that is dirty. Fix that by cleaning the extent buffer, with btrfs_clean_tree_block(), before freeing it. This was triggered by test case generic/475 from fstests. Fixes: 67addf29004c5b ("btrfs: fix metadata extent leak after failure to create subvolume") CC: [email protected] # 4.4+ Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-12-15btrfs: fix invalid delayed ref after subvolume creation failureFilipe Manana6-22/+31
When creating a subvolume, at ioctl.c:create_subvol(), if we fail to insert the new root's root item into the root tree, we are freeing the metadata extent we reserved for the new root to prevent a metadata extent leak, as we don't abort the transaction at that point (since there is nothing at that point that is irreversible). However we allocated the metadata extent for the new root which we are creating for the new subvolume, so its delayed reference refers to the ID of this new root. But when we free the metadata extent we pass the root of the subvolume where the new subvolume is located to btrfs_free_tree_block() - this is incorrect because this will generate a delayed reference that refers to the ID of the parent subvolume's root, and not to ID of the new root. This results in a failure when running delayed references that leads to a transaction abort and a trace like the following: [3868.738042] RIP: 0010:__btrfs_free_extent+0x709/0x950 [btrfs] [3868.739857] Code: 68 0f 85 e6 fb ff (...) [3868.742963] RSP: 0018:ffffb0e9045cf910 EFLAGS: 00010246 [3868.743908] RAX: 00000000fffffffe RBX: 00000000fffffffe RCX: 0000000000000002 [3868.745312] RDX: 00000000fffffffe RSI: 0000000000000002 RDI: ffff90b0cd793b88 [3868.746643] RBP: 000000000e5d8000 R08: 0000000000000000 R09: ffff90b0cd793b88 [3868.747979] R10: 0000000000000002 R11: 00014ded97944d68 R12: 0000000000000000 [3868.749373] R13: ffff90b09afe4a28 R14: 0000000000000000 R15: ffff90b0cd793b88 [3868.750725] FS: 00007f281c4a8b80(0000) GS:ffff90b3ada00000(0000) knlGS:0000000000000000 [3868.752275] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [3868.753515] CR2: 00007f281c6a5000 CR3: 0000000108a42006 CR4: 0000000000370ee0 [3868.754869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [3868.756228] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [3868.757803] Call Trace: [3868.758281] <TASK> [3868.758655] ? btrfs_merge_delayed_refs+0x178/0x1c0 [btrfs] [3868.759827] __btrfs_run_delayed_refs+0x2b1/0x1250 [btrfs] [3868.761047] btrfs_run_delayed_refs+0x86/0x210 [btrfs] [3868.762069] ? lock_acquired+0x19f/0x420 [3868.762829] btrfs_commit_transaction+0x69/0xb20 [btrfs] [3868.763860] ? _raw_spin_unlock+0x29/0x40 [3868.764614] ? btrfs_block_rsv_release+0x1c2/0x1e0 [btrfs] [3868.765870] create_subvol+0x1d8/0x9a0 [btrfs] [3868.766766] btrfs_mksubvol+0x447/0x4c0 [btrfs] [3868.767669] ? preempt_count_add+0x49/0xa0 [3868.768444] __btrfs_ioctl_snap_create+0x123/0x190 [btrfs] [3868.769639] ? _copy_from_user+0x66/0xa0 [3868.770391] btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs] [3868.771495] btrfs_ioctl+0xd1e/0x35c0 [btrfs] [3868.772364] ? __slab_free+0x10a/0x360 [3868.773198] ? rcu_read_lock_sched_held+0x12/0x60 [3868.774121] ? lock_release+0x223/0x4a0 [3868.774863] ? lock_acquired+0x19f/0x420 [3868.775634] ? rcu_read_lock_sched_held+0x12/0x60 [3868.776530] ? trace_hardirqs_on+0x1b/0xe0 [3868.777373] ? _raw_spin_unlock_irqrestore+0x3e/0x60 [3868.778280] ? kmem_cache_free+0x321/0x3c0 [3868.779011] ? __x64_sys_ioctl+0x83/0xb0 [3868.779718] __x64_sys_ioctl+0x83/0xb0 [3868.780387] do_syscall_64+0x3b/0xc0 [3868.781059] entry_SYSCALL_64_after_hwframe+0x44/0xae [3868.781953] RIP: 0033:0x7f281c59e957 [3868.782585] Code: 3c 1c 48 f7 d8 4c (...) [3868.785867] RSP: 002b:00007ffe1f83e2b8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [3868.787198] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f281c59e957 [3868.788450] RDX: 00007ffe1f83e2c0 RSI: 0000000050009418 RDI: 0000000000000003 [3868.789748] RBP: 00007ffe1f83f300 R08: 0000000000000000 R09: 00007ffe1f83fe36 [3868.791214] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003 [3868.792468] R13: 0000000000000003 R14: 00007ffe1f83e2c0 R15: 00000000000003cc [3868.793765] </TASK> [3868.794037] irq event stamp: 0 [3868.794548] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [3868.795670] hardirqs last disabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040 [3868.797086] softirqs last enabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040 [3868.798309] softirqs last disabled at (0): [<0000000000000000>] 0x0 [3868.799284] ---[ end trace be24c7002fe27747 ]--- [3868.799928] BTRFS info (device dm-0): leaf 241188864 gen 1268 total ptrs 214 free space 469 owner 2 [3868.801133] BTRFS info (device dm-0): refs 2 lock_owner 225627 current 225627 [3868.802056] item 0 key (237436928 169 0) itemoff 16250 itemsize 33 [3868.802863] extent refs 1 gen 1265 flags 2 [3868.803447] ref#0: tree block backref root 1610 (...) [3869.064354] item 114 key (241008640 169 0) itemoff 12488 itemsize 33 [3869.065421] extent refs 1 gen 1268 flags 2 [3869.066115] ref#0: tree block backref root 1689 (...) [3869.403834] BTRFS error (device dm-0): unable to find ref byte nr 241008640 parent 0 root 1622 owner 0 offset 0 [3869.405641] BTRFS: error (device dm-0) in __btrfs_free_extent:3076: errno=-2 No such entry [3869.407138] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2159: errno=-2 No such entry Fix this by passing the new subvolume's root ID to btrfs_free_tree_block(). This requires changing the root argument of btrfs_free_tree_block() from struct btrfs_root * to a u64, since at this point during the subvolume creation we have not yet created the struct btrfs_root for the new subvolume, and btrfs_free_tree_block() only needs a root ID and nothing else from a struct btrfs_root. This was triggered by test case generic/475 from fstests. Fixes: 67addf29004c5b ("btrfs: fix metadata extent leak after failure to create subvolume") CC: [email protected] # 4.4+ Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-12-15btrfs: check WRITE_ERR when trying to read an extent bufferJosef Bacik1-0/+8
Filipe reported a hang when we have errors on btrfs. This turned out to be a side-effect of my fix c2e39305299f01 ("btrfs: clear extent buffer uptodate when we fail to write it") which made it so we clear EXTENT_BUFFER_UPTODATE on an eb when we fail to write it out. Below is a paste of Filipe's analysis he got from using drgn to debug the hang """ btree readahead code calls read_extent_buffer_pages(), sets ->io_pages to a value while writeback of all pages has not yet completed: --> writeback for the first 3 pages finishes, we clear EXTENT_BUFFER_UPTODATE from eb on the first page when we get an error. --> at this point eb->io_pages is 1 and we cleared Uptodate bit from the first 3 pages --> read_extent_buffer_pages() does not see EXTENT_BUFFER_UPTODATE() so it continues, it's able to lock the pages since we obviously don't hold the pages locked during writeback --> read_extent_buffer_pages() then computes 'num_reads' as 3, and sets eb->io_pages to 3, since only the first page does not have Uptodate bit set at this point --> writeback for the remaining page completes, we ended decrementing eb->io_pages by 1, resulting in eb->io_pages == 2, and therefore never calling end_extent_buffer_writeback(), so EXTENT_BUFFER_WRITEBACK remains in the eb's flags --> of course, when the read bio completes, it doesn't and shouldn't call end_extent_buffer_writeback() --> we should clear EXTENT_BUFFER_UPTODATE only after all pages of the eb finished writeback? or maybe make the read pages code wait for writeback of all pages of the eb to complete before checking which pages need to be read, touch ->io_pages, submit read bio, etc writeback bit never cleared means we can hang when aborting a transaction, at: btrfs_cleanup_one_transaction() btrfs_destroy_marked_extents() wait_on_extent_buffer_writeback() """ This is a problem because our writes are not synchronized with reads in any way. We clear the UPTODATE flag and then we can easily come in and try to read the EB while we're still waiting on other bio's to complete. We have two options here, we could lock all the pages, and then check to see if eb->io_pages != 0 to know if we've already got an outstanding write on the eb. Or we can simply check to see if we have WRITE_ERR set on this extent buffer. We set this bit _before_ we clear UPTODATE, so if the read gets triggered because we aren't UPTODATE because of a write error we're guaranteed to have WRITE_ERR set, and in this case we can simply return -EIO. This will fix the reported hang. Reported-by: Filipe Manana <[email protected]> Fixes: c2e39305299f01 ("btrfs: clear extent buffer uptodate when we fail to write it") CC: [email protected] # 5.4+ Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-12-14btrfs: fix missing last dir item offset update when logging directoryFilipe Manana1-0/+1
When logging a directory, once we finish processing a leaf that is full of dir items, if we find the next leaf was not modified in the current transaction, we grab the first key of that next leaf and log it as to mark the end of a key range boundary. However we did not update the value of ctx->last_dir_item_offset, which tracks the offset of the last logged key. This can result in subsequent logging of the same directory in the current transaction to not realize that key was already logged, and then add it to the middle of a batch that starts with a lower key, resulting later in a leaf with one key that is duplicated and at non-consecutive slots. When that happens we get an error later when writing out the leaf, reporting that there is a pair of keys in wrong order. The report is something like the following: Dec 13 21:44:50 kernel: BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=118444032 slot=21, bad key order, prev (704687 84 4146773349) current (704687 84 1063561078) Dec 13 21:44:50 kernel: BTRFS info (device dm-0): leaf 118444032 gen 91449 total ptrs 39 free space 546 owner 18446744073709551610 Dec 13 21:44:50 kernel: item 0 key (704687 1 0) itemoff 3835 itemsize 160 Dec 13 21:44:50 kernel: inode generation 35532 size 1026 mode 40755 Dec 13 21:44:50 kernel: item 1 key (704687 12 704685) itemoff 3822 itemsize 13 Dec 13 21:44:50 kernel: item 2 key (704687 24 3817753667) itemoff 3736 itemsize 86 Dec 13 21:44:50 kernel: item 3 key (704687 60 0) itemoff 3728 itemsize 8 Dec 13 21:44:50 kernel: item 4 key (704687 72 0) itemoff 3720 itemsize 8 Dec 13 21:44:50 kernel: item 5 key (704687 84 140445108) itemoff 3666 itemsize 54 Dec 13 21:44:50 kernel: dir oid 704793 type 1 Dec 13 21:44:50 kernel: item 6 key (704687 84 298800632) itemoff 3599 itemsize 67 Dec 13 21:44:50 kernel: dir oid 707849 type 2 Dec 13 21:44:50 kernel: item 7 key (704687 84 476147658) itemoff 3532 itemsize 67 Dec 13 21:44:50 kernel: dir oid 707901 type 2 Dec 13 21:44:50 kernel: item 8 key (704687 84 633818382) itemoff 3471 itemsize 61 Dec 13 21:44:50 kernel: dir oid 704694 type 2 Dec 13 21:44:50 kernel: item 9 key (704687 84 654256665) itemoff 3403 itemsize 68 Dec 13 21:44:50 kernel: dir oid 707841 type 1 Dec 13 21:44:50 kernel: item 10 key (704687 84 995843418) itemoff 3331 itemsize 72 Dec 13 21:44:50 kernel: dir oid 2167736 type 1 Dec 13 21:44:50 kernel: item 11 key (704687 84 1063561078) itemoff 3278 itemsize 53 Dec 13 21:44:50 kernel: dir oid 704799 type 2 Dec 13 21:44:50 kernel: item 12 key (704687 84 1101156010) itemoff 3225 itemsize 53 Dec 13 21:44:50 kernel: dir oid 704696 type 1 Dec 13 21:44:50 kernel: item 13 key (704687 84 2521936574) itemoff 3173 itemsize 52 Dec 13 21:44:50 kernel: dir oid 704704 type 2 Dec 13 21:44:50 kernel: item 14 key (704687 84 2618368432) itemoff 3112 itemsize 61 Dec 13 21:44:50 kernel: dir oid 704738 type 1 Dec 13 21:44:50 kernel: item 15 key (704687 84 2676316190) itemoff 3046 itemsize 66 Dec 13 21:44:50 kernel: dir oid 2167729 type 1 Dec 13 21:44:50 kernel: item 16 key (704687 84 3319104192) itemoff 2986 itemsize 60 Dec 13 21:44:50 kernel: dir oid 704745 type 2 Dec 13 21:44:50 kernel: item 17 key (704687 84 3908046265) itemoff 2929 itemsize 57 Dec 13 21:44:50 kernel: dir oid 2167734 type 1 Dec 13 21:44:50 kernel: item 18 key (704687 84 3945713089) itemoff 2857 itemsize 72 Dec 13 21:44:50 kernel: dir oid 2167730 type 1 Dec 13 21:44:50 kernel: item 19 key (704687 84 4077169308) itemoff 2795 itemsize 62 Dec 13 21:44:50 kernel: dir oid 704688 type 1 Dec 13 21:44:50 kernel: item 20 key (704687 84 4146773349) itemoff 2727 itemsize 68 Dec 13 21:44:50 kernel: dir oid 707892 type 1 Dec 13 21:44:50 kernel: item 21 key (704687 84 1063561078) itemoff 2674 itemsize 53 Dec 13 21:44:50 kernel: dir oid 704799 type 2 Dec 13 21:44:50 kernel: item 22 key (704687 96 2) itemoff 2612 itemsize 62 Dec 13 21:44:50 kernel: item 23 key (704687 96 6) itemoff 2551 itemsize 61 Dec 13 21:44:50 kernel: item 24 key (704687 96 7) itemoff 2498 itemsize 53 Dec 13 21:44:50 kernel: item 25 key (704687 96 12) itemoff 2446 itemsize 52 Dec 13 21:44:50 kernel: item 26 key (704687 96 14) itemoff 2385 itemsize 61 Dec 13 21:44:50 kernel: item 27 key (704687 96 18) itemoff 2325 itemsize 60 Dec 13 21:44:50 kernel: item 28 key (704687 96 24) itemoff 2271 itemsize 54 Dec 13 21:44:50 kernel: item 29 key (704687 96 28) itemoff 2218 itemsize 53 Dec 13 21:44:50 kernel: item 30 key (704687 96 62) itemoff 2150 itemsize 68 Dec 13 21:44:50 kernel: item 31 key (704687 96 66) itemoff 2083 itemsize 67 Dec 13 21:44:50 kernel: item 32 key (704687 96 75) itemoff 2015 itemsize 68 Dec 13 21:44:50 kernel: item 33 key (704687 96 79) itemoff 1948 itemsize 67 Dec 13 21:44:50 kernel: item 34 key (704687 96 82) itemoff 1882 itemsize 66 Dec 13 21:44:50 kernel: item 35 key (704687 96 83) itemoff 1810 itemsize 72 Dec 13 21:44:50 kernel: item 36 key (704687 96 85) itemoff 1753 itemsize 57 Dec 13 21:44:50 kernel: item 37 key (704687 96 87) itemoff 1681 itemsize 72 Dec 13 21:44:50 kernel: item 38 key (704694 1 0) itemoff 1521 itemsize 160 Dec 13 21:44:50 kernel: inode generation 35534 size 30 mode 40755 Dec 13 21:44:50 kernel: BTRFS error (device dm-0): block=118444032 write time tree block corruption detected So fix that by adding the missing update of ctx->last_dir_item_offset with the offset of the boundary key. Reported-by: Chris Murphy <[email protected]> Link: https://lore.kernel.org/linux-btrfs/CAJCQCtT+RSzpUjbMq+UfzNUMe1X5+1G+DnAGbHC=OZ=iRS24jg@mail.gmail.com/ Fixes: dc2872247ec0ca ("btrfs: keep track of the last logged keys when logging a directory") Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-12-14btrfs: fix double free of anon_dev after failure to create subvolumeFilipe Manana1-0/+8
When creating a subvolume, at create_subvol(), we allocate an anonymous device and later call btrfs_get_new_fs_root(), which in turn just calls btrfs_get_root_ref(). There we call btrfs_init_fs_root() which assigns the anonymous device to the root, but if after that call there's an error, when we jump to 'fail' label, we call btrfs_put_root(), which frees the anonymous device and then returns an error that is propagated back to create_subvol(). Than create_subvol() frees the anonymous device again. When this happens, if the anonymous device was not reallocated after the first time it was freed with btrfs_put_root(), we get a kernel message like the following: (...) [13950.282466] BTRFS: error (device dm-0) in create_subvol:663: errno=-5 IO failure [13950.283027] ida_free called for id=65 which is not allocated. [13950.285974] BTRFS info (device dm-0): forced readonly (...) If the anonymous device gets reallocated by another btrfs filesystem or any other kernel subsystem, then bad things can happen. So fix this by setting the root's anonymous device to 0 at btrfs_get_root_ref(), before we call btrfs_put_root(), if an error happened. Fixes: 2dfb1e43f57dd3 ("btrfs: preallocate anon block device at first phase of snapshot creation") CC: [email protected] # 5.10+ Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-12-14btrfs: fix memory leak in __add_inode_ref()Jianglei Nie1-0/+1
Line 1169 (#3) allocates a memory chunk for victim_name by kmalloc(), but when the function returns in line 1184 (#4) victim_name allocated by line 1169 (#3) is not freed, which will lead to a memory leak. There is a similar snippet of code in this function as allocating a memory chunk for victim_name in line 1104 (#1) as well as releasing the memory in line 1116 (#2). We should kfree() victim_name when the return value of backref_in_log() is less than zero and before the function returns in line 1184 (#4). 1057 static inline int __add_inode_ref(struct btrfs_trans_handle *trans, 1058 struct btrfs_root *root, 1059 struct btrfs_path *path, 1060 struct btrfs_root *log_root, 1061 struct btrfs_inode *dir, 1062 struct btrfs_inode *inode, 1063 u64 inode_objectid, u64 parent_objectid, 1064 u64 ref_index, char *name, int namelen, 1065 int *search_done) 1066 { 1104 victim_name = kmalloc(victim_name_len, GFP_NOFS); // #1: kmalloc (victim_name-1) 1105 if (!victim_name) 1106 return -ENOMEM; 1112 ret = backref_in_log(log_root, &search_key, 1113 parent_objectid, victim_name, 1114 victim_name_len); 1115 if (ret < 0) { 1116 kfree(victim_name); // #2: kfree (victim_name-1) 1117 return ret; 1118 } else if (!ret) { 1169 victim_name = kmalloc(victim_name_len, GFP_NOFS); // #3: kmalloc (victim_name-2) 1170 if (!victim_name) 1171 return -ENOMEM; 1180 ret = backref_in_log(log_root, &search_key, 1181 parent_objectid, victim_name, 1182 victim_name_len); 1183 if (ret < 0) { 1184 return ret; // #4: missing kfree (victim_name-2) 1185 } else if (!ret) { 1241 return 0; 1242 } Fixes: d3316c8233bb ("btrfs: Properly handle backref_in_log retval") CC: [email protected] # 5.10+ Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Jianglei Nie <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-12-13fget: clarify and improve __fget_files() implementationLinus Torvalds1-16/+56
Commit 054aa8d439b9 ("fget: check that the fd still exists after getting a ref to it") fixed a race with getting a reference to a file just as it was being closed. It was a fairly minimal patch, and I didn't think re-checking the file pointer lookup would be a measurable overhead, since it was all right there and cached. But I was wrong, as pointed out by the kernel test robot. The 'poll2' case of the will-it-scale.per_thread_ops benchmark regressed quite noticeably. Admittedly it seems to be a very artificial test: doing "poll()" system calls on regular files in a very tight loop in multiple threads. That means that basically all the time is spent just looking up file descriptors without ever doing anything useful with them (not that doing 'poll()' on a regular file is useful to begin with). And as a result it shows the extra "re-check fd" cost as a sore thumb. Happily, the regression is fixable by just writing the code to loook up the fd to be better and clearer. There's still a cost to verify the file pointer, but now it's basically in the noise even for that benchmark that does nothing else - and the code is more understandable and has better comments too. [ Side note: this patch is also a classic case of one that looks very messy with the default greedy Myers diff - it's much more legible with either the patience of histogram diff algorithm ] Link: https://lore.kernel.org/lkml/20211210053743.GA36420@xsang-OptiPlex-9020/ Link: https://lore.kernel.org/lkml/[email protected]/ Reported-by: kernel test robot <[email protected]> Tested-by: Carel Si <[email protected]> Cc: Jann Horn <[email protected]> Cc: Miklos Szeredi <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-12-13io-wq: drop wqe lock before creating new workerJens Axboe1-0/+2
We have two io-wq creation paths: - On queue enqueue - When a worker goes to sleep The latter invokes worker creation with the wqe->lock held, but that can run into problems if we end up exiting and need to cancel the queued work. syzbot caught this: ============================================ WARNING: possible recursive locking detected 5.16.0-rc4-syzkaller #0 Not tainted -------------------------------------------- iou-wrk-6468/6471 is trying to acquire lock: ffff88801aa98018 (&wqe->lock){+.+.}-{2:2}, at: io_worker_cancel_cb+0xb7/0x210 fs/io-wq.c:187 but task is already holding lock: ffff88801aa98018 (&wqe->lock){+.+.}-{2:2}, at: io_wq_worker_sleeping+0xb6/0x140 fs/io-wq.c:700 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&wqe->lock); lock(&wqe->lock); *** DEADLOCK *** May be due to missing lock nesting notation 1 lock held by iou-wrk-6468/6471: #0: ffff88801aa98018 (&wqe->lock){+.+.}-{2:2}, at: io_wq_worker_sleeping+0xb6/0x140 fs/io-wq.c:700 stack backtrace: CPU: 1 PID: 6471 Comm: iou-wrk-6468 Not tainted 5.16.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1dc/0x2d8 lib/dump_stack.c:106 print_deadlock_bug kernel/locking/lockdep.c:2956 [inline] check_deadlock kernel/locking/lockdep.c:2999 [inline] validate_chain+0x5984/0x8240 kernel/locking/lockdep.c:3788 __lock_acquire+0x1382/0x2b00 kernel/locking/lockdep.c:5027 lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5637 __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline] _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:154 io_worker_cancel_cb+0xb7/0x210 fs/io-wq.c:187 io_wq_cancel_tw_create fs/io-wq.c:1220 [inline] io_queue_worker_create+0x3cf/0x4c0 fs/io-wq.c:372 io_wq_worker_sleeping+0xbe/0x140 fs/io-wq.c:701 sched_submit_work kernel/sched/core.c:6295 [inline] schedule+0x67/0x1f0 kernel/sched/core.c:6323 schedule_timeout+0xac/0x300 kernel/time/timer.c:1857 wait_woken+0xca/0x1b0 kernel/sched/wait.c:460 unix_msg_wait_data net/unix/unix_bpf.c:32 [inline] unix_bpf_recvmsg+0x7f9/0xe20 net/unix/unix_bpf.c:77 unix_stream_recvmsg+0x214/0x2c0 net/unix/af_unix.c:2832 sock_recvmsg_nosec net/socket.c:944 [inline] sock_recvmsg net/socket.c:962 [inline] sock_read_iter+0x3a7/0x4d0 net/socket.c:1035 call_read_iter include/linux/fs.h:2156 [inline] io_iter_do_read fs/io_uring.c:3501 [inline] io_read fs/io_uring.c:3558 [inline] io_issue_sqe+0x144c/0x9590 fs/io_uring.c:6671 io_wq_submit_work+0x2d8/0x790 fs/io_uring.c:6836 io_worker_handle_work+0x808/0xdd0 fs/io-wq.c:574 io_wqe_worker+0x395/0x870 fs/io-wq.c:630 ret_from_fork+0x1f/0x30 We can safely drop the lock before doing work creation, making the two contexts the same in that regard. Reported-by: [email protected] Fixes: 71a85387546e ("io-wq: check for wq exit after adding new worker task_work") Signed-off-by: Jens Axboe <[email protected]>
2021-12-11Merge tag 'xfs-5.16-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-3/+11
Pull xfs fix from Darrick Wong: "This fixes a race between a readonly remount process and other processes that hold a file IOLOCK on files that previously experienced copy on write, that could result in severe filesystem corruption if the filesystem is then remounted rw. I think this is fairly rare (since the only reliable reproducer I have that fits the second criteria is the experimental xfs_scrub program), but the race is clear, so we still need to fix this. Summary: - Fix a data corruption vector that can result from the ro remount process failing to clear all speculative preallocations from files and the rw remount process not noticing the incomplete cleanup" * tag 'xfs-5.16-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: remove all COW fork extents when remounting readonly
2021-12-11Merge tag 'io_uring-5.16-2021-12-10' of git://git.kernel.dk/linux-blockLinus Torvalds2-8/+27
Pull io_uring fixes from Jens Axboe: "A few fixes that are all bound for stable: - Two syzbot reports for io-wq that turned out to be separate fixes, but ultimately very closely related - io_uring task_work running on cancelations" * tag 'io_uring-5.16-2021-12-10' of git://git.kernel.dk/linux-block: io-wq: check for wq exit after adding new worker task_work io_uring: ensure task_work gets run as part of cancelations io-wq: remove spurious bit clear on task_work addition
2021-12-10Merge tag 'for-5.16-rc4-tag' of ↵Linus Torvalds7-10/+35
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more regression fixes and stable patches, mostly one-liners. Regression fixes: - fix pointer/ERR_PTR mismatch returned from memdup_user - reset dedicated zoned mode relocation block group to avoid using it and filling it without any recourse Fixes: - handle a case to FITRIM range (also to make fstests/generic/260 work) - fix warning when extent buffer state and pages get out of sync after an IO error - fix transaction abort when syncing due to missing mapping error set on metadata inode after inlining a compressed file - fix transaction abort due to tree-log and zoned mode interacting in an unexpected way - fix memory leak of additional extent data when qgroup reservation fails - do proper handling of slot search call when deleting root refs" * tag 'for-5.16-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: replace the BUG_ON in btrfs_del_root_ref with proper error handling btrfs: zoned: clear data relocation bg on zone finish btrfs: free exchange changeset on failures btrfs: fix re-dirty process of tree-log nodes btrfs: call mapping_set_error() on btree inode with a write error btrfs: clear extent buffer uptodate when we fail to write it btrfs: fail if fstrim_range->start == U64_MAX btrfs: fix error pointer dereference in btrfs_ioctl_rm_dev_v2()
2021-12-10Merge tag '5.16-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds2-31/+36
Pull cifs fixes from Steve French: "Two cifs/smb3 fixes - one for stable, the other fixes a recently reported NTLMSSP auth problem" * tag '5.16-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: fix ntlmssp auth when there is no key exchange cifs: Fix crash on unload of cifs_arc4.ko
2021-12-10Merge tag 'nfsd-5.16-2' of git://linux-nfs.org/~bfields/linuxLinus Torvalds3-9/+15
Pull nfsd fixes from Bruce Fields: "Fix a race on startup and another in the delegation code. The latter has been around for years, but I suspect recent changes may have widened the race window a little, so I'd like to go ahead and get it in" * tag 'nfsd-5.16-2' of git://linux-nfs.org/~bfields/linux: nfsd: fix use-after-free due to delegation race nfsd: Fix nsfd startup race (again)
2021-12-10Merge tag 'trace-v5.16-rc4' of ↵Linus Torvalds1-0/+76
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Tracing, ftrace and tracefs fixes: - Have tracefs honor the gid mount option - Have new files in tracefs inherit the parent ownership - Have direct_ops unregister when it has no more functions - Properly clean up the ops when unregistering multi direct ops - Add a sample module to test the multiple direct ops - Fix memory leak in error path of __create_synth_event()" * tag 'trace-v5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix possible memory leak in __create_synth_event() error path ftrace/samples: Add module to test multi direct modify interface ftrace: Add cleanup to unregister_ftrace_direct_multi ftrace: Use direct_ops hash in unregister_ftrace_direct tracefs: Set all files to the same group ownership as the mount option tracefs: Have new files inherit the ownership of their parent
2021-12-10Merge tag 'aio-poll-for-linus' of ↵Linus Torvalds2-45/+153
git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull aio poll fixes from Eric Biggers: "Fix three bugs in aio poll, and one issue with POLLFREE more broadly: - aio poll didn't handle POLLFREE, causing a use-after-free. - aio poll could block while the file is ready. - aio poll called eventfd_signal() when it isn't allowed. - POLLFREE didn't handle multiple exclusive waiters correctly. This has been tested with the libaio test suite, as well as with test programs I wrote that reproduce the first two bugs. I am sending this pull request myself as no one seems to be maintaining this code" * tag 'aio-poll-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: aio: Fix incorrect usage of eventfd_signal_allowed() aio: fix use-after-free due to missing POLLFREE handling aio: keep poll requests on waitqueue until completed signalfd: use wake_up_pollfree() binder: use wake_up_pollfree() wait: add wake_up_pollfree()