aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2021-08-23btrfs: avoid unnecessary log mutex contention when syncing logFilipe Manana1-4/+10
When syncing the log we acquire the root's log mutex just to update the root's last_log_commit. This is unnecessary because: 1) At this point there can only be one task updating this value, which is the task committing the current log transaction. Any task that enters btrfs_sync_log() has to wait for the previous log transaction to commit and wait for the current log transaction to commit if someone else already started it (in this case it never reaches to the point of updating last_log_commit, as that is done by the committing task); 2) All readers of the root's last_log_commit don't acquire the root's log mutex. This is to avoid blocking the readers, potentially for too long and because getting a stale value of last_log_commit does not cause any functional problem, in the worst case getting a stale value results in logging an inode unnecessarily. Plus it's actually very rare to get a stale value that results in unnecessarily logging the inode. So in order to avoid unnecessary contention on the root's log mutex, which is used for several different purposes, like starting/joining a log transaction and starting writeback of a log transaction, stop acquiring the log mutex for updating the root's last_log_commit. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23btrfs: remove racy and unnecessary inode transaction update when using no-holesFilipe Manana1-7/+5
When using the NO_HOLES feature and expanding the size of an inode, we update the inode's last_trans, last_sub_trans and last_log_commit fields at maybe_insert_hole() so that a fsync does know that the inode needs to be logged (by making sure that btrfs_inode_in_log() returns false). This happens for expanding truncate operations, buffered writes, direct IO writes and when cloning extents to an offset greater than the inode's i_size. However the way we do it is racy, because in between setting the inode's last_sub_trans and last_log_commit fields, the log transaction ID that was assigned to last_sub_trans might be committed before we read the root's last_log_commit and assign that value to last_log_commit. If that happens it would make a future call to btrfs_inode_in_log() return true. This is a race that should be extremely unlikely to be hit in practice, and it is the same that was described by commit bc0939fcfab0d7 ("btrfs: fix race between marking inode needs to be logged and log syncing"). The fix would simply be to set last_log_commit to the value we assigned to last_sub_trans minus 1, like it was done in that commit. However updating these two fields plus the last_trans field is pointless here because all the callers of btrfs_cont_expand() (which is the only caller of maybe_insert_hole()) always call btrfs_set_inode_last_trans() or btrfs_update_inode() after calling btrfs_cont_expand(). Calling either btrfs_set_inode_last_trans() or btrfs_update_inode() guarantees that the next fsync will log the inode, as it makes btrfs_inode_in_log() return false. So just remove the code that explicitly sets the inode's last_trans, last_sub_trans and last_log_commit fields. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23btrfs: stop doing GFP_KERNEL memory allocations in the ref verify toolFilipe Manana2-17/+5
In commit 351cbf6e4410e7 ("btrfs: use nofs allocations for running delayed items") we wrapped all btree updates when running delayed items with memalloc_nofs_save() and memalloc_nofs_restore(), due to a lock inversion detected by lockdep involving reclaim and the mutex of delayed nodes. The problem is because the ref verify tool does some memory allocations with GFP_KERNEL, which can trigger reclaim and reclaim can trigger inode eviction, which requires locking the mutex of an inode's delayed node. On the other hand the ref verify tool is called when allocating metadata extents as part of operations that modify a btree, which is a problem when running delayed nodes, where we do btree updates while holding the mutex of a delayed node. This is what caused the lockdep warning. Instead of wrapping every btree update when running delayed nodes, change the ref verify tool to never do GFP_KERNEL allocations, because: 1) We get less repeated code, which at the moment does not even have a comment mentioning why we need to setup the NOFS context, which is a recommended good practice as mentioned at Documentation/core-api/gfp_mask-from-fs-io.rst 2) The ref verify tool is something meant only for debugging and not something that should be enabled on non-debug / non-development kernels; 3) We may have yet more places outside delayed-inode.c where we have similar problem: doing btree updates while holding some lock and then having the GFP_KERNEL memory allocations, from the ref verify tool, trigger reclaim and trying again to acquire the same lock through the reclaim path. Or we could get more such cases in the future, therefore this change prevents getting into similar cases when using the ref verify tool. Curiously most of the memory allocations done by the ref verify tool were already using GFP_NOFS, except a few ones for no apparent reason. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23btrfs: improve the batch insertion of delayed itemsFilipe Manana1-133/+79
When we insert the delayed items of an inode, which corresponds to the directory index keys for a directory (key type BTRFS_DIR_INDEX_KEY), we do the following: 1) Pick the first delayed item from the rbtree and insert it into the fs/subvolume btree, using btrfs_insert_empty_item() for that; 2) Without releasing the path returned by btrfs_insert_empty_item(), keep collecting as many consecutive delayed items from the rbtree as possible, as long as each one's BTRFS_DIR_INDEX_KEY key is the immediate successor of the previously picked item and as long as they fit in the available space of the leaf the path points to; 3) Then insert all the collected items into the leaf; 4) Release the reserve metadata space for each collected item and release each item (implies deleting from the rbtree); 5) Unlock the path. While this is much better than inserting items one by one, it can be improved in a few aspects: 1) Instead of adding items based on the remaining free space of the leaf, collect as many items that can fit in a leaf and bulk insert them. This results in less and larger batches, reducing the total amount of time to insert the delayed items. For example when adding 100K files to a directory, we ended up creating 1658 batches with very variable sizes ranging from 1 item to 118 items, on a filesystem with a node/leaf size of 16K. After this change, we end up with 839 batches, with the vast majority of them having exactly 120 items; 2) We do the search for more items to batch, by iterating the rbtree, while holding a write lock on the leaf; 3) While still holding the leaf locked, we are releasing the reserved metadata for each item and then deleting each item, keeping a write lock on the leaf for longer than necessary. Releasing the delayed items one by one can take a significant amount of time, because deleting them from the rbtree can often be a bit slow when the deletion results in rebalancing the rbtree. So change this so that we try to create larger batches, with a total item size up to the maximum a leaf can support, and by unlocking the leaf immediately after inserting the items, releasing the reserved metadata space of each item and releasing each item without holding the write lock on the leaf. The following script that runs fs_mark was used to test this change: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-m single -d single" FILES=1000000 THREADS=16 FILE_SIZE=0 echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $DEV &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT OPTS="-S 0 -L 5 -n $FILES -s $FILE_SIZE -t 16" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT It was run on machine with 12 cores, 64G of ram, using a NVMe device and using a non-debug kernel config (Debian's default config). Results before this change: FSUse% Count Size Files/sec App Overhead 1 16000000 0 76182.1 72223046 3 32000000 0 62746.9 80776528 5 48000000 0 77029.0 93022381 6 64000000 0 73691.6 95251075 8 80000000 0 66288.0 85089634 Results after this change: FSUse% Count Size Files/sec App Overhead 1 16000000 0 79049.5 (+3.7%) 69700824 3 32000000 0 65248.9 (+3.9%) 80583693 5 48000000 0 77991.4 (+1.2%) 90040908 6 64000000 0 75096.8 (+1.9%) 89862241 8 80000000 0 66926.8 (+1.0%) 84429169 Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23btrfs: rescue: allow ibadroots to skip bad extent tree when reading block ↵Qu Wenruo1-0/+19
group items When extent tree gets corrupted, normally it's not extent tree root, but one toasted tree leaf/node. In that case, rescue=ibadroots mount option won't help as it can only handle the extent tree root corruption. This patch will enhance the behavior by: - Allow fill_dummy_bgs() to ignore -EEXIST error This means we may have some block group items read from disk, but then hit some error halfway. - Fallback to fill_dummy_bgs() if any error gets hit in btrfs_read_block_groups() Of course, this still needs rescue=ibadroots mount option. With that, rescue=ibadroots can handle extent tree corruption more gracefully and allow a better recover chance. Reported-by: Zhenyu Wu <[email protected]> Link: https://www.spinics.net/lists/linux-btrfs/msg114424.html Reviewed-by: Su Yue <[email protected]> Reviewed-by: Anand Jain <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23btrfs: pass NULL as trans to btrfs_search_slot if we only want to searchMarcos Paulo de Souza2-2/+2
Using a transaction in btrfs_search_slot is only useful when we are searching to add or modify the tree. When the function is used for searching, insert length and mod arguments are 0, there is no need to use a transaction. No functional changes, changing for consistency. Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Marcos Paulo de Souza <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23btrfs: continue readahead of siblings even if target node is in memoryFilipe Manana1-5/+8
At reada_for_search(), when attempting to readahead a node or leaf's siblings, we skip the readahead of the siblings if the node/leaf is already in memory. That is probably fine for the READA_FORWARD and READA_BACK readahead types, as they are used on contexts where we end up reading some consecutive leaves, but usually not the whole btree. However for a READA_FORWARD_ALWAYS mode, currently only used for full send operations, it does not make sense to skip the readahead if the target node or leaf is already loaded in memory, since we know the caller is visiting every node and leaf of the btree in ascending order. So change the behaviour to not skip the readahead when the target node is already in memory and the readahead mode is READA_FORWARD_ALWAYS. The following test script was used to measure the improvement on a box using an average, consumer grade, spinning disk, with 32GiB of RAM and using a non-debug kernel config (Debian's default config). $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null mount $MOUNT_OPTIONS $DEV $MNT # Create files with inline data to make it easier and faster to create # large btrees. add_files() { local total=$1 local start_offset=$2 local number_jobs=$3 local total_per_job=$(($total / $number_jobs)) echo "Creating $total new files using $number_jobs jobs" for ((n = 0; n < $number_jobs; n++)); do ( local start_num=$(($start_offset + $n * $total_per_job)) for ((i = 1; i <= $total_per_job; i++)); do local file_num=$((start_num + $i)) local file_path="$MNT/file_${file_num}" xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null if [ $? -ne 0 ]; then echo "Failed creating file $file_path" break fi done ) & worker_pids[$n]=$! done wait ${worker_pids[@]} sync echo echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)" } file_count=2000000 add_files $file_count 0 4 echo echo "Creating snapshot..." btrfs subvolume snapshot -r $MNT $MNT/snap1 umount $MNT echo 3 > /proc/sys/vm/drop_caches blockdev --flushbufs $DEV &> /dev/null hdparm -F $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT echo echo "Testing full send..." start=$(date +%s) btrfs send $MNT/snap1 > /dev/null end=$(date +%s) echo echo "Full send took $((end - start)) seconds" umount $MNT The duration of the full send operations, in seconds, were the following: Before this change: 85 seconds After this change: 76 seconds (-11.2%) Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23btrfs: check-integrity: drop kmap/kunmap for block pagesDavid Sterba1-8/+3
The pages in block_ctx have never been allocated from highmem (in btrfsic_read_block) so the mapping is pointless and can be removed. Signed-off-by: David Sterba <[email protected]>
2021-08-23btrfs: compression: drop kmap/kunmap from generic helpersDavid Sterba2-4/+2
The pages in compressed_pages are not from highmem anymore so we can drop the mapping for checksum calculation and inline extent. Signed-off-by: David Sterba <[email protected]>
2021-08-23btrfs: compression: drop kmap/kunmap from zstdDavid Sterba1-18/+9
As we don't use highmem pages anymore, drop the kmap/kunmap. The kmap is simply page_address and kunmap is a no-op. Signed-off-by: David Sterba <[email protected]>
2021-08-23btrfs: compression: drop kmap/kunmap from zlibDavid Sterba1-25/+11
As we don't use highmem pages anymore, drop the kmap/kunmap. The kmap is simply page_address and kunmap is a no-op. Signed-off-by: David Sterba <[email protected]>
2021-08-23btrfs: compression: drop kmap/kunmap from lzoDavid Sterba1-28/+10
As we don't use highmem pages anymore, drop the kmap/kunmap. The kmap is simply page_address and kunmap is a no-op. Signed-off-by: David Sterba <[email protected]>
2021-08-23btrfs: drop from __GFP_HIGHMEM all allocationsDavid Sterba5-15/+14
The highmem flag is used for allocating pages for compression and for raid56 pages. The high memory makes sense on 32bit systems but is not without problems. On 64bit system's it's just another layer of wrappers. The time the pages are allocated for compression or raid56 is relatively short (about a transaction commit), so the pages are not blocked indefinitely. As the number of pages depends on the amount of data being written/read, there's a theoretical problem. A fast device on a 32bit system could use most of the low memory pool, while with the highmem allocation that would not happen. This was possibly the original idea long time ago, but nowadays we optimize for 64bit systems. This patch removes all usage of the __GFP_HIGHMEM flag for page allocation, the kmap/kunmap are still in place and will be removed in followup patches. Remaining is masking out the bit in alloc_extent_state and __lookup_free_space_inode, that can safely stay. Signed-off-by: David Sterba <[email protected]>
2021-08-23btrfs: cleanup fs_devices pointer usage in btrfs_trim_fsAnand Jain1-5/+5
Drop variable 'devices' (used only once) and add new variable for the fs_devices, so it is used at two locations within btrfs_trim_fs() function and also helps to access fs_devices->devices. Signed-off-by: Anand Jain <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23btrfs: remove max argument from generic_bin_searchMarcos Paulo de Souza1-11/+7
Both callers use btrfs_header_nritems to feed the max argument. Remove the argument and let generic_bin_search call it itself. Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: Marcos Paulo de Souza <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23btrfs: make btrfs_finish_chunk_alloc private to block-group.cNikolay Borisov3-96/+91
One of the final things that must be done to add a new chunk is inserting its device extent items in the device tree. They describe the portion of allocated device physical space during phase 1 of chunk allocation. This is currently done in btrfs_finish_chunk_alloc whose name isn't very informative. What's more, this function is only used in block-group.c but is defined as public. There isn't anything special about it that would warrant it being defined in volumes.c. Just move btrfs_finish_chunk_alloc and alloc_chunk_dev_extent to block-group.c, make the former static and rename both functions to insert_dev_extents and insert_dev_extent respectively. Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23btrfs: check-integrity: drop unnecessary function prototypesAnand Jain1-49/+0
The function prototypes below aren't necessary as the functions are first defined before called. Remove them. Signed-off-by: Anand Jain <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23btrfs: add special case to setget helpers for 64k pagesDavid Sterba1-4/+4
On 64K pages the size of the extent_buffer::pages array is 1 and compilation with -Warray-bounds warns due to kaddr = page_address(eb->pages[idx + 1]); when reading byte range crossing page boundary. This does never actually overflow the array because on 64K because all the data fit in one page and bounds are checked by check_setget_bounds. To fix the reported overflows and warnings add a compile-time condition that will allow compiler to eliminate the dead code that reads from the idx + 1 page. Link: https://lore.kernel.org/lkml/[email protected]/ CC: Gustavo A. R. Silva <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23btrfs: zoned: remove max_zone_append_size logicJohannes Thumshirn4-24/+0
There used to be a patch in the original series for zoned support which limited the extent size to max_zone_append_size, but this patch has been dropped somewhere around v9. We've decided to go the opposite direction, instead of limiting extents in the first place we split them before submission to comply with the device's limits. Remove the related code, btrfs_fs_info::max_zone_append_size and btrfs_zoned_device_info::max_zone_append_size. This also removes the workaround for dm-crypt introduced in 1d68128c107a ("btrfs: zoned: fail mount if the device does not support zone append") because the fix has been merged as f34ee1dce642 ("dm crypt: Fix zoned block device support"). Reviewed-by: Anand Jain <[email protected]> Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-23fs: remove mandatory file locking supportJeff Layton15-240/+15
We added CONFIG_MANDATORY_FILE_LOCKING in 2015, and soon after turned it off in Fedora and RHEL8. Several other distros have followed suit. I've heard of one problem in all that time: Someone migrated from an older distro that supported "-o mand" to one that didn't, and the host had a fstab entry with "mand" in it which broke on reboot. They didn't actually _use_ mandatory locking so they just removed the mount option and moved on. This patch rips out mandatory locking support wholesale from the kernel, along with the Kconfig option and the Documentation file. It also changes the mount code to ignore the "mand" mount option instead of erroring out, and to throw a big, ugly warning. Signed-off-by: Jeff Layton <[email protected]>
2021-08-23fs: simplify get_filesystem_list / get_all_fs_namesChristoph Hellwig1-10/+17
Just output the '\0' separate list of supported file systems for block devices directly rather than going through a pointless round of string manipulation. Based on an earlier patch from Al Viro <[email protected]>. Vivek: Modified list_bdev_fs_names() and split_fs_names() to return number of null terminted strings to caller. Callers now use that information to loop through all the strings instead of relying on one extra null char being present at the end. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Vivek Goyal <[email protected]> Signed-off-by: Al Viro <[email protected]>
2021-08-21fcntl: fix potential deadlock for &fasync_struct.fa_lockDesmond Cheong Zhi Xi1-2/+3
There is an existing lock hierarchy of &dev->event_lock --> &fasync_struct.fa_lock --> &f->f_owner.lock from the following call chain: input_inject_event(): spin_lock_irqsave(&dev->event_lock,...); input_handle_event(): input_pass_values(): input_to_handler(): evdev_events(): evdev_pass_values(): spin_lock(&client->buffer_lock); __pass_event(): kill_fasync(): kill_fasync_rcu(): read_lock(&fa->fa_lock); send_sigio(): read_lock_irqsave(&fown->lock,...); &dev->event_lock is HARDIRQ-safe, so interrupts have to be disabled while grabbing &fasync_struct.fa_lock, otherwise we invert the lock hierarchy. However, since kill_fasync which calls kill_fasync_rcu is an exported symbol, it may not necessarily be called with interrupts disabled. As kill_fasync_rcu may be called with interrupts disabled (for example, in the call chain above), we replace calls to read_lock/read_unlock on &fasync_struct.fa_lock in kill_fasync_rcu with read_lock_irqsave/read_unlock_irqrestore. Signed-off-by: Desmond Cheong Zhi Xi <[email protected]> Signed-off-by: Jeff Layton <[email protected]>
2021-08-21fcntl: fix potential deadlocks for &fown_struct.lockDesmond Cheong Zhi Xi1-6/+7
Syzbot reports a potential deadlock in do_fcntl: ======================================================== WARNING: possible irq lock inversion dependency detected 5.12.0-syzkaller #0 Not tainted -------------------------------------------------------- syz-executor132/8391 just changed the state of lock: ffff888015967bf8 (&f->f_owner.lock){.+..}-{2:2}, at: f_getown_ex fs/fcntl.c:211 [inline] ffff888015967bf8 (&f->f_owner.lock){.+..}-{2:2}, at: do_fcntl+0x8b4/0x1200 fs/fcntl.c:395 but this lock was taken by another, HARDIRQ-safe lock in the past: (&dev->event_lock){-...}-{2:2} and interrupts could create inverse lock ordering between them. other info that might help us debug this: Chain exists of: &dev->event_lock --> &new->fa_lock --> &f->f_owner.lock Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&f->f_owner.lock); local_irq_disable(); lock(&dev->event_lock); lock(&new->fa_lock); <Interrupt> lock(&dev->event_lock); *** DEADLOCK *** This happens because there is a lock hierarchy of &dev->event_lock --> &new->fa_lock --> &f->f_owner.lock from the following call chain: input_inject_event(): spin_lock_irqsave(&dev->event_lock,...); input_handle_event(): input_pass_values(): input_to_handler(): evdev_events(): evdev_pass_values(): spin_lock(&client->buffer_lock); __pass_event(): kill_fasync(): kill_fasync_rcu(): read_lock(&fa->fa_lock); send_sigio(): read_lock_irqsave(&fown->lock,...); However, since &dev->event_lock is HARDIRQ-safe, interrupts have to be disabled while grabbing &f->f_owner.lock, otherwise we invert the lock hierarchy. Hence, we replace calls to read_lock/read_unlock on &f->f_owner.lock, with read_lock_irq/read_unlock_irq. Reported-and-tested-by: [email protected] Signed-off-by: Desmond Cheong Zhi Xi <[email protected]> Signed-off-by: Jeff Layton <[email protected]>
2021-08-21Merge tag 'locks-v5.14' of ↵Linus Torvalds1-1/+5
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull mandatory file locking deprecation warning from Jeff Layton: "As discussed on the list, this patch just adds a new warning for folks who still have mandatory locking enabled and actually mount with '-o mand'. I'd like to get this in for v5.14 so we can push this out into stable kernels and hopefully reach folks who have mounts with -o mand. For now, I'm operating under the assumption that we'll fully remove this support in v5.15, but we can move that out if any legitimate users of this facility speak up between now and then" * tag 'locks-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: fs: warn about impending deprecation of mandatory locks
2021-08-21lockd: lockd server-side shouldn't set fl_opsJ. Bruce Fields1-18/+12
Locks have two sets of op arrays, fl_lmops for the lock manager (lockd or nfsd), fl_ops for the filesystem. The server-side lockd code has been setting its own fl_ops, which leads to confusion (and crashes) in the reexport case, where the filesystem expects to be the only one setting fl_ops. And there's no reason for it that I can see-the lm_get/put_owner ops do the same job. Reported-by: Daire Byrne <[email protected]> Tested-by: Daire Byrne <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2021-08-21Merge tag 'io_uring-5.14-2021-08-20' of git://git.kernel.dk/linux-blockLinus Torvalds1-6/+10
Pull io_uring fixes from Jens Axboe: "A few small fixes that should go into this release: - Fix never re-assigning an initial error value for io_uring_enter() for SQPOLL, if asked to do nothing - Fix xa_alloc_cycle() return value checking, for cases where we have wrapped around - Fix for a ctx pin issue introduced in this cycle (Pavel)" * tag 'io_uring-5.14-2021-08-20' of git://git.kernel.dk/linux-block: io_uring: fix xa_alloc_cycle() error return value check io_uring: pin ctx on fallback execution io_uring: only assign io_uring_enter() SQPOLL error in actual error case
2021-08-21ksmbd: fix permission check issue on chown and chmodNamjae Jeon2-6/+23
When commanding chmod and chown on cifs&ksmbd, ksmbd allows it without file permissions check. There is code to check it in settattr_prepare. Instead of setting the inode directly, update the mode and uid/gid through notify_change. Signed-off-by: Namjae Jeon <[email protected]> Signed-off-by: Steve French <[email protected]>
2021-08-21fs: warn about impending deprecation of mandatory locksJeff Layton1-1/+5
We've had CONFIG_MANDATORY_FILE_LOCKING since 2015 and a lot of distros have disabled it. Warn the stragglers that still use "-o mand" that we'll be dropping support for that mount option. Cc: [email protected] Signed-off-by: Jeff Layton <[email protected]>
2021-08-20io_uring: fix xa_alloc_cycle() error return value checkJens Axboe1-4/+5
We currently check for ret != 0 to indicate error, but '1' is a valid return and just indicates that the allocation succeeded with a wrap. Correct the check to be for < 0, like it was before the xarray conversion. Cc: [email protected] Fixes: 61cf93700fe6 ("io_uring: Convert personality_idr to XArray") Signed-off-by: Jens Axboe <[email protected]>
2021-08-20xfs: fix perag structure refcounting error when scrub failsDarrick J. Wong4-4/+7
The kernel test robot found the following bug when running xfs/355 to scrub a bmap btree: XFS: Assertion failed: !sa->pag, file: fs/xfs/scrub/common.c, line: 412 ------------[ cut here ]------------ kernel BUG at fs/xfs/xfs_message.c:110! invalid opcode: 0000 [#1] SMP PTI CPU: 2 PID: 1415 Comm: xfs_scrub Not tainted 5.14.0-rc4-00021-g48c6615cc557 #1 Hardware name: Hewlett-Packard p6-1451cx/2ADA, BIOS 8.15 02/05/2013 RIP: 0010:assfail+0x23/0x28 [xfs] RSP: 0018:ffffc9000aacb890 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffffc9000aacbcc8 RCX: 0000000000000000 RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffc09e7dcd RBP: ffffc9000aacbc80 R08: ffff8881fdf17d50 R09: 0000000000000000 R10: 000000000000000a R11: f000000000000000 R12: 0000000000000000 R13: ffff88820c7ed000 R14: 0000000000000001 R15: ffffc9000aacb980 FS: 00007f185b955700(0000) GS:ffff8881fdf00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7f6ef43000 CR3: 000000020de38002 CR4: 00000000001706e0 Call Trace: xchk_ag_read_headers+0xda/0x100 [xfs] xchk_ag_init+0x15/0x40 [xfs] xchk_btree_check_block_owner+0x76/0x180 [xfs] xchk_btree_get_block+0xd0/0x140 [xfs] xchk_btree+0x32e/0x440 [xfs] xchk_bmap_btree+0xd4/0x140 [xfs] xchk_bmap+0x1eb/0x3c0 [xfs] xfs_scrub_metadata+0x227/0x4c0 [xfs] xfs_ioc_scrub_metadata+0x50/0xc0 [xfs] xfs_file_ioctl+0x90c/0xc40 [xfs] __x64_sys_ioctl+0x83/0xc0 do_syscall_64+0x3b/0xc0 The unusual handling of errors while initializing struct xchk_ag is the root cause here. Since the beginning of xfs_scrub, the goal of xchk_ag_read_headers has been to read all three AG header buffers and attach them both to the xchk_ag structure and the scrub transaction. Corruption errors on any of the three headers doesn't necessarily trigger an immediate return to userspace, because xfs_scrub can also tell us to /fix/ the problem. In other words, it's possible for the xchk_ag init functions to return an error code and a partially filled out structure so that scrub can use however much information it managed to pull. Before 5.15, it was sufficient to cancel (or commit) the scrub transaction on the way out of the scrub code to release the buffers. Ccommit 48c6615cc557 added a reference to the perag structure to struct xchk_ag. Since perag structures are not attached to transactions like buffers are, this adds the requirement that the perag ref be released explicitly. The scrub teardown function xchk_teardown was amended to do this for the xchk_ag embedded in struct xfs_scrub. Unfortunately, I forgot that certain parts of the scrub code probe multiple AGs and therefore handle the initialization and cleanup on their own. Specifically, the bmbt scrubber will initialize it long enough to cross-reference AG metadata for btree blocks and for the extent mappings in the bmbt. If one of the AG headers is corrupt, the init function returns with a live perag structure reference and some of the AG header buffers. If an error occurs, the cross referencing will be noted as XCORRUPTion and skipped, but the main scrub process will move on to the next record. It is now necessary to release the perag reference before we try to analyze something from a different AG, or else we'll trip over the assertion noted above. Fixes: 48c6615cc557 ("xfs: grab active perag ref when reading AG headers") Reported-by: kernel test robot <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Chandan Babu R <[email protected]>
2021-08-20erofs: support reading chunk-based uncompressed filesGao Xiang3-11/+102
Add runtime support for chunk-based uncompressed files described in the previous patch. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Liu Bo <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Gao Xiang <[email protected]>
2021-08-20erofs: introduce chunk-based file on-disk formatGao Xiang1-2/+45
Currently, uncompressed data except for tail-packing inline is consecutive on disk. In order to support chunk-based data deduplication, add a new corresponding inode data layout. In the future, the data source of chunks can be either (un)compressed. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Liu Bo <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Gao Xiang <[email protected]>
2021-08-20gfs2: Remove redundant check from gfs2_glock_dqBob Peterson1-6/+5
In function gfs2_glock_dq, it checks to see if this is the fast path. Before this patch, it checked both "find_first_holder(gl) == NULL" and list_empty(&gl->gl_holders), which is redundant. If gl_holders is empty then find_first_holder must return NULL. This patch removes the redundancy. Signed-off-by: Bob Peterson <[email protected]>
2021-08-20gfs2: Delay withdraw from atomic contextBob Peterson1-1/+1
Before this patch, if function __gfs2_ail_flush detected an error syncing the ail list, it call gfs2_ail_error which called gfs2_withdraw. Since __gfs2_ail_flush deals with a specific glock, we shouldn't withdraw immediately because the withdraw code (signal_our_withdraw) uses glocks in its processing. This patch changes the call from gfs2_withdraw to gfs2_withdraw_delayed which defers the withdraw until a more appropriate context, such as the logd daemon, discovers the intent to withdraw. Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Bob Peterson <[email protected]>
2021-08-20gfs2: Don't call dlm after protocol is unmountedBob Peterson1-0/+5
In the gfs2 withdraw sequence, the dlm protocol is unmounted with a call to lm_unmount. After a withdraw, users are allowed to unmount the withdrawn file system. But at that point we may still have glocks left over that we need to free via unmount's call to gfs2_gl_hash_clear. These glocks may have never been completed because of whatever problem caused the withdraw (IO errors or whatever). Before this patch, function gdlm_put_lock would still try to call into dlm to unlock these leftover glocks, which resulted in dlm returning -EINVAL because the lock space was abandoned. These glocks were never freed because there was no mechanism after that to free them. This patch adds a check to gdlm_put_lock to see if the locking protocol was inactive (DFL_UNMOUNT flag) and if so, free the glock and not make the invalid call into dlm. I could have combined this "if" with the one that follows, related to leftover glock LVBs, but I felt the code was more readable with its own if clause. Signed-off-by: Bob Peterson <[email protected]>
2021-08-20gfs2: don't stop reads while withdraw in progressBob Peterson2-4/+8
When gfs2 withdraws a file system, it calls signal_our_withdraw which triggers another node to replay the withdrawing node's journal. Then it waits until it knows the journal has been replayed. Part of this wait is to repeatedly call check_journal_clean which calls gfs2_jdesc_check, which checks to see if the journal is sane. As part of its sanity checks it needs to re-read its journal's metadata. But with today's code, any attempt to re-read the metadata results in -EIO because of a check for the file system withdraw in function gfs2_meta_wait. This patch adds an additional check for SDF_WITHDRAW_IN_PROG, to tell if the read is done while the withdraw is in progress. In that case we allow the metadata read to not be rejected. Therefore the metadata check is done properly, so the withdraw sequence can finish normally. Signed-off-by: Bob Peterson <[email protected]>
2021-08-20gfs2: Mark journal inodes as "don't cache"Bob Peterson2-0/+2
Before this patch, journal inodes were considered regular inodes, which meant that instead of evicting them, function iput_final would just put them on the lru for later processing. If the file system withdrew for whatever reason, the withdraw would never be seen until the inode was evicted, which could be indefinitely. This patch marks all journal inodes as "don't cache" which means function iput_final will evict them immediately, allowing us to properly recover the journal on other cluster nodes. Signed-off-by: Bob Peterson <[email protected]>
2021-08-20gfs2: nit: gfs2_drop_inode shouldn't return boolBob Peterson1-1/+1
Today, gfs2_drop_inode can return "false" for an int value. I'm sure this was just an oversight. Change to int value. Signed-off-by: Bob Peterson <[email protected]>
2021-08-20gfs2: Eliminate vestigial HIF_FIRSTBob Peterson2-3/+0
Holder flag HIF_FIRST is no longer used or needed, so remove it. Signed-off-by: Bob Peterson <[email protected]>
2021-08-20gfs2: Make recovery error more readableBob Peterson1-1/+1
Before this patch, withdraws could cause an error that looked like: Journal recovery skipped for 0 until next mount. This patch changes it to a more readable: Journal recovery skipped for jid 0 until next mount. Signed-off-by: Bob Peterson <[email protected]>
2021-08-20gfs2: Don't release and reacquire local statfs bhBob Peterson5-41/+25
Before this patch, several functions in gfs2 related to the updating of the statfs file used a newly acquired/read buffer_head for the local statfs file. This is completely unnecessary, because other nodes should never update it. Recreating the buffer is a waste of time. This patch allows gfs2 to read in the local statefs buffer_head at mount time and keep it around until unmount time. Signed-off-by: Bob Peterson <[email protected]>
2021-08-20gfs2: init system threads before freeze lockBob Peterson2-55/+48
Patch 96b1454f2e ("gfs2: move freeze glock outside the make_fs_rw and _ro functions") changed the gfs2 mount sequence so that it holds the freeze lock before calling gfs2_make_fs_rw. Before this patch, gfs2_make_fs_rw called init_threads to initialize the quotad and logd threads. That is a problem if the system needs to withdraw due to IO errors early in the mount sequence, for example, while initializing the system statfs inode: 1. An IO error causes the statfs glock to not sync properly after recovery, and leaves items on the ail list. 2. The leftover items on the ail list causes its do_xmote call to fail, which makes it want to withdraw. But since the glock code cannot withdraw (because the withdraw sequence uses glocks) it relies upon the logd daemon to initiate the withdraw. 3. The withdraw can never be performed by the logd daemon because all this takes place before the logd daemon is started. This patch moves function init_threads from super.c to ops_fstype.c and it changes gfs2_fill_super to start its threads before holding the freeze lock, and if there's an error, stop its threads after releasing it. This allows the logd to run unblocked by the freeze lock. Thus, the logd daemon can perform its withdraw sequence properly. Fixes: 96b1454f2e8e ("gfs2: move freeze glock outside the make_fs_rw and _ro functions") Signed-off-by: Bob Peterson <[email protected]>
2021-08-20ARM: 9108/1: oabi-compat: rework epoll_wait/epoll_pwait emulationArnd Bergmann1-3/+2
The epoll_wait() system call wrapper is one of the remaining users of the set_fs() infrasturcture for Arm. Changing it to not require set_fs() is rather complex unfortunately. The approach I'm taking here is to allow architectures to override the code that copies the output to user space, and let the oabi-compat implementation check whether it is getting called from an EABI or OABI system call based on the thread_info->syscall value. The in_oabi_syscall() check here mirrors the in_compat_syscall() and in_x32_syscall() helpers for 32-bit compat implementations on other architectures. Overall, the amount of code goes down, at least with the newly added sys_oabi_epoll_pwait() helper getting removed again. The downside is added complexity in the source code for the native implementation. There should be no difference in runtime performance except for Arm kernels with CONFIG_OABI_COMPAT enabled that now have to go through an external function call to check which of the two variants to use. Acked-by: Christoph Hellwig <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: Russell King (Oracle) <[email protected]>
2021-08-20ksmbd: don't set FILE DELETE and FILE_DELETE_CHILD in access mask by defaultNamjae Jeon1-2/+0
When there is no dacl in request, ksmbd send dacl that coverted by using file permission. This patch don't set FILE DELETE and FILE_DELETE_CHILD in access mask by default. Signed-off-by: Namjae Jeon <[email protected]> Signed-off-by: Steve French <[email protected]>
2021-08-19gfs2: tiny cleanup in gfs2_log_reserveBob Peterson1-1/+1
Function gfs2_log_reserve was setting revoke_blks to 0. There's no need because it calculates it shortly thereafter. This patch removes the unnecessary set. Signed-off-by: Bob Peterson <[email protected]>
2021-08-19gfs2: trivial clean up of gfs2_ail_errorBob Peterson1-4/+6
This patch does not change function. It adds variable sdp to clean up function gfs2_ail_error and make it more readable. Signed-off-by: Bob Peterson <[email protected]>
2021-08-19gfs2: be more verbose replaying invalid rgrp blocksBob Peterson1-15/+29
This patch adds some crucial information when journal replay detects a replay of an obsolete rgrp block. For example, it wasn't printing the journal id or the generation number played. This just supplements what is logged in this unusual case. The function that actually complains about the replaying of an obsolete rgrp block has been split off to avoid long lines and sparse warnings. Signed-off-by: Bob Peterson <[email protected]>
2021-08-19xfs: rename buffer cache index variable b_bnDave Chinner2-25/+12
To stop external users from using b_bn as the disk address of the buffer, rename it to b_rhash_key to indicate that it is the buffer cache index, not the block number of the buffer. Code that needs the disk address should use xfs_buf_daddr() to obtain it. Do the rename and clean up any of the remaining internal b_bn users. Also clean up any remaining b_bn cruft that is now unused. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-19xfs: convert bp->b_bn references to xfs_buf_daddr()Dave Chinner21-61/+58
Stop directly referencing b_bn in code outside the buffer cache, as b_bn is supposed to be used only as an internal cache index. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-19xfs: introduce xfs_buf_daddr()Dave Chinner15-22/+26
Introduce a helper function xfs_buf_daddr() to extract the disk address of the buffer from the struct xfs_buf. This will replace direct accesses to bp->b_bn and bp->b_maps[0].bm_bn, as well as the XFS_BUF_ADDR() macro. This patch introduces the helper function and replaces all uses of XFS_BUF_ADDR() as this is just a simple sed replacement. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>