aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2022-03-25Merge tag 'kbuild-gnu11-v5.18' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild update for C11 language base from Masahiro Yamada: "Kbuild -std=gnu11 updates for v5.18 Linus pointed out the benefits of C99 some years ago, especially variable declarations in loops [1]. At that time, we were not ready for the migration due to old compilers. Recently, Jakob Koschel reported a bug in list_for_each_entry(), which leaks the invalid pointer out of the loop [2]. In the discussion, we agreed that the time had come. Now that GCC 5.1 is the minimum compiler version, there is nothing to prevent us from going to -std=gnu99, or even straight to -std=gnu11. Discussions for a better list iterator implementation are ongoing, but this patch set must land first" [1] https://lore.kernel.org/all/CAHk-=wgr12JkKmRd21qh-se-_Gs69kbPgR9x4C+Es-yJV2GLkA@mail.gmail.com/ [2] https://lore.kernel.org/lkml/[email protected]/ * tag 'kbuild-gnu11-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: Kbuild: use -std=gnu11 for KBUILD_USERCFLAGS Kbuild: move to -std=gnu11 Kbuild: use -Wdeclaration-after-statement Kbuild: add -Wno-shift-negative-value where -Wextra is used
2022-03-25[smb3] move more common protocol header definitions to smbfs_commonSteve French5-206/+112
We have duplicated definitions for various SMB3 PDUs in fs/ksmbd and fs/cifs. Some had already been moved to fs/smbfs_common/smb2pdu.h Move definitions for - error response - query info and various related protocol flags - various lease handling flags and the create lease context to smbfs_common/smb2pdu.h to reduce code duplication Reviewed-by: Namjae Jeon <[email protected]> Signed-off-by: Steve French <[email protected]>
2022-03-25fs/iomap: Fix buffered write page prefaultingAndreas Gruenbacher1-1/+1
When part of the user buffer passed to generic_perform_write() or iomap_file_buffered_write() cannot be faulted in for reading, the entire write currently fails. The correct behavior would be to write all the data that can be written, up to the point of failure. Commit a6294593e8a1 ("iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable") gave us the information needed, so fix the page prefaulting in generic_perform_write() and iomap_write_iter() to only bail out when no pages could be faulted in. We already factor in that pages that are faulted in may no longer be resident by the time they are accessed. Paging out pages has the same effect as not faulting in those pages in the first place, so the code can already deal with that. Signed-off-by: Andreas Gruenbacher <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2022-03-25io_uring: fix put_kbuf without proper lockingPavel Begunkov1-1/+12
io_put_kbuf_comp() should only be called while holding ->completion_lock, however there is no such assumption in io_clean_op() and thus it can corrupt ->io_buffer_comp. Take the lock there, and workaround the only user of io_clean_op() calling it with locks. Not the prettiest solution, but it's easier to refactor it for-next. Fixes: cc3cec8367cba ("io_uring: speedup provided buffer handling") Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/743e2130b73ec6d48c4c5dd15db896c433431e6d.1648212967.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-03-25io_uring: fix invalid flags for io_put_kbuf()Pavel Begunkov1-1/+3
io_req_complete_failed() doesn't require callers to hold ->uring_lock, use IO_URING_F_UNLOCKED version of io_put_kbuf(). The only affected place is the fail path of io_apoll_task_func(). Also add a lockdep annotation to catch such bugs in the future. Fixes: 3b2b78a8eb7cc ("io_uring: extend provided buf return to fails") Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/ccf602dbf8df3b6a8552a262d8ee0a13a086fbc7.1648212967.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-03-25io_uring: improve req fields commentsPavel Begunkov1-1/+2
Move a misplaced comment about req->creds and add a line with assumptions about req->link. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/1e51d1e6b1f3708c2d4127b4e371f9daa4c5f859.1648209006.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-03-25io_uring: enable EPOLLEXCLUSIVE for accept pollDylan Yudaken1-1/+4
When polling sockets for accept, use EPOLLEXCLUSIVE. This is helpful when multiple accept SQEs are submitted. For O_NONBLOCK sockets multiple queued SQEs would previously have all completed at once, but most with -EAGAIN as the result. Now only one wakes up and completes. For sockets without O_NONBLOCK there is no user facing change, but internally the extra requests would previously be queued onto a worker thread as they would wake up with no connection waiting, and be punted. Now they do not wake up unnecessarily. Co-developed-by: Jens Axboe <[email protected]> Signed-off-by: Dylan Yudaken <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-03-24Merge tag 'ceph-for-5.18-rc1' of https://github.com/ceph/ceph-clientLinus Torvalds16-368/+569
Pull ceph updates from Ilya Dryomov: "The highlights are: - several changes to how snap context and snap realms are tracked (Xiubo Li). In particular, this should resolve a long-standing issue of high kworker CPU usage and various stalls caused by needless iteration over all inodes in the snap realm. - async create fixes to address hangs in some edge cases (Jeff Layton) - support for getvxattr MDS op for querying server-side xattrs, such as file/directory layouts and ephemeral pins (Milind Changire) - average latency is now maintained for all metrics (Venky Shankar) - some tweaks around handling inline data to make it fit better with netfs helper library (David Howells) Also a couple of memory leaks got plugged along with a few assorted fixups. Last but not least, Xiubo has stepped up to serve as a CephFS co-maintainer" * tag 'ceph-for-5.18-rc1' of https://github.com/ceph/ceph-client: (27 commits) ceph: fix memory leak in ceph_readdir when note_last_dentry returns error ceph: uninitialized variable in debug output ceph: use tracked average r/w/m latencies to display metrics in debugfs ceph: include average/stdev r/w/m latency in mds metrics ceph: track average r/w/m latency ceph: use ktime_to_timespec64() rather than jiffies_to_timespec64() ceph: assign the ci only when the inode isn't NULL ceph: fix inode reference leakage in ceph_get_snapdir() ceph: misc fix for code style and logs ceph: allocate capsnap memory outside of ceph_queue_cap_snap() ceph: do not release the global snaprealm until unmounting ceph: remove incorrect and unused CEPH_INO_DOTDOT macro MAINTAINERS: add Xiubo Li as cephfs co-maintainer ceph: eliminate the recursion when rebuilding the snap context ceph: do not update snapshot context when there is no new snapshot ceph: zero the dir_entries memory when allocating it ceph: move to a dedicated slabcache for ceph_cap_snap ceph: add getvxattr op libceph: drop else branches in prepare_read_data{,_cont} ceph: fix comments mentioning i_mutex ...
2022-03-24Merge tag 'xfs-5.18-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds26-211/+343
Pull xfs updates from Darrick Wong: "The biggest change this cycle is bringing XFS' inode attribute setting code back towards alignment with what the VFS does. IOWs, setgid bit handling should be a closer match with ext4 and btrfs behavior. The rest of the branch is bug fixes around the filesystem -- patching gaps in quota enforcement, removing bogus selinux audit messages, and fixing log corruption and problems with log recovery. There will be a second pull request later on in the merge window with more bug fixes. Dave Chinner will be taking over as XFS maintainer for one release cycle, starting from the day 5.18-rc1 drops until 5.19-rc1 is tagged so that I can focus on starting a massive design review for the (feature complete after five years) online repair feature. Summary: - Fix some incorrect mapping state being passed to iomap during COW - Don't create bogus selinux audit messages when deciding to degrade gracefully due to lack of privilege - Fix setattr implementation to use VFS helpers so that we drop setgid consistently with the other filesystems - Fix link/unlink/rename to check quota limits - Constify xfs_name_dotdot to prevent abuse of in-kernel symbols - Fix log livelock between the AIL and inodegc threads during recovery - Fix a log stall when the AIL races with pushers - Fix stalls in CIL flushes due to pinned inode cluster buffers during recovery - Fix log corruption due to incorrect usage of xfs_is_shutdown vs xlog_is_shutdown because during an induced fs shutdown, AIL writeback must continue until the log is shut down, even if the filesystem has already shut down" * tag 'xfs-5.18-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: xfs_is_shutdown vs xlog_is_shutdown cage fight xfs: AIL should be log centric xfs: log items should have a xlog pointer, not a mount xfs: async CIL flushes need pending pushes to be made stable xfs: xfs_ail_push_all_sync() stalls when racing with updates xfs: check buffer pin state after locking in delwri_submit xfs: log worker needs to start before intent/unlink recovery xfs: constify xfs_name_dotdot xfs: constify the name argument to various directory functions xfs: reserve quota for target dir expansion when renaming files xfs: reserve quota for dir expansion when linking/unlinking files xfs: refactor user/group quota chown in xfs_setattr_nonsize xfs: use setattr_copy to set vfs inode attributes xfs: don't generate selinux audit messages for capability testing xfs: add missing cmap->br_state = XFS_EXT_NORM update
2022-03-24Merge tag 'dax-for-5.18' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull DAX updates from Dan Williams: "Andrew has been shepherding major dax features that touch the core -mm through his tree, but I still collect the dax updates that are core-mm independent. - Fix a crash due to a missing rcu_barrier() in dax_fs_exit() - Fix two miscellaneous doc issues" * tag 'dax-for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: Fix missing kdoc for dax_device dax: make sure inodes are flushed before destroy cache fsdax: fix function description
2022-03-24io_uring: improve task work cache utilizationJens Axboe1-1/+5
While profiling task_work intensive workloads, I noticed that most of the time in tctx_task_work() is spending stalled on loading 'req'. This is one of the unfortunate side effects of using linked lists, particularly when they end up being passe around. Prefetch the next request, if there is one. There's a sufficient amount of work in between that this makes it available for the next loop. While fiddling with the cache layout, move the link outside of the hot completion cacheline. It's rarely used in hot workloads, so better to bring in kbuf which is used for networked loads with provided buffers. This reduces tctx_task_work() overhead from ~3% to 1-1.5% in my testing. Signed-off-by: Jens Axboe <[email protected]>
2022-03-24gfs2: Make sure not to return short direct writesAndreas Gruenbacher1-1/+1
When direct writes fail with -ENOTBLK because we're writing into a hole (gfs2_iomap_begin()) or because of a page invalidation failure (iomap_dio_rw()), we're falling back to buffered writes. In that case, when we lose the inode glock in gfs2_file_buffered_write(), we want to re-acquire it instead of returning a short write. Signed-off-by: Andreas Gruenbacher <[email protected]>
2022-03-24gfs2: Remove dead code in gfs2_file_read_iterAndreas Gruenbacher1-6/+3
Function iomap_dio_rw() only returns -ENOTBLK for write requests and gfs2_file_direct_read() no longer returns -ENOTBLK since commit 1d45bb7f9d2a5 ("gfs2: Use iomap for stuffed direct I/O reads"), so there is no need to check for -ENOTBLK in gfs2_file_read_iter() anymore. Signed-off-by: Andreas Gruenbacher <[email protected]>
2022-03-24gfs2: Fix gfs2_file_buffered_write endless loop workaroundAndreas Gruenbacher1-0/+1
Since commit 554c577cee95b, gfs2_file_buffered_write() can accidentally return a truncated iov_iter, which might confuse callers. Fix that. Fixes: 554c577cee95b ("gfs2: Prevent endless loops in gfs2_file_buffered_write") Signed-off-by: Andreas Gruenbacher <[email protected]>
2022-03-24io_uring: fix async accept on O_NONBLOCK socketsDylan Yudaken1-3/+0
Do not set REQ_F_NOWAIT if the socket is non blocking. When enabled this causes the accept to immediately post a CQE with EAGAIN, which means you cannot perform an accept SQE on a NONBLOCK socket asynchronously. By removing the flag if there is no pending accept then poll is armed as usual and when a connection comes in the CQE is posted. Signed-off-by: Dylan Yudaken <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-03-24Merge branch 'akpm' (patches from Andrew)Linus Torvalds5-33/+36
Merge more updates from Andrew Morton: "Various misc subsystems, before getting into the post-linux-next material. 41 patches. Subsystems affected by this patch series: procfs, misc, core-kernel, lib, checkpatch, init, pipe, minix, fat, cgroups, kexec, kdump, taskstats, panic, kcov, resource, and ubsan" * emailed patches from Andrew Morton <[email protected]>: (41 commits) Revert "ubsan, kcsan: Don't combine sanitizer with kcov on clang" kernel/resource: fix kfree() of bootmem memory again kcov: properly handle subsequent mmap calls kcov: split ioctl handling into locked and unlocked parts panic: move panic_print before kmsg dumpers panic: add option to dump all CPUs backtraces in panic_print docs: sysctl/kernel: add missing bit to panic_print taskstats: remove unneeded dead assignment kasan: no need to unset panic_on_warn in end_report() ubsan: no need to unset panic_on_warn in ubsan_epilogue() panic: unset panic_on_warn inside panic() docs: kdump: add scp example to write out the dump file docs: kdump: update description about sysfs file system support arm64: mm: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef x86/setup: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef riscv: mm: init: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef kexec: make crashk_res, crashk_low_res and crash_notes symbols always visible cgroup: use irqsave in cgroup_rstat_flush_locked(). fat: use pointer to simple type in put_user() minix: fix bug when opening a file with O_DIRECT ...
2022-03-24Merge tag 'flexible-array-transformations-5.18-rc1' of ↵Linus Torvalds8-14/+14
git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux Pull flexible-array transformations from Gustavo Silva: "Treewide patch that replaces zero-length arrays with flexible-array members. This has been baking in linux-next for a whole development cycle" * tag 'flexible-array-transformations-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: treewide: Replace zero-length arrays with flexible-array members
2022-03-24Merge tag 'fs.rt.v5.18' of ↵Linus Torvalds1-2/+18
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull mount attributes PREEMPT_RT update from Christian Brauner: "This contains Sebastian's fix to make changing mount attributes/getting write access compatible with CONFIG_PREEMPT_RT. The change only applies when users explicitly opt-in to real-time via CONFIG_PREEMPT_RT otherwise things are exactly as before. We've waited quite a long time with this to make sure folks could take a good look" * tag 'fs.rt.v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: fs/namespace: Boost the mount_lock.lock owner instead of spinning on PREEMPT_RT.
2022-03-24Merge tag 'fs.v5.18' of ↵Linus Torvalds1-70/+78
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull mount_setattr updates from Christian Brauner: "This contains a few more patches to massage the mount_setattr() codepaths and one minor fix to reuse a helper we added some time back. The final two patches do similar cleanups in different ways. One patch is mine and the other is Al's who was nice enough to give me a branch for it. Since his came in later and my branch had been sitting in -next for quite some time we just put his on top instead of swap them" * tag 'fs.v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: mount_setattr(): clean the control flow and calling conventions fs: clean up mount_setattr control flow fs: don't open-code mnt_hold_writers() fs: simplify check in mount_setattr_commit() fs: add mnt_allow_writers() and simplify mount_setattr_prepare()
2022-03-24btrfs: prevent subvol with swapfile from being deletedKaiwen Hu1-0/+22
A subvolume with an active swapfile must not be deleted otherwise it would not be possible to deactivate it. After the subvolume is deleted, we cannot swapoff the swapfile in this deleted subvolume because the path is unreachable. The swapfile is still active and holding references, the filesystem cannot be unmounted. The test looks like this: mkfs.btrfs -f $dev > /dev/null mount $dev $mnt btrfs sub create $mnt/subvol touch $mnt/subvol/swapfile chmod 600 $mnt/subvol/swapfile chattr +C $mnt/subvol/swapfile dd if=/dev/zero of=$mnt/subvol/swapfile bs=1K count=4096 mkswap $mnt/subvol/swapfile swapon $mnt/subvol/swapfile btrfs sub delete $mnt/subvol swapoff $mnt/subvol/swapfile # failed: No such file or directory swapoff --all unmount $mnt # target is busy. To prevent above issue, we simply check that whether the subvolume contains any active swapfile, and stop the deleting process. This behavior is like snapshot ioctl dealing with a swapfile. CC: [email protected] # 5.4+ Reviewed-by: Robbie Ko <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Kaiwen Hu <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-24btrfs: do not warn for free space inode in cow_file_rangeJosef Bacik1-1/+0
This is a long time leftover from when I originally added the free space inode, the point was to catch cases where we weren't honoring the NOCOW flag. However there exists a race with relocation, if we allocate our free space inode in a block group that is about to be relocated, we could trigger the COW path before the relocation has the opportunity to find the extents and delete the free space cache. In production where we have auto-relocation enabled we're seeing this WARN_ON_ONCE() around 5k times in a 2 week period, so not super common but enough that it's at the top of our metrics. We're properly handling the error here, and with us phasing out v1 space cache anyway just drop the WARN_ON_ONCE. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-24btrfs: avoid defragging extents whose next extents are not targetsQu Wenruo1-6/+14
[BUG] There is a report that autodefrag is defragging single sector, which is completely waste of IO, and no help for defragging: btrfs-cleaner-808 defrag_one_locked_range: root=256 ino=651122 start=0 len=4096 [CAUSE] In defrag_collect_targets(), we check if the current range (A) can be merged with next one (B). If mergeable, we will add range A into target for defrag. However there is a catch for autodefrag, when checking mergeability against range B, we intentionally pass 0 as @newer_than, hoping to get a higher chance to merge with the next extent. But in the next iteration, range B will looked up by defrag_lookup_extent(), with non-zero @newer_than. And if range B is not really newer, it will rejected directly, causing only range A being defragged, while we expect to defrag both range A and B. [FIX] Since the root cause is the difference in check condition of defrag_check_next_extent() and defrag_collect_targets(), we fix it by: 1. Pass @newer_than to defrag_check_next_extent() 2. Pass @extent_thresh to defrag_check_next_extent() This makes the check between defrag_collect_targets() and defrag_check_next_extent() more consistent. While there is still some minor difference, the remaining checks are focus on runtime flags like writeback/delalloc, which are mostly transient and safe to be checked only in defrag_collect_targets(). Link: https://github.com/btrfs/linux/issues/423#issuecomment-1066981856 CC: [email protected] # 5.16+ Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-24btrfs: fix fallocate to use file_modified to update permissions consistentlyDarrick J. Wong1-2/+11
Since the initial introduction of (posix) fallocate back at the turn of the century, it has been possible to use this syscall to change the user-visible contents of files. This can happen by extending the file size during a preallocation, or through any of the newer modes (punch, zero range). Because the call can be used to change file contents, we should treat it like we do any other modification to a file -- update the mtime, and drop set[ug]id privileges/capabilities. The VFS function file_modified() does all this for us if pass it a locked inode, so let's make fallocate drop permissions correctly. Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-24btrfs: remove device item and update super block in the same transactionQu Wenruo1-37/+28
[BUG] There is a report that a btrfs has a bad super block num devices. This makes btrfs to reject the fs completely. BTRFS error (device sdd3): super_num_devices 3 mismatch with num_devices 2 found here BTRFS error (device sdd3): failed to read chunk tree: -22 BTRFS error (device sdd3): open_ctree failed [CAUSE] During btrfs device removal, chunk tree and super block num devs are updated in two different transactions: btrfs_rm_device() |- btrfs_rm_dev_item(device) | |- trans = btrfs_start_transaction() | | Now we got transaction X | | | |- btrfs_del_item() | | Now device item is removed from chunk tree | | | |- btrfs_commit_transaction() | Transaction X got committed, super num devs untouched, | but device item removed from chunk tree. | (AKA, super num devs is already incorrect) | |- cur_devices->num_devices--; |- cur_devices->total_devices--; |- btrfs_set_super_num_devices() All those operations are not in transaction X, thus it will only be written back to disk in next transaction. So after the transaction X in btrfs_rm_dev_item() committed, but before transaction X+1 (which can be minutes away), a power loss happen, then we got the super num mismatch. [FIX] Instead of starting and committing a transaction inside btrfs_rm_dev_item(), start a transaction in side btrfs_rm_device() and pass it to btrfs_rm_dev_item(). And only commit the transaction after everything is done. Reported-by: Luca Béla Palkovics <[email protected]> Link: https://lore.kernel.org/linux-btrfs/CA+8xDSpvdm_U0QLBAnrH=zqDq_cWCOH5TiV46CKmp3igr44okQ@mail.gmail.com/ CC: [email protected] # 4.14+ Reviewed-by: Anand Jain <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-24NFSv4.1: don't retry BIND_CONN_TO_SESSION on session errorOlga Kornievskaia1-0/+1
There is no reason to retry the operation if a session error had occurred in such case result structure isn't filled out. Fixes: dff58530c4ca ("NFSv4.1: fix handling of backchannel binding in BIND_CONN_TO_SESSION") Signed-off-by: Olga Kornievskaia <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2022-03-24NFS: replace usage of found with dedicated list iterator variableJakob Koschel1-7/+6
To move the list iterator variable into the list_for_each_entry_*() macro in the future it should be avoided to use the list iterator variable after the loop body. To *never* use the list iterator variable after the loop it was concluded to use a separate iterator variable instead of a found boolean [1]. This removes the need to use a found variable and simply checking if the variable was set, can determine if the break/goto was hit. Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ Signed-off-by: Jakob Koschel <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2022-03-24io_uring: remove IORING_CQE_F_MSGJens Axboe1-2/+1
This was introduced with the message ring opcode, but isn't strictly required for the request itself. The sender can encode what is needed in user_data, which is passed to the receiver. It's unclear if having a separate flag that essentially says "This CQE did not originate from an SQE on this ring" provides any real utility to applications. While we can always re-introduce a flag to provide this information, we cannot take it away at a later point in time. Remove the flag while we still can, before it's in a released kernel. Signed-off-by: Jens Axboe <[email protected]>
2022-03-23fat: use pointer to simple type in put_user()Helge Deller1-1/+1
The put_user(val,ptr) macro wants a pointer to a simple type, but in fat_ioctl_filldir() the d_name field references an "array of chars". Be more accurate and explicitly give the pointer to the first character of the d_name[] array. I noticed that issue while trying to optimize the parisc put_user() macro and used an intermediate variable to store the pointer. In that case I got this error: In file included from include/linux/uaccess.h:11, from include/linux/compat.h:17, from fs/fat/dir.c:18: fs/fat/dir.c: In function `fat_ioctl_filldir': fs/fat/dir.c:725:33: error: invalid initializer 725 | if (put_user(0, d2->d_name) || \ | ^~ include/asm/uaccess.h:152:33: note: in definition of macro `__put_user' 152 | __typeof__(ptr) __ptr = ptr; \ | ^~~ fs/fat/dir.c:759:1: note: in expansion of macro `FAT_IOCTL_FILLDIR_FUNC' 759 | FAT_IOCTL_FILLDIR_FUNC(fat_ioctl_filldir, __fat_dirent) Andreas Schwab <[email protected]> suggested to use __typeof__(&*(ptr)) __ptr = ptr; instead. This works, but nevertheless it's probably reasonable to fix the original caller too. Link: https://lkml.kernel.org/r/Ygo+A9MREmC1H3kr@p100 Signed-off-by: Helge Deller <[email protected]> Acked-by: OGAWA Hirofumi <[email protected]> Cc: David Laight <[email protected]> Cc: Andreas Schwab <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-23minix: fix bug when opening a file with O_DIRECTQinghua Jin1-1/+2
Testcase: 1. create a minix file system and mount it 2. open a file on the file system with O_RDWR|O_CREAT|O_TRUNC|O_DIRECT 3. open fails with -EINVAL but leaves an empty file behind. All other open() failures don't leave the failed open files behind. It is hard to check the direct_IO op before creating the inode. Just as ext4 and btrfs do, this patch will resolve the issue by allowing to create the file with O_DIRECT but returning error when writing the file. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Qinghua Jin <[email protected]> Reported-by: Colin Ian King <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Christian Brauner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-23fs/pipe.c: local vars have to match types of proper pipe_inode_info fieldsAndrei Vagin1-2/+2
head, tail, ring_size are declared as unsigned int, so all local variables that operate with these fields have to be unsigned to avoid signed integer overflow. Right now, it isn't an issue because the maximum pipe size is limited by 1U<<31. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andrei Vagin <[email protected]> Suggested-by: Dmitry Safonov <[email protected]> Acked-by: Christian Brauner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-23fs/pipe: use kvcalloc to allocate a pipe_buffer arrayAndrei Vagin1-5/+4
Right now, kcalloc is used to allocate a pipe_buffer array. The size of the pipe_buffer struct is 40 bytes. kcalloc allows allocating reliably chunks with sizes less or equal to PAGE_ALLOC_COSTLY_ORDER (3). It means that the maximum pipe size is 3.2MB in this case. In CRIU, we use pipes to dump processes memory. CRIU freezes a target process, injects a parasite code into it and then this code splices memory into pipes. If a maximum pipe size is small, we need to do many iterations or create many pipes. kvcalloc attempt to allocate physically contiguous memory, but upon failure, fall back to non-contiguous (vmalloc) allocation and so it isn't limited by PAGE_ALLOC_COSTLY_ORDER. The maximum pipe size for non-root users is limited by the /proc/sys/fs/pipe-max-size sysctl that is 1MB by default, so only the root user will be able to trigger vmalloc allocations. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andrei Vagin <[email protected]> Reviewed-by: Dmitry Safonov <[email protected]> Cc: Alexander Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-23proc/vmcore: fix vmcore_alloc_buf() kernel-doc commentYang Li1-1/+1
Fix a spelling problem to remove warnings found by running scripts/kernel-doc, which is caused by using 'make W=1'. fs/proc/vmcore.c:492: warning: Function parameter or member 'size' not described in 'vmcore_alloc_buf' fs/proc/vmcore.c:492: warning: Excess function parameter 'sizez' description in 'vmcore_alloc_buf' Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Li <[email protected]> Reported-by: Abaci Robot <[email protected]> Acked-by: Baoquan He <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-23proc/vmcore: fix possible deadlock on concurrent mmap and readDavid Hildenbrand1-19/+22
Lockdep noticed that there is chance for a deadlock if we have concurrent mmap, concurrent read, and the addition/removal of a callback. As nicely explained by Boqun: "Lockdep warned about the above sequences because rw_semaphore is a fair read-write lock, and the following can cause a deadlock: TASK 1 TASK 2 TASK 3 ====== ====== ====== down_write(mmap_lock); down_read(vmcore_cb_rwsem) down_write(vmcore_cb_rwsem); // blocked down_read(vmcore_cb_rwsem); // cannot get the lock because of the fairness down_read(mmap_lock); // blocked IOW, a reader can block another read if there is a writer queued by the second reader and the lock is fair" To fix this, convert to srcu to make this deadlock impossible. We need srcu as our callbacks can sleep. With this change, I cannot trigger any lockdep warnings. ====================================================== WARNING: possible circular locking dependency detected 5.17.0-0.rc0.20220117git0c947b893d69.68.test.fc36.x86_64 #1 Not tainted ------------------------------------------------------ makedumpfile/542 is trying to acquire lock: ffffffff832d2eb8 (vmcore_cb_rwsem){.+.+}-{3:3}, at: mmap_vmcore+0x340/0x580 but task is already holding lock: ffff8880af226438 (&mm->mmap_lock#2){++++}-{3:3}, at: vm_mmap_pgoff+0x84/0x150 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&mm->mmap_lock#2){++++}-{3:3}: lock_acquire+0xc3/0x1a0 __might_fault+0x4e/0x70 _copy_to_user+0x1f/0x90 __copy_oldmem_page+0x72/0xc0 read_from_oldmem+0x77/0x1e0 read_vmcore+0x2c2/0x310 proc_reg_read+0x47/0xa0 vfs_read+0x101/0x340 __x64_sys_pread64+0x5d/0xa0 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #0 (vmcore_cb_rwsem){.+.+}-{3:3}: validate_chain+0x9f4/0x2670 __lock_acquire+0x8f7/0xbc0 lock_acquire+0xc3/0x1a0 down_read+0x4a/0x140 mmap_vmcore+0x340/0x580 proc_reg_mmap+0x3e/0x90 mmap_region+0x504/0x880 do_mmap+0x38a/0x520 vm_mmap_pgoff+0xc1/0x150 ksys_mmap_pgoff+0x178/0x200 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&mm->mmap_lock#2); lock(vmcore_cb_rwsem); lock(&mm->mmap_lock#2); lock(vmcore_cb_rwsem); *** DEADLOCK *** 1 lock held by makedumpfile/542: #0: ffff8880af226438 (&mm->mmap_lock#2){++++}-{3:3}, at: vm_mmap_pgoff+0x84/0x150 stack backtrace: CPU: 0 PID: 542 Comm: makedumpfile Not tainted 5.17.0-0.rc0.20220117git0c947b893d69.68.test.fc36.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: __lock_acquire+0x8f7/0xbc0 lock_acquire+0xc3/0x1a0 down_read+0x4a/0x140 mmap_vmcore+0x340/0x580 proc_reg_mmap+0x3e/0x90 mmap_region+0x504/0x880 do_mmap+0x38a/0x520 vm_mmap_pgoff+0xc1/0x150 ksys_mmap_pgoff+0x178/0x200 do_syscall_64+0x43/0x90 Link: https://lkml.kernel.org/r/[email protected] Fixes: cc5f2704c934 ("proc/vmcore: convert oldmem_pfn_is_ram callback to more generic vmcore callbacks") Signed-off-by: David Hildenbrand <[email protected]> Reported-by: Baoquan He <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Vivek Goyal <[email protected]> Cc: Dave Young <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Boqun Feng <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-23proc: alloc PATH_MAX bytes for /proc/${pid}/fd/ symlinksHao Lee1-4/+4
It's not a standard approach that use __get_free_page() to alloc path buffer directly. We'd better use kmalloc and PATH_MAX. PAGE_SIZE is different on different archs. An unlinked file with very long canonical pathname will readlink differently because "(deleted)" eats into a buffer. --adobriyan [[email protected]: remove now-unneeded cast] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hao Lee <[email protected]> Signed-off-by: Alexey Dobriyan <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Kees Cook <[email protected]> Cc: James Morris <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-03-23Merge tag 'asm-generic-5.18' of ↵Linus Torvalds1-6/+0
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic Pull asm-generic updates from Arnd Bergmann: "There are three sets of updates for 5.18 in the asm-generic tree: - The set_fs()/get_fs() infrastructure gets removed for good. This was already gone from all major architectures, but now we can finally remove it everywhere, which loses some particularly tricky and error-prone code. There is a small merge conflict against a parisc cleanup, the solution is to use their new version. - The nds32 architecture ends its tenure in the Linux kernel. The hardware is still used and the code is in reasonable shape, but the mainline port is not actively maintained any more, as all remaining users are thought to run vendor kernels that would never be updated to a future release. - A series from Masahiro Yamada cleans up some of the uapi header files to pass the compile-time checks" * tag 'asm-generic-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: (27 commits) nds32: Remove the architecture uaccess: remove CONFIG_SET_FS ia64: remove CONFIG_SET_FS support sh: remove CONFIG_SET_FS support sparc64: remove CONFIG_SET_FS support lib/test_lockup: fix kernel pointer check for separate address spaces uaccess: generalize access_ok() uaccess: fix type mismatch warnings from access_ok() arm64: simplify access_ok() m68k: fix access_ok for coldfire MIPS: use simpler access_ok() MIPS: Handle address errors for accesses above CPU max virtual user address uaccess: add generic __{get,put}_kernel_nofault nios2: drop access_ok() check from __put_user() x86: use more conventional access_ok() definition x86: remove __range_not_ok() sparc64: add __{get,put}_kernel_nofault() nds32: fix access_ok() checks in get/put_user uaccess: fix nios2 and microblaze get_user_8() sparc64: fix building assembly files ...
2022-03-23io_uring: add flag for disabling provided buffer recyclingJens Axboe1-0/+8
If we need to continue doing this IO, then we don't want a potentially selected buffer recycled. Add a flag for that. Set this for recv/recvmsg if they do partial IO. Signed-off-by: Jens Axboe <[email protected]>
2022-03-23io_uring: ensure recv and recvmsg handle MSG_WAITALL correctlyJens Axboe1-0/+28
We currently don't attempt to get the full asked for length even if MSG_WAITALL is set, if we get a partial receive. If we do see a partial receive, then just note how many bytes we did and return -EAGAIN to get it retried. The iov is advanced appropriately for the vector based case, and we manually bump the buffer and remainder for the non-vector case. Cc: [email protected] Reported-by: Constantine Gavrilov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2022-03-23btrfs: fix qgroup reserve overflow the qgroup limitEthan Lien1-1/+1
We use extent_changeset->bytes_changed in qgroup_reserve_data() to record how many bytes we set for EXTENT_QGROUP_RESERVED state. Currently the bytes_changed is set as "unsigned int", and it will overflow if we try to fallocate a range larger than 4GiB. The result is we reserve less bytes and eventually break the qgroup limit. Unlike regular buffered/direct write, which we use one changeset for each ordered extent, which can never be larger than 256M. For fallocate, we use one changeset for the whole range, thus it no longer respects the 256M per extent limit, and caused the problem. The following example test script reproduces the problem: $ cat qgroup-overflow.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f $DEV mount $DEV $MNT # Set qgroup limit to 2GiB. btrfs quota enable $MNT btrfs qgroup limit 2G $MNT # Try to fallocate a 3GiB file. This should fail. echo echo "Try to fallocate a 3GiB file..." fallocate -l 3G $MNT/3G.file # Try to fallocate a 5GiB file. echo echo "Try to fallocate a 5GiB file..." fallocate -l 5G $MNT/5G.file # See we break the qgroup limit. echo sync btrfs qgroup show -r $MNT umount $MNT When running the test: $ ./qgroup-overflow.sh (...) Try to fallocate a 3GiB file... fallocate: fallocate failed: Disk quota exceeded Try to fallocate a 5GiB file... qgroupid         rfer         excl     max_rfer --------         ----         ----     -------- 0/5           5.00GiB      5.00GiB      2.00GiB Since we have no control of how bytes_changed is used, it's better to set it to u64. CC: [email protected] # 4.14+ Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Ethan Lien <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-23btrfs: zoned: remove left over ASSERT checking for single profileJohannes Thumshirn1-4/+0
With commit dcf5652291f6 ("btrfs: zoned: allow DUP on meta-data block groups") we started allowing DUP on metadata block groups, so the ASSERT()s in btrfs_can_activate_zone() and btrfs_zoned_get_device() are no longer valid and in fact even harmful. Fixes: dcf5652291f6 ("btrfs: zoned: allow DUP on meta-data block groups") CC: [email protected] # 5.17 Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-23btrfs: zoned: traverse devices under chunk_mutex in btrfs_can_activate_zoneJohannes Thumshirn1-4/+5
btrfs_can_activate_zone() can be called with the device_list_mutex already held, which will lead to a deadlock: insert_dev_extents() // Takes device_list_mutex `-> insert_dev_extent() `-> btrfs_insert_empty_item() `-> btrfs_insert_empty_items() `-> btrfs_search_slot() `-> btrfs_cow_block() `-> __btrfs_cow_block() `-> btrfs_alloc_tree_block() `-> btrfs_reserve_extent() `-> find_free_extent() `-> find_free_extent_update_loop() `-> can_allocate_chunk() `-> btrfs_can_activate_zone() // Takes device_list_mutex again Instead of using the RCU on fs_devices->device_list we can use fs_devices->alloc_list, protected by the chunk_mutex to traverse the list of active devices. We are in the chunk allocation thread. The newer chunk allocation happens from the devices in the fs_device->alloc_list protected by the chunk_mutex. btrfs_create_chunk() lockdep_assert_held(&info->chunk_mutex); gather_device_info list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) Also, a device that reappears after the mount won't join the alloc_list yet and, it will be in the dev_list, which we don't want to consider in the context of the chunk alloc. [15.166572] WARNING: possible recursive locking detected [15.167117] 5.17.0-rc6-dennis #79 Not tainted [15.167487] -------------------------------------------- [15.167733] kworker/u8:3/146 is trying to acquire lock: [15.167733] ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: find_free_extent+0x15a/0x14f0 [btrfs] [15.167733] [15.167733] but task is already holding lock: [15.167733] ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_create_pending_block_groups+0x20a/0x560 [btrfs] [15.167733] [15.167733] other info that might help us debug this: [15.167733] Possible unsafe locking scenario: [15.167733] [15.171834] CPU0 [15.171834] ---- [15.171834] lock(&fs_devs->device_list_mutex); [15.171834] lock(&fs_devs->device_list_mutex); [15.171834] [15.171834] *** DEADLOCK *** [15.171834] [15.171834] May be due to missing lock nesting notation [15.171834] [15.171834] 5 locks held by kworker/u8:3/146: [15.171834] #0: ffff888100050938 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x1c3/0x5a0 [15.171834] #1: ffffc9000067be80 ((work_completion)(&fs_info->async_data_reclaim_work)){+.+.}-{0:0}, at: process_one_work+0x1c3/0x5a0 [15.176244] #2: ffff88810521e620 (sb_internal){.+.+}-{0:0}, at: flush_space+0x335/0x600 [btrfs] [15.176244] #3: ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_create_pending_block_groups+0x20a/0x560 [btrfs] [15.176244] #4: ffff8881152e4b78 (btrfs-dev-00){++++}-{3:3}, at: __btrfs_tree_lock+0x27/0x130 [btrfs] [15.179641] [15.179641] stack backtrace: [15.179641] CPU: 1 PID: 146 Comm: kworker/u8:3 Not tainted 5.17.0-rc6-dennis #79 [15.179641] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014 [15.179641] Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs] [15.179641] Call Trace: [15.179641] <TASK> [15.179641] dump_stack_lvl+0x45/0x59 [15.179641] __lock_acquire.cold+0x217/0x2b2 [15.179641] lock_acquire+0xbf/0x2b0 [15.183838] ? find_free_extent+0x15a/0x14f0 [btrfs] [15.183838] __mutex_lock+0x8e/0x970 [15.183838] ? find_free_extent+0x15a/0x14f0 [btrfs] [15.183838] ? find_free_extent+0x15a/0x14f0 [btrfs] [15.183838] ? lock_is_held_type+0xd7/0x130 [15.183838] ? find_free_extent+0x15a/0x14f0 [btrfs] [15.183838] find_free_extent+0x15a/0x14f0 [btrfs] [15.183838] ? _raw_spin_unlock+0x24/0x40 [15.183838] ? btrfs_get_alloc_profile+0x106/0x230 [btrfs] [15.187601] btrfs_reserve_extent+0x131/0x260 [btrfs] [15.187601] btrfs_alloc_tree_block+0xb5/0x3b0 [btrfs] [15.187601] __btrfs_cow_block+0x138/0x600 [btrfs] [15.187601] btrfs_cow_block+0x10f/0x230 [btrfs] [15.187601] btrfs_search_slot+0x55f/0xbc0 [btrfs] [15.187601] ? lock_is_held_type+0xd7/0x130 [15.187601] btrfs_insert_empty_items+0x2d/0x60 [btrfs] [15.187601] btrfs_create_pending_block_groups+0x2b3/0x560 [btrfs] [15.187601] __btrfs_end_transaction+0x36/0x2a0 [btrfs] [15.192037] flush_space+0x374/0x600 [btrfs] [15.192037] ? find_held_lock+0x2b/0x80 [15.192037] ? btrfs_async_reclaim_data_space+0x49/0x180 [btrfs] [15.192037] ? lock_release+0x131/0x2b0 [15.192037] btrfs_async_reclaim_data_space+0x70/0x180 [btrfs] [15.192037] process_one_work+0x24c/0x5a0 [15.192037] worker_thread+0x4a/0x3d0 Fixes: a85f05e59bc1 ("btrfs: zoned: avoid chunk allocation if active block group has enough space") CC: [email protected] # 5.16+ Reviewed-by: Anand Jain <[email protected]> Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2022-03-23cifs: fix incorrect use of list iterator after the loopXiaomeng Tong1-2/+4
The bug is here: if (!tcon) { resched = true; list_del_init(&ses->rlist); cifs_put_smb_ses(ses); Because the list_for_each_entry() never exits early (without any break/goto/return inside the loop), the iterator 'ses' after the loop will always be an pointer to a invalid struct containing the HEAD (&pserver->smb_ses_list). As a result, the uses of 'ses' above will lead to a invalid memory access. The original intention should have been to walk each entry 'ses' in '&tmp_ses_list', delete '&ses->rlist' and put 'ses'. So fix it with a list_for_each_entry_safe(). Cc: [email protected] # 5.17 Fixes: 3663c9045f51a ("cifs: check reconnects for channels of active tcons too") Signed-off-by: Xiaomeng Tong <[email protected]> Reviewed-by: Shyam Prasad N <[email protected]> Signed-off-by: Steve French <[email protected]>
2022-03-23ksmbd: store fids as opaque u64 integersPaulo Alcantara2-72/+56
There is no need to store the fids as le64 integers as they are opaque to the client and only used for equality. Signed-off-by: Paulo Alcantara (SUSE) <[email protected]> Reviewed-by: Tom Talpey <[email protected]> Acked-by: Namjae Jeon <[email protected]> Signed-off-by: Steve French <[email protected]>
2022-03-23cifs: fix bad fids sent over wirePaulo Alcantara4-53/+46
The client used to partially convert the fids to le64, while storing or sending them by using host endianness. This broke the client on big-endian machines. Instead of converting them to le64, store them as opaque integers and then avoid byteswapping when sending them over wire. Signed-off-by: Paulo Alcantara (SUSE) <[email protected]> Reviewed-by: Namjae Jeon <[email protected]> Reviewed-by: Tom Talpey <[email protected]> Signed-off-by: Steve French <[email protected]>
2022-03-23cifs: change smb2_query_info_compound to use a cached fid, if availableRonnie Sahlberg1-10/+35
This will reduce the number of Open/Close we send on the wire and replace a Open/GetInfo/Close compound with just a simple GetInfo request IF we have a cached handle for the object. Signed-off-by: Ronnie Sahlberg <[email protected]> Reviewed-by: Paulo Alcantara (SUSE) <[email protected]> Signed-off-by: Steve French <[email protected]>
2022-03-23cifs: convert the path to utf16 in smb2_query_info_compoundRonnie Sahlberg2-12/+13
and not in the callers. Signed-off-by: Ronnie Sahlberg <[email protected]> Reviewed-by: Paulo Alcantara (SUSE) <[email protected]> Signed-off-by: Steve French <[email protected]>
2022-03-23gfs2: Minor retry logic cleanupAndreas Gruenbacher1-18/+16
Clean up the retry logic in the read and write functions somewhat. Signed-off-by: Andreas Gruenbacher <[email protected]>
2022-03-23gfs2: Disable page faults during lockless buffered readsAndreas Gruenbacher1-1/+3
During lockless buffered reads, filemap_read() holds page cache page references while trying to copy data to the user-space buffer. The calling process isn't holding the inode glock, but the page references it holds prevent those pages from being removed from the page cache, and that prevents the underlying inode glock from being moved to another node. Thus, we can end up in the same kinds of distributed deadlock situations as with normal (non-lockless) buffered reads. Fix that by disabling page faults during lockless reads as well. Signed-off-by: Andreas Gruenbacher <[email protected]>
2022-03-23gfs2: Fix should_fault_in_pages() logicAndreas Gruenbacher1-8/+9
Fix the fault-in window size logic: * Use a maximum window size of 1 MiB instead of BIO_MAX_VECS * PAGE_SIZE. The previous window size was always one page because the pages variable was accidentally being defined and then redefined in should_fault_in_pages(). * The nr_dirtied heuristic for guessing when there might be memory pressure often results in very small window sizes. Don't let nr_dirtied drop below 8 pages (as btrfs does). * Compute the window size in units of bytes, not pages. * Account for page overlap (unaligned iterators). Signed-off-by: Andreas Gruenbacher <[email protected]>
2022-03-23fs: do not pass __GFP_HIGHMEM to bio_alloc in do_mpage_readpageChristoph Hellwig1-4/+2
The mpage bio alloc cleanup accidentally removed clearing ~GFP_KERNEL bits from the mask passed to bio_alloc. Fix this up in a slightly less obsfucated way that mirrors what iomap does in its readpage code. Fixes: 77c436de01c0 ("mpage: pass the operation to bio_alloc") Reported-by: Guenter Roeck <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Ryusuke Konishi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-03-23io_uring: don't recycle provided buffer if punted to async workerJens Axboe1-3/+8
We only really need to recycle the buffer when going async for a file type that has an indefinite reponse time (eg non-file/bdev). And for files that to arm poll, the async worker will arm poll anyway and the buffer will get recycled there. In that latter case, we're not holding ctx->uring_lock. Ensure we take the issue_flags into account and acquire it if we need to. Fixes: b1c62645758e ("io_uring: recycle provided buffers if request goes async") Reported-by: Stefan Roesch <[email protected]> Signed-off-by: Jens Axboe <[email protected]>