aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2023-08-28Merge tag 'filelock-v6.6' of ↵Linus Torvalds1-5/+22
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull file locking updates from Jeff Layton: - new functionality for F_OFD_GETLK: requesting a type of F_UNLCK will find info about whatever lock happens to be first in the given range, regardless of type. - an OFD lock selftest - bugfix involving a UAF in a tracepoint - comment typo fix * tag 'filelock-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: locks: fix KASAN: use-after-free in trace_event_raw_event_filelock_lock fs/locks: Fix typo selftests: add OFD lock tests fs/locks: F_UNLCK extension for F_OFD_GETLK
2023-08-28Merge tag 'v6.6-fs.proc.uapi' of ↵Linus Torvalds2-1/+3
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull procfs fixes from Christian Brauner: "Mode changes to files under /proc/<pid>/ aren't supported ever since commit 6d76fa58b050 ("Don't allow chmod() on the /proc/<pid>/ files"). Due to an oversight in commit 1b3044e39a89 ("procfs: fix pthread cross-thread naming if !PR_DUMPABLE") in switching from REG to NOD, mode changes on /proc/thread-self/comm were accidently allowed. Similar, mode changes for all files beneath /proc/<pid>/net/ are blocked but mode changes on /proc/<pid>/net itself were accidently allowed. Both issues come down to not using the generic proc_setattr() helper which blocks all mode changes. This is rectified with this pull request. This also removes a strange nolibc test that abused /proc/<pid>/net for testing mode changes. Using procfs for this test never made a lot of sense given procfs has special semantics for almost everything anway. Both changes are minor user-visible changes. It is however very unlikely that mode changes on proc/<pid>/net and /proc/thread-self/comm are something that userspace relies on" * tag 'v6.6-fs.proc.uapi' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: procfs: block chmod on /proc/thread-self/comm proc: use generic setattr() for /proc/$PID/net selftests/nolibc: drop test chmod_net
2023-08-28Merge tag 'v6.6-vfs.autofs' of ↵Linus Torvalds1-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull autofs fixes from Christian Brauner: "This fixes a memory leak in autofs reported by syzkaller and a missing conversion from uninterruptible to interruptible wake up when autofs is in catatonic mode" * tag 'v6.6-vfs.autofs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: autofs: use wake_up() instead of wake_up_interruptible(() autofs: fix memory leak of waitqueues in autofs_catatonic_mode
2023-08-28Merge tag 'v6.6-vfs.fchmodat2' of ↵Linus Torvalds1-4/+19
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull fchmodat2 system call from Christian Brauner: "This adds the fchmodat2() system call. It is a revised version of the fchmodat() system call, adding a missing flag argument. Support for both AT_SYMLINK_NOFOLLOW and AT_EMPTY_PATH are included. Adding this system call revision has been a longstanding request but so far has always fallen through the cracks. While the kernel implementation of fchmodat() does not have a flag argument the libc provided POSIX-compliant fchmodat(3) version does. Both glibc and musl have to implement a workaround in order to support AT_SYMLINK_NOFOLLOW (see [1] and [2]). The workaround is brittle because it relies not just on O_PATH and O_NOFOLLOW semantics and procfs magic links but also on our rather inconsistent symlink semantics. This gives userspace a proper fchmodat2() system call that libcs can use to properly implement fchmodat(3) and allows them to get rid of their hacks. In this case it will immediately benefit them as the current workaround is already defunct because of aformentioned inconsistencies. In addition to AT_SYMLINK_NOFOLLOW, give userspace the ability to use AT_EMPTY_PATH with fchmodat2(). This is already possible with fchownat() so there's no reason to not also support it for fchmodat2(). The implementation is simple and comes with selftests. Implementation of the system call and wiring up the system call are done as separate patches even though they could arguably be one patch. But in case there are merge conflicts from other system call additions it can be beneficial to have separate patches" Link: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35 [1] Link: https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28 [2] * tag 'v6.6-vfs.fchmodat2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: selftests: fchmodat2: remove duplicate unneeded defines fchmodat2: add support for AT_EMPTY_PATH selftests: Add fchmodat2 selftest arch: Register fchmodat2, usually as syscall 452 fs: Add fchmodat2() Non-functional cleanup of a "__user * filename"
2023-08-28Merge tag 'v6.6-vfs.super' of ↵Linus Torvalds25-520/+935
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull superblock updates from Christian Brauner: "This contains the super rework that was ready for this cycle. The first part changes the order of how we open block devices and allocate superblocks, contains various cleanups, simplifications, and a new mechanism to wait on superblock state changes. This unblocks work to ultimately limit the number of writers to a block device. Jan has already scheduled follow-up work that will be ready for v6.7 and allows us to restrict the number of writers to a given block device. That series builds on this work right here. The second part contains filesystem freezing updates. Overview: The generic superblock changes are rougly organized as follows (ignoring additional minor cleanups): (1) Removal of the bd_super member from struct block_device. This was a very odd back pointer to struct super_block with unclear rules. For all relevant places we have other means to get the same information so just get rid of this. (2) Simplify rules for superblock cleanup. Roughly, everything that is allocated during fs_context initialization and that's stored in fs_context->s_fs_info needs to be cleaned up by the fs_context->free() implementation before the superblock allocation function has been called successfully. After sget_fc() returned fs_context->s_fs_info has been transferred to sb->s_fs_info at which point sb->kill_sb() if fully responsible for cleanup. Adhering to these rules means that cleanup of sb->s_fs_info in fill_super() is to be avoided as it's brittle and inconsistent. Cleanup shouldn't be duplicated between sb->put_super() as sb->put_super() is only called if sb->s_root has been set aka when the filesystem has been successfully born (SB_BORN). That complexity should be avoided. This also means that block devices are to be closed in sb->kill_sb() instead of sb->put_super(). More details in the lower section. (3) Make it possible to lookup or create a superblock before opening block devices There's a subtle dependency on (2) as some filesystems did rely on fill_super() to be called in order to correctly clean up sb->s_fs_info. All these filesystems have been fixed. (4) Switch most filesystem to follow the same logic as the generic mount code now does as outlined in (3). (5) Use the superblock as the holder of the block device. We can now easily go back from block device to owning superblock. (6) Export and extend the generic fs_holder_ops and use them as holder ops everywhere and remove the filesystem specific holder ops. (7) Call from the block layer up into the filesystem layer when the block device is removed, allowing to shut down the filesystem without risk of deadlocks. (8) Get rid of get_super(). We can now easily go back from the block device to owning superblock and can call up from the block layer into the filesystem layer when the device is removed. So no need to wade through all registered superblock to find the owning superblock anymore" Link: https://lore.kernel.org/lkml/20230824-prall-intakt-95dbffdee4a0@brauner/ * tag 'v6.6-vfs.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (47 commits) super: use higher-level helper for {freeze,thaw} super: wait until we passed kill super super: wait for nascent superblocks super: make locking naming consistent super: use locking helpers fs: simplify invalidate_inodes fs: remove get_super block: call into the file system for ioctl BLKFLSBUF block: call into the file system for bdev_mark_dead block: consolidate __invalidate_device and fsync_bdev block: drop the "busy inodes on changed media" log message dasd: also call __invalidate_device when setting the device offline amiflop: don't call fsync_bdev in FDFMTBEG floppy: call disk_force_media_change when changing the format block: simplify the disk_force_media_change interface nbd: call blk_mark_disk_dead in nbd_clear_sock_ioctl xfs use fs_holder_ops for the log and RT devices xfs: drop s_umount over opening the log and RT devices ext4: use fs_holder_ops for the log device ext4: drop s_umount over opening the log device ...
2023-08-28Merge tag 'v6.6-vfs.misc' of ↵Linus Torvalds31-198/+240
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual filesystems. Features: - Block mode changes on symlinks and rectify our broken semantics - Report file modifications via fsnotify() for splice - Allow specifying an explicit timeout for the "rootwait" kernel command line option. This allows to timeout and reboot instead of always waiting indefinitely for the root device to show up - Use synchronous fput for the close system call Cleanups: - Get rid of open-coded lockdep workarounds for async io submitters and replace it all with a single consolidated helper - Simplify epoll allocation helper - Convert simple_write_begin and simple_write_end to use a folio - Convert page_cache_pipe_buf_confirm() to use a folio - Simplify __range_close to avoid pointless locking - Disable per-cpu buffer head cache for isolated cpus - Port ecryptfs to kmap_local_page() api - Remove redundant initialization of pointer buf in pipe code - Unexport the d_genocide() function which is only used within core vfs - Replace printk(KERN_ERR) and WARN_ON() with WARN() Fixes: - Fix various kernel-doc issues - Fix refcount underflow for eventfds when used as EFD_SEMAPHORE - Fix a mainly theoretical issue in devpts - Check the return value of __getblk() in reiserfs - Fix a racy assert in i_readcount_dec - Fix integer conversion issues in various functions - Fix LSM security context handling during automounts that prevented NFS superblock sharing" * tag 'v6.6-vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits) cachefiles: use kiocb_{start,end}_write() helpers ovl: use kiocb_{start,end}_write() helpers aio: use kiocb_{start,end}_write() helpers io_uring: use kiocb_{start,end}_write() helpers fs: create kiocb_{start,end}_write() helpers fs: add kerneldoc to file_{start,end}_write() helpers io_uring: rename kiocb_end_write() local helper splice: Convert page_cache_pipe_buf_confirm() to use a folio libfs: Convert simple_write_begin and simple_write_end to use a folio fs/dcache: Replace printk and WARN_ON by WARN fs/pipe: remove redundant initialization of pointer buf fs: Fix kernel-doc warnings devpts: Fix kernel-doc warnings doc: idmappings: fix an error and rephrase a paragraph init: Add support for rootwait timeout parameter vfs: fix up the assert in i_readcount_dec fs: Fix one kernel-doc comment docs: filesystems: idmappings: clarify from where idmappings are taken fs/buffer.c: disable per-CPU buffer_head cache for isolated CPUs vfs, security: Fix automount superblock LSM init problem, preventing NFS sb sharing ...
2023-08-28Merge tag 'v6.6-vfs.tmpfs' of ↵Linus Torvalds6-53/+344
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull libfs and tmpfs updates from Christian Brauner: "This cycle saw a lot of work for tmpfs that required changes to the vfs layer. Andrew, Hugh, and I decided to take tmpfs through vfs this cycle. Things will go back to mm next cycle. Features ======== - By far the biggest work is the quota support for tmpfs. New tmpfs quota infrastructure is added to support it and a new QFMT_SHMEM uapi option is exposed. This offers user and group quotas to tmpfs (project quotas will be added later). Similar to other filesystems tmpfs quota are not supported within user namespaces yet. - Add support for user xattrs. While tmpfs already supports security xattrs (security.*) and POSIX ACLs for a long time it lacked support for user xattrs (user.*). With this pull request tmpfs will be able to support a limited number of user xattrs. This is accompanied by a fix (see below) to limit persistent simple xattr allocations. - Add support for stable directory offsets. Currently tmpfs relies on the libfs provided cursor-based mechanism for readdir. This causes issues when a tmpfs filesystem is exported via NFS. NFS clients do not open directories. Instead, each server-side readdir operation opens the directory, reads it, and then closes it. Since the cursor state for that directory is associated with the opened file it is discarded after each readdir operation. Such directory offsets are not just cached by NFS clients but also various userspace libraries based on these clients. As it stands there is no way to invalidate the caches when directory offsets have changed and the whole application depends on unchanging directory offsets. At LSFMM we discussed how to solve this problem and decided to support stable directory offsets. libfs now allows filesystems like tmpfs to use an xarrary to map a directory offset to a dentry. This mechanism is currently only used by tmpfs but can be supported by others as well. Fixes ===== - Change persistent simple xattrs allocations in libfs from GFP_KERNEL to GPF_KERNEL_ACCOUNT so they're subject to memory cgroup limits. Since this is a change to libfs it affects both tmpfs and kernfs. - Correctly verify {g,u}id mount options. A new filesystem context is created via fsopen() which records the namespace that becomes the owning namespace of the superblock when fsconfig(FSCONFIG_CMD_CREATE) is called for filesystems that are mountable in namespaces. However, fsconfig() calls can occur in a namespace different from the namespace where fsopen() has been called. Currently, when fsconfig() is called to set {g,u}id mount options the requested {g,u}id is mapped into a k{g,u}id according to the namespace where fsconfig() was called from. The resulting k{g,u}id is not guaranteed to be resolvable in the namespace of the filesystem (the one that fsopen() was called in). This means it's possible for an unprivileged user to create files owned by any group in a tmpfs mount since it's possible to set the setid bits on the tmpfs directory. The contract for {g,u}id mount options and {g,u}id values in general set from userspace has always been that they are translated according to the caller's idmapping. In so far, tmpfs has been doing the correct thing. But since tmpfs is mountable in unprivileged contexts it is also necessary to verify that the resulting {k,g}uid is representable in the namespace of the superblock to avoid such bugs. The new mount api's cross-namespace delegation abilities are already widely used. Having talked to a bunch of userspace this is the most faithful solution with minimal regression risks" * tag 'v6.6-vfs.tmpfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: tmpfs,xattr: GFP_KERNEL_ACCOUNT for simple xattrs mm: invalidation check mapping before folio_contains tmpfs: trivial support for direct IO tmpfs,xattr: enable limited user extended attributes tmpfs: track free_ispace instead of free_inodes xattr: simple_xattr_set() return old_xattr to be freed tmpfs: verify {g,u}id mount options correctly shmem: move spinlock into shmem_recalc_inode() to fix quota support libfs: Remove parent dentry locking in offset_iterate_dir() libfs: Add a lock class for the offset map's xa_lock shmem: stable directory offsets shmem: Refactor shmem_symlink() libfs: Add directory operations for stable offsets shmem: fix quota lock nesting in huge hole handling shmem: Add default quota limit mount options shmem: quota support shmem: prepare shmem quota infrastructure quota: Check presence of quota operation structures instead of ->quota_read and ->quota_write callbacks shmem: make shmem_get_inode() return ERR_PTR instead of NULL shmem: make shmem_inode_acct_block() return error
2023-08-28Merge tag 'v6.6-vfs.ctime' of ↵Linus Torvalds233-977/+1107
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs timestamp updates from Christian Brauner: "This adds VFS support for multi-grain timestamps and converts tmpfs, xfs, ext4, and btrfs to use them. This carries acks from all relevant filesystems. The VFS always uses coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot of metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g., backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. This introduces fine-grained timestamps that are used when they are actively queried. This uses the 31st bit of the ctime tv_nsec field to indicate that something has queried the inode for the mtime or ctime. When this flag is set, on the next mtime or ctime update, the kernel will fetch a fine-grained timestamp instead of the usual coarse-grained one. As POSIX generally mandates that when the mtime changes, the ctime must also change the kernel always stores normalized ctime values, so only the first 30 bits of the tv_nsec field are ever used. Filesytems can opt into this behavior by setting the FS_MGTIME flag in the fstype. Filesystems that don't set this flag will continue to use coarse-grained timestamps. Various preparatory changes, fixes and cleanups are included: - Fixup all relevant places where POSIX requires updating ctime together with mtime. This is a wide-range of places and all maintainers provided necessary Acks. - Add new accessors for inode->i_ctime directly and change all callers to rely on them. Plain accesses to inode->i_ctime are now gone and it is accordingly rename to inode->__i_ctime and commented as requiring accessors. - Extend generic_fillattr() to pass in a request mask mirroring in a sense the statx() uapi. This allows callers to pass in a request mask to only get a subset of attributes filled in. - Rework timestamp updates so it's possible to drop the @now parameter the update_time() inode operation and associated helpers. - Add inode_update_timestamps() and convert all filesystems to it removing a bunch of open-coding" * tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits) btrfs: convert to multigrain timestamps ext4: switch to multigrain timestamps xfs: switch to multigrain timestamps tmpfs: add support for multigrain timestamps fs: add infrastructure for multigrain timestamps fs: drop the timespec64 argument from update_time xfs: have xfs_vn_update_time gets its own timestamp fat: make fat_update_time get its own timestamp fat: remove i_version handling from fat_update_time ubifs: have ubifs_update_time use inode_update_timestamps btrfs: have it use inode_update_timestamps fs: drop the timespec64 arg from generic_update_time fs: pass the request_mask to generic_fillattr fs: remove silly warning from current_time gfs2: fix timestamp handling on quota inodes fs: rename i_ctime field to __i_ctime selinux: convert to ctime accessor functions security: convert to ctime accessor functions apparmor: convert to ctime accessor functions sunrpc: convert to ctime accessor functions ...
2023-08-28Merge tag 'v6.6-vfs.fs_context' of ↵Linus Torvalds3-67/+104
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull mount API updates from Christian Brauner: "This introduces FSCONFIG_CMD_CREATE_EXCL which allows userspace to implement something like $ mount -t ext4 --exclusive /dev/sda /B which fails if a superblock for the requested filesystem does already exist instead of silently reusing an existing superblock. Without it, in the sequence $ move-mount -f xfs -o source=/dev/sda4 /A $ move-mount -f xfs -o noacl,source=/dev/sda4 /B the initial mounter will create a superblock. The second mounter will reuse the existing superblock, creating a bind-mount (see [1] for the source of the move-mount binary). The problem is that reusing an existing superblock means all mount options other than read-only and read-write will be silently ignored even if they are incompatible requests. For example, the second mount has requested no POSIX ACL support but since the existing superblock is reused POSIX ACL support will remain enabled. Such silent superblock reuse can easily become a security issue. After adding support for FSCONFIG_CMD_CREATE_EXCL to mount(8) in util-linux this can be fixed: $ move-mount -f xfs --exclusive -o source=/dev/sda4 /A $ move-mount -f xfs --exclusive -o noacl,source=/dev/sda4 /B Device or resource busy | move-mount.c: 300: do_fsconfig: i xfs: reusing existing filesystem not allowed This requires the new mount api. With the old mount api it would be necessary to plumb this through every legacy filesystem's file_system_type->mount() method. If they want this feature they are most welcome to switch to the new mount api" Link: https://github.com/brauner/move-mount-beneath [1] Link: https://lore.kernel.org/linux-block/20230704-fasching-wertarbeit-7c6ffb01c83d@brauner Link: https://lore.kernel.org/linux-block/20230705-pumpwerk-vielversprechend-a4b1fd947b65@brauner Link: https://lore.kernel.org/linux-fsdevel/20230725-einnahmen-warnschilder-17779aec0a97@brauner Link: https://lore.kernel.org/lkml/20230824-anzog-allheilmittel-e8c63e429a79@brauner/ * tag 'v6.6-vfs.fs_context' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: add FSCONFIG_CMD_CREATE_EXCL fs: add vfs_cmd_reconfigure() fs: add vfs_cmd_create() super: remove get_tree_single_reconf()
2023-08-27ext4: fix slab-use-after-free in ext4_es_insert_extent()Baokun Li1-14/+30
Yikebaer reported an issue: ================================================================== BUG: KASAN: slab-use-after-free in ext4_es_insert_extent+0xc68/0xcb0 fs/ext4/extents_status.c:894 Read of size 4 at addr ffff888112ecc1a4 by task syz-executor/8438 CPU: 1 PID: 8438 Comm: syz-executor Not tainted 6.5.0-rc5 #1 Call Trace: [...] kasan_report+0xba/0xf0 mm/kasan/report.c:588 ext4_es_insert_extent+0xc68/0xcb0 fs/ext4/extents_status.c:894 ext4_map_blocks+0x92a/0x16f0 fs/ext4/inode.c:680 ext4_alloc_file_blocks.isra.0+0x2df/0xb70 fs/ext4/extents.c:4462 ext4_zero_range fs/ext4/extents.c:4622 [inline] ext4_fallocate+0x251c/0x3ce0 fs/ext4/extents.c:4721 [...] Allocated by task 8438: [...] kmem_cache_zalloc include/linux/slab.h:693 [inline] __es_alloc_extent fs/ext4/extents_status.c:469 [inline] ext4_es_insert_extent+0x672/0xcb0 fs/ext4/extents_status.c:873 ext4_map_blocks+0x92a/0x16f0 fs/ext4/inode.c:680 ext4_alloc_file_blocks.isra.0+0x2df/0xb70 fs/ext4/extents.c:4462 ext4_zero_range fs/ext4/extents.c:4622 [inline] ext4_fallocate+0x251c/0x3ce0 fs/ext4/extents.c:4721 [...] Freed by task 8438: [...] kmem_cache_free+0xec/0x490 mm/slub.c:3823 ext4_es_try_to_merge_right fs/ext4/extents_status.c:593 [inline] __es_insert_extent+0x9f4/0x1440 fs/ext4/extents_status.c:802 ext4_es_insert_extent+0x2ca/0xcb0 fs/ext4/extents_status.c:882 ext4_map_blocks+0x92a/0x16f0 fs/ext4/inode.c:680 ext4_alloc_file_blocks.isra.0+0x2df/0xb70 fs/ext4/extents.c:4462 ext4_zero_range fs/ext4/extents.c:4622 [inline] ext4_fallocate+0x251c/0x3ce0 fs/ext4/extents.c:4721 [...] ================================================================== The flow of issue triggering is as follows: 1. remove es raw es es removed es1 |-------------------| -> |----|.......|------| 2. insert es es insert es1 merge with es es1 merge with es and free es1 |----|.......|------| -> |------------|------| -> |-------------------| es merges with newes, then merges with es1, frees es1, then determines if es1->es_len is 0 and triggers a UAF. The code flow is as follows: ext4_es_insert_extent es1 = __es_alloc_extent(true); es2 = __es_alloc_extent(true); __es_remove_extent(inode, lblk, end, NULL, es1) __es_insert_extent(inode, &newes, es1) ---> insert es1 to es tree __es_insert_extent(inode, &newes, es2) ext4_es_try_to_merge_right ext4_es_free_extent(inode, es1) ---> es1 is freed if (es1 && !es1->es_len) // Trigger UAF by determining if es1 is used. We determine whether es1 or es2 is used immediately after calling __es_remove_extent() or __es_insert_extent() to avoid triggering a UAF if es1 or es2 is freed. Reported-by: Yikebaer Aizezi <[email protected]> Closes: https://lore.kernel.org/lkml/CALcu4raD4h9coiyEBL4Bm0zjDwxC2CyPiTwsP3zFuhot6y9Beg@mail.gmail.com Fixes: 2a69c450083d ("ext4: using nofail preallocation in ext4_es_insert_extent()") Cc: [email protected] Signed-off-by: Baokun Li <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27libfs: remove redundant checks of s_encodingEric Biggers1-12/+2
Now that neither ext4 nor f2fs allows inodes with the casefold flag to be instantiated when unsupported, it's unnecessary to repeatedly check for support later on during random filesystem operations. Signed-off-by: Eric Biggers <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: remove redundant checks of s_encodingEric Biggers2-4/+4
Now that ext4 does not allow inodes with the casefold flag to be instantiated when unsupported, it's unnecessary to repeatedly check for support later on during random filesystem operations. Signed-off-by: Eric Biggers <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: reject casefold inode flag without casefold featureEric Biggers1-1/+4
It is invalid for the casefold inode flag to be set without the casefold superblock feature flag also being set. e2fsck already considers this case to be invalid and handles it by offering to clear the casefold flag on the inode. __ext4_iget() also already considered this to be invalid, sort of, but it only got so far as logging an error message; it didn't actually reject the inode. Make it reject the inode so that other code doesn't have to handle this case. This matches what f2fs does. Note: we could check 's_encoding != NULL' instead of ext4_has_feature_casefold(). This would make the check robust against the casefold feature being enabled by userspace writing to the page cache of the mounted block device. However, it's unsolvable in general for filesystems to be robust against concurrent writes to the page cache of the mounted block device. Though this very particular scenario involving the casefold feature is solvable, we should not pretend that we can support this model, so let's just check the casefold feature. tune2fs already forbids enabling casefold on a mounted filesystem. Signed-off-by: Eric Biggers <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: use LIST_HEAD() to initialize the list_head in mballoc.cRuan Jinjie1-13/+5
Use LIST_HEAD() to initialize the list_head instead of open-coding it. Signed-off-by: Ruan Jinjie <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: do not mark inode dirty every time when appending using delallocLiu Song1-26/+62
In the delalloc append write scenario, if inode's i_size is extended due to buffer write, there are delalloc writes pending in the range up to i_size, and no need to touch i_disksize since writeback will push i_disksize up to i_size eventually. Offers significant performance improvement in high-frequency append write scenarios. I conducted tests in my 32-core environment by launching 32 concurrent threads to append write to the same file. Each write operation had a length of 1024 bytes and was repeated 100000 times. Without using this patch, the test was completed in 7705 ms. However, with this patch, the test was completed in 5066 ms, resulting in a performance improvement of 34%. Moreover, in test scenarios of Kafka version 2.6.2, using packet size of 2K, with this patch resulted in a 10% performance improvement. Signed-off-by: Liu Song <[email protected]> Suggested-by: Jan Kara <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: rename s_error_work to s_sb_upd_workTheodore Ts'o2-19/+22
The most common use that s_error_work will get scheduled is now the periodic update of the superblock. So rename it to s_sb_upd_work. Also rename the function flush_stashed_error_work() to update_super_work(). Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: add periodic superblock update checkVitaliy Kuznetsov1-1/+61
This patch introduces a mechanism to periodically check and update the superblock within the ext4 file system. The main purpose of this patch is to keep the disk superblock up to date. The update will be performed if more than one hour has passed since the last update, and if more than 16MB of data have been written to disk. This check and update is performed within the ext4_journal_commit_callback function, ensuring that the superblock is written while the disk is active, rather than based on a timer that may trigger during disk idle periods. Discussion https://www.spinics.net/lists/linux-ext4/msg85865.html Signed-off-by: Vitaliy Kuznetsov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: drop dio overwrite only flag and associated warningBrian Foster1-15/+10
The commit referenced below opened up concurrent unaligned dio under shared locking for pure overwrites. In doing so, it enabled use of the IOMAP_DIO_OVERWRITE_ONLY flag and added a warning on unexpected -EAGAIN returns as an extra precaution, since ext4 does not retry writes in such cases. The flag itself is advisory in this case since ext4 checks for unaligned I/Os and uses appropriate locking up front, rather than on a retry in response to -EAGAIN. As it turns out, the warning check is susceptible to false positives because there are scenarios where -EAGAIN can be expected from lower layers without necessarily having IOCB_NOWAIT set on the iocb. For example, one instance of the warning has been seen where io_uring sets IOCB_HIPRI, which in turn results in REQ_POLLED|REQ_NOWAIT on the bio. This results in -EAGAIN if the block layer is unable to allocate a request, etc. [Note that there is an outstanding patch to untangle REQ_POLLED and REQ_NOWAIT such that the latter relies on IOCB_NOWAIT, which would also address this instance of the warning.] Another instance of the warning has been reproduced by syzbot. A dio write is interrupted down in __get_user_pages_locked() waiting on the mm lock and returns -EAGAIN up the stack. If the iomap dio iteration layer has made no progress on the write to this point, -EAGAIN returns up to the filesystem and triggers the warning. This use of the overwrite flag in ext4 is precautionary and half-baked. I.e., ext4 doesn't actually implement overwrite checking in the iomap callbacks when the flag is set, so the only extra verification it provides are i_size checks in the generic iomap dio layer. Combined with the tendency for false positives, the added verification is not worth the extra trouble. Remove the flag, associated warning, and update the comments to document when concurrent unaligned dio writes are allowed and why said flag is not used. Cc: [email protected] Reported-by: [email protected] Reported-by: Pengfei Xu <[email protected]> Fixes: 310ee0902b8d ("ext4: allow concurrent unaligned dio overwrites") Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: add correct group descriptors and reserved GDT blocks to system zoneWang Jianjian3-8/+17
When setup_system_zone, flex_bg is not initialized so it is always 1. Use a new helper function, ext4_num_base_meta_blocks() which does not depend on sbi->s_log_groups_per_flex being initialized. [ Squashed two patches in the Link URL's below together into a single commit, which is simpler to review/understand. Also fix checkpatch warnings. --TYT ] Cc: [email protected] Signed-off-by: Wang Jianjian <[email protected]> Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: remove unused function declarationCai Xinchen1-6/+0
These functions do not have its function implementation. So those function declaration is useless. Remove these Signed-off-by: Cai Xinchen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: mballoc: avoid garbage value from errSu Hui1-1/+1
clang's static analysis warning: fs/ext4/mballoc.c line 4178, column 6, Branch condition evaluates to a garbage value. err is uninitialized and will be judged when 'len <= 0' or it first enters the loop while the condition "!ext4_sb_block_valid()" is true. Although this can't make problems now, it's better to correct it. Signed-off-by: Su Hui <[email protected]> Reviewed-by: Nick Desaulniers <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: use sbi instead of EXT4_SB(sb) in ext4_mb_new_blocks_simple()Lu Hongfei1-1/+1
Signed-off-by: Lu Hongfei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: change the type of blocksize in ext4_mb_init_cache()Lu Hongfei1-1/+1
The return value type of i_blocksize() is 'unsigned int', so the type of blocksize has been modified from 'int' to 'unsigned int' to ensure data type consistency. Signed-off-by: Lu Hongfei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: fix unttached inode after power cut with orphan file feature enabledZhihao Cheng1-0/+3
Running generic/475(filesystem consistent tests after power cut) could easily trigger unattached inode error while doing fsck: Unattached zero-length inode 39405. Clear? no Unattached inode 39405 Connect to /lost+found? no Above inconsistence is caused by following process: P1 P2 ext4_create inode = ext4_new_inode_start_handle // itable records nlink=1 ext4_add_nondir err = ext4_add_entry // ENOSPC ext4_append ext4_bread ext4_getblk ext4_map_blocks // returns ENOSPC drop_nlink(inode) // won't be updated into disk inode ext4_orphan_add(handle, inode) ext4_orphan_file_add ext4_journal_stop(handle) jbd2_journal_commit_transaction // commit success >> power cut << ext4_fill_super ext4_load_and_init_journal // itable records nlink=1 ext4_orphan_cleanup ext4_process_orphan if (inode->i_nlink) // true, inode won't be deleted Then, allocated inode will be reserved on disk and corresponds to no dentries, so e2fsck reports 'unattached inode' problem. The problem won't happen if orphan file feature is disabled, because ext4_orphan_add() will update disk inode in orphan list mode. There are several places not updating disk inode while putting inode into orphan area, such as ext4_add_nondir(), ext4_symlink() and whiteout in ext4_rename(). Fix it by updating inode into disk in all error branches of these places. Link: https://bugzilla.kernel.org/show_bug.cgi?id=217605 Fixes: 02f310fcf47f ("ext4: Speedup ext4 orphan inode handling") Signed-off-by: Zhihao Cheng <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27jbd2: correct the end of the journal recovery scan rangeZhang Yi1-9/+3
We got a filesystem inconsistency issue below while running generic/475 I/O failure pressure test with fast_commit feature enabled. Symlink /p3/d3/d1c/d6c/dd6/dce/l101 (inode #132605) is invalid. If fast_commit feature is enabled, a special fast_commit journal area is appended to the end of the normal journal area. The journal->j_last point to the first unused block behind the normal journal area instead of the whole log area, and the journal->j_fc_last point to the first unused block behind the fast_commit journal area. While doing journal recovery, do_one_pass(PASS_SCAN) should first scan the normal journal area and turn around to the first block once it meet journal->j_last, but the wrap() macro misuse the journal->j_fc_last, so the recovering could not read the next magic block (commit block perhaps) and would end early mistakenly and missing tN and every transaction after it in the following example. Finally, it could lead to filesystem inconsistency. | normal journal area | fast commit area | +-------------------------------------------------+------------------+ | tN(rere) | tN+1 |~| tN-x |...| tN-1 | tN(front) | .... | +-------------------------------------------------+------------------+ / / / start journal->j_last journal->j_fc_last This patch fix it by use the correct ending journal->j_last. Fixes: 5b849b5f96b4 ("jbd2: fast commit recovery path") Cc: [email protected] Reported-by: Theodore Ts'o <[email protected]> Link: https://lore.kernel.org/linux-ext4/[email protected]/ Signed-off-by: Zhang Yi <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: ext4_get_{dev}_journal return proper error valueZhang Yi1-22/+31
ext4_get_journal() and ext4_get_dev_journal() return NULL if they failed to init journal, making them return proper error value instead, also rename them to ext4_open_{inode,dev}_journal(). [ Folded fix to ext4_calculate_overhead() to check for an ERR_PTR instead of NULL. ] Signed-off-by: Zhang Yi <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reported-by: [email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-25Merge tag 'mm-hotfixes-stable-2023-08-25-11-07' of ↵Linus Torvalds2-2/+11
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "18 hotfixes. 13 are cc:stable and the remainder pertain to post-6.4 issues or aren't considered suitable for a -stable backport" * tag 'mm-hotfixes-stable-2023-08-25-11-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: shmem: fix smaps BUG sleeping while atomic selftests: cachestat: catch failing fsync test on tmpfs selftests: cachestat: test for cachestat availability maple_tree: disable mas_wr_append() when other readers are possible madvise:madvise_free_pte_range(): don't use mapcount() against large folio for sharing check madvise:madvise_free_huge_pmd(): don't use mapcount() against large folio for sharing check madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against large folio for sharing check mm: multi-gen LRU: don't spin during memcg release mm: memory-failure: fix unexpected return value in soft_offline_page() radix tree: remove unused variable mm: add a call to flush_cache_vmap() in vmap_pfn() selftests/mm: FOLL_LONGTERM need to be updated to 0x100 nilfs2: fix general protection fault in nilfs_lookup_dirty_data_buffers() mm/gup: handle cont-PTE hugetlb pages correctly in gup_must_unshare() via GUP-fast selftests: cgroup: fix test_kmem_basic less than error mm: enable page walking API to lock vmas during the walk smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd() mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT
2023-08-25f2fs: use finish zone command when closing a zoneDaeho Jeong1-6/+13
Use the finish zone command first when a zone should be closed. Signed-off-by: Daeho Jeong <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2023-08-25dlm: fix plock lookup when using multiple lockspacesAlexander Aring1-3/+3
All posix lock ops, for all lockspaces (gfs2 file systems) are sent to userspace (dlm_controld) through a single misc device. The dlm_controld daemon reads the ops from the misc device and sends them to other cluster nodes using separate, per-lockspace cluster api communication channels. The ops for a single lockspace are ordered at this level, so that the results are received in the same sequence that the requests were sent. When the results are sent back to the kernel via the misc device, they are again funneled through the single misc device for all lockspaces. When the dlm code in the kernel processes the results from the misc device, these results will be returned in the same sequence that the requests were sent, on a per-lockspace basis. A recent change in this request/reply matching code missed the "per-lockspace" check (fsid comparison) when matching request and reply, so replies could be incorrectly matched to requests from other lockspaces. Cc: [email protected] Reported-by: Barry Marson <[email protected]> Fixes: 57e2c2f2d94c ("fs: dlm: fix mismatch of plock results from userspace") Signed-off-by: Alexander Aring <[email protected]> Signed-off-by: David Teigland <[email protected]>
2023-08-24[SMB3] send channel sequence number in SMB3 requests after reconnectsSteve French5-1/+45
The ChannelSequence field in the SMB3 header is supposed to be increased after reconnect to allow the server to distinguish requests from before and after the reconnect. We had always been setting it to zero. There are cases where incrementing ChannelSequence on requests after network reconnects can reduce the chance of data corruptions. See MS-SMB2 3.2.4.1 and 3.2.7.1 Signed-off-by: Steve French <[email protected]> Cc: [email protected] # 5.16+
2023-08-24document while_each_thread(), change first_tid() to use for_each_thread()Oleg Nesterov1-3/+2
Add the comment to explain that while_each_thread(g,t) is not rcu-safe unless g is stable (e.g. current). Even if g is a group leader and thus can't exit before t, t or another sub-thread can exec and remove g from the thread_group list. The only lockless user of while_each_thread() is first_tid() and it is fine in that it can't loop forever, yet for_each_thread() looks better and I am going to change while_each_thread/next_thread. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Cc: Eric W. Biederman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24mm: remove enum page_entry_sizeMatthew Wilcox (Oracle)7-69/+44
Remove the unnecessary encoding of page order into an enum and pass the page order directly. That lets us get rid of pe_order(). The switch constructs have to be changed to if/else constructs to prevent GCC from warning on builds with 3-level page tables where PMD_ORDER and PUD_ORDER have the same value. If you are looking at this commit because your driver stopped compiling, look at the previous commit as well and audit your driver to be sure it doesn't depend on mmap_lock being held in its ->huge_fault method. [[email protected]: use "order %u" to match the (non dev_t) style] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24mm: move PMD_ORDER to pgtable.hMatthew Wilcox (Oracle)1-3/+0
Patch series "Change calling convention for ->huge_fault", v2. There are two unrelated changes to the calling convention for ->huge_fault. I've bundled them together to help people notice the change. The first is to improve scalability of DAX page faults by allowing them to be handled under the VMA lock. The second is to remove enum page_entry_size since it's really unnecessary. The changelogs and documentation updates hopefully work to that end. This patch (of 3): Allow this to be used in generic code. Also add PUD_ORDER. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24mm: userfaultfd: remove stale comment about core dump lockingJann Horn1-5/+1
Since commit 7f3bfab52cab ("mm/gup: take mmap_lock in get_dump_page()"), which landed in v5.10, core dumping doesn't enter fault handling without holding the mmap_lock anymore. Remove the stale parts of the comments, but leave the behavior as-is - letting core dumping block on userfault handling would be a bad idea and could lead to deadlocks if the dumping process was handling its own userfaults. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jann Horn <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24minmax: add in_range() macroMatthew Wilcox (Oracle)4-12/+0
Patch series "New page table range API", v6. This patchset changes the API used by the MM to set up page table entries. The four APIs are: set_ptes(mm, addr, ptep, pte, nr) update_mmu_cache_range(vma, addr, ptep, nr) flush_dcache_folio(folio) flush_icache_pages(vma, page, nr) flush_dcache_folio() isn't technically new, but no architecture implemented it, so I've done that for them. The old APIs remain around but are mostly implemented by calling the new interfaces. The new APIs are based around setting up N page table entries at once. The N entries belong to the same PMD, the same folio and the same VMA, so ptep++ is a legitimate operation, and locking is taken care of for you. Some architectures can do a better job of it than just a loop, but I have hesitated to make too deep a change to architectures I don't understand well. One thing I have changed in every architecture is that PG_arch_1 is now a per-folio bit instead of a per-page bit when used for dcache clean/dirty tracking. This was something that would have to happen eventually, and it makes sense to do it now rather than iterate over every page involved in a cache flush and figure out if it needs to happen. The point of all this is better performance, and Fengwei Yin has measured improvement on x86. I suspect you'll see improvement on your architecture too. Try the new will-it-scale test mentioned here: https://lore.kernel.org/linux-mm/[email protected]/ You'll need to run it on an XFS filesystem and have CONFIG_TRANSPARENT_HUGEPAGE set. This patchset is the basis for much of the anonymous large folio work being done by Ryan, so it's received quite a lot of testing over the last few months. This patch (of 38): Determine if a value lies within a range more efficiently (subtraction + comparison vs two comparisons and an AND). It also has useful (under some circumstances) behaviour if the range exceeds the maximum value of the type. Convert all the conflicting definitions of in_range() within the kernel; some can use the generic definition while others need their own definition. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24mm: handle userfaults under VMA lockSuren Baghdasaryan1-20/+14
Enable handle_userfault to operate under VMA lock by releasing VMA lock instead of mmap_lock and retrying. Note that FAULT_FLAG_RETRY_NOWAIT should never be used when handling faults under per-VMA lock protection because that would break the assumption that lock is dropped on retry. [[email protected]: fix a lockdep issue in vma_assert_write_locked] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Acked-by: Peter Xu <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Hillf Danton <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jan Kara <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Laurent Dufour <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Punit Agrawal <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24Merge tag 'nfsd-6.5-5' of ↵Linus Torvalds2-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: "Two last-minute one-liners for v6.5-rc. One got lost in the shuffle, and the other was reported just this morning" - Close race window when handling FREE_STATEID operations - Fix regression in /proc/fs/nfsd/v4_end_grace introduced in v6.5-rc" * tag 'nfsd-6.5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: NFSD: Fix a thinko introduced by recent trace point changes nfsd: Fix race to FREE_STATEID and cl_revoked
2023-08-24NFSv4.1: use EXCHGID4_FLAG_USE_PNFS_DS for DS serverOlga Kornievskaia2-0/+7
After receiving the location(s) of the DS server(s) in the GETDEVINCEINFO, create the request for the clientid to such server and indicate that the client is connecting to a DS. Signed-off-by: Olga Kornievskaia <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2023-08-24NFS/pNFS: Set the connect timeout for the pNFS flexfiles driverTrond Myklebust4-0/+10
Ensure that the connect timeout for the pNFS flexfiles driver is of the same order as the I/O timeout, so that we can fail over quickly when trying to read from a data server that is down. Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2023-08-24NFS: Fix a potential data corruptionTrond Myklebust1-1/+19
We must ensure that the subrequests are joined back into the head before we can retransmit a request. If the head was not on the commit lists, because the server wrote it synchronously, we still need to add it back to the retransmission list. Add a call that mirrors the effect of nfs_cancel_remove_inode() for O_DIRECT. Fixes: ed5d588fe47f ("NFS: Try to join page groups before an O_DIRECT retransmission") Cc: [email protected] Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2023-08-24nfs: fix redundant readdir request after get eofKinglong Mee1-4/+11
When a directory contains 17 files (except . and ..), nfs client sends a redundant readdir request after get eof. A simple reproduce, At NFS server, create a directory with 17 files under exported directory. # mkdir test # cd test # for i in {0..16} ; do touch $i; done At NFS client, no matter mounting through nfsv3 or nfsv4, does ls (or ll) at the created test directory. A tshark output likes following (for nfsv4), # tshark -i eth0 tcp port 2049 -Tfields -e ip.src -e ip.dst -e nfs -e nfs.cookie4 srcip dstip SEQUENCE, PUTFH, READDIR 0 dstip srcip SEQUENCE PUTFH READDIR 909539109313539306,2108391201987888856,2305312124304486544,2566335452463141496,2978225129081509984,4263037479923412583,4304697173036510679,4666703455469210097,4759208201298769007,4776701232145978803,5338408478512081262,5949498658935544804,5971526429894832903,6294060338267709855,6528840566229532529,8600463293536422524,9223372036854775807 srcip dstip srcip dstip SEQUENCE, PUTFH, READDIR 9223372036854775807 dstip srcip SEQUENCE PUTFH READDIR The READDIR with cookie 9223372036854775807(0x7FFFFFFFFFFFFFFF) is redundant. Reviewed-by: Benjamin Coddington <[email protected]> Signed-off-by: Kinglong Mee <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2023-08-24nfs/blocklayout: Use the passed in gfp flagsDan Carpenter1-2/+2
This allocation should use the passed in GFP_ flags instead of GFP_KERNEL. One places where this matters is in filelayout_pg_init_write() which uses GFP_NOFS as the allocation flags. Fixes: 5c83746a0cf2 ("pnfs/blocklayout: in-kernel GETDEVICEINFO XDR parsing") Signed-off-by: Dan Carpenter <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2023-08-24filemap: Fix errors in file.c[email protected]1-1/+1
The following checkpatch errors are removed: ERROR: "foo * bar" should be "foo *bar" "foo * bar" should be "foo *bar" Signed-off-by: ZhiHu <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2023-08-24NFSv4/pnfs: minor fix for cleanup path in nfs4_get_device_infoFedor Pchelkin1-1/+1
It is an almost improbable error case but when page allocating loop in nfs4_get_device_info() fails then we should only free the already allocated pages, as __free_page() can't deal with NULL arguments. Found by Linux Verification Center (linuxtesting.org). Cc: [email protected] Signed-off-by: Fedor Pchelkin <[email protected]> Reviewed-by: Benjamin Coddington <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2023-08-24NFS: Move common includes outside ifdefGUO Zihua1-7/+5
module.h, clnt.h, addr.h and dns_resolve.h is always included whether CONFIG_NFS_USE_KERNEL_DNS is set or not and their order does not seems to matter. Move them outside the ifdef to simplify code and avoid checkincludes message. Signed-off-by: GUO Zihua <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2023-08-24NFSD: Fix a thinko introduced by recent trace point changesChuck Lever1-0/+1
The fixed commit erroneously removed a call to nfsd_end_grace(), which makes calls to write_v4_end_grace() a no-op. Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-lkp/[email protected] Fixes: 39d432fc7630 ("NFSD: trace nfsctl operations") Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2023-08-24locks: fix KASAN: use-after-free in trace_event_raw_event_filelock_lockWill Shiu1-1/+1
As following backtrace, the struct file_lock request , in posix_lock_inode is free before ftrace function using. Replace the ftrace function ahead free flow could fix the use-after-free issue. [name:report&]=============================================== BUG:KASAN: use-after-free in trace_event_raw_event_filelock_lock+0x80/0x12c [name:report&]Read at addr f6ffff8025622620 by task NativeThread/16753 [name:report_hw_tags&]Pointer tag: [f6], memory tag: [fe] [name:report&] BT: Hardware name: MT6897 (DT) Call trace: dump_backtrace+0xf8/0x148 show_stack+0x18/0x24 dump_stack_lvl+0x60/0x7c print_report+0x2c8/0xa08 kasan_report+0xb0/0x120 __do_kernel_fault+0xc8/0x248 do_bad_area+0x30/0xdc do_tag_check_fault+0x1c/0x30 do_mem_abort+0x58/0xbc el1_abort+0x3c/0x5c el1h_64_sync_handler+0x54/0x90 el1h_64_sync+0x68/0x6c trace_event_raw_event_filelock_lock+0x80/0x12c posix_lock_inode+0xd0c/0xd60 do_lock_file_wait+0xb8/0x190 fcntl_setlk+0x2d8/0x440 ... [name:report&] [name:report&]Allocated by task 16752: ... slab_post_alloc_hook+0x74/0x340 kmem_cache_alloc+0x1b0/0x2f0 posix_lock_inode+0xb0/0xd60 ... [name:report&] [name:report&]Freed by task 16752: ... kmem_cache_free+0x274/0x5b0 locks_dispose_list+0x3c/0x148 posix_lock_inode+0xc40/0xd60 do_lock_file_wait+0xb8/0x190 fcntl_setlk+0x2d8/0x440 do_fcntl+0x150/0xc18 ... Signed-off-by: Will Shiu <[email protected]> Signed-off-by: Jeff Layton <[email protected]>
2023-08-24fs/locks: Fix typoJakub Wilk1-1/+1
Signed-off-by: Jakub Wilk <[email protected]> Signed-off-by: Jeff Layton <[email protected]>
2023-08-24ceph: switch ceph_lookup/atomic_open() to use new fscrypt helperLuís Henriques2-11/+10
Instead of setting the no-key dentry, use the new fscrypt_prepare_lookup_partial() helper. We still need to mark the directory as incomplete if the directory was just unlocked. In ceph_atomic_open() this fixes a bug where a dentry is incorrectly set with DCACHE_NOKEY_NAME when 'dir' has been evicted but the key is still available (for example, where there's a drop_caches). Signed-off-by: Luís Henriques <[email protected]> Reviewed-by: Xiubo Li <[email protected]> Reviewed-by: Milind Changire <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2023-08-24ceph: fix updating i_truncate_pagecache_size for fscryptXiubo Li2-13/+26
When fscrypt is enabled we will align the truncate size up to the CEPH_FSCRYPT_BLOCK_SIZE always, so if we truncate the size in the same block more than once, the latter ones will be skipped being invalidated from the page caches. This will force invalidating the page caches by using the smaller size than the real file size. At the same time add more debug log and fix the debug log for truncate code. Link: https://tracker.ceph.com/issues/58834 Signed-off-by: Xiubo Li <[email protected]> Reviewed-and-tested-by: Luís Henriques <[email protected]> Reviewed-by: Milind Changire <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>