aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2023-08-23erofs: refine warning messages for zdata I/OsFerry Meng1-14/+9
Don't warn users since -EINTR can be returned due to user interruption. Also suppress warning messages of readmore. Signed-off-by: Ferry Meng <[email protected]> Reviewed-by: Gao Xiang <[email protected]> Reviewed-by: Yue Hu <[email protected]> Reviewed-by: Chao Yu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Gao Xiang <[email protected]>
2023-08-23Merge tag 'vfs-6.6-merge-3' of ↵Christian Brauner4-38/+183
ssh://gitolite.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs online fsck update from Darrick Wong: New code for 6.6: * Allow the kernel to initiate a freeze of a filesystem. The kernel and userspace can both hold a freeze on a filesystem at the same time; the freeze is not lifted until /both/ holders lift it. This will enable us to fix a longstanding bug in XFS online fsck. * Use kernel-initated fsfreeze to fix some longstanding false negatives in online fsck of the free space and inode counters. Signed-off-by: Darrick J. Wong <[email protected]> Message-Id: <20230822182604.GB11286@frogsfrogsfrogs> Signed-off-by: Christian Brauner <[email protected]>
2023-08-23Merge tag 'vfs-6.6-merge-2' of ↵Christian Brauner15-86/+184
ssh://gitolite.kernel.org/pub/scm/fs/xfs/xfs-linux Pull filesystem freezing updates from Darrick Wong: New code for 6.6: * Allow the kernel to initiate a freeze of a filesystem. The kernel and userspace can both hold a freeze on a filesystem at the same time; the freeze is not lifted until /both/ holders lift it. This will enable us to fix a longstanding bug in XFS online fsck. Signed-off-by: Darrick J. Wong <[email protected]> Message-Id: <20230822182604.GB11286@frogsfrogsfrogs> Signed-off-by: Christian Brauner <[email protected]>
2023-08-23ext4: cleanup ext4_get_dev_journal() and ext4_get_journal()Zhang Yi1-60/+49
Factor out a new helper form ext4_get_dev_journal() to get external journal bdev and check validation of this device, drop ext4_blkdev_get() helper, and also remove duplicate check of journal feature. It makes ext4_get_dev_journal() more clear than before. Signed-off-by: Zhang Yi <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-23jbd2: jbd2_journal_init_{dev,inode} return proper error return valueZhang Yi3-16/+15
Current jbd2_journal_init_{dev,inode} return NULL if some error happens, make them to pass out proper error return value. [ Fix from Yang Yingliang folded in. ] Signed-off-by: Zhang Yi <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-22debugfs: Add write support to debugfs_create_str()Mike Tipton1-2/+46
Currently, debugfs_create_str() only supports reading strings from debugfs. Add support for writing them as well. Based on original implementation by Peter Zijlstra [0]. Write support was present in the initial patch version, but dropped in v2 due to lack of users. We have a user now, so reintroduce it. [0] https://lore.kernel.org/all/YF3Hv5zXb%[email protected]/ Signed-off-by: Mike Tipton <[email protected]> Acked-by: Greg Kroah-Hartman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Georgi Djakov <[email protected]>
2023-08-22Merge tag 'nfs-for-6.5-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds4-18/+31
Pull NFS client fixes from Trond Myklebust: - fix a use after free in nfs_direct_join_group() (Cc: stable) - fix sysfs server name memory leak - fix lock recovery hang in NFSv4.0 - fix page free in the error path for nfs42_proc_getxattr() and __nfs4_get_acl_uncached() - SUNRPC/rdma: fix receive buffer dma-mapping after a server disconnect * tag 'nfs-for-6.5-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: xprtrdma: Remap Receive buffers after a reconnect NFSv4: fix out path in __nfs4_get_acl_uncached NFSv4.2: fix error handling in nfs42_proc_getxattr NFS: Fix sysfs server name memory leak NFS: Fix a use after free in nfs_direct_join_group() NFSv4: Fix dropped lock for racing OPEN and delegation return
2023-08-22cifs: update desired access while requesting for directory leaseBharath SM1-1/+1
We read and cache directory contents when we get directory lease, so we should ask for read permission to read contents of directory. Signed-off-by: Bharath SM <[email protected]> Reviewed-by: Shyam Prasad N <[email protected]> Cc: [email protected] Signed-off-by: Steve French <[email protected]>
2023-08-22tracefs: Remove kerneldoc from struct eventfs_fileSteven Rostedt (Google)1-4/+10
The struct eventfs_file is a local structure and should not be parsed by kernel doc. It also does not fully follow the kerneldoc format and is causing kerneldoc to spit out errors. Replace the /** to /* so that kerneldoc no longer processes this structure. Also format the comments of the delete union of the structure to be a bit better. Link: https://lore.kernel.org/linux-trace-kernel/[email protected]/ Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Mark Rutland <[email protected]> Reported-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-08-22affs: rename local toupper() to fn() to avoid confusionAndy Shevchenko1-10/+10
A compiler may see the collision with the toupper() defined in ctype.h: fs/affs/namei.c:159:19: warning: unused variable 'toupper' [-Wunused-variable] 159 | toupper_t toupper = affs_get_toupper(sb); To prevent this from happening, rename toupper local variable to fn. Initially this had been introduced by 24579a881513 ("v2.4.3.5 -> v2.4.3.6") in the history.git by history group. Reported-by: kernel test robot <[email protected]> Signed-off-by: Andy Shevchenko <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-08-22affs: remove writepage implementationMatthew Wilcox (Oracle)1-5/+9
If the filesystem implements migrate_folio and writepages, there is no need for a writepage implementation. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-08-22btrfs: zoned: skip splitting and logical rewriting on pre-alloc writeNaohiro Aota1-4/+15
When doing a relocation, there is a chance that at the time of btrfs_reloc_clone_csums(), there is no checksum for the corresponding region. In this case, btrfs_finish_ordered_zoned()'s sum points to an invalid item and so ordered_extent's logical is set to some invalid value. Then, btrfs_lookup_block_group() in btrfs_zone_finish_endio() failed to find a block group and will hit an assert or a null pointer dereference as following. This can be reprodcued by running btrfs/028 several times (e.g, 4 to 16 times) with a null_blk setup. The device's zone size and capacity is set to 32 MB and the storage size is set to 5 GB on my setup. KASAN: null-ptr-deref in range [0x0000000000000088-0x000000000000008f] CPU: 6 PID: 3105720 Comm: kworker/u16:13 Tainted: G W 6.5.0-rc6-kts+ #1 Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0 12/17/2015 Workqueue: btrfs-endio-write btrfs_work_helper [btrfs] RIP: 0010:btrfs_zone_finish_endio.part.0+0x34/0x160 [btrfs] Code: 41 54 49 89 fc 55 48 89 f5 53 e8 57 7d fc ff 48 8d b8 88 00 00 00 48 89 c3 48 b8 00 00 00 00 00 > 3c 02 00 0f 85 02 01 00 00 f6 83 88 00 00 00 01 0f 84 a8 00 00 RSP: 0018:ffff88833cf87b08 EFLAGS: 00010206 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000011 RSI: 0000000000000004 RDI: 0000000000000088 RBP: 0000000000000002 R08: 0000000000000001 R09: ffffed102877b827 R10: ffff888143bdc13b R11: ffff888125b1cbc0 R12: ffff888143bdc000 R13: 0000000000007000 R14: ffff888125b1cba8 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88881e500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f3ed85223d5 CR3: 00000001519b4005 CR4: 00000000001706e0 Call Trace: <TASK> ? die_addr+0x3c/0xa0 ? exc_general_protection+0x148/0x220 ? asm_exc_general_protection+0x22/0x30 ? btrfs_zone_finish_endio.part.0+0x34/0x160 [btrfs] ? btrfs_zone_finish_endio.part.0+0x19/0x160 [btrfs] btrfs_finish_one_ordered+0x7b8/0x1de0 [btrfs] ? rcu_is_watching+0x11/0xb0 ? lock_release+0x47a/0x620 ? btrfs_finish_ordered_zoned+0x59b/0x800 [btrfs] ? __pfx_btrfs_finish_one_ordered+0x10/0x10 [btrfs] ? btrfs_finish_ordered_zoned+0x358/0x800 [btrfs] ? __smp_call_single_queue+0x124/0x350 ? rcu_is_watching+0x11/0xb0 btrfs_work_helper+0x19f/0xc60 [btrfs] ? __pfx_try_to_wake_up+0x10/0x10 ? _raw_spin_unlock_irq+0x24/0x50 ? rcu_is_watching+0x11/0xb0 process_one_work+0x8c1/0x1430 ? __pfx_lock_acquire+0x10/0x10 ? __pfx_process_one_work+0x10/0x10 ? __pfx_do_raw_spin_lock+0x10/0x10 ? _raw_spin_lock_irq+0x52/0x60 worker_thread+0x100/0x12c0 ? __kthread_parkme+0xc1/0x1f0 ? __pfx_worker_thread+0x10/0x10 kthread+0x2ea/0x3c0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> On the zoned mode, writing to pre-allocated region means data relocation write. Such write always uses WRITE command so there is no need of splitting and rewriting logical address. Thus, we can just skip the function for the case. Fixes: cbfce4c7fbde ("btrfs: optimize the logical to physical mapping for zoned writes") Signed-off-by: Naohiro Aota <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-08-22super: use higher-level helper for {freeze,thaw}Christian Brauner1-3/+12
It's not necessary to use low-level locking helpers here. Use the higher-level locking helpers and log if the superblock is dying. Since the caller is assumed to already hold an active reference it isn't possible to observe a dying superblock. Suggested-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-08-22tracefs: Avoid changing i_mode to a temp valueSishuai Gong1-2/+4
Right now inode->i_mode is updated twice to reach the desired value in tracefs_apply_options(). Because there is no lock protecting the two writes, other threads might read the intermediate value of inode->i_mode. Thread-1 Thread-2 // tracefs_apply_options() //e.g., acl_permission_check inode->i_mode &= ~S_IALLUGO; unsigned int mode = inode->i_mode; inode->i_mode |= opts->mode; I think there is no need to introduce a lock but it is better to only update inode->i_mode ONCE, so the readers will either see the old or latest value, rather than an intermediate/temporary value. Note, the race is not a security concern as the intermediate value is more locked down than either the start or end version. This is more just to do the conversion cleanly. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Signed-off-by: Sishuai Gong <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-08-22tmpfs,xattr: GFP_KERNEL_ACCOUNT for simple xattrsHugh Dickins1-2/+2
It is particularly important for the userns mount case (when a sensible nr_inodes maximum may not be enforced) that tmpfs user xattrs be subject to memory cgroup limiting. Leave temporary buffer allocations as is, but change the persistent simple xattr allocations from GFP_KERNEL to GFP_KERNEL_ACCOUNT. This limits kernfs's cgroupfs too, but that's good. (I had intended to send this change earlier, but had been confused by shmem_alloc_inode() using GFP_KERNEL, and thought a discussion would be needed to change that too: no, I was forgetting the SLAB_ACCOUNT on that kmem_cache, which implicitly adds __GFP_ACCOUNT to all its allocations.) Signed-off-by: Hugh Dickins <[email protected]> Reviewed-by: Jan Kara <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-08-22ceph: add base64 endcoding routines for encrypted namesLuís Henriques2-0/+92
The base64url encoding used by fscrypt includes the '_' character, which may cause problems in snapshot names (if the name starts with '_'). Thus, use the base64 encoding defined for IMAP mailbox names (RFC 3501), which uses '+' and ',' instead of '-' and '_'. Signed-off-by: Luís Henriques <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Reviewed-by: Xiubo Li <[email protected]> Reviewed-by: Milind Changire <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2023-08-22ceph: make ioctl cmds more readable in debug logXiubo Li1-1/+38
ioctl file 0000000004e6b054 cmd 2148296211 arg 824635143532 The numerical cmd value in the ioctl debug log message is too hard to understand even when you look at it in the code. Make it more readable. [ idryomov: add missing _ in ceph_ioctl_cmd_name() ] Signed-off-by: Xiubo Li <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Reviewed-and-tested-by: Luís Henriques <[email protected]> Reviewed-by: Milind Changire <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2023-08-22ceph: add fscrypt ioctls and ceph.fscrypt.auth vxattrJeff Layton2-0/+111
We gate most of the ioctls on MDS feature support. The exception is the key removal and status functions that we still want to work if the MDS's were to (inexplicably) lose the feature. For the set_policy ioctl, we take Fs caps to ensure that nothing can create files in the directory while the ioctl is running. That should be enough to ensure that the "empty_dir" check is reliable. The vxattr is read-only, added mostly for future debugging purposes. Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: Xiubo Li <[email protected]> Reviewed-and-tested-by: Luís Henriques <[email protected]> Reviewed-by: Milind Changire <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2023-08-22ceph: implement -o test_dummy_encryption mount optionJeff Layton6-6/+186
Add support for the test_dummy_encryption mount option. This allows us to test the encrypted codepaths in ceph without having to manually set keys, etc. [ lhenriques: fix potential fsc->fsc_dummy_enc_policy memory leak in ceph_real_mount() ] Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: Xiubo Li <[email protected]> Reviewed-and-tested-by: Luís Henriques <[email protected]> Reviewed-by: Milind Changire <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2023-08-22ceph: fscrypt_auth handling for cephJeff Layton10-27/+372
Most fscrypt-enabled filesystems store the crypto context in an xattr, but that's problematic for ceph as xatts are governed by the XATTR cap, but we really want the crypto context as part of the AUTH cap. Because of this, the MDS has added two new inode metadata fields: fscrypt_auth and fscrypt_file. The former is used to hold the crypto context, and the latter is used to track the real file size. Parse new fscrypt_auth and fscrypt_file fields in inode traces. For now, we don't use fscrypt_file, but fscrypt_auth is used to hold the fscrypt context. Allow the client to use a setattr request for setting the fscrypt_auth field. Since this is not a standard setattr request from the VFS, we add a new field to __ceph_setattr that carries ceph-specific inode attrs. Have the set_context op do a setattr that sets the fscrypt_auth value, and get_context just return the contents of that field (since it should always be available). Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: Xiubo Li <[email protected]> Reviewed-and-tested-by: Luís Henriques <[email protected]> Reviewed-by: Milind Changire <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2023-08-22ceph: use osd_req_op_extent_osd_iter for netfs readsJeff Layton1-18/+1
The netfs layer has already pinned the pages involved before calling issue_op, so we can just pass down the iter directly instead of calling iov_iter_get_pages_alloc. Instead of having to allocate a page array, use CEPH_MSG_DATA_ITER and pass it the iov_iter directly to clone. Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: Xiubo Li <[email protected]> Reviewed-and-tested-by: Luís Henriques <[email protected]> Reviewed-by: Milind Changire <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2023-08-22ceph: make ceph_msdc_build_path use ref-walkJeff Layton1-16/+19
Encryption potentially requires allocation, at which point we'll need to be in a non-atomic context. Convert ceph_msdc_build_path to take dentry spinlocks and references instead of using rcu_read_lock to walk the path. This is slightly less efficient, and we may want to eventually allow using RCU when the leaf dentry isn't encrypted. Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: Xiubo Li <[email protected]> Reviewed-and-tested-by: Luís Henriques <[email protected]> Reviewed-by: Milind Changire <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2023-08-22ceph: preallocate inode for ops that may create oneJeff Layton6-67/+179
When creating a new inode, we need to determine the crypto context before we can transmit the RPC. The fscrypt API has a routine for getting a crypto context before a create occurs, but it requires an inode. Change the ceph code to preallocate an inode in advance of a create of any sort (open(), mknod(), symlink(), etc). Move the existing code that generates the ACL and SELinux blobs into this routine since that's mostly common across all the different codepaths. In most cases, we just want to allow ceph_fill_trace to use that inode after the reply comes in, so add a new field to the MDS request for it (r_new_inode). The async create codepath is a bit different though. In that case, we want to hash the inode in advance of the RPC so that it can be used before the reply comes in. If the call subsequently fails with -EJUKEBOX, then just put the references and clean up the as_ctx. Note that with this change, we now need to regenerate the as_ctx when this occurs, but it's quite rare for it to happen. Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: Xiubo Li <[email protected]> Reviewed-and-tested-by: Luís Henriques <[email protected]> Reviewed-by: Milind Changire <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2023-08-22ceph: add new mount option to enable sparse readsJeff Layton4-11/+72
Add a new mount option that has the client issue sparse reads instead of normal ones. The callers now preallocate an sparse extent buffer that the libceph receive code can populate and hand back after the operation completes. After a successful sparse read, we can't use the req->r_result value to determine the amount of data "read", so instead we set the received length to be from the end of the last extent in the buffer. Any interstitial holes will have been filled by the receive code. [ xiubli: fix a double free on req reported by Ilya ] Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: Xiubo Li <[email protected]> Reviewed-and-tested-by: Luís Henriques <[email protected]> Reviewed-by: Milind Changire <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2023-08-21merge mm-hotfixes-stable into mm-stable to pick up depended-upon changesAndrew Morton5-5/+50
2023-08-21kill do_each_thread()Oleg Nesterov1-2/+2
Eric has pointed out that we still have 3 users of do_each_thread(). Change them to use for_each_process_thread() and kill this helper. There is a subtle change, after do_each_thread/while_each_thread g == t == &init_task, while after for_each_process_thread() they both point to nowhere, but this doesn't matter. > Why is for_each_process_thread() better than do_each_thread()? Say, for_each_process_thread() is rcu safe, do_each_thread() is not. And certainly for_each_process_thread(p, t) { do_something(p, t); } looks better than do_each_thread(p, t) { do_something(p, t); } while_each_thread(p, t); And again, there are only 3 users of this awkward helper left. It should have been killed years ago and in fact I thought it had already been killed. It uses while_each_thread() which needs some changes. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Reviewed-by: Kees Cook <[email protected]> Cc: "Christian Brauner (Microsoft)" <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Jiri Slaby <[email protected]> # tty/serial Signed-off-by: Andrew Morton <[email protected]>
2023-08-21nilfs2: fix WARNING in mark_buffer_dirty due to discarded buffer reuseRyusuke Konishi2-3/+7
A syzbot stress test using a corrupted disk image reported that mark_buffer_dirty() called from __nilfs_mark_inode_dirty() or nilfs_palloc_commit_alloc_entry() may output a kernel warning, and can panic if the kernel is booted with panic_on_warn. This is because nilfs2 keeps buffer pointers in local structures for some metadata and reuses them, but such buffers may be forcibly discarded by nilfs_clear_dirty_page() in some critical situations. This issue is reported to appear after commit 28a65b49eb53 ("nilfs2: do not write dirty data after degenerating to read-only"), but the issue has potentially existed before. Fix this issue by checking the uptodate flag when attempting to reuse an internally held buffer, and reloading the metadata instead of reusing the buffer if the flag was lost. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ryusuke Konishi <[email protected]> Reported-by: [email protected] Closes: https://lkml.kernel.org/r/[email protected] Fixes: 8c26c4e2694a ("nilfs2: fix issue with flush kernel thread after remount in RO mode because of driver's internal error or metadata corruption") Tested-by: Ryusuke Konishi <[email protected]> Cc: <[email protected]> # 3.10+ Signed-off-by: Andrew Morton <[email protected]>
2023-08-21kernel/fork: stop playing lockless games for exe_file replacementMateusz Guzik1-2/+2
xchg originated in 6e399cd144d8 ("prctl: avoid using mmap_sem for exe_file serialization"). While the commit message does not explain *why* the change, I found the original submission [1] which ultimately claims it cleans things up by removing dependency of exe_file on the semaphore. However, fe69d560b5bd ("kernel/fork: always deny write access to current MM exe_file") added a semaphore up/down cycle to synchronize the state of exe_file against fork, defeating the point of the original change. This is on top of semaphore trips already present both in the replacing function and prctl (the only consumer). Normally replacing exe_file does not happen for busy processes, thus write-locking is not an impediment to performance in the intended use case. If someone keeps invoking the routine for a busy processes they are trying to play dirty and that's another reason to avoid any trickery. As such I think the atomic here only adds complexity for no benefit. Just write-lock around the replacement. I also note that replacement races against the mapping check loop as nothing synchronizes actual assignment with with said checks but I am not addressing it in this patch. (Is the loop of any use to begin with?) Link: https://lore.kernel.org/linux-mm/[email protected]/ [1] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mateusz Guzik <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: "Christian Brauner (Microsoft)" <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mateusz Guzik <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21adfs: delete unused "union adfs_dirtail" definitionAlexey Dobriyan1-5/+0
union adfs_dirtail::new stands in the way if Linux++ project: "new" can't be used as member's name because it is a keyword in C++. Link: https://lkml.kernel.org/r/43b0a4c8-a7cf-4ab1-98f7-0f65c096f9e8@p183 Signed-off-by: Alexey Dobriyan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm,thp: fix smaps THPeligible output alignmentHugh Dickins1-1/+1
Extract from current /proc/self/smaps output: Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 0 ProtectionKey: 0 That's not the alignment shown in Documentation/filesystems/proc.rst: it's an ugly artifact from missing out the %8 other fields are using; but there's even one selftest which expects it to look that way. Hoping no other smaps parsers depend on THPeligible to look so ugly, fix these. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Cc: Alexey Dobriyan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm: memtest: convert to memtest_report_meminfo()Kefeng Wang1-11/+1
It is better to not expose too many internal variables of memtest, add a helper memtest_report_meminfo() to show memtest results. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Acked-by: Mike Rapoport (IBM) <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Tomas Mudrunka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm: lock vma explicitly before doing vm_flags_reset and vm_flags_reset_onceSuren Baghdasaryan1-0/+6
Implicit vma locking inside vm_flags_reset() and vm_flags_reset_once() is not obvious and makes it hard to understand where vma locking is happening. Also in some cases (like in dup_userfaultfd()) vma should be locked earlier than vma_flags modification. To make locking more visible, change these functions to assert that the vma write lock is taken and explicitly lock the vma beforehand. Fix userfaultfd functions which should lock the vma earlier. Link: https://lkml.kernel.org/r/[email protected] Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Suren Baghdasaryan <[email protected]> Cc: Jann Horn <[email protected]> Cc: Liam R. Howlett <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm: factor out VMA stack and heap checksKefeng Wang2-34/+5
Patch series "mm: convert to vma_is_initial_heap/stack()", v3. Add vma_is_initial_stack() and vma_is_initial_heap() helpers and use them to simplify code. This patch (of 4): Factor out VMA stack and heap checks and name them vma_is_initial_stack() and vma_is_initial_heap() for general use. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: Christian Göttsche <[email protected]> Cc: Alex Deucher <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Christian Göttsche <[email protected]> Cc: Christian König <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: David Airlie <[email protected]> Cc: Eric Paris <[email protected]> Cc: Felix Kuehling <[email protected]> Cc: "Pan, Xinhui" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Stephen Smalley <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm: kill frontswapJohannes Weiner1-0/+1
The only user of frontswap is zswap, and has been for a long time. Have swap call into zswap directly and remove the indirection. [[email protected]: remove obsolete comment, per Yosry] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: don't warn if none swapcache folio is passed to zswap_load] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Yin Fengwei <[email protected]> Acked-by: Konrad Rzeszutek Wilk <[email protected]> Acked-by: Nhat Pham <[email protected]> Acked-by: Yosry Ahmed <[email protected]> Acked-by: Christoph Hellwig <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21nilfs2: fix general protection fault in nilfs_lookup_dirty_data_buffers()Ryusuke Konishi1-0/+5
A syzbot stress test reported that create_empty_buffers() called from nilfs_lookup_dirty_data_buffers() can cause a general protection fault. Analysis using its reproducer revealed that the back reference "mapping" from a page/folio has been changed to NULL after dirty page/folio gang lookup in nilfs_lookup_dirty_data_buffers(). Fix this issue by excluding pages/folios from being collected if, after acquiring a lock on each page/folio, its back reference "mapping" differs from the pointer to the address space struct that held the page/folio. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ryusuke Konishi <[email protected]> Reported-by: [email protected] Closes: https://lkml.kernel.org/r/[email protected] Tested-by: Ryusuke Konishi <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm: enable page walking API to lock vmas during the walkSuren Baghdasaryan1-0/+5
walk_page_range() and friends often operate under write-locked mmap_lock. With introduction of vma locks, the vmas have to be locked as well during such walks to prevent concurrent page faults in these areas. Add an additional member to mm_walk_ops to indicate locking requirements for the walk. The change ensures that page walks which prevent concurrent page faults by write-locking mmap_lock, operate correctly after introduction of per-vma locks. With per-vma locks page faults can be handled under vma lock without taking mmap_lock at all, so write locking mmap_lock would not stop them. The change ensures vmas are properly locked during such walks. A sample issue this solves is do_mbind() performing queue_pages_range() to queue pages for migration. Without this change a concurrent page can be faulted into the area and be left out of migration. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Suggested-by: Jann Horn <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Laurent Dufour <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Peter Xu <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd()David Hildenbrand1-2/+1
We shouldn't be using a GUP-internal helper if it can be avoided. Similar to smaps_pte_entry() that uses vm_normal_page(), let's use vm_normal_page_pmd() that similarly refuses to return the huge zeropage. In contrast to follow_trans_huge_pmd(), vm_normal_page_pmd(): (1) Will always return the head page, not a tail page of a THP. If we'd ever call smaps_account with a tail page while setting "compound = true", we could be in trouble, because smaps_account() would look at the memmap of unrelated pages. If we're unlucky, that memmap does not exist at all. Before we removed PG_doublemap, we could have triggered something similar as in commit 24d7275ce279 ("fs/proc: task_mmu.c: don't read mapcount for migration entry"). This can theoretically happen ever since commit ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock"): (a) We're in show_smaps_rollup() and processed a VMA (b) We release the mmap lock in show_smaps_rollup() because it is contended (c) We merged that VMA with another VMA (d) We collapsed a THP in that merged VMA at that position If the end address of the original VMA falls into the middle of a THP area, we would call smap_gather_stats() with a start address that falls into a PMD-mapped THP. It's probably very rare to trigger when not really forced. (2) Will succeed on a is_pci_p2pdma_page(), like vm_normal_page() Treat such PMDs here just like smaps_pte_entry() would treat such PTEs. If such pages would be anonymous, we most certainly would want to account them. (3) Will skip over pmd_devmap(), like vm_normal_page() for pte_devmap() As noted in vm_normal_page(), that is only for handling legacy ZONE_DEVICE pages. So just like smaps_pte_entry(), we'll now also ignore such PMD entries. Especially, follow_pmd_mask() never ends up calling follow_trans_huge_pmd() on pmd_devmap(). Instead it calls follow_devmap_pmd() -- which will fail if neither FOLL_GET nor FOLL_PIN is set. So skipping pmd_devmap() pages seems to be the right thing to do. (4) Will properly handle VM_MIXEDMAP/VM_PFNMAP, like vm_normal_page() We won't be returning a memmap that should be ignored by core-mm, or worse, a memmap that does not even exist. Note that while walk_page_range() will skip VM_PFNMAP mappings, walk_page_vma() won't. Most probably this case doesn't currently really happen on the PMD level, otherwise we'd already be able to trigger kernel crashes when reading smaps / smaps_rollup. So most probably only (1) is relevant in practice as of now, but could only cause trouble in extreme corner cases. Let's move follow_trans_huge_pmd() to mm/internal.h to discourage future reuse in wrong context. Link: https://lkml.kernel.org/r/[email protected] Fixes: ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock") Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: John Hubbard <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: liubo <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Peter Xu <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21f2fs: avoid false alarm of circular lockingJaegeuk Kim2-10/+17
====================================================== WARNING: possible circular locking dependency detected 6.5.0-rc5-syzkaller-00353-gae545c3283dc #0 Not tainted ------------------------------------------------------ syz-executor273/5027 is trying to acquire lock: ffff888077fe1fb0 (&fi->i_sem){+.+.}-{3:3}, at: f2fs_down_write fs/f2fs/f2fs.h:2133 [inline] ffff888077fe1fb0 (&fi->i_sem){+.+.}-{3:3}, at: f2fs_add_inline_entry+0x300/0x6f0 fs/f2fs/inline.c:644 but task is already holding lock: ffff888077fe07c8 (&fi->i_xattr_sem){.+.+}-{3:3}, at: f2fs_down_read fs/f2fs/f2fs.h:2108 [inline] ffff888077fe07c8 (&fi->i_xattr_sem){.+.+}-{3:3}, at: f2fs_add_dentry+0x92/0x230 fs/f2fs/dir.c:783 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&fi->i_xattr_sem){.+.+}-{3:3}: down_read+0x9c/0x470 kernel/locking/rwsem.c:1520 f2fs_down_read fs/f2fs/f2fs.h:2108 [inline] f2fs_getxattr+0xb1e/0x12c0 fs/f2fs/xattr.c:532 __f2fs_get_acl+0x5a/0x900 fs/f2fs/acl.c:179 f2fs_acl_create fs/f2fs/acl.c:377 [inline] f2fs_init_acl+0x15c/0xb30 fs/f2fs/acl.c:420 f2fs_init_inode_metadata+0x159/0x1290 fs/f2fs/dir.c:558 f2fs_add_regular_entry+0x79e/0xb90 fs/f2fs/dir.c:740 f2fs_add_dentry+0x1de/0x230 fs/f2fs/dir.c:788 f2fs_do_add_link+0x190/0x280 fs/f2fs/dir.c:827 f2fs_add_link fs/f2fs/f2fs.h:3554 [inline] f2fs_mkdir+0x377/0x620 fs/f2fs/namei.c:781 vfs_mkdir+0x532/0x7e0 fs/namei.c:4117 do_mkdirat+0x2a9/0x330 fs/namei.c:4140 __do_sys_mkdir fs/namei.c:4160 [inline] __se_sys_mkdir fs/namei.c:4158 [inline] __x64_sys_mkdir+0xf2/0x140 fs/namei.c:4158 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd -> #0 (&fi->i_sem){+.+.}-{3:3}: check_prev_add kernel/locking/lockdep.c:3142 [inline] check_prevs_add kernel/locking/lockdep.c:3261 [inline] validate_chain kernel/locking/lockdep.c:3876 [inline] __lock_acquire+0x2e3d/0x5de0 kernel/locking/lockdep.c:5144 lock_acquire kernel/locking/lockdep.c:5761 [inline] lock_acquire+0x1ae/0x510 kernel/locking/lockdep.c:5726 down_write+0x93/0x200 kernel/locking/rwsem.c:1573 f2fs_down_write fs/f2fs/f2fs.h:2133 [inline] f2fs_add_inline_entry+0x300/0x6f0 fs/f2fs/inline.c:644 f2fs_add_dentry+0xa6/0x230 fs/f2fs/dir.c:784 f2fs_do_add_link+0x190/0x280 fs/f2fs/dir.c:827 f2fs_add_link fs/f2fs/f2fs.h:3554 [inline] f2fs_mkdir+0x377/0x620 fs/f2fs/namei.c:781 vfs_mkdir+0x532/0x7e0 fs/namei.c:4117 ovl_do_mkdir fs/overlayfs/overlayfs.h:196 [inline] ovl_mkdir_real+0xb5/0x370 fs/overlayfs/dir.c:146 ovl_workdir_create+0x3de/0x820 fs/overlayfs/super.c:309 ovl_make_workdir fs/overlayfs/super.c:711 [inline] ovl_get_workdir fs/overlayfs/super.c:864 [inline] ovl_fill_super+0xdab/0x6180 fs/overlayfs/super.c:1400 vfs_get_super+0xf9/0x290 fs/super.c:1152 vfs_get_tree+0x88/0x350 fs/super.c:1519 do_new_mount fs/namespace.c:3335 [inline] path_mount+0x1492/0x1ed0 fs/namespace.c:3662 do_mount fs/namespace.c:3675 [inline] __do_sys_mount fs/namespace.c:3884 [inline] __se_sys_mount fs/namespace.c:3861 [inline] __x64_sys_mount+0x293/0x310 fs/namespace.c:3861 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- rlock(&fi->i_xattr_sem); lock(&fi->i_sem); lock(&fi->i_xattr_sem); lock(&fi->i_sem); Cc: <[email protected]> Reported-and-tested-by: [email protected] Fixes: 5eda1ad1aaff "f2fs: fix deadlock in i_xattr_sem and inode page lock" Tested-by: Guenter Roeck <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2023-08-21ext2: Fix kernel-doc warningsMatthew Wilcox (Oracle)2-61/+56
Document a few parameters of ext2_alloc_blocks(). Redo the alloc_new_reservation() and find_next_reservable_window() kernel-doc entirely. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Jan Kara <[email protected]> Message-Id: <[email protected]>
2023-08-21super: wait until we passed kill superChristian Brauner1-7/+64
Recent rework moved block device closing out of sb->put_super() and into sb->kill_sb() to avoid deadlocks as s_umount is held in put_super() and blkdev_put() can end up taking s_umount again. That means we need to move the removal of the superblock from @fs_supers out of generic_shutdown_super() and into deactivate_locked_super() to ensure that concurrent mounters don't fail to open block devices that are still in use because blkdev_put() in sb->kill_sb() hasn't been called yet. We can now do this as we can make iterators through @fs_super and @super_blocks wait without holding s_umount. Concurrent mounts will wait until a dying superblock is fully dead so until sb->kill_sb() has been called and SB_DEAD been set. Concurrent iterators can already discard any SB_DYING superblock. Reviewed-by: Jan Kara <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-08-21super: wait for nascent superblocksChristian Brauner1-51/+153
Recent patches experiment with making it possible to allocate a new superblock before opening the relevant block device. Naturally this has intricate side-effects that we get to learn about while developing this. Superblock allocators such as sget{_fc}() return with s_umount of the new superblock held and lock ordering currently requires that block level locks such as bdev_lock and open_mutex rank above s_umount. Before aca740cecbe5 ("fs: open block device after superblock creation") ordering was guaranteed to be correct as block devices were opened prior to superblock allocation and thus s_umount wasn't held. But now s_umount must be dropped before opening block devices to avoid locking violations. This has consequences. The main one being that iterators over @super_blocks and @fs_supers that grab a temporary reference to the superblock can now also grab s_umount before the caller has managed to open block devices and called fill_super(). So whereas before such iterators or concurrent mounts would have simply slept on s_umount until SB_BORN was set or the superblock was discard due to initalization failure they can now needlessly spin through sget{_fc}(). If the caller is sleeping on bdev_lock or open_mutex one caller waiting on SB_BORN will always spin somewhere and potentially this can go on for quite a while. It should be possible to drop s_umount while allowing iterators to wait on a nascent superblock to either be born or discarded. This patch implements a wait_var_event() mechanism allowing iterators to sleep until they are woken when the superblock is born or discarded. This also allows us to avoid relooping through @fs_supers and @super_blocks if a superblock isn't yet born or dying. Link: aca740cecbe5 ("fs: open block device after superblock creation") Reviewed-by: Jan Kara <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-08-21cachefiles: use kiocb_{start,end}_write() helpersAmir Goldstein1-13/+3
Use helpers instead of the open coded dance to silence lockdep warnings. Suggested-by: Jan Kara <[email protected]> Signed-off-by: Amir Goldstein <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Jens Axboe <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-08-21ovl: use kiocb_{start,end}_write() helpersAmir Goldstein1-8/+2
Use helpers instead of the open coded dance to silence lockdep warnings. Suggested-by: Jan Kara <[email protected]> Signed-off-by: Amir Goldstein <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Jens Axboe <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-08-21aio: use kiocb_{start,end}_write() helpersAmir Goldstein1-17/+3
Use helpers instead of the open coded dance to silence lockdep warnings. Suggested-by: Jan Kara <[email protected]> Signed-off-by: Amir Goldstein <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Jens Axboe <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-08-21splice: Convert page_cache_pipe_buf_confirm() to use a folioMatthew Wilcox (Oracle)1-11/+9
Convert buf->page to a folio once instead of five times. There's only one uptodate bit per folio, not per page, so we lose nothing here. Signed-off-by: "Matthew Wilcox (Oracle)" <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-08-21libfs: Convert simple_write_begin and simple_write_end to use a folioMatthew Wilcox (Oracle)1-20/+20
Remove a number of implicit calls to compound_head() and various calls to compatibility functions. This is not sufficient to enable support for large folios; generic_perform_write() must be converted first. Signed-off-by: "Matthew Wilcox (Oracle)" <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-08-21btrfs: tests: test invalid splitting when skipping pinned drop extent_mapJosef Bacik1-0/+138
This reproduces the bug fixed by "btrfs: fix incorrect splitting in btrfs_drop_extent_map_range", we were improperly calculating the range for the split extent. Add a test that exercises this scenario and validates that we get the correct resulting extent_maps in our tree. Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-08-21btrfs: tests: add a test for btrfs_add_extent_mappingJosef Bacik1-0/+56
This helper is different from the normal add_extent_mapping in that it will stuff an em into a gap that exists between overlapping em's in the tree. It appeared there was a bug so I wrote a self test to validate it did the correct thing when it worked with two side by side ems. Thankfully it is correct, but more testing is better. Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-08-21btrfs: tests: add extent_map tests for dropping with odd layoutsJosef Bacik1-0/+218
While investigating weird problems with the extent_map I wrote a self test testing the various edge cases of btrfs_drop_extent_map_range. This can split in different ways and behaves different in each case, so test the various edge cases to make sure everything is functioning properly. Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-08-21btrfs: scrub: move write back of repaired sectors to ↵Qu Wenruo1-47/+25
scrub_stripe_read_repair_worker() Currently the scrub_stripe_read_repair_worker() only does reads to rebuild the corrupted sectors, it doesn't do any writeback. The design is mostly to put writeback into a more ordered manner, to co-operate with dev-replace with zoned mode, which requires every write to be submitted in their bytenr order. However the writeback for repaired sectors into the original mirror doesn't need such strong sync requirement, as it can only happen for non-zoned devices. This patch would move the writeback for repaired sectors into scrub_stripe_read_repair_worker(), which removes two calls sites for repaired sectors writeback. (one from flush_scrub_stripes(), one from scrub_raid56_parity_stripe()) Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>