aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2024-10-11bcachefs: Fix bch2_have_enough_devs() for BCH_SB_MEMBER_INVALIDKent Overstreet1-0/+5
This fixes a kasan splat in the ec device removal tests. Signed-off-by: Kent Overstreet <[email protected]>
2024-10-11Merge tag 'nfs-for-6.12-2' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds12-15/+25
Pull NFS client fixes from Anna Schumaker: "Localio Bugfixes: - remove duplicated include in localio.c - fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put() - fix Kconfig for NFS_COMMON_LOCALIO_SUPPORT - fix nfsd_file tracepoints to handle NULL rqstp pointers Other Bugfixes: - fix program selection loop in svc_process_common - fix integer overflow in decode_rc_list() - prevent NULL-pointer dereference in nfs42_complete_copies() - fix CB_RECALL performance issues when using a large number of delegations" * tag 'nfs-for-6.12-2' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFS: remove revoked delegation from server's delegation list nfsd/localio: fix nfsd_file tracepoints to handle NULL rqstp nfs_common: fix Kconfig for NFS_COMMON_LOCALIO_SUPPORT nfs_common: fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put() NFSv4: Prevent NULL-pointer dereference in nfs42_complete_copies() SUNRPC: Fix integer overflow in decode_rc_list() sunrpc: fix prog selection loop in svc_process_common nfs: Remove duplicated include in localio.c
2024-10-11btrfs: fix uninitialized pointer free on read_alloc_one_name() errorRoi Martin1-2/+2
The function read_alloc_one_name() does not initialize the name field of the passed fscrypt_str struct if kmalloc fails to allocate the corresponding buffer. Thus, it is not guaranteed that fscrypt_str.name is initialized when freeing it. This is a follow-up to the linked patch that fixes the remaining instances of the bug introduced by commit e43eec81c516 ("btrfs: use struct qstr instead of name and namelen pairs"). Link: https://lore.kernel.org/linux-btrfs/[email protected]/ Fixes: e43eec81c516 ("btrfs: use struct qstr instead of name and namelen pairs") CC: [email protected] # 6.1+ Reviewed-by: Anand Jain <[email protected]> Signed-off-by: Roi Martin <[email protected]> Signed-off-by: David Sterba <[email protected]>
2024-10-11btrfs: send: cleanup unneeded return variable in changed_verity()Christian Heusel1-3/+1
As all changed_* functions need to return something, just return 0 directly here, as the verity status is passed via the context. Reported by LKP: fs/btrfs/send.c:6877:5-8: Unneeded variable: "ret". Return "0" on line 6883 Reported-by: kernel test robot <[email protected]> Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: Christian Heusel <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2024-10-11btrfs: fix uninitialized pointer free in add_inode_ref()Roi Martin1-1/+1
The add_inode_ref() function does not initialize the "name" struct when it is declared. If any of the following calls to "read_one_inode() returns NULL, dir = read_one_inode(root, parent_objectid); if (!dir) { ret = -ENOENT; goto out; } inode = read_one_inode(root, inode_objectid); if (!inode) { ret = -EIO; goto out; } then "name.name" would be freed on "out" before being initialized. out: ... kfree(name.name); This issue was reported by Coverity with CID 1526744. Fixes: e43eec81c516 ("btrfs: use struct qstr instead of name and namelen pairs") CC: [email protected] # 6.6+ Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Roi Martin <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2024-10-11btrfs: use sector numbers as keys for the dirty extents xarrayFilipe Manana3-13/+33
We are using the logical address ("bytenr") of an extent as the key for qgroup records in the dirty extents xarray. This is a problem because the xarrays use "unsigned long" for keys/indices, meaning that on a 32 bits platform any extent starting at or beyond 4G is truncated, which is a too low limitation as virtually everyone is using storage with more than 4G of space. This means a "bytenr" of 4G gets truncated to 0, and so does 8G and 16G for example, resulting in incorrect qgroup accounting. Fix this by using sector numbers as keys instead, that is, using keys that match the logical address right shifted by fs_info->sectorsize_bits, which is what we do for the fs_info->buffer_radix that tracks extent buffers (radix trees also use an "unsigned long" type for keys). This also makes the index space more dense which helps optimize the xarray (as mentioned at Documentation/core-api/xarray.rst). Fixes: 3cce39a8ca4e ("btrfs: qgroup: use xarray to track dirty extents in transaction") Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2024-10-11ksmbd: add support for supplementary groupsNamjae Jeon7-17/+137
Even though system user has a supplementary group, It gets NT_STATUS_ACCESS_DENIED when attempting to create file or directory. This patch add KSMBD_EVENT_LOGIN_REQUEST_EXT/RESPONSE_EXT netlink events to get supplementary groups list. The new netlink event doesn't break backward compatibility when using old ksmbd-tools. Co-developed-by: Atte Heikkilä <[email protected]> Signed-off-by: Atte Heikkilä <[email protected]> Signed-off-by: Namjae Jeon <[email protected]> Signed-off-by: Steve French <[email protected]>
2024-10-11f2fs: allow parallel DIO readsJaegeuk Kim1-1/+2
This fixes a regression which prevents parallel DIO reads. Fixes: 0cac51185e65 ("f2fs: fix to avoid racing in between read and OPU dio write") Reviewed-by: Daeho Jeong <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2024-10-11xfs: fix integer overflow in xrep_bmapDarrick J. Wong1-1/+1
The variable declaration in this function predates the merge of the nrext64 (aka 64-bit extent counters) feature, which means that the variable declaration type is insufficient to avoid an integer overflow. Fix that by redeclaring the variable to be xfs_extnum_t. Coverity-id: 1630958 Fixes: 8f71bede8efd ("xfs: repair inode fork block mapping data structures") Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Carlos Maiolino <[email protected]>
2024-10-11erofs: get rid of kaddr in `struct z_erofs_maprecorder`Gao Xiang1-20/+12
`kaddr` becomes useless after switching to metabuf. Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Gao Xiang <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-10-11erofs: get rid of z_erofs_try_to_claim_pcluster()Gao Xiang1-20/+9
Just fold it into the caller for simplicity. Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Gao Xiang <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-10-11erofs: ensure regular inodes for file-backed mountsGao Xiang1-3/+10
Only regular inodes are allowed for file-backed mounts, not directories (as seen in the original syzbot case) or special inodes. Also ensure that .read_folio() is implemented on the underlying fs for the primary device. Fixes: fb176750266a ("erofs: add file-backed mount support") Reported-by: [email protected] Closes: https://lore.kernel.org/r/[email protected] Tested-by: [email protected] Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Gao Xiang <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-10-10Merge tag 'for-6.12-rc2-tag' of ↵Linus Torvalds7-24/+76
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - update fstrim loop and add more cancellation points, fix reported delayed or blocked suspend if there's a huge chunk queued - fix error handling in recent qgroup xarray conversion - in zoned mode, fix warning printing device path without RCU protection - again fix invalid extent xarray state (6252690f7e1b), lost due to refactoring * tag 'for-6.12-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix clear_dirty and writeback ordering in submit_one_sector() btrfs: zoned: fix missing RCU locking in error message when loading zone info btrfs: fix missing error handling when adding delayed ref with qgroups enabled btrfs: add cancellation points to trim loops btrfs: split remaining space to discard in chunks
2024-10-10Merge tag 'nfsd-6.12-1' of ↵Linus Torvalds3-4/+7
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Fix NFSD bring-up / shutdown - Fix a UAF when releasing a stateid * tag 'nfsd-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: nfsd: fix possible badness in FREE_STATEID nfsd: nfsd_destroy_serv() must call svc_destroy() even if nfsd_startup_net() failed NFSD: Mark filecache "down" if init fails
2024-10-10Merge tag 'xfs-6.12-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds14-260/+206
Pull xfs fixes from Carlos Maiolino: - A few small typo fixes - fstests xfs/538 DEBUG-only fix - Performance fix on blockgc on COW'ed files, by skipping trims on cowblock inodes currently opened for write - Prevent cowblocks to be freed under dirty pagecache during unshare - Update MAINTAINERS file to quote the new maintainer * tag 'xfs-6.12-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix a typo xfs: don't free cowblocks from under dirty pagecache on unshare xfs: skip background cowblock trims on inodes open for write xfs: support lowmode allocations in xfs_bmap_exact_minlen_extent_alloc xfs: call xfs_bmap_exact_minlen_extent_alloc from xfs_bmap_btalloc xfs: don't ifdef around the exact minlen allocations xfs: fold xfs_bmap_alloc_userdata into xfs_bmapi_allocate xfs: distinguish extra split from real ENOSPC from xfs_attr_node_try_addname xfs: distinguish extra split from real ENOSPC from xfs_attr3_leaf_split xfs: return bool from xfs_attr3_leaf_add xfs: merge xfs_attr_leaf_try_add into xfs_attr_leaf_addname xfs: Use try_cmpxchg() in xlog_cil_insert_pcp_aggregate() xfs: scrub: convert comma to semicolon xfs: Remove empty declartion in header file MAINTAINERS: add Carlos Maiolino as XFS release manager
2024-10-10openat2: explicitly return -E2BIG for (usize > PAGE_SIZE)Aleksa Sarai1-0/+2
While we do currently return -EFAULT in this case, it seems prudent to follow the behaviour of other syscalls like clone3. It seems quite unlikely that anyone depends on this error code being EFAULT, but we can always revert this if it turns out to be an issue. Cc: [email protected] # v5.6+ Fixes: fddb5d430ad9 ("open: introduce openat2(2) syscall") Signed-off-by: Aleksa Sarai <[email protected]> Link: https://lore.kernel.org/r/20241010-extensible-structs-check_fields-v3-3-d2833dfe6edd@cyphar.com Signed-off-by: Christian Brauner <[email protected]>
2024-10-09ksmbd: fix user-after-free from session log offNamjae Jeon4-6/+34
There is racy issue between smb2 session log off and smb2 session setup. It will cause user-after-free from session log off. This add session_lock when setting SMB2_SESSION_EXPIRED and referece count to session struct not to free session while it is being used. Cc: [email protected] # v5.15+ Reported-by: [email protected] # ZDI-CAN-25282 Signed-off-by: Namjae Jeon <[email protected]> Signed-off-by: Steve French <[email protected]>
2024-10-09Merge tag 'mm-hotfixes-stable-2024-10-09-15-46' of ↵Linus Torvalds3-14/+46
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "12 hotfixes, 5 of which are c:stable. All singletons, about half of which are MM" * tag 'mm-hotfixes-stable-2024-10-09-15-46' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: zswap: delete comments for "value" member of 'struct zswap_entry'. CREDITS: sort alphabetically by name secretmem: disable memfd_secret() if arch cannot set direct map .mailmap: update Fangrui's email mm/huge_memory: check pmd_special() only after pmd_present() resource, kunit: fix user-after-free in resource_test_region_intersects() fs/proc/kcore.c: allow translation of physical memory addresses selftests/mm: fix incorrect buffer->mirror size in hmm2 double_map test device-dax: correct pgoff align in dax_set_mapping() kthread: unpark only parked kthread Revert "mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN" bcachefs: do not use PF_MEMALLOC_NORECLAIM
2024-10-09bcachefs: __wait_for_freeing_inode: Switch to wait_bit_queue_entryKent Overstreet1-1/+2
inode_bit_waitqueue() is changing - this update clears the way for sched changes. Signed-off-by: Kent Overstreet <[email protected]>
2024-10-09bcachefs: Check if stuck in journal_res_get()Kent Overstreet1-0/+13
Like how we already do when the allocator seems to be stuck, check if we're waiting too long for a journal reservation and print some debug info. This is specifically to track down https://github.com/koverstreet/bcachefs/issues/656 which is showing up in userspace where we don't have sysfs/debugfs to get the journal debug info. Signed-off-by: Kent Overstreet <[email protected]>
2024-10-09bcachefs: Fix state lock involved deadlockAlan Huang1-6/+9
We increased write ref, if the fs went to RO, that would lead to a deadlock, it actually happens: 00171 ========= TEST generic/279 00171 00172 bcachefs (vdb): starting version 1.12: rebalance_work_acct_fix opts=nocow 00172 bcachefs (vdb): recovering from clean shutdown, journal seq 35 00172 bcachefs (vdb): accounting_read... done 00172 bcachefs (vdb): alloc_read... done 00172 bcachefs (vdb): stripes_read... done 00172 bcachefs (vdb): snapshots_read... done 00172 bcachefs (vdb): journal_replay... done 00172 bcachefs (vdb): resume_logged_ops... done 00172 bcachefs (vdb): going read-write 00172 bcachefs (vdb): done starting filesystem 00172 FSTYP -- bcachefs 00172 PLATFORM -- Linux/aarch64 farm3-kvm 6.11.0-rc1-ktest-g3e290a0b8e34 #7030 SMP Tue Oct 8 14:15:12 UTC 2024 00172 MKFS_OPTIONS -- --nocow /dev/vdc 00172 MOUNT_OPTIONS -- /dev/vdc /mnt/scratch 00172 00172 bcachefs (vdc): starting version 1.12: rebalance_work_acct_fix opts=nocow 00172 bcachefs (vdc): initializing new filesystem 00172 bcachefs (vdc): going read-write 00172 bcachefs (vdc): marking superblocks 00172 bcachefs (vdc): initializing freespace 00172 bcachefs (vdc): done initializing freespace 00172 bcachefs (vdc): reading snapshots table 00172 bcachefs (vdc): reading snapshots done 00172 bcachefs (vdc): done starting filesystem 00173 bcachefs (vdc): shutting down 00173 bcachefs (vdc): going read-only 00173 bcachefs (vdc): finished waiting for writes to stop 00173 bcachefs (vdc): flushing journal and stopping allocators, journal seq 4 00173 bcachefs (vdc): flushing journal and stopping allocators complete, journal seq 6 00173 bcachefs (vdc): shutdown complete, journal seq 7 00173 bcachefs (vdc): marking filesystem clean 00173 bcachefs (vdc): shutdown complete 00173 bcachefs (vdb): shutting down 00173 bcachefs (vdb): going read-only 00361 INFO: task umount:6180 blocked for more than 122 seconds. 00361 Not tainted 6.11.0-rc1-ktest-g3e290a0b8e34 #7030 00361 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 00361 task:umount state:D stack:0 pid:6180 tgid:6180 ppid:6176 flags:0x00000004 00361 Call trace: 00362 __switch_to (arch/arm64/kernel/process.c:556) 00362 __schedule (kernel/sched/core.c:5191 kernel/sched/core.c:6529) 00363 schedule (include/asm-generic/bitops/generic-non-atomic.h:128 include/linux/thread_info.h:192 include/linux/sched.h:2084 kernel/sched/core.c:6608 kernel/sched/core.c:6621) 00365 bch2_fs_read_only (fs/bcachefs/super.c:346 (discriminator 41)) 00367 __bch2_fs_stop (fs/bcachefs/super.c:620) 00368 bch2_put_super (fs/bcachefs/fs.c:1942) 00369 generic_shutdown_super (include/linux/list.h:373 (discriminator 2) fs/super.c:650 (discriminator 2)) 00371 bch2_kill_sb (fs/bcachefs/fs.c:2170) 00372 deactivate_locked_super (fs/super.c:434 fs/super.c:475) 00373 deactivate_super (fs/super.c:508) 00374 cleanup_mnt (fs/namespace.c:250 fs/namespace.c:1374) 00376 __cleanup_mnt (fs/namespace.c:1381) 00376 task_work_run (include/linux/sched.h:2024 kernel/task_work.c:224) 00377 do_notify_resume (include/linux/resume_user_mode.h:50 arch/arm64/kernel/entry-common.c:151) 00377 el0_svc (arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/entry-common.c:171 arch/arm64/kernel/entry-common.c:178 arch/arm64/kernel/entry-common.c:713) 00377 el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:731) 00378 el0t_64_sync (arch/arm64/kernel/entry.S:598) 00378 INFO: task tee:6182 blocked for more than 122 seconds. 00378 Not tainted 6.11.0-rc1-ktest-g3e290a0b8e34 #7030 00378 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 00378 task:tee state:D stack:0 pid:6182 tgid:6182 ppid:533 flags:0x00000004 00378 Call trace: 00378 __switch_to (arch/arm64/kernel/process.c:556) 00378 __schedule (kernel/sched/core.c:5191 kernel/sched/core.c:6529) 00378 schedule (include/asm-generic/bitops/generic-non-atomic.h:128 include/linux/thread_info.h:192 include/linux/sched.h:2084 kernel/sched/core.c:6608 kernel/sched/core.c:6621) 00378 schedule_preempt_disabled (kernel/sched/core.c:6680) 00379 rwsem_down_read_slowpath (kernel/locking/rwsem.c:1073 (discriminator 1)) 00379 down_read (kernel/locking/rwsem.c:1529) 00381 bch2_gc_gens (fs/bcachefs/sb-members.h:77 fs/bcachefs/sb-members.h:88 fs/bcachefs/sb-members.h:128 fs/bcachefs/btree_gc.c:1240) 00383 bch2_fs_store_inner (fs/bcachefs/sysfs.c:473) 00385 bch2_fs_internal_store (fs/bcachefs/sysfs.c:417 fs/bcachefs/sysfs.c:580 fs/bcachefs/sysfs.c:576) 00386 sysfs_kf_write (fs/sysfs/file.c:137) 00387 kernfs_fop_write_iter (fs/kernfs/file.c:334) 00389 vfs_write (fs/read_write.c:497 fs/read_write.c:590) 00390 ksys_write (fs/read_write.c:643) 00391 __arm64_sys_write (fs/read_write.c:652) 00391 invoke_syscall.constprop.0 (arch/arm64/include/asm/syscall.h:61 arch/arm64/kernel/syscall.c:54) 00392 do_el0_svc (include/linux/thread_info.h:127 (discriminator 2) arch/arm64/kernel/syscall.c:140 (discriminator 2) arch/arm64/kernel/syscall.c:151 (discriminator 2)) 00392 el0_svc (arch/arm64/include/asm/irqflags.h:55 arch/arm64/include/asm/irqflags.h:76 arch/arm64/kernel/entry-common.c:165 arch/arm64/kernel/entry-common.c:178 arch/arm64/kernel/entry-common.c:713) 00392 el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:731) 00392 el0t_64_sync (arch/arm64/kernel/entry.S:598) Signed-off-by: Alan Huang <[email protected]> Signed-off-by: Kent Overstreet <[email protected]>
2024-10-09bcachefs: Fix NULL pointer dereference in bch2_opt_to_textMohammed Anees1-1/+3
This patch adds a bounds check to the bch2_opt_to_text function to prevent NULL pointer dereferences when accessing the opt->choices array. This ensures that the index used is within valid bounds before dereferencing. The new version enhances the readability. Reported-and-tested-by: [email protected] Closes: https://syzkaller.appspot.com/bug?extid=37186860aa7812b331d5 Signed-off-by: Mohammed Anees <[email protected]> Signed-off-by: Kent Overstreet <[email protected]>
2024-10-09bcachefs: Release transaction before wake upAlan Huang1-2/+3
We will get this if we wake up first: Kernel panic - not syncing: btree_node_write_done leaked btree_trans since there are still transactions waiting for cycle detectors after BTREE_NODE_write_in_flight is cleared. Signed-off-by: Alan Huang <[email protected]> Signed-off-by: Kent Overstreet <[email protected]>
2024-10-09bcachefs: add check for btree id against max in try read nodePiotr Zalewski1-0/+3
Add check for read node's btree_id against BTREE_ID_NR_MAX in try_read_btree_node to prevent triggering EBUG_ON condition in bch2_btree_id_root[1]. [1] https://syzkaller.appspot.com/bug?extid=cf7b2215b5d70600ec00 Reported-by: [email protected] Closes: https://syzkaller.appspot.com/bug?extid=cf7b2215b5d70600ec00 Fixes: 4409b8081d16 ("bcachefs: Repair pass for scanning for btree nodes") Signed-off-by: Piotr Zalewski <[email protected]> Signed-off-by: Kent Overstreet <[email protected]>
2024-10-09bcachefs: Disk accounting device validation fixesKent Overstreet3-37/+118
- Fix failure to validate that accounting replicas entries point to valid devices: this wasn't a real bug since they'd be cleaned up by GC, but is still something we should know about - Fix failure to validate that dev_data_type entries point to valid devices: this does fix a real bug, since bch2_accounting_read() would then try to copy the counters to that device and pop an inconsistent error when the device didn't exist - Remove accounting entries that are zeroed or invalid: if we're not validating them we need to get rid of them: they might not exist in the superblock, so we need the to trigger the superblock mark path when they're readded. This fixes the replication.ktest rereplicate test, which was failing with "superblock not marked for replicas..." Signed-off-by: Kent Overstreet <[email protected]>
2024-10-09bcachefs: bch2_inode_or_descendents_is_open()Kent Overstreet4-21/+103
fsck can now correctly check if inodes in interior snapshot nodes are open/in use. - Tweak the vfs inode rhashtable so that the subvolume ID isn't hashed, meaning inums in different subvolumes will hash to the same slot. Note that this is a hack, and will cause problems if anyone ever has the same file in many different snapshots open all at the same time. - Then check if any of those subvolumes is a descendent of the snapshot ID being checked Signed-off-by: Kent Overstreet <[email protected]>
2024-10-09bcachefs: Kill bch2_propagate_key_to_snapshot_leaves()Kent Overstreet2-100/+0
Dead code now. Signed-off-by: Kent Overstreet <[email protected]>
2024-10-09bcachefs: bcachefs_metadata_version_inode_has_child_snapshotsKent Overstreet9-78/+302
There's an inherent race in taking a snapshot while an unlinked file is open, and then reattaching it in the child snapshot. In the interior snapshot node the file will appear unlinked, as though it should be deleted - it's not referenced by anything in that snapshot - but we can't delete it, because the file data is referenced by the child snapshot. This was being handled incorrectly with propagate_key_to_snapshot_leaves() - but that doesn't resolve the fundamental inconsistency of "this file looks like it should be deleted according to normal rules, but - ". To fix this, we need to fix the rule for when an inode is deleted. The previous rule, ignoring snapshots (there was no well-defined rule for with snapshots) was: Unlinked, non open files are deleted, either at recovery time or during online fsck The new rule is: Unlinked, non open files, that do not exist in child snapshots, are deleted. To make this work transactionally, we add a new inode flag, BCH_INODE_has_child_snapshot; it overrides BCH_INODE_unlinked when considering whether to delete an inode, or put it on the deleted list. For transactional consistency, clearing it handled by the inode trigger: when deleting an inode we check if there are parent inodes which can now have the BCH_INODE_has_child_snapshot flag cleared. Signed-off-by: Kent Overstreet <[email protected]>
2024-10-09fs/proc/kcore.c: allow translation of physical memory addressesAlexander Gordeev1-2/+34
When /proc/kcore is read an attempt to read the first two pages results in HW-specific page swap on s390 and another (so called prefix) pages are accessed instead. That leads to a wrong read. Allow architecture-specific translation of memory addresses using kc_xlate_dev_mem_ptr() and kc_unxlate_dev_mem_ptr() callbacks similarily to /dev/mem xlate_dev_mem_ptr() and unxlate_dev_mem_ptr() callbacks. That way an architecture can deal with specific physical memory ranges. Re-use the existing /dev/mem callback implementation on s390, which handles the described prefix pages swapping correctly. For other architectures the default callback is basically NOP. It is expected the condition (vaddr == __va(__pa(vaddr))) always holds true for KCORE_RAM memory type. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alexander Gordeev <[email protected]> Suggested-by: Heiko Carstens <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-10-09bcachefs: do not use PF_MEMALLOC_NORECLAIMMichal Hocko2-12/+12
Patch series "remove PF_MEMALLOC_NORECLAIM" v3. This patch (of 2): bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new inode to achieve GFP_NOWAIT semantic while holding locks. If this allocation fails it will drop locks and use GFP_NOFS allocation context. We would like to drop PF_MEMALLOC_NORECLAIM because it is really dangerous to use if the caller doesn't control the full call chain with this flag set. E.g. if any of the function down the chain needed GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and cause unexpected failure. While this is not the case in this particular case using the scoped gfp semantic is not really needed bacause we can easily pus the allocation context down the chain without too much clutter. [[email protected]: fix kerneldoc warnings] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Jan Kara <[email protected]> # For vfs changes Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: James Morris <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Paul Moore <[email protected]> Cc: Serge E. Hallyn <[email protected]> Cc: Yafang Shao <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-10-09NFS: remove revoked delegation from server's delegation listDai Ngo1-0/+5
After the delegation is returned to the NFS server remove it from the server's delegations list to reduce the time it takes to scan this list. Network trace captured while running the below script shows the time taken to service the CB_RECALL increases gradually due to the overhead of traversing the delegation list in nfs_delegation_find_inode_server. The NFS server in this test is a Solaris server which issues CB_RECALL when receiving the all-zero stateid in the SETATTR. mount=/mnt/data for i in $(seq 1 20) do echo $i mkdir $mount/testtarfile$i time tar -C $mount/testtarfile$i -xf 5000_files.tar done Signed-off-by: Dai Ngo <[email protected]> Reviewed-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2024-10-09Merge tag 'unicode-fixes-6.12-rc3' of ↵Linus Torvalds2-3427/+3346
git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode Pull unicode fix from Gabriel Krisman Bertazi: - Handle code-points with the Ignorable property as regular character instead of treating them as an empty string (me) * tag 'unicode-fixes-6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode: unicode: Don't special case ignorable code points
2024-10-09unicode: Don't special case ignorable code pointsGabriel Krisman Bertazi2-3427/+3346
We don't need to handle them separately. Instead, just let them decompose/casefold to themselves. Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
2024-10-09btrfs: fix clear_dirty and writeback ordering in submit_one_sector()Naohiro Aota1-7/+7
This commit is a replay of commit 6252690f7e1b ("btrfs: fix invalid mapping of extent xarray state"). We need to call btrfs_folio_clear_dirty() before btrfs_set_range_writeback(), so that xarray DIRTY tag is cleared. With a refactoring commit 8189197425e7 ("btrfs: refactor __extent_writepage_io() to do sector-by-sector submission"), it screwed up and the order is reversed and causing the same hang. Fix the ordering now in submit_one_sector(). Fixes: 8189197425e7 ("btrfs: refactor __extent_writepage_io() to do sector-by-sector submission") Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Naohiro Aota <[email protected]> Signed-off-by: David Sterba <[email protected]>
2024-10-09btrfs: zoned: fix missing RCU locking in error message when loading zone infoFilipe Manana1-1/+1
At btrfs_load_zone_info() we have an error path that is dereferencing the name of a device which is a RCU string but we are not holding a RCU read lock, which is incorrect. Fix this by using btrfs_err_in_rcu() instead of btrfs_err(). The problem is there since commit 08e11a3db098 ("btrfs: zoned: load zone's allocation offset"), back then at btrfs_load_block_group_zone_info() but then later on that code was factored out into the helper btrfs_load_zone_info() by commit 09a46725cc84 ("btrfs: zoned: factor out per-zone logic from btrfs_load_block_group_zone_info"). Fixes: 08e11a3db098 ("btrfs: zoned: load zone's allocation offset") Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Naohiro Aota <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2024-10-09xfs: fix a typoAndrew Kreimer1-1/+1
Fix a typo in comments. Signed-off-by: Andrew Kreimer <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Carlos Maiolino <[email protected]>
2024-10-09xfs: don't free cowblocks from under dirty pagecache on unshareBrian Foster3-7/+23
fallocate unshare mode explicitly breaks extent sharing. When a command completes, it checks the data fork for any remaining shared extents to determine whether the reflink inode flag and COW fork preallocation can be removed. This logic doesn't consider in-core pagecache and I/O state, however, which means we can unsafely remove COW fork blocks that are still needed under certain conditions. For example, consider the following command sequence: xfs_io -fc "pwrite 0 1k" -c "reflink <file> 0 256k 1k" \ -c "pwrite 0 32k" -c "funshare 0 1k" <file> This allocates a data block at offset 0, shares it, and then overwrites it with a larger buffered write. The overwrite triggers COW fork preallocation, 32 blocks by default, which maps the entire 32k write to delalloc in the COW fork. All but the shared block at offset 0 remains hole mapped in the data fork. The unshare command redirties and flushes the folio at offset 0, removing the only shared extent from the inode. Since the inode no longer maps shared extents, unshare purges the COW fork before the remaining 28k may have written back. This leaves dirty pagecache backed by holes, which writeback quietly skips, thus leaving clean, non-zeroed pagecache over holes in the file. To verify, fiemap shows holes in the first 32k of the file and reads return different data across a remount: $ xfs_io -c "fiemap -v" <file> <file>: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS ... 1: [8..511]: hole 504 ... $ xfs_io -c "pread -v 4k 8" <file> 00001000: cd cd cd cd cd cd cd cd ........ $ umount <mnt>; mount <dev> <mnt> $ xfs_io -c "pread -v 4k 8" <file> 00001000: 00 00 00 00 00 00 00 00 ........ To avoid this problem, make unshare follow the same rules used for background cowblock scanning and never purge the COW fork for inodes with dirty pagecache or in-flight I/O. Fixes: 46afb0628b86347 ("xfs: only flush the unshared range in xfs_reflink_unshare") Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Carlos Maiolino <[email protected]>
2024-10-08Merge tag 'ntfs3_for_6.12' of ↵Linus Torvalds14-197/+410
https://github.com/Paragon-Software-Group/linux-ntfs3 Pull ntfs3 updates from Konstantin Komarov: "New: - implement fallocate for compressed files - add support for the compression attribute - optimize large writes to sparse files Fixes: - fix several potential deadlock scenarios - fix various internal bugs detected by syzbot - add checks before accessing NTFS structures during parsing - correct the format of output messages Refactoring: - replace fsparam_flag_no with fsparam_flag in options parser - remove unused functions and macros" * tag 'ntfs3_for_6.12' of https://github.com/Paragon-Software-Group/linux-ntfs3: (25 commits) fs/ntfs3: Format output messages like others fs in kernel fs/ntfs3: Additional check in ntfs_file_release fs/ntfs3: Fix general protection fault in run_is_mapped_full fs/ntfs3: Sequential field availability check in mi_enum_attr() fs/ntfs3: Additional check in ni_clear() fs/ntfs3: Fix possible deadlock in mi_read ntfs3: Change to non-blocking allocation in ntfs_d_hash fs/ntfs3: Remove unused al_delete_le fs/ntfs3: Rename ntfs3_setattr into ntfs_setattr fs/ntfs3: Replace fsparam_flag_no -> fsparam_flag fs/ntfs3: Add support for the compression attribute fs/ntfs3: Implement fallocate for compressed files fs/ntfs3: Make checks in run_unpack more clear fs/ntfs3: Add rough attr alloc_size check fs/ntfs3: Stale inode instead of bad fs/ntfs3: Refactor enum_rstbl to suppress static checker fs/ntfs3: Fix sparse warning in ni_fiemap fs/ntfs3: Fix warning possible deadlock in ntfs_set_state fs/ntfs3: Fix sparse warning for bigendian fs/ntfs3: Separete common code for file_read/write iter/splice ...
2024-10-08Merge patch series "fsdax/xfs: unshare range fixes for 6.12"Christian Brauner3-32/+45
Darrick J. Wong <[email protected]> says: This patchset fixes multiple data corruption bugs in the fallocate unshare range implementation for fsdax. * patches from https://lore.kernel.org/r/172796813251.1131942.12184885574609980777.stgit@frogsfrogsfrogs: fsdax: dax_unshare_iter needs to copy entire blocks fsdax: remove zeroing code from dax_unshare_iter iomap: share iomap_unshare_iter predicate code with fsdax xfs: don't allocate COW extents when unsharing a hole Link: https://lore.kernel.org/r/172796813251.1131942.12184885574609980777.stgit@frogsfrogsfrogs Signed-off-by: Christian Brauner <[email protected]>
2024-10-07btrfs: fix missing error handling when adding delayed ref with qgroups enabledFilipe Manana1-9/+33
When adding a delayed ref head, at delayed-ref.c:add_delayed_ref_head(), if we fail to insert the qgroup record we don't error out, we ignore it. In fact we treat it as if there was no error and there was already an existing record - we don't distinguish between the cases where btrfs_qgroup_trace_extent_nolock() returns 1, meaning a record already existed and we can free the given record, and the case where it returns a negative error value, meaning the insertion into the xarray that is used to track records failed. Effectively we end up ignoring that we are lacking qgroup record in the dirty extents xarray, resulting in incorrect qgroup accounting. Fix this by checking for errors and return them to the callers. Fixes: 3cce39a8ca4e ("btrfs: qgroup: use xarray to track dirty extents in transaction") Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2024-10-07btrfs: add cancellation points to trim loopsLuca Stefani3-3/+14
There are reports that system cannot suspend due to running trim because the task responsible for trimming the device isn't able to finish in time, especially since we have a free extent discarding phase, which can trim a lot of unallocated space. There are no limits on the trim size (unlike the block group part). Since trime isn't a critical call it can be interrupted at any time, in such cases we stop the trim, report the amount of discarded bytes and return an error. Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180 Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737 CC: [email protected] # 5.15+ Signed-off-by: Luca Stefani <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2024-10-07btrfs: split remaining space to discard in chunksLuca Stefani2-4/+21
Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device, mostly empty although we will do the split according to our super block locations, the last super block ends at 256G, we can submit a huge discard for the range [256G, 8T), causing a large delay. Split the space left to discard based on BTRFS_MAX_DISCARD_CHUNK_SIZE in preparation of introduction of cancellation points to trim. The value of the chunk size is arbitrary, it can be higher or derived from actual device capabilities but we can't easily read that using bio_discard_limit(). Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180 Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737 CC: [email protected] # 5.15+ Signed-off-by: Luca Stefani <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2024-10-07fsdax: dax_unshare_iter needs to copy entire blocksDarrick J. Wong1-7/+27
The code that copies data from srcmap to iomap in dax_unshare_iter is very very broken, which bfoster's recent fsx changes have exposed. If the pos and len passed to dax_file_unshare are not aligned to an fsblock boundary, the iter pos and length in the _iter function will reflect this unalignment. dax_iomap_direct_access always returns a pointer to the start of the kmapped fsdax page, even if its pos argument is in the middle of that page. This is catastrophic for data integrity when iter->pos is not aligned to a page, because daddr/saddr do not point to the same byte in the file as iter->pos. Hence we corrupt user data by copying it to the wrong place. If iter->pos + iomap_length() in the _iter function not aligned to a page, then we fail to copy a full block, and only partially populate the destination block. This is catastrophic for data confidentiality because we expose stale pmem contents. Fix both of these issues by aligning copy_pos/copy_len to a page boundary (remember, this is fsdax so 1 fsblock == 1 base page) so that we always copy full blocks. We're not done yet -- there's no call to invalidate_inode_pages2_range, so programs that have the file range mmap'd will continue accessing the old memory mapping after the file metadata updates have completed. Be careful with the return value -- if the unshare succeeds, we still need to return the number of bytes that the iomap iter thinks we're operating on. Cc: [email protected] Fixes: d984648e428b ("fsdax,xfs: port unshare to fsdax") Signed-off-by: Darrick J. Wong <[email protected]> Link: https://lore.kernel.org/r/172796813328.1131942.16777025316348797355.stgit@frogsfrogsfrogs Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-10-07fsdax: remove zeroing code from dax_unshare_iterDarrick J. Wong1-8/+0
Remove the code in dax_unshare_iter that zeroes the destination memory because it's not necessary. If srcmap is unwritten, we don't have to do anything because that unwritten extent came from the regular file mapping, and unwritten extents cannot be shared. The same applies to holes. Furthermore, zeroing to unshare a mapping is just plain wrong because unsharing means copy on write, and we should be copying data. This is effectively a revert of commit 13dd4e04625f ("fsdax: unshare: zero destination if srcmap is HOLE or UNWRITTEN") Cc: [email protected] Signed-off-by: Darrick J. Wong <[email protected]> Link: https://lore.kernel.org/r/172796813311.1131942.16033376284752798632.stgit@frogsfrogsfrogs Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-10-07iomap: share iomap_unshare_iter predicate code with fsdaxDarrick J. Wong2-16/+17
The predicate code that iomap_unshare_iter uses to decide if it's really needs to unshare a file range mapping should be shared with the fsdax version, because right now they're opencoded and inconsistent. Note that we simplify the predicate logic a bit -- we no longer allow unsharing of inline data mappings, but there aren't any filesystems that allow shared inline data currently. This is a fix in the sense that it should have been ported to fsdax. Fixes: b53fdb215d13 ("iomap: improve shared block detection in iomap_unshare_iter") Signed-off-by: Darrick J. Wong <[email protected]> Link: https://lore.kernel.org/r/172796813294.1131942.15762084021076932620.stgit@frogsfrogsfrogs Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-10-07xfs: don't allocate COW extents when unsharing a holeDarrick J. Wong1-1/+1
It doesn't make sense to allocate a COW extent when unsharing a hole because holes cannot be shared. Fixes: 1f1397b7218d7 ("xfs: don't allocate into the data fork for an unshare request") Signed-off-by: Darrick J. Wong <[email protected]> Link: https://lore.kernel.org/r/172796813277.1131942.5486112889531210260.stgit@frogsfrogsfrogs Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-10-07netfs: In readahead, put the folio refs as soon extractedDavid Howells2-33/+16
netfslib currently defers dropping the ref on the folios it obtains during readahead to after it has started I/O on the basis that we can do it whilst we wait for the I/O to complete, but this runs the risk of the I/O collection racing with this in future. Furthermore, Matthew Wilcox strongly suggests that the refs should be dropped immediately, as readahead_folio() does (netfslib is using __readahead_batch() which doesn't drop the refs). Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") Suggested-by: Matthew Wilcox <[email protected]> Signed-off-by: David Howells <[email protected]> Link: https://lore.kernel.org/r/[email protected] cc: Jeff Layton <[email protected]> cc: [email protected] cc: [email protected] Signed-off-by: Christian Brauner <[email protected]>
2024-10-07xfs: skip background cowblock trims on inodes open for writeBrian Foster1-8/+23
The background blockgc scanner runs on a 5m interval by default and trims preallocation (post-eof and cow fork) from inodes that are otherwise idle. Idle effectively means that iolock can be acquired without blocking and that the inode has no dirty pagecache or I/O in flight. This simple mechanism and heuristic has worked fairly well for post-eof speculative preallocations. Support for reflink and COW fork preallocations came sometime later and plugged into the same mechanism, with similar heuristics. Some recent testing has shown that COW fork preallocation may be notably more sensitive to blockgc processing than post-eof preallocation, however. For example, consider an 8GB reflinked file with a COW extent size hint of 1MB. A worst case fully randomized overwrite of this file results in ~8k extents of an average size of ~1MB. If the same workload is interrupted a couple times for blockgc processing (assuming the file goes idle), the resulting extent count explodes to over 100k extents with an average size <100kB. This is significantly worse than ideal and essentially defeats the COW extent size hint mechanism. While this particular test is instrumented, it reflects a fairly reasonable pattern in practice where random I/Os might spread out over a large period of time with varying periods of (in)activity. For example, consider a cloned disk image file for a VM or container with long uptime and variable and bursty usage. A background blockgc scan that races and processes the image file when it happens to be clean and idle can have a significant effect on the future fragmentation level of the file, even when still in use. To help combat this, update the heuristic to skip cowblocks inodes that are currently opened for write access during non-sync blockgc scans. This allows COW fork preallocations to persist for as long as possible unless otherwise needed for functional purposes (i.e. a sync scan), the file is idle and closed, or the inode is being evicted from cache. While here, update the comments to help distinguish performance oriented heuristics from the logic that exists to maintain functional correctness. Suggested-by: Darrick Wong <[email protected]> Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Carlos Maiolino <[email protected]>
2024-10-07xfs: support lowmode allocations in xfs_bmap_exact_minlen_extent_allocChristoph Hellwig1-1/+7
Currently the debug-only xfs_bmap_exact_minlen_extent_alloc allocation variant fails to drop into the lowmode last resort allocator, and thus can sometimes fail allocations for which the caller has a transaction block reservation. Fix this by using xfs_bmap_btalloc_low_space to do the actual allocation. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Carlos Maiolino <[email protected]>
2024-10-07xfs: call xfs_bmap_exact_minlen_extent_alloc from xfs_bmap_btallocChristoph Hellwig1-48/+13
xfs_bmap_exact_minlen_extent_alloc duplicates the args setup in xfs_bmap_btalloc. Switch to call it from xfs_bmap_btalloc after doing the basic setup. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Carlos Maiolino <[email protected]>