aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2023-10-21Merge tag 'iomap-6.6-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-3/+7
Pull iomap fix from Darrick Wong: - Fix a bug where a writev consisting of a bunch of sub-fsblock writes where the last buffer address is invalid could lead to an infinite loop * tag 'iomap-6.6-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: fix short copy in iomap_write_iter()
2023-10-20Merge tag 'nfs-for-6.6-4' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds5-21/+38
Pull NFS client fixes from Anna Schumaker: "Stable Fix: - Fix a pNFS hang in nfs4_evict_inode() Fixes: - Force update of suid/sgid bits after an NFS v4.2 ALLOCATE op - Fix a potential oops in nfs_inode_remove_request() - Check the validity of the layout pointer in ff_layout_mirror_prepare_stats() - Fix incorrectly marking the pNFS MDS with USE_PNFS_DS in some cases" * tag 'nfs-for-6.6-4' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFSv4.1: fixup use EXCHGID4_FLAG_USE_PNFS_DS for DS server pNFS/flexfiles: Check the layout validity in ff_layout_mirror_prepare_stats pNFS: Fix a hang in nfs4_evict_inode() NFS: Fix potential oops in nfs_inode_remove_request() nfs42: client needs to strip file mode's suid/sgid bit after ALLOCATE op
2023-10-20Merge tag 'fsnotify_for_v6.6-rc7' of ↵Linus Torvalds1-8/+17
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fanotify fix from Jan Kara: "Disable superblock / mount marks for filesystems that can encode file handles but not open them (currently only overlayfs). It is not clear the functionality is useful in any way so let's better disable it before someone comes up with some creative misuse" * tag 'fsnotify_for_v6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fanotify: limit reporting of event with non-decodeable file handles
2023-10-19iomap: fix short copy in iomap_write_iter()Jan Stancek1-3/+7
Starting with commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace"), iomap_write_iter() can get into endless loop. This can be reproduced with LTP writev07 which uses partially valid iovecs: struct iovec wr_iovec[] = { { buffer, 64 }, { bad_addr, 64 }, { buffer + 64, 64 }, { buffer + 64 * 2, 64 }, }; commit bc1bb416bbb9 ("generic_perform_write()/iomap_write_actor(): saner logics for short copy") previously introduced the logic, which made short copy retry in next iteration with amount of "bytes" it managed to copy: if (unlikely(status == 0)) { /* * A short copy made iomap_write_end() reject the * thing entirely. Might be memory poisoning * halfway through, might be a race with munmap, * might be severe memory pressure. */ if (copied) bytes = copied; However, since 5d8edfb900d5 "bytes" is no longer carried into next iteration, because it is now always initialized at the beginning of the loop. And for iov_iter_count < PAGE_SIZE, "bytes" ends up with same value as previous iteration, making the loop retry same copy over and over, which leads to writev07 testcase hanging. Make next iteration retry with amount of bytes we managed to copy. Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") Signed-off-by: Jan Stancek <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2023-10-19Merge tag 'v6.6-rc7.vfs.fixes' of ↵Linus Torvalds1-4/+5
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fix from Christian Brauner: "An openat() call from io_uring triggering an audit call can apparently cause the refcount of struct filename to be incremented from multiple threads concurrently during async execution, triggering a refcount underflow and hitting a BUG_ON(). That bug has been lurking around since at least v5.16 apparently. Switch to an atomic counter to fix that. The underflow check is downgraded from a BUG_ON() to a WARN_ON_ONCE() but we could easily remove that check altogether tbh" * tag 'v6.6-rc7.vfs.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: audit,io_uring: io_uring openat triggers audit reference count underflow
2023-10-19Merge tag 'ntfs3_for_6.6' of ↵Linus Torvalds16-82/+197
https://github.com/Paragon-Software-Group/linux-ntfs3 Pull ntfs3 fixes from Konstantin Komarov: - memory leak - some logic errors, NULL dereferences - some code was refactored - more sanity checks * tag 'ntfs3_for_6.6' of https://github.com/Paragon-Software-Group/linux-ntfs3: fs/ntfs3: Avoid possible memory leak fs/ntfs3: Fix directory element type detection fs/ntfs3: Fix possible null-pointer dereference in hdr_find_e() fs/ntfs3: Fix OOB read in ntfs_init_from_boot fs/ntfs3: fix panic about slab-out-of-bounds caused by ntfs_list_ea() fs/ntfs3: Fix NULL pointer dereference on error in attr_allocate_frame() fs/ntfs3: Fix possible NULL-ptr-deref in ni_readpage_cmpr() fs/ntfs3: Do not allow to change label if volume is read-only fs/ntfs3: Add more info into /proc/fs/ntfs3/<dev>/volinfo fs/ntfs3: Refactoring and comments fs/ntfs3: Fix alternative boot searching fs/ntfs3: Allow repeated call to ntfs3_put_sbi fs/ntfs3: Use inode_set_ctime_to_ts instead of inode_set_ctime fs/ntfs3: Fix shift-out-of-bounds in ntfs_fill_super fs/ntfs3: fix deadlock in mark_as_free_ex fs/ntfs3: Add more attributes checks in mi_enum_attr() fs/ntfs3: Use kvmalloc instead of kmalloc(... __GFP_NOWARN) fs/ntfs3: Write immediately updated ntfs state fs/ntfs3: Add ckeck in ni_update_parent()
2023-10-19Merge tag 'for-6.6-rc6-tag' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "Fix a bug in chunk size decision that could lead to suboptimal placement and filling patterns" * tag 'for-6.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix stripe length calculation for non-zoned data chunk allocation
2023-10-19fanotify: limit reporting of event with non-decodeable file handlesAmir Goldstein1-8/+17
Commit a95aef69a740 ("fanotify: support reporting non-decodeable file handles") merged in v6.5-rc1, added the ability to use an fanotify group with FAN_REPORT_FID mode to watch filesystems that do not support nfs export, but do know how to encode non-decodeable file handles, with the newly introduced AT_HANDLE_FID flag. At the time that this commit was merged, there were no filesystems in-tree with those traits. Commit 16aac5ad1fa9 ("ovl: support encoding non-decodable file handles"), merged in v6.6-rc1, added this trait to overlayfs, thus allowing fanotify watching of overlayfs with FAN_REPORT_FID mode. In retrospect, allowing an fanotify filesystem/mount mark on such filesystem in FAN_REPORT_FID mode will result in getting events with file handles, without the ability to resolve the filesystem objects from those file handles (i.e. no open_by_handle_at() support). For v6.6, the safer option would be to allow this mode for inode marks only, where the caller has the opportunity to use name_to_handle_at() at the time of setting the mark. In the future we can revise this decision. Fixes: a95aef69a740 ("fanotify: support reporting non-decodeable file handles") Signed-off-by: Amir Goldstein <[email protected]> Signed-off-by: Jan Kara <[email protected]> Message-Id: <[email protected]>
2023-10-18NFSv4.1: fixup use EXCHGID4_FLAG_USE_PNFS_DS for DS serverOlga Kornievskaia1-2/+0
This patches fixes commit 51d674a5e488 "NFSv4.1: use EXCHGID4_FLAG_USE_PNFS_DS for DS server", purpose of that commit was to mark EXCHANGE_ID to the DS with the appropriate flag. However, connection to MDS can return both EXCHGID4_FLAG_USE_PNFS_DS and EXCHGID4_FLAG_USE_PNFS_MDS set but previous patch would only remember the USE_PNFS_DS and for the 2nd EXCHANGE_ID send that to the MDS. Instead, just mark the pnfs path exclusively. Fixes: 51d674a5e488 ("NFSv4.1: use EXCHGID4_FLAG_USE_PNFS_DS for DS server") Signed-off-by: Olga Kornievskaia <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2023-10-18pNFS/flexfiles: Check the layout validity in ff_layout_mirror_prepare_statsTrond Myklebust1-7/+10
Ensure that we check the layout pointer and validity after dereferencing it in ff_layout_mirror_prepare_stats. Fixes: 08e2e5bc6c9a ("pNFS/flexfiles: Clean up layoutstats") Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2023-10-18pNFS: Fix a hang in nfs4_evict_inode()Trond Myklebust1-10/+23
We are not allowed to call pnfs_mark_matching_lsegs_return() without also holding a reference to the layout header, since doing so could lead to the reference count going to zero when we call pnfs_layout_remove_lseg(). This again can lead to a hang when we get to nfs4_evict_inode() and are unable to clear the layout pointer. pnfs_layout_return_unused_byserver() is guilty of this behaviour, and has been seen to trigger the refcount warning prior to a hang. Fixes: b6d49ecd1081 ("NFSv4: Fix a pNFS layout related use-after-free race when freeing the inode") Cc: [email protected] Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2023-10-15btrfs: fix stripe length calculation for non-zoned data chunk allocationZygo Blaxell1-1/+1
Commit f6fca3917b4d "btrfs: store chunk size in space-info struct" broke data chunk allocations on non-zoned multi-device filesystems when using default chunk_size. Commit 5da431b71d4b "btrfs: fix the max chunk size and stripe length calculation" partially fixed that, and this patch completes the fix for that case. After commit f6fca3917b4d and 5da431b71d4b, the sequence of events for a data chunk allocation on a non-zoned filesystem is: 1. btrfs_create_chunk calls init_alloc_chunk_ctl, which copies space_info->chunk_size (default 10 GiB) to ctl->max_stripe_len unmodified. Before f6fca3917b4d, ctl->max_stripe_len value was 1 GiB for non-zoned data chunks and not configurable. 2. btrfs_create_chunk calls gather_device_info which consumes and produces more fields of chunk_ctl. 3. gather_device_info multiplies ctl->max_stripe_len by ctl->dev_stripes (which is 1 in all cases except dup) and calls find_free_dev_extent with that number as num_bytes. 4. find_free_dev_extent locates the first dev_extent hole on a device which is at least as large as num_bytes. With default max_chunk_size from f6fca3917b4d, it finds the first hole which is longer than 10 GiB, or the largest hole if that hole is shorter than 10 GiB. This is different from the pre-f6fca3917b4d behavior, where num_bytes is 1 GiB, and find_free_dev_extent may choose a different hole. 5. gather_device_info repeats step 4 with all devices to find the first or largest dev_extent hole that can be allocated on each device. 6. gather_device_info sorts the device list by the hole size on each device, using total unallocated space on each device to break ties, then returns to btrfs_create_chunk with the list. 7. btrfs_create_chunk calls decide_stripe_size_regular. 8. decide_stripe_size_regular finds the largest stripe_len that fits across the first nr_devs device dev_extent holes that were found by gather_device_info (and satisfies other constraints on stripe_len that are not relevant here). 9. decide_stripe_size_regular caps the length of the stripe it computed at 1 GiB. This cap appeared in 5da431b71d4b to correct one of the other regressions introduced in f6fca3917b4d. 10. btrfs_create_chunk creates a new chunk with the above computed size and number of devices. At step 4, gather_device_info() has found a location where stripe up to 10 GiB in length could be allocated on several devices, and selected which devices should have a dev_extent allocated on them, but at step 9, only 1 GiB of the space that was found on each device can be used. This mismatch causes new suboptimal chunk allocation cases that did not occur in pre-f6fca3917b4d kernels. Consider a filesystem using raid1 profile with 3 devices. After some balances, device 1 has 10x 1 GiB unallocated space, while devices 2 and 3 have 1x 10 GiB unallocated space, i.e. the same total amount of space, but distributed across different numbers of dev_extent holes. For visualization, let's ignore all the chunks that were allocated before this point, and focus on the remaining holes: Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10x 1 GiB unallocated) Device 2: [__________] (10 GiB contig unallocated) Device 3: [__________] (10 GiB contig unallocated) Before f6fca3917b4d, the allocator would fill these optimally by allocating chunks with dev_extents on devices 1 and 2 ([12]), 1 and 3 ([13]), or 2 and 3 ([23]): [after 0 chunk allocations] Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB) Device 2: [__________] (10 GiB) Device 3: [__________] (10 GiB) [after 1 chunk allocation] Device 1: [12] [_] [_] [_] [_] [_] [_] [_] [_] [_] Device 2: [12] [_________] (9 GiB) Device 3: [__________] (10 GiB) [after 2 chunk allocations] Device 1: [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB) Device 2: [12] [_________] (9 GiB) Device 3: [13] [_________] (9 GiB) [after 3 chunk allocations] Device 1: [12] [13] [12] [_] [_] [_] [_] [_] [_] [_] (7 GiB) Device 2: [12] [12] [________] (8 GiB) Device 3: [13] [_________] (9 GiB) [...] [after 12 chunk allocations] Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [_] [_] (2 GiB) Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [__] (2 GiB) Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB) [after 13 chunk allocations] Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [_] (1 GiB) Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB) Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB) [after 14 chunk allocations] Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full) Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB) Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [13] [_] (1 GiB) [after 15 chunk allocations] Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full) Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [23] (full) Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [13] [23] (full) This allocates all of the space with no waste. The sorting function used by gather_device_info considers free space holes above 1 GiB in length to be equal to 1 GiB, so once find_free_dev_extent locates a sufficiently long hole on each device, all the holes appear equal in the sort, and the comparison falls back to sorting devices by total free space. This keeps usable space on each device equal so they can all be filled completely. After f6fca3917b4d, the allocator prefers the devices with larger holes over the devices with more free space, so it makes bad allocation choices: [after 1 chunk allocation] Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB) Device 2: [23] [_________] (9 GiB) Device 3: [23] [_________] (9 GiB) [after 2 chunk allocations] Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB) Device 2: [23] [23] [________] (8 GiB) Device 3: [23] [23] [________] (8 GiB) [after 3 chunk allocations] Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB) Device 2: [23] [23] [23] [_______] (7 GiB) Device 3: [23] [23] [23] [_______] (7 GiB) [...] [after 9 chunk allocations] Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB) Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB) Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB) [after 10 chunk allocations] Device 1: [12] [_] [_] [_] [_] [_] [_] [_] [_] [_] (9 GiB) Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [12] (full) Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB) [after 11 chunk allocations] Device 1: [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB) Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [12] (full) Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [13] (full) No further allocations are possible, with 8 GiB wasted (4 GiB of data space). The sort in gather_device_info now considers free space in holes longer than 1 GiB to be distinct, so it will prefer devices 2 and 3 over device 1 until all but 1 GiB is allocated on devices 2 and 3. At that point, with only 1 GiB unallocated on every device, the largest hole length on each device is equal at 1 GiB, so the sort finally moves to ordering the devices with the most free space, but by this time it is too late to make use of the free space on device 1. Note that it's possible to contrive a case where the pre-f6fca3917b4d allocator fails the same way, but these cases generally have extensive dev_extent fragmentation as a precondition (e.g. many holes of 768M in length on one device, and few holes 1 GiB in length on the others). With the regression in f6fca3917b4d, bad chunk allocation can occur even under optimal conditions, when all dev_extent holes are exact multiples of stripe_len in length, as in the example above. Also note that post-f6fca3917b4d kernels do treat dev_extent holes larger than 10 GiB as equal, so the bad behavior won't show up on a freshly formatted filesystem; however, as the filesystem ages and fills up, and holes ranging from 1 GiB to 10 GiB in size appear, the problem can show up as a failure to balance after adding or removing devices, or an unexpected shortfall in available space due to unequal allocation. To fix the regression and make data chunk allocation work again, set ctl->max_stripe_len back to the original SZ_1G, or space_info->chunk_size if that's smaller (the latter can happen if the user set space_info->chunk_size to less than 1 GiB via sysfs, or it's a 32 MiB system chunk with a hardcoded chunk_size and stripe_len). While researching the background of the earlier commits, I found that an identical fix was already proposed at: https://lore.kernel.org/linux-btrfs/[email protected]/ The previous review missed one detail: ctl->max_stripe_len is used before decide_stripe_size_regular() is called, when it is too late for the changes in that function to have any effect. ctl->max_stripe_len is not used directly by decide_stripe_size_regular(), but the parameter does heavily influence the per-device free space data presented to the function. Fixes: f6fca3917b4d ("btrfs: store chunk size in space-info struct") CC: [email protected] # 6.1+ Link: https://lore.kernel.org/linux-btrfs/[email protected]/ Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Zygo Blaxell <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-10-15Merge tag 'ovl-fixes-6.6-rc6' of ↵Linus Torvalds2-69/+84
git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs Pull overlayfs fixes from Amir Goldstein: - Various fixes for regressions due to conversion to new mount api in v6.5 - Disable a new mount option syntax (append lowerdir) that was added in v6.5 because we plan to add a different lowerdir append syntax in v6.7 * tag 'ovl-fixes-6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs: ovl: temporarily disable appending lowedirs ovl: fix regression in showing lowerdir mount option ovl: fix regression in parsing of mount options with escaped comma fs: factor out vfs_parse_monolithic_sep() helper
2023-10-14Merge tag '6.6-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbdLinus Torvalds2-7/+11
Pull smb server fixes from Steve French: - Fix for possible double free in RPC read - Add additional check to clarify smb2_open path and quiet Coverity - Fix incorrect error rsp in a compounding path - Fix to properly fail open of file with pending delete on close * tag '6.6-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd: ksmbd: fix potential double free on smb2_read_pipe() error path ksmbd: fix Null pointer dereferences in ksmbd_update_fstate() ksmbd: fix wrong error response status by using set_smb2_rsp_status() ksmbd: not allow to open file if delelete on close bit is set
2023-10-14Merge tag '6.6-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds2-74/+69
Pull smb client fixes from Steve French: - fix caching race with open_cached_dir and laundromat cleanup of cached dirs (addresses a problem spotted with xfstest run with directory leases enabled) - reduce excessive resource usage of laundromat threads * tag '6.6-rc5-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb: client: prevent new fids from being removed by laundromat smb: client: make laundromat a delayed worker
2023-10-14ovl: temporarily disable appending lowedirsAmir Goldstein1-49/+3
Kernel v6.5 converted overlayfs to new mount api. As an added bonus, it also added a feature to allow appending lowerdirs using lowerdir=:/lower2,lowerdir=::/data3 syntax. This new syntax has raised some concerns regarding escaping of colons. We decided to try and disable this syntax, which hasn't been in the wild for so long and introduce it again in 6.7 using explicit mount options lowerdir+=/lower2,datadir+=/data3. Suggested-by: Miklos Szeredi <[email protected]> Link: https://lore.kernel.org/r/CAJfpegsr3A4YgF2YBevWa6n3=AcP7hNndG6EPMu3ncvV-AM71A@mail.gmail.com/ Fixes: b36a5780cb44 ("ovl: modify layer parameter parsing") Signed-off-by: Amir Goldstein <[email protected]>
2023-10-14Merge tag 'xfs-6.6-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds5-5/+16
Pull xfs fixes from Chandan Babu: - Fix calculation of offset of AG's last block and its length - Update incore AG block count when shrinking an AG - Process free extents to busy list in FIFO order - Make XFS report its i_version as the STATX_CHANGE_COOKIE * tag 'xfs-6.6-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: reinstate the old i_version counter as STATX_CHANGE_COOKIE xfs: Remove duplicate include xfs: correct calculation for agend and blockcount xfs: process free extents to busy list in FIFO order xfs: adjust the incore perag block_count when shrinking
2023-10-14ovl: fix regression in showing lowerdir mount optionAmir Goldstein1-15/+23
Before commit b36a5780cb44 ("ovl: modify layer parameter parsing"), spaces and commas in lowerdir mount option value used to be escaped using seq_show_option(). In current upstream, when lowerdir value has a space, it is not escaped in /proc/mounts, e.g.: none /mnt overlay rw,relatime,lowerdir=l l,upperdir=u,workdir=w 0 0 which results in broken output of the mount utility: none on /mnt type overlay (rw,relatime,lowerdir=l) Store the original lowerdir mount options before unescaping and show them using the same escaping used for seq_show_option() in addition to escaping the colon separator character. Fixes: b36a5780cb44 ("ovl: modify layer parameter parsing") Signed-off-by: Amir Goldstein <[email protected]>
2023-10-13Merge tag 'ceph-for-6.6-rc6' of https://github.com/ceph/ceph-clientLinus Torvalds3-5/+3
Pull ceph fixes from Ilya Dryomov: "Fixes for an overreaching WARN_ON, two error paths and a switch to kernel_connect() which recently grown protection against someone using BPF to rewrite the address. All but one marked for stable" * tag 'ceph-for-6.6-rc6' of https://github.com/ceph/ceph-client: ceph: fix type promotion bug on 32bit systems libceph: use kernel_connect() ceph: remove unnecessary IS_ERR() check in ceph_fname_to_usr() ceph: fix incorrect revoked caps assert in ceph_fill_file_size()
2023-10-13audit,io_uring: io_uring openat triggers audit reference count underflowDan Clash1-4/+5
An io_uring openat operation can update an audit reference count from multiple threads resulting in the call trace below. A call to io_uring_submit() with a single openat op with a flag of IOSQE_ASYNC results in the following reference count updates. These first part of the system call performs two increments that do not race. do_syscall_64() __do_sys_io_uring_enter() io_submit_sqes() io_openat_prep() __io_openat_prep() getname() getname_flags() /* update 1 (increment) */ __audit_getname() /* update 2 (increment) */ The openat op is queued to an io_uring worker thread which starts the opportunity for a race. The system call exit performs one decrement. do_syscall_64() syscall_exit_to_user_mode() syscall_exit_to_user_mode_prepare() __audit_syscall_exit() audit_reset_context() putname() /* update 3 (decrement) */ The io_uring worker thread performs one increment and two decrements. These updates can race with the system call decrement. io_wqe_worker() io_worker_handle_work() io_wq_submit_work() io_issue_sqe() io_openat() io_openat2() do_filp_open() path_openat() __audit_inode() /* update 4 (increment) */ putname() /* update 5 (decrement) */ __audit_uring_exit() audit_reset_context() putname() /* update 6 (decrement) */ The fix is to change the refcnt member of struct audit_names from int to atomic_t. kernel BUG at fs/namei.c:262! Call Trace: ... ? putname+0x68/0x70 audit_reset_context.part.0.constprop.0+0xe1/0x300 __audit_uring_exit+0xda/0x1c0 io_issue_sqe+0x1f3/0x450 ? lock_timer_base+0x3b/0xd0 io_wq_submit_work+0x8d/0x2b0 ? __try_to_del_timer_sync+0x67/0xa0 io_worker_handle_work+0x17c/0x2b0 io_wqe_worker+0x10a/0x350 Cc: [email protected] Link: https://lore.kernel.org/lkml/MW2PR2101MB1033FFF044A258F84AEAA584F1C9A@MW2PR2101MB1033.namprd21.prod.outlook.com/ Fixes: 5bd2182d58e9 ("audit,io_uring,io-wq: add some basic audit support to io_uring") Signed-off-by: Dan Clash <[email protected]> Link: https://lore.kernel.org/r/20231012215518.GA4048@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net Reviewed-by: Jens Axboe <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-10-12ksmbd: fix potential double free on smb2_read_pipe() error pathNamjae Jeon1-1/+1
Fix new smatch warnings: fs/smb/server/smb2pdu.c:6131 smb2_read_pipe() error: double free of 'rpc_resp' Fixes: e2b76ab8b5c9 ("ksmbd: add support for read compound") Reported-by: kernel test robot <[email protected]> Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Namjae Jeon <[email protected]> Signed-off-by: Steve French <[email protected]>
2023-10-12ksmbd: fix Null pointer dereferences in ksmbd_update_fstate()Namjae Jeon1-0/+3
Coverity Scan report the following one. This report is a false alarm. Because fp is never NULL when rc is zero. This patch add null check for fp in ksmbd_update_fstate to make alarm silence. *** CID 1568583: Null pointer dereferences (FORWARD_NULL) /fs/smb/server/smb2pdu.c: 3408 in smb2_open() 3402 path_put(&path); 3403 path_put(&parent_path); 3404 } 3405 ksmbd_revert_fsids(work); 3406 err_out1: 3407 if (!rc) { >>> CID 1568583: Null pointer dereferences (FORWARD_NULL) >>> Passing null pointer "fp" to "ksmbd_update_fstate", which dereferences it. 3408 ksmbd_update_fstate(&work->sess->file_table, fp, FP_INITED); 3409 rc = ksmbd_iov_pin_rsp(work, (void *)rsp, iov_len); 3410 } 3411 if (rc) { 3412 if (rc == -EINVAL) 3413 rsp->hdr.Status = STATUS_INVALID_PARAMETER; Fixes: e2b76ab8b5c9 ("ksmbd: add support for read compound") Reported-by: Coverity Scan <[email protected]> Signed-off-by: Namjae Jeon <[email protected]> Signed-off-by: Steve French <[email protected]>
2023-10-12ksmbd: fix wrong error response status by using set_smb2_rsp_status()Namjae Jeon1-4/+5
set_smb2_rsp_status() after __process_request() sets the wrong error status. This patch resets all iov vectors and sets the error status on clean one. Fixes: e2b76ab8b5c9 ("ksmbd: add support for read compound") Signed-off-by: Namjae Jeon <[email protected]> Signed-off-by: Steve French <[email protected]>
2023-10-12ksmbd: not allow to open file if delelete on close bit is setNamjae Jeon1-2/+2
Cthon test fail with the following error. check for proper open/unlink operation nfsjunk files before unlink: -rwxr-xr-x 1 root root 0 9월 25 11:03 ./nfs2y8Jm9 ./nfs2y8Jm9 open; unlink ret = 0 nfsjunk files after unlink: -rwxr-xr-x 1 root root 0 9월 25 11:03 ./nfs2y8Jm9 data compare ok nfsjunk files after close: ls: cannot access './nfs2y8Jm9': No such file or directory special tests failed Cthon expect to second unlink failure when file is already unlinked. ksmbd can not allow to open file if flags of ksmbd inode is set with S_DEL_ON_CLS flags. Cc: [email protected] Signed-off-by: Namjae Jeon <[email protected]> Signed-off-by: Steve French <[email protected]>
2023-10-12ovl: fix regression in parsing of mount options with escaped commaAmir Goldstein1-0/+29
Ever since commit 91c77947133f ("ovl: allow filenames with comma"), the following example was legit overlayfs mount options: mount -t overlay overlay -o 'lowerdir=/tmp/a\,b/lower' /mnt The conversion to new mount api moved to using the common helper generic_parse_monolithic() and discarded the specialized ovl_next_opt() option separator. Bring back ovl_next_opt() and use vfs_parse_monolithic_sep() to fix the regression. Reported-by: Ryan Hendrickson <[email protected]> Closes: https://lore.kernel.org/r/[email protected]/ Fixes: 1784fbc2ed9c ("ovl: port to new mount api") Signed-off-by: Amir Goldstein <[email protected]>
2023-10-12fs: factor out vfs_parse_monolithic_sep() helperAmir Goldstein1-5/+29
Factor out vfs_parse_monolithic_sep() from generic_parse_monolithic(), so filesystems could use it with a custom option separator callback. Acked-by: Christian Brauner <[email protected]> Signed-off-by: Amir Goldstein <[email protected]>
2023-10-12smb: client: prevent new fids from being removed by laundromatPaulo Alcantara1-21/+35
Check if @cfid->time is set in laundromat so we guarantee that only fully cached fids will be selected for removal. While we're at it, add missing locks to protect access of @cfid fields in order to avoid races with open_cached_dir() and cfids_laundromat_worker(), respectively. Signed-off-by: Paulo Alcantara (SUSE) <[email protected]> Reviewed-by: Shyam Prasad N <[email protected]> Signed-off-by: Steve French <[email protected]>
2023-10-12smb: client: make laundromat a delayed workerPaulo Alcantara2-55/+36
By having laundromat kthread processing cached directories on every second turned out to be overkill, especially when having multiple SMB mounts. Relax it by using a delayed worker instead that gets scheduled on every @dir_cache_timeout (default=30) seconds per tcon. This also fixes the 1s delay when tearing down tcon. Signed-off-by: Paulo Alcantara (SUSE) <[email protected]> Reviewed-by: Shyam Prasad N <[email protected]> Signed-off-by: Steve French <[email protected]>
2023-10-12xfs: reinstate the old i_version counter as STATX_CHANGE_COOKIEJeff Layton1-0/+5
The handling of STATX_CHANGE_COOKIE was moved into generic_fillattr in commit 0d72b92883c6 (fs: pass the request_mask to generic_fillattr), but we didn't account for the fact that xfs doesn't call generic_fillattr at all. Make XFS report its i_version as the STATX_CHANGE_COOKIE. Fixes: 0d72b92883c6 (fs: pass the request_mask to generic_fillattr) Signed-off-by: Jeff Layton <[email protected]> Reviewed-by: "Darrick J. Wong" <[email protected]> Signed-off-by: Chandan Babu R <[email protected]>
2023-10-12xfs: Remove duplicate includeJiapeng Chong1-1/+0
./fs/xfs/scrub/xfile.c: xfs_format.h is included more than once. Reported-by: Abaci Robot <[email protected]> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=6209 Signed-off-by: Jiapeng Chong <[email protected]> Reviewed-by: "Darrick J. Wong" <[email protected]> Signed-off-by: Chandan Babu R <[email protected]>
2023-10-12xfs: correct calculation for agend and blockcountShiyang Ruan1-3/+3
The agend should be "start + length - 1", then, blockcount should be "end + 1 - start". Correct 2 calculation mistakes. Also, rename "agend" to "range_agend" because it's not the end of the AG per se; it's the end of the dead region within an AG's agblock space. Fixes: 5cf32f63b0f4 ("xfs: fix the calculation for "end" and "length"") Signed-off-by: Shiyang Ruan <[email protected]> Reviewed-by: "Darrick J. Wong" <[email protected]> Signed-off-by: Chandan Babu R <[email protected]>
2023-10-11Merge tag 'fs_for_v6.6-rc6' of ↵Linus Torvalds1-27/+39
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull quota regression fix from Jan Kara. * tag 'fs_for_v6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: quota: Fix slow quotaoff
2023-10-11Merge tag 'for-6.6-rc5-tag' of ↵Linus Torvalds3-6/+2
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A revert of recent mount option parsing fix, this breaks mounts with security options. The second patch is a flexible array annotation" * tag 'for-6.6-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: add __counted_by for struct btrfs_delayed_item and use struct_size() Revert "btrfs: reject unknown mount options early"
2023-10-11xfs: process free extents to busy list in FIFO orderDarrick J. Wong1-1/+2
When we're adding extents to the busy discard list, add them to the tail of the list so that we get FIFO order. For FITRIM commands, this means that we send discard bios sorted in order from longest to shortest, like we did before commit 89cfa899608fc. For transactions that are freeing extents, this puts them in the transaction's busy list in FIFO order as well, which shouldn't make any noticeable difference. Fixes: 89cfa899608fc ("xfs: reduce AGF hold times during fstrim operations") Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2023-10-11xfs: adjust the incore perag block_count when shrinkingDarrick J. Wong1-0/+6
If we reduce the number of blocks in an AG, we must update the incore geometry values as well. Fixes: 0800169e3e2c9 ("xfs: Pre-calculate per-AG agbno geometry") Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2023-10-11NFS: Fix potential oops in nfs_inode_remove_request()Scott Mayhew1-1/+3
Once a folio's private data has been cleared, it's possible for another process to clear the folio->mapping (e.g. via invalidate_complete_folio2 or evict_mapping_folio), so it wouldn't be safe to call nfs_page_to_inode() after that. Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use folios") Signed-off-by: Scott Mayhew <[email protected]> Reviewed-by: Benjamin Coddington <[email protected]> Tested-by: Benjamin Coddington <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2023-10-11nfs42: client needs to strip file mode's suid/sgid bit after ALLOCATE opDai Ngo1-1/+2
The Linux NFS server strips the SUID and SGID from the file mode on ALLOCATE op. Modify _nfs42_proc_fallocate to add NFS_INO_REVAL_FORCED to nfs_set_cache_invalid's argument to force update of the file mode suid/sgid bit. Suggested-by: Trond Myklebust <[email protected]> Signed-off-by: Dai Ngo <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Tested-by: Jeff Layton <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2023-10-11btrfs: add __counted_by for struct btrfs_delayed_item and use struct_size()Gustavo A. R. Silva2-2/+2
Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). While there, use struct_size() helper, instead of the open-coded version, to calculate the size for the allocation of the whole flexible structure, including of course, the flexible-array member. This code was found with the help of Coccinelle, and audited and fixed manually. Reviewed-by: Kees Cook <[email protected]> Signed-off-by: Gustavo A. R. Silva <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-10-10Revert "btrfs: reject unknown mount options early"David Sterba1-4/+0
This reverts commit 5f521494cc73520ffac18ede0758883b9aedd018. The patch breaks mounts with security mount options like $ mount -o context=system_u:object_r:root_t:s0 /dev/sdX /mn mount: /mnt: wrong fs type, bad option, bad superblock on /dev/sdX, missing codepage or helper program, ... We cannot reject all unknown options in btrfs_parse_subvol_options() as intended, the security options can be present at this point and it's not possible to enumerate them in a future proof way. This means unknown mount options are silently accepted like before when the filesystem is mounted with either -o subvol=/path or as followup mounts of the same device. Reported-by: Shinichiro Kawasaki <[email protected] Signed-off-by: David Sterba <[email protected]>
2023-10-09ceph: fix type promotion bug on 32bit systemsDan Carpenter1-1/+1
In this code "ret" is type long and "src_objlen" is unsigned int. The problem is that on 32bit systems, when we do the comparison signed longs are type promoted to unsigned int. So negative error codes from do_splice_direct() are treated as success instead of failure. Cc: [email protected] Fixes: 1b0c3b9f91f0 ("ceph: re-org copy_file_range and fix some error paths") Signed-off-by: Dan Carpenter <[email protected]> Reviewed-by: Xiubo Li <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2023-10-09ceph: remove unnecessary IS_ERR() check in ceph_fname_to_usr()Luis Henriques1-1/+1
Before returning, function ceph_fname_to_usr() does a final IS_ERR() check in 'dir': if ((dir != fname->dir) && !IS_ERR(dir)) {...} This check is unnecessary because, if the 'dir' variable has changed to something other than 'fname->dir' (it's initial value), that error check has been performed already and, if there was indeed an error, it would have been returned immediately. Besides, this useless IS_ERR() is also confusing static analysis tools. Reported-by: kernel test robot <[email protected]> Reported-by: Dan Carpenter <[email protected]> Closes: https://lore.kernel.org/r/[email protected]/ Signed-off-by: Luis Henriques <[email protected]> Reviewed-by: Ilya Dryomov <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2023-10-09ceph: fix incorrect revoked caps assert in ceph_fill_file_size()Xiubo Li1-3/+1
When truncating the inode the MDS will acquire the xlock for the ifile Locker, which will revoke the 'Frwsxl' caps from the clients. But when the client just releases and flushes the 'Fw' caps to MDS, for exmaple, and once the MDS receives the caps flushing msg it just thought the revocation has finished. Then the MDS will continue truncating the inode and then issued the truncate notification to all the clients. While just before the clients receives the cap flushing ack they receive the truncation notification, the clients will detecte that the 'issued | dirty' is still holding the 'Fw' caps. Cc: [email protected] Link: https://tracker.ceph.com/issues/56693 Fixes: b0d7c2231015 ("ceph: introduce i_truncate_mutex") Signed-off-by: Xiubo Li <[email protected]> Reviewed-by: Milind Changire <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2023-10-08Merge tag '6.6-rc4-ksmbd-server-fixes' of git://git.samba.org/ksmbdLinus Torvalds10-45/+151
Pull smb server fixes from Steve French: "Six SMB3 server fixes for various races found by RO0T Lab of Huawei: - Fix oops when racing between oplock break ack and freeing file - Simultaneous request fixes for parallel logoffs, and for parallel lock requests - Fixes for tree disconnect race, session expire race, and close/open race" * tag '6.6-rc4-ksmbd-server-fixes' of git://git.samba.org/ksmbd: ksmbd: fix race condition between tree conn lookup and disconnect ksmbd: fix race condition from parallel smb2 lock requests ksmbd: fix race condition from parallel smb2 logoff requests ksmbd: fix uaf in smb20_oplock_break_ack ksmbd: fix race condition with fp ksmbd: fix race condition between session lookup and expire
2023-10-07Merge tag '6.6-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds1-13/+13
Pull smb client fixes from Steve French: - protect cifs/smb3 socket connect from BPF address overwrite - fix case when directory leases disabled but wasting resources with unneeded thread on each mount * tag '6.6-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb: client: do not start laundromat thread on nohandlecache smb: use kernel_connect() and kernel_bind()
2023-10-07Merge tag 'xfs-6.6-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds6-117/+311
Pull xfs fixes from Chandan Babu: - Prevent filesystem hang when executing fstrim operations on large and slow storage * tag 'xfs-6.6-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: abort fstrim if kernel is suspending xfs: reduce AGF hold times during fstrim operations xfs: move log discard work to xfs_discard.c
2023-10-06Merge tag 'for-6.6-rc4-tag' of ↵Linus Torvalds4-17/+47
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - reject unknown mount options - adjust transaction abort error message level - fix one more build warning with -Wmaybe-uninitialized - proper error handling in several COW-related cases * tag 'for-6.6-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: error out when reallocating block for defrag using a stale transaction btrfs: error when COWing block from a root that is being deleted btrfs: error out when COWing block using a stale transaction btrfs: always print transaction aborted messages with an error level btrfs: reject unknown mount options early btrfs: fix some -Wmaybe-uninitialized warnings in ioctl.c
2023-10-06quota: Fix slow quotaoffJan Kara1-27/+39
Eric has reported that commit dabc8b207566 ("quota: fix dqput() to follow the guarantees dquot_srcu should provide") heavily increases runtime of generic/270 xfstest for ext4 in nojournal mode. The reason for this is that ext4 in nojournal mode leaves dquots dirty until the last dqput() and thus the cleanup done in quota_release_workfn() has to write them all. Due to the way quota_release_workfn() is written this results in synchronize_srcu() call for each dirty dquot which makes the dquot cleanup when turning quotas off extremely slow. To be able to avoid synchronize_srcu() for each dirty dquot we need to rework how we track dquots to be cleaned up. Instead of keeping the last dquot reference while it is on releasing_dquots list, we drop it right away and mark the dquot with new DQ_RELEASING_B bit instead. This way we can we can remove dquot from releasing_dquots list when new reference to it is acquired and thus there's no need to call synchronize_srcu() each time we drop dq_list_lock. References: https://lore.kernel.org/all/ZRytn6CxFK2oECUt@debian-BULLSEYE-live-builder-AMD64 Reported-by: Eric Whitney <[email protected]> Fixes: dabc8b207566 ("quota: fix dqput() to follow the guarantees dquot_srcu should provide") CC: [email protected] Signed-off-by: Jan Kara <[email protected]>
2023-10-05Merge tag 'erofs-for-6.6-rc5-fixes' of ↵Linus Torvalds2-2/+5
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fixes from Gao Xiang: - Fix a memory leak issue when using LZMA global compressed deduplication - Fix empty device tags in flatdev mode - Update documentation for recent new features * tag 'erofs-for-6.6-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: update documentation erofs: allow empty device tags in flatdev mode erofs: fix memory leak of LZMA global compressed deduplication
2023-10-05Merge tag 'ovl-fixes-6.6-rc5' of ↵Linus Torvalds5-30/+28
git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs Pull overlayfs fixes from Amir Goldstein: - Fix for file reference leak regression - Fix for NULL pointer deref regression - Fixes for RCU-walk race regressions: Two of the fixes were taken from Al's RCU pathwalk race fixes series with his consent [1]. Note that unlike most of Al's series, these two patches are not about racing with ->kill_sb() and they are also very recent regressions from v6.5, so I think it's worth getting them into v6.5.y. There is also a fix for an RCU pathwalk race with ->kill_sb(), which may have been solved in vfs generic code as you suggested, but it also rids overlayfs from a nasty hack, so I think it's worth anyway. Link: https://lore.kernel.org/linux-fsdevel/20231003204749.GA800259@ZenIV/ [1] * tag 'ovl-fixes-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs: ovl: fix NULL pointer defer when encoding non-decodable lower fid ovl: make use of ->layers safe in rcu pathwalk ovl: fetch inode once in ovl_dentry_revalidate_common() ovl: move freeing ovl_entry past rcu delay ovl: fix file reference leak when submitting aio
2023-10-04ksmbd: fix race condition between tree conn lookup and disconnectNamjae Jeon6-17/+90
if thread A in smb2_write is using work-tcon, other thread B use smb2_tree_disconnect free the tcon, then thread A will use free'd tcon. Time + Thread A | Thread A smb2_write | smb2_tree_disconnect | | | kfree(tree_conn) | // UAF! | work->tcon->share_conf | + This patch add state, reference count and lock for tree conn to fix race condition issue. Reported-by: luosili <[email protected]> Signed-off-by: Namjae Jeon <[email protected]> Signed-off-by: Steve French <[email protected]>