aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2024-09-01mm: remove PG_errorMatthew Wilcox (Oracle)1-1/+0
The PG_error bit is now unused; delete it and free up a bit in page->flags. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/hugetlb: remove hugetlb_follow_page_mask() leftoverDavid Hildenbrand1-9/+2
We removed hugetlb_follow_page_mask() in commit 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code") but forgot to cleanup some leftovers. While at it, simplify the hugetlb comment, it's overly detailed and rather confusing. Stating that we may end up in there during coredumping is sufficient to explain the PF_DUMPCORE usage. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Peter Xu <[email protected]> Cc: Muchun Song <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: move vma_shrink(), vma_expand() to internal headerLorenzo Stoakes1-75/+6
The vma_shrink() and vma_expand() functions are internal VMA manipulation functions which we ought to abstract for use outside of memory management code. To achieve this, we replace shift_arg_pages() in fs/exec.c with an invocation of a new relocate_vma_down() function implemented in mm/mmap.c, which enables us to also move move_page_tables() and vma_iter_prev_range() to internal.h. The purpose of doing this is to isolate key VMA manipulation functions in order that we can both abstract them and later render them easily testable. Link: https://lkml.kernel.org/r/3cfcd9ec433e032a85f636fdc0d7d98fafbd19c5.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Gow <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Jan Kara <[email protected]> Cc: Kees Cook <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Rae Moar <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Pengfei Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01userfaultfd: move core VMA manipulation logic to mm/userfaultfd.cLorenzo Stoakes1-149/+11
Patch series "Make core VMA operations internal and testable", v4. There are a number of "core" VMA manipulation functions implemented in mm/mmap.c, notably those concerning VMA merging, splitting, modifying, expanding and shrinking, which logically don't belong there. More importantly this functionality represents an internal implementation detail of memory management and should not be exposed outside of mm/ itself. This patch series isolates core VMA manipulation functionality into its own file, mm/vma.c, and provides an API to the rest of the mm code in mm/vma.h. Importantly, it also carefully implements mm/vma_internal.h, which specifies which headers need to be imported by vma.c, leading to the very useful property that vma.c depends only on mm/vma.h and mm/vma_internal.h. This means we can then re-implement vma_internal.h in userland, adding shims for kernel mechanisms as required, allowing us to unit test internal VMA functionality. This testing is useful as opposed to an e.g. kunit implementation as this way we can avoid all external kernel side-effects while testing, run tests VERY quickly, and iterate on and debug problems quickly. Excitingly this opens the door to, in the future, recreating precise problems observed in production in userland and very quickly debugging problems that might otherwise be very difficult to reproduce. This patch series takes advantage of existing shim logic and full userland maple tree support contained in tools/testing/radix-tree/ and tools/include/linux/, separating out shared components of the radix tree implementation to provide this testing. Kernel functionality is stubbed and shimmed as needed in tools/testing/vma/ which contains a fully functional userland vma_internal.h file and which imports mm/vma.c and mm/vma.h to be directly tested from userland. A simple, skeleton testing implementation is provided in tools/testing/vma/vma.c as a proof-of-concept, asserting that simple VMA merge, modify (testing split), expand and shrink functionality work correctly. This patch (of 4): This patch forms part of a patch series intending to separate out VMA logic and render it testable from userspace, which requires that core manipulation functions be exposed in an mm/-internal header file. In order to do this, we must abstract APIs we wish to test, in this instance functions which ultimately invoke vma_modify(). This patch therefore moves all logic which ultimately invokes vma_modify() to mm/userfaultfd.c, trying to transfer code at a functional granularity where possible. [[email protected]: fix user-after-free in userfaultfd_clear_vma()] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/50c3ed995fd81c45876c86304c8a00bf3e396cfd.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Gow <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Jan Kara <[email protected]> Cc: Kees Cook <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Rae Moar <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Pengfei Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/hugetlb: enforce that PMD PT sharing has split PMD PT locksDavid Hildenbrand1-0/+4
Sharing page tables between processes but falling back to per-MM page table locks cannot possibly work. So, let's make sure that we do have split PMD locks by adding a new Kconfig option and letting that depend on CONFIG_SPLIT_PMD_PTLOCKS. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Mike Rapoport (Microsoft) <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Muchun Song <[email protected]> Cc: "Naveen N. Rao" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Cc: Russell King <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: kvmalloc: align kvrealloc() with krealloc()Danilo Krummrich1-1/+1
Besides the obvious (and desired) difference between krealloc() and kvrealloc(), there is some inconsistency in their function signatures and behavior: - krealloc() frees the memory when the requested size is zero, whereas kvrealloc() simply returns a pointer to the existing allocation. - krealloc() behaves like kmalloc() if a NULL pointer is passed, whereas kvrealloc() does not accept a NULL pointer at all and, if passed, would fault instead. - krealloc() is self-contained, whereas kvrealloc() relies on the caller to provide the size of the previous allocation. Inconsistent behavior throughout allocation APIs is error prone, hence make kvrealloc() behave like krealloc(), which seems superior in all mentioned aspects. Besides that, implementing kvrealloc() by making use of krealloc() and vrealloc() provides oppertunities to grow (and shrink) allocations more efficiently. For instance, vrealloc() can be optimized to allocate and map additional pages to grow the allocation or unmap and free unused pages to shrink the allocation. [[email protected]: document concurrency restrictions] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: disable KASAN when switching to vmalloc] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: properly document __GFP_ZERO behavior] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Danilo Krummrich <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Chandan Babu R <[email protected]> Cc: Christian König <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hyeonggon Yoo <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Kees Cook <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Miguel Ojeda <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Uladzislau Rezki <[email protected]> Cc: Wedson Almeida Filho <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01Merge tag 'v6.11-rc5-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds7-45/+93
Pull smb client fixes from Steve French: - copy_file_range fix - two read fixes including read past end of file rc fix and read retry crediting fix - falloc zero range fix * tag 'v6.11-rc5-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: Fix FALLOC_FL_ZERO_RANGE to preflush buffered part of target region cifs: Fix copy offload to flush destination region netfs, cifs: Fix handling of short DIO read cifs: Fix lack of credit renegotiation on read retry
2024-09-01Merge tag 'bcachefs-2024-08-21' of https://github.com/koverstreet/bcachefsLinus Torvalds5-116/+68
Push bcachefs fixes from Kent Overstreet: "The data corruption in the buffered write path is troubling; inode lock should not have been able to cause that... - Fix a rare data corruption in the rebalance path, caught as a nonce inconsistency on encrypted filesystems - Revert lockless buffered write path - Mark more errors as autofix" * tag 'bcachefs-2024-08-21' of https://github.com/koverstreet/bcachefs: bcachefs: Mark more errors as autofix bcachefs: Revert lockless buffered IO path bcachefs: Fix bch2_extents_match() false positive bcachefs: Fix failure to return error in data_update_index_update()
2024-08-31bcachefs: Mark more errors as autofixKent Overstreet1-5/+5
errors that are known to always be safe to fix should be autofix: this should be most errors even at this point, but that will need some thorough review. note that errors are still logged in the superblock, so we'll still know that they happened. Signed-off-by: Kent Overstreet <[email protected]>
2024-08-31bcachefs: Revert lockless buffered IO pathKent Overstreet2-110/+40
We had a report of data corruption on nixos when building installer images. https://github.com/NixOS/nixpkgs/pull/321055#issuecomment-2184131334 It seems that writes are being dropped, but only when issued by QEMU, and possibly only in snapshot mode. It's undetermined if it's write calls are being dropped or dirty folios. Further testing, via minimizing the original patch to just the change that skips the inode lock on non appends/truncates, reveals that it really is just not taking the inode lock that causes the corruption: it has nothing to do with the other logic changes for preserving write atomicity in corner cases. It's also kernel config dependent: it doesn't reproduce with the minimal kernel config that ktest uses, but it does reproduce with nixos's distro config. Bisection the kernel config initially pointer the finger at page migration or compaction, but it appears that was erroneous; we haven't yet determined what kernel config option actually triggers it. Sadly it appears this will have to be reverted since we're getting too close to release and my plate is full, but we'd _really_ like to fully debug it. My suspicion is that this patch is exposing a preexisting bug - the inode lock actually covers very little in IO paths, and we have a different lock (the pagecache add lock) that guards against races with truncate here. Fixes: 7e64c86cdc6c ("bcachefs: Buffered write path now can avoid the inode lock") Signed-off-by: Kent Overstreet <[email protected]>
2024-09-01Merge tag 'nfsd-6.11-3' of ↵Linus Torvalds1-2/+9
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fix from Chuck Lever: - One more write delegation fix * tag 'nfsd-6.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: nfsd: fix nfsd4_deleg_getattr_conflict in presence of third party lease
2024-09-01Merge tag 'xfs-6.11-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds6-48/+114
Pull xfs fixes from Chandan Babu: - Do not call out v1 inodes with non-zero di_nlink field as being corrupt - Change xfs_finobt_count_blocks() to count "free inode btree" blocks rather than "inode btree" blocks - Don't report the number of trimmed bytes via FITRIM because the underlying storage isn't required to do anything and failed discard IOs aren't reported to the caller anyway - Fix incorrect setting of rm_owner field in an rmap query - Report missing disk offset range in an fsmap query - Obtain m_growlock when extending realtime section of the filesystem - Reset rootdir extent size hint after extending realtime section of the filesystem * tag 'xfs-6.11-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: reset rootdir extent size hint after growfsrt xfs: take m_growlock when running growfsrt xfs: Fix missing interval for missing_owner in xfs fsmap xfs: use XFS_BUF_DADDR_NULL for daddrs in getfsmap code xfs: Fix the owner setting issue for rmap query in xfs fsmap xfs: don't bother reporting blocks trimmed via FITRIM xfs: xfs_finobt_count_blocks() walks the wrong btree xfs: fix folio dirtying for XFILE_ALLOC callers xfs: fix di_onlink checking for V1/V2 inodes
2024-08-30nfsd: fix nfsd4_deleg_getattr_conflict in presence of third party leaseNeilBrown1-2/+9
It is not safe to dereference fl->c.flc_owner without first confirming fl->fl_lmops is the expected manager. nfsd4_deleg_getattr_conflict() tests fl_lmops but largely ignores the result and assumes that flc_owner is an nfs4_delegation anyway. This is wrong. With this patch we restore the "!= &nfsd_lease_mng_ops" case to behave as it did before the change mentioned below. This is the same as the current code, but without any reference to a possible delegation. Fixes: c5967721e106 ("NFSD: handle GETATTR conflict with write delegation") Signed-off-by: NeilBrown <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2024-08-30Merge tag 'execve-v6.11-rc6' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve fix from Kees Cook: - binfmt_elf_fdpic: fix AUXV size with ELF_HWCAP2 (Max Filippov) * tag 'execve-v6.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: binfmt_elf_fdpic: fix AUXV size calculation when ELF_HWCAP2 is defined
2024-08-30dcache: keep dentry_hashtable or d_hash_shift even when not usedStephen Brennan1-2/+7
The runtime constant feature removes all the users of these variables, allowing the compiler to optimize them away. It's quite difficult to extract their values from the kernel text, and the memory saved by removing them is tiny, and it was never the point of this optimization. Since the dentry_hashtable is a core data structure, it's valuable for debugging tools to be able to read it easily. For instance, scripts built on drgn, like the dentrycache script[1], rely on it to be able to perform diagnostics on the contents of the dcache. Annotate it as used, so the compiler doesn't discard it. Link: https://github.com/oracle-samples/drgn-tools/blob/3afc56146f54d09dfd1f6d3c1b7436eda7e638be/drgn_tools/dentry.py#L325-L355 [1] Fixes: e3c92e81711d ("runtime constants: add x86 architecture support") Signed-off-by: Stephen Brennan <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2024-08-28cifs: Fix FALLOC_FL_ZERO_RANGE to preflush buffered part of target regionDavid Howells1-2/+14
Under certain conditions, the range to be cleared by FALLOC_FL_ZERO_RANGE may only be buffered locally and not yet have been flushed to the server. For example: xfs_io -f -t -c "pwrite -S 0x41 0 4k" \ -c "pwrite -S 0x42 4k 4k" \ -c "fzero 0 4k" \ -c "pread -v 0 8k" /xfstest.test/foo will write two 4KiB blocks of data, which get buffered in the pagecache, and then fallocate() is used to clear the first 4KiB block on the server - but we don't flush the data first, which means the EOF position on the server is wrong, and so the FSCTL_SET_ZERO_DATA RPC fails (and xfs_io ignores the error), but then when we try to read it, we see the old data. Fix this by preflushing any part of the target region that above the server's idea of the EOF position to force the server to update its EOF position. Note, however, that we don't want to simply expand the file by moving the EOF before doing the FSCTL_SET_ZERO_DATA[*] because someone else might see the zeroed region or if the RPC fails we then have to try to clean it up or risk getting corruption. [*] And we have to move the EOF first otherwise FSCTL_SET_ZERO_DATA won't do what we want. This fixes the generic/008 xfstest. [!] Note: A better way to do this might be to split the operation into two parts: we only do FSCTL_SET_ZERO_DATA for the part of the range below the server's EOF and then, if that worked, invalidate the buffered pages for the part above the range. Fixes: 6b69040247e1 ("cifs/smb3: Fix data inconsistent when zero file range") Signed-off-by: David Howells <[email protected]> cc: Steve French <[email protected]> cc: Zhang Xiaoxu <[email protected]> cc: Pavel Shilovsky <[email protected]> cc: Paulo Alcantara <[email protected]> cc: Shyam Prasad N <[email protected]> cc: Rohith Surabattula <[email protected]> cc: Jeff Layton <[email protected]> cc: [email protected] cc: [email protected] Signed-off-by: Steve French <[email protected]>
2024-08-29Merge tag 'nfsd-6.11-2' of ↵Linus Torvalds4-24/+49
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Fix a number of crashers - Update email address for an NFSD reviewer * tag 'nfsd-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: fs/nfsd: fix update of inode attrs in CB_GETATTR nfsd: fix potential UAF in nfsd4_cb_getattr_release nfsd: hold reference to delegation when updating it for cb_getattr MAINTAINERS: Update Olga Kornievskaia's email address nfsd: prevent panic for nfsv4.0 closed files in nfs4_show_open nfsd: ensure that nfsd4_fattr_args.context is zeroed out
2024-08-29Merge tag 'for-6.11-rc5-tag' of ↵Linus Torvalds5-22/+27
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix use-after-free when submitting bios for read, after an error and partially submitted bio the original one is freed while it can be still be accessed again - fix fstests case btrfs/301, with enabled quotas wait for delayed iputs when flushing delalloc - fix periodic block group reclaim, an unitialized value can be returned if there are no block groups to reclaim - fix build warning (-Wmaybe-uninitialized) * tag 'for-6.11-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix uninitialized return value from btrfs_reclaim_sweep() btrfs: fix a use-after-free when hitting errors inside btrfs_submit_chunk() btrfs: initialize last_extent_end to fix -Wmaybe-uninitialized warning in extent_fiemap() btrfs: run delayed iputs when flushing delalloc
2024-08-28cifs: Fix copy offload to flush destination regionDavid Howells1-17/+4
Fix cifs_file_copychunk_range() to flush the destination region before invalidating it to avoid potential loss of data should the copy fail, in whole or in part, in some way. Fixes: 7b2404a886f8 ("cifs: Fix flushing, invalidation and file size with copy_file_range()") Signed-off-by: David Howells <[email protected]> cc: Steve French <[email protected]> cc: Paulo Alcantara <[email protected]> cc: Shyam Prasad N <[email protected]> cc: Rohith Surabattula <[email protected]> cc: Matthew Wilcox <[email protected]> cc: Jeff Layton <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected] Signed-off-by: Steve French <[email protected]>
2024-08-28netfs, cifs: Fix handling of short DIO readDavid Howells2-10/+20
Short DIO reads, particularly in relation to cifs, are not being handled correctly by cifs and netfslib. This can be tested by doing a DIO read of a file where the size of read is larger than the size of the file. When it crosses the EOF, it gets a short read and this gets retried, and in the case of cifs, the retry read fails, with the failure being translated to ENODATA. Fix this by the following means: (1) Add a flag, NETFS_SREQ_HIT_EOF, for the filesystem to set when it detects that the read did hit the EOF. (2) Make the netfslib read assessment stop processing subrequests when it encounters one with that flag set. (3) Return rreq->transferred, the accumulated contiguous amount read to that point, to userspace for a DIO read. (4) Make cifs set the flag and clear the error if the read RPC returned ENODATA. (5) Make cifs set the flag and clear the error if a short read occurred without error and the read-to file position is now at the remote inode size. Fixes: 69c3c023af25 ("cifs: Implement netfslib hooks") Signed-off-by: David Howells <[email protected]> cc: Steve French <[email protected]> cc: Paulo Alcantara <[email protected]> cc: Jeff Layton <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected] Signed-off-by: Steve French <[email protected]>
2024-08-28cifs: Fix lack of credit renegotiation on read retryDavid Howells6-16/+55
When netfslib asks cifs to issue a read operation, it prefaces this with a call to ->clamp_length() which cifs uses to negotiate credits, providing receive capacity on the server; however, in the event that a read op needs reissuing, netfslib doesn't call ->clamp_length() again as that could shorten the subrequest, leaving a gap. This causes the retried read to be done with zero credits which causes the server to reject it with STATUS_INVALID_PARAMETER. This is a problem for a DIO read that is requested that would go over the EOF. The short read will be retried, causing EINVAL to be returned to the user when it fails. Fix this by making cifs_req_issue_read() negotiate new credits if retrying (NETFS_SREQ_RETRYING now gets set in the read side as well as the write side in this instance). This isn't sufficient, however: the new credits might not be sufficient to complete the remainder of the read, so also add an additional field, rreq->actual_len, that holds the actual size of the op we want to perform without having to alter subreq->len. We then rely on repeated short reads being retried until we finish the read or reach the end of file and make a zero-length read. Also fix a couple of places where the subrequest start and length need to be altered by the amount so far transferred when being used. Fixes: 69c3c023af25 ("cifs: Implement netfslib hooks") Signed-off-by: David Howells <[email protected]> cc: Steve French <[email protected]> cc: Paulo Alcantara <[email protected]> cc: Jeff Layton <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected] Signed-off-by: Steve French <[email protected]>
2024-08-28Merge tag 'v6.11-rc5-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds4-27/+43
Pull smb client fixes from Steve French: - two RDMA/smbdirect fixes and a minor cleanup - punch hole fix * tag 'v6.11-rc5-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: Fix FALLOC_FL_PUNCH_HOLE support smb/client: fix rdma usage in smb2_async_writev() smb/client: remove unused rq_iter_size from struct smb_rqst smb/client: avoid dereferencing rdata=NULL in smb2_new_read_req()
2024-08-27btrfs: fix uninitialized return value from btrfs_reclaim_sweep()Filipe Manana2-13/+6
The return variable 'ret' at btrfs_reclaim_sweep() is never assigned if none of the space infos is reclaimable (for example if periodic reclaim is disabled, which is the default), so we return an undefined value. This can be fixed my making btrfs_reclaim_sweep() not return any value as well as do_reclaim_sweep() because: 1) do_reclaim_sweep() always returns 0, so we can make it return void; 2) The only caller of btrfs_reclaim_sweep() (btrfs_reclaim_bgs()) doesn't care about its return value, and in its context there's nothing to do about any errors anyway. Therefore remove the return value from btrfs_reclaim_sweep() and do_reclaim_sweep(). Fixes: e4ca3932ae90 ("btrfs: periodic block_group reclaim") Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2024-08-27xfs: reset rootdir extent size hint after growfsrtDarrick J. Wong1-0/+40
If growfsrt is run on a filesystem that doesn't have a rt volume, it's possible to change the rt extent size. If the root directory was previously set up with an inherited extent size hint and rtinherit, it's possible that the hint is no longer a multiple of the rt extent size. Although the verifiers don't complain about this, xfs_repair will, so if we detect this situation, log the root directory to clean it up. This is still racy, but it's better than nothing. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]> Signed-off-by: Chandan Babu R <[email protected]>
2024-08-27xfs: take m_growlock when running growfsrtDarrick J. Wong1-13/+25
Take the grow lock when we're expanding the realtime volume, like we do for the other growfs calls. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]> Signed-off-by: Chandan Babu R <[email protected]>
2024-08-27xfs: Fix missing interval for missing_owner in xfs fsmapZizhi Wo1-1/+23
In the fsmap query of xfs, there is an interval missing problem: [root@fedora ~]# xfs_io -c 'fsmap -vvvv' /mnt EXT: DEV BLOCK-RANGE OWNER FILE-OFFSET AG AG-OFFSET TOTAL 0: 253:16 [0..7]: static fs metadata 0 (0..7) 8 1: 253:16 [8..23]: per-AG metadata 0 (8..23) 16 2: 253:16 [24..39]: inode btree 0 (24..39) 16 3: 253:16 [40..47]: per-AG metadata 0 (40..47) 8 4: 253:16 [48..55]: refcount btree 0 (48..55) 8 5: 253:16 [56..103]: per-AG metadata 0 (56..103) 48 6: 253:16 [104..127]: free space 0 (104..127) 24 ...... BUG: [root@fedora ~]# xfs_io -c 'fsmap -vvvv -d 104 107' /mnt [root@fedora ~]# Normally, we should be able to get [104, 107), but we got nothing. The problem is caused by shifting. The query for the problem-triggered scenario is for the missing_owner interval (e.g. freespace in rmapbt/ unknown space in bnobt), which is obtained by subtraction (gap). For this scenario, the interval is obtained by info->last. However, rec_daddr is calculated based on the start_block recorded in key[1], which is converted by calling XFS_BB_TO_FSBT. Then if rec_daddr does not exceed info->next_daddr, which means keys[1].fmr_physical >> (mp)->m_blkbb_log <= info->next_daddr, no records will be displayed. In the above example, 104 >> (mp)->m_blkbb_log = 12 and 107 >> (mp)->m_blkbb_log = 12, so the two are reduced to 0 and the gap is ignored: before calculate ----------------> after shifting 104(st) 107(ed) 12(st/ed) |---------| | sector size block size Resolve this issue by introducing the "end_daddr" field in xfs_getfsmap_info. This records |key[1].fmr_physical + key[1].length| at the granularity of sector. If the current query is the last, the rec_daddr is end_daddr to prevent missing interval problems caused by shifting. We only need to focus on the last query, because xfs disks are internally aligned with disk blocksize that are powers of two and minimum 512, so there is no problem with shifting in previous queries. After applying this patch, the above problem have been solved: [root@fedora ~]# xfs_io -c 'fsmap -vvvv -d 104 107' /mnt EXT: DEV BLOCK-RANGE OWNER FILE-OFFSET AG AG-OFFSET TOTAL 0: 253:16 [104..106]: free space 0 (104..106) 3 Fixes: e89c041338ed ("xfs: implement the GETFSMAP ioctl") Signed-off-by: Zizhi Wo <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> [djwong: limit the range of end_addr correctly] Signed-off-by: Darrick J. Wong <[email protected]> Signed-off-by: Chandan Babu R <[email protected]>
2024-08-27xfs: use XFS_BUF_DADDR_NULL for daddrs in getfsmap codeDarrick J. Wong1-2/+2
Use XFS_BUF_DADDR_NULL (instead of a magic sentinel value) to mean "this field is null" like the rest of xfs. Cc: [email protected] Fixes: e89c041338ed6 ("xfs: implement the GETFSMAP ioctl") Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]> Signed-off-by: Chandan Babu R <[email protected]>
2024-08-27Merge tag 'vfs-6.11-rc6.fixes' of ↵Linus Torvalds8-59/+79
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "VFS: - Ensure that backing files uses file->f_ops->splice_write() for splice netfs: - Revert the removal of PG_private_2 from netfs_release_folio() as cephfs still relies on this - When AS_RELEASE_ALWAYS is set on a mapping the folio needs to always be invalidated during truncation - Fix losing untruncated data in a folio by making letting netfs_release_folio() return false if the folio is dirty - Fix trimming of streaming-write folios in netfs_inval_folio() - Reset iterator before retrying a short read - Fix interaction of streaming writes with zero-point tracker afs: - During truncation afs currently calls truncate_setsize() which sets i_size, expands the pagecache and truncates it. The first two operations aren't needed because they will have already been done. So call truncate_pagecache() instead and skip the redundant parts overlayfs: - Fix checking of the number of allowed lower layers so 500 layers can actually be used instead of just 499 - Add missing '\n' to pr_err() output - Pass string to ovl_parse_layer() and thus allow it to be used for Opt_lowerdir as well pidfd: - Revert blocking the creation of pidfds for kthread as apparently userspace relies on this. Specifically, it breaks systemd during shutdown romfs: - Fix romfs_read_folio() to use the correct offset with folio_zero_tail()" * tag 'vfs-6.11-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: netfs: Fix interaction of streaming writes with zero-point tracker netfs: Fix missing iterator reset on retry of short read netfs: Fix trimming of streaming-write folios in netfs_inval_folio() netfs: Fix netfs_release_folio() to say no if folio dirty afs: Fix post-setattr file edit to do truncation correctly mm: Fix missing folio invalidation calls during truncation ovl: ovl_parse_param_lowerdir: Add missed '\n' for pr_err ovl: fix wrong lowerdir number check for parameter Opt_lowerdir ovl: pass string to ovl_parse_layer() backing-file: convert to using fops->splice_write Revert "pidfd: prevent creation of pidfds for kthreads" romfs: fix romfs_read_folio() netfs, ceph: Partially revert "netfs: Replace PG_fscache by setting folio->private and marking dirty"
2024-08-26bcachefs: Fix bch2_extents_match() false positiveKent Overstreet1-1/+22
This was caught as a very rare nonce inconsistency, on systems with encryption and replication (and tiering, or some form of rebalance operation running): [Wed Jul 17 13:30:03 2024] about to insert invalid key in data update path [Wed Jul 17 13:30:03 2024] old: u64s 10 type extent 671283510:6392:U32_MAX len 16 ver 106595503: durability: 2 crc: c_size 8 size 16 offset 0 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 3:355968:104 gen 7 ptr: 4:513244:48 gen 6 rebalance: target hdd compression zstd [Wed Jul 17 13:30:03 2024] k: u64s 10 type extent 671283510:6400:U32_MAX len 16 ver 106595508: durability: 2 crc: c_size 8 size 16 offset 0 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 3:355968:112 gen 7 ptr: 4:513244:56 gen 6 rebalance: target hdd compression zstd [Wed Jul 17 13:30:03 2024] new: u64s 14 type extent 671283510:6392:U32_MAX len 8 ver 106595508: durability: 2 crc: c_size 8 size 16 offset 0 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 3:355968:112 gen 7 cached ptr: 4:513244:56 gen 6 cached rebalance: target hdd compression zstd crc: c_size 8 size 16 offset 8 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 1:10860085:32 gen 0 ptr: 0:17285918:408 gen 0 [Wed Jul 17 13:30:03 2024] bcachefs (cca5bc65-fe77-409d-a9fa-465a6e7f4eae): fatal error - emergency read only bch2_extents_match() was reporting true for extents that did not actually point to the same data. bch2_extent_match() iterates over pairs of pointers, looking for pointers that point to the same location on disk (with matching generation numbers). However one or both extents may have been trimmed (or merged) and they might not have the same disk offset: it corrects for this by subtracting the key offset and the checksum entry offset. However, this failed when an extent was immediately partially overwritten, and the new overwrite was allocated the next adjacent disk space. Normally, with compression off, this would never cause a bug, since the new extent would have to be immediately after the old extent for the pointer offsets to match, and the rebalance index update path is not looking for an extent outside the range of the extent it moved. However with compression enabled, extents take up less space on disk than they do in the btree index space - and spuriously matching after partial overwrite is possible. To fix this, add a secondary check, that strictly checks that the regions pointed to on disk overlap. https://github.com/koverstreet/bcachefs/issues/717 Signed-off-by: Kent Overstreet <[email protected]>
2024-08-26bcachefs: Fix failure to return error in data_update_index_update()Kent Overstreet1-0/+1
This fixes an assertion pop in io_write.c - if we don't return an error we're supposed to have completed all the btree updates. Signed-off-by: Kent Overstreet <[email protected]>
2024-08-27btrfs: fix a use-after-free when hitting errors inside btrfs_submit_chunk()Qu Wenruo1-8/+18
[BUG] There is an internal report that KASAN is reporting use-after-free, with the following backtrace: BUG: KASAN: slab-use-after-free in btrfs_check_read_bio+0xa68/0xb70 [btrfs] Read of size 4 at addr ffff8881117cec28 by task kworker/u16:2/45 CPU: 1 UID: 0 PID: 45 Comm: kworker/u16:2 Not tainted 6.11.0-rc2-next-20240805-default+ #76 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014 Workqueue: btrfs-endio btrfs_end_bio_work [btrfs] Call Trace: dump_stack_lvl+0x61/0x80 print_address_description.constprop.0+0x5e/0x2f0 print_report+0x118/0x216 kasan_report+0x11d/0x1f0 btrfs_check_read_bio+0xa68/0xb70 [btrfs] process_one_work+0xce0/0x12a0 worker_thread+0x717/0x1250 kthread+0x2e3/0x3c0 ret_from_fork+0x2d/0x70 ret_from_fork_asm+0x11/0x20 Allocated by task 20917: kasan_save_stack+0x37/0x60 kasan_save_track+0x10/0x30 __kasan_slab_alloc+0x7d/0x80 kmem_cache_alloc_noprof+0x16e/0x3e0 mempool_alloc_noprof+0x12e/0x310 bio_alloc_bioset+0x3f0/0x7a0 btrfs_bio_alloc+0x2e/0x50 [btrfs] submit_extent_page+0x4d1/0xdb0 [btrfs] btrfs_do_readpage+0x8b4/0x12a0 [btrfs] btrfs_readahead+0x29a/0x430 [btrfs] read_pages+0x1a7/0xc60 page_cache_ra_unbounded+0x2ad/0x560 filemap_get_pages+0x629/0xa20 filemap_read+0x335/0xbf0 vfs_read+0x790/0xcb0 ksys_read+0xfd/0x1d0 do_syscall_64+0x6d/0x140 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Freed by task 20917: kasan_save_stack+0x37/0x60 kasan_save_track+0x10/0x30 kasan_save_free_info+0x37/0x50 __kasan_slab_free+0x4b/0x60 kmem_cache_free+0x214/0x5d0 bio_free+0xed/0x180 end_bbio_data_read+0x1cc/0x580 [btrfs] btrfs_submit_chunk+0x98d/0x1880 [btrfs] btrfs_submit_bio+0x33/0x70 [btrfs] submit_one_bio+0xd4/0x130 [btrfs] submit_extent_page+0x3ea/0xdb0 [btrfs] btrfs_do_readpage+0x8b4/0x12a0 [btrfs] btrfs_readahead+0x29a/0x430 [btrfs] read_pages+0x1a7/0xc60 page_cache_ra_unbounded+0x2ad/0x560 filemap_get_pages+0x629/0xa20 filemap_read+0x335/0xbf0 vfs_read+0x790/0xcb0 ksys_read+0xfd/0x1d0 do_syscall_64+0x6d/0x140 entry_SYSCALL_64_after_hwframe+0x4b/0x53 [CAUSE] Although I cannot reproduce the error, the report itself is good enough to pin down the cause. The call trace is the regular endio workqueue context, but the free-by-task trace is showing that during btrfs_submit_chunk() we already hit a critical error, and is calling btrfs_bio_end_io() to error out. And the original endio function called bio_put() to free the whole bio. This means a double freeing thus causing use-after-free, e.g.: 1. Enter btrfs_submit_bio() with a read bio The read bio length is 128K, crossing two 64K stripes. 2. The first run of btrfs_submit_chunk() 2.1 Call btrfs_map_block(), which returns 64K 2.2 Call btrfs_split_bio() Now there are two bios, one referring to the first 64K, the other referring to the second 64K. 2.3 The first half is submitted. 3. The second run of btrfs_submit_chunk() 3.1 Call btrfs_map_block(), which by somehow failed Now we call btrfs_bio_end_io() to handle the error 3.2 btrfs_bio_end_io() calls the original endio function Which is end_bbio_data_read(), and it calls bio_put() for the original bio. Now the original bio is freed. 4. The submitted first 64K bio finished Now we call into btrfs_check_read_bio() and tries to advance the bio iter. But since the original bio (thus its iter) is already freed, we trigger the above use-after free. And even if the memory is not poisoned/corrupted, we will later call the original endio function, causing a double freeing. [FIX] Instead of calling btrfs_bio_end_io(), call btrfs_orig_bbio_end_io(), which has the extra check on split bios and do the proper refcounting for cloned bios. Furthermore there is already one extra btrfs_cleanup_bio() call, but that is duplicated to btrfs_orig_bbio_end_io() call, so remove that label completely. Reported-by: David Sterba <[email protected]> Fixes: 852eee62d31a ("btrfs: allow btrfs_submit_bio to split bios") CC: [email protected] # 6.6+ Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2024-08-26fs/nfsd: fix update of inode attrs in CB_GETATTRJeff Layton4-10/+26
Currently, we copy the mtime and ctime to the in-core inode and then mark the inode dirty. This is fine for certain types of filesystems, but not all. Some require a real setattr to properly change these values (e.g. ceph or reexported NFS). Fix this code to call notify_change() instead, which is the proper way to effect a setattr. There is one problem though: In this case, the client is holding a write delegation and has sent us attributes to update our cache. We don't want to break the delegation for this since that would defeat the purpose. Add a new ATTR_DELEG flag that makes notify_change bypass the try_break_deleg call. Fixes: c5967721e106 ("NFSD: handle GETATTR conflict with write delegation") Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Jeff Layton <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2024-08-26binfmt_elf_fdpic: fix AUXV size calculation when ELF_HWCAP2 is definedMax Filippov1-0/+3
create_elf_fdpic_tables() does not correctly account the space for the AUX vector when an architecture has ELF_HWCAP2 defined. Prior to the commit 10e29251be0e ("binfmt_elf_fdpic: fix /proc/<pid>/auxv") it resulted in the last entry of the AUX vector being set to zero, but with that change it results in a kernel BUG. Fix that by adding one to the number of AUXV entries (nitems) when ELF_HWCAP2 is defined. Fixes: 10e29251be0e ("binfmt_elf_fdpic: fix /proc/<pid>/auxv") Cc: [email protected] Reported-by: Greg Ungerer <[email protected]> Closes: https://lore.kernel.org/lkml/[email protected]/ Signed-off-by: Max Filippov <[email protected]> Tested-by: Greg Ungerer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>
2024-08-26nfsd: fix potential UAF in nfsd4_cb_getattr_releaseJeff Layton1-1/+1
Once we drop the delegation reference, the fields embedded in it are no longer safe to access. Do that last. Fixes: c5967721e106 ("NFSD: handle GETATTR conflict with write delegation") Signed-off-by: Jeff Layton <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2024-08-26nfsd: hold reference to delegation when updating it for cb_getattrJeff Layton1-3/+7
Once we've dropped the flc_lock, there is nothing that ensures that the delegation that was found will still be around later. Take a reference to it while holding the lock and then drop it when we've finished with the delegation. Fixes: c5967721e106 ("NFSD: handle GETATTR conflict with write delegation") Signed-off-by: Jeff Layton <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2024-08-26btrfs: initialize last_extent_end to fix -Wmaybe-uninitialized warning in ↵David Sterba1-1/+1
extent_fiemap() There's a warning (probably on some older compiler version): fs/btrfs/fiemap.c: warning: 'last_extent_end' may be used uninitialized in this function [-Wmaybe-uninitialized]: => 822:19 Initialize the variable to 0 although it's not necessary as it's either properly set or not used after an error. The called function is in the same file so this is a false alert but we want to fix all -Wmaybe-uninitialized reports. Link: https://lore.kernel.org/all/[email protected]/ Reported-by: Geert Uytterhoeven <[email protected]> Signed-off-by: David Sterba <[email protected]>
2024-08-26xfs: Fix the owner setting issue for rmap query in xfs fsmapZizhi Wo1-1/+1
I notice a rmap query bug in xfs_io fsmap: [root@fedora ~]# xfs_io -c 'fsmap -vvvv' /mnt EXT: DEV BLOCK-RANGE OWNER FILE-OFFSET AG AG-OFFSET TOTAL 0: 253:16 [0..7]: static fs metadata 0 (0..7) 8 1: 253:16 [8..23]: per-AG metadata 0 (8..23) 16 2: 253:16 [24..39]: inode btree 0 (24..39) 16 3: 253:16 [40..47]: per-AG metadata 0 (40..47) 8 4: 253:16 [48..55]: refcount btree 0 (48..55) 8 5: 253:16 [56..103]: per-AG metadata 0 (56..103) 48 6: 253:16 [104..127]: free space 0 (104..127) 24 ...... Bug: [root@fedora ~]# xfs_io -c 'fsmap -vvvv -d 0 3' /mnt [root@fedora ~]# Normally, we should be able to get one record, but we got nothing. The root cause of this problem lies in the incorrect setting of rm_owner in the rmap query. In the case of the initial query where the owner is not set, __xfs_getfsmap_datadev() first sets info->high.rm_owner to ULLONG_MAX. This is done to prevent any omissions when comparing rmap items. However, if the current ag is detected to be the last one, the function sets info's high_irec based on the provided key. If high->rm_owner is not specified, it should continue to be set to ULLONG_MAX; otherwise, there will be issues with interval omissions. For example, consider "start" and "end" within the same block. If high->rm_owner == 0, it will be smaller than the founded record in rmapbt, resulting in a query with no records. The main call stack is as follows: xfs_ioc_getfsmap xfs_getfsmap xfs_getfsmap_datadev_rmapbt __xfs_getfsmap_datadev info->high.rm_owner = ULLONG_MAX if (pag->pag_agno == end_ag) xfs_fsmap_owner_to_rmap // set info->high.rm_owner = 0 because fmr_owner == -1ULL dest->rm_owner = 0 // get nothing xfs_getfsmap_datadev_rmapbt_query The problem can be resolved by simply modify the xfs_fsmap_owner_to_rmap function internal logic to achieve. After applying this patch, the above problem have been solved: [root@fedora ~]# xfs_io -c 'fsmap -vvvv -d 0 3' /mnt EXT: DEV BLOCK-RANGE OWNER FILE-OFFSET AG AG-OFFSET TOTAL 0: 253:16 [0..7]: static fs metadata 0 (0..7) 8 Fixes: e89c041338ed ("xfs: implement the GETFSMAP ioctl") Signed-off-by: Zizhi Wo <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]> Signed-off-by: Chandan Babu R <[email protected]>
2024-08-26xfs: don't bother reporting blocks trimmed via FITRIMDarrick J. Wong1-25/+11
Don't bother reporting the number of bytes that we "trimmed" because the underlying storage isn't required to do anything(!) and failed discard IOs aren't reported to the caller anyway. It's not like userspace can use the reported value for anything useful like adjusting the offset parameter of the next call, and it's not like anyone ever wrote a manpage about FITRIM's out parameters. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Tested-by: Christoph Hellwig <[email protected]> Signed-off-by: Chandan Babu R <[email protected]>
2024-08-26xfs: xfs_finobt_count_blocks() walks the wrong btreeDave Chinner1-1/+1
As a result of the factoring in commit 14dd46cf31f4 ("xfs: split xfs_inobt_init_cursor"), mount started taking a long time on a user's filesystem. For Anders, this made mount times regress from under a second to over 15 minutes for a filesystem with only 30 million inodes in it. Anders bisected it down to the above commit, but even then the bug was not obvious. In this commit, over 20 calls to xfs_inobt_init_cursor() were modified, and some we modified to call a new function named xfs_finobt_init_cursor(). If that takes you a moment to reread those function names to see what the rename was, then you have realised why this bug wasn't spotted during review. And it wasn't spotted on inspection even after the bisect pointed at this commit - a single missing "f" isn't the easiest thing for a human eye to notice.... The result is that xfs_finobt_count_blocks() now incorrectly calls xfs_inobt_init_cursor() so it is now walking the inobt instead of the finobt. Hence when there are lots of allocated inodes in a filesystem, mount takes a -long- time run because it now walks a massive allocated inode btrees instead of the small, nearly empty free inode btrees. It also means all the finobt space reservations are wrong, so mount could potentially given ENOSPC on kernel upgrade. In hindsight, commit 14dd46cf31f4 should have been two commits - the first to convert the finobt callers to the new API, the second to modify the xfs_inobt_init_cursor() API for the inobt callers. That would have made the bug very obvious during review. Fixes: 14dd46cf31f4 ("xfs: split xfs_inobt_init_cursor") Reported-by: Anders Blomdell <[email protected]> Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]> Signed-off-by: Chandan Babu R <[email protected]>
2024-08-26xfs: fix folio dirtying for XFILE_ALLOC callersDarrick J. Wong1-1/+1
willy pointed out that folio_mark_dirty is the correct function to use to mark an xfile folio dirty because it calls out to the mapping's aops to mark it dirty. For tmpfs this likely doesn't matter much since it currently uses nop_dirty_folio, but let's use the abstractions properly. Reported-by: [email protected] Fixes: 6907e3c00a40 ("xfs: add file_{get,put}_folio") Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Chandan Babu R <[email protected]>
2024-08-26xfs: fix di_onlink checking for V1/V2 inodesDarrick J. Wong1-4/+10
"KjellR" complained on IRC that an old V4 filesystem suddenly stopped mounting after upgrading from 6.9.11 to 6.10.3, with the following splat when trying to read the rt bitmap inode: 00000000: 49 4e 80 00 01 02 00 01 00 00 00 00 00 00 00 00 IN.............. 00000010: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000020: 00 00 00 00 00 00 00 00 43 d2 a9 da 21 0f d6 30 ........C...!..0 00000030: 43 d2 a9 da 21 0f d6 30 00 00 00 00 00 00 00 00 C...!..0........ 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000050: 00 00 00 02 00 00 00 00 00 00 00 04 00 00 00 00 ................ 00000060: ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ As Dave Chinner points out, this is a V1 inode with both di_onlink and di_nlink set to 1 and di_flushiter == 0. In other words, this inode was formatted this way by mkfs and hasn't been touched since then. Back in the old days of xfsprogs 3.2.3, I observed that libxfs_ialloc would set di_nlink, but if the filesystem didn't have NLINK, it would then set di_version = 1. libxfs_iflush_int later sees the V1 inode and copies the value of di_nlink to di_onlink without zeroing di_onlink. Eventually this filesystem must have been upgraded to support NLINK because 6.10 doesn't support !NLINK filesystems, which is how we tripped over this old behavior. The filesystem doesn't have a realtime section, so that's why the rtbitmap inode has never been touched. Fix this by removing the di_onlink/di_nlink checking for all V1/V2 inodes because this is a muddy mess. The V3 inode handling code has always supported NLINK and written di_onlink==0 so keep that check. The removal of the V1 inode handling code when we dropped support for !NLINK obscured this old behavior. Reported-by: [email protected] Fixes: 40cb8613d612 ("xfs: check unused nlink fields in the ondisk inode") Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Chandan Babu R <[email protected]>
2024-08-25btrfs: run delayed iputs when flushing delallocJosef Bacik1-0/+2
We have transient failures with btrfs/301, specifically in the part where we do for i in $(seq 0 10); do write 50m to file rm -f file done Sometimes this will result in a transient quota error, and it's because sometimes we start writeback on the file which results in a delayed iput, and thus the rm doesn't actually clean the file up. When we're flushing the quota space we need to run the delayed iputs to make sure all the unlinks that we think have completed have actually completed. This removes the small window where we could fail to find enough space in our quota. CC: [email protected] # 5.15+ Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>
2024-08-25cifs: Fix FALLOC_FL_PUNCH_HOLE supportDavid Howells1-0/+22
The cifs filesystem doesn't quite emulate FALLOC_FL_PUNCH_HOLE correctly (note that due to lack of protocol support, it can't actually implement it directly). Whilst it will (partially) invalidate dirty folios in the pagecache, it doesn't write them back first, and so the EOF marker on the server may be lower than inode->i_size. This presents a problem, however, as if the punched hole invalidates the tail of the locally cached dirty data, writeback won't know it needs to move the EOF over to account for the hole punch (which isn't supposed to move the EOF). We could just write zeroes over the punched out region of the pagecache and write that back - but this is supposed to be a deallocatory operation. Fix this by manually moving the EOF over on the server after the operation if the hole punched would corrupt it. Note that the FSCTL_SET_ZERO_DATA RPC and the setting of the EOF should probably be compounded to stop a third party interfering (or, at least, massively reduce the chance). This was reproducible occasionally by using fsx with the following script: truncate 0x0 0x375e2 0x0 punch_hole 0x2f6d3 0x6ab5 0x375e2 truncate 0x0 0x3a71f 0x375e2 mapread 0xee05 0xcf12 0x3a71f write 0x2078e 0x5604 0x3a71f write 0x3ebdf 0x1421 0x3a71f * punch_hole 0x379d0 0x8630 0x40000 * mapread 0x2aaa2 0x85b 0x40000 fallocate 0x1b401 0x9ada 0x40000 read 0x15f2 0x7d32 0x40000 read 0x32f37 0x7a3b 0x40000 * The second "write" should extend the EOF to 0x40000, and the "punch_hole" should operate inside of that - but that depends on whether the VM gets in and writes back the data first. If it doesn't, the file ends up 0x3a71f in size, not 0x40000. Fixes: 31742c5a3317 ("enable fallocate punch hole ("fallocate -p") for SMB3") Signed-off-by: David Howells <[email protected]> cc: Steve French <[email protected]> cc: Paulo Alcantara <[email protected]> cc: Shyam Prasad N <[email protected]> cc: Jeff Layton <[email protected]> cc: [email protected] cc: [email protected] Signed-off-by: Steve French <[email protected]>
2024-08-25smb/client: fix rdma usage in smb2_async_writev()Stefan Metzmacher1-20/+20
rqst.rq_iter needs to be truncated otherwise we'll also send the bytes into the stream socket... This is the logic behind rqst.rq_npages = 0, which was removed in "cifs: Change the I/O paths to use an iterator rather than a page list" (d08089f649a0cfb2099c8551ac47eef0cc23fdf2). Cc: [email protected] Fixes: d08089f649a0 ("cifs: Change the I/O paths to use an iterator rather than a page list") Reviewed-by: David Howells <[email protected]> Signed-off-by: Stefan Metzmacher <[email protected]> Signed-off-by: Steve French <[email protected]>
2024-08-25smb/client: remove unused rq_iter_size from struct smb_rqstStefan Metzmacher4-6/+0
Reviewed-by: David Howells <[email protected]> Fixes: d08089f649a0 ("cifs: Change the I/O paths to use an iterator rather than a page list") Signed-off-by: Stefan Metzmacher <[email protected]> Signed-off-by: Steve French <[email protected]>
2024-08-25smb/client: avoid dereferencing rdata=NULL in smb2_new_read_req()Stefan Metzmacher1-1/+1
This happens when called from SMB2_read() while using rdma and reaching the rdma_readwrite_threshold. Cc: [email protected] Fixes: a6559cc1d35d ("cifs: split out smb3_use_rdma_offload() helper") Reviewed-by: David Howells <[email protected]> Signed-off-by: Stefan Metzmacher <[email protected]> Signed-off-by: Steve French <[email protected]>
2024-08-25Merge tag 'bcachefs-2024-08-24' of git://evilpiepirate.org/bcachefsLinus Torvalds25-192/+387
Pull bcachefs fixes from Kent Overstreet: - assorted syzbot fixes - some upgrade fixes for old (pre 1.0) filesystems - fix for moving data off a device that was switched to durability=0 after data had been written to it. - nocow deadlock fix - fix for new rebalance_work accounting * tag 'bcachefs-2024-08-24' of git://evilpiepirate.org/bcachefs: (28 commits) bcachefs: Fix rebalance_work accounting bcachefs: Fix failure to flush moves before sleeping in copygc bcachefs: don't use rht_bucket() in btree_key_cache_scan() bcachefs: add missing inode_walker_exit() bcachefs: clear path->should_be_locked in bch2_btree_key_cache_drop() bcachefs: Fix double assignment in check_dirent_to_subvol() bcachefs: Fix refcounting in discard path bcachefs: Fix compat issue with old alloc_v4 keys bcachefs: Fix warning in bch2_fs_journal_stop() fs/super.c: improve get_tree() error message bcachefs: Fix missing validation in bch2_sb_journal_v2_validate() bcachefs: Fix replay_now_at() assert bcachefs: Fix locking in bch2_ioc_setlabel() bcachefs: fix failure to relock in btree_node_fill() bcachefs: fix failure to relock in bch2_btree_node_mem_alloc() bcachefs: unlock_long() before resort in journal replay bcachefs: fix missing bch2_err_str() bcachefs: fix time_stats_to_text() bcachefs: Fix bch2_bucket_gens_init() bcachefs: Fix bch2_trigger_alloc assert ...
2024-08-25Merge tag '6.11-rc5-server-fixes' of git://git.samba.org/ksmbdLinus Torvalds2-10/+10
Pull smb server fixes from Steve French: - query directory flex array fix - fix potential null ptr reference in open - fix error message in some open cases - two minor cleanups * tag '6.11-rc5-server-fixes' of git://git.samba.org/ksmbd: smb/server: update misguided comment of smb2_allocate_rsp_buf() smb/server: remove useless assignment of 'file_present' in smb2_open() smb/server: fix potential null-ptr-deref of lease_ctx_info in smb2_open() smb/server: fix return value of smb2_open() ksmbd: the buffer of smb2 query dir response has at least 1 byte
2024-08-24bcachefs: Fix rebalance_work accountingKent Overstreet5-27/+98
rebalance_work was keying off of the presence of rebelance_opts in the extent - but that was incorrect, we keep those around after rebalance for indirect extents since the inode's options are not directly available Fixes: 20ac515a9cc7 ("bcachefs: bch_acct_rebalance_work") Signed-off-by: Kent Overstreet <[email protected]>
2024-08-24bcachefs: Fix failure to flush moves before sleeping in copygcKent Overstreet1-1/+1
This fixes an apparent deadlock - rebalance would get stuck trying to take nocow locks because they weren't being released by copygc. Signed-off-by: Kent Overstreet <[email protected]>