Commit graph

93206 commits

Author SHA1 Message Date
Trond Myklebust
f92214e4c3 NFS: Avoid unnecessary rescanning of the per-server delegation list
If the call to nfs_delegation_grab_inode() fails, we will not have
dropped any locks that require us to rescan the list.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-08-22 17:01:10 -04:00
Trond Myklebust
d72b796311 NFSv4: Fix clearing of layout segments in layoutreturn
Make sure that we clear the layout segments in cases where we see a
fatal error, and also in the case where the layout is invalid.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-08-22 17:01:10 -04:00
Trond Myklebust
a017ad1313 NFSv4: Add missing rescheduling points in nfs_client_return_marked_delegations
We're seeing reports of soft lockups when iterating through the loops,
so let's add rescheduling points.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-08-22 17:01:10 -04:00
Jeff Layton
95832998fb nfs: fix bitmap decoder to handle a 3rd word
It only decodes the first two words at this point. Have it decode the
third word as well. Without this, the client doesn't send delegated
timestamps in the CB_GETATTR response.

With this change we also need to expand the on-stack bitmap in
decode_recallany_args to 3 elements, in case the server sends a larger
bitmap than expected.

Fixes: 43df7110f4 ("NFSv4: Add CB_GETATTR support for delegated attributes")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-08-22 17:01:10 -04:00
Jeff Layton
cb78f9b7d0 nfs: fix the fetch of FATTR4_OPEN_ARGUMENTS
The client doesn't properly request FATTR4_OPEN_ARGUMENTS in the initial
SERVER_CAPS getattr. Add FATTR4_WORD2_OPEN_ARGUMENTS to the initial
request.

Fixes: 707f13b3d0 (NFSv4: Add support for the FATTR4_OPEN_ARGUMENTS attribute)
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-08-22 17:01:09 -04:00
ChenXiaoSong
5e51224d2a smb/client: fix typo: GlobalMid_Sem -> GlobalMid_Lock
The comments have typos, fix that to not confuse readers.

Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-22 15:44:19 -05:00
Jeff Layton
f58bab6fd4 nfsd: ensure that nfsd4_fattr_args.context is zeroed out
If nfsd4_encode_fattr4 ends up doing a "goto out" before we get to
checking for the security label, then args.context will be set to
uninitialized junk on the stack, which we'll then try to free.
Initialize it early.

Fixes: f59388a579 ("NFSD: Add nfsd4_encode_fattr4_sec_label()")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-08-22 14:49:10 -04:00
Paulo Alcantara
ec68680411 smb: client: ignore unhandled reparse tags
Just ignore reparse points that the client can't parse rather than
bailing out and not opening the file or directory.

Reported-by: Marc <1marc1@gmail.com>
Closes: https://lore.kernel.org/r/CAMHwNVv-B+Q6wa0FEXrAuzdchzcJRsPKDDRrNaYZJd6X-+iJzw@mail.gmail.com
Fixes: 539aad7f14 ("smb: client: introduce ->parse_reparse_point()")
Tested-by: Anthony Nandaa (Microsoft) <profnandaa@gmail.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-22 12:37:16 -05:00
Steve French
15179cf280 smb3: fix problem unloading module due to leaked refcount on shutdown
The shutdown ioctl can leak a refcount on the tlink which can
prevent rmmod (unloading the cifs.ko) module from working.

Found while debugging xfstest generic/043

Fixes: 69ca1f5755 ("smb3: add dynamic tracepoints for shutdown ioctl")
Reviewed-by: Meetakshi Setiya <msetiya@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-22 12:36:57 -05:00
ChenXiaoSong
2b7e0573a4 smb/server: update misguided comment of smb2_allocate_rsp_buf()
smb2_allocate_rsp_buf() will return other error code except -ENOMEM.

Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-22 09:52:00 -05:00
ChenXiaoSong
0dd771b7d6 smb/server: remove useless assignment of 'file_present' in smb2_open()
The variable is already true here.

Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-22 09:52:00 -05:00
ChenXiaoSong
4e8771a366 smb/server: fix potential null-ptr-deref of lease_ctx_info in smb2_open()
null-ptr-deref will occur when (req_op_level == SMB2_OPLOCK_LEVEL_LEASE)
and parse_lease_state() return NULL.

Fix this by check if 'lease_ctx_info' is NULL.

Additionally, remove the redundant parentheses in
parse_durable_handle_context().

Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-22 09:52:00 -05:00
ChenXiaoSong
2186a11653 smb/server: fix return value of smb2_open()
In most error cases, error code is not returned in smb2_open(),
__process_request() will not print error message.

Fix this by returning the correct value at the end of smb2_open().

Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-22 09:52:00 -05:00
Namjae Jeon
ce61b605a0 ksmbd: the buffer of smb2 query dir response has at least 1 byte
When STATUS_NO_MORE_FILES status is set to smb2 query dir response,
->StructureSize is set to 9, which mean buffer has 1 byte.
This issue occurs because ->Buffer[1] in smb2_query_directory_rsp to
flex-array.

Fixes: eb3e28c1e8 ("smb3: Replace smb2pdu 1-element arrays with flex-arrays")
Cc: stable@vger.kernel.org # v6.1+
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-22 09:52:00 -05:00
Kent Overstreet
a592cdf516 bcachefs: don't use rht_bucket() in btree_key_cache_scan()
rht_bucket() does strange complicated things when a rehash is in
progress.

Instead, just skip scanning when a rehash is in progress: scanning is
going to be more expensive (many more empty slots to cover), and some
sort of infinite loop is being observed

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 10:04:41 -04:00
Kent Overstreet
3e878fe5a0 bcachefs: add missing inode_walker_exit()
fix a small leak

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 10:04:41 -04:00
Kent Overstreet
87313ac1f1 bcachefs: clear path->should_be_locked in bch2_btree_key_cache_drop()
bch2_btree_key_cache_drop() evicts the key cache entry - it's used when
we're doing an update that bypasses the key cache, because for cache
coherency reasons a key can't be in the key cache unless it also exists
in the btree - i.e. creates have to bypass the cache.

After evicting, the path no longer points to a key cache key, and
relock() will always fail if should_be_locked is true.

Prep for improving path->should_be_locked assertions

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 03:12:57 -04:00
Yuesong Li
dedb2fe375 bcachefs: Fix double assignment in check_dirent_to_subvol()
ret was assigned twice in check_dirent_to_subvol(). Reported by cocci.

Signed-off-by: Yuesong Li <liyuesong@vivo.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:40:52 -04:00
Kent Overstreet
0b50b7313e bcachefs: Fix refcounting in discard path
bch_dev->io_ref does not protect against the filesystem going away;
bch_fs->writes does.

Thus the filesystem write ref needs to be the last ref we release.

Reported-by: syzbot+9e0404b505e604f67e41@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:23 -04:00
Kent Overstreet
8ed823b192 bcachefs: Fix compat issue with old alloc_v4 keys
we allow new fields to be added to existing key types, and new versions
should treat them as being zeroed; this was not handled in
alloc_v4_validate.

Reported-by: syzbot+3b2968fa4953885dd66a@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:23 -04:00
Kent Overstreet
7f2de6947f bcachefs: Fix warning in bch2_fs_journal_stop()
j->last_empty_seq needs to match j->seq when the journal is empty

Reported-by: syzbot+4093905737cf289b6b38@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:23 -04:00
Kent Overstreet
06f67437ab fs/super.c: improve get_tree() error message
seeing an odd bug where we fail to correctly return an error from
.get_tree():

https://syzkaller.appspot.com/bug?extid=c0360e8367d6d8d04a66

we need to be able to distinguish between accidently returning a
positive error (as implied by the log) and no error.

Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:23 -04:00
Kent Overstreet
bdbdd4759f bcachefs: Fix missing validation in bch2_sb_journal_v2_validate()
Reported-by: syzbot+47ecc948aadfb2ab3efc@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:23 -04:00
Kent Overstreet
cab18be695 bcachefs: Fix replay_now_at() assert
Journal replay, in the slowpath where we insert keys in journal order,
was inserting keys in the wrong order; keys from early repair come last.

Reported-by: syzbot+2c4fcb257ce2b6a29d0e@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:23 -04:00
Kent Overstreet
6575b8c987 bcachefs: Fix locking in bch2_ioc_setlabel()
Fixes: 7a254053a5 ("bcachefs: support FS_IOC_SETFSLABEL")
Reported-by: syzbot+7e9efdfec27fbde0141d@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:23 -04:00
Kent Overstreet
5dbfc4ef72 bcachefs: fix failure to relock in btree_node_fill()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:23 -04:00
Kent Overstreet
3c5d0b72a8 bcachefs: fix failure to relock in bch2_btree_node_mem_alloc()
We weren't always so strict about trans->locked state - but now we are,
and new assertions are shaking some bugs out.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:23 -04:00
Kent Overstreet
1dceae4cc1 bcachefs: unlock_long() before resort in journal replay
Fix another SRCU splat - this one pretty harmless.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:23 -04:00
Kent Overstreet
cecc328240 bcachefs: fix missing bch2_err_str()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:23 -04:00
Kent Overstreet
b8db1bd802 bcachefs: fix time_stats_to_text()
Fixes: 7423330e30 ("bcachefs: prt_printf() now respects \r\n\t")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:22 -04:00
Kent Overstreet
c2a503f3e9 bcachefs: Fix bch2_bucket_gens_init()
Comparing the wrong bpos - this was missed because normally
bucket_gens_init() runs on brand new filesystems, but this bug caused it
to overwrite bucket_gens keys with 0s when upgrading ancient
filesystems.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:22 -04:00
Kent Overstreet
e150a7e89c bcachefs: Fix bch2_trigger_alloc assert
On testing on an old mangled filesystem, we missed a case.

Fixes: bd864bc2d9 ("bcachefs: Fix bch2_trigger_alloc when upgrading from old versions")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:22 -04:00
Kent Overstreet
49203a6b9d bcachefs: Fix failure to relock in btree_node_get()
discovered by new trans->locked asserts

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:22 -04:00
Kent Overstreet
548e7f5167 bcachefs: setting bcachefs_effective.* xattrs is a noop
bcachefs_effective.* xattrs show the options inherited from parent
directories (as well as explicitly set); this namespace is not for
setting bcachefs options.

Change the .set() handler to a noop so that if e.g. rsync is copying
xattrs it'll do the right thing, and only copy xattrs in the bcachefs.*
namespace. We don't want to return an error, because that will cause
rsync to bail out or get spammy.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:22 -04:00
Kent Overstreet
8cc0e50614 bcachefs: Fix "trying to move an extent, but nr_replicas=0"
data_update_init() does a bunch of complicated stuff to decide how many
replicas to add, since we only want to increase an extent's durability
on an explicit rereplicate, but extent pointers may be on devices with
different durability settings.

There was a corner case when evacuating a device that had been set to
durability=0 after data had been written to it, and extents on that
device had already been rereplicated - then evacuate only needs to drop
pointers on that device, not move them.

So the assert for !m->op.nr_replicas was spurious; this was a perfectly
legitimate case that needed to be handled.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:22 -04:00
Kent Overstreet
3f53d05041 bcachefs: bch2_data_update_init() cleanup
Factor out some helpers - this function has gotten much too big.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-22 02:07:22 -04:00
Linus Torvalds
5c6154ffd4 Changes since last update:
- Allow large folios on compressed inodes;
 
  - Fix invalid memory accesses if z_erofs_gbuf_growsize()
    partially fails;
 
  - Two minor cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmbF1s8RHHhpYW5nQGtl
 cm5lbC5vcmcACgkQUXZn5Zlu5qqzghAAjSi/OITIcpjYe3GOxDNEbIxTFf+5d7ge
 66KRMBm5FG6xBaXOK4xhit2EdFsDVMAna6FjYyRJYpDDQX5MAUsifDeZeGZCKOV5
 KpUk2CfhIp1zi9CeVA5Tl6rfdz0b0SKz77uRK9BN8PWmCdbeenOKgK6HU06S5enC
 uPud9QegQELXad0bIe+uo+4zGQnqxwGMfkNEW+cfbp200m6kpig/SgttoXCtz1mm
 CEZpv87MlrQLbLtox+KcesFhYofxWdx/My5ckJVbaHGZdM29HKehRLIsBFX3di5N
 4OkGnFbh16lhx6s4a/b1pCeDNH1NcmP7uqCoGr+dCVKQ79glSZQvfiV4myQk7vRN
 4Orj+zq4aBSmYOvizIFMH++IGDioPBxkhKv3ReeCe4t/L8LHPkGEaeT+iM9b+b4n
 uK0faj6qREygf2rHIRr6ciq7GdxlMwdlF7AeNUIyCHNO84J39+GkAID3Nqs+vzuF
 wSJAuqp959WgJf2Co5ugb4vjbrpq8TKfZG0yk7KnIOWJkhAKDcIwta+ooloIYtD8
 5gHhMRodIgKjYKPOi9vx9cQPlISTMsqVWpRuyzH9vCvqNvkzDriAez1kFq01tZTO
 gcK5QGekSgft/5h7PLFrbVV1kVfVdjNapsIY+XAEmBGGCJCScCrlIxjbsRwsseFW
 JIpQXilIU3c=
 =gFZW
 -----END PGP SIGNATURE-----

Merge tag 'erofs-for-6.11-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs fixes from Gao Xiang:
 "As I mentioned in the merge window pull request, there is a regression
  which could cause system hang due to page migration. The corresponding
  fix landed upstream through MM tree last week (commit 2e6506e1c4:
  "mm/migrate: fix deadlock in migrate_pages_batch() on large folios"),
  therefore large folios can be safely allowed for compressed inodes and
  stress tests have been running on my fleet for over 20 days without
  any regression. Users have explicitly requested this for months, so
  let's allow large folios for EROFS full cases now for wider testing.

  Additionally, there is a fix which addresses invalid memory accesses
  on a failure path triggered by fault injection and two minor cleanups
  to simplify the codebase.

  Summary:

   - Allow large folios on compressed inodes

   - Fix invalid memory accesses if z_erofs_gbuf_growsize() partially
     fails

   - Two minor cleanups"

* tag 'erofs-for-6.11-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: fix out-of-bound access when z_erofs_gbuf_growsize() partially fails
  erofs: allow large folios for compressed files
  erofs: get rid of check_layout_compatibility()
  erofs: simplify readdir operation
2024-08-22 06:06:09 +08:00
Christian Brauner
524b2c6dc8
romfs: fix romfs_read_folio()
Add the correct offset to folio_zero_tail().

Fixes: d86f2de026 ("romfs: Convert romfs_read_folio() to use a folio")
Reported-by: Greg Ungerer <gregungerer@westnet.com.au>
Link: https://lore.kernel.org/r/Zr0GTnPHfeA0P8nb@casper.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-21 22:32:58 +02:00
David Howells
92764e8822
netfs, ceph: Partially revert "netfs: Replace PG_fscache by setting folio->private and marking dirty"
This partially reverts commit 2ff1e97587.

In addition to reverting the removal of PG_private_2 wrangling from the
buffered read code[1][2], the removal of the waits for PG_private_2 from
netfs_release_folio() and netfs_invalidate_folio() need reverting too.

It also adds a wait into ceph_evict_inode() to wait for netfs read and
copy-to-cache ops to complete.

Fixes: 2ff1e97587 ("netfs: Replace PG_fscache by setting folio->private and marking dirty")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk [1]
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8e5ced7804cb9184c4a23f8054551240562a8eda [2]
Link: https://lore.kernel.org/r/20240814203850.2240469-2-dhowells@redhat.com
cc: Max Kellermann <max.kellermann@ionos.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: ceph-devel@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-21 22:32:58 +02:00
Gao Xiang
0005e01e1e erofs: fix out-of-bound access when z_erofs_gbuf_growsize() partially fails
If z_erofs_gbuf_growsize() partially fails on a global buffer due to
memory allocation failure or fault injection (as reported by syzbot [1]),
new pages need to be freed by comparing to the existing pages to avoid
memory leaks.

However, the old gbuf->pages[] array may not be large enough, which can
lead to null-ptr-deref or out-of-bound access.

Fix this by checking against gbuf->nrpages in advance.

[1] https://lore.kernel.org/r/000000000000f7b96e062018c6e3@google.com

Reported-by: syzbot+242ee56aaa9585553766@syzkaller.appspotmail.com
Fixes: d6db47e571 ("erofs: do not use pagepool in z_erofs_gbuf_growsize()")
Cc: <stable@vger.kernel.org> # 6.10+
Reviewed-by: Chunhai Guo <guochunhai@vivo.com>
Reviewed-by: Sandeep Dhavale <dhavale@google.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240820085619.1375963-1-hsiangkao@linux.alibaba.com
2024-08-21 08:12:05 +08:00
Kent Overstreet
2102bdac67 bcachefs: Extra debug for data move path
We don't have sufficient information to debug:

https://github.com/koverstreet/bcachefs/issues/726

- print out durability of extent ptrs, when non default
- print the number of replicas we need in data_update_to_text()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-19 21:42:08 -04:00
Gao Xiang
e080a26725 erofs: allow large folios for compressed files
As commit 2e6506e1c4 ("mm/migrate: fix deadlock in
migrate_pages_batch() on large folios") has landed upstream, large
folios can be safely enabled for compressed inodes since all
prerequisites have already landed in 6.11-rc1.

Stress tests has been running on my fleet for over 20 days without any
regression.  Additionally, users [1] have requested it for months.
Let's allow large folios for EROFS full cases upstream now for wider
testing.

[1] https://lore.kernel.org/r/CAGsJ_4wtE8OcpinuqVwG4jtdx6Qh5f+TON6wz+4HMCq=A2qFcA@mail.gmail.com

Cc: Barry Song <21cnbao@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
[ Gao Xiang: minor commit typo fixes. ]
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240819025207.3808649-1-hsiangkao@linux.alibaba.com
2024-08-19 16:10:04 +08:00
Hongzhen Luo
2c534624ae erofs: get rid of check_layout_compatibility()
Simple enough to just open-code it.

Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Reviewed-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240806112208.150323-1-hongzhen@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-08-19 11:06:20 +08:00
Hongzhen Luo
5b5c96c63d erofs: simplify readdir operation
- Use i_size instead of i_size_read() due to immutable fses;

 - Get rid of an unneeded goto since erofs_fill_dentries() also works;

 - Remove unnecessary lines.

Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240801112622.2164029-1-hongzhen@linux.alibaba.com
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2024-08-19 11:06:20 +08:00
Kent Overstreet
47cdc7b144 bcachefs: Fix incorrect gfp flags
fixes:
00488 WARNING: CPU: 9 PID: 194 at mm/page_alloc.c:4410 __alloc_pages_noprof+0x1818/0x1888
00488 Modules linked in:
00488 CPU: 9 UID: 0 PID: 194 Comm: kworker/u66:1 Not tainted 6.11.0-rc1-ktest-g18fa10d6495f #2931
00488 Hardware name: linux,dummy-virt (DT)
00488 Workqueue: writeback wb_workfn (flush-bcachefs-2)
00488 pstate: 20001005 (nzCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
00488 pc : __alloc_pages_noprof+0x1818/0x1888
00488 lr : __alloc_pages_noprof+0x5f4/0x1888
00488 sp : ffffff80ccd8ed00
00488 x29: ffffff80ccd8ed00 x28: 0000000000000000 x27: dfffffc000000000
00488 x26: 0000000000000010 x25: 0000000000000002 x24: 0000000000000000
00488 x23: 0000000000000000 x22: 1ffffff0199b1dbe x21: ffffff80cc680900
00488 x20: 0000000000000000 x19: ffffff80ccd8eed0 x18: 0000000000000000
00488 x17: ffffff80cc58a010 x16: dfffffc000000000 x15: 1ffffff00474e518
00488 x14: 1ffffff00474e518 x13: 1ffffff00474e518 x12: ffffffb8104701b9
00488 x11: 1ffffff8104701b8 x10: ffffffb8104701b8 x9 : ffffffc08043cde8
00488 x8 : 00000047efb8fe48 x7 : ffffff80ccd8ee20 x6 : 0000000000048000
00488 x5 : 1ffffff810470138 x4 : 0000000000000050 x3 : 1ffffff0199b1d94
00488 x2 : ffffffb0199b1d94 x1 : 0000000000000001 x0 : ffffffc082387448
00488 Call trace:
00488  __alloc_pages_noprof+0x1818/0x1888
00488  new_slab+0x284/0x2f0
00488  ___slab_alloc+0x208/0x8e0
00488  __kmalloc_noprof+0x328/0x340
00488  __bch2_writepage+0x106c/0x1830
00488  write_cache_pages+0xa0/0xe8

due to __GFP_NOFAIL without allowing reclaim

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-18 20:42:05 -04:00
Kent Overstreet
d9f49c3106 bcachefs: fix field-spanning write warning
attempts to retrofit memory safety onto C are increasingly annoying

------------[ cut here ]------------
memcpy: detected field-spanning write (size 4) of single field "&k.replicas" at fs/bcachefs/replicas.c:454 (size 3)
WARNING: CPU: 5 PID: 6525 at fs/bcachefs/replicas.c:454 bch2_replicas_gc2+0x2cb/0x400 [bcachefs]
bch2_replicas_gc2+0x2cb/0x400:
bch2_replicas_gc2 at /home/ojab/src/bcachefs/fs/bcachefs/replicas.c:454 (discriminator 3)
Modules linked in: dm_mod tun nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter overlay msr sctp bcachefs lz4hc_compress lz4_compress libcrc32c xor raid6_pq lz4_decompress pps_ldisc pps_core wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel curve25519_x86_64 libcurve25519_generic libchacha sit tunnel4 ip_tunnel af_packet bridge stp llc ip6table_nat ip6table_filter ip6_tables xt_MASQUERADE xt_conntrack iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables tcp_bbr sch_fq_codel efivarfs nls_iso8859_1 nls_cp437 vfat fat cdc_mbim cdc_wdm cdc_ncm cdc_ether usbnet r8152 input_leds joydev mii amdgpu mousedev hid_generic usbhid hid ath10k_pci amd_atl edac_mce_amd ath10k_core kvm_amd ath kvm mac80211 bfq crc32_pclmul crc32c_intel polyval_clmulni polyval_generic sha512_ssse3 sha256_ssse3 sha1_ssse3 snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg i2c_algo_bit drm_exec snd_hda_codec r8169 drm_suballoc_helper
aesni_intel gf128mul crypto_simd amdxcp realtek mfd_core tpm_crb drm_buddy snd_hwdep mdio_devres libarc4 cryptd tpm_tis wmi_bmof cfg80211 evdev libphy snd_hda_core tpm_tis_core gpu_sched rapl xhci_pci xhci_hcd snd_pcm drm_display_helper snd_timer tpm sp5100_tco rfkill efi_pstore mpt3sas drm_ttm_helper ahci usbcore libaescfb ccp snd ttm 8250 libahci watchdog soundcore raid_class sha1_generic acpi_cpufreq k10temp 8250_base usb_common scsi_transport_sas i2c_piix4 hwmon video serial_mctrl_gpio serial_base ecdh_generic wmi rtc_cmos backlight ecc gpio_amdpt rng_core gpio_generic button
CPU: 5 UID: 0 PID: 6525 Comm: bcachefs Tainted: G        W          6.11.0-rc1-ojab-00058-g224bc118aec9 #6 6d5debde398d2a84851f42ab300dae32c2992027
Tainted: [W]=WARN
RIP: 0010:bch2_replicas_gc2+0x2cb/0x400 [bcachefs]
Code: c7 c2 60 91 d1 c1 48 89 c6 48 c7 c7 98 91 d1 c1 4c 89 14 24 44 89 5c 24 08 48 89 44 24 20 c6 05 fa 68 04 00 01 e8 05 a3 40 e4 <0f> 0b 4c 8b 14 24 44 8b 5c 24 08 48 8b 44 24 20 e9 55 fe ff ff 8b
RSP: 0018:ffffb434c9263d60 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff9a8efa79cc00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffb434c9263de0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000005
R13: ffff9a8efa73c300 R14: ffff9a8d9e880000 R15: ffff9a8d9e8806f8
FS:  0000000000000000(0000) GS:ffff9a9410c80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000565423373090 CR3: 0000000164e30000 CR4: 00000000003506f0
Call Trace:
<TASK>
? __warn+0x97/0x150
? bch2_replicas_gc2+0x2cb/0x400 [bcachefs 9803eca5e131ef28f26250ede34072d5b50d98b3]
bch2_replicas_gc2+0x2cb/0x400:
bch2_replicas_gc2 at /home/ojab/src/bcachefs/fs/bcachefs/replicas.c:454 (discriminator 3)
? report_bug+0x196/0x1c0
? handle_bug+0x3c/0x70
? exc_invalid_op+0x17/0x80
? __wake_up_klogd.part.0+0x4c/0x80
? asm_exc_invalid_op+0x16/0x20
? bch2_replicas_gc2+0x2cb/0x400 [bcachefs 9803eca5e131ef28f26250ede34072d5b50d98b3]
bch2_replicas_gc2+0x2cb/0x400:
bch2_replicas_gc2 at /home/ojab/src/bcachefs/fs/bcachefs/replicas.c:454 (discriminator 3)
? bch2_dev_usage_read+0xa0/0xa0 [bcachefs 9803eca5e131ef28f26250ede34072d5b50d98b3]
bch2_dev_usage_read+0xa0/0xa0:
discard_in_flight_remove at /home/ojab/src/bcachefs/fs/bcachefs/alloc_background.c:1712

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-18 20:42:05 -04:00
Kent Overstreet
d6d539c9a7 bcachefs: Reallocate table when we're increasing size
Fixes: c2f6e16a67 ("bcachefs: Increase size of cuckoo hash table on too many rehashes")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-18 20:41:50 -04:00
Thorsten Blum
7c525dddbe ksmbd: Replace one-element arrays with flexible-array members
Replace the deprecated one-element arrays with flexible-array members
in the structs filesystem_attribute_info and filesystem_device_info.

There are no binary differences after this conversion.

Link: https://github.com/KSPP/linux/issues/79
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-18 17:02:36 -05:00
Namjae Jeon
76e98a158b ksmbd: fix race condition between destroy_previous_session() and smb2 operations()
If there is ->PreviousSessionId field in the session setup request,
The session of the previous connection should be destroyed.
During this, if the smb2 operation requests in the previous session are
being processed, a racy issue could happen with ksmbd_destroy_file_table().
This patch sets conn->status to KSMBD_SESS_NEED_RECONNECT to block
incoming  operations and waits until on-going operations are complete
(i.e. idle) before desctorying the previous session.

Fixes: c8efcc7861 ("ksmbd: add support for durable handles v1/v2")
Cc: stable@vger.kernel.org # v6.6+
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-25040
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-18 17:02:36 -05:00
Namjae Jeon
dfd046d0ce ksmbd: Use unsafe_memcpy() for ntlm_negotiate
rsp buffer is allocated larger than spnego_blob from
smb2_allocate_rsp_buf().

Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-18 17:02:36 -05:00
Steve French
e4be320eec smb3: fix broken cached reads when posix locks
Mandatory locking is enforced for cached reads, which violates
default posix semantics, and also it is enforced inconsistently.
This affected recent versions of libreoffice, and can be
demonstrated by opening a file twice from the same client,
locking it from handle one and trying to read from it from
handle two (which fails, returning EACCES).

There is already a mount option "forcemandatorylock"
(which defaults to off), so with this change only when the user
intentionally specifies "forcemandatorylock" on mount will we
break posix semantics on read to a locked range (ie we will
only fail in this case, if the user mounts with
"forcemandatorylock").

An earlier patch fixed the write path.

Fixes: 85160e03a7 ("CIFS: Implement caching mechanism for mandatory brlocks")
Cc: stable@vger.kernel.org
Cc: Pavel Shilovsky <piastryyy@gmail.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Reported-by: abartlet@samba.org
Reported-by: Kevin Ottens <kevin.ottens@enioka.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-18 17:01:06 -05:00
Linus Torvalds
57b14823ea for-6.11-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmbB6dwACgkQxWXV+ddt
 WDu/Eg/9EXEoSPqRYxsRa2vLQjSbBbBCBDW1G75F5oUsvRKtfNhZ+w02JAqubirF
 wdBNaXoGQ9zJq/E0JtHHDqv9M6FV+g3aO0xM+ntmp9cZdFBVXRkrB3TlewMesKfI
 lXZW5kn35q6aeNi2MaJjk2G5Pr0MYjGGRezBuloc7TcIlgVijjLBlcnKEz263C1/
 rXvENxowxPA20LiWviA4ZjlqlRQLBrgxqpSXLGg7mZs93XdbtPa2ZvzS7ffuZTI+
 PUCYGEwI4E2Dpv+mswFb21SUdUPPmAycubERJvABqnZxCWupkevgvv+6MeWC5A2p
 7OjoTmINRDDsNWYSvyHQ04U+0XkPmHCBKEkAuy0ZIajHJU4G6rpeDySX04/Cwzht
 mJZ37FzMGZ9LjEpL1uoPifWKcH0nUW9sWw4Tw9tgeuBG9RfI/BxZqRT9WLEkiXUI
 7Bdq2Ir6fzv8IVKWkuqO6No+LDa//qF2ci0nWvCwbgdNquGFa9DmNmVDiKDqKW3R
 RlP+6laXEPCfKSycTpp94gASVeEEcKNYYC0B/FCBLJmXVcCJQ41qsMn/fYUn8Yn3
 vVdmAuGKThYjO4RNjqy4FgNVLdY9280OGazH0B9t3HhL+U+9JLN8fnhD5H4hEpR/
 dDC1quKmo6WIKAem6O6r3GRnCj8lbLaXpmp+MpyEUI7M5FyjsAc=
 =YBfJ
 -----END PGP SIGNATURE-----

Merge tag 'for-6.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull more btrfs fixes from David Sterba:
 "A more fixes. We got reports that shrinker added in 6.10 still causes
  latency spikes and the fixes don't handle all corner cases. Due to
  summer holidays we're taking a shortcut to disable it for release
  builds and will fix it in the near future.

   - only enable extent map shrinker for DEBUG builds, temporary quick
     fix to avoid latency spikes for regular builds

   - update target inode's ctime on unlink, mandated by POSIX

   - properly take lock to read/update block group's zoned variables

   - add counted_by() annotations"

* tag 'for-6.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: only enable extent map shrinker for DEBUG builds
  btrfs: zoned: properly take lock to read/update block group's zoned variables
  btrfs: tree-checker: add dev extent item checks
  btrfs: update target inode's ctime on unlink
  btrfs: send: annotate struct name_cache_entry with __counted_by()
2024-08-18 08:50:36 -07:00
Jann Horn
3c0da3d163 fuse: Initialize beyond-EOF page contents before setting uptodate
fuse_notify_store(), unlike fuse_do_readpage(), does not enable page
zeroing (because it can be used to change partial page contents).

So fuse_notify_store() must be more careful to fully initialize page
contents (including parts of the page that are beyond end-of-file)
before marking the page uptodate.

The current code can leave beyond-EOF page contents uninitialized, which
makes these uninitialized page contents visible to userspace via mmap().

This is an information leak, but only affects systems which do not
enable init-on-alloc (via CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y or the
corresponding kernel command line parameter).

Link: https://bugs.chromium.org/p/project-zero/issues/detail?id=2574
Cc: stable@kernel.org
Fixes: a1d75f2582 ("fuse: add store request")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-08-18 08:45:39 -07:00
Linus Torvalds
e0fac5fc8b three client fixes, including two for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmbBGkoACgkQiiy9cAdy
 T1HAJAv9G2efGXOuLHuDKM4IkoUBoeAsC/o5g5sVbZfINON1Ra0vQBLmRLunhAlW
 xIY2Ln92jMdvM6wNwFcsAI5bIWTiIrjdqP/HY9kiKRU5O5NvqNWeyPEDOB3aM41O
 UXq8jNKyyyyFD1P4QJNYMeZucTZatLJVb7WRZHGDEDcVMrCWdDVcnPwnMfyNeD0w
 GndMPAAxiQxV+AoL+RgE6+nfVr4EwHI3VFG/h3FyNcaMp2ZSzYHDu/TIwmGBHq6P
 DCJyxjKMJoXKzKO+3hVp3tKzKZ9EuE3ljb8liBbZ8g6J4quCHbQWC3Mh8Jhmgav6
 1KhDRKI6vjHZwu8tWjBEgadhwcRBHMuz/YZL+zrx3QHjA/AgV20Y7oyvyXKusj9t
 G5C1bTExusdhLnEOGN4+udxjAHrMkW36R6Vux5D85WYmhR3k2AbIdZevA+mLADKU
 veTye1VAX5vy9h0atyV69Zta9aBU6q3Mhcpgrcbj0u3C/Iuu1DafrEmb5hGgW7Dw
 xnGynYax
 =af3x
 -----END PGP SIGNATURE-----

Merge tag 'v6.11-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

 - fix for clang warning - additional null check

 - fix for cached write with posix locks

 - flexible structure fix

* tag 'v6.11-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb: smb2pdu.h: Use static_assert() to check struct sizes
  smb3: fix lock breakage for cached writes
  smb/client: avoid possible NULL dereference in cifs_free_subrequest()
2024-08-17 16:31:12 -07:00
Linus Torvalds
d09840f8b3 Bug fixes for 6.11-rc4:
* Check for presence of only 'attr' feature before scrubbing an inode's
     attribute fork.
   * Restore the behaviour of setting AIL thread to TASK_INTERRUPTIBLE for
     long (i.e. 50ms) sleep durations to prevent high load averages.
   * Do not allow users to change the realtime flag of a file unless the
     datadev and rtdev both support fsdax access modes.
 
 Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZr1wqwAKCRAH7y4RirJu
 9MYxAQCgHoAK8rqxb4obrrGmqVcHJdnHDYqSFRqbbvytRHybZgEA2hfaNbNpuQYT
 JOV5pGOUJf1LiSc5D6MBepg2BAFRNwo=
 =7Ibh
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.11-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Chandan Babu:

 - Check for presence of only 'attr' feature before scrubbing an inode's
   attribute fork.

 - Restore the behaviour of setting AIL thread to TASK_INTERRUPTIBLE for
   long (i.e. 50ms) sleep durations to prevent high load averages.

 - Do not allow users to change the realtime flag of a file unless the
   datadev and rtdev both support fsdax access modes.

* tag 'xfs-6.11-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: conditionally allow FS_XFLAG_REALTIME changes if S_DAX is set
  xfs: revert AIL TASK_KILLABLE threshold
  xfs: attr forks require attr, not attr2
2024-08-17 09:51:28 -07:00
Linus Torvalds
b718175853 bcachefs fixes for 6.11-rc4
- New on disk format version, bcachefs_metadata_version_disk_accounting_inum
 
 This adds one more disk accounting counter, which counts disk usage and
 number of extents per inode number. This lets us track fragmentation,
 for implementing defragmentation later, and it also counts disk usage
 per inode in all snapshots, which will be a useful thing to expose to
 users.
 
 - One performance issue we've observed is threads spinning when they
 should be waiting for dirty keys in the key cache to be flushed by
 journal reclaim, so we now have hysteresis for the waiting thread, as
 well as improving the tracepoint and a new time_stat, for tracking time
 blocked waiting on key cache flushing.
 
 And, various assorted smaller fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAma/9QkACgkQE6szbY3K
 bnYcBw/+LBSZ415gWSjPktdecf5rc6K4KxETxAxV0f0KesYzxqAtQzN0SCDvKt65
 3aALU03wM8vWITiLS38/ckT+j6S2BpXcOxdu/OC0nRYQEUg9ZLvqEG5lQ3a/LliV
 Q64N33qsSr6QaKszFllLYcN4tGduKg8HoMlHn6+vJ7HNPjdfv0HHERSUsc7K84/w
 jkRtDE2NxsRJZKMEvIFp8hd5KXUR5zyBz/kc4P0WliLXpSyJLITzhKw1JV7ikKVD
 0mO2bJ/0i7wPIabAD2HJahvbC7fl+2fkYFxUJ2XnvMTgU/+QyeGHEufbcbVrVSp0
 BpzBTmSMFbGXBkbQBruFX5rJetzXeBqdYf0Yfavd4KDhGvYlSfDZQUapXT1QKC2q
 aHSB/s+2r7Crr/MBJyjbeFgXFTNGvI5yerlbdp2yj1kxjYJHHaKrp6h7n6XXk21W
 /mGF5tkIMkFTv98rQnIaky4neJzOPsLTTgxeR8zEudCgMaVUqEcaMdIFvARDjY/3
 n52VR0zl3olV3vu7LgHaHfgH6lfaMV0sHPaGNYGL0YL+bCJD+lYM8a6l9aaks8vk
 md7+mFcOS4FUdDdS8MEKIN/k/gkEOC/EpmI864i9rIl0SiNXNy7FPTDKON8b+Ury
 5omBMUQMEe9Q/pgKGXfpJWFynhSPEVf4y1DIOsrXk/jeBqenFyo=
 =BPGT
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-08-16' of git://evilpiepirate.org/bcachefs

Pull bcachefs fixes from Kent OverstreetL

 - New on disk format version, bcachefs_metadata_version_disk_accounting_inum

   This adds one more disk accounting counter, which counts disk usage
   and number of extents per inode number. This lets us track
   fragmentation, for implementing defragmentation later, and it also
   counts disk usage per inode in all snapshots, which will be a useful
   thing to expose to users.

 - One performance issue we've observed is threads spinning when they
   should be waiting for dirty keys in the key cache to be flushed by
   journal reclaim, so we now have hysteresis for the waiting thread, as
   well as improving the tracepoint and a new time_stat, for tracking
   time blocked waiting on key cache flushing.

... and various assorted smaller fixes.

* tag 'bcachefs-2024-08-16' of git://evilpiepirate.org/bcachefs:
  bcachefs: Fix locking in __bch2_trans_mark_dev_sb()
  bcachefs: fix incorrect i_state usage
  bcachefs: avoid overflowing LRU_TIME_BITS for cached data lru
  bcachefs: Fix forgetting to pass trans to fsck_err()
  bcachefs: Increase size of cuckoo hash table on too many rehashes
  bcachefs: bcachefs_metadata_version_disk_accounting_inum
  bcachefs: Kill __bch2_accounting_mem_mod()
  bcachefs: Make bkey_fsck_err() a wrapper around fsck_err()
  bcachefs: Fix warning in __bch2_fsck_err() for trans not passed in
  bcachefs: Add a time_stat for blocked on key cache flush
  bcachefs: Improve trans_blocked_journal_reclaim tracepoint
  bcachefs: Add hysteresis to waiting on btree key cache flush
  lib/generic-radix-tree.c: Fix rare race in __genradix_ptr_alloc()
  bcachefs: Convert for_each_btree_node() to lockrestart_do()
  bcachefs: Add missing downgrade table entry
  bcachefs: disk accounting: ignore unknown types
  bcachefs: bch2_accounting_invalid() fixup
  bcachefs: Fix bch2_trigger_alloc when upgrading from old versions
  bcachefs: delete faulty fastpath in bch2_btree_path_traverse_cached()
2024-08-17 09:46:10 -07:00
Kent Overstreet
0e49d3ff12 bcachefs: Fix locking in __bch2_trans_mark_dev_sb()
We run this in full RW mode now, so we have to guard against the
superblock buffer being reallocated.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-16 20:45:15 -04:00
Qu Wenruo
534f7eff92 btrfs: only enable extent map shrinker for DEBUG builds
Although there are several patches improving the extent map shrinker,
there are still reports of too frequent shrinker behavior, taking too
much CPU for the kswapd process.

So let's only enable extent shrinker for now, until we got more
comprehensive understanding and a better solution.

Link: https://lore.kernel.org/linux-btrfs/3df4acd616a07ef4d2dc6bad668701504b412ffc.camel@intelfx.name/
Link: https://lore.kernel.org/linux-btrfs/c30fd6b3-ca7a-4759-8a53-d42878bf84f7@gmail.com/
Fixes: 956a17d9d0 ("btrfs: add a shrinker for extent maps")
CC: stable@vger.kernel.org # 6.10+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-16 21:22:39 +02:00
Kent Overstreet
99c87fe0f5 bcachefs: fix incorrect i_state usage
Reported-by: syzbot+95e40eae71609e40d851@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-16 12:46:40 -04:00
Kent Overstreet
9482f3b053 bcachefs: avoid overflowing LRU_TIME_BITS for cached data lru
Reported-by: syzbot+510b0b28f8e6de64d307@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-16 12:46:40 -04:00
Kent Overstreet
075cabf324 bcachefs: Fix forgetting to pass trans to fsck_err()
Reported-by: syzbot+e3938cd6d761b78750e6@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-16 12:46:40 -04:00
Kent Overstreet
c2f6e16a67 bcachefs: Increase size of cuckoo hash table on too many rehashes
Also, improve the calculation of the new table size, so that it can
shrink when needed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-16 12:46:40 -04:00
Gustavo A. R. Silva
5b4f3af39b smb: smb2pdu.h: Use static_assert() to check struct sizes
Commit 9f9bef9bc5 ("smb: smb2pdu.h: Avoid -Wflex-array-member-not-at-end
warnings") introduced tagged `struct create_context_hdr`. We want to
ensure that when new members need to be added to the flexible structure,
they are always included within this tagged struct.

So, we use `static_assert()` to ensure that the memory layout for
both the flexible structure and the tagged struct is the same after
any changes.

Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-15 16:06:55 -05:00
Steve French
836bb3268d smb3: fix lock breakage for cached writes
Mandatory locking is enforced for cached writes, which violates
default posix semantics, and also it is enforced inconsistently.
This apparently breaks recent versions of libreoffice, but can
also be demonstrated by opening a file twice from the same
client, locking it from handle one and writing to it from
handle two (which fails, returning EACCES).

Since there was already a mount option "forcemandatorylock"
(which defaults to off), with this change only when the user
intentionally specifies "forcemandatorylock" on mount will we
break posix semantics on write to a locked range (ie we will
only fail the write in this case, if the user mounts with
"forcemandatorylock").

Fixes: 85160e03a7 ("CIFS: Implement caching mechanism for mandatory brlocks")
Cc: stable@vger.kernel.org
Cc: Pavel Shilovsky <piastryyy@gmail.com>
Reported-by: abartlet@samba.org
Reported-by: Kevin Ottens <kevin.ottens@enioka.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-15 16:04:47 -05:00
Su Hui
74c2ab6d65 smb/client: avoid possible NULL dereference in cifs_free_subrequest()
Clang static checker (scan-build) warning:
	cifsglob.h:line 890, column 3
	Access to field 'ops' results in a dereference of a null pointer.

Commit 519be98971 ("cifs: Add a tracepoint to track credits involved in
R/W requests") adds a check for 'rdata->server', and let clang throw this
warning about NULL dereference.

When 'rdata->credits.value != 0 && rdata->server == NULL' happens,
add_credits_and_wake_if() will call rdata->server->ops->add_credits().
This will cause NULL dereference problem. Add a check for 'rdata->server'
to avoid NULL dereference.

Cc: stable@vger.kernel.org
Fixes: 69c3c023af ("cifs: Implement netfslib hooks")
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Su Hui <suhui@nfschina.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-15 15:32:30 -05:00
Naohiro Aota
e30729d4bd btrfs: zoned: properly take lock to read/update block group's zoned variables
__btrfs_add_free_space_zoned() references and modifies bg's alloc_offset,
ro, and zone_unusable, but without taking the lock. It is mostly safe
because they monotonically increase (at least for now) and this function is
mostly called by a transaction commit, which is serialized by itself.

Still, taking the lock is a safer and correct option and I'm going to add a
change to reset zone_unusable while a block group is still alive. So, add
locking around the operations.

Fixes: 169e0da91a ("btrfs: zoned: track unusable bytes for zones")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-15 20:35:56 +02:00
Qu Wenruo
008e2512dc btrfs: tree-checker: add dev extent item checks
[REPORT]
There is a corruption report that btrfs refused to mount a fs that has
overlapping dev extents:

  BTRFS error (device sdc): dev extent devid 4 physical offset 14263979671552 overlap with previous dev extent end 14263980982272
  BTRFS error (device sdc): failed to verify dev extents against chunks: -117
  BTRFS error (device sdc): open_ctree failed

[CAUSE]
The direct cause is very obvious, there is a bad dev extent item with
incorrect length.

With btrfs check reporting two overlapping extents, the second one shows
some clue on the cause:

  ERROR: dev extent devid 4 offset 14263979671552 len 6488064 overlap with previous dev extent end 14263980982272
  ERROR: dev extent devid 13 offset 2257707008000 len 6488064 overlap with previous dev extent end 2257707270144
  ERROR: errors found in extent allocation tree or chunk allocation

The second one looks like a bitflip happened during new chunk
allocation:
hex(2257707008000) = 0x20da9d30000
hex(2257707270144) = 0x20da9d70000
diff               = 0x00000040000

So it looks like a bitflip happened during new dev extent allocation,
resulting the second overlap.

Currently we only do the dev-extent verification at mount time, but if the
corruption is caused by memory bitflip, we really want to catch it before
writing the corruption to the storage.

Furthermore the dev extent items has the following key definition:

	(<device id> DEV_EXTENT <physical offset>)

Thus we can not just rely on the generic key order check to make sure
there is no overlapping.

[ENHANCEMENT]
Introduce dedicated dev extent checks, including:

- Fixed member checks
  * chunk_tree should always be BTRFS_CHUNK_TREE_OBJECTID (3)
  * chunk_objectid should always be
    BTRFS_FIRST_CHUNK_CHUNK_TREE_OBJECTID (256)

- Alignment checks
  * chunk_offset should be aligned to sectorsize
  * length should be aligned to sectorsize
  * key.offset should be aligned to sectorsize

- Overlap checks
  If the previous key is also a dev-extent item, with the same
  device id, make sure we do not overlap with the previous dev extent.

Reported: Stefan N <stefannnau@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CA+W5K0rSO3koYTo=nzxxTm1-Pdu1HYgVxEpgJ=aGc7d=E8mGEg@mail.gmail.com/
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-15 20:35:52 +02:00
Jeff Layton
3bc2ac2f8f btrfs: update target inode's ctime on unlink
Unlink changes the link count on the target inode. POSIX mandates that
the ctime must also change when this occurs.

According to https://pubs.opengroup.org/onlinepubs/9699919799/functions/unlink.html:

"Upon successful completion, unlink() shall mark for update the last data
 modification and last file status change timestamps of the parent
 directory. Also, if the file's link count is not 0, the last file status
 change timestamp of the file shall be marked for update."

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add link to the opengroup docs ]
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-15 20:35:44 +02:00
Thorsten Blum
c0247d289e btrfs: send: annotate struct name_cache_entry with __counted_by()
Add the __counted_by compiler attribute to the flexible array member
name to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.

Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-15 20:35:32 +02:00
Linus Torvalds
1fb918967b for-6.11-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAma9KksACgkQxWXV+ddt
 WDvMYw/8Ds6A3IcMd1AByPDryhHZnOpqU/YQS/HhneisTg08MYHD0TMZLX02GXw0
 vzqUyBHQ9yOYEt18SVtX67Dzapy0SWZWTK8Er/tJCn+14SLWVMdRiJCO2Y3Rdm8T
 J3S/b610iEVl5z6S6SpFD+zc56liCyVHfpK6obwSFBCzAyN6vm2p0ls5vq+hGsjb
 s/dOPJfOMnFTOFXVIumJ5KJRCubGuwG+PhZO9engwxiFIr1O4xxedzhKocXNSiiE
 +jt96gnKwW/K/Wh59YFGLbKk7h/jsIzM2CqE+JrEZlwIN+oNODFvP+Z37DbSC6jN
 0x7G8gqY8vXgxOUk1rub+IqP+/wIjMmCTzkxpO2uy80hB0h3YzqzZnUNHk6DlIu/
 zOhgcMnp5SMuvtJIXBpP2HRzzG/UbxfrjaPKDmUvwKvCUydrw0xL+XWdDMhi3bSn
 NPDW/Ixl1XGMS131YYla1v/KXTKJwZ2hK045svx7A8Aok0WaXAcYfwValQ9GMO0j
 gk89DRN0tMikBebaVD3aakE1FqyC/3dEZ80D3LSs7cgDGQA27wjdESHcSTtY373/
 +fCBXDH/N9ubanZwsu+gUEzNp3DukEoaw2r73IicxcbsiskDpEqvddcKQmrZQ1xW
 03UVziw1LpzkNSsDMZrlBwIoo5SnqbEbsEkTCMtjivJYOzhfyYk=
 =din2
 -----END PGP SIGNATURE-----

Merge tag 'for-6.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - extend tree-checker verification of directory item type

 - fix regression in page/folio and extent state tracking in xarray, the
   dirty status can get out of sync and can cause problems e.g. a hang

 - in send, detect last extent and allow to clone it instead of sending
   it as write, reduces amount of data transferred in the stream

 - fix checking extent references when cleaning deleted subvolumes

 - fix one more case in the extent map shrinker, let it run only in the
   kswapd context so it does not cause latency spikes during other
   operations

* tag 'for-6.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix invalid mapping of extent xarray state
  btrfs: send: allow cloning non-aligned extent if it ends at i_size
  btrfs: only run the extent map shrinker from kswapd tasks
  btrfs: tree-checker: reject BTRFS_FT_UNKNOWN dir type
  btrfs: check delayed refs when we're checking if a ref exists
2024-08-14 17:56:15 -07:00
Linus Torvalds
4ac0f08f44 vfs-6.11-rc4.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZrym4AAKCRCRxhvAZXjc
 oqT3AP9ydoUNavaZcRayH8r3ybvz9+aJGJ6Q7NznFVCk71vn0gD/buLzmq96Muns
 M5DWHbft2AFwK0Rz2nx8j5OXUeHwrQg=
 =HZBL
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:
 "VFS:

   - Fix the name of file lease slab cache. When file leases were split
     out of file locks the name of the file lock slab cache was used for
     the file leases slab cache as well.

   - Fix a type in take_fd() helper.

   - Fix infinite directory iteration for stable offsets in tmpfs.

   - When the icache is pruned all reclaimable inodes are marked with
     I_FREEING and other processes that try to lookup such inodes will
     block.

     But some filesystems like ext4 can trigger lookups in their inode
     evict callback causing deadlocks. Ext4 does such lookups if the
     ea_inode feature is used whereby a separate inode may be used to
     store xattrs.

     Introduce I_LRU_ISOLATING which pins the inode while its pages are
     reclaimed. This avoids inode deletion during inode_lru_isolate()
     avoiding the deadlock and evict is made to wait until
     I_LRU_ISOLATING is done.

  netfs:

   - Fault in smaller chunks for non-large folio mappings for
     filesystems that haven't been converted to large folios yet.

   - Fix the CONFIG_NETFS_DEBUG config option. The config option was
     renamed a short while ago and that introduced two minor issues.
     First, it depended on CONFIG_NETFS whereas it wants to depend on
     CONFIG_NETFS_SUPPORT. The former doesn't exist, while the latter
     does. Second, the documentation for the config option wasn't fixed
     up.

   - Revert the removal of the PG_private_2 writeback flag as ceph is
     using it and fix how that flag is handled in netfs.

   - Fix DIO reads on 9p. A program watching a file on a 9p mount
     wouldn't see any changes in the size of the file being exported by
     the server if the file was changed directly in the source
     filesystem. Fix this by attempting to read the full size specified
     when a DIO read is requested.

   - Fix a NULL pointer dereference bug due to a data race where a
     cachefiles cookies was retired even though it was still in use.
     Check the cookie's n_accesses counter before discarding it.

  nsfs:

   - Fix ioctl declaration for NS_GET_MNTNS_ID from _IO() to _IOR() as
     the kernel is writing to userspace.

  pidfs:

   - Prevent the creation of pidfds for kthreads until we have a
     use-case for it and we know the semantics we want. It also confuses
     userspace why they can get pidfds for kthreads.

  squashfs:

   - Fix an unitialized value bug reported by KMSAN caused by a
     corrupted symbolic link size read from disk. Check that the
     symbolic link size is not larger than expected"

* tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  Squashfs: sanity check symbolic link size
  9p: Fix DIO read through netfs
  vfs: Don't evict inode under the inode lru traversing context
  netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags
  netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a second writeback flag"
  file: fix typo in take_fd() comment
  pidfd: prevent creation of pidfds for kthreads
  netfs: clean up after renaming FSCACHE_DEBUG config
  libfs: fix infinite directory reads for offset dir
  nsfs: fix ioctl declaration
  fs/netfs/fscache_cookie: add missing "n_accesses" check
  filelock: fix name of file_lease slab cache
  netfs: Fault in smaller chunks for non-large folio mappings
2024-08-14 09:06:28 -07:00
Darrick J. Wong
8d16762047 xfs: conditionally allow FS_XFLAG_REALTIME changes if S_DAX is set
If a file has the S_DAX flag (aka fsdax access mode) set, we cannot
allow users to change the realtime flag unless the datadev and rtdev
both support fsdax access modes.  Even if there are no extents allocated
to the file, the setattr thread could be racing with another thread
that has already started down the write code paths.

Fixes: ba23cba9b3 ("fs: allow per-device dax status checking for filesystems")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-08-14 21:20:24 +05:30
Darrick J. Wong
04d6dbb553 xfs: revert AIL TASK_KILLABLE threshold
In commit 9adf40249e, we changed the behavior of the AIL thread to
set its own task state to KILLABLE whenever the timeout value is
nonzero.  Unfortunately, this missed the fact that xfsaild_push will
return 50ms (aka a longish sleep) when we reach the push target or the
AIL becomes empty, so xfsaild goes to sleep for a long period of time in
uninterruptible D state.

This results in artificially high load averages because KILLABLE
processes are UNINTERRUPTIBLE, which contributes to load average even
though the AIL is asleep waiting for someone to interrupt it.  It's not
blocked on IOs or anything, but people scrap ps for processes that look
like they're stuck in D state, so restore the previous threshold.

Fixes: 9adf40249e ("xfs: AIL doesn't need manual pushing")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-08-14 21:19:34 +05:30
Darrick J. Wong
73c34b0b85 xfs: attr forks require attr, not attr2
It turns out that I misunderstood the difference between the attr and
attr2 feature bits.  "attr" means that at some point an attr fork was
created somewhere in the filesystem.  "attr2" means that inodes have
variable-sized forks, but says nothing about whether or not there
actually /are/ attr forks in the system.

If we have an attr fork, we only need to check that attr is set.

Fixes: 99d9d8d05d ("xfs: scrub inode block mappings")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-08-14 21:19:14 +05:30
Kent Overstreet
58474f76a7 bcachefs: bcachefs_metadata_version_disk_accounting_inum
This adds another disk accounting counter to track usage per inode
number (any snapshot ID).

This will be used for a couple things:

- It'll give us a way to tell the user how much space a given file ista
  consuming in all snapshots; i.e. how much extra space it's consuming
  due to snapshot versioning.

- It counts number of extents and total size of extents (both in btree
  keyspace sectors and actual disk usage), meaning it gives us average
  extent size: that is, it'll let us cheaply find fragmented files that
  should be defragmented.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 23:00:50 -04:00
Kent Overstreet
5132b99bb6 bcachefs: Kill __bch2_accounting_mem_mod()
The next patch will be adding a disk accounting counter type which is
not kept in the in-memory eytzinger tree.

As prep, fold __bch2_accounting_mem_mod() into
bch2_accounting_mem_mod_locked() so that we can check for that counter
type and bail out without calling bpos_to_disk_accounting_pos() twice.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 23:00:50 -04:00
Kent Overstreet
d97de0d017 bcachefs: Make bkey_fsck_err() a wrapper around fsck_err()
bkey_fsck_err() was added as an interface that looks like fsck_err(),
but previously all it did was ensure that the appropriate error counter
was incremented in the superblock.

This is a cleanup and bugfix patch that converts it to a wrapper around
fsck_err(). This is needed to fix an issue with the upgrade path to
disk_accounting_v3, where the "silent fix" error list now includes
bkey_fsck errors; fsck_err() handles this in a unified way, and since we
need to change printing of bkey fsck errors from the caller to the inner
bkey_fsck_err() calls, this ends up being a pretty big change.

Als,, rename .invalid() methods to .validate(), for clarity, while we're
changing the function signature anyways (to drop the printbuf argument).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 23:00:50 -04:00
Kent Overstreet
c99471024f bcachefs: Fix warning in __bch2_fsck_err() for trans not passed in
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 23:00:50 -04:00
Kent Overstreet
06a8693b89 bcachefs: Add a time_stat for blocked on key cache flush
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 23:00:50 -04:00
Kent Overstreet
790666c8ac bcachefs: Improve trans_blocked_journal_reclaim tracepoint
include information about the state of the btree key cache

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 23:00:34 -04:00
Kent Overstreet
7254555c44 bcachefs: Add hysteresis to waiting on btree key cache flush
This helps ensure key cache reclaim isn't contending with threads
waiting for the key cache to be helped, and fixes a severe performance
bug.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 23:00:34 -04:00
Kent Overstreet
968feb854a bcachefs: Convert for_each_btree_node() to lockrestart_do()
for_each_btree_node() now works similarly to for_each_btree_key(), where
the loop body is passed as an argument to be passed to lockrestart_do().

This now calls trans_begin() on every loop iteration - which fixes an
SRCU warning in backpointers fsck.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 22:56:50 -04:00
Kent Overstreet
48d6cc1b48 bcachefs: Add missing downgrade table entry
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 22:56:50 -04:00
Kent Overstreet
486d920735 bcachefs: disk accounting: ignore unknown types
forward compat fix

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 22:56:50 -04:00
Kent Overstreet
d9e615762b bcachefs: bch2_accounting_invalid() fixup
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 22:56:50 -04:00
Kent Overstreet
bd864bc2d9 bcachefs: Fix bch2_trigger_alloc when upgrading from old versions
bch2_trigger_alloc was assuming that the new key would always be newly
created and thus always an alloc_v4 key, but - not when called from
btree_gc.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 22:56:50 -04:00
Kent Overstreet
a24e6e7146 bcachefs: delete faulty fastpath in bch2_btree_path_traverse_cached()
bch2_btree_path_traverse_cached() was previously checking if it could
just relock the path, which is a common idiom in path traversal.

However, it was using btree_node_relock(), not btree_path_relock();
btree_path_relock() only succeeds if the path was in state
BTREE_ITER_NEED_RELOCK.

If the path was in state BTREE_ITER_NEED_TRAVERSE a full traversal is
needed; this led to a null ptr deref in
bch2_btree_path_traverse_cached().

And the short circuit check here isn't needed, since it was already done
in the main bch2_btree_path_traverse_one().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-13 22:56:50 -04:00
Linus Torvalds
6b0f8db921 execve fixes for v6.11-rc4
- binfmt_flat: Fix corruption when not offsetting data start
 
 - exec: Fix ToCToU between perm check and set-uid/gid usage
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCZrvBmwAKCRA2KwveOeQk
 u9Z2AQDgtypF4Kficiwn9BwZL5OxCxr9XnFuCYUue8Ufzm2WdQEAwhl1tcbs+Auf
 VAqpr6gZTRlpBjbl55LHeyMdxIbwhg4=
 =L28M
 -----END PGP SIGNATURE-----

Merge tag 'execve-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull execve fixes from Kees Cook:

 - binfmt_flat: Fix corruption when not offsetting data start

 - exec: Fix ToCToU between perm check and set-uid/gid usage

* tag 'execve-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  exec: Fix ToCToU between perm check and set-uid/gid usage
  binfmt_flat: Fix corruption when not offsetting data start
2024-08-13 16:10:32 -07:00
Kees Cook
f50733b45d exec: Fix ToCToU between perm check and set-uid/gid usage
When opening a file for exec via do_filp_open(), permission checking is
done against the file's metadata at that moment, and on success, a file
pointer is passed back. Much later in the execve() code path, the file
metadata (specifically mode, uid, and gid) is used to determine if/how
to set the uid and gid. However, those values may have changed since the
permissions check, meaning the execution may gain unintended privileges.

For example, if a file could change permissions from executable and not
set-id:

---------x 1 root root 16048 Aug  7 13:16 target

to set-id and non-executable:

---S------ 1 root root 16048 Aug  7 13:16 target

it is possible to gain root privileges when execution should have been
disallowed.

While this race condition is rare in real-world scenarios, it has been
observed (and proven exploitable) when package managers are updating
the setuid bits of installed programs. Such files start with being
world-executable but then are adjusted to be group-exec with a set-uid
bit. For example, "chmod o-x,u+s target" makes "target" executable only
by uid "root" and gid "cdrom", while also becoming setuid-root:

-rwxr-xr-x 1 root cdrom 16048 Aug  7 13:16 target

becomes:

-rwsr-xr-- 1 root cdrom 16048 Aug  7 13:16 target

But racing the chmod means users without group "cdrom" membership can
get the permission to execute "target" just before the chmod, and when
the chmod finishes, the exec reaches brpm_fill_uid(), and performs the
setuid to root, violating the expressed authorization of "only cdrom
group members can setuid to root".

Re-check that we still have execute permissions in case the metadata
has changed. It would be better to keep a copy from the perm-check time,
but until we can do that refactoring, the least-bad option is to do a
full inode_permission() call (under inode lock). It is understood that
this is safe against dead-locks, but hardly optimal.

Reported-by: Marco Vanotti <mvanotti@google.com>
Tested-by: Marco Vanotti <mvanotti@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Kees Cook <kees@kernel.org>
2024-08-13 13:24:29 -07:00
Linus Torvalds
6b4aa469f0 2 smb3 server fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAma6liAACgkQiiy9cAdy
 T1Eh4wwAuTQDHjehfvCDspMn6lG8IXAtb3oio2cntkII3warxxQ/dRiIyG1JcG5Z
 38e+dokvRkaUF6ntrmudUbHOerw+NRl2ozYF5pQv0+ECyJLXHDqVGnuxNvNPAsD7
 RtHfFf50PdgzGKmXjmUg0GbXMgA6eLSHe9r+wwDkqmIwZHMxaJ2nGuwVjHoO/+uJ
 oynxpYHIUROa2DeQiQKZAz/KHwpdSAGR4+KJRutvVCjInlb9bmSGp//BG34W4vva
 nyQIpnqskmlFg4elV/ktOgCp1rbHc4lgQwsWoCDYrNOyKX83HEIRRWHUEIi7fi+Y
 PBcFgTblrnuhYbUL4Z+rSmHB3YuUkvMLeKkSWSJm2M2qAZzoZWTUNLpzOcAOAcIF
 uhkt1+GUuLsZu3ZoDbolMZl477DtBsbBOKsM0DZ5IMji3MRu8GpvhmOfGOAdVRpT
 msTWfUoWvrc2CM09v3HBtnsAfjDXb/4ebztZxGTGVFk0uYJA1Zg655bHbYbw3tWr
 jXKVa805
 =Q9Qj
 -----END PGP SIGNATURE-----

Merge tag '6.11-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:
 "Two smb3 server fixes for access denied problem on share path checks"

* tag '6.11-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd:
  ksmbd: override fsids for smb2_query_info()
  ksmbd: override fsids for share path check
2024-08-13 09:03:23 -07:00
Naohiro Aota
6252690f7e btrfs: fix invalid mapping of extent xarray state
In __extent_writepage_io(), we call btrfs_set_range_writeback() ->
folio_start_writeback(), which clears PAGECACHE_TAG_DIRTY mark from the
mapping xarray if the folio is not dirty. This worked fine before commit
97713b1a2c ("btrfs: do not clear page dirty inside
extent_write_locked_range()").

After the commit, however, the folio is still dirty at this point, so the
mapping DIRTY tag is not cleared anymore. Then, __extent_writepage_io()
calls btrfs_folio_clear_dirty() to clear the folio's dirty flag. That
results in the page being unlocked with a "strange" state. The page is not
PageDirty, but the mapping tag is set as PAGECACHE_TAG_DIRTY.

This strange state looks like causing a hang with a call trace below when
running fstests generic/091 on a null_blk device. It is waiting for a folio
lock.

While I don't have an exact relation between this hang and the strange
state, fixing the state also fixes the hang. And, that state is worth
fixing anyway.

This commit reorders btrfs_folio_clear_dirty() and
btrfs_set_range_writeback() in __extent_writepage_io(), so that the
PAGECACHE_TAG_DIRTY tag is properly removed from the xarray.

  [464.274] task:fsx             state:D stack:0     pid:3034  tgid:3034  ppid:2853   flags:0x00004002
  [464.286] Call Trace:
  [464.291]  <TASK>
  [464.295]  __schedule+0x10ed/0x6260
  [464.301]  ? __pfx___blk_flush_plug+0x10/0x10
  [464.308]  ? __submit_bio+0x37c/0x450
  [464.314]  ? __pfx___schedule+0x10/0x10
  [464.321]  ? lock_release+0x567/0x790
  [464.327]  ? __pfx_lock_acquire+0x10/0x10
  [464.334]  ? __pfx_lock_release+0x10/0x10
  [464.340]  ? __pfx_lock_acquire+0x10/0x10
  [464.347]  ? __pfx_lock_release+0x10/0x10
  [464.353]  ? do_raw_spin_lock+0x12e/0x270
  [464.360]  schedule+0xdf/0x3b0
  [464.365]  io_schedule+0x8f/0xf0
  [464.371]  folio_wait_bit_common+0x2ca/0x6d0
  [464.378]  ? folio_wait_bit_common+0x1cc/0x6d0
  [464.385]  ? __pfx_folio_wait_bit_common+0x10/0x10
  [464.392]  ? __pfx_filemap_get_folios_tag+0x10/0x10
  [464.400]  ? __pfx_wake_page_function+0x10/0x10
  [464.407]  ? __pfx___might_resched+0x10/0x10
  [464.414]  ? do_raw_spin_unlock+0x58/0x1f0
  [464.420]  extent_write_cache_pages+0xe49/0x1620 [btrfs]
  [464.428]  ? lock_acquire+0x435/0x500
  [464.435]  ? __pfx_extent_write_cache_pages+0x10/0x10 [btrfs]
  [464.443]  ? btrfs_do_write_iter+0x493/0x640 [btrfs]
  [464.451]  ? orc_find.part.0+0x1d4/0x380
  [464.457]  ? __pfx_lock_release+0x10/0x10
  [464.464]  ? __pfx_lock_release+0x10/0x10
  [464.471]  ? btrfs_do_write_iter+0x493/0x640 [btrfs]
  [464.478]  btrfs_writepages+0x1cc/0x460 [btrfs]
  [464.485]  ? __pfx_btrfs_writepages+0x10/0x10 [btrfs]
  [464.493]  ? is_bpf_text_address+0x6e/0x100
  [464.500]  ? kernel_text_address+0x145/0x160
  [464.507]  ? unwind_get_return_address+0x5e/0xa0
  [464.514]  ? arch_stack_walk+0xac/0x100
  [464.521]  do_writepages+0x176/0x780
  [464.527]  ? lock_release+0x567/0x790
  [464.533]  ? __pfx_do_writepages+0x10/0x10
  [464.540]  ? __pfx_lock_acquire+0x10/0x10
  [464.546]  ? __pfx_stack_trace_save+0x10/0x10
  [464.553]  ? do_raw_spin_lock+0x12e/0x270
  [464.560]  ? do_raw_spin_unlock+0x58/0x1f0
  [464.566]  ? _raw_spin_unlock+0x23/0x40
  [464.573]  ? wbc_attach_and_unlock_inode+0x3da/0x7d0
  [464.580]  filemap_fdatawrite_wbc+0x113/0x180
  [464.587]  ? prepare_pages.constprop.0+0x13c/0x5c0 [btrfs]
  [464.596]  __filemap_fdatawrite_range+0xaf/0xf0
  [464.603]  ? __pfx___filemap_fdatawrite_range+0x10/0x10
  [464.611]  ? trace_irq_enable.constprop.0+0xce/0x110
  [464.618]  ? kasan_quarantine_put+0xd7/0x1e0
  [464.625]  btrfs_start_ordered_extent+0x46f/0x570 [btrfs]
  [464.633]  ? __pfx_btrfs_start_ordered_extent+0x10/0x10 [btrfs]
  [464.642]  ? __clear_extent_bit+0x2c0/0x9d0 [btrfs]
  [464.650]  btrfs_lock_and_flush_ordered_range+0xc6/0x180 [btrfs]
  [464.659]  ? __pfx_btrfs_lock_and_flush_ordered_range+0x10/0x10 [btrfs]
  [464.669]  btrfs_read_folio+0x12a/0x1d0 [btrfs]
  [464.676]  ? __pfx_btrfs_read_folio+0x10/0x10 [btrfs]
  [464.684]  ? __pfx_filemap_add_folio+0x10/0x10
  [464.691]  ? __pfx___might_resched+0x10/0x10
  [464.698]  ? __filemap_get_folio+0x1c5/0x450
  [464.705]  prepare_uptodate_page+0x12e/0x4d0 [btrfs]
  [464.713]  prepare_pages.constprop.0+0x13c/0x5c0 [btrfs]
  [464.721]  ? fault_in_iov_iter_readable+0xd2/0x240
  [464.729]  btrfs_buffered_write+0x5bd/0x12f0 [btrfs]
  [464.737]  ? __pfx_btrfs_buffered_write+0x10/0x10 [btrfs]
  [464.745]  ? __pfx_lock_release+0x10/0x10
  [464.752]  ? generic_write_checks+0x275/0x400
  [464.759]  ? down_write+0x118/0x1f0
  [464.765]  ? up_write+0x19b/0x500
  [464.770]  btrfs_direct_write+0x731/0xba0 [btrfs]
  [464.778]  ? __pfx_btrfs_direct_write+0x10/0x10 [btrfs]
  [464.785]  ? __pfx___might_resched+0x10/0x10
  [464.792]  ? lock_acquire+0x435/0x500
  [464.798]  ? lock_acquire+0x435/0x500
  [464.804]  btrfs_do_write_iter+0x494/0x640 [btrfs]
  [464.811]  ? __pfx_btrfs_do_write_iter+0x10/0x10 [btrfs]
  [464.819]  ? __pfx___might_resched+0x10/0x10
  [464.825]  ? rw_verify_area+0x6d/0x590
  [464.831]  vfs_write+0x5d7/0xf50
  [464.837]  ? __might_fault+0x9d/0x120
  [464.843]  ? __pfx_vfs_write+0x10/0x10
  [464.849]  ? btrfs_file_llseek+0xb1/0xfb0 [btrfs]
  [464.856]  ? lock_release+0x567/0x790
  [464.862]  ksys_write+0xfb/0x1d0
  [464.867]  ? __pfx_ksys_write+0x10/0x10
  [464.873]  ? _raw_spin_unlock+0x23/0x40
  [464.879]  ? btrfs_getattr+0x4af/0x670 [btrfs]
  [464.886]  ? vfs_getattr_nosec+0x79/0x340
  [464.892]  do_syscall_64+0x95/0x180
  [464.898]  ? __do_sys_newfstat+0xde/0xf0
  [464.904]  ? __pfx___do_sys_newfstat+0x10/0x10
  [464.911]  ? trace_irq_enable.constprop.0+0xce/0x110
  [464.918]  ? syscall_exit_to_user_mode+0xac/0x2a0
  [464.925]  ? do_syscall_64+0xa1/0x180
  [464.931]  ? trace_irq_enable.constprop.0+0xce/0x110
  [464.939]  ? trace_irq_enable.constprop.0+0xce/0x110
  [464.946]  ? syscall_exit_to_user_mode+0xac/0x2a0
  [464.953]  ? btrfs_file_llseek+0xb1/0xfb0 [btrfs]
  [464.960]  ? do_syscall_64+0xa1/0x180
  [464.966]  ? btrfs_file_llseek+0xb1/0xfb0 [btrfs]
  [464.973]  ? trace_irq_enable.constprop.0+0xce/0x110
  [464.980]  ? syscall_exit_to_user_mode+0xac/0x2a0
  [464.987]  ? __pfx_btrfs_file_llseek+0x10/0x10 [btrfs]
  [464.995]  ? trace_irq_enable.constprop.0+0xce/0x110
  [465.002]  ? __pfx_btrfs_file_llseek+0x10/0x10 [btrfs]
  [465.010]  ? do_syscall_64+0xa1/0x180
  [465.016]  ? lock_release+0x567/0x790
  [465.022]  ? __pfx_lock_acquire+0x10/0x10
  [465.028]  ? __pfx_lock_release+0x10/0x10
  [465.034]  ? trace_irq_enable.constprop.0+0xce/0x110
  [465.042]  ? syscall_exit_to_user_mode+0xac/0x2a0
  [465.049]  ? do_syscall_64+0xa1/0x180
  [465.055]  ? syscall_exit_to_user_mode+0xac/0x2a0
  [465.062]  ? do_syscall_64+0xa1/0x180
  [465.068]  ? syscall_exit_to_user_mode+0xac/0x2a0
  [465.075]  ? do_syscall_64+0xa1/0x180
  [465.081]  ? clear_bhb_loop+0x25/0x80
  [465.087]  ? clear_bhb_loop+0x25/0x80
  [465.093]  ? clear_bhb_loop+0x25/0x80
  [465.099]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [465.106] RIP: 0033:0x7f093b8ee784
  [465.111] RSP: 002b:00007ffc29d31b28 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
  [465.122] RAX: ffffffffffffffda RBX: 0000000000006000 RCX: 00007f093b8ee784
  [465.131] RDX: 000000000001de00 RSI: 00007f093b6ed200 RDI: 0000000000000003
  [465.141] RBP: 000000000001de00 R08: 0000000000006000 R09: 0000000000000000
  [465.150] R10: 0000000000023e00 R11: 0000000000000202 R12: 0000000000006000
  [465.160] R13: 0000000000023e00 R14: 0000000000023e00 R15: 0000000000000001
  [465.170]  </TASK>
  [465.174] INFO: lockdep is turned off.

Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 97713b1a2c ("btrfs: do not clear page dirty inside extent_write_locked_range()")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-13 15:36:57 +02:00
Phillip Lougher
810ee43d9c
Squashfs: sanity check symbolic link size
Syzkiller reports a "KMSAN: uninit-value in pick_link" bug.

This is caused by an uninitialised page, which is ultimately caused
by a corrupted symbolic link size read from disk.

The reason why the corrupted symlink size causes an uninitialised
page is due to the following sequence of events:

1. squashfs_read_inode() is called to read the symbolic
   link from disk.  This assigns the corrupted value
   3875536935 to inode->i_size.

2. Later squashfs_symlink_read_folio() is called, which assigns
   this corrupted value to the length variable, which being a
   signed int, overflows producing a negative number.

3. The following loop that fills in the page contents checks that
   the copied bytes is less than length, which being negative means
   the loop is skipped, producing an uninitialised page.

This patch adds a sanity check which checks that the symbolic
link size is not larger than expected.

--

Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Link: https://lore.kernel.org/r/20240811232821.13903-1-phillip@squashfs.org.uk
Reported-by: Lizhi Xu <lizhi.xu@windriver.com>
Reported-by: syzbot+24ac24ff58dc5b0d26b9@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000a90e8c061e86a76b@google.com/
V2: fix spelling mistake.
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-13 13:56:46 +02:00
Dominique Martinet
e3786b29c5
9p: Fix DIO read through netfs
If a program is watching a file on a 9p mount, it won't see any change in
size if the file being exported by the server is changed directly in the
source filesystem, presumably because 9p doesn't have change notifications,
and because netfs skips the reads if the file is empty.

Fix this by attempting to read the full size specified when a DIO read is
requested (such as when 9p is operating in unbuffered mode) and dealing
with a short read if the EOF was less than the expected read.

To make this work, filesystems using netfslib must not set
NETFS_SREQ_CLEAR_TAIL if performing a DIO read where that read hit the EOF.
I don't want to mandatorily clear this flag in netfslib for DIO because,
say, ceph might make a read from an object that is not completely filled,
but does not reside at the end of file - and so we need to clear the
excess.

This can be tested by watching an empty file over 9p within a VM (such as
in the ktest framework):

        while true; do read content; if [ -n "$content" ]; then echo $content; break; fi; done < /host/tmp/foo

then writing something into the empty file.  The watcher should immediately
display the file content and break out of the loop.  Without this fix, it
remains in the loop indefinitely.

Fixes: 80105ed2fd ("9p: Use netfslib read/write_iter")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218916
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/1229195.1723211769@warthog.procyon.org.uk
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-13 13:53:09 +02:00
Zhihao Cheng
2a0629834c
vfs: Don't evict inode under the inode lru traversing context
The inode reclaiming process(See function prune_icache_sb) collects all
reclaimable inodes and mark them with I_FREEING flag at first, at that
time, other processes will be stuck if they try getting these inodes
(See function find_inode_fast), then the reclaiming process destroy the
inodes by function dispose_list(). Some filesystems(eg. ext4 with
ea_inode feature, ubifs with xattr) may do inode lookup in the inode
evicting callback function, if the inode lookup is operated under the
inode lru traversing context, deadlock problems may happen.

Case 1: In function ext4_evict_inode(), the ea inode lookup could happen
        if ea_inode feature is enabled, the lookup process will be stuck
	under the evicting context like this:

 1. File A has inode i_reg and an ea inode i_ea
 2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea
 3. Then, following three processes running like this:

    PA                              PB
 echo 2 > /proc/sys/vm/drop_caches
  shrink_slab
   prune_dcache_sb
   // i_reg is added into lru, lru->i_ea->i_reg
   prune_icache_sb
    list_lru_walk_one
     inode_lru_isolate
      i_ea->i_state |= I_FREEING // set inode state
     inode_lru_isolate
      __iget(i_reg)
      spin_unlock(&i_reg->i_lock)
      spin_unlock(lru_lock)
                                     rm file A
                                      i_reg->nlink = 0
      iput(i_reg) // i_reg->nlink is 0, do evict
       ext4_evict_inode
        ext4_xattr_delete_inode
         ext4_xattr_inode_dec_ref_all
          ext4_xattr_inode_iget
           ext4_iget(i_ea->i_ino)
            iget_locked
             find_inode_fast
              __wait_on_freeing_inode(i_ea) ----→ AA deadlock
    dispose_list // cannot be executed by prune_icache_sb
     wake_up_bit(&i_ea->i_state)

Case 2: In deleted inode writing function ubifs_jnl_write_inode(), file
        deleting process holds BASEHD's wbuf->io_mutex while getting the
	xattr inode, which could race with inode reclaiming process(The
        reclaiming process could try locking BASEHD's wbuf->io_mutex in
	inode evicting function), then an ABBA deadlock problem would
	happen as following:

 1. File A has inode ia and a xattr(with inode ixa), regular file B has
    inode ib and a xattr.
 2. getfattr(A, xattr_buf) // ixa is added into lru // lru->ixa
 3. Then, following three processes running like this:

        PA                PB                        PC
                echo 2 > /proc/sys/vm/drop_caches
                 shrink_slab
                  prune_dcache_sb
                  // ib and ia are added into lru, lru->ixa->ib->ia
                  prune_icache_sb
                   list_lru_walk_one
                    inode_lru_isolate
                     ixa->i_state |= I_FREEING // set inode state
                    inode_lru_isolate
                     __iget(ib)
                     spin_unlock(&ib->i_lock)
                     spin_unlock(lru_lock)
                                                   rm file B
                                                    ib->nlink = 0
 rm file A
  iput(ia)
   ubifs_evict_inode(ia)
    ubifs_jnl_delete_inode(ia)
     ubifs_jnl_write_inode(ia)
      make_reservation(BASEHD) // Lock wbuf->io_mutex
      ubifs_iget(ixa->i_ino)
       iget_locked
        find_inode_fast
         __wait_on_freeing_inode(ixa)
          |          iput(ib) // ib->nlink is 0, do evict
          |           ubifs_evict_inode
          |            ubifs_jnl_delete_inode(ib)
          ↓             ubifs_jnl_write_inode
     ABBA deadlock ←-----make_reservation(BASEHD)
                   dispose_list // cannot be executed by prune_icache_sb
                    wake_up_bit(&ixa->i_state)

Fix the possible deadlock by using new inode state flag I_LRU_ISOLATING
to pin the inode in memory while inode_lru_isolate() reclaims its pages
instead of using ordinary inode reference. This way inode deletion
cannot be triggered from inode_lru_isolate() thus avoiding the deadlock.
evict() is made to wait for I_LRU_ISOLATING to be cleared before
proceeding with inode cleanup.

Link: https://lore.kernel.org/all/37c29c42-7685-d1f0-067d-63582ffac405@huaweicloud.com/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219022
Fixes: e50e5129f3 ("ext4: xattr-in-inode support")
Fixes: 7959cf3a75 ("ubifs: journal: Handle xattrs like files")
Cc: stable@vger.kernel.org
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Link: https://lore.kernel.org/r/20240809031628.1069873-1-chengzhihao@huaweicloud.com
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-13 13:52:16 +02:00
Filipe Manana
46a6e10a1a btrfs: send: allow cloning non-aligned extent if it ends at i_size
If we a find that an extent is shared but its end offset is not sector
size aligned, then we don't clone it and issue write operations instead.
This is because the reflink (remap_file_range) operation does not allow
to clone unaligned ranges, except if the end offset of the range matches
the i_size of the source and destination files (and the start offset is
sector size aligned).

While this is not incorrect because send can only guarantee that a file
has the same data in the source and destination snapshots, it's not
optimal and generates confusion and surprising behaviour for users.

For example, running this test:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdi
  MNT=/mnt/sdi

  mkfs.btrfs -f $DEV
  mount $DEV $MNT

  # Use a file size not aligned to any possible sector size.
  file_size=$((1 * 1024 * 1024 + 5)) # 1MB + 5 bytes
  dd if=/dev/random of=$MNT/foo bs=$file_size count=1
  cp --reflink=always $MNT/foo $MNT/bar

  btrfs subvolume snapshot -r $MNT/ $MNT/snap
  rm -f /tmp/send-test
  btrfs send -f /tmp/send-test $MNT/snap

  umount $MNT
  mkfs.btrfs -f $DEV
  mount $DEV $MNT

  btrfs receive -vv -f /tmp/send-test $MNT

  xfs_io -r -c "fiemap -v" $MNT/snap/bar

  umount $MNT

Gives the following result:

  (...)
  mkfile o258-7-0
  rename o258-7-0 -> bar
  write bar - offset=0 length=49152
  write bar - offset=49152 length=49152
  write bar - offset=98304 length=49152
  write bar - offset=147456 length=49152
  write bar - offset=196608 length=49152
  write bar - offset=245760 length=49152
  write bar - offset=294912 length=49152
  write bar - offset=344064 length=49152
  write bar - offset=393216 length=49152
  write bar - offset=442368 length=49152
  write bar - offset=491520 length=49152
  write bar - offset=540672 length=49152
  write bar - offset=589824 length=49152
  write bar - offset=638976 length=49152
  write bar - offset=688128 length=49152
  write bar - offset=737280 length=49152
  write bar - offset=786432 length=49152
  write bar - offset=835584 length=49152
  write bar - offset=884736 length=49152
  write bar - offset=933888 length=49152
  write bar - offset=983040 length=49152
  write bar - offset=1032192 length=16389
  chown bar - uid=0, gid=0
  chmod bar - mode=0644
  utimes bar
  utimes
  BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=06d640da-9ca1-604c-b87c-3375175a8eb3, stransid=7
  /mnt/sdi/snap/bar:
   EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
     0: [0..2055]:       26624..28679      2056   0x1

There's no clone operation to clone extents from the file foo into file
bar and fiemap confirms there's no shared flag (0x2000).

So update send_write_or_clone() so that it proceeds with cloning if the
source and destination ranges end at the i_size of the respective files.

After this changes the result of the test is:

  (...)
  mkfile o258-7-0
  rename o258-7-0 -> bar
  clone bar - source=foo source offset=0 offset=0 length=1048581
  chown bar - uid=0, gid=0
  chmod bar - mode=0644
  utimes bar
  utimes
  BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=582420f3-ea7d-564e-bbe5-ce440d622190, stransid=7
  /mnt/sdi/snap/bar:
   EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
     0: [0..2055]:       26624..28679      2056 0x2001

A test case for fstests will also follow up soon.

Link: https://github.com/kdave/btrfs-progs/issues/572#issuecomment-2282841416
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-13 13:45:42 +02:00
Filipe Manana
ae1e766f62 btrfs: only run the extent map shrinker from kswapd tasks
Currently the extent map shrinker can be run by any task when attempting
to allocate memory and there's enough memory pressure to trigger it.

To avoid too much latency we stop iterating over extent maps and removing
them once the task needs to reschedule. This logic was introduced in commit
b3ebb9b7e9 ("btrfs: stop extent map shrinker if reschedule is needed").

While that solved high latency problems for some use cases, it's still
not enough because with a too high number of tasks entering the extent map
shrinker code, either due to memory allocations or because they are a
kswapd task, we end up having a very high level of contention on some
spin locks, namely:

1) The fs_info->fs_roots_radix_lock spin lock, which we need to find
   roots to iterate over their inodes;

2) The spin lock of the xarray used to track open inodes for a root
   (struct btrfs_root::inodes) - on 6.10 kernels and below, it used to
   be a red black tree and the spin lock was root->inode_lock;

3) The fs_info->delayed_iput_lock spin lock since the shrinker adds
   delayed iputs (calls btrfs_add_delayed_iput()).

Instead of allowing the extent map shrinker to be run by any task, make
it run only by kswapd tasks. This still solves the problem of running
into OOM situations due to an unbounded extent map creation, which is
simple to trigger by direct IO writes, as described in the changelog
of commit 956a17d9d0 ("btrfs: add a shrinker for extent maps"), and
by a similar case when doing buffered IO on files with a very large
number of holes (keeping the file open and creating many holes, whose
extent maps are only released when the file is closed).

Reported-by: kzd <kzd@56709.net>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219121
Reported-by: Octavia Togami <octavia.togami@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAHPNGSSt-a4ZZWrtJdVyYnJFscFjP9S7rMcvEMaNSpR556DdLA@mail.gmail.com/
Fixes: 956a17d9d0 ("btrfs: add a shrinker for extent maps")
CC: stable@vger.kernel.org # 6.10+
Tested-by: kzd <kzd@56709.net>
Tested-by: Octavia Togami <octavia.togami@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-13 13:43:28 +02:00
Qu Wenruo
31723c9542 btrfs: tree-checker: reject BTRFS_FT_UNKNOWN dir type
[REPORT]
There is a bug report that kernel is rejecting a mismatching inode mode
and its dir item:

  [ 1881.553937] BTRFS critical (device dm-0): inode mode mismatch with
  dir: inode mode=040700 btrfs type=2 dir type=0

[CAUSE]
It looks like the inode mode is correct, while the dir item type
0 is BTRFS_FT_UNKNOWN, which should not be generated by btrfs at all.

This may be caused by a memory bit flip.

[ENHANCEMENT]
Although tree-checker is not able to do any cross-leaf verification, for
this particular case we can at least reject any dir type with
BTRFS_FT_UNKNOWN.

So here we enhance the dir type check from [0, BTRFS_FT_MAX), to
(0, BTRFS_FT_MAX).
Although the existing corruption can not be fixed just by such enhanced
checking, it should prevent the same 0x2->0x0 bitflip for dir type to
reach disk in the future.

Reported-by: Kota <nospam@kota.moe>
Link: https://lore.kernel.org/linux-btrfs/CACsxjPYnQF9ZF-0OhH16dAx50=BXXOcP74MxBc3BG+xae4vTTw@mail.gmail.com/
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-13 13:42:26 +02:00
Josef Bacik
42fac187b5 btrfs: check delayed refs when we're checking if a ref exists
In the patch 78c52d9eb6 ("btrfs: check for refs on snapshot delete
resume") I added some code to handle file systems that had been
corrupted by a bug that incorrectly skipped updating the drop progress
key while dropping a snapshot.  This code would check to see if we had
already deleted our reference for a child block, and skip the deletion
if we had already.

Unfortunately there is a bug, as the check would only check the on-disk
references.  I made an incorrect assumption that blocks in an already
deleted snapshot that was having the deletion resume on mount wouldn't
be modified.

If we have 2 pending deleted snapshots that share blocks, we can easily
modify the rules for a block.  Take the following example

subvolume a exists, and subvolume b is a snapshot of subvolume a.  They
share references to block 1.  Block 1 will have 2 full references, one
for subvolume a and one for subvolume b, and it belongs to subvolume a
(btrfs_header_owner(block 1) == subvolume a).

When deleting subvolume a, we will drop our full reference for block 1,
and because we are the owner we will drop our full reference for all of
block 1's children, convert block 1 to FULL BACKREF, and add a shared
reference to all of block 1's children.

Then we will start the snapshot deletion of subvolume b.  We look up the
extent info for block 1, which checks delayed refs and tells us that
FULL BACKREF is set, so sets parent to the bytenr of block 1.  However
because this is a resumed snapshot deletion, we call into
check_ref_exists().  Because check_ref_exists() only looks at the disk,
it doesn't find the shared backref for the child of block 1, and thus
returns 0 and we skip deleting the reference for the child of block 1
and continue.  This orphans the child of block 1.

The fix is to lookup the delayed refs, similar to what we do in
btrfs_lookup_extent_info().  However we only care about whether the
reference exists or not.  If we fail to find our reference on disk, go
look up the bytenr in the delayed refs, and if it exists look for an
existing ref in the delayed ref head.  If that exists then we know we
can delete the reference safely and carry on.  If it doesn't exist we
know we have to skip over this block.

This bug has existed since I introduced this fix, however requires
having multiple deleted snapshots pending when we unmount.  We noticed
this in production because our shutdown path stops the container on the
system, which deletes a bunch of subvolumes, and then reboots the box.
This gives us plenty of opportunities to hit this issue.  Looking at the
history we've seen this occasionally in production, but we had a big
spike recently thanks to faster machines getting jobs with multiple
subvolumes in the job.

Chris Mason wrote a reproducer which does the following

mount /dev/nvme4n1 /btrfs
btrfs subvol create /btrfs/s1
simoop -E -f 4k -n 200000 -z /btrfs/s1
while(true) ; do
	btrfs subvol snap /btrfs/s1 /btrfs/s2
	simoop -f 4k -n 200000 -r 10 -z /btrfs/s2
	btrfs subvol snap /btrfs/s2 /btrfs/s3
	btrfs balance start -dusage=80 /btrfs
	btrfs subvol del /btrfs/s2 /btrfs/s3
	umount /btrfs
	btrfsck /dev/nvme4n1 || exit 1
	mount /dev/nvme4n1 /btrfs
done

On the second loop this would fail consistently, with my patch it has
been running for hours and hasn't failed.

I also used dm-log-writes to capture the state of the failure so I could
debug the problem.  Using the existing failure case to test my patch
validated that it fixes the problem.

Fixes: 78c52d9eb6 ("btrfs: check for refs on snapshot delete resume")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-13 13:42:26 +02:00
David Howells
7b589a9b45
netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags
The NETFS_RREQ_USE_PGPRIV2 and NETFS_RREQ_WRITE_TO_CACHE flags aren't used
correctly.  The problem is that we try to set them up in the request
initialisation, but we the cache may be in the process of setting up still,
and so the state may not be correct.  Further, we secondarily sample the
cache state and make contradictory decisions later.

The issue arises because we set up the cache resources, which allows the
cache's ->prepare_read() to switch on NETFS_SREQ_COPY_TO_CACHE - which
triggers cache writing even if we didn't set the flags when allocating.

Fix this in the following way:

 (1) Drop NETFS_ICTX_USE_PGPRIV2 and instead set NETFS_RREQ_USE_PGPRIV2 in
     ->init_request() rather than trying to juggle that in
     netfs_alloc_request().

 (2) Repurpose NETFS_RREQ_USE_PGPRIV2 to merely indicate that if caching is
     to be done, then PG_private_2 is to be used rather than only setting
     it if we decide to cache and then having netfs_rreq_unlock_folios()
     set the non-PG_private_2 writeback-to-cache if it wasn't set.

 (3) Split netfs_rreq_unlock_folios() into two functions, one of which
     contains the deprecated code for using PG_private_2 to avoid
     accidentally doing the writeback path - and always use it if
     USE_PGPRIV2 is set.

 (4) As NETFS_ICTX_USE_PGPRIV2 is removed, make netfs_write_begin() always
     wait for PG_private_2.  This function is deprecated and only used by
     ceph anyway, and so label it so.

 (5) Drop the NETFS_RREQ_WRITE_TO_CACHE flag and use
     fscache_operation_valid() on the cache_resources instead.  This has
     the advantage of picking up the result of netfs_begin_cache_read() and
     fscache_begin_write_operation() - which are called after the object is
     initialised and will wait for the cache to come to a usable state.

Just reverting ae678317b95e[1] isn't a sufficient fix, so this need to be
applied on top of that.  Without this as well, things like:

 rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: {

and:

 WARNING: CPU: 13 PID: 3621 at fs/ceph/caps.c:3386

may happen, along with some UAFs due to PG_private_2 not getting used to
wait on writeback completion.

Fixes: 2ff1e97587 ("netfs: Replace PG_fscache by setting folio->private and marking dirty")
Reported-by: Max Kellermann <max.kellermann@ionos.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Hristo Venev <hristo@venev.name>
cc: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: ceph-devel@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk/ [1]
Link: https://lore.kernel.org/r/1173209.1723152682@warthog.procyon.org.uk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-12 22:03:27 +02:00
David Howells
8e5ced7804
netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a second writeback flag"
This reverts commit ae678317b9.

Revert the patch that removes the deprecated use of PG_private_2 in
netfslib for the moment as Ceph is actually still using this to track
data copied to the cache.

Fixes: ae678317b9 ("netfs: Remove deprecated use of PG_private_2 as a second writeback flag")
Reported-by: Max Kellermann <max.kellermann@ionos.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: ceph-devel@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
https: //lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-12 22:03:27 +02:00
Lukas Bulwahn
889ced4c93
netfs: clean up after renaming FSCACHE_DEBUG config
Commit 6b8e61472529 ("netfs: Rename CONFIG_FSCACHE_DEBUG to
CONFIG_NETFS_DEBUG") renames the config, but introduces two issues: First,
NETFS_DEBUG mistakenly depends on the non-existing config NETFS, whereas
the actual intended config is called NETFS_SUPPORT. Second, the config
renaming misses to adjust the documentation of the functionality of this
config.

Clean up those two points.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com>
Link: https://lore.kernel.org/r/20240731073902.69262-1-lukas.bulwahn@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-12 22:03:26 +02:00
yangerkun
64a7ce76fb
libfs: fix infinite directory reads for offset dir
After we switch tmpfs dir operations from simple_dir_operations to
simple_offset_dir_operations, every rename happened will fill new dentry
to dest dir's maple tree(&SHMEM_I(inode)->dir_offsets->mt) with a free
key starting with octx->newx_offset, and then set newx_offset equals to
free key + 1. This will lead to infinite readdir combine with rename
happened at the same time, which fail generic/736 in xfstests(detail show
as below).

1. create 5000 files(1 2 3...) under one dir
2. call readdir(man 3 readdir) once, and get one entry
3. rename(entry, "TEMPFILE"), then rename("TEMPFILE", entry)
4. loop 2~3, until readdir return nothing or we loop too many
   times(tmpfs break test with the second condition)

We choose the same logic what commit 9b378f6ad4 ("btrfs: fix infinite
directory reads") to fix it, record the last_index when we open dir, and
do not emit the entry which index >= last_index. The file->private_data
now used in offset dir can use directly to do this, and we also update
the last_index when we llseek the dir file.

Fixes: a2e459555c ("shmem: stable directory offsets")
Signed-off-by: yangerkun <yangerkun@huawei.com>
Link: https://lore.kernel.org/r/20240731043835.1828697-1-yangerkun@huawei.com
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
[brauner: only update last_index after seek when offset is zero like Jan suggested]
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-12 22:03:26 +02:00
Max Kellermann
f71aa06398
fs/netfs/fscache_cookie: add missing "n_accesses" check
This fixes a NULL pointer dereference bug due to a data race which
looks like this:

  BUG: kernel NULL pointer dereference, address: 0000000000000008
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP PTI
  CPU: 33 PID: 16573 Comm: kworker/u97:799 Not tainted 6.8.7-cm4all1-hp+ #43
  Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/17/2018
  Workqueue: events_unbound netfs_rreq_write_to_cache_work
  RIP: 0010:cachefiles_prepare_write+0x30/0xa0
  Code: 57 41 56 45 89 ce 41 55 49 89 cd 41 54 49 89 d4 55 53 48 89 fb 48 83 ec 08 48 8b 47 08 48 83 7f 10 00 48 89 34 24 48 8b 68 20 <48> 8b 45 08 4c 8b 38 74 45 49 8b 7f 50 e8 4e a9 b0 ff 48 8b 73 10
  RSP: 0018:ffffb4e78113bde0 EFLAGS: 00010286
  RAX: ffff976126be6d10 RBX: ffff97615cdb8438 RCX: 0000000000020000
  RDX: ffff97605e6c4c68 RSI: ffff97605e6c4c60 RDI: ffff97615cdb8438
  RBP: 0000000000000000 R08: 0000000000278333 R09: 0000000000000001
  R10: ffff97605e6c4600 R11: 0000000000000001 R12: ffff97605e6c4c68
  R13: 0000000000020000 R14: 0000000000000001 R15: ffff976064fe2c00
  FS:  0000000000000000(0000) GS:ffff9776dfd40000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000008 CR3: 000000005942c002 CR4: 00000000001706f0
  Call Trace:
   <TASK>
   ? __die+0x1f/0x70
   ? page_fault_oops+0x15d/0x440
   ? search_module_extables+0xe/0x40
   ? fixup_exception+0x22/0x2f0
   ? exc_page_fault+0x5f/0x100
   ? asm_exc_page_fault+0x22/0x30
   ? cachefiles_prepare_write+0x30/0xa0
   netfs_rreq_write_to_cache_work+0x135/0x2e0
   process_one_work+0x137/0x2c0
   worker_thread+0x2e9/0x400
   ? __pfx_worker_thread+0x10/0x10
   kthread+0xcc/0x100
   ? __pfx_kthread+0x10/0x10
   ret_from_fork+0x30/0x50
   ? __pfx_kthread+0x10/0x10
   ret_from_fork_asm+0x1b/0x30
   </TASK>
  Modules linked in:
  CR2: 0000000000000008
  ---[ end trace 0000000000000000 ]---

This happened because fscache_cookie_state_machine() was slow and was
still running while another process invoked fscache_unuse_cookie();
this led to a fscache_cookie_lru_do_one() call, setting the
FSCACHE_COOKIE_DO_LRU_DISCARD flag, which was picked up by
fscache_cookie_state_machine(), withdrawing the cookie via
cachefiles_withdraw_cookie(), clearing cookie->cache_priv.

At the same time, yet another process invoked
cachefiles_prepare_write(), which found a NULL pointer in this code
line:

  struct cachefiles_object *object = cachefiles_cres_object(cres);

The next line crashes, obviously:

  struct cachefiles_cache *cache = object->volume->cache;

During cachefiles_prepare_write(), the "n_accesses" counter is
non-zero (via fscache_begin_operation()).  The cookie must not be
withdrawn until it drops to zero.

The counter is checked by fscache_cookie_state_machine() before
switching to FSCACHE_COOKIE_STATE_RELINQUISHING and
FSCACHE_COOKIE_STATE_WITHDRAWING (in "case
FSCACHE_COOKIE_STATE_FAILED"), but not for
FSCACHE_COOKIE_STATE_LRU_DISCARDING ("case
FSCACHE_COOKIE_STATE_ACTIVE").

This patch adds the missing check.  With a non-zero access counter,
the function returns and the next fscache_end_cookie_access() call
will queue another fscache_cookie_state_machine() call to handle the
still-pending FSCACHE_COOKIE_DO_LRU_DISCARD.

Fixes: 12bb21a29c ("fscache: Implement cookie user counting and resource pinning")
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240729162002.3436763-2-dhowells@redhat.com
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: stable@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-12 22:03:26 +02:00
Omar Sandoval
3f65f3c099
filelock: fix name of file_lease slab cache
When struct file_lease was split out from struct file_lock, the name of
the file_lock slab cache was copied to the new slab cache for
file_lease. This name conflict causes confusion in /proc/slabinfo and
/sys/kernel/slab. In particular, it caused failures in drgn's test case
for slab cache merging.

Link: 9ad29fd864/tests/linux_kernel/helpers/test_slab.py (L81)
Fixes: c69ff40719 ("filelock: split leases out of struct file_lock")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Link: https://lore.kernel.org/r/2d1d053da1cafb3e7940c4f25952da4f0af34e38.1722293276.git.osandov@fb.com
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-12 22:03:25 +02:00
Matthew Wilcox (Oracle)
98055bc359
netfs: Fault in smaller chunks for non-large folio mappings
As in commit 4e527d5841 ("iomap: fault in smaller chunks for non-large
folio mappings"), we can see a performance loss for filesystems
which have not yet been converted to large folios.

Fixes: c38f4e96e6 ("netfs: Provide func to copy data to pagecache for buffered write")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20240527201735.1898381-1-willy@infradead.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-12 22:03:25 +02:00
Linus Torvalds
a1460e457e fix bitmap corruption on close_range(), take 2
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZrFgeQAKCRBZ7Krx/gZQ
 6w4SAP48jL+Vil7ifIXviasoBrQGzf9lbTcOAmWAoaxjSlvlpAEAw4OyPhJUmKHW
 ykB/yqUMCajsrrTQPN5lmc0W5v0nqQ4=
 =G0ms
 -----END PGP SIGNATURE-----

Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull fd bitmap fix from Al Viro:
 "Fix bitmap corruption on close_range() by cleaning up
  copy_fd_bitmaps()"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix bitmap corruption on close_range() with CLOSE_RANGE_UNSHARE
2024-08-12 08:03:28 -07:00
Linus Torvalds
5189dafa4c nfsd-6.11 fixes:
- Two minor fixes for recent changes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAma3pFoACgkQM2qzM29m
 f5f4dA//SiPOS7v60jQubfV3f031E+W3a6YxvB5tqByA23hOavjYowGREyvhtaS5
 lgqRO25BXz7uNOjSWUXaV7SsR1GKElWtk/gGVsfYy03A91Qvafb24oapJRwRXeJ1
 5n6X5AScxJeoDGEIUMoGU3O87v4T6UoTj5K6bjUDqr5kTRTL6rPFj3jYvy0YVLe3
 EsA1P+BLctB1XiboMYNfZyEJbxNF5gpO7OaeS8VPQIeNYhaF1cwsa7A9wDJ22UpB
 BQEguTjPqgzkthOdkc3jNcoXtDUEzAXxQqACKR8CU6cCoyr159KupSlci4LqA2o4
 8CwcVd01L79JjGcrZqo+4vmbuY6rfhN7U/EYcNNJPZKUIo7Q5h4dq/pCaBADK2Ke
 UJVPsmdYXKMwn/pPJI9kjoellqSrjOxqFDXXqBb/e8Z/xXuXmiYRl6E+dtGeBlty
 j4qm12d1BGKs7zRaLLThbFTH6s3kUCjVhEr+uphse6CVF6HV/NNK79jooHrrgCiS
 e1vgRaYIgzMREOyq8fZwvINDuk/8Pa1Z4wlR65orwVzWlZMmk9bLm+wwplcyj34x
 una8RrQhxlicY5nAGUa+aRCkKUtO9oSPs6hukoT+DB06+mhQ7u3ro96ojueMfMdQ
 pKux8zrCdJe7jZUlnIW30/QsrgWhHW9KsrdUA9eKQ1FoZIOdS9I=
 =YX9E
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Two minor fixes for recent changes

* tag 'nfsd-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  nfsd: don't set SVC_SOCK_ANONYMOUS when creating nfsd sockets
  sunrpc: avoid -Wformat-security warning
2024-08-10 10:44:21 -07:00
Linus Torvalds
31b2444606 bcachefs fixes for 6.11-rc2, more
- fix a bug that was causing ACLs to seemingly "disappear"
 - new on disk format version, bcachefs_metadata_version_disk_accounting_v3
   bcachefs_metadata_version_disk_accounting_v2 accidentally included
   padding in disk_accounting_key; fortunately, 6.11 isn't out yet so we
   can fix this with another version bump.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAma3mGIACgkQE6szbY3K
 bnagAw//XP8wYgEE6worKtLb8sNx+w+jeKR1FeAsYBP012+58xBvljxsUm2J6Mct
 bs7nJmhextVEhcIhm43/SEaic1dXWNxlC5hvTPQzJCGAjyVCPH2naXpz284GQzIu
 HlwBlnWK2HjnSpPHFPEqRm3jq/H9SGEhENDJTrccg7E82towGktQspvmx1Bzyt/E
 cb31An0Y4lQmZGu1/bxX0IDjcKYdI4Jr+FQwiJV35KwGXSw9gzkfHii0wNY+gTAj
 OklXiw5oOtIR52Oy7FKLhYWdl6GIe4/hYS+yYJ3sMiSUuYY50otwIGGnCRLICRlf
 AsE4fmN8wjTPCRryOOTDvc/MCuKuanQa8x1pH0cM3WMKqhl4mrDD2wne+38gfgDN
 zQwoLHfDdDXtadf0NB/VoVQnHTfYSk6GKpHHduMhZeb45XNm0WNryZcefQMunSMz
 CTBq22f22iSngF0KCvHQgp7mm8XNk+d12keQ37ldINRkC+mDWPu7Yi/mhwYzN43O
 +6WrzxyQsQPhLE+nfYRlvAuDCCVsXqjL2U2EVyuaBmUJaMMZbFU+u9zpb48dMjIl
 dRxBQfOsJjvbvTMTNlG4EDslKi8HjoqHXMo5Y3nhvSAmQRjkltfhiXSRivdJxYuR
 NwjJYtszbbBIqSsUZ6DqR+1eTG6HSQ1ML0B+Nm8orLLMqIFlAkQ=
 =f3fR
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-08-10' of git://evilpiepirate.org/bcachefs

Pull more bcachefs fixes from Kent Overstreet:
 "A couple last minute fixes for the new disk accounting

   - fix a bug that was causing ACLs to seemingly "disappear"

   - new on disk format version, bcachefs_metadata_version_disk_accounting_v3

     bcachefs_metadata_version_disk_accounting_v2 accidentally included
     padding in disk_accounting_key; fortunately, 6.11 isn't out yet so
     we can fix this with another version bump"

* tag 'bcachefs-2024-08-10' of git://evilpiepirate.org/bcachefs:
  bcachefs: bcachefs_metadata_version_disk_accounting_v3
  bcachefs: improve bch2_dev_usage_to_text()
  bcachefs: bch2_accounting_invalid()
  bcachefs: Switch to .get_inode_acl()
2024-08-10 10:06:26 -07:00
Linus Torvalds
34ac1e82e5 Three smb3 client fixes including two for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAma2k9YACgkQiiy9cAdy
 T1GkqAv7B8wyDp7pVU8qtTxct1gVtDEqvd2Kte6962uwvKWEhhf8SKzagypICudL
 vXxwJuPsRlFM5jWd8h36/xXC17BYM1mEsz6GVq9viUYUcqtrKrGPsOL7+OLVR3Mp
 SwPRPLYyLyMBHhc3Su2/6uGr87NMkcgDTzugZrx67ojAaw/xkYsiuz+Wm8ijbdik
 waDkJ3zosj0Nbhud47sf6rkrymiqC5kP717rhlcXF+TxNscFDQ/eFO8b0lt3g5q5
 +Y338a4pLLFWXBm1jP9EOtbx7NSDKaqWVYYtnwEBs6EV3QGpKjbT1zBQ07wWjfT4
 FhcFhYYukq7Jc4X39JouBXqYWR2wjB8VpMzCuwNsDNJ7FahNaGTgfXCMx6C0Bi0w
 XICAMoZfBXGSAs+NTyi38AHyPQ+KdYJeSgA9doL+3FhMGx9KKAVkT6XB3twhxoOS
 w/iMyzhx/1mwv4CukCKm3lDflN09AseB688QiFsTshTmnTlWFcHdEnORqWbNrd2N
 evhrKvxK
 =GIpz
 -----END PGP SIGNATURE-----

Merge tag '6.11-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

 - DFS fix

 - fix for security flags for requiring encryption

 - minor cleanup

* tag '6.11-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: cifs_inval_name_dfs_link_error: correct the check for fullpath
  Fix spelling errors in Server Message Block
  smb3: fix setting SecurityFlags when encryption is required
2024-08-09 21:33:25 -07:00
Kees Cook
3eb3cd5992 binfmt_flat: Fix corruption when not offsetting data start
Commit 04d82a6d08 ("binfmt_flat: allow not offsetting data start")
introduced a RISC-V specific variant of the FLAT format which does
not allocate any space for the (obsolete) array of shared library
pointers. However, it did not disable the code which initializes the
array, resulting in the corruption of sizeof(long) bytes before the DATA
segment, generally the end of the TEXT segment.

Introduce MAX_SHARED_LIBS_UPDATE which depends on the state of
CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET to guard the initialization of
the shared library pointer region so that it will only be initialized
if space is reserved for it.

Fixes: 04d82a6d08 ("binfmt_flat: allow not offsetting data start")
Co-developed-by: Stefan O'Rear <sorear@fastmail.com>
Signed-off-by: Stefan O'Rear <sorear@fastmail.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Acked-by: Greg Ungerer <gerg@linux-m68k.org>
Link: https://lore.kernel.org/r/20240807195119.it.782-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
2024-08-09 20:19:00 -07:00
Kent Overstreet
8a2491db7b bcachefs: bcachefs_metadata_version_disk_accounting_v3
bcachefs_metadata_version_disk_accounting_v2 erroneously had padding
bytes in disk_accounting_key, which is a problem because we have to
guarantee that all unused bytes in disk_accounting_key are zeroed.

Fortunately 6.11 isn't out yet, so it's cheap to fix this by spinning a
new version.

Reported-by: Gabriel de Perthuis <g2p.code@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-09 19:21:28 -04:00
Kent Overstreet
1a9e219db1 bcachefs: improve bch2_dev_usage_to_text()
Add a line for capacity

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-09 14:40:19 -04:00
Kent Overstreet
077e473723 bcachefs: bch2_accounting_invalid()
Implement bch2_accounting_invalid(); check for junk at the end, and
replicas accounting entries in particular need to be checked or we'll
pop asserts later.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-09 14:40:19 -04:00
Namjae Jeon
f6bd41280a ksmbd: override fsids for smb2_query_info()
Sangsoo reported that a DAC denial error occurred when accessing
files through the ksmbd thread. This patch override fsids for
smb2_query_info().

Reported-by: Sangsoo Lee <constant.lee@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-08 22:54:09 -05:00
Namjae Jeon
a018c1b636 ksmbd: override fsids for share path check
Sangsoo reported that a DAC denial error occurred when accessing
files through the ksmbd thread. This patch override fsids for share
path check.

Reported-by: Sangsoo Lee <constant.lee@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-08 22:54:09 -05:00
Gleb Korobeynikov
36bb22a08a cifs: cifs_inval_name_dfs_link_error: correct the check for fullpath
Replace the always-true check tcon->origin_fullpath with
check of server->leaf_fullpath

See https://bugzilla.kernel.org/show_bug.cgi?id=219083

The check of the new @tcon will always be true during mounting,
since @tcon->origin_fullpath will only be set after the tree is
connected to the latest common resource, as well as checking if
the prefix paths from it are fully accessible.

Fixes: 3ae872de41 ("smb: client: fix shared DFS root mounts with different prefixes")
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Gleb Korobeynikov <gkorobeynikov@astralinux.ru>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-08 20:06:22 -05:00
Linus Torvalds
9466b6ae6b tracing fixes for v6.11:
- Have reading of event format files test if the meta data still exists.
   When a event is freed, a flag (EVENT_FILE_FL_FREED) in the meta data is
   set to state that it is to prevent any new references to it from happening
   while waiting for existing references to close. When the last reference
   closes, the meta data is freed. But the "format" was missing a check to
   this flag (along with some other files) that allowed new references to
   happen, and a use-after-free bug to occur.
 
 - Have the trace event meta data use the refcount infrastructure instead
   of relying on its own atomic counters.
 
 - Have tracefs inodes use alloc_inode_sb() for allocation instead of
   using kmem_cache_alloc() directly.
 
 - Have eventfs_create_dir() return an ERR_PTR instead of NULL as
   the callers expect a real object or an ERR_PTR.
 
 - Have release_ei() use call_srcu() and not call_rcu() as all the
   protection is on SRCU and not RCU.
 
 - Fix ftrace_graph_ret_addr() to use the task passed in and not current.
 
 - Fix overflow bug in get_free_elt() where the counter can overflow
   the integer and cause an infinite loop.
 
 - Remove unused function ring_buffer_nr_pages()
 
 - Have tracefs freeing use the inode RCU infrastructure instead of
   creating its own. When the kernel had randomize structure fields
   enabled, the rcu field of the tracefs_inode was overlapping the
   rcu field of the inode structure, and corrupting it. Instead,
   use the destroy_inode() callback to do the initial cleanup of
   the code, and then have free_inode() free it.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZrTvXxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qu39AP9ze6ELpShDrxbXhf0adbNqG2IXMepa
 MMLqfq8tU8E/vAEAuZXJ6rKXeGvKeONa06ocvWJ0dpb2cy/n4hmx+KtM5gI=
 =Pkh4
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Have reading of event format files test if the metadata still exists.

   When a event is freed, a flag (EVENT_FILE_FL_FREED) in the metadata
   is set to state that it is to prevent any new references to it from
   happening while waiting for existing references to close. When the
   last reference closes, the metadata is freed. But the "format" was
   missing a check to this flag (along with some other files) that
   allowed new references to happen, and a use-after-free bug to occur.

 - Have the trace event meta data use the refcount infrastructure
   instead of relying on its own atomic counters.

 - Have tracefs inodes use alloc_inode_sb() for allocation instead of
   using kmem_cache_alloc() directly.

 - Have eventfs_create_dir() return an ERR_PTR instead of NULL as the
   callers expect a real object or an ERR_PTR.

 - Have release_ei() use call_srcu() and not call_rcu() as all the
   protection is on SRCU and not RCU.

 - Fix ftrace_graph_ret_addr() to use the task passed in and not
   current.

 - Fix overflow bug in get_free_elt() where the counter can overflow the
   integer and cause an infinite loop.

 - Remove unused function ring_buffer_nr_pages()

 - Have tracefs freeing use the inode RCU infrastructure instead of
   creating its own.

   When the kernel had randomize structure fields enabled, the rcu field
   of the tracefs_inode was overlapping the rcu field of the inode
   structure, and corrupting it. Instead, use the destroy_inode()
   callback to do the initial cleanup of the code, and then have
   free_inode() free it.

* tag 'trace-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracefs: Use generic inode RCU for synchronizing freeing
  ring-buffer: Remove unused function ring_buffer_nr_pages()
  tracing: Fix overflow in get_free_elt()
  function_graph: Fix the ret_stack used by ftrace_graph_ret_addr()
  eventfs: Use SRCU for freeing eventfs_inodes
  eventfs: Don't return NULL in eventfs_create_dir()
  tracefs: Fix inode allocation
  tracing: Use refcount for trace_event_file reference counter
  tracing: Have format file honor EVENT_FILE_FL_FREED
2024-08-08 13:32:59 -07:00
Linus Torvalds
b3f5620f76 bcachefs fixes for 6.11-rc3
Assorted little stuff:
 - lockdep fixup for lockdep_set_notrack_class()
 - we can now remove a device when using erasure coding without
   deadlocking, though we still hit other issues
 - the "allocator stuck" timeout is now configurable, and messages are
   ratelimited; default timeout has been increased from 10 seconds to 30
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAma05JwACgkQE6szbY3K
 bnYE+Q//VJ6R/UxDoxjk8zgftRcdgwXod6U+/E0Kj3ZBKLYXkcGaWWmmGMkFafBp
 eL7Y3wtHSKiMsHYX9KEdFUZFLe1KI4c16RgNIXk9nwkF+3/+8pEDHKPFuoGHJH3O
 HComHGqwVg8Zx2jRNvEkvQ980iH7OBGhCjMFXhJ3xbMGLdw91TQQi49a+Q/vz7QT
 y3Cl1dgX5xBl7fqKefsYa+X6mpWi4/6t60vJvatI+bvDfznjI6jN3qGVLlQye7tC
 6VbJAjHsPPyNMlWa99UaHqDdaM325zR2ES0bsfHd8Up4iAwO8OgjzYQxpYTgi51i
 6DTiGEOV2S8gF+Rnprnbzsnau0hEvrtQY2Ub85TCIGbZJa8b+aDIlq9k8jF36O2E
 2CUTleQ/E129RxXpkZGsVRpNmemdCi6rHAcluaFEgezX4FJH8BVOwQQq2Xz7rd7E
 3ZP5iAWmX0IgOL0VOCP/ZXl/lEMwSk0VAED3jEbT7f2K7rU9nXDO2bIEx1wXDCm1
 b32kvmUi2FBjqLHSqvAPEb52tvvZuliMUY7z9dEx+AX9AVC9kGE+amGexosKb/LY
 nWzey+D0cKHtgbkMFrCClkpg75Tnt9ISJbad53+5qhN8an/a71djdj8Zk0UQnQjv
 6Amv4Ns1lDo3XGC1QtYkF5HqiWaupbUXAftptpS4Av4X1zZEQIc=
 =q1dD
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-08-08' of git://evilpiepirate.org/bcachefs

Pull bcachefs fixes from Kent Overstreet:
 "Assorted little stuff:

   - lockdep fixup for lockdep_set_notrack_class()

   - we can now remove a device when using erasure coding without
     deadlocking, though we still hit other issues

   - the 'allocator stuck' timeout is now configurable, and messages are
     ratelimited. The default timeout has been increased from 10 seconds
     to 30"

* tag 'bcachefs-2024-08-08' of git://evilpiepirate.org/bcachefs:
  bcachefs: Use bch2_wait_on_allocator() in btree node alloc path
  bcachefs: Make allocator stuck timeout configurable, ratelimit messages
  bcachefs: Add missing path_traverse() to btree_iter_next_node()
  bcachefs: ec should not allocate from ro devs
  bcachefs: Improved allocator debugging for ec
  bcachefs: Add missing bch2_trans_begin() call
  bcachefs: Add a comment for bucket helper types
  bcachefs: Don't rely on implicit unsigned -> signed integer conversion
  lockdep: Fix lockdep_set_notrack_class() for CONFIG_LOCK_STAT
  bcachefs: Fix double free of ca->buckets_nouse
2024-08-08 13:27:31 -07:00
Kent Overstreet
f39bae2e02 bcachefs: Switch to .get_inode_acl()
.set_acl() requires a dentry, and if one isn't passed it marks the VFS
inode as not having an ACL.

This has been causing inodes with ACLs to have them "disappear" on
bcachefs filesystem, depending on which path those inodes get pulled
into the cache from.

Switching to .get_inode_acl(), like other local filesystems, fixes this.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-08 15:14:02 -04:00
Xiaxi Shen
bdcffe4be7 Fix spelling errors in Server Message Block
Fixed typos in various files under fs/smb/client/

Signed-off-by: Xiaxi Shen <shenxiaxi26@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-08 11:15:33 -05:00
Steve French
1b5487aefb smb3: fix setting SecurityFlags when encryption is required
Setting encryption as required in security flags was broken.
For example (to require all mounts to be encrypted by setting):

  "echo 0x400c5 > /proc/fs/cifs/SecurityFlags"

Would return "Invalid argument" and log "Unsupported security flags"
This patch fixes that (e.g. allowing overriding the default for
SecurityFlags  0x00c5, including 0x40000 to require seal, ie
SMB3.1.1 encryption) so now that works and forces encryption
on subsequent mounts.

Acked-by: Bharath SM <bharathsm@microsoft.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-08 11:14:53 -05:00
Kent Overstreet
73dc1656f4 bcachefs: Use bch2_wait_on_allocator() in btree node alloc path
If the allocator gets stuck, we need to know why.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-07 21:04:55 -04:00
Kent Overstreet
cecf72798b bcachefs: Make allocator stuck timeout configurable, ratelimit messages
Limit these messages to once every 2 minutes to avoid spamming logs;
with multiple devices the output can be quite significant.

Also, up the default timeout to 30 seconds from 10 seconds.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-07 21:04:55 -04:00
Kent Overstreet
6d496e02b4 bcachefs: Add missing path_traverse() to btree_iter_next_node()
This fixes a bug exposed by the next path - we pop an assert in
path_set_should_be_locked().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-07 21:04:55 -04:00
Steven Rostedt
0b6743bd60 tracefs: Use generic inode RCU for synchronizing freeing
With structure layout randomization enabled for 'struct inode' we need to
avoid overlapping any of the RCU-used / initialized-only-once members,
e.g. i_lru or i_sb_list to not corrupt related list traversals when making
use of the rcu_head.

For an unlucky structure layout of 'struct inode' we may end up with the
following splat when running the ftrace selftests:

[<...>] list_del corruption, ffff888103ee2cb0->next (tracefs_inode_cache+0x0/0x4e0 [slab object]) is NULL (prev is tracefs_inode_cache+0x78/0x4e0 [slab object])
[<...>] ------------[ cut here ]------------
[<...>] kernel BUG at lib/list_debug.c:54!
[<...>] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
[<...>] CPU: 3 PID: 2550 Comm: mount Tainted: G                 N  6.8.12-grsec+ #122 ed2f536ca62f28b087b90e3cc906a8d25b3ddc65
[<...>] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[<...>] RIP: 0010:[<ffffffff84656018>] __list_del_entry_valid_or_report+0x138/0x3e0
[<...>] Code: 48 b8 99 fb 65 f2 ff ff ff ff e9 03 5c d9 fc cc 48 b8 99 fb 65 f2 ff ff ff ff e9 33 5a d9 fc cc 48 b8 99 fb 65 f2 ff ff ff ff <0f> 0b 4c 89 e9 48 89 ea 48 89 ee 48 c7 c7 60 8f dd 89 31 c0 e8 2f
[<...>] RSP: 0018:fffffe80416afaf0 EFLAGS: 00010283
[<...>] RAX: 0000000000000098 RBX: ffff888103ee2cb0 RCX: 0000000000000000
[<...>] RDX: ffffffff84655fe8 RSI: ffffffff89dd8b60 RDI: 0000000000000001
[<...>] RBP: ffff888103ee2cb0 R08: 0000000000000001 R09: fffffbd0082d5f25
[<...>] R10: fffffe80416af92f R11: 0000000000000001 R12: fdf99c16731d9b6d
[<...>] R13: 0000000000000000 R14: ffff88819ad4b8b8 R15: 0000000000000000
[<...>] RBX: tracefs_inode_cache+0x0/0x4e0 [slab object]
[<...>] RDX: __list_del_entry_valid_or_report+0x108/0x3e0
[<...>] RSI: __func__.47+0x4340/0x4400
[<...>] RBP: tracefs_inode_cache+0x0/0x4e0 [slab object]
[<...>] RSP: process kstack fffffe80416afaf0+0x7af0/0x8000 [mount 2550 2550]
[<...>] R09: kasan shadow of process kstack fffffe80416af928+0x7928/0x8000 [mount 2550 2550]
[<...>] R10: process kstack fffffe80416af92f+0x792f/0x8000 [mount 2550 2550]
[<...>] R14: tracefs_inode_cache+0x78/0x4e0 [slab object]
[<...>] FS:  00006dcb380c1840(0000) GS:ffff8881e0600000(0000) knlGS:0000000000000000
[<...>] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[<...>] CR2: 000076ab72b30e84 CR3: 000000000b088004 CR4: 0000000000360ef0 shadow CR4: 0000000000360ef0
[<...>] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[<...>] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[<...>] ASID: 0003
[<...>] Stack:
[<...>]  ffffffff818a2315 00000000f5c856ee ffffffff896f1840 ffff888103ee2cb0
[<...>]  ffff88812b6b9750 0000000079d714b6 fffffbfff1e9280b ffffffff8f49405f
[<...>]  0000000000000001 0000000000000000 ffff888104457280 ffffffff8248b392
[<...>] Call Trace:
[<...>]  <TASK>
[<...>]  [<ffffffff818a2315>] ? lock_release+0x175/0x380 fffffe80416afaf0
[<...>]  [<ffffffff8248b392>] list_lru_del+0x152/0x740 fffffe80416afb48
[<...>]  [<ffffffff8248ba93>] list_lru_del_obj+0x113/0x280 fffffe80416afb88
[<...>]  [<ffffffff8940fd19>] ? _atomic_dec_and_lock+0x119/0x200 fffffe80416afb90
[<...>]  [<ffffffff8295b244>] iput_final+0x1c4/0x9a0 fffffe80416afbb8
[<...>]  [<ffffffff8293a52b>] dentry_unlink_inode+0x44b/0xaa0 fffffe80416afbf8
[<...>]  [<ffffffff8293fefc>] __dentry_kill+0x23c/0xf00 fffffe80416afc40
[<...>]  [<ffffffff8953a85f>] ? __this_cpu_preempt_check+0x1f/0xa0 fffffe80416afc48
[<...>]  [<ffffffff82949ce5>] ? shrink_dentry_list+0x1c5/0x760 fffffe80416afc70
[<...>]  [<ffffffff82949b71>] ? shrink_dentry_list+0x51/0x760 fffffe80416afc78
[<...>]  [<ffffffff82949da8>] shrink_dentry_list+0x288/0x760 fffffe80416afc80
[<...>]  [<ffffffff8294ae75>] shrink_dcache_sb+0x155/0x420 fffffe80416afcc8
[<...>]  [<ffffffff8953a7c3>] ? debug_smp_processor_id+0x23/0xa0 fffffe80416afce0
[<...>]  [<ffffffff8294ad20>] ? do_one_tree+0x140/0x140 fffffe80416afcf8
[<...>]  [<ffffffff82997349>] ? do_remount+0x329/0xa00 fffffe80416afd18
[<...>]  [<ffffffff83ebf7a1>] ? security_sb_remount+0x81/0x1c0 fffffe80416afd38
[<...>]  [<ffffffff82892096>] reconfigure_super+0x856/0x14e0 fffffe80416afd70
[<...>]  [<ffffffff815d1327>] ? ns_capable_common+0xe7/0x2a0 fffffe80416afd90
[<...>]  [<ffffffff82997436>] do_remount+0x416/0xa00 fffffe80416afdd0
[<...>]  [<ffffffff829b2ba4>] path_mount+0x5c4/0x900 fffffe80416afe28
[<...>]  [<ffffffff829b25e0>] ? finish_automount+0x13a0/0x13a0 fffffe80416afe60
[<...>]  [<ffffffff82903812>] ? user_path_at_empty+0xb2/0x140 fffffe80416afe88
[<...>]  [<ffffffff829b2ff5>] do_mount+0x115/0x1c0 fffffe80416afeb8
[<...>]  [<ffffffff829b2ee0>] ? path_mount+0x900/0x900 fffffe80416afed8
[<...>]  [<ffffffff8272461c>] ? __kasan_check_write+0x1c/0xa0 fffffe80416afee0
[<...>]  [<ffffffff829b31cf>] __do_sys_mount+0x12f/0x280 fffffe80416aff30
[<...>]  [<ffffffff829b36cd>] __x64_sys_mount+0xcd/0x2e0 fffffe80416aff70
[<...>]  [<ffffffff819f8818>] ? syscall_trace_enter+0x218/0x380 fffffe80416aff88
[<...>]  [<ffffffff8111655e>] x64_sys_call+0x5d5e/0x6720 fffffe80416affa8
[<...>]  [<ffffffff8952756d>] do_syscall_64+0xcd/0x3c0 fffffe80416affb8
[<...>]  [<ffffffff8100119b>] entry_SYSCALL_64_safe_stack+0x4c/0x87 fffffe80416affe8
[<...>]  </TASK>
[<...>]  <PTREGS>
[<...>] RIP: 0033:[<00006dcb382ff66a>] vm_area_struct[mount 2550 2550 file 6dcb38225000-6dcb3837e000 22 55(read|exec|mayread|mayexec)]+0x0/0xb8 [userland map]
[<...>] Code: 48 8b 0d 29 18 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f6 17 0d 00 f7 d8 64 89 01 48
[<...>] RSP: 002b:0000763d68192558 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[<...>] RAX: ffffffffffffffda RBX: 00006dcb38433264 RCX: 00006dcb382ff66a
[<...>] RDX: 000017c3e0d11210 RSI: 000017c3e0d1a5a0 RDI: 000017c3e0d1ae70
[<...>] RBP: 000017c3e0d10fb0 R08: 000017c3e0d11260 R09: 00006dcb383d1be0
[<...>] R10: 000000000020002e R11: 0000000000000246 R12: 0000000000000000
[<...>] R13: 000017c3e0d1ae70 R14: 000017c3e0d11210 R15: 000017c3e0d10fb0
[<...>] RBX: vm_area_struct[mount 2550 2550 file 6dcb38433000-6dcb38434000 5b 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
[<...>] RCX: vm_area_struct[mount 2550 2550 file 6dcb38225000-6dcb3837e000 22 55(read|exec|mayread|mayexec)]+0x0/0xb8 [userland map]
[<...>] RDX: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
[<...>] RSI: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
[<...>] RDI: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
[<...>] RBP: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
[<...>] RSP: vm_area_struct[mount 2550 2550 anon 763d68173000-763d68195000 7ffffffdd 100133(read|write|mayread|maywrite|growsdown|account)]+0x0/0xb8 [userland map]
[<...>] R08: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
[<...>] R09: vm_area_struct[mount 2550 2550 file 6dcb383d1000-6dcb383d3000 1cd 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
[<...>] R13: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
[<...>] R14: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
[<...>] R15: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read|write|mayread|maywrite|account)]+0x0/0xb8 [userland map]
[<...>]  </PTREGS>
[<...>] Modules linked in:
[<...>] ---[ end trace 0000000000000000 ]---

The list debug message as well as RBX's symbolic value point out that the
object in question was allocated from 'tracefs_inode_cache' and that the
list's '->next' member is at offset 0. Dumping the layout of the relevant
parts of 'struct tracefs_inode' gives the following:

  struct tracefs_inode {
    union {
      struct inode {
        struct list_head {
          struct list_head * next;                    /*     0     8 */
          struct list_head * prev;                    /*     8     8 */
        } i_lru;
        [...]
      } vfs_inode;
      struct callback_head {
        void (*func)(struct callback_head *);         /*     0     8 */
        struct callback_head * next;                  /*     8     8 */
      } rcu;
    };
    [...]
  };

Above shows that 'vfs_inode.i_lru' overlaps with 'rcu' which will
destroy the 'i_lru' list as soon as the 'rcu' member gets used, e.g. in
call_rcu() or later when calling the RCU callback. This will disturb
concurrent list traversals as well as object reuse which assumes these
list heads will keep their integrity.

For reproduction, the following diff manually overlays 'i_lru' with
'rcu' as, otherwise, one would require some good portion of luck for
gambling an unlucky RANDSTRUCT seed:

  --- a/include/linux/fs.h
  +++ b/include/linux/fs.h
  @@ -629,6 +629,7 @@ struct inode {
   	umode_t			i_mode;
   	unsigned short		i_opflags;
   	kuid_t			i_uid;
  +	struct list_head	i_lru;		/* inode LRU list */
   	kgid_t			i_gid;
   	unsigned int		i_flags;

  @@ -690,7 +691,6 @@ struct inode {
   	u16			i_wb_frn_avg_time;
   	u16			i_wb_frn_history;
   #endif
  -	struct list_head	i_lru;		/* inode LRU list */
   	struct list_head	i_sb_list;
   	struct list_head	i_wb_list;	/* backing dev writeback list */
   	union {

The tracefs inode does not need to supply its own RCU delayed destruction
of its inode. The inode code itself offers both a "destroy_inode()"
callback that gets called when the last reference of the inode is
released, and the "free_inode()" which is called after a RCU
synchronization period from the "destroy_inode()".

The tracefs code can unlink the inode from its list in the destroy_inode()
callback, and the simply free it from the free_inode() callback. This
should provide the same protection.

Link: https://lore.kernel.org/all/20240807115143.45927-3-minipli@grsecurity.net/

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ajay Kaher <ajay.kaher@broadcom.com>
Cc: Ilkka =?utf-8?b?TmF1bGFww6TDpA==?= <digirigawa@gmail.com>
Link: https://lore.kernel.org/20240807185402.61410544@gandalf.local.home
Fixes: baa23a8d43 ("tracefs: Reset permissions on remount if permissions are options")
Reported-by: Mathias Krause <minipli@grsecurity.net>
Reported-by: Brad Spengler <spender@grsecurity.net>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-08-07 20:27:49 -04:00
Mathias Krause
8e55643247 eventfs: Use SRCU for freeing eventfs_inodes
To mirror the SRCU lock held in eventfs_iterate() when iterating over
eventfs inodes, use call_srcu() to free them too.

This was accidentally(?) degraded to RCU in commit 43aa6f97c2
("eventfs: Get rid of dentry pointers without refcounts").

Cc: Ajay Kaher <ajay.kaher@broadcom.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20240723210755.8970-1-minipli@grsecurity.net
Fixes: 43aa6f97c2 ("eventfs: Get rid of dentry pointers without refcounts")
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-08-07 19:28:48 -04:00
Mathias Krause
12c20c65d0 eventfs: Don't return NULL in eventfs_create_dir()
Commit 77a06c33a2 ("eventfs: Test for ei->is_freed when accessing
ei->dentry") added another check, testing if the parent was freed after
we released the mutex. If so, the function returns NULL. However, all
callers expect it to either return a valid pointer or an error pointer,
at least since commit 5264a2f4bb ("tracing: Fix a NULL vs IS_ERR() bug
in event_subsystem_dir()"). Returning NULL will therefore fail the error
condition check in the caller.

Fix this by substituting the NULL return value with a fitting error
pointer.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: stable@vger.kernel.org
Fixes: 77a06c33a2 ("eventfs: Test for ei->is_freed when accessing ei->dentry")
Link: https://lore.kernel.org/20240723122522.2724-1-minipli@grsecurity.net
Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Ajay Kaher <ajay.kaher@broadcom.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-08-07 19:28:31 -04:00
Mathias Krause
0df2ac59be tracefs: Fix inode allocation
The leading comment above alloc_inode_sb() is pretty explicit about it:

  /*
   * This must be used for allocating filesystems specific inodes to set
   * up the inode reclaim context correctly.
   */

Switch tracefs over to alloc_inode_sb() to make sure inodes are properly
linked.

Cc: Ajay Kaher <ajay.kaher@broadcom.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20240807115143.45927-2-minipli@grsecurity.net
Fixes: ba37ff75e0 ("eventfs: Implement tracefs_inode_cache")
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-08-07 18:49:06 -04:00
Linus Torvalds
6a0e382640 for-6.11-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmazbGYACgkQxWXV+ddt
 WDsQAw//Z3XjjylTZPuHNk/AiMe5oochxB5T9ZracQOzG0o70gj1w/UQIZBkSzp1
 66g8I4YdbwvEXKDg9Oi/GPDSON3GuhAiLXp+0Y/reeD/totgvrhROuJ3mIk5CZ0H
 B4fIKH3xCKLQan26Opgju4qjum7+AR7ekFveM6GicxXXb3eAYALgoEFt63eZjZVu
 7myak78gmBuK5QdGHH+onEhn+HfC57UTGBqu1bJsSOQC7dANkU+WzmgbH6FeOHqx
 2T/lN/tu2tBBoF4zMvC472Zjmj4PnubNQnwwv0oJ8Z2Y0yIY95joZV0uXIXO72oz
 BMQC6s0cltiTn1Tfe4iIWn+ZNjcfGAZO7aoD5NcJb2F/Mz5mZuMhtq3BND0wJ8/+
 SuYk7PxX7tcOaFrDAn3Ne7XHsD7r5lLkFICXkzcNG4dqkBUxR3dN4Oi8KKMFrgCP
 kTpf0/lAkYKYSrU86Yn1zhwRLaH8jm3fulVUY8i/p0tJnpsW8AmeME1Sk97Bhxbb
 y6zt8+MPscPuQi3jPsBevaYd8Q8BIT34vVtU/jNmGAyqEv1wVYQRRK2Ma4sJJfNk
 EEQV9i4VqXWLc17dcWVZrG6lEsVRIIBE/2Adhth6Myq73bkunpB9ZNNNZApf1T2j
 Wn5gPzkuQd6FWrXe8V7oSN9rT1OPn9uHL6BmSUWhHUCuFeATgoc=
 =03Gf
 -----END PGP SIGNATURE-----

Merge tag 'for-6.11-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix double inode unlock for direct IO sync writes (reported by
   syzbot)

 - fix root tree id/name map definitions, don't use fixed size buffers
   for name (reported by -Werror=unterminated-string-initialization)

 - fix qgroup reserve leaks in bufferd write path

 - update scrub status structure more often so it can be reported in
   user space more accurately and let 'resume' not repeat work

 - in preparation to remove space cache v1 in the future print a warning
   if it's detected

* tag 'for-6.11-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: avoid using fixed char array size for tree names
  btrfs: fix double inode unlock for direct IO sync writes
  btrfs: emit a warning about space cache v1 being deprecated
  btrfs: fix qgroup reserve leaks in cow_file_range
  btrfs: implement launder_folio for clearing dirty page reserve
  btrfs: scrub: update last_physical after scrubbing one stripe
  btrfs: factor out stripe length calculation into a helper
2024-08-07 09:53:41 -07:00
Kent Overstreet
2caca9fb16 bcachefs: ec should not allocate from ro devs
This fixes a device removal deadlock when using erasure coding.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-07 08:31:10 -04:00
Kent Overstreet
c1e4446247 bcachefs: Improved allocator debugging for ec
chasing down a device removal deadlock with erasure coding

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-07 08:31:10 -04:00
Kent Overstreet
02026e8931 bcachefs: Add missing bch2_trans_begin() call
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-07 08:31:10 -04:00
Kent Overstreet
90b211fa2d bcachefs: Add a comment for bucket helper types
We've had bugs in the past with incorrect integer conversions in disk
accounting code, which is why bucket helpers now always return s64s; add
a comment explaining this.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-07 08:31:10 -04:00
Kent Overstreet
7442b5cdf2 bcachefs: Don't rely on implicit unsigned -> signed integer conversion
implicit integer conversion is a fertile source of bugs, and we really
would rather not have the min()/max() macros doing it implicitly.
bcachefs appears to be the only place in the kernel where this happens,
so let's fix it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-07 08:31:10 -04:00
Al Viro
9a2fa14720 fix bitmap corruption on close_range() with CLOSE_RANGE_UNSHARE
copy_fd_bitmaps(new, old, count) is expected to copy the first
count/BITS_PER_LONG bits from old->full_fds_bits[] and fill
the rest with zeroes.  What it does is copying enough words
(BITS_TO_LONGS(count/BITS_PER_LONG)), then memsets the rest.
That works fine, *if* all bits past the cutoff point are
clear.  Otherwise we are risking garbage from the last word
we'd copied.

For most of the callers that is true - expand_fdtable() has
count equal to old->max_fds, so there's no open descriptors
past count, let alone fully occupied words in ->open_fds[],
which is what bits in ->full_fds_bits[] correspond to.

The other caller (dup_fd()) passes sane_fdtable_size(old_fdt, max_fds),
which is the smallest multiple of BITS_PER_LONG that covers all
opened descriptors below max_fds.  In the common case (copying on
fork()) max_fds is ~0U, so all opened descriptors will be below
it and we are fine, by the same reasons why the call in expand_fdtable()
is safe.

Unfortunately, there is a case where max_fds is less than that
and where we might, indeed, end up with junk in ->full_fds_bits[] -
close_range(from, to, CLOSE_RANGE_UNSHARE) with
	* descriptor table being currently shared
	* 'to' being above the current capacity of descriptor table
	* 'from' being just under some chunk of opened descriptors.
In that case we end up with observably wrong behaviour - e.g. spawn
a child with CLONE_FILES, get all descriptors in range 0..127 open,
then close_range(64, ~0U, CLOSE_RANGE_UNSHARE) and watch dup(0) ending
up with descriptor #128, despite #64 being observably not open.

The minimally invasive fix would be to deal with that in dup_fd().
If this proves to add measurable overhead, we can go that way, but
let's try to fix copy_fd_bitmaps() first.

* new helper: bitmap_copy_and_expand(to, from, bits_to_copy, size).
* make copy_fd_bitmaps() take the bitmap size in words, rather than
bits; it's 'count' argument is always a multiple of BITS_PER_LONG,
so we are not losing any information, and that way we can use the
same helper for all three bitmaps - compiler will see that count
is a multiple of BITS_PER_LONG for the large ones, so it'll generate
plain memcpy()+memset().

Reproducer added to tools/testing/selftests/core/close_range_test.c

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-08-05 19:23:11 -04:00
Linus Torvalds
3f3f6d6123 'smb3 client fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmauj9AACgkQiiy9cAdy
 T1G4HgwAvdAPAn2BAFYT/SaXfkN3EX78mcGe85wA8CXQep7q/ik+3xAwvNMKOWmo
 OYmem3TqdfK4N4wkXCWd7TpKI+DZQAyt9ocIk8MhDWoIxp2A1nX/80SJUjTAJWvg
 8Q2HBZu8GfYyw8PW9KfR4hBOixvA8dLXZI7vNSvHP4S7XA10OP/HFTkwi4pPlkLF
 ZuSZNMU0Enwmzay1pUkVp9r2dq1ZDKtilbFFmN+bnuMoAigp//HDFFxx/zjIXCqb
 FdhA+bl9Wj1f2r164qDRHoVg2kVX2lyIzhQJtAWdIqxPEAfUgZCu//KN5NgdstYx
 sQID8DL0MDeDYRhvuoAVinLpLvJFVf0O4K43f5kY1HXA4JFn//lY9zPNE4FLuwrw
 Ez+WsB70YHhof14n5w1hgcDE5XMeLZLa3SbVNoyhTW4C7xjJj1cqMEfnqZTGUsLx
 s2sZJnLhoX0aThTp9+Wc4KLy9Z8QjOy3GMmc7tmCtwHfYJTocly8wfWCrR9VYvBP
 yVIhZCbt
 =MKOr
 -----END PGP SIGNATURE-----

Merge tag '6.11-rc1-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

 - two reparse point fixes

 - minor cleanup

 - additional trace point (to help debug a recent problem)

* tag '6.11-rc1-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: update internal version number
  smb: client: fix FSCTL_GET_REPARSE_POINT against NetApp
  smb3: add dynamic tracepoints for shutdown ioctl
  cifs: Remove cifs_aio_ctx
  smb: client: handle lack of FSCTL_GET_REPARSE_POINT support
2024-08-04 08:18:40 -07:00
Linus Torvalds
d3426a6ed9 Bug fixes for 6.11-rc1:
* Fix memory leak when corruption is detected during scrubbing parent
     pointers.
   * Allow SECURE namespace xattrs to use reserved block pool to in order to
     prevent ENOSPC.
   * Save stack space by passing tracepoint's char array to file_path() instead
     of another stack variable.
   * Remove unused parameter in macro XFS_DQUOT_LOGRES.
   * Replace comma with semicolon in a couple of places.
 
 Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZqoucAAKCRAH7y4RirJu
 9LlvAP9J85bGKmcBcy0SLbuqatg6aut/ev/7+qI0FHdaRp1mYQEA+ryJarLdl8kM
 8EqMUwDF3CzK3o88hrTMu6lT5F6Mpw4=
 =CCxD
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.11-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Chandan Babu:

 - Fix memory leak when corruption is detected during scrubbing parent
   pointers

 - Allow SECURE namespace xattrs to use reserved block pool to in order
   to prevent ENOSPC

 - Save stack space by passing tracepoint's char array to file_path()
   instead of another stack variable

 - Remove unused parameter in macro XFS_DQUOT_LOGRES

 - Replace comma with semicolon in a couple of places

* tag 'xfs-6.11-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: convert comma to semicolon
  xfs: convert comma to semicolon
  xfs: remove unused parameter in macro XFS_DQUOT_LOGRES
  xfs: fix file_path handling in tracepoints
  xfs: allow SECURE namespace xattrs to use reserved block pool
  xfs: fix a memory leak
2024-08-03 09:09:25 -07:00
Qu Wenruo
12653ec361 btrfs: avoid using fixed char array size for tree names
[BUG]
There is a bug report that using the latest trunk GCC 15, btrfs would cause
unterminated-string-initialization warning:

  linux-6.6/fs/btrfs/print-tree.c:29:49: error: initializer-string for array of ‘char’ is too long [-Werror=unterminated-string-initialization]
   29 |         { BTRFS_BLOCK_GROUP_TREE_OBJECTID,      "BLOCK_GROUP_TREE"      },
      |
      ^~~~~~~~~~~~~~~~~~

[CAUSE]
To print tree names we have an array of root_name_map structure, which
uses "char name[16];" to store the name string of a tree.

But the following trees have names exactly at 16 chars length:
- "BLOCK_GROUP_TREE"
- "RAID_STRIPE_TREE"

This means we will have no space for the terminating '\0', and can lead
to unexpected access when printing the name.

[FIX]
Instead of "char name[16];" use "const char *" instead.

Since the name strings are all read-only data, and are all NULL
terminated by default, there is not much need to bother the length at
all.

Reported-by: Sam James <sam@gentoo.org>
Reported-by: Alejandro Colomar <alx@kernel.org>
Fixes: edde81f1ab ("btrfs: add raid stripe tree pretty printer")
Fixes: 9c54e80ddc ("btrfs: add code to support the block group root")
CC: stable@vger.kernel.org # 6.1+
Suggested-by: Alejandro Colomar <alx@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Alejandro Colomar <alx@kernel.org>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-02 22:44:27 +02:00
Filipe Manana
e0391e92f9 btrfs: fix double inode unlock for direct IO sync writes
If we do a direct IO sync write, at btrfs_sync_file(), and we need to skip
inode logging or we get an error starting a transaction or an error when
flushing delalloc, we end up unlocking the inode when we shouldn't under
the 'out_release_extents' label, and then unlock it again at
btrfs_direct_write().

Fix that by checking if we have to skip inode unlocking under that label.

Reported-by: syzbot+7dbbb74af6291b5a5a8b@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/000000000000dfd631061eaeb4bc@google.com/
Fixes: 939b656bc8 ("btrfs: fix corruption after buffer fault in during direct IO append write")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-02 22:32:40 +02:00
Linus Torvalds
1c4246294c A fix for a potential hang in the MDS when cap revocation races with
the client releasing the caps in question, marked for stable.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmatDEgTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzixHdB/9k62+n/hLFTXem1WE421sg6oAwCNA+
 KCps+TU4ekeo7ckvExeO1MwL5SyRMUc05nY9zHDiqey0hkpuUsZfjvj9v/q0XHDh
 YcGLfIe9gTl2QiAeEjl6FRkJas5GA/MULjEKltEFUpRTCH9WQvS2z5/MzJByUobf
 kHtEokAViEB+OpIabN3Vm4ZcQJ5ZLdbmVzlvf3nRYZ1Yw+OinQn3yVQJ2VKYXiOi
 l9rk9XxUSISV7EKUq9N0qlSH7M1YsspMTM1lSIC1vlaILyeTA1dJa/85ACEtXOfm
 Z7mC77H6ivR2o+eXilgvpEAyCCDRfmwUp4GV1LnqFDZKBJsBAcwLyLGY
 =I++X
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.11-rc2' of https://github.com/ceph/ceph-client

Pull ceph fix from Ilya Dryomov:
 "A fix for a potential hang in the MDS when cap revocation races with
  the client releasing the caps in question, marked for stable"

* tag 'ceph-for-6.11-rc2' of https://github.com/ceph/ceph-client:
  ceph: force sending a cap update msg back to MDS for revoke op
2024-08-02 10:33:06 -07:00
Steve French
a91bfa6760 cifs: update internal version number
To 2.50

Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-02 10:56:14 -05:00
Paulo Alcantara
ddecea00f8 smb: client: fix FSCTL_GET_REPARSE_POINT against NetApp
NetApp server requires the file to be open with FILE_READ_EA access in
order to support FSCTL_GET_REPARSE_POINT, otherwise it will return
STATUS_INVALID_DEVICE_REQUEST.  It doesn't make any sense because
there's no requirement for FILE_READ_EA bit to be set nor
STATUS_INVALID_DEVICE_REQUEST being used for something other than
"unsupported reparse points" in MS-FSA.

To fix it and improve compatibility, set FILE_READ_EA & SYNCHRONIZE
bits to match what Windows client currently does.

Tested-by: Sebastian Steinbeisser <Sebastian.Steinbeisser@lrz.de>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-02 10:56:02 -05:00
Steve French
69ca1f5755 smb3: add dynamic tracepoints for shutdown ioctl
For debugging an umount failure in xfstests generic/043 generic/044 in some
configurations, we needed more information on the shutdown ioctl which
was suspected of being related to the cause, so tracepoints are added
in this patch e.g.

  "trace-cmd record -e smb3_shutdown_enter -e smb3_shutdown_done -e smb3_shutdown_err"

Sample output:
  godown-47084   [011] .....  3313.756965: smb3_shutdown_enter: flags=0x1 tid=0x733b3e75
  godown-47084   [011] .....  3313.756968: smb3_shutdown_done: flags=0x1 tid=0x733b3e75

Tested-by: Anthony Nandaa (Microsoft) <profnandaa@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-02 10:55:49 -05:00
David Howells
cd93650798 cifs: Remove cifs_aio_ctx
Remove struct cifs_aio_ctx and its associated alloc/release functions as it
is no longer used, the functions being taken over by netfslib.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-02 10:55:45 -05:00
Paulo Alcantara
4b96024ef2 smb: client: handle lack of FSCTL_GET_REPARSE_POINT support
As per MS-FSA 2.1.5.10.14, support for FSCTL_GET_REPARSE_POINT is
optional and if the server doesn't support it,
STATUS_INVALID_DEVICE_REQUEST must be returned for the operation.

If we find files with reparse points and we can't read them due to
lack of client or server support, just ignore it and then treat them
as regular files or junctions.

Fixes: 5f71ebc412 ("smb: client: parse reparse point flag in create response")
Reported-by: Sebastian Steinbeisser <Sebastian.Steinbeisser@lrz.de>
Tested-by: Sebastian Steinbeisser <Sebastian.Steinbeisser@lrz.de>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-08-02 10:55:22 -05:00
Linus Torvalds
bbea34e693 do_dup2() out-of-bounds array speculation fix
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZqvoJgAKCRBZ7Krx/gZQ
 6/QwAQD7oRzLm3Wg63EQIcbvhpZZlQvA7/FHVUVbIPE6+5ovWQEAsdHqyvyEXKMc
 osLQJ5hAaYkfEgzjgy9kxR4f7tbdiwI=
 =sbYX
 -----END PGP SIGNATURE-----

Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs fix from Al Viro:
 "do_dup2() out-of-bounds array speculation fix"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  protect the fetch of ->fd[fd] in do_dup2() from mispredictions
2024-08-02 08:52:27 -07:00
Al Viro
8aa37bde1a protect the fetch of ->fd[fd] in do_dup2() from mispredictions
both callers have verified that fd is not greater than ->max_fds;
however, misprediction might end up with
        tofree = fdt->fd[fd];
being speculatively executed.  That's wrong for the same reasons
why it's wrong in close_fd()/file_close_fd_locked(); the same
solution applies - array_index_nospec(fd, fdt->max_fds) could differ
from fd only in case of speculative execution on mispredicted path.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-08-01 15:51:57 -04:00
Josef Bacik
1e7bec1f7d btrfs: emit a warning about space cache v1 being deprecated
We've been wanting to get rid of this for a while, add a message to
indicate that this feature is going away and when so we can finally have
a date when we're going to remove it.  The output looks like this

BTRFS warning (device nvme0n1): space cache v1 is being deprecated and will be removed in a future release, please use -o space_cache=v2

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-01 17:30:50 +02:00
Boris Burkov
30479f31d4 btrfs: fix qgroup reserve leaks in cow_file_range
In the buffered write path, the dirty page owns the qgroup reserve until
it creates an ordered_extent.

Therefore, any errors that occur before the ordered_extent is created
must free that reservation, or else the space is leaked. The fstest
generic/475 exercises various IO error paths, and is able to trigger
errors in cow_file_range where we fail to get to allocating the ordered
extent. Note that because we *do* clear delalloc, we are likely to
remove the inode from the delalloc list, so the inodes/pages to not have
invalidate/launder called on them in the commit abort path.

This results in failures at the unmount stage of the test that look like:

  BTRFS: error (device dm-8 state EA) in cleanup_transaction:2018: errno=-5 IO failure
  BTRFS: error (device dm-8 state EA) in btrfs_replace_file_extents:2416: errno=-5 IO failure
  BTRFS warning (device dm-8 state EA): qgroup 0/5 has unreleased space, type 0 rsv 28672
  ------------[ cut here ]------------
  WARNING: CPU: 3 PID: 22588 at fs/btrfs/disk-io.c:4333 close_ctree+0x222/0x4d0 [btrfs]
  Modules linked in: btrfs blake2b_generic libcrc32c xor zstd_compress raid6_pq
  CPU: 3 PID: 22588 Comm: umount Kdump: loaded Tainted: G W          6.10.0-rc7-gab56fde445b8 #21
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
  RIP: 0010:close_ctree+0x222/0x4d0 [btrfs]
  RSP: 0018:ffffb4465283be00 EFLAGS: 00010202
  RAX: 0000000000000001 RBX: ffffa1a1818e1000 RCX: 0000000000000001
  RDX: 0000000000000000 RSI: ffffb4465283bbe0 RDI: ffffa1a19374fcb8
  RBP: ffffa1a1818e13c0 R08: 0000000100028b16 R09: 0000000000000000
  R10: 0000000000000003 R11: 0000000000000003 R12: ffffa1a18ad7972c
  R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
  FS:  00007f9168312b80(0000) GS:ffffa1a4afcc0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f91683c9140 CR3: 000000010acaa000 CR4: 00000000000006f0
  Call Trace:
   <TASK>
   ? close_ctree+0x222/0x4d0 [btrfs]
   ? __warn.cold+0x8e/0xea
   ? close_ctree+0x222/0x4d0 [btrfs]
   ? report_bug+0xff/0x140
   ? handle_bug+0x3b/0x70
   ? exc_invalid_op+0x17/0x70
   ? asm_exc_invalid_op+0x1a/0x20
   ? close_ctree+0x222/0x4d0 [btrfs]
   generic_shutdown_super+0x70/0x160
   kill_anon_super+0x11/0x40
   btrfs_kill_super+0x11/0x20 [btrfs]
   deactivate_locked_super+0x2e/0xa0
   cleanup_mnt+0xb5/0x150
   task_work_run+0x57/0x80
   syscall_exit_to_user_mode+0x121/0x130
   do_syscall_64+0xab/0x1a0
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
  RIP: 0033:0x7f916847a887
  ---[ end trace 0000000000000000 ]---
  BTRFS error (device dm-8 state EA): qgroup reserved space leaked

Cases 2 and 3 in the out_reserve path both pertain to this type of leak
and must free the reserved qgroup data. Because it is already an error
path, I opted not to handle the possible errors in
btrfs_free_qgroup_data.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-01 17:26:52 +02:00
Boris Burkov
872617a089 btrfs: implement launder_folio for clearing dirty page reserve
In the buffered write path, dirty pages can be said to "own" the qgroup
reservation until they create an ordered_extent. It is possible for
there to be outstanding dirty pages when a transaction is aborted, in
which case there is no cancellation path for freeing this reservation
and it is leaked.

We do already walk the list of outstanding delalloc inodes in
btrfs_destroy_delalloc_inodes() and call invalidate_inode_pages2() on them.

This does *not* call btrfs_invalidate_folio(), as one might guess, but
rather calls launder_folio() and release_folio(). Since this is a
reservation associated with dirty pages only, rather than something
associated with the private bit (ordered_extent is cancelled separately
already in the cleanup transaction path), implementing this release
should be done via launder_folio.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-01 17:26:40 +02:00