Age | Commit message (Collapse) | Author | Files | Lines |
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Pull smb client fixes from Steve French:
"Five smb3 client fixes
- three nets/fiolios cifs fixes
- fix typo in module parameters description
- fix incorrect swap warning"
* tag '6.10-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Move the 'pid' from the subreq to the req
cifs: Only pick a channel once per read request
cifs: Defer read completion
cifs: fix typo in module parameter enable_gcm_256
cifs: drop the incorrect assertion in cifs_swap_rw()
|
|
This fixes filesystem size not changing on device removal.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The debug code relies on btree_trans_list being ordered so that it can
resume on subsequent calls or lock restarts.
However, it was using trans->locknig_wait.task.pid, which is incorrect
since btree_trans objects are cached and reused - typically by different
tasks.
Fix this by switching to pointer order, and also sort them lazily when
required - speeding up the btree_trans_get() fastpath.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
debug.c was using closure_get() on a different thread's closure where
the we don't know if the object being refcounted is alive.
We keep btree_trans objects on a list so they can be printed by debug
code, and because it is cost prohibitive to touch the btree_trans list
every time we allocate and free btree_trans objects, cached objects are
also on this list.
However, we do not want the debug code to see cached but not in use
btree_trans objects - critically because the btree_paths array will have
been freed (if it was reallocated).
closure_get() is also incorrect to use when that get may race with it
hitting zero, i.e. we must already have a ref on the object or know the
ref can't currently hit 0 for other reasons (as used in the cycle
detector).
to fix this, use the previously introduced closure_get_not_zero(),
closure_return_sync(), and closure_init_stack_release(); the debug code
now can only take a ref on a trans object if it's alive and in use.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
btree_deadlock_to_text() searches the list of btree transactions to find
a deadlock - when it finds one it's done; it's not like other *_read()
functions that's printing each object.
Factor out btree_deadlock_to_text() to make this clearer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We were grabbing the sequence number before unlock incremented it - fix
this by moving the increment to seqmutex_lock() (so the seqmutex_relock()
failure path skips the mutex_trylock()), and returning the sequence
number from unlock(), to make the API simpler and safer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This fixes incorrect/missign checking of strndup_user() returns.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Fix crashes triggered by administrative operations on the server
* tag 'nfsd-6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
NFSD: grab nfsd_mutex in nfsd_nl_rpc_status_get_dumpit()
nfsd: fix oops when reading pool_stats before server is started
|
|
Pull xfs fix from Chandan Babu:
- Fix assertion failure due to a race between unlink and cluster buffer
instantiation.
* tag 'xfs-6.10-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix unlink vs cluster buffer instantiation race
|
|
Pull bcachefs fixes from Kent Overstreet:
"Lots of (mostly boring) fixes for syzbot bugs and rare(r) CI bugs.
The LRU_TIME_BITS fix was slightly more involved; we only have 48 bits
for the LRU position (we would prefer 64), so wraparound is possible
for the cached data LRUs on a filesystem that has done sufficient
(petabytes) reads; this is now handled.
One notable user reported bugfix, where we were forgetting to
correctly set the bucket data type, which should have been
BCH_DATA_need_gc_gens instead of BCH_DATA_free; this was causing us to
go emergency read-only on a filesystem that had seen heavy enough use
to see bucket gen wraparoud.
We're now starting to fix simple (safe) errors without requiring user
intervention - i.e. a small incremental step towards full self
healing.
This is currently limited to just certain allocation information
counters, and the error is still logged in the superblock; see that
patch for more information. ("bcachefs: Fix safe errors by default")"
* tag 'bcachefs-2024-06-22' of https://evilpiepirate.org/git/bcachefs: (22 commits)
bcachefs: Move the ei_flags setting to after initialization
bcachefs: Fix a UAF after write_super()
bcachefs: Use bch2_print_string_as_lines for long err
bcachefs: Fix I_NEW warning in race path in bch2_inode_insert()
bcachefs: Replace bare EEXIST with private error codes
bcachefs: Fix missing alloc_data_type_set()
closures: Change BUG_ON() to WARN_ON()
bcachefs: fix alignment of VMA for memory mapped files on THP
bcachefs: Fix safe errors by default
bcachefs: Fix bch2_trans_put()
bcachefs: set_worker_desc() for delete_dead_snapshots
bcachefs: Fix bch2_sb_downgrade_update()
bcachefs: Handle cached data LRU wraparound
bcachefs: Guard against overflowing LRU_TIME_BITS
bcachefs: delete_dead_snapshots() doesn't need to go RW
bcachefs: Fix early init error path in journal code
bcachefs: Check for invalid btree IDs
bcachefs: Fix btree ID bitmasks
bcachefs: Fix shift overflow in read_one_super()
bcachefs: Fix a locking bug in the do_discard_fast() path
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs
Pull overlayfs fixes from Miklos Szeredi:
"Fix two bugs, one originating in this cycle and one from 6.6"
* tag 'ovl-fixes-6.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
ovl: fix encoding fid for lower only root
ovl: fix copy-up in tmpfile
|
|
`inode->ei_flags` setting and cleaning should be done after initialization,
otherwise the operation is invalid.
Fixes: 9ca4853b98af ("bcachefs: Fix quota support for snapshots")
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
write_super() may reallocate the superblock buffer - but
bch_sb_field_ext was referencing it; don't use it after the write_super
call.
Reported-by: syzbot+8992fc10a192067b8d8a@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
printk strings get truncated to 1024 bytes; if we have a long error
message (journal debug info) we need to use a helper.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
discard_new_inode() is the correct interface for tearing down an indoe
that was fully created but not made visible to other threads, but it
expects I_NEW to be set, which we don't use.
Reported-by: https://github.com/koverstreet/bcachefs/issues/690
Fixes: bcachefs: Fix race path in bch2_inode_insert()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Incorrect bucket state transition in the discard path; when incrementing
a bucket's generation number that had already been discarded, we were
forgetting to check if it should be need_gc_gens, not free.
This was caught by the .invalid checks in the transaction commit path,
causing us to go emergency read only.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Move the reference pid from the cifs_io_subrequest struct to the
cifs_io_request struct as it's the same for all subreqs of a particular
request.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
In cifs, only pick a channel when setting up a read request rather than
doing so individually for every subrequest and instead use that channel for
all. This mirrors what the code in v6.9 does.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Defer read completion from the I/O thread to the cifsiod thread so as not
to slow down the I/O thread. This restores the behaviour of v6.9.
Fixes: 3ee1a1fc3981 ("cifs: Cut over to using netfslib")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix potential infinite loop when doing block grou reclaim
- fix crash on emulated zoned device and NOCOW files
* tag 'for-6.10-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: allocate dummy checksums for zoned NODATASUM writes
btrfs: retry block group reclaim without infinite loop
|
|
With CONFIG_READ_ONLY_THP_FOR_FS, the Linux kernel supports using THPs
for read-only mmapped files, such as shared libraries. However, the
kernel makes no attempt to actually align those mappings on 2MB
boundaries, which makes it impossible to use those THPs most of the
time. This issue applies to general file mapping THP as well as
existing setups using CONFIG_READ_ONLY_THP_FOR_FS. This is easily
fixed by using thp_get_unmapped_area for the unmapped_area function
in bcachefs, which is what ext2, ext4, fuse, xfs and btrfs all use.
Similar to commit b0c582233a85 ("btrfs: fix alignment of VMA for
memory mapped files on THP").
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
i.e. the start of automatic self healing:
If errors=continue or fix_safe, we now automatically fix simple errors
without user intervention.
New error action option: fix_safe
This replaces the existing errors=ro option, which gets a new slot, i.e.
existing errors=ro users now get errors=fix_safe.
This is currently only enabled for a limited set of errors - initially
just disk accounting; errors we would never not want to fix, and we
don't want to require user intervention (i.e. to make sure a bug report
gets filed).
Errors will still be counted in the superblock, so we (developers) will
still know they've been occuring if a bug report gets filed (as bug
reports typically include the errors superblock section).
Eventually we'll be enabling this for a much wider set of errors, after
we've done thorough error injection testing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
reference: https://github.com/koverstreet/bcachefs/issues/692
trans->ref is the reference used by the cycle detector, which walks
btree_trans objects of other threads to walk the graph of held locks and
issue wakeups when an abort is required.
We have to wait for the ref to go to 1 before freeing trans->paths or
clearing trans->locking_wait.task.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
this is long running - help users see what's going on
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Missing enum conversion
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We only have 48 bits for the LRU time field, which is insufficient to
prevent wraparound.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
LRUs only have 48 bits for the time field (i.e. LRU order); thus we need
overflow checks and guards.
Reported-by: syzbot+df3bf3f088dcaa728857@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We've been moving away from going RW lazily; if we want to go RW we do
that in set_may_go_rw(), and if we didn't go RW we don't need to delete
dead snapshots.
Reported-by: syzbot+4366624c0b5aac4906cf@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We shouln't be running the journal shutdown sequence if we never fully
initialized the journal.
Reported-by: syzbot+ffd2270f0bca3322ee00@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We can only handle btree IDs up to 62, since the btree id (plus the type
for interior btree nodes) has to fit ito a 64 bit bitmask - check for
invalid ones to avoid invalid shifts later.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
these should be 64 bit bitmasks, not 32 bit.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Reported-by: syzbot+9f74cb4006b83e2a3df1@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We can't discard a bucket while it's still open; this needs the
bucket_is_open_safe() version, which takes the open_buckets lock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We use 0 size arrays as markers, but ubsan doesn't know that - cast them
to a pointer to fix the splat.
Also, make sure this code gets tested a bit more.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
btree_iter_init() needs to happen before key_cache_init(), to initialize
btree_trans_barrier
Reported-by: syzbot+3cca837c2183f8f6fcaf@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
enable_gcm_256 (which allows the server to require the strongest
encryption) is enabled by default, but the modinfo description
incorrectly showed it disabled by default. Fix the typo.
Cc: stable@vger.kernel.org
Fixes: fee742b50289 ("smb3.1.1: enable negotiating stronger encryption by default")
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Since commit 2282679fb20b ("mm: submit multipage write for SWP_FS_OPS
swap-space"), we can plug multiple pages then unplug them all together.
That means iov_iter_count(iter) could be way bigger than PAGE_SIZE, it
actually equals the size of iov_iter_npages(iter, INT_MAX).
Note this issue has nothing to do with large folios as we don't support
THP_SWPOUT to non-block devices.
Fixes: 2282679fb20b ("mm: submit multipage write for SWP_FS_OPS swap-space")
Reported-by: Christoph Hellwig <hch@lst.de>
Closes: https://lore.kernel.org/linux-mm/20240614100329.1203579-1-hch@lst.de/
Cc: NeilBrown <neilb@suse.de>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Steve French <sfrench@samba.org>
Cc: Trond Myklebust <trondmy@kernel.org>
Cc: Chuanhua Han <hanchuanhua@oppo.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Paulo Alcantara <pc@manguebit.com>
Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Cc: Shyam Prasad N <sprasad@microsoft.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Bharath SM <bharathsm@microsoft.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"Mainly MM singleton fixes. And a couple of ocfs2 regression fixes"
* tag 'mm-hotfixes-stable-2024-06-17-11-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
kcov: don't lose track of remote references during softirqs
mm: shmem: fix getting incorrect lruvec when replacing a shmem folio
mm/debug_vm_pgtable: drop RANDOM_ORVALUE trick
mm: fix possible OOB in numa_rebuild_large_mapping()
mm/migrate: fix kernel BUG at mm/compaction.c:2761!
selftests: mm: make map_fixed_noreplace test names stable
mm/memfd: add documentation for MFD_NOEXEC_SEAL MFD_EXEC
mm: mmap: allow for the maximum number of bits for randomizing mmap_base by default
gcov: add support for GCC 14
zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING
mm: huge_memory: fix misused mapping_large_folio_support() for anon folios
lib/alloc_tag: fix RCU imbalance in pgalloc_tag_get()
lib/alloc_tag: do not register sysctl interface when CONFIG_SYSCTL=n
MAINTAINERS: remove Lorenzo as vmalloc reviewer
Revert "mm: init_mlocked_on_free_v3"
mm/page_table_check: fix crash on ZONE_DEVICE
gcc: disable '-Warray-bounds' for gcc-9
ocfs2: fix NULL pointer dereference in ocfs2_abort_trigger()
ocfs2: fix NULL pointer dereference in ocfs2_journal_dirty()
|
|
Grab nfsd_mutex lock in nfsd_nl_rpc_status_get_dumpit routine and remove
nfsd_nl_rpc_status_get_start() and nfsd_nl_rpc_status_get_done(). This
patch fix the syzbot log reported below:
INFO: task syz-executor.1:17770 blocked for more than 143 seconds.
Not tainted 6.10.0-rc3-syzkaller-00022-gcea2a26553ac #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor.1 state:D stack:23800 pid:17770 tgid:17767 ppid:11381 flags:0x00000006
Call Trace:
<TASK>
context_switch kernel/sched/core.c:5408 [inline]
__schedule+0x17e8/0x4a20 kernel/sched/core.c:6745
__schedule_loop kernel/sched/core.c:6822 [inline]
schedule+0x14b/0x320 kernel/sched/core.c:6837
schedule_preempt_disabled+0x13/0x30 kernel/sched/core.c:6894
__mutex_lock_common kernel/locking/mutex.c:684 [inline]
__mutex_lock+0x6a4/0xd70 kernel/locking/mutex.c:752
nfsd_nl_listener_get_doit+0x115/0x5d0 fs/nfsd/nfsctl.c:2124
genl_family_rcv_msg_doit net/netlink/genetlink.c:1115 [inline]
genl_family_rcv_msg net/netlink/genetlink.c:1195 [inline]
genl_rcv_msg+0xb16/0xec0 net/netlink/genetlink.c:1210
netlink_rcv_skb+0x1e5/0x430 net/netlink/af_netlink.c:2564
genl_rcv+0x28/0x40 net/netlink/genetlink.c:1219
netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline]
netlink_unicast+0x7ec/0x980 net/netlink/af_netlink.c:1361
netlink_sendmsg+0x8db/0xcb0 net/netlink/af_netlink.c:1905
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x223/0x270 net/socket.c:745
____sys_sendmsg+0x525/0x7d0 net/socket.c:2585
___sys_sendmsg net/socket.c:2639 [inline]
__sys_sendmsg+0x2b0/0x3a0 net/socket.c:2668
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f24ed27cea9
RSP: 002b:00007f24ee0080c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f24ed3b3f80 RCX: 00007f24ed27cea9
RDX: 0000000000000000 RSI: 0000000020000100 RDI: 0000000000000005
RBP: 00007f24ed2ebff4 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
Fixes: 1bd773b4f0c9 ("nfsd: hold nfsd_mutex across entire netlink operation")
Fixes: bd9d6a3efa97 ("NFSD: add rpc_status netlink support")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Luis has been reporting an assert failure when freeing an inode
cluster during inode inactivation for a while. The assert looks
like:
XFS: Assertion failed: bp->b_flags & XBF_DONE, file: fs/xfs/xfs_trans_buf.c, line: 241
------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:102!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
CPU: 4 PID: 73 Comm: kworker/4:1 Not tainted 6.10.0-rc1 #4
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Workqueue: xfs-inodegc/loop5 xfs_inodegc_worker [xfs]
RIP: 0010:assfail (fs/xfs/xfs_message.c:102) xfs
RSP: 0018:ffff88810188f7f0 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff88816e748250 RCX: 1ffffffff844b0e7
RDX: 0000000000000004 RSI: ffff88810188f558 RDI: ffffffffc2431fa0
RBP: 1ffff11020311f01 R08: 0000000042431f9f R09: ffffed1020311e9b
R10: ffff88810188f4df R11: ffffffffac725d70 R12: ffff88817a3f4000
R13: ffff88812182f000 R14: ffff88810188f998 R15: ffffffffc2423f80
FS: 0000000000000000(0000) GS:ffff8881c8400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055fe9d0f109c CR3: 000000014426c002 CR4: 0000000000770ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
xfs_trans_read_buf_map (fs/xfs/xfs_trans_buf.c:241 (discriminator 1)) xfs
xfs_imap_to_bp (fs/xfs/xfs_trans.h:210 fs/xfs/libxfs/xfs_inode_buf.c:138) xfs
xfs_inode_item_precommit (fs/xfs/xfs_inode_item.c:145) xfs
xfs_trans_run_precommits (fs/xfs/xfs_trans.c:931) xfs
__xfs_trans_commit (fs/xfs/xfs_trans.c:966) xfs
xfs_inactive_ifree (fs/xfs/xfs_inode.c:1811) xfs
xfs_inactive (fs/xfs/xfs_inode.c:2013) xfs
xfs_inodegc_worker (fs/xfs/xfs_icache.c:1841 fs/xfs/xfs_icache.c:1886) xfs
process_one_work (kernel/workqueue.c:3231)
worker_thread (kernel/workqueue.c:3306 (discriminator 2) kernel/workqueue.c:3393 (discriminator 2))
kthread (kernel/kthread.c:389)
ret_from_fork (arch/x86/kernel/process.c:147)
ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
</TASK>
And occurs when the the inode precommit handlers is attempt to look
up the inode cluster buffer to attach the inode for writeback.
The trail of logic that I can reconstruct is as follows.
1. the inode is clean when inodegc runs, so it is not
attached to a cluster buffer when precommit runs.
2. #1 implies the inode cluster buffer may be clean and not
pinned by dirty inodes when inodegc runs.
3. #2 implies that the inode cluster buffer can be reclaimed
by memory pressure at any time.
4. The assert failure implies that the cluster buffer was
attached to the transaction, but not marked done. It had
been accessed earlier in the transaction, but not marked
done.
5. #4 implies the cluster buffer has been invalidated (i.e.
marked stale).
6. #5 implies that the inode cluster buffer was instantiated
uninitialised in the transaction in xfs_ifree_cluster(),
which only instantiates the buffers to invalidate them
and never marks them as done.
Given factors 1-3, this issue is highly dependent on timing and
environmental factors. Hence the issue can be very difficult to
reproduce in some situations, but highly reliable in others. Luis
has an environment where it can be reproduced easily by g/531 but,
OTOH, I've reproduced it only once in ~2000 cycles of g/531.
I think the fix is to have xfs_ifree_cluster() set the XBF_DONE flag
on the cluster buffers, even though they may not be initialised. The
reasons why I think this is safe are:
1. A buffer cache lookup hit on a XBF_STALE buffer will
clear the XBF_DONE flag. Hence all future users of the
buffer know they have to re-initialise the contents
before use and mark it done themselves.
2. xfs_trans_binval() sets the XFS_BLI_STALE flag, which
means the buffer remains locked until the journal commit
completes and the buffer is unpinned. Hence once marked
XBF_STALE/XFS_BLI_STALE by xfs_ifree_cluster(), the only
context that can access the freed buffer is the currently
running transaction.
3. #2 implies that future buffer lookups in the currently
running transaction will hit the transaction match code
and not the buffer cache. Hence XBF_STALE and
XFS_BLI_STALE will not be cleared unless the transaction
initialises and logs the buffer with valid contents
again. At which point, the buffer will be marked marked
XBF_DONE again, so having XBF_DONE already set on the
stale buffer is a moot point.
4. #2 also implies that any concurrent access to that
cluster buffer will block waiting on the buffer lock
until the inode cluster has been fully freed and is no
longer an active inode cluster buffer.
5. #4 + #1 means that any future user of the disk range of
that buffer will always see the range of disk blocks
covered by the cluster buffer as not done, and hence must
initialise the contents themselves.
6. Setting XBF_DONE in xfs_ifree_cluster() then means the
unlinked inode precommit code will see a XBF_DONE buffer
from the transaction match as it expects. It can then
attach the stale but newly dirtied inode to the stale
but newly dirtied cluster buffer without unexpected
failures. The stale buffer will then sail through the
journal and do the right thing with the attached stale
inode during unpin.
Hence the fix is just one line of extra code. The explanation of
why we have to set XBF_DONE in xfs_ifree_cluster, OTOH, is long and
complex....
Fixes: 82842fee6e59 ("xfs: fix AGF vs inode cluster buffer deadlock")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are a number of small char/misc and iio driver fixes for
6.10-rc4. Included in here are the following:
- iio driver fixes for a bunch of reported problems.
- mei driver fixes for a number of reported issues.
- amiga parport driver build fix.
- .editorconfig fix that was causing lots of unintended whitespace
changes to happen to files when they were being edited. Unless we
want to sweep the whole tree and remove all trailing whitespace at
once, this is needed for the .editorconfig file to be able to be
used at all. This change is required because the original
submitters never touched older files in the tree.
- jfs bugfix for a buffer overflow
The jfs bugfix is in here as I didn't know where else to put it, and
it's been ignored for a while as the filesystem seems to be abandoned
and I'm tired of seeing the same issue reported in multiple places.
All of these have been in linux-next with no reported issues"
* tag 'char-misc-6.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (25 commits)
.editorconfig: remove trim_trailing_whitespace option
jfs: xattr: fix buffer overflow for invalid xattr
misc: microchip: pci1xxxx: Fix a memory leak in the error handling of gp_aux_bus_probe()
misc: microchip: pci1xxxx: fix double free in the error handling of gp_aux_bus_probe()
parport: amiga: Mark driver struct with __refdata to prevent section mismatch
mei: vsc: Fix wrong invocation of ACPI SID method
mei: vsc: Don't stop/restart mei device during system suspend/resume
mei: me: release irq in mei_me_pci_resume error path
mei: demote client disconnect warning on suspend to debug
iio: inkern: fix channel read regression
iio: imu: inv_mpu6050: stabilized timestamping in interrupt
iio: adc: ad7173: Fix sampling frequency setting
iio: adc: ad7173: Clear append status bit
iio: imu: inv_icm42600: delete unneeded update watermark call
iio: imu: inv_icm42600: stabilized timestamp in interrupt
iio: invensense: fix odr switching to same value
iio: adc: ad7173: Remove index from temp channel
iio: adc: ad7173: Add ad7173_device_info names
iio: adc: ad7173: fix buffers enablement for ad7176-2
iio: temperature: mlx90635: Fix ERR_PTR dereference in mlx90635_probe()
...
|
|
Pull xfs fix from Chandan Babu:
"Ensure xfs incore superblock's allocated inode counter, free inode
counter, and free data block counter are all zero or positive when
they are copied over from xfs_mount->m_[icount,ifree,fdblocks]
respectively"
* tag 'xfs-6.10-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: make sure sb_fdblocks is non-negative
|
|
Pull smb server fixes from Steve French:
"Two small smb3 server fixes:
- set xatttr fix
- pathname parsing check fix"
* tag '6.10-rc3-smb3-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix missing use of get_write in in smb2_set_ea()
ksmbd: move leading slash check to smb2_get_name()
|
|
bdev->bd_super has been removed and commit 8887b94d9322 change the usage
from bdev->bd_super to b_assoc_map->host->i_sb. Since ocfs2 hasn't set
bh->b_assoc_map, it will trigger NULL pointer dereference when calling
into ocfs2_abort_trigger().
Actually this was pointed out in history, see commit 74e364ad1b13. But
I've made a mistake when reviewing commit 8887b94d9322 and then
re-introduce this regression.
Since we cannot revive bdev in buffer head, so fix this issue by
initializing all types of ocfs2 triggers when fill super, and then get the
specific ocfs2 trigger from ocfs2_caching_info when access journal.
[joseph.qi@linux.alibaba.com: v2]
Link: https://lkml.kernel.org/r/20240602112045.1112708-1-joseph.qi@linux.alibaba.com
Link: https://lkml.kernel.org/r/20240530110630.3933832-2-joseph.qi@linux.alibaba.com
Fixes: 8887b94d9322 ("ocfs2: stop using bdev->bd_super for journal error logging")
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org> [6.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
bdev->bd_super has been removed and commit 8887b94d9322 change the usage
from bdev->bd_super to b_assoc_map->host->i_sb. This introduces the
following NULL pointer dereference in ocfs2_journal_dirty() since
b_assoc_map is still not initialized. This can be easily reproduced by
running xfstests generic/186, which simulate no more credits.
[ 134.351592] BUG: kernel NULL pointer dereference, address: 0000000000000000
...
[ 134.355341] RIP: 0010:ocfs2_journal_dirty+0x14f/0x160 [ocfs2]
...
[ 134.365071] Call Trace:
[ 134.365312] <TASK>
[ 134.365524] ? __die_body+0x1e/0x60
[ 134.365868] ? page_fault_oops+0x13d/0x4f0
[ 134.366265] ? __pfx_bit_wait_io+0x10/0x10
[ 134.366659] ? schedule+0x27/0xb0
[ 134.366981] ? exc_page_fault+0x6a/0x140
[ 134.367356] ? asm_exc_page_fault+0x26/0x30
[ 134.367762] ? ocfs2_journal_dirty+0x14f/0x160 [ocfs2]
[ 134.368305] ? ocfs2_journal_dirty+0x13d/0x160 [ocfs2]
[ 134.368837] ocfs2_create_new_meta_bhs.isra.51+0x139/0x2e0 [ocfs2]
[ 134.369454] ocfs2_grow_tree+0x688/0x8a0 [ocfs2]
[ 134.369927] ocfs2_split_and_insert.isra.67+0x35c/0x4a0 [ocfs2]
[ 134.370521] ocfs2_split_extent+0x314/0x4d0 [ocfs2]
[ 134.371019] ocfs2_change_extent_flag+0x174/0x410 [ocfs2]
[ 134.371566] ocfs2_add_refcount_flag+0x3fa/0x630 [ocfs2]
[ 134.372117] ocfs2_reflink_remap_extent+0x21b/0x4c0 [ocfs2]
[ 134.372994] ? inode_update_timestamps+0x4a/0x120
[ 134.373692] ? __pfx_ocfs2_journal_access_di+0x10/0x10 [ocfs2]
[ 134.374545] ? __pfx_ocfs2_journal_access_di+0x10/0x10 [ocfs2]
[ 134.375393] ocfs2_reflink_remap_blocks+0xe4/0x4e0 [ocfs2]
[ 134.376197] ocfs2_remap_file_range+0x1de/0x390 [ocfs2]
[ 134.376971] ? security_file_permission+0x29/0x50
[ 134.377644] vfs_clone_file_range+0xfe/0x320
[ 134.378268] ioctl_file_clone+0x45/0xa0
[ 134.378853] do_vfs_ioctl+0x457/0x990
[ 134.379422] __x64_sys_ioctl+0x6e/0xd0
[ 134.379987] do_syscall_64+0x5d/0x170
[ 134.380550] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 134.381231] RIP: 0033:0x7fa4926397cb
[ 134.381786] Code: 73 01 c3 48 8b 0d bd 56 38 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8d 56 38 00 f7 d8 64 89 01 48
[ 134.383930] RSP: 002b:00007ffc2b39f7b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 134.384854] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fa4926397cb
[ 134.385734] RDX: 00007ffc2b39f7f0 RSI: 000000004020940d RDI: 0000000000000003
[ 134.386606] RBP: 0000000000000000 R08: 00111a82a4f015bb R09: 00007fa494221000
[ 134.387476] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 134.388342] R13: 0000000000f10000 R14: 0000558e844e2ac8 R15: 0000000000f10000
[ 134.389207] </TASK>
Fix it by only aborting transaction and journal in ocfs2_journal_dirty()
now, and leave ocfs2_abort() later when detecting an aborted handle,
e.g. start next transaction. Also log the handle details in this case.
Link: https://lkml.kernel.org/r/20240530110630.3933832-1-joseph.qi@linux.alibaba.com
Fixes: 8887b94d9322 ("ocfs2: stop using bdev->bd_super for journal error logging")
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org> [6.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
ovl_check_encode_origin() should return a positive number if the lower
dentry is to be encoded, zero otherwise. If there's no upper layer at all
(read-only overlay), then it obviously needs to return positive.
This was broken by commit 16aac5ad1fa9 ("ovl: support encoding
non-decodable file handles"), which didn't take the lower-only
configuration into account.
Fix by checking the no-upper-layer case up-front.
Reported-and-tested-by: Youzhong Yang <youzhong@gmail.com>
Closes: https://lore.kernel.org/all/CADpNCvaBimi+zCYfRJHvCOhMih8OU0rmZkwLuh24MKKroRuT8Q@mail.gmail.com/
Fixes: 16aac5ad1fa9 ("ovl: support encoding non-decodable file handles")
Cc: <stable@vger.kernel.org> # v6.6
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Shin'ichiro reported that when he's running fstests' test-case
btrfs/167 on emulated zoned devices, he's seeing the following NULL
pointer dereference in 'btrfs_zone_finish_endio()':
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000011: 0000 [#1] PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000088-0x000000000000008f]
CPU: 4 PID: 2332440 Comm: kworker/u80:15 Tainted: G W 6.10.0-rc2-kts+ #4
Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
RIP: 0010:btrfs_zone_finish_endio.part.0+0x34/0x160 [btrfs]
RSP: 0018:ffff88867f107a90 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff893e5534
RDX: 0000000000000011 RSI: 0000000000000004 RDI: 0000000000000088
RBP: 0000000000000002 R08: 0000000000000001 R09: ffffed1081696028
R10: ffff88840b4b0143 R11: ffff88834dfff600 R12: ffff88840b4b0000
R13: 0000000000020000 R14: 0000000000000000 R15: ffff888530ad5210
FS: 0000000000000000(0000) GS:ffff888e3f800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f87223fff38 CR3: 00000007a7c6a002 CR4: 00000000007706f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
? __die_body.cold+0x19/0x27
? die_addr+0x46/0x70
? exc_general_protection+0x14f/0x250
? asm_exc_general_protection+0x26/0x30
? do_raw_read_unlock+0x44/0x70
? btrfs_zone_finish_endio.part.0+0x34/0x160 [btrfs]
btrfs_finish_one_ordered+0x5d9/0x19a0 [btrfs]
? __pfx_lock_release+0x10/0x10
? do_raw_write_lock+0x90/0x260
? __pfx_do_raw_write_lock+0x10/0x10
? __pfx_btrfs_finish_one_ordered+0x10/0x10 [btrfs]
? _raw_write_unlock+0x23/0x40
? btrfs_finish_ordered_zoned+0x5a9/0x850 [btrfs]
? lock_acquire+0x435/0x500
btrfs_work_helper+0x1b1/0xa70 [btrfs]
? __schedule+0x10a8/0x60b0
? __pfx___might_resched+0x10/0x10
process_one_work+0x862/0x1410
? __pfx_lock_acquire+0x10/0x10
? __pfx_process_one_work+0x10/0x10
? assign_work+0x16c/0x240
worker_thread+0x5e6/0x1010
? __pfx_worker_thread+0x10/0x10
kthread+0x2c3/0x3a0
? trace_irq_enable.constprop.0+0xce/0x110
? __pfx_kthread+0x10/0x10
ret_from_fork+0x31/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Enabling CONFIG_BTRFS_ASSERT revealed the following assertion to
trigger:
assertion failed: !list_empty(&ordered->list), in fs/btrfs/zoned.c:1815
This indicates, that we're missing the checksums list on the
ordered_extent. As btrfs/167 is doing a NOCOW write this is to be
expected.
Further analysis with drgn confirmed the assumption:
>>> inode = prog.crashed_thread().stack_trace()[11]['ordered'].inode
>>> btrfs_inode = drgn.container_of(inode, "struct btrfs_inode", \
"vfs_inode")
>>> print(btrfs_inode.flags)
(u32)1
As zoned emulation mode simulates conventional zones on regular devices,
we cannot use zone-append for writing. But we're only attaching dummy
checksums if we're doing a zone-append write.
So for NOCOW zoned data writes on conventional zones, also attach a
dummy checksum.
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: cbfce4c7fbde ("btrfs: optimize the logical to physical mapping for zoned writes")
CC: Naohiro Aota <Naohiro.Aota@wdc.com> # 6.6+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If inc_block_group_ro systematically fails (e.g. due to ETXTBUSY from
swap) or btrfs_relocate_chunk systematically fails (from lack of
space), then this worker becomes an infinite loop.
At the very least, this strands the cleaner thread, but can also result
in hung tasks/RCU stalls on PREEMPT_NONE kernels and if the
reclaim_bgs_lock mutex is not contended.
I believe the best long term fix is to manage reclaim via work queue,
where we queue up a relocation on the triggering condition and re-queue
on failure. In the meantime, this is an easy fix to apply to avoid the
immediate pain.
Fixes: 7e2718099438 ("btrfs: reinsert BGs failed to reclaim")
CC: stable@vger.kernel.org # 6.6+
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|