Age | Commit message (Collapse) | Author | Files | Lines |
|
Currently fiemap does not take the inode's lock (VFS lock), it only locks
a file range in the inode's io tree. This however can lead to a deadlock
if we have a concurrent fsync on the file and fiemap code triggers a fault
when accessing the user space buffer with fiemap_fill_next_extent(). The
deadlock happens on the inode's i_mmap_lock semaphore, which is taken both
by fsync and btrfs_page_mkwrite(). This deadlock was recently reported by
syzbot and triggers a trace like the following:
task:syz-executor361 state:D stack:20264 pid:5668 ppid:5119 flags:0x00004004
Call Trace:
<TASK>
context_switch kernel/sched/core.c:5293 [inline]
__schedule+0x995/0xe20 kernel/sched/core.c:6606
schedule+0xcb/0x190 kernel/sched/core.c:6682
wait_on_state fs/btrfs/extent-io-tree.c:707 [inline]
wait_extent_bit+0x577/0x6f0 fs/btrfs/extent-io-tree.c:751
lock_extent+0x1c2/0x280 fs/btrfs/extent-io-tree.c:1742
find_lock_delalloc_range+0x4e6/0x9c0 fs/btrfs/extent_io.c:488
writepage_delalloc+0x1ef/0x540 fs/btrfs/extent_io.c:1863
__extent_writepage+0x736/0x14e0 fs/btrfs/extent_io.c:2174
extent_write_cache_pages+0x983/0x1220 fs/btrfs/extent_io.c:3091
extent_writepages+0x219/0x540 fs/btrfs/extent_io.c:3211
do_writepages+0x3c3/0x680 mm/page-writeback.c:2581
filemap_fdatawrite_wbc+0x11e/0x170 mm/filemap.c:388
__filemap_fdatawrite_range mm/filemap.c:421 [inline]
filemap_fdatawrite_range+0x175/0x200 mm/filemap.c:439
btrfs_fdatawrite_range fs/btrfs/file.c:3850 [inline]
start_ordered_ops fs/btrfs/file.c:1737 [inline]
btrfs_sync_file+0x4ff/0x1190 fs/btrfs/file.c:1839
generic_write_sync include/linux/fs.h:2885 [inline]
btrfs_do_write_iter+0xcd3/0x1280 fs/btrfs/file.c:1684
call_write_iter include/linux/fs.h:2189 [inline]
new_sync_write fs/read_write.c:491 [inline]
vfs_write+0x7dc/0xc50 fs/read_write.c:584
ksys_write+0x177/0x2a0 fs/read_write.c:637
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f7d4054e9b9
RSP: 002b:00007f7d404fa2f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007f7d405d87a0 RCX: 00007f7d4054e9b9
RDX: 0000000000000090 RSI: 0000000020000000 RDI: 0000000000000006
RBP: 00007f7d405a51d0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 61635f65646f6e69
R13: 65646f7475616f6e R14: 7261637369646f6e R15: 00007f7d405d87a8
</TASK>
INFO: task syz-executor361:5697 blocked for more than 145 seconds.
Not tainted 6.2.0-rc3-syzkaller-00376-g7c6984405241 #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor361 state:D stack:21216 pid:5697 ppid:5119 flags:0x00004004
Call Trace:
<TASK>
context_switch kernel/sched/core.c:5293 [inline]
__schedule+0x995/0xe20 kernel/sched/core.c:6606
schedule+0xcb/0x190 kernel/sched/core.c:6682
rwsem_down_read_slowpath+0x5f9/0x930 kernel/locking/rwsem.c:1095
__down_read_common+0x54/0x2a0 kernel/locking/rwsem.c:1260
btrfs_page_mkwrite+0x417/0xc80 fs/btrfs/inode.c:8526
do_page_mkwrite+0x19e/0x5e0 mm/memory.c:2947
wp_page_shared+0x15e/0x380 mm/memory.c:3295
handle_pte_fault mm/memory.c:4949 [inline]
__handle_mm_fault mm/memory.c:5073 [inline]
handle_mm_fault+0x1b79/0x26b0 mm/memory.c:5219
do_user_addr_fault+0x69b/0xcb0 arch/x86/mm/fault.c:1428
handle_page_fault arch/x86/mm/fault.c:1519 [inline]
exc_page_fault+0x7a/0x110 arch/x86/mm/fault.c:1575
asm_exc_page_fault+0x22/0x30 arch/x86/include/asm/idtentry.h:570
RIP: 0010:copy_user_short_string+0xd/0x40 arch/x86/lib/copy_user_64.S:233
Code: 74 0a 89 (...)
RSP: 0018:ffffc9000570f330 EFLAGS: 00050202
RAX: ffffffff843e6601 RBX: 00007fffffffefc8 RCX: 0000000000000007
RDX: 0000000000000000 RSI: ffffc9000570f3e0 RDI: 0000000020000120
RBP: ffffc9000570f490 R08: 0000000000000000 R09: fffff52000ae1e83
R10: fffff52000ae1e83 R11: 1ffff92000ae1e7c R12: 0000000000000038
R13: ffffc9000570f3e0 R14: 0000000020000120 R15: ffffc9000570f3e0
copy_user_generic arch/x86/include/asm/uaccess_64.h:37 [inline]
raw_copy_to_user arch/x86/include/asm/uaccess_64.h:58 [inline]
_copy_to_user+0xe9/0x130 lib/usercopy.c:34
copy_to_user include/linux/uaccess.h:169 [inline]
fiemap_fill_next_extent+0x22e/0x410 fs/ioctl.c:144
emit_fiemap_extent+0x22d/0x3c0 fs/btrfs/extent_io.c:3458
fiemap_process_hole+0xa00/0xad0 fs/btrfs/extent_io.c:3716
extent_fiemap+0xe27/0x2100 fs/btrfs/extent_io.c:3922
btrfs_fiemap+0x172/0x1e0 fs/btrfs/inode.c:8209
ioctl_fiemap fs/ioctl.c:219 [inline]
do_vfs_ioctl+0x185b/0x2980 fs/ioctl.c:810
__do_sys_ioctl fs/ioctl.c:868 [inline]
__se_sys_ioctl+0x83/0x170 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f7d4054e9b9
RSP: 002b:00007f7d390d92f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f7d405d87b0 RCX: 00007f7d4054e9b9
RDX: 0000000020000100 RSI: 00000000c020660b RDI: 0000000000000005
RBP: 00007f7d405a51d0 R08: 00007f7d390d9700 R09: 0000000000000000
R10: 00007f7d390d9700 R11: 0000000000000246 R12: 61635f65646f6e69
R13: 65646f7475616f6e R14: 7261637369646f6e R15: 00007f7d405d87b8
</TASK>
What happens is the following:
1) Task A is doing an fsync, enters btrfs_sync_file() and flushes delalloc
before locking the inode and the i_mmap_lock semaphore, that is, before
calling btrfs_inode_lock();
2) After task A flushes delalloc and before it calls btrfs_inode_lock(),
another task dirties a page;
3) Task B starts a fiemap without FIEMAP_FLAG_SYNC, so the page dirtied
at step 2 remains dirty and unflushed. Then when it enters
extent_fiemap() and it locks a file range that includes the range of
the page dirtied in step 2;
4) Task A calls btrfs_inode_lock() and locks the inode (VFS lock) and the
inode's i_mmap_lock semaphore in write mode. Then it tries to flush
delalloc by calling start_ordered_ops(), which will block, at
find_lock_delalloc_range(), when trying to lock the range of the page
dirtied at step 2, since this range was locked by the fiemap task (at
step 3);
5) Task B generates a page fault when accessing the user space fiemap
buffer with a call to fiemap_fill_next_extent().
The fault handler needs to call btrfs_page_mkwrite() for some other
page of our inode, and there we deadlock when trying to lock the
inode's i_mmap_lock semaphore in read mode, since the fsync task locked
it in write mode (step 4) and the fsync task can not progress because
it's waiting to lock a file range that is currently locked by us (the
fiemap task, step 3).
Fix this by taking the inode's lock (VFS lock) in shared mode when
entering fiemap. This effectively serializes fiemap with fsync (except the
most expensive part of fsync, the log sync), preventing this deadlock.
Reported-by: [email protected]
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
CC: [email protected] # 6.1+
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
If commit interval is 0, it means using default value.
Fixes: 6e47a3cc68fc ("ext4: get rid of super block and sbi from handle_mount_ops()")
Signed-off-by: Wang Jianjian <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
|
|
To avoid 'sparse' warnings about missing endianness conversions, don't
store native endianness values into struct ext4_fc_tl. Instead, use a
separate struct type, ext4_fc_tl_mem.
Fixes: dcc5827484d6 ("ext4: factor out ext4_fc_get_tl()")
Cc: Ye Bin <[email protected]>
Signed-off-by: Eric Biggers <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
|
|
Refactor the in-inode and xattr block consistency checking, and report
more fine-grained reports of the consistency problems. Also add more
consistency checks for ea_inode number.
Reviewed-by: Andreas Dilger <[email protected]>
Signed-off-by: Theodore Ts'o <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
|
|
When converting directory from in-ICB to normal format, the last
iteration through the directory fixing up directory enteries can fail
due to ENOMEM. We do not expect this iteration to fail since the
directory is already verified to be correct and it is difficult to undo
the conversion at this point. So just use GFP_NOFAIL to make sure the
small allocation cannot fail.
Reported-by: [email protected]
Fixes: 0aba4860b0d0 ("udf: Allocate name buffer in directory iterator on heap")
Signed-off-by: Jan Kara <[email protected]>
|
|
GCC does not like having a partially allocated object, since it cannot
reason about it for bounds checking when it is passed to other code.
Instead, fully allocate sig_inputArgs. (Alternatively, sig_inputArgs
should be defined as a struct coda_in_hdr, if it is actually not using
any other part of the union.) Seen under GCC 13:
../fs/coda/upcall.c: In function 'coda_upcall':
../fs/coda/upcall.c:801:22: warning: array subscript 'union inputArgs[0]' is partly outside array bounds of 'unsigned char[20]' [-Warray-bounds=]
801 | sig_inputArgs->ih.opcode = CODA_SIGNAL;
| ^~
Cc: Jan Harkes <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Kees Cook <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Now that fscrypt_add_test_dummy_key() is only called by
setup_file_encryption_key() and not by the individual filesystems,
un-export it. Also change its prototype to take the
fscrypt_key_specifier directly, as the caller already has it.
Signed-off-by: Eric Biggers <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Now that the key associated with the "test_dummy_operation" mount option
is added on-demand when it's needed, rather than immediately when the
filesystem is mounted, fscrypt_destroy_keyring() no longer needs to be
called from __put_super() to avoid a memory leak on mount failure.
Remove this call, which was causing confusion because it appeared to be
a sleep-in-atomic bug (though it wasn't, for a somewhat-subtle reason).
Signed-off-by: Eric Biggers <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Now that fs/crypto/ adds the test dummy encryption key on-demand when
it's needed, there's no need for individual filesystems to call
fscrypt_add_test_dummy_key(). Remove the call to it from f2fs.
Signed-off-by: Eric Biggers <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Now that fs/crypto/ adds the test dummy encryption key on-demand when
it's needed, there's no need for individual filesystems to call
fscrypt_add_test_dummy_key(). Remove the call to it from ext4.
Signed-off-by: Eric Biggers <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
When the key for an inode is not found but the inode is using the
test_dummy_encryption policy, automatically add the
test_dummy_encryption key to the filesystem keyring. This eliminates
the need for all the individual filesystems to do this at mount time,
which is a bit tricky to clean up from on failure.
Note: this covers the call to fscrypt_find_master_key() from inode key
setup, but not from the fscrypt ioctls. So, this isn't *exactly* the
same as the key being present from the very beginning. I think we can
tolerate that, though, since the inode key setup caller is the only one
that actually matters in the context of test_dummy_encryption.
Signed-off-by: Eric Biggers <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
For LFS mode, it should update outplace and no need inplace update.
When using LFS mode for small-volume devices, IPU will not be used,
and the OPU writing method is actually used, but F2FS_IPU_FORCE can
be read from the ipu_policy node, which is different from the actual
situation. And remount to lfs mode should be disallowed when
f2fs ipu is enabled, let's fix it.
Fixes: 84b89e5d943d ("f2fs: add auto tuning for small devices")
Signed-off-by: Yangtao Li <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
This patch is to fix typos in f2fs files.
Signed-off-by: Jinyoung Choi <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
We should return when io->bio is null before doing anything. Otherwise, panic.
BUG: kernel NULL pointer dereference, address: 0000000000000010
RIP: 0010:__submit_merged_write_cond+0x164/0x240 [f2fs]
Call Trace:
<TASK>
f2fs_submit_merged_write+0x1d/0x30 [f2fs]
commit_checkpoint+0x110/0x1e0 [f2fs]
f2fs_write_checkpoint+0x9f7/0xf00 [f2fs]
? __pfx_issue_checkpoint_thread+0x10/0x10 [f2fs]
__checkpoint_and_complete_reqs+0x84/0x190 [f2fs]
? preempt_count_add+0x82/0xc0
? __pfx_issue_checkpoint_thread+0x10/0x10 [f2fs]
issue_checkpoint_thread+0x4c/0xf0 [f2fs]
? __pfx_autoremove_wake_function+0x10/0x10
kthread+0xff/0x130
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2c/0x50
</TASK>
Cc: [email protected] # v5.18+
Fixes: 64bf0eef0171 ("f2fs: pass the bio operation to bio_alloc_bioset")
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
iostat_update_and_unbind_ctx()
Convert to use iostat_lat_type as parameter instead of raw number.
BTW, move NUM_PREALLOC_IOSTAT_CTXS to the header file, adjust
iostat_lat[{0,1,2}] to iostat_lat[{READ_IO,WRITE_SYNC_IO,WRITE_ASYNC_IO}]
in tracepoint function, and rename iotype to page_type to match the definition.
Reported-by: kernel test robot <[email protected]>
Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Yangtao Li <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Signed-off-by: qixiaoyu1 <[email protected]>
Signed-off-by: xiongping1 <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
MDS expects the completed cap release prior to responding to the
session flush for cache drop.
Cc: [email protected]
Link: http://tracker.ceph.com/issues/38009
Signed-off-by: Xiubo Li <[email protected]>
Reviewed-by: Venky Shankar <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
Signed-off-by: Ilya Dryomov <[email protected]>
|
|
To avoid confusing the compiler about possible negative sizes, switch
various size variables that can never be negative from int to u32. Seen
with GCC 13:
../fs/udf/directory.c: In function 'udf_copy_fi':
../include/linux/fortify-string.h:57:33: warning: '__builtin_memcpy' pointer overflow between offset 80 and size [-2147483648, -1] [-Warray-bounds=]
57 | #define __underlying_memcpy __builtin_memcpy
| ^
...
../fs/udf/directory.c:102:9: note: in expansion of macro 'memcpy'
102 | memcpy(&iter->fi, iter->bh[0]->b_data + off, len);
| ^~~~~~
Cc: Jan Kara <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
Message-Id: <[email protected]>
|
|
This patch passes the full response so that the audit function can use all
of it. The audit function was updated to log the additional information in
the AUDIT_FANOTIFY record.
Currently the only type of fanotify info that is defined is an audit
rule number, but convert it to hex encoding to future-proof the field.
Hex encoding suggested by Paul Moore <[email protected]>.
The {subj,obj}_trust values are {0,1,2}, corresponding to no, yes, unknown.
Sample records:
type=FANOTIFY msg=audit(1600385147.372:590): resp=2 fan_type=1 fan_info=3137 subj_trust=3 obj_trust=5
type=FANOTIFY msg=audit(1659730979.839:284): resp=1 fan_type=0 fan_info=0 subj_trust=2 obj_trust=2
Suggested-by: Steve Grubb <[email protected]>
Link: https://lore.kernel.org/r/3075502.aeNJFYEL58@x2
Tested-by: Steve Grubb <[email protected]>
Acked-by: Steve Grubb <[email protected]>
Signed-off-by: Richard Guy Briggs <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
Message-Id: <bcb6d552e517b8751ece153e516d8b073459069c.1675373475.git.rgb@redhat.com>
|
|
This patch adds a flag, FAN_INFO and an extensible buffer to provide
additional information about response decisions. The buffer contains
one or more headers defining the information type and the length of the
following information. The patch defines one additional information
type, FAN_RESPONSE_INFO_AUDIT_RULE, to audit a rule number. This will
allow for the creation of other information types in the future if other
users of the API identify different needs.
The kernel can be tested if it supports a given info type by supplying
the complete info extension but setting fd to FAN_NOFD. It will return
the expected size but not issue an audit record.
Suggested-by: Steve Grubb <[email protected]>
Link: https://lore.kernel.org/r/2745105.e9J7NaK4W3@x2
Suggested-by: Jan Kara <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Tested-by: Steve Grubb <[email protected]>
Acked-by: Steve Grubb <[email protected]>
Signed-off-by: Richard Guy Briggs <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
Message-Id: <10177cfcae5480926b7176321a28d9da6835b667.1675373475.git.rgb@redhat.com>
|
|
The user space API for the response variable is __u32. This patch makes
sure that the whole path through the kernel uses u32 so that there is
no sign extension or truncation of the user space response.
Suggested-by: Steve Grubb <[email protected]>
Link: https://lore.kernel.org/r/12617626.uLZWGnKmhe@x2
Signed-off-by: Richard Guy Briggs <[email protected]>
Acked-by: Paul Moore <[email protected]>
Tested-by: Steve Grubb <[email protected]>
Acked-by: Steve Grubb <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
Message-Id: <3778cb0b3501bc4e686ba7770b20eb9ab0506cf4.1675373475.git.rgb@redhat.com>
|
|
clang build fails with
fs/udf/partition.c:86:28: error: variable 'loc' is uninitialized when used here [-Werror,-Wuninitialized]
sb, block, partition, loc, index);
^~~
loc is now only known when bh is valid. So remove reporting loc in debug
output.
Fixes: 4215db46d538 ("udf: Use udf_bread() in udf_get_pblock_virt15()")
Reported-by: kernel test robot <[email protected]>
Reported-by: "kernelci.org bot" <[email protected]>
Signed-off-by: Tom Rix <[email protected]>
Reviewed-by: Nathan Chancellor <[email protected]>
Reviewed-by: Nick Desaulniers <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
|
|
Bits, which are related to Bitmap Descriptor logical blocks,
are not reset when buffer headers are allocated for them. As the
result, these logical blocks can be treated as free and
be used for other blocks.This can cause usage of one buffer header
for several types of data. UDF issues WARNING in this situation:
WARNING: CPU: 0 PID: 2703 at fs/udf/inode.c:2014
__udf_add_aext+0x685/0x7d0 fs/udf/inode.c:2014
RIP: 0010:__udf_add_aext+0x685/0x7d0 fs/udf/inode.c:2014
Call Trace:
udf_setup_indirect_aext+0x573/0x880 fs/udf/inode.c:1980
udf_add_aext+0x208/0x2e0 fs/udf/inode.c:2067
udf_insert_aext fs/udf/inode.c:2233 [inline]
udf_update_extents fs/udf/inode.c:1181 [inline]
inode_getblk+0x1981/0x3b70 fs/udf/inode.c:885
Found by Linux Verification Center (linuxtesting.org) with syzkaller.
[JK: Somewhat cleaned up the boundary checks]
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Vladislav Efanov <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
|
|
When the network status is unstable, use-after-free may occur when
read data from the server.
BUG: KASAN: use-after-free in readpages_fill_pages+0x14c/0x7e0
Call Trace:
<TASK>
dump_stack_lvl+0x38/0x4c
print_report+0x16f/0x4a6
kasan_report+0xb7/0x130
readpages_fill_pages+0x14c/0x7e0
cifs_readv_receive+0x46d/0xa40
cifs_demultiplex_thread+0x121c/0x1490
kthread+0x16b/0x1a0
ret_from_fork+0x2c/0x50
</TASK>
Allocated by task 2535:
kasan_save_stack+0x22/0x50
kasan_set_track+0x25/0x30
__kasan_kmalloc+0x82/0x90
cifs_readdata_direct_alloc+0x2c/0x110
cifs_readdata_alloc+0x2d/0x60
cifs_readahead+0x393/0xfe0
read_pages+0x12f/0x470
page_cache_ra_unbounded+0x1b1/0x240
filemap_get_pages+0x1c8/0x9a0
filemap_read+0x1c0/0x540
cifs_strict_readv+0x21b/0x240
vfs_read+0x395/0x4b0
ksys_read+0xb8/0x150
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
Freed by task 79:
kasan_save_stack+0x22/0x50
kasan_set_track+0x25/0x30
kasan_save_free_info+0x2e/0x50
__kasan_slab_free+0x10e/0x1a0
__kmem_cache_free+0x7a/0x1a0
cifs_readdata_release+0x49/0x60
process_one_work+0x46c/0x760
worker_thread+0x2a4/0x6f0
kthread+0x16b/0x1a0
ret_from_fork+0x2c/0x50
Last potentially related work creation:
kasan_save_stack+0x22/0x50
__kasan_record_aux_stack+0x95/0xb0
insert_work+0x2b/0x130
__queue_work+0x1fe/0x660
queue_work_on+0x4b/0x60
smb2_readv_callback+0x396/0x800
cifs_abort_connection+0x474/0x6a0
cifs_reconnect+0x5cb/0xa50
cifs_readv_from_socket.cold+0x22/0x6c
cifs_read_page_from_socket+0xc1/0x100
readpages_fill_pages.cold+0x2f/0x46
cifs_readv_receive+0x46d/0xa40
cifs_demultiplex_thread+0x121c/0x1490
kthread+0x16b/0x1a0
ret_from_fork+0x2c/0x50
The following function calls will cause UAF of the rdata pointer.
readpages_fill_pages
cifs_read_page_from_socket
cifs_readv_from_socket
cifs_reconnect
__cifs_reconnect
cifs_abort_connection
mid->callback() --> smb2_readv_callback
queue_work(&rdata->work) # if the worker completes first,
# the rdata is freed
cifs_readv_complete
kref_put
cifs_readdata_release
kfree(rdata)
return rdata->... # UAF in readpages_fill_pages()
Similarly, this problem also occurs in the uncache_fill_pages().
Fix this by adjusts the order of condition judgment in the return
statement.
Signed-off-by: ZhaoLong Wang <[email protected]>
Cc: [email protected]
Acked-by: Paulo Alcantara (SUSE) <[email protected]>
Signed-off-by: Steve French <[email protected]>
|
|
When logging a directory, we always set the inode's last_dir_index_offset
to the offset of the last dir index item we found. This is using an extra
field in the log context structure, and it makes more sense to update it
only after we insert dir index items, and we could directly update the
inode's last_dir_index_offset field instead.
So make this simpler by updating the inode's last_dir_index_offset only
when we actually insert dir index keys in the log tree, and getting rid
of the last_dir_item_offset field in the log context structure.
Reported-by: David Arendt <[email protected]>
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
Reported-by: Maxim Mikityanskiy <[email protected]>
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
Reported-by: Hunter Wardlaw <[email protected]>
Link: https://bugzilla.suse.com/show_bug.cgi?id=1207231
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216851
CC: [email protected] # 6.1+
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- explicitly initialize zlib work memory to fix a KCSAN warning
- limit number of send clones by maximum memory allocated
- limit device size extent in case it device shrink races with chunk
allocation
- raid56 fixes:
- fix copy&paste error in RAID6 stripe recovery
- make error bitmap update atomic
* tag 'for-6.2-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: raid56: make error_bitmap update atomic
btrfs: send: limit number of clones and allocated memory size
btrfs: zlib: zero-initialize zlib workspace
btrfs: limit device extents to the device size
btrfs: raid56: fix stripes if vertical errors are found
|
|
Commit 5911d2d1d1a3 ("f2fs: introduce gc_merge mount option") forgot
to show nogc_merge option, let's fix it.
Signed-off-by: Yangtao Li <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
When writing a page from an encrypted file that is using
filesystem-layer encryption (not inline encryption), f2fs encrypts the
pagecache page into a bounce page, then writes the bounce page.
It also passes the bounce page to wbc_account_cgroup_owner(). That's
incorrect, because the bounce page is a newly allocated temporary page
that doesn't have the memory cgroup of the original pagecache page.
This makes wbc_account_cgroup_owner() not account the I/O to the owner
of the pagecache page as it should.
Fix this by always passing the pagecache page to
wbc_account_cgroup_owner().
Fixes: 578c647879f7 ("f2fs: implement cgroup writeback support")
Cc: [email protected]
Reported-by: Matthew Wilcox (Oracle) <[email protected]>
Signed-off-by: Eric Biggers <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Currently we wrongly calculate the new block age to
old * LAST_AGE_WEIGHT / 100.
Fix it to new * (100 - LAST_AGE_WEIGHT) / 100
+ old * LAST_AGE_WEIGHT / 100.
Signed-off-by: qixiaoyu1 <[email protected]>
Signed-off-by: xiongping1 <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Pull ELF fix from Al Viro:
"One of the many equivalent build warning fixes for !CONFIG_ELF_CORE
configs. Geert's is the earliest one I've been able to find"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
coredump: Move dump_emit_page() to kill unused warning
|
|
When we split a BMBT due to record insertion, we offload it to a
worker thread because we can be deep in the stack when we try to
allocate a new block for the BMBT. Allocation can use several
kilobytes of stack (full memory reclaim, swap and/or IO path can
end up on the stack during allocation) and we can already be several
kilobytes deep in the stack when we need to split the BMBT.
A recent workload demonstrated a deadlock in this BMBT split
offload. It requires several things to happen at once:
1. two inodes need a BMBT split at the same time, one must be
unwritten extent conversion from IO completion, the other must be
from extent allocation.
2. there must be a no available xfs_alloc_wq worker threads
available in the worker pool.
3. There must be sustained severe memory shortages such that new
kworker threads cannot be allocated to the xfs_alloc_wq pool for
both threads that need split work to be run
4. The split work from the unwritten extent conversion must run
first.
5. when the BMBT block allocation runs from the split work, it must
loop over all AGs and not be able to either trylock an AGF
successfully, or each AGF is is able to lock has no space available
for a single block allocation.
6. The BMBT allocation must then attempt to lock the AGF that the
second task queued to the rescuer thread already has locked before
it finds an AGF it can allocate from.
At this point, we have an ABBA deadlock between tasks queued on the
xfs_alloc_wq rescuer thread and a locked AGF. i.e. The queued task
holding the AGF lock can't be run by the rescuer thread until the
task the rescuer thread is runing gets the AGF lock....
This is a highly improbably series of events, but there it is.
There's a couple of ways to fix this, but the easiest way to ensure
that we only punt tasks with a locked AGF that holds enough space
for the BMBT block allocations to the worker thread.
This works for unwritten extent conversion in IO completion (which
doesn't have a locked AGF and space reservations) because we have
tight control over the IO completion stack. It is typically only 6
functions deep when xfs_btree_split() is called because we've
already offloaded the IO completion work to a worker thread and
hence we don't need to worry about stack overruns here.
The other place we can be called for a BMBT split without a
preceeding allocation is __xfs_bunmapi() when punching out the
center of an existing extent. We don't remove extents in the IO
path, so these operations don't tend to be called with a lot of
stack consumed. Hence we don't really need to ship the split off to
a worker thread in these cases, either.
Signed-off-by: Dave Chinner <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
|
|
Variable names in this code module are inconsistent and confusing.
xfs_phys_extent describe physical mappings, so rename them "pmap".
xfs_refcount_intents describe refcount intents, so rename them "ri".
Signed-off-by: Darrick J. Wong <[email protected]>
|
|
Pass the incore refcount intent through the CUI logging code instead of
repeatedly boxing and unboxing parameters.
Signed-off-by: Darrick J. Wong <[email protected]>
|
|
Variable names in this code module are inconsistent and confusing.
xfs_map_extent describe file mappings, so rename them "map".
xfs_rmap_intents describe block mapping intents, so rename them "ri".
Signed-off-by: Darrick J. Wong <[email protected]>
|
|
Pass the incore rmap space mapping through the RUI logging code instead
of repeatedly boxing and unboxing parameters.
Signed-off-by: Darrick J. Wong <[email protected]>
|
|
Change the name of all pointers to xfs_extent_item structures to "xefi"
to make the name consistent and because the current selections ("new"
and "free") mean other things in C.
Signed-off-by: Darrick J. Wong <[email protected]>
|
|
Pass the incore xfs_extent_free_item through the EFI logging code
instead of repeatedly boxing and unboxing parameters.
Signed-off-by: Darrick J. Wong <[email protected]>
|
|
Variable names in this code module are inconsistent and confusing.
xfs_map_extent describe file mappings, so rename them "map".
xfs_bmap_intents describe block mapping intents, so rename them "bi".
Signed-off-by: Darrick J. Wong <[email protected]>
|
|
Instead of repeatedly boxing and unboxing the incore extent mapping
structure as it passes through the BUI code, pass the pointer directly
through.
Signed-off-by: Darrick J. Wong <[email protected]>
|
|
The implementation of strscpy() is more robust and safer.
That's now the recommended way to copy NUL-terminated strings.
Signed-off-by: Xu Panda <[email protected]>
Signed-off-by: Yang Yang <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
|
|
I added $(srctree)/ to some included Makefiles in the following commits:
- 3204a7fb98a3 ("kbuild: prefix $(srctree)/ to some included Makefiles")
- d82856395505 ("kbuild: do not require sub-make for separate output tree builds")
They were a preparation for removing --include-dir flag.
I have never thought --include-dir useful. Rather, it _is_ harmful.
For example, run the following commands:
$ make -s ARCH=x86 mrproper defconfig
$ make ARCH=arm O=foo dtbs
make[1]: Entering directory '/tmp/linux/foo'
HOSTCC scripts/basic/fixdep
Error: kernelrelease not valid - run 'make prepare' to update it
UPD include/config/kernel.release
make[1]: Leaving directory '/tmp/linux/foo'
The first command configures the source tree for x86. The next command
tries to build ARM device trees in the separate foo/ directory - this
must stop because the directory foo/ has not been configured yet.
However, due to --include-dir=$(abs_srctree), the top Makefile includes
the wrong include/config/auto.conf from the source tree and continues
building. Kbuild traverses the directory tree, but of course it does
not work correctly. The Error message is also pointless - 'make prepare'
does not help at all for fixing the issue.
This commit fixes more arch Makefile, and finally removes --include-dir
from the top Makefile.
There are more breakages under drivers/, but I do not volunteer to fix
them all. I just moved --include-dir to drivers/Makefile.
With this commit, the second command will stop with a sensible message.
$ make -s ARCH=x86 mrproper defconfig
$ make ARCH=arm O=foo dtbs
make[1]: Entering directory '/tmp/linux/foo'
SYNC include/config/auto.conf.cmd
***
*** The source tree is not clean, please run 'make ARCH=arm mrproper'
*** in /tmp/linux
***
make[2]: *** [../Makefile:646: outputmakefile] Error 1
/tmp/linux/Makefile:770: include/config/auto.conf.cmd: No such file or directory
make[1]: *** [/tmp/linux/Makefile:793: include/config/auto.conf.cmd] Error 2
make[1]: Leaving directory '/tmp/linux/foo'
make: *** [Makefile:226: __sub-make] Error 2
Signed-off-by: Masahiro Yamada <[email protected]>
|
|
This fix was nacked by Philip, for reasons identified in the email linked
below.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 72e544b1b28325 ("squashfs: harden sanity check in squashfs_read_xattr_id_table")
Cc: Alexey Khoroshilov <[email protected]>
Cc: Fedor Pchelkin <[email protected]>
Cc: Phillip Lougher <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
The copy_mc_to_kernel() will return 0 if it executed successfully. Then
the return value should be set to the length it copied.
[[email protected]: don't mess up `ret', per Matthew]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: d984648e428b ("fsdax,xfs: port unshare to fsdax")
Signed-off-by: Shiyang Ruan <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Commit e4a0d3e720e7 ("aio: Make it possible to remap aio ring") introduced
a null-deref if mremap is called on an old aio mapping after fork as
mm->ioctx_table will be set to NULL.
[[email protected]: fix 80 column issue]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: e4a0d3e720e7 ("aio: Make it possible to remap aio ring")
Signed-off-by: Seth Jenkins <[email protected]>
Signed-off-by: Jeff Moyer <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Benjamin LaHaise <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Pull ceph fix from Ilya Dryomov:
"A safeguard to prevent the kernel client from further damaging the
filesystem after running into a case of an invalid snap trace.
The root cause of this metadata corruption is still being investigated
but it appears to be stemming from the MDS. As such, this is the best
we can do for now"
* tag 'ceph-for-6.2-rc7' of https://github.com/ceph/ceph-client:
ceph: blocklist the kclient when receiving corrupted snap trace
ceph: move mount state enum to super.h
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"25 hotfixes, mainly for MM. 13 are cc:stable"
* tag 'mm-hotfixes-stable-2023-02-02-19-24-2' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (26 commits)
mm: memcg: fix NULL pointer in mem_cgroup_track_foreign_dirty_slowpath()
Kconfig.debug: fix the help description in SCHED_DEBUG
mm/swapfile: add cond_resched() in get_swap_pages()
mm: use stack_depot_early_init for kmemleak
Squashfs: fix handling and sanity checking of xattr_ids count
sh: define RUNTIME_DISCARD_EXIT
highmem: round down the address passed to kunmap_flush_on_unmap()
migrate: hugetlb: check for hugetlb shared PMD in node migration
mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps
mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups
Revert "mm: kmemleak: alloc gray object for reserved region with direct map"
freevxfs: Kconfig: fix spelling
maple_tree: should get pivots boundary by type
.mailmap: update e-mail address for Eugen Hristev
mm, mremap: fix mremap() expanding for vma's with vm_ops->close()
squashfs: harden sanity check in squashfs_read_xattr_id_table
ia64: fix build error due to switch case label appearing next to declaration
mm: multi-gen LRU: fix crash during cgroup migration
Revert "mm: add nodes= arg to memory.reclaim"
zsmalloc: fix a race with deferred_handles storing
...
|
|
Use the bvec_set_page helper to initialize a bvec.
Signed-off-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Use the bvec_set_page and bvec_set_folio helpers to initialize bvecs.
Signed-off-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Use the bvec_set_page helper to initialize bvecs.
Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Trond Myklebust <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Use the bvec_set_page helper to initialize a bvec.
Signed-off-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|