Age | Commit message (Collapse) | Author | Files | Lines |
|
When discussing[1] exec and posix file locks it was realized that none
of the callers of get_files_struct fundamentally needed to call
get_files_struct, and that by switching them to helper functions
instead it will both simplify their code and remove unnecessary
increments of files_struct.count. Those unnecessary increments can
result in exec unnecessarily unsharing files_struct which breaking
posix locks, and it can result in fget_light having to fallback to
fget reducing system performance.
Instead hold task_lock for the duration that task->files needs to be
stable in seq_show. The task_lock was already taken in
get_files_struct, and so skipping get_files_struct performs less work
overall, and avoids the problems with the files_struct reference
count.
[1] https://lkml.kernel.org/r/[email protected]
Suggested-by: Oleg Nesterov <[email protected]>
Acked-by: Christian Brauner <[email protected]>
v1: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Eric W. Biederman <[email protected]>
|
|
When discussing[1] exec and posix file locks it was realized that none
of the callers of get_files_struct fundamentally needed to call
get_files_struct, and that by switching them to helper functions
instead it will both simplify their code and remove unnecessary
increments of files_struct.count. Those unnecessary increments can
result in exec unnecessarily unsharing files_struct which breaking
posix locks, and it can result in fget_light having to fallback to
fget reducing system performance.
Using task_lookup_next_fd_rcu simplifies proc_readfd_common, by moving
the checking for the maximum file descritor into the generic code, and
by remvoing the need for capturing and releasing a reference on
files_struct.
As task_lookup_fd_rcu may update the fd ctx->pos has been changed
to be the fd +2 after task_lookup_fd_rcu returns.
[1] https://lkml.kernel.org/r/[email protected]
Suggested-by: Oleg Nesterov <[email protected]>
Tested-by: Andy Lavr <[email protected]>
v1: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Eric W. Biederman <[email protected]>
|
|
As a companion to fget_task and task_lookup_fd_rcu implement
task_lookup_next_fd_rcu that will return the struct file for the first
file descriptor number that is equal or greater than the fd argument
value, or NULL if there is no such struct file.
This allows file descriptors of foreign processes to be iterated
through safely, without needed to increment the count on files_struct.
Some concern[1] has been expressed that this function takes the task_lock
for each iteration and thus for each file descriptor. This place
where this function will be called in a commonly used code path is for
listing /proc/<pid>/fd. I did some small benchmarks and did not see
any measurable performance differences. For ordinary users ls is
likely to stat each of the directory entries and tid_fd_mode called
from tid_fd_revalidae has always taken the task lock for each file
descriptor. So this does not look like it will be a big change in
practice.
At some point is will probably be worth changing put_files_struct to
free files_struct after an rcu grace period so that task_lock won't be
needed at all.
[1] https://lkml.kernel.org/r/[email protected]
v1: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Eric W. Biederman <[email protected]>
|
|
When discussing[1] exec and posix file locks it was realized that none
of the callers of get_files_struct fundamentally needed to call
get_files_struct, and that by switching them to helper functions
instead it will both simplify their code and remove unnecessary
increments of files_struct.count. Those unnecessary increments can
result in exec unnecessarily unsharing files_struct which breaking
posix locks, and it can result in fget_light having to fallback to
fget reducing system performance.
Instead of manually coding finding the files struct for a task and
then calling files_lookup_fd_rcu, use the helper task_lookup_fd_rcu
that combines those to steps. Making the code simpler and removing
the need to get a reference on a files_struct.
[1] https://lkml.kernel.org/r/[email protected]
Suggested-by: Oleg Nesterov <[email protected]>
Acked-by: Christian Brauner <[email protected]>
v1: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Eric W. Biederman <[email protected]>
|
|
As a companion to lookup_fd_rcu implement task_lookup_fd_rcu for
querying an arbitrary process about a specific file.
Acked-by: Christian Brauner <[email protected]>
v1: https://lkml.kernel.org/r/20200818103713.aw46m7vprsy4vlve@wittgenstein
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Eric W. Biederman <[email protected]>
|
|
Also remove the confusing comment about checking if a fd exists. I
could not find one instance in the entire kernel that still matches
the description or the reason for the name fcheck.
The need for better names became apparent in the last round of
discussion of this set of changes[1].
[1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Eric W. Biederman <[email protected]>
|
|
This change renames fcheck_files to files_lookup_fd_rcu. All of the
remaining callers take the rcu_read_lock before calling this function
so the _rcu suffix is appropriate. This change also tightens up the
debug check to verify that all callers hold the rcu_read_lock.
All callers that used to call files_check with the files->file_lock
held have now been changed to call files_lookup_fd_locked.
This change of name has helped remind me of which locks and which
guarantees are in place helping me to catch bugs later in the
patchset.
The need for better names became apparent in the last round of
discussion of this set of changes[1].
[1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Eric W. Biederman <[email protected]>
|
|
To make it easy to tell where files->file_lock protection is being
used when looking up a file create files_lookup_fd_locked. Only allow
this function to be called with the file_lock held.
Update the callers of fcheck and fcheck_files that are called with the
files->file_lock held to call files_lookup_fd_locked instead.
Hopefully this makes it easier to quickly understand what is going on.
The need for better names became apparent in the last round of
discussion of this set of changes[1].
[1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Eric W. Biederman <[email protected]>
|
|
The function fcheck despite it's comment is poorly named
as it has no callers that only check it's return value.
All of fcheck's callers use the returned file descriptor.
The same is true for fcheck_files and __fcheck_files.
A new less confusing name is needed. In addition the names
of these functions are confusing as they do not report
the kind of locks that are needed to be held when these
functions are called making error prone to use them.
To remedy this I am making the base functio name lookup_fd
and will and prefixes and sufficies to indicate the rest
of the context.
Name the function (previously called __fcheck_files) that proceeds
from a struct files_struct, looks up the struct file of a file
descriptor, and requires it's callers to verify all of the appropriate
locks are held files_lookup_fd_raw.
The need for better names became apparent in the last round of
discussion of this set of changes[1].
[1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Eric W. Biederman <[email protected]>
|
|
When discussing[1] exec and posix file locks it was realized that none
of the callers of get_files_struct fundamentally needed to call
get_files_struct, and that by switching them to helper functions
instead it will both simplify their code and remove unnecessary
increments of files_struct.count. Those unnecessary increments can
result in exec unnecessarily unsharing files_struct which breaking
posix locks, and it can result in fget_light having to fallback to
fget reducing system performance.
Simplifying proc_fd_link is a little bit tricky. It is necessary to
know that there is a reference to fd_f ile while path_get is running.
This reference can either be guaranteed to exist either by locking the
fdtable as the code currently does or by taking a reference on the
file in question.
Use fget_task to remove the need for get_files_struct and
to take a reference to file in question.
[1] https://lkml.kernel.org/r/[email protected]
Suggested-by: Oleg Nesterov <[email protected]>
v1: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Eric W. Biederman <[email protected]>
|
|
Now that exec no longer needs to restore the previous value of current->files
on error there are no more callers of reset_files_struct so remove it.
Acked-by: Christian Brauner <[email protected]>
v1: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Eric W. Biederman <[email protected]>
|
|
Now that exec no longer needs to return the unshared files to their
previous value there is no reason to return displaced.
Instead when unshare_fd creates a copy of the file table, call
put_files_struct before returning from unshare_files.
Acked-by: Christian Brauner <[email protected]>
v1: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Eric W. Biederman <[email protected]>
|
|
Many moons ago the binfmts were doing some very questionable things
with file descriptors and an unsharing of the file descriptor table
was added to make things better[1][2]. The helper steal_lockss was
added to avoid breaking the userspace programs[3][4][6].
Unfortunately it turned out that steal_locks did not work for network
file systems[5], so it was removed to see if anyone would
complain[7][8]. It was thought at the time that NPTL would not be
affected as the unshare_files happened after the other threads were
killed[8]. Unfortunately because there was an unshare_files in
binfmt_elf.c before the threads were killed this analysis was
incorrect.
This unshare_files in binfmt_elf.c resulted in the unshares_files
happening whenever threads were present. Which led to unshare_files
being moved to the start of do_execve[9].
Later the problems were rediscovered and the suggested approach was to
readd steal_locks under a different name[10]. I happened to be
reviewing patches and I noticed that this approach was a step
backwards[11].
I proposed simply moving unshare_files[12] and it was pointed
out that moving unshare_files without auditing the code was
also unsafe[13].
There were then several attempts to solve this[14][15][16] and I even
posted this set of changes[17]. Unfortunately because auditing all of
execve is time consuming this change did not make it in at the time.
Well now that I am cleaning up exec I have made the time to read
through all of the binfmts and the only playing with file descriptors
is either the security modules closing them in
security_bprm_committing_creds or is in the generic code in fs/exec.c.
None of it happens before begin_new_exec is called.
So move unshare_files into begin_new_exec, after the point of no
return. If memory is very very very low and the application calling
exec is sharing file descriptor tables between processes we might fail
past the point of no return. Which is unfortunate but no different
than any of the other places where we allocate memory after the point
of no return.
This movement allows another process that shares the file table, or
another thread of the same process and that closes files or changes
their close on exec behavior and races with execve to cause some
unexpected things to happen. There is only one time of check to time
of use race and it is just there so that execve fails instead of
an interpreter failing when it tries to open the file it is supposed
to be interpreting. Failing later if userspace is being silly is
not a problem.
With this change it the following discription from the removal
of steal_locks[8] finally becomes true.
Apps using NPTL are not affected, since all other threads are killed before
execve.
Apps using LinuxThreads are only affected if they
- have multiple threads during exec (LinuxThreads doesn't kill other
threads, the app may do it with pthread_kill_other_threads_np())
- rely on POSIX locks being inherited across exec
Both conditions are documented, but not their interaction.
Apps using clone() natively are affected if they
- use clone(CLONE_FILES)
- rely on POSIX locks being inherited across exec
I have investigated some paths to make it possible to solve this
without moving unshare_files but they all look more complicated[18].
Reported-by: Daniel P. Berrangé <[email protected]>
Reported-by: Jeff Layton <[email protected]>
History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
[1] 02cda956de0b ("[PATCH] unshare_files"
[2] 04e9bcb4d106 ("[PATCH] use new unshare_files helper")
[3] 088f5d7244de ("[PATCH] add steal_locks helper")
[4] 02c541ec8ffa ("[PATCH] use new steal_locks helper")
[5] https://lkml.kernel.org/r/[email protected]
[6] https://lkml.kernel.org/r/[email protected]
[7] https://lkml.kernel.org/r/[email protected]
[8] c89681ed7d0e ("[PATCH] remove steal_locks()")
[9] fd8328be874f ("[PATCH] sanitize handling of shared descriptor tables in failing execve()")
[10] https://lkml.kernel.org/r/[email protected]
[11] https://lkml.kernel.org/r/[email protected]
[12] https://lkml.kernel.org/r/[email protected]
[13] https://lkml.kernel.org/r/[email protected]
[14] https://lkml.kernel.org/r/[email protected]
[15] https://lkml.kernel.org/r/[email protected]
[16] https://lkml.kernel.org/r/[email protected]
[17] https://lkml.kernel.org/r/[email protected]
[18] https://lkml.kernel.org/r/[email protected]
Acked-by: Christian Brauner <[email protected]>
v1: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Eric W. Biederman <[email protected]>
|
|
Al Viro pointed out that using the phrase "close_on_exec(fd,
rcu_dereference_raw(current->files->fdt))" instead of wrapping it in
rcu_read_lock(), rcu_read_unlock() is a very questionable
optimization[1].
Once wrapped with rcu_read_lock()/rcu_read_unlock() that phrase
becomes equivalent the helper function get_close_on_exec so
simplify the code and make it more robust by simply using
get_close_on_exec.
[1] https://lkml.kernel.org/r/[email protected]
Suggested-by: Al Viro <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Eric W. Biederman <[email protected]>
|
|
This patch addresses minor issues in compression chksum.
Fixes: b28f047b28c5 ("f2fs: compress: support chksum")
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
syzbot reported a bug which could cause shift-out-of-bounds issue,
fix it.
Call Trace:
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x107/0x163 lib/dump_stack.c:120
ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
__ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395
sanity_check_raw_super fs/f2fs/super.c:2812 [inline]
read_raw_super_block fs/f2fs/super.c:3267 [inline]
f2fs_fill_super.cold+0x16c9/0x16f6 fs/f2fs/super.c:3519
mount_bdev+0x34d/0x410 fs/super.c:1366
legacy_get_tree+0x105/0x220 fs/fs_context.c:592
vfs_get_tree+0x89/0x2f0 fs/super.c:1496
do_new_mount fs/namespace.c:2896 [inline]
path_mount+0x12ae/0x1e70 fs/namespace.c:3227
do_mount fs/namespace.c:3240 [inline]
__do_sys_mount fs/namespace.c:3448 [inline]
__se_sys_mount fs/namespace.c:3425 [inline]
__x64_sys_mount+0x27f/0x300 fs/namespace.c:3425
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reported-by: [email protected]
Signed-off-by: Anant Thazhemadam <[email protected]>
Signed-off-by: Chao Yu <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
|
|
Jan Kara's analysis of the syzbot report (edited):
The reproducer opens a directory on FUSE filesystem, it then attaches
dnotify mark to the open directory. After that a fuse_do_getattr() call
finds that attributes returned by the server are inconsistent, and calls
make_bad_inode() which, among other things does:
inode->i_mode = S_IFREG;
This then confuses dnotify which doesn't tear down its structures
properly and eventually crashes.
Avoid calling make_bad_inode() on a live inode: switch to a private flag on
the fuse inode. Also add the test to ops which the bad_inode_ops would
have caught.
This bug goes back to the initial merge of fuse in 2.6.14...
Reported-by: [email protected]
Signed-off-by: Miklos Szeredi <[email protected]>
Tested-by: Jan Kara <[email protected]>
Cc: <[email protected]>
|
|
Ensure that both getxattr and listxattr page array are correctly
aligned, and that getxattr correctly accounts for the page padding word.
Signed-off-by: Trond Myklebust <[email protected]>
|
|
In zonefs_file_dio_append(), the pages obtained using
bio_iov_iter_get_pages() are not released on completion of the
REQ_OP_APPEND BIO, nor when bio_iov_iter_get_pages() fails.
Furthermore, a call to bio_put() is missing when
bio_iov_iter_get_pages() fails.
Fix these resource leaks by adding BIO resource release code (bio_put()i
and bio_release_pages()) at the end of the function after the BIO
execution and add a jump to this resource cleanup code in case of
bio_iov_iter_get_pages() failure.
While at it, also fix the call to task_io_account_write() to be passed
the correct BIO size instead of bio_iov_iter_get_pages() return value.
Reported-by: Christoph Hellwig <[email protected]>
Fixes: 02ef12a663c7 ("zonefs: use REQ_OP_ZONE_APPEND for sync DIO")
Cc: [email protected]
Signed-off-by: Damien Le Moal <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
|
|
Surprisingly, `block' in sector_t indicates the number of
i_blkbits-sized blocks rather than sectors for bmap.
In addition, considering buffer_head limits mapped size to 32-bits,
should avoid using generic_block_bmap.
Link: https://lore.kernel.org/r/[email protected]
Fixes: 9da681e017a3 ("staging: erofs: support bmap")
Reviewed-by: Chao Yu <[email protected]>
Reviewed-by: Gao Xiang <[email protected]>
Signed-off-by: Huang Jianan <[email protected]>
Signed-off-by: Guo Weichao <[email protected]>
[ Gao Xiang: slightly update the commit message description. ]
Signed-off-by: Gao Xiang <[email protected]>
|
|
Signed-off-by: Chunguang Xu <[email protected]>
Reviewed-by: Andreas Dilger <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
|
|
Since ext4_data_block_valid() has been renamed to ext4_inode_block_valid(),
the related comments need to be updated.
Signed-off-by: Chunguang Xu <[email protected]>
Reviewed-by: Andreas Dilger <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
|
|
Checking !list_empty(&ctx->cq_overflow_list) around noflush in
io_cqring_events() is racy, because if it fails but a request overflowed
just after that, io_cqring_overflow_flush() still will be called.
Remove the second check, it shouldn't be a problem for performance,
because there is cq_check_overflow bit check just above.
Cc: <[email protected]> # 5.5+
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
It's not safe to call io_cqring_overflow_flush() for IOPOLL mode without
hodling uring_lock, because it does synchronisation differently. Make
sure we have it.
As for io_ring_exit_work(), we don't even need it there because
io_ring_ctx_wait_and_kill() already set force flag making all overflowed
requests to be dropped.
Cc: <[email protected]> # 5.5+
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
IOPOLL allows buffer remove/provide requests, but they doesn't
synchronise by rules of IOPOLL, namely it have to hold uring_lock.
Cc: <[email protected]> # 5.7+
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Abaci Fuzz reported a double-free or invalid-free BUG in io_commit_cqring():
[ 95.504842] BUG: KASAN: double-free or invalid-free in io_commit_cqring+0x3ec/0x8e0
[ 95.505921]
[ 95.506225] CPU: 0 PID: 4037 Comm: io_wqe_worker-0 Tainted: G B
W 5.10.0-rc5+ #1
[ 95.507434] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 95.508248] Call Trace:
[ 95.508683] dump_stack+0x107/0x163
[ 95.509323] ? io_commit_cqring+0x3ec/0x8e0
[ 95.509982] print_address_description.constprop.0+0x3e/0x60
[ 95.510814] ? vprintk_func+0x98/0x140
[ 95.511399] ? io_commit_cqring+0x3ec/0x8e0
[ 95.512036] ? io_commit_cqring+0x3ec/0x8e0
[ 95.512733] kasan_report_invalid_free+0x51/0x80
[ 95.513431] ? io_commit_cqring+0x3ec/0x8e0
[ 95.514047] __kasan_slab_free+0x141/0x160
[ 95.514699] kfree+0xd1/0x390
[ 95.515182] io_commit_cqring+0x3ec/0x8e0
[ 95.515799] __io_req_complete.part.0+0x64/0x90
[ 95.516483] io_wq_submit_work+0x1fa/0x260
[ 95.517117] io_worker_handle_work+0xeac/0x1c00
[ 95.517828] io_wqe_worker+0xc94/0x11a0
[ 95.518438] ? io_worker_handle_work+0x1c00/0x1c00
[ 95.519151] ? __kthread_parkme+0x11d/0x1d0
[ 95.519806] ? io_worker_handle_work+0x1c00/0x1c00
[ 95.520512] ? io_worker_handle_work+0x1c00/0x1c00
[ 95.521211] kthread+0x396/0x470
[ 95.521727] ? _raw_spin_unlock_irq+0x24/0x30
[ 95.522380] ? kthread_mod_delayed_work+0x180/0x180
[ 95.523108] ret_from_fork+0x22/0x30
[ 95.523684]
[ 95.523985] Allocated by task 4035:
[ 95.524543] kasan_save_stack+0x1b/0x40
[ 95.525136] __kasan_kmalloc.constprop.0+0xc2/0xd0
[ 95.525882] kmem_cache_alloc_trace+0x17b/0x310
[ 95.533930] io_queue_sqe+0x225/0xcb0
[ 95.534505] io_submit_sqes+0x1768/0x25f0
[ 95.535164] __x64_sys_io_uring_enter+0x89e/0xd10
[ 95.535900] do_syscall_64+0x33/0x40
[ 95.536465] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 95.537199]
[ 95.537505] Freed by task 4035:
[ 95.538003] kasan_save_stack+0x1b/0x40
[ 95.538599] kasan_set_track+0x1c/0x30
[ 95.539177] kasan_set_free_info+0x1b/0x30
[ 95.539798] __kasan_slab_free+0x112/0x160
[ 95.540427] kfree+0xd1/0x390
[ 95.540910] io_commit_cqring+0x3ec/0x8e0
[ 95.541516] io_iopoll_complete+0x914/0x1390
[ 95.542150] io_do_iopoll+0x580/0x700
[ 95.542724] io_iopoll_try_reap_events.part.0+0x108/0x200
[ 95.543512] io_ring_ctx_wait_and_kill+0x118/0x340
[ 95.544206] io_uring_release+0x43/0x50
[ 95.544791] __fput+0x28d/0x940
[ 95.545291] task_work_run+0xea/0x1b0
[ 95.545873] do_exit+0xb6a/0x2c60
[ 95.546400] do_group_exit+0x12a/0x320
[ 95.546967] __x64_sys_exit_group+0x3f/0x50
[ 95.547605] do_syscall_64+0x33/0x40
[ 95.548155] entry_SYSCALL_64_after_hwframe+0x44/0xa9
The reason is that once we got a non EAGAIN error in io_wq_submit_work(),
we'll complete req by calling io_req_complete(), which will hold completion_lock
to call io_commit_cqring(), but for polled io, io_iopoll_complete() won't
hold completion_lock to call io_commit_cqring(), then there maybe concurrent
access to ctx->defer_list, double free may happen.
To fix this bug, we always let io_iopoll_complete() complete polled io.
Cc: <[email protected]> # 5.5+
Reported-by: Abaci Fuzz <[email protected]>
Signed-off-by: Xiaoguang Wang <[email protected]>
Reviewed-by: Pavel Begunkov <[email protected]>
Reviewed-by: Joseph Qi <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Support timeout updates through IORING_OP_TIMEOUT_REMOVE with passed in
IORING_TIMEOUT_UPDATE. Updates doesn't support offset timeout mode.
Oirignal timeout.off will be ignored as well.
Signed-off-by: Pavel Begunkov <[email protected]>
[axboe: remove now unused 'ret' variable]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Add io_timeout_extract() helper, which searches and disarms timeouts,
but doesn't complete them. No functional changes.
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
io_uring_cancel_files()'s task check condition mistakenly got flipped.
1. There can't be a request in the inflight list without
IO_WQ_WORK_FILES, kill this check to keep the whole condition simpler.
2. Also, don't call the function for files==NULL to not do such a check,
all that staff is already handled well by its counter part,
__io_uring_cancel_task_requests().
With that just flip the task check.
Also, it iowq-cancels all request of current task there, don't forget to
set right ->files into struct io_task_cancel.
Fixes: c1973b38bf639 ("io_uring: cancel only requests of current task")
Reported-by: [email protected]
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
io_file_data_ref_zero() can be invoked from soft-irq from the RCU core,
hence we need to ensure that the file_data lock is bottom half safe. Use
the _bh() variants when grabbing this lock.
Reported-by: [email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
io_req_init() doesn't decrement state->ios_left if a request doesn't
need ->file, it just returns before that on if(!needs_file). That's
not really a problem but may cause overhead for an additional fput().
Also inline and kill io_req_set_file() as it's of no use anymore.
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Keep submit state invariant of whether there are file refs left based on
state->nr_refs instead of (state->file==NULL), and always check against
the first one. It's easier to track and allows to remove 1 if. It also
automatically leaves struct submit_state in a consistent state after
io_submit_state_end(), that's not used yet but nice.
btw rename has_refs to file_refs for more clarity.
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
syzbot reports following issue:
INFO: task syz-executor.2:12399 can't die for more than 143 seconds.
task:syz-executor.2 state:D stack:28744 pid:12399 ppid: 8504 flags:0x00004004
Call Trace:
context_switch kernel/sched/core.c:3773 [inline]
__schedule+0x893/0x2170 kernel/sched/core.c:4522
schedule+0xcf/0x270 kernel/sched/core.c:4600
schedule_timeout+0x1d8/0x250 kernel/time/timer.c:1847
do_wait_for_common kernel/sched/completion.c:85 [inline]
__wait_for_common kernel/sched/completion.c:106 [inline]
wait_for_common kernel/sched/completion.c:117 [inline]
wait_for_completion+0x163/0x260 kernel/sched/completion.c:138
kthread_stop+0x17a/0x720 kernel/kthread.c:596
io_put_sq_data fs/io_uring.c:7193 [inline]
io_sq_thread_stop+0x452/0x570 fs/io_uring.c:7290
io_finish_async fs/io_uring.c:7297 [inline]
io_sq_offload_create fs/io_uring.c:8015 [inline]
io_uring_create fs/io_uring.c:9433 [inline]
io_uring_setup+0x19b7/0x3730 fs/io_uring.c:9507
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45deb9
Code: Unable to access opcode bytes at RIP 0x45de8f.
RSP: 002b:00007f174e51ac78 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
RAX: ffffffffffffffda RBX: 0000000000008640 RCX: 000000000045deb9
RDX: 0000000000000000 RSI: 0000000020000140 RDI: 00000000000050e5
RBP: 000000000118bf58 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000118bf2c
R13: 00007ffed9ca723f R14: 00007f174e51b9c0 R15: 000000000118bf2c
INFO: task syz-executor.2:12399 blocked for more than 143 seconds.
Not tainted 5.10.0-rc3-next-20201110-syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Currently we don't have a reproducer yet, but seems that there is a
race in current codes:
=> io_put_sq_data
ctx_list is empty now. |
==> kthread_park(sqd->thread); |
| T1: sq thread is parked now.
==> kthread_stop(sqd->thread); |
KTHREAD_SHOULD_STOP is set now.|
===> kthread_unpark(k); |
| T2: sq thread is now unparkd, run again.
|
| T3: sq thread is now preempted out.
|
===> wake_up_process(k); |
|
| T4: Since sqd ctx_list is empty, needs_sched will be true,
| then sq thread sets task state to TASK_INTERRUPTIBLE,
| and schedule, now sq thread will never be waken up.
===> wait_for_completion |
I have artificially used mdelay() to simulate above race, will get same
stack like this syzbot report, but to be honest, I'm not sure this code
race triggers syzbot report.
To fix this possible code race, when sq thread is unparked, need to check
whether sq thread has been stopped.
Reported-by: [email protected]
Signed-off-by: Xiaoguang Wang <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Double fixed files for splice/tee are done in a nasty way, it takes 2
ref_node refs, and during the second time it blindly overrides
req->fixed_file_refs hoping that it haven't changed. That works because
all that is done under iouring_lock in a single go but is error-prone.
Bind everything explicitly to a single ref_node and take only one ref,
with current ref_node ordering it's guaranteed to keep all files valid
awhile the request is inflight.
That's mainly a cleanup + preparation for generic resource handling,
but also saves pcpu_ref get/put for splice/tee with 2 fixed files.
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
As tasks now cancel only theirs requests, and inflight_wait is awaited
only in io_uring_cancel_files(), which should be called with ->in_idle
set, instead of keeping a separate inflight_wait use tctx->wait.
That will add some spurious wakeups but actually is safer from point of
not hanging the task.
e.g.
task1 | IRQ
| *start* io_complete_rw_common(link)
| link: req1 -> req2 -> req3(with files)
*cancel_files() |
io_wq_cancel(), etc. |
| put_req(link), adds to io-wq req2
schedule() |
So, task1 will never try to cancel req2 or req3. If req2 is
long-standing (e.g. read(empty_pipe)), this may hang.
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
We don't even allow not plain data msg_control, which is disallowed in
__sys_{send,revb}msg_sock(). So no need in fs for IORING_OP_SENDMSG and
IORING_OP_RECVMSG. fs->lock is less contanged not as much as before, but
there are cases that can be, e.g. IOSQE_ASYNC.
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
If IORING_SETUP_SQPOLL is enabled, sqes are either handled in sq thread
task context or in io worker task context. If current task context is sq
thread, we don't need to check whether should wake up sq thread.
io_iopoll_req_issued() calls wq_has_sleeper(), which has smp_mb() memory
barrier, before this patch, perf shows obvious overhead:
Samples: 481K of event 'cycles', Event count (approx.): 299807382878
Overhead Comma Shared Object Symbol
3.69% :9630 [kernel.vmlinux] [k] io_issue_sqe
With this patch, perf shows:
Samples: 482K of event 'cycles', Event count (approx.): 299929547283
Overhead Comma Shared Object Symbol
0.70% :4015 [kernel.vmlinux] [k] io_issue_sqe
It shows some obvious improvements.
Signed-off-by: Xiaoguang Wang <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Both IOPOLL and sqes handling need to acquire uring_lock, combine
them together, then we just need to acquire uring_lock once.
Signed-off-by: Xiaoguang Wang <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Some static checker reports below warning:
fs/io_uring.c:6939 io_sq_thread()
error: uninitialized symbol 'timeout'.
This is a false positive, but let's just initialize 'timeout' to make
sure we don't trip over this.
Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Xiaoguang Wang <[email protected]>
Reviewed-by: Stefano Garzarella <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
There are some issues about current io_sq_thread() implementation:
1. The prepare_to_wait() usage in __io_sq_thread() is weird. If
multiple ctxs share one same poll thread, one ctx will put poll thread
in TASK_INTERRUPTIBLE, but if other ctxs have work to do, we don't
need to change task's stat at all. I think only if all ctxs don't have
work to do, we can do it.
2. We use round-robin strategy to make multiple ctxs share one same
poll thread, but there are various condition in __io_sq_thread(), which
seems complicated and may affect round-robin strategy.
To improve above issues, I take below actions:
1. If multiple ctxs share one same poll thread, only if all all ctxs
don't have work to do, we can call prepare_to_wait() and schedule() to
make poll thread enter sleep state.
2. To make round-robin strategy more straight, I simplify
__io_sq_thread() a bit, it just does io poll and sqes submit work once,
does not check various condition.
3. For multiple ctxs share one same poll thread, we choose the biggest
sq_thread_idle among these ctxs as timeout condition, and will update
it when ctx is in or out.
4. Not need to check EBUSY especially, if io_submit_sqes() returns
EBUSY, IORING_SQ_CQ_OVERFLOW should be set, helper in liburing should
be aware of cq overflow and enters kernel to flush work.
Signed-off-by: Xiaoguang Wang <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Instead of iterating over each request and cancelling it individually in
io_uring_cancel_files(), try to cancel all matching requests and use
->inflight_list only to check if there anything left.
In many cases it should be faster, and we can reuse a lot of code from
task cancellation.
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Make io_poll_remove_all() and io_kill_timeouts() to match against files
as well. A preparation patch, effectively not used by now.
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
io_uring_cancel_files() guarantees to cancel all matching requests,
that's not necessary to do that in a loop. Move it up in the callchain
into io_uring_cancel_task_requests().
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
io_uring_cancel_files() cancels all request that match files regardless
of task. There is no real need in that, cancel only requests of the
specified task. That also handles SQPOLL case as it already changes task
to it.
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Add io_match_task() that matches both task and files.
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
If IORING_SETUP_SQPOLL is set all requests belong to the corresponding
SQPOLL task, so skip task checking in that case and always match.
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Inline io_import_iovec() and leave only its former __io_import_iovec()
renamed to the original name. That makes it more obious what is reused in
io_read/write().
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
io_size and iov_count in io_read() and io_write() hold the same value,
kill the last one.
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
This is the only code that relies on import_iovec() returning
iter.count on success.
This allows a better interface to import_iovec().
Signed-off-by: David Laight <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Reviewed-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
SQPOLL task may find sqo_task->files == NULL and
__io_sq_thread_acquire_files() would leave it unset, so following
fget_many() and others try to dereference NULL and fault. Propagate
an error files are missing.
[ 118.962785] BUG: kernel NULL pointer dereference, address:
0000000000000020
[ 118.963812] #PF: supervisor read access in kernel mode
[ 118.964534] #PF: error_code(0x0000) - not-present page
[ 118.969029] RIP: 0010:__fget_files+0xb/0x80
[ 119.005409] Call Trace:
[ 119.005651] fget_many+0x2b/0x30
[ 119.005964] io_file_get+0xcf/0x180
[ 119.006315] io_submit_sqes+0x3a4/0x950
[ 119.007481] io_sq_thread+0x1de/0x6a0
[ 119.007828] kthread+0x114/0x150
[ 119.008963] ret_from_fork+0x22/0x30
Reported-by: Josef Grieb <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|