aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2020-12-10proc/fd: In fdinfo seq_show don't use get_files_structEric W. Biederman1-4/+5
When discussing[1] exec and posix file locks it was realized that none of the callers of get_files_struct fundamentally needed to call get_files_struct, and that by switching them to helper functions instead it will both simplify their code and remove unnecessary increments of files_struct.count. Those unnecessary increments can result in exec unnecessarily unsharing files_struct which breaking posix locks, and it can result in fget_light having to fallback to fget reducing system performance. Instead hold task_lock for the duration that task->files needs to be stable in seq_show. The task_lock was already taken in get_files_struct, and so skipping get_files_struct performs less work overall, and avoids the problems with the files_struct reference count. [1] https://lkml.kernel.org/r/[email protected] Suggested-by: Oleg Nesterov <[email protected]> Acked-by: Christian Brauner <[email protected]> v1: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric W. Biederman <[email protected]>
2020-12-10proc/fd: In proc_readfd_common use task_lookup_next_fd_rcuEric W. Biederman1-12/+5
When discussing[1] exec and posix file locks it was realized that none of the callers of get_files_struct fundamentally needed to call get_files_struct, and that by switching them to helper functions instead it will both simplify their code and remove unnecessary increments of files_struct.count. Those unnecessary increments can result in exec unnecessarily unsharing files_struct which breaking posix locks, and it can result in fget_light having to fallback to fget reducing system performance. Using task_lookup_next_fd_rcu simplifies proc_readfd_common, by moving the checking for the maximum file descritor into the generic code, and by remvoing the need for capturing and releasing a reference on files_struct. As task_lookup_fd_rcu may update the fd ctx->pos has been changed to be the fd +2 after task_lookup_fd_rcu returns. [1] https://lkml.kernel.org/r/[email protected] Suggested-by: Oleg Nesterov <[email protected]> Tested-by: Andy Lavr <[email protected]> v1: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric W. Biederman <[email protected]>
2020-12-10file: Implement task_lookup_next_fd_rcuEric W. Biederman1-0/+21
As a companion to fget_task and task_lookup_fd_rcu implement task_lookup_next_fd_rcu that will return the struct file for the first file descriptor number that is equal or greater than the fd argument value, or NULL if there is no such struct file. This allows file descriptors of foreign processes to be iterated through safely, without needed to increment the count on files_struct. Some concern[1] has been expressed that this function takes the task_lock for each iteration and thus for each file descriptor. This place where this function will be called in a commonly used code path is for listing /proc/<pid>/fd. I did some small benchmarks and did not see any measurable performance differences. For ordinary users ls is likely to stat each of the directory entries and tid_fd_mode called from tid_fd_revalidae has always taken the task lock for each file descriptor. So this does not look like it will be a big change in practice. At some point is will probably be worth changing put_files_struct to free files_struct after an rcu grace period so that task_lock won't be needed at all. [1] https://lkml.kernel.org/r/[email protected] v1: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric W. Biederman <[email protected]>
2020-12-10proc/fd: In tid_fd_mode use task_lookup_fd_rcuEric W. Biederman1-6/+1
When discussing[1] exec and posix file locks it was realized that none of the callers of get_files_struct fundamentally needed to call get_files_struct, and that by switching them to helper functions instead it will both simplify their code and remove unnecessary increments of files_struct.count. Those unnecessary increments can result in exec unnecessarily unsharing files_struct which breaking posix locks, and it can result in fget_light having to fallback to fget reducing system performance. Instead of manually coding finding the files struct for a task and then calling files_lookup_fd_rcu, use the helper task_lookup_fd_rcu that combines those to steps. Making the code simpler and removing the need to get a reference on a files_struct. [1] https://lkml.kernel.org/r/[email protected] Suggested-by: Oleg Nesterov <[email protected]> Acked-by: Christian Brauner <[email protected]> v1: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric W. Biederman <[email protected]>
2020-12-10file: Implement task_lookup_fd_rcuEric W. Biederman1-0/+15
As a companion to lookup_fd_rcu implement task_lookup_fd_rcu for querying an arbitrary process about a specific file. Acked-by: Christian Brauner <[email protected]> v1: https://lkml.kernel.org/r/20200818103713.aw46m7vprsy4vlve@wittgenstein Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric W. Biederman <[email protected]>
2020-12-10file: Rename fcheck lookup_fd_rcuEric W. Biederman1-1/+1
Also remove the confusing comment about checking if a fd exists. I could not find one instance in the entire kernel that still matches the description or the reason for the name fcheck. The need for better names became apparent in the last round of discussion of this set of changes[1]. [1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric W. Biederman <[email protected]>
2020-12-10file: Replace fcheck_files with files_lookup_fd_rcuEric W. Biederman2-4/+4
This change renames fcheck_files to files_lookup_fd_rcu. All of the remaining callers take the rcu_read_lock before calling this function so the _rcu suffix is appropriate. This change also tightens up the debug check to verify that all callers hold the rcu_read_lock. All callers that used to call files_check with the files->file_lock held have now been changed to call files_lookup_fd_locked. This change of name has helped remind me of which locks and which guarantees are in place helping me to catch bugs later in the patchset. The need for better names became apparent in the last round of discussion of this set of changes[1]. [1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric W. Biederman <[email protected]>
2020-12-10file: Factor files_lookup_fd_locked out of fcheck_filesEric W. Biederman3-8/+10
To make it easy to tell where files->file_lock protection is being used when looking up a file create files_lookup_fd_locked. Only allow this function to be called with the file_lock held. Update the callers of fcheck and fcheck_files that are called with the files->file_lock held to call files_lookup_fd_locked instead. Hopefully this makes it easier to quickly understand what is going on. The need for better names became apparent in the last round of discussion of this set of changes[1]. [1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric W. Biederman <[email protected]>
2020-12-10file: Rename __fcheck_files to files_lookup_fd_rawEric W. Biederman1-1/+1
The function fcheck despite it's comment is poorly named as it has no callers that only check it's return value. All of fcheck's callers use the returned file descriptor. The same is true for fcheck_files and __fcheck_files. A new less confusing name is needed. In addition the names of these functions are confusing as they do not report the kind of locks that are needed to be held when these functions are called making error prone to use them. To remedy this I am making the base functio name lookup_fd and will and prefixes and sufficies to indicate the rest of the context. Name the function (previously called __fcheck_files) that proceeds from a struct files_struct, looks up the struct file of a file descriptor, and requires it's callers to verify all of the appropriate locks are held files_lookup_fd_raw. The need for better names became apparent in the last round of discussion of this set of changes[1]. [1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric W. Biederman <[email protected]>
2020-12-10proc/fd: In proc_fd_link use fget_taskEric W. Biederman1-10/+3
When discussing[1] exec and posix file locks it was realized that none of the callers of get_files_struct fundamentally needed to call get_files_struct, and that by switching them to helper functions instead it will both simplify their code and remove unnecessary increments of files_struct.count. Those unnecessary increments can result in exec unnecessarily unsharing files_struct which breaking posix locks, and it can result in fget_light having to fallback to fget reducing system performance. Simplifying proc_fd_link is a little bit tricky. It is necessary to know that there is a reference to fd_f ile while path_get is running. This reference can either be guaranteed to exist either by locking the fdtable as the code currently does or by taking a reference on the file in question. Use fget_task to remove the need for get_files_struct and to take a reference to file in question. [1] https://lkml.kernel.org/r/[email protected] Suggested-by: Oleg Nesterov <[email protected]> v1: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric W. Biederman <[email protected]>
2020-12-10exec: Remove reset_files_structEric W. Biederman1-12/+0
Now that exec no longer needs to restore the previous value of current->files on error there are no more callers of reset_files_struct so remove it. Acked-by: Christian Brauner <[email protected]> v1: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric W. Biederman <[email protected]>
2020-12-10exec: Simplify unshare_filesEric W. Biederman2-8/+2
Now that exec no longer needs to return the unshared files to their previous value there is no reason to return displaced. Instead when unshare_fd creates a copy of the file table, call put_files_struct before returning from unshare_files. Acked-by: Christian Brauner <[email protected]> v1: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric W. Biederman <[email protected]>
2020-12-10exec: Move unshare_files to fix posix file locking during execEric W. Biederman1-14/+15
Many moons ago the binfmts were doing some very questionable things with file descriptors and an unsharing of the file descriptor table was added to make things better[1][2]. The helper steal_lockss was added to avoid breaking the userspace programs[3][4][6]. Unfortunately it turned out that steal_locks did not work for network file systems[5], so it was removed to see if anyone would complain[7][8]. It was thought at the time that NPTL would not be affected as the unshare_files happened after the other threads were killed[8]. Unfortunately because there was an unshare_files in binfmt_elf.c before the threads were killed this analysis was incorrect. This unshare_files in binfmt_elf.c resulted in the unshares_files happening whenever threads were present. Which led to unshare_files being moved to the start of do_execve[9]. Later the problems were rediscovered and the suggested approach was to readd steal_locks under a different name[10]. I happened to be reviewing patches and I noticed that this approach was a step backwards[11]. I proposed simply moving unshare_files[12] and it was pointed out that moving unshare_files without auditing the code was also unsafe[13]. There were then several attempts to solve this[14][15][16] and I even posted this set of changes[17]. Unfortunately because auditing all of execve is time consuming this change did not make it in at the time. Well now that I am cleaning up exec I have made the time to read through all of the binfmts and the only playing with file descriptors is either the security modules closing them in security_bprm_committing_creds or is in the generic code in fs/exec.c. None of it happens before begin_new_exec is called. So move unshare_files into begin_new_exec, after the point of no return. If memory is very very very low and the application calling exec is sharing file descriptor tables between processes we might fail past the point of no return. Which is unfortunate but no different than any of the other places where we allocate memory after the point of no return. This movement allows another process that shares the file table, or another thread of the same process and that closes files or changes their close on exec behavior and races with execve to cause some unexpected things to happen. There is only one time of check to time of use race and it is just there so that execve fails instead of an interpreter failing when it tries to open the file it is supposed to be interpreting. Failing later if userspace is being silly is not a problem. With this change it the following discription from the removal of steal_locks[8] finally becomes true. Apps using NPTL are not affected, since all other threads are killed before execve. Apps using LinuxThreads are only affected if they - have multiple threads during exec (LinuxThreads doesn't kill other threads, the app may do it with pthread_kill_other_threads_np()) - rely on POSIX locks being inherited across exec Both conditions are documented, but not their interaction. Apps using clone() natively are affected if they - use clone(CLONE_FILES) - rely on POSIX locks being inherited across exec I have investigated some paths to make it possible to solve this without moving unshare_files but they all look more complicated[18]. Reported-by: Daniel P. BerrangĂ© <[email protected]> Reported-by: Jeff Layton <[email protected]> History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git [1] 02cda956de0b ("[PATCH] unshare_files" [2] 04e9bcb4d106 ("[PATCH] use new unshare_files helper") [3] 088f5d7244de ("[PATCH] add steal_locks helper") [4] 02c541ec8ffa ("[PATCH] use new steal_locks helper") [5] https://lkml.kernel.org/r/[email protected] [6] https://lkml.kernel.org/r/[email protected] [7] https://lkml.kernel.org/r/[email protected] [8] c89681ed7d0e ("[PATCH] remove steal_locks()") [9] fd8328be874f ("[PATCH] sanitize handling of shared descriptor tables in failing execve()") [10] https://lkml.kernel.org/r/[email protected] [11] https://lkml.kernel.org/r/[email protected] [12] https://lkml.kernel.org/r/[email protected] [13] https://lkml.kernel.org/r/[email protected] [14] https://lkml.kernel.org/r/[email protected] [15] https://lkml.kernel.org/r/[email protected] [16] https://lkml.kernel.org/r/[email protected] [17] https://lkml.kernel.org/r/[email protected] [18] https://lkml.kernel.org/r/[email protected] Acked-by: Christian Brauner <[email protected]> v1: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric W. Biederman <[email protected]>
2020-12-10exec: Don't open code get_close_on_execEric W. Biederman1-2/+1
Al Viro pointed out that using the phrase "close_on_exec(fd, rcu_dereference_raw(current->files->fdt))" instead of wrapping it in rcu_read_lock(), rcu_read_unlock() is a very questionable optimization[1]. Once wrapped with rcu_read_lock()/rcu_read_unlock() that phrase becomes equivalent the helper function get_close_on_exec so simplify the code and make it more robust by simply using get_close_on_exec. [1] https://lkml.kernel.org/r/[email protected] Suggested-by: Al Viro <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric W. Biederman <[email protected]>
2020-12-10f2fs: compress: fix compression chksumChao Yu2-2/+1
This patch addresses minor issues in compression chksum. Fixes: b28f047b28c5 ("f2fs: compress: support chksum") Signed-off-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2020-12-10f2fs: fix shift-out-of-bounds in sanity_check_raw_super()Chao Yu1-5/+4
syzbot reported a bug which could cause shift-out-of-bounds issue, fix it. Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x107/0x163 lib/dump_stack.c:120 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395 sanity_check_raw_super fs/f2fs/super.c:2812 [inline] read_raw_super_block fs/f2fs/super.c:3267 [inline] f2fs_fill_super.cold+0x16c9/0x16f6 fs/f2fs/super.c:3519 mount_bdev+0x34d/0x410 fs/super.c:1366 legacy_get_tree+0x105/0x220 fs/fs_context.c:592 vfs_get_tree+0x89/0x2f0 fs/super.c:1496 do_new_mount fs/namespace.c:2896 [inline] path_mount+0x12ae/0x1e70 fs/namespace.c:3227 do_mount fs/namespace.c:3240 [inline] __do_sys_mount fs/namespace.c:3448 [inline] __se_sys_mount fs/namespace.c:3425 [inline] __x64_sys_mount+0x27f/0x300 fs/namespace.c:3425 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported-by: [email protected] Signed-off-by: Anant Thazhemadam <[email protected]> Signed-off-by: Chao Yu <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2020-12-10fuse: fix bad inodeMiklos Szeredi7-17/+74
Jan Kara's analysis of the syzbot report (edited): The reproducer opens a directory on FUSE filesystem, it then attaches dnotify mark to the open directory. After that a fuse_do_getattr() call finds that attributes returned by the server are inconsistent, and calls make_bad_inode() which, among other things does: inode->i_mode = S_IFREG; This then confuses dnotify which doesn't tear down its structures properly and eventually crashes. Avoid calling make_bad_inode() on a live inode: switch to a private flag on the fuse inode. Also add the test to ops which the bad_inode_ops would have caught. This bug goes back to the initial merge of fuse in 2.6.14... Reported-by: [email protected] Signed-off-by: Miklos Szeredi <[email protected]> Tested-by: Jan Kara <[email protected]> Cc: <[email protected]>
2020-12-10NFSv4.2: Fix up the get/listxattr calls to rpc_prepare_reply_pages()Trond Myklebust1-5/+7
Ensure that both getxattr and listxattr page array are correctly aligned, and that getxattr correctly accounts for the page padding word. Signed-off-by: Trond Myklebust <[email protected]>
2020-12-10zonefs: fix page reference and BIO leakDamien Le Moal1-6/+8
In zonefs_file_dio_append(), the pages obtained using bio_iov_iter_get_pages() are not released on completion of the REQ_OP_APPEND BIO, nor when bio_iov_iter_get_pages() fails. Furthermore, a call to bio_put() is missing when bio_iov_iter_get_pages() fails. Fix these resource leaks by adding BIO resource release code (bio_put()i and bio_release_pages()) at the end of the function after the BIO execution and add a jump to this resource cleanup code in case of bio_iov_iter_get_pages() failure. While at it, also fix the call to task_io_account_write() to be passed the correct BIO size instead of bio_iov_iter_get_pages() return value. Reported-by: Christoph Hellwig <[email protected]> Fixes: 02ef12a663c7 ("zonefs: use REQ_OP_ZONE_APPEND for sync DIO") Cc: [email protected] Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2020-12-10erofs: avoid using generic_block_bmapHuang Jianan1-19/+7
Surprisingly, `block' in sector_t indicates the number of i_blkbits-sized blocks rather than sectors for bmap. In addition, considering buffer_head limits mapped size to 32-bits, should avoid using generic_block_bmap. Link: https://lore.kernel.org/r/[email protected] Fixes: 9da681e017a3 ("staging: erofs: support bmap") Reviewed-by: Chao Yu <[email protected]> Reviewed-by: Gao Xiang <[email protected]> Signed-off-by: Huang Jianan <[email protected]> Signed-off-by: Guo Weichao <[email protected]> [ Gao Xiang: slightly update the commit message description. ] Signed-off-by: Gao Xiang <[email protected]>
2020-12-09ext4: delete nonsensical (commented-out) code inside ext4_xattr_block_set()Chunguang Xu1-1/+0
Signed-off-by: Chunguang Xu <[email protected]> Reviewed-by: Andreas Dilger <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2020-12-09ext4: update ext4_data_block_valid related commentsChunguang Xu1-3/+3
Since ext4_data_block_valid() has been renamed to ext4_inode_block_valid(), the related comments need to be updated. Signed-off-by: Chunguang Xu <[email protected]> Reviewed-by: Andreas Dilger <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2020-12-09io_uring: fix io_cqring_events()'s noflushPavel Begunkov1-1/+1
Checking !list_empty(&ctx->cq_overflow_list) around noflush in io_cqring_events() is racy, because if it fails but a request overflowed just after that, io_cqring_overflow_flush() still will be called. Remove the second check, it shouldn't be a problem for performance, because there is cq_check_overflow bit check just above. Cc: <[email protected]> # 5.5+ Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: fix racy IOPOLL flush overflowPavel Begunkov1-4/+6
It's not safe to call io_cqring_overflow_flush() for IOPOLL mode without hodling uring_lock, because it does synchronisation differently. Make sure we have it. As for io_ring_exit_work(), we don't even need it there because io_ring_ctx_wait_and_kill() already set force flag making all overflowed requests to be dropped. Cc: <[email protected]> # 5.5+ Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: fix racy IOPOLL completionsPavel Begunkov1-5/+18
IOPOLL allows buffer remove/provide requests, but they doesn't synchronise by rules of IOPOLL, namely it have to hold uring_lock. Cc: <[email protected]> # 5.7+ Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: always let io_iopoll_complete() complete polled ioXiaoguang Wang1-2/+13
Abaci Fuzz reported a double-free or invalid-free BUG in io_commit_cqring(): [ 95.504842] BUG: KASAN: double-free or invalid-free in io_commit_cqring+0x3ec/0x8e0 [ 95.505921] [ 95.506225] CPU: 0 PID: 4037 Comm: io_wqe_worker-0 Tainted: G B W 5.10.0-rc5+ #1 [ 95.507434] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 95.508248] Call Trace: [ 95.508683] dump_stack+0x107/0x163 [ 95.509323] ? io_commit_cqring+0x3ec/0x8e0 [ 95.509982] print_address_description.constprop.0+0x3e/0x60 [ 95.510814] ? vprintk_func+0x98/0x140 [ 95.511399] ? io_commit_cqring+0x3ec/0x8e0 [ 95.512036] ? io_commit_cqring+0x3ec/0x8e0 [ 95.512733] kasan_report_invalid_free+0x51/0x80 [ 95.513431] ? io_commit_cqring+0x3ec/0x8e0 [ 95.514047] __kasan_slab_free+0x141/0x160 [ 95.514699] kfree+0xd1/0x390 [ 95.515182] io_commit_cqring+0x3ec/0x8e0 [ 95.515799] __io_req_complete.part.0+0x64/0x90 [ 95.516483] io_wq_submit_work+0x1fa/0x260 [ 95.517117] io_worker_handle_work+0xeac/0x1c00 [ 95.517828] io_wqe_worker+0xc94/0x11a0 [ 95.518438] ? io_worker_handle_work+0x1c00/0x1c00 [ 95.519151] ? __kthread_parkme+0x11d/0x1d0 [ 95.519806] ? io_worker_handle_work+0x1c00/0x1c00 [ 95.520512] ? io_worker_handle_work+0x1c00/0x1c00 [ 95.521211] kthread+0x396/0x470 [ 95.521727] ? _raw_spin_unlock_irq+0x24/0x30 [ 95.522380] ? kthread_mod_delayed_work+0x180/0x180 [ 95.523108] ret_from_fork+0x22/0x30 [ 95.523684] [ 95.523985] Allocated by task 4035: [ 95.524543] kasan_save_stack+0x1b/0x40 [ 95.525136] __kasan_kmalloc.constprop.0+0xc2/0xd0 [ 95.525882] kmem_cache_alloc_trace+0x17b/0x310 [ 95.533930] io_queue_sqe+0x225/0xcb0 [ 95.534505] io_submit_sqes+0x1768/0x25f0 [ 95.535164] __x64_sys_io_uring_enter+0x89e/0xd10 [ 95.535900] do_syscall_64+0x33/0x40 [ 95.536465] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 95.537199] [ 95.537505] Freed by task 4035: [ 95.538003] kasan_save_stack+0x1b/0x40 [ 95.538599] kasan_set_track+0x1c/0x30 [ 95.539177] kasan_set_free_info+0x1b/0x30 [ 95.539798] __kasan_slab_free+0x112/0x160 [ 95.540427] kfree+0xd1/0x390 [ 95.540910] io_commit_cqring+0x3ec/0x8e0 [ 95.541516] io_iopoll_complete+0x914/0x1390 [ 95.542150] io_do_iopoll+0x580/0x700 [ 95.542724] io_iopoll_try_reap_events.part.0+0x108/0x200 [ 95.543512] io_ring_ctx_wait_and_kill+0x118/0x340 [ 95.544206] io_uring_release+0x43/0x50 [ 95.544791] __fput+0x28d/0x940 [ 95.545291] task_work_run+0xea/0x1b0 [ 95.545873] do_exit+0xb6a/0x2c60 [ 95.546400] do_group_exit+0x12a/0x320 [ 95.546967] __x64_sys_exit_group+0x3f/0x50 [ 95.547605] do_syscall_64+0x33/0x40 [ 95.548155] entry_SYSCALL_64_after_hwframe+0x44/0xa9 The reason is that once we got a non EAGAIN error in io_wq_submit_work(), we'll complete req by calling io_req_complete(), which will hold completion_lock to call io_commit_cqring(), but for polled io, io_iopoll_complete() won't hold completion_lock to call io_commit_cqring(), then there maybe concurrent access to ctx->defer_list, double free may happen. To fix this bug, we always let io_iopoll_complete() complete polled io. Cc: <[email protected]> # 5.5+ Reported-by: Abaci Fuzz <[email protected]> Signed-off-by: Xiaoguang Wang <[email protected]> Reviewed-by: Pavel Begunkov <[email protected]> Reviewed-by: Joseph Qi <[email protected]> Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: add timeout updatePavel Begunkov1-4/+50
Support timeout updates through IORING_OP_TIMEOUT_REMOVE with passed in IORING_TIMEOUT_UPDATE. Updates doesn't support offset timeout mode. Oirignal timeout.off will be ignored as well. Signed-off-by: Pavel Begunkov <[email protected]> [axboe: remove now unused 'ret' variable] Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: restructure io_timeout_cancel()Pavel Begunkov1-19/+23
Add io_timeout_extract() helper, which searches and disarms timeouts, but doesn't complete them. No functional changes. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: fix files cancellationPavel Begunkov1-4/+4
io_uring_cancel_files()'s task check condition mistakenly got flipped. 1. There can't be a request in the inflight list without IO_WQ_WORK_FILES, kill this check to keep the whole condition simpler. 2. Also, don't call the function for files==NULL to not do such a check, all that staff is already handled well by its counter part, __io_uring_cancel_task_requests(). With that just flip the task check. Also, it iowq-cancels all request of current task there, don't forget to set right ->files into struct io_task_cancel. Fixes: c1973b38bf639 ("io_uring: cancel only requests of current task") Reported-by: [email protected] Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: use bottom half safe lock for fixed file dataJens Axboe1-8/+8
io_file_data_ref_zero() can be invoked from soft-irq from the RCU core, hence we need to ensure that the file_data lock is bottom half safe. Use the _bh() variants when grabbing this lock. Reported-by: [email protected] Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: fix miscounting ios_leftPavel Begunkov1-12/+9
io_req_init() doesn't decrement state->ios_left if a request doesn't need ->file, it just returns before that on if(!needs_file). That's not really a problem but may cause overhead for an additional fput(). Also inline and kill io_req_set_file() as it's of no use anymore. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: change submit file state invariantPavel Begunkov1-11/+10
Keep submit state invariant of whether there are file refs left based on state->nr_refs instead of (state->file==NULL), and always check against the first one. It's easier to track and allows to remove 1 if. It also automatically leaves struct submit_state in a consistent state after io_submit_state_end(), that's not used yet but nice. btw rename has_refs to file_refs for more clarity. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: check kthread stopped flag when sq thread is unparkedXiaoguang Wang1-1/+9
syzbot reports following issue: INFO: task syz-executor.2:12399 can't die for more than 143 seconds. task:syz-executor.2 state:D stack:28744 pid:12399 ppid: 8504 flags:0x00004004 Call Trace: context_switch kernel/sched/core.c:3773 [inline] __schedule+0x893/0x2170 kernel/sched/core.c:4522 schedule+0xcf/0x270 kernel/sched/core.c:4600 schedule_timeout+0x1d8/0x250 kernel/time/timer.c:1847 do_wait_for_common kernel/sched/completion.c:85 [inline] __wait_for_common kernel/sched/completion.c:106 [inline] wait_for_common kernel/sched/completion.c:117 [inline] wait_for_completion+0x163/0x260 kernel/sched/completion.c:138 kthread_stop+0x17a/0x720 kernel/kthread.c:596 io_put_sq_data fs/io_uring.c:7193 [inline] io_sq_thread_stop+0x452/0x570 fs/io_uring.c:7290 io_finish_async fs/io_uring.c:7297 [inline] io_sq_offload_create fs/io_uring.c:8015 [inline] io_uring_create fs/io_uring.c:9433 [inline] io_uring_setup+0x19b7/0x3730 fs/io_uring.c:9507 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x45deb9 Code: Unable to access opcode bytes at RIP 0x45de8f. RSP: 002b:00007f174e51ac78 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9 RAX: ffffffffffffffda RBX: 0000000000008640 RCX: 000000000045deb9 RDX: 0000000000000000 RSI: 0000000020000140 RDI: 00000000000050e5 RBP: 000000000118bf58 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 000000000118bf2c R13: 00007ffed9ca723f R14: 00007f174e51b9c0 R15: 000000000118bf2c INFO: task syz-executor.2:12399 blocked for more than 143 seconds. Not tainted 5.10.0-rc3-next-20201110-syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Currently we don't have a reproducer yet, but seems that there is a race in current codes: => io_put_sq_data ctx_list is empty now. | ==> kthread_park(sqd->thread); | | T1: sq thread is parked now. ==> kthread_stop(sqd->thread); | KTHREAD_SHOULD_STOP is set now.| ===> kthread_unpark(k); | | T2: sq thread is now unparkd, run again. | | T3: sq thread is now preempted out. | ===> wake_up_process(k); | | | T4: Since sqd ctx_list is empty, needs_sched will be true, | then sq thread sets task state to TASK_INTERRUPTIBLE, | and schedule, now sq thread will never be waken up. ===> wait_for_completion | I have artificially used mdelay() to simulate above race, will get same stack like this syzbot report, but to be honest, I'm not sure this code race triggers syzbot report. To fix this possible code race, when sq thread is unparked, need to check whether sq thread has been stopped. Reported-by: [email protected] Signed-off-by: Xiaoguang Wang <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: share fixed_file_refs b/w multiple rsrcsPavel Begunkov1-8/+15
Double fixed files for splice/tee are done in a nasty way, it takes 2 ref_node refs, and during the second time it blindly overrides req->fixed_file_refs hoping that it haven't changed. That works because all that is done under iouring_lock in a single go but is error-prone. Bind everything explicitly to a single ref_node and take only one ref, with current ref_node ordering it's guaranteed to keep all files valid awhile the request is inflight. That's mainly a cleanup + preparation for generic resource handling, but also saves pcpu_ref get/put for splice/tee with 2 fixed files. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: replace inflight_wait with tctx->waitPavel Begunkov1-7/+6
As tasks now cancel only theirs requests, and inflight_wait is awaited only in io_uring_cancel_files(), which should be called with ->in_idle set, instead of keeping a separate inflight_wait use tctx->wait. That will add some spurious wakeups but actually is safer from point of not hanging the task. e.g. task1 | IRQ | *start* io_complete_rw_common(link) | link: req1 -> req2 -> req3(with files) *cancel_files() | io_wq_cancel(), etc. | | put_req(link), adds to io-wq req2 schedule() | So, task1 will never try to cancel req2 or req3. If req2 is long-standing (e.g. read(empty_pipe)), this may hang. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: don't take fs for recvmsg/sendmsgPavel Begunkov1-4/+2
We don't even allow not plain data msg_control, which is disallowed in __sys_{send,revb}msg_sock(). So no need in fs for IORING_OP_SENDMSG and IORING_OP_RECVMSG. fs->lock is less contanged not as much as before, but there are cases that can be, e.g. IOSQE_ASYNC. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: only wake up sq thread while current task is in io worker contextXiaoguang Wang1-3/+8
If IORING_SETUP_SQPOLL is enabled, sqes are either handled in sq thread task context or in io worker task context. If current task context is sq thread, we don't need to check whether should wake up sq thread. io_iopoll_req_issued() calls wq_has_sleeper(), which has smp_mb() memory barrier, before this patch, perf shows obvious overhead: Samples: 481K of event 'cycles', Event count (approx.): 299807382878 Overhead Comma Shared Object Symbol 3.69% :9630 [kernel.vmlinux] [k] io_issue_sqe With this patch, perf shows: Samples: 482K of event 'cycles', Event count (approx.): 299929547283 Overhead Comma Shared Object Symbol 0.70% :4015 [kernel.vmlinux] [k] io_issue_sqe It shows some obvious improvements. Signed-off-by: Xiaoguang Wang <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: don't acquire uring_lock twiceXiaoguang Wang1-11/+7
Both IOPOLL and sqes handling need to acquire uring_lock, combine them together, then we just need to acquire uring_lock once. Signed-off-by: Xiaoguang Wang <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: initialize 'timeout' properly in io_sq_thread()Xiaoguang Wang1-1/+1
Some static checker reports below warning: fs/io_uring.c:6939 io_sq_thread() error: uninitialized symbol 'timeout'. This is a false positive, but let's just initialize 'timeout' to make sure we don't trip over this. Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Xiaoguang Wang <[email protected]> Reviewed-by: Stefano Garzarella <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: refactor io_sq_thread() handlingXiaoguang Wang1-102/+67
There are some issues about current io_sq_thread() implementation: 1. The prepare_to_wait() usage in __io_sq_thread() is weird. If multiple ctxs share one same poll thread, one ctx will put poll thread in TASK_INTERRUPTIBLE, but if other ctxs have work to do, we don't need to change task's stat at all. I think only if all ctxs don't have work to do, we can do it. 2. We use round-robin strategy to make multiple ctxs share one same poll thread, but there are various condition in __io_sq_thread(), which seems complicated and may affect round-robin strategy. To improve above issues, I take below actions: 1. If multiple ctxs share one same poll thread, only if all all ctxs don't have work to do, we can call prepare_to_wait() and schedule() to make poll thread enter sleep state. 2. To make round-robin strategy more straight, I simplify __io_sq_thread() a bit, it just does io poll and sqes submit work once, does not check various condition. 3. For multiple ctxs share one same poll thread, we choose the biggest sq_thread_idle among these ctxs as timeout condition, and will update it when ctx is in or out. 4. Not need to check EBUSY especially, if io_submit_sqes() returns EBUSY, IORING_SQ_CQ_OVERFLOW should be set, helper in liburing should be aware of cq overflow and enters kernel to flush work. Signed-off-by: Xiaoguang Wang <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: always batch cancel in *cancel_files()Pavel Begunkov3-125/+21
Instead of iterating over each request and cancelling it individually in io_uring_cancel_files(), try to cancel all matching requests and use ->inflight_list only to check if there anything left. In many cases it should be faster, and we can reuse a lot of code from task cancellation. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: pass files into kill timeouts/pollPavel Begunkov1-8/+10
Make io_poll_remove_all() and io_kill_timeouts() to match against files as well. A preparation patch, effectively not used by now. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: don't iterate io_uring_cancel_files()Pavel Begunkov1-22/+12
io_uring_cancel_files() guarantees to cancel all matching requests, that's not necessary to do that in a loop. Move it up in the callchain into io_uring_cancel_task_requests(). Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: cancel only requests of current taskPavel Begunkov1-18/+5
io_uring_cancel_files() cancels all request that match files regardless of task. There is no real need in that, cancel only requests of the specified task. That also handles SQPOLL case as it already changes task to it. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: add a {task,files} pair matching helperPavel Begunkov1-26/+22
Add io_match_task() that matches both task and files. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: simplify io_task_match()Pavel Begunkov1-5/+1
If IORING_SETUP_SQPOLL is set all requests belong to the corresponding SQPOLL task, so skip task checking in that case and always match. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: inline io_import_iovec()Pavel Begunkov1-24/+16
Inline io_import_iovec() and leave only its former __io_import_iovec() renamed to the original name. That makes it more obious what is reused in io_read/write(). Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: remove duplicated io_size from rwPavel Begunkov1-10/+6
io_size and iov_count in io_read() and io_write() hold the same value, kill the last one. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-09fs/io_uring Don't use the return value from import_iovec().David Laight1-4/+4
This is the only code that relies on import_iovec() returning iter.count on success. This allows a better interface to import_iovec(). Signed-off-by: David Laight <[email protected]> Signed-off-by: Pavel Begunkov <[email protected]> Reviewed-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: NULL files dereference by SQPOLLPavel Begunkov1-7/+12
SQPOLL task may find sqo_task->files == NULL and __io_sq_thread_acquire_files() would leave it unset, so following fget_many() and others try to dereference NULL and fault. Propagate an error files are missing. [ 118.962785] BUG: kernel NULL pointer dereference, address: 0000000000000020 [ 118.963812] #PF: supervisor read access in kernel mode [ 118.964534] #PF: error_code(0x0000) - not-present page [ 118.969029] RIP: 0010:__fget_files+0xb/0x80 [ 119.005409] Call Trace: [ 119.005651] fget_many+0x2b/0x30 [ 119.005964] io_file_get+0xcf/0x180 [ 119.006315] io_submit_sqes+0x3a4/0x950 [ 119.007481] io_sq_thread+0x1de/0x6a0 [ 119.007828] kthread+0x114/0x150 [ 119.008963] ret_from_fork+0x22/0x30 Reported-by: Josef Grieb <[email protected]> Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>