aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2021-08-25ceph: correctly handle releasing an embedded cap flushXiubo Li4-12/+22
The ceph_cap_flush structures are usually dynamically allocated, but the ceph_cap_snap has an embedded one. When force umounting, the client will try to remove all the session caps. During this, it will free them, but that should not be done with the ones embedded in a capsnap. Fix this by adding a new boolean that indicates that the cap flush is embedded in a capsnap, and skip freeing it if that's set. At the same time, switch to using list_del_init() when detaching the i_list and g_list heads. It's possible for a forced umount to remove these objects but then handle_cap_flushsnap_ack() races in and does the list_del_init() again, corrupting memory. Cc: [email protected] URL: https://tracker.ceph.com/issues/52283 Signed-off-by: Xiubo Li <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Signed-off-by: Ilya Dryomov <[email protected]>
2021-08-25cachefiles: Use file_inode() rather than accessing ->f_inodeDavid Howells1-2/+2
Use the file_inode() helper rather than accessing ->f_inode directly. Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> cc: [email protected] Link: https://lore.kernel.org/r/162431192403.2908479.4590814090994846904.stgit@warthog.procyon.org.uk/
2021-08-25netfs: Move cookie debug ID to struct netfs_cache_resourcesDavid Howells1-1/+1
Move the cookie debug ID from struct netfs_read_request to struct netfs_cache_resources and drop the 'cookie_' prefix. This makes it available for things that want to use netfs_cache_resources without having a netfs_read_request. Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> cc: [email protected] Link: https://lore.kernel.org/r/162431190784.2908479.13386972676539789127.stgit@warthog.procyon.org.uk/
2021-08-25fscache: Select netfs stats if fscache stats are enabledDavid Howells1-0/+1
Unconditionally select the stats produced by the netfs lib if fscache stats are enabled as the former are displayed in the latter's procfile. Signed-off-by: David Howells <[email protected]> cc: [email protected] Reviewed-by: Jeff Layton <[email protected]> Link: https://lore.kernel.org/r/162280352566.3319242.10615341893991206961.stgit@warthog.procyon.org.uk/ Link: https://lore.kernel.org/r/162431189627.2908479.9165349423842063755.stgit@warthog.procyon.org.uk/
2021-08-25erofs: fix double free of 'copied'Gao Xiang1-0/+1
Dan reported a new smatch warning [1] "fs/erofs/inode.c:210 erofs_read_inode() error: double free of 'copied'" Due to new chunk-based format handling logic, the error path can be called after kfree(copied). Set "copied = NULL" after kfree(copied) to fix this. [1] https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected] Fixes: c5aa903a59db ("erofs: support reading chunk-based uncompressed files") Reported-by: kernel test robot <[email protected]> Reported-by: Dan Carpenter <[email protected]> Reviewed-by: Chao Yu <[email protected]> Signed-off-by: Gao Xiang <[email protected]>
2021-08-25Revert "btrfs: compression: don't try to compress if we don't have enough pages"Qu Wenruo1-1/+1
This reverts commit f2165627319ffd33a6217275e5690b1ab5c45763. [BUG] It's no longer possible to create compressed inline extent after commit f2165627319f ("btrfs: compression: don't try to compress if we don't have enough pages"). [CAUSE] For compression code, there are several possible reasons we have a range that needs to be compressed while it's no more than one page. - Compressed inline write The data is always smaller than one sector and the test lacks the condition to properly recognize a non-inline extent. - Compressed subpage write For the incoming subpage compressed write support, we require page alignment of the delalloc range. And for 64K page size, we can compress just one page into smaller sectors. For those reasons, the requirement for the data to be more than one page is not correct, and is already causing regression for compressed inline data writeback. The idea of skipping one page to avoid wasting CPU time could be revisited in the future. [FIX] Fix it by reverting the offending commit. Reported-by: Zygo Blaxell <[email protected]> Link: https://lore.kernel.org/linux-btrfs/[email protected] Fixes: f2165627319f ("btrfs: compression: don't try to compress if we don't have enough pages") CC: [email protected] # 4.4+ Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2021-08-25io_uring: accept directly into fixed file tablePavel Begunkov1-6/+18
As done with open opcodes, allow accept to skip installing fd into processes' file tables and put it directly into io_uring's fixed file table. Same restrictions and design as for open. Suggested-by: Josh Triplett <[email protected]> Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]> Link: https://lore.kernel.org/r/6d16163f376fac7ac26a656de6b42199143e9721.1629888991.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-25io_uring: hand code io_accept() fd installingPavel Begunkov1-7/+20
Make io_accept() to handle file descriptor allocations and installation. A preparation patch for bypassing file tables. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]> Link: https://lore.kernel.org/r/5b73d204caa0ce979ccb98136695b60f52a3d98c.1629888991.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-25io_uring: openat directly into fixed fd tablePavel Begunkov1-8/+66
Instead of opening a file into a process's file table as usual and then registering the fd within io_uring, some users may want to skip the first step and place it directly into io_uring's fixed file table. This patch adds such a capability for IORING_OP_OPENAT and IORING_OP_OPENAT2. The behaviour is controlled by setting sqe->file_index, where 0 implies the old behaviour using normal file tables. If non-zero value is specified, then it will behave as described and place the file into a fixed file slot sqe->file_index - 1. A file table should be already created, the slot should be valid and empty, otherwise the operation will fail. Keep the error codes consistent with IORING_OP_FILES_UPDATE, ENXIO and EINVAL on inappropriate fixed tables, and return EBADF on collision with already registered file. Note: IOSQE_FIXED_FILE can't be used to switch between modes, because accept takes a file, and it already uses the flag with a different meaning. Suggested-by: Josh Triplett <[email protected]> Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]> Link: https://lore.kernel.org/r/e9b33d1163286f51ea707f87d95bd596dada1e65.1629888991.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-25configfs: fix a race in configfs_lookup()Sishuai Gong1-3/+4
When configfs_lookup() is executing list_for_each_entry(), it is possible that configfs_dir_lseek() is calling list_del(). Some unfortunate interleavings of them can cause a kernel NULL pointer dereference error Thread 1 Thread 2 //configfs_dir_lseek() //configfs_lookup() list_del(&cursor->s_sibling); list_for_each_entry(sd, ...) Fix this by grabbing configfs_dirent_lock in configfs_lookup() while iterating ->s_children. Signed-off-by: Sishuai Gong <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-08-25configfs: fold configfs_attach_attr into configfs_lookupChristoph Hellwig1-49/+24
This makes it more clear what gets added to the dcache and prepares for an additional locking fix. Signed-off-by: Christoph Hellwig <[email protected]>
2021-08-25configfs: simplify the configfs_dirent_is_readyChristoph Hellwig1-3/+1
Return the error directly instead of using a goto. Signed-off-by: Christoph Hellwig <[email protected]>
2021-08-25configfs: return -ENAMETOOLONG earlier in configfs_lookupChristoph Hellwig1-2/+3
Just like most other file systems: get the simple checks out of the way first. Signed-off-by: Christoph Hellwig <[email protected]>
2021-08-24xfs: fix I_DONTCACHEDave Chinner2-2/+3
Yup, the VFS hoist broke it, and nobody noticed. Bulkstat workloads make it clear that it doesn't work as it should. Fixes: dae2f8ed7992 ("fs: Lift XFS_IDONTCACHE to the VFS layer") Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
2021-08-24block: mark blkdev_fsync staticChristoph Hellwig1-2/+2
blkdev_fsync is only used inside of block_dev.c since the removal of the raw drŅ–ver. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-24fs: clean up after mandatory file locking support removalLukas Bulwahn1-7/+3
Commit 3efee0567b4a ("fs: remove mandatory file locking support") removes some operations in functions rw_verify_area(). As these functions are now simplified, do some syntactic clean-up as follow-up to the removal as well, which was pointed out by compiler warnings and static analysis. No functional change. Signed-off-by: Lukas Bulwahn <[email protected]> Signed-off-by: Jeff Layton <[email protected]>
2021-08-23xfs: only set IOMAP_F_SHARED when providing a srcmap to a writeDarrick J. Wong1-4/+4
While prototyping a free space defragmentation tool, I observed an unexpected IO error while running a sequence of commands that can be recreated by the following sequence of commands: # xfs_io -f -c "pwrite -S 0x58 -b 10m 0 10m" file1 # cp --reflink=always file1 file2 # punch-alternating -o 1 file2 # xfs_io -c "funshare 0 10m" file2 fallocate: Input/output error I then scraped this (abbreviated) stack trace from dmesg: WARNING: CPU: 0 PID: 30788 at fs/iomap/buffered-io.c:577 iomap_write_begin+0x376/0x450 CPU: 0 PID: 30788 Comm: xfs_io Not tainted 5.14.0-rc6-xfsx #rc6 5ef57b62a900814b3e4d885c755e9014541c8732 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:iomap_write_begin+0x376/0x450 RSP: 0018:ffffc90000c0fc20 EFLAGS: 00010297 RAX: 0000000000000001 RBX: ffffc90000c0fd10 RCX: 0000000000001000 RDX: ffffc90000c0fc54 RSI: 000000000000000c RDI: 000000000000000c RBP: ffff888005d5dbd8 R08: 0000000000102000 R09: ffffc90000c0fc50 R10: 0000000000b00000 R11: 0000000000101000 R12: ffffea0000336c40 R13: 0000000000001000 R14: ffffc90000c0fd10 R15: 0000000000101000 FS: 00007f4b8f62fe40(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056361c554108 CR3: 000000000524e004 CR4: 00000000001706f0 Call Trace: iomap_unshare_actor+0x95/0x140 iomap_apply+0xfa/0x300 iomap_file_unshare+0x44/0x60 xfs_reflink_unshare+0x50/0x140 [xfs 61947ea9b3a73e79d747dbc1b90205e7987e4195] xfs_file_fallocate+0x27c/0x610 [xfs 61947ea9b3a73e79d747dbc1b90205e7987e4195] vfs_fallocate+0x133/0x330 __x64_sys_fallocate+0x3e/0x70 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f4b8f79140a Looking at the iomap tracepoints, I saw this: iomap_iter: dev 8:64 ino 0x100 pos 0 length 0 flags WRITE|0x80 (0x81) ops xfs_buffered_write_iomap_ops caller iomap_file_unshare iomap_iter_dstmap: dev 8:64 ino 0x100 bdev 8:64 addr -1 offset 0 length 131072 type DELALLOC flags SHARED iomap_iter_srcmap: dev 8:64 ino 0x100 bdev 8:64 addr 147456 offset 0 length 4096 type MAPPED flags iomap_iter: dev 8:64 ino 0x100 pos 0 length 4096 flags WRITE|0x80 (0x81) ops xfs_buffered_write_iomap_ops caller iomap_file_unshare iomap_iter_dstmap: dev 8:64 ino 0x100 bdev 8:64 addr -1 offset 4096 length 4096 type DELALLOC flags SHARED console: WARNING: CPU: 0 PID: 30788 at fs/iomap/buffered-io.c:577 iomap_write_begin+0x376/0x450 The first time funshare calls ->iomap_begin, xfs sees that the first block is shared and creates a 128k delalloc reservation in the COW fork. The delalloc reservation is returned as dstmap, and the shared block is returned as srcmap. So far so good. funshare calls ->iomap_begin to try the second block. This time there's no srcmap (punch-alternating punched it out!) but we still have the delalloc reservation in the COW fork. Therefore, we again return the reservation as dstmap and the hole as srcmap. iomap_unshare_iter incorrectly tries to unshare the hole, which __iomap_write_begin rejects because shared regions must be fully written and therefore cannot require zeroing. Therefore, change the buffered write iomap_begin function not to set IOMAP_F_SHARED when there isn't a source mapping to read from for the unsharing. Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Chandan Babu R <[email protected]>
2021-08-23Keep read and write fds with each nlm_fileJ. Bruce Fields5-41/+102
We shouldn't really be using a read-only file descriptor to take a write lock. Most filesystems will put up with it. But NFS, for example, won't. Signed-off-by: J. Bruce Fields <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2021-08-23io_uring: add support for IORING_OP_LINKATDmitry Kadashev3-1/+74
IORING_OP_LINKAT behaves like linkat(2) and takes the same flags and arguments. In some internal places 'hardlink' is used instead of 'link' to avoid confusion with the SQE links. Name 'link' conflicts with the existing 'link' member of io_kiocb. Acked-by: Linus Torvalds <[email protected]> Suggested-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/ Signed-off-by: Dmitry Kadashev <[email protected]> Acked-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] [axboe: add splice_fd_in check] Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: add support for IORING_OP_SYMLINKATDmitry Kadashev3-2/+69
IORING_OP_SYMLINKAT behaves like symlinkat(2) and takes the same flags and arguments. Acked-by: Linus Torvalds <[email protected]> Suggested-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/ Signed-off-by: Dmitry Kadashev <[email protected]> Acked-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] [axboe: add splice_fd_in check] Signed-off-by: Jens Axboe <[email protected]>
2021-08-23block: use the percpu bio cache in __blkdev_direct_IOChristoph Hellwig1-2/+4
Use bio_alloc_kiocb to dip into the percpu cache of bios when the caller asks for it. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: enable use of bio alloc cacheJens Axboe1-1/+1
Mark polled IO as being safe for dipping into the bio allocation cache, in case the targeted bio_set has it enabled. This brings an IOPOLL gen2 Optane QD=128 workload from ~3.2M IOPS to ~3.5M IOPS. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: fix io_try_cancel_userdata race for iowqPavel Begunkov1-2/+3
WARNING: CPU: 1 PID: 5870 at fs/io_uring.c:5975 io_try_cancel_userdata+0x30f/0x540 fs/io_uring.c:5975 CPU: 0 PID: 5870 Comm: iou-wrk-5860 Not tainted 5.14.0-rc6-next-20210820-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:io_try_cancel_userdata+0x30f/0x540 fs/io_uring.c:5975 Call Trace: io_async_cancel fs/io_uring.c:6014 [inline] io_issue_sqe+0x22d5/0x65a0 fs/io_uring.c:6407 io_wq_submit_work+0x1dc/0x300 fs/io_uring.c:6511 io_worker_handle_work+0xa45/0x1840 fs/io-wq.c:533 io_wqe_worker+0x2cc/0xbb0 fs/io-wq.c:582 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 io_try_cancel_userdata() can be called from io_async_cancel() executing in the io-wq context, so the warning fires, which is there to alert anyone accessing task->io_uring->io_wq in a racy way. However, io_wq_put_and_exit() always first waits for all threads to complete, so the only detail left is to zero tctx->io_wq after the context is removed. note: one little assumption is that when IO_WQ_WORK_CANCEL, the executor won't touch ->io_wq, because io_wq_destroy() might cancel left pending requests in such a way. Cc: [email protected] Reported-by: [email protected] Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/dfdd37a80cfa9ffd3e59538929c99cdd55d8699e.1629721757.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: add support for IORING_OP_MKDIRATDmitry Kadashev1-0/+60
IORING_OP_MKDIRAT behaves like mkdirat(2) and takes the same flags and arguments. Acked-by: Linus Torvalds <[email protected]> Signed-off-by: Dmitry Kadashev <[email protected]> Acked-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] [axboe: add splice_fd_in check] Signed-off-by: Jens Axboe <[email protected]>
2021-08-23namei: update do_*() helpers to return intsDmitry Kadashev2-8/+8
Update the following to return int rather than long, for uniformity with the rest of the do_* helpers in namei.c: * do_rmdir() * do_unlinkat() * do_mkdirat() * do_mknodat() * do_symlinkat() Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Acked-by: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/io-uring/20210514143202.dmzfcgz5hnauy7ze@wittgenstein/ Signed-off-by: Dmitry Kadashev <[email protected]> Acked-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-23namei: make do_linkat() take struct filenameDmitry Kadashev1-16/+29
Pass in the struct filename pointers instead of the user string, for uniformity with do_renameat2, do_unlinkat, do_mknodat, etc. Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Acked-by: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/io-uring/20210330071700.kpjoyp5zlni7uejm@wittgenstein/ Signed-off-by: Dmitry Kadashev <[email protected]> Acked-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-23namei: add getname_uflags()Dmitry Kadashev2-6/+10
There are a couple of places where we already open-code the (flags & AT_EMPTY_PATH) check and io_uring will likely add another one in the future. Let's just add a simple helper getname_uflags() that handles this directly and use it. Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Acked-by: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/io-uring/20210415100815.edrn4a7cy26wkowe@wittgenstein/ Signed-off-by: Christian Brauner <[email protected]> Signed-off-by: Dmitry Kadashev <[email protected]> Acked-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-23namei: make do_symlinkat() take struct filenameDmitry Kadashev1-11/+12
Pass in the struct filename pointers instead of the user string, for uniformity with the recently converted do_mkdnodat(), do_unlinkat(), do_renameat(), do_mkdirat(). Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Acked-by: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/io-uring/20210330071700.kpjoyp5zlni7uejm@wittgenstein/ Signed-off-by: Dmitry Kadashev <[email protected]> Acked-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-23namei: make do_mknodat() take struct filenameDmitry Kadashev1-8/+11
Pass in the struct filename pointers instead of the user string, for uniformity with the recently converted do_unlinkat(), do_renameat(), do_mkdirat(). Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Acked-by: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/io-uring/20210330071700.kpjoyp5zlni7uejm@wittgenstein/ Signed-off-by: Dmitry Kadashev <[email protected]> Acked-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-23namei: make do_mkdirat() take struct filenameDmitry Kadashev2-7/+20
Pass in the struct filename pointers instead of the user string, and update the three callers to do the same. This is heavily based on commit dbea8d345177 ("fs: make do_renameat2() take struct filename"). This behaves like do_unlinkat() and do_renameat2(). Cc: Al Viro <[email protected]> Acked-by: Linus Torvalds <[email protected]> Signed-off-by: Dmitry Kadashev <[email protected]> Acked-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-23namei: change filename_parentat() calling conventionsDmitry Kadashev1-55/+53
Since commit 5c31b6cedb675 ("namei: saner calling conventions for filename_parentat()") filename_parentat() had the following behavior WRT the passed in struct filename *: * On error the name is consumed (putname() is called on it); * On success the name is returned back as the return value; Now there is a need for filename_create() and filename_lookup() variants that do not consume the passed filename, and following the same "consume the name only on error" semantics is proven to be hard to reason about and result in confusing code. Hence this preparation change splits filename_parentat() into two: one that always consumes the name and another that never consumes the name. This will allow to implement two filename_create() variants in the same way, and is a consistent and hopefully easier to reason about approach. Link: https://lore.kernel.org/io-uring/CAOKbgA7MiqZAq3t-HDCpSGUFfco4hMA9ArAE-74fTpU+EkvKPw@mail.gmail.com/ Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Acked-by: Linus Torvalds <[email protected]> Signed-off-by: Dmitry Kadashev <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-23namei: ignore ERR/NULL names in putname()Dmitry Kadashev1-4/+5
Supporting ERR/NULL names in putname() makes callers code cleaner, and is what some other path walking functions already support for the same reason. This also removes a few existing IS_ERR checks before putname(). Suggested-by: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/io-uring/CAHk-=wgCac9hBsYzKMpHk0EbLgQaXR=OUAjHaBtaY+G8A9KhFg@mail.gmail.com/ Acked-by: Linus Torvalds <[email protected]> Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Signed-off-by: Dmitry Kadashev <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: IRQ rw completion batchingPavel Begunkov1-1/+16
Employ inline completion logic for read/write completions done via io_req_task_complete(). If ->uring_lock is contended, just do normal request completion, but if not, make tctx_task_work() to grab the lock and do batched inline completions in io_req_task_complete(). Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/94589c3ce69eaed86a21bb1ec696407a54fab1aa.1629286357.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: batch task work lockingPavel Begunkov1-31/+49
Many task_work handlers either grab ->uring_lock, or may benefit from having it. Move locking logic out of individual handlers to a lazy approach controlled by tctx_task_work(), so we don't keep doing tons of mutex lock/unlock. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/d6a34e147f2507a2f3e2fa1e38a9c541dcad3929.1629286357.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: flush completions for fallbacksPavel Begunkov1-0/+5
io_fallback_req_func() doesn't expect anyone creating inline completions, and no one currently does that. Teach the function to flush completions preparing for further changes. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/8b941516921f72e1a64d58932d671736892d7fff.1629286357.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: add ->splice_fd_in checksPavel Begunkov1-22/+30
->splice_fd_in is used only by splice/tee, but no other request checks it for validity. Add the check for most of request types excluding reads/writes/sends/recvs, we don't want overhead for them and can leave them be as is until the field is actually used. Cc: [email protected] Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/f44bc2acd6777d932de3d71a5692235b5b2b7397.1629451684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: add clarifying comment for io_cqring_ev_posted()Jens Axboe1-0/+7
We've previously had an issue where overflow flush unconditionally calls io_cqring_ev_posted() even if it didn't flush any events to the ring, causing wake and eventfd increment where no new events are available. Some applications don't like that, see commit b18032bb0a88 for details. This came up in discussion for another patch recently, hence add a comment detailing what the relationship between calling the events posted helper and CQ ring entries is. Link: https://lore.kernel.org/io-uring/[email protected]/ Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: place fixed tables under memcg limitsPavel Begunkov1-3/+4
Fixed tables may be large enough, place all of them together with allocated tags under memcg limits. Cc: [email protected] Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/b3ac9f5da9821bb59837b5fe25e8ef4be982218c.1629451684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: limit fixed table size by RLIMIT_NOFILEPavel Begunkov1-0/+2
Limit the number of files in io_uring fixed tables by RLIMIT_NOFILE, that's the first and the simpliest restriction that we should impose. Cc: [email protected] Suggested-by: Jens Axboe <[email protected]> Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/b2756c340aed7d6c0b302c26dab50c6c5907f4ce.1629451684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: fix lack of protection for compl_nrHao Xu1-1/+2
coml_nr in ctx_flush_and_put() is not protected by uring_lock, this may cause problems when accessing in parallel: say coml_nr > 0 ctx_flush_and put other context if (compl_nr) get mutex coml_nr > 0 do flush coml_nr = 0 release mutex get mutex do flush (*) release mutex in (*) place, we call io_cqring_ev_posted() and users likely get no events there. To avoid spurious events, re-check the value when under the lock. Fixes: 2c32395d8111 ("io_uring: fix __tctx_task_work() ctx race") Signed-off-by: Hao Xu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: Add register support for non-4k PAGE_SIZEwangyangbo1-2/+2
Now allocated rsrc table uses PAGE_SIZE as the size of 2nd-level, and accessing this table relies on each level index from fixed TABLE_SHIFT (12 - 3) in 4k page case. In order to correctly work in non-4k page, define TABLE_SHIFT as non-fixed (PAGE_SHIFT - shift of data) for 2nd-level table entry number. Signed-off-by: wangyangbo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: extend task put optimisationsPavel Begunkov1-7/+9
Now with IRQ completions done via IRQ, almost all requests freeing are done from the context of submitter task, so it makes sense to extend task_put optimisation from io_req_free_batch_finish() to cover all the cases including task_work by moving it into io_put_task(). Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/824a7cbd745ddeee4a0f3ff85c558a24fd005872.1629302453.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: add comments on why PF_EXITING checking is safeJens Axboe1-0/+2
We have two checks of task->flags & PF_EXITING left: 1) In io_req_task_submit(), which is called in task_work and hence always in the context of the original task. That means that req->task == current, and hence checking ->flags is totally fine. 2) In io_poll_rewait(), where we need to stop re-arming poll to prevent it interfering with cancelation. This is only run from task_work as well, and hence for this case too req->task == current. Add a comment to both spots detailing that. Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io-wq: move nr_running and worker_refs out of wqe->lock protectionHao Xu1-3/+4
We don't need to protect nr_running and worker_refs by wqe->lock, so narrow the range of raw_spin_lock_irq - raw_spin_unlock_irq Signed-off-by: Hao Xu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: fix io_timeout_remove lockingPavel Begunkov1-4/+10
io_timeout_cancel() posts CQEs so needs ->completion_lock to be held, so grab it in io_timeout_remove(). Fixes: 48ecb6369f1f2 ("io_uring: run timeouts from task_work") Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/d6f03d653a4d7bf693ef6f39b6a426b6d97fd96f.1629280204.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: improve same wq pollingPavel Begunkov1-3/+5
Move earlier the check for whether __io_queue_proc() tries to poll already polled waitqueue, and do the same for the second poll entry, if any. Shouldn't really matter, but at least it would have a more predictable behaviour. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/8cb428cfe8ade0fd055859fabb878db8777d4c2f.1629228203.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: reuse io_req_complete_post()Pavel Begunkov1-37/+11
We have io_req_complete_post() to post a CQE and put the request. It takes care of all synchronisation and is more concise and efficent, so replace all hancoded occurrences of "lock; post CQE; unlock; + put_req()" with io_req_complete_post(). Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/2c83463458a613f9d870e5147eb134da2aa70779.1629228203.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: better encapsulate buffer select for rwPavel Begunkov1-16/+7
Make io_put_rw_kbuf() to do the REQ_F_BUFFER_SELECTED check, so all the callers don't need to hand code it. The number of places where we call io_put_rw_kbuf() is growing, so saves some pain. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/3df3919e5e7efe03420c44ab4d9317a81a9cf398.1629228203.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: optimise io_prep_linked_timeout()Pavel Begunkov1-3/+22
Linked timeout handling during issuing is heavy, it adds extra instructions and forces to save the next linked timeout before io_issue_sqe(). Follwing the same reasoning as in refcounting patches, a request can't be freed by the time it returns from io_issue_sqe(), so now we don't need to do io_prep_linked_timeout() in advance, and it can be delayed to colder paths optimising the generic path. Also, it should also save quite a lot for requests with linked timeouts and completed inline on timeout spinlocking + hrtimer_start() + hrtimer_try_to_cancel() and so on. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/19bfc9a0d26c5c5f1e359f7650afe807ca8ef879.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: cancel not-armed linked touts separatelyPavel Begunkov1-3/+16
Adjust io_disarm_next(), so it can detect if there is a linked but not-yet-armed timeout and complete/cancel it separately. Will be used in the following patch. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/ae228cde2c0df3d92d29d5e4852ed9fa8a2a97db.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>