Age | Commit message (Collapse) | Author | Files | Lines |
|
The ceph_cap_flush structures are usually dynamically allocated, but
the ceph_cap_snap has an embedded one.
When force umounting, the client will try to remove all the session
caps. During this, it will free them, but that should not be done
with the ones embedded in a capsnap.
Fix this by adding a new boolean that indicates that the cap flush is
embedded in a capsnap, and skip freeing it if that's set.
At the same time, switch to using list_del_init() when detaching the
i_list and g_list heads. It's possible for a forced umount to remove
these objects but then handle_cap_flushsnap_ack() races in and does the
list_del_init() again, corrupting memory.
Cc: [email protected]
URL: https://tracker.ceph.com/issues/52283
Signed-off-by: Xiubo Li <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
Signed-off-by: Ilya Dryomov <[email protected]>
|
|
Use the file_inode() helper rather than accessing ->f_inode directly.
Signed-off-by: David Howells <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
cc: [email protected]
Link: https://lore.kernel.org/r/162431192403.2908479.4590814090994846904.stgit@warthog.procyon.org.uk/
|
|
Move the cookie debug ID from struct netfs_read_request to struct
netfs_cache_resources and drop the 'cookie_' prefix. This makes it
available for things that want to use netfs_cache_resources without having
a netfs_read_request.
Signed-off-by: David Howells <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
cc: [email protected]
Link: https://lore.kernel.org/r/162431190784.2908479.13386972676539789127.stgit@warthog.procyon.org.uk/
|
|
Unconditionally select the stats produced by the netfs lib if fscache stats
are enabled as the former are displayed in the latter's procfile.
Signed-off-by: David Howells <[email protected]>
cc: [email protected]
Reviewed-by: Jeff Layton <[email protected]>
Link: https://lore.kernel.org/r/162280352566.3319242.10615341893991206961.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/162431189627.2908479.9165349423842063755.stgit@warthog.procyon.org.uk/
|
|
Dan reported a new smatch warning [1]
"fs/erofs/inode.c:210 erofs_read_inode() error: double free of 'copied'"
Due to new chunk-based format handling logic, the error path can be
called after kfree(copied).
Set "copied = NULL" after kfree(copied) to fix this.
[1] https://lore.kernel.org/r/[email protected]
Link: https://lore.kernel.org/r/[email protected]
Fixes: c5aa903a59db ("erofs: support reading chunk-based uncompressed files")
Reported-by: kernel test robot <[email protected]>
Reported-by: Dan Carpenter <[email protected]>
Reviewed-by: Chao Yu <[email protected]>
Signed-off-by: Gao Xiang <[email protected]>
|
|
This reverts commit f2165627319ffd33a6217275e5690b1ab5c45763.
[BUG]
It's no longer possible to create compressed inline extent after commit
f2165627319f ("btrfs: compression: don't try to compress if we don't
have enough pages").
[CAUSE]
For compression code, there are several possible reasons we have a range
that needs to be compressed while it's no more than one page.
- Compressed inline write
The data is always smaller than one sector and the test lacks the
condition to properly recognize a non-inline extent.
- Compressed subpage write
For the incoming subpage compressed write support, we require page
alignment of the delalloc range.
And for 64K page size, we can compress just one page into smaller
sectors.
For those reasons, the requirement for the data to be more than one page
is not correct, and is already causing regression for compressed inline
data writeback. The idea of skipping one page to avoid wasting CPU time
could be revisited in the future.
[FIX]
Fix it by reverting the offending commit.
Reported-by: Zygo Blaxell <[email protected]>
Link: https://lore.kernel.org/linux-btrfs/[email protected]
Fixes: f2165627319f ("btrfs: compression: don't try to compress if we don't have enough pages")
CC: [email protected] # 4.4+
Signed-off-by: Qu Wenruo <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
|
|
As done with open opcodes, allow accept to skip installing fd into
processes' file tables and put it directly into io_uring's fixed file
table. Same restrictions and design as for open.
Suggested-by: Josh Triplett <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
Link: https://lore.kernel.org/r/6d16163f376fac7ac26a656de6b42199143e9721.1629888991.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
Make io_accept() to handle file descriptor allocations and installation.
A preparation patch for bypassing file tables.
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
Link: https://lore.kernel.org/r/5b73d204caa0ce979ccb98136695b60f52a3d98c.1629888991.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
Instead of opening a file into a process's file table as usual and then
registering the fd within io_uring, some users may want to skip the
first step and place it directly into io_uring's fixed file table.
This patch adds such a capability for IORING_OP_OPENAT and
IORING_OP_OPENAT2.
The behaviour is controlled by setting sqe->file_index, where 0 implies
the old behaviour using normal file tables. If non-zero value is
specified, then it will behave as described and place the file into a
fixed file slot sqe->file_index - 1. A file table should be already
created, the slot should be valid and empty, otherwise the operation
will fail.
Keep the error codes consistent with IORING_OP_FILES_UPDATE, ENXIO and
EINVAL on inappropriate fixed tables, and return EBADF on collision with
already registered file.
Note: IOSQE_FIXED_FILE can't be used to switch between modes, because
accept takes a file, and it already uses the flag with a different
meaning.
Suggested-by: Josh Triplett <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
Link: https://lore.kernel.org/r/e9b33d1163286f51ea707f87d95bd596dada1e65.1629888991.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
When configfs_lookup() is executing list_for_each_entry(),
it is possible that configfs_dir_lseek() is calling list_del().
Some unfortunate interleavings of them can cause a kernel NULL
pointer dereference error
Thread 1 Thread 2
//configfs_dir_lseek() //configfs_lookup()
list_del(&cursor->s_sibling);
list_for_each_entry(sd, ...)
Fix this by grabbing configfs_dirent_lock in configfs_lookup()
while iterating ->s_children.
Signed-off-by: Sishuai Gong <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
|
|
This makes it more clear what gets added to the dcache and prepares
for an additional locking fix.
Signed-off-by: Christoph Hellwig <[email protected]>
|
|
Return the error directly instead of using a goto.
Signed-off-by: Christoph Hellwig <[email protected]>
|
|
Just like most other file systems: get the simple checks out of the
way first.
Signed-off-by: Christoph Hellwig <[email protected]>
|
|
Yup, the VFS hoist broke it, and nobody noticed. Bulkstat workloads
make it clear that it doesn't work as it should.
Fixes: dae2f8ed7992 ("fs: Lift XFS_IDONTCACHE to the VFS layer")
Signed-off-by: Dave Chinner <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
|
|
blkdev_fsync is only used inside of block_dev.c since the
removal of the raw drŅver.
Signed-off-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Commit 3efee0567b4a ("fs: remove mandatory file locking support") removes
some operations in functions rw_verify_area().
As these functions are now simplified, do some syntactic clean-up as
follow-up to the removal as well, which was pointed out by compiler
warnings and static analysis.
No functional change.
Signed-off-by: Lukas Bulwahn <[email protected]>
Signed-off-by: Jeff Layton <[email protected]>
|
|
While prototyping a free space defragmentation tool, I observed an
unexpected IO error while running a sequence of commands that can be
recreated by the following sequence of commands:
# xfs_io -f -c "pwrite -S 0x58 -b 10m 0 10m" file1
# cp --reflink=always file1 file2
# punch-alternating -o 1 file2
# xfs_io -c "funshare 0 10m" file2
fallocate: Input/output error
I then scraped this (abbreviated) stack trace from dmesg:
WARNING: CPU: 0 PID: 30788 at fs/iomap/buffered-io.c:577 iomap_write_begin+0x376/0x450
CPU: 0 PID: 30788 Comm: xfs_io Not tainted 5.14.0-rc6-xfsx #rc6 5ef57b62a900814b3e4d885c755e9014541c8732
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:iomap_write_begin+0x376/0x450
RSP: 0018:ffffc90000c0fc20 EFLAGS: 00010297
RAX: 0000000000000001 RBX: ffffc90000c0fd10 RCX: 0000000000001000
RDX: ffffc90000c0fc54 RSI: 000000000000000c RDI: 000000000000000c
RBP: ffff888005d5dbd8 R08: 0000000000102000 R09: ffffc90000c0fc50
R10: 0000000000b00000 R11: 0000000000101000 R12: ffffea0000336c40
R13: 0000000000001000 R14: ffffc90000c0fd10 R15: 0000000000101000
FS: 00007f4b8f62fe40(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000056361c554108 CR3: 000000000524e004 CR4: 00000000001706f0
Call Trace:
iomap_unshare_actor+0x95/0x140
iomap_apply+0xfa/0x300
iomap_file_unshare+0x44/0x60
xfs_reflink_unshare+0x50/0x140 [xfs 61947ea9b3a73e79d747dbc1b90205e7987e4195]
xfs_file_fallocate+0x27c/0x610 [xfs 61947ea9b3a73e79d747dbc1b90205e7987e4195]
vfs_fallocate+0x133/0x330
__x64_sys_fallocate+0x3e/0x70
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f4b8f79140a
Looking at the iomap tracepoints, I saw this:
iomap_iter: dev 8:64 ino 0x100 pos 0 length 0 flags WRITE|0x80 (0x81) ops xfs_buffered_write_iomap_ops caller iomap_file_unshare
iomap_iter_dstmap: dev 8:64 ino 0x100 bdev 8:64 addr -1 offset 0 length 131072 type DELALLOC flags SHARED
iomap_iter_srcmap: dev 8:64 ino 0x100 bdev 8:64 addr 147456 offset 0 length 4096 type MAPPED flags
iomap_iter: dev 8:64 ino 0x100 pos 0 length 4096 flags WRITE|0x80 (0x81) ops xfs_buffered_write_iomap_ops caller iomap_file_unshare
iomap_iter_dstmap: dev 8:64 ino 0x100 bdev 8:64 addr -1 offset 4096 length 4096 type DELALLOC flags SHARED
console: WARNING: CPU: 0 PID: 30788 at fs/iomap/buffered-io.c:577 iomap_write_begin+0x376/0x450
The first time funshare calls ->iomap_begin, xfs sees that the first
block is shared and creates a 128k delalloc reservation in the COW fork.
The delalloc reservation is returned as dstmap, and the shared block is
returned as srcmap. So far so good.
funshare calls ->iomap_begin to try the second block. This time there's
no srcmap (punch-alternating punched it out!) but we still have the
delalloc reservation in the COW fork. Therefore, we again return the
reservation as dstmap and the hole as srcmap. iomap_unshare_iter
incorrectly tries to unshare the hole, which __iomap_write_begin rejects
because shared regions must be fully written and therefore cannot
require zeroing.
Therefore, change the buffered write iomap_begin function not to set
IOMAP_F_SHARED when there isn't a source mapping to read from for the
unsharing.
Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Chandan Babu R <[email protected]>
|
|
We shouldn't really be using a read-only file descriptor to take a write
lock.
Most filesystems will put up with it. But NFS, for example, won't.
Signed-off-by: J. Bruce Fields <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
|
|
IORING_OP_LINKAT behaves like linkat(2) and takes the same flags and
arguments.
In some internal places 'hardlink' is used instead of 'link' to avoid
confusion with the SQE links. Name 'link' conflicts with the existing
'link' member of io_kiocb.
Acked-by: Linus Torvalds <[email protected]>
Suggested-by: Christian Brauner <[email protected]>
Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/
Signed-off-by: Dmitry Kadashev <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[axboe: add splice_fd_in check]
Signed-off-by: Jens Axboe <[email protected]>
|
|
IORING_OP_SYMLINKAT behaves like symlinkat(2) and takes the same flags
and arguments.
Acked-by: Linus Torvalds <[email protected]>
Suggested-by: Christian Brauner <[email protected]>
Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/
Signed-off-by: Dmitry Kadashev <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[axboe: add splice_fd_in check]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Use bio_alloc_kiocb to dip into the percpu cache of bios when the
caller asks for it.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Mark polled IO as being safe for dipping into the bio allocation
cache, in case the targeted bio_set has it enabled.
This brings an IOPOLL gen2 Optane QD=128 workload from ~3.2M IOPS to
~3.5M IOPS.
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
WARNING: CPU: 1 PID: 5870 at fs/io_uring.c:5975 io_try_cancel_userdata+0x30f/0x540 fs/io_uring.c:5975
CPU: 0 PID: 5870 Comm: iou-wrk-5860 Not tainted 5.14.0-rc6-next-20210820-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:io_try_cancel_userdata+0x30f/0x540 fs/io_uring.c:5975
Call Trace:
io_async_cancel fs/io_uring.c:6014 [inline]
io_issue_sqe+0x22d5/0x65a0 fs/io_uring.c:6407
io_wq_submit_work+0x1dc/0x300 fs/io_uring.c:6511
io_worker_handle_work+0xa45/0x1840 fs/io-wq.c:533
io_wqe_worker+0x2cc/0xbb0 fs/io-wq.c:582
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
io_try_cancel_userdata() can be called from io_async_cancel() executing
in the io-wq context, so the warning fires, which is there to alert
anyone accessing task->io_uring->io_wq in a racy way. However,
io_wq_put_and_exit() always first waits for all threads to complete,
so the only detail left is to zero tctx->io_wq after the context is
removed.
note: one little assumption is that when IO_WQ_WORK_CANCEL, the executor
won't touch ->io_wq, because io_wq_destroy() might cancel left pending
requests in such a way.
Cc: [email protected]
Reported-by: [email protected]
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/dfdd37a80cfa9ffd3e59538929c99cdd55d8699e.1629721757.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
IORING_OP_MKDIRAT behaves like mkdirat(2) and takes the same flags
and arguments.
Acked-by: Linus Torvalds <[email protected]>
Signed-off-by: Dmitry Kadashev <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[axboe: add splice_fd_in check]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Update the following to return int rather than long, for uniformity with
the rest of the do_* helpers in namei.c:
* do_rmdir()
* do_unlinkat()
* do_mkdirat()
* do_mknodat()
* do_symlinkat()
Cc: Al Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Link: https://lore.kernel.org/io-uring/20210514143202.dmzfcgz5hnauy7ze@wittgenstein/
Signed-off-by: Dmitry Kadashev <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Pass in the struct filename pointers instead of the user string, for
uniformity with do_renameat2, do_unlinkat, do_mknodat, etc.
Cc: Al Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Link: https://lore.kernel.org/io-uring/20210330071700.kpjoyp5zlni7uejm@wittgenstein/
Signed-off-by: Dmitry Kadashev <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
There are a couple of places where we already open-code the (flags &
AT_EMPTY_PATH) check and io_uring will likely add another one in the
future. Let's just add a simple helper getname_uflags() that handles
this directly and use it.
Cc: Al Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Link: https://lore.kernel.org/io-uring/20210415100815.edrn4a7cy26wkowe@wittgenstein/
Signed-off-by: Christian Brauner <[email protected]>
Signed-off-by: Dmitry Kadashev <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Pass in the struct filename pointers instead of the user string, for
uniformity with the recently converted do_mkdnodat(), do_unlinkat(),
do_renameat(), do_mkdirat().
Cc: Al Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Link: https://lore.kernel.org/io-uring/20210330071700.kpjoyp5zlni7uejm@wittgenstein/
Signed-off-by: Dmitry Kadashev <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Pass in the struct filename pointers instead of the user string, for
uniformity with the recently converted do_unlinkat(), do_renameat(),
do_mkdirat().
Cc: Al Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Link: https://lore.kernel.org/io-uring/20210330071700.kpjoyp5zlni7uejm@wittgenstein/
Signed-off-by: Dmitry Kadashev <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Pass in the struct filename pointers instead of the user string, and
update the three callers to do the same. This is heavily based on
commit dbea8d345177 ("fs: make do_renameat2() take struct filename").
This behaves like do_unlinkat() and do_renameat2().
Cc: Al Viro <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Signed-off-by: Dmitry Kadashev <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Since commit 5c31b6cedb675 ("namei: saner calling conventions for
filename_parentat()") filename_parentat() had the following behavior WRT
the passed in struct filename *:
* On error the name is consumed (putname() is called on it);
* On success the name is returned back as the return value;
Now there is a need for filename_create() and filename_lookup() variants
that do not consume the passed filename, and following the same "consume
the name only on error" semantics is proven to be hard to reason about
and result in confusing code.
Hence this preparation change splits filename_parentat() into two: one
that always consumes the name and another that never consumes the name.
This will allow to implement two filename_create() variants in the same
way, and is a consistent and hopefully easier to reason about approach.
Link: https://lore.kernel.org/io-uring/CAOKbgA7MiqZAq3t-HDCpSGUFfco4hMA9ArAE-74fTpU+EkvKPw@mail.gmail.com/
Cc: Al Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Signed-off-by: Dmitry Kadashev <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Supporting ERR/NULL names in putname() makes callers code cleaner, and
is what some other path walking functions already support for the same
reason.
This also removes a few existing IS_ERR checks before putname().
Suggested-by: Linus Torvalds <[email protected]>
Link: https://lore.kernel.org/io-uring/CAHk-=wgCac9hBsYzKMpHk0EbLgQaXR=OUAjHaBtaY+G8A9KhFg@mail.gmail.com/
Acked-by: Linus Torvalds <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Signed-off-by: Dmitry Kadashev <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Employ inline completion logic for read/write completions done via
io_req_task_complete(). If ->uring_lock is contended, just do normal
request completion, but if not, make tctx_task_work() to grab the lock
and do batched inline completions in io_req_task_complete().
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/94589c3ce69eaed86a21bb1ec696407a54fab1aa.1629286357.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
Many task_work handlers either grab ->uring_lock, or may benefit from
having it. Move locking logic out of individual handlers to a lazy
approach controlled by tctx_task_work(), so we don't keep doing
tons of mutex lock/unlock.
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/d6a34e147f2507a2f3e2fa1e38a9c541dcad3929.1629286357.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
io_fallback_req_func() doesn't expect anyone creating inline
completions, and no one currently does that. Teach the function to flush
completions preparing for further changes.
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/8b941516921f72e1a64d58932d671736892d7fff.1629286357.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
->splice_fd_in is used only by splice/tee, but no other request checks
it for validity. Add the check for most of request types excluding
reads/writes/sends/recvs, we don't want overhead for them and can leave
them be as is until the field is actually used.
Cc: [email protected]
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/f44bc2acd6777d932de3d71a5692235b5b2b7397.1629451684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
We've previously had an issue where overflow flush unconditionally calls
io_cqring_ev_posted() even if it didn't flush any events to the ring,
causing wake and eventfd increment where no new events are available.
Some applications don't like that, see commit b18032bb0a88 for details.
This came up in discussion for another patch recently, hence add a
comment detailing what the relationship between calling the events
posted helper and CQ ring entries is.
Link: https://lore.kernel.org/io-uring/[email protected]/
Signed-off-by: Jens Axboe <[email protected]>
|
|
Fixed tables may be large enough, place all of them together with
allocated tags under memcg limits.
Cc: [email protected]
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/b3ac9f5da9821bb59837b5fe25e8ef4be982218c.1629451684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
Limit the number of files in io_uring fixed tables by RLIMIT_NOFILE,
that's the first and the simpliest restriction that we should impose.
Cc: [email protected]
Suggested-by: Jens Axboe <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/b2756c340aed7d6c0b302c26dab50c6c5907f4ce.1629451684.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
coml_nr in ctx_flush_and_put() is not protected by uring_lock, this
may cause problems when accessing in parallel:
say coml_nr > 0
ctx_flush_and put other context
if (compl_nr) get mutex
coml_nr > 0
do flush
coml_nr = 0
release mutex
get mutex
do flush (*)
release mutex
in (*) place, we call io_cqring_ev_posted() and users likely get
no events there. To avoid spurious events, re-check the value when
under the lock.
Fixes: 2c32395d8111 ("io_uring: fix __tctx_task_work() ctx race")
Signed-off-by: Hao Xu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Now allocated rsrc table uses PAGE_SIZE as the size of 2nd-level, and
accessing this table relies on each level index from fixed TABLE_SHIFT
(12 - 3) in 4k page case. In order to correctly work in non-4k page,
define TABLE_SHIFT as non-fixed (PAGE_SHIFT - shift of data) for
2nd-level table entry number.
Signed-off-by: wangyangbo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Now with IRQ completions done via IRQ, almost all requests freeing
are done from the context of submitter task, so it makes sense to
extend task_put optimisation from io_req_free_batch_finish() to cover
all the cases including task_work by moving it into io_put_task().
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/824a7cbd745ddeee4a0f3ff85c558a24fd005872.1629302453.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
We have two checks of task->flags & PF_EXITING left:
1) In io_req_task_submit(), which is called in task_work and hence always
in the context of the original task. That means that
req->task == current, and hence checking ->flags is totally fine.
2) In io_poll_rewait(), where we need to stop re-arming poll to prevent
it interfering with cancelation. This is only run from task_work as
well, and hence for this case too req->task == current.
Add a comment to both spots detailing that.
Signed-off-by: Jens Axboe <[email protected]>
|
|
We don't need to protect nr_running and worker_refs by wqe->lock, so
narrow the range of raw_spin_lock_irq - raw_spin_unlock_irq
Signed-off-by: Hao Xu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
io_timeout_cancel() posts CQEs so needs ->completion_lock to be held,
so grab it in io_timeout_remove().
Fixes: 48ecb6369f1f2 ("io_uring: run timeouts from task_work")
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/d6f03d653a4d7bf693ef6f39b6a426b6d97fd96f.1629280204.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
Move earlier the check for whether __io_queue_proc() tries to poll
already polled waitqueue, and do the same for the second poll entry, if
any. Shouldn't really matter, but at least it would have a more
predictable behaviour.
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/8cb428cfe8ade0fd055859fabb878db8777d4c2f.1629228203.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
We have io_req_complete_post() to post a CQE and put the request. It
takes care of all synchronisation and is more concise and efficent, so
replace all hancoded occurrences of
"lock; post CQE; unlock; + put_req()" with io_req_complete_post().
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/2c83463458a613f9d870e5147eb134da2aa70779.1629228203.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
Make io_put_rw_kbuf() to do the REQ_F_BUFFER_SELECTED check, so all the
callers don't need to hand code it. The number of places where we call
io_put_rw_kbuf() is growing, so saves some pain.
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/3df3919e5e7efe03420c44ab4d9317a81a9cf398.1629228203.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
Linked timeout handling during issuing is heavy, it adds extra
instructions and forces to save the next linked timeout before
io_issue_sqe().
Follwing the same reasoning as in refcounting patches, a request can't
be freed by the time it returns from io_issue_sqe(), so now we don't
need to do io_prep_linked_timeout() in advance, and it can be delayed to
colder paths optimising the generic path.
Also, it should also save quite a lot for requests with linked timeouts
and completed inline on timeout spinlocking + hrtimer_start() +
hrtimer_try_to_cancel() and so on.
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/19bfc9a0d26c5c5f1e359f7650afe807ca8ef879.1628981736.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
Adjust io_disarm_next(), so it can detect if there is a linked but
not-yet-armed timeout and complete/cancel it separately. Will be used in
the following patch.
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/ae228cde2c0df3d92d29d5e4852ed9fa8a2a97db.1628981736.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|