Age | Commit message (Collapse) | Author | Files | Lines |
|
It's now unused, drop the code related to it. This includes the
io_issue_defs->manual alloc field.
While in there, and since ->async_size is now being used a bit more
frequently and in the issue path, move it to io_issue_defs[].
Signed-off-by: Jens Axboe <[email protected]>
|
|
The previous commit turned on async data for uring_cmd, and did the
basic conversion of setting everything up on the prep side. However, for
a lot of use cases, -EIOCBQUEUED will get returned on issue, as the
operation got successfully queued. For that case, a persistent SQE isn't
needed, as it's just used for issue.
Unless execution goes async immediately, defer copying the double SQE
until it's necessary.
This greatly reduces the overhead of such commands, as evidenced by
a perf diff from before and after this change:
10.60% -8.58% [kernel.vmlinux] [k] io_uring_cmd_prep
where the prep side drops from 10.60% to ~2%, which is more expected.
Performance also rises from ~113M IOPS to ~122M IOPS, bringing us back
to where it was before the async command prep.
Tested-by: Anuj Gupta <[email protected]>
Reviewed-by: Anuj Gupta <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Basic conversion ensuring async_data is allocated off the prep path. Adds
a basic alloc cache as well, as passthrough IO can be quite high in rate.
Tested-by: Anuj Gupta <[email protected]>
Reviewed-by: Anuj Gupta <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
While doing that, get rid of io_async_connect and just use the generic
io_async_msghdr. Both of them have a struct sockaddr_storage in there,
and while io_async_msghdr is bigger, if the same type can be used then
the netmsg_cache can get reused for connect as well.
Signed-off-by: Jens Axboe <[email protected]>
|
|
Let the io_async_rw hold on to the iovec and reuse it, rather than always
allocate and free them.
Also enables KASAN for the iovec entries, so that reuse can be detected
even while they are in the cache.
While doing so, shrink io_async_rw by getting rid of the bigger embedded
fast iovec. Since iovecs are being recycled now, shrink it from 8 to 1.
This reduces the io_async_rw size from 264 to 160 bytes, a 40% reduction.
Signed-off-by: Jens Axboe <[email protected]>
|
|
We no longer need to gate a potential retry on whether or not the
context matches our original task, as all read/write operations have
been fully prepared upfront. This means there's never any re-import
needed, and hence we can always retry requests.
Signed-off-by: Jens Axboe <[email protected]>
|
|
A separate state struct is not needed anymore, just fold it in with
io_async_rw.
Signed-off-by: Jens Axboe <[email protected]>
|
|
read/write requests try to put everything on the stack, and then alloc
and copy if a retry is needed. This necessitates a bunch of nasty code
that deals with intermediate state.
Get rid of this, and have the prep side setup everything that is needed
upfront, which greatly simplifies the opcode handlers.
This includes adding an alloc cache for io_async_rw, to make it cheap
to handle.
In terms of cost, this should be basically free and transparent. For
the worst case of {READ,WRITE}_FIXED which didn't need it before,
performance is unaffected in the normal peak workload that is being
used to test that. Still runs at 122M IOPS.
Signed-off-by: Jens Axboe <[email protected]>
|
|
Now that iovec recycling is being done, the iovec is no longer being
freed in there. Hence the kmsg parameter is now useless.
Signed-off-by: Jens Axboe <[email protected]>
|
|
Right now the io_async_msghdr is recycled to avoid the overhead of
allocating+freeing it for every request. But the iovec is not included,
hence that will be allocated and freed for each transfer regardless.
This commit enables recyling of the iovec between io_async_msghdr
recycles. This avoids alloc+free for each one if an iovec is used, and
on top of that, it extends the cache hot nature of msg to the iovec as
well.
Also enables KASAN for the iovec entries, so that reuse can be detected
even while they are in the cache.
The io_async_msghdr also shrinks from 376 -> 288 bytes, an 88 byte
saving (or ~23% smaller), as the fast_iovec entry is dropped from 8
entries to a single entry. There's no point keeping a big fast iovec
entry, if iovecs aren't being allocated and freed continually.
Signed-off-by: Jens Axboe <[email protected]>
|
|
All net commands have async data at this point, there's no reason to
check if this is the case or not.
Signed-off-by: Jens Axboe <[email protected]>
|
|
We now ONLY call io_msg_alloc_async() from inside prep handling, which
is always locked. No need for this helper anymore, or the check in
io_msg_alloc_async() on whether the ring is locked or not.
Signed-off-by: Jens Axboe <[email protected]>
|
|
Move the io_async_msghdr out of the issue path and into prep handling,
e it's now done unconditionally and hence does not need to be part
of the issue path. This means any usage of io_sendrecv_prep_async() and
io_sendmsg_prep_async(), and hence the forced async setup path is now
unified with the normal prep setup.
Signed-off-by: Jens Axboe <[email protected]>
|
|
Move the io_async_msghdr out of the issue path and into prep handling,
since it's now done unconditionally and hence does not need to be part
of the issue path. This reduces the footprint of the multishot fast
path of multiple invocations of ->issue() per prep, and also means that
using ->prep_async() can be dropped for recvmsg asthis is now done via
setup on the prep side.
Signed-off-by: Jens Axboe <[email protected]>
|
|
We currently set this separately for async/sync entry, but let's just
move it to a generic pre-issue spot and eliminate the difference
between the two.
Signed-off-by: Jens Axboe <[email protected]>
|
|
Rather than use an on-stack one and then need to allocate and copy if
async execution is required, always grab one upfront. This should be
very cheap, and potentially even have cache hotness benefits for
back-to-back send/recv requests.
For any recv type of request, this is probably a good choice in general,
as it's expected that no data is available initially. For send this is
not necessarily the case, as space in the socket buffer is expected to
be available. However, getting a cached io_async_msghdr is very cheap,
and as it should be cache hot, probably the difference here is neglible,
if any.
A nice side benefit is that io_setup_async_msg can get killed
completely, which has some nasty iovec manipulation code.
Signed-off-by: Jens Axboe <[email protected]>
|
|
Now that recv/recvmsg both do the same cleanup, put it in the retry and
finish handlers.
Signed-off-by: Jens Axboe <[email protected]>
|
|
No functional changes in this patch, just in preparation for carrying
more state than what is available now, if necessary.
Signed-off-by: Jens Axboe <[email protected]>
|
|
No functional changes in this patch, just in preparation for carrying
more state then what is being done now, if necessary. While unifying
some of this code, add a generic send setup prep handler that they can
both use.
This gets rid of some manual msghdr and sockaddr on the stack, and makes
it look a bit more like the sendmsg/recvmsg variants. Going forward, more
can get unified on top.
Signed-off-by: Jens Axboe <[email protected]>
|
|
In practice, we just need to recycle a few elements for (by far) most
use cases. Shrink the total size down from 512 to 128, which should be
more than plenty.
Signed-off-by: Jens Axboe <[email protected]>
|
|
For historical reasons these were special cased, as they were the only
ones that needed cancelation. But now we handle cancelations generally,
and hence there's no need to check for these in
io_ring_ctx_wait_and_kill() when io_uring_try_cancel_requests() handles
both these and the rest as well.
Signed-off-by: Jens Axboe <[email protected]>
|
|
Just like we run the inline task_work, ensure we also factor in and
run the fallback task_work.
Signed-off-by: Jens Axboe <[email protected]>
|
|
Move CONFIG_PROVE_LOCKING checks inside of io_lockdep_assert_cq_locked()
and kill the else branch.
Signed-off-by: Pavel Begunkov <[email protected]>
Tested-by: Ming Lei <[email protected]>
Link: https://lore.kernel.org/r/bbf33c429c9f6d7207a8fe66d1a5866ec2c99850.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
Make io_req_complete_post() to push all IORING_SETUP_IOPOLL requests
to task_work, it's much cleaner and should normally happen. We couldn't
do it before because there was a possibility of looping in
complete_post() -> tw -> complete_post() -> ...
Also, unexport the function and inline __io_req_complete_post().
Signed-off-by: Pavel Begunkov <[email protected]>
Tested-by: Ming Lei <[email protected]>
Link: https://lore.kernel.org/r/ea19c032ace3e0dd96ac4d991a063b0188037014.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
task_work execution is now always locked, and we shouldn't get into
io_req_complete_post() from them. That means that complete_post() is
always called out of the original task context and we don't even need to
check current.
Signed-off-by: Pavel Begunkov <[email protected]>
Tested-by: Ming Lei <[email protected]>
Link: https://lore.kernel.org/r/24ec27f27db0d8f58c974d8118dca1d345314ddc.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
io_post_aux_cqe(), which is used for multishot requests, delays
completions by putting CQEs into a temporary array for the purpose
completion lock/flush batching.
DEFER_TASKRUN doesn't need any locking, so for it we can put completions
directly into the CQ and defer post completion handling with a flag.
That leaves !DEFER_TASKRUN, which is not that interesting / hot for
multishot requests, so have conditional locking with deferred flush
for them.
Signed-off-by: Pavel Begunkov <[email protected]>
Tested-by: Ming Lei <[email protected]>
Link: https://lore.kernel.org/r/b1d05a81fd27aaa2a07f9860af13059e7ad7a890.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
The restriction on multishot execution context disallowing io-wq is
driven by rules of io_fill_cqe_req_aux(), it should only be called in
the master task context, either from the syscall path or in task_work.
Since task_work now always takes the ctx lock implying
IO_URING_F_COMPLETE_DEFER, we can just assume that the function is
always called with its defer argument set to true.
Kill the argument. Also rename the function for more consistency as
"fill" in CQE related functions was usually meant for raw interfaces
only copying data into the CQ without any locking, waking the user
and other accounting "post" functions take care of.
Signed-off-by: Pavel Begunkov <[email protected]>
Tested-by: Ming Lei <[email protected]>
Link: https://lore.kernel.org/r/93423d106c33116c7d06bf277f651aa68b427328.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
ctx is always locked for task_work now, so get rid of struct
io_tw_state::locked. Note I'm stopping one step before removing
io_tw_state altogether, which is not empty, because it still serves the
purpose of indicating which function is a tw callback and forcing users
not to invoke them carelessly out of a wrong context. The removal can
always be done later.
Signed-off-by: Pavel Begunkov <[email protected]>
Tested-by: Ming Lei <[email protected]>
Link: https://lore.kernel.org/r/e95e1ea116d0bfa54b656076e6a977bc221392a4.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
We can run normal task_work without locking the ctx, however we try to
lock anyway and most handlers prefer or require it locked. It might have
been interesting to multi-submitter ring with high contention completing
async read/write requests via task_work, however that will still need to
go through io_req_complete_post() and potentially take the lock for
rsrc node putting or some other case.
In other words, it's hard to care about it, so alawys force the locking.
The case described would also because of various io_uring caches.
Signed-off-by: Pavel Begunkov <[email protected]>
Tested-by: Ming Lei <[email protected]>
Link: https://lore.kernel.org/r/6ae858f2ef562e6ed9f13c60978c0d48926954ba.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
kiocb_done() should care to specifically redirecting requests to io-wq.
Remove the hopping to tw to then queue an io-wq, return -EAGAIN and let
the core code io_uring handle offloading.
Signed-off-by: Pavel Begunkov <[email protected]>
Tested-by: Ming Lei <[email protected]>
Link: https://lore.kernel.org/r/413564e550fe23744a970e1783dfa566291b0e6f.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
!IO_URING_F_UNLOCKED does not translate to availability of the deferred
completion infra, IO_URING_F_COMPLETE_DEFER does, that what we should
pass and look for to use io_req_complete_defer() and other variants.
Luckily, it's not a real problem as two wrongs actually made it right,
at least as far as io_uring_cmd_work() goes.
Signed-off-by: Pavel Begunkov <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Tested-by: Ming Lei <[email protected]>
Link: https://lore.kernel.org/r/aef76d34fe9410df8ecc42a14544fd76cd9d8b9e.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
io_uring cmd converts struct io_tw_state to issue_flags and later back
to io_tw_state, it's awfully ill-fated, not to mention that intermediate
issue_flags state is not correct.
Get rid of the last conversion, drag through tw everything that came
with IO_URING_F_UNLOCKED, and replace io_req_complete_defer() with a
direct call to io_req_complete_defer(), at least for the time being.
Signed-off-by: Pavel Begunkov <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Tested-by: Ming Lei <[email protected]>
Link: https://lore.kernel.org/r/c53fa3df749752bd058cf6f824a90704822d6bcc.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
io_uring_try_cancel_uring_cmd() is a part of the cmd handling so let's
move it closer to all cmd bits into uring_cmd.c
Signed-off-by: Pavel Begunkov <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Tested-by: Ming Lei <[email protected]>
Link: https://lore.kernel.org/r/43a3937af4933655f0fd9362c381802f804f43de.1710799188.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
cac9e4418f4cb ("io_uring/net: save msghdr->msg_control for retries")
reinstatiates msg_control before every __sys_sendmsg_sock(), since the
function can overwrite the value in msghdr. We need to do same for
zerocopy sendmsg.
Cc: [email protected]
Fixes: 493108d95f146 ("io_uring/net: zerocopy sendmsg")
Link: https://github.com/axboe/liburing/issues/1067
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/cc1d5d9df0576fa66ddad4420d240a98a020b267.1712596179.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
There's a bunch of flags that are purely based on what the file
operations support while also never being conditionally set or unset.
IOW, they're not subject to change for individual files. Imho, such
flags don't need to live in f_mode they might as well live in the fops
structs itself. And the fops struct already has that lonely
mmap_supported_flags member. We might as well turn that into a generic
fop_flags member and move a few flags from FMODE_* space into FOP_*
space. That gets us four FMODE_* bits back and the ability for new
static flags that are about file ops to not have to live in FMODE_*
space but in their own FOP_* space. It's not the most beautiful thing
ever but it gets the job done. Yes, there'll be an additional pointer
chase but hopefully that won't matter for these flags.
I suspect there's a few more we can move into there and that we can also
redirect a bunch of new flag suggestions that follow this pattern into
the fop_flags field instead of f_mode.
Link: https://lore.kernel.org/r/20240328-gewendet-spargel-aa60a030ef74@brauner
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Jens Axboe <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
|
|
This bug was introduced in commit 950e79dd7313 ("io_uring: minor
io_cqring_wait() optimization"), which was made in preparation for
adc8682ec690 ("io_uring: Add support for napi_busy_poll"). The latter
got reverted in cb3182167325 ("Revert "io_uring: Add support for
napi_busy_poll""), so simply undo the former as well.
Cc: [email protected]
Fixes: 950e79dd7313 ("io_uring: minor io_cqring_wait() optimization")
Signed-off-by: Alexey Izbyshev <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
If we look up the kbuf, ensure that it doesn't get unregistered until
after we're done with it. Since we're inside mmap, we cannot safely use
the io_uring lock. Rely on the fact that we can lookup the buffer list
under RCU now and grab a reference to it, preventing it from being
unregistered until we're done with it. The lookup returns the
io_buffer_list directly with it referenced.
Cc: [email protected] # v6.4+
Fixes: 5cf4f52e6d8a ("io_uring: free io_buffer_list entries via RCU")
Signed-off-by: Jens Axboe <[email protected]>
|
|
No functional changes in this patch, just in preparation for being able
to keep the buffer list alive outside of the ctx->uring_lock.
Cc: [email protected] # v6.4+
Signed-off-by: Jens Axboe <[email protected]>
|
|
Now that xarray is being exclusively used for the buffer_list lookup,
this check is no longer needed. Get rid of it and the is_ready member.
Cc: [email protected] # v6.4+
Signed-off-by: Jens Axboe <[email protected]>
|
|
Just rely on the xarray for any kind of bgid. This simplifies things, and
it really doesn't bring us much, if anything.
Cc: [email protected] # v6.4+
Signed-off-by: Jens Axboe <[email protected]>
|
|
Rather than use the system unbound event workqueue, use an io_uring
specific one. This avoids dependencies with the tty, which also uses
the system_unbound_wq, and issues flushes of said workqueue from inside
its poll handling.
Cc: [email protected]
Reported-by: Rasmus Karlsson <[email protected]>
Tested-by: Rasmus Karlsson <[email protected]>
Tested-by: Iskren Chernev <[email protected]>
Link: https://github.com/axboe/liburing/issues/1113
Signed-off-by: Jens Axboe <[email protected]>
|
|
Do the same check for direct io-wq execution for multishot requests that
commit 2a975d426c82 did for the inline execution, and disable multishot
mode (and revert to single shot) if the file type doesn't support NOWAIT,
and isn't opened in O_NONBLOCK mode. For multishot to work properly, it's
a requirement that nonblocking read attempts can be done.
Cc: [email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Supporting multishot reads requires support for NOWAIT, as the
alternative would be always having io-wq execute the work item whenever
the poll readiness triggered. Any fast file type will have NOWAIT
support (eg it understands both O_NONBLOCK and IOCB_NOWAIT). If the
given file type does not, then simply resort to single shot execution.
Cc: [email protected]
Fixes: fc68fcda04910 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT")
Signed-off-by: Jens Axboe <[email protected]>
|
|
Ideally we'd want to simply kill the task rather than wake it, but for
now let's just add a startup check that causes the thread to exit.
This can only happen if io_uring_alloc_task_context() fails, which
generally requires fault injection.
Reported-by: Ubisectech Sirius <[email protected]>
Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately")
Signed-off-by: Jens Axboe <[email protected]>
|
|
If failure happens before the opcode prep handler is called, ensure that
we clear the opcode specific area of the request, which holds data
specific to that request type. This prevents errors where opcode
handlers either don't get to clear per-request private data since prep
isn't even called.
Reported-and-tested-by: [email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
If we get a request with IOSQE_ASYNC set, then we first run the prep
async handlers. But if we then fail setting it up and want to post
a CQE with -EINVAL, we use ->done_io. This was previously guarded with
REQ_F_PARTIAL_IO, and the normal setup handlers do set it up before any
potential errors, but we need to cover the async setup too.
Fixes: 9817ad85899f ("io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_io")
Signed-off-by: Jens Axboe <[email protected]>
|
|
We know the request is either being removed, or already in the process of
being removed through task_work, so we can delete it from our waitid list
upfront. This is important for remove all conditions, as we otherwise
will find it multiple times and prevent cancelation progress.
Remove the dead check in cancelation as well for the hash_node being
empty or not. We already have a waitid reference check for ownership,
so we don't need to check the list too.
Cc: [email protected]
Fixes: f31ecf671ddc ("io_uring: add IORING_OP_WAITID support")
Signed-off-by: Jens Axboe <[email protected]>
|
|
We know the request is either being removed, or already in the process of
being removed through task_work, so we can delete it from our futex list
upfront. This is important for remove all conditions, as we otherwise
will find it multiple times and prevent cancelation progress.
Cc: [email protected]
Fixes: 194bb58c6090 ("io_uring: add support for futex wake and wait")
Fixes: 8f350194d5cf ("io_uring: add support for vectored futex waits")
Signed-off-by: Jens Axboe <[email protected]>
|
|
Taking the ctx lock is not enough to use the deferred request completion
infrastructure, it'll get queued into the list but no one would expect
it there, so it will sit there until next io_submit_flush_completions().
It's hard to care about the cancellation path, so complete it via tw.
Fixes: ef7dfac51d8ed ("io_uring/poll: serialize poll linked timer start with poll removal")
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/c446740bc16858f8a2a8dcdce899812f21d15f23.1710514702.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>
|
|
Looking at the error path of __io_uaddr_map, if we fail after pinning
the pages for any reasons, ret will be set to -EINVAL and the error
handler won't properly release the pinned pages.
I didn't manage to trigger it without forcing a failure, but it can
happen in real life when memory is heavily fragmented.
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
Fixes: 223ef4743164 ("io_uring: don't allow IORING_SETUP_NO_MMAP rings on highmem pages")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|