aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2020-01-22io_uring: honor IOSQE_ASYNC for linked reqsPavel Begunkov1-0/+4
REQ_F_FORCE_ASYNC is checked only for the head of a link. Fix it. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-22io_uring: prep req when do IOSQE_ASYNCPavel Begunkov1-0/+4
Whenever IOSQE_ASYNC is set, requests will be punted to async without getting into io_issue_req() and without proper preparation done (e.g. io_req_defer_prep()). Hence they will be left uninitialised. Prepare them before punting. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: use labeled array init in io_op_defsPavel Begunkov1-62/+29
Don't rely on implicit ordering of IORING_OP_ and explicitly place them at a right place in io_op_defs. Now former comments are now a part of the code and won't ever outdate. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: optimise sqe-to-req flags translationPavel Begunkov2-34/+81
For each IOSQE_* flag there is a corresponding REQ_F_* flag. And there is a repetitive pattern of their translation: e.g. if (sqe->flags & SQE_FLAG*) req->flags |= REQ_F_FLAG* Use same numeric values/bits for them and copy instead of manual handling. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: remove REQ_F_IO_DRAINEDPavel Begunkov1-5/+2
A request can get into the defer list only once, there is no need for marking it as drained, so remove it. This probably was left after extracting __need_defer() for use in timeouts. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: file switch work needs to get flushed on exitJens Axboe1-1/+4
We currently flush early, but if we have something in progress and a new switch is scheduled, we need to ensure to flush after our teardown as well. Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: hide uring_fd in ctxPavel Begunkov1-15/+12
req->ring_fd and req->ring_file are used only during the prep stage during submission, which is is protected by mutex. There is no need to store them per-request, place them in ctx. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: remove extra check in __io_commit_cqringPavel Begunkov1-7/+5
__io_commit_cqring() is almost always called when there is a change in the rings, so the check is rather pessimising. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: optimise use of ctx->drain_nextPavel Begunkov1-20/+21
Move setting ctx->drain_next to the only place it could be set, when it got linked non-head requests. The same for checking it, it's interesting only for a head of a link or a non-linked request. No functional changes here. This removes some code from the common path and also removes REQ_F_DRAIN_LINK flag, as it doesn't need it anymore. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: add support for probing opcodesJens Axboe2-2/+69
The application currently has no way of knowing if a given opcode is supported or not without having to try and issue one and see if we get -EINVAL or not. And even this approach is fraught with peril, as maybe we're getting -EINVAL due to some fields being missing, or maybe it's just not that easy to issue that particular command without doing some other leg work in terms of setup first. This adds IORING_REGISTER_PROBE, which fills in a structure with info on what it supported or not. This will work even with sparse opcode fields, which may happen in the future or even today if someone backports specific features to older kernels. Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: account fixed file references correctly in batchJens Axboe1-5/+9
We can't assume that the whole batch has fixed files in it. If it's a mix, or none at all, then we can end up doing a ref put that either messes up accounting, or causes an oops if we have no fixed files at all. Also ensure we free requests properly between inflight accounted and normal requests. Fixes: 82c721577011 ("io_uring: extend batch freeing to cover more cases") Reported-by: Dmitrii Dolgov <[email protected]> Reported-by: Pavel Begunkov <[email protected]> Tested-by: Dmitrii Dolgov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: add opcode to issue trace eventJens Axboe2-5/+11
For some test apps at least, user_data is just zeroes. So it's not a good way to tell what the command actually is. Add the opcode to the issue trace point. Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: add support for IORING_OP_OPENAT2Jens Axboe2-6/+64
Add support for the new openat2(2) system call. It's trivial to do, as we can have openat(2) just be wrapped around it. Suggested-by: Stefan Metzmacher <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: remove 'fname' from io_open structureJens Axboe1-5/+6
We only use it internally in the prep functions for both statx and openat, so we don't need it to be persistent across the request. Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: add 'struct open_how' to the openat request contextJens Axboe1-9/+8
We'll need this for openat2(2) support, remove flags and mode from the existing io_open struct. Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: enable option to only trigger eventfd for async completionsJens Axboe2-1/+17
If an application is using eventfd notifications with poll to know when new SQEs can be issued, it's expecting the following read/writes to complete inline. And with that, it knows that there are events available, and don't want spurious wakeups on the eventfd for those requests. This adds IORING_REGISTER_EVENTFD_ASYNC, which works just like IORING_REGISTER_EVENTFD, except it only triggers notifications for events that happen from async completions (IRQ, or io-wq worker completions). Any completions inline from the submission itself will not trigger notifications. Suggested-by: Mark Papadakis <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: change io_ring_ctx bool fields into bit fieldsJens Axboe1-7/+7
In preparation for adding another one, which would make us spill into another long (and hence bump the size of the ctx), change them to bit fields. Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: file set registration should use interruptible waitsJens Axboe1-2/+8
If an application attempts to register a set with unbounded requests pending, we can be stuck here forever if they don't complete. We can make this wait interruptible, and just abort if we get signaled. Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: Remove unnecessary null checkYueHaibing1-2/+1
Null check kfree is redundant, so remove it. This is detected by coccinelle. Signed-off-by: YueHaibing <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: add support for send(2) and recv(2)Jens Axboe2-5/+137
This adds IORING_OP_SEND for send(2) support, and IORING_OP_RECV for recv(2) support. Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: remove extra io_wq_current_is_worker()Pavel Begunkov1-2/+1
io_wq workers use io_issue_sqe() to forward sqes and never io_queue_sqe(). Remove extra check for io_wq_current_is_worker() Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: optimise commit_sqring() for common casePavel Begunkov1-8/+6
It should be pretty rare to not submitting anything when there is something in the ring. No need to keep heuristics for this case. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: optimise head checks in io_get_sqring()Pavel Begunkov1-9/+4
A user may ask to submit more than there is in the ring, and then io_uring will submit as much as it can. However, in the last iteration it will allocate an io_kiocb and immediately free it. It could do better and adjust @to_submit to what is in the ring. And since the ring's head is already checked here, there is no need to do it in the loop, spamming with smp_load_acquire()'s barriers Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: clamp to_submit in io_submit_sqes()Pavel Begunkov1-2/+2
Make io_submit_sqes() to clamp @to_submit itself. It removes duplicated code and prepares for following changes. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: add support for IORING_SETUP_CLAMPJens Axboe2-3/+15
Some applications like to start small in terms of ring size, and then ramp up as needed. This is a bit tricky to do currently, since we don't advertise the max ring size. This adds IORING_SETUP_CLAMP. If set, and the values for SQ or CQ ring size exceed what we support, then clamp them at the max values instead of returning -EINVAL. Since we return the chosen ring sizes after setup, no further changes are needed on the application side. io_uring already changes the ring sizes if the application doesn't ask for power-of-two sizes, for example. Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: extend batch freeing to cover more casesJens Axboe1-31/+69
Currently we only batch free if fixed files are used, no links, no aux data, etc. This extends the batch freeing to only exclude the linked case and fallback case, and make io_free_req_many() handle the other cases just fine. Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: wrap multi-req freeing in struct req_batchJens Axboe1-34/+31
This cleans up the code a bit, and it allows us to build on top of the multi-req freeing. Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: batch getting pcpu referencesPavel Begunkov1-9/+17
percpu_ref_tryget() has its own overhead. Instead getting a reference for each request, grab a bunch once per io_submit_sqes(). ~5% throughput boost for a "submit and wait 128 nops" benchmark. Signed-off-by: Pavel Begunkov <[email protected]> __io_req_free_empty() -> __io_req_do_free() Signed-off-by: Jens Axboe <[email protected]>
2020-01-20pcpu_ref: add percpu_ref_tryget_many()Pavel Begunkov1-5/+21
Add percpu_ref_tryget_many(), which works the same way as percpu_ref_tryget(), but grabs specified number of refs. Signed-off-by: Pavel Begunkov <[email protected]> Acked-by: Tejun Heo <[email protected]> Acked-by: Dennis Zhou <[email protected]> Cc: Christoph Lameter <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: add IORING_OP_MADVISEJens Axboe2-0/+60
This adds support for doing madvise(2) through io_uring. We assume that any operation can block, and hence punt everything async. This could be improved, but hard to make bullet proof. The async punt ensures it's safe. Reviewed-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-20mm: make do_madvise() available internallyJens Axboe2-1/+7
This is in preparation for enabling this functionality through io_uring. Add a helper that is just exporting what sys_madvise() does, and have the system call use it. No functional changes in this patch. Reviewed-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: add IORING_OP_FADVISEJens Axboe2-0/+55
This adds support for doing fadvise through io_uring. We assume that WILLNEED doesn't block, but that DONTNEED may block. Reviewed-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: allow use of offset == -1 to mean file positionJens Axboe2-1/+11
This behaves like preadv2/pwritev2 with offset == -1, it'll use (and update) the current file position. This obviously comes with the caveat that if the application has multiple read/writes in flight, then the end result will not be as expected. This is similar to threads sharing a file descriptor and doing IO using the current file position. Since this feature isn't easily detectable by doing a read or write, add a feature flags, IORING_FEAT_RW_CUR_POS, to allow applications to detect presence of this feature. Reported-by: 李通洲 <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: add non-vectored read/write commandsJens Axboe2-0/+25
For uses cases that don't already naturally have an iovec, it's easier (or more convenient) to just use a buffer address + length. This is particular true if the use case is from languages that want to create a memory safe abstraction on top of io_uring, and where introducing the need for the iovec may impose an ownership issue. For those cases, they currently need an indirection buffer, which means allocating data just for this purpose. Add basic read/write that don't require the iovec. Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: improve poll completion performanceJens Axboe1-20/+88
For busy IORING_OP_POLL_ADD workloads, we can have enough contention on the completion lock that we fail the inline completion path quite often as we fail the trylock on that lock. Add a list for deferred completions that we can use in that case. This helps reduce the number of async offloads we have to do, as if we get multiple completions in a row, we'll piggy back on to the poll_llist instead of having to queue our own offload. Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: split overflow state into SQ and CQ sideJens Axboe1-13/+27
We currently check ->cq_overflow_list from both SQ and CQ context, which causes some bouncing of that cache line. Add separate bits of state for this instead, so that the SQ side can check using its own state, and likewise for the CQ side. This adds ->sq_check_overflow with the SQ state, and ->cq_check_overflow with the CQ state. If we hit an overflow condition, both of these bits are set. Likewise for overflow flush clear, we clear both bits. For the fast path of just checking if there's an overflow condition on either the SQ or CQ side, we can use our own private bit for this. Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: add lookup table for various opcode needsJens Axboe1-53/+155
We currently have various switch statements that check if an opcode needs a file, mm, etc. These are hard to keep in sync as opcodes are added. Add a struct io_op_def that holds all of this information, so we have just one spot to update when opcodes are added. This also enables us to NOT allocate req->io if a deferred command doesn't need it, and corrects some mistakes we had in terms of what commands need mm context. Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: remove two unnecessary function declarationsJens Axboe1-2/+0
__io_free_req() and io_double_put_req() aren't used before they are defined, so we can kill these two forwards. Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: move *queue_link_head() from common pathPavel Begunkov1-17/+15
Move io_queue_link_head() to links handling code in io_submit_sqe(), so it wouldn't need extra checks and would have better data locality. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: rename prev to headPavel Begunkov1-5/+5
Calling "prev" a head of a link is a bit misleading. Rename it Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: add IOSQE_ASYNCJens Axboe2-2/+15
io_uring defaults to always doing inline submissions, if at all possible. But for larger copies, even if the data is fully cached, that can take a long time. Add an IOSQE_ASYNC flag that the application can set on the SQE - if set, it'll ensure that we always go async for those kinds of requests. Use the io-wq IO_WQ_WORK_CONCURRENT flag to ensure we get the concurrency we desire for this case. Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io-wq: support concurrent non-blocking workJens Axboe2-1/+5
io-wq assumes that work will complete fast (and not block), so it doesn't create a new worker when work is enqueued, if we already have at least one worker running. This is done on the assumption that if work is running, then it will complete fast. Add an option to force io-wq to fork a new worker for work queued. This is signaled by setting IO_WQ_WORK_CONCURRENT on the work item. For that case, io-wq will create a new worker, even though workers are already running. Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: add support for IORING_OP_STATXJens Axboe2-1/+87
This provides support for async statx(2) through io_uring. Signed-off-by: Jens Axboe <[email protected]>
2020-01-20fs: make two stat prep helpers availableJens Axboe2-12/+28
To implement an async stat, we need to provide the flags mapping and the statx user copy. Make them available internally, through fs/internal.h. Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: avoid ring quiesce for fixed file set unregister and updateJens Axboe2-135/+351
We currently fully quiesce the ring before an unregister or update of the fixed fileset. This is very expensive, and we can be a bit smarter about this. Add a percpu refcount for the file tables as a whole. Grab a percpu ref when we use a registered file, and put it on completion. This is cheap to do. Upon removal of a file from a set, switch the ref count to atomic mode. When we hit zero ref on the completion side, then we know we can drop the previously registered files. When the old files have been dropped, switch the ref back to percpu mode for normal operation. Since there's a period between doing the update and the kernel being done with it, add a IORING_OP_FILES_UPDATE opcode that can perform the same action. The application knows the update has completed when it gets the CQE for it. Between doing the update and receiving this completion, the application must continue to use the unregistered fd if submitting IO on this particular file. This takes the runtime of test/file-register from liburing from 14s to about 0.7s. Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: add support for IORING_OP_CLOSEJens Axboe2-0/+110
This works just like close(2), unsurprisingly. We remove the file descriptor and post the completion inline, then offload the actual (potential) last file put to async context. Mark the async part of this work as uncancellable, as we really must guarantee that the latter part of the close is run. Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io-wq: add support for uncancellable workJens Axboe3-2/+12
Not all work can be cancelled, some of it we may need to guarantee that it runs to completion. Allow the caller to set IO_WQ_WORK_NO_CANCEL on work that must not be cancelled. Note that the caller work function must also check for IO_WQ_WORK_NO_CANCEL on work that is marked IO_WQ_WORK_CANCEL. Signed-off-by: Jens Axboe <[email protected]>
2020-01-20fs: move filp_close() outside of __close_fd_get_file()Jens Axboe2-4/+8
Just one caller of this, and just use filp_close() there manually. This is important to allow async close/removal of the fd. Signed-off-by: Jens Axboe <[email protected]>
2020-01-20io_uring: add support for IORING_OP_OPENATJens Axboe2-2/+95
This works just like openat(2), except it can be performed async. For the normal case of a non-blocking path lookup this will complete inline. If we have to do IO to perform the open, it'll be done from async context. Signed-off-by: Jens Axboe <[email protected]>
2020-01-20fs: make build_open_flags() available internallyJens Axboe2-3/+4
This is a prep patch for supporting non-blocking open from io_uring. Signed-off-by: Jens Axboe <[email protected]>