aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2021-08-23io_uring: simplify io_prep_linked_timeoutPavel Begunkov1-10/+10
The link test in io_prep_linked_timeout() is pretty bulky, replace it with a flag. It's better for normal path and linked requests, and also will be used further for request failing. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/3703770bfae8bc1ff370e43ef5767940202cab42.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: kill REQ_F_LTIMEOUT_ACTIVEPavel Begunkov1-9/+10
Instead of handling double consecutive linked timeouts through tricky flag combinations, just check the submit_state.link during timeout_prep and fail that case in advance. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/04150760b0dc739522264b8abd309409f7421a06.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: optimise hot path of ltimeout prepPavel Begunkov1-20/+25
io_prep_linked_timeout() grew too heavy and compiler now refuse to inline the function. Help it by splitting in two and annotating with inline. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/560636717a32e9513724f09b9ecaace942dde4d4.1628705069.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: deduplicate cancellation codePavel Begunkov1-28/+18
IORING_OP_ASYNC_CANCEL and IORING_OP_LINK_TIMEOUT have enough of overlap, so extract a helper for request cancellation and use in both. Also, removes some amount of ugliness because of success_ret. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/900122b588e65b637e71bfec80a260726c6a54d6.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: kill not necessary resubmit switchPavel Begunkov1-7/+7
773af69121ecc ("io_uring: always reissue from task_work context") makes all resubmission to be made from task_work, so we don't need that hack with resubmit/not-resubmit switch anymore. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/47fa177cca04e5ffd308a35227966c8e15d8525b.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: optimise initial ltimeout refcountingPavel Begunkov1-2/+1
Linked timeouts are never refcounted when it comes to the first call to __io_prep_linked_timeout(), so save an io_ref_get() and set the desired value directly. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/177b24cc62ffbb42d915d6eb9e8876266e4c0d5a.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: don't inflight-track linked timeoutsPavel Begunkov1-2/+0
Tracking linked timeouts as infligh was needed to make sure that io-wq is not destroyed by io_uring_cancel_generic() racing with io_async_cancel_one() accessing it. Now, cancellations issued by linked timeouts are done in the task context, so it's already synchronised. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/e1b05cf47cb69df2305efdbee8cf7ba36f46c1a3.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: optimise iowq refcountingPavel Begunkov1-9/+16
If a requests is forwarded into io-wq, there is a good chance it hasn't been refcounted yet and we can save one req_ref_get() by setting the refcount number to the right value directly. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/2d53f4449faaf73b4a4c5de667fc3c176d974860.1628981736.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: correct __must_hold annotationJens Axboe1-1/+1
io_req_free_batch() has a __must_hold annotation referencing a request being passed in, but we're passing in the context. Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: code clean for completion_lock in io_arm_poll_handler()Hao Xu1-6/+3
We can merge two spin_unlock() operations to one since we removed some code not long ago. Signed-off-by: Hao Xu <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: remove files pointer in cancellation functionsHao Xu1-2/+2
When doing cancellation, we use a parameter to indicate where it's from do_exit or exec. So a boolean value is good enough for this, remove the struct files* as it is not necessary. Signed-off-by: Hao Xu <[email protected]> [axboe: fixup io_uring_files_cancel for !CONFIG_IO_URING] Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: skip request refcountingPavel Begunkov1-1/+23
As submission references are gone, there is only one initial reference left. Instead of actually doing atomic refcounting, add a flag indicating whether we're going to take more refs or doing any other sync magic. The flag should be set before the request may get used in parallel. Together with the previous patch it saves 2 refcount atomics per request for IOPOLL and IRQ completions, and 1 atomic per req for inline completions, with some exceptions. In particular, currently, there are three cases, when the refcounting have to be enabled: - Polling, including apoll. Because double poll entries takes a ref. Might get relaxed in the near future. - Link timeouts, enabled for both, the timeout and the request it's bound to, because they work in-parallel and we need to synchronise to cancel one of them on completion. - When a request gets in io-wq, because it doesn't hold uring_lock and we need guarantees of submission references. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/8b204b6c5f6643062270a1913d6d3a7f8f795fd9.1628705069.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: remove submission referencesPavel Begunkov1-23/+14
Requests are by default given with two references, submission and completion. Completion references are straightforward, they represent request ownership and are put when a request is completed or so. Submission references are a bit more trickier. They're needed when io_issue_sqe() followed deep into the submission stack (e.g. in fs, block, drivers, etc.), request may have given away for concurrent execution or already completed, and the code unwinding back to io_issue_sqe() may be accessing some pieces of our requests, e.g. file or iov. Now, we prevent such async/in-depth completions by pushing requests through task_work. Punting to io-wq is also done through task_works, apart from a couple of cases with a pretty well known context. So, there're two cases: 1) io_issue_sqe() from the task context and protected by ->uring_lock. Either requests return back to io_uring or handed to task_work, which won't be executed because we're currently controlling that task. So, we can be sure that requests are staying alive all the time and we don't need submission references to pin them. 2) io_issue_sqe() from io-wq, which doesn't hold the mutex. The role of submission reference is played by io-wq reference, which is put by io_wq_submit_work(). Hence, it should be fine. Considering that, we can carefully kill the submission reference. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/6b68f1c763229a590f2a27148aee77767a8d7750.1628705069.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: remove req_ref_sub_and_test()Pavel Begunkov1-17/+14
Soon, we won't need to put several references at once, remove req_ref_sub_and_test() and @nr argument from io_put_req_deferred(), and put the rest of the references by hand. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/1868c7554108bff9194fb5757e77be23fadf7fc0.1628705069.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: move req_ref_get() and friendsPavel Begunkov1-35/+35
Move all request refcount helpers to avoid forward declarations in the future. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/89fd36f6f3fe5b733dfe4546c24725eee40df605.1628705069.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: remove IRQ aspect of io_ring_ctx completion lockJens Axboe1-80/+74
We have no hard/soft IRQ users of this lock left, remove any IRQ disabling/saving and restoring when grabbing this lock. This is straight forward with no users entering with IRQs disabled anymore, the only thing to look out for is the waitqueue poll head lock which nests inside the completion lock. That needs IRQs disabled, and hence we have to do that now instead of relying on the outer lock doing so. Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: run regular file completions from task_workJens Axboe1-7/+24
This is in preparation to making the completion lock work outside of hard/soft IRQ context. Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: run linked timeouts from task_workJens Axboe1-12/+30
This is in preparation to making the completion lock work outside of hard/soft IRQ context. Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: run timeouts from task_workJens Axboe1-14/+40
This is in preparation to making the completion lock work outside of hard/soft IRQ context. Add a timeout_lock to handle the ordering of timeout completions or cancelations with the timeouts actually triggering. Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: remove file batch-get optimisationPavel Begunkov1-49/+4
For requests with non-fixed files, instead of grabbing just one reference, we get by the number of left requests, so the following requests using the same file can take it without atomics. However, it's not all win. If there is one request in the middle not using files or having a fixed file, we'll need to put back the left references. Even worse if an application submits requests dealing with different files, it will do a put for each new request, so doubling the number of atomics needed. Also, even if not used, it's still takes some cycles in the submission path. If a file used many times, it rather makes sense to pre-register it, if not, we may fall in the described pitfall. So, this optimisation is a matter of use case. Go with the simpliest code-wise way, remove it. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: clean up tctx_task_work()Pavel Begunkov1-18/+14
After recent fixes, tctx_task_work() always does proper spinlocking before looking into ->task_list, so now we don't need atomics for ->task_state, replace it with non-atomic task_running using the critical section. Tide it up, combine two separate block with spinlocking, and always try to splice in there, so we do less locking when new requests are arriving during the function execution. Signed-off-by: Pavel Begunkov <[email protected]> [axboe: fix missing ->task_running reset on task_work_add() failure] Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: inline io_poll_remove_waitqsPavel Begunkov1-17/+6
Inline io_poll_remove_waitqs() into its only user and clean it up. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/2f1a91a19ffcd591531dc4c61e2f11c64a2d6a6d.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: remove extra argument for overflow flushPavel Begunkov1-5/+5
Unlike __io_cqring_overflow_flush(), nobody does forced flushing with io_cqring_overflow_flush(), so removed the argument from it. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/7594f869ca41b7cfb5a35a3c7c2d402242834e9e.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: inline struct io_comp_statePavel Begunkov1-34/+27
Inline struct io_comp_state into struct io_submit_state. They are already coupled tightly, together with mixed responsibilities it only brings confusion having them separately. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/e55bba77426b399e3a2e54e3c6c267c6a0fc4b57.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: use inflight_entry instead of compl.listPavel Begunkov1-8/+7
req->compl.list is used to cache freed requests, and so can't overlap in time with req->inflight_entry. So, use inflight_entry to link requests and remove compl.list. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/e430e79d22d70a190d718831bda7bfed1daf8976.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: remove redundant args from cache_freePavel Begunkov1-4/+2
We don't use @tsk argument of io_req_cache_free(), remove it. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/6a28b4a58ee0aaf0db98e2179b9c9f06f9b0cca1.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: cache __io_free_req()'d requestsPavel Begunkov1-1/+6
Don't kfree requests in __io_free_req() but put them back into the internal request cache. That makes allocations more sustainable and will be used for refcounting optimisations. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/9f4950fbe7771c8d41799366d0a3a08ac3040236.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: move io_fallback_req_func()Pavel Begunkov1-15/+13
Move io_fallback_req_func() to kill yet another forward declaration. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/d0a8f9d9a0057ed761d6237167d51c9378798d2d.1628536684.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: optimise putting task structPavel Begunkov1-6/+11
We cache all the reference to task + tctx, so if io_put_task() is called by the corresponding task itself, we can save on atomics and return the refs right back into the cache. It's beneficial for all inline completions, and also iopolling, when polling and submissions are done by the same task, including SQPOLL|IOPOLL. Note: io_uring_cancel_generic() can return refs to the cache as well, so those should be flushed in the loop for tctx_inflight() to work right. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/6fe9646b3cb70e46aca1f58426776e368c8926b3.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: drop exec checks from io_req_task_submitPavel Begunkov1-1/+1
In case of on-exec io_uring cancellations, tasks already wait for all submitted requests to get completed/cancelled, so we don't need to check for ->in_execve separately. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/be8707049f10df9d20ca03dc4ca3316239b5e8e0.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: kill unused IO_IOPOLL_BATCHPavel Begunkov1-1/+0
IO_IOPOLL_BATCH is not used, delete it. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/b2bdf19dbee2c9fc8865bbab9412135a14e24a64.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: improve ctx hang handlingPavel Begunkov1-2/+6
If io_ring_exit_work() can't get it done in 5 minutes, something is going very wrong, don't keep spinning at HZ / 20 rate, it doesn't help and it may take much of CPU time if there is a lot of workers stuck as such. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/9e2d1ca81d569f6bc628af1a42ff6663bff7ce9c.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: deduplicate open iopoll checkPavel Begunkov1-7/+4
Move IORING_SETUP_IOPOLL check into __io_openat_prep(), so both openat and openat2 reuse it. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/9a73ce83e4ee60d011180ef177eecef8e87ff2a2.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: inline io_free_req_deferredPavel Begunkov1-8/+4
Inline io_free_req_deferred(), there is no reason to keep it separated. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/ce04b7180d4eac0d69dd00677b227eefe80c2cc5.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: move io_rsrc_node_alloc() definitionPavel Begunkov1-45/+44
Move the function together with io_rsrc_node_ref_zero() in the source file as it is to get rid of forward declarations. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/4d81f6f833e7d017860b24463a9a68b14a8a5ed2.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: move io_put_task() definitionPavel Begunkov1-12/+11
Move the function in the source file as it is to get rid of forward declarations. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/33d917d69e4206557c75a5b98fe22bcdf77ce47d.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: extract a helper for ctx quiescePavel Begunkov1-24/+29
Refactor __io_uring_register() by extracting a helper responsible for ctx queisce. Looks better and will make it easier to add more optimisations. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/0339e0027504176be09237eefa7945bf9a6f153d.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: optimise io_cqring_wait() hot pathPavel Begunkov1-8/+6
Turns out we always init struct io_wait_queue in io_cqring_wait(), even if it's not used after, i.e. there are already enough of CQEs. And often it's exactly what happens, for instance, requests may have been completed inline, or in case of io_uring_enter(submit=N, wait=1). It shows up in my profiler, so optimise it by delaying the struct init. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/6f1b81c60b947d165583dc333947869c3d85d037.1628471125.git.asml.silence@gmail.com [axboe: fixed up for new cqring wait] Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: add more locking annotations for submitPavel Begunkov1-0/+6
Add more annotations for submission path functions holding ->uring_lock. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/128ec4185e26fbd661dd3a424aa66108ee8ff951.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: don't halt iopoll too earlyPavel Begunkov1-9/+6
IOPOLL users should care more about getting completions for requests they submitted, but not in "device did/completed something". Currently, io_do_iopoll() may return a positive number, which will instruct io_iopoll_check() to break the loop and end the syscall, even if there is not enough CQEs or none at all. Don't return positive numbers, so io_iopoll_check() exits only when it gets an actual error, need reschedule or got enough CQEs. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/641a88f751623b6758303b3171f0a4141f06726e.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: refactor io_alloc_reqPavel Begunkov1-33/+33
Replace the main if of io_flush_cached_reqs() with inverted condition + goto, so all the cases are handled in the same way. And also extract io_preinit_req() to make it cleaner and easier to refer to. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/1abcba1f7b55dc53bf1dbe95036e345ffb1d5b01.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io-wq: improve wq_list_add_tail()Pavel Begunkov1-1/+1
Prepare nodes that we're going to add before actually linking them, it's always safer and costs us nothing. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/f7e53f0c84c02ed6748c488ed0789b98f8cc6185.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: remove unnecessary PF_EXITING checkPavel Begunkov1-3/+1
We prefer nornal task_works even if it would fail requests inside. Kill a PF_EXITING check in io_req_task_work_add(), task_work_add() handles well dying tasks, i.e. return error when can't enqueue due to late stages of do_exit(). Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/fc14297e8441cd8f5d1743a2488cf0df09bf48ac.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: clean io-wq callbacksPavel Begunkov1-9/+9
Move io-wq callbacks closer to each other, so it's easier to work with them, and rename io_free_work() into io_wq_free_work() for consistency. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/851bbc7f0f86f206d8c1333efee8bcb9c26e419f.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: avoid touching inode in rw prepPavel Begunkov1-10/+15
If we use fixed files, we can be sure (almost) that REQ_F_ISREG is set. However, for non-reg files io_prep_rw() still will look into inode to double check, and that's expensive and can be avoided. The only caveat is that it only currently works with 64+ bit architectures, see FFS_ISREG, so we should consider that. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/0a62780c491ca2522cd52db4ae3f16e03aafed0f.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: rename io_file_supports_async()Pavel Begunkov1-15/+15
io_file_supports_async() checks whether a file supports nowait operations, so "async" in the name is misleading. Rename it. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/33d55b5ce43aa1884c637c1957f1e30d30dc3bec.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: inline fixed part of io_file_get()Pavel Begunkov1-26/+39
Optimise io_file_get() with registered files, which is in a hot path, by inlining parts of the function. Saves a function call, and inefficiencies of passing arguments, e.g. evaluating (sqe_flags & IOSQE_FIXED_FILE). It couldn't have been done before as compilers were refusing to inline it because of the function size. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/52115cd6ce28f33bd0923149c0e6cb611084a0b1.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: use kvmalloc for fixed filesPavel Begunkov1-23/+10
Instead of hand-coded two-level tables for registered files, allocate them with kvmalloc(). In many cases small enough tables are enough, and so can be kmalloc()'ed removing an extra memory load and a bunch of bit logic instructions from the hot path. If the table is larger, we trade off all the pros with a TLB-assisted memory lookup. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/280421d3b48775dabab773006bb5588c7b2dabc0.1628471125.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: be smarter about waking multiple CQ ring waitersJens Axboe1-14/+13
Currently we only wake the first waiter, even if we have enough entries posted to satisfy multiple waiters. Improve that situation so that every waiter knows how much the CQ tail has to advance before they can be safely woken up. With this change, if we have N waiters each asking for 1 event and we get 4 completions, then we wake up 4 waiters. If we have N waiters asking for 2 completions and we get 4 completions, then we wake up the first two. Previously, only the first waiter would've been woken up. Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io-wq: remove GFP_ATOMIC allocation off schedule out pathJens Axboe1-32/+40
Daniel reports that the v5.14-rc4-rt4 kernel throws a BUG when running stress-ng: | [ 90.202543] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:35 | [ 90.202549] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2047, name: iou-wrk-2041 | [ 90.202555] CPU: 5 PID: 2047 Comm: iou-wrk-2041 Tainted: G W 5.14.0-rc4-rt4+ #89 | [ 90.202559] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 | [ 90.202561] Call Trace: | [ 90.202577] dump_stack_lvl+0x34/0x44 | [ 90.202584] ___might_sleep.cold+0x87/0x94 | [ 90.202588] rt_spin_lock+0x19/0x70 | [ 90.202593] ___slab_alloc+0xcb/0x7d0 | [ 90.202598] ? newidle_balance.constprop.0+0xf5/0x3b0 | [ 90.202603] ? dequeue_entity+0xc3/0x290 | [ 90.202605] ? io_wqe_dec_running.isra.0+0x98/0xe0 | [ 90.202610] ? pick_next_task_fair+0xb9/0x330 | [ 90.202612] ? __schedule+0x670/0x1410 | [ 90.202615] ? io_wqe_dec_running.isra.0+0x98/0xe0 | [ 90.202618] kmem_cache_alloc_trace+0x79/0x1f0 | [ 90.202621] io_wqe_dec_running.isra.0+0x98/0xe0 | [ 90.202625] io_wq_worker_sleeping+0x37/0x50 | [ 90.202628] schedule+0x30/0xd0 | [ 90.202630] schedule_timeout+0x8f/0x1a0 | [ 90.202634] ? __bpf_trace_tick_stop+0x10/0x10 | [ 90.202637] io_wqe_worker+0xfd/0x320 | [ 90.202641] ? finish_task_switch.isra.0+0xd3/0x290 | [ 90.202644] ? io_worker_handle_work+0x670/0x670 | [ 90.202646] ? io_worker_handle_work+0x670/0x670 | [ 90.202649] ret_from_fork+0x22/0x30 which is due to the RT kernel not liking a GFP_ATOMIC allocation inside a raw spinlock. Besides that not working on RT, doing any kind of allocation from inside schedule() is kind of nasty and should be avoided if at all possible. This particular path happens when an io-wq worker goes to sleep, and we need a new worker to handle pending work. We currently allocate a small data item to hold the information we need to create a new worker, but we can instead include this data in the io_worker struct itself and just protect it with a single bit lock. We only really need one per worker anyway, as we will have run pending work between to sleep cycles. https://lore.kernel.org/lkml/[email protected]/ Reported-by: Daniel Wagner <[email protected]> Tested-by: Daniel Wagner <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Jens Axboe <[email protected]>