aboutsummaryrefslogtreecommitdiff
path: root/io_uring
AgeCommit message (Collapse)AuthorFilesLines
2023-01-29io_uring: add lazy poll_wq activationPavel Begunkov2-5/+62
Even though io_poll_wq_wake()'s waitqueue_active reuses a barrier we do for another waitqueue, it's not going to be the case in the future and so we want to have a fast path for it when the ring has never been polled. Move poll_wq wake ups into __io_commit_cqring_flush() using a new flag called ->poll_activated. The idea behind the flag is to set it when the ring was polled for the first time. This requires additional sync to not miss events, which is done here by using task_work for ->task_complete rings, and by default enabling the flag for all other types of rings. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/060785e8e9137a920b232c0c7f575b131af19cac.1673274244.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-01-29io_uring: separate wq for ring pollingPavel Begunkov2-1/+11
Don't use ->cq_wait for ring polling but add a separate wait queue for it. We need it for following patches. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/dea0be0bf990503443c5c6c337fc66824af7d590.1673274244.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-01-29io_uring: move io_run_local_work_lockedPavel Begunkov2-18/+17
io_run_local_work_locked() is only used in io_uring.c, move it there. With that we can also make __io_run_local_work() static. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/91757bcb33e5774e49fed6f2b6e058630608119b.1673274244.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-01-29io_uring: mark io_run_local_work staticPavel Begunkov2-2/+1
io_run_local_work is enclosed in io_uring.c, we don't need to export it. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/b477fb81f5e77044f724a06fe245d5c078659364.1673274244.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-01-29io_uring: don't set TASK_RUNNING in local tw runnerPavel Begunkov1-3/+2
The CQ waiting loop sets TASK_RUNNING before trying to execute task_work, no need to repeat it in io_run_local_work(). Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/9d9422c429ef3f9457b4f4b8288bf4789564f33b.1673274244.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-01-29io_uring: refactor io_wake_functionPavel Begunkov1-4/+2
Remove a local variable ctx in io_wake_function(), we don't need it if io_should_wake() triggers it to wake up. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/e60eb1008aebe286aab7d34c772ed01c447bddb1.1673274244.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-01-29io_uring: remove excessive unlikely on IS_ERRDmitrii Bundin1-1/+1
The IS_ERR function uses the IS_ERR_VALUE macro under the hood which already wraps the condition into unlikely. Signed-off-by: Dmitrii Bundin <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29io_uring/msg_ring: Pass custom flags to the cqeBreno Leitao1-5/+19
This patch adds a new flag (IORING_MSG_RING_FLAGS_PASS) in the message ring operations (IORING_OP_MSG_RING). This new flag enables the sender to specify custom flags, which will be copied over to cqe->flags in the receiving ring. These custom flags should be specified using the sqe->file_index field. This mechanism provides additional flexibility when sending messages between rings. Signed-off-by: Breno Leitao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29io_uring: keep timeout in io_wait_queuePavel Begunkov1-14/+14
Move waiting timeout into io_wait_queue Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/e4b48a9e26a3b1cf97c80121e62d4b5ab873d28d.1672916894.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-01-29io_uring: optimise non-timeout waitingPavel Begunkov1-1/+3
Unlike the jiffy scheduling version, schedule_hrtimeout() jumps a few functions before getting into schedule() even if there is no actual timeout needed. Some tests showed that it takes up to 1% of CPU cycles. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/89f880574eceee6f4899783377ead234df7b3d04.1672916894.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-01-29io_uring: set TASK_RUNNING right after schedulePavel Begunkov1-3/+2
Instead of constantly watching that the state of the task is running before executing tw or taking locks in io_cqring_wait(), switch it back to TASK_RUNNING immediately. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/246dddee247d89fd52023f785ed17cc34962a008.1672916894.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-01-29io_uring: simplify io_has_workPavel Begunkov1-2/+1
->work_llist should never be non-empty for a non DEFER_TASKRUN ring, so we can safely skip checking the flag. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/26af9f73c09a56c9a035f94db56127358688f3aa.1672916894.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-01-29io_uring: mimimise io_cqring_wait_schedulePavel Begunkov1-16/+23
io_cqring_wait_schedule() is called after we started waiting on the cq wq and set the state to TASK_INTERRUPTIBLE, for that reason we have to constantly worry whether we has returned the state back to running or not. Leave only quick checks in io_cqring_wait_schedule() and move the rest including running task work to the callers. Note, we run tw in the loop after the sched checks because of the fast path in the beginning of the function. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/2814fabe75e2e019e7ca43ea07daa94564349805.1672916894.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-01-29io_uring: parse check_cq out of wq waitingPavel Begunkov1-14/+18
We already avoid flushing overflows in io_cqring_wait_schedule() but only return an error for the outer loop to handle it. Minimise it even further by moving all ->check_cq parsing there. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/9dfcec3121013f98208dbf79368d636d74e1231a.1672916894.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-01-29io_uring: move defer tw task checksPavel Begunkov2-9/+11
Most places that want to run local tw explicitly and in advance check if they are allowed to do so. Don't rely on a similar check in __io_run_local_work(), leave it as a just-in-case warning and make sure callers checks capabilities themselves. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/990fe0e8e70fd4d57e43625e5ce8fba584821d1a.1672916894.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-01-29io_uring: kill io_run_task_work_ctxPavel Begunkov2-21/+5
There is only one user of io_run_task_work_ctx(), inline it. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/40953c65f7c88fb00cdc4d870ca5d5319fb3d7ea.1672916894.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-01-29io_uring: don't iterate cq wait fast pathPavel Begunkov1-10/+8
Task work runners keep running until all queues tw items are exhausted. It's also rare for defer tw to queue normal tw and vise versa. Taking it into account, there is only a dim chance that further iterating the io_cqring_wait() fast path will get us anything and so we can remove the loop there. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/1f9565726661266abaa5d921e97433c831759ecf.1672916894.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-01-29io_uring: rearrange defer list checksPavel Begunkov2-4/+1
There should be nothing in the ->work_llist for non DEFER_TASKRUN rings, so we can skip flag checks and test the list emptiness directly. Also move it out of io_run_local_work() for inlining. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/331d63fd15ca79b35b95c82a82d9246110686392.1672916894.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-01-27io_uring: always prep_async for drain requestsDylan Yudaken1-10/+8
Drain requests all go through io_drain_req, which has a quick exit in case there is nothing pending (ie the drain is not useful). In that case it can run the issue the request immediately. However for safety it queues it through task work. The problem is that in this case the request is run asynchronously, but the async work has not been prepared through io_req_prep_async. This has not been a problem up to now, as the task work always would run before returning to userspace, and so the user would not have a chance to race with it. However - with IORING_SETUP_DEFER_TASKRUN - this is no longer the case and the work might be defered, giving userspace a chance to change data being referred to in the request. Instead _always_ prep_async for drain requests, which is simpler anyway and removes this issue. Cc: [email protected] Fixes: c0e0d6ba25f1 ("io_uring: add IORING_SETUP_DEFER_TASKRUN") Signed-off-by: Dylan Yudaken <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-23io_uring/net: cache provided buffer group value for multishot receivesJens Axboe1-0/+11
If we're using ring provided buffers with multishot receive, and we end up doing an io-wq based issue at some points that also needs to select a buffer, we'll lose the initially assigned buffer group as io_ring_buffer_select() correctly clears the buffer group list as the issue isn't serialized by the ctx uring_lock. This is fine for normal receives as the request puts the buffer and finishes, but for multishot, we will re-arm and do further receives. On the next trigger for this multishot receive, the receive will try and pick from a buffer group whose value is the same as the buffer ID of the las receive. That is obviously incorrect, and will result in a premature -ENOUFS error for the receive even if we had available buffers in the correct group. Cache the buffer group value at prep time, so we can restore it for future receives. This only needs doing for the above mentioned case, but just do it by default to keep it easier to read. Cc: [email protected] Fixes: b3fdea6ecb55 ("io_uring: multishot recv") Fixes: 9bb66906f23e ("io_uring: support multishot in recvmsg") Cc: Dylan Yudaken <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2023-01-20io_uring/poll: don't reissue in case of poll race on multishot requestJens Axboe1-1/+5
A previous commit fixed a poll race that can occur, but it's only applicable for multishot requests. For a multishot request, we can safely ignore a spurious wakeup, as we never leave the waitqueue to begin with. A blunt reissue of a multishot armed request can cause us to leak a buffer, if they are ring provided. While this seems like a bug in itself, it's not really defined behavior to reissue a multishot request directly. It's less efficient to do so as well, and not required to rearm anything like it is for singleshot poll requests. Cc: [email protected] Fixes: 6e5aedb9324a ("io_uring/poll: attempt request issue after racy poll wakeup") Reported-and-tested-by: Olivier Langlois <[email protected]> Link: https://github.com/axboe/liburing/issues/778 Signed-off-by: Jens Axboe <[email protected]>
2023-01-20io_uring/msg_ring: fix remote queue to disabled ringPavel Begunkov2-2/+10
IORING_SETUP_R_DISABLED rings don't have the submitter task set, so it's not always safe to use ->submitter_task. Disallow posting msg_ring messaged to disabled rings. Also add task NULL check for loosy sync around testing for IORING_SETUP_R_DISABLED. Cc: [email protected] Fixes: 6d043ee1164ca ("io_uring: do msg_ring in target task via tw") Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2023-01-20io_uring/msg_ring: fix flagging remote executionPavel Begunkov1-17/+23
There is a couple of problems with queueing a tw in io_msg_ring_data() for remote execution. First, once we queue it the target ring can go away and so setting IORING_SQ_TASKRUN there is not safe. Secondly, the userspace might not expect IORING_SQ_TASKRUN. Extract a helper and uniformly use TWA_SIGNAL without TWA_SIGNAL_NO_IPI tricks for now, just as it was done in the original patch. Cc: [email protected] Fixes: 6d043ee1164ca ("io_uring: do msg_ring in target task via tw") Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2023-01-19io_uring/msg_ring: fix missing lock on overflow for IOPOLLJens Axboe1-9/+30
If the target ring is configured with IOPOLL, then we always need to hold the target ring uring_lock before posting CQEs. We could just grab it unconditionally, but since we don't expect many target rings to be of this type, make grabbing the uring_lock conditional on the ring type. Link: https://lore.kernel.org/io-uring/Y8krlYa52%[email protected]/ Reported-by: Xingyuan Mo <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2023-01-19io_uring/msg_ring: move double lock/unlock helpers higher upJens Axboe1-24/+23
In preparation for needing them somewhere else, move them and get rid of the unused 'issue_flags' for the unlock side. No functional changes in this patch. Signed-off-by: Jens Axboe <[email protected]>
2023-01-18mm/nommu: factor out check for NOMMU shared mappings into ↵David Hildenbrand1-1/+1
is_nommu_shared_mapping() Patch series "mm/nommu: don't use VM_MAYSHARE for MAP_PRIVATE mappings". Trying to reduce the confusion around VM_SHARED and VM_MAYSHARE first requires !CONFIG_MMU to stop using VM_MAYSHARE for MAP_PRIVATE mappings. CONFIG_MMU only sets VM_MAYSHARE for MAP_SHARED mappings. This paves the way for further VM_MAYSHARE and VM_SHARED cleanups: for example, renaming VM_MAYSHARED to VM_MAP_SHARED to make it cleaner what is actually means. Let's first get the weird case out of the way and not use VM_MAYSHARE in MAP_PRIVATE mappings, using a new VM_MAYOVERLAY flag instead. This patch (of 3): We want to stop using VM_MAYSHARE in private mappings to pave the way for clarifying the semantics of VM_MAYSHARE vs. VM_SHARED and reduce the confusion. While CONFIG_MMU uses VM_MAYSHARE to represent MAP_SHARED, !CONFIG_MMU also sets VM_MAYSHARE for selected R/O private file mappings that are an effective overlay of a file mapping. Let's factor out all relevant VM_MAYSHARE checks in !CONFIG_MMU code into is_nommu_shared_mapping() first. Note that whenever VM_SHARED is set, VM_MAYSHARE must be set as well (unless there is a serious BUG). So there is not need to test for VM_SHARED manually. No functional change intended. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Nicolas Pitre <[email protected]> Cc: Pavel Begunkov <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-01-13io_uring: lock overflowing for IOPOLLPavel Begunkov1-1/+5
syzbot reports an issue with overflow filling for IOPOLL: WARNING: CPU: 0 PID: 28 at io_uring/io_uring.c:734 io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734 CPU: 0 PID: 28 Comm: kworker/u4:1 Not tainted 6.2.0-rc3-syzkaller-16369-g358a161a6a9e #0 Workqueue: events_unbound io_ring_exit_work Call trace:  io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734  io_req_cqe_overflow+0x5c/0x70 io_uring/io_uring.c:773  io_fill_cqe_req io_uring/io_uring.h:168 [inline]  io_do_iopoll+0x474/0x62c io_uring/rw.c:1065  io_iopoll_try_reap_events+0x6c/0x108 io_uring/io_uring.c:1513  io_uring_try_cancel_requests+0x13c/0x258 io_uring/io_uring.c:3056  io_ring_exit_work+0xec/0x390 io_uring/io_uring.c:2869  process_one_work+0x2d8/0x504 kernel/workqueue.c:2289  worker_thread+0x340/0x610 kernel/workqueue.c:2436  kthread+0x12c/0x158 kernel/kthread.c:376  ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:863 There is no real problem for normal IOPOLL as flush is also called with uring_lock taken, but it's getting more complicated for IOPOLL|SQPOLL, for which __io_cqring_overflow_flush() happens from the CQ waiting path. Reported-and-tested-by: [email protected] Cc: [email protected] # 5.10+ Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2023-01-12io_uring/poll: attempt request issue after racy poll wakeupJens Axboe1-11/+20
If we have multiple requests waiting on the same target poll waitqueue, then it's quite possible to get a request triggered and get disappointed in not being able to make any progress with it. If we race in doing so, we'll potentially leave the poll request on the internal tables, but removed from the waitqueue. That means that any subsequent trigger of the poll waitqueue will not kick that request into action, causing an application to potentially wait for completion of a request that will never happen. Fix this by adding a new poll return state, IOU_POLL_REISSUE. Rather than have complicated logic for how to re-arm a given type of request, just punt it for a reissue. While in there, move the 'ret' variable to the only section where it gets used. This avoids confusion the scope of it. Cc: [email protected] Fixes: eb0089d629ba ("io_uring: single shot poll removal optimisation") Signed-off-by: Jens Axboe <[email protected]>
2023-01-10io_uring/fdinfo: include locked hash table in fdinfo outputJens Axboe1-2/+10
A previous commit split the hash table for polled requests into two parts, but didn't get the fdinfo output updated. This means that it's less useful for debugging, as we may think a given request is not pending poll. Fix this up by dumping the locked hash table contents too. Fixes: 9ca9fb24d5fe ("io_uring: mutex locked poll hashing") Signed-off-by: Jens Axboe <[email protected]>
2023-01-09io_uring/poll: add hash if ready poll request can't complete inlineJens Axboe1-5/+12
If we don't, then we may lose access to it completely, leading to a request leak. This will eventually stall the ring exit process as well. Cc: [email protected] Fixes: 49f1c68e048f ("io_uring: optimise submission side poll_refs") Reported-and-tested-by: [email protected] Link: https://lore.kernel.org/io-uring/[email protected]/ Suggested-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2023-01-08io_uring: use iter_ubuf for single range importsJens Axboe1-3/+6
This is more efficient than iter_iov. Signed-off-by: Jens Axboe <[email protected]> [merge to 6.2, minor fixes] Signed-off-by: Keith Busch <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2023-01-08io_uring: switch network send/recv to ITER_UBUFJens Axboe1-12/+5
This is more efficient than iter_iov. Signed-off-by: Jens Axboe <[email protected]> [merged to 6.2] Signed-off-by: Keith Busch <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2023-01-08io_uring/io-wq: only free worker if it was allocated for creationJens Axboe1-1/+6
We have two types of task_work based creation, one is using an existing worker to setup a new one (eg when going to sleep and we have no free workers), and the other is allocating a new worker. Only the latter should be freed when we cancel task_work creation for a new worker. Fixes: af82425c6a2d ("io_uring/io-wq: free worker if task_work creation is canceled") Reported-by: [email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-05io_uring: fix CQ waiting timeout handlingPavel Begunkov1-3/+3
Jiffy to ktime CQ waiting conversion broke how we treat timeouts, in particular we rearm it anew every time we get into io_cqring_wait_schedule() without adjusting the timeout. Waiting for 2 CQEs and getting a task_work in the middle may double the timeout value, or even worse in some cases task may wait indefinitely. Cc: [email protected] Fixes: 228339662b398 ("io_uring: don't convert to jiffies for waiting on timeouts") Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/f7bffddd71b08f28a877d44d37ac953ddb01590d.1672915663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-01-03io_uring: lockdep annotate CQ lockingPavel Begunkov2-3/+17
Locking around CQE posting is complex and depends on options the ring is created with, add more thorough lockdep annotations checking all invariants. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/aa3770b4eacae3915d782cc2ab2f395a99b4b232.1672795976.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-01-03io_uring: pin context while queueing deferred twPavel Begunkov1-1/+7
Unlike normal tw, nothing prevents deferred tw to be executed right after an tw item added to ->work_llist in io_req_local_work_add(). For instance, the waiting task may get waken up by CQ posting or a normal tw. Thus we need to pin the ring for the rest of io_req_local_work_add() Cc: [email protected] Fixes: c0e0d6ba25f18 ("io_uring: add IORING_SETUP_DEFER_TASKRUN") Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/1a79362b9c10b8523ef70b061d96523650a23344.1672795998.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2023-01-02io_uring/io-wq: free worker if task_work creation is canceledJens Axboe1-0/+1
If we cancel the task_work, the worker will never come into existance. As this is the last reference to it, ensure that we get it freed appropriately. Cc: [email protected] Reported-by: 진호 <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2022-12-23io_uring: check for valid register opcode earlierJens Axboe1-2/+3
We only check the register opcode value inside the restricted ring section, move it into the main io_uring_register() function instead and check it up front. Signed-off-by: Jens Axboe <[email protected]>
2022-12-21io_uring/cancel: re-grab ctx mutex after finishing waitJens Axboe1-5/+4
If we have a signal pending during cancelations, it'll cause the task_work run to return an error. Since we didn't run task_work, the current task is left in TASK_INTERRUPTIBLE state when we need to re-grab the ctx mutex, and the kernel will rightfully complain about that. Move the lock grabbing for the error cases outside the loop to avoid that issue. Reported-by: [email protected] Link: https://lore.kernel.org/io-uring/[email protected]/ Signed-off-by: Jens Axboe <[email protected]>
2022-12-21io_uring: finish waiting before flushing overflow entriesJens Axboe1-9/+16
If we have overflow entries being generated after we've done the initial flush in io_cqring_wait(), then we could be flushing them in the main wait loop as well. If that's done after having added ourselves to the cq_wait waitqueue, then the task state can be != TASK_RUNNING when we enter the overflow flush. Check for the need to overflow flush, and finish our wait cycle first if we have to do so. Reported-and-tested-by: [email protected] Link: https://lore.kernel.org/io-uring/[email protected]/ Signed-off-by: Jens Axboe <[email protected]>
2022-12-19io_uring/net: fix cleanup after recyclePavel Begunkov1-1/+1
Don't access io_async_msghdr io_netmsg_recycle(), it may be reallocated. Cc: [email protected] Fixes: 9bb66906f23e5 ("io_uring: support multishot in recvmsg") Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/9e326f4ad4046ddadf15bf34bf3fa58c6372f6b5.1671461985.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-12-19io_uring/net: ensure compat import handlers clear free_iovJens Axboe1-0/+1
If we're not allocating the vectors because the count is below UIO_FASTIOV, we still do need to properly clear ->free_iov to prevent an erronous free of on-stack data. Reported-by: Jiri Slaby <[email protected]> Fixes: 4c17a496a7a0 ("io_uring/net: fix cleanup double free free_iov init") Cc: [email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-12-17io_uring: include task_work run after scheduling in wait for eventsJens Axboe1-1/+10
It's quite possible that we got woken up because task_work was queued, and we need to process this task_work to generate the events waited for. If we return to the wait loop without running task_work, we'll end up adding the task to the waitqueue again, only to call io_cqring_wait_schedule() again which will run the task_work. This is less efficient than it could be, as it requires adding to the cq_wait queue again. It also triggers the wakeup path for completions as cq_wait is now non-empty with the task itself, and it'll require another lock grab and deletion to remove ourselves from the waitqueue. Signed-off-by: Jens Axboe <[email protected]>
2022-12-17io_uring: don't use TIF_NOTIFY_SIGNAL to test for availability of task_workJens Axboe1-2/+1
Use task_work_pending() as a better test for whether we have task_work or not, TIF_NOTIFY_SIGNAL is only valid if the any of the task_work items had been queued with TWA_SIGNAL as the notification mechanism. Hence task_work_pending() is a more reliable check. Signed-off-by: Jens Axboe <[email protected]>
2022-12-15io_uring: use call_rcu_hurry if signaling an eventfdDylan Yudaken1-1/+1
io_uring uses call_rcu in the case it needs to signal an eventfd as a result of an eventfd signal, since recursing eventfd signals are not allowed. This should be calling the new call_rcu_hurry API to not delay the signal. Signed-off-by: Dylan Yudaken <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Paul E. McKenney <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Reviewed-by: Joel Fernandes (Google) <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-12-15io_uring: fix overflow handling regressionPavel Begunkov2-2/+2
Because the single task locking series got reordered ahead of the timeout and completion lock changes, two hunks inadvertently ended up using __io_fill_cqe_req() rather than io_fill_cqe_req(). This meant that we dropped overflow handling in those two spots. Reinstate the correct CQE filling helper. Fixes: f66f73421f0a ("io_uring: skip spinlocking for ->task_complete") Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2022-12-14io_uring: ease timeout flush locking requirementsPavel Begunkov2-7/+4
We don't need completion_lock for timeout flushing, don't take it. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/1e3dc657975ac445b80e7bdc40050db783a5935a.1670002973.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-12-14io_uring: revise completion_lock lockingPavel Begunkov3-15/+20
io_kill_timeouts() doesn't post any events but queues everything to task_work. Locking there is needed for protecting linked requests traversing, we should grab completion_lock directly instead of using io_cq_[un]lock helpers. Same goes for __io_req_find_next_prep(). Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/88e75d481a65dc295cb59722bb1cf76402d1c06b.1670002973.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-12-14io_uring: protect cq_timeouts with timeout_lockPavel Begunkov1-1/+3
Read cq_timeouts in io_flush_timeouts() only after taking the timeout_lock, as it's protected by it. There are many places where we also grab ->completion_lock, but for instance io_timeout_fn() doesn't and still modifies cq_timeouts. Cc: [email protected] Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/9c79544dd6cf5c4018cb1bab99cf481a93ea46ef.1670002973.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-12-13Merge tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linuxLinus Torvalds1-1/+2
Pull block updates from Jens Axboe: - NVMe pull requests via Christoph: - Support some passthrough commands without CAP_SYS_ADMIN (Kanchan Joshi) - Refactor PCIe probing and reset (Christoph Hellwig) - Various fabrics authentication fixes and improvements (Sagi Grimberg) - Avoid fallback to sequential scan due to transient issues (Uday Shankar) - Implement support for the DEAC bit in Write Zeroes (Christoph Hellwig) - Allow overriding the IEEE OUI and firmware revision in configfs for nvmet (Aleksandr Miloserdov) - Force reconnect when number of queue changes in nvmet (Daniel Wagner) - Minor fixes and improvements (Uros Bizjak, Joel Granados, Sagi Grimberg, Christoph Hellwig, Christophe JAILLET) - Fix and cleanup nvme-fc req allocation (Chaitanya Kulkarni) - Use the common tagset helpers in nvme-pci driver (Christoph Hellwig) - Cleanup the nvme-pci removal path (Christoph Hellwig) - Use kstrtobool() instead of strtobool (Christophe JAILLET) - Allow unprivileged passthrough of Identify Controller (Joel Granados) - Support io stats on the mpath device (Sagi Grimberg) - Minor nvmet cleanup (Sagi Grimberg) - MD pull requests via Song: - Code cleanups (Christoph) - Various fixes - Floppy pull request from Denis: - Fix a memory leak in the init error path (Yuan) - Series fixing some batch wakeup issues with sbitmap (Gabriel) - Removal of the pktcdvd driver that was deprecated more than 5 years ago, and subsequent removal of the devnode callback in struct block_device_operations as no users are now left (Greg) - Fix for partition read on an exclusively opened bdev (Jan) - Series of elevator API cleanups (Jinlong, Christoph) - Series of fixes and cleanups for blk-iocost (Kemeng) - Series of fixes and cleanups for blk-throttle (Kemeng) - Series adding concurrent support for sync queues in BFQ (Yu) - Series bringing drbd a bit closer to the out-of-tree maintained version (Christian, Joel, Lars, Philipp) - Misc drbd fixes (Wang) - blk-wbt fixes and tweaks for enable/disable (Yu) - Fixes for mq-deadline for zoned devices (Damien) - Add support for read-only and offline zones for null_blk (Shin'ichiro) - Series fixing the delayed holder tracking, as used by DM (Yu, Christoph) - Series enabling bio alloc caching for IRQ based IO (Pavel) - Series enabling userspace peer-to-peer DMA (Logan) - BFQ waker fixes (Khazhismel) - Series fixing elevator refcount issues (Christoph, Jinlong) - Series cleaning up references around queue destruction (Christoph) - Series doing quiesce by tagset, enabling cleanups in drivers (Christoph, Chao) - Series untangling the queue kobject and queue references (Christoph) - Misc fixes and cleanups (Bart, David, Dawei, Jinlong, Kemeng, Ye, Yang, Waiman, Shin'ichiro, Randy, Pankaj, Christoph) * tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux: (247 commits) blktrace: Fix output non-blktrace event when blk_classic option enabled block: sed-opal: Don't include <linux/kernel.h> sed-opal: allow using IOC_OPAL_SAVE for locking too blk-cgroup: Fix typo in comment block: remove bio_set_op_attrs nvmet: don't open-code NVME_NS_ATTR_RO enumeration nvme-pci: use the tagset alloc/free helpers nvme: add the Apple shared tag workaround to nvme_alloc_io_tag_set nvme: only set reserved_tags in nvme_alloc_io_tag_set for fabrics controllers nvme: consolidate setting the tagset flags nvme: pass nr_maps explicitly to nvme_alloc_io_tag_set block: bio_copy_data_iter nvme-pci: split out a nvme_pci_ctrl_is_dead helper nvme-pci: return early on ctrl state mismatch in nvme_reset_work nvme-pci: rename nvme_disable_io_queues nvme-pci: cleanup nvme_suspend_queue nvme-pci: remove nvme_pci_disable nvme-pci: remove nvme_disable_admin_queue nvme: merge nvme_shutdown_ctrl into nvme_disable_ctrl nvme: use nvme_wait_ready in nvme_shutdown_ctrl ...