aboutsummaryrefslogtreecommitdiff
path: root/io_uring
AgeCommit message (Collapse)AuthorFilesLines
2024-04-25io_uring/rw: reinstate thread check for retriesJens Axboe3-25/+29
Allowing retries for everything is arguably the right thing to do, now that every command type is async read from the start. But it's exposed a few issues around missing check for a retry (which cca6571381a0 exposed), and the fixup commit for that isn't necessarily 100% sound in terms of iov_iter state. For now, just revert these two commits. This unfortunately then re-opens the fact that -EAGAIN can get bubbled to userspace for some cases where the kernel very well could just sanely retry them. But until we have all the conditions covered around that, we cannot safely enable that. This reverts commit df604d2ad480fcf7b39767280c9093e13b1de952. This reverts commit cca6571381a0bdc88021a1f7a4c2349df21279f7. Signed-off-by: Jens Axboe <[email protected]>
2024-04-22io_uring/notif: implement notification stackingPavel Begunkov2-7/+67
The network stack allows only one ubuf_info per skb, and unlike MSG_ZEROCOPY, each io_uring zerocopy send will carry a separate ubuf_info. That means that send requests can't reuse a previosly allocated skb and need to get one more or more of new ones. That's fine for large sends, but otherwise it would spam the stack with lots of skbs carrying just a little data each. To help with that implement linking notification (i.e. an io_uring wrapper around ubuf_info) into a list. Each is refcounted by skbs and the stack as usual. additionally all non head entries keep a reference to the head, which they put down when their refcount hits 0. When the head have no more users, it'll efficiently put all notifications in a batch. As mentioned previously about ->io_link_skb, the callback implementation always allows to bind to an skb without a ubuf_info. Reviewed-by: Jens Axboe <[email protected]> Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/bf1e7f9b72f9ecc99999fdc0d2cded5eea87fd0b.1713369317.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2024-04-22io_uring/notif: simplify io_notif_flush()Pavel Begunkov2-9/+6
io_notif_flush() is partially duplicating io_tx_ubuf_complete(), so instead of duplicating it, make the flush call io_tx_ubuf_complete. Reviewed-by: Jens Axboe <[email protected]> Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/19e41652c16718b946a5c80d2ad409df7682e47e.1713369317.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2024-04-22Merge branch 'for-uring-ubufops' of ↵Jens Axboe1-2/+6
git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux into for-6.10/io_uring Merge net changes required for the upcoming send zerocopy improvements. * 'for-uring-ubufops' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux: net: add callback for setting a ubuf_info to skb net: extend ubuf_info callback to ops structure Signed-off-by: Jens Axboe <[email protected]>
2024-04-22net: extend ubuf_info callback to ops structurePavel Begunkov1-5/+13
We'll need to associate additional callbacks with ubuf_info, introduce a structure holding ubuf_info callbacks. Apart from a more smarter io_uring notification management introduced in next patches, it can be used to generalise msg_zerocopy_put_abort() and also store ->sg_from_iter, which is currently passed in struct msghdr. Reviewed-by: Jens Axboe <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: Pavel Begunkov <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Link: https://lore.kernel.org/all/a62015541de49c0e2a8a0377a1d5d0a5aeb07016.1713369317.git.asml.silence@gmail.com Signed-off-by: Jakub Kicinski <[email protected]>
2024-04-22io_uring/net: support bundles for recvJens Axboe2-22/+97
If IORING_OP_RECV is used with provided buffers, the caller may also set IORING_RECVSEND_BUNDLE to turn it into a multi-buffer recv. This grabs buffers available and receives into them, posting a single completion for all of it. This can be used with multishot receive as well, or without it. Now that both send and receive support bundles, add a feature flag for it as well. If IORING_FEAT_RECVSEND_BUNDLE is set after registering the ring, then the kernel supports bundles for recv and send. Signed-off-by: Jens Axboe <[email protected]>
2024-04-22io_uring/net: support bundles for sendJens Axboe1-17/+128
If IORING_OP_SEND is used with provided buffers, the caller may also set IORING_RECVSEND_BUNDLE to turn it into a multi-buffer send. The idea is that an application can fill outgoing buffers in a provided buffer group, and then arm a single send that will service them all. Once there are no more buffers to send, or if the requested length has been sent, the request posts a single completion for all the buffers. This only enables it for IORING_OP_SEND, IORING_OP_SENDMSG is coming in a separate patch. However, this patch does do a lot of the prep work that makes wiring up the sendmsg variant pretty trivial. They share the prep side. Signed-off-by: Jens Axboe <[email protected]>
2024-04-22io_uring/kbuf: add helpers for getting/peeking multiple buffersJens Axboe2-12/+198
Our provided buffer interface only allows selection of a single buffer. Add an API that allows getting/peeking multiple buffers at the same time. This is only implemented for the ring provided buffers. It could be added for the legacy provided buffers as well, but since it's strongly encouraged to use the new interface, let's keep it simpler and just provide it for the new API. The legacy interface will always just select a single buffer. There are two new main functions: io_buffers_select(), which selects up as many buffers as it can. The caller supplies the iovec array, and io_buffers_select() may allocate a bigger array if the 'out_len' being passed in is non-zero and bigger than what fits in the provided iovec. Buffers grabbed with this helper are permanently assigned. io_buffers_peek(), which works like io_buffers_select(), except they can be recycled, if needed. Callers using either of these functions should call io_put_kbufs() rather than io_put_kbuf() at completion time. The peek interface must be called with the ctx locked from peek to completion. This add a bit state for the request: - REQ_F_BUFFERS_COMMIT, which means that the the buffers have been peeked and should be committed to the buffer ring head when they are put as part of completion. Prior to this, req->buf_list was cleared to NULL when committed. Signed-off-by: Jens Axboe <[email protected]>
2024-04-22io_uring/net: add provided buffer support for IORING_OP_SENDJens Axboe2-5/+21
It's pretty trivial to wire up provided buffer support for the send side, just like how it's done the receive side. This enables setting up a buffer ring that an application can use to push pending sends to, and then have a send pick a buffer from that ring. One of the challenges with async IO and networking sends is that you can get into reordering conditions if you have more than one inflight at the same time. Consider the following scenario where everything is fine: 1) App queues sendA for socket1 2) App queues sendB for socket1 3) App does io_uring_submit() 4) sendA is issued, completes successfully, posts CQE 5) sendB is issued, completes successfully, posts CQE All is fine. Requests are always issued in-order, and both complete inline as most sends do. However, if we're flooding socket1 with sends, the following could also result from the same sequence: 1) App queues sendA for socket1 2) App queues sendB for socket1 3) App does io_uring_submit() 4) sendA is issued, socket1 is full, poll is armed for retry 5) Space frees up in socket1, this triggers sendA retry via task_work 6) sendB is issued, completes successfully, posts CQE 7) sendA is retried, completes successfully, posts CQE Now we've sent sendB before sendA, which can make things unhappy. If both sendA and sendB had been using provided buffers, then it would look as follows instead: 1) App queues dataA for sendA, queues sendA for socket1 2) App queues dataB for sendB queues sendB for socket1 3) App does io_uring_submit() 4) sendA is issued, socket1 is full, poll is armed for retry 5) Space frees up in socket1, this triggers sendA retry via task_work 6) sendB is issued, picks first buffer (dataA), completes successfully, posts CQE (which says "I sent dataA") 7) sendA is retried, picks first buffer (dataB), completes successfully, posts CQE (which says "I sent dataB") Now we've sent the data in order, and everybody is happy. It's worth noting that this also opens the door for supporting multishot sends, as provided buffers would be a prerequisite for that. Those can trigger either when new buffers are added to the outgoing ring, or (if stalled due to lack of space) when space frees up in the socket. Signed-off-by: Jens Axboe <[email protected]>
2024-04-22io_uring/net: add generic multishot retry helperJens Axboe1-12/+12
This is just moving io_recv_prep_retry() higher up so it can get used for sends as well, and rename it to be generically useful for both sends and receives. Signed-off-by: Jens Axboe <[email protected]>
2024-04-17io_uring/rw: ensure retry condition isn't lostJens Axboe3-7/+20
A previous commit removed the checking on whether or not it was possible to retry a request, since it's now possible to retry any of them. This would previously have caused the request to have been ended with an error, but now the retry condition can simply get lost instead. Cleanup the retry handling and always just punt it to task_work, which will queue it with io-wq appropriately. Reported-by: Changhui Zhong <[email protected]> Tested-by: Ming Lei <[email protected]> Fixes: cca6571381a0 ("io_uring/rw: cleanup retry path") Signed-off-by: Jens Axboe <[email protected]>
2024-04-17io-wq: Drop intermediate step between pending list and active workGabriel Krisman Bertazi1-5/+2
next_work is only used to make the work visible for cancellation. Instead, we can just directly write to cur_work before dropping the acct_lock and avoid the extra hop. Signed-off-by: Gabriel Krisman Bertazi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-04-17io-wq: write next_work before dropping acct_lockGabriel Krisman Bertazi1-5/+8
Commit 361aee450c6e ("io-wq: add intermediate work step between pending list and active work") closed a race between a cancellation and the work being removed from the wq for execution. To ensure the request is always reachable by the cancellation, we need to move it within the wq lock, which also synchronizes the cancellation. But commit 42abc95f05bf ("io-wq: decouple work_list protection from the big wqe->lock") replaced the wq lock here and accidentally reintroduced the race by releasing the acct_lock too early. In other words: worker | cancellation work = io_get_next_work() | raw_spin_unlock(&acct->lock); | | | io_acct_cancel_pending_work | io_wq_worker_cancel() worker->next_work = work Using acct_lock is still enough since we synchronize on it on io_acct_cancel_pending_work. Fixes: 42abc95f05bf ("io-wq: decouple work_list protection from the big wqe->lock") Signed-off-by: Gabriel Krisman Bertazi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-04-15remove call_{read,write}_iter() functionsMiklos Szeredi1-2/+2
These have no clear purpose. This is effectively a revert of commit bb7462b6fd64 ("vfs: use helpers for calling f_op->{read,write}_iter()"). The patch was created with the help of a coccinelle script. Fixes: bb7462b6fd64 ("vfs: use helpers for calling f_op->{read,write}_iter()") Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> Signed-off-by: Al Viro <[email protected]>
2024-04-15io_uring/sqpoll: work around a potential audit memory leakJens Axboe1-0/+8
kmemleak complains that there's a memory leak related to connect handling: unreferenced object 0xffff0001093bdf00 (size 128): comm "iou-sqp-455", pid 457, jiffies 4294894164 hex dump (first 32 bytes): 02 00 fa ea 7f 00 00 01 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc 2e481b1a): [<00000000c0a26af4>] kmemleak_alloc+0x30/0x38 [<000000009c30bb45>] kmalloc_trace+0x228/0x358 [<000000009da9d39f>] __audit_sockaddr+0xd0/0x138 [<0000000089a93e34>] move_addr_to_kernel+0x1a0/0x1f8 [<000000000b4e80e6>] io_connect_prep+0x1ec/0x2d4 [<00000000abfbcd99>] io_submit_sqes+0x588/0x1e48 [<00000000e7c25e07>] io_sq_thread+0x8a4/0x10e4 [<00000000d999b491>] ret_from_fork+0x10/0x20 which can can happen if: 1) The command type does something on the prep side that triggers an audit call. 2) The thread hasn't done any operations before this that triggered an audit call inside ->issue(), where we have audit_uring_entry() and audit_uring_exit(). Work around this by issuing a blanket NOP operation before the SQPOLL does anything. Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring/notif: shrink account_pages to u32Pavel Begunkov1-1/+2
->account_pages is the number of pages we account against the user derived from unsigned len, it definitely fits into unsigned, which saves some space in struct io_notif_data. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/19f2687fcb36daa74d86f4a27bfb3d35cffec318.1713185320.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring/notif: remove ctx var from io_notif_tw_completePavel Begunkov1-3/+2
We don't need ctx in the hottest path, i.e. registered buffers, let's get it only when we need it. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/e7345e268ffaeaf79b4c8f3a5d019d6a87a3d1f1.1713185320.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring/notif: refactor io_tx_ubuf_complete()Pavel Begunkov1-4/+5
Flip the dec_and_test "if", that makes the function extension easier in the future. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/43939e2b04dff03bff5d7227c98afedf951227b3.1713185320.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring: ensure overflow entries are dropped when ring is exitingJens Axboe1-1/+2
A previous consolidation cleanup missed handling the case where the ring is dying, and __io_cqring_overflow_flush() doesn't flush entries if the CQ ring is already full. This is fine for the normal CQE overflow flushing, but if the ring is going away, we need to flush everything, even if it means simply freeing the overflown entries. Fixes: 6c948ec44b29 ("io_uring: consolidate overflow flushing") Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring/timeout: remove duplicate initialization of the io_timeout list.Ruyi Zhang1-1/+0
In the __io_timeout_prep function, the io_timeout list is initialized twice, removing the meaningless second initialization. Signed-off-by: Ruyi Zhang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring: consolidate overflow flushingPavel Begunkov1-25/+15
Consolidate __io_cqring_overflow_flush and io_cqring_overflow_kill() into a single function as it once was, it's easier to work with it this way. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/986b42c35e76a6be7aa0cdcda0a236a2222da3a7.1712708261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring: always lock __io_cqring_overflow_flushPavel Begunkov1-5/+8
Conditional locking is never great, in case of __io_cqring_overflow_flush(), which is a slow path, it's not justified. Don't handle IOPOLL separately, always grab uring_lock for overflow flushing. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/162947df299aa12693ac4b305dacedab32ec7976.1712708261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring: open code io_cqring_overflow_flush()Pavel Begunkov1-8/+3
There is only one caller of io_cqring_overflow_flush(), open code it Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/a1fecd56d9dba923ed8d4d159727fa939d3baa2a.1712708261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring: remove extra SQPOLL overflow flushPavel Begunkov1-2/+0
c1edbf5f081be ("io_uring: flag SQPOLL busy condition to userspace") added an extra overflowed CQE flush in the SQPOLL submission path due to backpressure, which was later removed. Remove the flush and let io_cqring_wait() / iopoll handle it. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/2a83b0724ca6ca9d16c7d79a51b77c81876b2e39.1712708261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring: unexport io_req_cqe_overflow()Pavel Begunkov2-2/+1
There are no users of io_req_cqe_overflow() apart from io_uring.c, make it static. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/f4295eb2f9eb98d5db38c0578f57f0b86bfe0d8c.1712708261.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring: separate header for exported net bitsPavel Begunkov1-0/+1
We're exporting some io_uring bits to networking, e.g. for implementing a net callback for io_uring cmds, but we don't want to expose more than needed. Add a separate header for networking. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: David Wei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring/net: set MSG_ZEROCOPY for sendzc in advancePavel Begunkov1-3/+3
We can set MSG_ZEROCOPY at the preparation step, do it so we don't have to care about it later in the issue callback. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/c2c22aaa577624977f045979a6db2b9fb2e5648c.1712534031.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring/net: get rid of io_notif_complete_tw_extPavel Begunkov3-20/+14
io_notif_complete_tw_ext() can be removed and combined with io_notif_complete_tw to make it simpler without sacrificing anything. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/025a124a5e20e2474a57e2f04f16c422eb83063c.1712534031.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring/net: merge ubuf sendzc callbacksPavel Begunkov1-18/+8
Splitting io_tx_ubuf_callback_ext from io_tx_ubuf_callback is a pre mature optimisation that doesn't give us much. Merge the functions into one and reclaim some simplicity back. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/d44d68f6f7add33a0dcf0b7fd7b73c2dc543604f.1712534031.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring: return void from io_put_kbuf_comp()Ming Lei2-7/+3
The only caller doesn't handle the return value of io_put_kbuf_comp(), so change its return type into void. Also follow Jens's suggestion to rename it as io_put_kbuf_drop(). Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring: remove io_req_put_rsrc_locked()Pavel Begunkov2-9/+2
io_req_put_rsrc_locked() is a weird shim function around io_req_put_rsrc(). All calls to io_req_put_rsrc() require holding ->uring_lock, so we can just use it directly. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/a195bc78ac3d2c6fbaea72976e982fe51e50ecdd.1712331455.git.asml.silence@gmail.com Reviewed-by: Ming Lei <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring: remove async request cachePavel Begunkov1-22/+0
io_req_complete_post() was a sole user of ->locked_free_list, but since we just gutted the function, the cache is not used anymore and can be removed. ->locked_free_list served as an asynhronous counterpart of the main request (i.e. struct io_kiocb) cache for all unlocked cases like io-wq. Now they're all forced to be completed into the main cache directly, off of the normal completion path or via io_free_req(). Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/7bffccd213e370abd4de480e739d8b08ab6c1326.1712331455.git.asml.silence@gmail.com Reviewed-by: Ming Lei <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring: turn implicit assumptions into a warningPavel Begunkov1-1/+11
io_req_complete_post() is now io-wq only and shouldn't be used outside of it, i.e. it relies that io-wq holds a ref for the request as explained in a comment below. Let's add a warning to enforce the assumption and make sure nobody would try to do anything weird. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/1013b60c35d431d0698cafbc53c06f5917348c20.1712331455.git.asml.silence@gmail.com Reviewed-by: Ming Lei <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring: kill dead code in io_req_complete_postMing Lei2-35/+9
Since commit 8f6c829491fe ("io_uring: remove struct io_tw_state::locked"), io_req_complete_post() is only called from io-wq submit work, where the request reference is guaranteed to be grabbed and won't drop to zero in io_req_complete_post(). Kill the dead code, meantime add req_ref_put() to put the reference. Cc: Pavel Begunkov <[email protected]> Signed-off-by: Ming Lei <[email protected]> Reviewed-by: Pavel Begunkov <[email protected]> Signed-by: Pavel Begunkov <[email protected]> Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/1d8297e2046553153e763a52574f0e0f4d512f86.1712331455.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring/kbuf: remove dead defineJens Axboe1-2/+0
We no longer use IO_BUFFER_LIST_BUF_PER_PAGE, kill it. Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring: fix warnings on shadow variablesJens Axboe3-7/+4
There are a few of those: io_uring/fdinfo.c:170:16: warning: declaration shadows a local variable [-Wshadow] 170 | struct file *f = io_file_from_index(&ctx->file_table, i); | ^ io_uring/fdinfo.c:53:67: note: previous declaration is here 53 | __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *f) | ^ io_uring/cancel.c:187:25: warning: declaration shadows a local variable [-Wshadow] 187 | struct io_uring_task *tctx = node->task->io_uring; | ^ io_uring/cancel.c:166:31: note: previous declaration is here 166 | struct io_uring_task *tctx, | ^ io_uring/register.c:371:25: warning: declaration shadows a local variable [-Wshadow] 371 | struct io_uring_task *tctx = node->task->io_uring; | ^ io_uring/register.c:312:24: note: previous declaration is here 312 | struct io_uring_task *tctx = NULL; | ^ and a simple cleanup gets rid of them. For the fdinfo case, make a distinction between the file being passed in (for the ring), and the registered files we iterate. For the other two cases, just get rid of shadowed variable, there's no reason to have a new one. Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring: move mapping/allocation helpers to a separate fileJens Axboe7-335/+367
Move the related code from io_uring.c into memmap.c. No functional changes in this patch, just cleaning it up a bit now that the full transition is done. Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring: use unpin_user_pages() where appropriateJens Axboe2-6/+3
There are a few cases of open-rolled loops around unpin_user_page(), use the generic helper instead. Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring/kbuf: use vm_insert_pages() for mmap'ed pbuf ringJens Axboe4-154/+47
Rather than use remap_pfn_range() for this and manually free later, switch to using vm_insert_page() and have it Just Work. This requires a bit of effort on the mmap lookup side, as the ctx uring_lock isn't held, which otherwise protects buffer_lists from being torn down, and it's not safe to grab from mmap context that would introduce an ABBA deadlock between the mmap lock and the ctx uring_lock. Instead, lookup the buffer_list under RCU, as the the list is RCU freed already. Use the existing reference count to determine whether it's possible to safely grab a reference to it (eg if it's not zero already), and drop that reference when done with the mapping. If the mmap reference is the last one, the buffer_list and the associated memory can go away, since the vma insertion has references to the inserted pages at that point. Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring/kbuf: vmap pinned buffer ringJens Axboe1-24/+15
This avoids needing to care about HIGHMEM, and it makes the buffer indexing easier as both ring provided buffer methods are now virtually mapped in a contigious fashion. Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring: unify io_pin_pages()Jens Axboe2-55/+42
Move it into io_uring.c where it belongs, and use it in there as well rather than have two implementations of this. Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring: use vmap() for ring mappingJens Axboe1-29/+9
This is the last holdout which does odd page checking, convert it to vmap just like what is done for the non-mmap path. Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring: get rid of remap_pfn_range() for mapping rings/sqesJens Axboe2-8/+133
Rather than use remap_pfn_range() for this and manually free later, switch to using vm_insert_pages() and have it Just Work. If possible, allocate a single compound page that covers the range that is needed. If that works, then we can just use page_address() on that page. If we fail to get a compound page, allocate single pages and use vmap() to map them into the kernel virtual address space. This just covers the rings/sqes, the other remaining user of the mmap remap_pfn_range() user will be converted separately. Once that is done, we can kill the old alloc/free code. Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring: use the right type for work_llist empty checkJens Axboe1-1/+1
io_task_work_pending() uses wq_list_empty() on ctx->work_llist, but it's not an io_wq_work_list, it's a struct llist_head. They both have ->first as head-of-list, and it turns out the checks are identical. But be proper and use the right helper. Fixes: dac6a0eae793 ("io_uring: ensure iopoll runs local task work as well") Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring: Remove the now superfluous sentinel elements from ctl_table arrayJoel Granados1-1/+0
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%[email protected]/) Remove sentinel element from kernel_io_uring_disabled_table Signed-off-by: Joel Granados <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring: Remove unused functionJiapeng Chong1-6/+0
The function are defined in the io_uring.c file, but not called elsewhere, so delete the unused function. io_uring/io_uring.c:646:20: warning: unused function '__io_cq_unlock'. Reported-by: Abaci Robot <[email protected]> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8660 Signed-off-by: Jiapeng Chong <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring: re-arrange Makefile orderJens Axboe1-7/+7
The object list is a bit of a mess, with core and opcode files mixed in. Re-arrange it so that we have the core bits first, and then opcode specific files after that. Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring: refill request cache in memory orderJens Axboe1-3/+3
The allocator will generally return memory in order, but __io_alloc_req_refill() then adds them to a stack and we'll extract them in the opposite order. This obviously isn't a huge deal, but: 1) it makes debugging easier when they are in order 2) keeping them in-order is the right thing to do 3) reduces the code for adding them to the stack Just add them in reverse to the stack. Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring/poll: shrink alloc cache size to 32Jens Axboe2-1/+3
This should be plenty, rather than the default of 128, and matches what we have on the rsrc and futex side as well. Signed-off-by: Jens Axboe <[email protected]>
2024-04-15io_uring/alloc_cache: switch to array based cachingJens Axboe14-144/+92
Currently lists are being used to manage this, but best practice is usually to have these in an array instead as that it cheaper to manage. Outside of that detail, games are also played with KASAN as the list is inside the cached entry itself. Finally, all users of this need a struct io_cache_entry embedded in their struct, which is union'ized with something else in there that isn't used across the free -> realloc cycle. Get rid of all of that, and simply have it be an array. This will not change the memory used, as we're just trading an 8-byte member entry for the per-elem array size. This reduces the overhead of the recycled allocations, and it reduces the amount of code code needed to support recycling to about half of what it currently is. Signed-off-by: Jens Axboe <[email protected]>