aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-07-24io_uring: cache struct io_notifPavel Begunkov4-7/+65
kmalloc'ing struct io_notif is too expensive when done frequently, cache them as many other resources in io_uring. Keep two list, the first one is from where we're getting notifiers, it's protected by ->uring_lock. The second is protected by ->completion_lock, to which we queue released notifiers. Then we splice one list into another when needed. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/9dec18f7fcbab9f4bd40b96e5ae158b119945230.1657643355.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: add zc notification infrastructurePavel Begunkov6-4/+179
Add internal part of send zerocopy notifications. There are two main structures, the first one is struct io_notif, which carries inside struct ubuf_info and maps 1:1 to it. io_uring will be binding a number of zerocopy send requests to it and ask to complete (aka flush) it. When flushed and all attached requests and skbs complete, it'll generate one and only one CQE. There are intended to be passed into the network layer as struct msghdr::msg_ubuf. The second concept is notification slots. The userspace will be able to register an array of slots and subsequently addressing them by the index in the array. Slots are independent of each other. Each slot can have only one notifier at a time (called active notifier) but many notifiers during the lifetime. When active, a notifier not going to post any completion but the userspace can attach requests to it by specifying the corresponding slot while issueing send zc requests. Eventually, the userspace will want to "flush" the notifier losing any way to attach new requests to it, however it can use the next atomatically added notifier of this slot or of any other slot. When the network layer is done with all enqueued skbs attached to a notifier and doesn't need the specified in them user data, the flushed notifier will post a CQE. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/3ecf54c31a85762bf679b0a432c9f43ecf7e61cc.1657643355.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: export io_put_task()Pavel Begunkov4-36/+36
Make io_put_task() available to non-core parts of io_uring, we'll need it for notification infrastructure. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/3686807d4c03b72e389947b0e8692d4d44334ef0.1657643355.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: initialise msghdr::msg_ubufPavel Begunkov1-0/+2
Initialise newly added ->msg_ubuf in io_recv() and io_send(). Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/b8f9f263875a4a36e7f26cc5d55ebe315308f57d.1657643355.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-07-24Merge branch 'for-5.20/io_uring' into for-5.20/io_uring-zerocopy-sendJens Axboe633-16785/+21978
* for-5.20/io_uring: (716 commits) io_uring: ensure REQ_F_ISREG is set async offload net: fix compat pointer in get_compat_msghdr() io_uring: Don't require reinitable percpu_ref io_uring: fix types in io_recvmsg_multishot_overflow io_uring: Use atomic_long_try_cmpxchg in __io_account_mem io_uring: support multishot in recvmsg net: copy from user before calling __get_compat_msghdr net: copy from user before calling __copy_msghdr io_uring: support 0 length iov in buffer select in compat io_uring: fix multishot ending when not polled io_uring: add netmsg cache io_uring: impose max limit on apoll cache io_uring: add abstraction around apoll cache io_uring: move apoll cache to poll.c io_uring: consolidate hash_locked io-wq handling io_uring: clear REQ_F_HASH_LOCKED on hash removal io_uring: don't race double poll setting REQ_F_ASYNC_DATA io_uring: don't miss setting REQ_F_DOUBLE_POLL io_uring: disable multishot recvmsg io_uring: only trace one of complete or overflow ... Signed-off-by: Jens Axboe <[email protected]>
2022-07-24Merge branch 'io_uring-zerocopy-send' of ↵Jens Axboe9-64/+193
git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux into for-5.20/io_uring-zerocopy-send Merge prep net series for io_uring tx zc from the Jakub's tree. * 'io_uring-zerocopy-send' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux: net: fix uninitialised msghdr->sg_from_iter tcp: support externally provided ubufs ipv6/udp: support externally provided ubufs ipv4/udp: support externally provided ubufs net: introduce __skb_fill_page_desc_noacc net: introduce managed frags infrastructure net: Allow custom iter handler in msghdr skbuff: carry external ubuf_info in msghdr skbuff: add SKBFL_DONT_ORPHAN flag skbuff: don't mix ubuf_info from different sources ipv6: avoid partial copy for zc ipv4: avoid partial copy for zc
2022-07-24mm: honor FGP_NOWAIT for page cache page allocationJens Axboe1-0/+4
If we're creating a page cache page with FGP_CREAT but FGP_NOWAIT is set, we should dial back the gfp flags to avoid frivolous blocking which is trivial to hit in low memory conditions: [ 10.117661] __schedule+0x8c/0x550 [ 10.118305] schedule+0x58/0xa0 [ 10.118897] schedule_timeout+0x30/0xdc [ 10.119610] __wait_for_common+0x88/0x114 [ 10.120348] wait_for_completion+0x1c/0x24 [ 10.121103] __flush_work.isra.0+0x16c/0x19c [ 10.121896] flush_work+0xc/0x14 [ 10.122496] __drain_all_pages+0x144/0x218 [ 10.123267] drain_all_pages+0x10/0x18 [ 10.123941] __alloc_pages+0x464/0x9e4 [ 10.124633] __folio_alloc+0x18/0x3c [ 10.125294] __filemap_get_folio+0x17c/0x204 [ 10.126084] iomap_write_begin+0xf8/0x428 [ 10.126829] iomap_file_buffered_write+0x144/0x24c [ 10.127710] xfs_file_buffered_write+0xe8/0x248 [ 10.128553] xfs_file_write_iter+0xa8/0x120 [ 10.129324] io_write+0x16c/0x38c [ 10.129940] io_issue_sqe+0x70/0x1cc [ 10.130617] io_queue_sqe+0x18/0xfc [ 10.131277] io_submit_sqes+0x5d4/0x600 [ 10.131946] __arm64_sys_io_uring_enter+0x224/0x600 [ 10.132752] invoke_syscall.constprop.0+0x70/0xc0 [ 10.133616] do_el0_svc+0xd0/0x118 [ 10.134238] el0_svc+0x78/0xa0 Clear IO, FS, and reclaim flags and mark the allocation as GFP_NOWAIT and add __GFP_NOWARN to avoid polluting dmesg with pointless allocations failures. A caller with FGP_NOWAIT must be expected to handle the resulting -EAGAIN return and retry from a suitable context without NOWAIT set. Reviewed-by: Shakeel Butt <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2022-07-24xfs: Add async buffered write supportStefan Roesch2-7/+9
This adds the async buffered write support to XFS. For async buffered write requests, the request will return -EAGAIN if the ilock cannot be obtained immediately. Signed-off-by: Stefan Roesch <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24xfs: Specify lockmode when calling xfs_ilock_for_iomap()Stefan Roesch1-3/+3
This patch changes the helper function xfs_ilock_for_iomap such that the lock mode must be passed in. Signed-off-by: Stefan Roesch <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: Add tracepoint for short writesStefan Roesch2-0/+28
This adds the io_uring_short_write tracepoint to io_uring. A short write is issued if not all pages that are required for a write are in the page cache and the async buffered writes have to return EAGAIN. Signed-off-by: Stefan Roesch <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: fix issue with io_write() not always undoing sb_start_write()Jens Axboe1-1/+8
This is actually an older issue, but we never used to hit the -EAGAIN path before having done sb_start_write(). Make sure that we always call kiocb_end_write() if we need to retry the write, so that we keep the calls to sb_start_write() etc balanced. Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: Add support for async buffered writesStefan Roesch1-5/+24
This enables the async buffered writes for the filesystems that support async buffered writes in io-uring. Buffered writes are enabled for blocks that are already in the page cache or can be acquired with noio. Signed-off-by: Stefan Roesch <[email protected]> Link: https://lore.kernel.org/r/[email protected] [axboe: adapt to 5.20 branch] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24fs: Add async write file modification handling.Stefan Roesch2-3/+43
This adds a file_modified_async() function to return -EAGAIN if the request either requires to remove privileges or needs to update the file modification time. This is required for async buffered writes, so the request gets handled in the io worker of io-uring. Signed-off-by: Stefan Roesch <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Christian Brauner (Microsoft) <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24fs: Split off inode_needs_update_time and __file_update_timeStefan Roesch1-26/+50
This splits off the functions inode_needs_update_time() and __file_update_time() from the function file_update_time(). This is required to support async buffered writes. No intended functional changes in this patch. Signed-off-by: Stefan Roesch <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Christian Brauner (Microsoft) <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24fs: add __remove_file_privs() with flags parameterStefan Roesch1-20/+37
This adds the function __remove_file_privs, which allows the caller to pass the kiocb flags parameter. No intended functional changes in this patch. Signed-off-by: Stefan Roesch <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Christian Brauner (Microsoft) <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24fs: add a FMODE_BUF_WASYNC flags for f_modeStefan Roesch2-1/+6
This introduces the flag FMODE_BUF_WASYNC. If devices support async buffered writes, this flag can be set. It also modifies the check in generic_write_checks to take async buffered writes into consideration. Signed-off-by: Stefan Roesch <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Christian Brauner (Microsoft) <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24iomap: Return -EAGAIN from iomap_write_iter()Stefan Roesch1-0/+4
If iomap_write_iter() encounters -EAGAIN, return -EAGAIN to the caller. Signed-off-by: Stefan Roesch <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Christoph Hellwig <[email protected]> [axboe: make the suggested ternary edit] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24iomap: Add async buffered write supportStefan Roesch1-5/+28
This adds async buffered write support to iomap. This replaces the call to balance_dirty_pages_ratelimited() with the call to balance_dirty_pages_ratelimited_flags. This allows to specify if the write request is async or not. In addition this also moves the above function call to the beginning of the function. If the function call is at the end of the function and the decision is made to throttle writes, then there is no request that io-uring can wait on. By moving it to the beginning of the function, the write request is not issued, but returns -EAGAIN instead. io-uring will punt the request and process it in the io-worker. By moving the function call to the beginning of the function, the write throttling will happen one page later. Signed-off-by: Stefan Roesch <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24iomap: Add flags parameter to iomap_page_create()Stefan Roesch1-10/+20
Add the kiocb flags parameter to the function iomap_page_create(). Depending on the value of the flags parameter it enables different gfp flags. No intended functional changes in this patch. Signed-off-by: Stefan Roesch <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24mm: Add balance_dirty_pages_ratelimited_flags() functionJan Kara2-10/+48
This adds the helper function balance_dirty_pages_ratelimited_flags(). It adds the parameter flags to balance_dirty_pages_ratelimited(). The flags parameter is passed to balance_dirty_pages(). For async buffered writes the flag value will be BDP_ASYNC. If balance_dirty_pages() gets called for async buffered write, we don't want to wait. Instead we need to indicate to the caller that throttling is needed so that it can stop writing and offload the rest of the write to a context that can block. The new helper function is also used by balance_dirty_pages_ratelimited(). Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Stefan Roesch <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] [axboe: fix kerneltest bot 'ret' issue] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24mm: Move updates of dirty_exceeded into one placeJan Kara1-5/+2
Transition of wb->dirty_exceeded from 0 to 1 happens before we go to sleep in balance_dirty_pages() while transition from 1 to 0 happens when exiting from balance_dirty_pages(), possibly based on old values. This does not make a lot of sense since wb->dirty_exceeded should simply reflect whether wb is over dirty limit and so we should ratelimit entering to balance_dirty_pages() less. Move the two updates together. Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Stefan Roesch <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24mm: Move starting of background writeback into the main balancing loopJan Kara1-17/+14
We start background writeback if we are over background threshold after exiting the main loop in balance_dirty_pages(). This may result in basing the decision on already stale values (we may have slept for significant amount of time) and it is also inconvenient for refactoring needed for async dirty throttling. Move the check into the main waiting loop. Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Stefan Roesch <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: ensure REQ_F_ISREG is set async offloadJens Axboe3-5/+8
If we're offloading requests directly to io-wq because IOSQE_ASYNC was set in the sqe, we can miss hashing writes appropriately because we haven't set REQ_F_ISREG yet. This can cause a performance regression with buffered writes, as io-wq then no longer correctly serializes writes to that file. Ensure that we set the flags in io_prep_async_work(), which will cause the io-wq work item to be hashed appropriately. Fixes: 584b0180f0f4 ("io_uring: move read/write file prep state into actual opcode handler") Link: https://lore.kernel.org/io-uring/20220608080054.GB22428@xsang-OptiPlex-9020/ Reported-by: kernel test robot <[email protected]> Tested-by: Yin Fengwei <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2022-07-24net: fix compat pointer in get_compat_msghdr()Jens Axboe2-2/+2
A previous change enabled external users to copy the data before calling __get_compat_msghdr(), but didn't modify get_compat_msghdr() or __io_compat_recvmsg_copy_hdr() to take that into account. They are both stil passing in the __user pointer rather than the copied version. Ensure we pass in the kernel struct, not the pointer to the user data. Link: https://lore.kernel.org/all/[email protected]/ Fixes: 1a3e4e94a1b9 ("net: copy from user before calling __get_compat_msghdr") Reported-by: Marek Szyprowski <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: Don't require reinitable percpu_refMichal Koutný1-1/+1
The commit 8bb649ee1da3 ("io_uring: remove ring quiesce for io_uring_register") removed the worklow relying on reinit/resurrection of the percpu_ref, hence, initialization with that requested is a relic. This is based on code review, this causes no real bug (and theoretically can't). Technically it's a revert of commit 214828962dea ("io_uring: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT") but since the flag omission is now justified, I'm not making this a revert. Fixes: 8bb649ee1da3 ("io_uring: remove ring quiesce for io_uring_register") Signed-off-by: Michal Koutný <[email protected]> Acked-by: Roman Gushchin <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: fix types in io_recvmsg_multishot_overflowDylan Yudaken1-5/+5
io_recvmsg_multishot_overflow had incorrect types on non x64 system. But also it had an unnecessary INT_MAX check, which could just be done by changing the type of the accumulator to int (also simplifying the casts). Reported-by: Stephen Rothwell <[email protected]> Fixes: a8b38c4ce724 ("io_uring: support multishot in recvmsg") Signed-off-by: Dylan Yudaken <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: Use atomic_long_try_cmpxchg in __io_account_memUros Bizjak1-4/+3
Use atomic_long_try_cmpxchg instead of atomic_long_cmpxchg (*ptr, old, new) == old in __io_account_mem. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, atomic_long_try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg fails, enabling further code simplifications. No functional change intended. Signed-off-by: Uros Bizjak <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: support multishot in recvmsgDylan Yudaken3-19/+174
Similar to multishot recv, this will require provided buffers to be used. However recvmsg is much more complex than recv as it has multiple outputs. Specifically flags, name, and control messages. Support this by introducing a new struct io_uring_recvmsg_out with 4 fields. namelen, controllen and flags match the similar out fields in msghdr from standard recvmsg(2), payloadlen is the length of the payload following the header. This struct is placed at the start of the returned buffer. Based on what the user specifies in struct msghdr, the next bytes of the buffer will be name (the next msg_namelen bytes), and then control (the next msg_controllen bytes). The payload will come at the end. The return value in the CQE is the total used size of the provided buffer. Signed-off-by: Dylan Yudaken <[email protected]> Link: https://lore.kernel.org/r/[email protected] [axboe: style fixups, see link] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24net: copy from user before calling __get_compat_msghdrDylan Yudaken3-33/+28
this is in preparation for multishot receive from io_uring, where it needs to have access to the original struct user_msghdr. functionally this should be a no-op. Acked-by: Paolo Abeni <[email protected]> Signed-off-by: Dylan Yudaken <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24net: copy from user before calling __copy_msghdrDylan Yudaken3-33/+28
this is in preparation for multishot receive from io_uring, where it needs to have access to the original struct user_msghdr. functionally this should be a no-op. Acked-by: Paolo Abeni <[email protected]> Signed-off-by: Dylan Yudaken <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: support 0 length iov in buffer select in compatDylan Yudaken1-9/+14
Match up work done in "io_uring: allow iov_len = 0 for recvmsg and buffer select", but for compat code path. Fixes: a68caad69ce5 ("io_uring: allow iov_len = 0 for recvmsg and buffer select") Signed-off-by: Dylan Yudaken <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: fix multishot ending when not polledDylan Yudaken1-0/+2
If multishot is not actually polling then return IOU_OK rather than the result. If the result was > 0 this will confuse things further up the callstack which expect a return <= 0. Fixes: 1300ebb20286 ("io_uring: multishot recv") Signed-off-by: Dylan Yudaken <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: add netmsg cacheJens Axboe4-12/+73
For recvmsg/sendmsg, if they don't complete inline, we currently need to allocate a struct io_async_msghdr for each request. This is a somewhat large struct. Hook up sendmsg/recvmsg to use the io_alloc_cache. This reduces the alloc + free overhead considerably, yielding 4-5% of extra performance running netbench. Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: impose max limit on apoll cacheJens Axboe3-3/+17
Caches like this tend to grow to the peak size, and then never get any smaller. Impose a max limit on the size, to prevent it from growing too big. A somewhat randomly chosen 512 is the max size we'll allow the cache to get. If a batch of frees come in and would bring it over that, we simply start kfree'ing the surplus. Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: add abstraction around apoll cacheJens Axboe5-20/+62
In preparation for adding limits, and one more user, abstract out the core bits of the allocation+free cache. Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: move apoll cache to poll.cJens Axboe3-12/+14
This is where it's used, move the flush handler in there. Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: consolidate hash_locked io-wq handlingPavel Begunkov1-5/+7
Don't duplicate code disabling REQ_F_HASH_LOCKED for IO_URING_F_UNLOCKED (i.e. io-wq), move the handling into __io_arm_poll_handler(). Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/0ff0ffdfaa65b3d536131535c3dad3c63d9b7bb0.1657203020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: clear REQ_F_HASH_LOCKED on hash removalPavel Begunkov1-5/+2
Instead of clearing REQ_F_HASH_LOCKED while arming a poll, unset the bit when we're removing the entry from the table in io_poll_tw_hash_eject(). Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/02e48bb88d6f1480c94ac2924c43ad1fbd48e92a.1657203020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: don't race double poll setting REQ_F_ASYNC_DATAPavel Begunkov1-3/+3
Just as with io_poll_double_prepare() setting REQ_F_DOUBLE_POLL, we can race with the first poll entry when setting REQ_F_ASYNC_DATA. Move it under io_poll_double_prepare(). Fixes: a18427bb2d9b ("io_uring: optimise submission side poll_refs") Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/df6920f509c11115aa2bce8b34dc5fdb0eb98920.1657203020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: don't miss setting REQ_F_DOUBLE_POLLPavel Begunkov1-8/+10
When adding a second poll entry we should set REQ_F_DOUBLE_POLL unconditionally. We might race with the first entry removal but that doesn't change the rule. Fixes: a18427bb2d9b ("io_uring: optimise submission side poll_refs") Reported-and-tested-by: [email protected] Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/8b680d83ded07424db83e8745585e7a6d72826ef.1657203020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: disable multishot recvmsgDylan Yudaken1-19/+10
recvmsg has semantics that do not make it trivial to extend to multishot. Specifically it has user pointers and returns data in the original parameter. In order to make this API useful these will need to be somehow included with the provided buffers. For now remove multishot for recvmsg as it is not useful. Signed-off-by: Dylan Yudaken <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: only trace one of complete or overflowDylan Yudaken2-5/+8
In overflow we see a duplcate line in the trace, and in some cases 3 lines (if initial io_post_aux_cqe fails). Instead just trace once for each CQE Signed-off-by: Dylan Yudaken <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: fix io_uring_cqe_overflow trace formatDylan Yudaken1-1/+1
Make the trace format consistent with io_uring_complete for cflags Signed-off-by: Dylan Yudaken <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: multishot recvDylan Yudaken2-13/+94
Support multishot receive for io_uring. Typical server applications will run a loop where for each recv CQE it requeues another recv/recvmsg. This can be simplified by using the existing multishot functionality combined with io_uring's provided buffers. The API is to add the IORING_RECV_MULTISHOT flag to the SQE. CQEs will then be posted (with IORING_CQE_F_MORE flag set) when data is available and is read. Once an error occurs or the socket ends, the multishot will be removed and a completion without IORING_CQE_F_MORE will be posted. The benefit to this is that the recv is much more performant. * Subsequent receives are queued up straight away without requiring the application to finish a processing loop. * If there are more data in the socket (sat the provided buffer size is smaller than the socket buffer) then the data is immediately returned, improving batching. * Poll is only armed once and reused, saving CPU cycles Signed-off-by: Dylan Yudaken <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: fix multishot accept orderingDylan Yudaken1-4/+7
Similar to multishot poll, drop multishot accept when CQE overflow occurs. Signed-off-by: Dylan Yudaken <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: fix multishot poll on overflowDylan Yudaken1-2/+4
On overflow, multishot poll can still complete with the IORING_CQE_F_MORE flag set. If in the meantime the user clears a CQE and a the poll was cancelled then the poll will post a CQE without the IORING_CQE_F_MORE (and likely result -ECANCELED). However when processing the application will encounter the non-overflow CQE which indicates that there will be no more events posted. Typical userspace applications would free memory associated with the poll in this case. It will then subsequently receive the earlier CQE which has overflowed, which breaks the contract given by the IORING_CQE_F_MORE flag. Signed-off-by: Dylan Yudaken <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: add allow_overflow to io_post_aux_cqeDylan Yudaken6-11/+18
Some use cases of io_post_aux_cqe would not want to overflow as is, but might want to change the flags/result. For example multishot receive requires in order CQE, and so if there is an overflow it would need to stop receiving until the overflow is taken care of. Signed-off-by: Dylan Yudaken <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: add IOU_STOP_MULTISHOT return codeDylan Yudaken2-2/+16
For multishot we want a way to signal the caller that multishot has ended but also this might not be an error return. For example sockets return 0 when closed, which should end a multishot recv, but still have a CQE with result 0 Introduce IOU_STOP_MULTISHOT which does this and indicates that the return code is stored inside req->cqe Signed-off-by: Dylan Yudaken <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: clean up io_poll_check_events return valuesDylan Yudaken1-11/+16
The values returned are a bit confusing, where 0 and 1 have implied meaning, so add some definitions for them. Signed-off-by: Dylan Yudaken <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-07-24io_uring: recycle buffers on errorDylan Yudaken1-2/+8
Rather than passing an error back to the user with a buffer attached, recycle the buffer immediately. Signed-off-by: Dylan Yudaken <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>