aboutsummaryrefslogtreecommitdiff
path: root/include/uapi/linux/io_uring.h
AgeCommit message (Collapse)AuthorFilesLines
2021-10-19io_uring: add flag to not fail link after timeoutPavel Begunkov1-0/+1
For some reason non-off IORING_OP_TIMEOUT always fails links, it's pretty inconvenient and unnecessary limits chaining after it to hard linking, which is far from ideal, e.g. doesn't pair well with timeout cancellation. Add a flag forcing it to not fail links on -ETIME. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/17c7ec0fb7a6113cc6be8cdaedcada0ba836ac0e.1633199723.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-09-13io-wq: provide IO_WQ_* constants for IORING_REGISTER_IOWQ_MAX_WORKERS arg itemsEugene Syromiatnikov1-1/+7
The items passed in the array pointed by the arg parameter of IORING_REGISTER_IOWQ_MAX_WORKERS io_uring_register operation carry certain semantics: they refer to different io-wq worker categories; provide IO_WQ_* constants in the UAPI, so these categories can be referenced in the user space code. Suggested-by: Jens Axboe <[email protected]> Complements: 2e480058ddc21ec5 ("io-wq: provide a way to limit max number of workers") Signed-off-by: Eugene Syromiatnikov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-30Merge tag 'for-5.15/io_uring-vfs-2021-08-30' of git://git.kernel.dk/linux-blockLinus Torvalds1-0/+4
Pull io_uring mkdirat/symlinkat/linkat support from Jens Axboe: "This adds io_uring support for mkdirat, symlinkat, and linkat" * tag 'for-5.15/io_uring-vfs-2021-08-30' of git://git.kernel.dk/linux-block: io_uring: add support for IORING_OP_LINKAT io_uring: add support for IORING_OP_SYMLINKAT io_uring: add support for IORING_OP_MKDIRAT namei: update do_*() helpers to return ints namei: make do_linkat() take struct filename namei: add getname_uflags() namei: make do_symlinkat() take struct filename namei: make do_mknodat() take struct filename namei: make do_mkdirat() take struct filename namei: change filename_parentat() calling conventions namei: ignore ERR/NULL names in putname()
2021-08-29io_uring: allow updating linked timeoutsPavel Begunkov1-5/+6
We allow updating normal timeouts, add support for adjusting timings of linked timeouts as well. Reported-by: Victor Stewart <[email protected]> Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-08-29io_uring: support CLOCK_BOOTTIME/REALTIME for timeoutsJens Axboe1-0/+3
Certain use cases want to use CLOCK_BOOTTIME or CLOCK_REALTIME rather than CLOCK_MONOTONIC, instead of the default CLOCK_MONOTONIC. Add an IORING_TIMEOUT_BOOTTIME and IORING_TIMEOUT_REALTIME flag that allows timeouts and linked timeouts to use the selected clock source. Only one clock source may be selected, and we -EINVAL the request if more than one is given. If neither BOOTIME nor REALTIME are selected, the previous default of MONOTONIC is used. Link: https://github.com/axboe/liburing/issues/369 Signed-off-by: Jens Axboe <[email protected]>
2021-08-29io-wq: provide a way to limit max number of workersJens Axboe1-0/+3
io-wq divides work into two categories: 1) Work that completes in a bounded time, like reading from a regular file or a block device. This type of work is limited based on the size of the SQ ring. 2) Work that may never complete, we call this unbounded work. The amount of workers here is just limited by RLIMIT_NPROC. For various uses cases, it's handy to have the kernel limit the maximum amount of pending workers for both categories. Provide a way to do with with a new IORING_REGISTER_IOWQ_MAX_WORKERS operation. IORING_REGISTER_IOWQ_MAX_WORKERS takes an array of two integers and sets the max worker count to what is being passed in for each category. The old values are returned into that same array. If 0 is being passed in for either category, it simply returns the current value. The value is capped at RLIMIT_NPROC. This actually isn't that important as it's more of a hint, if we're exceeding the value then our attempt to fork a new worker will fail. This happens naturally already if more than one node is in the system, as these values are per-node internally for io-wq. Reported-by: Johannes Lundberg <[email protected]> Link: https://github.com/axboe/liburing/issues/420 Signed-off-by: Jens Axboe <[email protected]>
2021-08-25io_uring: openat directly into fixed fd tablePavel Begunkov1-1/+4
Instead of opening a file into a process's file table as usual and then registering the fd within io_uring, some users may want to skip the first step and place it directly into io_uring's fixed file table. This patch adds such a capability for IORING_OP_OPENAT and IORING_OP_OPENAT2. The behaviour is controlled by setting sqe->file_index, where 0 implies the old behaviour using normal file tables. If non-zero value is specified, then it will behave as described and place the file into a fixed file slot sqe->file_index - 1. A file table should be already created, the slot should be valid and empty, otherwise the operation will fail. Keep the error codes consistent with IORING_OP_FILES_UPDATE, ENXIO and EINVAL on inappropriate fixed tables, and return EBADF on collision with already registered file. Note: IOSQE_FIXED_FILE can't be used to switch between modes, because accept takes a file, and it already uses the flag with a different meaning. Suggested-by: Josh Triplett <[email protected]> Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]> Link: https://lore.kernel.org/r/e9b33d1163286f51ea707f87d95bd596dada1e65.1629888991.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: add support for IORING_OP_LINKATDmitry Kadashev1-0/+2
IORING_OP_LINKAT behaves like linkat(2) and takes the same flags and arguments. In some internal places 'hardlink' is used instead of 'link' to avoid confusion with the SQE links. Name 'link' conflicts with the existing 'link' member of io_kiocb. Acked-by: Linus Torvalds <[email protected]> Suggested-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/ Signed-off-by: Dmitry Kadashev <[email protected]> Acked-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] [axboe: add splice_fd_in check] Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: add support for IORING_OP_SYMLINKATDmitry Kadashev1-0/+1
IORING_OP_SYMLINKAT behaves like symlinkat(2) and takes the same flags and arguments. Acked-by: Linus Torvalds <[email protected]> Suggested-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/ Signed-off-by: Dmitry Kadashev <[email protected]> Acked-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] [axboe: add splice_fd_in check] Signed-off-by: Jens Axboe <[email protected]>
2021-08-23io_uring: add support for IORING_OP_MKDIRATDmitry Kadashev1-0/+1
IORING_OP_MKDIRAT behaves like mkdirat(2) and takes the same flags and arguments. Acked-by: Linus Torvalds <[email protected]> Signed-off-by: Dmitry Kadashev <[email protected]> Acked-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] [axboe: add splice_fd_in check] Signed-off-by: Jens Axboe <[email protected]>
2021-06-30io_uring: simplify struct io_uring_sqe layoutPavel Begunkov1-14/+10
Flatten struct io_uring_sqe, the last union is exactly 64B, so move them out of union { struct { ... }}, and decrease __pad2 size. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/2e21ef7aed136293d654450bc3088973a8adc730.1624543113.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-06-17io_uring: allow user configurable IO thread CPU affinityJens Axboe1-0/+4
io-wq defaults to per-node masks for IO workers. This works fine by default, but isn't particularly handy for workloads that prefer more specific affinities, for either performance or isolation reasons. This adds IORING_REGISTER_IOWQ_AFF that allows the user to pass in a CPU mask that is then applied to IO thread workers, and an IORING_UNREGISTER_IOWQ_AFF that simply resets the masks back to the default of per-node. Note that no care is given to existing IO threads, they will need to go through a reschedule before the affinity is correct if they are already running or sleeping. Signed-off-by: Jens Axboe <[email protected]>
2021-06-10io_uring: add feature flag for rsrc tagsPavel Begunkov1-0/+1
Add IORING_FEAT_RSRC_TAGS indicating that io_uring supports a bunch of new IORING_REGISTER operations, in particular IORING_REGISTER_[FILES[,UPDATE]2,BUFFERS[2,UPDATE]] that support rsrc tagging, and also indicating implemented dynamic fixed buffer updates. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/9b995d4045b6c6b4ab7510ca124fd25ac2203af7.1623339162.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-06-10io_uring: change registration/upd/rsrc tagging ABIPavel Begunkov1-9/+9
There are ABI moments about recently added rsrc registration/update and tagging that might become a nuisance in the future. First, IORING_REGISTER_RSRC[_UPD] hide different types of resources under it, so breaks fine control over them by restrictions. It works for now, but once those are wanted under restrictions it would require a rework. It was also inconvenient trying to fit a new resource not supporting all the features (e.g. dynamic update) into the interface, so better to return to IORING_REGISTER_* top level dispatching. Second, register/update were considered to accept a type of resource, however that's not a good idea because there might be several ways of registration of a single resource type, e.g. we may want to add non-contig buffers or anything more exquisite as dma mapped memory. So, remove IORING_RSRC_[FILE,BUFFER] out of the ABI, and place them internally for now to limit changes. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/9b554897a7c17ad6e3becc48dfed2f7af9f423d5.1623339162.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-04-25io_uring: add full-fledged dynamic buffers supportPavel Begunkov1-0/+1
Hook buffers into all rsrc infrastructure, including tagging and updates. Suggested-by: Bijan Mottahedeh <[email protected]> Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/119ed51d68a491dae87eb55fb467a47870c86aad.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-04-25io_uring: add generic rsrc update with tagsPavel Begunkov1-6/+16
Add IORING_REGISTER_RSRC_UPDATE, which also supports passing in rsrc tags. Implement it for registered files. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/d4dc66df204212f64835ffca2c4eb5e8363f2f05.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-04-25io_uring: add IORING_REGISTER_RSRCPavel Begunkov1-0/+8
Add a new io_uring_register() opcode for rsrc registeration. Instead of accepting a pointer to resources, fds or iovecs, it @arg is now pointing to a struct io_uring_rsrc_register, and the second argument tells how large that struct is to make it easily extendible by adding new fields. All that is done mainly to be able to pass in a pointer with tags. Pass it in and enable CQE posting for file resources. Doesn't support setting tags on update yet. A design choice made here is to not post CQEs on rsrc de-registration, but only when we updated-removed it by rsrc dynamic update. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/c498aaec32a4bb277b2406b9069662c02cdda98c.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-04-25io_uring: enumerate dynamic resourcesPavel Begunkov1-0/+4
As resources are getting more support and common parts, it'll be more convenient to index resources and use it for indexing. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/f0be63e9310212d5601d36277c2946ff7a040485.1619356238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-04-11io_uring: allow events and user_data update of running poll requestsJens Axboe1-0/+5
This adds two new POLL_ADD flags, IORING_POLL_UPDATE_EVENTS and IORING_POLL_UPDATE_USER_DATA. As with the other POLL_ADD flag, these are masked into sqe->len. If set, the POLL_ADD will have the following behavior: - sqe->addr must contain the the user_data of the poll request that needs to be modified. This field is otherwise invalid for a POLL_ADD command. - If IORING_POLL_UPDATE_EVENTS is set, sqe->poll_events must contain the new mask for the existing poll request. There are no checks for whether these are identical or not, if a matching poll request is found, then it is re-armed with the new mask. - If IORING_POLL_UPDATE_USER_DATA is set, sqe->off must contain the new user_data for the existing poll request. A POLL_ADD with any of these flags set may complete with any of the following results: 1) 0, which means that we successfully found the existing poll request specified, and performed the re-arm procedure. Any error from that re-arm will be exposed as a completion event for that original poll request, not for the update request. 2) -ENOENT, if no existing poll request was found with the given user_data. 3) -EALREADY, if the existing poll request was already in the process of being removed/canceled/completing. 4) -EACCES, if an attempt was made to modify an internal poll request (eg not one originally issued ass IORING_OP_POLL_ADD). The usual -EINVAL cases apply as well, if any invalid fields are set in the sqe for this command type. Signed-off-by: Jens Axboe <[email protected]>
2021-04-11io_uring: add multishot mode for IORING_OP_POLL_ADDJens Axboe1-0/+12
The default io_uring poll mode is one-shot, where once the event triggers, the poll command is completed and won't trigger any further events. If we're doing repeated polling on the same file or socket, then it can be more efficient to do multishot, where we keep triggering whenever the event becomes true. This deviates from the usual norm of having one CQE per SQE submitted. Add a CQE flag, IORING_CQE_F_MORE, which tells the application to expect further completion events from the submitted SQE. Right now the only user of this is POLL_ADD in multishot mode. Since sqe->poll_events is using the space that we normally use for adding flags to commands, use sqe->len for the flag space for POLL_ADD. Multishot mode is selected by setting IORING_POLL_ADD_MULTI in sqe->len. An application should expect more CQEs for the specificed SQE if the CQE is flagged with IORING_CQE_F_MORE. In multishot mode, only cancelation or an error will terminate the poll request, in which case the flag will be cleared. Signed-off-by: Jens Axboe <[email protected]>
2021-02-23io_uring: flag new native workers with IORING_FEAT_NATIVE_WORKERSJens Axboe1-0/+1
A few reasons to do this: - The naming of the manager and worker have changed. That's a user visible change, so makes sense to flag it. - Opening certain files that use ->signal (like /proc/self or /dev/tty) now works, and the flag tells the application upfront that this is the case. - Related to the above, using signalfd will now work as well. Signed-off-by: Jens Axboe <[email protected]>
2021-02-01io_uring: Add skip option for __io_sqe_files_updatenoah1-0/+3
This patch adds support for skipping a file descriptor when using IORING_REGISTER_FILES_UPDATE. __io_sqe_files_update will skip fds set to IORING_REGISTER_FILES_SKIP. IORING_REGISTER_FILES_SKIP is inturn added as a #define in io_uring.h Signed-off-by: noah <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-02-01io_uring: rename file related variables to rsrcBijan Mottahedeh1-0/+7
This is a prep rename patch for subsequent patches to generalize file registration. [io_uring_rsrc_update:: rename fds -> data] Reviewed-by: Pavel Begunkov <[email protected]> Signed-off-by: Bijan Mottahedeh <[email protected]> [leave io_uring_files_update as struct] Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: add timeout updatePavel Begunkov1-0/+1
Support timeout updates through IORING_OP_TIMEOUT_REMOVE with passed in IORING_TIMEOUT_UPDATE. Updates doesn't support offset timeout mode. Oirignal timeout.off will be ignored as well. Signed-off-by: Pavel Begunkov <[email protected]> [axboe: remove now unused 'ret' variable] Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: add timeout support for io_uring_enter()Hao Xu1-0/+9
Now users who want to get woken when waiting for events should submit a timeout command first. It is not safe for applications that split SQ and CQ handling between two threads, such as mysql. Users should synchronize the two threads explicitly to protect SQ and that will impact the performance. This patch adds support for timeout to existing io_uring_enter(). To avoid overloading arguments, it introduces a new parameter structure which contains sigmask and timeout. I have tested the workloads with one thread submiting nop requests while the other reaping the cqe with timeout. It shows 1.8~2x faster when the iodepth is 16. Signed-off-by: Jiufei Xue <[email protected]> Signed-off-by: Hao Xu <[email protected]> [axboe: various cleanups/fixes, and name change to SIG_IS_DATA] Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: add support for IORING_OP_UNLINKATJens Axboe1-0/+2
IORING_OP_UNLINKAT behaves like unlinkat(2) and takes the same flags and arguments. Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: add support for IORING_OP_RENAMEATJens Axboe1-0/+2
IORING_OP_RENAMEAT behaves like renameat2(), and takes the same flags etc. Signed-off-by: Jens Axboe <[email protected]>
2020-12-09io_uring: allow non-fixed files with SQPOLLJens Axboe1-0/+1
The restriction of needing fixed files for SQPOLL is problematic, and prevents/inhibits several valid uses cases. With the referenced files_struct that we have now, it's trivially supportable. Treat ->files like we do the mm for the SQPOLL thread - grab a reference to it (and assign it), and drop it when we're done. This feature is exposed as IORING_FEAT_SQPOLL_NONFIXED. Signed-off-by: Jens Axboe <[email protected]>
2020-11-23io_uring: add support for shutdown(2)Jens Axboe1-0/+1
This adds support for the shutdown(2) system call, which is useful for dealing with sockets. shutdown(2) may block, so we have to punt it to async context. Suggested-by: Norman Maurer <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-09-30io_uring: provide IORING_ENTER_SQ_WAIT for SQPOLL SQ ring waitsJens Axboe1-0/+1
When using SQPOLL, applications can run into the issue of running out of SQ ring entries because the thread hasn't consumed them yet. The only option for dealing with that is checking later, or busy checking for the condition. Provide IORING_ENTER_SQ_WAIT if applications want to wait on this condition. Signed-off-by: Jens Axboe <[email protected]>
2020-09-30io_uring: allow disabling rings during the creationStefano Garzarella1-0/+2
This patch adds a new IORING_SETUP_R_DISABLED flag to start the rings disabled, allowing the user to register restrictions, buffers, files, before to start processing SQEs. When IORING_SETUP_R_DISABLED is set, SQE are not processed and SQPOLL kthread is not started. The restrictions registration are allowed only when the rings are disable to prevent concurrency issue while processing SQEs. The rings can be enabled using IORING_REGISTER_ENABLE_RINGS opcode with io_uring_register(2). Suggested-by: Jens Axboe <[email protected]> Signed-off-by: Stefano Garzarella <[email protected]> Reviewed-by: Kees Cook <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-09-30io_uring: add IOURING_REGISTER_RESTRICTIONS opcodeStefano Garzarella1-0/+31
The new io_uring_register(2) IOURING_REGISTER_RESTRICTIONS opcode permanently installs a feature allowlist on an io_ring_ctx. The io_ring_ctx can then be passed to untrusted code with the knowledge that only operations present in the allowlist can be executed. The allowlist approach ensures that new features added to io_uring do not accidentally become available when an existing application is launched on a newer kernel version. Currently is it possible to restrict sqe opcodes, sqe flags, and register opcodes. IOURING_REGISTER_RESTRICTIONS can only be made once. Afterwards it is not possible to change restrictions anymore. This prevents untrusted code from removing restrictions. Suggested-by: Stefan Hajnoczi <[email protected]> Signed-off-by: Stefano Garzarella <[email protected]> Reviewed-by: Kees Cook <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-09-30io_uring: use an enumeration for io_uring_register(2) opcodesStefano Garzarella1-11/+16
The enumeration allows us to keep track of the last io_uring_register(2) opcode available. Behaviour and opcodes names don't change. Signed-off-by: Stefano Garzarella <[email protected]> Reviewed-by: Kees Cook <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-07-24Merge branch 'io_uring-5.8' into for-5.9/io_uringJens Axboe1-0/+1
Merge in io_uring-5.8 fixes, as changes/cleanups to how we do locked mem accounting require a fixup, and only one of the spots are noticed by git as the other merges cleanly. The flags fix from io_uring-5.8 also causes a merge conflict, the leak fix for recvmsg, the double poll fix, and the link failure locking fix. * io_uring-5.8: io_uring: fix lockup in io_fail_links() io_uring: fix ->work corruption with poll_add io_uring: missed req_init_async() for IOSQE_ASYNC io_uring: always allow drain/link/hardlink/async sqe flags io_uring: ensure double poll additions work with both request types io_uring: fix recvmsg memory leak with buffer selection io_uring: fix not initialised work->flags io_uring: fix missing msg_name assignment io_uring: account user memory freed when exit has been queued io_uring: fix memleak in io_sqe_files_register() io_uring: fix memleak in __io_sqe_files_update() io_uring: export cq overflow status to userspace Signed-off-by: Jens Axboe <[email protected]>
2020-07-08io_uring: export cq overflow status to userspaceXiaoguang Wang1-0/+1
For those applications which are not willing to use io_uring_enter() to reap and handle cqes, they may completely rely on liburing's io_uring_peek_cqe(), but if cq ring has overflowed, currently because io_uring_peek_cqe() is not aware of this overflow, it won't enter kernel to flush cqes, below test program can reveal this bug: static void test_cq_overflow(struct io_uring *ring) { struct io_uring_cqe *cqe; struct io_uring_sqe *sqe; int issued = 0; int ret = 0; do { sqe = io_uring_get_sqe(ring); if (!sqe) { fprintf(stderr, "get sqe failed\n"); break;; } ret = io_uring_submit(ring); if (ret <= 0) { if (ret != -EBUSY) fprintf(stderr, "sqe submit failed: %d\n", ret); break; } issued++; } while (ret > 0); assert(ret == -EBUSY); printf("issued requests: %d\n", issued); while (issued) { ret = io_uring_peek_cqe(ring, &cqe); if (ret) { if (ret != -EAGAIN) { fprintf(stderr, "peek completion failed: %s\n", strerror(ret)); break; } printf("left requets: %d\n", issued); continue; } io_uring_cqe_seen(ring, cqe); issued--; printf("left requets: %d\n", issued); } } int main(int argc, char *argv[]) { int ret; struct io_uring ring; ret = io_uring_queue_init(16, &ring, 0); if (ret) { fprintf(stderr, "ring setup failed: %d\n", ret); return 1; } test_cq_overflow(&ring); return 0; } To fix this issue, export cq overflow status to userspace by adding new IORING_SQ_CQ_OVERFLOW flag, then helper functions() in liburing, such as io_uring_peek_cqe, can be aware of this cq overflow and do flush accordingly. Signed-off-by: Xiaoguang Wang <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-21io_uring: change the poll type to be 32-bitsJiufei Xue1-1/+3
poll events should be 32-bits to cover EPOLLEXCLUSIVE. Explicit word-swap the poll32_events for big endian to make sure the ABI is not changed. We call this feature IORING_FEAT_POLL_32BITS, applications who want to use EPOLLEXCLUSIVE should check the feature bit first. Signed-off-by: Jiufei Xue <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-05-17io_uring: add tee(2) supportPavel Begunkov1-0/+1
Add IORING_OP_TEE implementing tee(2) support. Almost identical to splice bits, but without offsets. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-05-15io_uring: add IORING_CQ_EVENTFD_DISABLED to the CQ ring flagsStefano Garzarella1-0/+7
This new flag should be set/clear from the application to disable/enable eventfd notifications when a request is completed and queued to the CQ ring. Before this patch, notifications were always sent if an eventfd is registered, so IORING_CQ_EVENTFD_DISABLED is not set during the initialization. It will be up to the application to set the flag after initialization if no notifications are required at the beginning. Signed-off-by: Stefano Garzarella <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-05-15io_uring: add 'cq_flags' field for the CQ ringStefano Garzarella1-1/+3
This patch adds the new 'cq_flags' field that should be written by the application and read by the kernel. This new field is available to the userspace application through 'cq_off.flags'. We are using 4-bytes previously reserved and set to zero. This means that if the application finds this field to zero, then the new functionality is not supported. In the next patch we will introduce the first flag available. Signed-off-by: Stefano Garzarella <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-03-21io_uring: make spdxcheck.py happyLukas Bulwahn1-1/+1
Commit bbbdeb4720a0 ("io_uring: dual license io_uring.h uapi header") uses a nested SPDX-License-Identifier to dual license the header. Since then, ./scripts/spdxcheck.py complains: include/uapi/linux/io_uring.h: 1:60 Missing parentheses: OR Add parentheses to make spdxcheck.py happy. Signed-off-by: Lukas Bulwahn <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-03-11io_uring: dual license io_uring.h uapi headerJens Axboe1-1/+1
This just syncs the header it with the liburing version, so there's no confusion on the license of the header parts. Signed-off-by: Jens Axboe <[email protected]>
2020-03-10io_uring: provide means of removing buffersJens Axboe1-0/+1
We have IORING_OP_PROVIDE_BUFFERS, but the only way to remove buffers is to trigger IO on them. The usual case of shrinking a buffer pool would be to just not replenish the buffers when IO completes, and instead just free it. But it may be nice to have a way to manually remove a number of buffers from a given group, and IORING_OP_REMOVE_BUFFERS provides that functionality. Signed-off-by: Jens Axboe <[email protected]>
2020-03-10io_uring: support buffer selection for OP_READ and OP_RECVJens Axboe1-0/+14
If a server process has tons of pending socket connections, generally it uses epoll to wait for activity. When the socket is ready for reading (or writing), the task can select a buffer and issue a recv/send on the given fd. Now that we have fast (non-async thread) support, a task can have tons of pending reads or writes pending. But that means they need buffers to back that data, and if the number of connections is high enough, having them preallocated for all possible connections is unfeasible. With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to use for any request. The request then sets IOSQE_BUFFER_SELECT in the sqe, and a given group ID in sqe->buf_group. When the fd becomes ready, a free buffer from the specified group is selected. If none are available, the request is terminated with -ENOBUFS. If successful, the CQE on completion will contain the buffer ID chosen in the cqe->flags member, encoded as: (buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER; Once a buffer has been consumed by a request, it is no longer available and must be registered again with IORING_OP_PROVIDE_BUFFERS. Requests need to support this feature. For now, IORING_OP_READ and IORING_OP_RECV support it. This is checked on SQE submission, a CQE with res == -EOPNOTSUPP will be posted if attempted on unsupported requests. Signed-off-by: Jens Axboe <[email protected]>
2020-03-10io_uring: add IORING_OP_PROVIDE_BUFFERSJens Axboe1-2/+8
IORING_OP_PROVIDE_BUFFERS uses the buffer registration infrastructure to support passing in an addr/len that is associated with a buffer ID and buffer group ID. The group ID is used to index and lookup the buffers, while the buffer ID can be used to notify the application which buffer in the group was used. The addr passed in is the starting buffer address, and length is each buffer length. A number of buffers to add with can be specified, in which case addr is incremented by length for each addition, and each buffer increments the buffer ID specified. No validation is done of the buffer ID. If the application provides buffers within the same group with identical buffer IDs, then it'll have a hard time telling which buffer ID was used. The only restriction is that the buffer ID can be a max of 16-bits in size, so USHRT_MAX is the maximum ID that can be used. Signed-off-by: Jens Axboe <[email protected]>
2020-03-02io_uring: use poll driven retry for files that support itJens Axboe1-0/+1
Currently io_uring tries any request in a non-blocking manner, if it can, and then retries from a worker thread if we get -EAGAIN. Now that we have a new and fancy poll based retry backend, use that to retry requests if the file supports it. This means that, for example, an IORING_OP_RECVMSG on a socket no longer requires an async thread to complete the IO. If we get -EAGAIN reading from the socket in a non-blocking manner, we arm a poll handler for notification on when the socket becomes readable. When it does, the pending read is executed directly by the task again, through the io_uring task work handlers. Not only is this faster and more efficient, it also means we're not generating potentially tons of async threads that just sit and block, waiting for the IO to complete. The feature is marked with IORING_FEAT_FAST_POLL, meaning that async pollable IO is fast, and that poll<link>other_op is fast as well. Signed-off-by: Jens Axboe <[email protected]>
2020-03-02io_uring: add splice(2) supportPavel Begunkov1-1/+13
Add support for splice(2). - output file is specified as sqe->fd, so it's handled by generic code - hash_reg_file handled by generic code as well - len is 32bit, but should be fine - the fd_in is registered file, when SPLICE_F_FD_IN_FIXED is set, which is a splice flag (i.e. sqe->splice_flags). Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-29io_uring: add support for epoll_ctl(2)Jens Axboe1-0/+1
This adds IORING_OP_EPOLL_CTL, which can perform the same work as the epoll_ctl(2) system call. Signed-off-by: Jens Axboe <[email protected]>
2020-01-28io_uring: support using a registered personality for commandsJens Axboe1-1/+6
For personalities previously registered via IORING_REGISTER_PERSONALITY, allow any command to select them. This is done through setting sqe->personality to the id returned from registration, and then flagging sqe->flags with IOSQE_PERSONALITY. Reviewed-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-28io_uring: allow registering credentialsJens Axboe1-0/+2
If an application wants to use a ring with different kinds of credentials, it can register them upfront. We don't lookup credentials, the credentials of the task calling IORING_REGISTER_PERSONALITY is used. An 'id' is returned for the application to use in subsequent personality support. Reviewed-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-01-28io_uring: add io-wq workqueue sharingPavel Begunkov1-1/+3
If IORING_SETUP_ATTACH_WQ is set, it expects wq_fd in io_uring_params to be a valid io_uring fd io-wq of which will be shared with the newly created io_uring instance. If the flag is set but it can't share io-wq, it fails. This allows creation of "sibling" io_urings, where we prefer to keep the SQ/CQ private, but want to share the async backend to minimize the amount of overhead associated with having multiple rings that belong to the same backend. Reported-by: Jens Axboe <[email protected]> Reported-by: Daurnimator <[email protected]> Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>