diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2019-03-08 14:48:40 -0800 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2019-03-08 14:48:40 -0800 |
commit | 38e7571c07be01f9f19b355a9306a4e3d5cb0f5b (patch) | |
tree | 48812ba46a6fe37ee59d31e0de418f336bbb15ca /tools/io_uring/queue.c | |
parent | 80201fe175cbf7f3e372f53eba0a881a702ad926 (diff) | |
parent | 21b4aa5d20fd07207e73270cadffed5c63fb4343 (diff) |
Merge tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block
Pull io_uring IO interface from Jens Axboe:
"Second attempt at adding the io_uring interface.
Since the first one, we've added basic unit testing of the three
system calls, that resides in liburing like the other unit tests that
we have so far. It'll take a while to get full coverage of it, but
we're working towards it. I've also added two basic test programs to
tools/io_uring. One uses the raw interface and has support for all the
various features that io_uring supports outside of standard IO, like
fixed files, fixed IO buffers, and polled IO. The other uses the
liburing API, and is a simplified version of cp(1).
This adds support for a new IO interface, io_uring.
io_uring allows an application to communicate with the kernel through
two rings, the submission queue (SQ) and completion queue (CQ) ring.
This allows for very efficient handling of IOs, see the v5 posting for
some basic numbers:
https://lore.kernel.org/linux-block/20190116175003.17880-1-axboe@kernel.dk/
Outside of just efficiency, the interface is also flexible and
extendable, and allows for future use cases like the upcoming NVMe
key-value store API, networked IO, and so on. It also supports async
buffered IO, something that we've always failed to support in the
kernel.
Outside of basic IO features, it supports async polled IO as well.
This particular feature has already been tested at Facebook months ago
for flash storage boxes, with 25-33% improvements. It makes polled IO
actually useful for real world use cases, where even basic flash sees
a nice win in terms of efficiency, latency, and performance. These
boxes were IOPS bound before, now they are not.
This series adds three new system calls. One for setting up an
io_uring instance (io_uring_setup(2)), one for submitting/completing
IO (io_uring_enter(2)), and one for aux functions like registrating
file sets, buffers, etc (io_uring_register(2)). Through the help of
Arnd, I've coordinated the syscall numbers so merge on that front
should be painless.
Jon did a writeup of the interface a while back, which (except for
minor details that have been tweaked) is still accurate. Find that
here:
https://lwn.net/Articles/776703/
Huge thanks to Al Viro for helping getting the reference cycle code
correct, and to Jann Horn for his extensive reviews focused on both
security and bugs in general.
There's a userspace library that provides basic functionality for
applications that don't need or want to care about how to fiddle with
the rings directly. It has helpers to allow applications to easily set
up an io_uring instance, and submit/complete IO through it without
knowing about the intricacies of the rings. It also includes man pages
(thanks to Jeff Moyer), and will continue to grow support helper
functions and features as time progresses. Find it here:
git://git.kernel.dk/liburing
Fio has full support for the raw interface, both in the form of an IO
engine (io_uring), but also with a small test application (t/io_uring)
that can exercise and benchmark the interface"
* tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block:
io_uring: add a few test tools
io_uring: allow workqueue item to handle multiple buffered requests
io_uring: add support for IORING_OP_POLL
io_uring: add io_kiocb ref count
io_uring: add submission polling
io_uring: add file set registration
net: split out functions related to registering inflight socket files
io_uring: add support for pre-mapped user IO buffers
block: implement bio helper to add iter bvec pages to bio
io_uring: batch io_kiocb allocation
io_uring: use fget/fput_many() for file references
fs: add fget_many() and fput_many()
io_uring: support for IO polling
io_uring: add fsync support
Add io_uring IO interface
Diffstat (limited to 'tools/io_uring/queue.c')
-rw-r--r-- | tools/io_uring/queue.c | 164 |
1 files changed, 164 insertions, 0 deletions
diff --git a/tools/io_uring/queue.c b/tools/io_uring/queue.c new file mode 100644 index 000000000000..88505e873ad9 --- /dev/null +++ b/tools/io_uring/queue.c @@ -0,0 +1,164 @@ +#include <sys/types.h> +#include <sys/stat.h> +#include <sys/mman.h> +#include <unistd.h> +#include <errno.h> +#include <string.h> + +#include "liburing.h" +#include "barrier.h" + +static int __io_uring_get_completion(struct io_uring *ring, + struct io_uring_cqe **cqe_ptr, int wait) +{ + struct io_uring_cq *cq = &ring->cq; + const unsigned mask = *cq->kring_mask; + unsigned head; + int ret; + + *cqe_ptr = NULL; + head = *cq->khead; + do { + /* + * It's necessary to use a read_barrier() before reading + * the CQ tail, since the kernel updates it locklessly. The + * kernel has the matching store barrier for the update. The + * kernel also ensures that previous stores to CQEs are ordered + * with the tail update. + */ + read_barrier(); + if (head != *cq->ktail) { + *cqe_ptr = &cq->cqes[head & mask]; + break; + } + if (!wait) + break; + ret = io_uring_enter(ring->ring_fd, 0, 1, + IORING_ENTER_GETEVENTS, NULL); + if (ret < 0) + return -errno; + } while (1); + + if (*cqe_ptr) { + *cq->khead = head + 1; + /* + * Ensure that the kernel sees our new head, the kernel has + * the matching read barrier. + */ + write_barrier(); + } + + return 0; +} + +/* + * Return an IO completion, if one is readily available + */ +int io_uring_get_completion(struct io_uring *ring, + struct io_uring_cqe **cqe_ptr) +{ + return __io_uring_get_completion(ring, cqe_ptr, 0); +} + +/* + * Return an IO completion, waiting for it if necessary + */ +int io_uring_wait_completion(struct io_uring *ring, + struct io_uring_cqe **cqe_ptr) +{ + return __io_uring_get_completion(ring, cqe_ptr, 1); +} + +/* + * Submit sqes acquired from io_uring_get_sqe() to the kernel. + * + * Returns number of sqes submitted + */ +int io_uring_submit(struct io_uring *ring) +{ + struct io_uring_sq *sq = &ring->sq; + const unsigned mask = *sq->kring_mask; + unsigned ktail, ktail_next, submitted; + int ret; + + /* + * If we have pending IO in the kring, submit it first. We need a + * read barrier here to match the kernels store barrier when updating + * the SQ head. + */ + read_barrier(); + if (*sq->khead != *sq->ktail) { + submitted = *sq->kring_entries; + goto submit; + } + + if (sq->sqe_head == sq->sqe_tail) + return 0; + + /* + * Fill in sqes that we have queued up, adding them to the kernel ring + */ + submitted = 0; + ktail = ktail_next = *sq->ktail; + while (sq->sqe_head < sq->sqe_tail) { + ktail_next++; + read_barrier(); + + sq->array[ktail & mask] = sq->sqe_head & mask; + ktail = ktail_next; + + sq->sqe_head++; + submitted++; + } + + if (!submitted) + return 0; + + if (*sq->ktail != ktail) { + /* + * First write barrier ensures that the SQE stores are updated + * with the tail update. This is needed so that the kernel + * will never see a tail update without the preceeding sQE + * stores being done. + */ + write_barrier(); + *sq->ktail = ktail; + /* + * The kernel has the matching read barrier for reading the + * SQ tail. + */ + write_barrier(); + } + +submit: + ret = io_uring_enter(ring->ring_fd, submitted, 0, + IORING_ENTER_GETEVENTS, NULL); + if (ret < 0) + return -errno; + + return 0; +} + +/* + * Return an sqe to fill. Application must later call io_uring_submit() + * when it's ready to tell the kernel about it. The caller may call this + * function multiple times before calling io_uring_submit(). + * + * Returns a vacant sqe, or NULL if we're full. + */ +struct io_uring_sqe *io_uring_get_sqe(struct io_uring *ring) +{ + struct io_uring_sq *sq = &ring->sq; + unsigned next = sq->sqe_tail + 1; + struct io_uring_sqe *sqe; + + /* + * All sqes are used + */ + if (next - sq->sqe_head > *sq->kring_entries) + return NULL; + + sqe = &sq->sqes[sq->sqe_tail & *sq->kring_mask]; + sq->sqe_tail = next; + return sqe; +} |