Merge tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block

Pull io_uring IO interface from Jens Axboe: "Second attempt at adding the io_uring interface. Since the first one, we've added basic unit testing of the three system calls, that resides in liburing like the other unit tests that we have so far. It'll take a while to get full coverage of it, but we're working towards it. I've also added two basic test programs to tools/io_uring. One uses the raw interface and has support for all the various features that io_uring supports outside of standard IO, like fixed files, fixed IO buffers, and polled IO. The other uses the liburing API, and is a simplified version of cp(1). This adds support for a new IO interface, io_uring. io_uring allows an application to communicate with the kernel through two rings, the submission queue (SQ) and completion queue (CQ) ring. This allows for very efficient handling of IOs, see the v5 posting for some basic numbers: https://lore.kernel.org/linux-block/[email protected]/ Outside of just efficiency, the interface is also flexible and extendable, and allows for future use cases like the upcoming NVMe key-value store API, networked IO, and so on. It also supports async buffered IO, something that we've always failed to support in the kernel. Outside of basic IO features, it supports async polled IO as well. This particular feature has already been tested at Facebook months ago for flash storage boxes, with 25-33% improvements. It makes polled IO actually useful for real world use cases, where even basic flash sees a nice win in terms of efficiency, latency, and performance. These boxes were IOPS bound before, now they are not. This series adds three new system calls. One for setting up an io_uring instance (io_uring_setup(2)), one for submitting/completing IO (io_uring_enter(2)), and one for aux functions like registrating file sets, buffers, etc (io_uring_register(2)). Through the help of Arnd, I've coordinated the syscall numbers so merge on that front should be painless. Jon did a writeup of the interface a while back, which (except for minor details that have been tweaked) is still accurate. Find that here: https://lwn.net/Articles/776703/ Huge thanks to Al Viro for helping getting the reference cycle code correct, and to Jann Horn for his extensive reviews focused on both security and bugs in general. There's a userspace library that provides basic functionality for applications that don't need or want to care about how to fiddle with the rings directly. It has helpers to allow applications to easily set up an io_uring instance, and submit/complete IO through it without knowing about the intricacies of the rings. It also includes man pages (thanks to Jeff Moyer), and will continue to grow support helper functions and features as time progresses. Find it here: git://git.kernel.dk/liburing Fio has full support for the raw interface, both in the form of an IO engine (io_uring), but also with a small test application (t/io_uring) that can exercise and benchmark the interface" * tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block: io_uring: add a few test tools io_uring: allow workqueue item to handle multiple buffered requests io_uring: add support for IORING_OP_POLL io_uring: add io_kiocb ref count io_uring: add submission polling io_uring: add file set registration net: split out functions related to registering inflight socket files io_uring: add support for pre-mapped user IO buffers block: implement bio helper to add iter bvec pages to bio io_uring: batch io_kiocb allocation io_uring: use fget/fput_many() for file references fs: add fget_many() and fput_many() io_uring: support for IO polling io_uring: add fsync support Add io_uring IO interface
author: Linus Torvalds <[email protected]> 2019-03-08 14:48:40 -0800
committer: Linus Torvalds <[email protected]> 2019-03-08 14:48:40 -0800
commit: 38e7571c07be01f9f19b355a9306a4e3d5cb0f5b (patch)
tree: 48812ba46a6fe37ee59d31e0de418f336bbb15ca /tools/io_uring/queue.c
parent: 80201fe175cbf7f3e372f53eba0a881a702ad926 (diff)
parent: 21b4aa5d20fd07207e73270cadffed5c63fb4343 (diff)
1 files changed, 164 insertions, 0 deletions
diff --git a/tools/io_uring/queue.c b/tools/io_uring/queue.c
new file mode 100644
index 000000000000..88505e873ad9
--- /dev/null
+++ b/tools/io_uring/queue.c
@@ -0,0 +1,164 @@
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include <errno.h>
+#include <string.h>
+
+#include "liburing.h"
+#include "barrier.h"
+
+static int __io_uring_get_completion(struct io_uring *ring,
+				     struct io_uring_cqe **cqe_ptr, int wait)
+{
+	struct io_uring_cq *cq = &ring->cq;
+	const unsigned mask = *cq->kring_mask;
+	unsigned head;
+	int ret;
+
+	*cqe_ptr = NULL;
+	head = *cq->khead;
+	do {
+		/*
+		 * It's necessary to use a read_barrier() before reading
+		 * the CQ tail, since the kernel updates it locklessly. The
+		 * kernel has the matching store barrier for the update. The
+		 * kernel also ensures that previous stores to CQEs are ordered
+		 * with the tail update.
+		 */
+		read_barrier();
+		if (head != *cq->ktail) {
+			*cqe_ptr = &cq->cqes[head & mask];
+			break;
+		}
+		if (!wait)
+			break;
+		ret = io_uring_enter(ring->ring_fd, 0, 1,
+					IORING_ENTER_GETEVENTS, NULL);
+		if (ret < 0)
+			return -errno;
+	} while (1);
+
+	if (*cqe_ptr) {
+		*cq->khead = head + 1;
+		/*
+		 * Ensure that the kernel sees our new head, the kernel has
+		 * the matching read barrier.
+		 */
+		write_barrier();
+	}
+
+	return 0;
+}
+
+/*
+ * Return an IO completion, if one is readily available
+ */
+int io_uring_get_completion(struct io_uring *ring,
+			    struct io_uring_cqe **cqe_ptr)
+{
+	return __io_uring_get_completion(ring, cqe_ptr, 0);
+}
+
+/*
+ * Return an IO completion, waiting for it if necessary
+ */
+int io_uring_wait_completion(struct io_uring *ring,
+			     struct io_uring_cqe **cqe_ptr)
+{
+	return __io_uring_get_completion(ring, cqe_ptr, 1);
+}
+
+/*
+ * Submit sqes acquired from io_uring_get_sqe() to the kernel.
+ *
+ * Returns number of sqes submitted
+ */
+int io_uring_submit(struct io_uring *ring)
+{
+	struct io_uring_sq *sq = &ring->sq;
+	const unsigned mask = *sq->kring_mask;
+	unsigned ktail, ktail_next, submitted;
+	int ret;
+
+	/*
+	 * If we have pending IO in the kring, submit it first. We need a
+	 * read barrier here to match the kernels store barrier when updating
+	 * the SQ head.
+	 */
+	read_barrier();
+	if (*sq->khead != *sq->ktail) {
+		submitted = *sq->kring_entries;
+		goto submit;
+	}
+
+	if (sq->sqe_head == sq->sqe_tail)
+		return 0;
+
+	/*
+	 * Fill in sqes that we have queued up, adding them to the kernel ring
+	 */
+	submitted = 0;
+	ktail = ktail_next = *sq->ktail;
+	while (sq->sqe_head < sq->sqe_tail) {
+		ktail_next++;
+		read_barrier();
+
+		sq->array[ktail & mask] = sq->sqe_head & mask;
+		ktail = ktail_next;
+
+		sq->sqe_head++;
+		submitted++;
+	}
+
+	if (!submitted)
+		return 0;
+
+	if (*sq->ktail != ktail) {
+		/*
+		 * First write barrier ensures that the SQE stores are updated
+		 * with the tail update. This is needed so that the kernel
+		 * will never see a tail update without the preceeding sQE
+		 * stores being done.
+		 */
+		write_barrier();
+		*sq->ktail = ktail;
+		/*
+		 * The kernel has the matching read barrier for reading the
+		 * SQ tail.
+		 */
+		write_barrier();
+	}
+
+submit:
+	ret = io_uring_enter(ring->ring_fd, submitted, 0,
+				IORING_ENTER_GETEVENTS, NULL);
+	if (ret < 0)
+		return -errno;
+
+	return 0;
+}
+
+/*
+ * Return an sqe to fill. Application must later call io_uring_submit()
+ * when it's ready to tell the kernel about it. The caller may call this
+ * function multiple times before calling io_uring_submit().
+ *
+ * Returns a vacant sqe, or NULL if we're full.
+ */
+struct io_uring_sqe *io_uring_get_sqe(struct io_uring *ring)
+{
+	struct io_uring_sq *sq = &ring->sq;
+	unsigned next = sq->sqe_tail + 1;
+	struct io_uring_sqe *sqe;
+
+	/*
+	 * All sqes are used
+	 */
+	if (next - sq->sqe_head > *sq->kring_entries)
+		return NULL;
+
+	sqe = &sq->sqes[sq->sqe_tail & *sq->kring_mask];
+	sq->sqe_tail = next;
+	return sqe;
+}
author	Linus Torvalds <[email protected]>	2019-03-08 14:48:40 -0800
committer	Linus Torvalds <[email protected]>	2019-03-08 14:48:40 -0800
commit	38e7571c07be01f9f19b355a9306a4e3d5cb0f5b (patch)
tree	48812ba46a6fe37ee59d31e0de418f336bbb15ca /tools/io_uring/queue.c
parent	80201fe175cbf7f3e372f53eba0a881a702ad926 (diff)
parent	21b4aa5d20fd07207e73270cadffed5c63fb4343 (diff)