aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-10-18aoe: add error handling support for add_disk()Luis Chamberlain1-1/+5
We never checked for errors on add_disk() as this function returned void. Now that this is fixed, use the shiny new error handling. Signed-off-by: Luis Chamberlain <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-18nbd: add error handling support for add_disk()Luis Chamberlain1-1/+5
We never checked for errors on add_disk() as this function returned void. Now that this is fixed, use the shiny new error handling. Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Luis Chamberlain <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-18loop: add error handling support for add_disk()Luis Chamberlain1-1/+7
We never checked for errors on add_disk() as this function returned void. Now that this is fixed, use the shiny new error handling. Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Luis Chamberlain <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-18null_blk: poll queue supportJens Axboe2-4/+108
There's currently no way to experiment with polled IO with null_blk, which seems like an oversight. This patch adds support for polled IO. We keep a list of issued IOs on submit, and then process that list when mq_ops->poll() is invoked. A new parameter is added, poll_queues. It defaults to 1 like the submit queues, meaning we'll have 1 poll queue available. Fixes-by: Bart Van Assche <[email protected]> Fixes-by: Pavel Begunkov <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18nvme: wire up completion batching for the IRQ pathJens Axboe1-1/+5
Trivial to do now, just need our own io_comp_batch on the stack and pass that in to the usual command completion handling. I pondered making this dependent on how many entries we had to process, but even for a single entry there's no discernable difference in performance or latency. Running a sync workload over io_uring: t/io_uring -b512 -d1 -s1 -c1 -p0 -F1 -B1 -n2 /dev/nvme1n1 /dev/nvme2n1 yields the below performance before the patch: IOPS=254820, BW=124MiB/s, IOS/call=1/1, inflight=(1 1) IOPS=251174, BW=122MiB/s, IOS/call=1/1, inflight=(1 1) IOPS=250806, BW=122MiB/s, IOS/call=1/1, inflight=(1 1) and the following after: IOPS=255972, BW=124MiB/s, IOS/call=1/1, inflight=(1 1) IOPS=251920, BW=123MiB/s, IOS/call=1/1, inflight=(1 1) IOPS=251794, BW=122MiB/s, IOS/call=1/1, inflight=(1 1) which definitely isn't slower, about the same if you factor in a bit of variance. For peak performance workloads, benchmarking shows a 2% improvement. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-18io_uring: utilize the io batching infrastructure for more efficient polled IOJens Axboe1-2/+6
Wire up using an io_comp_batch for f_op->iopoll(). If the lower stack supports it, we can handle high rates of polled IO more efficiently. This raises the single core efficiency on my system from ~6.1M IOPS to ~6.6M IOPS running a random read workload at depth 128 on two gen2 Optane drives. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-18nvme: add support for batched completion of polled IOJens Axboe3-12/+51
Take advantage of struct io_comp_batch, if passed in to the nvme poll handler. If it's set, rather than complete each request individually inline, store them in the io_comp_batch list. We only do so for requests that will complete successfully, anything else will be completed inline as before. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: add support for blk_mq_end_request_batch()Jens Axboe4-19/+99
Instead of calling blk_mq_end_request() on a single request, add a helper that takes the new struct io_comp_batch and completes any request stored in there. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-18sbitmap: add helper to clear a batch of tagsJens Axboe2-3/+52
sbitmap currently only supports clearing tags one-by-one, add a helper that allows the caller to pass in an array of tags to clear. Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: add a struct io_comp_batch argument to fops->iopoll()Jens Axboe16-25/+39
struct io_comp_batch contains a list head and a completion handler, which will allow completions to more effciently completed batches of IO. For now, no functional changes in this patch, we just define the io_comp_batch structure and add the argument to the file_operations iopoll handler. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: provide helpers for rq_list manipulationJens Axboe2-14/+34
Instead of open-coding the list additions, traversal, and removal, provide a basic set of helpers. Suggested-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: remove some blk_mq_hw_ctx debugfs entriesJens Axboe3-93/+0
Just like the blk_mq_ctx counterparts, we've got a bunch of counters in here that are only for debugfs and are of questionnable value. They are: - dispatched, index of how many requests were dispatched in one go - poll_{considered,invoked,success}, which track poll sucess rates. We're confident in the iopoll implementation at this point, don't bother tracking these. As a bonus, this shrinks each hardware queue from 576 bytes to 512 bytes, dropping a whole cacheline. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: remove debugfs blk_mq_ctx dispatched/merged/completed attributesJens Axboe4-68/+1
These were added as part of early days debugging for blk-mq, and they are not really useful anymore. Rather than spend cycles updating them, just get rid of them. As a bonus, this shrinks the per-cpu software queue size from 256b to 192b. That's a whole cacheline less. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: cache rq_flags inside blk_mq_rq_ctx_init()Pavel Begunkov1-6/+8
Add a local variable for rq_flags, it helps to compile out some of rq_flags reloads. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: blk_mq_rq_ctx_init cache ctx/q/hctxPavel Begunkov1-5/+9
We should have enough of registers in blk_mq_rq_ctx_init(), store them in local vars, so we don't keep reloading them. note: keeping q->elevator may look unnecessary, but it's also used inside inlined blk_mq_tags_from_data(). Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: skip elevator fields init for non-elv queuePavel Begunkov1-14/+14
Don't init rq->hash and rq->rb_node in blk_mq_rq_ctx_init() if there is no elevator. Also, move some other initialisers that imply barriers to the end, so the compiler is free to rearrange and optimise other the rest of them. note: fold in a change from Jens leaving queue_list unconditional, as it might lead to problems otherwise. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: store elevator state in requestJens Axboe3-20/+29
Add an rq private RQF_ELV flag, which tells the block layer that this request was initialized on a queue that has an IO scheduler attached. This allows for faster checking in the fast path, rather than having to deference rq->q later on. Elevator switching does full quiesce of the queue before detaching an IO scheduler, so it's safe to cache this in the request itself. Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: only mark bio as tracked if it really is trackedJens Axboe1-2/+3
We set BIO_TRACKED unconditionally when rq_qos_throttle() is called, even though we may not even have an rq_qos handler. Only mark it as TRACKED if it really is potentially tracked. This saves considerable time for the case where the bio isn't tracked: 2.64% -1.65% [kernel.vmlinux] [k] bio_endio Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: improve layout of struct requestJens Axboe1-44/+46
It's been a while since this was analyzed, move some members around to better flow with the use case. Initial state up top, and queued state after that. This improves my peak case by about 1.5%, from 7750K to 7900K IOPS. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: move update request helpers into blk-mq.cJens Axboe3-145/+146
For some reason we still have them in blk-core, with the rest of the request completion being in blk-mq. That causes and out-of-line call for each completion. Move them into blk-mq.c instead, where they belong. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: remove useless caller argument to print_req_error()Jens Axboe1-5/+4
We have exactly one caller of this, just get rid of adding the useless function name to the output. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: don't bother iter advancing a fully done bioJens Axboe2-15/+24
If we're completing nbytes and nbytes is the size of the bio, don't bother with calling into the iterator increment helpers. Just clear the bio size and we're done. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: convert the rest of block to bdev_get_queuePavel Begunkov8-22/+22
Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached queue pointer and so is faster. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/addf6ea988c04213697ba3684c853e4ed7642a39.1634219547.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: use bdev_get_queue() in blk-core.cPavel Begunkov1-6/+7
Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached queue pointer and so is faster. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/efc41f880262517c8dc32f932f1b23112f21b255.1634219547.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: use bdev_get_queue() in bio.cPavel Begunkov1-5/+5
Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached queue pointer and so is faster. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/85c36ea784d285a5075baa10049e6b59e15fb484.1634219547.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: use bdev_get_queue() in bdev.cPavel Begunkov1-4/+4
Convert bdev->bd_disk->queue to bdev_get_queue(), it's uses a cached queue pointer and so is faster. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/a352936ce5d9ac719645b1e29b173d931ebcdc02.1634219547.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: cache request queue in bdevPavel Begunkov4-2/+6
There are tons of places where we need to get a request_queue only having bdev, which turns into bdev->bd_disk->queue. There are probably a hundred of such places considering inline helpers, and enough of them are in hot paths. Cache queue pointer in struct block_device and make use of it in bdev_get_queue(). Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/a3bfaecdd28956f03629d0ca5c63ebc096e1c809.1634219547.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: handle fast path of bio splitting inlineJens Axboe3-21/+35
The fast path is no splitting needed. Separate the handling into a check part we can inline, and an out-of-line handling path if we do need to split. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: use flags instead of bit fields for blkdev_dioJens Axboe1-14/+20
This generates a lot better code for me, and bumps performance from 7650K IOPS to 7750K IOPS. Looking at profiles for the run and running perf diff, it confirms that we're now sending a lot less time there: 6.38% -2.80% [kernel.vmlinux] [k] blkdev_direct_IO Taking it from the 2nd most cycle consumer to only the 9th most at 3.35% of the CPU time. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: cache bdev in struct file for raw bdev IOPavel Begunkov1-15/+12
bdev = &BDEV_I(file->f_mapping->host)->bdev Getting struct block_device from a file requires 2 memory dereferences as illustrated above, that takes a toll on performance, so cache it in yet unused file->private_data. That gives a noticeable peak performance improvement. Signed-off-by: Pavel Begunkov <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/8415f9fe12e544b9da89593dfbca8de2b52efe03.1634115360.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2021-10-18nvme-multipath: enable polled I/OChristoph Hellwig1-1/+15
Set the poll queue flag to enable polling, given that the multipath node just dispatches the bios to a lower queue. Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Mark Wunderlich <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: don't allow writing to the poll queue attributeChristoph Hellwig1-19/+4
The poll attribute is a historic artefact from before when we had explicit poll queues that require driver specific configuration. Just print a warning when writing to the attribute. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Tested-by: Mark Wunderlich <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: switch polling to be bio basedChristoph Hellwig39-264/+232
Replace the blk_poll interface that requires the caller to keep a queue and cookie from the submissions with polling based on the bio. Polling for the bio itself leads to a few advantages: - the cookie construction can made entirely private in blk-mq.c - the caller does not need to remember the request_queue and cookie separately and thus sidesteps their lifetime issues - keeping the device and the cookie inside the bio allows to trivially support polling BIOs remapping by stacking drivers - a lot of code to propagate the cookie back up the submission path can be removed entirely. Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Mark Wunderlich <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: define 'struct bvec_iter' as packedMing Lei1-1/+1
'struct bvec_iter' is embedded into 'struct bio', define it as packed so that we can get one extra 4bytes for other uses without expanding bio. 'struct bvec_iter' is often allocated on stack, so making it packed doesn't affect performance. Also I have run io_uring on both nvme/null_blk, and not observe performance effect in this way. Suggested-by: Christoph Hellwig <[email protected]> Signed-off-by: Ming Lei <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Mark Wunderlich <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: use SLAB_TYPESAFE_BY_RCU for the bio slabChristoph Hellwig1-1/+2
This flags ensures that the pages will not be reused for non-bio allocations before the end of an RCU grace period. With that we can safely use a RCU lookup for bio polling as long as we are fine with occasionally polling the wrong device. Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Mark Wunderlich <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: rename REQ_HIPRI to REQ_POLLEDChristoph Hellwig11-20/+19
Unlike the RWF_HIPRI userspace ABI which is intentionally kept vague, the bio flag is specific to the polling implementation, so rename and document it properly. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Tested-by: Mark Wunderlich <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18io_uring: don't sleep when polling for I/OChristoph Hellwig3-2/+5
There is no point in sleeping for the expected I/O completion timeout in the io_uring async polling model as we never poll for a specific I/O. Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Mark Wunderlich <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: replace the spin argument to blk_iopoll with a flags argumentChristoph Hellwig9-26/+26
Switch the boolean spin argument to blk_poll to passing a set of flags instead. This will allow to control polling behavior in a more fine grained way. Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Mark Wunderlich <[email protected]> Link: https://lore.kernel.org/r/[email protected] [axboe: adapt to changed io_uring iopoll] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18blk-mq: remove blk_qc_t_validChristoph Hellwig2-6/+1
Move the trivial check into the only caller. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Tested-by: Mark Wunderlich <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18blk-mq: remove blk_qc_t_to_tag and blk_qc_t_is_internalChristoph Hellwig2-13/+5
Merge both functions into their only caller to keep the blk-mq tag to blk_qc_t mapping as private as possible in blk-mq.c. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Tested-by: Mark Wunderlich <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18blk-mq: factor out a "classic" poll helperChristoph Hellwig1-64/+56
Factor the code to do the classic full metal polling out of blk_poll into a separate blk_mq_poll_classic helper. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Tested-by: Mark Wunderlich <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18blk-mq: factor out a blk_qc_to_hctx helperChristoph Hellwig2-6/+7
Add a helper to get the hctx from a request_queue and cookie, and fold the blk_qc_t_to_queue_num helper into it as no other callers are left. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Tested-by: Mark Wunderlich <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18io_uring: fix a layering violation in io_iopoll_req_issuedChristoph Hellwig1-8/+1
syscall-level code can't just poke into the details of the poll cookie, which is private information of the block layer. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18iomap: don't try to poll multi-bio I/Os in __iomap_dio_rwChristoph Hellwig1-1/+20
If an iocb is split into multiple bios we can't poll for both. So don't bother to even try to poll in that case. Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Mark Wunderlich <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: don't try to poll multi-bio I/Os in __blkdev_direct_IOChristoph Hellwig1-14/+7
If an iocb is split into multiple bios we can't poll for both. So don't even bother to try to poll in that case. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18direct-io: remove blk_poll supportChristoph Hellwig1-10/+4
The polling support in the legacy direct-io support is a little crufty. It already doesn't support the asynchronous polling needed for io_uring polling, and is hard to adopt to upcoming changes in the polling interfaces. Given that all the major file systems already use the iomap direct I/O code, just drop the polling support. Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Mark Wunderlich <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: only check previous entry for plug merge attemptJens Axboe1-23/+13
Currently we scan the entire plug list, which is potentially very expensive. In an IOPS bound workload, we can drive about 5.6M IOPS with merging enabled, and profiling shows that the plug merge check is the (by far) most expensive thing we're doing: Overhead Command Shared Object Symbol + 20.89% io_uring [kernel.vmlinux] [k] blk_attempt_plug_merge + 4.98% io_uring [kernel.vmlinux] [k] io_submit_sqes + 4.78% io_uring [kernel.vmlinux] [k] blkdev_direct_IO + 4.61% io_uring [kernel.vmlinux] [k] blk_mq_submit_bio Instead of browsing the whole list, just check the previously inserted entry. That is enough for a naive merge check and will catch most cases, and for devices that need full merging, the IO scheduler attached to such devices will do that anyway. The plug merge is meant to be an inexpensive check to avoid getting a request, but if we repeatedly scan the list for every single insert, it is very much not a cheap check. With this patch, the workload instead runs at ~7.0M IOPS, providing a 25% improvement. Disabling merging entirely yields another 5% improvement. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: move CONFIG_BLOCK guard to top MakefileMasahiro Yamada2-2/+3
Every object under block/ depends on CONFIG_BLOCK. Move the guard to the top Makefile since there is no point to descend into block/ if CONFIG_BLOCK=n. Signed-off-by: Masahiro Yamada <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: move menu "Partition type" to block/partitions/KconfigMasahiro Yamada2-4/+4
Move the menu to the relevant place. Signed-off-by: Masahiro Yamada <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18block: simplify Kconfig filesMasahiro Yamada2-15/+7
Everything under block/ depends on BLOCK. BLOCK_HOLDER_DEPRECATED is selected from drivers/md/Kconfig, which is entirely dependent on BLOCK. Extend the 'if BLOCK' ... 'endif' so it covers the whole block/Kconfig. Also, clean up the definition of BLOCK_COMPAT and BLK_MQ_PCI because COMPAT and PCI are boolean. Signed-off-by: Masahiro Yamada <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>