Age | Commit message (Collapse) | Author | Files | Lines |
|
blk_mq_start_stopped_hw_queues() is used implictly
as counterpart of blk_mq_quiesce_queue() for unquiescing queue,
so we introduce blk_mq_unquiesce_queue() and make it
as counterpart of blk_mq_quiesce_queue() explicitly.
This function is for improving the current quiescing mechanism
in the following patches.
Reviewed-by: Bart Van Assche <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
This patch introduces blk_mq_quiesce_queue_nowait() so
that we can workaround mpt3sas for quiescing its queue.
Once mpt3sas is fixed, we can remove this helper.
Reviewed-by: Bart Van Assche <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
We usually put blk_mq_*() into include/linux/blk-mq.h, so
move this API into there.
Signed-off-by: Ming Lei <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Use the same values for use for request completion errors as the return
value from ->queue_rq. BLK_STS_RESOURCE is special cased to cause
a requeue, and all the others are completed as-is.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Currently we use nornal Linux errno values in the block layer, and while
we accept any error a few have overloaded magic meanings. This patch
instead introduces a new blk_status_t value that holds block layer specific
status codes and explicitly explains their meaning. Helpers to convert from
and to the previous special meanings are provided for now, but I suspect
we want to get rid of them in the long run - those drivers that have a
errno input (e.g. networking) usually get errnos that don't know about
the special block layer overloads, and similarly returning them to userspace
will usually return somethings that strictly speaking isn't correct
for file system operations, but that's left as an exercise for later.
For now the set of errors is a very limited set that closely corresponds
to the previous overloaded errno values, but there is some low hanging
fruite to improve it.
blk_status_t (ab)uses the sparse __bitwise annotations to allow for sparse
typechecking, so that we can easily catch places passing the wrong values.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
No one uses it any more, so remove it.
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
|
|
This provides the infrastructure for schedulers to expose their internal
state through debugfs. We add a list of queue attributes and a list of
hctx attributes to struct elevator_type and wire them up when switching
schedulers.
Signed-off-by: Omar Sandoval <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Add missing seq_file.h header in blk-mq-debugfs.h
Signed-off-by: Jens Axboe <[email protected]>
|
|
Originally, I tied debugfs registration/unregistration together with
sysfs. There's no reason to do this, and it's getting in the way of
letting schedulers define their own debugfs attributes. Instead, tie the
debugfs registration to the lifetime of the structures themselves.
The saner lifetimes mean we can also get rid of the extra mq directory
and move everything one level up. I.e., nvme0n1/mq/hctx0/tags is now
just nvme0n1/hctx0/tags.
Signed-off-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Remove the request_idx parameter, which can't be used safely now that we
support I/O schedulers with blk-mq. Except for a superflous check in
mtip32xx it was unused anyway.
Also pass the tag_set instead of just the driver data - this allows drivers
to avoid some code duplication in a follow on cleanup.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
The only difference between ->run_work and ->delay_work, is that
the latter is used to defer running a queue. This is done by
marking the queue stopped, and scheduling ->delay_work to run
sometime in the future. While the queue is stopped, direct runs
or runs through ->run_work will not run the queue.
If we combine the handlers, then we need to handle two things:
1) If a delayed/stopped run is scheduled, then we should not run
the queue before that has been completed.
2) If a queue is delayed/stopped, the handler needs to restart
the queue. Normally a run of a queue with the stopped bit set
would be a no-op.
Case 1 is handled by modifying a currently pending queue run
to the deadline set by the caller of blk_mq_delay_queue().
Subsequent attempts to queue a queue run will find the work
item already pending, and direct runs will see a stopped queue
as before.
Case 2 is handled by adding a new bit, BLK_MQ_S_START_ON_RUN,
that tells the work handler that it should clear a stopped
queue and run the handler.
Reviewed-by: Bart Van Assche <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
They serve the exact same purpose. Get rid of the non-delayed
work variant, and just run it without delay for the normal case.
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
This new callback function will be used in the next patch to show
more information about SCSI requests.
Signed-off-by: Bart Van Assche <[email protected]>
Reviewed-by: Omar Sandoval <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Now that all drivers that call blk_mq_complete_requests have a
->complete callback we can remove the direct call to blk_mq_end_request,
as well as the error argument to blk_mq_complete_request.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
blk_mq_finish_request() is required for schedulers that define their own
put_request(). blk_mq_run_hw_queue() is required for schedulers that
hold back requests to be run later.
Signed-off-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
We've added a considerable amount of fixes for stalls and issues
with the blk-mq scheduling in the 4.11 series since forking
off the for-4.12/block branch. We need to do improvements on
top of that for 4.12, so pull in the previous fixes to make
our lives easier going forward.
Signed-off-by: Jens Axboe <[email protected]>
|
|
Introduce a function that runs a hardware queue unconditionally
after a delay. Note: there is already a function that stops and
restarts a hardware queue after a delay, namely blk_mq_delay_queue().
This function will be used in the next patch in this series.
Signed-off-by: Bart Van Assche <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: Long Li <[email protected]>
Cc: K. Y. Srinivasan <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
The block layer core sets blk_mq_queue_data.list but no block
drivers read that member. Hence remove it and also the code that
is used to set this member.
Signed-off-by: Bart Van Assche <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
As the .q_usage_counter is used by both legacy and
mq path, we need to block new I/O if queue becomes
dead in blk_queue_enter().
So rename it and we can use this function in both
paths.
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
This flag was never used since it was introduced.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
A driver may wish to take corrective action if queued requests do not
complete within a set time.
Signed-off-by: Keith Busch <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Drivers can start a freeze, so this provides a way to wait for frozen.
Signed-off-by: Keith Busch <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Commit 50e1dab86aa2 ("blk-mq-sched: fix starvation for multiple hardware
queues and shared tags") fixed one starvation issue for shared tags.
However, we can still get into a situation where we fail to allocate a
tag because all tags are allocated but we don't have any pending
requests on any hardware queue.
One solution for this would be to restart all queues that share a tag
map, but that really sucks. Ideally, we could just block and wait for a
tag, but that isn't always possible from blk_mq_dispatch_rq_list().
However, we can still use the struct sbitmap_queue wait queues with a
custom callback instead of blocking. This has a few benefits:
1. It avoids iterating over all hardware queues when completing an I/O,
which the current restart code has to do.
2. It benefits from the existing rolling wakeup code.
3. It avoids punting to another thread just to have it block.
Signed-off-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Add Kconfig entries to manage what devices get assigned an MQ
scheduler, and add a blk-mq flag for drivers to opt out of scheduling.
The latter is useful for admin type queues that still allocate a blk-mq
queue and tag set, but aren't use for normal IO.
Signed-off-by: Jens Axboe <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Omar Sandoval <[email protected]>
|
|
This adds a set of hooks that intercepts the blk-mq path of
allocating/inserting/issuing/completing requests, allowing
us to develop a scheduler within that framework.
We reuse the existing elevator scheduler API on the registration
side, but augment that with the scheduler flagging support for
the blk-mq interfce, and with a separate set of ops hooks for MQ
devices.
We split driver and scheduler tags, so we can run the scheduling
independently of device queue depth.
Signed-off-by: Jens Axboe <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Omar Sandoval <[email protected]>
|
|
It's only used in blk-mq, kill it from the main exported header
and kill the symbol export as well.
Signed-off-by: Jens Axboe <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Omar Sandoval <[email protected]>
|
|
We never change it, make that clear.
Signed-off-by: Jens Axboe <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
|
|
Pull SCSI updates from James Bottomley:
"This update includes the usual round of major driver updates (ncr5380,
lpfc, hisi_sas, megaraid_sas, ufs, ibmvscsis, mpt3sas).
There's also an assortment of minor fixes, mostly in error legs or
other not very user visible stuff. The major change is the
pci_alloc_irq_vectors replacement for the old pci_msix_.. calls; this
effectively makes IRQ mapping generic for the drivers and allows
blk_mq to use the information"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (256 commits)
scsi: qla4xxx: switch to pci_alloc_irq_vectors
scsi: hisi_sas: support deferred probe for v2 hw
scsi: megaraid_sas: switch to pci_alloc_irq_vectors
scsi: scsi_devinfo: remove synchronous ALUA for NETAPP devices
scsi: be2iscsi: set errno on error path
scsi: be2iscsi: set errno on error path
scsi: hpsa: fallback to use legacy REPORT PHYS command
scsi: scsi_dh_alua: Fix RCU annotations
scsi: hpsa: use %phN for short hex dumps
scsi: hisi_sas: fix free'ing in probe and remove
scsi: isci: switch to pci_alloc_irq_vectors
scsi: ipr: Fix runaway IRQs when falling back from MSI to LSI
scsi: dpt_i2o: double free on error path
scsi: cxlflash: Migrate scsi command pointer to AFU command
scsi: cxlflash: Migrate IOARRIN specific routines to function pointers
scsi: cxlflash: Cleanup queuecommand()
scsi: cxlflash: Cleanup send_tmf()
scsi: cxlflash: Remove AFU command lock
scsi: cxlflash: Wait for active AFU commands to timeout upon tear down
scsi: cxlflash: Remove private command pool
...
|
|
We have a variant for all hardware queues, but not one for a single
hardware queue.
Signed-off-by: Jens Axboe <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
|
|
This will allow SCSI to have a single blk_mq_ops structure that either
lets the LLDD map the queues to PCIe MSIx vectors or use the default.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Jens Axboe <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
|
|
Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls
are followed by kicking the requeue list. Hence add an argument to
these two functions that allows to kick the requeue list. This was
proposed by Christoph Hellwig.
Signed-off-by: Bart Van Assche <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
blk_mq_quiesce_queue() waits until ongoing .queue_rq() invocations
have finished. This function does *not* wait until all outstanding
requests have finished (this means invocation of request.end_io()).
The algorithm used by blk_mq_quiesce_queue() is as follows:
* Hold either an RCU read lock or an SRCU read lock around
.queue_rq() calls. The former is used if .queue_rq() does not
block and the latter if .queue_rq() may block.
* blk_mq_quiesce_queue() first calls blk_mq_stop_hw_queues()
followed by synchronize_srcu() or synchronize_rcu(). The latter
call waits for .queue_rq() invocations that started before
blk_mq_quiesce_queue() was called.
* The blk_mq_hctx_stopped() calls that control whether or not
.queue_rq() will be called are called with the (S)RCU read lock
held. This is necessary to avoid race conditions against
blk_mq_quiesce_queue().
Signed-off-by: Bart Van Assche <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: Johannes Thumshirn <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Since blk_mq_requeue_work() no longer restarts stopped queues
canceling requeue work is no longer needed to prevent that a
stopped queue would be restarted. Hence remove this function.
Signed-off-by: Bart Van Assche <[email protected]>
Cc: Mike Snitzer <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: Johannes Thumshirn <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
The function blk_queue_stopped() allows to test whether or not a
traditional request queue has been stopped. Introduce a helper
function that allows block drivers to query easily whether or not
one or more hardware contexts of a blk-mq queue have been stopped.
Signed-off-by: Bart Van Assche <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Pull blk-mq CPU hotplug update from Jens Axboe:
"This is the conversion of blk-mq to the new hotplug state machine"
* 'for-4.9/block-smp' of git://git.kernel.dk/linux-block:
blk-mq: fixup "Convert to new hotplug state machine"
blk-mq: Convert to new hotplug state machine
blk-mq/cpu-notif: Convert to new hotplug state machine
|
|
Pull blk-mq irq/cpu mapping updates from Jens Axboe:
"This is the block-irq topic branch for 4.9-rc. It's mostly from
Christoph, and it allows drivers to specify their own mappings, and
more importantly, to share the blk-mq mappings with the IRQ affinity
mappings. It's a good step towards making this work better out of the
box"
* 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
blk_mq: linux/blk-mq.h does not include all the headers it depends on
blk-mq: kill unused blk_mq_create_mq_map()
blk-mq: get rid of the cpumask in struct blk_mq_tags
nvme: remove the post_scan callout
nvme: switch to use pci_alloc_irq_vectors
blk-mq: provide a default queue mapping for PCI device
blk-mq: allow the driver to pass in a queue mapping
blk-mq: remove ->map_queue
blk-mq: only allocate a single mq_map per tag_set
blk-mq: don't redistribute hardware queues on a CPU hotplug event
|
|
If a driver sets BLK_MQ_F_BLOCKING, it is allowed to block in its
->queue_rq() handler. For that case, blk-mq ensures that we always
calls it from a safe context.
Signed-off-by: Jens Axboe <[email protected]>
Tested-by: Josef Bacik <[email protected]>
|
|
Replace the block-mq notifier list management with the multi instance
facility in the cpu hotplug state machine.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Christoph Hellwing <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Enable devices without a gendisk instance to register itself with blk-mq
and expose the associated multi-queue sysfs entries.
Signed-off-by: Matias Bjørling <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
This is a generally useful data structure, so make it available to
anyone else who might want to use it. It's also a nice cleanup
separating the allocation logic from the rest of the tag handling logic.
The code is behind a new Kconfig option, CONFIG_SBITMAP, which is only
selected by CONFIG_BLOCK for now.
This should be a complete noop functionality-wise.
Signed-off-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Unused now that NVMe sets up irq affinity before calling into blk-mq.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
This allows drivers specify their own queue mapping by overriding the
setup-time function that builds the mq_map. This can be used for
example to build the map based on the MSI-X vector mapping provided
by the core interrupt layer for PCI devices.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
All drivers use the default, so provide an inline version of it. If we
ever need other queue mapping we can add an optional method back,
although supporting will also require major changes to the queue setup
code.
This provides better code generation, and better debugability as well.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
The mapping is identical for all queues in a tag_set, so stop wasting
memory for building multiple. Note that for now I've kept the mq_map
pointer in the request_queue, but we'll need to investigate if we can
remove it without suffering too much from the additional pointer chasing.
The same would apply to the mq_ops pointer as well.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
blk_mq_delay_kick_requeue_list() provides the ability to kick the
q->requeue_list after a specified time. To do this the request_queue's
'requeue_work' member was changed to a delayed_work.
blk_mq_delay_kick_requeue_list() allows DM to defer processing requeued
requests while it doesn't make sense to immediately requeue them
(e.g. when all paths in a DM multipath have failed).
Signed-off-by: Mike Snitzer <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
The blk_mq_alloc_single_hw_queue() is a prototype artifact that
should have been removed with
commit cdef54dd85ad66e77262ea57796a3e81683dd5d6
"blk-mq: remove alloc_hctx and free_hctx methods" where the last
users of it were deleted.
Fixes: cdef54dd85ad ("blk-mq: remove alloc_hctx and free_hctx methods")
Signed-off-by: Linus Walleij <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
In order to help determine the effectiveness of polling in a running
system it is usful to determine the ratio of how often the poll
function is called vs how often the completion is checked. For this
reason we add a poll_considered variable and add it to the sysfs entry
for io_poll.
Signed-off-by: Stephen Bates <[email protected]>
Acked-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Various cache line optimizations:
- Move delay_work towards the end. It's huge, and we don't use it
a lot (only SCSI).
- Move the atomic state into the same cacheline as the the dispatch
list and lock.
- Rearrange a few members to pack it better.
- Shrink the max-order for dispatch accounting from 10 to 7. This
means that ->dispatched[] and ->run now take up their own
cacheline.
This shrinks struct blk_mq_hw_ctx down to 8 cachelines.
Signed-off-by: Jens Axboe <[email protected]>
|
|
We don't need the larger delayed work struct, since we always run it
immediately.
Signed-off-by: Jens Axboe <[email protected]>
|
|
The new nvme-rdma driver will need to reinitialize all the tags as part of
the error recovery procedure (realloc the tag memory region). Add a helper
in blk-mq for it that can iterate over all requests in a tagset to make
this easier.
Signed-off-by: Sagi Grimberg <[email protected]>
Tested-by: Ming Lin <[email protected]>
Reviewed-by: Stephen Bates <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Steve Wise <[email protected]>
Tested-by: Steve Wise <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
For some protocols like NVMe over Fabrics we need to be able to send
initialization commands to a specific queue.
Based on an earlier patch from Christoph Hellwig <[email protected]>.
Signed-off-by: Ming Lin <[email protected]>
[hch: disallow sleeping allocation, req_op fixes]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|