Age | Commit message (Collapse) | Author | Files | Lines |
|
The RQF_PREEMPT flag is used for three purposes:
- In the SCSI core, for making sure that power management requests
are executed even if a device is in the "quiesced" state.
- For domain validation by SCSI drivers that use the parallel port.
- In the IDE driver, for IDE preempt requests.
Rename "preempt-only" into "pm-only" because the primary purpose of
this mode is power management. Since the power management core may
but does not have to resume a runtime suspended device before
performing system-wide suspend and since a later patch will set
"pm-only" mode as long as a block device is runtime suspended, make
it possible to set "pm-only" mode from more than one context. Since
with this change scsi_device_quiesce() is no longer idempotent, make
that function return early if it is called for a quiesced queue.
Signed-off-by: Bart Van Assche <[email protected]>
Acked-by: Martin K. Petersen <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Cc: Jianchao Wang <[email protected]>
Cc: Johannes Thumshirn <[email protected]>
Cc: Alan Stern <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Move the code for runtime power management from blk-core.c into the
new source file blk-pm.c. Move the corresponding declarations from
<linux/blkdev.h> into <linux/blk-pm.h>. For CONFIG_PM=n, leave out
the declarations of the functions that are not used in that mode.
This patch not only reduces the number of #ifdefs in the block layer
core code but also reduces the size of header file <linux/blkdev.h>
and hence should help to reduce the build time of the Linux kernel
if CONFIG_PM is not defined.
Signed-off-by: Bart Van Assche <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Jianchao Wang <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: Johannes Thumshirn <[email protected]>
Cc: Alan Stern <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
No need to expose these helpers outside the block layer.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Keep it close to the actual users instead of exposing the function to all
drivers.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
No need to expose these to drivers.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
After merging the iolatency policy, we potentially now have 4 policies
being registered, but only support 3. This causes one of them to fail
loading. Takashi reports that BFQ no longer works for him, because it
fails to load due to policy registration failure.
Bump to 5 policies, and also add a warning for when we have exceeded
the global amount. If we have to touch this again, we should switch
to a dynamic scheme instead.
Reported-by: Takashi Iwai <[email protected]>
Reviewed-by: Jeff Moyer <[email protected]>
Tested-by: Takashi Iwai <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Commit 12f5b9314545 ("blk-mq: Remove generation seqeunce") removed the
only seqcount_t and u64_stats_sync instances from <linux/blkdev.h> but
did not remove the corresponding #include directives. Since these
include directives are no longer needed, remove them.
Signed-off-by: Bart Van Assche <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Ming Lei <[email protected]>
Cc: Jianchao Wang <[email protected]>
Cc: Hannes Reinecke <[email protected]>,
Cc: Johannes Thumshirn <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
This allows bio_integrity_bytes() to be called from drivers instead of
open coding it.
Acked-by: Martin K. Petersen <[email protected]>
Signed-off-by: Greg Edwards <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
c11f0c0b5bb9 ("block/mm: make bdev_ops->rw_page() take a bool for
read/write") replaced @op with boolean @is_write, which limited the
amount of information going into ->rw_page() and more importantly
page_endio(), which removed the need to expose block internals to mm.
Unfortunately, we want to track discards separately and @is_write
isn't enough information. This patch updates bdev_ops->rw_page() to
take REQ_OP instead but leaves page_endio() to take bool @is_write.
This allows the block part of operations to have enough information
while not leaking it to mm.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Mike Christie <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Dan Williams <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Remove blkdev_entry_to_request() macro, which remained unused through
the observable history, also note that it repeats list_entry_rq() macro
verbatim.
Signed-off-by: Vladimir Zapolskiy <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
blkcg-qos is going to do essentially what wbt does, only on a cgroup
basis. Break out the common code that will be shared between blkcg-qos
and wbt into blk-rq-qos.* so they can both utilize the same
infrastructure.
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
We have to remove synchronize_rcu() from blk_queue_cleanup(),
otherwise long delay can be caused during lun probe. For removing
it, we have to avoid to iterate the set->tag_list in IO path, eg,
blk_mq_sched_restart().
This patch reverts 5b79413946d (Revert "blk-mq: don't handle
TAG_SHARED in restart"). Given we have fixed enough IO hang issue,
and there isn't any reason to restart all queues in one tags any more,
see the following reasons:
1) blk-mq core can deal with shared-tags case well via blk_mq_get_driver_tag(),
which can wake up queues waiting for driver tag.
2) SCSI is a bit special because it may return BLK_STS_RESOURCE if queue,
target or host is ready, but SCSI built-in restart can cover all these well,
see scsi_end_request(), queue will be rerun after any request initiated from
this host/target is completed.
In my test on scsi_debug(8 luns), this patch may improve IOPS by 20% ~ 30%
when running I/O on these 8 luns concurrently.
Fixes: 705cda97ee3a ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
Cc: Omar Sandoval <[email protected]>
Cc: Bart Van Assche <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Martin K. Petersen <[email protected]>
Cc: [email protected]
Reported-by: Andrew Jones <[email protected]>
Tested-by: Andrew Jones <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Exclude zoned block device members from struct request_queue for
CONFIG_BLK_DEV_ZONED == n. Avoid breaking the build by only building
the code that uses these struct request_queue members if
CONFIG_BLK_DEV_ZONED != n.
Signed-off-by: Bart Van Assche <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Cc: Matias Bjorling <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Since the implementation of blk_queue_nr_zones() is trivial and since
it only has a single caller, inline this function.
Signed-off-by: Bart Van Assche <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Cc: Matias Bjorling <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Remove this function since it has no callers. This function was
introduced in commit 6cc77e9cb080 ("block: introduce zoned block
devices zone write locking").
Signed-off-by: Bart Van Assche <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Matias Bjorling <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
A device may have boundary restrictions where the number of sectors
between boundaries exceeds its max transfer size. In this case, we need
to cap the max size to the smaller of the two limits.
Reported-by: Jitendra Bhivare <[email protected]>
Tested-by: Jitendra Bhivare <[email protected]>
Cc: <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
This function is entirely unused, so remove it and the tag_queue_busy
member of struct request_queue.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
We can currently call the timeout handler again on a request that has
already been handed over to the timeout handler. Prevent that with a new
flag.
Fixes: 12f5b931 ("blk-mq: Remove generation seqeunce")
Reported-by: Andrew Randrianasulu <[email protected]>
Tested-by: Andrew Randrianasulu <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Convert the core block functionality to embedded bio sets.
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
After the recent timeout handling changes, we have two holes in
the struct. Move the timeout near the deadline, killing both,
and moving related members closer together. On my config on
x86-64, this shrinks struct request from 312 to 304 bytes.
Signed-off-by: Jens Axboe <[email protected]>
|
|
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
The BLK_EH_NOT_HANDLED implies nothing happen, but very often that
is not what is happening - instead the driver already completed the
command. Fix the symbolic name to reflect that a little better.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
This patch simplifies the timeout handling by relying on the request
reference counting to ensure the iterator is operating on an inflight
and truly timed out request. Since the reference counting prevents the
tag from being reallocated, the block layer no longer needs to prevent
drivers from completing their requests while the timeout handler is
operating on it: a driver completing a request is allowed to proceed to
the next state without additional syncronization with the block layer.
This also removes any need for generation sequence numbers since the
request lifetime is prevented from being reallocated as a new sequence
while timeout handling is operating on it.
To enables this a refcount is added to struct request so that request
users can be sure they're operating on the same request without it
changing while they're processing it. The request's tag won't be
released for reuse until both the timeout handler and the completion
are done with it.
Signed-off-by: Keith Busch <[email protected]>
[hch: slight cleanups, added back submission side hctx lock, use cmpxchg
for completions]
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Switch everyone to blk_get_request_flags, and then rename
blk_get_request_flags to blk_get_request.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Currently, struct request has four timestamp fields:
- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq
These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.
Signed-off-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
cfq and bfq have some internal fields that use sched_clock() which can
trivially use ktime_get_ns() instead. Their timestamp fields in struct
request can also use ktime_get_ns(), which resolves the 8 year old
comment added by commit 28f4197e5d47 ("block: disable preemption before
using sched_clock()").
Signed-off-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
struct blk_issue_stat squashes three things into one u64:
- The time the driver started working on a request
- The original size of the request (for the io.low controller)
- Flags for writeback throttling
It turns out that on x86_64, we have a 4 byte hole in struct request
which we can fill with the non-timestamp fields from blk_issue_stat,
simplifying things quite a bit.
Signed-off-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Eight bug fixes, one spelling update and one tracepoint addition.
The most serious is probably the mptsas write same fix because it
means anyone using these controllers sees errors when modern
filesystems try to issue discards"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: target: fix crash with iscsi target and dvd
scsi: sd_zbc: Avoid that resetting a zone fails sporadically
scsi: sd: Defer spinning up drive while SANITIZE is in progress
scsi: megaraid_sas: Do not log an error if FW successfully initializes.
scsi: ufs: add trace event for ufs upiu
scsi: core: remove reference to scsi_show_extd_sense()
scsi: mptsas: Disable WRITE SAME
scsi: fnic: fix spelling mistake in fnic stats "Abord" -> "Abort"
scsi: scsi_debug: IMMED related delay adjustments
scsi: iscsi: respond to netlink with unicast when appropriate
|
|
Since SCSI scanning occurs asynchronously, since sd_revalidate_disk() is
called from sd_probe_async() and since sd_revalidate_disk() calls
sd_zbc_read_zones() it can happen that sd_zbc_read_zones() is called
concurrently with blkdev_report_zones() and/or blkdev_reset_zones(). That can
cause these functions to fail with -EIO because sd_zbc_read_zones() e.g. sets
q->nr_zones to zero before restoring it to the actual value, even if no drive
characteristics have changed. Avoid that this can happen by making the
following changes:
- Protect the code that updates zone information with blk_queue_enter()
and blk_queue_exit().
- Modify sd_zbc_setup_seq_zones_bitmap() and sd_zbc_setup() such that
these functions do not modify struct scsi_disk before all zone
information has been obtained.
Note: since commit 055f6e18e08f ("block: Make q_usage_counter also track
legacy requests"; kernel v4.15) the request queue freezing mechanism also
affects legacy request queues.
Fixes: 89d947561077 ("sd: Implement support for ZBC devices")
Signed-off-by: Bart Van Assche <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Damien Le Moal <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: [email protected] # v4.16
Reviewed-by: Damien Le Moal <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
|
|
So we can check FUA support status from the iomap direct IO code.
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
It happens often while I'm preparing a patch for a block driver that
I'm wondering: is a definition of SECTOR_SIZE and/or SECTOR_SHIFT
available for this driver? Do I have to introduce definitions of these
constants before I can use these constants? To avoid this confusion,
move the existing definitions of SECTOR_SIZE and SECTOR_SHIFT into the
<linux/blkdev.h> header file such that these become available for all
block drivers. Make the SECTOR_SIZE definition in the uapi msdos_fs.h
header file conditional to avoid that including that header file after
<linux/blkdev.h> causes the compiler to complain about a SECTOR_SIZE
redefinition.
Note: the SECTOR_SIZE / SECTOR_SHIFT / SECTOR_BITS definitions have
not been removed from uapi header files nor from NAND drivers in
which these constants are used for another purpose than converting
block layer offsets and sizes into a number of sectors.
Cc: David S. Miller <[email protected]>
Cc: Mike Snitzer <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Nitin Gupta <[email protected]>
Reviewed-by: Sergey Senozhatsky <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Signed-off-by: Bart Van Assche <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
file
This patch helps to avoid that new code gets introduced in block drivers
that manipulates queue flags without holding the queue lock when that
lock should be held.
Cc: Christoph Hellwig <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: Ming Lei <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Signed-off-by: Bart Van Assche <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Since it is not safe to use queue_flag_(set|clear)_unlocked()
without holding the queue lock after the sysfs entries for a
queue have been created, complain if this happens.
Cc: Mike Snitzer <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: Ming Lei <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Signed-off-by: Bart Van Assche <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Introduce functions that modify the queue flags and that protect
these modifications with the request queue lock. Except for moving
one wake_up_all() call from inside to outside a critical section,
this patch does not change any functionality.
Cc: Christoph Hellwig <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: Ming Lei <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Signed-off-by: Bart Van Assche <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Move the definition of queue_flag_clear_unlocked() up and move the
definition of queue_in_flight() down such that all queue flag
manipulation function definitions become contiguous.
This patch does not change any functionality.
Cc: Christoph Hellwig <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: Ming Lei <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Signed-off-by: Bart Van Assche <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
This patch does not change any functionality.
Signed-off-by: Bart Van Assche <[email protected]>
Reviewed-by: Joseph Qi <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Philipp Reisner <[email protected]>
Cc: Ulf Hansson <[email protected]>
Cc: Kees Cook <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Update comment typo _consisitent_ to _consistent_ from following commit.
commit 0206319fdfee ("blk-mq: Fix poll_stat for new size-based bucketing.")
Cc: Jens Axboe <[email protected]>
Signed-off-by: Minwoo Im <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Pull block updates from Jens Axboe:
"This is the main pull request for block IO related changes for the
4.16 kernel. Nothing major in this pull request, but a good amount of
improvements and fixes all over the map. This contains:
- BFQ improvements, fixes, and cleanups from Angelo, Chiara, and
Paolo.
- Support for SMR zones for deadline and mq-deadline from Damien and
Christoph.
- Set of fixes for bcache by way of Michael Lyle, including fixes
from himself, Kent, Rui, Tang, and Coly.
- Series from Matias for lightnvm with fixes from Hans Holmberg,
Javier, and Matias. Mostly centered around pblk, and the removing
rrpc 1.2 in preparation for supporting 2.0.
- A couple of NVMe pull requests from Christoph. Nothing major in
here, just fixes and cleanups, and support for command tracing from
Johannes.
- Support for blk-throttle for tracking reads and writes separately.
From Joseph Qi. A few cleanups/fixes also for blk-throttle from
Weiping.
- Series from Mike Snitzer that enables dm to register its queue more
logically, something that's alwways been problematic on dm since
it's a stacked device.
- Series from Ming cleaning up some of the bio accessor use, in
preparation for supporting multipage bvecs.
- Various fixes from Ming closing up holes around queue mapping and
quiescing.
- BSD partition fix from Richard Narron, fixing a problem where we
can't mount newer (10/11) FreeBSD partitions.
- Series from Tejun reworking blk-mq timeout handling. The previous
scheme relied on atomic bits, but it had races where we would think
a request had timed out if it to reused at the wrong time.
- null_blk now supports faking timeouts, to enable us to better
exercise and test that functionality separately. From me.
- Kill the separate atomic poll bit in the request struct. After
this, we don't use the atomic bits on blk-mq anymore at all. From
me.
- sgl_alloc/free helpers from Bart.
- Heavily contended tag case scalability improvement from me.
- Various little fixes and cleanups from Arnd, Bart, Corentin,
Douglas, Eryu, Goldwyn, and myself"
* 'for-4.16/block' of git://git.kernel.dk/linux-block: (186 commits)
block: remove smart1,2.h
nvme: add tracepoint for nvme_complete_rq
nvme: add tracepoint for nvme_setup_cmd
nvme-pci: introduce RECONNECTING state to mark initializing procedure
nvme-rdma: remove redundant boolean for inline_data
nvme: don't free uuid pointer before printing it
nvme-pci: Suspend queues after deleting them
bsg: use pr_debug instead of hand crafted macros
blk-mq-debugfs: don't allow write on attributes with seq_operations set
nvme-pci: Fix queue double allocations
block: Set BIO_TRACE_COMPLETION on new bio during split
blk-throttle: use queue_is_rq_based
block: Remove kblockd_schedule_delayed_work{,_on}()
blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays
blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly()
lib/scatterlist: Fix chaining support in sgl_alloc_order()
blk-throttle: track read and write request individually
block: add bdev_read_only() checks to common helpers
block: fail op_is_write() requests to read-only partitions
blk-throttle: export io_serviced_recursive, io_service_bytes_recursive
...
|
|
The previous patch removed all users of these two functions. Hence
also remove the functions themselves.
Reviewed-by: Mike Snitzer <[email protected]>
Signed-off-by: Bart Van Assche <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Move completion related items (like the call single data) near the
end of the struct, instead of mixing them in with the initial
queueing related fields.
Move queuelist below the bio structures. Then we have all
queueing related bits in the first cache line.
This yields a 1.5-2% increase in IOPS for a null_blk test, both for
sync and for high thread count access. Sync test goes form 975K to
992K, 32-thread case from 20.8M to 21.2M IOPS.
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
We only have one atomic flag left. Instead of using an entire
unsigned long for that, steal the bottom bit of the deadline
field that we already reserved.
Remove ->atomic_flags, since it's now unused.
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
We reduce the resolution of request expiry, but since we're already
using jiffies for this where resolution depends on the kernel
configuration and since the timeout resolution is coarse anyway,
that should be fine.
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
We don't need this to be an atomic flag, it can be a regular
flag. We either end up on the same CPU for the polling, in which
case the state is sane, or we did the sleep which would imply
the needed barrier to ensure we see the right state.
Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
After the recent updates to use generation number and state based
synchronization, blk-mq no longer depends on REQ_ATOM_COMPLETE except
to avoid firing the same timeout multiple times.
Remove all REQ_ATOM_COMPLETE usages and use a new rq_flags flag
RQF_MQ_TIMEOUT_EXPIRED to avoid firing the same timeout multiple
times. This removes atomic bitops from hot paths too.
v2: Removed blk_clear_rq_complete() from blk_mq_rq_timed_out().
v3: Added RQF_MQ_TIMEOUT_EXPIRED flag.
Signed-off-by: Tejun Heo <[email protected]>
Cc: "jianchao.wang" <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules. Unfortunately, it contains quite a few holes.
There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request. If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly. Nothing actually timed out. It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.
This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.
1. Each request has a u64 generation + state value, which can be
updated only by the request owner. Whenever a request becomes
in-flight, the generation number gets bumped up too. This provides
the basis for the timeout path to distinguish different recycle
instances of the request.
Also, marking a request in-flight and setting its deadline are
protected with a seqcount so that the timeout path can fetch both
values coherently.
2. The timeout path fetches the generation, state and deadline. If
the verdict is timeout, it records the generation into a dedicated
request abortion field and does RCU wait.
3. The completion path is also protected by RCU (from the previous
patch) and checks whether the current generation number and state
match the abortion field. If so, it skips completion.
4. The timeout path, after RCU wait, scans requests again and
terminates the ones whose generation and state still match the ones
requested for abortion.
By now, the timeout path knows that either the generation number
and state changed if it lost the race or the completion will yield
to it and can safely timeout the request.
While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.
While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places. Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.
Note that this patch adds a comment explaining a race condition in
BLK_EH_RESET_TIMER path. The race has always been there and this
patch doesn't change it. It's just documenting the existing race.
v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
- s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
- READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
v3: - Fixed possible extended seqcount / u64_stats_sync read looping
spotted by Peter.
- MQ_RQ_IDLE was incorrectly being set in complete_request instead
of free_request. Fixed.
v4: - Rebased on top of hctx_lock() refactoring patch.
- Added comment explaining the use of hctx_lock() in completion path.
v5: - Added comments requested by Bart.
- Note the addition of BLK_EH_RESET_TIMER race condition in the
commit message.
Signed-off-by: Tejun Heo <[email protected]>
Cc: "jianchao.wang" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Bart Van Assche <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Components relying only on the request_queue structure for accessing
block devices (e.g. I/O schedulers) have a limited knowledged of the
device characteristics. In particular, the device capacity cannot be
easily discovered, which for a zoned block device also result in the
inability to easily know the number of zones of the device (the zone
size is indicated by the chunk_sectors field of the queue limits).
Introduce the nr_zones field to the request_queue structure to simplify
access to this information. Also, add the bitmap seq_zone_bitmap which
indicates which zones of the device are sequential zones (write
preferred or write required) and the bitmap seq_zones_wlock which
indicates if a zone is write locked, that is, if a write request
targeting a zone was dispatched to the device. These fields are
initialized by the low level block device driver (sd.c for ZBC/ZAC
disks). They are not initialized by stacking drivers (device mappers)
handling zoned block devices (e.g. dm-linear).
Using this, I/O schedulers can introduce zone write locking to control
request dispatching to a zoned block device and avoid write request
reordering by limiting to at most a single write request per zone
outside of the scheduler at any time.
Based on previous patches from Damien Le Moal.
Signed-off-by: Christoph Hellwig <[email protected]>
[Damien]
* Fixed comments and identation in blkdev.h
* Changed helper functions
* Fixed this commit message
Signed-off-by: Damien Le Moal <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
A previous change blindly added massive alignment to the
call_single_data structure in struct request. This ballooned it in size
from 296 to 320 bytes on my setup, for no valid reason at all.
Use the unaligned struct __call_single_data variant instead.
Fixes: 966a967116e69 ("smp: Avoid using two cache lines for struct call_single_data")
Cc: [email protected] # v4.14
Signed-off-by: Jens Axboe <[email protected]>
|
|
Commit caa4b02476e3(blk-map: call blk_queue_bounce from blk_rq_append_bio)
moves blk_queue_bounce() into blk_rq_append_bio(), but don't consider
the fact that the bounced bio becomes invisible to caller since the
parameter type is 'struct bio *'. Make it a pointer to a pointer to
a bio, so the caller sees the right bio also after a bounce.
Fixes: caa4b02476e3 ("blk-map: call blk_queue_bounce from blk_rq_append_bio")
Cc: Christoph Hellwig <[email protected]>
Reported-by: Michele Ballabio <[email protected]>
(handling failure of blk_rq_append_bio(), only call bio_get() after
blk_rq_append_bio() returns OK)
Tested-by: Michele Ballabio <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Commit a8821f3f3("block: Improvements to bounce-buffer handling") tries
to make sure that the bio to .make_request_fn won't exceed BIO_MAX_PAGES,
but ignores that passthrough I/O can use blk_queue_bounce() too.
Especially, passthrough IO may not be sector-aligned, and the check
of 'sectors < bio_sectors(*bio_orig)' inside __blk_queue_bounce() may
become true even though the max bvec number doesn't exceed BIO_MAX_PAGES,
then cause the bio splitted, and the original passthrough bio is submited
to generic_make_request().
This patch fixes this issue by checking if the bio is passthrough IO,
and use bio_kmalloc() to allocate the cloned passthrough bio.
Cc: NeilBrown <[email protected]>
Fixes: a8821f3f3("block: Improvements to bounce-buffer handling")
Tested-by: Michele Ballabio <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|