Age | Commit message (Collapse) | Author | Files | Lines |
|
Use a single switch to perform read and write specific checks and exit
early for other operations instead of having two checks using different
predicates.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Kanchan Joshi <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Allocation failures already leave a stack trace, so don't spew another
warning.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Kanchan Joshi <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
bio_integrity_add_page can add physically contiguous regions of any size,
so don't bother chunking up the kmalloced buffer.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Kanchan Joshi <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
The PI generation helpers already zero the app tag, so relax the zeroing
to non-PI metadata.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Kanchan Joshi <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Sparse is a bit dumb about bitwise operation on __bitwise types used
in boolean contexts. Add a !! to explicitly propagate to boolean
without a warning.
Fixes: fcf865e357f8 ("block: convert features and flags to __bitwise types")
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Kent Overstreet <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
1. Userspace sends the command "losetup -d" which uses the open() call
to open the device
2. Kernel receives the ioctl command "LOOP_CLR_FD" which calls the
function loop_clr_fd()
3. If LOOP_CLR_FD is the first command received at the time, then the
AUTOCLEAR flag is not set and deletion of the
loop device proceeds ahead and scans the partitions (drop/add
partitions)
if (disk_openers(lo->lo_disk) > 1) {
lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
loop_global_unlock(lo, true);
return 0;
}
4. Before scanning partitions, it will check to see if any partition of
the loop device is currently opened
5. If any partition is opened, then it will return EBUSY:
if (disk->open_partitions)
return -EBUSY;
6. So, after receiving the "LOOP_CLR_FD" command and just before the above
check for open_partitions, if any other command
(like blkid) opens any partition of the loop device, then the partition
scan will not proceed and EBUSY is returned as shown in above code
7. But in "__loop_clr_fd()", this EBUSY error is not propagated
8. We have noticed that this is causing the partitions of the loop to
remain stale even after the loop device is detached resulting in the
IO errors on the partitions
Fix:
Defer the detach of loop device to release function, which is called when
the last close happens, by setting the lo_flags to LO_FLAGS_AUTOCLEAR at
the time of detach i.e in loop_clr_fd() function.
Test case involves the following two scripts:
script1.sh:
while [ 1 ];
do
losetup -P -f /home/opt/looptest/test10.img
blkid /dev/loop0p1
done
script2.sh:
while [ 1 ];
do
losetup -d /dev/loop0
done
Without fix, the following IO errors have been observed:
kernel: __loop_clr_fd: partition scan of loop0 failed (rc=-16)
kernel: I/O error, dev loop0, sector 20971392 op 0x0:(READ) flags 0x80700
phys_seg 1 prio class 0
kernel: I/O error, dev loop0, sector 108868 op 0x0:(READ) flags 0x0
phys_seg 1 prio class 0
kernel: Buffer I/O error on dev loop0p1, logical block 27201, async page
read
Signed-off-by: Gulam Mohamed <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Since commit 70200574cc22 ("block: remove QUEUE_FLAG_DISCARD"),
blk_queue_flag_test_and_set() has not been used, so delete it.
Signed-off-by: John Garry <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
It is odd to check the offset amidst a series of assignments. Moving this
check to the beginning of the function makes the code look better.
Signed-off-by: Li Nan <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
For !CONFIG_BLK_DEV_INTEGRITY, rq_integrity_vec() wasn't updated
properly. Fix it up.
Fixes: cf546dd289e0 ("block: change rq_integrity_vec to respect the iterator")
Signed-off-by: Jens Axboe <[email protected]>
|
|
This patch prepares for the introduction of a new iopolicy by breaking up
the nvme_find_path() code path into sub-routines.
Signed-off-by: John Meneghini <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
dma_pad_mask is a queue_limits by all ways of looking at it, so move it
there and set it through the atomic queue limits APIs.
Add a little helper that takes the alignment and pad into account to
simplify the code that is touched a bit.
Note that there never was any need for the > check in
blk_queue_update_dma_pad, this probably was just copy and paste from
dma_update_dma_alignment.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Now that all updates go through blk_validate_limits the default of 511
is set at initialization time. Also remove the unused NULL check as
calling this helper on a NULL queue can't happen (and doesn't make
much sense to start with).
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: John Garry <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Mark blk_apply_bdi_limits non-static and open code disk_update_readahead
in the only caller.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
"static" never goes on a line of its own.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: John Garry <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
... and let sparse help us catch mismatches or abuses.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
This is a flag for ->flags and not a feature for ->features. And fix the
one place that actually incorrectly cleared it from ->features.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: John Garry <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Check the features flag and the override flag using the
blk_queue_write_cache, helper otherwise we're going to always
report "write through".
Fixes: 1122c0c1cc71 ("block: move cache control settings out of queue->flags")
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Reviewed-by: John Garry <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
The md driver wants to enforce a number of flags for all devices, even
when not inheriting them from the underlying devices. To make sure these
flags survive the queue_limits_set calls that md uses to update the
queue limits without deriving them form the previous limits add a new
md_init_stacking_limits helper that calls blk_set_stacking_limits and sets
these flags.
Fixes: 1122c0c1cc71 ("block: move cache control settings out of queue->flags")
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
If we allocate a bio that is larger than NVMe maximum request size,
attach integrity metadata to it and send it to the NVMe subsystem, the
integrity metadata will be corrupted.
Splitting the bio works correctly. The function bio_split will clone the
bio, trim the iterator of the first bio and advance the iterator of the
second bio.
However, the function rq_integrity_vec has a bug - it returns the first
vector of the bio's metadata and completely disregards the metadata
iterator that was advanced when the bio was split. Thus, the second bio
uses the same metadata as the first bio and this leads to metadata
corruption.
This commit changes rq_integrity_vec, so that it calls mp_bvec_iter_bvec
instead of returning the first vector. mp_bvec_iter_bvec reads the
iterator and uses it to build a bvec for the current position in the
iterator.
The "queue_max_integrity_segments(rq->q) > 1" check was removed, because
the updated rq_integrity_vec function works correctly with multiple
segments.
Signed-off-by: Mikulas Patocka <[email protected]>
Reviewed-by: Anuj Gupta <[email protected]>
Reviewed-by: Kanchan Joshi <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Scheduling reset_work after a nvme subsystem reset is expected to fail
on pcie, but this also prevents potential handling the platform's pcie
services may provide that might successfully recovering the link without
re-enumeration. Such examples include AER, DPC, and power's EEH.
Provide a pci specific operation that safely initiates a subsystem
reset, and instead of scheduling reset work, read back the status
register to trigger a pcie read error.
Since this only affects pci, the other fabrics drivers subscribe to a
generic nvmf subsystem reset that is exactly the same as before. The
loop fabric doesn't use it because nvmet doesn't support setting that
property anyway.
And since we're using the magic NSSR value in two places now, provide a
symbolic define for it.
Reported-by: Nilay Shroff <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
Implement the 'host_traddr' callback to display the host transport
address for nvmet debugfs.
Signed-off-by: Hannes Reinecke <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Justin Tee <[email protected]>
Signed-off-by: Daniel Wagner <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
Implement the 'host_traddr' callback to display the host transport
address for nvmet debugfs.
Signed-off-by: Hannes Reinecke <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: James Smart <[email protected]>
Signed-off-by: Daniel Wagner <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
Implement callback to display the host transport address by
adding a callback 'host_traddr' for nvmet_fc_target_template.
Signed-off-by: Hannes Reinecke <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: James Smart <[email protected]>
Signed-off-by: Daniel Wagner <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
Implement callback to display the host transport address.
Signed-off-by: Hannes Reinecke <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Daniel Wagner <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
Implement callback to display the host transport address.
Signed-off-by: Hannes Reinecke <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Daniel Wagner <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
We want to display the transport address of the connected host
in debugfs, but this is a property of the transport.
So add a callback 'host_traddr' to allow the transport drivers
to fill in the data.
Signed-off-by: Hannes Reinecke <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Daniel Wagner <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
Add a debugfs hierarchy to display the configured subsystems
and the controllers attached to the subsystems.
Suggested-by: Redouane BOUFENGHOUR <[email protected]>
Signed-off-by: Hannes Reinecke <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Daniel Wagner <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
I have graduated from SCUT. My old email is not available now.
Signed-off-by: Weiwen Hu <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
CDR/MORE/DNR fields are not belonging to SC in the NVMe spec, rename
them to NVME_STATUS_* to avoid confusion.
Signed-off-by: Weiwen Hu <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
Replaced some magic numbers about SC and SCT with enum and macro.
Signed-off-by: Weiwen Hu <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
This should better match its semantic. "sc" is used in the NVMe spec to
specifically refer to the last 8 bits in the status field. We should not
reuse "sc" here.
Signed-off-by: Weiwen Hu <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
Combining both creates an ambiguous cleanup scenario for the caller if
an error is returned: does the device reference need to be dropped or
did the error occur before the device was initialized? If an error
occurs after the device is added, then the existing cleanup routines
will leak memory.
Furthermore, the nvme core is taking it upon itself to free the device's
kobj name under certain conditions rather than go through the core
device API. We shouldn't be peaking into these implementation details.
Split the device initialization from the addition to make it easier to
know the error handling actions, fix the existing memory leaks, and stop
the device layering violations.
Link: https://lore.kernel.org/linux-nvme/[email protected]/
Tested-by: Yi Zhang <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
Drivers must call nvme_uninit_ctrl after a successful nvme_init_ctrl.
Split the allocation side out to make the error handling boundary easier
to navigate. The nvme fc driver's error handling had different returns
in the error goto label's, which harm readability.
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
Drivers must call nvme_uninit_ctrl after a successful nvme_init_ctrl.
Split the allocation side out to make the error handling boundary easier
to navigate. The nvme rdma driver's error handling had different returns
in the error goto label's, which harm readability.
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
Drivers must call nvme_uninit_ctrl after a successful nvme_init_ctrl.
Split the allocation side out to make the error handling boundary easier
to navigate. The nvme tcp driver's error handling had different returns
in the error goto label's, which harm readability.
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
Drivers must call nvme_uninit_ctrl after a successful nvme_init_ctrl.
Split the allocation side out to make the error handling boundary easier
to navigate. The apple driver had been doing this wrong, leaking the
controller device memory on a tagset failure.
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
Take care of the inverse polarity of the BLK_FEAT_ROTATIONAL flag
vs the old nonrot helper.
Fixes: bd4a633b6f7c ("block: move the nonrot flag to queue_limits")
Reported-by: kernel test robot <[email protected]>
Reported-by: Keith Busch <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
make allmodconfig && make W=1 C=1 reports:
modpost: missing MODULE_DESCRIPTION() in drivers/block/brd.o
Add the missing invocation of the MODULE_DESCRIPTION() macro.
Signed-off-by: Jeff Johnson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
make allmodconfig && make W=1 C=1 reports:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/cdrom/cdrom.o
Add the missing MODULE_DESCRIPTION() macro invocation.
Signed-off-by: Jeff Johnson <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]
Reviewed-by: Phillip Potter <[email protected]>
Link: https://lore.kernel.org/lkml/ZluYQbvrJkRlhnJC@KernelVM
Signed-off-by: Phillip Potter <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
For arm32, we get the following build warning:
In file included from /tmp/next/build/include/linux/printk.h:10,
from /tmp/next/build/include/linux/kernel.h:31,
from /tmp/next/build/block/blk-settings.c:5:
/tmp/next/build/block/blk-settings.c: In function 'blk_validate_atomic_write_limits':
/tmp/next/build/include/asm-generic/div64.h:222:35: warning: comparison of distinct pointer types lacks a cast
222 | (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
| ^~
The divident for do_div() should be 64b, which it is not. Since we want to
check 2x unsigned ints, just use % operator. This allows us to drop the
chunk_sectors variable.
Fixes: 9da3d1e912f3 ("block: Add core atomic write support")
Reported-by: Mark Brown <[email protected]>
Closes: https://lore.kernel.org/linux-next/[email protected]/T/#mbf067b1edd89c7f9d7dac6e258c516199953a108
Signed-off-by: John Garry <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
There is no need to conditionally define on CONFIG_BLK_DEV_ZONED the
inline helper functions bdev_nr_zones(), bdev_max_open_zones(),
bdev_max_active_zones() and disk_zone_no() as these function will return
the correct valu in all cases (zoned device or not, including when
CONFIG_BLK_DEV_ZONED is not set). Furthermore, disk_nr_zones()
definition can be simplified as disk->nr_zones is always 0 for regular
block devices.
Signed-off-by: Damien Le Moal <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
There is no need for bdev_nr_zones() to be an exported function
calculating the number of zones of a block device. Instead, given that
all callers use this helper with a fully initialized block device that
has a gendisk, we can redefine this function as an inline helper in
blkdev.h.
Signed-off-by: Damien Le Moal <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
In null_register_zoned_dev(), there is no need to set disk->nr_zones as
the now uncoditional call to blk_revalidate_disk_zones() will do that.
So remove the assignment using bdev_nr_zones().
Signed-off-by: Damien Le Moal <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Add support to set block layer request_queue atomic write limits. The
limits will be derived from either the namespace or controller atomic
parameters.
NVMe atomic-related parameters are grouped into "normal" and "power-fail"
(or PF) class of parameter. For atomic write support, only PF parameters
are of interest. The "normal" parameters are concerned with racing reads
and writes (which also applies to PF). See NVM Command Set Specification
Revision 1.0d section 2.1.4 for reference.
Whether to use per namespace or controller atomic parameters is decided by
NSFEAT bit 1 - see Figure 97: Identify – Identify Namespace Data
Structure, NVM Command Set.
NVMe namespaces may define an atomic boundary, whereby no atomic guarantees
are provided for a write which straddles this per-lba space boundary. The
block layer merging policy is such that no merges may occur in which the
resultant request would straddle such a boundary.
Unlike SCSI, NVMe specifies no granularity or alignment rules, apart from
atomic boundary rule. In addition, again unlike SCSI, there is no
dedicated atomic write command - a write which adheres to the atomic size
limit and boundary is implicitly atomic.
If NSFEAT bit 1 is set, the following parameters are of interest:
- NAWUPF (Namespace Atomic Write Unit Power Fail)
- NABSPF (Namespace Atomic Boundary Size Power Fail)
- NABO (Namespace Atomic Boundary Offset)
and we set request_queue limits as follows:
- atomic_write_unit_max = rounddown_pow_of_two(NAWUPF)
- atomic_write_max_bytes = NAWUPF
- atomic_write_boundary = NABSPF
If in the unlikely scenario that NABO is non-zero, then atomic writes will
not be supported at all as dealing with this adds extra complexity. This
policy may change in future.
In all cases, atomic_write_unit_min is set to the logical block size.
If NSFEAT bit 1 is unset, the following parameter is of interest:
- AWUPF (Atomic Write Unit Power Fail)
and we set request_queue limits as follows:
- atomic_write_unit_max = rounddown_pow_of_two(AWUPF)
- atomic_write_max_bytes = AWUPF
- atomic_write_boundary = 0
A new function, nvme_valid_atomic_write(), is also called from submission
path to verify that a request has been submitted to the driver will
actually be executed atomically. As mentioned, there is no dedicated NVMe
atomic write command (which may error for a command which exceeds the
controller atomic write limits).
Note on NABSPF:
There seems to be some vagueness in the spec as to whether NABSPF applies
for NSFEAT bit 1 being unset. Figure 97 does not explicitly mention NABSPF
and how it is affected by bit 1. However Figure 4 does tell to check Figure
97 for info about per-namespace parameters, which NABSPF is, so it is
implied. However currently nvme_update_disk_info() does check namespace
parameter NABO regardless of this bit.
Signed-off-by: Alan Adamson <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
jpg: total rewrite
Signed-off-by: John Garry <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Add initial support for atomic writes.
As is standard method, feed device properties via modules param, those
being:
- atomic_max_size_blks
- atomic_alignment_blks
- atomic_granularity_blks
- atomic_max_size_with_boundary_blks
- atomic_max_boundary_blks
These just match sbc4r22 section 6.6.4 - Block limits VPD page.
We just support ATOMIC WRITE (16).
The major change in the driver is how we lock the device for RW accesses.
Currently the driver uses a per-device lock for accessing device metadata
and "media" data (calls to do_device_access()) atomically for the duration
of the whole read/write command.
This should not suit verifying atomic writes. Reason being that currently
all reads/writes are atomic, so using atomic writes does not prove
anything.
Change device access model to basis that regular writes only atomic on a
per-sector basis, while reads and atomic writes are fully atomic.
As mentioned, since accessing metadata and device media is atomic,
continue to have regular writes involving metadata - like discard or PI -
as atomic. We can improve this later.
Currently we only support model where overlapping going reads or writes
wait for current access to complete before commencing an atomic write.
This is described in 4.29.3.2 section of the SBC. However, we simplify,
things and wait for all accesses to complete (when issuing an atomic
write).
Reviewed-by: Martin K. Petersen <[email protected]>
Signed-off-by: John Garry <[email protected]>
Acked-by: Darrick J. Wong <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Reviewed-by: Luis Chamberlain <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Support is divided into two main areas:
- reading VPD pages and setting sdev request_queue limits
- support WRITE ATOMIC (16) command and tracing
The relevant block limits VPD page need to be read to allow the block layer
request_queue atomic write limits to be set. These VPD page limits are
described in sbc4r22 section 6.6.4 - Block limits VPD page.
There are five limits of interest:
- MAXIMUM ATOMIC TRANSFER LENGTH
- ATOMIC ALIGNMENT
- ATOMIC TRANSFER LENGTH GRANULARITY
- MAXIMUM ATOMIC TRANSFER LENGTH WITH BOUNDARY
- MAXIMUM ATOMIC BOUNDARY SIZE
MAXIMUM ATOMIC TRANSFER LENGTH is the maximum length for a WRITE ATOMIC
(16) command. It will not be greater than the device MAXIMUM TRANSFER
LENGTH.
ATOMIC ALIGNMENT and ATOMIC TRANSFER LENGTH GRANULARITY are the minimum
alignment and length values for an atomic write in terms of logical blocks.
Unlike NVMe, SCSI does not specify an LBA space boundary, but does specify
a per-IO boundary granularity. The maximum boundary size is specified in
MAXIMUM ATOMIC BOUNDARY SIZE. When used, this boundary value is set in the
WRITE ATOMIC (16) ATOMIC BOUNDARY field - layout for the WRITE_ATOMIC_16
command can be found in sbc4r22 section 5.48. This boundary value is the
granularity size at which the device may atomically write the data. A value
of zero in WRITE ATOMIC (16) ATOMIC BOUNDARY field means that all data must
be atomically written together.
MAXIMUM ATOMIC TRANSFER LENGTH WITH BOUNDARY is the maximum atomic write
length if a non-zero boundary value is set.
For atomic write support, the WRITE ATOMIC (16) boundary is not of much
interest, as the block layer expects each request submitted to be executed
atomically. However, the SCSI spec does leave itself open to a quirky
scenario where MAXIMUM ATOMIC TRANSFER LENGTH is zero, yet MAXIMUM ATOMIC
TRANSFER LENGTH WITH BOUNDARY and MAXIMUM ATOMIC BOUNDARY SIZE are both
non-zero. This case will be supported.
To set the block layer request_queue atomic write capabilities, sanitize
the VPD page limits and set limits as follows:
- atomic_write_unit_min is derived from granularity and alignment values.
If no granularity value is not set, use physical block size
- atomic_write_unit_max is derived from MAXIMUM ATOMIC TRANSFER LENGTH. In
the scenario where MAXIMUM ATOMIC TRANSFER LENGTH is zero and boundary
limits are non-zero, use MAXIMUM ATOMIC BOUNDARY SIZE for
atomic_write_unit_max. New flag scsi_disk.use_atomic_write_boundary is
set for this scenario.
- atomic_write_boundary_bytes is set to zero always
SCSI also supports a WRITE ATOMIC (32) command, which is for type 2
protection enabled. This is not going to be supported now, so check for
T10_PI_TYPE2_PROTECTION when setting any request_queue limits.
To handle an atomic write request, add support for WRITE ATOMIC (16)
command in handler sd_setup_atomic_cmnd(). Flag use_atomic_write_boundary
is checked here for encoding ATOMIC BOUNDARY field.
Trace info is also added for WRITE_ATOMIC_16 command.
Reviewed-by: Martin K. Petersen <[email protected]>
Signed-off-by: John Garry <[email protected]>
Acked-by: Darrick J. Wong <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Support atomic writes by submitting a single BIO with the REQ_ATOMIC set.
It must be ensured that the atomic write adheres to its rules, like
naturally aligned offset, so call blkdev_dio_invalid() ->
blkdev_atomic_write_valid() [with renaming blkdev_dio_unaligned() to
blkdev_dio_invalid()] for this purpose. The BIO submission path currently
checks for atomic writes which are too large, so no need to check here.
In blkdev_direct_IO(), if the nr_pages exceeds BIO_MAX_VECS, then we cannot
produce a single BIO, so error in this case.
Finally set FMODE_CAN_ATOMIC_WRITE when the bdev can support atomic writes
and the associated file flag is for O_DIRECT.
Reviewed-by: Martin K. Petersen <[email protected]>
Signed-off-by: John Garry <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Acked-by: Darrick J. Wong <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Extend statx system call to return additional info for atomic write support
support if the specified file is a block device.
Reviewed-by: Martin K. Petersen <[email protected]>
Signed-off-by: Prasad Singamsetty <[email protected]>
Signed-off-by: John Garry <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Acked-by: Darrick J. Wong <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Reviewed-by: Luis Chamberlain <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Add atomic write support, as follows:
- add helper functions to get request_queue atomic write limits
- report request_queue atomic write support limits to sysfs and update Doc
- support to safely merge atomic writes
- deal with splitting atomic writes
- misc helper functions
- add a per-request atomic write flag
New request_queue limits are added, as follows:
- atomic_write_hw_max is set by the block driver and is the maximum length
of an atomic write which the device may support. It is not
necessarily a power-of-2.
- atomic_write_max_sectors is derived from atomic_write_hw_max_sectors and
max_hw_sectors. It is always a power-of-2. Atomic writes may be merged,
and atomic_write_max_sectors would be the limit on a merged atomic write
request size. This value is not capped at max_sectors, as the value in
max_sectors can be controlled from userspace, and it would only cause
trouble if userspace could limit atomic_write_unit_max_bytes and the
other atomic write limits.
- atomic_write_hw_unit_{min,max} are set by the block driver and are the
min/max length of an atomic write unit which the device may support. They
both must be a power-of-2. Typically atomic_write_hw_unit_max will hold
the same value as atomic_write_hw_max.
- atomic_write_unit_{min,max} are derived from
atomic_write_hw_unit_{min,max}, max_hw_sectors, and block core limits.
Both min and max values must be a power-of-2.
- atomic_write_hw_boundary is set by the block driver. If non-zero, it
indicates an LBA space boundary at which an atomic write straddles no
longer is atomically executed by the disk. The value must be a
power-of-2. Note that it would be acceptable to enforce a rule that
atomic_write_hw_boundary_sectors is a multiple of
atomic_write_hw_unit_max, but the resultant code would be more
complicated.
All atomic writes limits are by default set 0 to indicate no atomic write
support. Even though it is assumed by Linux that a logical block can always
be atomically written, we ignore this as it is not of particular interest.
Stacked devices are just not supported either for now.
An atomic write must always be submitted to the block driver as part of a
single request. As such, only a single BIO must be submitted to the block
layer for an atomic write. When a single atomic write BIO is submitted, it
cannot be split. As such, atomic_write_unit_{max, min}_bytes are limited
by the maximum guaranteed BIO size which will not be required to be split.
This max size is calculated by request_queue max segments and the number
of bvecs a BIO can fit, BIO_MAX_VECS. Currently we rely on userspace
issuing a write with iovcnt=1 for pwritev2() - as such, we can rely on each
segment containing PAGE_SIZE of data, apart from the first+last, which each
can fit logical block size of data. The first+last will be LBS
length/aligned as we rely on direct IO alignment rules also.
New sysfs files are added to report the following atomic write limits:
- atomic_write_unit_max_bytes - same as atomic_write_unit_max_sectors in
bytes
- atomic_write_unit_min_bytes - same as atomic_write_unit_min_sectors in
bytes
- atomic_write_boundary_bytes - same as atomic_write_hw_boundary_sectors in
bytes
- atomic_write_max_bytes - same as atomic_write_max_sectors in bytes
Atomic writes may only be merged with other atomic writes and only under
the following conditions:
- total resultant request length <= atomic_write_max_bytes
- the merged write does not straddle a boundary
Helper function bdev_can_atomic_write() is added to indicate whether
atomic writes may be issued to a bdev. If a bdev is a partition, the
partition start must be aligned with both atomic_write_unit_min_sectors
and atomic_write_hw_boundary_sectors.
FSes will rely on the block layer to validate that an atomic write BIO
submitted will be of valid size, so add blk_validate_atomic_write_op_size()
for this purpose. Userspace expects an atomic write which is of invalid
size to be rejected with -EINVAL, so add BLK_STS_INVAL for this. Also use
BLK_STS_INVAL for when a BIO needs to be split, as this should mean an
invalid size BIO.
Flag REQ_ATOMIC is used for indicating an atomic write.
Co-developed-by: Himanshu Madhani <[email protected]>
Signed-off-by: Himanshu Madhani <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Signed-off-by: John Garry <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Extend statx system call to return additional info for atomic write support
support for a file.
Helper function generic_fill_statx_atomic_writes() can be used by FSes to
fill in the relevant statx fields. For now atomic_write_segments_max will
always be 1, otherwise some rules would need to be imposed on iovec length
and alignment, which we don't want now.
Signed-off-by: Prasad Singamsetty <[email protected]>
jpg: relocate bdev support to another patch
Reviewed-by: Darrick J. Wong <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Signed-off-by: John Garry <[email protected]>
Acked-by: Darrick J. Wong <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|