blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2024-07-08	nvmet-auth: fix nvmet_auth hash error handling	Gaosheng Cui	1	-6/+8
	If we fail to call nvme_auth_augmented_challenge, or fail to kmalloc for shash, we should free the memory allocation for challenge, so add err path out_free_challenge to fix the memory leak. Fixes: 7a277c37d352 ("nvmet-auth: Diffie-Hellman key exchange support") Signed-off-by: Gaosheng Cui <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-07-08	nvme: implement ->get_unique_id	Christoph Hellwig	3	-0/+46
	Implement the get_unique_id method to allow pNFS SCSI layout access to NVMe namespaces. This is the server side implementation of RFC 9561 "Using the Parallel NFS (pNFS) SCSI Layout to Access Non-Volatile Memory Express (NVMe) Storage Devices". Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-07-02	nvme-multipath: implement "queue-depth" iopolicy	Thomas Song	3	-5/+87
	The round-robin path selector is inefficient in cases where there is a difference in latency between paths. In the presence of one or more high latency paths the round-robin selector continues to use the high latency path equally. This results in a bias towards the highest latency path and can cause a significant decrease in overall performance as IOs pile on the highest latency path. This problem is acute with NVMe-oF controllers. The queue-depth path selector sends I/O down the path with the lowest number of requests in its request queue. Paths with lower latency will clear requests more quickly and have less requests queued compared to higher latency paths. The goal of this path selector is to make more use of lower latency paths which will bring down overall IO latency and increase throughput and performance. Signed-off-by: Thomas Song <[email protected]> [emilne: commandeered patch developed by Thomas Song @ Pure Storage] Co-developed-by: Ewan D. Milne <[email protected]> Signed-off-by: Ewan D. Milne <[email protected]> Co-developed-by: John Meneghini <[email protected]> Signed-off-by: John Meneghini <[email protected]> Link: https://lore.kernel.org/linux-nvme/[email protected]/ Tested-by: Marco Patalano <[email protected]> Tested-by: Jyoti Rani <[email protected]> Tested-by: John Meneghini <[email protected]> Reviewed-by: Randy Jennings <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-06-26	nvme-multipath: prepare for "queue-depth" iopolicy	John Meneghini	1	-6/+15
	This patch prepares for the introduction of a new iopolicy by breaking up the nvme_find_path() code path into sub-routines. Signed-off-by: John Meneghini <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-06-26	nvme-pci: do not directly handle subsys reset fallout	Keith Busch	8	-11/+61
	Scheduling reset_work after a nvme subsystem reset is expected to fail on pcie, but this also prevents potential handling the platform's pcie services may provide that might successfully recovering the link without re-enumeration. Such examples include AER, DPC, and power's EEH. Provide a pci specific operation that safely initiates a subsystem reset, and instead of scheduling reset work, read back the status register to trigger a pcie read error. Since this only affects pci, the other fabrics drivers subscribe to a generic nvmf subsystem reset that is exactly the same as before. The loop fabric doesn't use it because nvmet doesn't support setting that property anyway. And since we're using the magic NSSR value in two places now, provide a symbolic define for it. Reported-by: Nilay Shroff <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-06-24	lpfc_nvmet: implement 'host_traddr'	Hannes Reinecke	1	-0/+11
	Implement the 'host_traddr' callback to display the host transport address for nvmet debugfs. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Justin Tee <[email protected]> Signed-off-by: Daniel Wagner <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-06-24	nvme-fcloop: implement 'host_traddr'	Hannes Reinecke	1	-0/+11
	Implement the 'host_traddr' callback to display the host transport address for nvmet debugfs. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: James Smart <[email protected]> Signed-off-by: Daniel Wagner <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-06-24	nvmet-fc: implement host_traddr()	Hannes Reinecke	2	-0/+37
	Implement callback to display the host transport address by adding a callback 'host_traddr' for nvmet_fc_target_template. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: James Smart <[email protected]> Signed-off-by: Daniel Wagner <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-06-24	nvmet-rdma: implement host_traddr()	Hannes Reinecke	1	-0/+12
	Implement callback to display the host transport address. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Daniel Wagner <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-06-24	nvmet-tcp: implement host_traddr()	Hannes Reinecke	1	-0/+14
	Implement callback to display the host transport address. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Daniel Wagner <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-06-24	nvmet: add 'host_traddr' callback for debugfs	Hannes Reinecke	3	-0/+31
	We want to display the transport address of the connected host in debugfs, but this is a property of the transport. So add a callback 'host_traddr' to allow the transport drivers to fill in the data. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Daniel Wagner <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-06-24	nvmet: add debugfs support	Hannes Reinecke	6	-3/+262
	Add a debugfs hierarchy to display the configured subsystems and the controllers attached to the subsystems. Suggested-by: Redouane BOUFENGHOUR <[email protected]> Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Daniel Wagner <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-06-24	mailmap: add entry for Weiwen Hu	Weiwen Hu	1	-0/+1
	I have graduated from SCUT. My old email is not available now. Signed-off-by: Weiwen Hu <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-06-24	nvme: rename CDR/MORE/DNR to NVME_STATUS_*	Weiwen Hu	16	-125/+125
	CDR/MORE/DNR fields are not belonging to SC in the NVMe spec, rename them to NVME_STATUS_* to avoid confusion. Signed-off-by: Weiwen Hu <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-06-24	nvme: fix status magic numbers	Weiwen Hu	6	-16/+26
	Replaced some magic numbers about SC and SCT with enum and macro. Signed-off-by: Weiwen Hu <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-06-24	nvme: rename nvme_sc_to_pr_err to nvme_status_to_pr_err	Weiwen Hu	1	-5/+5
	This should better match its semantic. "sc" is used in the NVMe spec to specifically refer to the last 8 bits in the status field. We should not reuse "sc" here. Signed-off-by: Weiwen Hu <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-06-24	nvme: split device add from initialization	Keith Busch	8	-23/+66
	Combining both creates an ambiguous cleanup scenario for the caller if an error is returned: does the device reference need to be dropped or did the error occur before the device was initialized? If an error occurs after the device is added, then the existing cleanup routines will leak memory. Furthermore, the nvme core is taking it upon itself to free the device's kobj name under certain conditions rather than go through the core device API. We shouldn't be peaking into these implementation details. Split the device initialization from the addition to make it easier to know the error handling actions, fix the existing memory leaks, and stop the device layering violations. Link: https://lore.kernel.org/linux-nvme/[email protected]/ Tested-by: Yi Zhang <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-06-24	nvme: fc: split controller bringup handling	Keith Busch	1	-16/+27
	Drivers must call nvme_uninit_ctrl after a successful nvme_init_ctrl. Split the allocation side out to make the error handling boundary easier to navigate. The nvme fc driver's error handling had different returns in the error goto label's, which harm readability. Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-06-24	nvme: rdma: split controller bringup handling	Keith Busch	1	-7/+21
	Drivers must call nvme_uninit_ctrl after a successful nvme_init_ctrl. Split the allocation side out to make the error handling boundary easier to navigate. The nvme rdma driver's error handling had different returns in the error goto label's, which harm readability. Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-06-24	nvme: tcp: split controller bringup handling	Keith Busch	1	-6/+19
	Drivers must call nvme_uninit_ctrl after a successful nvme_init_ctrl. Split the allocation side out to make the error handling boundary easier to navigate. The nvme tcp driver's error handling had different returns in the error goto label's, which harm readability. Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-06-24	nvme: apple: fix device reference counting	Keith Busch	1	-5/+22
	Drivers must call nvme_uninit_ctrl after a successful nvme_init_ctrl. Split the allocation side out to make the error handling boundary easier to navigate. The apple driver had been doing this wrong, leaking the controller device memory on a tagset failure. Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-06-24	block: fix the blk_queue_nonrot polarity	Christoph Hellwig	1	-1/+1
	Take care of the inverse polarity of the BLK_FEAT_ROTATIONAL flag vs the old nonrot helper. Fixes: bd4a633b6f7c ("block: move the nonrot flag to queue_limits") Reported-by: kernel test robot <[email protected]> Reported-by: Keith Busch <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-06-24	brd: add missing MODULE_DESCRIPTION() macro	Jeff Johnson	1	-0/+1
	make allmodconfig && make W=1 C=1 reports: modpost: missing MODULE_DESCRIPTION() in drivers/block/brd.o Add the missing invocation of the MODULE_DESCRIPTION() macro. Signed-off-by: Jeff Johnson <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-06-22	cdrom: Add missing MODULE_DESCRIPTION()	Jeff Johnson	1	-0/+1
	make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/cdrom/cdrom.o Add the missing MODULE_DESCRIPTION() macro invocation. Signed-off-by: Jeff Johnson <[email protected]> Link: https://lore.kernel.org/lkml/[email protected] Reviewed-by: Phillip Potter <[email protected]> Link: https://lore.kernel.org/lkml/ZluYQbvrJkRlhnJC@KernelVM Signed-off-by: Phillip Potter <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-06-21	block: Fix blk_validate_atomic_write_limits() build for arm32	John Garry	1	-2/+1
	For arm32, we get the following build warning: In file included from /tmp/next/build/include/linux/printk.h:10, from /tmp/next/build/include/linux/kernel.h:31, from /tmp/next/build/block/blk-settings.c:5: /tmp/next/build/block/blk-settings.c: In function 'blk_validate_atomic_write_limits': /tmp/next/build/include/asm-generic/div64.h:222:35: warning: comparison of distinct pointer types lacks a cast 222 \| (void)(((typeof((n)) )0) == ((uint64_t )0)); \ \| ^~ The divident for do_div() should be 64b, which it is not. Since we want to check 2x unsigned ints, just use % operator. This allows us to drop the chunk_sectors variable. Fixes: 9da3d1e912f3 ("block: Add core atomic write support") Reported-by: Mark Brown <[email protected]> Closes: https://lore.kernel.org/linux-next/[email protected]/T/#mbf067b1edd89c7f9d7dac6e258c516199953a108 Signed-off-by: John Garry <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-06-21	block: Cleanup block device zone helpers	Damien Le Moal	1	-32/+12
	There is no need to conditionally define on CONFIG_BLK_DEV_ZONED the inline helper functions bdev_nr_zones(), bdev_max_open_zones(), bdev_max_active_zones() and disk_zone_no() as these function will return the correct valu in all cases (zoned device or not, including when CONFIG_BLK_DEV_ZONED is not set). Furthermore, disk_nr_zones() definition can be simplified as disk->nr_zones is always 0 for regular block devices. Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-06-21	block: Define bdev_nr_zones() as an inline function	Damien Le Moal	2	-19/+5
	There is no need for bdev_nr_zones() to be an exported function calculating the number of zones of a block device. Instead, given that all callers use this helper with a fully initialized block device that has a gendisk, we can redefine this function as an inline helper in blkdev.h. Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-06-21	null_blk: Do not set disk->nr_zones	Damien Le Moal	1	-2/+0
	In null_register_zoned_dev(), there is no need to set disk->nr_zones as the now uncoditional call to blk_revalidate_disk_zones() will do that. So remove the assignment using bdev_nr_zones(). Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-06-20	nvme: Atomic write support	Alan Adamson	1	-0/+52
	Add support to set block layer request_queue atomic write limits. The limits will be derived from either the namespace or controller atomic parameters. NVMe atomic-related parameters are grouped into "normal" and "power-fail" (or PF) class of parameter. For atomic write support, only PF parameters are of interest. The "normal" parameters are concerned with racing reads and writes (which also applies to PF). See NVM Command Set Specification Revision 1.0d section 2.1.4 for reference. Whether to use per namespace or controller atomic parameters is decided by NSFEAT bit 1 - see Figure 97: Identify – Identify Namespace Data Structure, NVM Command Set. NVMe namespaces may define an atomic boundary, whereby no atomic guarantees are provided for a write which straddles this per-lba space boundary. The block layer merging policy is such that no merges may occur in which the resultant request would straddle such a boundary. Unlike SCSI, NVMe specifies no granularity or alignment rules, apart from atomic boundary rule. In addition, again unlike SCSI, there is no dedicated atomic write command - a write which adheres to the atomic size limit and boundary is implicitly atomic. If NSFEAT bit 1 is set, the following parameters are of interest: - NAWUPF (Namespace Atomic Write Unit Power Fail) - NABSPF (Namespace Atomic Boundary Size Power Fail) - NABO (Namespace Atomic Boundary Offset) and we set request_queue limits as follows: - atomic_write_unit_max = rounddown_pow_of_two(NAWUPF) - atomic_write_max_bytes = NAWUPF - atomic_write_boundary = NABSPF If in the unlikely scenario that NABO is non-zero, then atomic writes will not be supported at all as dealing with this adds extra complexity. This policy may change in future. In all cases, atomic_write_unit_min is set to the logical block size. If NSFEAT bit 1 is unset, the following parameter is of interest: - AWUPF (Atomic Write Unit Power Fail) and we set request_queue limits as follows: - atomic_write_unit_max = rounddown_pow_of_two(AWUPF) - atomic_write_max_bytes = AWUPF - atomic_write_boundary = 0 A new function, nvme_valid_atomic_write(), is also called from submission path to verify that a request has been submitted to the driver will actually be executed atomically. As mentioned, there is no dedicated NVMe atomic write command (which may error for a command which exceeds the controller atomic write limits). Note on NABSPF: There seems to be some vagueness in the spec as to whether NABSPF applies for NSFEAT bit 1 being unset. Figure 97 does not explicitly mention NABSPF and how it is affected by bit 1. However Figure 4 does tell to check Figure 97 for info about per-namespace parameters, which NABSPF is, so it is implied. However currently nvme_update_disk_info() does check namespace parameter NABO regardless of this bit. Signed-off-by: Alan Adamson <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> jpg: total rewrite Signed-off-by: John Garry <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-06-20	scsi: scsi_debug: Atomic write support	John Garry	1	-134/+454
	Add initial support for atomic writes. As is standard method, feed device properties via modules param, those being: - atomic_max_size_blks - atomic_alignment_blks - atomic_granularity_blks - atomic_max_size_with_boundary_blks - atomic_max_boundary_blks These just match sbc4r22 section 6.6.4 - Block limits VPD page. We just support ATOMIC WRITE (16). The major change in the driver is how we lock the device for RW accesses. Currently the driver uses a per-device lock for accessing device metadata and "media" data (calls to do_device_access()) atomically for the duration of the whole read/write command. This should not suit verifying atomic writes. Reason being that currently all reads/writes are atomic, so using atomic writes does not prove anything. Change device access model to basis that regular writes only atomic on a per-sector basis, while reads and atomic writes are fully atomic. As mentioned, since accessing metadata and device media is atomic, continue to have regular writes involving metadata - like discard or PI - as atomic. We can improve this later. Currently we only support model where overlapping going reads or writes wait for current access to complete before commencing an atomic write. This is described in 4.29.3.2 section of the SBC. However, we simplify, things and wait for all accesses to complete (when issuing an atomic write). Reviewed-by: Martin K. Petersen <[email protected]> Signed-off-by: John Garry <[email protected]> Acked-by: Darrick J. Wong <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Luis Chamberlain <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-06-20	scsi: sd: Atomic write support	John Garry	5	-1/+124
	Support is divided into two main areas: - reading VPD pages and setting sdev request_queue limits - support WRITE ATOMIC (16) command and tracing The relevant block limits VPD page need to be read to allow the block layer request_queue atomic write limits to be set. These VPD page limits are described in sbc4r22 section 6.6.4 - Block limits VPD page. There are five limits of interest: - MAXIMUM ATOMIC TRANSFER LENGTH - ATOMIC ALIGNMENT - ATOMIC TRANSFER LENGTH GRANULARITY - MAXIMUM ATOMIC TRANSFER LENGTH WITH BOUNDARY - MAXIMUM ATOMIC BOUNDARY SIZE MAXIMUM ATOMIC TRANSFER LENGTH is the maximum length for a WRITE ATOMIC (16) command. It will not be greater than the device MAXIMUM TRANSFER LENGTH. ATOMIC ALIGNMENT and ATOMIC TRANSFER LENGTH GRANULARITY are the minimum alignment and length values for an atomic write in terms of logical blocks. Unlike NVMe, SCSI does not specify an LBA space boundary, but does specify a per-IO boundary granularity. The maximum boundary size is specified in MAXIMUM ATOMIC BOUNDARY SIZE. When used, this boundary value is set in the WRITE ATOMIC (16) ATOMIC BOUNDARY field - layout for the WRITE_ATOMIC_16 command can be found in sbc4r22 section 5.48. This boundary value is the granularity size at which the device may atomically write the data. A value of zero in WRITE ATOMIC (16) ATOMIC BOUNDARY field means that all data must be atomically written together. MAXIMUM ATOMIC TRANSFER LENGTH WITH BOUNDARY is the maximum atomic write length if a non-zero boundary value is set. For atomic write support, the WRITE ATOMIC (16) boundary is not of much interest, as the block layer expects each request submitted to be executed atomically. However, the SCSI spec does leave itself open to a quirky scenario where MAXIMUM ATOMIC TRANSFER LENGTH is zero, yet MAXIMUM ATOMIC TRANSFER LENGTH WITH BOUNDARY and MAXIMUM ATOMIC BOUNDARY SIZE are both non-zero. This case will be supported. To set the block layer request_queue atomic write capabilities, sanitize the VPD page limits and set limits as follows: - atomic_write_unit_min is derived from granularity and alignment values. If no granularity value is not set, use physical block size - atomic_write_unit_max is derived from MAXIMUM ATOMIC TRANSFER LENGTH. In the scenario where MAXIMUM ATOMIC TRANSFER LENGTH is zero and boundary limits are non-zero, use MAXIMUM ATOMIC BOUNDARY SIZE for atomic_write_unit_max. New flag scsi_disk.use_atomic_write_boundary is set for this scenario. - atomic_write_boundary_bytes is set to zero always SCSI also supports a WRITE ATOMIC (32) command, which is for type 2 protection enabled. This is not going to be supported now, so check for T10_PI_TYPE2_PROTECTION when setting any request_queue limits. To handle an atomic write request, add support for WRITE ATOMIC (16) command in handler sd_setup_atomic_cmnd(). Flag use_atomic_write_boundary is checked here for encoding ATOMIC BOUNDARY field. Trace info is also added for WRITE_ATOMIC_16 command. Reviewed-by: Martin K. Petersen <[email protected]> Signed-off-by: John Garry <[email protected]> Acked-by: Darrick J. Wong <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-06-20	block: Add fops atomic write support	John Garry	1	-3/+17
	Support atomic writes by submitting a single BIO with the REQ_ATOMIC set. It must be ensured that the atomic write adheres to its rules, like naturally aligned offset, so call blkdev_dio_invalid() -> blkdev_atomic_write_valid() [with renaming blkdev_dio_unaligned() to blkdev_dio_invalid()] for this purpose. The BIO submission path currently checks for atomic writes which are too large, so no need to check here. In blkdev_direct_IO(), if the nr_pages exceeds BIO_MAX_VECS, then we cannot produce a single BIO, so error in this case. Finally set FMODE_CAN_ATOMIC_WRITE when the bdev can support atomic writes and the associated file flag is for O_DIRECT. Reviewed-by: Martin K. Petersen <[email protected]> Signed-off-by: John Garry <[email protected]> Reviewed-by: Keith Busch <[email protected]> Acked-by: Darrick J. Wong <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-06-20	block: Add atomic write support for statx	Prasad Singamsetty	3	-19/+39
	Extend statx system call to return additional info for atomic write support support if the specified file is a block device. Reviewed-by: Martin K. Petersen <[email protected]> Signed-off-by: Prasad Singamsetty <[email protected]> Signed-off-by: John Garry <[email protected]> Reviewed-by: Keith Busch <[email protected]> Acked-by: Darrick J. Wong <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Luis Chamberlain <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-06-20	block: Add core atomic write support	John Garry	8	-5/+304
	Add atomic write support, as follows: - add helper functions to get request_queue atomic write limits - report request_queue atomic write support limits to sysfs and update Doc - support to safely merge atomic writes - deal with splitting atomic writes - misc helper functions - add a per-request atomic write flag New request_queue limits are added, as follows: - atomic_write_hw_max is set by the block driver and is the maximum length of an atomic write which the device may support. It is not necessarily a power-of-2. - atomic_write_max_sectors is derived from atomic_write_hw_max_sectors and max_hw_sectors. It is always a power-of-2. Atomic writes may be merged, and atomic_write_max_sectors would be the limit on a merged atomic write request size. This value is not capped at max_sectors, as the value in max_sectors can be controlled from userspace, and it would only cause trouble if userspace could limit atomic_write_unit_max_bytes and the other atomic write limits. - atomic_write_hw_unit_{min,max} are set by the block driver and are the min/max length of an atomic write unit which the device may support. They both must be a power-of-2. Typically atomic_write_hw_unit_max will hold the same value as atomic_write_hw_max. - atomic_write_unit_{min,max} are derived from atomic_write_hw_unit_{min,max}, max_hw_sectors, and block core limits. Both min and max values must be a power-of-2. - atomic_write_hw_boundary is set by the block driver. If non-zero, it indicates an LBA space boundary at which an atomic write straddles no longer is atomically executed by the disk. The value must be a power-of-2. Note that it would be acceptable to enforce a rule that atomic_write_hw_boundary_sectors is a multiple of atomic_write_hw_unit_max, but the resultant code would be more complicated. All atomic writes limits are by default set 0 to indicate no atomic write support. Even though it is assumed by Linux that a logical block can always be atomically written, we ignore this as it is not of particular interest. Stacked devices are just not supported either for now. An atomic write must always be submitted to the block driver as part of a single request. As such, only a single BIO must be submitted to the block layer for an atomic write. When a single atomic write BIO is submitted, it cannot be split. As such, atomic_write_unit_{max, min}_bytes are limited by the maximum guaranteed BIO size which will not be required to be split. This max size is calculated by request_queue max segments and the number of bvecs a BIO can fit, BIO_MAX_VECS. Currently we rely on userspace issuing a write with iovcnt=1 for pwritev2() - as such, we can rely on each segment containing PAGE_SIZE of data, apart from the first+last, which each can fit logical block size of data. The first+last will be LBS length/aligned as we rely on direct IO alignment rules also. New sysfs files are added to report the following atomic write limits: - atomic_write_unit_max_bytes - same as atomic_write_unit_max_sectors in bytes - atomic_write_unit_min_bytes - same as atomic_write_unit_min_sectors in bytes - atomic_write_boundary_bytes - same as atomic_write_hw_boundary_sectors in bytes - atomic_write_max_bytes - same as atomic_write_max_sectors in bytes Atomic writes may only be merged with other atomic writes and only under the following conditions: - total resultant request length <= atomic_write_max_bytes - the merged write does not straddle a boundary Helper function bdev_can_atomic_write() is added to indicate whether atomic writes may be issued to a bdev. If a bdev is a partition, the partition start must be aligned with both atomic_write_unit_min_sectors and atomic_write_hw_boundary_sectors. FSes will rely on the block layer to validate that an atomic write BIO submitted will be of valid size, so add blk_validate_atomic_write_op_size() for this purpose. Userspace expects an atomic write which is of invalid size to be rejected with -EINVAL, so add BLK_STS_INVAL for this. Also use BLK_STS_INVAL for when a BIO needs to be split, as this should mean an invalid size BIO. Flag REQ_ATOMIC is used for indicating an atomic write. Co-developed-by: Himanshu Madhani <[email protected]> Signed-off-by: Himanshu Madhani <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Signed-off-by: John Garry <[email protected]> Reviewed-by: Keith Busch <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-06-20	fs: Add initial atomic write support info to statx	Prasad Singamsetty	4	-2/+50
	Extend statx system call to return additional info for atomic write support support for a file. Helper function generic_fill_statx_atomic_writes() can be used by FSes to fill in the relevant statx fields. For now atomic_write_segments_max will always be 1, otherwise some rules would need to be imposed on iovec length and alignment, which we don't want now. Signed-off-by: Prasad Singamsetty <[email protected]> jpg: relocate bdev support to another patch Reviewed-by: Darrick J. Wong <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Signed-off-by: John Garry <[email protected]> Acked-by: Darrick J. Wong <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-06-20	fs: Initial atomic write support	Prasad Singamsetty	6	-14/+45
	An atomic write is a write issued with torn-write protection, meaning that for a power failure or any other hardware failure, all or none of the data from the write will be stored, but never a mix of old and new data. Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the write is to be issued with torn-write prevention, according to special alignment and length rules. For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for iocb->ki_flags field to indicate the same. A call to statx will give the relevant atomic write info for a file: - atomic_write_unit_min - atomic_write_unit_max - atomic_write_segments_max Both min and max values must be a power-of-2. Applications can avail of atomic write feature by ensuring that the total length of a write is a power-of-2 in size and also sized between atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications must ensure that the write is at a naturally-aligned offset in the file wrt the total write length. The value in atomic_write_segments_max indicates the upper limit for IOV_ITER iovcnt. Add file mode flag FMODE_CAN_ATOMIC_WRITE, so files which do not have the flag set will have RWF_ATOMIC rejected and not just ignored. Add a type argument to kiocb_set_rw_flags() to allows reads which have RWF_ATOMIC set to be rejected. Helper function generic_atomic_write_valid() can be used by FSes to verify compliant writes. There we check for iov_iter type is for ubuf, which implies iovcnt==1 for pwritev2(), which is an initial restriction for atomic_write_segments_max. Initially the only user will be bdev file operations write handler. We will rely on the block BIO submission path to ensure write sizes are compliant for the bdev, so we don't need to check atomic writes sizes yet. Signed-off-by: Prasad Singamsetty <[email protected]> jpg: merge into single patch and much rewrite Acked-by: Darrick J. Wong <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Signed-off-by: John Garry <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-06-20	block: Generalize chunk_sectors support as boundary support	John Garry	3	-13/+22
	The purpose of the chunk_sectors limit is to ensure that a mergeble request fits within the boundary of the chunck_sector value. Such a feature will be useful for other request_queue boundary limits, so generalize the chunk_sectors merge code. This idea was proposed by Hannes Reinecke. Reviewed-by: Martin K. Petersen <[email protected]> Signed-off-by: John Garry <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Acked-by: Darrick J. Wong <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-06-20	block: Pass blk_queue_get_max_sectors() a request pointer	John Garry	3	-4/+7
	Currently blk_queue_get_max_sectors() is passed a enum req_op. In future the value returned from blk_queue_get_max_sectors() may depend on certain request flags, so pass a request pointer. Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Luis Chamberlain <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Signed-off-by: John Garry <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Acked-by: Darrick J. Wong <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-06-20	Merge branch 'for-6.11/block-limits' into for-6.11/block	Jens Axboe	9	-51/+29
	Merge in queue limits cleanups. * for-6.11/block-limits: block: move the raid_partial_stripes_expensive flag into the features field block: remove the discard_alignment flag block: move the misaligned flag into the features field block: renumber and rename the cache disabled flag block: fix spelling and grammar for in writeback_cache_control.rst block: remove the unused blk_bounce enum
2024-06-20	rust: block: do not use removed queue flag API	Andreas Hindborg	1	-14/+3
	`blk_queue_flag_set` and `blk_queue_flag_clear` was removed in favor of a new API. This caused a build error for Rust block device abstractions. Thus, use the new feature passing API instead of the old removed API. Fixes: bd4a633b6f7c ("block: move the nonrot flag to queue_limits") Signed-off-by: Andreas Hindborg <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-06-20	block: move the raid_partial_stripes_expensive flag into the features field	Christoph Hellwig	4	-9/+8
	Move the raid_partial_stripes_expensive flags into the features field to reclaim a little bit of space. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-06-20	block: remove the discard_alignment flag	Christoph Hellwig	5	-14/+0
	queue_limits.discard_alignment is never read except in the places where it is stacked into another limit. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-06-20	block: move the misaligned flag into the features field	Christoph Hellwig	2	-11/+13
	Move the misaligned flags into the features field to reclaim a little bit of space. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-06-20	block: renumber and rename the cache disabled flag	Christoph Hellwig	2	-6/+6
	Start with the first bit, and drop the plural-S from the name. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-06-20	block: fix spelling and grammar for in writeback_cache_control.rst	Christoph Hellwig	1	-3/+3
	Suggested-by: Damien Le Moal <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-06-20	block: remove the unused blk_bounce enum	Christoph Hellwig	1	-9/+0
	The enum has been replaced with the BLK_FEAT_BOUNCE_HIGH flag. Reported-by: Keith Busch <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-06-19	Merge branch 'for-6.11/block-limits' into for-6.11/block	Jens Axboe	61	-730/+576
	Merge in last round of queue limits changes from Christoph. * for-6.11/block-limits: (26 commits) block: move the bounce flag into the features field block: move the skip_tagset_quiesce flag to queue_limits block: move the pci_p2pdma flag to queue_limits block: move the zone_resetall flag to queue_limits block: move the zoned flag into the features field block: move the poll flag to queue_limits block: move the dax flag to queue_limits block: move the nowait flag to queue_limits block: move the synchronous flag to queue_limits block: move the stable_writes flag to queue_limits block: move the io_stat flag setting to queue_limits block: move the add_random flag to queue_limits block: move the nonrot flag to queue_limits block: move cache control settings out of queue->flags block: remove blk_flush_policy block: freeze the queue in queue_attr_store nbd: move setting the cache control flags to __nbd_set_size virtio_blk: remove virtblk_update_cache_mode loop: fold loop_update_rotational into loop_reconfigure_limits loop: also use the default block size from an underlying block device ... Signed-off-by: Jens Axboe <[email protected]>
2024-06-19	block: move the bounce flag into the features field	Christoph Hellwig	4	-5/+6
	Move the bounce flag into the features field to reclaim a little bit of space. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-06-19	block: move the skip_tagset_quiesce flag to queue_limits	Christoph Hellwig	3	-6/+9
	Move the skip_tagset_quiesce flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-06-19	block: move the pci_p2pdma flag to queue_limits	Christoph Hellwig	3	-9/+7
	Move the pci_p2pdma flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>