aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-03-04nvme: split out a nvme_identify_ns_nvm helperChristoph Hellwig1-12/+26
Split the logic to query the Identify Namespace Data Structure, NVM Command Set into a separate helper. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme: move common logic into nvme_update_ns_infoChristoph Hellwig1-42/+42
nvme_update_ns_info_generic and nvme_update_ns_info_block share a fair amount of logic related to not fully supported namespace formats and updating the multipath information. Move this logic into the common caller. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme: move setting the write cache flags out of nvme_set_queue_limitsChristoph Hellwig1-3/+2
nvme_set_queue_limits is used on the admin queue and all gendisks including hidden ones that don't support block I/O. The write cache setting on the other hand only makes sense for block I/O. Move the blk_queue_write_cache call to nvme_update_ns_info_block instead. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme: move a few things out of nvme_update_disk_infoChristoph Hellwig1-20/+26
Move setting up the integrity profile and setting the disk capacity out of nvme_update_disk_info to get nvme_update_disk_info into a shape where it just sets queue_limits eventually. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme: don't use nvme_update_disk_info for the multipath diskChristoph Hellwig1-1/+2
Currently nvme_update_ns_info_block calls nvme_update_disk_info both for the namespace attached disk, and the multipath one (if it exists). This is very different from how other stacking drivers work, and leads to a lot of complexity. Switch to setting the disk capacity and initializing the integrity profile, and let blk_stack_limits which already is called just below deal with updating the other limits. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme: move blk_integrity_unregister into nvme_init_integrityChristoph Hellwig1-2/+2
Move uneregistering the existing integrity profile into the helper dealing with all the other integrity / metadata setup. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme: cleanup the nvme_init_integrity calling conventionsChristoph Hellwig1-14/+15
Handle the no metadata support case in nvme_init_integrity as well to simplify the calling convention and prepare for future changes in the area. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme: move max_integrity_segments handling out of nvme_init_integrityChristoph Hellwig1-7/+4
max_integrity_segments is just a hardware limit and doesn't need to be in nvme_init_integrity with the PI setup. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme: remove nvme_revalidate_zonesChristoph Hellwig3-12/+3
Handle setting the zone size / chunk_sectors and max_append_sectors limits together with the other ZNS limits, and just open code the call to blk_revalidate_zones in the current place. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme: move NVME_QUIRK_DEALLOCATE_ZEROES out of nvme_config_discardChristoph Hellwig1-5/+6
Move the handling of the NVME_QUIRK_DEALLOCATE_ZEROES quirk out of nvme_config_discard so that it is combined with the normal write_zeroes limit handling. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme: set max_hw_sectors unconditionallyChristoph Hellwig1-8/+8
All transports set a max_hw_sectors value in the nvme_ctrl, so make the code using it unconditional and clean it up using a little helper. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Reviewed-by: John Garry <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-02nvme-fabrics: check max outstanding commandsGuixin Liu1-0/+5
Maxcmd is mandatory for fabrics, check it early to identify the root cause instead of waiting for it to propagate to "sqsize" and "allocing queue". By the way, change nvme_check_ctrl_fabric_info() to nvmf_validate_identify_ctrl(). Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Guixin Liu <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-02nvmet-rdma: set max_queue_size for RDMA transportMax Gurtovoy2-1/+10
A new port configuration was added to set max_queue_size. Clamp user configuration to RDMA transport limits. Increase the maximal queue size of RDMA controllers from 128 to 256 (the default size stays 128 same as before). Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Israel Rukshin <[email protected]> Signed-off-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-02nvmet: introduce new max queue size configuration entryMax Gurtovoy3-3/+46
Using this port configuration, one will be able to set the maximal queue size to be used for any controller that will be associated to the configured port. The default value stayed 1024 but each transport will be able to set the its own values before enabling the port. Introduce lower limit of 16 for minimal queue depth (same as we use in the host fabrics drivers). Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Israel Rukshin <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Guixin Liu <[email protected]> Signed-off-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-02nvme-rdma: clamp queue size according to ctrl capMax Gurtovoy1-4/+10
If a controller is configured with metadata support, clamp the maximal queue size to be 128 since there are more resources that are needed for metadata operations. Otherwise, clamp it to 256. Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Israel Rukshin <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-02nvme-rdma: introduce NVME_RDMA_MAX_METADATA_QUEUE_SIZE definitionMax Gurtovoy2-1/+4
This definition will be used by controllers that are configured with metadata support. For now, both regular and metadata controllers have the same maximal queue size but later commit will increase the maximal queue size for regular RDMA controllers to 256. We'll keep the maximal queue size for metadata controllers to be 128 since there are more resources that are needed for metadata operations and 128 is the optimal size found for metadata controllers base on testing. Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Israel Rukshin <[email protected]> Signed-off-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-02nvmet: set ctrl pi_support cap before initializing cap regMax Gurtovoy2-2/+1
This is a preparation for setting the maximal queue size of a controller that supports PI. Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Israel Rukshin <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-02nvmet: set maxcmd to be per controllerMax Gurtovoy4-4/+4
This is a preparation for having a dynamic configuration of max queue size for a controller. Make sure that the maxcmd field stays the same as the MQES (+1) value as we do today. Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Israel Rukshin <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-02nvmet: compare mqes and sqsize only for IO SQMax Gurtovoy1-1/+2
According to the NVMe Spec: " MQES: This field indicates the maximum individual queue size that the controller supports. For NVMe over PCIe implementations, this value applies to the I/O Submission Queues and I/O Completion Queues that the host creates. For NVMe over Fabrics implementations, this value applies to only the I/O Submission Queues that the host creates. " Align the target code to compare mqes and sqsize as mentioned in the NVMe Spec. Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-02nvme-rdma: move NVME_RDMA_IP_PORT from common fileMax Gurtovoy2-2/+2
The correct place for this definition is the nvme rdma header file and not the common nvme header file. Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Israel Rukshin <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-01nbd: use the atomic queue limits API in nbd_set_sizeChristoph Hellwig1-4/+11
Use queue_limits_start_update / queue_limits_commit_update to update all the limits in one go and with proper sanity checking. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-03-01nbd: freeze the queue for queue limits updatesChristoph Hellwig1-1/+13
nbd currently updates the logical and physical block sizes as well as the discard_sectors on a live queue. Freeze the queue first to make sure there are not commands in flight that can see torn or inconsistent limits. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-03-01nbd: don't clear discard_sectors in nbd_config_putChristoph Hellwig1-1/+2
nbd_config_put currently clears discard_sectors when unusing a device. This is pretty odd behavior and different from the sector size configuration which is simply left in places and then reconfigured when nbd_set_size is as part of configuring the device. Change nbd_set_size to clear discard_sectors if discard is not supported so that all the queue limits changes are handled in one place. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-03-01pktcdvd: don't set max_hw_sectors on the underlying deviceChristoph Hellwig1-5/+6
pktcdvd sets max_hw_sectors on the queue of the underlying device that it doesn't own (and doesn't reset it ever) since the driver was merged. This can create all kinds of problems as the underlying driver doesn't even know about it changing the limit. As the state purpose is to not create I/Os larger than a single frame, and pktcdvd never builds bios larger than that, just set REQ_NOMERGE on the bios it submits so that largers I/Os never get built. Note: I don't have packet writing hardware, so this is compile tested only. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-03-01dm: use queue_limits_setChristoph Hellwig2-16/+13
Use queue_limits_set which validates the limits and takes care of updating the readahead settings instead of directly assigning them to the queue. For that make sure all limits are actually updated before the assignment. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-03-01block: add a queue_limits_stack_bdev helperChristoph Hellwig2-0/+27
Add a small wrapper around blk_stack_limits that allows passing a bdev for the bottom device and prints an error in case of misaligned device. The name fits into the new queue limits API and the intent is to eventually replace disk_stack_limits. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-03-01block: add a queue_limits_set helperChristoph Hellwig2-0/+19
Add a small wrapper around queue_limits_commit_update for stacking drivers that don't want to update existing limits, but set an entirely new set. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-03-01Merge tag 'md-6.9-20240301' of ↵Jens Axboe8-381/+549
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.9/block Pull MD updates from Song: "The major changes are: 1. Refactor raid1 read_balance, by Yu Kuai and Paul Luse. 2. Clean up and fix for md_ioctl, by Li Nan. 3. Other small fixes, by Gui-Dong Han and Heming Zhao." * tag 'md-6.9-20240301' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: (22 commits) md/raid1: factor out helpers to choose the best rdev from read_balance() md/raid1: factor out the code to manage sequential IO md/raid1: factor out choose_bb_rdev() from read_balance() md/raid1: factor out choose_slow_rdev() from read_balance() md/raid1: factor out read_first_rdev() from read_balance() md/raid1-10: factor out a new helper raid1_should_read_first() md/raid1-10: add a helper raid1_check_read_range() md/raid1: fix choose next idle in read_balance() md/raid1: record nonrot rdevs while adding/removing rdevs to conf md/raid1: factor out helpers to add rdev to conf md: add a new helper rdev_has_badblock() md/raid5: fix atomicity violation in raid5_cache_count md/md-bitmap: fix incorrect usage for sb_index md: check mddev->pers before calling md_set_readonly() md: clean up openers check in do_md_stop() and md_set_readonly() md: sync blockdev before stopping raid or setting readonly md: factor out a helper to sync mddev md: Don't clear MD_CLOSING when the raid is about to stop md: return directly before setting did_set_md_closing md: clean up invalid BUG_ON in md_ioctl ...
2024-02-29Merge branch 'raid1-read_balance' into md-6.9Song Liu6-280/+444
From: Yu Kuai <[email protected]> Co-developed-by: Paul Luse <[email protected]> The original idea is that Paul want to optimize raid1 read performance([1]), however, we think that the original code for read_balance() is quite complex, and we don't want to add more complexity. Hence we decide to refactor read_balance() first, to make code cleaner and easier for follow up. Before this patchset, read_balance() has many local variables and many branches, it want to consider all the scenarios in one iteration. The idea of this patch is to divide them into 4 different steps: 1) If resync is in progress, find the first usable disk, patch 5; Otherwise: 2) Loop through all disks and skipping slow disks and disks with bad blocks, choose the best disk, patch 10. If no disk is found: 3) Look for disks with bad blocks and choose the one with most number of sectors, patch 8. If no disk is found: 4) Choose first found slow disk with no bad blocks, or slow disk with most number of sectors, patch 7. Note that step 3) and step 4) are super code path, and performance should not be considered. And after this patchset, we'll continue to optimize read_balance for step 2), specifically how to choose the best rdev to read. [1] https://lore.kernel.org/all/[email protected]/ Yu Kuai (11): md: add a new helper rdev_has_badblock() md/raid1: factor out helpers to add rdev to conf md/raid1: record nonrot rdevs while adding/removing rdevs to conf md/raid1: fix choose next idle in read_balance() md/raid1-10: add a helper raid1_check_read_range() md/raid1-10: factor out a new helper raid1_should_read_first() md/raid1: factor out read_first_rdev() from read_balance() md/raid1: factor out choose_slow_rdev() from read_balance() md/raid1: factor out choose_bb_rdev() from read_balance() md/raid1: factor out the code to manage sequential IO md/raid1: factor out helpers to choose the best rdev from read_balance()
2024-02-29md/raid1: factor out helpers to choose the best rdev from read_balance()Yu Kuai1-77/+98
The way that best rdev is chosen: 1) If the read is sequential from one rdev: - if rdev is rotational, use this rdev; - if rdev is non-rotational, use this rdev until total read length exceed disk opt io size; 2) If the read is not sequential: - if there is idle disk, use it, otherwise: - if the array has non-rotational disk, choose the rdev with minimal inflight IO; - if all the underlaying disks are rotational disk, choose the rdev with closest IO; There are no functional changes, just to make code cleaner and prepare for following refactor. Co-developed-by: Paul Luse <[email protected]> Signed-off-by: Paul Luse <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-29md/raid1: factor out the code to manage sequential IOYu Kuai1-34/+37
There is no functional change for now, make read_balance() cleaner and prepare to fix problems and refactor the handler of sequential IO. Co-developed-by: Paul Luse <[email protected]> Signed-off-by: Paul Luse <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-29md/raid1: factor out choose_bb_rdev() from read_balance()Yu Kuai1-31/+48
read_balance() is hard to understand because there are too many status and branches, and it's overlong. This patch factor out the case to read the rdev with bad blocks from read_balance(), there are no functional changes. Co-developed-by: Paul Luse <[email protected]> Signed-off-by: Paul Luse <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-29md/raid1: factor out choose_slow_rdev() from read_balance()Yu Kuai1-17/+52
read_balance() is hard to understand because there are too many status and branches, and it's overlong. This patch factor out the case to read the slow rdev from read_balance(), there are no functional changes. Co-developed-by: Paul Luse <[email protected]> Signed-off-by: Paul Luse <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-29md/raid1: factor out read_first_rdev() from read_balance()Yu Kuai1-17/+46
read_balance() is hard to understand because there are too many status and branches, and it's overlong. This patch factor out the case to read the first rdev from read_balance(), there are no functional changes. Co-developed-by: Paul Luse <[email protected]> Signed-off-by: Paul Luse <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-29md/raid1-10: factor out a new helper raid1_should_read_first()Yu Kuai3-24/+24
If resync is in progress, read_balance() should find the first usable disk, otherwise, data could be inconsistent after resync is done. raid1 and raid10 implement the same checking, hence factor out the checking to make code cleaner. Noted that raid1 is using 'mddev->recovery_cp', which is updated after all resync IO is done, while raid10 is using 'conf->next_resync', which is inaccurate because raid10 update it before submitting resync IO. Fortunately, raid10 read IO can't concurrent with resync IO, hence there is no problem. And this patch also switch raid10 to use 'mddev->recovery_cp'. Co-developed-by: Paul Luse <[email protected]> Signed-off-by: Paul Luse <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-29md/raid1-10: add a helper raid1_check_read_range()Yu Kuai1-0/+49
The checking and handler of bad blocks appear many timers during read_balance() in raid1 and raid10. This helper will be used in later patches to simplify read_balance() a lot. Co-developed-by: Paul Luse <[email protected]> Signed-off-by: Paul Luse <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-29md/raid1: fix choose next idle in read_balance()Yu Kuai1-10/+22
Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add the case choose next idle in read_balance(): read_balance: for_each_rdev if(next_seq_sect == this_sector || dist == 0) -> sequential reads best_disk = disk; if (...) choose_next_idle = 1 continue; for_each_rdev -> iterate next rdev if (pending == 0) best_disk = disk; -> choose the next idle disk break; if (choose_next_idle) -> keep using this rdev if there are no other idle disk contine However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.") remove the code: - /* If device is idle, use it */ - if (pending == 0) { - best_disk = disk; - break; - } Hence choose next idle will never work now, fix this problem by following: 1) don't set best_disk in this case, read_balance() will choose the best disk after iterating all the disks; 2) add 'pending' so that other idle disk will be chosen; 3) add a new local variable 'sequential_disk' to record the disk, and if there is no other idle disk, 'sequential_disk' will be chosen; Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.") Co-developed-by: Paul Luse <[email protected]> Signed-off-by: Paul Luse <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-29md/raid1: record nonrot rdevs while adding/removing rdevs to confYu Kuai3-7/+12
For raid1, each read will iterate all the rdevs from conf and check if any rdev is non-rotational, then choose rdev with minimal IO inflight if so, or rdev with closest distance otherwise. Disk nonrot info can be changed through sysfs entry: /sys/block/[disk_name]/queue/rotational However, consider that this should only be used for testing, and user really shouldn't do this in real life. Record the number of non-rotational disks in conf, to avoid checking each rdev in IO fast path and simplify read_balance() a little bit. Co-developed-by: Paul Luse <[email protected]> Signed-off-by: Paul Luse <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-29md/raid1: factor out helpers to add rdev to confYu Kuai1-32/+53
There are no functional changes, just make code cleaner and prepare to record disk non-rotational information while adding and removing rdev to conf Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-29md: add a new helper rdev_has_badblock()Yu Kuai4-72/+44
The current api is_badblock() must pass in 'first_bad' and 'bad_sectors', however, many caller just want to know if there are badblocks or not, and these caller must define two local variable that will never be used. Add a new helper rdev_has_badblock() that will only return if there are badblocks or not, remove unnecessary local variables and replace is_badblock() with the new helper in many places. There are no functional changes, and the new helper will also be used later to refactor read_balance(). Co-developed-by: Paul Luse <[email protected]> Signed-off-by: Paul Luse <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-28ublk: add UBLK_CMD_DEL_DEV_ASYNCMing Lei2-3/+8
The current command UBLK_CMD_DEL_DEV won't return until the device is released, this way looks more reliable, but makes userspace more difficult to implement, especially about orders: unmap command buffer(which holds one ublkc reference), ublkc close, io_uring_file_unregister, ublkb close. Add UBLK_CMD_DEL_DEV_ASYNC so that device deletion won't wait release, then userspace needn't worry about the above order. Actually both loop and nbd is deleted in this async way. Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-02-28ublk: improve getting & putting ublk deviceMing Lei1-5/+7
Firstly convert get_device() and put_device() into ublk_get_device() and ublk_put_device(). Secondly annotate ublk_get_device() & ublk_put_device() as noinline for trace, especially it is often to trigger device deletion hang when incorrect order is used on ublkc mmap, ublkc close, io_uring_sqe_unregister_file, ublkb close. Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-02-28blk-mq: don't change nr_hw_queues and nr_maps for kdump kernelMing Lei1-8/+6
For most of ARCHs, 'nr_cpus=1' is passed for kdump kernel, so nr_hw_queues for each mapping is supposed to be 1 already. More importantly, this way may cause trouble for driver, because blk-mq and driver see different queue mapping since driver should setup hardware queue setting before calling into allocating blk-mq tagset. So not overriding nr_hw_queues and nr_maps for kdump kernel. Cc: Wen Xiong <[email protected]> Signed-off-by: Ming Lei <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-02-27md/raid5: fix atomicity violation in raid5_cache_countGui-Dong Han1-6/+8
In raid5_cache_count(): if (conf->max_nr_stripes < conf->min_nr_stripes) return 0; return conf->max_nr_stripes - conf->min_nr_stripes; The current check is ineffective, as the values could change immediately after being checked. In raid5_set_cache_size(): ... conf->min_nr_stripes = size; ... while (size > conf->max_nr_stripes) conf->min_nr_stripes = conf->max_nr_stripes; ... Due to intermediate value updates in raid5_set_cache_size(), concurrent execution of raid5_cache_count() and raid5_set_cache_size() may lead to inconsistent reads of conf->max_nr_stripes and conf->min_nr_stripes. The current checks are ineffective as values could change immediately after being checked, raising the risk of conf->min_nr_stripes exceeding conf->max_nr_stripes and potentially causing an integer overflow. This possible bug is found by an experimental static analysis tool developed by our team. This tool analyzes the locking APIs to extract function pairs that can be concurrently executed, and then analyzes the instructions in the paired functions to identify possible concurrency bugs including data races and atomicity violations. The above possible bug is reported when our tool analyzes the source code of Linux 6.2. To resolve this issue, it is suggested to introduce local variables 'min_stripes' and 'max_stripes' in raid5_cache_count() to ensure the values remain stable throughout the check. Adding locks in raid5_cache_count() fails to resolve atomicity violations, as raid5_set_cache_size() may hold intermediate values of conf->min_nr_stripes while unlocked. With this patch applied, our tool no longer reports the bug, with the kernel configuration allyesconfig for x86_64. Due to the lack of associated hardware, we cannot test the patch in runtime testing, and just verify it according to the code logic. Fixes: edbe83ab4c27 ("md/raid5: allow the stripe_cache to grow and shrink.") Cc: [email protected] Signed-off-by: Gui-Dong Han <[email protected]> Reviewed-by: Yu Kuai <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2024-02-27ubd: open the backing files in ubd_addChristoph Hellwig1-42/+16
Opening the backing device only when the block device is opened is a bit weird as no one configures block devices to not use them. Opend them at add time, close them at remove time and remove the now superflous opened counter as remove can simply check for disk_openers. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Richard Weinberger <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-02-27ubd: remove the queue pointer in struct ubdChristoph Hellwig1-3/+1
No need for it now, everything goes through the gendisk. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Richard Weinberger <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-02-27ubd: move set_disk_ro to ubd_addChristoph Hellwig1-1/+1
No need to delay this until open time. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Richard Weinberger <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-02-27ubd: move setting the variable queue limits to ubd_addChristoph Hellwig1-6/+7
No reason to delay this until open time. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Richard Weinberger <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-02-27ubd: move setting the nonrot flag to ubd_addChristoph Hellwig1-1/+1
No reason to delay this until open time. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Richard Weinberger <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-02-27ubd: remove ubd_disk_registerChristoph Hellwig1-22/+15
Fold it into the only caller to remove lots of references to the global ubd_devs array. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Richard Weinberger <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>