aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-03-05Merge branch 'dmraid-fix-6.9' into md-6.9Song Liu4-40/+196
This is the second half of fixes for dmraid. The first half is available at [1]. This set contains fixes: - reshape can start unexpected, cause data corruption, patch 1,5,6; - deadlocks that reshape concurrent with IO, patch 8; - a lockdep warning, patch 9; For all the dmraid related tests in lvm2 suite, there is no new regressions compared against 6.6 kernels (which is good baseline before recent regressions). [1] https://lore.kernel.org/all/CAPhsuW7u1UKHCDOBDhD7DzOVtkGemDz_QnJ4DUq_kSN-Q3G66Q@mail.gmail.com/ * dmraid-fix-6.9: dm-raid: fix lockdep waring in "pers->hot_add_disk" dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape dm-raid: add a new helper prepare_suspend() in md_personality md/dm-raid: don't call md_reap_sync_thread() directly dm-raid: really frozen sync_thread during suspend md: add a new helper reshape_interrupted() md: export helper md_is_rdwr() md: export helpers to stop sync_thread md: don't clear MD_RECOVERY_FROZEN for new dm-raid until resume
2024-03-05dm-raid: fix lockdep waring in "pers->hot_add_disk"Yu Kuai1-0/+2
The lockdep assert is added by commit a448af25becf ("md/raid10: remove rcu protection to access rdev from conf") in print_conf(). And I didn't notice that dm-raid is calling "pers->hot_add_disk" without holding 'reconfig_mutex'. "pers->hot_add_disk" read and write many fields that is protected by 'reconfig_mutex', and raid_resume() already grab the lock in other contex. Hence fix this problem by protecting "pers->host_add_disk" with the lock. Fixes: 9092c02d9435 ("DM RAID: Add ability to restore transiently failed devices on resume") Fixes: a448af25becf ("md/raid10: remove rcu protection to access rdev from conf") Cc: [email protected] # v6.7+ Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Xiao Ni <[email protected]> Acked-by: Mike Snitzer <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-05dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent ↵Yu Kuai4-7/+74
with reshape For raid456, if reshape is still in progress, then IO across reshape position will wait for reshape to make progress. However, for dm-raid, in following cases reshape will never make progress hence IO will hang: 1) the array is read-only; 2) MD_RECOVERY_WAIT is set; 3) MD_RECOVERY_FROZEN is set; After commit c467e97f079f ("md/raid6: use valid sector values to determine if an I/O should wait on the reshape") fix the problem that IO across reshape position doesn't wait for reshape, the dm-raid test shell/lvconvert-raid-reshape.sh start to hang: [root@fedora ~]# cat /proc/979/stack [<0>] wait_woken+0x7d/0x90 [<0>] raid5_make_request+0x929/0x1d70 [raid456] [<0>] md_handle_request+0xc2/0x3b0 [md_mod] [<0>] raid_map+0x2c/0x50 [dm_raid] [<0>] __map_bio+0x251/0x380 [dm_mod] [<0>] dm_submit_bio+0x1f0/0x760 [dm_mod] [<0>] __submit_bio+0xc2/0x1c0 [<0>] submit_bio_noacct_nocheck+0x17f/0x450 [<0>] submit_bio_noacct+0x2bc/0x780 [<0>] submit_bio+0x70/0xc0 [<0>] mpage_readahead+0x169/0x1f0 [<0>] blkdev_readahead+0x18/0x30 [<0>] read_pages+0x7c/0x3b0 [<0>] page_cache_ra_unbounded+0x1ab/0x280 [<0>] force_page_cache_ra+0x9e/0x130 [<0>] page_cache_sync_ra+0x3b/0x110 [<0>] filemap_get_pages+0x143/0xa30 [<0>] filemap_read+0xdc/0x4b0 [<0>] blkdev_read_iter+0x75/0x200 [<0>] vfs_read+0x272/0x460 [<0>] ksys_read+0x7a/0x170 [<0>] __x64_sys_read+0x1c/0x30 [<0>] do_syscall_64+0xc6/0x230 [<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74 This is because reshape can't make progress. For md/raid, the problem doesn't exist because register new sync_thread doesn't rely on the IO to be done any more: 1) If array is read-only, it can switch to read-write by ioctl/sysfs; 2) md/raid never set MD_RECOVERY_WAIT; 3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold 'reconfig_mutex', hence it can be cleared and reshape can continue by sysfs api 'sync_action'. However, I'm not sure yet how to avoid the problem in dm-raid yet. This patch on the one hand make sure raid_message() can't change sync_thread() through raid_message() after presuspend(), on the other hand detect the above 3 cases before wait for IO do be done in dm_suspend(), and let dm-raid requeue those IO. Cc: [email protected] # v6.7+ Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Xiao Ni <[email protected]> Acked-by: Mike Snitzer <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-05dm-raid: add a new helper prepare_suspend() in md_personalityYu Kuai2-0/+19
There are no functional changes for now, prepare to fix a deadlock for dm-raid456. Cc: [email protected] # v6.7+ Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Xiao Ni <[email protected]> Acked-by: Mike Snitzer <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-05md/dm-raid: don't call md_reap_sync_thread() directlyYu Kuai1-10/+18
Currently md_reap_sync_thread() is called from raid_message() directly without holding 'reconfig_mutex', this is definitely unsafe because md_reap_sync_thread() can change many fields that is protected by 'reconfig_mutex'. However, hold 'reconfig_mutex' here is still problematic because this will cause deadlock, for example, commit 130443d60b1b ("md: refactor idle/frozen_sync_thread() to fix deadlock"). Fix this problem by using stop_sync_thread() to unregister sync_thread, like md/raid did. Fixes: be83651f0050 ("DM RAID: Add message/status support for changing sync action") Cc: [email protected] # v6.7+ Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Xiao Ni <[email protected]> Acked-by: Mike Snitzer <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-05dm-raid: really frozen sync_thread during suspendYu Kuai2-11/+17
1) commit f52f5c71f3d4 ("md: fix stopping sync thread") remove MD_RECOVERY_FROZEN from __md_stop_writes() and doesn't realize that dm-raid relies on __md_stop_writes() to frozen sync_thread indirectly. Fix this problem by adding MD_RECOVERY_FROZEN in md_stop_writes(), and since stop_sync_thread() is only used for dm-raid in this case, also move stop_sync_thread() to md_stop_writes(). 2) The flag MD_RECOVERY_FROZEN doesn't mean that sync thread is frozen, it only prevent new sync_thread to start, and it can't stop the running sync thread; In order to frozen sync_thread, after seting the flag, stop_sync_thread() should be used. 3) The flag MD_RECOVERY_FROZEN doesn't mean that writes are stopped, use it as condition for md_stop_writes() in raid_postsuspend() doesn't look correct. Consider that reentrant stop_sync_thread() do nothing, always call md_stop_writes() in raid_postsuspend(). 4) raid_message can set/clear the flag MD_RECOVERY_FROZEN at anytime, and if MD_RECOVERY_FROZEN is cleared while the array is suspended, new sync_thread can start unexpected. Fix this by disallow raid_message() to change sync_thread status during suspend. Note that after commit f52f5c71f3d4 ("md: fix stopping sync thread"), the test shell/lvconvert-raid-reshape.sh start to hang in stop_sync_thread(), and with previous fixes, the test won't hang there anymore, however, the test will still fail and complain that ext4 is corrupted. And with this patch, the test won't hang due to stop_sync_thread() or fail due to ext4 is corrupted anymore. However, there is still a deadlock related to dm-raid456 that will be fixed in following patches. Reported-by: Mikulas Patocka <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ Fixes: 1af2048a3e87 ("dm raid: fix deadlock caused by premature md_stop_writes()") Fixes: 9dbd1aa3a81c ("dm raid: add reshaping support to the target") Fixes: f52f5c71f3d4 ("md: fix stopping sync thread") Cc: [email protected] # v6.7+ Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Xiao Ni <[email protected]> Acked-by: Mike Snitzer <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-05md: add a new helper reshape_interrupted()Yu Kuai1-0/+19
The helper will be used for dm-raid456 later to detect the case that reshape can't make progress. Cc: [email protected] # v6.7+ Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Xiao Ni <[email protected]> Acked-by: Mike Snitzer <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-05md: export helper md_is_rdwr()Yu Kuai2-12/+12
There are no functional changes for now, prepare to fix a deadlock for dm-raid456. Cc: [email protected] # v6.7+ Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Xiao Ni <[email protected]> Acked-by: Mike Snitzer <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-05md: export helpers to stop sync_threadYu Kuai2-0/+32
Add new helpers: void md_idle_sync_thread(struct mddev *mddev); void md_frozen_sync_thread(struct mddev *mddev); void md_unfrozen_sync_thread(struct mddev *mddev); The helpers will be used in dm-raid in later patches to fix regressions and prevent calling md_reap_sync_thread() directly. Cc: [email protected] # v6.7+ Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Xiao Ni <[email protected]> Acked-by: Mike Snitzer <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-05md: don't clear MD_RECOVERY_FROZEN for new dm-raid until resumeYu Kuai1-1/+4
After commit 9dbd1aa3a81c ("dm raid: add reshaping support to the target") raid_ctr() will set MD_RECOVERY_FROZEN before md_run() and expect to keep array frozen until resume. However, md_run() will clear the flag by setting mddev->recovery to 0. Before commit 1baae052cccd ("md: Don't ignore suspended array in md_check_recovery()"), dm-raid actually relied on suspending to prevent starting new sync_thread. Fix this problem by keeping 'MD_RECOVERY_FROZEN' for dm-raid in md_run(). Fixes: 1baae052cccd ("md: Don't ignore suspended array in md_check_recovery()") Fixes: 9dbd1aa3a81c ("dm raid: add reshaping support to the target") Cc: [email protected] # v6.7+ Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Xiao Ni <[email protected]> Acked-by: Mike Snitzer <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-05Revert "Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d""Song Liu1-0/+12
This reverts commit bed9e27baf52a09b7ba2a3714f1e24e17ced386d. The original set [1][2] was expected to undo a suboptimal fix in [2], and replace it with a better fix [1]. However, as reported by Dan Moulding [2] causes an issue with raid5 with journal device. Revert [2] for now to close the issue. We will follow up on another issue reported by Juxiao Bi, as [2] is expected to fix it. We believe this is a good trade-off, because the latter issue happens less freqently. In the meanwhile, we will NOT revert [1], as it contains the right logic. [1] commit d6e035aad6c0 ("md: bypass block throttle for superblock update") [2] commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"") Reported-by: Dan Moulding <[email protected]> Closes: https://lore.kernel.org/linux-raid/[email protected]/ Fixes: bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"") Cc: [email protected] # v5.19+ Cc: Junxiao Bi <[email protected]> Cc: Yu Kuai <[email protected]> Signed-off-by: Song Liu <[email protected]> Reviewed-by: Yu Kuai <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-05nvme: fcloop: make fcloop_class constantRicardo B. Marliere1-8/+9
Since commit 43a7206b0963 ("driver core: class: make class_register() take a const *"), the driver core allows for struct class to be in read-only memory, so move the fcloop_class structure to be declared at build time placing it into read-only memory, instead of having to be dynamically allocated at boot time. Cc: Greg Kroah-Hartman <[email protected]> Suggested-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Ricardo B. Marliere <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-05nvme: fabrics: make nvmf_class constantRicardo B. Marliere1-9/+11
Since commit 43a7206b0963 ("driver core: class: make class_register() take a const *"), the driver core allows for struct class to be in read-only memory, so move the nvmf_class structure to be declared at build time placing it into read-only memory, instead of having to be dynamically allocated at boot time. Cc: Greg Kroah-Hartman <[email protected]> Suggested-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Ricardo B. Marliere <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-05nvme: core: constify struct class usageRicardo B. Marliere1-25/+28
Since commit 43a7206b0963 ("driver core: class: make class_register() take a const *"), the driver core allows for struct class to be in read-only memory, so move the structures nvme_class, nvme_subsys_class and nvme_ns_chr_class to be declared at build time placing them into read-only memory, instead of having to be dynamically allocated at boot time. Cc: Greg Kroah-Hartman <[email protected]> Suggested-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Ricardo B. Marliere <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme-fabrics: typo in nvmf_parse_key()Hannes Reinecke1-1/+1
Of course we should use the key if there is no error ... Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme-multipath: use atomic queue limits API for stacking limitsChristoph Hellwig1-3/+6
Switch to the queue_limits_* helpers to stack the bdev limits, which also includes updating the readahead settings. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme-multipath: pass queue_limits to blk_alloc_diskChristoph Hellwig1-6/+7
The multipath disk starts out with the stacking default limits. The one interesting part here is that blk_set_stacking_limits sets the max_zone_append_sectorts to UINT_MAX, which fails the validation for non-zoned devices. With the old one call per limit scheme this was fine because no one verified this weird mismatch and it was fixed by blk_stack_limits a little later before I/O could be issued. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme: use the atomic queue limits update APIChristoph Hellwig3-79/+80
Changes the callchains that update queue_limits to build an on-stack queue_limits and update it atomically. Note that for now only the admin queue actually passes it to the queue allocation function. Doing the same for the gendisks used for the namespaces will require a little more work. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme: cleanup nvme_configure_metadataChristoph Hellwig1-28/+19
Fold nvme_init_ms into nvme_configure_metadata after splitting up a little helper to deal with the extended LBA formats. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme: don't query identify data in configure_metadataChristoph Hellwig1-30/+19
Move reading the Identify Namespace Data Structure, NVM Command Set out of configure_metadata into the caller. This allows doing the identify call outside the frozen I/O queues, and prepares for using data from the Identify data structure for other purposes. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme: split out a nvme_identify_ns_nvm helperChristoph Hellwig1-12/+26
Split the logic to query the Identify Namespace Data Structure, NVM Command Set into a separate helper. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme: move common logic into nvme_update_ns_infoChristoph Hellwig1-42/+42
nvme_update_ns_info_generic and nvme_update_ns_info_block share a fair amount of logic related to not fully supported namespace formats and updating the multipath information. Move this logic into the common caller. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme: move setting the write cache flags out of nvme_set_queue_limitsChristoph Hellwig1-3/+2
nvme_set_queue_limits is used on the admin queue and all gendisks including hidden ones that don't support block I/O. The write cache setting on the other hand only makes sense for block I/O. Move the blk_queue_write_cache call to nvme_update_ns_info_block instead. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme: move a few things out of nvme_update_disk_infoChristoph Hellwig1-20/+26
Move setting up the integrity profile and setting the disk capacity out of nvme_update_disk_info to get nvme_update_disk_info into a shape where it just sets queue_limits eventually. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme: don't use nvme_update_disk_info for the multipath diskChristoph Hellwig1-1/+2
Currently nvme_update_ns_info_block calls nvme_update_disk_info both for the namespace attached disk, and the multipath one (if it exists). This is very different from how other stacking drivers work, and leads to a lot of complexity. Switch to setting the disk capacity and initializing the integrity profile, and let blk_stack_limits which already is called just below deal with updating the other limits. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme: move blk_integrity_unregister into nvme_init_integrityChristoph Hellwig1-2/+2
Move uneregistering the existing integrity profile into the helper dealing with all the other integrity / metadata setup. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme: cleanup the nvme_init_integrity calling conventionsChristoph Hellwig1-14/+15
Handle the no metadata support case in nvme_init_integrity as well to simplify the calling convention and prepare for future changes in the area. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme: move max_integrity_segments handling out of nvme_init_integrityChristoph Hellwig1-7/+4
max_integrity_segments is just a hardware limit and doesn't need to be in nvme_init_integrity with the PI setup. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme: remove nvme_revalidate_zonesChristoph Hellwig3-12/+3
Handle setting the zone size / chunk_sectors and max_append_sectors limits together with the other ZNS limits, and just open code the call to blk_revalidate_zones in the current place. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme: move NVME_QUIRK_DEALLOCATE_ZEROES out of nvme_config_discardChristoph Hellwig1-5/+6
Move the handling of the NVME_QUIRK_DEALLOCATE_ZEROES quirk out of nvme_config_discard so that it is combined with the normal write_zeroes limit handling. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-04nvme: set max_hw_sectors unconditionallyChristoph Hellwig1-8/+8
All transports set a max_hw_sectors value in the nvme_ctrl, so make the code using it unconditional and clean it up using a little helper. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Reviewed-by: John Garry <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-02nvme-fabrics: check max outstanding commandsGuixin Liu1-0/+5
Maxcmd is mandatory for fabrics, check it early to identify the root cause instead of waiting for it to propagate to "sqsize" and "allocing queue". By the way, change nvme_check_ctrl_fabric_info() to nvmf_validate_identify_ctrl(). Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Guixin Liu <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-02nvmet-rdma: set max_queue_size for RDMA transportMax Gurtovoy2-1/+10
A new port configuration was added to set max_queue_size. Clamp user configuration to RDMA transport limits. Increase the maximal queue size of RDMA controllers from 128 to 256 (the default size stays 128 same as before). Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Israel Rukshin <[email protected]> Signed-off-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-02nvmet: introduce new max queue size configuration entryMax Gurtovoy3-3/+46
Using this port configuration, one will be able to set the maximal queue size to be used for any controller that will be associated to the configured port. The default value stayed 1024 but each transport will be able to set the its own values before enabling the port. Introduce lower limit of 16 for minimal queue depth (same as we use in the host fabrics drivers). Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Israel Rukshin <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Guixin Liu <[email protected]> Signed-off-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-02nvme-rdma: clamp queue size according to ctrl capMax Gurtovoy1-4/+10
If a controller is configured with metadata support, clamp the maximal queue size to be 128 since there are more resources that are needed for metadata operations. Otherwise, clamp it to 256. Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Israel Rukshin <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-02nvme-rdma: introduce NVME_RDMA_MAX_METADATA_QUEUE_SIZE definitionMax Gurtovoy2-1/+4
This definition will be used by controllers that are configured with metadata support. For now, both regular and metadata controllers have the same maximal queue size but later commit will increase the maximal queue size for regular RDMA controllers to 256. We'll keep the maximal queue size for metadata controllers to be 128 since there are more resources that are needed for metadata operations and 128 is the optimal size found for metadata controllers base on testing. Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Israel Rukshin <[email protected]> Signed-off-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-02nvmet: set ctrl pi_support cap before initializing cap regMax Gurtovoy2-2/+1
This is a preparation for setting the maximal queue size of a controller that supports PI. Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Israel Rukshin <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-02nvmet: set maxcmd to be per controllerMax Gurtovoy4-4/+4
This is a preparation for having a dynamic configuration of max queue size for a controller. Make sure that the maxcmd field stays the same as the MQES (+1) value as we do today. Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Israel Rukshin <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-02nvmet: compare mqes and sqsize only for IO SQMax Gurtovoy1-1/+2
According to the NVMe Spec: " MQES: This field indicates the maximum individual queue size that the controller supports. For NVMe over PCIe implementations, this value applies to the I/O Submission Queues and I/O Completion Queues that the host creates. For NVMe over Fabrics implementations, this value applies to only the I/O Submission Queues that the host creates. " Align the target code to compare mqes and sqsize as mentioned in the NVMe Spec. Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-02nvme-rdma: move NVME_RDMA_IP_PORT from common fileMax Gurtovoy2-2/+2
The correct place for this definition is the nvme rdma header file and not the common nvme header file. Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Israel Rukshin <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Max Gurtovoy <[email protected]> Signed-off-by: Keith Busch <[email protected]>
2024-03-01nbd: use the atomic queue limits API in nbd_set_sizeChristoph Hellwig1-4/+11
Use queue_limits_start_update / queue_limits_commit_update to update all the limits in one go and with proper sanity checking. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-03-01nbd: freeze the queue for queue limits updatesChristoph Hellwig1-1/+13
nbd currently updates the logical and physical block sizes as well as the discard_sectors on a live queue. Freeze the queue first to make sure there are not commands in flight that can see torn or inconsistent limits. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-03-01nbd: don't clear discard_sectors in nbd_config_putChristoph Hellwig1-1/+2
nbd_config_put currently clears discard_sectors when unusing a device. This is pretty odd behavior and different from the sector size configuration which is simply left in places and then reconfigured when nbd_set_size is as part of configuring the device. Change nbd_set_size to clear discard_sectors if discard is not supported so that all the queue limits changes are handled in one place. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-03-01pktcdvd: don't set max_hw_sectors on the underlying deviceChristoph Hellwig1-5/+6
pktcdvd sets max_hw_sectors on the queue of the underlying device that it doesn't own (and doesn't reset it ever) since the driver was merged. This can create all kinds of problems as the underlying driver doesn't even know about it changing the limit. As the state purpose is to not create I/Os larger than a single frame, and pktcdvd never builds bios larger than that, just set REQ_NOMERGE on the bios it submits so that largers I/Os never get built. Note: I don't have packet writing hardware, so this is compile tested only. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-03-01dm: use queue_limits_setChristoph Hellwig2-16/+13
Use queue_limits_set which validates the limits and takes care of updating the readahead settings instead of directly assigning them to the queue. For that make sure all limits are actually updated before the assignment. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-03-01block: add a queue_limits_stack_bdev helperChristoph Hellwig2-0/+27
Add a small wrapper around blk_stack_limits that allows passing a bdev for the bottom device and prints an error in case of misaligned device. The name fits into the new queue limits API and the intent is to eventually replace disk_stack_limits. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-03-01block: add a queue_limits_set helperChristoph Hellwig2-0/+19
Add a small wrapper around queue_limits_commit_update for stacking drivers that don't want to update existing limits, but set an entirely new set. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-03-01Merge tag 'md-6.9-20240301' of ↵Jens Axboe8-381/+549
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.9/block Pull MD updates from Song: "The major changes are: 1. Refactor raid1 read_balance, by Yu Kuai and Paul Luse. 2. Clean up and fix for md_ioctl, by Li Nan. 3. Other small fixes, by Gui-Dong Han and Heming Zhao." * tag 'md-6.9-20240301' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: (22 commits) md/raid1: factor out helpers to choose the best rdev from read_balance() md/raid1: factor out the code to manage sequential IO md/raid1: factor out choose_bb_rdev() from read_balance() md/raid1: factor out choose_slow_rdev() from read_balance() md/raid1: factor out read_first_rdev() from read_balance() md/raid1-10: factor out a new helper raid1_should_read_first() md/raid1-10: add a helper raid1_check_read_range() md/raid1: fix choose next idle in read_balance() md/raid1: record nonrot rdevs while adding/removing rdevs to conf md/raid1: factor out helpers to add rdev to conf md: add a new helper rdev_has_badblock() md/raid5: fix atomicity violation in raid5_cache_count md/md-bitmap: fix incorrect usage for sb_index md: check mddev->pers before calling md_set_readonly() md: clean up openers check in do_md_stop() and md_set_readonly() md: sync blockdev before stopping raid or setting readonly md: factor out a helper to sync mddev md: Don't clear MD_CLOSING when the raid is about to stop md: return directly before setting did_set_md_closing md: clean up invalid BUG_ON in md_ioctl ...
2024-02-29Merge branch 'raid1-read_balance' into md-6.9Song Liu6-280/+444
From: Yu Kuai <[email protected]> Co-developed-by: Paul Luse <[email protected]> The original idea is that Paul want to optimize raid1 read performance([1]), however, we think that the original code for read_balance() is quite complex, and we don't want to add more complexity. Hence we decide to refactor read_balance() first, to make code cleaner and easier for follow up. Before this patchset, read_balance() has many local variables and many branches, it want to consider all the scenarios in one iteration. The idea of this patch is to divide them into 4 different steps: 1) If resync is in progress, find the first usable disk, patch 5; Otherwise: 2) Loop through all disks and skipping slow disks and disks with bad blocks, choose the best disk, patch 10. If no disk is found: 3) Look for disks with bad blocks and choose the one with most number of sectors, patch 8. If no disk is found: 4) Choose first found slow disk with no bad blocks, or slow disk with most number of sectors, patch 7. Note that step 3) and step 4) are super code path, and performance should not be considered. And after this patchset, we'll continue to optimize read_balance for step 2), specifically how to choose the best rdev to read. [1] https://lore.kernel.org/all/[email protected]/ Yu Kuai (11): md: add a new helper rdev_has_badblock() md/raid1: factor out helpers to add rdev to conf md/raid1: record nonrot rdevs while adding/removing rdevs to conf md/raid1: fix choose next idle in read_balance() md/raid1-10: add a helper raid1_check_read_range() md/raid1-10: factor out a new helper raid1_should_read_first() md/raid1: factor out read_first_rdev() from read_balance() md/raid1: factor out choose_slow_rdev() from read_balance() md/raid1: factor out choose_bb_rdev() from read_balance() md/raid1: factor out the code to manage sequential IO md/raid1: factor out helpers to choose the best rdev from read_balance()
2024-02-29md/raid1: factor out helpers to choose the best rdev from read_balance()Yu Kuai1-77/+98
The way that best rdev is chosen: 1) If the read is sequential from one rdev: - if rdev is rotational, use this rdev; - if rdev is non-rotational, use this rdev until total read length exceed disk opt io size; 2) If the read is not sequential: - if there is idle disk, use it, otherwise: - if the array has non-rotational disk, choose the rdev with minimal inflight IO; - if all the underlaying disks are rotational disk, choose the rdev with closest IO; There are no functional changes, just to make code cleaner and prepare for following refactor. Co-developed-by: Paul Luse <[email protected]> Signed-off-by: Paul Luse <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]