aboutsummaryrefslogtreecommitdiff
path: root/drivers
AgeCommit message (Collapse)AuthorFilesLines
2024-03-06md/raid5: use the atomic queue limit update APIsChristoph Hellwig1-65/+65
Build the queue limits outside the queue and apply them using queue_limits_set. To make the code more obvious also split the queue limits handling into separate helpers. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed--by: Song Liu <[email protected]> Tested-by: Song Liu <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-06md/raid1: use the atomic queue limit update APIsChristoph Hellwig1-9/+16
Build the queue limits outside the queue and apply them using queue_limits_set. To make the code more obvious also split the queue limits handling into a separate helper function. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed--by: Song Liu <[email protected]> Tested-by: Song Liu <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-06md/raid0: use the atomic queue limit update APIsChristoph Hellwig1-15/+20
Build the queue limits outside the queue and apply them using queue_limits_set. To make the code more obvious also split the queue limits handling into a separate helper function. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed--by: Song Liu <[email protected]> Tested-by: Song Liu <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-06md: add queue limit helpersChristoph Hellwig2-0/+48
Add a few helpers that wrap the block queue limits API for use in MD. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed--by: Song Liu <[email protected]> Tested-by: Song Liu <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-06md: add a mddev_is_dm helperChristoph Hellwig6-35/+38
Add a helper to check for a DM-mapped MD device instead of using the obfuscated ->gendisk or ->queue NULL checks. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed--by: Song Liu <[email protected]> Tested-by: Song Liu <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-06md: add a mddev_add_trace_msg helperChristoph Hellwig6-29/+28
Add a small wrapper around blk_add_trace_msg that hides some argument dereferences and the check for a DM-mapped MD device. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed--by: Song Liu <[email protected]> Tested-by: Song Liu <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-06md: add a mddev_trace_remap helperChristoph Hellwig6-37/+17
Add a helper to trace bio remapping that hides some argument dereferences and the check for a DM-mapped MD device. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed--by: Song Liu <[email protected]> Tested-by: Song Liu <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-05Merge branch 'dmraid-fix-6.9' into md-6.9Song Liu4-40/+196
This is the second half of fixes for dmraid. The first half is available at [1]. This set contains fixes: - reshape can start unexpected, cause data corruption, patch 1,5,6; - deadlocks that reshape concurrent with IO, patch 8; - a lockdep warning, patch 9; For all the dmraid related tests in lvm2 suite, there is no new regressions compared against 6.6 kernels (which is good baseline before recent regressions). [1] https://lore.kernel.org/all/CAPhsuW7u1UKHCDOBDhD7DzOVtkGemDz_QnJ4DUq_kSN-Q3G66Q@mail.gmail.com/ * dmraid-fix-6.9: dm-raid: fix lockdep waring in "pers->hot_add_disk" dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape dm-raid: add a new helper prepare_suspend() in md_personality md/dm-raid: don't call md_reap_sync_thread() directly dm-raid: really frozen sync_thread during suspend md: add a new helper reshape_interrupted() md: export helper md_is_rdwr() md: export helpers to stop sync_thread md: don't clear MD_RECOVERY_FROZEN for new dm-raid until resume
2024-03-05dm-raid: fix lockdep waring in "pers->hot_add_disk"Yu Kuai1-0/+2
The lockdep assert is added by commit a448af25becf ("md/raid10: remove rcu protection to access rdev from conf") in print_conf(). And I didn't notice that dm-raid is calling "pers->hot_add_disk" without holding 'reconfig_mutex'. "pers->hot_add_disk" read and write many fields that is protected by 'reconfig_mutex', and raid_resume() already grab the lock in other contex. Hence fix this problem by protecting "pers->host_add_disk" with the lock. Fixes: 9092c02d9435 ("DM RAID: Add ability to restore transiently failed devices on resume") Fixes: a448af25becf ("md/raid10: remove rcu protection to access rdev from conf") Cc: [email protected] # v6.7+ Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Xiao Ni <[email protected]> Acked-by: Mike Snitzer <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-05dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent ↵Yu Kuai4-7/+74
with reshape For raid456, if reshape is still in progress, then IO across reshape position will wait for reshape to make progress. However, for dm-raid, in following cases reshape will never make progress hence IO will hang: 1) the array is read-only; 2) MD_RECOVERY_WAIT is set; 3) MD_RECOVERY_FROZEN is set; After commit c467e97f079f ("md/raid6: use valid sector values to determine if an I/O should wait on the reshape") fix the problem that IO across reshape position doesn't wait for reshape, the dm-raid test shell/lvconvert-raid-reshape.sh start to hang: [root@fedora ~]# cat /proc/979/stack [<0>] wait_woken+0x7d/0x90 [<0>] raid5_make_request+0x929/0x1d70 [raid456] [<0>] md_handle_request+0xc2/0x3b0 [md_mod] [<0>] raid_map+0x2c/0x50 [dm_raid] [<0>] __map_bio+0x251/0x380 [dm_mod] [<0>] dm_submit_bio+0x1f0/0x760 [dm_mod] [<0>] __submit_bio+0xc2/0x1c0 [<0>] submit_bio_noacct_nocheck+0x17f/0x450 [<0>] submit_bio_noacct+0x2bc/0x780 [<0>] submit_bio+0x70/0xc0 [<0>] mpage_readahead+0x169/0x1f0 [<0>] blkdev_readahead+0x18/0x30 [<0>] read_pages+0x7c/0x3b0 [<0>] page_cache_ra_unbounded+0x1ab/0x280 [<0>] force_page_cache_ra+0x9e/0x130 [<0>] page_cache_sync_ra+0x3b/0x110 [<0>] filemap_get_pages+0x143/0xa30 [<0>] filemap_read+0xdc/0x4b0 [<0>] blkdev_read_iter+0x75/0x200 [<0>] vfs_read+0x272/0x460 [<0>] ksys_read+0x7a/0x170 [<0>] __x64_sys_read+0x1c/0x30 [<0>] do_syscall_64+0xc6/0x230 [<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74 This is because reshape can't make progress. For md/raid, the problem doesn't exist because register new sync_thread doesn't rely on the IO to be done any more: 1) If array is read-only, it can switch to read-write by ioctl/sysfs; 2) md/raid never set MD_RECOVERY_WAIT; 3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold 'reconfig_mutex', hence it can be cleared and reshape can continue by sysfs api 'sync_action'. However, I'm not sure yet how to avoid the problem in dm-raid yet. This patch on the one hand make sure raid_message() can't change sync_thread() through raid_message() after presuspend(), on the other hand detect the above 3 cases before wait for IO do be done in dm_suspend(), and let dm-raid requeue those IO. Cc: [email protected] # v6.7+ Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Xiao Ni <[email protected]> Acked-by: Mike Snitzer <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-05dm-raid: add a new helper prepare_suspend() in md_personalityYu Kuai2-0/+19
There are no functional changes for now, prepare to fix a deadlock for dm-raid456. Cc: [email protected] # v6.7+ Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Xiao Ni <[email protected]> Acked-by: Mike Snitzer <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-05md/dm-raid: don't call md_reap_sync_thread() directlyYu Kuai1-10/+18
Currently md_reap_sync_thread() is called from raid_message() directly without holding 'reconfig_mutex', this is definitely unsafe because md_reap_sync_thread() can change many fields that is protected by 'reconfig_mutex'. However, hold 'reconfig_mutex' here is still problematic because this will cause deadlock, for example, commit 130443d60b1b ("md: refactor idle/frozen_sync_thread() to fix deadlock"). Fix this problem by using stop_sync_thread() to unregister sync_thread, like md/raid did. Fixes: be83651f0050 ("DM RAID: Add message/status support for changing sync action") Cc: [email protected] # v6.7+ Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Xiao Ni <[email protected]> Acked-by: Mike Snitzer <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-05dm-raid: really frozen sync_thread during suspendYu Kuai2-11/+17
1) commit f52f5c71f3d4 ("md: fix stopping sync thread") remove MD_RECOVERY_FROZEN from __md_stop_writes() and doesn't realize that dm-raid relies on __md_stop_writes() to frozen sync_thread indirectly. Fix this problem by adding MD_RECOVERY_FROZEN in md_stop_writes(), and since stop_sync_thread() is only used for dm-raid in this case, also move stop_sync_thread() to md_stop_writes(). 2) The flag MD_RECOVERY_FROZEN doesn't mean that sync thread is frozen, it only prevent new sync_thread to start, and it can't stop the running sync thread; In order to frozen sync_thread, after seting the flag, stop_sync_thread() should be used. 3) The flag MD_RECOVERY_FROZEN doesn't mean that writes are stopped, use it as condition for md_stop_writes() in raid_postsuspend() doesn't look correct. Consider that reentrant stop_sync_thread() do nothing, always call md_stop_writes() in raid_postsuspend(). 4) raid_message can set/clear the flag MD_RECOVERY_FROZEN at anytime, and if MD_RECOVERY_FROZEN is cleared while the array is suspended, new sync_thread can start unexpected. Fix this by disallow raid_message() to change sync_thread status during suspend. Note that after commit f52f5c71f3d4 ("md: fix stopping sync thread"), the test shell/lvconvert-raid-reshape.sh start to hang in stop_sync_thread(), and with previous fixes, the test won't hang there anymore, however, the test will still fail and complain that ext4 is corrupted. And with this patch, the test won't hang due to stop_sync_thread() or fail due to ext4 is corrupted anymore. However, there is still a deadlock related to dm-raid456 that will be fixed in following patches. Reported-by: Mikulas Patocka <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ Fixes: 1af2048a3e87 ("dm raid: fix deadlock caused by premature md_stop_writes()") Fixes: 9dbd1aa3a81c ("dm raid: add reshaping support to the target") Fixes: f52f5c71f3d4 ("md: fix stopping sync thread") Cc: [email protected] # v6.7+ Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Xiao Ni <[email protected]> Acked-by: Mike Snitzer <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-05md: add a new helper reshape_interrupted()Yu Kuai1-0/+19
The helper will be used for dm-raid456 later to detect the case that reshape can't make progress. Cc: [email protected] # v6.7+ Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Xiao Ni <[email protected]> Acked-by: Mike Snitzer <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-05md: export helper md_is_rdwr()Yu Kuai2-12/+12
There are no functional changes for now, prepare to fix a deadlock for dm-raid456. Cc: [email protected] # v6.7+ Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Xiao Ni <[email protected]> Acked-by: Mike Snitzer <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-05md: export helpers to stop sync_threadYu Kuai2-0/+32
Add new helpers: void md_idle_sync_thread(struct mddev *mddev); void md_frozen_sync_thread(struct mddev *mddev); void md_unfrozen_sync_thread(struct mddev *mddev); The helpers will be used in dm-raid in later patches to fix regressions and prevent calling md_reap_sync_thread() directly. Cc: [email protected] # v6.7+ Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Xiao Ni <[email protected]> Acked-by: Mike Snitzer <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-05md: don't clear MD_RECOVERY_FROZEN for new dm-raid until resumeYu Kuai1-1/+4
After commit 9dbd1aa3a81c ("dm raid: add reshaping support to the target") raid_ctr() will set MD_RECOVERY_FROZEN before md_run() and expect to keep array frozen until resume. However, md_run() will clear the flag by setting mddev->recovery to 0. Before commit 1baae052cccd ("md: Don't ignore suspended array in md_check_recovery()"), dm-raid actually relied on suspending to prevent starting new sync_thread. Fix this problem by keeping 'MD_RECOVERY_FROZEN' for dm-raid in md_run(). Fixes: 1baae052cccd ("md: Don't ignore suspended array in md_check_recovery()") Fixes: 9dbd1aa3a81c ("dm raid: add reshaping support to the target") Cc: [email protected] # v6.7+ Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Xiao Ni <[email protected]> Acked-by: Mike Snitzer <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-05Revert "Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d""Song Liu1-0/+12
This reverts commit bed9e27baf52a09b7ba2a3714f1e24e17ced386d. The original set [1][2] was expected to undo a suboptimal fix in [2], and replace it with a better fix [1]. However, as reported by Dan Moulding [2] causes an issue with raid5 with journal device. Revert [2] for now to close the issue. We will follow up on another issue reported by Juxiao Bi, as [2] is expected to fix it. We believe this is a good trade-off, because the latter issue happens less freqently. In the meanwhile, we will NOT revert [1], as it contains the right logic. [1] commit d6e035aad6c0 ("md: bypass block throttle for superblock update") [2] commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"") Reported-by: Dan Moulding <[email protected]> Closes: https://lore.kernel.org/linux-raid/[email protected]/ Fixes: bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d"") Cc: [email protected] # v5.19+ Cc: Junxiao Bi <[email protected]> Cc: Yu Kuai <[email protected]> Signed-off-by: Song Liu <[email protected]> Reviewed-by: Yu Kuai <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-01nbd: use the atomic queue limits API in nbd_set_sizeChristoph Hellwig1-4/+11
Use queue_limits_start_update / queue_limits_commit_update to update all the limits in one go and with proper sanity checking. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-03-01nbd: freeze the queue for queue limits updatesChristoph Hellwig1-1/+13
nbd currently updates the logical and physical block sizes as well as the discard_sectors on a live queue. Freeze the queue first to make sure there are not commands in flight that can see torn or inconsistent limits. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-03-01nbd: don't clear discard_sectors in nbd_config_putChristoph Hellwig1-1/+2
nbd_config_put currently clears discard_sectors when unusing a device. This is pretty odd behavior and different from the sector size configuration which is simply left in places and then reconfigured when nbd_set_size is as part of configuring the device. Change nbd_set_size to clear discard_sectors if discard is not supported so that all the queue limits changes are handled in one place. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-03-01pktcdvd: don't set max_hw_sectors on the underlying deviceChristoph Hellwig1-5/+6
pktcdvd sets max_hw_sectors on the queue of the underlying device that it doesn't own (and doesn't reset it ever) since the driver was merged. This can create all kinds of problems as the underlying driver doesn't even know about it changing the limit. As the state purpose is to not create I/Os larger than a single frame, and pktcdvd never builds bios larger than that, just set REQ_NOMERGE on the bios it submits so that largers I/Os never get built. Note: I don't have packet writing hardware, so this is compile tested only. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-03-01dm: use queue_limits_setChristoph Hellwig1-15/+12
Use queue_limits_set which validates the limits and takes care of updating the readahead settings instead of directly assigning them to the queue. For that make sure all limits are actually updated before the assignment. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-03-01Merge tag 'md-6.9-20240301' of ↵Jens Axboe8-381/+549
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.9/block Pull MD updates from Song: "The major changes are: 1. Refactor raid1 read_balance, by Yu Kuai and Paul Luse. 2. Clean up and fix for md_ioctl, by Li Nan. 3. Other small fixes, by Gui-Dong Han and Heming Zhao." * tag 'md-6.9-20240301' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: (22 commits) md/raid1: factor out helpers to choose the best rdev from read_balance() md/raid1: factor out the code to manage sequential IO md/raid1: factor out choose_bb_rdev() from read_balance() md/raid1: factor out choose_slow_rdev() from read_balance() md/raid1: factor out read_first_rdev() from read_balance() md/raid1-10: factor out a new helper raid1_should_read_first() md/raid1-10: add a helper raid1_check_read_range() md/raid1: fix choose next idle in read_balance() md/raid1: record nonrot rdevs while adding/removing rdevs to conf md/raid1: factor out helpers to add rdev to conf md: add a new helper rdev_has_badblock() md/raid5: fix atomicity violation in raid5_cache_count md/md-bitmap: fix incorrect usage for sb_index md: check mddev->pers before calling md_set_readonly() md: clean up openers check in do_md_stop() and md_set_readonly() md: sync blockdev before stopping raid or setting readonly md: factor out a helper to sync mddev md: Don't clear MD_CLOSING when the raid is about to stop md: return directly before setting did_set_md_closing md: clean up invalid BUG_ON in md_ioctl ...
2024-02-29md/raid1: factor out helpers to choose the best rdev from read_balance()Yu Kuai1-77/+98
The way that best rdev is chosen: 1) If the read is sequential from one rdev: - if rdev is rotational, use this rdev; - if rdev is non-rotational, use this rdev until total read length exceed disk opt io size; 2) If the read is not sequential: - if there is idle disk, use it, otherwise: - if the array has non-rotational disk, choose the rdev with minimal inflight IO; - if all the underlaying disks are rotational disk, choose the rdev with closest IO; There are no functional changes, just to make code cleaner and prepare for following refactor. Co-developed-by: Paul Luse <[email protected]> Signed-off-by: Paul Luse <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-29md/raid1: factor out the code to manage sequential IOYu Kuai1-34/+37
There is no functional change for now, make read_balance() cleaner and prepare to fix problems and refactor the handler of sequential IO. Co-developed-by: Paul Luse <[email protected]> Signed-off-by: Paul Luse <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-29md/raid1: factor out choose_bb_rdev() from read_balance()Yu Kuai1-31/+48
read_balance() is hard to understand because there are too many status and branches, and it's overlong. This patch factor out the case to read the rdev with bad blocks from read_balance(), there are no functional changes. Co-developed-by: Paul Luse <[email protected]> Signed-off-by: Paul Luse <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-29md/raid1: factor out choose_slow_rdev() from read_balance()Yu Kuai1-17/+52
read_balance() is hard to understand because there are too many status and branches, and it's overlong. This patch factor out the case to read the slow rdev from read_balance(), there are no functional changes. Co-developed-by: Paul Luse <[email protected]> Signed-off-by: Paul Luse <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-29md/raid1: factor out read_first_rdev() from read_balance()Yu Kuai1-17/+46
read_balance() is hard to understand because there are too many status and branches, and it's overlong. This patch factor out the case to read the first rdev from read_balance(), there are no functional changes. Co-developed-by: Paul Luse <[email protected]> Signed-off-by: Paul Luse <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-29md/raid1-10: factor out a new helper raid1_should_read_first()Yu Kuai3-24/+24
If resync is in progress, read_balance() should find the first usable disk, otherwise, data could be inconsistent after resync is done. raid1 and raid10 implement the same checking, hence factor out the checking to make code cleaner. Noted that raid1 is using 'mddev->recovery_cp', which is updated after all resync IO is done, while raid10 is using 'conf->next_resync', which is inaccurate because raid10 update it before submitting resync IO. Fortunately, raid10 read IO can't concurrent with resync IO, hence there is no problem. And this patch also switch raid10 to use 'mddev->recovery_cp'. Co-developed-by: Paul Luse <[email protected]> Signed-off-by: Paul Luse <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-29md/raid1-10: add a helper raid1_check_read_range()Yu Kuai1-0/+49
The checking and handler of bad blocks appear many timers during read_balance() in raid1 and raid10. This helper will be used in later patches to simplify read_balance() a lot. Co-developed-by: Paul Luse <[email protected]> Signed-off-by: Paul Luse <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-29md/raid1: fix choose next idle in read_balance()Yu Kuai1-10/+22
Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add the case choose next idle in read_balance(): read_balance: for_each_rdev if(next_seq_sect == this_sector || dist == 0) -> sequential reads best_disk = disk; if (...) choose_next_idle = 1 continue; for_each_rdev -> iterate next rdev if (pending == 0) best_disk = disk; -> choose the next idle disk break; if (choose_next_idle) -> keep using this rdev if there are no other idle disk contine However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.") remove the code: - /* If device is idle, use it */ - if (pending == 0) { - best_disk = disk; - break; - } Hence choose next idle will never work now, fix this problem by following: 1) don't set best_disk in this case, read_balance() will choose the best disk after iterating all the disks; 2) add 'pending' so that other idle disk will be chosen; 3) add a new local variable 'sequential_disk' to record the disk, and if there is no other idle disk, 'sequential_disk' will be chosen; Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.") Co-developed-by: Paul Luse <[email protected]> Signed-off-by: Paul Luse <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-29md/raid1: record nonrot rdevs while adding/removing rdevs to confYu Kuai3-7/+12
For raid1, each read will iterate all the rdevs from conf and check if any rdev is non-rotational, then choose rdev with minimal IO inflight if so, or rdev with closest distance otherwise. Disk nonrot info can be changed through sysfs entry: /sys/block/[disk_name]/queue/rotational However, consider that this should only be used for testing, and user really shouldn't do this in real life. Record the number of non-rotational disks in conf, to avoid checking each rdev in IO fast path and simplify read_balance() a little bit. Co-developed-by: Paul Luse <[email protected]> Signed-off-by: Paul Luse <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-29md/raid1: factor out helpers to add rdev to confYu Kuai1-32/+53
There are no functional changes, just make code cleaner and prepare to record disk non-rotational information while adding and removing rdev to conf Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-29md: add a new helper rdev_has_badblock()Yu Kuai4-72/+44
The current api is_badblock() must pass in 'first_bad' and 'bad_sectors', however, many caller just want to know if there are badblocks or not, and these caller must define two local variable that will never be used. Add a new helper rdev_has_badblock() that will only return if there are badblocks or not, remove unnecessary local variables and replace is_badblock() with the new helper in many places. There are no functional changes, and the new helper will also be used later to refactor read_balance(). Co-developed-by: Paul Luse <[email protected]> Signed-off-by: Paul Luse <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-28ublk: add UBLK_CMD_DEL_DEV_ASYNCMing Lei1-3/+6
The current command UBLK_CMD_DEL_DEV won't return until the device is released, this way looks more reliable, but makes userspace more difficult to implement, especially about orders: unmap command buffer(which holds one ublkc reference), ublkc close, io_uring_file_unregister, ublkb close. Add UBLK_CMD_DEL_DEV_ASYNC so that device deletion won't wait release, then userspace needn't worry about the above order. Actually both loop and nbd is deleted in this async way. Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-02-28ublk: improve getting & putting ublk deviceMing Lei1-5/+7
Firstly convert get_device() and put_device() into ublk_get_device() and ublk_put_device(). Secondly annotate ublk_get_device() & ublk_put_device() as noinline for trace, especially it is often to trigger device deletion hang when incorrect order is used on ublkc mmap, ublkc close, io_uring_sqe_unregister_file, ublkb close. Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-02-27md/raid5: fix atomicity violation in raid5_cache_countGui-Dong Han1-6/+8
In raid5_cache_count(): if (conf->max_nr_stripes < conf->min_nr_stripes) return 0; return conf->max_nr_stripes - conf->min_nr_stripes; The current check is ineffective, as the values could change immediately after being checked. In raid5_set_cache_size(): ... conf->min_nr_stripes = size; ... while (size > conf->max_nr_stripes) conf->min_nr_stripes = conf->max_nr_stripes; ... Due to intermediate value updates in raid5_set_cache_size(), concurrent execution of raid5_cache_count() and raid5_set_cache_size() may lead to inconsistent reads of conf->max_nr_stripes and conf->min_nr_stripes. The current checks are ineffective as values could change immediately after being checked, raising the risk of conf->min_nr_stripes exceeding conf->max_nr_stripes and potentially causing an integer overflow. This possible bug is found by an experimental static analysis tool developed by our team. This tool analyzes the locking APIs to extract function pairs that can be concurrently executed, and then analyzes the instructions in the paired functions to identify possible concurrency bugs including data races and atomicity violations. The above possible bug is reported when our tool analyzes the source code of Linux 6.2. To resolve this issue, it is suggested to introduce local variables 'min_stripes' and 'max_stripes' in raid5_cache_count() to ensure the values remain stable throughout the check. Adding locks in raid5_cache_count() fails to resolve atomicity violations, as raid5_set_cache_size() may hold intermediate values of conf->min_nr_stripes while unlocked. With this patch applied, our tool no longer reports the bug, with the kernel configuration allyesconfig for x86_64. Due to the lack of associated hardware, we cannot test the patch in runtime testing, and just verify it according to the code logic. Fixes: edbe83ab4c27 ("md/raid5: allow the stripe_cache to grow and shrink.") Cc: [email protected] Signed-off-by: Gui-Dong Han <[email protected]> Reviewed-by: Yu Kuai <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2024-02-27xen-blkfront: atomically update queue limitsChristoph Hellwig1-18/+23
Pass the initial queue limits to blk_mq_alloc_disk and use the blkif_set_queue_limits API to update the limits on reconnect. Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Roger Pau Monné <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-02-27xen-blkfront: don't redundantly set max_sements in blkif_recoverChristoph Hellwig1-5/+3
blkif_set_queue_limits already sets the max_sements limits, so don't do it a second time. Also remove a comment about a long fixe bug in blk_mq_update_nr_hw_queues. Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Roger Pau Monné <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-02-27xen-blkfront: rely on the default discard granularityChristoph Hellwig1-2/+2
The block layer now sets the discard granularity to the physical block size default. Take advantage of that in xen-blkfront and only set the discard granularity if explicitly specified. Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Roger Pau Monné <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-02-27xen-blkfront: set max_discard/secure erase limits to UINT_MAXChristoph Hellwig1-4/+2
Currently xen-blkfront set the max discard limit to the capacity of the device, which is suboptimal when the capacity changes. Just set it to UINT_MAX, which has the same effect and is simpler. Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Roger Pau Monné <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-02-26md/md-bitmap: fix incorrect usage for sb_indexHeming Zhao1-3/+6
Commit d7038f951828 ("md-bitmap: don't use ->index for pages backing the bitmap file") removed page->index from bitmap code, but left wrong code logic for clustered-md. current code never set slot offset for cluster nodes, will sometimes cause crash in clustered env. Call trace (partly): md_bitmap_file_set_bit+0x110/0x1d8 [md_mod] md_bitmap_startwrite+0x13c/0x240 [md_mod] raid1_make_request+0x6b0/0x1c08 [raid1] md_handle_request+0x1dc/0x368 [md_mod] md_submit_bio+0x80/0xf8 [md_mod] __submit_bio+0x178/0x300 submit_bio_noacct_nocheck+0x11c/0x338 submit_bio_noacct+0x134/0x614 submit_bio+0x28/0xdc submit_bh_wbc+0x130/0x1cc submit_bh+0x1c/0x28 Fixes: d7038f951828 ("md-bitmap: don't use ->index for pages backing the bitmap file") Cc: [email protected] # v6.6+ Signed-off-by: Heming Zhao <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-26md: check mddev->pers before calling md_set_readonly()Li Nan1-11/+11
If 'mddev->pers' is NULL, there is nothing to do in md_set_readonly(). Except for md_ioctl(), the other two callers of md_set_readonly() have already checked 'mddev->pers'. To simplify the code, move the check of 'mddev->pers' to the caller. Signed-off-by: Li Nan <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-26md: clean up openers check in do_md_stop() and md_set_readonly()Li Nan1-23/+14
Before stopping or setting readonly, mddev_set_closing_and_sync_blockdev() is always called to check the openers. So no longer need to check it again in do_md_stop() and md_set_readonly(). Clean it up. Signed-off-by: Li Nan <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-26md: sync blockdev before stopping raid or setting readonlyLi Nan1-0/+16
Commit a05b7ea03d72 ("md: avoid crash when stopping md array races with closing other open fds.") added sync_block before stopping raid and setting readonly. Later in commit 260fa034ef7a ("md: avoid deadlock when dirty buffers during md_stop.") it is moved to ioctl. array_state_store() was ignored. Add sync blockdev to array_state_store() now. Signed-off-by: Li Nan <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-26md: factor out a helper to sync mddevLi Nan1-11/+21
There are no functional changes, prepare to sync mddev in array_state_store(). Signed-off-by: Li Nan <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-26md: Don't clear MD_CLOSING when the raid is about to stopLi Nan1-4/+10
The raid should not be opened anymore when it is about to be stopped. However, other processes can open it again if the flag MD_CLOSING is cleared before exiting. From now on, this flag will not be cleared when the raid will be stopped. Fixes: 065e519e71b2 ("md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop") Signed-off-by: Li Nan <[email protected]> Reviewed-by: Yu Kuai <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-26md: return directly before setting did_set_md_closingLi Nan1-17/+8
There is nothing to do at 'out' before setting 'did_set_md_closing' in md_ioctl(). Return directly, and it will help us to remove 'did_set_md_closing' later. Signed-off-by: Li Nan <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-26md: clean up invalid BUG_ON in md_ioctlLi Nan1-5/+0
'disk->private_data' is set to mddev in md_alloc() and never set to NULL, and users need to open mddev before submitting ioctl. So mddev must not have been freed during ioctl, and there is no need to check mddev here. Clean up it. Signed-off-by: Li Nan <[email protected]> Reviewed-by: Yu Kuai <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]