aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-09-28s390/dasd: use blk_mq_alloc_diskChristoph Hellwig7-88/+39
As far as I can tell there is no need for the staged setup in dasd, so allocate the tagset and the disk with the queue in dasd_gendisk_alloc. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Stefan Haberland <[email protected]> Signed-off-by: Stefan Haberland <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-27blk-cgroup: don't update the blkg lookup hint in blkg_conf_prepChristoph Hellwig1-13/+4
blkg_conf_prep just creates a new blkg structure, there is no real need to update the lookup hint which should only be done on a successful lookup in the I/O path. Suggested-by: Tejun Heo <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-27blk-mq: use quiesced elevator switch when reinitializing queuesKeith Busch3-7/+6
The hctx's run_work may be racing with the elevator switch when reinitializing hardware queues. The queue is merely frozen in this context, but that only prevents requests from allocating and doesn't stop the hctx work from running. The work may get an elevator pointer that's being torn down, and can result in use-after-free errors and kernel panics (example below). Use the quiesced elevator switch instead, and make the previous one static since it is now only used locally. nvme nvme0: resetting controller nvme nvme0: 32/0/0 default/read/poll queues BUG: kernel NULL pointer dereference, address: 0000000000000008 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 80000020c8861067 P4D 80000020c8861067 PUD 250f8c8067 PMD 0 Oops: 0000 [#1] SMP PTI Workqueue: kblockd blk_mq_run_work_fn RIP: 0010:kyber_has_work+0x29/0x70 ... Call Trace: __blk_mq_do_dispatch_sched+0x83/0x2b0 __blk_mq_sched_dispatch_requests+0x12e/0x170 blk_mq_sched_dispatch_requests+0x30/0x60 __blk_mq_run_hw_queue+0x2b/0x50 process_one_work+0x1ef/0x380 worker_thread+0x2d/0x3e0 Signed-off-by: Keith Busch <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-27block: replace blk_queue_nowait with bdev_nowaitChristoph Hellwig5-8/+10
Replace blk_queue_nowait with a bdev_nowait helpers that takes the block_device given that the I/O submission path should not have to look into the request_queue. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Pankaj Raghav <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-26blk-cgroup: pass a gendisk to the blkg allocation helpersChristoph Hellwig1-28/+28
Prepare for storing the blkcg information in the gendisk instead of the request_queue. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Andreas Herrmann <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-26blk-cgroup: pass a gendisk to blkcg_schedule_throttleChristoph Hellwig5-10/+11
Pass the gendisk to blkcg_schedule_throttle as part of moving the blk-cgroup infrastructure to be gendisk based. Remove the unused !BLK_CGROUP stub while we're at it. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Andreas Herrmann <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-26blk-cgroup: pass a gendisk to blkg_destroy_allChristoph Hellwig1-9/+4
Pass the gendisk to blkg_destroy_all as part of moving the blk-cgroup infrastructure to be gendisk based. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Andreas Herrmann <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-26blk-throttle: pass a gendisk to blk_throtl_cancel_biosChristoph Hellwig3-4/+5
Pass the gendisk to blk_throtl_cancel_bios as part of moving the blk-cgroup infrastructure to be gendisk based. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Andreas Herrmann <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-26blk-throttle: pass a gendisk to blk_throtl_register_queueChristoph Hellwig3-4/+5
Pass the gendisk to blk_throtl_register_queue as part of moving the blk-cgroup infrastructure to be gendisk based. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Andreas Herrmann <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-26blk-throttle: pass a gendisk to blk_throtl_init and blk_throtl_exitChristoph Hellwig3-9/+12
Pass the gendisk to blk_throtl_init and blk_throtl_exit as part of moving the blk-cgroup infrastructure to be gendisk based. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Andreas Herrmann <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-26blk-iocost: cleanup ioc_qos_writeChristoph Hellwig1-6/+8
Use a local disk variable instead of retrieving the disk and request_queue over and over by various means. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Andreas Herrmann <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-26blk-iocost: pass a gendisk to blk_iocost_initChristoph Hellwig1-3/+4
Pass the gendisk to blk_iocost_init as part of moving the blk-cgroup infrastructure to be gendisk based. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Andreas Herrmann <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-26blk-iocost: simplify ioc_nameChristoph Hellwig1-9/+5
Just directly dereference the disk name instead of going through multiple hoops to find the same value. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Andreas Herrmann <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-26blk-iolatency: pass a gendisk to blk_iolatency_initChristoph Hellwig3-4/+5
Pass the gendisk to blk_iolatency_init as part of moving the blk-cgroup infrastructure to be gendisk based. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Andreas Herrmann <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] [axboe: missed inline for blk_iolatency_init() and !CONFIG_BLK_CGROUP_IOLATENCY] Signed-off-by: Jens Axboe <[email protected]>
2022-09-26blk-ioprio: pass a gendisk to blk_ioprio_init and blk_ioprio_exitChristoph Hellwig3-10/+10
Pass the gendisk to blk_ioprio_init and blk_ioprio_exit as part of moving the blk-cgroup infrastructure to be gendisk based. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Andreas Herrmann <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-26blk-cgroup: pass a gendisk to blkcg_init_queue and blkcg_exit_queueChristoph Hellwig3-26/+12
Pass the gendisk to blkcg_init_disk and blkcg_exit_disk as part of moving the blk-cgroup infrastructure to be gendisk based. Also remove the rather pointless kerneldoc comments for these internal functions with a single caller each. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Andreas Herrmann <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-26blk-cgroup: remove blkg_lookup_checkChristoph Hellwig1-26/+10
The combinations of an error check with an ERR_PTR return and a lookup with a NULL return leads to ugly handling of the return values in the callers. Just open coding the check and the lookup is much simpler. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Andreas Herrmann <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-26blk-cgroup: cleanup the blkg_lookup family of functionsChristoph Hellwig2-50/+27
Add a fully inlined blkg_lookup as the extra two checks aren't going to generated a lot more code vs the call to the slowpath routine, and open code the hint update in the two callers that care. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Andreas Herrmann <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-26blk-cgroup: remove open coded blkg_lookup instancesChristoph Hellwig2-7/+7
Use blkg_lookup instead of open coding it. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Andreas Herrmann <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-26blk-cgroup: remove blk_queue_root_blkgChristoph Hellwig2-15/+1
Just open code it in the only caller and drop the unused !BLK_CGROUP stub. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Andreas Herrmann <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-26blk-cgroup: fix error unwinding in blkcg_init_queueChristoph Hellwig1-6/+7
When blk_throtl_init fails, we need to call blk_ioprio_exit. Switch to proper goto based unwinding to fix this. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Andreas Herrmann <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-24blk-mq: don't redirect completion for hctx withs only one ctx mappingLiu Song1-3/+5
High-performance NVMe devices usually support a large hw queues, which ensures a 1:1 mapping of hctx and ctx. In this case there will be no remote request, so we don't need to care about it. Signed-off-by: Liu Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-24blk-throttle: improve bypassing bios checkingsYu Kuai2-7/+28
"tg->has_rules" is extended to "tg->has_rules_iops/bps", thus bios that don't need to be throttled can be checked accurately. With this patch, bio will be throttled if: 1) Bio is read/write, and corresponding read/write iops limit exist. 2) If corresponding doesn't exist, corresponding bps limit exist and bio is not throttled before. Signed-off-by: Yu Kuai <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-24blk-throttle: remove THROTL_TG_HAS_IOPS_LIMITYu Kuai2-21/+3
Currently, "tg->has_rules" and "tg->flags & THROTL_TG_HAS_IOPS_LIMIT" both try to bypass bios that don't need to be throttled, however, they are a little redundant and both not perfect: 1) "tg->has_rules" only distinguish read and write, but not iops and bps limit. 2) "tg->flags & THROTL_TG_HAS_IOPS_LIMIT" only check if iops limit exist, read and write is not distinguished, and bps limit is not checked. tg->has_rules will extended to distinguish bps and iops in the following patch. There is no need to keep the flag. Signed-off-by: Yu Kuai <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-23ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY supportZiyangZhang2-1/+118
START_USER_RECOVERY and END_USER_RECOVERY are two new control commands to support user recovery feature. After a crash, user should send START_USER_RECOVERY, it will: (1) check if (a)current ublk_device is UBLK_S_DEV_QUIESCED which was set by quiesce_work and (b)chardev is released (2) reinit all ubqs, including: (a) put the task_struct and reset ->ubq_daemon to NULL. (b) reset all ublk_io. (3) reset ub->mm to NULL. Then, user should start a new process and send FETCH_REQ on each ubq_daemon. Finally, user should send END_USER_RECOVERY, it will: (1) wait for all new ubq_daemons getting ready. (2) update ublksrv_pid (3) unquiesce the request queue and expect incoming ublk_queue_rq() (4) convert ub's state to UBLK_S_DEV_LIVE Note: we can handle STOP_DEV between START_USER_RECOVERY and END_USER_RECOVERY. This is helpful to users who cannot start new process after sending START_USER_RECOVERY ctrl-cmd. Signed-off-by: ZiyangZhang <[email protected]> Reviewed-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-23ublk_drv: support UBLK_F_USER_RECOVERY_REISSUEZiyangZhang2-4/+20
UBLK_F_USER_RECOVERY_REISSUE implies that: With a dying ubq_daemon, ublk_drv let monitor_work requeues rq issued to userspace(ublksrv) before the ubq_daemon is dying. UBLK_F_USER_RECOVERY_REISSUE is designed for backends which: (1) tolerate double-write since ublk_drv may issue the same rq twice. (2) does not let frontend users get I/O error, such as read-only FS and VM backend. Signed-off-by: ZiyangZhang <[email protected]> Reviewed-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-23ublk_drv: consider recovery feature in aborting mechanismZiyangZhang1-6/+110
With USER_RECOVERY feature enabled, the monitor_work schedules quiesce_work after finding a dying ubq_daemon. The monitor_work should also abort all rqs issued to userspace before the ubq_daemon is dying. The quiesce_work's job is to: (1) quiesce request queue. (2) check if there is any INFLIGHT rq. If so, we retry until all these rqs are requeued and become IDLE. These rqs should be requeued by ublk_queue_rq(), task work, io_uring fallback wq or monitor_work. (3) complete all ioucmds by calling io_uring_cmd_done(). We are safe to do so because no ioucmd can be referenced now. (5) set ub's state to UBLK_S_DEV_QUIESCED, which means we are ready for recovery. This state is exposed to userspace by GET_DEV_INFO. The driver can always handle STOP_DEV and cleanup everything no matter ub's state is LIVE or QUIESCED. After ub's state is UBLK_S_DEV_QUIESCED, user can recover with new process. Note: we do not change the default behavior with reocvery feature disabled. monitor_work still schedules stop_work and abort inflight rqs. And finally ublk_device is released. Signed-off-by: ZiyangZhang <[email protected]> Reviewed-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-23ublk_drv: requeue rqs with recovery feature enabledZiyangZhang1-5/+15
With recovery feature enabled, in ublk_queue_rq or task work (in exit_task_work or fallback wq), we requeue rqs instead of ending(aborting) them. Besides, No matter recovery feature is enabled or disabled, we schedule monitor_work immediately. Signed-off-by: ZiyangZhang <[email protected]> Reviewed-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-23ublk_drv: define macros for recovery feature and check themZiyangZhang2-1/+20
Define some macros for recovery feature. UBLK_S_DEV_QUIESCED implies that ublk_device is quiesced and is ready for recovery. This state can be observed by userspace. UBLK_F_USER_RECOVERY implies that: (1) ublk_drv enables recovery feature. It won't let monitor_work to automatically abort rqs and release the device. (2) With a dying ubq_daemon, ublk_drv ends(aborts) rqs issued to userspace(ublksrv) before crash. (3) With a dying ubq_daemon, in task work and ublk_queue_rq(), ublk_drv requeues rqs. Signed-off-by: ZiyangZhang <[email protected]> Reviewed-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-23ublk_drv: check 'current' instead of 'ubq_daemon'ZiyangZhang1-2/+10
This check is not atomic. So with recovery feature, ubq_daemon may be modified simultaneously by recovery task. Instead, check 'current' is safe here because 'current' never changes. Also add comment explaining this check, which is really important for understanding recovery feature. Signed-off-by: ZiyangZhang <[email protected]> Reviewed-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-23Merge branch 'md-next' of ↵Jens Axboe7-142/+204
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.1/block Pull MD updates and fixes from Song: "1. Various raid5 fix and clean up, by Logan Gunthorpe and David Sloan. 2. Raid10 performance optimization, by Yu Kuai." * 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md: Fix spelling mistake in comments of r5l_log md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5d md/raid10: convert resync_lock to use seqlock md/raid10: fix improper BUG_ON() in raise_barrier() md/raid10: prevent unnecessary calls to wake_up() in fast path md/raid10: don't modify 'nr_waitng' in wait_barrier() for the case nowait md/raid10: factor out code from wait_barrier() to stop_waiting_barrier() md: Remove extra mddev_get() in md_seq_start() md/raid5: Remove unnecessary bio_put() in raid5_read_one_chunk() md/raid5: Ensure stripe_fill happens on non-read IO with journal md/raid5: Don't read ->active_stripes if it's not needed md/raid5: Cleanup prototype of raid5_get_active_stripe() md/raid5: Drop extern on function declarations in raid5.h md/raid5: Refactor raid5_get_active_stripe() md: Replace snprintf with scnprintf md/raid10: fix compile warning md/raid5: Fix spelling mistakes in comments
2022-09-22md: Fix spelling mistake in comments of r5l_logZhou nan1-1/+1
Fix spelling of dones't in comments. Signed-off-by: Zhou nan <[email protected]> Signed-off-by: Song Liu <[email protected]>
2022-09-22md/raid5: Wait for MD_SB_CHANGE_PENDING in raid5dLogan Gunthorpe1-0/+12
A complicated deadlock exists when using the journal and an elevated group_thrtead_cnt. It was found with loop devices, but its not clear whether it can be seen with real disks. The deadlock can occur simply by writing data with an fio script. When the deadlock occurs, multiple threads will hang in different ways: 1) The group threads will hang in the blk-wbt code with bios waiting to be submitted to the block layer: io_schedule+0x70/0xb0 rq_qos_wait+0x153/0x210 wbt_wait+0x115/0x1b0 io_schedule+0x70/0xb0 rq_qos_wait+0x153/0x210 wbt_wait+0x115/0x1b0 __rq_qos_throttle+0x38/0x60 blk_mq_submit_bio+0x589/0xcd0 wbt_wait+0x115/0x1b0 __rq_qos_throttle+0x38/0x60 blk_mq_submit_bio+0x589/0xcd0 __submit_bio+0xe6/0x100 submit_bio_noacct_nocheck+0x42e/0x470 submit_bio_noacct+0x4c2/0xbb0 ops_run_io+0x46b/0x1a30 handle_stripe+0xcd3/0x36b0 handle_active_stripes.constprop.0+0x6f6/0xa60 raid5_do_work+0x177/0x330 Or: io_schedule+0x70/0xb0 rq_qos_wait+0x153/0x210 wbt_wait+0x115/0x1b0 __rq_qos_throttle+0x38/0x60 blk_mq_submit_bio+0x589/0xcd0 __submit_bio+0xe6/0x100 submit_bio_noacct_nocheck+0x42e/0x470 submit_bio_noacct+0x4c2/0xbb0 flush_deferred_bios+0x136/0x170 raid5_do_work+0x262/0x330 2) The r5l_reclaim thread will hang in the same way, submitting a bio to the block layer: io_schedule+0x70/0xb0 rq_qos_wait+0x153/0x210 wbt_wait+0x115/0x1b0 __rq_qos_throttle+0x38/0x60 blk_mq_submit_bio+0x589/0xcd0 __submit_bio+0xe6/0x100 submit_bio_noacct_nocheck+0x42e/0x470 submit_bio_noacct+0x4c2/0xbb0 submit_bio+0x3f/0xf0 md_super_write+0x12f/0x1b0 md_update_sb.part.0+0x7c6/0xff0 md_update_sb+0x30/0x60 r5l_do_reclaim+0x4f9/0x5e0 r5l_reclaim_thread+0x69/0x30b However, before hanging, the MD_SB_CHANGE_PENDING flag will be set for sb_flags in r5l_write_super_and_discard_space(). This flag will never be cleared because the submit_bio() call never returns. 3) Due to the MD_SB_CHANGE_PENDING flag being set, handle_stripe() will do no processing on any pending stripes and re-set STRIPE_HANDLE. This will cause the raid5d thread to enter an infinite loop, constantly trying to handle the same stripes stuck in the queue. The raid5d thread has a blk_plug that holds a number of bios that are also stuck waiting seeing the thread is in a loop that never schedules. These bios have been accounted for by blk-wbt thus preventing the other threads above from continuing when they try to submit bios. --Deadlock. To fix this, add the same wait_event() that is used in raid5_do_work() to raid5d() such that if MD_SB_CHANGE_PENDING is set, the thread will schedule and wait until the flag is cleared. The schedule action will flush the plug which will allow the r5l_reclaim thread to continue, thus preventing the deadlock. However, md_check_recovery() calls can also clear MD_SB_CHANGE_PENDING from the same thread and can thus deadlock if the thread is put to sleep. So avoid waiting if md_check_recovery() is being called in the loop. It's not clear when the deadlock was introduced, but the similar wait_event() call in raid5_do_work() was added in 2017 by this commit: 16d997b78b15 ("md/raid5: simplfy delaying of writes while metadata is updated.") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Logan Gunthorpe <[email protected]> Signed-off-by: Song Liu <[email protected]>
2022-09-22Merge branch 'md-next-raid10-optimize' into md-nextSong Liu2-55/+96
This patchset tries to avoid that two locks are held unconditionally in hot path. Test environment: Architecture: aarch64 Huawei KUNPENG 920 x86 Intel(R) Xeon(R) Platinum 8380 Raid10 initialize: mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 \ /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Test cmd: (task set -c 0-15) fio -name=0 -ioengine=libaio -direct=1 -\ group_reporting=1 -randseed=2022 -rwmixread=70 -refill_buffers \ -filename=/dev/md0 -numjobs=16 -runtime=60s -bs=4k -iodepth=256 \ -rw=randread Test result: aarch64: before this patchset: 3.2 GiB/s bind node before this patchset: 6.9 Gib/s after this patchset: 7.9 Gib/s bind node after this patchset: 8.0 Gib/s x86:(bind node is not tested yet) before this patchset: 7.0 GiB/s after this patchset : 9.3 GiB/s Please noted that in the test machine, memory access latency is very bad across nodes compare to local node in aarch64, which is why bandwidth while bind node is much better.
2022-09-22md/raid10: convert resync_lock to use seqlockYu Kuai2-30/+59
Currently, wait_barrier() will hold 'resync_lock' to read 'conf->barrier', and io can't be dispatched until 'barrier' is dropped. Since holding the 'barrier' is not common, convert 'resync_lock' to use seqlock so that holding lock can be avoided in fast path. Signed-off-by: Yu Kuai <[email protected]> Reviewed-and-Tested-by: Logan Gunthorpe <[email protected]> Signed-off-by: Song Liu <[email protected]>
2022-09-22md/raid10: fix improper BUG_ON() in raise_barrier()Yu Kuai1-1/+1
'conf->barrier' is protected by 'conf->resync_lock', reading 'conf->barrier' without holding the lock is wrong. Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Logan Gunthorpe <[email protected]> Acked-by: Guoqing Jiang <[email protected]> Signed-off-by: Song Liu <[email protected]>
2022-09-22md/raid10: prevent unnecessary calls to wake_up() in fast pathYu Kuai1-2/+8
Currently, wake_up() is called unconditionally in fast path such as raid10_make_request(), which will cause lock contention under high concurrency: raid10_make_request wake_up __wake_up_common_lock spin_lock_irqsave Improve performance by only call wake_up() if waitqueue is not empty in allow_barrier() and raid10_make_request(). Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Logan Gunthorpe <[email protected]> Acked-by: Guoqing Jiang <[email protected]> Signed-off-by: Song Liu <[email protected]>
2022-09-22md/raid10: don't modify 'nr_waitng' in wait_barrier() for the case nowaitYu Kuai1-2/+2
For the case nowait in wait_barrier(), there is no point to increase nr_waiting and then decrease it. Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Logan Gunthorpe <[email protected]> Acked-by: Guoqing Jiang <[email protected]> Signed-off-by: Song Liu <[email protected]>
2022-09-22md/raid10: factor out code from wait_barrier() to stop_waiting_barrier()Yu Kuai1-22/+28
Currently the nasty condition in wait_barrier() is hard to read. This patch factors out the condition into a function. There are no functional changes. Signed-off-by: Yu Kuai <[email protected]> Acked-by: Paul Menzel <[email protected]> Reviewed-by: Logan Gunthorpe <[email protected]> Acked-by: Guoqing Jiang <[email protected]> Signed-off-by: Song Liu <[email protected]>
2022-09-22md: Remove extra mddev_get() in md_seq_start()Logan Gunthorpe1-1/+0
A regression is seen where mddev devices stay permanently after they are stopped due to an elevated reference count. This was tracked down to an extra mddev_get() in md_seq_start(). It only happened rarely because most of the time the md_seq_start() is called with a zero offset. The path with an extra mddev_get() only happens when it starts with a non-zero offset. The commit noted below changed an mddev_get() to check its success but inadvertently left the original call in. Remove the extra call. Fixes: 12a6caf27324 ("md: only delete entries from all_mddevs when the disk is freed") Signed-off-by: Logan Gunthorpe <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Acked-by: Guoqing Jiang <[email protected]> Signed-off-by: Song Liu <[email protected]>
2022-09-22md/raid5: Remove unnecessary bio_put() in raid5_read_one_chunk()David Sloan1-1/+0
When running chunk-sized reads on disks with badblocks duplicate bio free/puts are observed: ============================================================================= BUG bio-200 (Not tainted): Object already free ----------------------------------------------------------------------------- Allocated in mempool_alloc_slab+0x17/0x20 age=3 cpu=2 pid=7504 __slab_alloc.constprop.0+0x5a/0xb0 kmem_cache_alloc+0x31e/0x330 mempool_alloc_slab+0x17/0x20 mempool_alloc+0x100/0x2b0 bio_alloc_bioset+0x181/0x460 do_mpage_readpage+0x776/0xd00 mpage_readahead+0x166/0x320 blkdev_readahead+0x15/0x20 read_pages+0x13f/0x5f0 page_cache_ra_unbounded+0x18d/0x220 force_page_cache_ra+0x181/0x1c0 page_cache_sync_ra+0x65/0xb0 filemap_get_pages+0x1df/0xaf0 filemap_read+0x1e1/0x700 blkdev_read_iter+0x1e5/0x330 vfs_read+0x42a/0x570 Freed in mempool_free_slab+0x17/0x20 age=3 cpu=2 pid=7504 kmem_cache_free+0x46d/0x490 mempool_free_slab+0x17/0x20 mempool_free+0x66/0x190 bio_free+0x78/0x90 bio_put+0x100/0x1a0 raid5_make_request+0x2259/0x2450 md_handle_request+0x402/0x600 md_submit_bio+0xd9/0x120 __submit_bio+0x11f/0x1b0 submit_bio_noacct_nocheck+0x204/0x480 submit_bio_noacct+0x32e/0xc70 submit_bio+0x98/0x1a0 mpage_readahead+0x250/0x320 blkdev_readahead+0x15/0x20 read_pages+0x13f/0x5f0 page_cache_ra_unbounded+0x18d/0x220 Slab 0xffffea000481b600 objects=21 used=0 fp=0xffff8881206d8940 flags=0x17ffffc0010201(locked|slab|head|node=0|zone=2|lastcpupid=0x1fffff) CPU: 0 PID: 34525 Comm: kworker/u24:2 Not tainted 6.0.0-rc2-localyes-265166-gf11c5343fa3f #143 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Workqueue: raid5wq raid5_do_work Call Trace: <TASK> dump_stack_lvl+0x5a/0x78 dump_stack+0x10/0x16 print_trailer+0x158/0x165 object_err+0x35/0x50 free_debug_processing.cold+0xb7/0xbe __slab_free+0x1ae/0x330 kmem_cache_free+0x46d/0x490 mempool_free_slab+0x17/0x20 mempool_free+0x66/0x190 bio_free+0x78/0x90 bio_put+0x100/0x1a0 mpage_end_io+0x36/0x150 bio_endio+0x2fd/0x360 md_end_io_acct+0x7e/0x90 bio_endio+0x2fd/0x360 handle_failed_stripe+0x960/0xb80 handle_stripe+0x1348/0x3760 handle_active_stripes.constprop.0+0x72a/0xaf0 raid5_do_work+0x177/0x330 process_one_work+0x616/0xb20 worker_thread+0x2bd/0x6f0 kthread+0x179/0x1b0 ret_from_fork+0x22/0x30 </TASK> The double free is caused by an unnecessary bio_put() in the if(is_badblock(...)) error path in raid5_read_one_chunk(). The error path was moved ahead of bio_alloc_clone() in c82aa1b76787c ("md/raid5: move checking badblock before clone bio in raid5_read_one_chunk"). The previous code checked and freed align_bio which required a bio_put. After the move that is no longer needed as raid_bio is returned to the control of the common io path which performs its own endio resulting in a double free on bad device blocks. Fixes: c82aa1b76787c ("md/raid5: move checking badblock before clone bio in raid5_read_one_chunk") Signed-off-by: David Sloan <[email protected]> Signed-off-by: Logan Gunthorpe <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Acked-by: Guoqing Jiang <[email protected]> Signed-off-by: Song Liu <[email protected]>
2022-09-22md/raid5: Ensure stripe_fill happens on non-read IO with journalLogan Gunthorpe1-1/+1
When doing degrade/recover tests using the journal a kernel BUG is hit at drivers/md/raid5.c:4381 in handle_parity_checks5(): BUG_ON(!test_bit(R5_UPTODATE, &dev->flags)); This was found to occur because handle_stripe_fill() was skipped for stripes in the journal due to a condition in that function. Thus blocks were not fetched and R5_UPTODATE was not set when the code reached handle_parity_checks5(). To fix this, don't skip handle_stripe_fill() unless the stripe is for read. Fixes: 07e83364845e ("md/r5cache: shift complex rmw from read path to write path") Link: https://lore.kernel.org/linux-raid/[email protected]/ Suggested-by: Song Liu <[email protected]> Signed-off-by: Logan Gunthorpe <[email protected]> Signed-off-by: Song Liu <[email protected]>
2022-09-22md/raid5: Don't read ->active_stripes if it's not neededLogan Gunthorpe1-3/+2
The atomic_read() is not needed in many cases so only do the read after the first checks are done. Suggested-by: Christoph Hellwig <[email protected]> Signed-off-by: Logan Gunthorpe <[email protected]> Signed-off-by: Song Liu <[email protected]>
2022-09-22md/raid5: Cleanup prototype of raid5_get_active_stripe()Logan Gunthorpe3-25/+39
Drop the three bools in the prototype of raid5_get_active_stripe() and replace them with a flags parameter. At the same time, drop the distinction with __raid5_get_active_stripe(). Suggested-by: Christoph Hellwig <[email protected]> Signed-off-by: Logan Gunthorpe <[email protected]> Signed-off-by: Song Liu <[email protected]>
2022-09-22md/raid5: Drop extern on function declarations in raid5.hLogan Gunthorpe1-12/+10
externs should not be used in function declarations, so clean those up. Signed-off-by: Logan Gunthorpe <[email protected]> Signed-off-by: Song Liu <[email protected]>
2022-09-22md/raid5: Refactor raid5_get_active_stripe()Logan Gunthorpe1-41/+41
Refactor raid5_get_active_stripe() without the gotos with an explicit infinite loop and some additional nesting. Suggested-by: Christoph Hellwig <[email protected]> Signed-off-by: Logan Gunthorpe <[email protected]> Signed-off-by: Song Liu <[email protected]>
2022-09-22md: Replace snprintf with scnprintfSaurabh Sengar1-1/+1
Current code produces a warning as shown below when total characters in the constituent block device names plus the slashes exceeds 200. snprintf() returns the number of characters generated from the given input, which could cause the expression “200 – len” to wrap around to a large positive number. Fix this by using scnprintf() instead, which returns the actual number of characters written into the buffer. [ 1513.267938] ------------[ cut here ]------------ [ 1513.267943] WARNING: CPU: 15 PID: 37247 at <snip>/lib/vsprintf.c:2509 vsnprintf+0x2c8/0x510 [ 1513.267944] Modules linked in: <snip> [ 1513.267969] CPU: 15 PID: 37247 Comm: mdadm Not tainted 5.4.0-1085-azure #90~18.04.1-Ubuntu [ 1513.267969] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 05/09/2022 [ 1513.267971] RIP: 0010:vsnprintf+0x2c8/0x510 <-snip-> [ 1513.267982] Call Trace: [ 1513.267986] snprintf+0x45/0x70 [ 1513.267990] ? disk_name+0x71/0xa0 [ 1513.267993] dump_zones+0x114/0x240 [raid0] [ 1513.267996] ? _cond_resched+0x19/0x40 [ 1513.267998] raid0_run+0x19e/0x270 [raid0] [ 1513.268000] md_run+0x5e0/0xc50 [ 1513.268003] ? security_capable+0x3f/0x60 [ 1513.268005] do_md_run+0x19/0x110 [ 1513.268006] md_ioctl+0x195e/0x1f90 [ 1513.268007] blkdev_ioctl+0x91f/0x9f0 [ 1513.268010] block_ioctl+0x3d/0x50 [ 1513.268012] do_vfs_ioctl+0xa9/0x640 [ 1513.268014] ? __fput+0x162/0x260 [ 1513.268016] ksys_ioctl+0x75/0x80 [ 1513.268017] __x64_sys_ioctl+0x1a/0x20 [ 1513.268019] do_syscall_64+0x5e/0x200 [ 1513.268021] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 766038846e875 ("md/raid0: replace printk() with pr_*()") Reviewed-by: Michael Kelley <[email protected]> Acked-by: Guoqing Jiang <[email protected]> Signed-off-by: Saurabh Sengar <[email protected]> Signed-off-by: Song Liu <[email protected]>
2022-09-22md/raid10: fix compile warningGuoqing Jiang1-1/+1
With W=1, compiler complains. drivers/md/raid10.c:1983: warning: bad line: Signed-off-by: Guoqing Jiang <[email protected]> Signed-off-by: Song Liu <[email protected]>
2022-09-22md/raid5: Fix spelling mistakes in commentsXU pengfei1-3/+3
Fix spelling of 'waitting' in comments. Signed-off-by: XU pengfei <[email protected]> Signed-off-by: Song Liu <[email protected]>
2022-09-21block/blk-rq-qos: delete useless enmu RQ_QOS_IOPRIOLi Jinlin2-3/+0
Since blk-ioprio handing was converted from a rqos policy to a direct call, RQ_QOS_IOPRIO is not used anymore, just delete it. Signed-off-by: Li Jinlin <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>