aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2023-08-15md: Hold mddev->reconfig_mutex when trying to get mddev->sync_threadLi Lingfeng7-14/+15
Commit ba9d9f1a707f ("Revert "md: unlock mddev before reap sync_thread in action_store"") removed the scenario of calling md_unregister_thread() without holding mddev->reconfig_mutex, so add a lock holding check before acquiring mddev->sync_thread by passing mdev to md_unregister_thread(). Signed-off-by: Li Lingfeng <[email protected]> Reviewed-by: Yu Kuai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-08-15md/raid10: fix a 'conf->barrier' leakage in raid10_takeover()Yu Kuai1-1/+0
After commit b39f35ebe86d ("md: don't quiesce in mddev_suspend()"), 'conf->barrier' will be leaked in the case that raid10 takeover raid0: level_store pers->takeover -> raid10_takeover raid10_takeover_raid0 WRITE_ONCE(conf->barrier, 1) mddev_suspend // still raid0 mddev->pers = pers // switch to raid10 mddev_resume // resume without suspend After the above commit, mddev_resume() will not decrease 'conf->barrier' that is set in raid10_takeover_raid0(). Fix this problem by not setting 'conf->barrier' in raid10_takeover_raid0(). By the way, this problem is found while I'm trying to make mddev_suspend/resume() to be independent from raid personalities. raid10 is the only personality to use reference count in the quiesce() callback and this problem is only related to raid10. Fixes: b39f35ebe86d ("md: don't quiesce in mddev_suspend()") Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Paul Menzel <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-08-15md: raid1: fix potential OOB in raid1_remove_disk()Zhang Shurong1-0/+4
If rddev->raid_disk is greater than mddev->raid_disks, there will be an out-of-bounds in raid1_remove_disk(). We have already found similar reports as follows: 1) commit d17f744e883b ("md-raid10: fix KASAN warning") 2) commit 1ebc2cec0b7d ("dm raid: fix KASAN warning in raid5_remove_disk") Fix this bug by checking whether the "number" variable is valid. Signed-off-by: Zhang Shurong <[email protected]> Reviewed-by: Yu Kuai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-08-15md/raid5-cache: fix a deadlock in r5l_exit_log()Yu Kuai1-3/+6
Commit b13015af94cf ("md/raid5-cache: Clear conf->log after finishing work") introduce a new problem: // caller hold reconfig_mutex r5l_exit_log flush_work(&log->disable_writeback_work) r5c_disable_writeback_async wait_event /* * conf->log is not NULL, and mddev_trylock() * will fail, wait_event() can never pass. */ conf->log = NULL Fix this problem by setting 'config->log' to NULL before wake_up() as it used to be, so that wait_event() from r5c_disable_writeback_async() can exist. In the meantime, move forward md_unregister_thread() so that null-ptr-deref this commit fixed can still be fixed. Fixes: b13015af94cf ("md/raid5-cache: Clear conf->log after finishing work") Signed-off-by: Yu Kuai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-08-15ublk: Switch to memdup_user_nul() helperRuan Jinjie1-8/+3
Use memdup_user_nul() helper instead of open-coding to simplify the code. Signed-off-by: Ruan Jinjie <[email protected]> Reviewed-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-15block: uapi: Fix compilation errors using ioprio.h with C++Damien Le Moal1-10/+11
The use of the "class" argument name in the ioprio_value() inline function in include/uapi/linux/ioprio.h confuses C++ compilers resulting in compilation errors such as: /usr/include/linux/ioprio.h:110:43: error: expected primary-expression before ‘int’ 110 | static __always_inline __u16 ioprio_value(int class, int level, int hint) | ^~~ for user C++ programs including linux/ioprio.h. Avoid these errors by renaming the arguments of the ioprio_value() function to prioclass, priolevel and priohint. For consistency, the arguments of the IOPRIO_PRIO_VALUE() and IOPRIO_PRIO_VALUE_HINT() macros are also renamed in the same manner. Reported-by: Igor Pylypiv <[email protected]> Fixes: 01584c1e2337 ("scsi: block: Improve ioprio value validity checks") Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Tested-by: Igor Pylypiv <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Bart Van Assche <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2023-08-14block: Bring back zero_fill_bio_iterKent Overstreet2-4/+9
This reverts 6f822e1b5d9dda3d20e87365de138046e3baa03a - this helper is used by bcachefs. Signed-off-by: Kent Overstreet <[email protected]> Cc: Jens Axboe <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-14block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unsetKent Overstreet1-4/+6
bio_iov_iter_get_pages() trims the IO based on the block size of the block device the IO will be issued to. However, bcachefs is a multi device filesystem; when we're creating the bio we don't yet know which block device the bio will be submitted to - we have to handle the alignment checks elsewhere. Thus this is needed to avoid a null ptr deref. Signed-off-by: Kent Overstreet <[email protected]> Cc: Jens Axboe <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-14block: Add some exports for bcachefsKent Overstreet4-1/+4
- bio_set_pages_dirty(), bio_check_pages_dirty() - dio path - blk_status_to_str() - error messages - bio_add_folio() - this should definitely be exported for everyone, it's the modern version of bio_add_page() Signed-off-by: Kent Overstreet <[email protected]> Cc: [email protected] Cc: Jens Axboe <[email protected]> Signed-off-by: Kent Overstreet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-11ublk: fix 'warn: variable dereferenced before check 'req'' from SmatchMing Lei1-1/+3
The added check of 'req_op(req) == REQ_OP_ZONE_APPEND' should have been done after the request is confirmed as valid. Actually here, the request should always been true, so add one WARN_ON_ONCE(!req), meantime move the zone_append check after checking the request. Cc: Andreas Hindborg <[email protected]> Reported-by: Dan Carpenter <[email protected]> Fixes: 29802d7ca33b ("ublk: enable zoned storage support") Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-10block: fix bad lockdep annotation in blk-iolatencyJens Axboe1-1/+1
A previous commit added a lockdep annotation, but botched it. Use the right type. Fixes: 4eb44d10766a ("block: remove init_mutex and open-code blk_iolatency_try_init") Signed-off-by: Jens Axboe <[email protected]>
2023-08-10swim3: mark swim3_init() staticArnd Bergmann1-1/+1
This is the module init function, which by definition is used only locally, so mark it static to avoid a warning: drivers/block/swim3.c:1280:5: error: no previous prototype for 'swim3_init' [-Werror=missing-prototypes] Reviewed-by: Jack Wang <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2023-08-10block: remove init_mutex and open-code blk_iolatency_try_initLi Lingfeng1-24/+11
Commit a13696b83da4 ("blk-iolatency: Make initialization lazy") adds a mutex named "init_mutex" in blk_iolatency_try_init for the race condition of initializing RQ_QOS_LATENCY. Now a new lock has been add to struct request_queue by commit a13bd91be223 ("block/rq_qos: protect rq_qos apis with a new lock"). And it has been held in blkg_conf_open_bdev before calling blk_iolatency_init. So it's not necessary to keep init_mutex in blk_iolatency_try_init, just remove it. Since init_mutex has been removed, blk_iolatency_try_init can be open-coded back to iolatency_set_limit() like ioc_qos_write(). Signed-off-by: Li Lingfeng <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-10ublk: Fix signedness bug returning warningLi Zetao1-2/+2
There are two warnings reported by smatch: drivers/block/ublk_drv.c:445 ublk_setup_iod_zoned() warn: signedness bug returning '(-95)' drivers/block/ublk_drv.c:963 ublk_setup_iod() warn: signedness bug returning '(-5)' The type of "blk_status_t" is either be a u32 or u8, but this two functions return a negative value when not supported or failed. Use the error code of the blk module to fix these warnings. Fixes: 29802d7ca33b ("ublk: enable zoned storage support") Reported-by: kernel test robot <[email protected]> Reported-by: Dan Carpenter <[email protected]> Closes: https://lore.kernel.org/r/[email protected]/ Signed-off-by: Li Zetao <[email protected]> Reviewed-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-09bio-integrity: create multi-page bvecs in bio_integrity_add_page()Jinyoung Choi1-7/+24
In general, the bvec data structure consists of one for physically continuous pages. But, in the bvec configuration for bip, physically continuous integrity pages are composed of each bvec. Allow bio_integrity_add_page() to create multi-page bvecs, just like the bio payloads. This simplifies adding larger payloads, and fixes support for non-tiny workloads with nvme, which stopped using scatterlist for metadata a while ago. Cc: Christoph Hellwig <[email protected]> Cc: Martin K. Petersen <[email protected]> Fixes: 783b94bd9250 ("nvme-pci: do not build a scatterlist to map metadata") Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jinyoung Choi <[email protected]> Tested-by: "Martin K. Petersen" <[email protected]> Reviewed-by: "Martin K. Petersen" <[email protected]> Link: https://lore.kernel.org/r/20230803025202epcms2p82f57cbfe32195da38c776377b55aed59@epcms2p8 Signed-off-by: Jens Axboe <[email protected]>
2023-08-09bio-integrity: cleanup adding integrity pages to bip's bvec.Jinyoung Choi1-13/+3
bio_integrity_add_page() returns the add length if successful, else 0, just as bio_add_page. Simply check return value checking in bio_integrity_prep to not deal with a > 0 but < len case that can't happen. Cc: Christoph Hellwig <[email protected]> Cc: Martin K. Petersen <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jinyoung Choi <[email protected]> Tested-by: "Martin K. Petersen" <[email protected]> Reviewed-by: "Martin K. Petersen" <[email protected]> Link: https://lore.kernel.org/r/20230803025058epcms2p5a4d0db5da2ad967668932d463661c633@epcms2p5 Signed-off-by: Jens Axboe <[email protected]>
2023-08-09bio-integrity: update the payload size in bio_integrity_add_page()Jinyoung Choi5-7/+3
Previously, the bip's bi_size has been set before an integrity pages were added. If a problem occurs in the process of adding pages for bip, the bi_size mismatch problem must be dealt with. When the page is successfully added to bvec, the bi_size is updated. The parts affected by the change were also contained in this commit. Cc: Christoph Hellwig <[email protected]> Cc: Martin K. Petersen <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jinyoung Choi <[email protected]> Tested-by: "Martin K. Petersen" <[email protected]> Reviewed-by: "Martin K. Petersen" <[email protected]> Link: https://lore.kernel.org/r/20230803024956epcms2p38186a17392706650c582d38ef3dbcd32@epcms2p3 Signed-off-by: Jens Axboe <[email protected]>
2023-08-09block: make bvec_try_merge_hw_page() non-staticJinyoung Choi2-1/+5
This will be used for multi-page configuration for integrity payload. Cc: Christoph Hellwig <[email protected]> Cc: Martin K. Petersen <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jinyoung Choi <[email protected]> Tested-by: "Martin K. Petersen" <[email protected]> Reviewed-by: "Martin K. Petersen" <[email protected]> Link: https://lore.kernel.org/r/20230803024827epcms2p838d9e9131492c86a159fff25d195658f@epcms2p8 Signed-off-by: Jens Axboe <[email protected]>
2023-08-08block/mq-deadline: use correct way to throttling write requestsZhiguo Niu1-1/+2
The original formula was inaccurate: dd->async_depth = max(1UL, 3 * q->nr_requests / 4); For write requests, when we assign a tags from sched_tags, data->shallow_depth will be passed to sbitmap_find_bit, see the following code: nr = sbitmap_find_bit_in_word(&sb->map[index], min_t (unsigned int, __map_depth(sb, index), depth), alloc_hint, wrap); The smaller of data->shallow_depth and __map_depth(sb, index) will be used as the maximum range when allocating bits. For a mmc device (one hw queue, deadline I/O scheduler): q->nr_requests = sched_tags = 128, so according to the previous calculation method, dd->async_depth = data->shallow_depth = 96, and the platform is 64bits with 8 cpus, sched_tags.bitmap_tags.sb.shift=5, sb.maps[]=32/32/32/32, 32 is smaller than 96, whether it is a read or a write I/O, tags can be allocated to the maximum range each time, which has not throttling effect. In addition, refer to the methods of bfg/kyber I/O scheduler, limit ratiois are calculated base on sched_tags.bitmap_tags.sb.shift. This patch can throttle write requests really. Fixes: 07757588e507 ("block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests") Signed-off-by: Zhiguo Niu <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-08ublk: enable zoned storage supportAndreas Hindborg2-24/+370
Add zoned storage support to ublk: report_zones and operations: - REQ_OP_ZONE_OPEN - REQ_OP_ZONE_CLOSE - REQ_OP_ZONE_FINISH - REQ_OP_ZONE_RESET - REQ_OP_ZONE_APPEND The zone append feature uses the `addr` field of `struct ublksrv_io_cmd` to communicate ALBA back to the kernel. Therefore ublk must be used with the user copy feature (UBLK_F_USER_COPY) for zoned storage support to be available. Without this feature, ublk will not allow zoned storage support. Signed-off-by: Andreas Hindborg <[email protected]> Reviewed-by: Ming Lei <[email protected]> Tested-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-08ublk: move check for empty address field on command submissionAndreas Hindborg1-5/+9
In preparation for zoned storage support, move the check for empty `addr` field into the command handler case statement. Note that the check makes no sense for `UBLK_IO_NEED_GET_DATA` because the `addr` field must always be set for this command. Signed-off-by: Andreas Hindborg <[email protected]> Reviewed-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-08ublk: add helper to check if device supports user copyAndreas Hindborg1-1/+6
This will be used by ublk zoned storage support. Signed-off-by: Andreas Hindborg <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-08iocost_monitor: improve it by adding iocg wait_msChengming Zhou1-4/+8
The iocg can have three throttled metrics: wait, debt, delay. This patch add missing wait_ms to IocgStat to show the latest wait_ms of iocg. As we are here, group iocg usage percents "inflt%" and "usage%" together, and group iocg throttled metrics "wait", "debt" and "delay" together. Effect after changes: nvme0n1 RUN per=50.0ms cur_per=177105.713:v1053528.587 busy= +0 vrate=135.00%:270.00% params=ssd_dfl(CQ) active weight hweight% inflt% usage% wait debt delay InterfererGroup0 * 100/ 100 54.28/ 9.09 0.34 24.07 0.00 0.00 0.00 interfered * 84/ 1000 45.72/ 90.91 0.48 41.09 0.00 0.00 0.00 Signed-off-by: Chengming Zhou <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-08iocost_monitor: print vrate inuse along with base_vrateChengming Zhou1-2/+5
The real vrate iocost inuse is not base_vrate, but the atomic vtime_rate. We need iocost_monitor tool to display this real vrate that iocost use, to check if the boosted compensated vrate is normal. Effect after change: nvme0n1 RUN per=50.0ms cur_per=172116.580:v1040587.433 busy= +0 \ vrate=135.00%:270.00% params=ssd_dfl(CQ) ^ | this is real vrate inuse Signed-off-by: Chengming Zhou <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-08iocost_monitor: fix kernel queue kobj changesChengming Zhou1-1/+1
When I use iocost_monitor on nvme0n1, this error shows up: "Could not find ioc for nvme0n1" There is no kobj in struct queue in recent kernel, it seems that the commit 2bd85221a625 ("block: untangle request_queue refcounting from sysfs") move the queue kobj to struct gendisk. Fix it by using mq_kobj which is at the same level with queue kobj. Signed-off-by: Chengming Zhou <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-04fs/Kconfig: Fix compile error for romfsLi Zetao1-0/+1
There are some compile errors reported by kernel test robot: arm-linux-gnueabi-ld: fs/romfs/storage.o: in function `romfs_dev_read': storage.c:(.text+0x64): undefined reference to `__brelse' arm-linux-gnueabi-ld: storage.c:(.text+0x9c): undefined reference to `__bread_gfp' arm-linux-gnueabi-ld: fs/romfs/storage.o: in function `romfs_dev_strnlen': storage.c:(.text+0x128): undefined reference to `__brelse' arm-linux-gnueabi-ld: storage.c:(.text+0x16c): undefined reference to `__bread_gfp' arm-linux-gnueabi-ld: fs/romfs/storage.o: in function `romfs_dev_strcmp': storage.c:(.text+0x22c): undefined reference to `__bread_gfp' arm-linux-gnueabi-ld: storage.c:(.text+0x27c): undefined reference to `__brelse' arm-linux-gnueabi-ld: storage.c:(.text+0x2a8): undefined reference to `__bread_gfp' arm-linux-gnueabi-ld: storage.c:(.text+0x2bc): undefined reference to `__brelse' arm-linux-gnueabi-ld: storage.c:(.text+0x2d4): undefined reference to `__brelse' arm-linux-gnueabi-ld: storage.c:(.text+0x2f4): undefined reference to `__brelse' arm-linux-gnueabi-ld: storage.c:(.text+0x304): undefined reference to `__brelse' The reason for the problem is that the commit "925c86a19bac" ("fs: add CONFIG_BUFFER_HEAD") has added a new config "CONFIG_BUFFER_HEAD" that controls building the buffer_head code, and romfs needs to use the buffer_head API, but no corresponding config has beed added. Select the config "CONFIG_BUFFER_HEAD" in romfs Kconfig to resolve the problem. Fixes: 925c86a19bac ("fs: add CONFIG_BUFFER_HEAD") Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Reviewed-by: Luis Chamberlain <[email protected]> Tested-by: Li Zetao <[email protected]> Signed-off-by: Li Zetao <[email protected]> [axboe: fold in Christoph's incremental] Signed-off-by: Jens Axboe <[email protected]>
2023-08-02fs: add CONFIG_BUFFER_HEADChristoph Hellwig37-29/+119
Add a new config option that controls building the buffer_head code, and select it from all file systems and stacking drivers that need it. For the block device nodes and alternative iomap based buffered I/O path is provided when buffer_head support is not enabled, and iomap needs a a small tweak to define the IOMAP_F_BUFFER_HEAD flag to 0 to not call into the buffer_head code when it doesn't exist. Otherwise this is just Kconfig and ifdef changes. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Luis Chamberlain <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-02block: use iomap for writes to block devicesChristoph Hellwig2-2/+30
Use iomap in buffer_head compat mode to write to block devices. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Luis Chamberlain <[email protected]> Reviewed-by: Pankaj Raghav <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-02block: stop setting ->direct_IOChristoph Hellwig1-2/+1
Direct I/O on block devices now nevers goes through aops->direct_IO. Stop setting it and set the FMODE_CAN_ODIRECT in ->open instead. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Luis Chamberlain <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-02block: open code __generic_file_write_iter for blkdev writesChristoph Hellwig1-2/+43
Open code __generic_file_write_iter to remove the indirect call into ->direct_IO and to prepare using the iomap based write code. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Luis Chamberlain <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-02fs: rename and move block_page_mkwrite_returnChristoph Hellwig8-25/+31
block_page_mkwrite_return is neither block nor mkwrite specific, and should not be under CONFIG_BLOCK. Move it to mm.h and rename it to vmf_fs_error. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Luis Chamberlain <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-02fs: remove emergency_thaw_bdevChristoph Hellwig3-13/+3
Fold emergency_thaw_bdev into it's only caller, to prepare for buffer.c to be built only when buffer_head support is enabled. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Luis Chamberlain <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-07-29Merge tag 'md-next-20230729' of ↵Jens Axboe15-394/+431
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.6/block Pull MD updates from Song: "1. Deprecate bitmap file support, by Christoph Hellwig; 2. Fix deadlock with md sync thread, by Yu Kuai; 3. Refactor md io accounting, by Yu Kuai; 4. Various non-urgent fixes by Li Nan, Yu Kuai, and Jack Wang." * tag 'md-next-20230729' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: (36 commits) md/md-bitmap: hold 'reconfig_mutex' in backlog_store() md/md-bitmap: remove unnecessary local variable in backlog_store() md/raid10: use dereference_rdev_and_rrdev() to get devices md/raid10: factor out dereference_rdev_and_rrdev() md/raid10: check replacement and rdev to prevent submit the same io twice md/raid1: Avoid lock contention from wake_up() md: restore 'noio_flag' for the last mddev_resume() md: don't quiesce in mddev_suspend() md: remove redundant check in fix_read_error() md/raid10: optimize fix_read_error md/raid1: prioritize adding disk to 'removed' mirror md/md-faulty: enable io accounting md/md-linear: enable io accounting md/md-multipath: enable io accounting md/raid10: switch to use md_account_bio() for io accounting md/raid1: switch to use md_account_bio() for io accounting raid5: fix missing io accounting in raid5_align_endio() md: also clone new io if io accounting is disabled md: move initialization and destruction of 'io_acct_set' to md.c md: deprecate bitmap file support ...
2023-07-27md/md-bitmap: hold 'reconfig_mutex' in backlog_store()Yu Kuai1-0/+7
Several reasons why 'reconfig_mutex' should be held: 1) rdev_for_each() is not safe to be called without the lock, because rdev can be removed concurrently. 2) mddev_destroy_serial_pool() and mddev_create_serial_pool() should not be called concurrently. 3) mddev_suspend() from mddev_destroy/create_serial_pool() should be protected by the lock. Fixes: 10c92fca636e ("md-bitmap: create and destroy wb_info_pool with the change of backlog") Signed-off-by: Yu Kuai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-07-27md/md-bitmap: remove unnecessary local variable in backlog_store()Yu Kuai1-2/+0
Local variable is definied first in the beginning of backlog_store(), there is no need to define it again. Fixes: 8c13ab115b57 ("md/bitmap: don't set max_write_behind if there is no write mostly device") Signed-off-by: Yu Kuai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-07-27md/raid10: use dereference_rdev_and_rrdev() to get devicesLi Nan1-10/+5
Commit 2ae6aaf76912 ("md/raid10: fix io loss while replacement replace rdev") reads replacement first to prevent io loss. However, there are same issue in wait_blocked_dev() and raid10_handle_discard(), too. Fix it by using dereference_rdev_and_rrdev() to get devices. Fixes: d30588b2731f ("md/raid10: improve raid10 discard request") Fixes: f2e7e269a752 ("md/raid10: pull the code that wait for blocked dev into one function") Signed-off-by: Li Nan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-07-27md/raid10: factor out dereference_rdev_and_rrdev()Li Nan1-9/+20
Factor out a helper to get 'rdev' and 'replacement' from config->mirrors. Just to make code cleaner and prepare to fix the bug of io loss while 'replacement' replace 'rdev'. There is no functional change. Signed-off-by: Li Nan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-07-27md/raid10: check replacement and rdev to prevent submit the same io twiceLi Nan1-0/+2
After commit 4ca40c2ce099 ("md/raid10: Allow replacement device to be replace old drive."), 'rdev' and 'replacement' could appear to be identical. There are already checks for that in wait_blocked_dev() and raid10_write_request(). Add check for raid10_handle_discard() now. Signed-off-by: Li Nan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-07-27md/raid1: Avoid lock contention from wake_up()Jack Wang1-6/+12
wake_up is called unconditionally in a few paths such as make_request(), which cause lock contention under high concurrency workload like below raid1_end_write_request wake_up __wake_up_common_lock spin_lock_irqsave Improve performance by only call wake_up() if waitqueue is not empty Fio test script: [global] name=random reads and writes ioengine=libaio direct=1 readwrite=randrw rwmixread=70 iodepth=64 buffered=0 filename=/dev/md0 size=1G runtime=30 time_based randrepeat=0 norandommap refill_buffers ramp_time=10 bs=4k numjobs=400 group_reporting=1 [job1] Test result with 2 ramdisk in raid1 on a Intel Broadwell 56 cores server. Before this patch With this patch READ BW=4621MB/s BW=7337MB/s WRITE BW=1980MB/s BW=3144MB/s The patch is inspired by Yu Kuai's change for raid10: https://lore.kernel.org/r/[email protected] Cc: Yu Kuai <[email protected]> Signed-off-by: Jack Wang <[email protected]> Reviewed-by: Yu Kuai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-07-27md: restore 'noio_flag' for the last mddev_resume()Yu Kuai1-2/+4
memalloc_noio_save() is called for the first mddev_suspend(), and repeated mddev_suspend() only increase 'suspended'. However, memalloc_noio_restore() is also called for the first mddev_resume(), which means that memory reclaim will be enabled before the last mddev_resume() is called, while the array is still suspended. Fix this problem by restore 'noio_flag' for the last mddev_resume(). Fixes: 78f57ef9d50a ("md: use memalloc scope APIs in mddev_suspend()/mddev_resume()") Signed-off-by: Yu Kuai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-07-27md: don't quiesce in mddev_suspend()Yu Kuai1-2/+0
Some levels doesn't implement "pers->quiesce", for example raid0_quiesce() is empty, and now that all levels will drop 'active_io' until io is done, wait for 'active_io' to be 0 is enough to make sure all normal io is done, and percpu_ref_kill() for 'active_io' will make sure no new normal io can be dispatched. There is no need to call "pers->quiesce" anymore from mddev_suspend(). Signed-off-by: Yu Kuai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-07-27md: remove redundant check in fix_read_error()Li Nan2-2/+2
In fix_read_error(), 'success' will be checked immediately after assigning it, if it is set to 1 then the loop will break. Checking it again in condition of loop is redundant. Clean it up. Signed-off-by: Li Nan <[email protected]> Reviewed-by: Yu Kuai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-07-27md/raid10: optimize fix_read_errorLi Nan1-10/+10
We dereference r10_bio->read_slot too many times in fix_read_error(). Optimize it by using a variable to store read_slot. Signed-off-by: Li Nan <[email protected]> Reviewed-by: Yu Kuai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-07-27md/raid1: prioritize adding disk to 'removed' mirrorLi Nan1-11/+15
New disk should be added to "removed" position first instead of to be a replacement. Commit 6090368abcb4 ("md/raid10: prioritize adding disk to 'removed' mirror") has fixed this issue for raid10. Fix it for raid1 now. Signed-off-by: Li Nan <[email protected]> Reviewed-by: Yu Kuai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-07-27md/md-faulty: enable io accountingYu Kuai1-0/+2
use md_account_bio() to enable io accounting, also make sure mddev_suspend() will wait for all io to be done. Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-27md/md-linear: enable io accountingYu Kuai1-0/+1
use md_account_bio() to enable io accounting, also make sure mddev_suspend() will wait for all io to be done. Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-27md/md-multipath: enable io accountingYu Kuai1-0/+1
use md_account_bio() to enable io accounting, also make sure mddev_suspend() will wait for all io to be done. Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-27md/raid10: switch to use md_account_bio() for io accountingYu Kuai2-12/+9
Make sure that 'active_io' will represent inflight io instead of io that is dispatching, and io accounting from all levels will be consistent. Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-27md/raid1: switch to use md_account_bio() for io accountingYu Kuai2-9/+6
Two problems can be fixed this way: 1) 'active_io' will represent inflight io instead of io that is dispatching. 2) If io accounting is enabled or disabled while io is still inflight, bio_start_io_acct() and bio_end_io_acct() is not balanced and io inflight counter will be leaked. Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-27raid5: fix missing io accounting in raid5_align_endio()Yu Kuai1-21/+8
Io will only be accounted as done from raid5_align_endio() if the io succeeded, and io inflight counter will be leaked if such io failed. Fix this problem by switching to use md_account_bio() for io accounting. Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]