aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2023-08-17md/raid0: Factor out helper for mapping and submitting a bioJan Kara1-39/+40
Factor out helper function for mapping and submitting a bio out of raid0_make_request(). We will use it later for submitting both parts of a split bio. Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Yu Kuai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-08-17md raid1: allow writebehind to work on any leg device set WriteMostlyHeinz Mauelshagen1-2/+1
As the WriteMostly flag can be set on any component device of a RAID1 array, remove the constraint that it only works if set on the first one. Signed-off-by: Heinz Mauelshagen <[email protected]> Tested-by: Xiao Ni <[email protected]> Link: https://lore.kernel.org/r/2a9592bf3340f34bf588eec984b23ee219f3985e.1692013451.git.heinzm@redhat.com Signed-off-by: Song Liu <[email protected]>
2023-08-17md/raid1: hold the barrier until handle_read_error() finishesXueshi Hu1-1/+3
handle_read_error() will call allow_barrier() to match the former barrier raising. However, it should put the allow_barrier() at the end to avoid a concurrent raid reshape. Fixes: 689389a06ce7 ("md/raid1: simplify handle_read_error().") Reviewed-by: Yu Kuai <[email protected]> Signed-off-by: Xueshi Hu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-08-17md/raid1: free the r1bio before waiting for blocked rdevXueshi Hu1-2/+2
Raid1 reshape will change mempool and r1conf::raid_disks which are needed to free r1bio. allow_barrier() make a concurrent raid1_reshape() possible. So, free the in-flight r1bio before waiting blocked rdev. Fixes: 6bfe0b499082 ("md: support blocking writes to an array on device failure") Reviewed-by: Yu Kuai <[email protected]> Signed-off-by: Xueshi Hu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-08-17md/raid1: call free_r1bio() before allow_barrier() in raid_end_bio_io()Xueshi Hu1-3/+4
After allow_barrier, a concurrent raid1_reshape() will replace old mempool and r1conf::raid_disks. Move allow_barrier() to the end of raid_end_bio_io(), so that r1bio can be freed safely. Reviewed-by: Yu Kuai <[email protected]> Signed-off-by: Xueshi Hu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-08-17blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before ↵Tejun Heo1-14/+18
init blk-iocost sometimes causes the following crash: BUG: kernel NULL pointer dereference, address: 00000000000000e0 ... RIP: 0010:_raw_spin_lock+0x17/0x30 Code: be 01 02 00 00 e8 79 38 39 ff 31 d2 89 d0 5d c3 0f 1f 00 0f 1f 44 00 00 55 48 89 e5 65 ff 05 48 d0 34 7e b9 01 00 00 00 31 c0 <f0> 0f b1 0f 75 02 5d c3 89 c6 e8 ea 04 00 00 5d c3 0f 1f 84 00 00 RSP: 0018:ffffc900023b3d40 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 00000000000000e0 RCX: 0000000000000001 RDX: ffffc900023b3d20 RSI: ffffc900023b3cf0 RDI: 00000000000000e0 RBP: ffffc900023b3d40 R08: ffffc900023b3c10 R09: 0000000000000003 R10: 0000000000000064 R11: 000000000000000a R12: ffff888102337000 R13: fffffffffffffff2 R14: ffff88810af408c8 R15: ffff8881070c3600 FS: 00007faaaf364fc0(0000) GS:ffff88842fdc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000e0 CR3: 00000001097b1000 CR4: 0000000000350ea0 Call Trace: <TASK> ioc_weight_write+0x13d/0x410 cgroup_file_write+0x7a/0x130 kernfs_fop_write_iter+0xf5/0x170 vfs_write+0x298/0x370 ksys_write+0x5f/0xb0 __x64_sys_write+0x1b/0x20 do_syscall_64+0x3d/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 This happens because iocg->ioc is NULL. The field is initialized by ioc_pd_init() and never cleared. The NULL deref is caused by blkcg_activate_policy() installing blkg_policy_data before initializing it. blkcg_activate_policy() was doing the following: 1. Allocate pd's for all existing blkg's and install them in blkg->pd[]. 2. Initialize all pd's. 3. Online all pd's. blkcg_activate_policy() only grabs the queue_lock and may release and re-acquire the lock as allocation may need to sleep. ioc_weight_write() grabs blkcg->lock and iterates all its blkg's. The two can race and if ioc_weight_write() runs during #1 or between #1 and #2, it can encounter a pd which is not initialized yet, leading to crash. The crash can be reproduced with the following script: #!/bin/bash echo +io > /sys/fs/cgroup/cgroup.subtree_control systemd-run --unit touch-sda --scope dd if=/dev/sda of=/dev/null bs=1M count=1 iflag=direct echo 100 > /sys/fs/cgroup/system.slice/io.weight bash -c "echo '8:0 enable=1' > /sys/fs/cgroup/io.cost.qos" & sleep .2 echo 100 > /sys/fs/cgroup/system.slice/io.weight with the following patch applied: > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c > index fc49be622e05..38d671d5e10c 100644 > --- a/block/blk-cgroup.c > +++ b/block/blk-cgroup.c > @@ -1553,6 +1553,12 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol) > pd->online = false; > } > > + if (system_state == SYSTEM_RUNNING) { > + spin_unlock_irq(&q->queue_lock); > + ssleep(1); > + spin_lock_irq(&q->queue_lock); > + } > + > /* all allocated, init in the same order */ > if (pol->pd_init_fn) > list_for_each_entry_reverse(blkg, &q->blkg_list, q_node) I don't see a reason why all pd's should be allocated, initialized and onlined together. The only ordering requirement is that parent blkgs to be initialized and onlined before children, which is guaranteed from the walking order. Let's fix the bug by allocating, initializing and onlining pd for each blkg and holding blkcg->lock over initialization and onlining. This ensures that an installed blkg is always fully initialized and onlined removing the the race window. Signed-off-by: Tejun Heo <[email protected]> Reported-by: Breno Leitao <[email protected]> Fixes: 9d179b865449 ("blkcg: Fix multiple bugs in blkcg_activate_policy()") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-16drivers/rnbd: restore sysfs interface to rnbd-clientLi Zhijian1-1/+1
Commit 137380c0ec40 renamed 'rnbd-client' to 'rnbd_client', this changed sysfs interface to /sys/devices/virtual/rnbd_client/ctl/map_device from /sys/devices/virtual/rnbd-client/ctl/map_device. CC: Ivan Orlov <[email protected]> CC: "Md. Haris Iqbal" <[email protected]> CC: Jack Wang <[email protected]> Fixes: 137380c0ec40 ("block/rnbd: make all 'class' structures const") Signed-off-by: Li Zhijian <[email protected]> Acked-by: Jack Wang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-15Merge tag 'md-next-20230814-resend' of ↵Jens Axboe11-45/+57
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.6/block Pull MD fixes from Song: "1. raid6test build fixes, by WANG Xuerui 2. Various non-urgent fixes." * tag 'md-next-20230814-resend' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md/raid5-cache: fix null-ptr-deref for r5l_flush_stripe_to_raid() raid6: test: only check for Altivec if building on powerpc hosts raid6: test: make sure all intermediate and artifact files are .gitignored raid6: test: cosmetic cleanups for the test Makefile raid6: guard the tables.c include of <linux/export.h> with __KERNEL__ raid6: remove the <linux/export.h> include from recov.c md: Hold mddev->reconfig_mutex when trying to get mddev->sync_thread md/raid10: fix a 'conf->barrier' leakage in raid10_takeover() md: raid1: fix potential OOB in raid1_remove_disk() md/raid5-cache: fix a deadlock in r5l_exit_log()
2023-08-15md/raid5-cache: fix null-ptr-deref for r5l_flush_stripe_to_raid()Yu Kuai1-2/+1
r5l_flush_stripe_to_raid() will check if the list 'flushing_ios' is empty, and then submit 'flush_bio', however, r5l_log_flush_endio() is clearing the list first and then clear the bio, which will cause null-ptr-deref: T1: submit flush io raid5d handle_active_stripes r5l_flush_stripe_to_raid // list is empty // add 'io_end_ios' to the list bio_init submit_bio // io1 T2: io1 is done r5l_log_flush_endio list_splice_tail_init // clear the list T3: submit new flush io ... r5l_flush_stripe_to_raid // list is empty // add 'io_end_ios' to the list bio_init bio_uninit // clear bio->bi_blkg submit_bio // null-ptr-deref Fix this problem by clearing bio before clearing the list in r5l_log_flush_endio(). Fixes: 0dd00cba99c3 ("raid5-cache: fully initialize flush_bio when needed") Reported-and-tested-by: Corey Hickey <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Song Liu <[email protected]>
2023-08-15raid6: test: only check for Altivec if building on powerpc hostsWANG Xuerui1-9/+10
Altivec is only available for powerpc hosts, so only check for its availability when the host is powerpc, to avoid error messages being shown on architectures other than x86, arm or powerpc. Signed-off-by: WANG Xuerui <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-08-15raid6: test: make sure all intermediate and artifact files are .gitignoredWANG Xuerui1-0/+3
Currently when the raid6test utility is built, the resulting binary and an int.uc file are not being ignored, which can get inadvertently committed as a result when one works on the raid6 code. Ignore them to make `git status` clean at all times. Signed-off-by: WANG Xuerui <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-08-15raid6: test: cosmetic cleanups for the test MakefileWANG Xuerui1-15/+16
Use tabs/spaces consistently: hard tabs for marking recipe lines only, spaces for everything else. Also, the OPTFLAGS declaration actually included the tabs preceding the line comment, making compiler invocation lines unnecessarily long. As the entire block of declarations are meant for ad-hoc customization (otherwise they would probably make use of `?=` instead of `=`), move the "Adjust as desired" comment above the block too to fix the long invocation lines. Signed-off-by: WANG Xuerui <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-08-15raid6: guard the tables.c include of <linux/export.h> with __KERNEL__WANG Xuerui1-0/+2
The export directives for the tables are already emitted with __KERNEL__ guards, but the <linux/export.h> include is not, causing errors when building the raid6test program. Guard this include too to fix the raid6test build. Signed-off-by: WANG Xuerui <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-08-15raid6: remove the <linux/export.h> include from recov.cWANG Xuerui1-1/+0
There is no exported symbol left in recov.c, so the include is now unnecessary, and breaks the raid6test build. Remove it. Signed-off-by: WANG Xuerui <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-08-15md: Hold mddev->reconfig_mutex when trying to get mddev->sync_threadLi Lingfeng7-14/+15
Commit ba9d9f1a707f ("Revert "md: unlock mddev before reap sync_thread in action_store"") removed the scenario of calling md_unregister_thread() without holding mddev->reconfig_mutex, so add a lock holding check before acquiring mddev->sync_thread by passing mdev to md_unregister_thread(). Signed-off-by: Li Lingfeng <[email protected]> Reviewed-by: Yu Kuai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-08-15md/raid10: fix a 'conf->barrier' leakage in raid10_takeover()Yu Kuai1-1/+0
After commit b39f35ebe86d ("md: don't quiesce in mddev_suspend()"), 'conf->barrier' will be leaked in the case that raid10 takeover raid0: level_store pers->takeover -> raid10_takeover raid10_takeover_raid0 WRITE_ONCE(conf->barrier, 1) mddev_suspend // still raid0 mddev->pers = pers // switch to raid10 mddev_resume // resume without suspend After the above commit, mddev_resume() will not decrease 'conf->barrier' that is set in raid10_takeover_raid0(). Fix this problem by not setting 'conf->barrier' in raid10_takeover_raid0(). By the way, this problem is found while I'm trying to make mddev_suspend/resume() to be independent from raid personalities. raid10 is the only personality to use reference count in the quiesce() callback and this problem is only related to raid10. Fixes: b39f35ebe86d ("md: don't quiesce in mddev_suspend()") Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Paul Menzel <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-08-15md: raid1: fix potential OOB in raid1_remove_disk()Zhang Shurong1-0/+4
If rddev->raid_disk is greater than mddev->raid_disks, there will be an out-of-bounds in raid1_remove_disk(). We have already found similar reports as follows: 1) commit d17f744e883b ("md-raid10: fix KASAN warning") 2) commit 1ebc2cec0b7d ("dm raid: fix KASAN warning in raid5_remove_disk") Fix this bug by checking whether the "number" variable is valid. Signed-off-by: Zhang Shurong <[email protected]> Reviewed-by: Yu Kuai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-08-15md/raid5-cache: fix a deadlock in r5l_exit_log()Yu Kuai1-3/+6
Commit b13015af94cf ("md/raid5-cache: Clear conf->log after finishing work") introduce a new problem: // caller hold reconfig_mutex r5l_exit_log flush_work(&log->disable_writeback_work) r5c_disable_writeback_async wait_event /* * conf->log is not NULL, and mddev_trylock() * will fail, wait_event() can never pass. */ conf->log = NULL Fix this problem by setting 'config->log' to NULL before wake_up() as it used to be, so that wait_event() from r5c_disable_writeback_async() can exist. In the meantime, move forward md_unregister_thread() so that null-ptr-deref this commit fixed can still be fixed. Fixes: b13015af94cf ("md/raid5-cache: Clear conf->log after finishing work") Signed-off-by: Yu Kuai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-08-15ublk: Switch to memdup_user_nul() helperRuan Jinjie1-8/+3
Use memdup_user_nul() helper instead of open-coding to simplify the code. Signed-off-by: Ruan Jinjie <[email protected]> Reviewed-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-15block: uapi: Fix compilation errors using ioprio.h with C++Damien Le Moal1-10/+11
The use of the "class" argument name in the ioprio_value() inline function in include/uapi/linux/ioprio.h confuses C++ compilers resulting in compilation errors such as: /usr/include/linux/ioprio.h:110:43: error: expected primary-expression before ‘int’ 110 | static __always_inline __u16 ioprio_value(int class, int level, int hint) | ^~~ for user C++ programs including linux/ioprio.h. Avoid these errors by renaming the arguments of the ioprio_value() function to prioclass, priolevel and priohint. For consistency, the arguments of the IOPRIO_PRIO_VALUE() and IOPRIO_PRIO_VALUE_HINT() macros are also renamed in the same manner. Reported-by: Igor Pylypiv <[email protected]> Fixes: 01584c1e2337 ("scsi: block: Improve ioprio value validity checks") Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Tested-by: Igor Pylypiv <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Bart Van Assche <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2023-08-14block: Bring back zero_fill_bio_iterKent Overstreet2-4/+9
This reverts 6f822e1b5d9dda3d20e87365de138046e3baa03a - this helper is used by bcachefs. Signed-off-by: Kent Overstreet <[email protected]> Cc: Jens Axboe <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-14block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unsetKent Overstreet1-4/+6
bio_iov_iter_get_pages() trims the IO based on the block size of the block device the IO will be issued to. However, bcachefs is a multi device filesystem; when we're creating the bio we don't yet know which block device the bio will be submitted to - we have to handle the alignment checks elsewhere. Thus this is needed to avoid a null ptr deref. Signed-off-by: Kent Overstreet <[email protected]> Cc: Jens Axboe <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-14block: Add some exports for bcachefsKent Overstreet4-1/+4
- bio_set_pages_dirty(), bio_check_pages_dirty() - dio path - blk_status_to_str() - error messages - bio_add_folio() - this should definitely be exported for everyone, it's the modern version of bio_add_page() Signed-off-by: Kent Overstreet <[email protected]> Cc: [email protected] Cc: Jens Axboe <[email protected]> Signed-off-by: Kent Overstreet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-11ublk: fix 'warn: variable dereferenced before check 'req'' from SmatchMing Lei1-1/+3
The added check of 'req_op(req) == REQ_OP_ZONE_APPEND' should have been done after the request is confirmed as valid. Actually here, the request should always been true, so add one WARN_ON_ONCE(!req), meantime move the zone_append check after checking the request. Cc: Andreas Hindborg <[email protected]> Reported-by: Dan Carpenter <[email protected]> Fixes: 29802d7ca33b ("ublk: enable zoned storage support") Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-10block: fix bad lockdep annotation in blk-iolatencyJens Axboe1-1/+1
A previous commit added a lockdep annotation, but botched it. Use the right type. Fixes: 4eb44d10766a ("block: remove init_mutex and open-code blk_iolatency_try_init") Signed-off-by: Jens Axboe <[email protected]>
2023-08-10swim3: mark swim3_init() staticArnd Bergmann1-1/+1
This is the module init function, which by definition is used only locally, so mark it static to avoid a warning: drivers/block/swim3.c:1280:5: error: no previous prototype for 'swim3_init' [-Werror=missing-prototypes] Reviewed-by: Jack Wang <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2023-08-10block: remove init_mutex and open-code blk_iolatency_try_initLi Lingfeng1-24/+11
Commit a13696b83da4 ("blk-iolatency: Make initialization lazy") adds a mutex named "init_mutex" in blk_iolatency_try_init for the race condition of initializing RQ_QOS_LATENCY. Now a new lock has been add to struct request_queue by commit a13bd91be223 ("block/rq_qos: protect rq_qos apis with a new lock"). And it has been held in blkg_conf_open_bdev before calling blk_iolatency_init. So it's not necessary to keep init_mutex in blk_iolatency_try_init, just remove it. Since init_mutex has been removed, blk_iolatency_try_init can be open-coded back to iolatency_set_limit() like ioc_qos_write(). Signed-off-by: Li Lingfeng <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-10ublk: Fix signedness bug returning warningLi Zetao1-2/+2
There are two warnings reported by smatch: drivers/block/ublk_drv.c:445 ublk_setup_iod_zoned() warn: signedness bug returning '(-95)' drivers/block/ublk_drv.c:963 ublk_setup_iod() warn: signedness bug returning '(-5)' The type of "blk_status_t" is either be a u32 or u8, but this two functions return a negative value when not supported or failed. Use the error code of the blk module to fix these warnings. Fixes: 29802d7ca33b ("ublk: enable zoned storage support") Reported-by: kernel test robot <[email protected]> Reported-by: Dan Carpenter <[email protected]> Closes: https://lore.kernel.org/r/[email protected]/ Signed-off-by: Li Zetao <[email protected]> Reviewed-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-09bio-integrity: create multi-page bvecs in bio_integrity_add_page()Jinyoung Choi1-7/+24
In general, the bvec data structure consists of one for physically continuous pages. But, in the bvec configuration for bip, physically continuous integrity pages are composed of each bvec. Allow bio_integrity_add_page() to create multi-page bvecs, just like the bio payloads. This simplifies adding larger payloads, and fixes support for non-tiny workloads with nvme, which stopped using scatterlist for metadata a while ago. Cc: Christoph Hellwig <[email protected]> Cc: Martin K. Petersen <[email protected]> Fixes: 783b94bd9250 ("nvme-pci: do not build a scatterlist to map metadata") Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jinyoung Choi <[email protected]> Tested-by: "Martin K. Petersen" <[email protected]> Reviewed-by: "Martin K. Petersen" <[email protected]> Link: https://lore.kernel.org/r/20230803025202epcms2p82f57cbfe32195da38c776377b55aed59@epcms2p8 Signed-off-by: Jens Axboe <[email protected]>
2023-08-09bio-integrity: cleanup adding integrity pages to bip's bvec.Jinyoung Choi1-13/+3
bio_integrity_add_page() returns the add length if successful, else 0, just as bio_add_page. Simply check return value checking in bio_integrity_prep to not deal with a > 0 but < len case that can't happen. Cc: Christoph Hellwig <[email protected]> Cc: Martin K. Petersen <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jinyoung Choi <[email protected]> Tested-by: "Martin K. Petersen" <[email protected]> Reviewed-by: "Martin K. Petersen" <[email protected]> Link: https://lore.kernel.org/r/20230803025058epcms2p5a4d0db5da2ad967668932d463661c633@epcms2p5 Signed-off-by: Jens Axboe <[email protected]>
2023-08-09bio-integrity: update the payload size in bio_integrity_add_page()Jinyoung Choi5-7/+3
Previously, the bip's bi_size has been set before an integrity pages were added. If a problem occurs in the process of adding pages for bip, the bi_size mismatch problem must be dealt with. When the page is successfully added to bvec, the bi_size is updated. The parts affected by the change were also contained in this commit. Cc: Christoph Hellwig <[email protected]> Cc: Martin K. Petersen <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jinyoung Choi <[email protected]> Tested-by: "Martin K. Petersen" <[email protected]> Reviewed-by: "Martin K. Petersen" <[email protected]> Link: https://lore.kernel.org/r/20230803024956epcms2p38186a17392706650c582d38ef3dbcd32@epcms2p3 Signed-off-by: Jens Axboe <[email protected]>
2023-08-09block: make bvec_try_merge_hw_page() non-staticJinyoung Choi2-1/+5
This will be used for multi-page configuration for integrity payload. Cc: Christoph Hellwig <[email protected]> Cc: Martin K. Petersen <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jinyoung Choi <[email protected]> Tested-by: "Martin K. Petersen" <[email protected]> Reviewed-by: "Martin K. Petersen" <[email protected]> Link: https://lore.kernel.org/r/20230803024827epcms2p838d9e9131492c86a159fff25d195658f@epcms2p8 Signed-off-by: Jens Axboe <[email protected]>
2023-08-08block/mq-deadline: use correct way to throttling write requestsZhiguo Niu1-1/+2
The original formula was inaccurate: dd->async_depth = max(1UL, 3 * q->nr_requests / 4); For write requests, when we assign a tags from sched_tags, data->shallow_depth will be passed to sbitmap_find_bit, see the following code: nr = sbitmap_find_bit_in_word(&sb->map[index], min_t (unsigned int, __map_depth(sb, index), depth), alloc_hint, wrap); The smaller of data->shallow_depth and __map_depth(sb, index) will be used as the maximum range when allocating bits. For a mmc device (one hw queue, deadline I/O scheduler): q->nr_requests = sched_tags = 128, so according to the previous calculation method, dd->async_depth = data->shallow_depth = 96, and the platform is 64bits with 8 cpus, sched_tags.bitmap_tags.sb.shift=5, sb.maps[]=32/32/32/32, 32 is smaller than 96, whether it is a read or a write I/O, tags can be allocated to the maximum range each time, which has not throttling effect. In addition, refer to the methods of bfg/kyber I/O scheduler, limit ratiois are calculated base on sched_tags.bitmap_tags.sb.shift. This patch can throttle write requests really. Fixes: 07757588e507 ("block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests") Signed-off-by: Zhiguo Niu <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-08ublk: enable zoned storage supportAndreas Hindborg2-24/+370
Add zoned storage support to ublk: report_zones and operations: - REQ_OP_ZONE_OPEN - REQ_OP_ZONE_CLOSE - REQ_OP_ZONE_FINISH - REQ_OP_ZONE_RESET - REQ_OP_ZONE_APPEND The zone append feature uses the `addr` field of `struct ublksrv_io_cmd` to communicate ALBA back to the kernel. Therefore ublk must be used with the user copy feature (UBLK_F_USER_COPY) for zoned storage support to be available. Without this feature, ublk will not allow zoned storage support. Signed-off-by: Andreas Hindborg <[email protected]> Reviewed-by: Ming Lei <[email protected]> Tested-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-08ublk: move check for empty address field on command submissionAndreas Hindborg1-5/+9
In preparation for zoned storage support, move the check for empty `addr` field into the command handler case statement. Note that the check makes no sense for `UBLK_IO_NEED_GET_DATA` because the `addr` field must always be set for this command. Signed-off-by: Andreas Hindborg <[email protected]> Reviewed-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-08ublk: add helper to check if device supports user copyAndreas Hindborg1-1/+6
This will be used by ublk zoned storage support. Signed-off-by: Andreas Hindborg <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-08iocost_monitor: improve it by adding iocg wait_msChengming Zhou1-4/+8
The iocg can have three throttled metrics: wait, debt, delay. This patch add missing wait_ms to IocgStat to show the latest wait_ms of iocg. As we are here, group iocg usage percents "inflt%" and "usage%" together, and group iocg throttled metrics "wait", "debt" and "delay" together. Effect after changes: nvme0n1 RUN per=50.0ms cur_per=177105.713:v1053528.587 busy= +0 vrate=135.00%:270.00% params=ssd_dfl(CQ) active weight hweight% inflt% usage% wait debt delay InterfererGroup0 * 100/ 100 54.28/ 9.09 0.34 24.07 0.00 0.00 0.00 interfered * 84/ 1000 45.72/ 90.91 0.48 41.09 0.00 0.00 0.00 Signed-off-by: Chengming Zhou <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-08iocost_monitor: print vrate inuse along with base_vrateChengming Zhou1-2/+5
The real vrate iocost inuse is not base_vrate, but the atomic vtime_rate. We need iocost_monitor tool to display this real vrate that iocost use, to check if the boosted compensated vrate is normal. Effect after change: nvme0n1 RUN per=50.0ms cur_per=172116.580:v1040587.433 busy= +0 \ vrate=135.00%:270.00% params=ssd_dfl(CQ) ^ | this is real vrate inuse Signed-off-by: Chengming Zhou <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-08iocost_monitor: fix kernel queue kobj changesChengming Zhou1-1/+1
When I use iocost_monitor on nvme0n1, this error shows up: "Could not find ioc for nvme0n1" There is no kobj in struct queue in recent kernel, it seems that the commit 2bd85221a625 ("block: untangle request_queue refcounting from sysfs") move the queue kobj to struct gendisk. Fix it by using mq_kobj which is at the same level with queue kobj. Signed-off-by: Chengming Zhou <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-04fs/Kconfig: Fix compile error for romfsLi Zetao1-0/+1
There are some compile errors reported by kernel test robot: arm-linux-gnueabi-ld: fs/romfs/storage.o: in function `romfs_dev_read': storage.c:(.text+0x64): undefined reference to `__brelse' arm-linux-gnueabi-ld: storage.c:(.text+0x9c): undefined reference to `__bread_gfp' arm-linux-gnueabi-ld: fs/romfs/storage.o: in function `romfs_dev_strnlen': storage.c:(.text+0x128): undefined reference to `__brelse' arm-linux-gnueabi-ld: storage.c:(.text+0x16c): undefined reference to `__bread_gfp' arm-linux-gnueabi-ld: fs/romfs/storage.o: in function `romfs_dev_strcmp': storage.c:(.text+0x22c): undefined reference to `__bread_gfp' arm-linux-gnueabi-ld: storage.c:(.text+0x27c): undefined reference to `__brelse' arm-linux-gnueabi-ld: storage.c:(.text+0x2a8): undefined reference to `__bread_gfp' arm-linux-gnueabi-ld: storage.c:(.text+0x2bc): undefined reference to `__brelse' arm-linux-gnueabi-ld: storage.c:(.text+0x2d4): undefined reference to `__brelse' arm-linux-gnueabi-ld: storage.c:(.text+0x2f4): undefined reference to `__brelse' arm-linux-gnueabi-ld: storage.c:(.text+0x304): undefined reference to `__brelse' The reason for the problem is that the commit "925c86a19bac" ("fs: add CONFIG_BUFFER_HEAD") has added a new config "CONFIG_BUFFER_HEAD" that controls building the buffer_head code, and romfs needs to use the buffer_head API, but no corresponding config has beed added. Select the config "CONFIG_BUFFER_HEAD" in romfs Kconfig to resolve the problem. Fixes: 925c86a19bac ("fs: add CONFIG_BUFFER_HEAD") Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Reviewed-by: Luis Chamberlain <[email protected]> Tested-by: Li Zetao <[email protected]> Signed-off-by: Li Zetao <[email protected]> [axboe: fold in Christoph's incremental] Signed-off-by: Jens Axboe <[email protected]>
2023-08-02fs: add CONFIG_BUFFER_HEADChristoph Hellwig37-29/+119
Add a new config option that controls building the buffer_head code, and select it from all file systems and stacking drivers that need it. For the block device nodes and alternative iomap based buffered I/O path is provided when buffer_head support is not enabled, and iomap needs a a small tweak to define the IOMAP_F_BUFFER_HEAD flag to 0 to not call into the buffer_head code when it doesn't exist. Otherwise this is just Kconfig and ifdef changes. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Luis Chamberlain <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-02block: use iomap for writes to block devicesChristoph Hellwig2-2/+30
Use iomap in buffer_head compat mode to write to block devices. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Luis Chamberlain <[email protected]> Reviewed-by: Pankaj Raghav <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-02block: stop setting ->direct_IOChristoph Hellwig1-2/+1
Direct I/O on block devices now nevers goes through aops->direct_IO. Stop setting it and set the FMODE_CAN_ODIRECT in ->open instead. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Luis Chamberlain <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-02block: open code __generic_file_write_iter for blkdev writesChristoph Hellwig1-2/+43
Open code __generic_file_write_iter to remove the indirect call into ->direct_IO and to prepare using the iomap based write code. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Luis Chamberlain <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-02fs: rename and move block_page_mkwrite_returnChristoph Hellwig8-25/+31
block_page_mkwrite_return is neither block nor mkwrite specific, and should not be under CONFIG_BLOCK. Move it to mm.h and rename it to vmf_fs_error. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Luis Chamberlain <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-08-02fs: remove emergency_thaw_bdevChristoph Hellwig3-13/+3
Fold emergency_thaw_bdev into it's only caller, to prepare for buffer.c to be built only when buffer_head support is enabled. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Luis Chamberlain <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-07-29Merge tag 'md-next-20230729' of ↵Jens Axboe15-394/+431
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.6/block Pull MD updates from Song: "1. Deprecate bitmap file support, by Christoph Hellwig; 2. Fix deadlock with md sync thread, by Yu Kuai; 3. Refactor md io accounting, by Yu Kuai; 4. Various non-urgent fixes by Li Nan, Yu Kuai, and Jack Wang." * tag 'md-next-20230729' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: (36 commits) md/md-bitmap: hold 'reconfig_mutex' in backlog_store() md/md-bitmap: remove unnecessary local variable in backlog_store() md/raid10: use dereference_rdev_and_rrdev() to get devices md/raid10: factor out dereference_rdev_and_rrdev() md/raid10: check replacement and rdev to prevent submit the same io twice md/raid1: Avoid lock contention from wake_up() md: restore 'noio_flag' for the last mddev_resume() md: don't quiesce in mddev_suspend() md: remove redundant check in fix_read_error() md/raid10: optimize fix_read_error md/raid1: prioritize adding disk to 'removed' mirror md/md-faulty: enable io accounting md/md-linear: enable io accounting md/md-multipath: enable io accounting md/raid10: switch to use md_account_bio() for io accounting md/raid1: switch to use md_account_bio() for io accounting raid5: fix missing io accounting in raid5_align_endio() md: also clone new io if io accounting is disabled md: move initialization and destruction of 'io_acct_set' to md.c md: deprecate bitmap file support ...
2023-07-27md/md-bitmap: hold 'reconfig_mutex' in backlog_store()Yu Kuai1-0/+7
Several reasons why 'reconfig_mutex' should be held: 1) rdev_for_each() is not safe to be called without the lock, because rdev can be removed concurrently. 2) mddev_destroy_serial_pool() and mddev_create_serial_pool() should not be called concurrently. 3) mddev_suspend() from mddev_destroy/create_serial_pool() should be protected by the lock. Fixes: 10c92fca636e ("md-bitmap: create and destroy wb_info_pool with the change of backlog") Signed-off-by: Yu Kuai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-07-27md/md-bitmap: remove unnecessary local variable in backlog_store()Yu Kuai1-2/+0
Local variable is definied first in the beginning of backlog_store(), there is no need to define it again. Fixes: 8c13ab115b57 ("md/bitmap: don't set max_write_behind if there is no write mostly device") Signed-off-by: Yu Kuai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>
2023-07-27md/raid10: use dereference_rdev_and_rrdev() to get devicesLi Nan1-10/+5
Commit 2ae6aaf76912 ("md/raid10: fix io loss while replacement replace rdev") reads replacement first to prevent io loss. However, there are same issue in wait_blocked_dev() and raid10_handle_discard(), too. Fix it by using dereference_rdev_and_rrdev() to get devices. Fixes: d30588b2731f ("md/raid10: improve raid10 discard request") Fixes: f2e7e269a752 ("md/raid10: pull the code that wait for blocked dev into one function") Signed-off-by: Li Nan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]>