aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-12-05blk-throttle: fix typo in comment of throtl_adjusted_limitKemeng Shi1-1/+1
lapsed time -> elapsed time Acked-by: Tejun Heo <[email protected]> Signed-off-by: Kemeng Shi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-12-05blk-throttle: simpfy low limit reached check in throtl_tg_can_upgradeKemeng Shi1-13/+18
Commit c79892c557616 ("blk-throttle: add upgrade logic for LIMIT_LOW state") added upgrade logic for low limit and methioned that 1. "To determine if a cgroup exceeds its limitation, we check if the cgroup has pending request. Since cgroup is throttled according to the limit, pending request means the cgroup reaches the limit." 2. "If a cgroup has limit set for both read and write, we consider the combination of them for upgrade. The reason is read IO and write IO can interfere with each other. If we do the upgrade based in one direction IO, the other direction IO could be severly harmed." Besides, we also determine that cgroup reaches low limit if low limit is 0, see comment in throtl_tg_can_upgrade. Collect the information above, the desgin of upgrade check is as following: 1.The low limit is reached if limit is zero or io is already queued. 2.Cgroup will pass upgrade check if low limits of READ and WRITE are both reached. Simpfy the check code described above to removce repeat check and improve readability. There is no functional change. Detail equivalence proof is as following: All replaced conditions to return true are as following: condition 1 (!read_limit && !write_limit) condition 2 read_limit && sq->nr_queued[READ] && (!write_limit || sq->nr_queued[WRITE]) condition 3 write_limit && sq->nr_queued[WRITE] && (!read_limit || sq->nr_queued[READ]) Transferring condition 2 as following: (read_limit && sq->nr_queued[READ]) && (!write_limit || sq->nr_queued[WRITE]) is equivalent to (read_limit && sq->nr_queued[READ]) && (!write_limit || (write_limit && sq->nr_queued[WRITE])) is equivalent to condition 2.1 (read_limit && sq->nr_queued[READ] && !write_limit) || condition 2.2 (read_limit && sq->nr_queued[READ] && (write_limit && sq->nr_queued[WRITE])) Transferring condition 3 as following: write_limit && sq->nr_queued[WRITE] && (!read_limit || sq->nr_queued[READ]) is equivalent to (write_limit && sq->nr_queued[WRITE]) && (!read_limit || (read_limit && sq->nr_queued[READ])) is equivalent to condition 3.1 ((write_limit && sq->nr_queued[WRITE]) && !read_limit) || condition 3.2 ((write_limit && sq->nr_queued[WRITE]) && (read_limit && sq->nr_queued[READ])) Condition 3.2 is the same as condition 2.2, so all conditions we get to return are as following: (!read_limit && !write_limit) (1) (!read_limit && (write_limit && sq->nr_queued[WRITE])) (3.1) ((read_limit && sq->nr_queued[READ]) && !write_limit) (2.1) ((write_limit && sq->nr_queued[WRITE]) && (read_limit && sq->nr_queued[READ])) (2.2) As we can extract conditions "(a1 || a2) && (b1 || b2)" to: a1 && b1 a1 && b2 a2 && b1 ab && b2 Considering that: a1 = !read_limit a2 = read_limit && sq->nr_queued[READ] b1 = !write_limit b2 = write_limit && sq->nr_queued[WRITE] We can pack replaced conditions to (!read_limit || (read_limit && sq->nr_queued[READ])) && (!write_limit || (write_limit && sq->nr_queued[WRITE])) which is equivalent to (!read_limit || sq->nr_queued[READ]) && (!write_limit || sq->nr_queued[WRITE]) Reported-by: kernel test robot <[email protected]> Acked-by: Tejun Heo <[email protected]> Signed-off-by: Kemeng Shi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-12-05blk-throttle: correct calculation of wait time in tg_may_dispatchKemeng Shi1-25/+13
In C language, When executing "if (expression1 && expression2)" and expression1 return false, the expression2 may not be executed. For "tg_within_bps_limit(tg, bio, bps_limit, &bps_wait) && tg_within_iops_limit(tg, bio, iops_limit, &iops_wait))", if bps is limited, tg_within_bps_limit will return false and tg_within_iops_limit will not be called. So even bps and iops are both limited, iops_wait will not be calculated and is always zero. So wait time of iops is always ignored. Fix this by always calling tg_within_bps_limit and tg_within_iops_limit to get wait time for both bps and iops. Observed that: 1. Wait time in tg_within_iops_limit/tg_within_bps_limit need always be stored as wait argument is always passed. 2. wait time is stored to zero if iops/bps is limited otherwise non-zero is stored. Simpfy tg_within_iops_limit/tg_within_bps_limit by removing wait argument and return wait time directly. Caller tg_may_dispatch checks if wait time is zero to find if iops/bps is limited. Acked-by: Tejun Heo <[email protected]> Signed-off-by: Kemeng Shi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-12-05blk-throttle: ignore cgroup without io queued in blk_throtl_cancel_biosKemeng Shi1-1/+12
Ignore cgroup without io queued in blk_throtl_cancel_bios for two reasons: 1. Save cpu cycle for trying to dispatch cgroup which is no io queued. 2. Avoid non-consistent state that cgroup is inserted to service queue without THROTL_TG_PENDING set as tg_update_disptime will unconditional re-insert cgroup to service queue. If we are on the default hierarchy, IO dispatched from child in tg_dispatch_one_bio will trigger inserting cgroup to service queue without erase first and ruin the tree. Acked-by: Tejun Heo <[email protected]> Signed-off-by: Kemeng Shi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-12-05blk-throttle: Fix that bps of child could exceed bps limited in parentKemeng Shi1-1/+1
Consider situation as following (on the default hierarchy): HDD | root (bps limit: 4k) | child (bps limit :8k) | fio bs=8k Rate of fio is supposed to be 4k, but result is 8k. Reason is as following: Size of single IO from fio is larger than bytes allowed in one throtl_slice in child, so IOs are always queued in child group first. When queued IOs in child are dispatched to parent group, BIO_BPS_THROTTLED is set and these IOs will not be limited by tg_within_bps_limit anymore. Fix this by only set BIO_BPS_THROTTLED when the bio traversed the entire tree. There patch has no influence on situation which is not on the default hierarchy as each group is a single root group without parent. Acked-by: Tejun Heo <[email protected]> Signed-off-by: Kemeng Shi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-12-05blk-throttle: correct stale comment in throtl_pd_initKemeng Shi1-2/+3
On the default hierarchy (cgroup2), the throttle interface files don't exist in the root cgroup, so the ablity to limit the whole system by configuring root group is not existing anymore. In general, cgroup doesn't wanna be in the business of restricting resources at the system level, so correct the stale comment that we can limit whole system to we can only limit subtree. Acked-by: Tejun Heo <[email protected]> Signed-off-by: Kemeng Shi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-12-04Merge tag 'floppy-for-6.2' of https://github.com/evdenis/linux-floppy into ↵Jens Axboe1-1/+3
for-6.2/block Pull floppy fix from Denis: "Floppy patch for 6.2 The patch from Yuan Can fixes a memory leak in floppy init code. Signed-off-by: Denis Efremov <[email protected]>" * tag 'floppy-for-6.2' of https://github.com/evdenis/linux-floppy: floppy: Fix memory leak in do_floppy_init()
2022-12-04floppy: Fix memory leak in do_floppy_init()Yuan Can1-1/+3
A memory leak was reported when floppy_alloc_disk() failed in do_floppy_init(). unreferenced object 0xffff888115ed25a0 (size 8): comm "modprobe", pid 727, jiffies 4295051278 (age 25.529s) hex dump (first 8 bytes): 00 ac 67 5b 81 88 ff ff ..g[.... backtrace: [<000000007f457abb>] __kmalloc_node+0x4c/0xc0 [<00000000a87bfa9e>] blk_mq_realloc_tag_set_tags.part.0+0x6f/0x180 [<000000006f02e8b1>] blk_mq_alloc_tag_set+0x573/0x1130 [<0000000066007fd7>] 0xffffffffc06b8b08 [<0000000081f5ac40>] do_one_initcall+0xd0/0x4f0 [<00000000e26d04ee>] do_init_module+0x1a4/0x680 [<000000001bb22407>] load_module+0x6249/0x7110 [<00000000ad31ac4d>] __do_sys_finit_module+0x140/0x200 [<000000007bddca46>] do_syscall_64+0x35/0x80 [<00000000b5afec39>] entry_SYSCALL_64_after_hwframe+0x46/0xb0 unreferenced object 0xffff88810fc30540 (size 32): comm "modprobe", pid 727, jiffies 4295051278 (age 25.529s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<000000007f457abb>] __kmalloc_node+0x4c/0xc0 [<000000006b91eab4>] blk_mq_alloc_tag_set+0x393/0x1130 [<0000000066007fd7>] 0xffffffffc06b8b08 [<0000000081f5ac40>] do_one_initcall+0xd0/0x4f0 [<00000000e26d04ee>] do_init_module+0x1a4/0x680 [<000000001bb22407>] load_module+0x6249/0x7110 [<00000000ad31ac4d>] __do_sys_finit_module+0x140/0x200 [<000000007bddca46>] do_syscall_64+0x35/0x80 [<00000000b5afec39>] entry_SYSCALL_64_after_hwframe+0x46/0xb0 If the floppy_alloc_disk() failed, disks of current drive will not be set, thus the lastest allocated set->tag cannot be freed in the error handling path. A simple call graph shown as below: floppy_module_init() floppy_init() do_floppy_init() for (drive = 0; drive < N_DRIVE; drive++) blk_mq_alloc_tag_set() blk_mq_alloc_tag_set_tags() blk_mq_realloc_tag_set_tags() # set->tag allocated floppy_alloc_disk() blk_mq_alloc_disk() # error occurred, disks failed to allocated ->out_put_disk: for (drive = 0; drive < N_DRIVE; drive++) if (!disks[drive][0]) # the last disks is not set and loop break break; blk_mq_free_tag_set() # the latest allocated set->tag leaked Fix this problem by free the set->tag of current drive before jump to error handling path. Cc: [email protected] Fixes: 302cfee15029 ("floppy: use a separate gendisk for each media format") Signed-off-by: Yuan Can <[email protected]> [efremov: added stable list, changed title] Signed-off-by: Denis Efremov <[email protected]>
2022-12-03block: remove devnode callback from struct block_device_operationsGreg Kroah-Hartman2-12/+0
With the removal of the pktcdvd driver, there are no in-kernel users of the devnode callback in struct block_device_operations, so it can be safely removed. If it is needed for new block drivers in the future, it can be brought back. Cc: Jens Axboe <[email protected]> Cc: [email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-12-02Merge branch 'md-next' of ↵Jens Axboe2-62/+36
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.2/block Pull MD fixes from Song: "This contains code cleanup by Christoph." * 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md: fold unbind_rdev_from_array into md_kick_rdev_from_array md: mark md_kick_rdev_from_array static md: remove lock_bdev / unlock_bdev
2022-12-02pktcdvd: remove driver.Greg Kroah-Hartman8-3419/+0
Way back in 2016 in commit 5a8b187c61e9 ("pktcdvd: mark as unmaintained and deprecated") this driver was marked as "will be removed soon". 5 years seems long enough to have it stick around after that, so finally remove the thing now. Reported-by: Christoph Hellwig <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Thomas Maier <[email protected]> Cc: Peter Osterlund <[email protected]> Cc: [email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-12-02md: fold unbind_rdev_from_array into md_kick_rdev_from_arrayChristoph Hellwig1-21/+16
unbind_rdev_from_array is only called from md_kick_rdev_from_array, so merge it into its only caller. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Song Liu <[email protected]>
2022-12-02md: mark md_kick_rdev_from_array staticChristoph Hellwig2-3/+1
md_kick_rdev_from_array is only used in md.c, so unexport it and mark the symbol static. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Song Liu <[email protected]>
2022-12-02md: remove lock_bdev / unlock_bdevChristoph Hellwig1-41/+22
These wrappers for blkdev_get / blkdev_put just horribly confuse the code with their odd naming. Remove them and improve the error unwinding in md_import_device with the now folded code. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Song Liu <[email protected]>
2022-12-01blk-cgroup: Fix some kernel-doc commentsYang Li1-1/+1
Make the description of @gendisk to @disk in blkcg_schedule_throttle() to clear the below warnings: block/blk-cgroup.c:1850: warning: Function parameter or member 'disk' not described in 'blkcg_schedule_throttle' block/blk-cgroup.c:1850: warning: Excess function parameter 'gendisk' description in 'blkcg_schedule_throttle' Fixes: de185b56e8a6 ("blk-cgroup: pass a gendisk to blkcg_schedule_throttle") Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3338 Reported-by: Abaci Robot <[email protected]> Signed-off-by: Yang Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-12-01null_blk: support read-only and offline zone conditionsShin'ichiro Kawasaki3-4/+121
In zoned mode, zones with write pointers can have conditions "read-only" or "offline". In read-only condition, zones can not be written. In offline condition, the zones can be neither written nor read. These conditions are intended for zones with media failures, then it is difficult to set those conditions to zones on real devices. To test handling of zones in the conditions, add a feature to null_blk to set up zones in read-only or offline condition. Add new configuration attributes "zone_readonly" and "zone_offline". Write a sector to the attribute files to specify the target zone to set the zone conditions. For example, following command lines do it: echo 0 > nullb1/zone_readonly echo 524288 > nullb1/zone_offline When the specified zones are already in read-only or offline condition, normal empty condition is restored to the zones. These condition changes can be done only after the null_blk device get powered, since status area of each zone is not yet allocated before power-on. Also improve zone condition checks to inhibit all commands for zones in offline conditions. In same manner, inhibit write and zone management commands for zones in read-only condition. Signed-off-by: Shin'ichiro Kawasaki <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-12-01drbd: add context parameter to expect() macroChristoph Böhmwalder6-42/+42
Originally-from: Andreas Gruenbacher <[email protected]> Signed-off-by: Christoph Böhmwalder <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-12-01drbd: introduce drbd_ratelimit()Christoph Böhmwalder8-18/+26
Use call site specific ratelimit instead of one single static global. Also ratelimit ASSERTION messages generated by expect(). Originally-from: Lars Ellenberg <[email protected]> Signed-off-by: Christoph Böhmwalder <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-12-01drbd: introduce dynamic debugChristoph Böhmwalder1-36/+97
Incorporate as many out-of-tree changes as possible without changing the genl API. Over the years, we restructured this several times, and also changed the log format. One breaking change is that DRBD 9 gained "implicit options", like a connection name. This cannot be replayed here without changing the API, so save it for later. Originally-from: Andreas Gruenbacher <[email protected]> Originally-from: Philipp Reisner <[email protected]> Originally-from: Lars Ellenberg <[email protected]> Signed-off-by: Christoph Böhmwalder <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-12-01drbd: split polymorph printk to its own fileChristoph Böhmwalder2-67/+73
Signed-off-by: Christoph Böhmwalder <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-12-01drbd: unify how failed assertions are loggedChristoph Böhmwalder1-3/+5
Unify how failed assertions from D_ASSERT() and expect() are logged. Originally-from: Andreas Gruenbacher <[email protected]> Signed-off-by: Christoph Böhmwalder <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-12-01block: bdev & blktrace: use consistent function doc. notationRandy Dunlap2-4/+4
Use only one hyphen in kernel-doc notation between the function name and its short description. The is the documented kerenl-doc format. It also fixes the HTML presentation to be consistent with other functions. Signed-off-by: Randy Dunlap <[email protected]> Cc: Jens Axboe <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-12-01blk-iocost: Correct comment in blk_iocost_initKemeng Shi1-1/+1
There is no iocg_pd_init function. The pd_alloc_fn function pointer of iocost policy is set with ioc_pd_init. Just correct it. Signed-off-by: Kemeng Shi <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-12-01blk-iocost: Remove vrate member in struct ioc_nowKemeng Shi1-3/+3
If we trace vtime_base_rate instead of vtime_rate, there is nowhere which accesses now->vrate except function ioc_now using now->vrate locally. Just remove it. Signed-off-by: Kemeng Shi <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-12-01blk-iocost: Trace vtime_base_rate instead of vtime_rateKemeng Shi2-3/+3
Since commit ac33e91e2daca ("blk-iocost: implement vtime loss compensation") rename original vtime_rate to vtime_base_rate and current vtime_rate is original vtime_rate with compensation. The current rate showed in tracepoint is mixed with vtime_rate and vtime_base_rate: 1) In function ioc_adjust_base_vrate, the first trace_iocost_ioc_vrate_adj shows vtime_rate, the second trace_iocost_ioc_vrate_adj shows vtime_base_rate. 2) In function iocg_activate shows vtime_rate by calling TRACE_IOCG_PATH(iocg_activate... 3) In function ioc_check_iocgs shows vtime_rate by calling TRACE_IOCG_PATH(iocg_idle... Trace vtime_base_rate instead of vtime_rate as: 1) Before commit ac33e91e2daca ("blk-iocost: implement vtime loss compensation"), the traced rate is without compensation, so still show rate without compensation. 2) The vtime_base_rate is more stable while vtime_rate heavily depends on excess budeget on current period which may change abruptly in next period. Signed-off-by: Kemeng Shi <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-12-01blk-iocost: Reset vtime_base_rate in ioc_refresh_paramsKemeng Shi1-1/+3
Since commit ac33e91e2daca("blk-iocost: implement vtime loss compensation") split vtime_rate into vtime_rate and vtime_base_rate, we need reset both vtime_base_rate and vtime_rate when device parameters are refreshed. If vtime_base_rate is no reset here, vtime_rate will be overwritten with old vtime_base_rate soon in ioc_refresh_vrate. Signed-off-by: Kemeng Shi <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-12-01blk-iocost: Fix typo in commentKemeng Shi1-1/+1
soley -> solely Signed-off-by: Kemeng Shi <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-12-01block: Do not reread partition table on exclusively open deviceJan Kara3-8/+13
Since commit 10c70d95c0f2 ("block: remove the bd_openers checks in blk_drop_partitions") we allow rereading of partition table although there are users of the block device. This has an undesirable consequence that e.g. if sda and sdb are assembled to a RAID1 device md0 with partitions, BLKRRPART ioctl on sda will rescan partition table and create sda1 device. This partition device under a raid device confuses some programs (such as libstorage-ng used for initial partitioning for distribution installation) leading to failures. Fix the problem refusing to rescan partitions if there is another user that has the block device exclusively open. Cc: [email protected] Link: https://lore.kernel.org/all/20221130135344.2ul4cyfstfs3znxg@quack3 Fixes: 10c70d95c0f2 ("block: remove the bd_openers checks in blk_drop_partitions") Signed-off-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] [axboe: fold in followup fix] Signed-off-by: Jens Axboe <[email protected]>
2022-11-30virtio-blk: replace ida_simple[get|remove] with ida_[alloc_range|free]Pankaj Raghav1-4/+4
ida_simple[get|remove] are deprecated, and are just wrappers to ida_[alloc_range|free]. Replace ida_simple[get|remove] with their corresponding counterparts. No functional changes. Signed-off-by: Pankaj Raghav <[email protected]> Reviewed-by: Stefan Hajnoczi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-30block: mark blk_put_queue as potentially blockingChristoph Hellwig1-4/+2
We can't just say that the last reference release may block, as any reference dropped could be the last one. So move the might_sleep() from blk_free_queue to blk_put_queue and update the documentation. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-30block: untangle request_queue refcounting from sysfsChristoph Hellwig8-87/+71
The kobject embedded into the request_queue is used for the queue directory in sysfs, but that is a child of the gendisks directory and is intimately tied to it. Move this kobject to the gendisk and use a refcount_t in the request_queue for the actual request_queue refcounting that is completely unrelated to the device model. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-30block: fix error unwinding in blk_register_queueChristoph Hellwig1-12/+16
blk_register_queue fails to handle errors from blk_mq_sysfs_register, leaks various resources on errors and accidentally sets queue refs percpu refcount to percpu mode on kobject_add failure. Fix all that by properly unwinding on errors. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-30block: factor out a blk_debugfs_remove helperChristoph Hellwig1-7/+14
Split the debugfs removal from blk_unregister_queue into a helper so that the it can be reused for blk_register_queue error handling. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-30blk-crypto: pass a gendisk to blk_crypto_sysfs_{,un}registerChristoph Hellwig3-9/+12
Prepare for changes to the block layer sysfs handling by passing the readily available gendisk to blk_crypto_sysfs_{,un}register. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Eric Biggers <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-30Revert "blk-cgroup: Flush stats at blkgs destruction path"Jens Axboe3-35/+1
This reverts commit dae590a6c96c799434e0ff8156ef29b88c257e60. We've had a few reports on this causing a crash at boot time, because of a reference issue. While this problem seemginly did exist before the patch and needs solving separately, this patch makes it a lot easier to trigger. Link: https://lore.kernel.org/linux-block/CA+QYu4oxiRKC6hJ7F27whXy-PRBx=Tvb+-7TQTONN8qTtV3aDA@mail.gmail.com/ Link: https://lore.kernel.org/linux-block/[email protected]/ Signed-off-by: Jens Axboe <[email protected]>
2022-11-29block: use bool as the return type of elv_iosched_allow_bio_mergeJinlong Chen1-2/+2
We have bool type now, update the old signature. Signed-off-by: Jinlong Chen <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/0db0a0298758d60d0f4df8b7126ac6a381e5a5bb.1669736350.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <[email protected]>
2022-11-29block: replace "len+name" with "name+len" in elv_iosched_showJinlong Chen1-1/+1
The "pointer + offset" pattern is more resonable. Signed-off-by: Jinlong Chen <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/d9beaee71b14f7b2a39ab0db6458dc0f7d961ceb.1669736350.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <[email protected]>
2022-11-29block: always use 'e' when printing scheduler nameJinlong Chen1-1/+1
Printing e->elevator_name in all cases improves the readability, and 'e' and 'cur' are identical in this branch. Suggested-by: Christoph Hellwig <[email protected]> Signed-off-by: Jinlong Chen <[email protected]> Link: https://lore.kernel.org/r/4bae180ffbac608ea0cf46ffa9739ce0973b60aa.1669736350.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <[email protected]>
2022-11-29block: replace continue with else-if in elv_iosched_showJinlong Chen1-4/+2
else-if is more readable than continue here. Signed-off-by: Jinlong Chen <[email protected]> Link: https://lore.kernel.org/r/77ac19ba556efd2c8639a6396eb4203c59bc13d6.1669736350.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <[email protected]>
2022-11-29block: include 'none' for initial elv_iosched_show callJinlong Chen1-5/+4
This makes the printing order of the io schedulers consistent, and removes a redundant q->elevator check. Signed-off-by: Jinlong Chen <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/bdd7083ed4f232e3285f39081e3c5f30b20b8da2.1669736350.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <[email protected]>
2022-11-29Merge tag 'nvme-6.2-2022-11-29' of git://git.infradead.org/nvme into ↵Jens Axboe17-510/+766
for-6.2/block Pull NVMe updates from Christoph: "nvme updates for Linux 6.2 - support some passthrough commands without CAP_SYS_ADMIN (Kanchan Joshi) - refactor PCIe probing and reset (Christoph Hellwig) - various fabrics authentication fixes and improvements (Sagi Grimberg) - avoid fallback to sequential scan due to transient issues (Uday Shankar) - implement support for the DEAC bit in Write Zeroes (Christoph Hellwig) - allow overriding the IEEE OUI and firmware revision in configfs for nvmet (Aleksandr Miloserdov) - force reconnect when number of queue changes in nvmet (Daniel Wagner) - minor fixes and improvements (Uros Bizjak, Joel Granados, Sagi Grimberg, Christoph Hellwig, Christophe JAILLET)" * tag 'nvme-6.2-2022-11-29' of git://git.infradead.org/nvme: (45 commits) nvmet: expose firmware revision to configfs nvmet: expose IEEE OUI to configfs nvme: rename the queue quiescing helpers nvmet: fix a memory leak in nvmet_auth_set_key nvme: return err on nvme_init_non_mdts_limits fail nvme: avoid fallback to sequential scan due to transient issues nvme-rdma: stop auth work after tearing down queues in error recovery nvme-tcp: stop auth work after tearing down queues in error recovery nvme-auth: have dhchap_auth_work wait for queues auth to complete nvme-auth: remove redundant auth_work flush nvme-auth: convert dhchap_auth_list to an array nvme-auth: check chap ctrl_key once constructed nvme-auth: no need to reset chap contexts on re-authentication nvme-auth: remove redundant deallocations nvme-auth: clear sensitive info right after authentication completes nvme-auth: guarantee dhchap buffers under memory pressure nvme-auth: don't keep long lived 4k dhchap buffer nvme-auth: remove redundant if statement nvme-auth: don't override ctrl keys before validation nvme-auth: don't ignore key generation failures when initializing ctrl keys ...
2022-11-28block: mq-deadline: Rename deadline_is_seq_writes()Damien Le Moal1-2/+2
Rename deadline_is_seq_writes() to deadline_is_seq_write() (remove the "s" plural) to more correctly reflect the fact that this function tests a single request, not multiple requests. Fixes: 015d02f48537 ("block: mq-deadline: Do not break sequential write streams to zoned HDDs") Signed-off-by: Damien Le Moal <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-25blk-mq: fix possible memleak when register 'hctx' failedYe Bin1-2/+9
There's issue as follows when do fault injection test: unreferenced object 0xffff888132a9f400 (size 512): comm "insmod", pid 308021, jiffies 4324277909 (age 509.733s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 08 f4 a9 32 81 88 ff ff ...........2.... 08 f4 a9 32 81 88 ff ff 00 00 00 00 00 00 00 00 ...2............ backtrace: [<00000000e8952bb4>] kmalloc_node_trace+0x22/0xa0 [<00000000f9980e0f>] blk_mq_alloc_and_init_hctx+0x3f1/0x7e0 [<000000002e719efa>] blk_mq_realloc_hw_ctxs+0x1e6/0x230 [<000000004f1fda40>] blk_mq_init_allocated_queue+0x27e/0x910 [<00000000287123ec>] __blk_mq_alloc_disk+0x67/0xf0 [<00000000a2a34657>] 0xffffffffa2ad310f [<00000000b173f718>] 0xffffffffa2af824a [<0000000095a1dabb>] do_one_initcall+0x87/0x2a0 [<00000000f32fdf93>] do_init_module+0xdf/0x320 [<00000000cbe8541e>] load_module+0x3006/0x3390 [<0000000069ed1bdb>] __do_sys_finit_module+0x113/0x1b0 [<00000000a1a29ae8>] do_syscall_64+0x35/0x80 [<000000009cd878b0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0 Fault injection context as follows: kobject_add blk_mq_register_hctx blk_mq_sysfs_register blk_register_queue device_add_disk null_add_dev.part.0 [null_blk] As 'blk_mq_register_hctx' may already add some objects when failed halfway, but there isn't do fallback, caller don't know which objects add failed. To solve above issue just do fallback when add objects failed halfway in 'blk_mq_register_hctx'. Signed-off-by: Ye Bin <[email protected]> Reviewed-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-24block: fix crash in 'blk_mq_elv_switch_none'Ye Bin1-1/+1
Syzbot found the following issue: general protection fault, probably for non-canonical address 0xdffffc000000001d: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x00000000000000e8-0x00000000000000ef] CPU: 0 PID: 5234 Comm: syz-executor931 Not tainted 6.1.0-rc3-next-20221102-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 RIP: 0010:__elevator_get block/elevator.h:94 [inline] RIP: 0010:blk_mq_elv_switch_none block/blk-mq.c:4593 [inline] RIP: 0010:__blk_mq_update_nr_hw_queues block/blk-mq.c:4658 [inline] RIP: 0010:blk_mq_update_nr_hw_queues+0x304/0xe40 block/blk-mq.c:4709 RSP: 0018:ffffc90003cdfc08 EFLAGS: 00010206 RAX: 0000000000000000 RBX: dffffc0000000000 RCX: 0000000000000000 RDX: 000000000000001d RSI: 0000000000000002 RDI: 00000000000000e8 RBP: ffff88801dbd0000 R08: ffff888027c89398 R09: ffffffff8de2e517 R10: fffffbfff1bc5ca2 R11: 0000000000000000 R12: ffffc90003cdfc70 R13: ffff88801dbd0008 R14: ffff88801dbd03f8 R15: ffff888027c89380 FS: 0000555557259300(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000005d84c8 CR3: 000000007a7cb000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> nbd_start_device+0x153/0xc30 drivers/block/nbd.c:1355 nbd_start_device_ioctl drivers/block/nbd.c:1405 [inline] __nbd_ioctl drivers/block/nbd.c:1481 [inline] nbd_ioctl+0x5a1/0xbd0 drivers/block/nbd.c:1521 blkdev_ioctl+0x36e/0x800 block/ioctl.c:614 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl fs/ioctl.c:856 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd As after dd6f7f17bf58 commit move '__elevator_get(qe->type)' before set 'qe->type', so will lead to access wild pointer. To solve above issue get 'qe->type' after set 'qe->type'. Reported-by: [email protected] Fixes:dd6f7f17bf58("block: add proper helpers for elevator_type module refcount management") Signed-off-by: Ye Bin <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-24drbd: destroy workqueue when drbd device was freedWang ShaoBo1-1/+5
A submitter workqueue is dynamically allocated by init_submitter() called by drbd_create_device(), we should destroy it when this device is not needed or destroyed. Fixes: 113fef9e20e0 ("drbd: prepare to queue write requests on a submit worker") Signed-off-by: Wang ShaoBo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-24drbd: remove call to memset before free device/resource/connectionWang ShaoBo1-3/+0
This revert c2258ffc56f2 ("drbd: poison free'd device, resource and connection structs"), add memset is odd here for debugging, there are some methods to accurately show what happened, such as kdump. Signed-off-by: Wang ShaoBo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-24block: mq-deadline: Do not break sequential write streams to zoned HDDsDamien Le Moal1-4/+62
mq-deadline ensures an in order dispatching of write requests to zoned block devices using a per zone lock (a bit). This implies that for any purely sequential write workload, the drive is exercised most of the time at a maximum queue depth of one. However, when such sequential write workload crosses a zone boundary (when sequentially writing multiple contiguous zones), zone write locking may prevent the last write to one zone to be issued (as the previous write is still being executed) but allow the first write to the following zone to be issued (as that zone is not yet being writen and not locked). This result in an out of order delivery of the sequential write commands to the device every time a zone boundary is crossed. While such behavior does not break the sequential write constraint of zoned block devices (and does not generate any write error), some zoned hard-disks react badly to seeing these out of order writes, resulting in lower write throughput. This problem can be addressed by always dispatching the first request of a stream of sequential write requests, regardless of the zones targeted by these sequential writes. To do so, the function deadline_skip_seq_writes() is introduced and used in deadline_next_request() to select the next write command to issue if the target device is an HDD (blk_queue_nonrot() being false). deadline_fifo_request() is modified using the new deadline_earlier_request() and deadline_is_seq_write() helpers to ignore requests in the fifo list that have a preceding request in lba order that is sequential. With this fix, a sequential write workload executed with the following fio command: fio --name=seq-write --filename=/dev/sda --zonemode=zbd --direct=1 \ --size=68719476736 --ioengine=libaio --iodepth=32 --rw=write \ --bs=65536 results in an increase from 225 MB/s to 250 MB/s of the write throughput of an SMR HDD (11% increase). Cc: <[email protected]> Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-24block: mq-deadline: Fix dd_finish_request() for zoned devicesDamien Le Moal1-2/+15
dd_finish_request() tests if the per prio fifo_list is not empty to determine if request dispatching must be restarted for handling blocked write requests to zoned devices with a call to blk_mq_sched_mark_restart_hctx(). While simple, this implementation has 2 problems: 1) Only the priority level of the completed request is considered. However, writes to a zone may be blocked due to other writes to the same zone using a different priority level. While this is unlikely to happen in practice, as writing a zone with different IO priorirites does not make sense, nothing in the code prevents this from happening. 2) The use of list_empty() is dangerous as dd_finish_request() does not take dd->lock and may run concurrently with the insert and dispatch code. Fix these 2 problems by testing the write fifo list of all priority levels using the new helper dd_has_write_work(), and by testing each fifo list using list_empty_careful(). Fixes: c807ab520fc3 ("block/mq-deadline: Add I/O priority support") Cc: <[email protected]> Signed-off-by: Damien Le Moal <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-23blk-crypto: Add a missing include directiveBart Van Assche1-0/+1
Allow the compiler to verify consistency of function declarations and function definitions. This patch fixes the following sparse errors: block/blk-crypto-profile.c:241:14: error: no previous prototype for ‘blk_crypto_get_keyslot’ [-Werror=missing-prototypes] 241 | blk_status_t blk_crypto_get_keyslot(struct blk_crypto_profile *profile, | ^~~~~~~~~~~~~~~~~~~~~~ block/blk-crypto-profile.c:318:6: error: no previous prototype for ‘blk_crypto_put_keyslot’ [-Werror=missing-prototypes] 318 | void blk_crypto_put_keyslot(struct blk_crypto_keyslot *slot) | ^~~~~~~~~~~~~~~~~~~~~~ block/blk-crypto-profile.c:344:6: error: no previous prototype for ‘__blk_crypto_cfg_supported’ [-Werror=missing-prototypes] 344 | bool __blk_crypto_cfg_supported(struct blk_crypto_profile *profile, | ^~~~~~~~~~~~~~~~~~~~~~~~~~ block/blk-crypto-profile.c:373:5: error: no previous prototype for ‘__blk_crypto_evict_key’ [-Werror=missing-prototypes] 373 | int __blk_crypto_evict_key(struct blk_crypto_profile *profile, | ^~~~~~~~~~~~~~~~~~~~~~ Cc: Eric Biggers <[email protected]> Signed-off-by: Bart Van Assche <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-23elevator: remove an outdated comment in elevator_changeJinlong Chen1-3/+0
mq is no longer a special case. Signed-off-by: Jinlong Chen <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/cbf47824fc726440371e74c867bf635ae1b671a3.1669126766.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <[email protected]>