aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-07-28dm: Fix PR release handling for non All RegistrantsMike Christie1-14/+34
This commit fixes a bug where we are leaving the reservation in place even though pr_release has run and returned success. If we have a Write Exclusive, Exclusive Access, or Write/Exclusive Registrants only reservation, the release must be sent down the path that is the reservation holder. The problem is multipath_prepare_ioctl most likely selected path N for the reservation, then later when we do the release multipath_prepare_ioctl will select a completely different path. The device will then return success becuase the nvme and scsi specs say to return success if there is no reservation or if the release is sent down from a path that is not the holder. We then think we have released the reservation. This commit has us loop over each path and send a release so we can make sure the release is executed on the correct path. It has been tested with windows failover clustering's validation test which checks this case, and it has been tested manually (the libiscsi PGR tests don't have a test case for this yet, but I will be adding one). Signed-off-by: Mike Christie <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2022-07-28dm: Start pr_reserve from the same starting pathMike Christie1-14/+32
When an app does a pr_reserve it will go to whatever path we happen to be using at the time. This can result in errors when the app does a second pr_reserve call and expects success but gets a failure because the reserve is not done on the holder's path. This commit has us always start trying to do reserves from the first path in the first group. Windows failover clustering will produce the type of pattern above. With this commit, we will now pass its validation test for this case. Signed-off-by: Mike Christie <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2022-07-28dm: Allow dm_call_pr to be used for path searchesMike Christie1-12/+38
The specs state that if you send a reserve down a path that is already the holder success must be returned and if it goes down a path that is not the holder reservation conflict must be returned. Windows failover clustering will send a second reservation and expects that a device returns success. The problem for multipathing is that for an All Registrants reservation, we can send the reserve down any path but for all other reservation types there is one path that is the holder. To handle this we could add PR state to dm but that can get nasty. Look at target_core_pr.c for an example of the type of things we'd have to track. It will also get more complicated because other initiators can change the state so we will have to add in async event/sense handling. This commit, and the 3 commits that follow, tries to keep dm simple and keep just doing passthrough. This commit modifies dm_call_pr to be able to find the first usable path that can execute our pr_op then return. When dm_pr_reserve is converted to dm_call_pr in the next commit for the normal case we will use the same path for every reserve. Signed-off-by: Mike Christie <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2022-07-28dm: return early from dm_pr_call() if DM device is suspendedMike Snitzer1-0/+5
Otherwise PR ops may be issued while the broader DM device is being reconfigured, etc. Fixes: 9c72bad1f31a ("dm: call PR reserve/unreserve on each underlying device") Signed-off-by: Mike Snitzer <[email protected]>
2022-07-15dm thin: fix use-after-free crash in dm_sm_register_threshold_callbackLuo Meng2-3/+8
Fault inject on pool metadata device reports: BUG: KASAN: use-after-free in dm_pool_register_metadata_threshold+0x40/0x80 Read of size 8 at addr ffff8881b9d50068 by task dmsetup/950 CPU: 7 PID: 950 Comm: dmsetup Tainted: G W 5.19.0-rc6 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x34/0x44 print_address_description.constprop.0.cold+0xeb/0x3f4 kasan_report.cold+0xe6/0x147 dm_pool_register_metadata_threshold+0x40/0x80 pool_ctr+0xa0a/0x1150 dm_table_add_target+0x2c8/0x640 table_load+0x1fd/0x430 ctl_ioctl+0x2c4/0x5a0 dm_ctl_ioctl+0xa/0x10 __x64_sys_ioctl+0xb3/0xd0 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 This can be easily reproduced using: echo offline > /sys/block/sda/device/state dd if=/dev/zero of=/dev/mapper/thin bs=4k count=10 dmsetup load pool --table "0 20971520 thin-pool /dev/sda /dev/sdb 128 0 0" If a metadata commit fails, the transaction will be aborted and the metadata space maps will be destroyed. If a DM table reload then happens for this failed thin-pool, a use-after-free will occur in dm_sm_register_threshold_callback (called from dm_pool_register_metadata_threshold). Fix this by in dm_pool_register_metadata_threshold() by returning the -EINVAL error if the thin-pool is in fail mode. Also fail pool_ctr() with a new error message: "Error registering metadata threshold". Fixes: ac8c3f3df65e4 ("dm thin: generate event when metadata threshold passed") Cc: [email protected] Reported-by: Hulk Robot <[email protected]> Signed-off-by: Luo Meng <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2022-07-14dm writecache: count number of blocks discarded, not number of discard biosMikulas Patocka2-2/+2
Change dm-writecache, so that it counts the number of blocks discarded instead of the number of discard bios. Make it consistent with the read and write statistics counters that were changed to count the number of blocks instead of bios. Fixes: e3a35d03407c ("dm writecache: add event counters") Signed-off-by: Mikulas Patocka <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2022-07-14dm writecache: count number of blocks written, not number of write biosMikulas Patocka2-8/+14
Change dm-writecache, so that it counts the number of blocks written instead of the number of write bios. Bios can be split and requeued using the dm_accept_partial_bio function, so counting bios caused inaccurate results. Fixes: e3a35d03407c ("dm writecache: add event counters") Reported-by: Yu Kuai <[email protected]> Signed-off-by: Mikulas Patocka <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2022-07-14dm writecache: count number of blocks read, not number of read biosMikulas Patocka2-2/+3
Change dm-writecache, so that it counts the number of blocks read instead of the number of read bios. Bios can be split and requeued using the dm_accept_partial_bio function, so counting bios caused inaccurate results. Fixes: e3a35d03407c ("dm writecache: add event counters") Reported-by: Yu Kuai <[email protected]> Signed-off-by: Mikulas Patocka <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2022-07-14dm writecache: return void from functionsMikulas Patocka1-13/+13
The functions writecache_map_remap_origin and writecache_bio_copy_ssd only return a single value, thus they can be made to return void. This helps simplify the following IO accounting changes. Signed-off-by: Mikulas Patocka <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2022-07-14dm kcopyd: use __GFP_HIGHMEM when allocating pagesMikulas Patocka1-1/+1
dm-kcopyd doesn't access the allocated pages directly, it only passes them to dm-io which adds them to a bio list - thus, we can allocate the pages from high memory. This will reduce pressure on the low memory when there are a large number of kcopyd jobs in progress. Signed-off-by: Mikulas Patocka <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2022-07-14dm writecache: set a default MAX_WRITEBACK_JOBSMikulas Patocka1-1/+1
dm-writecache has the capability to limit the number of writeback jobs in progress. However, this feature was off by default. As such there were some out-of-memory crashes observed when lowering the low watermark while the cache is full. This commit enables writeback limit by default. It is set to 256MiB or 1/16 of total system memory, whichever is smaller. Cc: [email protected] Signed-off-by: Mikulas Patocka <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2022-07-07Documentation: dm writecache: Render status list as listBagas Sanjaya1-0/+1
The status list isn't rendered as list, but rather as normal paragraph, because there is missing blank line between "Status:" line and the list. Fix the issue by adding the blank line separator. Fixes: 48debafe4f2fea ("dm: add writecache target") Signed-off-by: Bagas Sanjaya <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2022-07-07Documentation: dm writecache: add blank line before optional parametersMauro Carvalho Chehab1-0/+1
Otherwise this warning occurs: Documentation/admin-guide/device-mapper/writecache.rst:23: WARNING: Unexpected indentation. Signed-off-by: Mauro Carvalho Chehab <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2022-07-07dm snapshot: fix typo in snapshot_map() commentZhang Jiaming1-1/+1
Signed-off-by: Zhang Jiaming <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2022-07-07dm raid: remove redundant "the" in parse_raid_params() commentJiang Jian1-1/+1
Signed-off-by: Jiang Jian <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2022-07-07dm cache: fix typo in 2 comment blocksSteven Lung2-2/+2
Replace neccessarily with necessarily. Signed-off-by: Steven Lung <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2022-07-07dm verity: fix checkpatch close brace errorJeongHyeon Lee1-4/+3
Resolves: ERROR: else should follow close brace '}' Signed-off-by: JeongHyeon Lee <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2022-07-07dm table: rename dm_target variable in dm_table_add_target()Mike Snitzer1-28/+28
Rename from "tgt" to "ti" so that all of dm-table.c code uses the same naming for dm_target variables. Signed-off-by: Mike Snitzer <[email protected]>
2022-07-07dm table: audit all dm_table_get_target() callersMike Snitzer6-143/+97
All callers of dm_table_get_target() are expected to do proper bounds checking on the index they pass. Move dm_table_get_target() to dm-core.h to make it extra clear that only DM core code should be using it. Switch it to be inlined while at it. Standardize all DM core callers to use the same for loop pattern and make associated variables as local as possible. Rename some variables (e.g. s/table/t/ and s/tgt/ti/) along the way. Signed-off-by: Mike Snitzer <[email protected]>
2022-07-07dm table: remove dm_table_get_num_targets() wrapperMike Snitzer6-29/+23
More efficient and readable to just access table->num_targets directly. Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2022-07-07dm: add two stage requeue mechanismMing Lei3-31/+130
Commit 61b6e2e5321d ("dm: fix BLK_STS_DM_REQUEUE handling when dm_io represents split bio") reverted DM core's bio splitting back to using bio_split()+bio_chain() because it was found that otherwise DM's BLK_STS_DM_REQUEUE would trigger a live-lock waiting for bio completion that would never occur. Restore using bio_trim()+bio_inc_remaining(), like was done in commit 7dd76d1feec7 ("dm: improve bio splitting and associated IO accounting"), but this time with proper handling for the above scenario that is covered in more detail in the commit header for 61b6e2e5321d. Solve this issue by adding a two staged dm_io requeue mechanism that uses the new dm_bio_rewind() via dm_io_rewind(): 1) requeue the dm_io into the requeue_list added to struct mapped_device, and schedule it via new added requeue work. This workqueue just clones the dm_io->orig_bio (which DM saves and ensures its end sector isn't modified). dm_io_rewind() uses the sectors and sectors_offset members of the dm_io that are recorded relative to the end of orig_bio: dm_bio_rewind()+bio_trim() are then used to make that cloned bio reflect the subset of the original bio that is represented by the dm_io that is being requeued. 2) the 2nd stage requeue is same with original requeue, but io->orig_bio points to new cloned bio (which matches the requeued dm_io as described above). This allows DM core to shift the need for bio cloning from bio-split time (during IO submission) to the less likely BLK_STS_DM_REQUEUE handling (after IO completes with that error). Signed-off-by: Ming Lei <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2022-07-07dm: add dm_bio_rewind() API to DM coreMing Lei3-1/+146
Commit 7759eb23fd98 ("block: remove bio_rewind_iter()") removed a similar API for the following reasons: ``` It is pointed that bio_rewind_iter() is one very bad API[1]: 1) bio size may not be restored after rewinding 2) it causes some bogus change, such as 5151842b9d8732 (block: reset bi_iter.bi_done after splitting bio) 3) rewinding really makes things complicated wrt. bio splitting 4) unnecessary updating of .bi_done in fast path [1] https://marc.info/?t=153549924200005&r=1&w=2 So this patch takes Kent's suggestion to restore one bio into its original state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(), given now bio_rewind_iter() is only used by bio integrity code. ``` However, saving off a copy of the 32 bytes bio->bi_iter in case rewind needed isn't efficient because it bloats per-bio-data for what is an unlikely case. That suggestion also ignores the need to restore crypto and integrity info. Add dm_bio_rewind() API for a specific use-case that is much more narrow than the previous more generic rewind code that was reverted: 1) most bios have a fixed end sector since bio split is done from front of the bio, if driver just records how many sectors between current bio's start sector and the original bio's end sector, the original position can be restored. Keeping the original bio's end sector fixed is a _hard_ requirement for this interface! 2) if a bio's end sector won't change (usually bio_trim() isn't called, or in the case of DM it preserves original bio), user can restore the original position by storing sector offset from the current ->bi_iter.bi_sector to bio's end sector; together with saving bio size, only 8 bytes is needed to restore to original bio. 3) DM's requeue use case: when BLK_STS_DM_REQUEUE happens, DM core needs to restore to an "original bio" which represents the current dm_io to be requeued (which may be a subset of the original bio). By storing the sector offset from the original bio's end sector and dm_io's size, dm_bio_rewind() can restore such original bio. See commit 7dd76d1feec7 ("dm: improve bio splitting and associated IO accounting") for more details on how DM does this. Leveraging this, allows DM core to shift the need for bio cloning from bio-split time (during IO submission) to the less likely BLK_STS_DM_REQUEUE handling (after IO completes with that error). 4) Unlike the original rewind API, dm_bio_rewind() doesn't add .bi_done to bvec_iter and there is no effect on the fast path. Implement dm_bio_rewind() by factoring out clear helpers that it calls: dm_bio_integrity_rewind, dm_bio_crypt_rewind and dm_bio_rewind_iter. DM is able to ensure that dm_bio_rewind() is used safely but, given the constraint that the bio's end must never change, other hypothetical future callers may not take the same care. So make dm_bio_rewind() and all supporting code local to DM to avoid risk of hypothetical abuse. A "dm_" prefix was added to all functions to avoid any namespace collisions. Suggested-by: Jens Axboe <[email protected]> Signed-off-by: Ming Lei <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2022-06-29dm: improve BLK_STS_DM_REQUEUE and BLK_STS_AGAIN handlingMing Lei1-25/+45
If either BLK_STS_DM_REQUEUE or BLK_STS_AGAIN is returned for POLLED io, we requeue the original bio into deferred list and kick md->wq to re-submit it to block layer. Improve the handling in the following way: 1) Factor out dm_handle_requeue() for handling dm_io requeue. 2) Unify handling for BLK_STS_DM_REQUEUE and BLK_STS_AGAIN: clear REQ_POLLED for BLK_STS_DM_REQUEUE too, for the sake of simplicity, given BLK_STS_DM_REQUEUE is very unusual. 3) Queue md->wq explicitly in dm_handle_requeue(), so requeue handling becomes more robust. Signed-off-by: Ming Lei <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2022-06-29dm: refactor dm_md_mempool allocationChristoph Hellwig4-71/+44
The current split between dm_table_alloc_md_mempools and dm_alloc_md_mempools is rather arbitrary, so merge the two into one easy to follow function. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2022-06-29dm: unexport dm_get_reserved_rq_based_iosChristoph Hellwig1-1/+0
dm_get_reserved_rq_based_ios is only used in the core dm code, so remove the export. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2022-06-29block: simplify disk_set_independent_access_rangesChristoph Hellwig3-44/+18
Lift setting disk->ia_ranges from disk_register_independent_access_ranges into disk_set_independent_access_ranges, and make the behavior the same for the registered vs non-registered queue cases. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Tested-by: Damien Le Moal <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-29block: move ->ia_ranges from the request_queue to the gendiskChristoph Hellwig2-15/+15
Independent access ranges only matter for file system I/O and are only valid with a registered gendisk, so move them there. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Tested-by: Damien Le Moal <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-29block: remove "select BLK_RQ_IO_DATA_LEN" from BLK_CGROUP_IOCOST dependencyYing Sun1-1/+0
The configuration item BLK_RQ_IO_DATA_LEN is not declared in the kernel. Select BLK_RQ_IO_DATA_LEN is meaningless which could be removed. Signed-off-by: Ying Sun <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-28blk-mq: cleanup disk sysfs registrationChristoph Hellwig4-26/+26
Pass a gendisk to the sysfs register/unregister functions and give them descriptive names. Also move the unregistration helper next to the one doing the registration. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-28blk-mq: rename blk_mq_sysfs_{,un}registerChristoph Hellwig3-6/+6
Add a _hctx postfix to better describe what the functions do, match the debugfs equivalents and release the old names for functions that should be using them. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-28block: remove the extra gendisk reference in __blk_mq_register_devChristoph Hellwig1-3/+1
kobject_add already grabs a reference to the parent, no need to have another one. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-28block: use default groups to register the queue attributesChristoph Hellwig1-6/+6
Set up the default_groups for blk_queue_ktype instead of manually calling sysfs_create_group. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-28block: remove a superflous queue kobject referenceChristoph Hellwig1-5/+1
kobject_add already adds a reference to the parent that is dropped on deletion, so don't bother grabbing another one. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-28block: simplify blktrace sysfs attribute creationChristoph Hellwig6-32/+6
Add the trace attributes to the default gendisk attributes, just like we already do for partitions. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-28block: remove blk_cleanup_diskChristoph Hellwig47-90/+74
blk_cleanup_disk is nothing but a trivial wrapper for put_disk now, so remove it. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-28block: simplify disk shutdownChristoph Hellwig34-113/+105
Set the queue dying flag and call blk_mq_exit_queue from del_gendisk for all disks that do not have separately allocated queues, and thus remove the need to call blk_cleanup_queue for them. Rename blk_cleanup_disk to blk_mq_destroy_queue to make it clear that this function is intended only for separately allocated blk-mq queues. This saves an extra queue freeze for devices without a separately allocated queue. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-28block: stop setting the nomerges flags in blk_cleanup_queueChristoph Hellwig1-3/+0
These flags only apply to file system I/O, and all file system I/O is already drained by del_gendisk and thus can't be in progress when blk_cleanup_queue is called. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-28block: remove QUEUE_FLAG_DEADChristoph Hellwig3-10/+3
Disallow setting the blk-mq state on any queue that is already dying as setting the state even then is a bad idea, and remove the now unused QUEUE_FLAG_DEAD flag. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-28mtip32xx: fix device removalChristoph Hellwig2-114/+44
Use the proper helper to mark a surpise removal, remove the gendisk as soon as possible when removing the device and implement the ->free_disk callback to ensure the private data is alive as long as the gendisk has references. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-28mtip32xx: remove the device_status debugfs fileChristoph Hellwig2-144/+1
This file is a huge mess that iterates over all devices and is in the way of fixing the device removal in this driver, so remove it. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-27blk-mq: blk_mq_tag_busy is no need to return a valueLiu Song2-17/+11
Currently "blk_mq_tag_busy" return value has no effect, so adjust it. Some code implementations have also been adjusted to enhance readability. Signed-off-by: Liu Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-27block: Always initialize bio IO priority on submitJan Kara1-0/+3
Currently, IO priority set in task's IO context is not reflected in the bio->bi_ioprio for most IO (only io_uring and direct IO set it). This results in odd results where process is submitting some bios with one priority and other bios with a different (unset) priority and due to differing priorities bios cannot be merged. Make sure bio->bi_ioprio is always set on bio submission. Reviewed-by: Damien Le Moal <[email protected]> Tested-by: Damien Le Moal <[email protected]> Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-27block: Initialize bio priority earlierJan Kara1-2/+2
Bio's IO priority needs to be initialized before we try to merge the bio with other bios. Otherwise we could merge bios which would otherwise receive different IO priorities leading to possible QoS issues. Reviewed-by: Damien Le Moal <[email protected]> Tested-by: Damien Le Moal <[email protected]> Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-27blk-ioprio: Convert from rqos policy to direct callJan Kara4-45/+23
Convert blk-ioprio handling from a rqos policy to a direct call from blk_mq_submit_bio(). Firstly, blk-ioprio is not much of a rqos policy anyway, it just needs a hook in bio submission path to set the bio's IO priority. Secondly, the rqos .track hook gets actually called too late for blk-ioprio purposes and introducing a special rqos hook just for blk-ioprio looks even weirder. Reviewed-by: Damien Le Moal <[email protected]> Tested-by: Damien Le Moal <[email protected]> Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-27blk-ioprio: Remove unneeded fieldJan Kara1-6/+3
blkcg->ioprio_set field is not really useful except for avoiding possibly more expensive checks inside blkcg_ioprio_track(). The check for blkcg->prio_policy being equal to POLICY_NO_CHANGE does the same service so just remove the ioprio_set field and replace the check. Reviewed-by: Damien Le Moal <[email protected]> Tested-by: Damien Le Moal <[email protected]> Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-27block: Fix handling of tasks without ioprio in ioprio_get(2)Jan Kara1-7/+23
ioprio_get(2) can be asked to return the best IO priority from several tasks (IOPRIO_WHO_PGRP, IOPRIO_WHO_USER). Currently the call treats tasks without set IO priority as having priority IOPRIO_CLASS_BE/IOPRIO_BE_NORM however this does not really reflect the IO priority the task will get (which depends on task's nice value). Fix the code to use the real IO priority task's IO will use. We have to modify code for ioprio_get(IOPRIO_WHO_PROCESS) to keep returning IOPRIO_CLASS_NONE priority for tasks without set IO priority as a special case to maintain userspace visible API. Reviewed-by: Damien Le Moal <[email protected]> Tested-by: Damien Le Moal <[email protected]> Signed-off-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-27block: Make ioprio_best() staticJan Kara2-6/+1
Nobody outside of block/ioprio.c uses it. Reviewed-by: Damien Le Moal <[email protected]> Tested-by: Damien Le Moal <[email protected]> Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-27block: Generalize get_current_ioprio() for any taskJan Kara2-16/+36
get_current_ioprio() operates only on current task. We will need the same functionality for other tasks as well. Generalize get_current_ioprio() for that and also move the bulk out of the header file because it is large enough. Reviewed-by: Damien Le Moal <[email protected]> Tested-by: Damien Le Moal <[email protected]> Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-27block: Return effective IO priority from get_current_ioprio()Jan Kara1-2/+9
get_current_ioprio() is used to initialize IO priority of various requests. As such it should be returning the effective IO priority of the task (i.e., reflecting the fact that unset IO priority should get set based on task's CPU priority) so that the conversion is concentrated in one place. Reviewed-by: Damien Le Moal <[email protected]> Tested-by: Damien Le Moal <[email protected]> Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-06-27block: fix default IO priority handling againJan Kara3-3/+5
Commit e70344c05995 ("block: fix default IO priority handling") introduced an inconsistency in get_current_ioprio() that tasks without IO context return IOPRIO_DEFAULT priority while tasks with freshly allocated IO context will return 0 (IOPRIO_CLASS_NONE/0) IO priority. Tasks without IO context used to be rare before 5a9d041ba2f6 ("block: move io_context creation into where it's needed") but after this commit they became common because now only BFQ IO scheduler setups task's IO context. Similar inconsistency is there for get_task_ioprio() so this inconsistency is now exposed to userspace and userspace will see different IO priority for tasks operating on devices with BFQ compared to devices without BFQ. Furthemore the changes done by commit e70344c05995 change the behavior when no IO priority is set for BFQ IO scheduler which is also documented in ioprio_set(2) manpage: "If no I/O scheduler has been set for a thread, then by default the I/O priority will follow the CPU nice value (setpriority(2)). In Linux kernels before version 2.6.24, once an I/O priority had been set using ioprio_set(), there was no way to reset the I/O scheduling behavior to the default. Since Linux 2.6.24, specifying ioprio as 0 can be used to reset to the default I/O scheduling behavior." So make sure we default to IOPRIO_CLASS_NONE as used to be the case before commit e70344c05995. Also cleanup alloc_io_context() to explicitely set this IO priority for the allocated IO context to avoid future surprises. Note that we tweak ioprio_best() to maintain ioprio_get(2) behavior and make this commit easily backportable. CC: [email protected] Fixes: e70344c05995 ("block: fix default IO priority handling") Reviewed-by: Damien Le Moal <[email protected]> Tested-by: Damien Le Moal <[email protected]> Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>