aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2023-01-29block, bfq: use helper macro RQ_BFQQ to get bfqq of requestKemeng Shi1-3/+3
Use helper macro RQ_BFQQ to get bfqq of request. Signed-off-by: Kemeng Shi <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29block, bfq: initialize bfqq->decrease_time_jif correctlyKemeng Shi1-0/+2
Inject limit is updated or reset when time_is_before_eq_jiffies( decrease_time_jif + several msecs) or think-time state changes. decrease_time_jif is initialized to 0 and will be set to current jiffies when inject limit is updated or reset. If the jiffies is slightly greater than LONG_MAX, time_is_after_eq_jiffies(0) will keep for a long time, so as time_is_after_eq_jiffies(decrease_time_jif + several msecs). If the think-time state never chages, then the injection will not work as expected for long time. To be more specific: Function bfq_update_inject_limit maybe triggered when jiffies pasts decrease_time_jif + msecs_to_jiffies(10) in bfq_add_request by setting bfqd->wait_dispatch to true. Function bfq_reset_inject_limit are called in two conditions: 1. jiffies pasts bfqq->decrease_time_jif + msecs_to_jiffies(1000) in function bfq_add_request. 2. jiffies pasts bfqq->decrease_time_jif + msecs_to_jiffies(100) or bfq think-time state change from short to long. Fix this by initializing bfqq->decrease_time_jif to current jiffies to trigger service injection soon when service injection conditions are met. Signed-off-by: Kemeng Shi <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29block, bfq: remove unsed parameter reason in bfq_bfqq_is_slowKemeng Shi1-3/+2
Parameter reason is never used, just remove it. Signed-off-by: Kemeng Shi <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29block, bfq: correctly raise inject limit in bfq_choose_bfqq_for_injectionKemeng Shi1-6/+4
Function bfq_choose_bfqq_for_injection may temporarily raise inject limit to one request if current inject_limit is 0 before search of the source queue for injection. However the search below will reset inject limit to bfqd->in_service_queue which is zero for raised inject limit. Then the temporarily raised inject limit never works as expected. Assigment limit to bfqd->in_service_queue in search is needed as limit maybe overwriten to min_t(unsigned int, 1, limit) for condition that a large in-flight request is on non-rotational devices in found queue. So we need to reset limit to bfqd->in_service_queue for normal case. Actually, we have already make sure bfqd->rq_in_driver is < limit before search, then -Limit is >= 1 as bfqd->rq_in_driver is >= 0. Then min_t(unsigned int, 1, limit) is always 1. So we can simply check bfqd->rq_in_driver with 1 instead of result of min_t(unsigned int, 1, limit) for larget request in non-rotational device case to avoid overwritting limit and the bug is gone. -For normal case, we have already check bfqd->rq_in_driver is < limit, so we can return found bfqq unconditionally to remove unncessary check. Signed-off-by: Kemeng Shi <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29sbitmap: correct wake_batch recalculation to avoid potential IO hungKemeng Shi1-4/+1
Commit 180dccb0dba4f ("blk-mq: fix tag_get wait task can't be awakened") mentioned that in case of shared tags, there could be just one real active hctx(queue) because of lazy detection of tag idle. Then driver tag allocation may wait forever on this real active hctx(queue) if wake_batch is > hctx_max_depth where hctx_max_depth is available tags depth for the actve hctx(queue). However, the condition wake_batch > hctx_max_depth is not strong enough to avoid IO hung as the sbitmap_queue_wake_up will only wake up one wait queue for each wake_batch even though there is only one waiter in the woken wait queue. After this, there is only one tag to free and wake_batch may not be reached anymore. Commit 180dccb0dba4f ("blk-mq: fix tag_get wait task can't be awakened") methioned that driver tag allocation may wait forever. Actually, the inactive hctx(queue) will be truely idle after at most 30 seconds and will call blk_mq_tag_wakeup_all to wake one waiter per wait queue to break the hung. But IO hung for 30 seconds is also not acceptable. Set batch size to small enough that depth of the shared hctx(queue) is enough to wake up all of the queues like sbq_calc_wake_batch do to fix this potential IO hung. Although hctx_max_depth will be clamped to at least 4 while wake_batch recalculation does not do the clamp, the wake_batch will be always recalculated to 1 when hctx_max_depth <= 4. Fixes: 180dccb0dba4 ("blk-mq: fix tag_get wait task can't be awakened") Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Kemeng Shi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29sbitmap: add sbitmap_find_bit to remove repeat code in ↵Kemeng Shi1-37/+33
__sbitmap_get/__sbitmap_get_shallow There are three differences between __sbitmap_get and __sbitmap_get_shallow when searching free bit: 1. __sbitmap_get_shallow limit number of bit to search per word. __sbitmap_get has no such limit. 2. __sbitmap_get_shallow always searches with wrap set. __sbitmap_get set wrap according to round_robin. 3. __sbitmap_get_shallow always searches from first bit in first word. __sbitmap_get searches from first bit when round_robin is not set otherwise searches from SB_NR_TO_BIT(sb, alloc_hint). Add helper function sbitmap_find_bit function to do common search while accept "limit depth per word", "wrap flag" and "first bit to search" from caller to support the need of both __sbitmap_get and __sbitmap_get_shallow. Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Kemeng Shi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29sbitmap: rewrite sbitmap_find_bit_in_index to reduce repeat codeKemeng Shi1-15/+15
Rewrite sbitmap_find_bit_in_index as following: 1. Rename sbitmap_find_bit_in_index to sbitmap_find_bit_in_word 2. Accept "struct sbitmap_word *" directly instead of accepting "struct sbitmap *" and "int index" to get "struct sbitmap_word *". 3. Accept depth/shallow_depth and wrap for __sbitmap_get_word from caller to support need of both __sbitmap_get_shallow and __sbitmap_get. With helper function sbitmap_find_bit_in_word, we can remove repeat code in __sbitmap_get_shallow to find bit considring deferred clear. Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Kemeng Shi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29sbitmap: remove redundant check in __sbitmap_queue_get_batchKemeng Shi1-5/+3
Commit fbb564a557809 ("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()") mentioned that "Checking free bits when setting the target bits. Otherwise, it may reuse the busying bits." This commit add check to make sure all masked bits in word before cmpxchg is zero. Then the existing check after cmpxchg to check any zero bit is existing in masked bits in word is redundant. Actually, old value of word before cmpxchg is stored in val and we will filter out busy bits in val by "(get_mask & ~val)" after cmpxchg. So we will not reuse busy bits methioned in commit fbb564a557809 ("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()"). Revert new-added check to remove redundant check. Fixes: fbb564a55780 ("lib/sbitmap: Fix invalid loop in __sbitmap_queue_get_batch()") Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Kemeng Shi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29sbitmap: remove unnecessary calculation of alloc_hint in __sbitmap_get_shallowKemeng Shi1-7/+4
Updates to alloc_hint in the loop in __sbitmap_get_shallow() are mostly pointless and equivalent to setting alloc_hint to zero (because SB_NR_TO_BIT() considers only low sb->shift bits from alloc_hint). So simplify the logic. Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Kemeng Shi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and ↵Yu Kuai2-6/+30
blkcg_deactivate_policy() Currently parent pd can be freed before child pd: t1: remove cgroup C1 blkcg_destroy_blkgs blkg_destroy list_del_init(&blkg->q_node) // remove blkg from queue list percpu_ref_kill(&blkg->refcnt) blkg_release call_rcu t2: from t1 __blkg_release blkg_free schedule_work t4: deactivate policy blkcg_deactivate_policy pd_free_fn // parent of C1 is freed first t3: from t2 blkg_free_workfn pd_free_fn If policy(for example, ioc_timer_fn() from iocost) access parent pd from child pd after pd_offline_fn(), then UAF can be triggered. Fix the problem by delaying 'list_del_init(&blkg->q_node)' from blkg_destroy() to blkg_free_workfn(), and using a new disk level mutex to synchronize blkg_free_workfn() and blkcg_deactivate_policy(). Signed-off-by: Yu Kuai <[email protected]> Acked-by: Tejun Heo <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29blk-cgroup: support to track if policy is onlineYu Kuai2-7/+18
A new field 'online' is added to blkg_policy_data to fix following 2 problem: 1) In blkcg_activate_policy(), if pd_alloc_fn() with 'GFP_NOWAIT' failed, 'queue_lock' will be dropped and pd_alloc_fn() will try again without 'GFP_NOWAIT'. In the meantime, remove cgroup can race with it, and pd_offline_fn() will be called without pd_init_fn() and pd_online_fn(). This way null-ptr-deference can be triggered. 2) In order to synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy(), 'list_del_init(&blkg->q_node)' will be delayed to blkg_free_workfn(), hence pd_offline_fn() can be called first in blkg_destroy(), and then blkcg_deactivate_policy() will call it again, we must prevent it. The new field 'online' will be set after pd_online_fn() and will be cleared after pd_offline_fn(), in the meantime pd_offline_fn() will only be called if 'online' is set. Signed-off-by: Yu Kuai <[email protected]> Acked-by: Tejun Heo <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29blk-cgroup: dropping parent refcount after pd_free_fn() is doneYu Kuai1-2/+2
Some cgroup policies will access parent pd through child pd even after pd_offline_fn() is done. If pd_free_fn() for parent is called before child, then UAF can be triggered. Hence it's better to guarantee the order of pd_free_fn(). Currently refcount of parent blkg is dropped in __blkg_release(), which is before pd_free_fn() is called in blkg_free_work_fn() while blkg_free_work_fn() is called asynchronously. This patch make sure pd_free_fn() called from removing cgroup is ordered by delaying dropping parent refcount after calling pd_free_fn() for child. BTW, pd_free_fn() will also be called from blkcg_deactivate_policy() from deleting device, and following patches will guarantee the order. Signed-off-by: Yu Kuai <[email protected]> Acked-by: Tejun Heo <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29blk-mq: cleanup unused methods: blk_mq_hw_sysfs_storeZhong Jinghua1-24/+0
We found that the blk_mq_hw_sysfs_store interface has no place to use. The object default_hw_ctx_attrs using blk_mq_hw_sysfs_ops only uses the show method and does not use the store method. Since this patch: 4a46f05ebf99 ("blk-mq: move hctx and ctx counters from sysfs to debugfs") moved the store method to debugfs, the store method is not used anymore. So let me do some tiny work to clean up unused code. Signed-off-by: Zhong Jinghua <[email protected]> Reviewed-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29s390/dcssblk:: don't call bio_split_to_limitsChristoph Hellwig1-4/+0
s390 iterates over the bio using bio_for_each_segment and doesn't need any bio splitting. Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Alexander Gordeev <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29ps3vram: remove bio splittingChristoph Hellwig1-7/+0
ps3vram iterates over the bio one segment, that is page aligned and max page sized chunk, a time. Because of that there is no point in calling bio_split_to_limits, or explicitly setting the default limits that are only used by bio_split_to_limits. Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Geoff Levand <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29block: treat poll queue enter similarly to timeoutsJens Axboe1-1/+10
We ran into an issue where a production workload would randomly grind to a halt and not continue until the pending IO had timed out. This turned out to be a complicated interaction between queue freezing and polled IO: 1) You have an application that does polled IO. At any point in time, there may be polled IO pending. 2) You have a monitoring application that issues a passthrough command, which is marked with side effects such that it needs to freeze the queue. 3) Passthrough command is started, which calls blk_freeze_queue_start() on the device. At this point the queue is marked frozen, and any attempt to enter the queue will fail (for non-blocking) or block. 4) Now the driver calls blk_mq_freeze_queue_wait(), which will return when the queue is quiesced and pending IO has completed. 5) The pending IO is polled IO, but any attempt to poll IO through the normal iocb_bio_iopoll() -> bio_poll() will fail when it gets to bio_queue_enter() as the queue is frozen. Rather than poll and complete IO, the polling threads will sit in a tight loop attempting to poll, but failing to enter the queue to do so. The end result is that progress for either application will be stalled until all pending polled IO has timed out. This causes obvious huge latency issues for the application doing polled IO, but also long delays for passthrough command. Fix this by treating queue enter for polled IO just like we do for timeouts. This allows quick quiesce of the queue as we still poll and complete this IO, while still disallowing queueing up new IO. Reviewed-by: Keith Busch <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2023-01-29blk-iocost: change div64_u64 to DIV64_U64_ROUND_UP in ioc_refresh_params()Li Nan1-2/+2
vrate_min is calculated by DIV64_U64_ROUND_UP, but vrate_max is calculated by div64_u64. Vrate_min may be 1 greater than vrate_max if the input values min and max of cost.qos are equal. Signed-off-by: Li Nan <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29blk-iocost: fix divide by 0 error in calc_lcoefs()Li Nan1-3/+8
echo max of u64 to cost.model can cause divide by 0 error. # echo 8:0 rbps=18446744073709551615 > /sys/fs/cgroup/io.cost.model divide error: 0000 [#1] PREEMPT SMP RIP: 0010:calc_lcoefs+0x4c/0xc0 Call Trace: <TASK> ioc_refresh_params+0x2b3/0x4f0 ioc_cost_model_write+0x3cb/0x4c0 ? _copy_from_iter+0x6d/0x6c0 ? kernfs_fop_write_iter+0xfc/0x270 cgroup_file_write+0xa0/0x200 kernfs_fop_write_iter+0x17d/0x270 vfs_write+0x414/0x620 ksys_write+0x73/0x160 __x64_sys_write+0x1e/0x30 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd calc_lcoefs() uses the input value of cost.model in DIV_ROUND_UP_ULL, overflow would happen if bps plus IOC_PAGE_SIZE is greater than ULLONG_MAX, it can cause divide by 0 error. Fix the problem by setting basecost Signed-off-by: Li Nan <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29blk-iocost: read params inside lock in sysfs apisYu Kuai1-0/+4
Otherwise, user might get abnormal values if params is updated concurrently. Signed-off-by: Yu Kuai <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29blk-iocost: don't allow to configure bio based deviceYu Kuai1-0/+10
iocost is based on rq_qos, which can only work for request based device, thus it doesn't make sense to configure iocost for bio based device. Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29blk-iocost: check return value of match_u64()Yu Kuai1-1/+2
This patch fixs that the return value of match_u64() from ioc_qos_write() is not checked, Signed-off-by: Yu Kuai <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29blk-iocost: avoid 64-bit division in ioc_timer_fnArnd Bergmann1-3/+5
The behavior of 'enum' types has changed in gcc-13, so now the UNBUSY_THR_PCT constant is interpreted as a 64-bit number because it is defined as part of the same enum definition as some other constants that do not fit within a 32-bit integer. This in turn leads to some inefficient code on 32-bit architectures as well as a link error: arm-linux-gnueabi/bin/arm-linux-gnueabi-ld: block/blk-iocost.o: in function `ioc_timer_fn': blk-iocost.c:(.text+0x68e8): undefined reference to `__aeabi_uldivmod' arm-linux-gnueabi-ld: blk-iocost.c:(.text+0x6908): undefined reference to `__aeabi_uldivmod' Split the enum definition to keep the 64-bit timing constants in a separate enum type from those constants that can clearly fit within a smaller type. Signed-off-by: Arnd Bergmann <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29block: ublk: fix doc build warningMing Lei1-0/+6
Fix the following warning: Documentation/block/ublk.rst:157: WARNING: Enumerated list ends without a blank line; unexpected unindent. Documentation/block/ublk.rst:171: WARNING: Enumerated list ends without a blank line; unexpected unindent. Fixes: 56f5160bc1b8 ("ublk_drv: add mechanism for supporting unprivileged ublk device") Reported-by: Stephen Rothwell <[email protected]> Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29block: introduce bdev_zone_no helperPankaj Raghav2-2/+6
Add a generic bdev_zone_no() helper to calculate zone number for a given sector in a block device. This helper internally uses disk_zone_no() to find the zone number. Use the helper bdev_zone_no() to calculate nr of zones. This lets us make modifications to the math if needed in one place. Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Pankaj Raghav <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29block: add a new helper bdev_{is_zone_start, offset_from_zone_start}Pankaj Raghav3-3/+15
Instead of open coding to check for zone start, add a helper to improve readability and store the logic in one place. Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Pankaj Raghav <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29block: remove superfluous check for request queue in bdev_is_zoned()Pankaj Raghav1-6/+1
Remove the superfluous request queue check in bdev_is_zoned() as bdev_get_queue() can never return NULL. Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Pankaj Raghav <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29block: extend bio-cache for non-polled requestsAnuj Gupta1-4/+2
This patch modifies the present check, so that bio-cache is not limited to iopoll. Signed-off-by: Anuj Gupta <[email protected]> Signed-off-by: Kanchan Joshi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29nvme: set REQ_ALLOC_CACHE for uring-passthru requestAnuj Gupta1-2/+2
This patch sets REQ_ALLOC_CACHE flag for uring-passthru requests. This is a prep-patch so that normal / IRQ-driven uring-passthru I/Os can also leverage bio-cache. Signed-off-by: Anuj Gupta <[email protected]> Signed-off-by: Kanchan Joshi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29ublk_drv: add mechanism for supporting unprivileged ublk deviceMing Lei3-17/+220
unprivileged ublk device is helpful for container use case, such as: ublk device created in one unprivileged container can be controlled and accessed by this container only. Implement this feature by adding flag of UBLK_F_UNPRIVILEGED_DEV, and if this flag isn't set, any control command has been run from privileged user. Otherwise, any control command can be sent from any unprivileged user, but the user has to be permitted to access the ublk char device to be controlled. In case of UBLK_F_UNPRIVILEGED_DEV: 1) for command UBLK_CMD_ADD_DEV, it is always allowed, and user needs to provide owner's uid/gid in this command, so that udev can set correct ownership for the created ublk device, since the device owner uid/gid can be queried via command of UBLK_CMD_GET_DEV_INFO. 2) for other control commands, they can only be run successfully if the current user is allowed to access the specified ublk char device, for running the permission check, path of the ublk char device has to be provided by these commands. Also add one control of command UBLK_CMD_GET_DEV_INFO2 which always include the char dev path in payload since userspace may not have knowledge if this device is created in unprivileged mode. For applying this mechanism, system administrator needs to take the following policies: 1) chmod 0666 /dev/ublk-control 2) change ownership of ublkcN & ublkbN - chown owner_uid:owner_gid /dev/ublkcN - chown owner_uid:owner_gid /dev/ublkbN Both can be done via one simple udev rule. Userspace: https://github.com/ming1/ubdsrv/tree/unprivileged-ublk 'ublk add -t $TYPE --un_privileged=1' is for creating one un-privileged ublk device if the user is un-privileged. Link: https://lore.kernel.org/linux-block/[email protected]/ Suggested-by: Stefan Hajnoczi <[email protected]> Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29ublk_drv: add module parameter of ublks_max for limiting max allowed ublk devMing Lei1-0/+19
Prepare for supporting unprivileged ublk device by limiting max number ublk devices added. Otherwise too many ublk devices could be added by un-trusted user, which can be thought as one DoS. Reviewed-by: ZiyangZhang <[email protected]> Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29ublk_drv: add device parameter UBLK_PARAM_TYPE_DEVTMing Lei2-1/+36
Userspace side only knows device ID, but the associated path of ublkc* and ublkb* could be changed by udev, and that depends on userspace's policy, so add parameter of UBLK_PARAM_TYPE_DEVT for retrieving major/minor of the ublkc* and ublkb*, then user may figure out major/minor of the ublk disks he/she owns. With major/minor, it is easy to find the device node path. Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29ublk_drv: move ublk_get_device_from_id into ublk_ctrl_uring_cmdMing Lei1-89/+49
It is annoying for each control command handler to get/put ublk device and deal with failure. Control command handler is simplified a lot by moving ublk_get_device_from_id into ublk_ctrl_uring_cmd(). Reviewed-by: ZiyangZhang <[email protected]> Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29ublk_drv: don't probe partitions if the ubq daemon isn't trustedMing Lei1-0/+9
If any ubq daemon is unprivileged, the ublk char device is allowed for unprivileged user actually, and we can't trust the current user, so not probe partitions. Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver") Reviewed-by: ZiyangZhang <[email protected]> Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29ublk_drv: remove nr_aborted_queues from ublk_deviceMing Lei1-1/+0
No one uses 'nr_aborted_queues' any more, so remove it. Reviewed-by: ZiyangZhang <[email protected]> Signed-off-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29block: don't allow multiple bios for IOCB_NOWAIT issueJens Axboe1-3/+18
If we're doing a large IO request which needs to be split into multiple bios for issue, then we can run into the same situation as the below marked commit fixes - parts will complete just fine, one or more parts will fail to allocate a request. This will result in a partially completed read or write request, where the caller gets EAGAIN even though parts of the IO completed just fine. Do the same for large bios as we do for splits - fail a NOWAIT request with EAGAIN. This isn't technically fixing an issue in the below marked patch, but for stable purposes, we should have either none of them or both. This depends on: 613b14884b85 ("block: handle bio_split_to_limits() NULL return") Cc: [email protected] # 5.15+ Fixes: 9cea62b2cbab ("block: don't allow splitting of a REQ_NOWAIT bio") Link: https://github.com/axboe/liburing/issues/766 Reported-and-tested-by: Michael Kelley <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2023-01-29drbd: drbd_insert_interval(): Clarify commentAndreas Gruenbacher1-1/+1
Signed-off-by: Andreas Gruenbacher <[email protected]> Signed-off-by: Christoph Böhmwalder <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29drbd: interval tree: make removing an "empty" interval a no-opLars Ellenberg1-0/+4
Trying to remove an "empty" (just initialized, or "cleared") interval from the tree, this results in an endless loop. As we typically protect the tree with a spinlock_irq, the result is a hung system. Be nice to error cleanup code paths, ignore removal of empty intervals. Signed-off-by: Lars Ellenberg <[email protected]> Signed-off-by: Christoph Böhmwalder <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29MAINTAINERS: add drbd headersChristoph Böhmwalder1-0/+1
Signed-off-by: Christoph Böhmwalder <[email protected]> Reviewed-by: Joel Colledge <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29drbd: remove macros using require_contextChristoph Böhmwalder1-11/+1
This require_context attribute originated in a proposed sparse patch by Philipp Reisner back in 2008. Johannes Berg had a different solution to a similar problem, and that patch "won" in the end; so the require_context thing never got merged. The whole history can be read at [0]. DRBD kept using these annotations anyway for a while. Nowadays, on a modern unmodified sparse, they obviously do nothing, and they are hardly used anymore anyway. So, just remove the definitions of these macros. [0] https://www.spinics.net/lists/linux-sparse/msg01150.html Signed-off-by: Christoph Böhmwalder <[email protected]> Reviewed-by: Joel Colledge <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29drbd: remove unnecessary assignment in vli_encode_bitsChristoph Böhmwalder1-1/+1
Signed-off-by: Christoph Böhmwalder <[email protected]> Reviewed-by: Joel Colledge <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29drbd: make limits unsignedChristoph Böhmwalder1-101/+101
These are almost always used as unsigned integers, so mark them as such. Signed-off-by: Christoph Böhmwalder <[email protected]> Reviewed-by: Joel Colledge <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29drbd: fix DRBD_VOLUME_MAX 65535 -> 65534Robert Altnoeder1-1/+1
The protocol uses -1 as a reserved value for 'no specific volume', and since the protocol field is a 16 bit unsigned value, -1 is converted to 65535. Therefore, limit the range of valid volume numbers to [0, 65534]. Signed-off-by: Robert Altnoeder <[email protected]> Signed-off-by: Christoph Böhmwalder <[email protected]> Reviewed-by: Joel Colledge <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29drbd: adjust drbd_limits license headerChristoph Böhmwalder1-1/+1
See also commit 93c68cc46a07 ("drbd: use consistent license"). We only want to license drbd under GPL-2.0, so use the corresponding SPDX header consistently. Signed-off-by: Christoph Böhmwalder <[email protected]> Reviewed-by: Joel Colledge <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29drbd: split off drbd_config into separate fileChristoph Böhmwalder4-7/+18
To be more similar to what we do in the out-of-tree module and ease the upstreaming process. Signed-off-by: Christoph Böhmwalder <[email protected]> Reviewed-by: Joel Colledge <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29drbd: drop API_VERSION defineChristoph Böhmwalder5-5/+4
Use the genetlink api version as defined in drbd_genl_api.h. Signed-off-by: Christoph Böhmwalder <[email protected]> Reviewed-by: Joel Colledge <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29drbd: split off drbd_buildtag into separate fileChristoph Böhmwalder3-19/+23
To be more similar to what we do in the out-of-tree module and ease the upstreaming process. Signed-off-by: Christoph Böhmwalder <[email protected]> Reviewed-by: Joel Colledge <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29block: add a BUILD_BUG_ON() for adding more bio flags than we have spaceJens Axboe1-0/+2
We have BIO_FLAG_LAST in the enum for bio specific flags, but it's not used to check that we're not exceeding the size of them. Add such a check. Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2023-01-29block: save user max_sectors limitKeith Busch4-9/+25
The user can set the max_sectors limit to any valid value via sysfs /sys/block/<dev>/queue/max_sectors_kb attribute. If the device limits are ever rescanned, though, the limit reverts back to the potentially artificially low BLK_DEF_MAX_SECTORS value. Preserve the user's setting as the max_sectors limit as long as it's valid. The user can reset back to defaults by writing 0 to the sysfs file. Signed-off-by: Keith Busch <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29block: make BLK_DEF_MAX_SECTORS unsignedKeith Busch3-4/+4
This is used as an unsigned value, so define it that way to avoid having to cast it. Suggested-by: Christoph Hellwig <[email protected]> Signed-off-by: Keith Busch <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-01-29block, bfq: balance I/O injection among underutilized actuatorsDavide Zini1-5/+13
Upon the invocation of its dispatch function, BFQ returns the next I/O request of the in-service bfq_queue, unless some exception holds. One such exception is that there is some underutilized actuator, different from the actuator for which the in-service queue contains I/O, and that some other bfq_queue happens to contain I/O for such an actuator. In this case, the next I/O request of the latter bfq_queue, and not of the in-service bfq_queue, is returned (I/O is injected from that bfq_queue). To find such an actuator, a linear scan, in increasing index order, is performed among actuators. Performing a linear scan entails a prioritization among actuators: an underutilized actuator may be considered for injection only if all actuators with a lower index are currently fully utilized, or if there is no pending I/O for any lower-index actuator that happens to be underutilized. This commits breaks this prioritization and tends to distribute injection uniformly across actuators. This is obtained by adding the following condition to the linear scan: even if an actuator A is underutilized, A is however skipped if its load is higher than that of the next actuator. Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Paolo Valente <[email protected]> Signed-off-by: Davide Zini <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>