Age | Commit message (Collapse) | Author | Files | Lines |
|
Add flags to dm_io and manage them using the same pattern used for
bi_flags in struct bio.
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Update my email address to kernel.org to allow distinction between my
"upstream" and "Red" Hats.
Signed-off-by: Mike Snitzer <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Optimizes dm_io_dec_pending() slightly by avoiding local variables.
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Remove the from_wq argument from dm_sumbit_bio_remap(). Eliminates the
need for dm_sumbit_bio_remap() callers to know whether they are
calling for a workqueue or from the original dm_submit_bio().
Add map_task to dm_io struct, record the map_task in alloc_io and
clear it after all target ->map() calls have completed. Update
dm_sumbit_bio_remap to check if 'current' matches io->map_task rather
than rely on passed 'from_rq' argument.
This change really simplifies the chore of porting each DM target to
using dm_sumbit_bio_remap() because there is no longer the risk of
programming error by not completely knowing all the different contexts
a particular method that calls dm_sumbit_bio_remap() might be used in.
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Signed-off-by: Mike Snitzer <[email protected]>
|
|
If a target uses dm_submit_bio_remap() it should set
ti->accounts_remapped_io.
Also, switch dm_start_io_acct() WARN_ON to WARN_ON_ONCE.
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Support bio polling (REQ_POLLED) in the following approach:
1) only support io polling on normal READ/WRITE, and other abnormal IOs
still fallback to IRQ mode, so the target io (and DM's clone bio) is
exactly inside the dm io.
2) hold one refcnt on io->io_count after submitting this dm bio with
REQ_POLLED
3) support dm native bio splitting, any dm io instance associated with
current bio will be added into one list which head is bio->bi_private
which will be recovered before ending this bio
4) implement .poll_bio() callback, call bio_poll() on the single target
bio inside the dm io which is retrieved via bio->bi_bio_drv_data; call
dm_io_dec_pending() after the target io is done in .poll_bio()
5) enable QUEUE_FLAG_POLL if all underlying queues enable QUEUE_FLAG_POLL,
which is based on Jeffle's previous patch.
These changes are good for a 30-35% IOPS improvement for polled IO.
For detailed test results please see (Jens, thanks for testing!):
https://listman.redhat.com/archives/dm-devel/2022-March/049868.html
or https://marc.info/?l=linux-block&m=164684246214700&w=2
Tested-by: Jens Axboe <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Prepare for supporting IO polling for bio-based driver.
Add ->poll_bio callback so that bio-based driver can provide their own
logic for polling bio.
Also fix ->submit_bio_bio typo in comment block above
__submit_bio_noacct.
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jens Axboe <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Just use the %pg format specifier instead.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Just use the %pg format specifier to print the block device name
directly.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Also remove empty newline before 'out:' label at end of __bind.
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Update both bio-based and request-based DM to requeue IO if the
mapping table not available.
This race of IO being submitted before the DM device ready is so
narrow, yet possible for initial table load given that the DM device's
request_queue is created prior, that it best to requeue IO to handle
this unlikely case.
Reported-by: Zhang Yi <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Commit 7eaceaccab5f ("block: remove per-queue plugging") dropped
unplug_delay and blk_unplug(). Plus, the current kernel has no
fundamental difference between sync_io() and async_io() except
sync_io() uses sync_io_complete() as the notify.fn and explicitly
calls wait_for_completion_io() to sync. The comment isn't valid
any more.
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Barry Song <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Signed-off-by: Zhiqiang Liu <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Use time_is_before_jiffies() to improve code readability.
Signed-off-by: Wang Qing <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Explicitly convert unsigned int in the right of the conditional
expression to int to match the left side operand and the return type,
fixing the following compiler warning:
drivers/md/dm-crypt.c:2593:43: warning: signed and unsigned
type in conditional expression [-Wsign-compare]
Fixes: c538f6ec9f56 ("dm crypt: add ability to use keys from the kernel key retention service")
Signed-off-by: Aashish Sharma <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
dm_cleanup_zoned_dev() uses queue, so it must be called
before blk_cleanup_disk() starts its killing:
blk_cleanup_disk->blk_cleanup_queue()->kobject_put()->blk_release_queue()->
->...RCU...->blk_free_queue_rcu()->kmem_cache_free()
Otherwise, RCU callback may be executed first and
dm_cleanup_zoned_dev() will touch free'd memory:
BUG: KASAN: use-after-free in dm_cleanup_zoned_dev+0x33/0xd0
Read of size 8 at addr ffff88805ac6e430 by task dmsetup/681
CPU: 4 PID: 681 Comm: dmsetup Not tainted 5.17.0-rc2+ #6
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x57/0x7d
print_address_description.constprop.0+0x1f/0x150
? dm_cleanup_zoned_dev+0x33/0xd0
kasan_report.cold+0x7f/0x11b
? dm_cleanup_zoned_dev+0x33/0xd0
dm_cleanup_zoned_dev+0x33/0xd0
__dm_destroy+0x26a/0x400
? dm_blk_ioctl+0x230/0x230
? up_write+0xd8/0x270
dev_remove+0x156/0x1d0
ctl_ioctl+0x269/0x530
? table_clear+0x140/0x140
? lock_release+0xb2/0x750
? remove_all+0x40/0x40
? rcu_read_lock_sched_held+0x12/0x70
? lock_downgrade+0x3c0/0x3c0
? rcu_read_lock_sched_held+0x12/0x70
dm_ctl_ioctl+0xa/0x10
__x64_sys_ioctl+0xb9/0xf0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fb6dfa95c27
Fixes: bb37d77239af ("dm: introduce zone append emulation")
Cc: [email protected]
Signed-off-by: Kirill Tkhai <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
It appears like cmd could be a Spectre v1 gadget as it's supplied by a
user and used as an array index. Prevent the contents of kernel memory
from being leaked to userspace via speculative execution by using
array_index_nospec.
Signed-off-by: Jordy Zomer <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Remove the second 'a'.
Signed-off-by: Tom Rix <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
All entries measured by dm ima are prefixed by a version string
(dm_version=N.N.N). When there is no data to measure, the entire buffer is
overwritten with a string containing the version string again and the
length of that string is added to the length of the version string.
The new length is now wrong because it contains the version string twice.
This caused entries like this:
dm_version=4.45.0;name=test,uuid=test;table_clear=no_data; \
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \
current_device_capacity=204808;
Signed-off-by: Thore Sommer <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
The 'table' static array is read-only so it make sense to make
it const. Add in the int type to clean up checkpatch warning.
Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Reviewed-by: Mikulas Patocka <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Care was taken to support kcryptd_io_read being called from crypt_map
or workqueue. Use of an intermediate CRYPT_MAP_READ_GFP gfp_t
(defined as GFP_NOWAIT) should protect from maintenance burden if that
flag were to change for some reason.
Reviewed-by: Mikulas Patocka <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Where possible, switch from early bio-based IO accounting (at the time
DM clones each incoming bio) to late IO accounting just before each
remapped bio is issued to underlying device via submit_bio_noacct().
Allows more precise bio-based IO accounting for DM targets that use
their own workqueues to perform additional processing of each bio in
conjunction with their DM_MAPIO_SUBMITTED return from their map
function. When a target is updated to use dm_submit_bio_remap() they
must also set ti->accounts_remapped_io to true.
Use xchg() in start_io_acct(), as suggested by Mikulas, to ensure each
IO is only started once. The xchg race only happens if
__send_duplicate_bios() sends multiple bios -- that case is reflected
via tio->is_duplicate_bio. Given the niche nature of this race, it is
best to avoid any xchg performance penalty for normal IO.
For IO that was never submitted with dm_bio_submit_remap(), but the
target completes the clone with bio_endio, accounting is started then
ended and pending_io counter decremented.
Reviewed-by: Mikulas Patocka <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Formally disallow dm_accept_partial_bio() on clones created by
__send_duplicate_bios() because their len_ptr points to a shared
unsigned int. __send_duplicate_bios() is only used for flush bios
and other "abnormal" bios (discards, writezeroes, etc). And
dm_accept_partial_bio() already didn't support flush bios.
Also refactor __send_changing_extent_only() to reflect it cannot fail.
As such __send_changing_extent_only() can update the clone_info before
__send_duplicate_bios() is called to fan-out __map_bio() calls.
Reviewed-by: Mikulas Patocka <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Remove one 4 byte hole in dm_io struct.
Remove two 4 byte holes in dm_target_io struct.
Reviewed-by: Mikulas Patocka <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Suggested-by: Christoph Hellwig <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Mikulas Patocka <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Prep for being able to defer trace_block_bio_remap() until when the
bio is remapped and submitted by the DM target.
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Mikulas Patocka <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Commit 8615cb65bd63 ("dm: remove useless loop in
__split_and_process_bio") showcased that we no longer loop.
Remove the bio_advance() in __split_and_process_bio() that was only
needed when looping was possible.
Similarly there is no need to advance the bio, using ci->sector
cursor, in __send_duplicate_bios().
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Mikulas Patocka <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Mikulas Patocka <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
The flush_bio in question was just initialized to be empty, so there
is no way bio_has_data() will return true. So remove stale BUG_ON().
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Mikulas Patocka <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Error path code (for handling DM_MAPIO_REQUEUE and DM_MAPIO_KILL) is
effectively identical.
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Mikulas Patocka <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Remove needless branching and indentation. Leaves code to catch
malformed op_is_zone_mgmt bios (they shouldn't have a payload).
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Mikulas Patocka <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Fold __clone_and_map_data_bio into its only caller.
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Mikulas Patocka <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Rename __split_and_process_bio to dm_split_and_process_bio.
Rename __split_and_process_non_flush to __split_and_process_bio.
Also fix a stale comment and whitespace.
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Mikulas Patocka <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Improves alignment and groups related members relative to cachelines.
Signed-off-by: Mike Snitzer <[email protected]>
|
|
There is no need for dm_io_dec_pending() to copy dm_io fields
anymore now that DM provides its own pending_io counters again.
The race documented in commit d208b89401e0 ("dm: fix mempool NULL
pointer race when completing IO") no longer exists now that block
core's in_flight counters aren't used to signal all dm_io is
complete.
Also, rename {start,end}_io_acct to dm_{start,end}_io_acct.
Signed-off-by: Mike Snitzer <[email protected]>
|
|
dm_stats_account_io()'s STAT_PRECISE_TIMESTAMPS support doesn't handle
the fact that with commit b879f915bc48 ("dm: properly fix redundant
bio-based IO accounting") io->start_time _may_ be in the past (meaning
the start_io_acct() was deferred until later).
Add a new dm_stats_recalc_precise_timestamps() helper that will
set/clear a new 'precise_timestamps' flag in the dm_stats struct based
on whether any configured stats enable STAT_PRECISE_TIMESTAMPS.
And update DM core's alloc_io() to use dm_stats_record_start() to set
stats_aux.duration_ns if stats->precise_timestamps is true.
Also, remove unused 'last_sector' and 'last_rw' members from the
dm_stats struct.
Fixes: b879f915bc48 ("dm: properly fix redundant bio-based IO accounting")
Cc: [email protected]
Co-developed-by: Mikulas Patocka <[email protected]>
Signed-off-by: Mikulas Patocka <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
DM handles a flush with data by first issuing an empty flush and then
once it completes the REQ_PREFLUSH flag is removed and the payload is
issued. The problem fixed by this commit is that both the empty flush
bio and the data payload will account the full extent of the data
payload.
Fix this by factoring out dm_io_acct() and having it wrap all IO
accounting to set the size of bio with REQ_PREFLUSH to 0, account the
IO, and then restore the original size.
Cc: [email protected]
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Commit d208b89401e0 ("dm: fix mempool NULL pointer race when
completing IO") didn't go far enough.
When bio_end_io_acct ends the count of in-flight I/Os may reach zero
and the DM device may be suspended. There is a possibility that the
suspend races with dm_stats_account_io.
Fix this by adding percpu "pending_io" counters to track outstanding
dm_io. Move kicking of suspend queue to dm_io_dec_pending(). Also,
rename md_in_flight_bios() to dm_in_flight_bios() and update it to
iterate all pending_io counters.
Fixes: d208b89401e0 ("dm: fix mempool NULL pointer race when completing IO")
Cc: [email protected]
Co-developed-by: Mikulas Patocka <[email protected]>
Signed-off-by: Mikulas Patocka <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
The return value is ioprio * BFQ_WEIGHT_CONVERSION_COEFF or 0.
What we want is ioprio or 0.
Correct this by changing the calculation.
Signed-off-by: Yahu Gao <[email protected]>
Acked-by: Paolo Valente <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
When blk_mq_delay_run_hw_queues sets an hctx to run in the future, it can
reset the delay length for an already pending delayed work run_work. This
creates a scenario where multiple hctx may have their queues set to run,
but if one runs first and finds nothing to do, it can reset the delay of
another hctx and stall the other hctx's ability to run requests.
To avoid this I/O stall when an hctx's run_work is already pending,
leave it untouched to run at its current designated time rather than
extending its delay. The work will still run which keeps closed the race
calling blk_mq_delay_run_hw_queues is needed for while also avoiding the
I/O stall.
Signed-off-by: David Jeffery <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Link: https://lore.kernel.org/r/20220131203337.GA17666@redhat
Signed-off-by: Jens Axboe <[email protected]>
|
|
Implement the ->free_disk method to free the virtio_blk structure only
once the last gendisk reference goes away instead of keeping a local
refcount.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Stefan Hajnoczi <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Implement the ->free_disk method to free the msb_data structure only once
the last gendisk reference goes away instead of keeping a local
refcount.
Signed-off-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Use set_disk_ro to propagate the read-only state to the block layer
instead of checking for it in ->open and leaking a reference in case
of a read-only device.
Signed-off-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Implement the ->free_disk method to free the msb_data structure only once
the last gendisk reference goes away instead of keeping a local refcount.
Signed-off-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Add a method to notify the driver that the gendisk is about to be freed.
This allows drivers to tie the lifetime of their private data to that of
the gendisk and thus deal with device removal races without expensive
synchronization and boilerplate code.
A new flag is added so that ->free_disk is only called after a successful
call to add_disk, which significantly simplifies the error handling path
during probing.
Signed-off-by: Christoph Hellwig <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|