Age | Commit message (Collapse) | Author | Files | Lines |
|
This patch adds support for handling incoming flush and/or FUA bios. Each
such bio is assigned to a struct vdo_flush. These are allocated as needed,
but there is always one kept in reserve in case allocations fail. In the
event of an allocation failure, bios may need to wait for an outstanding
flush to complete.
The logical address space is partitioned into logical zones, each handled
by its own thread. Each zone keeps a list of all data_vios handling write
requests for logical addresses in that zone. When a flush bio is processed,
each logical zone is informed of the flush. When all of the writes which
are in progress at the time of the notification have completed in all
zones, the flush bio is then allowed to complete.
Co-developed-by: J. corwin Coburn <[email protected]>
Signed-off-by: J. corwin Coburn <[email protected]>
Co-developed-by: Michael Sclafani <[email protected]>
Signed-off-by: Michael Sclafani <[email protected]>
Co-developed-by: Sweet Tea Dorminy <[email protected]>
Signed-off-by: Sweet Tea Dorminy <[email protected]>
Signed-off-by: Matthew Sakai <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Add the data and methods that implement the data_vio object that
handles user data bios as they are processed.
Co-developed-by: J. corwin Coburn <[email protected]>
Signed-off-by: J. corwin Coburn <[email protected]>
Co-developed-by: Michael Sclafani <[email protected]>
Signed-off-by: Michael Sclafani <[email protected]>
Co-developed-by: Sweet Tea Dorminy <[email protected]>
Signed-off-by: Sweet Tea Dorminy <[email protected]>
Co-developed-by: Bruce Johnston <[email protected]>
Signed-off-by: Bruce Johnston <[email protected]>
Co-developed-by: Ken Raeburn <[email protected]>
Signed-off-by: Ken Raeburn <[email protected]>
Signed-off-by: Matthew Sakai <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Add the data and methods that implement the vio object that is basic
unit of I/O in vdo.
Co-developed-by: J. corwin Coburn <[email protected]>
Signed-off-by: J. corwin Coburn <[email protected]>
Co-developed-by: Michael Sclafani <[email protected]>
Signed-off-by: Michael Sclafani <[email protected]>
Co-developed-by: Sweet Tea Dorminy <[email protected]>
Signed-off-by: Sweet Tea Dorminy <[email protected]>
Co-developed-by: Bruce Johnston <[email protected]>
Signed-off-by: Bruce Johnston <[email protected]>
Co-developed-by: Ken Raeburn <[email protected]>
Signed-off-by: Ken Raeburn <[email protected]>
Signed-off-by: Matthew Sakai <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
This patch adds the admin_state structures which are used to track the
states of individual vdo components for handling of operations like suspend
and resume. It also adds the action manager which is used to schedule and
manage cross-thread administrative and internal operations.
Co-developed-by: J. corwin Coburn <[email protected]>
Signed-off-by: J. corwin Coburn <[email protected]>
Signed-off-by: Matthew Sakai <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
The deduplication index interface for index clients includes the
deduplication request and index session structures. This is the interface
that the rest of the vdo target uses to make requests, receive responses,
and collect statistics.
This patch also adds sysfs nodes for inspecting various index properties at
runtime.
Co-developed-by: J. corwin Coburn <[email protected]>
Signed-off-by: J. corwin Coburn <[email protected]>
Co-developed-by: Michael Sclafani <[email protected]>
Signed-off-by: Michael Sclafani <[email protected]>
Co-developed-by: Thomas Jaskiewicz <[email protected]>
Signed-off-by: Thomas Jaskiewicz <[email protected]>
Co-developed-by: John Wiele <[email protected]>
Signed-off-by: John Wiele <[email protected]>
Signed-off-by: Matthew Sakai <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
The top-level deduplication index brings all the earlier components
together. The top-level index creates the separate zone structures that
enable the index to handle several requests in parallel, handles
dispatching requests to the right zones and components, and coordinates
metadata to ensure that it remain consistent. It also coordinates recovery
in the event of an unexpected index failure.
If sparse caching is enabled, the top-level index also handles the
coordination required by the sparse chapter index cache, which (unlike most
index structures) is shared among all zones.
Co-developed-by: J. corwin Coburn <[email protected]>
Signed-off-by: J. corwin Coburn <[email protected]>
Co-developed-by: Michael Sclafani <[email protected]>
Signed-off-by: Michael Sclafani <[email protected]>
Co-developed-by: Thomas Jaskiewicz <[email protected]>
Signed-off-by: Thomas Jaskiewicz <[email protected]>
Co-developed-by: Bruce Johnston <[email protected]>
Signed-off-by: Bruce Johnston <[email protected]>
Signed-off-by: Matthew Sakai <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
The volume store structures manage the reading and writing of chapter
pages. When a chapter is closed, it is packed into a read-only structure,
split across several pages, and written to storage.
The volume store also contains a cache and specialized queues that sort and
batch requests by the page they need, in order to minimize latency and I/O
requests when records have to be read from storage. The cache and queues
also coordinate with the volume index to ensure that the volume does not
waste resources reading pages that are no longer valid.
Co-developed-by: J. corwin Coburn <[email protected]>
Signed-off-by: J. corwin Coburn <[email protected]>
Co-developed-by: Michael Sclafani <[email protected]>
Signed-off-by: Michael Sclafani <[email protected]>
Co-developed-by: Thomas Jaskiewicz <[email protected]>
Signed-off-by: Thomas Jaskiewicz <[email protected]>
Co-developed-by: John Wiele <[email protected]>
Signed-off-by: John Wiele <[email protected]>
Signed-off-by: Matthew Sakai <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Deduplication records are stored in groups called chapters. New records are
collected in a structure called the open chapter, which is optimized for
adding, removing, and sorting records.
When a chapter fills, it is packed into a read-only structure called a
closed chapter, which is optimized for searching and reading. The closed
chapter includes a delta index, called the chapter index, which maps each
record name to the record page containing the record and allows the index
to read at most one record page when looking up a record.
Co-developed-by: J. corwin Coburn <[email protected]>
Signed-off-by: J. corwin Coburn <[email protected]>
Co-developed-by: Michael Sclafani <[email protected]>
Signed-off-by: Michael Sclafani <[email protected]>
Co-developed-by: Thomas Jaskiewicz <[email protected]>
Signed-off-by: Thomas Jaskiewicz <[email protected]>
Signed-off-by: Matthew Sakai <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
The volume index is a large delta index that maps each record name to the
chapter which contains the newest record for that name. The volume index
can contain several million records and is stored entirely in memory while
the index is operating, accounting for the majority of the deduplication
index's memory budget.
The volume index is composed of two subindexes in order to handle sparse
hook names separately from regular names. If sparse indexing is not
enabled, the sparse hook portion of the volume index is not used or
instantiated.
Co-developed-by: J. corwin Coburn <[email protected]>
Signed-off-by: J. corwin Coburn <[email protected]>
Co-developed-by: Michael Sclafani <[email protected]>
Signed-off-by: Michael Sclafani <[email protected]>
Co-developed-by: Thomas Jaskiewicz <[email protected]>
Signed-off-by: Thomas Jaskiewicz <[email protected]>
Signed-off-by: Matthew Sakai <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
The delta index is a space and memory efficient alternative to a hashtable.
Instead of storing the entire key for each entry, the entries are sorted by
key and only the difference between adjacent keys (the delta) is stored.
If the keys are evenly distributed, the size of the deltas follows an
exponential distribution, and the deltas can use a Huffman code to take up
even less space.
This structure allows the index to use many fewer bytes per entry than a
traditional hash table, but it is slightly more expensive to look up
entries, because a request must read and sum every entry in a list of
deltas in order to find a given record. The delta index reduces this lookup
cost by splitting its key space into many sub-lists, each starting at a
fixed key value, so that each individual list is short.
Co-developed-by: J. corwin Coburn <[email protected]>
Signed-off-by: J. corwin Coburn <[email protected]>
Co-developed-by: Michael Sclafani <[email protected]>
Signed-off-by: Michael Sclafani <[email protected]>
Co-developed-by: Thomas Jaskiewicz <[email protected]>
Signed-off-by: Thomas Jaskiewicz <[email protected]>
Signed-off-by: Matthew Sakai <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
This patch adds infrastructure for managing reads and writes to the
underlying storage layer for the deduplication index. The deduplication
index uses dm-bufio for all of its reads and writes, so part of this
infrastructure is managing the various dm-bufio clients required. It also
adds the buffered reader and buffered writer abstractions, which simplify
reading and writing metadata structures that span several blocks.
This patch also includes structures and utilities for encoding and decoding
all of the deduplication index metadata, collectively called the index
layout.
Co-developed-by: J. corwin Coburn <[email protected]>
Signed-off-by: J. corwin Coburn <[email protected]>
Co-developed-by: Michael Sclafani <[email protected]>
Signed-off-by: Michael Sclafani <[email protected]>
Co-developed-by: Thomas Jaskiewicz <[email protected]>
Signed-off-by: Thomas Jaskiewicz <[email protected]>
Co-developed-by: John Wiele <[email protected]>
Signed-off-by: John Wiele <[email protected]>
Signed-off-by: Matthew Sakai <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Add structures which record the configuration of various deduplication
index parameters. This also includes facilities for saving and loading the
configuration and validating its integrity.
Co-developed-by: J. corwin Coburn <[email protected]>
Signed-off-by: J. corwin Coburn <[email protected]>
Co-developed-by: Michael Sclafani <[email protected]>
Signed-off-by: Michael Sclafani <[email protected]>
Co-developed-by: Thomas Jaskiewicz <[email protected]>
Signed-off-by: Thomas Jaskiewicz <[email protected]>
Co-developed-by: John Wiele <[email protected]>
Signed-off-by: John Wiele <[email protected]>
Signed-off-by: Matthew Sakai <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
This patch adds two hash maps, one keyed by integers, the other by
pointers, and also a priority heap. The integer map is used for locking of
logical and physical addresses. The pointer map is used for managing
concurrent writes of the same data, ensuring that those writes are
deduplicated. The priority heap is used to minimize the search time for
free blocks.
Co-developed-by: J. corwin Coburn <[email protected]>
Signed-off-by: J. corwin Coburn <[email protected]>
Co-developed-by: Michael Sclafani <[email protected]>
Signed-off-by: Michael Sclafani <[email protected]>
Signed-off-by: Matthew Sakai <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
This patch adds funnel_queue, a mostly lock-free multi-producer,
single-consumer queue. It also adds the request queue used by the dm-vdo
deduplication index, and the work_queue used by the dm-vdo data store. Both
of these are built on top of funnel queue and are intended to support the
dispatching of many short-running tasks. The work_queue also supports
priorities. Finally, this patch adds vdo_completion, the structure which is
enqueued on work_queues.
Co-developed-by: J. corwin Coburn <[email protected]>
Signed-off-by: J. corwin Coburn <[email protected]>
Co-developed-by: Michael Sclafani <[email protected]>
Signed-off-by: Michael Sclafani <[email protected]>
Co-developed-by: Sweet Tea Dorminy <[email protected]>
Signed-off-by: Sweet Tea Dorminy <[email protected]>
Co-developed-by: Ken Raeburn <[email protected]>
Signed-off-by: Ken Raeburn <[email protected]>
Signed-off-by: Matthew Sakai <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
This patch adds utilities for managing and using named threads, as well as
several locking and synchronization utilities. These utilities help dm-vdo
minimize thread transitions and manage interactions between threads.
Co-developed-by: J. corwin Coburn <[email protected]>
Signed-off-by: J. corwin Coburn <[email protected]>
Co-developed-by: Michael Sclafani <[email protected]>
Signed-off-by: Michael Sclafani <[email protected]>
Co-developed-by: Thomas Jaskiewicz <[email protected]>
Signed-off-by: Thomas Jaskiewicz <[email protected]>
Co-developed-by: Bruce Johnston <[email protected]>
Signed-off-by: Bruce Johnston <[email protected]>
Co-developed-by: Ken Raeburn <[email protected]>
Signed-off-by: Ken Raeburn <[email protected]>
Signed-off-by: Matthew Sakai <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Add definitions of constants defining the fixed parameters of a VDO
volume, and the default and maximum values of configurable or dynamic
parameters.
Add definitions of internal status codes used for internal
communication within the module and for logging.
Add definitions of types and structs used to manage the processing of
an I/O operation.
Co-developed-by: J. corwin Coburn <[email protected]>
Signed-off-by: J. corwin Coburn <[email protected]>
Co-developed-by: Michael Sclafani <[email protected]>
Signed-off-by: Michael Sclafani <[email protected]>
Signed-off-by: Matthew Sakai <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Add various support utilities for the vdo target and deduplication index,
including logging utilities, string and time management, and index-specific
error codes.
Co-developed-by: J. corwin Coburn <[email protected]>
Signed-off-by: J. corwin Coburn <[email protected]>
Co-developed-by: Michael Sclafani <[email protected]>
Signed-off-by: Michael Sclafani <[email protected]>
Co-developed-by: Thomas Jaskiewicz <[email protected]>
Signed-off-by: Thomas Jaskiewicz <[email protected]>
Co-developed-by: Ken Raeburn <[email protected]>
Signed-off-by: Ken Raeburn <[email protected]>
Signed-off-by: Matthew Sakai <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
This patch adds standardized allocation macros and memory tracking tools to
track and report any allocated memory that is not freed. This makes it
easier to ensure that the vdo target does not leak memory.
This patch also adds utilities for controlling whether certain threads are
allowed to allocate memory, since memory allocation during certain critical
code sections can cause the vdo target to deadlock.
Co-developed-by: J. corwin Coburn <[email protected]>
Signed-off-by: J. corwin Coburn <[email protected]>
Co-developed-by: Michael Sclafani <[email protected]>
Signed-off-by: Michael Sclafani <[email protected]>
Co-developed-by: Thomas Jaskiewicz <[email protected]>
Signed-off-by: Thomas Jaskiewicz <[email protected]>
Co-developed-by: Ken Raeburn <[email protected]>
Signed-off-by: Ken Raeburn <[email protected]>
Signed-off-by: Matthew Sakai <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
MurmurHash3 is a fast, non-cryptographic, 128-bit hash. It was originally
written by Austin Appleby and placed in the public domain. This version has
been modified to produce the same result on both big endian and little
endian processors, making it suitable for use in portable persistent data.
Co-developed-by: J. corwin Coburn <[email protected]>
Signed-off-by: J. corwin Coburn <[email protected]>
Co-developed-by: Ken Raeburn <[email protected]>
Signed-off-by: Ken Raeburn <[email protected]>
Co-developed-by: John Wiele <[email protected]>
Signed-off-by: John Wiele <[email protected]>
Signed-off-by: Matthew Sakai <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
The task status has been set to TASK_RUNNING in schedule().
No need to set again here.
Signed-off-by: Lizhe <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Signed-off-by: Mike Snitzer <[email protected]>
|
|
Signed-off-by: Mike Snitzer <[email protected]>
|
|
"struct bvec_iter" is defined with the __packed attribute, so it is
aligned on a single byte. On X86 (and on other architectures that support
unaligned addresses in hardware), "struct bvec_iter" is accessed using the
8-byte and 4-byte memory instructions, however these instructions are less
efficient if they operate on unaligned addresses.
(on RISC machines that don't have unaligned access in hardware, GCC
generates byte-by-byte accesses that are very inefficient - see [1])
This commit reorders the entries in "struct dm_verity_io" and "struct
convert_context", so that "struct bvec_iter" is aligned on 8 bytes.
[1] https://lore.kernel.org/all/ZcLuWUNRZadJr0tQ@fedora/T/
Signed-off-by: Mikulas Patocka <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
If a userspace process reads (with O_DIRECT) multiple blocks into the same
buffer, dm-crypt reports an authentication error [1]. The error is
reported in a log and it may cause RAID leg being kicked out of the
array.
This commit fixes dm-crypt, so that if integrity verification fails, the
data is read again into a kernel buffer (where userspace can't modify it)
and the integrity tag is rechecked. If the recheck succeeds, the content
of the kernel buffer is copied into the user buffer; if the recheck fails,
an integrity error is reported.
[1] https://people.redhat.com/~mpatocka/testcases/blk-auth-modify/read2.c
Signed-off-by: Mikulas Patocka <[email protected]>
Cc: [email protected]
Signed-off-by: Mike Snitzer <[email protected]>
|
|
It was said that authenticated encryption could produce invalid tag when
the data that is being encrypted is modified [1]. So, fix this problem by
copying the data into the clone bio first and then encrypt them inside the
clone bio.
This may reduce performance, but it is needed to prevent the user from
corrupting the device by writing data with O_DIRECT and modifying them at
the same time.
[1] https://lore.kernel.org/all/[email protected]/T/
Signed-off-by: Mikulas Patocka <[email protected]>
Cc: [email protected]
Signed-off-by: Mike Snitzer <[email protected]>
|
|
If a userspace process reads (with O_DIRECT) multiple blocks into the same
buffer, dm-verity reports an error [1].
This commit fixes dm-verity, so that if hash verification fails, the data
is read again into a kernel buffer (where userspace can't modify it) and
the hash is rechecked. If the recheck succeeds, the content of the kernel
buffer is copied into the user buffer; if the recheck fails, an error is
reported.
[1] https://people.redhat.com/~mpatocka/testcases/blk-auth-modify/read2.c
Signed-off-by: Mikulas Patocka <[email protected]>
Cc: [email protected]
Signed-off-by: Mike Snitzer <[email protected]>
|
|
If a userspace process reads (with O_DIRECT) multiple blocks into the same
buffer, dm-integrity reports an error [1]. The error is reported in a log
and it may cause RAID leg being kicked out of the array.
This commit fixes dm-integrity, so that if integrity verification fails,
the data is read again into a kernel buffer (where userspace can't modify
it) and the integrity tag is rechecked. If the recheck succeeds, the
content of the kernel buffer is copied into the user buffer; if the
recheck fails, an integrity error is reported.
[1] https://people.redhat.com/~mpatocka/testcases/blk-auth-modify/read2.c
Signed-off-by: Mikulas Patocka <[email protected]>
Cc: [email protected]
Signed-off-by: Mike Snitzer <[email protected]>
|
|
'def_bool X' is a shorthand for 'bool' plus 'default X'.
'def_bool' is redundant where 'bool' is already present, so 'def_bool X'
can be replaced with 'default X', or removed if X is 'n'.
Signed-off-by: Masahiro Yamada <[email protected]>
|
|
Pass the queue limits directly to blk_alloc_disk instead of setting them
one at a time.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Reviewed-by: Himanshu Madhani <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Pass a queue_limits to blk_alloc_disk and apply it if non-NULL. This
will allow allocating queues with valid queue limits instead of setting
the values one at a time later.
Also change blk_alloc_disk to return an ERR_PTR instead of just NULL
which can't distinguish errors.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Reviewed-by: Himanshu Madhani <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.9/block
Pull MD changes from Song:
"1. Cleanup redundant checks, by Yu Kuai.
2. Remove deprecated headers, by Marc Zyngier and Song Liu.
3. Concurrency fixes, by Li Lingfeng.
4. Memory leak fix, by Li Nan."
* tag 'md-6.9-20240216' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
md: fix kmemleak of rdev->serial
md/multipath: Remove md-multipath.h
md/linear: Get rid of md-linear.h
md: use RCU lock to protect traversal in md_spares_need_change()
md: get rdev->mddev with READ_ONCE()
md: remove redundant md_wakeup_thread()
md: remove redundant check of 'mddev->sync_thread'
|
|
md_start_sync() will suspend the array if there are spares that can be
added or removed from conf, however, if reshape is still in progress,
this won't happen at all or data will be corrupted(remove_and_add_spares
won't be called from md_choose_sync_action for reshape), hence there is
no need to suspend the array if reshape is not done yet.
Meanwhile, there is a potential deadlock for raid456:
1) reshape is interrupted;
2) set one of the disk WantReplacement, and add a new disk to the array,
however, recovery won't start until the reshape is finished;
3) then issue an IO across reshpae position, this IO will wait for
reshape to make progress;
4) continue to reshape, then md_start_sync() found there is a spare disk
that can be added to conf, mddev_suspend() is called;
Step 4 and step 3 is waiting for each other, deadlock triggered. Noted
this problem is found by code review, and it's not reporduced yet.
Fix this porblem by don't suspend the array for interrupted reshape,
this is safe because conf won't be changed until reshape is done.
Fixes: bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration")
Cc: [email protected] # v6.7+
Signed-off-by: Yu Kuai <[email protected]>
Signed-off-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Currently, if reshape is interrupted, then reassemble the array will
register sync_thread directly from pers->run(), in this case
'MD_RECOVERY_RUNNING' is set directly, however, there is no guarantee
that md_do_sync() will be executed, hence stop_sync_thread() will hang
because 'MD_RECOVERY_RUNNING' can't be cleared.
Last patch make sure that md_do_sync() will set MD_RECOVERY_DONE,
however, following hang can still be triggered by dm-raid test
shell/lvconvert-raid-reshape.sh occasionally:
[root@fedora ~]# cat /proc/1982/stack
[<0>] stop_sync_thread+0x1ab/0x270 [md_mod]
[<0>] md_frozen_sync_thread+0x5c/0xa0 [md_mod]
[<0>] raid_presuspend+0x1e/0x70 [dm_raid]
[<0>] dm_table_presuspend_targets+0x40/0xb0 [dm_mod]
[<0>] __dm_destroy+0x2a5/0x310 [dm_mod]
[<0>] dm_destroy+0x16/0x30 [dm_mod]
[<0>] dev_remove+0x165/0x290 [dm_mod]
[<0>] ctl_ioctl+0x4bb/0x7b0 [dm_mod]
[<0>] dm_ctl_ioctl+0x11/0x20 [dm_mod]
[<0>] vfs_ioctl+0x21/0x60
[<0>] __x64_sys_ioctl+0xb9/0xe0
[<0>] do_syscall_64+0xc6/0x230
[<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74
Meanwhile mddev->recovery is:
MD_RECOVERY_RUNNING |
MD_RECOVERY_INTR |
MD_RECOVERY_RESHAPE |
MD_RECOVERY_FROZEN
Fix this problem by remove the code to register sync_thread directly
from raid10 and raid5. And let md_check_recovery() to register
sync_thread.
Fixes: f67055780caa ("[PATCH] md: Checkpoint and allow restart of raid5 reshape")
Fixes: f52f5c71f3d4 ("md: fix stopping sync thread")
Cc: [email protected] # v6.7+
Signed-off-by: Yu Kuai <[email protected]>
Signed-off-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
stop_sync_thread() will interrupt md_do_sync(), and md_do_sync() must
set MD_RECOVERY_DONE, so that follow up md_check_recovery() will
unregister sync_thread, clear MD_RECOVERY_RUNNING and wake up
stop_sync_thread().
If MD_RECOVERY_WAIT is set or the array is read-only, md_do_sync() will
return without setting MD_RECOVERY_DONE, and after commit f52f5c71f3d4
("md: fix stopping sync thread"), dm-raid switch from
md_reap_sync_thread() to stop_sync_thread() to unregister sync_thread
from md_stop() and md_stop_writes(), causing the test
shell/lvconvert-raid-reshape.sh hang.
We shouldn't switch back to md_reap_sync_thread() because it's
problematic in the first place. Fix the problem by making sure
md_do_sync() will set MD_RECOVERY_DONE.
Reported-by: Mikulas Patocka <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Fixes: d5d885fd514f ("md: introduce new personality funciton start()")
Fixes: 5fd6c1dce06e ("[PATCH] md: allow checkpoint of recovery with version-1 superblock")
Fixes: f52f5c71f3d4 ("md: fix stopping sync thread")
Cc: [email protected] # v6.7+
Signed-off-by: Yu Kuai <[email protected]>
Signed-off-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Usually if the array is not read-write, md_check_recovery() won't
register new sync_thread in the first place. And if the array is
read-write and sync_thread is registered, md_set_readonly() will
unregister sync_thread before setting the array read-only. md/raid
follow this behavior hence there is no problem.
After commit f52f5c71f3d4 ("md: fix stopping sync thread"), following
hang can be triggered by test shell/integrity-caching.sh:
1) array is read-only. dm-raid update super block:
rs_update_sbs
ro = mddev->ro
mddev->ro = 0
-> set array read-write
md_update_sb
2) register new sync thread concurrently.
3) dm-raid set array back to read-only:
rs_update_sbs
mddev->ro = ro
4) stop the array:
raid_dtr
md_stop
stop_sync_thread
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
md_wakeup_thread_directly(mddev->sync_thread);
wait_event(..., !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
5) sync thread done:
md_do_sync
set_bit(MD_RECOVERY_DONE, &mddev->recovery);
md_wakeup_thread(mddev->thread);
6) daemon thread can't unregister sync thread:
md_check_recovery
if (!md_is_rdwr(mddev) &&
!test_bit(MD_RECOVERY_NEEDED, &mddev->recovery))
return;
-> -> MD_RECOVERY_RUNNING can't be cleared, hence step 4 hang;
The root cause is that dm-raid manipulate 'mddev->ro' by itself,
however, dm-raid really should stop sync thread before setting the
array read-only. Unfortunately, I need to read more code before I
can refacter the handler of 'mddev->ro' in dm-raid, hence let's fix
the problem the easy way for now to prevent dm-raid regression.
Reported-by: Mikulas Patocka <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Fixes: ecbfb9f118bc ("dm raid: add raid level takeover support")
Fixes: f52f5c71f3d4 ("md: fix stopping sync thread")
Cc: [email protected] # v6.7+
Signed-off-by: Yu Kuai <[email protected]>
Signed-off-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
mddev_suspend() never stop sync_thread, hence it doesn't make sense to
ignore suspended array in md_check_recovery(), which might cause
sync_thread can't be unregistered.
After commit f52f5c71f3d4 ("md: fix stopping sync thread"), following
hang can be triggered by test shell/integrity-caching.sh:
1) suspend the array:
raid_postsuspend
mddev_suspend
2) stop the array:
raid_dtr
md_stop
__md_stop_writes
stop_sync_thread
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
md_wakeup_thread_directly(mddev->sync_thread);
wait_event(..., !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
3) sync thread done:
md_do_sync
set_bit(MD_RECOVERY_DONE, &mddev->recovery);
md_wakeup_thread(mddev->thread);
4) daemon thread can't unregister sync thread:
md_check_recovery
if (mddev->suspended)
return; -> return directly
md_read_sync_thread
clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
-> MD_RECOVERY_RUNNING can't be cleared, hence step 2 hang;
This problem is not just related to dm-raid, fix it by ignoring
suspended array in md_check_recovery(). And follow up patches will
improve dm-raid better to frozen sync thread during suspend.
Reported-by: Mikulas Patocka <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Fixes: 68866e425be2 ("MD: no sync IO while suspended")
Fixes: f52f5c71f3d4 ("md: fix stopping sync thread")
Cc: [email protected] # v6.7+
Signed-off-by: Yu Kuai <[email protected]>
Signed-off-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
If kobject_add() is fail in bind_rdev_to_array(), 'rdev->serial' will be
alloc not be freed, and kmemleak occurs.
unreferenced object 0xffff88815a350000 (size 49152):
comm "mdadm", pid 789, jiffies 4294716910
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace (crc f773277a):
[<0000000058b0a453>] kmemleak_alloc+0x61/0xe0
[<00000000366adf14>] __kmalloc_large_node+0x15e/0x270
[<000000002e82961b>] __kmalloc_node.cold+0x11/0x7f
[<00000000f206d60a>] kvmalloc_node+0x74/0x150
[<0000000034bf3363>] rdev_init_serial+0x67/0x170
[<0000000010e08fe9>] mddev_create_serial_pool+0x62/0x220
[<00000000c3837bf0>] bind_rdev_to_array+0x2af/0x630
[<0000000073c28560>] md_add_new_disk+0x400/0x9f0
[<00000000770e30ff>] md_ioctl+0x15bf/0x1c10
[<000000006cfab718>] blkdev_ioctl+0x191/0x3f0
[<0000000085086a11>] vfs_ioctl+0x22/0x60
[<0000000018b656fe>] __x64_sys_ioctl+0xba/0xe0
[<00000000e54e675e>] do_syscall_64+0x71/0x150
[<000000008b0ad622>] entry_SYSCALL_64_after_hwframe+0x6c/0x74
Fixes: 963c555e75b0 ("md: introduce mddev_create/destroy_wb_pool for the change of member device")
Signed-off-by: Li Nan <[email protected]>
Signed-off-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Now that all callers pass in GFP_KERNEL to blkdev_zone_mgmt() and use
memalloc_no{io,fs}_{save,restore}() to define the allocation scope, we can
drop the gfp_mask parameter from blkdev_zone_mgmt() as well as
blkdev_zone_reset_all() and blkdev_zone_reset_all_emulated().
Signed-off-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Reviewed-by: Mike Snitzer <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Guard the calls to blkdev_zone_mgmt() with a memalloc_noio scope.
This helps us getting rid of the GFP_NOIO argument to blkdev_zone_mgmt();
Signed-off-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Mike Snitzer <[email protected]>
Reviewed-by: Damien Le Moal <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
submit_flushes
atomic_set(&mddev->flush_pending, 1);
rdev_for_each_rcu(rdev, mddev)
atomic_inc(&mddev->flush_pending);
bi->bi_end_io = md_end_flush
submit_bio(bi);
/* flush io is done first */
md_end_flush
if (atomic_dec_and_test(&mddev->flush_pending))
percpu_ref_put(&mddev->active_io)
-> active_io is not released
if (atomic_dec_and_test(&mddev->flush_pending))
-> missing release of active_io
For consequence, mddev_suspend() will wait for 'active_io' to be zero
forever.
Fix this problem by releasing 'active_io' in submit_flushes() if
'flush_pending' is decreased to zero.
Fixes: fa2bbff7b0b4 ("md: synchronize flush io with array reconfiguration")
Cc: [email protected] # v6.1+
Reported-by: Blazej Kucman <[email protected]>
Closes: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Yu Kuai <[email protected]>
Signed-off-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
md-multipath is already deprecated. Remove the header file.
Signed-off-by: Song Liu <[email protected]>
|
|
Given that 849d18e27be9 ("md: Remove deprecated CONFIG_MD_LINEAR")
killed the linear flavour of MD, it seems only logical to drop
the leftover include file that used to come with it.
I also feel that it should be my own privilege to remove my 30 year
old attempt at writing kernel code ;-). RIP!
Cc: Song Liu <[email protected]>
Cc: Yu Kuai <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Reviewed-by: Yu Kuai <[email protected]>
Signed-off-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Since md_start_sync() will be called without the protect of mddev_lock,
and it can run concurrently with array reconfiguration, traversal of rdev
in it should be protected by RCU lock.
Commit bc08041b32ab ("md: suspend array in md_start_sync() if array need
reconfiguration") added md_spares_need_change() to md_start_sync(),
casusing use of rdev without any protection.
Fix this by adding RCU lock in md_spares_need_change().
Fixes: bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration")
Cc: [email protected] # 6.7+
Signed-off-by: Li Lingfeng <[email protected]>
Signed-off-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Users may get rdev->mddev by sysfs while rdev is releasing.
So use both READ_ONCE() and WRITE_ONCE() to prevent load/store tearing
and to read/write mddev atomically.
Signed-off-by: Li Lingfeng <[email protected]>
Reviewed-by: Yu Kuai <[email protected]>
Signed-off-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
On the one hand, mddev_unlock() will call md_wakeup_thread()
unconditionally; on the other hand, md_check_recovery() can't make
progress if 'reconfig_mutex' can't be grabbed. Hence, it really doesn't
make sense to wake up daemon thread while 'reconfig_mutex' is still
grabbed.
Remove all the md_wakup_thread() for 'mddev->thread' while
'reconfig_mtuex' is still grabbed.
Signed-off-by: Yu Kuai <[email protected]>
Signed-off-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The lifetime of sync_thread:
1) Set MD_RECOVERY_NEEDED and wake up daemon thread (by ioctl/sysfs or
other events);
2) Daemon thread woke up, md_check_recovery() found that
MD_RECOVERY_NEEDED is set:
a) try to grab reconfig_mutex;
b) set MD_RECOVERY_RUNNING;
c) clear MD_RECOVERY_NEEDED, and then queue sync_work;
3) md_start_sync() choose sync_action, then register sync_thread;
4) md_do_sync() is done, set MD_RECOVERY_DONE and wake up daemon thread;
5) Daemon thread woke up, md_check_recovery() found that
MD_RECOVERY_DONE is set:
a) try to grab reconfig_mutex;
b) unregister sync_thread;
c) clear MD_RECOVERY_RUNNING and MD_RECOVERY_DONE;
Hence there is no such case that MD_RECOVERY_RUNNING is not set, while
sync_thread is registered.
Signed-off-by: Yu Kuai <[email protected]>
Signed-off-by: Song Liu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Tasklets have an inherent problem with memory corruption. The function
tasklet_action_common calls tasklet_trylock, then it calls the tasklet
callback and then it calls tasklet_unlock. If the tasklet callback frees
the structure that contains the tasklet or if it calls some code that may
free it, tasklet_unlock will write into free memory.
The commits 8e14f610159d and d9a02e016aaf try to fix it for dm-crypt, but
it is not a sufficient fix and the data corruption can still happen [1].
There is no fix for dm-verity and dm-verity will write into free memory
with every tasklet-processed bio.
There will be atomic workqueues implemented in the kernel 6.9 [2]. They
will have better interface and they will not suffer from the memory
corruption problem.
But we need something that stops the memory corruption now and that can be
backported to the stable kernels. So, I'm proposing this commit that
disables tasklets in both dm-crypt and dm-verity. This commit doesn't
remove the tasklet support, because the tasklet code will be reused when
atomic workqueues will be implemented.
[1] https://lore.kernel.org/all/[email protected]/T/
[2] https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Mikulas Patocka <[email protected]>
Cc: [email protected]
Fixes: 39d42fa96ba1b ("dm crypt: add flags to optionally bypass kcryptd workqueues")
Fixes: 5721d4e5a9cdb ("dm verity: Add optional "try_verify_in_tasklet" feature")
Signed-off-by: Mike Snitzer <[email protected]>
|
|
The function kvmalloc_node limits the allocation size to INT_MAX. This
limit will be overflowed if dm-writecache attempts to map a device with
1TiB or larger length. This commit changes kvmalloc_array to vmalloc_array
to avoid the limit.
The commit also changes vmalloc(array_size()) to vmalloc_array().
Signed-off-by: Mikulas Patocka <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
The kvmalloc function fails with a warning if the size is larger than
INT_MAX. Linus said that there should be limits that prevent this warning
from being hit. This commit adds the limits to the dm-stats subsystem
in DM core.
Signed-off-by: Mikulas Patocka <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|
|
The kvmalloc function fails with a warning if the size is larger than
INT_MAX. The warning was triggered by a syscall testing robot.
In order to avoid the warning, this commit limits the number of targets to
1048576 and the size of the parameter area to 1073741824.
Signed-off-by: Mikulas Patocka <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
|