aboutsummaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
2024-02-20dm vdo: add flush supportMatthew Sakai4-0/+1066
This patch adds support for handling incoming flush and/or FUA bios. Each such bio is assigned to a struct vdo_flush. These are allocated as needed, but there is always one kept in reserve in case allocations fail. In the event of an allocation failure, bios may need to wait for an outstanding flush to complete. The logical address space is partitioned into logical zones, each handled by its own thread. Each zone keeps a list of all data_vios handling write requests for logical addresses in that zone. When a flush bio is processed, each logical zone is informed of the flush. When all of the writes which are in progress at the time of the notification have completed in all zones, the flush bio is then allowed to complete. Co-developed-by: J. corwin Coburn <[email protected]> Signed-off-by: J. corwin Coburn <[email protected]> Co-developed-by: Michael Sclafani <[email protected]> Signed-off-by: Michael Sclafani <[email protected]> Co-developed-by: Sweet Tea Dorminy <[email protected]> Signed-off-by: Sweet Tea Dorminy <[email protected]> Signed-off-by: Matthew Sakai <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20dm vdo: add data_vio, the request object which services incoming biosMatthew Sakai2-0/+2726
Add the data and methods that implement the data_vio object that handles user data bios as they are processed. Co-developed-by: J. corwin Coburn <[email protected]> Signed-off-by: J. corwin Coburn <[email protected]> Co-developed-by: Michael Sclafani <[email protected]> Signed-off-by: Michael Sclafani <[email protected]> Co-developed-by: Sweet Tea Dorminy <[email protected]> Signed-off-by: Sweet Tea Dorminy <[email protected]> Co-developed-by: Bruce Johnston <[email protected]> Signed-off-by: Bruce Johnston <[email protected]> Co-developed-by: Ken Raeburn <[email protected]> Signed-off-by: Ken Raeburn <[email protected]> Signed-off-by: Matthew Sakai <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20dm vdo: add vio, the request object for vdo metadataMatthew Sakai2-0/+700
Add the data and methods that implement the vio object that is basic unit of I/O in vdo. Co-developed-by: J. corwin Coburn <[email protected]> Signed-off-by: J. corwin Coburn <[email protected]> Co-developed-by: Michael Sclafani <[email protected]> Signed-off-by: Michael Sclafani <[email protected]> Co-developed-by: Sweet Tea Dorminy <[email protected]> Signed-off-by: Sweet Tea Dorminy <[email protected]> Co-developed-by: Bruce Johnston <[email protected]> Signed-off-by: Bruce Johnston <[email protected]> Co-developed-by: Ken Raeburn <[email protected]> Signed-off-by: Ken Raeburn <[email protected]> Signed-off-by: Matthew Sakai <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20dm vdo: add administrative state and action managerMatthew Sakai4-0/+1182
This patch adds the admin_state structures which are used to track the states of individual vdo components for handling of operations like suspend and resume. It also adds the action manager which is used to schedule and manage cross-thread administrative and internal operations. Co-developed-by: J. corwin Coburn <[email protected]> Signed-off-by: J. corwin Coburn <[email protected]> Signed-off-by: Matthew Sakai <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20dm vdo: implement external deduplication index interfaceMatthew Sakai5-0/+1367
The deduplication index interface for index clients includes the deduplication request and index session structures. This is the interface that the rest of the vdo target uses to make requests, receive responses, and collect statistics. This patch also adds sysfs nodes for inspecting various index properties at runtime. Co-developed-by: J. corwin Coburn <[email protected]> Signed-off-by: J. corwin Coburn <[email protected]> Co-developed-by: Michael Sclafani <[email protected]> Signed-off-by: Michael Sclafani <[email protected]> Co-developed-by: Thomas Jaskiewicz <[email protected]> Signed-off-by: Thomas Jaskiewicz <[email protected]> Co-developed-by: John Wiele <[email protected]> Signed-off-by: John Wiele <[email protected]> Signed-off-by: Matthew Sakai <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20dm vdo: implement top-level deduplication indexMatthew Sakai4-0/+2109
The top-level deduplication index brings all the earlier components together. The top-level index creates the separate zone structures that enable the index to handle several requests in parallel, handles dispatching requests to the right zones and components, and coordinates metadata to ensure that it remain consistent. It also coordinates recovery in the event of an unexpected index failure. If sparse caching is enabled, the top-level index also handles the coordination required by the sparse chapter index cache, which (unlike most index structures) is shared among all zones. Co-developed-by: J. corwin Coburn <[email protected]> Signed-off-by: J. corwin Coburn <[email protected]> Co-developed-by: Michael Sclafani <[email protected]> Signed-off-by: Michael Sclafani <[email protected]> Co-developed-by: Thomas Jaskiewicz <[email protected]> Signed-off-by: Thomas Jaskiewicz <[email protected]> Co-developed-by: Bruce Johnston <[email protected]> Signed-off-by: Bruce Johnston <[email protected]> Signed-off-by: Matthew Sakai <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20dm vdo: implement the chapter volume storeMatthew Sakai6-0/+2465
The volume store structures manage the reading and writing of chapter pages. When a chapter is closed, it is packed into a read-only structure, split across several pages, and written to storage. The volume store also contains a cache and specialized queues that sort and batch requests by the page they need, in order to minimize latency and I/O requests when records have to be read from storage. The cache and queues also coordinate with the volume index to ensure that the volume does not waste resources reading pages that are no longer valid. Co-developed-by: J. corwin Coburn <[email protected]> Signed-off-by: J. corwin Coburn <[email protected]> Co-developed-by: Michael Sclafani <[email protected]> Signed-off-by: Michael Sclafani <[email protected]> Co-developed-by: Thomas Jaskiewicz <[email protected]> Signed-off-by: Thomas Jaskiewicz <[email protected]> Co-developed-by: John Wiele <[email protected]> Signed-off-by: John Wiele <[email protected]> Signed-off-by: Matthew Sakai <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20dm vdo: implement the open chapter and chapter indexesMatthew Sakai4-0/+860
Deduplication records are stored in groups called chapters. New records are collected in a structure called the open chapter, which is optimized for adding, removing, and sorting records. When a chapter fills, it is packed into a read-only structure called a closed chapter, which is optimized for searching and reading. The closed chapter includes a delta index, called the chapter index, which maps each record name to the record page containing the record and allows the index to read at most one record page when looking up a record. Co-developed-by: J. corwin Coburn <[email protected]> Signed-off-by: J. corwin Coburn <[email protected]> Co-developed-by: Michael Sclafani <[email protected]> Signed-off-by: Michael Sclafani <[email protected]> Co-developed-by: Thomas Jaskiewicz <[email protected]> Signed-off-by: Thomas Jaskiewicz <[email protected]> Signed-off-by: Matthew Sakai <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20dm vdo: implement the volume indexMatthew Sakai2-0/+1482
The volume index is a large delta index that maps each record name to the chapter which contains the newest record for that name. The volume index can contain several million records and is stored entirely in memory while the index is operating, accounting for the majority of the deduplication index's memory budget. The volume index is composed of two subindexes in order to handle sparse hook names separately from regular names. If sparse indexing is not enabled, the sparse hook portion of the volume index is not used or instantiated. Co-developed-by: J. corwin Coburn <[email protected]> Signed-off-by: J. corwin Coburn <[email protected]> Co-developed-by: Michael Sclafani <[email protected]> Signed-off-by: Michael Sclafani <[email protected]> Co-developed-by: Thomas Jaskiewicz <[email protected]> Signed-off-by: Thomas Jaskiewicz <[email protected]> Signed-off-by: Matthew Sakai <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20dm vdo: implement the delta indexMatthew Sakai3-0/+2333
The delta index is a space and memory efficient alternative to a hashtable. Instead of storing the entire key for each entry, the entries are sorted by key and only the difference between adjacent keys (the delta) is stored. If the keys are evenly distributed, the size of the deltas follows an exponential distribution, and the deltas can use a Huffman code to take up even less space. This structure allows the index to use many fewer bytes per entry than a traditional hash table, but it is slightly more expensive to look up entries, because a request must read and sum every entry in a list of deltas in order to find a given record. The delta index reduces this lookup cost by splitting its key space into many sub-lists, each starting at a fixed key value, so that each individual list is short. Co-developed-by: J. corwin Coburn <[email protected]> Signed-off-by: J. corwin Coburn <[email protected]> Co-developed-by: Michael Sclafani <[email protected]> Signed-off-by: Michael Sclafani <[email protected]> Co-developed-by: Thomas Jaskiewicz <[email protected]> Signed-off-by: Thomas Jaskiewicz <[email protected]> Signed-off-by: Matthew Sakai <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20dm vdo: add deduplication index storage interfaceMatthew Sakai5-0/+2368
This patch adds infrastructure for managing reads and writes to the underlying storage layer for the deduplication index. The deduplication index uses dm-bufio for all of its reads and writes, so part of this infrastructure is managing the various dm-bufio clients required. It also adds the buffered reader and buffered writer abstractions, which simplify reading and writing metadata structures that span several blocks. This patch also includes structures and utilities for encoding and decoding all of the deduplication index metadata, collectively called the index layout. Co-developed-by: J. corwin Coburn <[email protected]> Signed-off-by: J. corwin Coburn <[email protected]> Co-developed-by: Michael Sclafani <[email protected]> Signed-off-by: Michael Sclafani <[email protected]> Co-developed-by: Thomas Jaskiewicz <[email protected]> Signed-off-by: Thomas Jaskiewicz <[email protected]> Co-developed-by: John Wiele <[email protected]> Signed-off-by: John Wiele <[email protected]> Signed-off-by: Matthew Sakai <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20dm vdo: add deduplication configuration structuresMatthew Sakai4-0/+841
Add structures which record the configuration of various deduplication index parameters. This also includes facilities for saving and loading the configuration and validating its integrity. Co-developed-by: J. corwin Coburn <[email protected]> Signed-off-by: J. corwin Coburn <[email protected]> Co-developed-by: Michael Sclafani <[email protected]> Signed-off-by: Michael Sclafani <[email protected]> Co-developed-by: Thomas Jaskiewicz <[email protected]> Signed-off-by: Thomas Jaskiewicz <[email protected]> Co-developed-by: John Wiele <[email protected]> Signed-off-by: John Wiele <[email protected]> Signed-off-by: Matthew Sakai <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20dm vdo: add basic hash map data structuresMatthew Sakai6-0/+1808
This patch adds two hash maps, one keyed by integers, the other by pointers, and also a priority heap. The integer map is used for locking of logical and physical addresses. The pointer map is used for managing concurrent writes of the same data, ensuring that those writes are deduplicated. The priority heap is used to minimize the search time for free blocks. Co-developed-by: J. corwin Coburn <[email protected]> Signed-off-by: J. corwin Coburn <[email protected]> Co-developed-by: Michael Sclafani <[email protected]> Signed-off-by: Michael Sclafani <[email protected]> Signed-off-by: Matthew Sakai <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20dm vdo: add specialized request queueing functionalityMatthew Sakai9-0/+1635
This patch adds funnel_queue, a mostly lock-free multi-producer, single-consumer queue. It also adds the request queue used by the dm-vdo deduplication index, and the work_queue used by the dm-vdo data store. Both of these are built on top of funnel queue and are intended to support the dispatching of many short-running tasks. The work_queue also supports priorities. Finally, this patch adds vdo_completion, the structure which is enqueued on work_queues. Co-developed-by: J. corwin Coburn <[email protected]> Signed-off-by: J. corwin Coburn <[email protected]> Co-developed-by: Michael Sclafani <[email protected]> Signed-off-by: Michael Sclafani <[email protected]> Co-developed-by: Sweet Tea Dorminy <[email protected]> Signed-off-by: Sweet Tea Dorminy <[email protected]> Co-developed-by: Ken Raeburn <[email protected]> Signed-off-by: Ken Raeburn <[email protected]> Signed-off-by: Matthew Sakai <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20dm vdo: add thread and synchronization utilitiesMatthew Sakai7-0/+524
This patch adds utilities for managing and using named threads, as well as several locking and synchronization utilities. These utilities help dm-vdo minimize thread transitions and manage interactions between threads. Co-developed-by: J. corwin Coburn <[email protected]> Signed-off-by: J. corwin Coburn <[email protected]> Co-developed-by: Michael Sclafani <[email protected]> Signed-off-by: Michael Sclafani <[email protected]> Co-developed-by: Thomas Jaskiewicz <[email protected]> Signed-off-by: Thomas Jaskiewicz <[email protected]> Co-developed-by: Bruce Johnston <[email protected]> Signed-off-by: Bruce Johnston <[email protected]> Co-developed-by: Ken Raeburn <[email protected]> Signed-off-by: Ken Raeburn <[email protected]> Signed-off-by: Matthew Sakai <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20dm vdo: add vdo type declarations, constants, and simple data structuresMatthew Sakai7-0/+1101
Add definitions of constants defining the fixed parameters of a VDO volume, and the default and maximum values of configurable or dynamic parameters. Add definitions of internal status codes used for internal communication within the module and for logging. Add definitions of types and structs used to manage the processing of an I/O operation. Co-developed-by: J. corwin Coburn <[email protected]> Signed-off-by: J. corwin Coburn <[email protected]> Co-developed-by: Michael Sclafani <[email protected]> Signed-off-by: Michael Sclafani <[email protected]> Signed-off-by: Matthew Sakai <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20dm vdo: add basic logging and support utilitiesMatthew Sakai9-0/+940
Add various support utilities for the vdo target and deduplication index, including logging utilities, string and time management, and index-specific error codes. Co-developed-by: J. corwin Coburn <[email protected]> Signed-off-by: J. corwin Coburn <[email protected]> Co-developed-by: Michael Sclafani <[email protected]> Signed-off-by: Michael Sclafani <[email protected]> Co-developed-by: Thomas Jaskiewicz <[email protected]> Signed-off-by: Thomas Jaskiewicz <[email protected]> Co-developed-by: Ken Raeburn <[email protected]> Signed-off-by: Ken Raeburn <[email protected]> Signed-off-by: Matthew Sakai <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20dm vdo: add memory allocation utilitiesMatthew Sakai2-0/+604
This patch adds standardized allocation macros and memory tracking tools to track and report any allocated memory that is not freed. This makes it easier to ensure that the vdo target does not leak memory. This patch also adds utilities for controlling whether certain threads are allowed to allocate memory, since memory allocation during certain critical code sections can cause the vdo target to deadlock. Co-developed-by: J. corwin Coburn <[email protected]> Signed-off-by: J. corwin Coburn <[email protected]> Co-developed-by: Michael Sclafani <[email protected]> Signed-off-by: Michael Sclafani <[email protected]> Co-developed-by: Thomas Jaskiewicz <[email protected]> Signed-off-by: Thomas Jaskiewicz <[email protected]> Co-developed-by: Ken Raeburn <[email protected]> Signed-off-by: Ken Raeburn <[email protected]> Signed-off-by: Matthew Sakai <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20dm vdo: add the MurmurHash3 fast hashing algorithmMatthew Sakai2-0/+190
MurmurHash3 is a fast, non-cryptographic, 128-bit hash. It was originally written by Austin Appleby and placed in the public domain. This version has been modified to produce the same result on both big endian and little endian processors, making it suitable for use in portable persistent data. Co-developed-by: J. corwin Coburn <[email protected]> Signed-off-by: J. corwin Coburn <[email protected]> Co-developed-by: Ken Raeburn <[email protected]> Signed-off-by: Ken Raeburn <[email protected]> Co-developed-by: John Wiele <[email protected]> Signed-off-by: John Wiele <[email protected]> Signed-off-by: Matthew Sakai <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20dm crypt: remove redundant state settings after waking upLizhe1-1/+0
The task status has been set to TASK_RUNNING in schedule(). No need to set again here. Signed-off-by: Lizhe <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20dm thin: add braces around conditional code that spans linesMike Snitzer1-8/+12
Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20dm-crypt, dm-integrity, dm-verity: bump target versionMike Snitzer3-3/+3
Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20dm-verity, dm-crypt: align "struct bvec_iter" correctlyMikulas Patocka2-4/+4
"struct bvec_iter" is defined with the __packed attribute, so it is aligned on a single byte. On X86 (and on other architectures that support unaligned addresses in hardware), "struct bvec_iter" is accessed using the 8-byte and 4-byte memory instructions, however these instructions are less efficient if they operate on unaligned addresses. (on RISC machines that don't have unaligned access in hardware, GCC generates byte-by-byte accesses that are very inefficient - see [1]) This commit reorders the entries in "struct dm_verity_io" and "struct convert_context", so that "struct bvec_iter" is aligned on 8 bytes. [1] https://lore.kernel.org/all/ZcLuWUNRZadJr0tQ@fedora/T/ Signed-off-by: Mikulas Patocka <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20dm-crypt: recheck the integrity tag after a failureMikulas Patocka1-16/+73
If a userspace process reads (with O_DIRECT) multiple blocks into the same buffer, dm-crypt reports an authentication error [1]. The error is reported in a log and it may cause RAID leg being kicked out of the array. This commit fixes dm-crypt, so that if integrity verification fails, the data is read again into a kernel buffer (where userspace can't modify it) and the integrity tag is rechecked. If the recheck succeeds, the content of the kernel buffer is copied into the user buffer; if the recheck fails, an integrity error is reported. [1] https://people.redhat.com/~mpatocka/testcases/blk-auth-modify/read2.c Signed-off-by: Mikulas Patocka <[email protected]> Cc: [email protected] Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20dm-crypt: don't modify the data when using authenticated encryptionMikulas Patocka1-0/+6
It was said that authenticated encryption could produce invalid tag when the data that is being encrypted is modified [1]. So, fix this problem by copying the data into the clone bio first and then encrypt them inside the clone bio. This may reduce performance, but it is needed to prevent the user from corrupting the device by writing data with O_DIRECT and modifying them at the same time. [1] https://lore.kernel.org/all/[email protected]/T/ Signed-off-by: Mikulas Patocka <[email protected]> Cc: [email protected] Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20dm-verity: recheck the hash after a failureMikulas Patocka2-6/+86
If a userspace process reads (with O_DIRECT) multiple blocks into the same buffer, dm-verity reports an error [1]. This commit fixes dm-verity, so that if hash verification fails, the data is read again into a kernel buffer (where userspace can't modify it) and the hash is rechecked. If the recheck succeeds, the content of the kernel buffer is copied into the user buffer; if the recheck fails, an error is reported. [1] https://people.redhat.com/~mpatocka/testcases/blk-auth-modify/read2.c Signed-off-by: Mikulas Patocka <[email protected]> Cc: [email protected] Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20dm-integrity: recheck the integrity tag after a failureMikulas Patocka1-9/+84
If a userspace process reads (with O_DIRECT) multiple blocks into the same buffer, dm-integrity reports an error [1]. The error is reported in a log and it may cause RAID leg being kicked out of the array. This commit fixes dm-integrity, so that if integrity verification fails, the data is read again into a kernel buffer (where userspace can't modify it) and the integrity tag is rechecked. If the recheck succeeds, the content of the kernel buffer is copied into the user buffer; if the recheck fails, an integrity error is reported. [1] https://people.redhat.com/~mpatocka/testcases/blk-auth-modify/read2.c Signed-off-by: Mikulas Patocka <[email protected]> Cc: [email protected] Signed-off-by: Mike Snitzer <[email protected]>
2024-02-20treewide: replace or remove redundant def_bool in Kconfig filesMasahiro Yamada1-1/+0
'def_bool X' is a shorthand for 'bool' plus 'default X'. 'def_bool' is redundant where 'bool' is already present, so 'def_bool X' can be replaced with 'default X', or removed if X is 'n'. Signed-off-by: Masahiro Yamada <[email protected]>
2024-02-19bcache: pass queue_limits to blk_mq_alloc_diskChristoph Hellwig1-22/+24
Pass the queue limits directly to blk_alloc_disk instead of setting them one at a time. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Himanshu Madhani <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-02-19block: pass a queue_limits argument to blk_alloc_diskChristoph Hellwig3-7/+8
Pass a queue_limits to blk_alloc_disk and apply it if non-NULL. This will allow allocating queues with valid queue limits instead of setting the values one at a time later. Also change blk_alloc_disk to return an ERR_PTR instead of just NULL which can't distinguish errors. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Himanshu Madhani <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-02-16Merge tag 'md-6.9-20240216' of ↵Jens Axboe4-85/+18
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.9/block Pull MD changes from Song: "1. Cleanup redundant checks, by Yu Kuai. 2. Remove deprecated headers, by Marc Zyngier and Song Liu. 3. Concurrency fixes, by Li Lingfeng. 4. Memory leak fix, by Li Nan." * tag 'md-6.9-20240216' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md: fix kmemleak of rdev->serial md/multipath: Remove md-multipath.h md/linear: Get rid of md-linear.h md: use RCU lock to protect traversal in md_spares_need_change() md: get rdev->mddev with READ_ONCE() md: remove redundant md_wakeup_thread() md: remove redundant check of 'mddev->sync_thread'
2024-02-15md: Don't suspend the array for interrupted reshapeYu Kuai1-4/+9
md_start_sync() will suspend the array if there are spares that can be added or removed from conf, however, if reshape is still in progress, this won't happen at all or data will be corrupted(remove_and_add_spares won't be called from md_choose_sync_action for reshape), hence there is no need to suspend the array if reshape is not done yet. Meanwhile, there is a potential deadlock for raid456: 1) reshape is interrupted; 2) set one of the disk WantReplacement, and add a new disk to the array, however, recovery won't start until the reshape is finished; 3) then issue an IO across reshpae position, this IO will wait for reshape to make progress; 4) continue to reshape, then md_start_sync() found there is a spare disk that can be added to conf, mddev_suspend() is called; Step 4 and step 3 is waiting for each other, deadlock triggered. Noted this problem is found by code review, and it's not reporduced yet. Fix this porblem by don't suspend the array for interrupted reshape, this is safe because conf won't be changed until reshape is done. Fixes: bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration") Cc: [email protected] # v6.7+ Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-15md: Don't register sync_thread for reshape directlyYu Kuai3-42/+8
Currently, if reshape is interrupted, then reassemble the array will register sync_thread directly from pers->run(), in this case 'MD_RECOVERY_RUNNING' is set directly, however, there is no guarantee that md_do_sync() will be executed, hence stop_sync_thread() will hang because 'MD_RECOVERY_RUNNING' can't be cleared. Last patch make sure that md_do_sync() will set MD_RECOVERY_DONE, however, following hang can still be triggered by dm-raid test shell/lvconvert-raid-reshape.sh occasionally: [root@fedora ~]# cat /proc/1982/stack [<0>] stop_sync_thread+0x1ab/0x270 [md_mod] [<0>] md_frozen_sync_thread+0x5c/0xa0 [md_mod] [<0>] raid_presuspend+0x1e/0x70 [dm_raid] [<0>] dm_table_presuspend_targets+0x40/0xb0 [dm_mod] [<0>] __dm_destroy+0x2a5/0x310 [dm_mod] [<0>] dm_destroy+0x16/0x30 [dm_mod] [<0>] dev_remove+0x165/0x290 [dm_mod] [<0>] ctl_ioctl+0x4bb/0x7b0 [dm_mod] [<0>] dm_ctl_ioctl+0x11/0x20 [dm_mod] [<0>] vfs_ioctl+0x21/0x60 [<0>] __x64_sys_ioctl+0xb9/0xe0 [<0>] do_syscall_64+0xc6/0x230 [<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74 Meanwhile mddev->recovery is: MD_RECOVERY_RUNNING | MD_RECOVERY_INTR | MD_RECOVERY_RESHAPE | MD_RECOVERY_FROZEN Fix this problem by remove the code to register sync_thread directly from raid10 and raid5. And let md_check_recovery() to register sync_thread. Fixes: f67055780caa ("[PATCH] md: Checkpoint and allow restart of raid5 reshape") Fixes: f52f5c71f3d4 ("md: fix stopping sync thread") Cc: [email protected] # v6.7+ Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-15md: Make sure md_do_sync() will set MD_RECOVERY_DONEYu Kuai1-4/+8
stop_sync_thread() will interrupt md_do_sync(), and md_do_sync() must set MD_RECOVERY_DONE, so that follow up md_check_recovery() will unregister sync_thread, clear MD_RECOVERY_RUNNING and wake up stop_sync_thread(). If MD_RECOVERY_WAIT is set or the array is read-only, md_do_sync() will return without setting MD_RECOVERY_DONE, and after commit f52f5c71f3d4 ("md: fix stopping sync thread"), dm-raid switch from md_reap_sync_thread() to stop_sync_thread() to unregister sync_thread from md_stop() and md_stop_writes(), causing the test shell/lvconvert-raid-reshape.sh hang. We shouldn't switch back to md_reap_sync_thread() because it's problematic in the first place. Fix the problem by making sure md_do_sync() will set MD_RECOVERY_DONE. Reported-by: Mikulas Patocka <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ Fixes: d5d885fd514f ("md: introduce new personality funciton start()") Fixes: 5fd6c1dce06e ("[PATCH] md: allow checkpoint of recovery with version-1 superblock") Fixes: f52f5c71f3d4 ("md: fix stopping sync thread") Cc: [email protected] # v6.7+ Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-15md: Don't ignore read-only array in md_check_recovery()Yu Kuai1-13/+18
Usually if the array is not read-write, md_check_recovery() won't register new sync_thread in the first place. And if the array is read-write and sync_thread is registered, md_set_readonly() will unregister sync_thread before setting the array read-only. md/raid follow this behavior hence there is no problem. After commit f52f5c71f3d4 ("md: fix stopping sync thread"), following hang can be triggered by test shell/integrity-caching.sh: 1) array is read-only. dm-raid update super block: rs_update_sbs ro = mddev->ro mddev->ro = 0 -> set array read-write md_update_sb 2) register new sync thread concurrently. 3) dm-raid set array back to read-only: rs_update_sbs mddev->ro = ro 4) stop the array: raid_dtr md_stop stop_sync_thread set_bit(MD_RECOVERY_INTR, &mddev->recovery); md_wakeup_thread_directly(mddev->sync_thread); wait_event(..., !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) 5) sync thread done: md_do_sync set_bit(MD_RECOVERY_DONE, &mddev->recovery); md_wakeup_thread(mddev->thread); 6) daemon thread can't unregister sync thread: md_check_recovery if (!md_is_rdwr(mddev) && !test_bit(MD_RECOVERY_NEEDED, &mddev->recovery)) return; -> -> MD_RECOVERY_RUNNING can't be cleared, hence step 4 hang; The root cause is that dm-raid manipulate 'mddev->ro' by itself, however, dm-raid really should stop sync thread before setting the array read-only. Unfortunately, I need to read more code before I can refacter the handler of 'mddev->ro' in dm-raid, hence let's fix the problem the easy way for now to prevent dm-raid regression. Reported-by: Mikulas Patocka <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ Fixes: ecbfb9f118bc ("dm raid: add raid level takeover support") Fixes: f52f5c71f3d4 ("md: fix stopping sync thread") Cc: [email protected] # v6.7+ Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-15md: Don't ignore suspended array in md_check_recovery()Yu Kuai1-3/+0
mddev_suspend() never stop sync_thread, hence it doesn't make sense to ignore suspended array in md_check_recovery(), which might cause sync_thread can't be unregistered. After commit f52f5c71f3d4 ("md: fix stopping sync thread"), following hang can be triggered by test shell/integrity-caching.sh: 1) suspend the array: raid_postsuspend mddev_suspend 2) stop the array: raid_dtr md_stop __md_stop_writes stop_sync_thread set_bit(MD_RECOVERY_INTR, &mddev->recovery); md_wakeup_thread_directly(mddev->sync_thread); wait_event(..., !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) 3) sync thread done: md_do_sync set_bit(MD_RECOVERY_DONE, &mddev->recovery); md_wakeup_thread(mddev->thread); 4) daemon thread can't unregister sync thread: md_check_recovery if (mddev->suspended) return; -> return directly md_read_sync_thread clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery); -> MD_RECOVERY_RUNNING can't be cleared, hence step 2 hang; This problem is not just related to dm-raid, fix it by ignoring suspended array in md_check_recovery(). And follow up patches will improve dm-raid better to frozen sync thread during suspend. Reported-by: Mikulas Patocka <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ Fixes: 68866e425be2 ("MD: no sync IO while suspended") Fixes: f52f5c71f3d4 ("md: fix stopping sync thread") Cc: [email protected] # v6.7+ Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-12md: fix kmemleak of rdev->serialLi Nan1-0/+1
If kobject_add() is fail in bind_rdev_to_array(), 'rdev->serial' will be alloc not be freed, and kmemleak occurs. unreferenced object 0xffff88815a350000 (size 49152): comm "mdadm", pid 789, jiffies 4294716910 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc f773277a): [<0000000058b0a453>] kmemleak_alloc+0x61/0xe0 [<00000000366adf14>] __kmalloc_large_node+0x15e/0x270 [<000000002e82961b>] __kmalloc_node.cold+0x11/0x7f [<00000000f206d60a>] kvmalloc_node+0x74/0x150 [<0000000034bf3363>] rdev_init_serial+0x67/0x170 [<0000000010e08fe9>] mddev_create_serial_pool+0x62/0x220 [<00000000c3837bf0>] bind_rdev_to_array+0x2af/0x630 [<0000000073c28560>] md_add_new_disk+0x400/0x9f0 [<00000000770e30ff>] md_ioctl+0x15bf/0x1c10 [<000000006cfab718>] blkdev_ioctl+0x191/0x3f0 [<0000000085086a11>] vfs_ioctl+0x22/0x60 [<0000000018b656fe>] __x64_sys_ioctl+0xba/0xe0 [<00000000e54e675e>] do_syscall_64+0x71/0x150 [<000000008b0ad622>] entry_SYSCALL_64_after_hwframe+0x6c/0x74 Fixes: 963c555e75b0 ("md: introduce mddev_create/destroy_wb_pool for the change of member device") Signed-off-by: Li Nan <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-12block: remove gfp_flags from blkdev_zone_mgmtJohannes Thumshirn1-1/+1
Now that all callers pass in GFP_KERNEL to blkdev_zone_mgmt() and use memalloc_no{io,fs}_{save,restore}() to define the allocation scope, we can drop the gfp_mask parameter from blkdev_zone_mgmt() as well as blkdev_zone_reset_all() and blkdev_zone_reset_all_emulated(). Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-02-12dm: dm-zoned: guard blkdev_zone_mgmt with noio scopeJohannes Thumshirn1-1/+4
Guard the calls to blkdev_zone_mgmt() with a memalloc_noio scope. This helps us getting rid of the GFP_NOIO argument to blkdev_zone_mgmt(); Signed-off-by: Johannes Thumshirn <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-02-07md: Fix missing release of 'active_io' for flushYu Kuai1-1/+5
submit_flushes atomic_set(&mddev->flush_pending, 1); rdev_for_each_rcu(rdev, mddev) atomic_inc(&mddev->flush_pending); bi->bi_end_io = md_end_flush submit_bio(bi); /* flush io is done first */ md_end_flush if (atomic_dec_and_test(&mddev->flush_pending)) percpu_ref_put(&mddev->active_io) -> active_io is not released if (atomic_dec_and_test(&mddev->flush_pending)) -> missing release of active_io For consequence, mddev_suspend() will wait for 'active_io' to be zero forever. Fix this problem by releasing 'active_io' in submit_flushes() if 'flush_pending' is decreased to zero. Fixes: fa2bbff7b0b4 ("md: synchronize flush io with array reconfiguration") Cc: [email protected] # v6.1+ Reported-by: Blazej Kucman <[email protected]> Closes: https://lore.kernel.org/lkml/[email protected]/ Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-05md/multipath: Remove md-multipath.hSong Liu1-32/+0
md-multipath is already deprecated. Remove the header file. Signed-off-by: Song Liu <[email protected]>
2024-02-05md/linear: Get rid of md-linear.hMarc Zyngier1-17/+0
Given that 849d18e27be9 ("md: Remove deprecated CONFIG_MD_LINEAR") killed the linear flavour of MD, it seems only logical to drop the leftover include file that used to come with it. I also feel that it should be my own privilege to remove my 30 year old attempt at writing kernel code ;-). RIP! Cc: Song Liu <[email protected]> Cc: Yu Kuai <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Reviewed-by: Yu Kuai <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-05md: use RCU lock to protect traversal in md_spares_need_change()Li Lingfeng1-2/+7
Since md_start_sync() will be called without the protect of mddev_lock, and it can run concurrently with array reconfiguration, traversal of rdev in it should be protected by RCU lock. Commit bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration") added md_spares_need_change() to md_start_sync(), casusing use of rdev without any protection. Fix this by adding RCU lock in md_spares_need_change(). Fixes: bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration") Cc: [email protected] # 6.7+ Signed-off-by: Li Lingfeng <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-05md: get rdev->mddev with READ_ONCE()Li Lingfeng1-2/+2
Users may get rdev->mddev by sysfs while rdev is releasing. So use both READ_ONCE() and WRITE_ONCE() to prevent load/store tearing and to read/write mddev atomically. Signed-off-by: Li Lingfeng <[email protected]> Reviewed-by: Yu Kuai <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-05md: remove redundant md_wakeup_thread()Yu Kuai1-18/+2
On the one hand, mddev_unlock() will call md_wakeup_thread() unconditionally; on the other hand, md_check_recovery() can't make progress if 'reconfig_mutex' can't be grabbed. Hence, it really doesn't make sense to wake up daemon thread while 'reconfig_mutex' is still grabbed. Remove all the md_wakup_thread() for 'mddev->thread' while 'reconfig_mtuex' is still grabbed. Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-05md: remove redundant check of 'mddev->sync_thread'Yu Kuai2-14/+6
The lifetime of sync_thread: 1) Set MD_RECOVERY_NEEDED and wake up daemon thread (by ioctl/sysfs or other events); 2) Daemon thread woke up, md_check_recovery() found that MD_RECOVERY_NEEDED is set: a) try to grab reconfig_mutex; b) set MD_RECOVERY_RUNNING; c) clear MD_RECOVERY_NEEDED, and then queue sync_work; 3) md_start_sync() choose sync_action, then register sync_thread; 4) md_do_sync() is done, set MD_RECOVERY_DONE and wake up daemon thread; 5) Daemon thread woke up, md_check_recovery() found that MD_RECOVERY_DONE is set: a) try to grab reconfig_mutex; b) unregister sync_thread; c) clear MD_RECOVERY_RUNNING and MD_RECOVERY_DONE; Hence there is no such case that MD_RECOVERY_RUNNING is not set, while sync_thread is registered. Signed-off-by: Yu Kuai <[email protected]> Signed-off-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-02-02dm-crypt, dm-verity: disable taskletsMikulas Patocka3-61/+4
Tasklets have an inherent problem with memory corruption. The function tasklet_action_common calls tasklet_trylock, then it calls the tasklet callback and then it calls tasklet_unlock. If the tasklet callback frees the structure that contains the tasklet or if it calls some code that may free it, tasklet_unlock will write into free memory. The commits 8e14f610159d and d9a02e016aaf try to fix it for dm-crypt, but it is not a sufficient fix and the data corruption can still happen [1]. There is no fix for dm-verity and dm-verity will write into free memory with every tasklet-processed bio. There will be atomic workqueues implemented in the kernel 6.9 [2]. They will have better interface and they will not suffer from the memory corruption problem. But we need something that stops the memory corruption now and that can be backported to the stable kernels. So, I'm proposing this commit that disables tasklets in both dm-crypt and dm-verity. This commit doesn't remove the tasklet support, because the tasklet code will be reused when atomic workqueues will be implemented. [1] https://lore.kernel.org/all/[email protected]/T/ [2] https://lore.kernel.org/lkml/[email protected]/ Signed-off-by: Mikulas Patocka <[email protected]> Cc: [email protected] Fixes: 39d42fa96ba1b ("dm crypt: add flags to optionally bypass kcryptd workqueues") Fixes: 5721d4e5a9cdb ("dm verity: Add optional "try_verify_in_tasklet" feature") Signed-off-by: Mike Snitzer <[email protected]>
2024-01-30dm writecache: allow allocations larger than 2GiBMikulas Patocka1-4/+4
The function kvmalloc_node limits the allocation size to INT_MAX. This limit will be overflowed if dm-writecache attempts to map a device with 1TiB or larger length. This commit changes kvmalloc_array to vmalloc_array to avoid the limit. The commit also changes vmalloc(array_size()) to vmalloc_array(). Signed-off-by: Mikulas Patocka <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2024-01-30dm stats: limit the number of entriesMikulas Patocka1-0/+9
The kvmalloc function fails with a warning if the size is larger than INT_MAX. Linus said that there should be limits that prevent this warning from being hit. This commit adds the limits to the dm-stats subsystem in DM core. Signed-off-by: Mikulas Patocka <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2024-01-30dm: limit the number of targets and parameter size areaMikulas Patocka3-3/+11
The kvmalloc function fails with a warning if the size is larger than INT_MAX. The warning was triggered by a syscall testing robot. In order to avoid the warning, this commit limits the number of targets to 1048576 and the size of the parameter area to 1073741824. Signed-off-by: Mikulas Patocka <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>