aboutsummaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)AuthorFilesLines
2020-06-05dm zoned: improve logging messages for reclaimHannes Reinecke1-3/+10
Instead of just reporting the errno, add some more verbose debugging message in the reclaim path. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-06-05dm zoned: avoid unnecessary device recalulation for secondary superblockHannes Reinecke1-3/+2
The secondary superblock must reside on the same device as the primary superblock, so there is no need to re-calculate the device. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-06-05dm zoned: add debugging message for reading superblocksHannes Reinecke1-0/+4
Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-06-05dm ebs: use dm_bufio_forget_buffersMikulas Patocka1-2/+2
Use dm_bufio_forget_buffers instead of a block-by-block loop that calls dm_bufio_forget. dm_bufio_forget_buffers is faster than the loop because it searches for used buffers using rb-tree. Signed-off-by: Mikulas Patocka <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-06-05dm bufio: introduce forget_buffer_lockedMikulas Patocka1-4/+56
Introduce a function forget_buffer_locked that forgets a range of buffers. It is more efficient than calling forget_buffer in a loop. Signed-off-by: Mikulas Patocka <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-06-05dm bufio: clean up rbtree block orderingMikulas Patocka1-3/+3
dm-bufio uses unnatural ordering in the rb-tree - blocks with smaller numbers were put to the right node and blocks with bigger numbers were put to the left node. Reverse that logic so that it's natural. Signed-off-by: Mikulas Patocka <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-06-04dm bufio: delete unused and inefficient dm_bufio_discard_buffersMikulas Patocka1-26/+0
There is no user for this interface. If in future it is needed it can be reimplemented to walk the rbtree of buffers instead of doing block-by-block lookups. Signed-off-by: Mikulas Patocka <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-06-02Merge tag 'for-5.8/drivers-2020-06-01' of git://git.kernel.dk/linux-blockLinus Torvalds18-153/+299
Pull block driver updates from Jens Axboe: "On top of the core changes, here are the block driver changes for this merge window: - NVMe changes: - NVMe over Fibre Channel protocol updates, which also reach over to drivers/scsi/lpfc (James Smart) - namespace revalidation support on the target (Anthony Iliopoulos) - gcc zero length array fix (Arnd Bergmann) - nvmet cleanups (Chaitanya Kulkarni) - misc cleanups and fixes (me, Keith Busch, Sagi Grimberg) - use a SRQ per completion vector (Max Gurtovoy) - fix handling of runtime changes to the queue count (Weiping Zhang) - t10 protection information support for nvme-rdma and nvmet-rdma (Israel Rukshin and Max Gurtovoy) - target side AEN improvements (Chaitanya Kulkarni) - various fixes and minor improvements all over, icluding the nvme part of the lpfc driver" - Floppy code cleanup series (Willy, Denis) - Floppy contention fix (Jiri) - Loop CONFIGURE support (Martijn) - bcache fixes/improvements (Coly, Joe, Colin) - q->queuedata cleanups (Christoph) - Get rid of ioctl_by_bdev (Christoph, Stefan) - md/raid5 allocation fixes (Coly) - zero length array fixes (Gustavo) - swim3 task state fix (Xu)" * tag 'for-5.8/drivers-2020-06-01' of git://git.kernel.dk/linux-block: (166 commits) bcache: configure the asynchronous registertion to be experimental bcache: asynchronous devices registration bcache: fix refcount underflow in bcache_device_free() bcache: Convert pr_<level> uses to a more typical style bcache: remove redundant variables i and n lpfc: Fix return value in __lpfc_nvme_ls_abort lpfc: fix axchg pointer reference after free and double frees lpfc: Fix pointer checks and comments in LS receive refactoring nvme: set dma alignment to qword nvmet: cleanups the loop in nvmet_async_events_process nvmet: fix memory leak when removing namespaces and controllers concurrently nvmet-rdma: add metadata/T10-PI support nvmet: add metadata support for block devices nvmet: add metadata/T10-PI support nvme: add Metadata Capabilities enumerations nvmet: rename nvmet_check_data_len to nvmet_check_transfer_len nvmet: rename nvmet_rw_len to nvmet_rw_data_len nvmet: add metadata characteristics for a namespace nvme-rdma: add metadata/T10-PI support nvme-rdma: introduce nvme_rdma_sgl structure ...
2020-06-02Merge tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-blockLinus Torvalds7-45/+27
Pull block updates from Jens Axboe: "Core block changes that have been queued up for this release: - Remove dead blk-throttle and blk-wbt code (Guoqing) - Include pid in blktrace note traces (Jan) - Don't spew I/O errors on wouldblock termination (me) - Zone append addition (Johannes, Keith, Damien) - IO accounting improvements (Konstantin, Christoph) - blk-mq hardware map update improvements (Ming) - Scheduler dispatch improvement (Salman) - Inline block encryption support (Satya) - Request map fixes and improvements (Weiping) - blk-iocost tweaks (Tejun) - Fix for timeout failing with error injection (Keith) - Queue re-run fixes (Douglas) - CPU hotplug improvements (Christoph) - Queue entry/exit improvements (Christoph) - Move DMA drain handling to the few drivers that use it (Christoph) - Partition handling cleanups (Christoph)" * tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block: (127 commits) block: mark bio_wouldblock_error() bio with BIO_QUIET blk-wbt: rename __wbt_update_limits to wbt_update_limits blk-wbt: remove wbt_update_limits blk-throttle: remove tg_drain_bios blk-throttle: remove blk_throtl_drain null_blk: force complete for timeout request blk-mq: drain I/O when all CPUs in a hctx are offline blk-mq: add blk_mq_all_tag_iter blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx blk-mq: use BLK_MQ_NO_TAG in more places blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG blk-mq: move more request initialization to blk_mq_rq_ctx_init blk-mq: simplify the blk_mq_get_request calling convention blk-mq: remove the bio argument to ->prepare_request nvme: force complete cancelled requests blk-mq: blk-mq: provide forced completion method block: fix a warning when blkdev.h is included for !CONFIG_BLOCK builds block: blk-crypto-fallback: remove redundant initialization of variable err block: reduce part_stat_lock() scope block: use __this_cpu_add() instead of access by smp_processor_id() ...
2020-06-02mm: remove the pgprot argument to __vmallocChristoph Hellwig1-2/+2
The pgprot argument to __vmalloc is always PAGE_KERNEL now, so remove it. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Michael Kelley <[email protected]> [hyperv] Acked-by: Gao Xiang <[email protected]> [erofs] Acked-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Wei Liu <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: David Airlie <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Laura Abbott <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Robin Murphy <[email protected]> Cc: Sakari Ailus <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Sumit Semwal <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-02md: remove __clear_page_buffers and use attach/detach_page_privateGuoqing Jiang1-10/+2
After introduction attach/detach_page_private in pagemap.h, we can remove the duplicated code and call the new functions. Signed-off-by: Guoqing Jiang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Song Liu <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-05-29blk-mq: drain I/O when all CPUs in a hctx are offlineMing Lei1-1/+1
Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup up queue mapping. Thomas mentioned the following point[1]: "That was the constraint of managed interrupts from the very beginning: The driver/subsystem has to quiesce the interrupt line and the associated queue _before_ it gets shutdown in CPU unplug and not fiddle with it until it's restarted by the core when the CPU is plugged in again." However, current blk-mq implementation doesn't quiesce hw queue before the last CPU in the hctx is shutdown. Even worse, CPUHP_BLK_MQ_DEAD is a cpuhp state handled after the CPU is down, so there isn't any chance to quiesce the hctx before shutting down the CPU. Add new CPUHP_AP_BLK_MQ_ONLINE state to stop allocating from blk-mq hctxs where the last CPU goes away, and wait for completion of in-flight requests. This guarantees that there is no inflight I/O before shutting down the managed IRQ. Add a BLK_MQ_F_STACKING and set it for dm-rq and loop, so we don't need to wait for completion of in-flight requests from these drivers to avoid a potential dead-lock. It is safe to do this for stacking drivers as those do not use interrupts at all and their I/O completions are triggered by underlying devices I/O completion. [1] https://lore.kernel.org/linux-block/[email protected]/ [hch: different retry mechanism, merged two patches, minor cleanups] Signed-off-by: Ming Lei <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Daniel Wagner <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-05-27dm: use bio_{start,end}_io_acctChristoph Hellwig1-7/+2
Switch dm to use the nicer bio accounting helpers. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Konstantin Khlebnikov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-05-27bcache: use bio_{start,end}_io_acctChristoph Hellwig1-14/+4
Switch bcache to use the nicer bio accounting helpers, and call the routines where we also sample the start time to give coherent accounting results. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Konstantin Khlebnikov <[email protected]> Acked-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-05-27bcache: configure the asynchronous registertion to be experimentalColy Li2-0/+11
In order to avoid the experimental async registration interface to be treated as new kernel ABI for common users, this patch makes it as an experimental kernel configure BCACHE_ASYNC_REGISTRAION. This interface is for extreme large cached data situation, to make sure the bcache device can always created without the udev timeout issue. For normal users the async or sync registration does not make difference. In future when we decide to use the asynchronous registration as default behavior, this experimental interface may be removed. Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-05-27bcache: asynchronous devices registrationColy Li1-0/+100
When there is a lot of data cached on cache device, the bcach internal btree can take a very long to validate during the backing device and cache device registration. In my test, it may takes 55+ minutes to check all the internal btree nodes. The problem is that the registration is invoked by udev rules and the udevd has 180 seconds timeout by default. If the btree node checking time is longer than udevd timeout, the registering process will be killed by udevd with SIGKILL. If the registering process has pending sigal, creating kthread for bcache will fail and the device registration will fail. The result is, for bcache device which cached a lot of data on cache device, the bcache device node like /dev/bcache<N> won't create always due to the very long btree checking time. A solution to avoid the udevd 180 seconds timeout is to register devices in an asynchronous way. Which is, after writing cache or backing device path into /sys/fs/bcache/register_async, the kernel code will create a kworker and move all the btree node checking (for cache device) or dirty data counting (for cached device) in the kwork context. Then the kworder is scheduled on system_wq and the registration code just returned to user space udev rule task. By this asynchronous way, the udev task for bcache rule will complete in seconds, no matter how long time spent in the kworker context, it won't be killed by udevd for a timeout. After all the checking and counting are done asynchronously in the kworker, the bcache device will eventually be created successfully. This patch does the above chagne and add a register sysfs file /sys/fs/bcache/register_async. Writing the registering device path into this sysfs file will do the asynchronous registration. The register_async interface is for very rare condition and won't be used for common users. In future I plan to make the asynchronous registration as default behavior, which depends on feedback for this patch. Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-05-27bcache: fix refcount underflow in bcache_device_free()Coly Li1-2/+5
The problematic code piece in bcache_device_free() is, 785 static void bcache_device_free(struct bcache_device *d) 786 { 787 struct gendisk *disk = d->disk; [snipped] 799 if (disk) { 800 if (disk->flags & GENHD_FL_UP) 801 del_gendisk(disk); 802 803 if (disk->queue) 804 blk_cleanup_queue(disk->queue); 805 806 ida_simple_remove(&bcache_device_idx, 807 first_minor_to_idx(disk->first_minor)); 808 put_disk(disk); 809 } [snipped] 816 } At line 808, put_disk(disk) may encounter kobject refcount of 'disk' being underflow. Here is how to reproduce the issue, - Attche the backing device to a cache device and do random write to make the cache being dirty. - Stop the bcache device while the cache device has dirty data of the backing device. - Only register the backing device back, NOT register cache device. - The bcache device node /dev/bcache0 won't show up, because backing device waits for the cache device shows up for the missing dirty data. - Now echo 1 into /sys/fs/bcache/pendings_cleanup, to stop the pending backing device. - After the pending backing device stopped, use 'dmesg' to check kernel message, a use-after-free warning from KASA reported the refcount of kobject linked to the 'disk' is underflow. The dropping refcount at line 808 in the above code piece is added by add_disk(d->disk) in bch_cached_dev_run(). But in the above condition the cache device is not registered, bch_cached_dev_run() has no chance to be called and the refcount is not added. The put_disk() for a non- added refcount of gendisk kobject triggers a underflow warning. This patch checks whether GENHD_FL_UP is set in disk->flags, if it is not set then the bcache device was not added, don't call put_disk() and the the underflow issue can be avoided. Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-05-27bcache: Convert pr_<level> uses to a more typical styleJoe Perches10-109/+110
Remove the trailing newline from the define of pr_fmt and add newlines to the uses. Miscellanea: o Convert bch_bkey_dump from multiple uses of pr_err to pr_cont as the earlier conversion was inappropriate done causing multiple lines to be emitted where only a single output line was desired o Use vsprintf extension %pV in bch_cache_set_error to avoid multiple line output where only a single line output was desired o Coalesce formats Fixes: 6ae63e3501c4 ("bcache: replace printk() by pr_*() routines") Signed-off-by: Joe Perches <[email protected]> Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-05-27bcache: remove redundant variables i and nColin Ian King1-2/+0
Variables i and n are being assigned but are never used. They are redundant and can be removed. Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Coly Li <[email protected]> Addresses-Coverity: ("Unused value") Signed-off-by: Jens Axboe <[email protected]>
2020-05-22dm zoned: remove leftover hunk for switching to sequential zonesHannes Reinecke1-8/+0
Remove a leftover hunk to switch from random zones to sequential zones when selecting a reclaim zone; the logic has moved into the caller and this hunk is now pointless. Fixes: 34f5affd04c4 ("dm zoned: separate random and cache zones") Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-22block: remove the error_sector argument to blkdev_issue_flushChristoph Hellwig3-5/+5
The argument isn't used by any caller, and drivers don't fill out bi_sector for flush requests either. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-05-20dm zoned: terminate reclaim on congestionHannes Reinecke3-2/+9
When dmz_get_chunk_mapping() selects a zone which is under reclaim we should terminate the reclaim copy process. Since we're changing the zone itself, reclaim needs to run afterwards again anyway. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-20dm zoned: start reclaim with sequential zonesHannes Reinecke1-5/+6
Sequential zones perform better for reclaim, so start off using them and only use random zones as a fallback when cache zones are present. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-20dm zoned: reclaim random zones when idleHannes Reinecke3-16/+29
When the system is idle we should be starting reclaiming random zones, too. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-20dm zoned: separate random and cache zonesHannes Reinecke4-67/+159
Instead of lumping emulated zones together with random zones we should be handling them as separate 'cache' zones. This improves code readability and allows an easier implementation of different cache policies. Also add additional allocation flags, to separate the type (cache, random, or sequential) from the purpose (eg reclaim). Also switch the allocation policy to not use random zones as buffer zones if cache zones are present. This avoids a performance drop when all cache zones are used. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-20dm zoned: return NULL if dmz_get_zone_for_reclaim() fails to find a zoneHannes Reinecke2-4/+4
The only case where dmz_get_zone_for_reclaim() cannot return a zone is if the respective lists are empty. So we should just return a simple NULL value here as we really don't have an error code which would make sense. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-20dm zoned: Avoid 64-bit division error in dmz_fixup_devicesNathan Chancellor1-2/+3
When building arm32 allyesconfig: ld.lld: error: undefined symbol: __aeabi_uldivmod >>> referenced by dm-zoned-target.c >>> md/dm-zoned-target.o:(dmz_ctr) in archive drivers/built-in.a dmz_fixup_devices uses DIV_ROUND_UP with variables of type sector_t. As such, it should be using DIV_ROUND_UP_SECTOR_T, which handles this automatically. Fixes: 70978208ec91 ("dm zoned: metadata version 2") Signed-off-by: Nathan Chancellor <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-20dm: use DMDEBUG macros now that they use pr_debug variantsMike Snitzer2-7/+7
Now that DMDEBUG uses pr_debug and DMDEBUG_LIMIT uses pr_debug_ratelimited cleanup DM's 2 direct pr_debug callers to use them to get the benefit of consistent DM_FMT formatting of debugging messages. While doing so, dm-mpath.c:dm_report_EIO() was switched over to using DMDEBUG_LIMIT due to the potential for error handling floods in the IO completion path. Signed-off-by: Mike Snitzer <[email protected]>
2020-05-20dm zoned: remove spurious newlines from debugging messagesHannes Reinecke2-4/+4
DMDEBUG will already add a newline to the logging messages, so we shouldn't be adding it to the message itself. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-20dm: replace zero-length array with flexible-arrayGustavo A. R. Silva9-9/+9
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] sizeof(flexible-array-member) triggers a warning because flexible array members have incomplete type[1]. There are some instances of code in which the sizeof operator is being incorrectly/erroneously applied to zero-length arrays and the result is zero. Such instances may be hiding some bugs. So, this work (flexible-array member conversions) will also help to get completely rid of those sorts of issues. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-20dm zoned: metadata version 2Hannes Reinecke3-102/+400
Implement handling for metadata version 2. The new metadata adds a label and UUID for the device mapper device, and additional UUID for the underlying block devices. It also allows for an additional regular drive to be used for emulating random access zones. The emulated zones will be placed logically in front of the zones from the zoned block device, causing the superblocks and metadata to be stored on that device. The first zone of the original zoned device will be used to hold another, tertiary copy of the metadata; this copy carries a generation number of 0 and is never updated; it's just used for identification. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Bob Liu <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-20dm zoned: ignore metadata zone in dmz_alloc_zone()Hannes Reinecke1-0/+6
When looking up zones in dmz_alloc_zone() we need to ignore metadata zones so as not to accidentally overwrite metadata. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Bob Liu <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-20dm zoned: Reduce logging output on startupHannes Reinecke1-12/+12
dm-zoned is becoming quite chatty during startup; reduce the noise by moving some information to 'debug' level. Suggested-by: Mike Snitzer <[email protected]> Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-20dm zoned: add metadata logging functionsHannes Reinecke1-39/+57
Use the metadata label for logging and not the underlying device. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Bob Liu <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-19dm zoned: use dmz_zone_to_dev() when handling metadata I/OHannes Reinecke1-5/+7
Use accessors to retrieve the device pointer in preparation for adding an additional block device. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Bob Liu <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-19dm zoned: replace 'target' pointer in the bio contextHannes Reinecke1-20/+24
Replace the 'target' pointer in the bio context with the device pointer as this is what's actually used. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Bob Liu <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-19dm zoned: remove 'dev' argument from reclaimHannes Reinecke3-30/+32
Use the dmz_zone_to_dev() mapping function to remove the 'dev' argument from reclaim. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Bob Liu <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-19blk-mq: allow blk_mq_make_request to consume the q_usage_counter referenceChristoph Hellwig1-1/+10
blk_mq_make_request currently needs to grab an q_usage_counter reference when allocating a request. This is because the block layer grabs one before calling blk_mq_make_request, but also releases it as soon as blk_mq_make_request returns. Remove the blk_queue_exit call after blk_mq_make_request returns, and instead let it consume the reference. This works perfectly fine for the block layer caller, just device mapper needs an extra reference as the old problem still persists there. Open code blk_queue_enter_live in device mapper, as there should be no other callers and this allows better documenting why we do a non-try get. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-05-15dm zoned: Introduce dmz_dev_is_dying() and dmz_check_dev()Hannes Reinecke4-5/+18
Introduce accessors dmz_dev_is_dying() and dmz_check_dev() to avoid having to reference the devices directly. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Bob Liu <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-15dm zoned: introduce dmz_metadata_label() to format device nameHannes Reinecke4-42/+62
Introduce dmz_metadata_label() to format the device-mapper device name and use it instead of the device name of the underlying device. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Bob Liu <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-15dm zoned: move fields from struct dmz_dev to dmz_metadataHannes Reinecke4-63/+95
Move fields from the device structure into the metadata structure and provide accessor functions. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Bob Liu <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-15dm zoned: store device in struct dmz_sbHannes Reinecke1-31/+59
Store the device together with the superblock so that we don't have to recur to the metadata to find it. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Bob Liu <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-15dm zoned: use array for superblock zonesHannes Reinecke1-16/+25
Instead of storing just the first superblock zone and calculate the secondary relative to that we should be using an array for holding the superblock zones. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Bob Liu <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-15dm zoned: store zone id within the zone structure and kill dmz_id()Hannes Reinecke4-36/+31
Instead of calculating the zone index by the offset within the zone array store the index within the structure itself. With that the helper dmz_id() is pointless and can be replaced with accessing the ->id value directly. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Bob Liu <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-15dm zoned: add 'message' callbackHannes Reinecke1-0/+15
Add callback for 'dmsetup message' to allow the reclaim process to be triggered manually. Eg. dmsetup message /dev/dm-X 0 message will start the reclaim process even if the default threshold of 50 percent of free random zones is not reached. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Bob Liu <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-15dm zoned: add 'status' callbackHannes Reinecke3-0/+44
Add callback to supply information for 'dmsetup status' and 'dmsetup table'. The output for 'dmsetup status' is 0 <size> zoned <nr_zones> zones <nr_unmap_rnd>/<nr_rnd> random <nr_unmap_seq>/<nr_seq> sequential where <nr_unmap_rnd> is the number of unmapped (ie free) random zones, <nr_rnd> the total number of random zones, <nr_unmap_seq> the number of unmapped sequential zones, and <nr_seq> the total number of sequential zones. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Bob Liu <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-15dm mpath: add Historical Service Time Path SelectorKhazhismel Kumykov3-0/+573
This new selector keeps an exponential moving average of the service time for each path (losely defined as delta between start_io and end_io), and uses this along with the number of inflight requests to estimate future service time for a path. Since we don't have a prober to account for temporally slow paths, re-try "slow" paths every once in a while (num_paths * historical_service_time). To account for fast paths transitioning to slow, if a path has not completed any request within (num_paths * historical_service_time), limit the number of outstanding requests. To account for low volume situations where number of inflight IOs would be zero, the last finish time of each path is factored in. Signed-off-by: Khazhismel Kumykov <[email protected]> Co-developed-by: Gabriel Krisman Bertazi <[email protected]> Signed-off-by: Gabriel Krisman Bertazi <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-15dm mpath: pass IO start time to path selectorGabriel Krisman Bertazi5-6/+18
The HST path selector needs this information to perform path prediction. For request-based mpath, struct request's io_start_time_ns is used, while for bio-based, use the start_time stored in dm_io. Signed-off-by: Gabriel Krisman Bertazi <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-15dm writecache: improve performance on DDR persistent memory (Optane)Mikulas Patocka1-1/+37
When testing the dm-writecache target on a real DDR persistent memory (Intel Optane), it turned out that explicit cache flushing using the clflushopt instruction performs better than non-temporal stores for block sizes 1k, 2k and 4k. The dm-writecache target is singlethreaded (all the copying is done while holding the writecache lock), so it benefits from clwb, see: http://lore.kernel.org/r/alpine.LRH.2.02.2004160411460.7833@file01.intranet.prod.int.rdu2.redhat.com Add a new function memcpy_flushcache_optimized() that tests if clflushopt is present - and if it is, we use it instead of memcpy_flushcache. Signed-off-by: Mikulas Patocka <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
2020-05-15dm writecache: remove superfluous test in persistent_memory_claimMikulas Patocka1-4/+0
Remove superfluous test if dax_dev is NULL - dax_direct_access already does this test. Signed-off-by: Mikulas Patocka <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>