aboutsummaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)AuthorFilesLines
2022-11-23blk-crypto: Add a missing include directiveBart Van Assche1-0/+1
Allow the compiler to verify consistency of function declarations and function definitions. This patch fixes the following sparse errors: block/blk-crypto-profile.c:241:14: error: no previous prototype for ‘blk_crypto_get_keyslot’ [-Werror=missing-prototypes] 241 | blk_status_t blk_crypto_get_keyslot(struct blk_crypto_profile *profile, | ^~~~~~~~~~~~~~~~~~~~~~ block/blk-crypto-profile.c:318:6: error: no previous prototype for ‘blk_crypto_put_keyslot’ [-Werror=missing-prototypes] 318 | void blk_crypto_put_keyslot(struct blk_crypto_keyslot *slot) | ^~~~~~~~~~~~~~~~~~~~~~ block/blk-crypto-profile.c:344:6: error: no previous prototype for ‘__blk_crypto_cfg_supported’ [-Werror=missing-prototypes] 344 | bool __blk_crypto_cfg_supported(struct blk_crypto_profile *profile, | ^~~~~~~~~~~~~~~~~~~~~~~~~~ block/blk-crypto-profile.c:373:5: error: no previous prototype for ‘__blk_crypto_evict_key’ [-Werror=missing-prototypes] 373 | int __blk_crypto_evict_key(struct blk_crypto_profile *profile, | ^~~~~~~~~~~~~~~~~~~~~~ Cc: Eric Biggers <[email protected]> Signed-off-by: Bart Van Assche <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-23elevator: remove an outdated comment in elevator_changeJinlong Chen1-3/+0
mq is no longer a special case. Signed-off-by: Jinlong Chen <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/cbf47824fc726440371e74c867bf635ae1b671a3.1669126766.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <[email protected]>
2022-11-23elevator: update the document of elevator_matchJinlong Chen1-3/+2
elevator_match does not care about elevator_features any more. Remove related descriptions from its document. Fixes: ffb86425ee2c ("block: don't check for required features in elevator_match") Signed-off-by: Jinlong Chen <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/a58424555202c07a9ccf7f60c3ad7e247da09e25.1669126766.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <[email protected]>
2022-11-23elevator: printk a warning if switching to a new io scheduler failsJinlong Chen1-0/+6
printk a warning to indicate that the io scheduler has been set to none if switching to a new io scheduler fails. Suggested-by: Christoph Hellwig <[email protected]> Signed-off-by: Jinlong Chen <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/d51ed0fb457db7a4f9cbb0dbce36d534e22be457.1669126766.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <[email protected]>
2022-11-23elevator: update the document of elevator_switchJinlong Chen1-4/+4
We no longer support falling back to the old io scheduler if switching to the new one fails. Update the document to indicate that. Fixes: a1ce35fa4985 ("block: remove dead elevator code") Signed-off-by: Jinlong Chen <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/94250961689ba7d2e67a7d9e7995a11166fedb31.1669126766.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <[email protected]>
2022-11-22blk-mq: fix queue reference leak on blk_mq_alloc_disk_for_queue failureChristoph Hellwig1-1/+6
Drop the request queue reference just acquired when __alloc_disk_node failed. Fixes: 6f8191fdf41d ("block: simplify disk shutdown") Reported-by: Al Viro <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-22block: fix missing nr_hw_queues update in blk_mq_realloc_tag_set_tagsShin'ichiro Kawasaki1-2/+2
The commit ee9d55210c2f ("blk-mq: simplify blk_mq_realloc_tag_set_tags") cleaned up the function blk_mq_realloc_tag_set_tags. After this change, the function does not update nr_hw_queues of struct blk_mq_tag_set when new nr_hw_queues value is smaller than original. This results in failure of queue number change of block devices. To avoid the failure, add the missing nr_hw_queues update. Fixes: ee9d55210c2f ("blk-mq: simplify blk_mq_realloc_tag_set_tags") Reported-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Shin'ichiro Kawasaki <[email protected]> Link: https://lore.kernel.org/linux-block/20221118140640.featvt3fxktfquwh@shindev/ Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-21blk-crypto: move internal only declarations to blk-crypto-internal.hChristoph Hellwig1-0/+12
blk_crypto_get_keyslot, blk_crypto_put_keyslot, __blk_crypto_evict_key and __blk_crypto_cfg_supported are only used internally by the blk-crypto code, so move the out of blk-crypto-profile.h, which is included by drivers that supply blk-crypto functionality. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Eric Biggers <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-21blk-crypto: add a blk_crypto_config_supported_natively helperChristoph Hellwig1-9/+12
Add a blk_crypto_config_supported_natively helper that wraps __blk_crypto_cfg_supported to retrieve the crypto_profile from the request queue. With this fscrypt can stop including blk-crypto-profile.h and rely on the public consumer interface in blk-crypto.h. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Eric Biggers <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-21blk-crypto: don't use struct request_queue for public interfacesChristoph Hellwig1-10/+14
Switch all public blk-crypto interfaces to use struct block_device arguments to specify the device they operate on instead of th request_queue, which is a block layer implementation detail. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Eric Biggers <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-18Merge tag 'block-6.1-2022-11-18' of git://git.kernel.dk/linuxLinus Torvalds4-6/+7
Pull block fixes from Jens Axboe: - NVMe pull request via Christoph: - Two more bogus nid quirks (Bean Huo, Tiago Dias Ferreira) - Memory leak fix in nvmet (Sagi Grimberg) - Regression fix for block cgroups pinning the wrong blkcg, causing leaks of cgroups and blkcgs (Chris) - UAF fix for drbd setup error handling (Dan) - Fix DMA alignment propagation in DM (Keith) * tag 'block-6.1-2022-11-18' of git://git.kernel.dk/linux: dm-log-writes: set dma_alignment limit in io_hints dm-integrity: set dma_alignment limit in io_hints block: make blk_set_default_limits() private dm-crypt: provide dma_alignment limit in io_hints block: make dma_alignment a stacking queue_limit nvmet: fix a memory leak in nvmet_auth_set_key nvme-pci: add NVME_QUIRK_BOGUS_NID for Netac NV7000 drbd: use after free in drbd_create_device() nvme-pci: add NVME_QUIRK_BOGUS_NID for Micron Nitro blk-cgroup: properly pin the parent in blkcg_css_online
2022-11-16blk-cgroup: Flush stats at blkgs destruction pathWaiman Long1-1/+14
As noted by Michal, the blkg_iostat_set's in the lockless list hold reference to blkg's to protect against their removal. Those blkg's hold reference to blkcg. When a cgroup is being destroyed, cgroup_rstat_flush() is only called at css_release_work_fn() which is called when the blkcg reference count reaches 0. This circular dependency will prevent blkcg from being freed until some other events cause cgroup_rstat_flush() to be called to flush out the pending blkcg stats. To prevent this delayed blkcg removal, add a new cgroup_rstat_css_flush() function to flush stats for a given css and cpu and call it at the blkgs destruction path, blkcg_destroy_blkgs(), whenever there are still some pending stats to be flushed. This will ensure that blkcg reference count can reach 0 ASAP. Signed-off-by: Waiman Long <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-16blk-cgroup: Optimize blkcg_rstat_flush()Waiman Long2-6/+80
For a system with many CPUs and block devices, the time to do blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It can be especially problematic as interrupt is disabled during the flush. It was reported that it might take seconds to complete in some extreme cases leading to hard lockup messages. As it is likely that not all the percpu blkg_iostat_set's has been updated since the last flush, those stale blkg_iostat_set's don't need to be flushed in this case. This patch optimizes blkcg_rstat_flush() by keeping a lockless list of recently updated blkg_iostat_set's in a newly added percpu blkcg->lhead pointer. The blkg_iostat_set is added to a lockless list on the update side in blk_cgroup_bio_start(). It is removed from the lockless list when flushed in blkcg_rstat_flush(). Due to racing, it is possible that blk_iostat_set's in the lockless list may have no new IO stats to be flushed, but that is OK. To protect against destruction of blkg, a percpu reference is gotten when putting into the lockless list and put back when removed. When booting up an instrumented test kernel with this patch on a 2-socket 96-thread system with cgroup v2, out of the 2051 calls to cgroup_rstat_flush() after bootup, 1788 of the calls were exited immediately because of empty lockless list. After an all-cpu kernel build, the ratio became 6295424/6340513. That was more than 99%. Signed-off-by: Waiman Long <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-16blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error pathWaiman Long1-8/+4
For blkcg_css_alloc(), the only error that will be returned is -ENOMEM. Simplify error handling code by returning this error directly instead of setting an intermediate "ret" variable. Signed-off-by: Waiman Long <[email protected]> Reviewed-by: Ming Lei <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-16block: make blk_set_default_limits() privateKeith Busch2-1/+1
There are no external users of this function. Signed-off-by: Keith Busch <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-16block: make dma_alignment a stacking queue_limitKeith Busch2-4/+5
Device mappers had always been getting the default 511 dma mask, but the underlying device might have a larger alignment requirement. Since this value is used to determine alloweable direct-io alignment, this needs to be a stackable limit. Signed-off-by: Keith Busch <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-16block: don't allow a disk link holder to itselfYu Kuai1-0/+3
After creating a dm device, then user can reload such dm with itself, and dead loop will be triggered because dm keep looking up to itself. Test procedures: 1) dmsetup create test --table "xxx sda", assume dm-0 is created 2) dmsetup suspend test 3) dmsetup reload test --table "xxx dm-0" 4) dmsetup resume test Test result: BUG: TASK stack guard page was hit at 00000000736a261f (stack is 000000008d12c88d..00000000c8dd82d5) stack guard page: 0000 [#1] PREEMPT SMP CPU: 29 PID: 946 Comm: systemd-udevd Not tainted 6.1.0-rc3-next-20221101-00006-g17640ca3b0ee #1295 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 RIP: 0010:dm_prepare_ioctl+0xf/0x1e0 Code: da 48 83 05 4a 7c 99 0b 01 41 89 c4 eb cd e8 b8 1f 40 00 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 48 83 05 a1 5a 99 0b 01 <41> 56 49 89 d6 41 55 4c 8d af 90 02 00 00 9 RSP: 0018:ffffc90002090000 EFLAGS: 00010206 RAX: ffff8881049d6800 RBX: ffff88817e589000 RCX: 0000000000000000 RDX: ffffc90002090010 RSI: ffffc9000209001c RDI: ffff88817e589000 RBP: 00000000484a101d R08: 0000000000000000 R09: 0000000000000007 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000005331 R13: 0000000000005331 R14: 0000000000000000 R15: 0000000000000000 FS: 00007fddf9609200(0000) GS:ffff889fbfd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffc9000208fff8 CR3: 0000000179043000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> dm_blk_ioctl+0x50/0x1c0 ? dm_prepare_ioctl+0xe0/0x1e0 dm_blk_ioctl+0x88/0x1c0 dm_blk_ioctl+0x88/0x1c0 ......(a lot of same lines) dm_blk_ioctl+0x88/0x1c0 dm_blk_ioctl+0x88/0x1c0 blkdev_ioctl+0x184/0x3e0 __x64_sys_ioctl+0xa3/0x110 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fddf7306577 Code: b3 66 90 48 8b 05 11 89 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e1 88 8 RSP: 002b:00007ffd0b2ec318 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00005634ef478320 RCX: 00007fddf7306577 RDX: 0000000000000000 RSI: 0000000000005331 RDI: 0000000000000007 RBP: 0000000000000007 R08: 00005634ef4843e0 R09: 0000000000000080 R10: 00007fddf75cfb38 R11: 0000000000000246 R12: 00000000030d4000 R13: 0000000000000000 R14: 0000000000000000 R15: 00005634ef48b800 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:dm_prepare_ioctl+0xf/0x1e0 Code: da 48 83 05 4a 7c 99 0b 01 41 89 c4 eb cd e8 b8 1f 40 00 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 48 83 05 a1 5a 99 0b 01 <41> 56 49 89 d6 41 55 4c 8d af 90 02 00 00 9 RSP: 0018:ffffc90002090000 EFLAGS: 00010206 RAX: ffff8881049d6800 RBX: ffff88817e589000 RCX: 0000000000000000 RDX: ffffc90002090010 RSI: ffffc9000209001c RDI: ffff88817e589000 RBP: 00000000484a101d R08: 0000000000000000 R09: 0000000000000007 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000005331 R13: 0000000000005331 R14: 0000000000000000 R15: 0000000000000000 FS: 00007fddf9609200(0000) GS:ffff889fbfd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffc9000208fff8 CR3: 0000000179043000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- Fix the problem by forbidding a disk to create link to itself. Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-16block: store the holder kobject in bd_holder_diskYu Kuai1-5/+6
We hold a reference to the holder kobject for each bd_holder_disk, so to make the code a bit more robust, use a reference to it instead of the block_device. As long as no one clears ->bd_holder_dir in before freeing the disk, this isn't strictly required, but it does make the code more clear and more robust. Orignally-From: Christoph Hellwig <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-16block: fix use after free for bd_holder_dirYu Kuai1-6/+15
Currently, the caller of bd_link_disk_holer() get 'bdev' by blkdev_get_by_dev(), which will look up 'bdev' by inode number 'dev'. Howerver, it's possible that del_gendisk() can be called currently, and 'bd_holder_dir' can be freed before bd_link_disk_holer() access it, thus use after free is triggered. t1: t2: bdev = blkdev_get_by_dev del_gendisk kobject_put(bd_holder_dir) kobject_free() bd_link_disk_holder Fix the problem by checking disk is still live and grabbing a reference to 'bd_holder_dir' first in bd_link_disk_holder(). Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-16block: remove delayed holder registrationChristoph Hellwig2-55/+21
Now that dm has been fixed to track of holder registrations before add_disk, the somewhat buggy block layer code can be safely removed. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-16block: clear ->slave_dir when dropping the main slave_dir referenceChristoph Hellwig1-0/+2
Zero out the pointer to ->slave_dir so that the holder code doesn't incorrectly treat the object as alive when add_disk failed or after del_gendisk was called. Fixes: 89f871af1b26 ("dm: delay registering the gendisk") Reported-by: Yu Kuai <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-16block: remove blkdev_writepagesChristoph Hellwig1-7/+0
While the block device code should switch to implementing ->writepages instead of ->writepage eventually, the current implementation is entirely pointless as it does the same looping over ->writepage as the generic code if no ->writepages is present. Remove blkdev_writepages so that we can eventually unexport generic_writepages. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-16bio: shrink max number of pcpu cached biosPavel Begunkov1-1/+1
The downside of the bio pcpu cache is that bios of a cpu will be never freed unless there is new I/O issued from that cpu. We currently keep max 512 bios, which feels too much, half it. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/bc198e8efb27d8c740d80c8ce477432729075096.1667384020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-11-16bio: add pcpu caching for non-polling bio_putPavel Begunkov1-16/+54
This patch extends REQ_ALLOC_CACHE to IRQ completions, whenever currently it's only limited to iopoll. Instead of guarding the list with irq toggling on alloc, which is expensive, it keeps an additional irq-safe list from which bios are spliced in batches to ammortise overhead. On the put side it toggles irqs, but in many cases they're already disabled and so cheap. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/c2306de96b900ab9264f4428ec37768ddcf0da36.1667384020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-11-16bio: split pcpu cache part of bio_put into a helperPavel Begunkov1-13/+25
Extract a helper out of bio_put for recycling into percpu caches. It's a preparation patch without functional changes. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/e97ab2026a89098ee1bfdd09bcb9451fced95f87.1667384020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-11-16bio: don't rob starving biosets of biosPavel Begunkov1-0/+2
Biosets keep a mempool, so as long as requests complete we can always can allocate and have forward progress. Percpu bio caches break that assumptions as we may complete into the cache of one CPU and after try and fail to allocate with another CPU. We also can't grab from another CPU's cache without tricky sync. If we're allocating with a bio while the mempool is undersaturated, remove REQ_ALLOC_CACHE flag, so on put it will go straight to mempool. It might try to free into mempool more requests than required, but assuming than there is no memory starvation in the system it'll stabilise and never hit that path. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/aa150caf9c263fa92269e86d7826cc8fa65f38de.1667384020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-11-14blk-cgroup: properly pin the parent in blkcg_css_onlineChris Mason1-1/+1
blkcg_css_online is supposed to pin the blkcg of the parent, but 397c9f46ee4d refactored things and along the way, changed it to pin the css instead. This results in extra pins, and we end up leaking blkcgs and cgroups. Fixes: 397c9f46ee4d ("blk-cgroup: move blkcg_{pin,unpin}_online out of line") Signed-off-by: Chris Mason <[email protected]> Spotted-by: Rik van Riel <[email protected]> Cc: <[email protected]> # v5.19+ Acked-by: Johannes Weiner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-11Merge tag 'block-6.1-2022-11-11' of git://git.kernel.dk/linuxLinus Torvalds2-4/+32
Pull block fixes from Jens Axboe: - NVMe pull request via Christoph: - Quiet user passthrough command errors (Keith Busch) - Fix memory leak in nvmet_subsys_attr_model_store_locked - Fix a memory leak in nvmet-auth (Sagi Grimberg) - Fix a potential NULL point deref in bfq (Yu) - Allocate command/response buffers separately for DMA for sed-opal, rather than rely on embedded alignment (Serge) * tag 'block-6.1-2022-11-11' of git://git.kernel.dk/linux: nvmet: fix a memory leak nvmet: fix memory leak in nvmet_subsys_attr_model_store_locked nvme: quiet user passthrough command errors block: sed-opal: kmalloc the cmd/resp buffers block, bfq: fix null pointer dereference in bfq_bio_bfqg()
2022-11-10blk-mq: simplify blk_mq_realloc_tag_set_tagsChristoph Hellwig1-6/+4
Use set->nr_hw_queues for the current number of tags, and remove the duplicate set->nr_hw_queues update in the caller. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-10blk-mq: remove blk_mq_alloc_tag_set_tagsChristoph Hellwig1-9/+5
There is no point in trying to share any code with the realloc case when all that is needed by the initial tagset allocation is a simple kcalloc_node. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-09bfq: ignore oom_bfqq in bfq_check_wakerKhazhismel Kumykov1-1/+3
oom_bfqq is just a fallback bfqq, so shouldn't be used with waker detection. Suggested-by: Jan Kara <[email protected]> Signed-off-by: Khazhismel Kumykov <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-09bfq: fix waker_bfqq inconsistency crashKhazhismel Kumykov1-2/+7
This fixes crashes in bfq_add_bfqq_busy due to waker_bfqq being NULL, but woken_list_node still being hashed. This would happen when bfq_init_rq() expects a brand new allocated queue to be returned from bfq_get_bfqq_handle_split() and unconditionally updates waker_bfqq without resetting woken_list_node. Since we can always return oom_bfqq when attempting to allocate, we cannot assume waker_bfqq starts as NULL. Avoid setting woken_bfqq for oom_bfqq entirely, as it's not useful. Crashes would have a stacktrace like: [160595.656560] bfq_add_bfqq_busy+0x110/0x1ec [160595.661142] bfq_add_request+0x6bc/0x980 [160595.666602] bfq_insert_request+0x8ec/0x1240 [160595.671762] bfq_insert_requests+0x58/0x9c [160595.676420] blk_mq_sched_insert_request+0x11c/0x198 [160595.682107] blk_mq_submit_bio+0x270/0x62c [160595.686759] __submit_bio_noacct_mq+0xec/0x178 [160595.691926] submit_bio+0x120/0x184 [160595.695990] ext4_mpage_readpages+0x77c/0x7c8 [160595.701026] ext4_readpage+0x60/0xb0 [160595.705158] filemap_read_page+0x54/0x114 [160595.711961] filemap_fault+0x228/0x5f4 [160595.716272] do_read_fault+0xe0/0x1f0 [160595.720487] do_fault+0x40/0x1c8 Tested by injecting random failures into bfq_get_queue, crashes go away completely. Fixes: 8ef3fc3a043c ("block, bfq: make shared queues inherit wakers") Signed-off-by: Khazhismel Kumykov <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-09block: set FOLL_PCI_P2PDMA in bio_map_user_iov()Logan Gunthorpe1-4/+8
When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be passed from userspace and enables the NVMe passthru requests to use P2PDMA pages. Signed-off-by: Logan Gunthorpe <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: John Hubbard <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-09block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()Logan Gunthorpe1-2/+7
When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be passed from userspace and enables the O_DIRECT path in iomap based filesystems and direct to block devices. Signed-off-by: Logan Gunthorpe <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: John Hubbard <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-09block: add check when merging zone device pagesLogan Gunthorpe1-0/+2
Consecutive zone device pages should not be merged into the same sgl or bvec segment with other types of pages or if they belong to different pgmaps. Otherwise getting the pgmap of a given segment is not possible without scanning the entire segment. This helper returns true either if both pages are not zone device pages or both pages are zone device pages with the same pgmap. Add a helper to determine if zone device pages are mergeable and use this helper in page_is_mergeable(). Signed-off-by: Logan Gunthorpe <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: John Hubbard <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-08block: sed-opal: kmalloc the cmd/resp buffersSerge Semin1-4/+28
In accordance with [1] the DMA-able memory buffers must be cacheline-aligned otherwise the cache writing-back and invalidation performed during the mapping may cause the adjacent data being lost. It's specifically required for the DMA-noncoherent platforms [2]. Seeing the opal_dev.{cmd,resp} buffers are implicitly used for DMAs in the NVME and SCSI/SD drivers in framework of the nvme_sec_submit() and sd_sec_submit() methods respectively they must be cacheline-aligned to prevent the denoted problem. One of the option to guarantee that is to kmalloc the buffers [2]. Let's explicitly allocate them then instead of embedding into the opal_dev structure instance. Note this fix was inspired by the commit c94b7f9bab22 ("nvme-hwmon: kmalloc the NVME SMART log buffer"). [1] Documentation/core-api/dma-api.rst [2] Documentation/core-api/dma-api-howto.rst Fixes: 455a7b238cd6 ("block: Add Sed-opal library") Signed-off-by: Serge Semin <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-08block, bfq: fix null pointer dereference in bfq_bio_bfqg()Yu Kuai1-0/+4
Out test found a following problem in kernel 5.10, and the same problem should exist in mainline: BUG: kernel NULL pointer dereference, address: 0000000000000094 PGD 0 P4D 0 Oops: 0000 [#1] SMP CPU: 7 PID: 155 Comm: kworker/7:1 Not tainted 5.10.0-01932-g19e0ace2ca1d-dirty 4 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-b4 Workqueue: kthrotld blk_throtl_dispatch_work_fn RIP: 0010:bfq_bio_bfqg+0x52/0xc0 Code: 94 00 00 00 00 75 2e 48 8b 40 30 48 83 05 35 06 c8 0b 01 48 85 c0 74 3d 4b RSP: 0018:ffffc90001a1fba0 EFLAGS: 00010002 RAX: ffff888100d60400 RBX: ffff8881132e7000 RCX: 0000000000000000 RDX: 0000000000000017 RSI: ffff888103580a18 RDI: ffff888103580a18 RBP: ffff8881132e7000 R08: 0000000000000000 R09: ffffc90001a1fe10 R10: 0000000000000a20 R11: 0000000000034320 R12: 0000000000000000 R13: ffff888103580a18 R14: ffff888114447000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88881fdc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000094 CR3: 0000000100cdb000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: bfq_bic_update_cgroup+0x3c/0x350 ? ioc_create_icq+0x42/0x270 bfq_init_rq+0xfd/0x1060 bfq_insert_requests+0x20f/0x1cc0 ? ioc_create_icq+0x122/0x270 blk_mq_sched_insert_requests+0x86/0x1d0 blk_mq_flush_plug_list+0x193/0x2a0 blk_flush_plug_list+0x127/0x170 blk_finish_plug+0x31/0x50 blk_throtl_dispatch_work_fn+0x151/0x190 process_one_work+0x27c/0x5f0 worker_thread+0x28b/0x6b0 ? rescuer_thread+0x590/0x590 kthread+0x153/0x1b0 ? kthread_flush_work+0x170/0x170 ret_from_fork+0x1f/0x30 Modules linked in: CR2: 0000000000000094 ---[ end trace e2e59ac014314547 ]--- RIP: 0010:bfq_bio_bfqg+0x52/0xc0 Code: 94 00 00 00 00 75 2e 48 8b 40 30 48 83 05 35 06 c8 0b 01 48 85 c0 74 3d 4b RSP: 0018:ffffc90001a1fba0 EFLAGS: 00010002 RAX: ffff888100d60400 RBX: ffff8881132e7000 RCX: 0000000000000000 RDX: 0000000000000017 RSI: ffff888103580a18 RDI: ffff888103580a18 RBP: ffff8881132e7000 R08: 0000000000000000 R09: ffffc90001a1fe10 R10: 0000000000000a20 R11: 0000000000034320 R12: 0000000000000000 R13: ffff888103580a18 R14: ffff888114447000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88881fdc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000094 CR3: 0000000100cdb000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Root cause is quite complex: 1) use bfq elevator for the test device. 2) create a cgroup CG 3) config blk throtl in CG blkg_conf_prep blkg_create 4) create a thread T1 and issue async io in CG: bio_init bio_associate_blkg ... submit_bio submit_bio_noacct blk_throtl_bio -> io is throttled // io submit is done 5) switch elevator: bfq_exit_queue blkcg_deactivate_policy list_for_each_entry(blkg, &q->blkg_list, q_node) blkg->pd[] = NULL // bfq policy is removed 5) thread t1 exist, then remove the cgroup CG: blkcg_unpin_online blkcg_destroy_blkgs blkg_destroy list_del_init(&blkg->q_node) // blkg is removed from queue list 6) switch elevator back to bfq bfq_init_queue bfq_create_group_hierarchy blkcg_activate_policy list_for_each_entry_reverse(blkg, &q->blkg_list) // blkg is removed from list, hence bfq policy is still NULL 7) throttled io is dispatched to bfq: bfq_insert_requests bfq_init_rq bfq_bic_update_cgroup bfq_bio_bfqg bfqg = blkg_to_bfqg(blkg) // bfqg is NULL because bfq policy is NULL The problem is only possible in bfq because only bfq can be deactivated and activated while queue is online, while others can only be deactivated while the device is removed. Fix the problem in bfq by checking if blkg is online before calling blkg_to_bfqg(). Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-07block: Fix some kernel-doc commentsYang Li1-1/+0
Remove the description of @required_features in elevator_match() to clear the below warning: block/elevator.c:103: warning: Excess function parameter 'required_features' description in 'elevator_match' Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2734 Fixes: ffb86425ee2c ("block: don't check for required features in elevator_match") Reported-by: Abaci Robot <[email protected]> Signed-off-by: Yang Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-05Merge tag 'block-6.1-2022-11-05' of git://git.kernel.dk/linuxLinus Torvalds2-3/+3
Pull block fixes from Jens Axboe: - Fixes for the ublk driver (Ming) - Fixes for error handling memory leaks (Chen Jun, Chen Zhongjin) - Explicitly clear the last request in a chain when the plug is flushed, as it may have already been issued (Al) * tag 'block-6.1-2022-11-05' of git://git.kernel.dk/linux: block: blk_add_rq_to_plug(): clear stale 'last' after flush blk-mq: Fix kmemleak in blk_mq_init_allocated_queue block: Fix possible memory leak for rq_wb on add_disk failure ublk_drv: add ublk_queue_cmd() for cleanup ublk_drv: avoid to touch io_uring cmd in blk_mq io path ublk_drv: comment on ublk_driver entry of Kconfig ublk_drv: return flag of UBLK_F_URING_CMD_COMP_IN_TASK in case of module
2022-11-02blk-mq: use if-else instead of goto in blk_mq_alloc_cached_request()Jinlong Chen1-13/+14
if-else is more readable than goto here. Signed-off-by: Jinlong Chen <[email protected]> Link: https://lore.kernel.org/r/d3306fa4e92dc9cc614edc8f1802686096bafef2.1667356813.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <[email protected]>
2022-11-02blk-mq: improve error handling in blk_mq_alloc_rq_map()Jinlong Chen1-9/+10
Use goto-style error handling like we do elsewhere in the kernel. Signed-off-by: Jinlong Chen <[email protected]> Link: https://lore.kernel.org/r/bbbc2d9b17b137798c7fb92042141ca4cbbc58cc.1667356813.git.nickyc975@zju.edu.cn Signed-off-by: Jens Axboe <[email protected]>
2022-11-02blk-mq: add tagset quiesce interfaceChao Leng1-0/+27
Drivers that have shared tagsets may need to quiesce potentially a lot of request queues that all share a single tagset (e.g. nvme). Add an interface to quiesce all the queues on a given tagset. This interface is useful because it can speedup the quiesce by doing it in parallel. Because some queues should not need to be quiesced (e.g. the nvme connect_q) when quiescing the tagset, introduce a QUEUE_FLAG_SKIP_TAGSET_QUIESCE flag to allow this new interface to ski quiescing a particular queue. Signed-off-by: Chao Leng <[email protected]> [hch: simplify for the per-tag_set srcu_struct] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Chao Leng <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-02blk-mq: pass a tagset to blk_mq_wait_quiesce_doneChristoph Hellwig1-7/+9
Nothing in blk_mq_wait_quiesce_done needs the request_queue now, so just pass the tagset, and move the non-mq check into the only caller that needs it. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chao Leng <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-02blk-mq: move the srcu_struct used for quiescing to the tagsetChristoph Hellwig6-53/+41
All I/O submissions have fairly similar latencies, and a tagset-wide quiesce is a fairly common operation. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Chao Leng <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Link: https://lore.kernel.org/r/[email protected] [axboe: fix whitespace] Signed-off-by: Jens Axboe <[email protected]>
2022-11-02blk-mq: skip non-mq queues in blk_mq_quiesce_queueChristoph Hellwig1-1/+3
For submit_bio based queues there is no (S)RCU critical section during I/O submission and thus nothing to wait for in blk_mq_wait_quiesce_done, so skip doing any synchronization. No non-mq driver should be calling this, but for now we have core callers that unconditionally call into it. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-02block: set the disk capacity to 0 in blk_mark_disk_deadChristoph Hellwig1-0/+5
nvme and xen-blkfront are already doing this to stop buffered writes from creating dirty pages that can't be written out later. Move it to the common code. This also removes the comment about the ordering from nvme, as bd_mutex not only is gone entirely, but also hasn't been used for locking updates to the disk size long before that, and thus the ordering requirement documented there doesn't apply any more. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Chao Leng <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-01block, bfq: don't declare 'bfqd' as type 'void *' in bfq_groupYu Kuai3-6/+4
Prevent unnecessary format conversion for bfqg->bfqd in multiple places. Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Paolo Valente <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-01block, bfq: remove dead code for updating 'rq_in_driver'Yu Kuai1-16/+0
Such code are not even compiled since they are inside marco "#if 0". Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Paolo Valente <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-01block, bfq: cleanup bfq_activate_requeue_entity()Yu Kuai1-9/+5
Just make the code a litter cleaner by removing the unnecessary variable 'sd'. Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Paolo Valente <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-01block, bfq: factor out code to update 'active_entities'Yu Kuai1-29/+32
Current code is a bit ugly and hard to read. Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Paolo Valente <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>