aboutsummaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)AuthorFilesLines
2023-06-16blk-mq: fix NULL dereference on q->elevator in blk_mq_elv_switch_noneMing Lei1-3/+7
After grabbing q->sysfs_lock, q->elevator may become NULL because of elevator switch. Fix the NULL dereference on q->elevator by checking it with lock. Reported-by: Guangwu Zhang <guazhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230616132354.415109-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-16block: remove BIO_PAGE_REFFEDChristoph Hellwig1-2/+0
Now that all block direct I/O helpers use page pinning, this flag is unused. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20230614140341.521331-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-16scsi: block: Improve ioprio value validity checksDamien Le Moal1-0/+1
The introduction of the macro IOPRIO_PRIO_LEVEL() in commit eca2040972b4 ("scsi: block: ioprio: Clean up interface definition") results in an iopriority level to always be masked using the macro IOPRIO_LEVEL_MASK, and thus to the kernel always seeing an acceptable value for an I/O priority level when checked in ioprio_check_cap(). Before this patch, this function would return an error for some (but not all) invalid values for a level valid range of [0..7]. Restore and improve the detection of invalid priority levels by introducing the inline function ioprio_value() to check an ioprio class, level and hint value before combining these fields into a single value to be used with ioprio_set() or AIOs. If an invalid value for the class, level or hint of an ioprio is detected, ioprio_value() returns an ioprio using the class IOPRIO_CLASS_INVALID, indicating an invalid value and causing ioprio_check_cap() to return -EINVAL. Fixes: 6c913257226a ("scsi: block: Introduce ioprio hints") Fixes: eca2040972b4 ("scsi: block: ioprio: Clean up interface definition") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230608095556.124001-1-dlemoal@kernel.org Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-06-14block: fix blktrace debugfs entries leakageYu Kuai1-1/+4
Commit 99d055b4fd4b ("block: remove per-disk debugfs files in blk_unregister_queue") moves blk_trace_shutdown() from blk_release_queue() to blk_unregister_queue(), this is safe if blktrace is created through sysfs, however, there is a regression in corner case. blktrace can still be enabled after del_gendisk() through ioctl if the disk is opened before del_gendisk(), and if blktrace is not shutdown through ioctl before closing the disk, debugfs entries will be leaked. Fix this problem by shutdown blktrace in disk_release(), this is safe because blk_trace_remove() is reentrant. Fixes: 99d055b4fd4b ("block: remove per-disk debugfs files in blk_unregister_queue") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230610022003.2557284-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-14blk-mq: check on cpu id when there is only one ctx mappingEd Tsai1-2/+3
commit f168420c62e7 ("blk-mq: don't redirect completion for hctx withs only one ctx mapping") When nvme applies a 1:1 mapping of hctx and ctx, there will be no remote request. But for ufs, the submission and completion queues could be asymmetric. (e.g. Multiple SQs share one CQ) Therefore, 1:1 mapping of hctx and ctx won't complete request on the submission cpu. In this situation, this nr_ctx check could violate the QUEUE_FLAG_SAME_FORCE, as a result, check on cpu id when there is only one ctx mapping. Signed-off-by: Ed Tsai <ed.tsai@mediatek.com> Signed-off-by: Po-Wen Kao <powen.kao@mediatek.com> Suggested-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230614002529.6636-1-ed.tsai@mediatek.com [axboe: fixed up indentation] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12blk-mq: fix potential io hang by wrong 'wake_batch'Yu Kuai3-8/+12
In __blk_mq_tag_busy/idle(), updating 'active_queues' and calculating 'wake_batch' is not atomic: t1: t2: _blk_mq_tag_busy blk_mq_tag_busy inc active_queues // assume 1->2 inc active_queues // 2 -> 3 blk_mq_update_wake_batch // calculate based on 3 blk_mq_update_wake_batch /* calculate based on 2, while active_queues is actually 3. */ Fix this problem by protecting them wih 'tags->lock', this is not a hot path, so performance should not be concerned. And now that all writers are inside the lock, switch 'actives_queues' from atomic to unsigned int. Fixes: 180dccb0dba4 ("blk-mq: fix tag_get wait task can't be awakened") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230610023043.2559121-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12block: store the holder in file->private_dataChristoph Hellwig1-6/+8
Store the file struct used as the holder in file->private_data as an indicator that this file descriptor was opened exclusively to remove the last use of FMODE_EXCL. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20230608110258.189493-30-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12block: always use I_BDEV on file->f_mapping->host to find the bdevChristoph Hellwig1-10/+8
Always use I_BDEV(file->f_mapping->host) to find the bdev for a file to free up file->private_data for other uses. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-29-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12block: replace fmode_t with a block-specific type for block open flagsChristoph Hellwig6-65/+68
The only overlap between the block open flags mapped into the fmode_t and other uses of fmode_t are FMODE_READ and FMODE_WRITE. Define a new blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and ->ioctl and stop abusing fmode_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12block: remove unused fmode_t arguments from ioctl handlersChristoph Hellwig3-12/+12
A few ioctl handlers have fmode_t arguments that are entirely unused, remove them. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20230608110258.189493-27-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12block: move a few internal definitions out of blkdev.hChristoph Hellwig1-2/+21
All these helpers are only used in core block code, so move them out of the public header. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-26-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12scsi: replace the fmode_t argument to ->sg_io_fn with a simple boolChristoph Hellwig2-4/+6
Instead of passing a fmode_t and only checking it for FMODE_WRITE, pass a bool open_for_write to prepare for callers that won't have the fmode_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-21-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12block: use the holder as indication for exclusive opensChristoph Hellwig4-24/+29
The current interface for exclusive opens is rather confusing as it requires both the FMODE_EXCL flag and a holder. Remove the need to pass FMODE_EXCL and just key off the exclusive open off a non-NULL holder. For blkdev_put this requires adding the holder argument, which provides better debug checking that only the holder actually releases the hold, but at the same time allows removing the now superfluous mode argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12block: rename blkdev_close to blkdev_releaseChristoph Hellwig1-2/+2
Make the function name match the method name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12block: remove the unused mode argument to ->releaseChristoph Hellwig1-7/+7
The mode argument to the ->release block_device_operation is never used, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12block: pass a gendisk to ->openChristoph Hellwig1-1/+1
->open is only called on the whole device. Make that explicit by passing a gendisk instead of the block_device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12block: pass a gendisk on bdev_check_media_changeChristoph Hellwig1-9/+9
bdev_check_media_change should only ever be called for the whole device. Pass a gendisk to make that explicit and rename the function to disk_check_media_change. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12block: also call ->open for incremental partition opensChristoph Hellwig1-10/+8
For whole devices ->open is called for each open, but for partitions it is only called on the first open of a partition, e.g.: open("/dev/vdb", ...) open("/dev/vdb", ...) - 2 call to ->open open("/dev/vdb1", ...) open("/dev/vdb", ...) - 2 call to ->open open("/dev/vdb", ...) open("/dev/vdb", ...) - just open call to ->open This is problematic as various block drivers look at open flags and might not do all the required setup if the earlier open was with an odd flag like O_NDELAY or the magic 3 ioctl-only open mode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Phillip Potter <phil@philpotter.co.uk> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-11blk-cgroup: Flush stats before releasing blkcg_gqMing Lei1-9/+31
As noted by Michal, the blkg_iostat_set's in the lockless list hold reference to blkg's to protect against their removal. Those blkg's hold reference to blkcg. When a cgroup is being destroyed, cgroup_rstat_flush() is only called at css_release_work_fn() which is called when the blkcg reference count reaches 0. This circular dependency will prevent blkcg and some blkgs from being freed after they are made offline. It is less a problem if the cgroup to be destroyed also has other controllers like memory that will call cgroup_rstat_flush() which will clean up the reference count. If block is the only controller that uses rstat, these offline blkcg and blkgs may never be freed leaking more and more memory over time. To prevent this potential memory leak: - flush blkcg per-cpu stats list in __blkg_release(), when no new stat can be added - add global blkg_stat_lock for covering concurrent parent blkg stat update - don't grab bio->bi_blkg reference when adding the stats into blkcg's per-cpu stat list since all stats are guaranteed to be consumed before releasing blkg instance, and grabbing blkg reference for stats was the most fragile part of original patch Based on Waiman's patch: https://lore.kernel.org/linux-block/20221215033132.230023-3-longman@redhat.com/ Fixes: 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()") Cc: stable@vger.kernel.org Reported-by: Jay Shin <jaeshin@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: mkoutny@suse.com Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230609234249.1412858-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-09filemap: add a kiocb_write_and_wait helperChristoph Hellwig1-15/+3
Factor out a helper that does filemap_write_and_wait_range for the range covered by a read kiocb, or returns -EAGAIN if the kiocb is marked as nowait and there would be pages to write. Link: https://lkml.kernel.org/r/20230601145904.1385409-5-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09block: fix rootwait= againChristoph Hellwig1-1/+1
The previous rootwait fix added an -EINVAL return to a completely bogus superflous branch, fix this. Fixes: 1341c7d2ccf4 ("block: fix rootwait=") Reported-by: Mark Brown <broonie@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Fabio Estevam <festevam@gmail.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20230609051737.328930-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-07block: fix rootwait=Christoph Hellwig1-2/+2
Failures to look up the gendisk must return -ENODEV so that rootwait retries the lookup instead of -EINVAL which exits early. Fixes: cf056a431215 ("init: improve the name_to_dev_t interface") Reported-by: Fabio Estevam <festevam@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Fabio Estevam <festevam@gmail.com> Link: https://lore.kernel.org/r/20230607135746.92995-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-07blk-cgroup: Reinit blkg_iostat_set after clearing in blkcg_reset_stats()Waiman Long1-0/+5
When blkg_alloc() is called to allocate a blkcg_gq structure with the associated blkg_iostat_set's, there are 2 fields within blkg_iostat_set that requires proper initialization - blkg & sync. The former field was introduced by commit 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()") while the later one was introduced by commit f73316482977 ("blk-cgroup: reimplement basic IO stats using cgroup rstat"). Unfortunately those fields in the blkg_iostat_set's are not properly re-initialized when they are cleared in v1's blkcg_reset_stats(). This can lead to a kernel panic due to NULL pointer access of the blkg pointer. The missing initialization of sync is less problematic and can be a problem in a debug kernel due to missing lockdep initialization. Fix these problems by re-initializing them after memory clearing. Fixes: 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()") Fixes: f73316482977 ("blk-cgroup: reimplement basic IO stats using cgroup rstat") Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230606180724.2455066-1-longman@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-07blk-ioc: fix recursive spin_lock/unlock_irq() in ioc_clear_queue()Yu Kuai1-2/+2
Recursive spin_lock/unlock_irq() is not safe, because spin_unlock_irq() will enable irq unconditionally: spin_lock_irq queue_lock -> disable irq spin_lock_irq ioc->lock spin_unlock_irq ioc->lock -> enable irq /* * AA dead lock will be triggered if current context is preempted by irq, * and irq try to hold queue_lock again. */ spin_unlock_irq queue_lock Fix this problem by using spin_lock/unlock() directly for 'ioc->lock'. Fixes: 5a0ac57c48aa ("blk-ioc: protect ioc_destroy_icq() by 'queue_lock'") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230606011438.3743440-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-06blk-ioprio: Introduce promote-to-rt policyHou Tao1-3/+20
Since commit a78418e6a04c ("block: Always initialize bio IO priority on submit"), bio->bi_ioprio will never be IOPRIO_CLASS_NONE when calling blkcg_set_ioprio(), so there will be no way to promote the io-priority of one cgroup to IOPRIO_CLASS_RT, because bi_ioprio will always be greater than or equals to IOPRIO_CLASS_RT. It seems possible to call blkcg_set_ioprio() first then try to initialize bi_ioprio later in bio_set_ioprio(), but this doesn't work for bio in which bi_ioprio is already initialized (e.g., direct-io), so introduce a new promote-to-rt policy to promote the iopriority of bio to IOPRIO_CLASS_RT if the ioprio is not already RT. For none-to-rt policy, although it doesn't work now, but considering that its purpose was also to override the io-priority to RT and allowing for a smoother transition, just keep it and treat it as an alias of the promote-to-rt policy. Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Link: https://lore.kernel.org/r/20230428074404.280532-1-houtao@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05blk-iocost: use spin_lock_irqsave in adjust_inuse_and_calc_costLi Nan1-3/+4
adjust_inuse_and_calc_cost() use spin_lock_irq() and IRQ will be enabled when unlock. DEADLOCK might happen if we have held other locks and disabled IRQ before invoking it. Fix it by using spin_lock_irqsave() instead, which can keep IRQ state consistent with before when unlock. ================================ WARNING: inconsistent lock state 5.10.0-02758-g8e5f91fd772f #26 Not tainted -------------------------------- inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. kworker/2:3/388 [HC0[0]:SC0[0]:HE0:SE1] takes: ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: spin_lock_irq ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: bfq_bio_merge+0x141/0x390 {IN-HARDIRQ-W} state was registered at: __lock_acquire+0x3d7/0x1070 lock_acquire+0x197/0x4a0 __raw_spin_lock_irqsave _raw_spin_lock_irqsave+0x3b/0x60 bfq_idle_slice_timer_body bfq_idle_slice_timer+0x53/0x1d0 __run_hrtimer+0x477/0xa70 __hrtimer_run_queues+0x1c6/0x2d0 hrtimer_interrupt+0x302/0x9e0 local_apic_timer_interrupt __sysvec_apic_timer_interrupt+0xfd/0x420 run_sysvec_on_irqstack_cond sysvec_apic_timer_interrupt+0x46/0xa0 asm_sysvec_apic_timer_interrupt+0x12/0x20 irq event stamp: 837522 hardirqs last enabled at (837521): [<ffffffff84b9419d>] __raw_spin_unlock_irqrestore hardirqs last enabled at (837521): [<ffffffff84b9419d>] _raw_spin_unlock_irqrestore+0x3d/0x40 hardirqs last disabled at (837522): [<ffffffff84b93fa3>] __raw_spin_lock_irq hardirqs last disabled at (837522): [<ffffffff84b93fa3>] _raw_spin_lock_irq+0x43/0x50 softirqs last enabled at (835852): [<ffffffff84e00558>] __do_softirq+0x558/0x8ec softirqs last disabled at (835845): [<ffffffff84c010ff>] asm_call_irq_on_stack+0xf/0x20 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&bfqd->lock); <Interrupt> lock(&bfqd->lock); *** DEADLOCK *** 3 locks held by kworker/2:3/388: #0: ffff888107af0f38 ((wq_completion)kthrotld){+.+.}-{0:0}, at: process_one_work+0x742/0x13f0 #1: ffff8881176bfdd8 ((work_completion)(&td->dispatch_work)){+.+.}-{0:0}, at: process_one_work+0x777/0x13f0 #2: ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: spin_lock_irq #2: ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: bfq_bio_merge+0x141/0x390 stack backtrace: CPU: 2 PID: 388 Comm: kworker/2:3 Not tainted 5.10.0-02758-g8e5f91fd772f #26 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Workqueue: kthrotld blk_throtl_dispatch_work_fn Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x107/0x167 print_usage_bug valid_state mark_lock_irq.cold+0x32/0x3a mark_lock+0x693/0xbc0 mark_held_locks+0x9e/0xe0 __trace_hardirqs_on_caller lockdep_hardirqs_on_prepare.part.0+0x151/0x360 trace_hardirqs_on+0x5b/0x180 __raw_spin_unlock_irq _raw_spin_unlock_irq+0x24/0x40 spin_unlock_irq adjust_inuse_and_calc_cost+0x4fb/0x970 ioc_rqos_merge+0x277/0x740 __rq_qos_merge+0x62/0xb0 rq_qos_merge bio_attempt_back_merge+0x12c/0x4a0 blk_mq_sched_try_merge+0x1b6/0x4d0 bfq_bio_merge+0x24a/0x390 __blk_mq_sched_bio_merge+0xa6/0x460 blk_mq_sched_bio_merge blk_mq_submit_bio+0x2e7/0x1ee0 __submit_bio_noacct_mq+0x175/0x3b0 submit_bio_noacct+0x1fb/0x270 blk_throtl_dispatch_work_fn+0x1ef/0x2b0 process_one_work+0x83e/0x13f0 process_scheduled_works worker_thread+0x7e3/0xd80 kthread+0x353/0x470 ret_from_fork+0x1f/0x30 Fixes: b0853ab4a238 ("blk-iocost: revamp in-period donation snapbacks") Signed-off-by: Li Nan <linan122@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230527091904.3001833-1-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05block: mark early_lookup_bdev as __initChristoph Hellwig1-10/+9
early_lookup_bdev is now only used during the early boot code as it should, so mark it __init to not waste run time memory on it. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230531125535.676098-25-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05block: move more code to early-lookup.cChristoph Hellwig2-92/+92
blk_lookup_devt is only used by code in early-lookup.c, so move it there. printk_all_partitions and it's helper bdevt_str are only used by the early init code in init/do_mounts.c, so they should go there as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230531125535.676098-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05block: move the code to do early boot lookup of block devices to block/Christoph Hellwig2-1/+225
Create a new block/early-lookup.c to keep the early block device lookup code instead of having this code sit with the early mount code. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230531125535.676098-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05block: add a mark_dead holder operationChristoph Hellwig1-0/+24
Add a mark_dead method to blk_holder_ops that is called from blk_mark_disk_dead to notify the holder that the block device it is using has been marked dead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05block: introduce holder opsChristoph Hellwig4-16/+36
Add a new blk_holder_ops structure, which is passed to blkdev_get_by_* and installed in the block_device for exclusive claims. It will be used to allow the block layer to call back into the user of the block device for thing like notification of a removed device or a device resize. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05block: remove blk_drop_partitionsChristoph Hellwig1-12/+4
There is only a single caller left, so fold the loop into that. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05block: delete partitions later in del_gendiskChristoph Hellwig3-13/+32
Delay dropping the block_devices for partitions in del_gendisk until after the call to blk_mark_disk_dead, so that we can implementat notification of removed devices in blk_mark_disk_dead. This requires splitting a lower-level drop_partition helper out of delete_partition and using that from del_gendisk, while having a common loop for the whole device and partitions that calls remove_inode_hash, fsync_bdev and __invalidate_device before the call to blk_mark_disk_dead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05block: unhash the inode earlier in delete_partitionChristoph Hellwig1-6/+6
Move the call to remove_inode_hash to the beginning of delete_partition, as we want to prevent opening a block_device that is about to be removed ASAP. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05block: avoid repeated work in blk_mark_disk_deadChristoph Hellwig1-1/+3
Check if GD_DEAD is already set in blk_mark_disk_dead, and don't duplicate the work already done. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05block: consolidate the shutdown logic in blk_mark_disk_dead and del_gendiskChristoph Hellwig1-14/+12
blk_mark_disk_dead does very similar work a a section of del_gendisk: - set the GD_DEAD flag - set the capacity to zero - start a queue drain but del_gendisk also sets QUEUE_FLAG_DYING on the queue if it is owned by the disk, sets the capacity to zero before starting the drain, and both with sending a uevent and kernel message for this fake capacity change. Move the exact logic from the more heavily used del_gendisk into blk_mark_disk_dead and then call blk_mark_disk_dead from del_gendisk. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05block: turn bdev_lock into a mutexChristoph Hellwig1-14/+13
There is no reason for this lock to spin, and being able to sleep under it will come in handy soon. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05block: refactor bd_may_claimChristoph Hellwig1-18/+22
The long if/else chain obsfucates the actual logic. Tidy it up to be more structured. Also drop the whole argument, as it can be trivially derived from bdev using bdev_whole, and having the bdev_whole in the function makes it easier to follow. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05block: factor out a bd_end_claim helper from blkdev_putChristoph Hellwig1-30/+33
Move all the logic to release an exclusive claim into a helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-03blk-mq: fix blk_mq_hw_ctx active request accountingTian Lan1-4/+4
The nr_active counter continues to increase over time which causes the blk_mq_get_tag to hang until the thread is rescheduled to a different core despite there are still tags available. kernel-stack INFO: task inboundIOReacto:3014879 blocked for more than 2 seconds Not tainted 6.1.15-amd64 #1 Debian 6.1.15~debian11 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:inboundIOReacto state:D stack:0 pid:3014879 ppid:4557 flags:0x00000000 Call Trace: <TASK> __schedule+0x351/0xa20 scheduler+0x5d/0xe0 io_schedule+0x42/0x70 blk_mq_get_tag+0x11a/0x2a0 ? dequeue_task_stop+0x70/0x70 __blk_mq_alloc_requests+0x191/0x2e0 kprobe output showing RQF_MQ_INFLIGHT bit is not cleared before __blk_mq_free_request being called. 320 320 kworker/29:1H __blk_mq_free_request rq_flags 0x220c0 in-flight 1 b'__blk_mq_free_request+0x1 [kernel]' b'bt_iter+0x50 [kernel]' b'blk_mq_queue_tag_busy_iter+0x318 [kernel]' b'blk_mq_timeout_work+0x7c [kernel]' b'process_one_work+0x1c4 [kernel]' b'worker_thread+0x4d [kernel]' b'kthread+0xe6 [kernel]' b'ret_from_fork+0x1f [kernel]' Signed-off-by: Tian Lan <tian.lan@twosigma.com> Fixes: 2e315dc07df0 ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter") Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230513221227.497327-1-tilan7663@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-01block: Replace all non-returning strlcpy with strscpyAzeem Shaikh3-3/+3
strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89 Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230530155608.272266-1-azeemshaikh38@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-01blk-ioc: protect ioc_destroy_icq() by 'queue_lock'Yu Kuai1-17/+13
Currently, icq is tracked by both request_queue(icq->q_node) and task(icq->ioc_node), and ioc_clear_queue() from elevator exit is not safe because it can access the list without protection: ioc_clear_queue ioc_release_fn lock queue_lock list_splice /* move queue list to a local list */ unlock queue_lock /* * lock is released, the local list * can be accessed through task exit. */ lock ioc->lock while (!hlist_empty) icq = hlist_entry lock queue_lock ioc_destroy_icq delete icq->ioc_node while (!list_empty) icq = list_entry() list_del icq->q_node /* * This is not protected by any lock, * list_entry concurrent with list_del * is not safe. */ unlock queue_lock unlock ioc->lock Fix this problem by protecting list 'icq->q_node' by queue_lock from ioc_clear_queue(). Reported-and-tested-by: Pradeep Pragallapati <quic_pragalla@quicinc.com> Link: https://lore.kernel.org/lkml/20230517084434.18932-1-quic_pragalla@quicinc.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230531073435.2923422-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-01block: add bio_add_folio_nofailJohannes Thumshirn1-0/+8
Just like for bio_add_pages() add a no-fail variant for bio_add_folio(). Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/924dff4077812804398ef84128fb920507fa4be1.1685532726.git.johannes.thumshirn@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-30block: constify the whole_disk device_attributeThomas Weißschuh1-1/+1
The struct is never modified so it can be const. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20230419-const-partition-v3-4-4e14e48be367@weissschuh.net Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-30block: constify struct part_attr_groupThomas Weißschuh1-1/+1
The struct is never modified so it can be const. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20230419-const-partition-v3-3-4e14e48be367@weissschuh.net Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-30block: constify struct part_type part_typeThomas Weißschuh1-1/+1
The struct is never modified so it can be const. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20230419-const-partition-v3-2-4e14e48be367@weissschuh.net Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-30block: constify partition prober arrayThomas Weißschuh1-1/+1
The array is never modified so it can be const. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202304191640.SkNk7kVN-lkp@intel.com/ Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20230419-const-partition-v3-1-4e14e48be367@weissschuh.net Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-29block: fix revalidate performance regressionDamien Le Moal1-1/+2
The scsi driver function sd_read_block_characteristics() always calls disk_set_zoned() to a disk zoned model correctly, in case the device model changed. This is done even for regular disks to set the zoned model to BLK_ZONED_NONE and free any zone related resources if the drive previously was zoned. This behavior significantly impact the time it takes to revalidate disks on a large system as the call to disk_clear_zone_settings() done from disk_set_zoned() for the BLK_ZONED_NONE case results in the device request queued to be frozen, even if there are no zone resources to free. Avoid this overhead for non-zoned devices by not calling disk_clear_zone_settings() in disk_set_zoned() if the device model was already set to BLK_ZONED_NONE, which is always the case for regular devices. Reported by: Brian Bunker <brian@purestorage.com> Fixes: 508aebb80527 ("block: introduce blk_queue_clear_zone_settings()") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230529073237.1339862-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24block: convert bio_map_user_iov to use iov_iter_extract_pagesDavid Howells1-12/+11
This will pin pages or leave them unaltered rather than getting a ref on them as appropriate to the iterator. The pages need to be pinned for DIO rather than having refs taken on them to prevent VM copy-on-write from malfunctioning during a concurrent fork() (the result of the I/O could otherwise end up being visible to/affected by the child process). Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Jens Axboe <axboe@kernel.dk> cc: Jan Kara <jack@suse.cz> cc: Matthew Wilcox <willy@infradead.org> cc: Logan Gunthorpe <logang@deltatee.com> cc: linux-block@vger.kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230522205744.2825689-7-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pagesDavid Howells1-11/+12
This will pin pages or leave them unaltered rather than getting a ref on them as appropriate to the iterator. The pages need to be pinned for DIO rather than having refs taken on them to prevent VM copy-on-write from malfunctioning during a concurrent fork() (the result of the I/O could otherwise end up being affected by/visible to the child process). Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Hubbard <jhubbard@nvidia.com> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Jens Axboe <axboe@kernel.dk> cc: Jan Kara <jack@suse.cz> cc: Matthew Wilcox <willy@infradead.org> cc: Logan Gunthorpe <logang@deltatee.com> cc: linux-block@vger.kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230522205744.2825689-6-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>