aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2020-06-30blk-mq: pass hctx to blk_mq_dispatch_rq_listMing Lei3-12/+10
All requests in the 'list' of blk_mq_dispatch_rq_list belong to same hctx, so it is better to pass hctx instead of request queue, because blk-mq's dispatch target is hctx instead of request queue. Signed-off-by: Ming Lei <[email protected]> Tested-by: Baolin Wang <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Cc: Sagi Grimberg <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-30blk-mq: pass request queue into get/put budget callbackMing Lei5-23/+17
blk-mq budget is abstract from scsi's device queue depth, and it is always per-request-queue instead of hctx. It can be quite absurd to get a budget from one hctx, then dequeue a request from scheduler queue, and this request may not belong to this hctx, at least for bfq and deadline. So fix the mess and always pass request queue to get/put budget callback. Signed-off-by: Ming Lei <[email protected]> Tested-by: Baolin Wang <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Douglas Anderson <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Cc: Sagi Grimberg <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Douglas Anderson <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-29blk-mq: remove the BLK_MQ_REQ_INTERNAL flagChristoph Hellwig4-12/+6
Just check for a non-NULL elevator directly to make the code more clear. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-29blk-mq: put driver tag when this request is completedMing Lei2-6/+2
It is natural to release driver tag when this request is completed by LLD or device since its purpose is for LLD use. One big benefit is that the released tag can be re-used quicker since bio_endio() may take too long. Meantime we don't need to release driver tag for flush request. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Ming Lei <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-29blk-cgroup: remove a dead check in blk_throtl_bioChristoph Hellwig1-1/+1
bios must have a valid block group by the time they are submitted. Acked-by: Tejun Heo <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-29blk-cgroup: remove blkcg_bio_issue_checkChristoph Hellwig5-57/+47
blkcg_bio_issue_check is a giant inline function that does three entirely different things. Factor out the blk-cgroup related bio initalization into a new helper, and the open code the sequence in the only caller, relying on the fact that all the actual functionality is stubbed out for non-cgroup builds. Acked-by: Tejun Heo <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-29blk-cgroup: move rcu locking from blkcg_bio_issue_check to blk_throtl_bioChristoph Hellwig2-3/+2
The only thing in blkcg_bio_issue_check that needs to be under rcu_read_lock is blk_throtl_bio, so move the locking there. Acked-by: Tejun Heo <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-29cgroup: unexport cgroup_rstat_updatedChristoph Hellwig1-1/+0
cgroup_rstat_updated is only used by core block code, no need to export it. Acked-by: Tejun Heo <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-29blk-cgroup: remove the !bio->bi_blkg check in blkcg_bio_issue_checkChristoph Hellwig1-14/+1
This is purely a sanity check for grave programming errors. Remove it to simplify further work in this area. Acked-by: Tejun Heo <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-29block: move the initial blkg lookup into blkg_tryget_closestChristoph Hellwig1-19/+14
By moving the initial blkg lookup into blkg_tryget_closest we get a nicely self contained routines that does all the RCU locking. Acked-by: Tejun Heo <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-29block: bypass blkg_tryget_closest for the root_blkgChristoph Hellwig1-5/+10
The root_blkg is only torn down at the very end of removing a queue. So in the I/O submission path is always has a life reference and we can just grab another one using blkg_get instead of doing a tryget and parent walk that won't lead anywhere. Acked-by: Tejun Heo <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-29block: merge blkg_lookup_create and __blkg_lookup_createChristoph Hellwig1-31/+18
No good reason to keep these two functions split. Acked-by: Tejun Heo <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-29block: move the bio cgroup associatation helpers to blk-cgroup.cChristoph Hellwig3-107/+101
Keep the cgroup code together. Acked-by: Tejun Heo <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-29block: move bio_associate_blkg_from_page to mm/page_io.cChristoph Hellwig3-33/+17
bio_associate_blkg_from_page is a special purpose helper for swap bios that doesn't need access to bio internals. Move it to the swap code instead of having it in bio.c. Acked-by: Tejun Heo <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-29block: merge __bio_associate_blkg into bio_associate_blkg_from_cssChristoph Hellwig1-32/+13
Merge __bio_associate_blkg into the only caller, which allows to slightly reduce the RCU crticial section and better explain the code flow. Acked-by: Tejun Heo <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-29block: really clone the block cgroup in bio_clone_blkg_associationChristoph Hellwig1-6/+6
bio_clone_blkg_association is supposed to clone the associatation, but actually ends up doing a search with a tryget. As we know we have a reference on the source cgroup just get an unconditional additional reference to it and call it a day. That also removes the need for a RCU critical section. Acked-by: Tejun Heo <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-29block: remove bio_disassociate_blkgChristoph Hellwig2-21/+8
bio_disassociate_blkg has two callers, of which one immediately assigns a new value to >bi_blkg. Just open code the function in the two callers. Acked-by: Tejun Heo <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-29dm: use bio_uninit instead of bio_disassociate_blkgChristoph Hellwig1-3/+2
bio_uninit is the proper API to clean up a BIO that has been allocated on stack or inside a structure that doesn't come from the BIO allocator. Switch dm to use that instead of bio_disassociate_blkg, which really is an implementation detail. Note that the bio_uninit calls are also moved to the two callers of __send_empty_flush, so that they better pair with the bio_init calls used to initialize them. Acked-by: Tejun Heo <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-28blk-rq-qos: remove redundant finish_wait to rq_qos_wait.Guo Xuenan1-2/+0
It is no need do finish_wait twice after acquiring inflight. Signed-off-by: Guo Xuenan <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-25blktrace: Provide event for request mergingJan Kara3-0/+27
Currently blk-mq does not report any event when two requests get merged in the elevator. This then results in difficult to understand sequence of events like: ... 8,0 34 1579 0.608765271 2718 I WS 215023504 + 40 [dbench] 8,0 34 1584 0.609184613 2719 A WS 215023544 + 56 <- (8,4) 2160568 8,0 34 1585 0.609184850 2719 Q WS 215023544 + 56 [dbench] 8,0 34 1586 0.609188524 2719 G WS 215023544 + 56 [dbench] 8,0 3 602 0.609684162 773 D WS 215023504 + 96 [kworker/3:1H] 8,0 34 1591 0.609843593 0 C WS 215023504 + 96 [0] and you can only guess (after quite some headscratching since the above excerpt is intermixed with a lot of other IO) that request 215023544+56 got merged to request 215023504+40. Provide proper event for request merging like we used to do in the legacy block layer. Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24block: move struct block_device to blk_types.hChristoph Hellwig9-42/+46
Move the struct block_device definition together with most of the block layer definitions, as it has nothing to do with the rest of fs.h. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24block: reduce ifdef CONFIG_BLOCK madness in headersChristoph Hellwig3-63/+46
Large part of bio.h, blkdev.h and genhd.h are under ifdef CONFIG_BLOCK for no good reason. Only stub out function that are called from code that is not dependent on CONFIG_BLOCK and leave the harmless other declarations around. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24fs: move the buffer_heads_over_limit stub to buffer_head.hChristoph Hellwig2-1/+1
Move the !CONFIG_BLOCK stub to the same place as the non-stub declaration. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24block: move block-related definitions out of fs.hChristoph Hellwig12-94/+96
Move most of the block related definition out of fs.h into more suitable headers. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24block: simplify sb_is_blkdev_sbChristoph Hellwig1-12/+6
Just use IS_ENABLED instead of providing a stub for !CONFIG_BLOCK. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24fs: remove the mount_bdev and kill_block_super stubsChristoph Hellwig1-16/+0
No one calls these functions without CONFIG_BLOCK, so don't bother stubbing them out. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24fs: remove the HAVE_UNLOCKED_IOCTL and HAVE_COMPAT_IOCTL definesChristoph Hellwig1-6/+0
These are not defined anywhere, and contrary to the comments we really do not care about out of tree code at all. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24fs: remove an unused block_device_operations forward declarationChristoph Hellwig1-2/+0
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24block: mark bd_finish_claiming staticChristoph Hellwig2-5/+2
Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24tty/sysrq: emergency_thaw_all does not depend on CONFIG_BLOCKChristoph Hellwig2-3/+1
We can also thaw non-block file systems. Remove the CONFIG_BLOCK in sysrq.c after making the prototype available unconditionally. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24nvme-rdma: fix a missing completion with remove invalidationChristoph Hellwig1-2/+3
Revert and incorret transformation that caused requests using remote invalidation to never complete. Fixes: 421147be863b ("nvme-rdma: factor out a nvme_rdma_end_request helper") Reported-by: Bart Van Assche <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Tested-by: Bart Van Assche <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24blk-iocost: Use struct_size() in kzalloc_node()Gustavo A. R. Silva1-2/+1
Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes. This code was detected with the help of Coccinelle and, audited and fixed manually. Signed-off-by: Gustavo A. R. Silva <[email protected]> Addresses-KSPP-ID: https://github.com/KSPP/linux/issues/83 Signed-off-by: Jens Axboe <[email protected]>
2020-06-24block: bio: Use struct_size() in kmalloc()Gustavo A. R. Silva1-3/+1
Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes. This code was detected with the help of Coccinelle and, audited and fixed manually. Signed-off-by: Gustavo A. R. Silva <[email protected]> Addresses-KSPP-ID: https://github.com/KSPP/linux/issues/83 Signed-off-by: Jens Axboe <[email protected]>
2020-06-24block: create the request_queue debugfs_dir on registrationLuis Chamberlain6-48/+39
We were only creating the request_queue debugfs_dir only for make_request block drivers (multiqueue), but never for request-based block drivers. We did this as we were only creating non-blktrace additional debugfs files on that directory for make_request drivers. However, since blktrace *always* creates that directory anyway, we special-case the use of that directory on blktrace. Other than this being an eye-sore, this exposes request-based block drivers to the same debugfs fragile race that used to exist with make_request block drivers where if we start adding files onto that directory we can later run a race with a double removal of dentries on the directory if we don't deal with this carefully on blktrace. Instead, just simplify things by always creating the request_queue debugfs_dir on request_queue registration. Rename the mutex also to reflect the fact that this is used outside of the blktrace context. Signed-off-by: Luis Chamberlain <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24blktrace: ensure our debugfs dir existsLuis Chamberlain1-0/+12
We make an assumption that a debugfs directory exists, but since this can fail ensure it exists before allowing blktrace setup to complete. Otherwise we end up stuffing blktrace files on the debugfs root directory. In the worst case scenario this *in theory* can create an eventual panic *iff* in the future a similarly named file is created prior on the debugfs root directory. This theoretical crash can happen due to a recursive removal followed by a specific dentry removal. This doesn't fix any known crash, however I have seen the files go into the main debugfs root directory in cases where the debugfs directory was not created due to other internal bugs with blktrace now fixed. blktrace is also completely useless without this directory, so this ensures to userspace we only setup blktrace if the kernel can stuff files where they are supposed to go into. debugfs directory creations typically aren't checked for, and we have maintainers doing sweep removals of these checks, but since we need this check to ensure proper userspace blktrace functionality we make sure to annotate the justification for the check. Signed-off-by: Luis Chamberlain <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24blktrace: fix debugfs use after freeLuis Chamberlain1-6/+12
On commit 6ac93117ab00 ("blktrace: use existing disk debugfs directory") merged on v4.12 Omar fixed the original blktrace code for request-based drivers (multiqueue). This however left in place a possible crash, if you happen to abuse blktrace while racing to remove / add a device. We used to use asynchronous removal of the request_queue, and with that the issue was easier to reproduce. Now that we have reverted to synchronous removal of the request_queue, the issue is still possible to reproduce, its however just a bit more difficult. We essentially run two instances of break-blktrace which add/remove a loop device, and setup a blktrace and just never tear the blktrace down. We do this twice in parallel. This is easily reproduced with the script run_0004.sh from break-blktrace [0]. We can end up with two types of panics each reflecting where we race, one a failed blktrace setup: [ 252.426751] debugfs: Directory 'loop0' with parent 'block' already present! [ 252.432265] BUG: kernel NULL pointer dereference, address: 00000000000000a0 [ 252.436592] #PF: supervisor write access in kernel mode [ 252.439822] #PF: error_code(0x0002) - not-present page [ 252.442967] PGD 0 P4D 0 [ 252.444656] Oops: 0002 [#1] SMP NOPTI [ 252.446972] CPU: 10 PID: 1153 Comm: break-blktrace Tainted: G E 5.7.0-rc2-next-20200420+ #164 [ 252.452673] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 [ 252.456343] RIP: 0010:down_write+0x15/0x40 [ 252.458146] Code: eb ca e8 ae 22 8d ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00 00 00 <f0> 48 0f b1 55 00 75 0f 48 8b 04 25 c0 8b 01 00 48 89 45 08 5d [ 252.463638] RSP: 0018:ffffa626415abcc8 EFLAGS: 00010246 [ 252.464950] RAX: 0000000000000000 RBX: ffff958c25f0f5c0 RCX: ffffff8100000000 [ 252.466727] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0 [ 252.468482] RBP: 00000000000000a0 R08: 0000000000000000 R09: 0000000000000001 [ 252.470014] R10: 0000000000000000 R11: ffff958d1f9227ff R12: 0000000000000000 [ 252.471473] R13: ffff958c25ea5380 R14: ffffffff8cce15f1 R15: 00000000000000a0 [ 252.473346] FS: 00007f2e69dee540(0000) GS:ffff958c2fc80000(0000) knlGS:0000000000000000 [ 252.475225] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 252.476267] CR2: 00000000000000a0 CR3: 0000000427d10004 CR4: 0000000000360ee0 [ 252.477526] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 252.478776] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 252.479866] Call Trace: [ 252.480322] simple_recursive_removal+0x4e/0x2e0 [ 252.481078] ? debugfs_remove+0x60/0x60 [ 252.481725] ? relay_destroy_buf+0x77/0xb0 [ 252.482662] debugfs_remove+0x40/0x60 [ 252.483518] blk_remove_buf_file_callback+0x5/0x10 [ 252.484328] relay_close_buf+0x2e/0x60 [ 252.484930] relay_open+0x1ce/0x2c0 [ 252.485520] do_blk_trace_setup+0x14f/0x2b0 [ 252.486187] __blk_trace_setup+0x54/0xb0 [ 252.486803] blk_trace_ioctl+0x90/0x140 [ 252.487423] ? do_sys_openat2+0x1ab/0x2d0 [ 252.488053] blkdev_ioctl+0x4d/0x260 [ 252.488636] block_ioctl+0x39/0x40 [ 252.489139] ksys_ioctl+0x87/0xc0 [ 252.489675] __x64_sys_ioctl+0x16/0x20 [ 252.490380] do_syscall_64+0x52/0x180 [ 252.491032] entry_SYSCALL_64_after_hwframe+0x44/0xa9 And the other on the device removal: [ 128.528940] debugfs: Directory 'loop0' with parent 'block' already present! [ 128.615325] BUG: kernel NULL pointer dereference, address: 00000000000000a0 [ 128.619537] #PF: supervisor write access in kernel mode [ 128.622700] #PF: error_code(0x0002) - not-present page [ 128.625842] PGD 0 P4D 0 [ 128.627585] Oops: 0002 [#1] SMP NOPTI [ 128.629871] CPU: 12 PID: 544 Comm: break-blktrace Tainted: G E 5.7.0-rc2-next-20200420+ #164 [ 128.635595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 [ 128.640471] RIP: 0010:down_write+0x15/0x40 [ 128.643041] Code: eb ca e8 ae 22 8d ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00 00 00 <f0> 48 0f b1 55 00 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89 45 08 5d [ 128.650180] RSP: 0018:ffffa9c3c05ebd78 EFLAGS: 00010246 [ 128.651820] RAX: 0000000000000000 RBX: ffff8ae9a6370240 RCX: ffffff8100000000 [ 128.653942] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0 [ 128.655720] RBP: 00000000000000a0 R08: 0000000000000002 R09: ffff8ae9afd2d3d0 [ 128.657400] R10: 0000000000000056 R11: 0000000000000000 R12: 0000000000000000 [ 128.659099] R13: 0000000000000000 R14: 0000000000000003 R15: 00000000000000a0 [ 128.660500] FS: 00007febfd995540(0000) GS:ffff8ae9afd00000(0000) knlGS:0000000000000000 [ 128.662204] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 128.663426] CR2: 00000000000000a0 CR3: 0000000420042003 CR4: 0000000000360ee0 [ 128.664776] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 128.666022] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 128.667282] Call Trace: [ 128.667801] simple_recursive_removal+0x4e/0x2e0 [ 128.668663] ? debugfs_remove+0x60/0x60 [ 128.669368] debugfs_remove+0x40/0x60 [ 128.669985] blk_trace_free+0xd/0x50 [ 128.670593] __blk_trace_remove+0x27/0x40 [ 128.671274] blk_trace_shutdown+0x30/0x40 [ 128.671935] blk_release_queue+0x95/0xf0 [ 128.672589] kobject_put+0xa5/0x1b0 [ 128.673188] disk_release+0xa2/0xc0 [ 128.673786] device_release+0x28/0x80 [ 128.674376] kobject_put+0xa5/0x1b0 [ 128.674915] loop_remove+0x39/0x50 [loop] [ 128.675511] loop_control_ioctl+0x113/0x130 [loop] [ 128.676199] ksys_ioctl+0x87/0xc0 [ 128.676708] __x64_sys_ioctl+0x16/0x20 [ 128.677274] do_syscall_64+0x52/0x180 [ 128.677823] entry_SYSCALL_64_after_hwframe+0x44/0xa9 The common theme here is: debugfs: Directory 'loop0' with parent 'block' already present This crash happens because of how blktrace uses the debugfs directory where it places its files. Upon init we always create the same directory which would be needed by blktrace but we only do this for make_request drivers (multiqueue) block drivers. When you race a removal of these devices with a blktrace setup you end up in a situation where the make_request recursive debugfs removal will sweep away the blktrace files and then later blktrace will also try to remove individual dentries which are already NULL. The inverse is also possible and hence the two types of use after frees. We don't create the block debugfs directory on init for these types of block devices: * request-based block driver block devices * every possible partition * scsi-generic And so, this race should in theory only be possible with make_request drivers. We can fix the UAF by simply re-using the debugfs directory for make_request drivers (multiqueue) and only creating the ephemeral directory for the other type of block devices. The new clarifications on relying on the q->blk_trace_mutex *and* also checking for q->blk_trace *prior* to processing a blktrace ensures the debugfs directories are only created if no possible directory name clashes are possible. This goes tested with: o nvme partitions o ISCSI with tgt, and blktracing against scsi-generic with: o block o tape o cdrom o media changer o blktests This patch is part of the work which disputes the severity of CVE-2019-19770 which shows this issue is not a core debugfs issue, but a misuse of debugfs within blktace. Fixes: 6ac93117ab00 ("blktrace: use existing disk debugfs directory") Reported-by: [email protected] Signed-off-by: Luis Chamberlain <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Bart Van Assche <[email protected]> Cc: Omar Sandoval <[email protected]> Cc: Hannes Reinecke <[email protected]> Cc: Nicolai Stange <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: "Martin K. Petersen" <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: yu kuai <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24loop: be paranoid on exit and prevent new additions / removalsLuis Chamberlain1-0/+4
Be pedantic on removal as well and hold the mutex. This should prevent uses of addition while we exit. Signed-off-by: Luis Chamberlain <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24blktrace: annotate required lock on do_blk_trace_setup()Luis Chamberlain1-0/+2
Ensure it is clear which lock is required on do_blk_trace_setup(). Suggested-by: Bart Van Assche <[email protected]> Signed-off-by: Luis Chamberlain <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24block: revert back to synchronous request_queue removalLuis Chamberlain4-23/+47
Commit dc9edc44de6c ("block: Fix a blk_exit_rl() regression") merged on v4.12 moved the work behind blk_release_queue() into a workqueue after a splat floated around which indicated some work on blk_release_queue() could sleep in blk_exit_rl(). This splat would be possible when a driver called blk_put_queue() or blk_cleanup_queue() (which calls blk_put_queue() as its final call) from an atomic context. blk_put_queue() decrements the refcount for the request_queue kobject, and upon reaching 0 blk_release_queue() is called. Although blk_exit_rl() is now removed through commit db6d99523560 ("block: remove request_list code") on v5.0, we reserve the right to be able to sleep within blk_release_queue() context. The last reference for the request_queue must not be called from atomic context. *When* the last reference to the request_queue reaches 0 varies, and so let's take the opportunity to document when that is expected to happen and also document the context of the related calls as best as possible so we can avoid future issues, and with the hopes that the synchronous request_queue removal sticks. We revert back to synchronous request_queue removal because asynchronous removal creates a regression with expected userspace interaction with several drivers. An example is when removing the loopback driver, one uses ioctls from userspace to do so, but upon return and if successful, one expects the device to be removed. Likewise if one races to add another device the new one may not be added as it is still being removed. This was expected behavior before and it now fails as the device is still present and busy still. Moving to asynchronous request_queue removal could have broken many scripts which relied on the removal to have been completed if there was no error. Document this expectation as well so that this doesn't regress userspace again. Using asynchronous request_queue removal however has helped us find other bugs. In the future we can test what could break with this arrangement by enabling CONFIG_DEBUG_KOBJECT_RELEASE. While at it, update the docs with the context expectations for the request_queue / gendisk refcount decrement, and make these expectations explicit by using might_sleep(). Fixes: dc9edc44de6c ("block: Fix a blk_exit_rl() regression") Suggested-by: Nicolai Stange <[email protected]> Signed-off-by: Luis Chamberlain <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Cc: Bart Van Assche <[email protected]> Cc: Omar Sandoval <[email protected]> Cc: Hannes Reinecke <[email protected]> Cc: Nicolai Stange <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: yu kuai <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24block: clarify context for refcount increment helpersLuis Chamberlain2-0/+8
Let us clarify the context under which the helpers to increment the refcount for the gendisk and request_queue can be called under. We make this explicit on the places where we may sleep with might_sleep(). We don't address the decrement context yet, as that needs some extra work and fixes, but will be addressed in the next patch. Signed-off-by: Luis Chamberlain <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24block: add docs for gendisk / request_queue refcount helpersLuis Chamberlain2-1/+62
This adds documentation for the gendisk / request_queue refcount helpers. Signed-off-by: Luis Chamberlain <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24nvme: use blk_mq_complete_request_remote to avoid an indirect function callChristoph Hellwig6-9/+18
Use the new blk_mq_complete_request_remote helper to avoid an indirect function call in the completion fast path. Reviewed-by: Daniel Wagner <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24nvme-rdma: factor out a nvme_rdma_end_request helperChristoph Hellwig1-19/+17
Factor a small sniplet of duplicated code into a new helper in preparation for making this sniplet a little bit less trivial. Reviewed-by: Daniel Wagner <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24blk-mq: add a new blk_mq_complete_request_remote APIChristoph Hellwig2-19/+27
This is a variant of blk_mq_complete_request_remote that only completes the request if it needs to be bounced to another CPU or a softirq. If the request can be completed locally the function returns false and lets the driver complete it without requring and indirect function call. Reviewed-by: Daniel Wagner <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24blk-mq: factor out a blk_mq_complete_need_ipi helperChristoph Hellwig1-18/+21
Add a helper to decide if we can complete locally or need an IPI. Reviewed-by: Daniel Wagner <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24blk-mq: remove the get_cpu/put_cpu pair in blk_mq_complete_requestChristoph Hellwig1-2/+1
We don't really care if we get migrated during the I/O completion. In the worth case we either perform an IPI that wasn't required, or complete the request on a CPU which we just migrated off. Reviewed-by: Daniel Wagner <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24blk-mq: move failure injection out of blk_mq_complete_requestChristoph Hellwig19-72/+61
Move the call to blk_should_fake_timeout out of blk_mq_complete_request and into the drivers, skipping call sites that are obvious error handlers, and remove the now superflous blk_mq_force_complete_rq helper. This ensures we don't keep injecting errors into completions that just terminate the Linux request after the hardware has been reset or the command has been aborted. Reviewed-by: Daniel Wagner <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24blk-mq: merge the softirq vs non-softirq IPI logicChristoph Hellwig1-65/+20
Both the softirq path for single queue devices and the multi-queue completion handler share the same logic to figure out if we need an IPI for the completion and eventually issue it. Merge the two versions into a single unified code path. Reviewed-by: Daniel Wagner <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24blk-mq: short cut the IPI path in blk_mq_force_complete_rq for !SMPChristoph Hellwig1-1/+2
Let the compile optimize out the entire IPI path, given that we are obviously not going to use it. Reviewed-by: Daniel Wagner <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-24blk-mq: complete polled requests directlyChristoph Hellwig1-6/+11
Even for single queue devices there is no point in offloading a polled completion to the softirq, given that blk_mq_force_complete_rq is called from the polling thread in that case and thus there are no starvation issues. Reviewed-by: Daniel Wagner <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>