aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-09-30io_uring: add io_uring_cmd_import_fixedAnuj Gupta2-0/+18
This is a new helper that callers can use to obtain a bvec iterator for the previously mapped buffer. This is preparatory work to enable fixed-buffer support for io_uring_cmd. Signed-off-by: Anuj Gupta <[email protected]> Signed-off-by: Kanchan Joshi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-30nvme: enable batched completions of passthrough IOJens Axboe1-2/+1
Now that the normal passthrough end_io path doesn't need the request anymore, we can kill the explicit blk_mq_free_request() and just pass back RQ_END_IO_FREE instead. This enables the batched completion from freeing batches of requests at the time. This brings passthrough IO performance at least on par with bdev based O_DIRECT with io_uring. With this and batche allocations, peak performance goes from 110M IOPS to 122M IOPS. For IRQ based, passthrough is now also about 10% faster than previously, going from ~61M to ~67M IOPS. Reviewed-by: Anuj Gupta <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Keith Busch <[email protected]> Co-developed-by: Stefan Roesch <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2022-09-30nvme: split out metadata vs non metadata end_io uring_cmd completionsJens Axboe1-18/+61
By splitting up the metadata and non-metadata end_io handling, we can remove any request dependencies on the normal non-metadata IO path. This is in preparation for enabling the normal IO passthrough path to pass the ownership of the request back to the block layer. Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Anuj Gupta <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Keith Busch <[email protected]> Co-developed-by: Stefan Roesch <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2022-09-30block: allow end_io based requests in the completion batch handlingJens Axboe2-3/+13
With end_io handlers now being able to potentially pass ownership of the request upon completion, we can allow requests with end_io handlers in the batch completion handling. Reviewed-by: Anuj Gupta <[email protected]> Reviewed-by: Keith Busch <[email protected]> Co-developed-by: Stefan Roesch <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2022-09-30block: change request end_io handler to pass back a return valueJens Axboe13-29/+65
Everything is just converted to returning RQ_END_IO_NONE, and there should be no functional changes with this patch. In preparation for allowing the end_io handler to pass ownership back to the block layer, rather than retain ownership of the request. Reviewed-by: Keith Busch <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2022-09-30block: enable batched allocation for blk_mq_alloc_request()Jens Axboe1-9/+71
The filesystem IO path can take advantage of allocating batches of requests, if the underlying submitter tells the block layer about it through the blk_plug. For passthrough IO, the exported API is the blk_mq_alloc_request() helper, and that one does not allow for request caching. Wire up request caching for blk_mq_alloc_request(), which is generally done without having a bio available upfront. Tested-by: Anuj Gupta <[email protected]> Reviewed-by: Keith Busch <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2022-09-30block: kill deprecated BUG_ON() in the flush handlingJens Axboe1-1/+0
We've never had any useful reports from this BUG_ON(), and in fact a number of the BUG_ON()'s in the flush handling need to be turned into more graceful handling. In preparation for allowing batched completions of the end_io handling, where we can enter the flush completion with queuelist having been reused for the batch, get rid of this BUG_ON(). Signed-off-by: Jens Axboe <[email protected]>
2022-09-30Merge branch 'for-6.1/io_uring' into for-6.1/passthroughJens Axboe30-326/+859
* for-6.1/io_uring: (56 commits) io_uring/net: fix notif cqe reordering io_uring/net: don't update msg_name if not provided io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL io_uring/rw: defer fsnotify calls to task context io_uring/net: fix fast_iov assignment in io_setup_async_msg() io_uring/net: fix non-zc send with address io_uring/net: don't skip notifs for failed requests io_uring/rw: don't lose short results on io_setup_async_rw() io_uring/rw: fix unexpected link breakage io_uring/net: fix cleanup double free free_iov init io_uring: fix CQE reordering io_uring/net: fix UAF in io_sendrecv_fail() selftest/net: adjust io_uring sendzc notif handling io_uring: ensure local task_work marks task as running io_uring/net: zerocopy sendmsg io_uring/net: combine fail handlers io_uring/net: rename io_sendzc() io_uring/net: support non-zerocopy sendto io_uring/net: refactor io_setup_async_addr io_uring/net: don't lose partial send_zc on fail ...
2022-09-30Merge branch 'for-6.1/block' into for-6.1/passthroughJens Axboe135-1644/+3204
* for-6.1/block: (162 commits) sbitmap: fix lockup while swapping block: add rationale for not using blk_mq_plug() when applicable block: adapt blk_mq_plug() to not plug for writes that require a zone lock s390/dasd: use blk_mq_alloc_disk blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep nvmet: don't look at the request_queue in nvmet_bdev_set_limits nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all blk-mq: use quiesced elevator switch when reinitializing queues block: replace blk_queue_nowait with bdev_nowait nvme: remove nvme_ctrl_init_connect_q nvme-loop: use the tagset alloc/free helpers nvme-loop: store the generic nvme_ctrl in set->driver_data nvme-loop: initialize sqsize later nvme-fc: use the tagset alloc/free helpers nvme-fc: store the generic nvme_ctrl in set->driver_data nvme-fc: keep ctrl->sqsize in sync with opts->queue_size nvme-rdma: use the tagset alloc/free helpers nvme-rdma: store the generic nvme_ctrl in set->driver_data nvme-tcp: use the tagset alloc/free helpers nvme-tcp: store the generic nvme_ctrl in set->driver_data ... Signed-off-by: Jens Axboe <[email protected]>
2022-09-29sbitmap: fix lockup while swappingHugh Dickins1-1/+1
Commit 4acb83417cad ("sbitmap: fix batched wait_cnt accounting") is a big improvement: without it, I had to revert to before commit 040b83fcecfb ("sbitmap: fix possible io hung due to lost wakeup") to avoid the high system time and freezes which that had introduced. Now okay on the NVME laptop, but 4acb83417cad is a disaster for heavy swapping (kernel builds in low memory) on another: soon locking up in sbitmap_queue_wake_up() (into which __sbq_wake_up() is inlined), cycling around with waitqueue_active() but wait_cnt 0 . Here is a backtrace, showing the common pattern of outer sbitmap_queue_wake_up() interrupted before setting wait_cnt 0 back to wake_batch (in some cases other CPUs are idle, in other cases they're spinning for a lock in dd_bio_merge()): sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag < __blk_mq_free_request < blk_mq_free_request < __blk_mq_end_request < scsi_end_request < scsi_io_completion < scsi_finish_command < scsi_complete < blk_complete_reqs < blk_done_softirq < __do_softirq < __irq_exit_rcu < irq_exit_rcu < common_interrupt < asm_common_interrupt < _raw_spin_unlock_irqrestore < __wake_up_common_lock < __wake_up < sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag < __blk_mq_free_request < blk_mq_free_request < dd_bio_merge < blk_mq_sched_bio_merge < blk_mq_attempt_bio_merge < blk_mq_submit_bio < __submit_bio < submit_bio_noacct_nocheck < submit_bio_noacct < submit_bio < __swap_writepage < swap_writepage < pageout < shrink_folio_list < evict_folios < lru_gen_shrink_lruvec < shrink_lruvec < shrink_node < do_try_to_free_pages < try_to_free_pages < __alloc_pages_slowpath < __alloc_pages < folio_alloc < vma_alloc_folio < do_anonymous_page < __handle_mm_fault < handle_mm_fault < do_user_addr_fault < exc_page_fault < asm_exc_page_fault See how the process-context sbitmap_queue_wake_up() has been interrupted, after bringing wait_cnt down to 0 (and in this example, after doing its wakeups), before advancing wake_index and refilling wake_cnt: an interrupt-context sbitmap_queue_wake_up() of the same sbq gets stuck. I have almost no grasp of all the possible sbitmap races, and their consequences: but __sbq_wake_up() can do nothing useful while wait_cnt 0, so it is better if sbq_wake_ptr() skips on to the next ws in that case: which fixes the lockup and shows no adverse consequence for me. The check for wait_cnt being 0 is obviously racy, and ultimately can lead to lost wakeups: for example, when there is only a single waitqueue with waiters. However, lost wakeups are unlikely to matter in these cases, and a proper fix requires redesign (and benchmarking) of the batched wakeup code: so let's plug the hole with this bandaid for now. Signed-off-by: Hugh Dickins <[email protected]> Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Keith Busch <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-29io_uring/net: fix notif cqe reorderingPavel Begunkov1-5/+19
send zc is not restricted to !IO_URING_F_UNLOCKED anymore and so we can't use task-tw ordering trick to order notification cqes with requests completions. In this case leave it alone and let io_send_zc_cleanup() flush it. Cc: [email protected] Fixes: 53bdc88aac9a2 ("io_uring/notif: order notif vs send CQEs") Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/0031f3a00d492e814a4a0935a2029a46d9c9ba06.1664486545.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-09-29io_uring/net: don't update msg_name if not providedPavel Begunkov1-1/+2
io_sendmsg_copy_hdr() may clear msg->msg_name if the userspace didn't provide it, we should retain NULL in this case. Cc: [email protected] Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/97d49f61b5ec76d0900df658cfde3aa59ff22121.1664486545.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-09-29io_uring: don't gate task_work run on TIF_NOTIFY_SIGNALJens Axboe1-4/+4
This isn't a reliable mechanism to tell if we have task_work pending, we really should be looking at whether we have any items queued. This is problematic if forward progress is gated on running said task_work. One such example is reading from a pipe, where the write side has been closed right before the read is started. The fput() of the file queues TWA_RESUME task_work, and we need that task_work to be run before ->release() is called for the pipe. If ->release() isn't called, then the read will sit forever waiting on data that will never arise. Fix this by io_run_task_work() so it checks if we have task_work pending rather than rely on TIF_NOTIFY_SIGNAL for that. The latter obviously doesn't work for task_work that is queued without TWA_SIGNAL. Reported-by: Christiano Haesbaert <[email protected]> Cc: [email protected] Link: https://github.com/axboe/liburing/issues/665 Signed-off-by: Jens Axboe <[email protected]>
2022-09-29io_uring/rw: defer fsnotify calls to task contextJens Axboe1-9/+15
We can't call these off the kiocb completion as that might be off soft/hard irq context. Defer the calls to when we process the task_work for this request. That avoids valid complaints like: stack backtrace: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.0.0-rc6-syzkaller-00321-g105a36f3694e #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_usage_bug kernel/locking/lockdep.c:3961 [inline] valid_state kernel/locking/lockdep.c:3973 [inline] mark_lock_irq kernel/locking/lockdep.c:4176 [inline] mark_lock.part.0.cold+0x18/0xd8 kernel/locking/lockdep.c:4632 mark_lock kernel/locking/lockdep.c:4596 [inline] mark_usage kernel/locking/lockdep.c:4527 [inline] __lock_acquire+0x11d9/0x56d0 kernel/locking/lockdep.c:5007 lock_acquire kernel/locking/lockdep.c:5666 [inline] lock_acquire+0x1ab/0x570 kernel/locking/lockdep.c:5631 __fs_reclaim_acquire mm/page_alloc.c:4674 [inline] fs_reclaim_acquire+0x115/0x160 mm/page_alloc.c:4688 might_alloc include/linux/sched/mm.h:271 [inline] slab_pre_alloc_hook mm/slab.h:700 [inline] slab_alloc mm/slab.c:3278 [inline] __kmem_cache_alloc_lru mm/slab.c:3471 [inline] kmem_cache_alloc+0x39/0x520 mm/slab.c:3491 fanotify_alloc_fid_event fs/notify/fanotify/fanotify.c:580 [inline] fanotify_alloc_event fs/notify/fanotify/fanotify.c:813 [inline] fanotify_handle_event+0x1130/0x3f40 fs/notify/fanotify/fanotify.c:948 send_to_group fs/notify/fsnotify.c:360 [inline] fsnotify+0xafb/0x1680 fs/notify/fsnotify.c:570 __fsnotify_parent+0x62f/0xa60 fs/notify/fsnotify.c:230 fsnotify_parent include/linux/fsnotify.h:77 [inline] fsnotify_file include/linux/fsnotify.h:99 [inline] fsnotify_access include/linux/fsnotify.h:309 [inline] __io_complete_rw_common+0x485/0x720 io_uring/rw.c:195 io_complete_rw+0x1a/0x1f0 io_uring/rw.c:228 iomap_dio_complete_work fs/iomap/direct-io.c:144 [inline] iomap_dio_bio_end_io+0x438/0x5e0 fs/iomap/direct-io.c:178 bio_endio+0x5f9/0x780 block/bio.c:1564 req_bio_endio block/blk-mq.c:695 [inline] blk_update_request+0x3fc/0x1300 block/blk-mq.c:825 scsi_end_request+0x7a/0x9a0 drivers/scsi/scsi_lib.c:541 scsi_io_completion+0x173/0x1f70 drivers/scsi/scsi_lib.c:971 scsi_complete+0x122/0x3b0 drivers/scsi/scsi_lib.c:1438 blk_complete_reqs+0xad/0xe0 block/blk-mq.c:1022 __do_softirq+0x1d3/0x9c6 kernel/softirq.c:571 invoke_softirq kernel/softirq.c:445 [inline] __irq_exit_rcu+0x123/0x180 kernel/softirq.c:650 irq_exit_rcu+0x5/0x20 kernel/softirq.c:662 common_interrupt+0xa9/0xc0 arch/x86/kernel/irq.c:240 Fixes: f63cf5192fe3 ("io_uring: ensure that fsnotify is always called") Link: https://lore.kernel.org/all/20220929135627.ykivmdks2w5vzrwg@quack3/ Reported-by: [email protected] Reported-by: Jan Kara <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2022-09-29block: add rationale for not using blk_mq_plug() when applicablePankaj Raghav2-0/+12
There are two places in the block layer at the moment where blk_mq_plug() helper could be used instead of directly accessing the plug from struct current. In both these cases, directly accessing the plug should not have any consequences for zoned devices. Make the intent explicit by adding comments instead of introducing unwanted checks with blk_mq_plug() helper.[1] [1] https://lore.kernel.org/linux-block/[email protected]/ Signed-off-by: Pankaj Raghav <[email protected]> Suggested-by: Jens Axboe <[email protected]> Link: https://lore.kernel.org/r/[email protected] [axboe: fixup multi-line comment style] Signed-off-by: Jens Axboe <[email protected]>
2022-09-29block: adapt blk_mq_plug() to not plug for writes that require a zone lockPankaj Raghav3-7/+14
The current implementation of blk_mq_plug() disables plugging for all operations that involves a transfer to the device as we just check if the last bit in op_is_write() function. Modify blk_mq_plug() to disable plugging only for REQ_OP_WRITE and REQ_OP_WRITE_ZEROS as they might require a zone lock. Suggested-by: Christoph Hellwig <[email protected]> Suggested-by: Damien Le Moal <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Pankaj Raghav <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-29io_uring/net: fix fast_iov assignment in io_setup_async_msg()Stefan Metzmacher1-2/+4
I hit a very bad problem during my tests of SENDMSG_ZC. BUG(); in first_iovec_segment() triggered very easily. The problem was io_setup_async_msg() in the partial retry case, which seems to happen more often with _ZC. iov_iter_iovec_advance() may change i->iov in order to have i->iov_offset being only relative to the first element. Which means kmsg->msg.msg_iter.iov is no longer the same as kmsg->fast_iov. But this would rewind the copy to be the start of async_msg->fast_iov, which means the internal state of sync_msg->msg.msg_iter is inconsitent. I tested with 5 vectors with length like this 4, 0, 64, 20, 8388608 and got a short writes with: - ret=2675244 min_ret=8388692 => remaining 5713448 sr->done_io=2675244 - ret=-EAGAIN => io_uring_poll_arm - ret=4911225 min_ret=5713448 => remaining 802223 sr->done_io=7586469 - ret=-EAGAIN => io_uring_poll_arm - ret=802223 min_ret=802223 => res=8388692 While this was easily triggered with SENDMSG_ZC (queued for 6.1), it was a potential problem starting with 7ba89d2af17aa879dda30f5d5d3f152e587fc551 in 5.18 for IORING_OP_RECVMSG. And also with 4c3c09439c08b03d9503df0ca4c7619c5842892e in 5.19 for IORING_OP_SENDMSG. However 257e84a5377fbbc336ff563833a8712619acce56 introduced the critical code into io_setup_async_msg() in 5.11. Fixes: 7ba89d2af17aa ("io_uring: ensure recv and recvmsg handle MSG_WAITALL correctly") Fixes: 257e84a5377fb ("io_uring: refactor sendmsg/recvmsg iov managing") Cc: [email protected] Signed-off-by: Stefan Metzmacher <[email protected]> Reviewed-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/b2e7be246e2fb173520862b0c7098e55767567a2.1664436949.git.metze@samba.org Signed-off-by: Jens Axboe <[email protected]>
2022-09-28io_uring/net: fix non-zc send with addressPavel Begunkov1-6/+6
We're currently ignoring the dest address with non-zerocopy send because even though we copy it from the userspace shortly after ->msg_name gets zeroed. Move msghdr init earlier. Fixes: 516e82f0e043a ("io_uring/net: support non-zerocopy sendto") Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/176ced5e8568aa5d300ca899b7f05b303ebc49fd.1664409532.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-09-28Merge tag 'nvme-6.1-2022-09-28' of git://git.infradead.org/nvme into ↵Jens Axboe18-435/+356
for-6.1/block Pull NVMe updates from Christoph: "nvme updates for Linux 6.1 - handle effects after freeing the request (Keith Busch) - copy firmware_rev on each init (Keith Busch) - restrict management ioctls to admin (Keith Busch) - ensure subsystem reset is single threaded (Keith Busch) - report the actual number of tagset maps in nvme-pci (Keith Busch) - small fabrics authentication fixups (Christoph Hellwig) - add common code for tagset allocation and freeing (Christoph Hellwig) - stop using the request_queue in nvmet (Christoph Hellwig) - set min_align_mask before calculating max_hw_sectors (Rishabh Bhatnagar) - send a rediscover uevent when a persistent discovery controller reconnects (Sagi Grimberg) - misc nvmet-tcp fixes (Varun Prakash, zhenwei pi)" * tag 'nvme-6.1-2022-09-28' of git://git.infradead.org/nvme: (31 commits) nvmet: don't look at the request_queue in nvmet_bdev_set_limits nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all nvme: remove nvme_ctrl_init_connect_q nvme-loop: use the tagset alloc/free helpers nvme-loop: store the generic nvme_ctrl in set->driver_data nvme-loop: initialize sqsize later nvme-fc: use the tagset alloc/free helpers nvme-fc: store the generic nvme_ctrl in set->driver_data nvme-fc: keep ctrl->sqsize in sync with opts->queue_size nvme-rdma: use the tagset alloc/free helpers nvme-rdma: store the generic nvme_ctrl in set->driver_data nvme-tcp: use the tagset alloc/free helpers nvme-tcp: store the generic nvme_ctrl in set->driver_data nvme-tcp: remove the unused queue_size member in nvme_tcp_queue nvme: add common helpers to allocate and free tagsets nvme-auth: add a MAINTAINERS entry nvmet: add helpers to set the result field for connect commands nvme: improve the NVME_CONNECT_AUTHREQ* definitions nvmet-auth: don't try to cancel a non-initialized work_struct nvmet-tcp: remove nvmet_tcp_finish_cmd ...
2022-09-28s390/dasd: use blk_mq_alloc_diskChristoph Hellwig7-88/+39
As far as I can tell there is no need for the staged setup in dasd, so allocate the tagset and the disk with the queue in dasd_gendisk_alloc. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Stefan Haberland <[email protected]> Signed-off-by: Stefan Haberland <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-28io_uring/net: don't skip notifs for failed requestsPavel Begunkov1-21/+8
We currently only add a notification CQE when the send succeded, i.e. cqe.res >= 0. However, it'd be more robust to do buffer notifications for failed requests as well in case drivers decide do something fanky. Always return a buffer notification after initial prep, don't hide it. This behaviour is better aligned with documentation and the patch also helps the userspace to respect it. Cc: [email protected] # 6.0 Suggested-by: Stefan Metzmacher <[email protected]> Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/9c8bead87b2b980fcec441b8faef52188b4a6588.1664292100.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-09-27blk-cgroup: don't update the blkg lookup hint in blkg_conf_prepChristoph Hellwig1-13/+4
blkg_conf_prep just creates a new blkg structure, there is no real need to update the lookup hint which should only be done on a successful lookup in the I/O path. Suggested-by: Tejun Heo <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-27nvmet: don't look at the request_queue in nvmet_bdev_set_limitsChristoph Hellwig1-6/+5
nvmet is a consumer of the block layer and should not directly look at the request_queue. Use the bdev_ helpers to retrieve the device limits instead. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]>
2022-09-27nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_allChristoph Hellwig1-2/+1
nvmet is a consumer of the block layer and should not directly look at the request_queue. Just use the NUMA node ID from the gendisk instead of the request_queue. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]>
2022-09-27blk-mq: use quiesced elevator switch when reinitializing queuesKeith Busch3-7/+6
The hctx's run_work may be racing with the elevator switch when reinitializing hardware queues. The queue is merely frozen in this context, but that only prevents requests from allocating and doesn't stop the hctx work from running. The work may get an elevator pointer that's being torn down, and can result in use-after-free errors and kernel panics (example below). Use the quiesced elevator switch instead, and make the previous one static since it is now only used locally. nvme nvme0: resetting controller nvme nvme0: 32/0/0 default/read/poll queues BUG: kernel NULL pointer dereference, address: 0000000000000008 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 80000020c8861067 P4D 80000020c8861067 PUD 250f8c8067 PMD 0 Oops: 0000 [#1] SMP PTI Workqueue: kblockd blk_mq_run_work_fn RIP: 0010:kyber_has_work+0x29/0x70 ... Call Trace: __blk_mq_do_dispatch_sched+0x83/0x2b0 __blk_mq_sched_dispatch_requests+0x12e/0x170 blk_mq_sched_dispatch_requests+0x30/0x60 __blk_mq_run_hw_queue+0x2b/0x50 process_one_work+0x1ef/0x380 worker_thread+0x2d/0x3e0 Signed-off-by: Keith Busch <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-27block: replace blk_queue_nowait with bdev_nowaitChristoph Hellwig5-8/+10
Replace blk_queue_nowait with a bdev_nowait helpers that takes the block_device given that the I/O submission path should not have to look into the request_queue. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Pankaj Raghav <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-09-27nvme: remove nvme_ctrl_init_connect_qChristoph Hellwig1-8/+0
Unused now. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]>
2022-09-27nvme-loop: use the tagset alloc/free helpersChristoph Hellwig1-64/+19
Use the common helpers to allocate and free the tagsets. To make this work the generic nvme_ctrl now needs to be stored in the hctx private data instead of the nvme_loop_ctrl. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]>
2022-09-27nvme-loop: store the generic nvme_ctrl in set->driver_dataChristoph Hellwig1-5/+5
Point the private data to the generic controller structure in preparation of using the common tagset init/exit code. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]>
2022-09-27nvme-loop: initialize sqsize laterChristoph Hellwig1-1/+1
Defer initializing the sqsize field from the options until it has been capped by MAXCMD. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]>
2022-09-27nvme-fc: use the tagset alloc/free helpersChristoph Hellwig1-66/+17
Use the common helpers to allocate and free the tagsets. To make this work the generic nvme_ctrl now needs to be stored in the hctx private data instead of the nvme_fc_ctrl. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: James Smart <[email protected]>
2022-09-27nvme-fc: store the generic nvme_ctrl in set->driver_dataChristoph Hellwig1-20/+12
Point the private data to the generic controller structure in preparation of using the common tagset init/exit code and use the chance the cleanup the init_hctx methods a bit. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: James Smart <[email protected]>
2022-09-27nvme-fc: keep ctrl->sqsize in sync with opts->queue_sizeChristoph Hellwig1-9/+1
Also update the sqsize field when capping the queue size, and remove the check a queue size that is larger than sqsize given that sqsize is only initialized from opts->queue_size. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: James Smart <[email protected]>
2022-09-27nvme-rdma: use the tagset alloc/free helpersChristoph Hellwig1-99/+34
Use the common helpers to allocate and free the tagsets. To make this work the generic nvme_ctrl now needs to be stored in the hctx private data instead of the nvme_rdma_ctrl. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]>
2022-09-27nvme-rdma: store the generic nvme_ctrl in set->driver_dataChristoph Hellwig1-6/+6
Point the private data to the generic controller structure in preparation of using the common tagset init/exit code. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]>
2022-09-27nvme-tcp: use the tagset alloc/free helpersChristoph Hellwig1-85/+16
Use the common helpers to allocate and free the tagsets. To make this work the generic nvme_ctrl now needs to be stored in the hctx private data instead of the nvme_tcp_ctrl. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]>
2022-09-27nvme-tcp: store the generic nvme_ctrl in set->driver_dataChristoph Hellwig1-6/+6
Point the private data to the generic controller structure in preparation of using the common tagset init/exit code. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]>
2022-09-27nvme-tcp: remove the unused queue_size member in nvme_tcp_queueChristoph Hellwig1-6/+3
->nvme_tcp_queue is not used anywhere, so remove it. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]>
2022-09-27nvme: add common helpers to allocate and free tagsetsChristoph Hellwig2-0/+110
Add common helpers to allocate and tear down the admin and I/O tag sets, including the special queues allocated with them. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]>
2022-09-27nvme-auth: add a MAINTAINERS entryChristoph Hellwig1-0/+9
Add Hannes as the nvme-auth maintainer. Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Hannes Reinecke <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]>
2022-09-27nvmet: add helpers to set the result field for connect commandsChristoph Hellwig1-10/+8
The code to set the result field for the admin and I/O connect commands is not only verbose and duplicated, but also violates the aliasing rules as it accesses both the u16 and u32 members in the union. Add a little helper to sort all that out. Fixes: db1312dd9548 ("nvmet: implement basic In-Band Authentication") Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]>
2022-09-27nvme: improve the NVME_CONNECT_AUTHREQ* definitionsChristoph Hellwig2-6/+4
Mark them as unsigned so that we don't need extra casts, and define them relative to cdword0 instead of requiring extra shifts. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]>
2022-09-27nvmet-auth: don't try to cancel a non-initialized work_structChristoph Hellwig4-14/+13
Currently blktests nvme/002 trips up debugobjects if CONFIG_NVME_AUTH is enabled, but authentication is not on a queue. This is because nvmet_auth_sq_free cancels sq->auth_expired_work unconditionaly, while auth_expired_work is only ever initialized if authentication is enabled for a given controller. Fix this by calling most of what is nvmet_init_auth unconditionally when initializing the SQ, and just do the setting of the result field in the connect command handler. Fixes: db1312dd9548 ("nvmet: implement basic In-Band Authentication") Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]>
2022-09-27nvmet-tcp: remove nvmet_tcp_finish_cmdzhenwei pi1-8/+2
There is only a single call-site of nvmet_tcp_finish_cmd(), this becomes redundant. Remove nvmet_tcp_finish_cmd() and use the original function body instead. Suggested-by: Sagi Grimberg <[email protected]> Signed-off-by: zhenwei pi <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-09-27nvmet-tcp: add bounds check on Transfer TagVarun Prakash1-2/+9
ttag is used as an index to get cmd in nvmet_tcp_handle_h2c_data_pdu(), add a bounds check to avoid out-of-bounds access. Signed-off-by: Varun Prakash <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-09-27nvmet-tcp: handle ICReq PDU received in NVMET_TCP_Q_LIVE stateVarun Prakash1-0/+7
As per NVMe/TCP transport specification ICReq PDU is the first PDU received by the controller and controller should receive only one ICReq PDU. If controller receives more than one ICReq PDU then this can be considered as fatal error. nvmet-tcp driver does not check for ICReq PDU opcode if queue state is NVMET_TCP_Q_LIVE. In LIVE state ICReq PDU is treated as CapsuleCmd PDU, this can result in abnormal behavior. Add a check for ICReq PDU in nvmet_tcp_done_recv_pdu() to fix this issue. Signed-off-by: Varun Prakash <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-09-27nvmet-tcp: fix NULL pointer dereference during releasezhenwei pi1-3/+16
nvmet-tcp frees CMD buffers in nvmet_tcp_uninit_data_in_cmds(), and waits the inflight IO requests in nvmet_sq_destroy(). During wait the inflight IO requests, the callback nvmet_tcp_queue_response() is called from backend after IO complete, this leads a typical Use-After-Free issue like this: BUG: kernel NULL pointer dereference, address: 0000000000000008 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 107f80067 P4D 107f80067 PUD 10789e067 PMD 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 1 PID: 123 Comm: kworker/1:1H Kdump: loaded Tainted: G E 6.0.0-rc2.bm.1-amd64 #15 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Workqueue: nvmet_tcp_wq nvmet_tcp_io_work [nvmet_tcp] RIP: 0010:shash_ahash_digest+0x2b/0x110 Code: 1f 44 00 00 41 57 41 56 41 55 41 54 55 48 89 fd 53 48 89 f3 48 83 ec 08 44 8b 67 30 45 85 e4 74 1c 48 8b 57 38 b8 00 10 00 00 <44> 8b 7a 08 44 29 f8 39 42 0c 0f 46 42 0c 41 39 c4 76 43 48 8b 03 RSP: 0018:ffffc9000051bdd8 EFLAGS: 00010206 RAX: 0000000000001000 RBX: ffff888100ab5470 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff888100ab5470 RDI: ffff888100ab5420 RBP: ffff888100ab5420 R08: ffff8881024d08c8 R09: ffff888103e1b4b8 R10: 8080808080808080 R11: 0000000000000000 R12: 0000000000001000 R13: 0000000000000000 R14: ffff88813412bd4c R15: ffff8881024d0800 FS: 0000000000000000(0000) GS:ffff88883fa40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 0000000104b48000 CR4: 0000000000350ee0 Call Trace: <TASK> nvmet_tcp_io_work+0xa52/0xb52 [nvmet_tcp] ? __switch_to+0x106/0x420 process_one_work+0x1ae/0x380 ? process_one_work+0x380/0x380 worker_thread+0x30/0x360 ? process_one_work+0x380/0x380 kthread+0xe6/0x110 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 Separate nvmet_tcp_uninit_data_in_cmds() into two steps: uninit data in cmds <- new step 1 nvmet_sq_destroy(); cancel_work_sync(&queue->io_work); free CMD buffers <- new step 2 Signed-off-by: zhenwei pi <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-09-27nvme-pci: report the actual number of tagset mapsKeith Busch1-2/+4
We've been reporting 2 maps regardless of whether the module parameter asked for anything beyond the default queues. A consequence of this means that blk-mq will reinitialize the all the hardware contexts and io schedulers on every controller reset when the mapping is exactly the same as before. This unnecessary overhead is adding several milliseconds on a reset for environments that don't need it. Report the actual number of mappings in use. Signed-off-by: Keith Busch <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-09-27nvme-pci: set min_align_mask before calculating max_hw_sectorsRishabh Bhatnagar1-1/+2
If swiotlb is force enabled dma_max_mapping_size ends up calling swiotlb_max_mapping_size which takes into account the min align mask for the device. Set the min align mask for nvme driver before calling dma_max_mapping_size while calculating max hw sectors. Signed-off-by: Rishabh Bhatnagar <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-09-27nvme: send a rediscover uevent when a persistent discovery controller reconnectsSagi Grimberg2-0/+11
When a discovery controller is disconnected, no AENs will arrive to notify the host about discovery log change events. In order to solve this, send a uevent notification when a persistent discovery controller reconnects. We add a new ctrl flag NVME_CTRL_STARTED_ONCE that will be set on the first start, and consecutive calls will find it set, and send the event to userspace if the controller is a discovery controller. Upon the event reception, userspace will re-read the discovery log page and will act upon changes as it sees fit. Signed-off-by: Sagi Grimberg <[email protected]> Reviewed-by: Daniel Wagner <[email protected]> Reviewed-by: James Smart <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>