aboutsummaryrefslogtreecommitdiff
path: root/scripts/gdb/linux/utils.py
diff options
context:
space:
mode:
authorSagi Grimberg <sagi@grimberg.me>2022-09-28 09:23:26 +0300
committerChristoph Hellwig <hch@lst.de>2022-10-12 11:35:46 +0200
commita1ae8d4d9be0178132df7c4931a1ba77d0e76039 (patch)
tree07c97ceece4624710d5be1e5cb454c1618c343fe /scripts/gdb/linux/utils.py
parent24a403340d70aad3667b3ee0f9a7aa5c0a5193a0 (diff)
nvme-rdma: fix possible hang caused during ctrl deletion
When we delete a controller, we execute the following: 1. nvme_stop_ctrl() - stop some work elements that may be inflight or scheduled (specifically also .stop_ctrl which cancels ctrl error recovery work) 2. nvme_remove_namespaces() - which first flushes scan_work to avoid competing ns addition/removal 3. continue to teardown the controller However, if err_work was scheduled to run in (1), it is designed to cancel any inflight I/O, particularly I/O that is originating from ns scan_work in (2), but because it is cancelled in .stop_ctrl(), we can prevent forward progress of (2) as ns scanning is blocking on I/O (that will never be cancelled). The race is: 1. transport layer error observed -> err_work is scheduled 2. scan_work executes, discovers ns, generate I/O to it 3. nvme_ctop_ctrl() -> .stop_ctrl() -> cancel_work_sync(err_work) - err_work never executed 4. nvme_remove_namespaces() -> flush_work(scan_work) --> deadlock, because scan_work is blocked on I/O that was supposed to be cancelled by err_work, but was cancelled before executing. Fix this by flushing err_work instead of cancelling it, to force it to execute and cancel all inflight I/O. Fixes: b435ecea2a4d ("nvme: Add .stop_ctrl to nvme ctrl ops") Fixes: f6c8e432cb04 ("nvme: flush namespace scanning work just before removing namespaces") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
Diffstat (limited to 'scripts/gdb/linux/utils.py')
0 files changed, 0 insertions, 0 deletions