aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-11-16dm: remove free_table_devicesChristoph Hellwig1-14/+1
free_table_devices just warns and frees all table_device structures when the target removal did not remove them. This should never happen, but if it did, just freeing the structure without deleting them from the list or cleaning up the resources would not help at all. So just WARN on a non-empty list instead. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-16block: clear ->slave_dir when dropping the main slave_dir referenceChristoph Hellwig1-0/+2
Zero out the pointer to ->slave_dir so that the holder code doesn't incorrectly treat the object as alive when add_disk failed or after del_gendisk was called. Fixes: 89f871af1b26 ("dm: delay registering the gendisk") Reported-by: Yu Kuai <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-16sbitmap: Try each queue to wake up at least one waiterGabriel Krisman Bertazi1-16/+12
Jan reported the new algorithm as merged might be problematic if the queue being awaken becomes empty between the waitqueue_active inside sbq_wake_ptr check and the wake up. If that happens, wake_up_nr will not wake up any waiter and we loose too many wake ups. In order to guarantee progress, we need to wake up at least one waiter here, if there are any. This now requires trying to wake up from every queue. Instead of walking through all the queues with sbq_wake_ptr, this call moves the wake up inside that function. In a previous version of the patch, I found that updating wake_index several times when walking through queues had a measurable overhead. This ensures we only update it once, at the end. Fixes: 4f8126bb2308 ("sbitmap: Use single per-bitmap counting to wake up queued tags") Reported-by: Jan Kara <[email protected]> Signed-off-by: Gabriel Krisman Bertazi <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-16wait: Return number of exclusive waiters awakenGabriel Krisman Bertazi2-8/+12
Sbitmap code will need to know how many waiters were actually woken for its batched wakeups implementation. Return the number of woken exclusive waiters from __wake_up() to facilitate that. Suggested-by: Jan Kara <[email protected]> Signed-off-by: Gabriel Krisman Bertazi <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-16sbitmap: Advance the queue index before waking up a queueGabriel Krisman Bertazi1-2/+8
When a queue is awaken, the wake_index written by sbq_wake_ptr currently keeps pointing to the same queue. On the next wake up, it will thus retry the same queue, which is unfair to other queues, and can lead to starvation. This patch, moves the index update to happen before the queue is returned, such that it will now try a different queue first on the next wake up, improving fairness. Fixes: 4f8126bb2308 ("sbitmap: Use single per-bitmap counting to wake up queued tags") Reported-by: Jan Kara <[email protected]> Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Gabriel Krisman Bertazi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-16block: remove blkdev_writepagesChristoph Hellwig1-7/+0
While the block device code should switch to implementing ->writepages instead of ->writepage eventually, the current implementation is entirely pointless as it does the same looping over ->writepage as the generic code if no ->writepages is present. Remove blkdev_writepages so that we can eventually unexport generic_writepages. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2022-11-16io_uring/rw: enable bio caches for IRQ rwPavel Begunkov1-1/+2
Now we can use IOCB_ALLOC_CACHE not only for iopoll'ed reads/write but also for normal IRQ driven I/O. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/fb8bd092ed5a4a3b037e84e4777074d07aa5639a.1667384020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-11-16bio: shrink max number of pcpu cached biosPavel Begunkov1-1/+1
The downside of the bio pcpu cache is that bios of a cpu will be never freed unless there is new I/O issued from that cpu. We currently keep max 512 bios, which feels too much, half it. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/bc198e8efb27d8c740d80c8ce477432729075096.1667384020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-11-16bio: add pcpu caching for non-polling bio_putPavel Begunkov1-16/+54
This patch extends REQ_ALLOC_CACHE to IRQ completions, whenever currently it's only limited to iopoll. Instead of guarding the list with irq toggling on alloc, which is expensive, it keeps an additional irq-safe list from which bios are spliced in batches to ammortise overhead. On the put side it toggles irqs, but in many cases they're already disabled and so cheap. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/c2306de96b900ab9264f4428ec37768ddcf0da36.1667384020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-11-16bio: split pcpu cache part of bio_put into a helperPavel Begunkov1-13/+25
Extract a helper out of bio_put for recycling into percpu caches. It's a preparation patch without functional changes. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/e97ab2026a89098ee1bfdd09bcb9451fced95f87.1667384020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-11-16bio: don't rob starving biosets of biosPavel Begunkov1-0/+2
Biosets keep a mempool, so as long as requests complete we can always can allocate and have forward progress. Percpu bio caches break that assumptions as we may complete into the cache of one CPU and after try and fail to allocate with another CPU. We also can't grab from another CPU's cache without tricky sync. If we're allocating with a bio while the mempool is undersaturated, remove REQ_ALLOC_CACHE flag, so on put it will go straight to mempool. It might try to free into mempool more requests than required, but assuming than there is no memory starvation in the system it'll stabilise and never hit that path. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/aa150caf9c263fa92269e86d7826cc8fa65f38de.1667384020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-11-16mempool: introduce mempool_is_saturatedPavel Begunkov1-0/+5
Introduce a helper mempool_is_saturated(), which tells if the mempool is under-filled or not. We need it to figure out whether it should be freed right into the mempool or could be cached with top level caches. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/636aed30be8c35d78f45e244998bc6209283cccc.1667384020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <[email protected]>
2022-11-16nvmet: fix a memory leak in nvmet_auth_set_keySagi Grimberg1-0/+2
When changing dhchap secrets we need to release the old secrets as well. kmemleak complaint: -- unreferenced object 0xffff8c7f44ed8180 (size 64): comm "check", pid 7304, jiffies 4295686133 (age 72034.246s) hex dump (first 32 bytes): 44 48 48 43 2d 31 3a 30 30 3a 4c 64 4c 4f 64 71 DHHC-1:00:LdLOdq 79 56 69 67 77 48 55 32 6d 5a 59 4c 7a 35 59 38 yVigwHU2mZYLz5Y8 backtrace: [<00000000b6fc5071>] kstrdup+0x2e/0x60 [<00000000f0f4633f>] 0xffffffffc0e07ee6 [<0000000053006c05>] 0xffffffffc0dff783 [<00000000419ae922>] configfs_write_iter+0xb1/0x120 [<000000008183c424>] vfs_write+0x2be/0x3c0 [<000000009005a2a5>] ksys_write+0x5f/0xe0 [<00000000cd495c89>] do_syscall_64+0x38/0x90 [<00000000f2a84ac5>] entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: db1312dd9548 ("nvmet: implement basic In-Band Authentication") Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-11-16nvme: return err on nvme_init_non_mdts_limits failJoel Granados1-1/+1
In nvme_init_non_mdts_limits function we were returning 0 when kzalloc failed; it now returns -ENOMEM. Fixes: 5befc7c26e5a ("nvme: implement non-mdts command limits") Signed-off-by: Joel Granados <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-11-16nvme: avoid fallback to sequential scan due to transient issuesUday Shankar1-4/+11
Currently, if nvme_scan_ns_list fails, nvme_scan_work will fall back to a sequential scan. nvme_scan_ns_list can fail for a variety of reasons, e.g. a transient transport issue, and the resulting sequential scan can be extremely expensive on controllers reporting an NN value close to the maximum allowed (> 4 billion). Avoid sequential scans wherever possible by only falling back to them in two cases: - When the NVMe version supported (VS) value reported by the device is older than NVME_VS(1, 1, 0), before which support of Identify NS List not required. - When the Identify NS List command fails with the DNR bit set in the status. This is to accommodate (non-compliant) devices which report a VS value which implies support for Identify NS List, but nevertheless do not support the command. Such devices will most likely fail the command with the DNR bit set. The third case is when the device claims support for Identify NS List but the command fails with DNR not set. In such cases, fallback to sequential scan is potentially expensive and likely unnecessary, as a retry of the list scan should succeed. So this change skips the fallback in this third case. Signed-off-by: Uday Shankar <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-11-16nvme-rdma: stop auth work after tearing down queues in error recoverySagi Grimberg1-1/+1
when starting error recovery there might be a authentication work running, and it involves I/O commands. Given the controller is tearing down there is no chance for the I/O to complete other than timing out which may unnecessarily take a full io timeout. So first tear down the queues, fail/cancel all inflight I/O (including potentially authentication) and only then stop authentication. This ensures that failover is not stalled due to blocked authentication I/O. Signed-off-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-11-16nvme-tcp: stop auth work after tearing down queues in error recoverySagi Grimberg1-1/+1
when starting error recovery there might be a authentication work running, and it involves I/O commands. Given the controller is tearing down there is no chance for the I/O to complete other than timing out which may unnecessarily take a full io timeout. So first tear down the queues, fail/cancel all inflight I/O (including potentially authentication) and only then stop authentication. This ensures that failover is not stalled due to blocked authentication I/O. Signed-off-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-11-16nvme-auth: have dhchap_auth_work wait for queues auth to completeSagi Grimberg1-7/+6
It triggered the queue authentication work elements in parallel, but the ctrl authentication work itself completes when all of them completes. Hence wait for queues auth completions. This also makes nvme_auth_stop simply a sync cancel of ctrl dhchap_auth_work. Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-11-16nvme-auth: remove redundant auth_work flushSagi Grimberg1-6/+2
only ctrl deletion calls nvme_auth_free, which was stopped prior in the teardown stage, so there is no possibility that it should ever run when nvme_auth_free is called. As a result, we can remove a local chap pointer variable. Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-11-16nvme-auth: convert dhchap_auth_list to an arraySagi Grimberg3-55/+69
We know exactly how many dhchap contexts we will need, there is no need to hold a list that we need to protect with a mutex. Convert to a dynamically allocated array. And dhchap_context access state is maintained by the chap itself. Make dhchap_auth_mutex protect only the ctrl host_key and ctrl_key in a fine-grained lock such that there is no long lasting acquisition of the lock and no need to take/release this lock when flushing authentication works. Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-11-16nvme-auth: check chap ctrl_key once constructedSagi Grimberg1-2/+2
ctrl ctrl_key member may be overwritten from a sysfs context driven by the user. Once a queue local copy was created, use that instead to minimize checks on a shared resource. Signed-off-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-11-16nvme-auth: no need to reset chap contexts on re-authenticationSagi Grimberg3-18/+0
Now that the chap context is reset upon completion, this is no longer needed. Also remove nvme_auth_reset as no callers are left. Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-11-16nvme-auth: remove redundant deallocationsSagi Grimberg1-20/+0
These are now redundant as the dhchap context is removed after authentication completes. Signed-off-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-11-16nvme-auth: clear sensitive info right after authentication completesSagi Grimberg1-0/+2
We don't want to keep authentication sensitive info in memory for unlimited amount of time. Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-11-16nvme-auth: guarantee dhchap buffers under memory pressureSagi Grimberg3-2/+43
We want to guarantee that we have chap buffers when a controller reconnects under memory pressure. Add a mempool specifically for that. Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-11-16nvme-auth: don't keep long lived 4k dhchap bufferSagi Grimberg1-23/+24
dhchap structure is per-queue, it is wasteful to keep it for the entire lifetime of the queue. Allocate it dynamically and get rid of it after authentication. We don't need kzalloc because all accessors are clearing it before writing to it. Also, remove redundant chap buf_size which is always 4096, use a define instead. Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-11-16nvme-auth: remove redundant if statementSagi Grimberg1-1/+1
No one passes NVME_QID_ANY to nvme_auth_negotiate. Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-11-16nvme-auth: don't override ctrl keys before validationSagi Grimberg1-2/+10
Replace ctrl ctrl_key/host_key only after nvme_auth_generate_key is successful. Also, this fixes a bug where the keys are leaked. Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-11-16nvme-auth: don't ignore key generation failures when initializing ctrl keysSagi Grimberg3-7/+25
nvme_auth_generate_key can fail, don't ignore it upon initialization. Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-11-16nvme-auth: remove redundant buffer deallocationsSagi Grimberg1-4/+0
host_response, host_key, ctrl_key and sess_key are freed in nvme_auth_reset_dhchap which is called from nvme_auth_free_dhchap. Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-11-16nvme-auth: don't re-authenticate if the controller is not LIVESagi Grimberg1-0/+7
The connect sequence will re-authenticate. Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-11-16nvme-auth: remove symbol export from nvme_auth_resetSagi Grimberg1-1/+0
Only the nvme module calls it. Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-11-16nvme-auth: rename authentication work elementsSagi Grimberg1-4/+4
Use nvme_ctrl_auth_work and nvme_queue_auth_work for better readability. Signed-off-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-11-16nvme-auth: rename __nvme_auth_[reset|free] to nvme_auth[reset|free]_dhchapSagi Grimberg1-6/+6
nvme_auth_[reset|free] operate on the controller while __nvme_auth_[reset|free] operate on a chap struct (which maps to a queue context). Rename it for clarity. Signed-off-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-11-16nvme-pci: don't unbind the driver on reset failureChristoph Hellwig1-30/+10
Unbind a device driver when a reset fails is very unusual behavior. Just shut the controller down and leave it in dead state if we fail to reset it. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]>
2022-11-16nvme-pci: split the initial probe from the rest pathChristoph Hellwig1-53/+80
nvme_reset_work is a little fragile as it needs to handle both resetting a live controller and initializing one during probe. Split out the initial probe and open code it in nvme_probe and leave nvme_reset_work to just do the live controller reset. This fixes a recently introduced bug where nvme_dev_disable causes a NULL pointer dereferences in blk_mq_quiesce_tagset because the tagset pointer is not set when the reset state is entered directly from the new state. The separate probe code can skip the reset state and probe directly and fixes this. To make sure the system isn't single threaded on enabling nvme controllers, set the PROBE_PREFER_ASYNCHRONOUS flag in the device_driver structure so that the driver core probes in parallel. Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset") Reported-by: Gerd Bayer <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Tested-by Gerd Bayer <[email protected]>
2022-11-16nvme-pci: move the HMPRE check into nvme_setup_host_memChristoph Hellwig1-5/+6
Check that a HMB is wanted into the allocation helper instead of the caller. This makes life simpler for an upcoming second caller. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]>
2022-11-15nvme-pci: simplify nvme_dbbuf_dma_allocChristoph Hellwig1-16/+16
Move the OACS check and the error checking into nvme_dbbuf_dma_alloc so that an upcoming second caller doesn't have to duplicate this boilerplate code. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]>
2022-11-15nvme-pci: call nvme_pci_configure_admin_queue from nvme_pci_enableChristoph Hellwig1-5/+2
nvme_pci_configure_admin_queue is called right after nvme_pci_enable, and it's work is undone by nvme_dev_disable. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Tested-by Gerd Bayer <[email protected]>
2022-11-15nvme-pci: set constant paramters in nvme_pci_alloc_ctrlChristoph Hellwig1-21/+17
Move setting of low-level constant parameters from nvme_reset_work to nvme_pci_alloc_ctrl. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Tested-by Gerd Bayer <[email protected]>
2022-11-15nvme-pci: factor out a nvme_pci_alloc_dev helperChristoph Hellwig1-35/+46
Add a helper that allocates the nvme_dev structure up to the point where we can call nvme_init_ctrl. This pairs with the free_ctrl method and can thus be used to cleanup the teardown path and make it more symmetric. Note that this now calls nvme_init_ctrl a lot earlier during probing, which also means the per-controller character device shows up earlier. Due to the controller state no commnds can be send on it, but it might make sense to delay the cdev registration until nvme_init_ctrl_finish. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Tested-by Gerd Bayer <[email protected]>
2022-11-15nvme-pci: factor the iod mempool creation into a helperChristoph Hellwig1-23/+18
Add a helper to create the iod mempool. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Tested-by Gerd Bayer <[email protected]>
2022-11-15nvme-pci: move more teardown work to nvme_removeChristoph Hellwig1-2/+2
nvme_dbbuf_dma_free frees dma coherent memory, so it must not be called after ->remove has returned. Fortunately there is no way to use it after shutdown as no more I/O is possible so it can be moved. Similarly the iod_mempool can't be used for a device kept alive after shutdown, so move it next to freeing the PRP pools. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Tested-by Gerd Bayer <[email protected]>
2022-11-15nvme-pci: put the admin queue in nvme_dev_remove_adminChristoph Hellwig1-2/+1
Once the controller is shutdown no one can access the admin queue. Tear it down in nvme_dev_remove_admin, which matches the flow in the other drivers. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Tested-by Gerd Bayer <[email protected]>
2022-11-15nvme: simplify transport specific device attribute handlingChristoph Hellwig3-17/+16
Allow the transport driver to override the attribute groups for the control device, so that the PCIe driver doesn't manually have to add a group after device creation and keep track of it. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Tested-by Gerd Bayer <[email protected]>
2022-11-15nvme: move OPAL setup from PCIe to coreChristoph Hellwig8-25/+29
Nothing about the TCG Opal support is PCIe transport specific, so move it to the core code. For this nvme_init_ctrl_finish grows a new was_suspended argument that allows the transport driver to tell the OPAL code if the controller came out of a suspend cycle. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: James Smart <[email protected]> Tested-by Gerd Bayer <[email protected]>
2022-11-15nvme: don't call nvme_init_ctrl_finish from nvme_passthru_endChristoph Hellwig1-2/+4
nvme_passthrough_end can race with a reset, which can lead to racing stores to the cels xarray as well as further shengians with upcoming more complicated initialization. So drop the call and just log that the controller capabilities might have changed and a reset could be required to use the new controller capabilities. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Tested-by Gerd Bayer <[email protected]>
2022-11-15nvme: implement the DEAC bit for the Write Zeroes commandChristoph Hellwig3-1/+14
While the specification allows devices to either deallocate data or to actually write zeroes on any Write Zeroes command, many SSDs only do the sensible thing and deallocate data when the DEAC bit is specific. Set it when it is supported and the caller doesn't explicitly opt out of deallocation. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]>
2022-11-15nvme: identify-namespace without CAP_SYS_ADMINKanchan Joshi1-2/+16
Allow all identify-namespace variants (CNS 00h, 05h and 08h) without requiring CAP_SYS_ADMIN. The information (retrieved using id-ns) is needed to form IO commands for passthrough interface. Signed-off-by: Kanchan Joshi <[email protected]> Reviewed-by: Jens Axboe <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2022-11-15nvme: fine-granular CAP_SYS_ADMIN for nvme io commandsKanchan Joshi2-33/+70
Currently both io and admin commands are kept under a coarse-granular CAP_SYS_ADMIN check, disregarding file mode completely. $ ls -l /dev/ng* crw-rw-rw- 1 root root 242, 0 Sep 9 19:20 /dev/ng0n1 crw------- 1 root root 242, 1 Sep 9 19:20 /dev/ng0n2 In the example above, ng0n1 appears as if it may allow unprivileged read/write operation but it does not and behaves same as ng0n2. This patch implements a shift from CAP_SYS_ADMIN to more fine-granular control for io-commands. If CAP_SYS_ADMIN is present, nothing else is checked as before. Otherwise, following rules are in place - any admin-cmd is not allowed - vendor-specific and fabric commmand are not allowed - io-commands that can write are allowed if matching FMODE_WRITE permission is present - io-commands that read are allowed Add a helper nvme_cmd_allowed that implements above policy. Change all the callers of CAP_SYS_ADMIN to go through nvme_cmd_allowed for any decision making. Since file open mode is counted for any approval/denial, change at various places to keep file-mode information handy. Signed-off-by: Kanchan Joshi <[email protected]> Reviewed-by: Jens Axboe <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>