aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-10-20nvme-pci: clear shadow doorbell memory on resetsKeith Busch1-1/+8
The host memory doorbell and event buffers need to be initialized on each reset so the driver doesn't observe stale values from the previous instantiation. Signed-off-by: Keith Busch <[email protected]> Tested-by: John Levon <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-10-20nvme-rdma: fix error code in nvme_rdma_setup_ctrlMax Gurtovoy1-0/+2
In case that icdoff is not zero or mandatory keyed sgls are not supported by the NVMe/RDMA target, we'll go to error flow but we'll return 0 to the caller. Fix it by returning an appropriate error code. Fixes: c66e2998c8ca ("nvme-rdma: centralize controller setup sequence") Signed-off-by: Max Gurtovoy <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-10-20nvme-multipath: add error handling support for add_disk()Luis Chamberlain1-2/+12
We never checked for errors on add_disk() as this function returned void. Now that this is fixed, use the shiny new error handling. Since we now can tell for sure when a disk was added, move setting the bit NVME_NSHEAD_DISK_LIVE only when we did add the disk successfully. Nothing to do here as the cleanup is done elsewhere. We take care and use test_and_set_bit() because it is protects against two nvme paths simultaneously calling device_add_disk() on the same namespace head. Signed-off-by: Luis Chamberlain <[email protected]> Reviewed-by: Keith Busch <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-10-20nvmet: use macro definitions for setting cmic valueMax Gurtovoy2-1/+3
This makes the code more readable. Signed-off-by: Max Gurtovoy <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-10-20nvmet: use macro definition for setting nmic valueMax Gurtovoy1-1/+1
This makes the code more readable. Signed-off-by: Max Gurtovoy <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-10-20nvme: display correct subsystem NQNHannes Reinecke5-4/+11
With discovery controllers supporting unique subsystem NQNs the actual subsystem NQN might be different from that one passed in via the connect args. So add a helper to display the resulting subsystem NQN. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-10-20nvme: Add connect option 'discovery'Hannes Reinecke3-1/+13
Add a connect option 'discovery' to specify that the connection should be made to a discovery controller, not a normal I/O controller. With discovery controllers supporting unique subsystem NQNs we cannot easily distinguish by the subsystem NQN if this should be a discovery connection, but we need this information to blank out options not supported by discovery controllers. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-10-20nvme: expose subsystem type in sysfs attribute 'subsystype'Hannes Reinecke2-0/+28
With unique discovery controller NQNs we cannot distinguish the subsystem type by the NQN alone, but need to check the subsystem type, too. So expose the subsystem type in a new sysfs attribute 'subsystype'. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-10-20nvmet: set 'CNTRLTYPE' in the identify controller dataHannes Reinecke3-1/+9
Set the correct 'CNTRLTYPE' field in the identify controller data. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-10-20nvmet: add nvmet_is_disc_subsys() helperHannes Reinecke3-4/+9
Add a helper function to determine if a given subsystem is a discovery subsystem. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Himanshu Madhani <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-10-20nvme: add CNTRLTYPE definitions for 'identify controller'Hannes Reinecke1-1/+9
Update the 'identify controller' structure to define the newly added CNTRLTYPE field. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Himanshu Madhani <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-10-20nvmet: make discovery NQN configurableHannes Reinecke2-1/+41
TPAR8013 allows for unique discovery NQNs, so make the discovery controller NQN configurable by exposing a subsys attribute 'discovery_nqn'. Signed-off-by: Hannes Reinecke <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Himanshu Madhani <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-10-20nvmet-rdma: implement get_max_queue_size controller opMax Gurtovoy1-0/+6
Limit the maximal queue size for RDMA controllers. Today, the target reports a limit of 1024 and this limit isn't valid for some of the RDMA based controllers. For now, limit RDMA transport to 128 entries (the max queue depth configured for Linux NVMe/RDMA host). Future general solution should use RDMA/core API to calculate this size according to device capabilities and number of WRs needed per NVMe IO request. Reported-by: Mark Ruijter <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Max Gurtovoy <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-10-20nvmet: add get_max_queue_size op for controllersMax Gurtovoy2-3/+6
Some transports, such as RDMA, would like to set the queue size according to device/port/ctrl characteristics. Add a new nvmet transport op that is called during ctrl initialization. This will not effect transports that don't implement this option. Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Max Gurtovoy <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-10-20nvme-rdma: limit the maximal queue size for RDMA controllersMax Gurtovoy2-0/+9
Corrent limit of 1024 isn't valid for some of the RDMA based ctrls. In case the target expose a cap of larger amount of entries (e.g. 1024), the initiator may fail to create a QP with this size. Thus limit to a value that works for all RDMA adapters. Future general solution should use RDMA/core API to calculate this size according to device capabilities and number of WRs needed per NVMe IO request. Signed-off-by: Max Gurtovoy <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-10-20nvmet-tcp: fix use-after-free when a port is removedIsrael Rukshin1-0/+16
When removing a port, all its controllers are being removed, but there are queues on the port that doesn't belong to any controller (during connection time). This causes a use-after-free bug for any command that dereferences req->port (like in nvmet_alloc_ctrl). Those queues should be destroyed before freeing the port via configfs. Destroy the remaining queues after the accept_work was cancelled guarantees that no new queue will be created. Signed-off-by: Israel Rukshin <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-10-20nvmet-rdma: fix use-after-free when a port is removedIsrael Rukshin1-0/+24
When removing a port, all its controllers are being removed, but there are queues on the port that doesn't belong to any controller (during connection time). This causes a use-after-free bug for any command that dereferences req->port (like in nvmet_alloc_ctrl). Those queues should be destroyed before freeing the port via configfs. Destroy the remaining queues after the RDMA-CM was destroyed guarantees that no new queue will be created. Signed-off-by: Israel Rukshin <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-10-20nvmet: fix use-after-free when a port is removedIsrael Rukshin1-0/+2
When a port is removed through configfs, any connected controllers are starting teardown flow asynchronously and can still send commands. This causes a use-after-free bug for any command that dereferences req->port (like in nvmet_parse_io_cmd). To fix this, wait for all the teardown scheduled works to complete (like release_work at rdma/tcp drivers). This ensures there are no active controllers when the port is eventually removed. Signed-off-by: Israel Rukshin <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-10-20qla2xxx: add ->map_queues support for nvmeSaurav Kashyap1-0/+15
Implement ->map queues and use the block layer blk_mq_pci_map_queues helper for mapping queues to CPUs. With this mapping minimum 10%+ increase in performance is noticed. Signed-off-by: Saurav Kashyap <[email protected]> Signed-off-by: Nilesh Javali <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-10-20nvme-fc: add support for ->map_queuesSaurav Kashyap2-0/+31
NVMe FC don't have support for map queues, unlike the PCI, RDMA and TCP transports. Add a ->map_queues callout for the LLDDs to provide such functionality. Signed-off-by: Saurav Kashyap <[email protected]> Signed-off-by: Nilesh Javali <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-10-20nvme: generate uevent once a multipath namespace is operational againHannes Reinecke1-2/+5
When fast_io_fail_tmo is set I/O will be aborted while recovery is still ongoing. This causes MD to set the namespace to failed, and no futher I/O will be submitted to that namespace. However, once the recovery succeeds and the namespace becomes operational again the NVMe subsystem doesn't send a notification, so MD cannot automatically reinstate operation and requires manual interaction. This patch will send a KOBJ_CHANGE uevent per multipathed namespace once the underlying controller transitions to LIVE, allowing an automatic MD reassembly with these udev rules: /etc/udev/rules.d/65-md-auto-re-add.rules: SUBSYSTEM!="block", GOTO="md_end" ACTION!="change", GOTO="md_end" ENV{ID_FS_TYPE}!="linux_raid_member", GOTO="md_end" PROGRAM="/sbin/md_raid_auto_readd.sh $devnode" LABEL="md_end" /sbin/md_raid_auto_readd.sh: MDADM=/sbin/mdadm DEVNAME=$1 export $(${MDADM} --examine --export ${DEVNAME}) if [ -z "${MD_UUID}" ]; then exit 1 fi UUID_LINK=$(readlink /dev/disk/by-id/md-uuid-${MD_UUID}) MD_DEVNAME=${UUID_LINK##*/} export $(${MDADM} --detail --export /dev/${MD_DEVNAME}) if [ -z "${MD_METADATA}" ] ; then exit 1 fi if [ $(cat /sys/block/${MD_DEVNAME}/md/degraded) != 1 ]; then echo "${MD_DEVNAME}: array not degraded, nothing to do" exit 0 fi MD_STATE=$(cat /sys/block/${MD_DEVNAME}/md/array_state) if [ ${MD_STATE} != "clean" ] ; then echo "${MD_DEVNAME}: array state ${MD_STATE}, cannot re-add" exit 1 fi MD_VARNAME="MD_DEVICE_dev_${DEVNAME##*/}_ROLE" if [ ${!MD_VARNAME} = "spare" ] ; then ${MDADM} --manage /dev/${MD_DEVNAME} --re-add ${DEVNAME} fi Changes to v2: - Add udev rules example to description Changes to v1: - use disk_uevent() as suggested by hch Signed-off-by: Hannes Reinecke <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-10-20bcache: remove bch_crc64_updateChristoph Hellwig3-10/+2
bch_crc64_update is an entirely pointless wrapper around crc64_be. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Coly Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-20bcache: use bvec_kmap_local in bch_data_verifyChristoph Hellwig1-6/+5
Using local kmaps slightly reduces the chances to stray writes, and the bvec interface cleans up the code a little bit. Also switch from page_address to bvec_kmap_local for cbv to be on the safe side and to avoid pointlessly poking into bvec internals. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Coly Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-20bcache: remove the backing_dev_name field from struct cached_devChristoph Hellwig6-39/+29
Just use the %pg format specifier to print the name directly. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Coly Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-20bcache: remove the cache_dev_name field from struct cacheChristoph Hellwig3-10/+7
Just use the %pg format specifier to print the name directly. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Coly Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-20bcache: move calc_cached_dev_sectors to proper place on backing device detachLin Feng1-1/+1
Calculation of cache_set's cached sectors is done by travelling cached_devs list as shown below: static void calc_cached_dev_sectors(struct cache_set *c) { ... list_for_each_entry(dc, &c->cached_devs, list) sectors += bdev_sectors(dc->bdev); c->cached_dev_sectors = sectors; } But cached_dev won't be unlinked from c->cached_devs list until we call following list_move(&dc->list, &uncached_devices), so previous fix in 'commit 46010141da6677b81cc77f9b47f8ac62bd1cbfd3 ("bcache: recal cached_dev_sectors on detach")' is wrong, now we move it to its right place. Signed-off-by: Lin Feng <[email protected]> Signed-off-by: Coly Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-20bcache: fix error info in register_bcache()Chao Yu1-3/+10
In register_bcache(), there are several cases we didn't set correct error info (return value and/or error message): - if kzalloc() fails, it needs to return ENOMEM and print "cannot allocate memory"; - if register_cache() fails, it's better to propagate its return value rather than using default EINVAL. Signed-off-by: Chao Yu <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Coly Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-20bcache: reserve never used bits from bkey.highColy Li1-2/+2
There sre 3 bits in member high of struct bkey are never used, and no plan to support them in future, - HEADER_SIZE, start at bit 58, length 2 bits - KEY_PINNED, start at bit 55, length 1 bit No any kernel code, or user space tool references or accesses the three bits. Therefore it is possible and feasible to reserve the valuable bits from bkey.high. They can be used in future for other purpose. Signed-off-by: Coly Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-20md: bcache: Fix spelling of 'acquire'Ding Senjie1-1/+1
acqurie -> acquire Signed-off-by: Ding Senjie <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Coly Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-20s390/dasd: fix possibly missed path verificationStefan Haberland2-3/+15
__dasd_device_check_path_events() calls the discipline path event handler. This handler can leave the 'to be verified pathmask' populated for an additional verification. There is a race window where the worker has finished before dasd_path_clear_all_verify() is called which resets the tbvpm. Due to this there could be outstanding path verifications missed. Fix by clearing the pathmasks before calling the handler and add them again in case of an error. Signed-off-by: Stefan Haberland <[email protected]> Reviewed-by: Jan Hoeppner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-20s390/dasd: fix missing path conf_data after failed allocationStefan Haberland1-1/+12
dasd_eckd_path_available_action() does a memory allocation to store the per path configuration data permanently. In the unlikely case that this allocation fails there is no conf_data stored for the corresponding path. This is OK since this is not necessary for an operational path but some features like control unit initiated reconfiguration (CUIR) do not work. To fix this add the path to the 'to be verified pathmask' again and schedule the handler again. Signed-off-by: Stefan Haberland <[email protected]> Reviewed-by: Jan Hoeppner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-20s390/dasd: summarize dasd configuration data in a separate structureStefan Haberland2-97/+98
Summarize the dasd configuration data in a separate structure so that functions that need temporary config data do not need to allocate the whole eckd_private structure. Signed-off-by: Stefan Haberland <[email protected]> Reviewed-by: Jan Hoeppner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-20s390/dasd: move dasd_eckd_read_fc_securityStefan Haberland1-2/+3
dasd_eckd_read_conf is called multiple times during device setup but the fc_security feature needs to be read only once. So move it into the calling function. Signed-off-by: Stefan Haberland <[email protected]> Reviewed-by: Jan Hoeppner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-20s390/dasd: split up dasd_eckd_read_confStefan Haberland1-79/+51
Move the cabling check out of dasd_eckd_read_conf and split it up into separate functions to improve readability and re-use functions. Signed-off-by: Stefan Haberland <[email protected]> Reviewed-by: Jan Hoeppner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-20s390/dasd: fix kernel doc commentHeiko Carstens1-2/+2
Fix this: drivers/s390/block/dasd_ioctl.c:666: warning: Function parameter or member 'disk' not described in 'dasd_biodasdinfo' drivers/s390/block/dasd_ioctl.c:666: warning: Function parameter or member 'info' not described in 'dasd_biodasdinfo' Acked-by: Jan Höppner <[email protected]> Signed-off-by: Heiko Carstens <[email protected]> Signed-off-by: Stefan Haberland <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-20s390/dasd: handle request magic consistently as unsigned intHeiko Carstens3-8/+8
Get rid of the rather odd casts to character pointer of the dasd_ccw_req magic member and simply use the unsigned int value unmodified everywhere. Acked-by: Jan Höppner <[email protected]> Signed-off-by: Heiko Carstens <[email protected]> Signed-off-by: Stefan Haberland <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-20nbd: Fix use-after-free in pid_showYe Bin1-9/+9
I got issue as follows: [ 263.886511] BUG: KASAN: use-after-free in pid_show+0x11f/0x13f [ 263.888359] Read of size 4 at addr ffff8880bf0648c0 by task cat/746 [ 263.890479] CPU: 0 PID: 746 Comm: cat Not tainted 4.19.90-dirty #140 [ 263.893162] Call Trace: [ 263.893509] dump_stack+0x108/0x15f [ 263.893999] print_address_description+0xa5/0x372 [ 263.894641] kasan_report.cold+0x236/0x2a8 [ 263.895696] __asan_report_load4_noabort+0x25/0x30 [ 263.896365] pid_show+0x11f/0x13f [ 263.897422] dev_attr_show+0x48/0x90 [ 263.898361] sysfs_kf_seq_show+0x24d/0x4b0 [ 263.899479] kernfs_seq_show+0x14e/0x1b0 [ 263.900029] seq_read+0x43f/0x1150 [ 263.900499] kernfs_fop_read+0xc7/0x5a0 [ 263.903764] vfs_read+0x113/0x350 [ 263.904231] ksys_read+0x103/0x270 [ 263.905230] __x64_sys_read+0x77/0xc0 [ 263.906284] do_syscall_64+0x106/0x360 [ 263.906797] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reproduce this issue as follows: 1. nbd-server 8000 /tmp/disk 2. nbd-client localhost 8000 /dev/nbd1 3. cat /sys/block/nbd1/pid Then trigger use-after-free in pid_show. Reason is after do step '2', nbd-client progress is already exit. So it's task_struct already freed. To solve this issue, revert part of 6521d39a64b3's modify and remove useless 'recv_task' member of nbd_device. Fixes: 6521d39a64b3 ("nbd: Remove variable 'pid'") Signed-off-by: Ye Bin <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-19nvme: don't memset() the normal read/write commandJens Axboe1-2/+6
This memset in the fast path costs a lot of cycles on my setup. Here's a top-of-profile of doing ~6.7M IOPS: + 5.90% io_uring [nvme] [k] nvme_queue_rq + 5.32% io_uring [nvme_core] [k] nvme_setup_cmd + 5.17% io_uring [kernel.vmlinux] [k] io_submit_sqes + 4.97% io_uring [kernel.vmlinux] [k] blkdev_direct_IO and a perf diff with this patch: 0.92% +4.40% [nvme_core] [k] nvme_setup_cmd reducing it from 5.3% to only 0.9%. This takes it from the 2nd most cycle consumer to something that's mostly irrelevant. Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Keith Busch <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-19nvme: move command clear into the various setup helpersJens Axboe2-3/+9
We don't have to worry about doing extra memsets by moving it outside the protection of RQF_DONTPREP, as nvme doesn't do partial completions. This is in preparation for making the read/write fast path not do a full memset of the command. Reviewed-by: Keith Busch <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-19block: ataflop: fix breakage introduced at blk-mq refactoringMichael Schmitz1-15/+3
Refactoring of the Atari floppy driver when converting to blk-mq has broken the state machine in not-so-subtle ways: finish_fdc() must be called when operations on the floppy device have completed. This is crucial in order to relase the ST-DMA lock, which protects against concurrent access to the ST-DMA controller by other drivers (some DMA related, most just related to device register access - broken beyond compare, I know). When rewriting the driver's old do_request() function, the fact that finish_fdc() was called only when all queued requests had completed appears to have been overlooked. Instead, the new request function calls finish_fdc() immediately after the last request has been queued. finish_fdc() executes a dummy seek after most requests, and this overwrites the state machine's interrupt hander that was set up to wait for completion of the read/write request just prior. To make matters worse, finish_fdc() is called before device interrupts are re-enabled, making certain that the read/write interupt is missed. Shifting the finish_fdc() call into the read/write request completion handler ensures the driver waits for the request to actually complete. With a queue depth of 2, we won't see long request sequences, so calling finish_fdc() unconditionally just adds a little overhead for the dummy seeks, and keeps the code simple. While we're at it, kill ataflop_commit_rqs() which does nothing but run finish_fdc() unconditionally, again likely wiping out an in-flight request. Signed-off-by: Michael Schmitz <[email protected]> Fixes: 6ec3938cff95 ("ataflop: convert to blk-mq") CC: [email protected] CC: Tetsuo Handa <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18nbd: fix uaf in nbd_handle_reply()Yu Kuai1-1/+17
There is a problem that nbd_handle_reply() might access freed request: 1) At first, a normal io is submitted and completed with scheduler: internel_tag = blk_mq_get_tag -> get tag from sched_tags blk_mq_rq_ctx_init sched_tags->rq[internel_tag] = sched_tag->static_rq[internel_tag] ... blk_mq_get_driver_tag __blk_mq_get_driver_tag -> get tag from tags tags->rq[tag] = sched_tag->static_rq[internel_tag] So, both tags->rq[tag] and sched_tags->rq[internel_tag] are pointing to the request: sched_tags->static_rq[internal_tag]. Even if the io is finished. 2) nbd server send a reply with random tag directly: recv_work nbd_handle_reply blk_mq_tag_to_rq(tags, tag) rq = tags->rq[tag] 3) if the sched_tags->static_rq is freed: blk_mq_sched_free_requests blk_mq_free_rqs(q->tag_set, hctx->sched_tags, i) -> step 2) access rq before clearing rq mapping blk_mq_clear_rq_mapping(set, tags, hctx_idx); __free_pages() -> rq is freed here 4) Then, nbd continue to use the freed request in nbd_handle_reply Fix the problem by get 'q_usage_counter' before blk_mq_tag_to_rq(), thus request is ensured not to be freed because 'q_usage_counter' is not zero. Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18nbd: partition nbd_read_stat() into nbd_read_reply() and nbd_handle_reply()Yu Kuai1-30/+44
Prepare to fix uaf in nbd_read_stat(), no functional changes. Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18nbd: clean up return value checking of sock_xmit()Yu Kuai1-6/+7
Check if sock_xmit() return 0 is useless because it'll never return 0, comment it and remove such checkings. Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18nbd: don't start request if nbd_queue_rq() failedYu Kuai1-3/+0
commit 6a468d5990ec ("nbd: don't start req until after the dead connection logic") move blk_mq_start_request() from nbd_queue_rq() to nbd_handle_cmd() to skip starting request if the connection is dead. However, request is still started in other error paths. Currently, blk_mq_end_request() will be called immediately if nbd_queue_rq() failed, thus start request in such situation is useless. So remove blk_mq_start_request() from error paths in nbd_handle_cmd(). Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18nbd: check sock index in nbd_read_stat()Yu Kuai1-0/+4
The sock that clent send request in nbd_send_cmd() and receive reply in nbd_read_stat() should be the same. Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18nbd: make sure request completion won't concurrentYu Kuai1-2/+9
commit cddce0116058 ("nbd: Aovid double completion of a request") try to fix that nbd_clear_que() and recv_work() can complete a request concurrently. However, the problem still exists: t1 t2 t3 nbd_disconnect_and_put flush_workqueue recv_work blk_mq_complete_request blk_mq_complete_request_remote -> this is true WRITE_ONCE(rq->state, MQ_RQ_COMPLETE) blk_mq_raise_softirq blk_done_softirq blk_complete_reqs nbd_complete_rq blk_mq_end_request blk_mq_free_request WRITE_ONCE(rq->state, MQ_RQ_IDLE) nbd_clear_que blk_mq_tagset_busy_iter nbd_clear_req __blk_mq_free_request blk_mq_put_tag blk_mq_complete_request -> complete again There are three places where request can be completed in nbd: recv_work(), nbd_clear_que() and nbd_xmit_timeout(). Since they all hold cmd->lock before completing the request, it's easy to avoid the problem by setting and checking a cmd flag. Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18nbd: don't handle response without a corresponding request messageYu Kuai1-1/+21
While handling a response message from server, nbd_read_stat() will try to get request by tag, and then complete the request. However, this is problematic if nbd haven't sent a corresponding request message: t1 t2 submit_bio nbd_queue_rq blk_mq_start_request recv_work nbd_read_stat blk_mq_tag_to_rq blk_mq_complete_request nbd_send_cmd Thus add a new cmd flag 'NBD_CMD_INFLIGHT', it will be set in nbd_send_cmd() and checked in nbd_read_stat(). Noted that this patch can't fix that blk_mq_tag_to_rq() might return a freed request, and this will be fixed in following patches. Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-10-18mtip32xx: Remove redundant 'flush_workqueue()' callsChristophe JAILLET1-2/+0
'destroy_workqueue()' already drains the queue before destroying it, so there is no need to flush it explicitly. Remove the redundant 'flush_workqueue()' calls. This was generated with coccinelle: @@ expression E; @@ - flush_workqueue(E); destroy_workqueue(E); Signed-off-by: Christophe JAILLET <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/0fea349c808c6cfbf549b0e33701320c7860c8b7.1634234221.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jens Axboe <[email protected]>
2021-10-18md: update superblock after changing rdev flags in state_storeXiao Ni1-1/+10
When the in memory flag is changed, we need to persist the change in the rdev superblock flags. This is needed for "writemostly" and "failfast". Reviewed-by: Li Feng <[email protected]> Signed-off-by: Xiao Ni <[email protected]> Signed-off-by: Song Liu <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-10-18md: remove unused argument from md_new_eventGuoqing Jiang4-18/+18
Actually, mddev is not used by md_new_event. Signed-off-by: Guoqing Jiang <[email protected]> Signed-off-by: Song Liu <[email protected]> Signed-off-by: Jens Axboe <[email protected]>