aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-08-16nvme: Have NVME_FABRICS select NVME_CORE instead of transport driversSagi Grimberg2-5/+1
Transport drivers need both core and fabrics modules, instead of selecting both, have the selection transitive such that NVME_FABRICS selects NVME_CORE and transport drivers select NVME_FABRICS. Suggested-by: Keith Busch <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: James Smart <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-08-16nvmet: check that host sqsize does not exceed ctrl MQESAmit Engel1-0/+9
Check that host sqsize is not greater-than Maximum Queue Entries Supported (MQES) value supported by the controller. Signed-off-by: Amit Engel <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-08-16nvmet: avoid duplicate qid in connect cmdAmit Engel2-6/+15
According to the NVMe specification, if the host sends a Connect command specifying a queue id which has already been created, a status value of NVME_SC_CMD_SEQ_ERROR is returned. Signed-off-by: Amit Engel <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-08-16nvmet: pass back cntlid on successful completionAmit Engel1-4/+5
According to the NVMe specification, the response dword 0 value of the Connect command is based on status code: return cntlid for successful compeltion return IPO and IATTR for connect invalid parameters. Fix a missing error information for a zero sized queue, and return the cntlid also for I/O queue Connect commands. Signed-off-by: Amit Engel <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-08-16nvme-rdma: don't update queue count when failing to set io queuesRuozhu Li1-2/+2
We update ctrl->queue_count and schedule another reconnect when io queue count is zero.But we will never try to create any io queue in next reco- nnection, because ctrl->queue_count already set to zero.We will end up having an admin-only session in Live state, which is exactly what we try to avoid in the original patch. Update ctrl->queue_count after queue_count zero checking to fix it. Signed-off-by: Ruozhu Li <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-08-16nvme-tcp: don't update queue count when failing to set io queuesRuozhu Li1-2/+2
We update ctrl->queue_count and schedule another reconnect when io queue count is zero.But we will never try to create any io queue in next reco- nnection, because ctrl->queue_count already set to zero.We will end up having an admin-only session in Live state, which is exactly what we try to avoid in the original patch. Update ctrl->queue_count after queue_count zero checking to fix it. Signed-off-by: Ruozhu Li <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-08-16nvme-tcp: pair send_mutex init with destroyKeith Busch1-0/+2
Each mutex_init() should have a corresponding mutex_destroy(). Signed-off-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-08-16nvme: allow user toggling hmb usageKeith Busch1-1/+44
The NVMe host memory buffer may consume a non-negligable amount of memory. Controllers are required to function without the host memory buffer enabled, but with possibly degraded performance. Export a sysfs property to toggle this feature on a per-device granularity so users may choose to reclaim memory at the expense of storage performance. Signed-off-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-08-16nvme-pci: disable hmb on idle suspendKeith Busch1-7/+17
An idle suspend may or may not disable host memory access from devices placed in low power mode. Either way, it should always be safe to disable the host memory buffer prior to entering the low power mode, and this should also always be faster than a full device shutdown. Signed-off-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-08-16nvmet: remove redundant assignments of variable statusColin Ian King1-4/+1
There are two occurrances where variable status is being assigned a value that is never read and it is being re-assigned a new value almost immediately afterwards on an error exit path. The assignments are redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-08-16nvmet: add set feature tracing supportHou Pu1-1/+17
A nvme connect command produces following trace from the target side. Before: kworker/0:1H-56 [000] .... 9012.155139: nvmet_req_init: nvmet1: qid=0, cmdid=16, nsid=0, flags=0x40, meta=0x0, cmd=(nvme_admin_set_features, cdw10=07 00 00 00 07 00 07 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00) kworker/0:1H-56 [000] .... 9012.872272: nvmet_req_init: nvmet1: qid=0, cmdid=13, nsid=0, flags=0x40, meta=0x0, cmd=(nvme_admin_set_features, cdw10=0b 00 00 00 00 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00) cmdline:/sys/kernel/debug/tracing# cat trace | grep feature kworker/0:1H-56 [000] .... 203.493914: nvmet_req_init: nvmet1: qid=0, cmdid=29, nsid=0, flags=0x40, meta=0x0, cmd=(nvme_admin_set_features, fid=0x7, sv=0x0, cdw11=0x70007) kworker/0:1H-56 [000] .... 204.197079: nvmet_req_init: nvmet1: qid=0, cmdid=29, nsid=0, flags=0x40, meta=0x0, cmd=(nvme_admin_set_features, fid=0xb, sv=0x0, cdw11=0x900) Using ',' to separate different field like others in nvmet_trace_admin_get_features. Signed-off-by: Hou Pu <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-08-16nvme: add set feature tracing supportHou Pu1-1/+17
A nvme connect command produces following trace. Before: /sys/kernel/debug/tracing# cat trace | grep feature kworker/5:1H-98 [005] .... 3221.294844: nvme_setup_cmd: nvme0: qid=0, cmdid=25, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_set_features cdw10=07 00 00 00 07 00 07 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00) kworker/4:1H-124 [004] .... 3222.009186: nvme_setup_cmd: nvme0: qid=0, cmdid=17, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_set_features cdw10=0b 00 00 00 00 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00) After: /sys/kernel/debug/tracing# cat trace | grep feature kworker/0:1H-253 [000] .... 196.060509: nvme_setup_cmd: nvme0: qid=0, cmdid=29, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_set_features fid=0x7, sv=0x0, cdw11=0x70007) kworker/0:1H-253 [000] .... 196.763947: nvme_setup_cmd: nvme0: qid=0, cmdid=29, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_set_features fid=0xb, sv=0x0, cdw11=0x900) Using ',' to separate different field like others in nvmet_trace_admin_get_features. Signed-off-by: Hou Pu <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-08-16nvme-fabrics: remove superfluous nvmf_host_put in nvmf_parse_optionsHou Pu1-1/+0
Opts->host is NULL there. It is checked just before. So remove nvmf_host_put. It is introduced by commit 59a2f3f00fd7 ("nvme: fix potential memory leak in option parsing"). Signed-off-by: Hou Pu <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-08-16nvme-pci: cmb sysfs: one file, one valueKeith Busch1-2/+26
An attribute should only be exporting one value as recommended in Documentation/filesystems/sysfs.rst. Implement CMB attributes this way. The old attribute will remain for backward compatibility. Signed-off-by: Keith Busch <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-08-16nvme-pci: use attribute group for cmb sysfsKeith Busch1-26/+46
Appending sysfs files to the controller kobject is a bit clunky and becomes a maintenance problem as more attributes are added. The attribute group infrastructure handles this better, so use that. Signed-off-by: Keith Busch <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-08-16nvme: code command_id with a genctr for use-after-free validationSagi Grimberg6-20/+66
We cannot detect a (perhaps buggy) controller that is sending us a completion for a request that was already completed (for example sending a completion twice), this phenomenon was seen in the wild a few times. So to protect against this, we use the upper 4 msbits of the nvme sqe command_id to use as a 4-bit generation counter and verify it matches the existing request generation that is incrementing on every execution. The 16-bit command_id structure now is constructed by: | xxxx | xxxxxxxxxxxx | gen request tag This means that we are giving up some possible queue depth as 12 bits allow for a maximum queue depth of 4095 instead of 65536, however we never create such long queues anyways so no real harm done. Suggested-by: Keith Busch <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Acked-by: Keith Busch <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Daniel Wagner <[email protected]> Tested-by: Daniel Wagner <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-08-16nvme-tcp: don't check blk_mq_tag_to_rq when receiving pdu dataSagi Grimberg1-11/+3
We already validate it when receiving the c2hdata pdu header and this is not changing so this is a redundant check. Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Reviewed-by: Daniel Wagner <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-08-16nvme-pci: limit maximum queue depth to 4095Sagi Grimberg1-9/+5
We are going to use the upper 4-bits of the command_id for a generation counter, so enforce the new queue depth upper limit. As we enforce both min and max queue depth, use param_set_uint_minmax istead of open coding it. Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Daniel Wagner <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-08-16params: lift param_set_uint_minmax to common codeSagi Grimberg3-18/+20
It is a useful helper hence move it to common code so others can enjoy it. Suggested-by: Christoph Hellwig <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-08-14remove the lightnvm subsystemChristoph Hellwig30-13678/+1
Lightnvm supports the OCSSD 1.x and 2.0 specs which were early attempts to produce Open Channel SSDs and never made it into the NVMe spec proper. They have since been superceeded by NVMe enhancements such as ZNS support. Remove the support per the deprecation schedule. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Matias Bjørling <[email protected]> Reviewed-by: Javier González <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-08-13nbd: reduce the nbd_index_mutex scopeChristoph Hellwig1-27/+28
nbd_index_mutex is currently held over add_disk and inside ->open, which leads to lock order reversals. Refactor the device creation code path so that nbd_dev_add is called without nbd_index_mutex lock held and only takes it for the IDR insertation. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] [axboe: fix whitespace] Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-08-13nbd: refactor device search and allocation in nbd_genl_connectChristoph Hellwig1-31/+14
Use idr_for_each_entry instead of the awkward callback to find an existing device for the index == -1 case, and de-duplicate the device allocation if no existing device was found. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-08-13nbd: return the allocated nbd_device from nbd_dev_addChristoph Hellwig1-12/+9
Return the device we just allocated instead of doing an extra search for it in the caller. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-08-13nbd: remove nbd_del_diskChristoph Hellwig1-12/+5
Fold nbd_del_disk and remove the pointless NULL check on ->disk given that it is always set for a successfully allocated nbd_device structure. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-08-13nbd: refactor device removalChristoph Hellwig1-24/+13
Share common code for the synchronous and workqueue based device removal, and remove the pointless use of refcount_dec_and_mutex_lock. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-08-13nbd: do del_gendisk() asynchronously for NBD_DESTROY_ON_DISCONNECTHou Tao1-9/+61
Now open_mutex is used to synchronize partition operations (e.g, blk_drop_partitions() and blkdev_reread_part()), however it makes nbd driver broken, because nbd may call del_gendisk() in nbd_release() or nbd_genl_disconnect() if NBD_CFLAG_DESTROY_ON_DISCONNECT is enabled, and deadlock occurs, as shown below: // AB-BA dead-lock nbd_genl_disconnect blkdev_open nbd_disconnect_and_put lock bd_mutex // last ref nbd_put lock nbd_index_mutex del_gendisk nbd_open try lock nbd_index_mutex try lock bd_mutex or // AA dead-lock nbd_release lock bd_mutex nbd_put try lock bd_mutex Instead of fixing block layer (e.g, introduce another lock), fixing the nbd driver to call del_gendisk() in a kworker when NBD_DESTROY_ON_DISCONNECT is enabled. When NBD_DESTROY_ON_DISCONNECT is disabled, nbd device will always be destroy through module removal, and there is no risky of deadlock. To ensure the reuse of nbd index succeeds, moving the calling of idr_remove() after del_gendisk(), so if the reused index is not found in nbd_index_idr, the old disk must have been deleted. And reusing the existing destroy_complete mechanism to ensure nbd_genl_connect() will wait for the completion of del_gendisk(). Also adding a new workqueue for nbd removal, so nbd_cleanup() can ensure all removals complete before exits. Reported-by: [email protected] Fixes: c76f48eb5c08 ("block: take bd_mutex around delete_partitions in del_gendisk") Signed-off-by: Hou Tao <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2021-08-13nbd: add the check to prevent overflow in __nbd_ioctl()Baokun Li1-2/+4
If user specify a large enough value of NBD blocks option, it may trigger signed integer overflow which may lead to nbd->config->bytesize becomes a large or small value, zero in particular. UBSAN: Undefined behaviour in drivers/block/nbd.c:325:31 signed integer overflow: 1024 * 4611686155866341414 cannot be represented in type 'long long int' [...] Call trace: [...] handle_overflow+0x188/0x1dc lib/ubsan.c:192 __ubsan_handle_mul_overflow+0x34/0x44 lib/ubsan.c:213 nbd_size_set drivers/block/nbd.c:325 [inline] __nbd_ioctl drivers/block/nbd.c:1342 [inline] nbd_ioctl+0x998/0xa10 drivers/block/nbd.c:1395 __blkdev_driver_ioctl block/ioctl.c:311 [inline] [...] Although it is not a big deal, still silence the UBSAN by limit the input value. Reported-by: Hulk Robot <[email protected]> Signed-off-by: Baokun Li <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Link: https://lore.kernel.org/r/[email protected] [axboe: dropped unlikely()] Signed-off-by: Jens Axboe <[email protected]>
2021-08-09xen-blkfront: Remove redundant assignment to variable errColin Ian King1-1/+0
The variable err is being assigned a value that is never read, the assignment is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <[email protected]> Reviewed-by: Boris Ostrovsky <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block/rnbd: Use sysfs_emit instead of s*printf function for sysfs showMd Haris Iqbal2-25/+22
sysfs_emit function was added to be aware of the PAGE_SIZE maximum of the temporary buffer used for outputting sysfs content, so there is no possible overruns. So replace the uses of any s*printf functions for the sysfs show functions with sysfs_emit. Signed-off-by: Md Haris Iqbal <[email protected]> Signed-off-by: Jack Wang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block/rnbd-clt: Use put_cpu_ptr after get_cpu_ptrGioh Kim1-1/+1
This patch replaces put_cpu_var with put_cpu_ptr because get_cpu_ptr should be paired with put_cpu_ptr. Signed-off-by: Gioh Kim <[email protected]> Signed-off-by: Jack Wang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: remove blk-mq-sysfs dead codeDamien Le Moal1-55/+0
In block/blk-mq-sysfs.c, struct blk_mq_ctx_sysfs_entry is not used to define any attribute since the "mq" sysfs directory contains only sub-directories (no attribute files). As a result, blk_mq_sysfs_show(), blk_mq_sysfs_store(), and struct sysfs_ops blk_mq_sysfs_ops are all unused and unnecessary. Remove all this unused code. Signed-off-by: Damien Le Moal <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02loop: raise media_change eventMatteo Croce1-0/+5
Make the loop device raise a DISK_MEDIA_CHANGE event on attach or detach. # udevadm monitor -up |grep -e DISK_MEDIA_CHANGE -e DEVNAME & # losetup -f zero [ 7.454235] loop0: detected capacity change from 0 to 16384 DISK_MEDIA_CHANGE=1 DEVNAME=/dev/loop0 DEVNAME=/dev/loop0 DEVNAME=/dev/loop0 # losetup -f zero [ 10.205245] loop1: detected capacity change from 0 to 16384 DISK_MEDIA_CHANGE=1 DEVNAME=/dev/loop1 DEVNAME=/dev/loop1 DEVNAME=/dev/loop1 # losetup -f zero2 [ 13.532368] loop2: detected capacity change from 0 to 40960 DISK_MEDIA_CHANGE=1 DEVNAME=/dev/loop2 DEVNAME=/dev/loop2 # losetup -D DEVNAME=/dev/loop1 DISK_MEDIA_CHANGE=1 DEVNAME=/dev/loop1 DEVNAME=/dev/loop2 DISK_MEDIA_CHANGE=1 DEVNAME=/dev/loop2 DEVNAME=/dev/loop0 DISK_MEDIA_CHANGE=1 DEVNAME=/dev/loop0 Signed-off-by: Matteo Croce <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Tested-by: Luca Boccassi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: add a helper to raise a media changed eventMatteo Croce2-15/+47
Refactor disk_check_events() and move some code into disk_event_uevent(). Then add disk_force_media_change(), a helper which will be used by devices to force issuing a DISK_EVENT_MEDIA_CHANGE event. Co-developed-by: Christoph Hellwig <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Matteo Croce <[email protected]> Tested-by: Luca Boccassi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: export diskseq in sysfsMatteo Croce2-0/+22
Add a new sysfs handle to export the new diskseq value. Place it in <sysfs>/block/<disk>/diskseq and document it. $ grep . /sys/class/block/*/diskseq /sys/class/block/loop0/diskseq:13 /sys/class/block/loop1/diskseq:14 /sys/class/block/loop2/diskseq:5 /sys/class/block/loop3/diskseq:6 /sys/class/block/ram0/diskseq:1 /sys/class/block/ram1/diskseq:2 /sys/class/block/vda/diskseq:7 Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Matteo Croce <[email protected]> Tested-by: Luca Boccassi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: add ioctl to read the disk sequence numberMatteo Croce2-0/+3
Add a new BLKGETDISKSEQ ioctl which retrieves the disk sequence number from the genhd structure. # ./getdiskseq /dev/loop* /dev/loop0: 13 /dev/loop0p1: 13 /dev/loop0p2: 13 /dev/loop0p3: 13 /dev/loop1: 14 /dev/loop1p1: 14 /dev/loop1p2: 14 /dev/loop2: 5 /dev/loop3: 6 Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Matteo Croce <[email protected]> Tested-by: Luca Boccassi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: export the diskseq in ueventsMatteo Croce1-0/+9
Export the newly introduced diskseq in uevents: $ udevadm info /sys/class/block/* |grep -e DEVNAME -e DISKSEQ E: DEVNAME=/dev/loop0 E: DISKSEQ=1 E: DEVNAME=/dev/loop1 E: DISKSEQ=2 E: DEVNAME=/dev/loop2 E: DISKSEQ=3 E: DEVNAME=/dev/loop3 E: DISKSEQ=4 E: DEVNAME=/dev/loop4 E: DISKSEQ=5 E: DEVNAME=/dev/loop5 E: DISKSEQ=6 E: DEVNAME=/dev/loop6 E: DISKSEQ=7 E: DEVNAME=/dev/loop7 E: DISKSEQ=8 E: DEVNAME=/dev/nvme0n1 E: DISKSEQ=9 E: DEVNAME=/dev/nvme0n1p1 E: DISKSEQ=9 E: DEVNAME=/dev/nvme0n1p2 E: DISKSEQ=9 E: DEVNAME=/dev/nvme0n1p3 E: DISKSEQ=9 E: DEVNAME=/dev/nvme0n1p4 E: DISKSEQ=9 E: DEVNAME=/dev/nvme0n1p5 E: DISKSEQ=9 E: DEVNAME=/dev/sda E: DISKSEQ=10 E: DEVNAME=/dev/sda1 E: DISKSEQ=10 E: DEVNAME=/dev/sda2 E: DISKSEQ=10 Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Matteo Croce <[email protected]> Tested-by: Luca Boccassi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: add disk sequence numberMatteo Croce3-0/+29
Associating uevents with block devices in userspace is difficult and racy: the uevent netlink socket is lossy, and on slow and overloaded systems has a very high latency. Block devices do not have exclusive owners in userspace, any process can set one up (e.g. loop devices). Moreover, device names can be reused (e.g. loop0 can be reused again and again). A userspace process setting up a block device and watching for its events cannot thus reliably tell whether an event relates to the device it just set up or another earlier instance with the same name. Being able to set a UUID on a loop device would solve the race conditions. But it does not allow to derive orderings from uevents: if you see a uevent with a UUID that does not match the device you are waiting for, you cannot tell whether it's because the right uevent has not arrived yet, or it was already sent and you missed it. So you cannot tell whether you should wait for it or not. Associating a unique, monotonically increasing sequential number to the lifetime of each block device, which can be retrieved with an ioctl immediately upon setting it up, allows to solve the race conditions with uevents, and also allows userspace processes to know whether they should wait for the uevent they need or if it was dropped and thus they should move on. Additionally, increment the disk sequence number when the media change, i.e. on DISK_EVENT_MEDIA_CHANGE event. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Matteo Croce <[email protected]> Tested-by: Luca Boccassi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: remove cmdline-parser.cChristoph Hellwig7-319/+262
cmdline-parser.c is only used by the cmdline faux partition format, so merge the code into that and avoid an indirect call. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: remove disk_name()Christoph Hellwig2-9/+9
Remove the disk_name function now that all users are gone. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: simplify disk name formatting in check_partitionChristoph Hellwig1-1/+1
disk_name for partition 0 just copies out the disk_name field. Replace the call to disk_name with a %s format specifier. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: simplify printing the device names disk_stack_limitsChristoph Hellwig1-9/+3
Printk ->disk_name directly for the disk and use the %pg format specifier for the block device, which is equivalent to a bdevname call. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: use the %pg format specifier in show_partitionChristoph Hellwig1-4/+2
Simplify printing the partition name by using the %pg format specifier that is equivalent to a bdevname call. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: use the %pg format specifier in printk_all_partitionsChristoph Hellwig1-4/+2
Simplify printing the partition name by using the %pg format specifier that is equivalent to a bdevname call. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: reduce stack usage in diskstats_showAbd-Alrhman Masalkhi1-4/+2
I have compiled the kernel with a cross compiler "hppa-linux-gnu-" v9.3.0 on x86-64 host machine. I got the following warning: block/genhd.c: In function ‘diskstats_show’: block/genhd.c:1227:1: warning: the frame size of 1688 bytes is larger than 1280 bytes [-Wframe-larger-than=] 1227 | } By Reduced the stack footprint by using the %pg printk specifier instead of disk_name to remove the need for the on-stack buffer. Signed-off-by: Abd-Alrhman Masalkhi <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: remove bdputChristoph Hellwig4-10/+3
Now that we've stopped using inode references for anything meaninful in the block layer get rid of the helper to put it and just open code the call to iput on the block_device inode. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: remove bdgrabChristoph Hellwig2-16/+0
All callers are gone, and no one should grab a pure inode reference to a block device anymore. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02loop: don't grab a reference to the block deviceChristoph Hellwig1-5/+0
The whole device block device won't be removed while the disk is still alive, so don't bother to grab a reference to it. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: change the refcounting for partitionsChristoph Hellwig2-38/+31
Instead of acquiring an inode reference on open make sure partitions always hold device model references to the disk while alive, and switch open to grab only a device model reference to the opened block device. If that is a partition the disk reference is transitively held by the partition already. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: allocate bd_meta_info later in add_partitionsChristoph Hellwig1-10/+7
Move the allocation of bd_meta_info after initializing the struct device to avoid the special bdput error handling path. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2021-08-02block: unhash the whole device inode earlierChristoph Hellwig2-7/+2
Unhash the whole device inode early in del_gendisk. This allows to remove the first GENHD_FL_UP check in the open path as we simply won't find a just removed inode. The second non-racy check after taking open_mutex is still kept. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>