aboutsummaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)AuthorFilesLines
2023-01-29block: make BLK_DEF_MAX_SECTORS unsignedKeith Busch1-1/+1
This is used as an unsigned value, so define it that way to avoid having to cast it. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230105205146.3610282-2-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29block, bfq: balance I/O injection among underutilized actuatorsDavide Zini1-5/+13
Upon the invocation of its dispatch function, BFQ returns the next I/O request of the in-service bfq_queue, unless some exception holds. One such exception is that there is some underutilized actuator, different from the actuator for which the in-service queue contains I/O, and that some other bfq_queue happens to contain I/O for such an actuator. In this case, the next I/O request of the latter bfq_queue, and not of the in-service bfq_queue, is returned (I/O is injected from that bfq_queue). To find such an actuator, a linear scan, in increasing index order, is performed among actuators. Performing a linear scan entails a prioritization among actuators: an underutilized actuator may be considered for injection only if all actuators with a lower index are currently fully utilized, or if there is no pending I/O for any lower-index actuator that happens to be underutilized. This commits breaks this prioritization and tends to distribute injection uniformly across actuators. This is obtained by adding the following condition to the linear scan: even if an actuator A is underutilized, A is however skipped if its load is higher than that of the next actuator. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Davide Zini <davidezini2@gmail.com> Link: https://lore.kernel.org/r/20230103145503.71712-9-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29block, bfq: inject I/O to underutilized actuatorsDavide Zini4-40/+139
The main service scheme of BFQ for sync I/O is serving one sync bfq_queue at a time, for a while. In particular, BFQ enforces this scheme when it deems the latter necessary to boost throughput or to preserve service guarantees. Unfortunately, when BFQ enforces this policy, only one actuator at a time gets served for a while, because each bfq_queue contains I/O only for one actuator. The other actuators may remain underutilized. Actually, BFQ may serve (inject) extra I/O, taken from other bfq_queues, in parallel with that of the in-service queue. This injection mechanism may provide the ground for dealing also with the above actuator-underutilization problem. Yet BFQ does not take the actuator load into account when choosing which queue to pick extra I/O from. In addition, BFQ may happen to inject extra I/O only when the in-service queue is temporarily empty. In view of these facts, this commit extends the injection mechanism in such a way that the latter: (1) takes into account also the actuator load; (2) checks such a load on each dispatch, and injects I/O for an underutilized actuator, if there is one and there is I/O for it. To perform the check in (2), this commit introduces a load threshold, currently set to 4. A linear scan of each actuator is performed, until an actuator is found for which the following two conditions hold: the load of the actuator is below the threshold, and there is at least one non-in-service queue that contains I/O for that actuator. If such a pair (actuator, queue) is found, then the head request of that queue is returned for dispatch, instead of the head request of the in-service queue. We have set the threshold, empirically, to the minimum possible value for which an actuator is fully utilized, or close to be fully utilized. By doing so, injected I/O 'steals' as few drive-queue slots as possibile to the in-service queue. This reduces as much as possible the probability that the service of I/O from the in-service bfq_queue gets delayed because of slot exhaustion, i.e., because all the slots of the drive queue are filled with I/O injected from other queues (NCQ provides for 32 slots). This new mechanism also counters actuator underutilization in the case of asymmetric configurations of bfq_queues. Namely if there are few bfq_queues containing I/O for some actuators and many bfq_queues containing I/O for other actuators. Or if the bfq_queues containing I/O for some actuators have lower weights than the other bfq_queues. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Davide Zini <davidezini2@gmail.com> Link: https://lore.kernel.org/r/20230103145503.71712-8-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29block, bfq: retrieve independent access ranges from request queueFederico Gavioli2-9/+58
This patch implements the code to gather the content of the independent_access_ranges structure from the request_queue and copy it into the queue's bfq_data. This copy is done at queue initialization. We copy the access ranges into the bfq_data to avoid taking the queue lock each time we access the ranges. This implementation, however, puts a limit to the maximum independent ranges supported by the scheduler. Such a limit is equal to the constant BFQ_MAX_ACTUATORS. This limit was placed to avoid the allocation of dynamic memory. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Co-developed-by: Rory Chen <rory.c.chen@seagate.com> Signed-off-by: Rory Chen <rory.c.chen@seagate.com> Signed-off-by: Federico Gavioli <f.gavioli97@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20230103145503.71712-7-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29block, bfq: split also async bfq_queues on a per-actuator basisDavide Zini2-22/+27
Similarly to sync bfq_queues, also async bfq_queues need to be split on a per-actuator basis. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Davide Zini <davidezini2@gmail.com> Link: https://lore.kernel.org/r/20230103145503.71712-6-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29block, bfq: turn bfqq_data into an array in bfq_io_cqPaolo Valente2-45/+67
When a bfq_queue Q is merged with another queue, several pieces of information are saved about Q. These pieces are stored in the bfqq_data field in the bfq_io_cq data structure of the process associated with Q. Yet, with a multi-actuator drive, a process may get associated with multiple bfq_queues: one queue for each of the N actuators. Each of these queues may undergo a merge. So, the bfq_io_cq data structure must be able to accommodate the above information for N queues. This commit solves this problem by turning the bfqq_data scalar field into an array of N elements (and by changing code so as to handle this array). This solution is written under the assumption that bfq_queues associated with different actuators cannot be cross-merged. This assumption holds naturally with basic queue merging: the latter is triggered by spatial locality, and sectors for different actuators are not close to each other (apart from the corner case of the last sectors served by a given actuator and the first sectors served by the next actuator). As for stable cross-merging, the assumption here is that it is disabled. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Gabriele Felici <felicigb@gmail.com> Signed-off-by: Gianmarco Lusvardi <glusvardi@posteo.net> Signed-off-by: Giulio Barabino <giuliobarabino99@gmail.com> Signed-off-by: Emiliano Maccaferri <inbox@emilianomaccaferri.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20230103145503.71712-5-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29block, bfq: move io_cq-persistent bfqq data into a dedicated structPaolo Valente2-80/+110
With a multi-actuator drive, a process may get associated with multiple bfq_queues: one queue for each of the N actuators. So, the bfq_io_cq data structure must be able to accommodate its per-queue persistent information for N queues. Currently it stores this information for just one queue, in several scalar fields. This is a preparatory commit for moving to accommodating persistent information for N queues. In particular, this commit packs all the above scalar fields into a single data structure. Then there is now only one field, in bfq_io_cq, that stores all the above information. This scalar field will then be turned into an array by a following commit. Suggested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Gianmarco Lusvardi <glusvardi@posteo.net> Signed-off-by: Giulio Barabino <giuliobarabino99@gmail.com> Signed-off-by: Emiliano Maccaferri <inbox@emilianomaccaferri.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20230103145503.71712-4-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29block, bfq: forbid stable merging of queues associated with different actuatorsPaolo Valente1-4/+9
If queues associated with different actuators are merged, then control is lost on each actuator. Therefore some actuator may be underutilized, and throughput may decrease. This problem cannot occur with basic queue merging, because the latter is triggered by spatial locality, and sectors for different actuators are not close to each other. Yet it may happen with stable merging. To address this issue, this commit prevents stable merging from occurring among queues associated with different actuators. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20230103145503.71712-3-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29block, bfq: split sync bfq_queues on a per-actuator basisPaolo Valente3-107/+195
Single-LUN multi-actuator SCSI drives, as well as all multi-actuator SATA drives appear as a single device to the I/O subsystem [1]. Yet they address commands to different actuators internally, as a function of Logical Block Addressing (LBAs). A given sector is reachable by only one of the actuators. For example, Seagate’s Serial Advanced Technology Attachment (SATA) version contains two actuators and maps the lower half of the SATA LBA space to the lower actuator and the upper half to the upper actuator. Evidently, to fully utilize actuators, no actuator must be left idle or underutilized while there is pending I/O for it. The block layer must somehow control the load of each actuator individually. This commit lays the ground for allowing BFQ to provide such a per-actuator control. BFQ associates an I/O-request sync bfq_queue with each process doing synchronous I/O, or with a group of processes, in case of queue merging. Then BFQ serves one bfq_queue at a time. While in service, a bfq_queue is emptied in request-position order. Yet the same process, or group of processes, may generate I/O for different actuators. In this case, different streams of I/O (each for a different actuator) get all inserted into the same sync bfq_queue. So there is basically no individual control on when each stream is served, i.e., on when the I/O requests of the stream are picked from the bfq_queue and dispatched to the drive. This commit enables BFQ to control the service of each actuator individually for synchronous I/O, by simply splitting each sync bfq_queue into N queues, one for each actuator. In other words, a sync bfq_queue is now associated to a pair (process, actuator). As a consequence of this split, the per-queue proportional-share policy implemented by BFQ will guarantee that the sync I/O generated for each actuator, by each process, receives its fair share of service. This is just a preparatory patch. If the I/O of the same process happens to be sent to different queues, then each of these queues may undergo queue merging. To handle this event, the bfq_io_cq data structure must be properly extended. In addition, stable merging must be disabled to avoid loss of control on individual actuators. Finally, also async queues must be split. These issues are described in detail and addressed in next commits. As for this commit, although multiple per-process bfq_queues are provided, the I/O of each process or group of processes is still sent to only one queue, regardless of the actuator the I/O is for. The forwarding to distinct bfq_queues will be enabled after addressing the above issues. [1] https://www.linaro.org/blog/budget-fair-queueing-bfq-linux-io-scheduler-optimizations-for-multi-actuator-sata-hard-drives/ Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Gabriele Felici <felicigb@gmail.com> Signed-off-by: Carmine Zaccagnino <carmine@carminezacc.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20230103145503.71712-2-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-27driver core: make struct device_type.devnode() take a const *Greg Kroah-Hartman1-1/+1
The devnode() callback in struct device_type should not be modifying the device that is passed into it, so mark it as a const * and propagate the function signature changes out into all relevant subsystems that use this callback. Cc: Jens Axboe <axboe@kernel.dk> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Ben Widawsky <bwidawsk@kernel.org> Cc: Jeremy Kerr <jk@ozlabs.org> Cc: Joel Stanley <joel@jms.id.au> Cc: Alistar Popple <alistair@popple.id.au> Cc: Eddie James <eajames@linux.ibm.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jilin Yuan <yuanjilin@cdjrlc.com> Cc: Heikki Krogerus <heikki.krogerus@linux.intel.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Won Chung <wonchung@google.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Acked-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20230111113018.459199-7-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-27driver core: make struct device_type.uevent() take a const *Greg Kroah-Hartman1-2/+2
The uevent() callback in struct device_type should not be modifying the device that is passed into it, so mark it as a const * and propagate the function signature changes out into all relevant subsystems that use this callback. Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Andreas Noever <andreas.noever@gmail.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Bard Liao <yung-chuan.liao@linux.intel.com> Cc: Chaitanya Kulkarni <kch@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Frank Rowand <frowand.list@gmail.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jilin Yuan <yuanjilin@cdjrlc.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Len Brown <lenb@kernel.org> Cc: Mark Gross <markgross@kernel.org> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Maximilian Luz <luzmaximilian@gmail.com> Cc: Michael Jamet <michael.jamet@intel.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Cc: Rob Herring <robh+dt@kernel.org> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: Sanyog Kale <sanyog.r.kale@intel.com> Cc: Sean Young <sean@mess.org> Cc: Stefan Richter <stefanr@s5r6.in-berlin.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Won Chung <wonchung@google.com> Cc: Yehezkel Bernat <YehezkelShB@gmail.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for Thunderbolt Acked-by: Mauro Carvalho Chehab <mchehab@kernel.org> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Acked-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Acked-by: Wolfram Sang <wsa@kernel.org> Acked-by: Vinod Koul <vkoul@kernel.org> Acked-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20230111113018.459199-6-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-20Merge tag 'block-6.2-2023-01-20' of git://git.kernel.dk/linuxLinus Torvalds4-7/+13
Pull block fixes from Jens Axboe: "Various little tweaks all over the place: - NVMe pull request via Christoph: - fix controller shutdown regression in nvme-apple (Janne Grunau) - fix a polling on timeout regression in nvme-pci (Keith Busch) - Fix a bug in the read request side request allocation caching (Pavel) - pktcdvd was brought back after we configured a NULL return on bio splits, make it consistent with the others (me) - BFQ refcount fix (Yu) - Block cgroup policy activation fix (Yu) - Fix for an md regression introduced in the 6.2 cycle (Adrian)" * tag 'block-6.2-2023-01-20' of git://git.kernel.dk/linux: nvme-pci: fix timeout request state check nvme-apple: only reset the controller when RTKit is running nvme-apple: reset controller during shutdown block: fix hctx checks for batch allocation block/rnbd-clt: fix wrong max ID in ida_alloc_max blk-cgroup: fix missing pd_online_fn() while activating policy pktcdvd: check for NULL returna fter calling bio_split_to_limits() block, bfq: switch 'bfqg->ref' to use atomic refcount apis md: fix incorrect declaration about claim_rdev in md_import_device
2023-01-17blk-mq: Build default queue map via group_cpus_evenly()Ming Lei1-50/+13
The default queue mapping builder of blk_mq_map_queues doesn't take NUMA topo into account, so the built mapping is pretty bad, since CPUs belonging to different NUMA node are assigned to same queue. It is observed that IOPS drops by ~30% when running two jobs on same hctx of null_blk from two CPUs belonging to two NUMA nodes compared with from same NUMA node. Address the issue by reusing group_cpus_evenly() for building queue mapping since group_cpus_evenly() does group cpus according to CPU/NUMA locality. Also performance data becomes more stable with this given correct queue mapping is applied wrt. numa locality viewpoint, for example, on one two nodes arm64 machine with 160 cpus, node 0(cpu 0~79), node 1(cpu 80~159): 1) modprobe null_blk nr_devices=1 submit_queues=2 2) run 'fio(t/io_uring -p 0 -n 4 -r 20 /dev/nullb0)', and observe that IOPS becomes much stable on multiple tests: - unpatched: IOPS is 2.5M ~ 4.5M - patched: IOPS is 4.3M ~ 5.0M Lots of drivers may benefit from the change, such as nvme pci poll, nvme tcp, ... Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20221227022905.352674-7-ming.lei@redhat.com
2023-01-17block: fix hctx checks for batch allocationPavel Begunkov1-1/+5
When there are no read queues read requests will be assigned a default queue on allocation. However, blk_mq_get_cached_request() is not prepared for that and will fail all attempts to grab read requests from the cache. Worst case it doubles the number of requests allocated, roughly half of which will be returned by blk_mq_free_plug_rqs(). It only affects batched allocations and so is io_uring specific. For reference, QD8 t/io_uring benchmark improves by 20-35%. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/80d4511011d7d4751b4cf6375c4e38f237d935e3.1673955390.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-16blk-cgroup: fix missing pd_online_fn() while activating policyYu Kuai1-0/+4
If the policy defines pd_online_fn(), it should be called after pd_init_fn(), like blkg_create(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230103112833.2013432-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-15block, bfq: switch 'bfqg->ref' to use atomic refcount apisYu Kuai2-6/+4
The updating of 'bfqg->ref' should be protected by 'bfqd->lock', however, during code review, we found that bfq_pd_free() update 'bfqg->ref' without holding the lock, which is problematic: 1) bfq_pd_free() triggered by removing cgroup is called asynchronously; 2) bfqq will grab bfqg reference, and exit bfqq will drop the reference, which can concurrent with 1). Unfortunately, 'bfqd->lock' can't be held here because 'bfqd' might already be freed in bfq_pd_free(). Fix the problem by using atomic refcount apis. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230103084755.1256479-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-13Merge tag 'block-6.2-2023-01-13' of git://git.kernel.dk/linuxLinus Torvalds1-3/+0
Pull block fixes from Jens Axboe: "Nothing major in here, just a collection of NVMe fixes and dropping a wrong might_sleep() that static checkers tripped over but which isn't valid" * tag 'block-6.2-2023-01-13' of git://git.kernel.dk/linux: MAINTAINERS: stop nvme matching for nvmem files nvme: don't allow unprivileged passthrough on partitions nvme: replace the "bool vec" arguments with flags in the ioctl path nvme: remove __nvme_ioctl nvme-pci: fix error handling in nvme_pci_enable() nvme-pci: add NVME_QUIRK_IDENTIFY_CNS quirk to Apple T2 controllers nvme-apple: add NVME_QUIRK_IDENTIFY_CNS quirk to fix regression block: Drop spurious might_sleep() from blk_put_queue()
2023-01-14block: add a sanity check for non-write flush/fua biosChristoph Hellwig1-5/+9
Check that the PREFUSH and FUA flags are only set on write bios, given that the flush state machine expects that. [Damien] The check is also extended to REQ_OP_ZONE_APPEND operations as these are data write operations used by btrfs and zonefs and may also have the REQ_FUA bit set. Reported-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de>
2023-01-11block: use iter_ubuf for single rangeKeith Busch1-4/+4
This is more efficient than iter_iov. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> [axboe: fold in iovec assumption fix] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-08block: Drop spurious might_sleep() from blk_put_queue()Tejun Heo1-3/+0
Dan reports the following smatch detected the following: block/blk-cgroup.c:1863 blkcg_schedule_throttle() warn: sleeping in atomic context caused by blkcg_schedule_throttle() calling blk_put_queue() in an non-sleepable context. blk_put_queue() acquired might_sleep() in 63f93fd6fa57 ("block: mark blk_put_queue as potentially blocking") which transferred the might_sleep() from blk_free_queue(). blk_free_queue() acquired might_sleep() in e8c7d14ac6c3 ("block: revert back to synchronous request_queue removal") while turning request_queue removal synchronous. However, this isn't necessary as nothing in the free path actually requires sleeping. It's pretty unusual to require a sleeping context in a put operation and it's not needed in the first place. Let's drop it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Dan Carpenter <error27@gmail.com> Link: https://lkml.kernel.org/r/Y7g3L6fntnTtOm63@kili Cc: Christoph Hellwig <hch@lst.de> Cc: Luis Chamberlain <mcgrof@kernel.org> Fixes: e8c7d14ac6c3 ("block: revert back to synchronous request_queue removal") # v5.9+ Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/Y7iFwjN+XzWvLv3y@slm.duckdns.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-06Merge tag 'block-2023-01-06' of git://git.kernel.dk/linuxLinus Torvalds5-18/+50
Pull block fixes from Jens Axboe: "The big change here is obviously the revert of the pktcdvd driver removal. Outside of that, just minor tweaks. In detail: - Re-instate the pktcdvd driver, which necessitates adding back bio_copy_data_iter() and the fops->devnode() hook for now (me) - Fix for splitting of a bio marked as NOWAIT, causing either nowait reads or writes to error with EAGAIN even if parts of the IO completed (me) - Fix for ublk, punting management commands to io-wq as they can all easily block for extended periods of time (Ming) - Removal of SRCU dependency for the block layer (Paul)" * tag 'block-2023-01-06' of git://git.kernel.dk/linux: block: Remove "select SRCU" Revert "pktcdvd: remove driver." Revert "block: remove devnode callback from struct block_device_operations" Revert "block: bio_copy_data_iter" ublk: honor IO_URING_F_NONBLOCK for handling control command block: don't allow splitting of a REQ_NOWAIT bio block: handle bio_split_to_limits() NULL return
2023-01-05block: Remove "select SRCU"Paul E. McKenney1-1/+0
Now that the SRCU Kconfig option is unconditionally selected, there is no longer any point in selecting it. Therefore, remove the "select SRCU" Kconfig statements. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-04Revert "block: remove devnode callback from struct block_device_operations"Jens Axboe1-0/+11
This reverts commit 85d6ce58e493ac8b7122e2fbe3f41b94d6ebdc11. We're reinstating the pktcdvd driver, which needs this API. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-04Revert "block: bio_copy_data_iter"Jens Axboe1-15/+22
This reverts commit db1c7d77976775483a8ef240b4c705f113e13ea1. We're reinstating the pktcdvd driver, which needs this API. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-04block: don't allow splitting of a REQ_NOWAIT bioJens Axboe1-0/+10
If we split a bio marked with REQ_NOWAIT, then we can trigger spurious EAGAIN if constituent parts of that split bio end up failing request allocations. Parts will complete just fine, but just a single failure in one of the chained bios will yield an EAGAIN final result for the parent bio. Return EAGAIN early if we end up needing to split such a bio, which allows for saner recovery handling. Cc: stable@vger.kernel.org # 5.15+ Link: https://github.com/axboe/liburing/issues/766 Reported-by: Michael Kelley <mikelley@microsoft.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-04block: handle bio_split_to_limits() NULL returnJens Axboe2-2/+7
This can't happen right now, but in preparation for allowing bio_split_to_limits() returning NULL if it ended the bio, check for it in all the callers. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-29Merge tag 'block-6.2-2022-12-29' of git://git.kernel.dk/linuxLinus Torvalds1-1/+1
Pull block fixes from Jens Axboe: "Mostly just NVMe, but also a single fixup for BFQ for a regression that happened during the merge window. In detail: - NVMe pull requests via Christoph: - Fix doorbell buffer value endianness (Klaus Jensen) - Fix Linux vs NVMe page size mismatch (Keith Busch) - Fix a potential use memory access beyong the allocation limit (Keith Busch) - Fix a multipath vs blktrace NULL pointer dereference (Yanjun Zhang) - Fix various problems in handling the Command Supported and Effects log (Christoph Hellwig) - Don't allow unprivileged passthrough of commands that don't transfer data but modify logical block content (Christoph Hellwig) - Add a features and quirks policy document (Christoph Hellwig) - Fix some really nasty code that was correct but made smatch complain (Sagi Grimberg) - Use-after-free regression in BFQ from this merge window (Yu)" * tag 'block-6.2-2022-12-29' of git://git.kernel.dk/linux: nvme-auth: fix smatch warning complaints nvme: consult the CSE log page for unprivileged passthrough nvme: also return I/O command effects from nvme_command_effects nvmet: don't defer passthrough commands with trivial effects to the workqueue nvmet: set the LBCC bit for commands that modify data nvmet: use NVME_CMD_EFFECTS_CSUPP instead of open coding it nvme: fix the NVME_CMD_EFFECTS_CSE_MASK definition docs, nvme: add a feature and quirk policy document nvme-pci: update sqsize when adjusting the queue depth nvme: fix setting the queue depth in nvme_alloc_io_tag_set block, bfq: fix uaf for bfqq in bfq_exit_icq_bfqq nvme: fix multipath crash caused by flush request when blktrace is enabled nvme-pci: fix page size checks nvme-pci: fix mempool alloc size nvme-pci: fix doorbell buffer value endianness
2022-12-26block, bfq: fix uaf for bfqq in bfq_exit_icq_bfqqYu Kuai1-1/+1
Commit 64dc8c732f5c ("block, bfq: fix possible uaf for 'bfqq->bic'") will access 'bic->bfqq' in bic_set_bfqq(), however, bfq_exit_icq_bfqq() can free bfqq first, and then call bic_set_bfqq(), which will cause uaf. Fix the problem by moving bfq_exit_bfqq() behind bic_set_bfqq(). Fixes: 64dc8c732f5c ("block, bfq: fix possible uaf for 'bfqq->bic'") Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20221226030605.1437081-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-25treewide: Convert del_timer*() to timer_shutdown*()Steven Rostedt (Google)3-3/+3
Due to several bugs caused by timers being re-armed after they are shutdown and just before they are freed, a new state of timers was added called "shutdown". After a timer is set to this state, then it can no longer be re-armed. The following script was run to find all the trivial locations where del_timer() or del_timer_sync() is called in the same function that the object holding the timer is freed. It also ignores any locations where the timer->function is modified between the del_timer*() and the free(), as that is not considered a "trivial" case. This was created by using a coccinelle script and the following commands: $ cat timer.cocci @@ expression ptr, slab; identifier timer, rfield; @@ ( - del_timer(&ptr->timer); + timer_shutdown(&ptr->timer); | - del_timer_sync(&ptr->timer); + timer_shutdown_sync(&ptr->timer); ) ... when strict when != ptr->timer ( kfree_rcu(ptr, rfield); | kmem_cache_free(slab, ptr); | kfree(ptr); ) $ spatch timer.cocci . > /tmp/t.patch $ patch -p1 < /tmp/t.patch Link: https://lore.kernel.org/lkml/20221123201306.823305113@linutronix.de/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Pavel Machek <pavel@ucw.cz> [ LED ] Acked-by: Kalle Valo <kvalo@kernel.org> [ wireless ] Acked-by: Paolo Abeni <pabeni@redhat.com> [ networking ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-12-21Merge tag 'block-6.2-2022-12-19' of git://git.kernel.dk/linuxLinus Torvalds7-20/+34
Pull block fixes from Jens Axboe: - Various fixes for BFQ (Yu, Yuwei) - Fix for loop command line parsing (Isaac) - No need to specifically clear REQ_ALLOC_CACHE on IOPOLL downgrade anymore (me) - blk-iocost enum fix for newer gcc (Jiri) - UAF fix for queue release (Ming) - blk-iolatency error handling memory leak fix (Tejun) * tag 'block-6.2-2022-12-19' of git://git.kernel.dk/linux: block: don't clear REQ_ALLOC_CACHE for non-polled requests block: fix use-after-free of q->q_usage_counter block, bfq: only do counting of pending-request for BFQ_GROUP_IOSCHED blk-iolatency: Fix memory leak on add_disk() failures loop: Fix the max_loop commandline argument treatment when it is set to 0 block/blk-iocost (gcc13): keep large values in a new enum block, bfq: replace 0/1 with false/true in bic apis block, bfq: don't return bfqg from __bfq_bic_change_cgroup() block, bfq: fix possible uaf for 'bfqq->bic'
2022-12-16Merge tag 'driver-core-6.2-rc1' of ↵Linus Torvalds2-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the set of driver core and kernfs changes for 6.2-rc1. The "big" change in here is the addition of a new macro, container_of_const() that will preserve the "const-ness" of a pointer passed into it. The "problem" of the current container_of() macro is that if you pass in a "const *", out of it can comes a non-const pointer unless you specifically ask for it. For many usages, we want to preserve the "const" attribute by using the same call. For a specific example, this series changes the kobj_to_dev() macro to use it, allowing it to be used no matter what the const value is. This prevents every subsystem from having to declare 2 different individual macros (i.e. kobj_const_to_dev() and kobj_to_dev()) and having the compiler enforce the const value at build time, which having 2 macros would not do either. The driver for all of this have been discussions with the Rust kernel developers as to how to properly mark driver core, and kobject, objects as being "non-mutable". The changes to the kobject and driver core in this pull request are the result of that, as there are lots of paths where kobjects and device pointers are not modified at all, so marking them as "const" allows the compiler to enforce this. So, a nice side affect of the Rust development effort has been already to clean up the driver core code to be more obvious about object rules. All of this has been bike-shedded in quite a lot of detail on lkml with different names and implementations resulting in the tiny version we have in here, much better than my original proposal. Lots of subsystem maintainers have acked the changes as well. Other than this change, included in here are smaller stuff like: - kernfs fixes and updates to handle lock contention better - vmlinux.lds.h fixes and updates - sysfs and debugfs documentation updates - device property updates All of these have been in the linux-next tree for quite a while with no problems" * tag 'driver-core-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (58 commits) device property: Fix documentation for fwnode_get_next_parent() firmware_loader: fix up to_fw_sysfs() to preserve const usb.h: take advantage of container_of_const() device.h: move kobj_to_dev() to use container_of_const() container_of: add container_of_const() that preserves const-ness of the pointer driver core: fix up missed drivers/s390/char/hmcdrv_dev.c class.devnode() conversion. driver core: fix up missed scsi/cxlflash class.devnode() conversion. driver core: fix up some missing class.devnode() conversions. driver core: make struct class.devnode() take a const * driver core: make struct class.dev_uevent() take a const * cacheinfo: Remove of_node_put() for fw_token device property: Add a blank line in Kconfig of tests device property: Rename goto label to be more precise device property: Move PROPERTY_ENTRY_BOOL() a bit down device property: Get rid of __PROPERTY_ENTRY_ARRAY_EL*SIZE*() kernfs: fix all kernel-doc warnings and multiple typos driver core: pass a const * into of_device_uevent() kobject: kset_uevent_ops: make name() callback take a const * kobject: kset_uevent_ops: make filter() callback take a const * kobject: make kobject_namespace take a const * ...
2022-12-15block: fix use-after-free of q->q_usage_counterMing Lei1-4/+5
For blk-mq, queue release handler is usually called after blk_mq_freeze_queue_wait() returns. However, the q_usage_counter->release() handler may not be run yet at that time, so this can cause a use-after-free. Fix the issue by moving percpu_ref_exit() into blk_free_queue_rcu(). Since ->release() is called with rcu read lock held, it is agreed that the race should be covered in caller per discussion from the two links. Reported-by: Zhang Wensheng <zhangwensheng@huaweicloud.com> Reported-by: Zhong Jinghua <zhongjinghua@huawei.com> Link: https://lore.kernel.org/linux-block/Y5prfOjyyjQKUrtH@T590/T/#u Link: https://lore.kernel.org/lkml/Y4%2FmzMd4evRg9yDi@fedora/ Cc: Hillf Danton <hdanton@sina.com> Cc: Yu Kuai <yukuai3@huawei.com> Cc: Dennis Zhou <dennis@kernel.org> Fixes: 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of percpu_ref in fast path") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20221215021629.74870-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-15block, bfq: only do counting of pending-request for BFQ_GROUP_IOSCHEDYuwei Guan3-4/+10
The 'bfqd->num_groups_with_pending_reqs' is used when CONFIG_BFQ_GROUP_IOSCHED is enabled, so let the variables and processes take effect when CONFIG_BFQ_GROUP_IOSCHED is enabled. Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Yuwei Guan <Yuwei.Guan@zeekrlife.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20221110112622.389332-1-Yuwei.Guan@zeekrlife.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-14blk-iolatency: Fix memory leak on add_disk() failuresTejun Heo1-0/+2
When a gendisk is successfully initialized but add_disk() fails such as when a loop device has invalid number of minor device numbers specified, blkcg_init_disk() is called during init and then blkcg_exit_disk() during error handling. Unfortunately, iolatency gets initialized in the former but doesn't get cleaned up in the latter. This is because, in non-error cases, the cleanup is performed by del_gendisk() calling rq_qos_exit(), the assumption being that rq_qos policies, iolatency being one of them, can only be activated once the disk is fully registered and visible. That assumption is true for wbt and iocost, but not so for iolatency as it gets initialized before add_disk() is called. It is desirable to lazy-init rq_qos policies because they are optional features and add to hot path overhead once initialized - each IO has to walk all the registered rq_qos policies. So, we want to switch iolatency to lazy init too. However, that's a bigger change. As a fix for the immediate problem, let's just add an extra call to rq_qos_exit() in blkcg_exit_disk(). This is safe because duplicate calls to rq_qos_exit() become noop's. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: darklight2357@icloud.com Cc: Josef Bacik <josef@toxicpanda.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Fixes: d70675121546 ("block: introduce blk-iolatency io controller") Cc: stable@vger.kernel.org # v4.19+ Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/Y5TQ5gm3O4HXrXR3@slm.duckdns.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-14block/blk-iocost (gcc13): keep large values in a new enumJiri Slaby (SUSE)1-0/+2
Since gcc13, each member of an enum has the same type as the enum [1]. And that is inherited from its members. Provided: VTIME_PER_SEC_SHIFT = 37, VTIME_PER_SEC = 1LLU << VTIME_PER_SEC_SHIFT, ... AUTOP_CYCLE_NSEC = 10LLU * NSEC_PER_SEC, the named type is unsigned long. This generates warnings with gcc-13: block/blk-iocost.c: In function 'ioc_weight_prfill': block/blk-iocost.c:3037:37: error: format '%u' expects argument of type 'unsigned int', but argument 4 has type 'long unsigned int' block/blk-iocost.c: In function 'ioc_weight_show': block/blk-iocost.c:3047:34: error: format '%u' expects argument of type 'unsigned int', but argument 3 has type 'long unsigned int' So split the anonymous enum with large values to a separate enum, so that they don't affect other members. [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=36113 Cc: Martin Liska <mliska@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: cgroups@vger.kernel.org Cc: linux-block@vger.kernel.org Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Link: https://lore.kernel.org/r/20221213120826.17446-1-jirislaby@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-14block, bfq: replace 0/1 with false/true in bic apisYu Kuai2-6/+6
Just to make the code a litter cleaner, there are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221214033155.3455754-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-14block, bfq: don't return bfqg from __bfq_bic_change_cgroup()Yu Kuai1-5/+3
The return value is not used, hence remove it. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221214033155.3455754-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-14block, bfq: fix possible uaf for 'bfqq->bic'Yu Kuai1-1/+6
Our test report a uaf for 'bfqq->bic' in 5.10: ================================================================== BUG: KASAN: use-after-free in bfq_select_queue+0x378/0xa30 CPU: 6 PID: 2318352 Comm: fsstress Kdump: loaded Not tainted 5.10.0-60.18.0.50.h602.kasan.eulerosv2r11.x86_64 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-20220320_160524-szxrtosci10000 04/01/2014 Call Trace: bfq_select_queue+0x378/0xa30 bfq_dispatch_request+0xe8/0x130 blk_mq_do_dispatch_sched+0x62/0xb0 __blk_mq_sched_dispatch_requests+0x215/0x2a0 blk_mq_sched_dispatch_requests+0x8f/0xd0 __blk_mq_run_hw_queue+0x98/0x180 __blk_mq_delay_run_hw_queue+0x22b/0x240 blk_mq_run_hw_queue+0xe3/0x190 blk_mq_sched_insert_requests+0x107/0x200 blk_mq_flush_plug_list+0x26e/0x3c0 blk_finish_plug+0x63/0x90 __iomap_dio_rw+0x7b5/0x910 iomap_dio_rw+0x36/0x80 ext4_dio_read_iter+0x146/0x190 [ext4] ext4_file_read_iter+0x1e2/0x230 [ext4] new_sync_read+0x29f/0x400 vfs_read+0x24e/0x2d0 ksys_read+0xd5/0x1b0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x61/0xc6 Commit 3bc5e683c67d ("bfq: Split shared queues on move between cgroups") changes that move process to a new cgroup will allocate a new bfqq to use, however, the old bfqq and new bfqq can point to the same bic: 1) Initial state, two process with io in the same cgroup. Process 1 Process 2 (BIC1) (BIC2) | Λ | Λ | | | | V | V | bfqq1 bfqq2 2) bfqq1 is merged to bfqq2. Process 1 Process 2 (BIC1) (BIC2) | | \-------------\| V bfqq1 bfqq2(coop) 3) Process 1 exit, then issue new io(denoce IOA) from Process 2. (BIC2) | Λ | | V | bfqq2(coop) 4) Before IOA is completed, move Process 2 to another cgroup and issue io. Process 2 (BIC2) Λ |\--------------\ | V bfqq2 bfqq3 Now that BIC2 points to bfqq3, while bfqq2 and bfqq3 both point to BIC2. If all the requests are completed, and Process 2 exit, BIC2 will be freed while there is no guarantee that bfqq2 will be freed before BIC2. Fix the problem by clearing bfqq->bic while bfqq is detached from bic. Fixes: 3bc5e683c67d ("bfq: Split shared queues on move between cgroups") Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221214030430.3304151-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-13Merge tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linuxLinus Torvalds38-877/+1132
Pull block updates from Jens Axboe: - NVMe pull requests via Christoph: - Support some passthrough commands without CAP_SYS_ADMIN (Kanchan Joshi) - Refactor PCIe probing and reset (Christoph Hellwig) - Various fabrics authentication fixes and improvements (Sagi Grimberg) - Avoid fallback to sequential scan due to transient issues (Uday Shankar) - Implement support for the DEAC bit in Write Zeroes (Christoph Hellwig) - Allow overriding the IEEE OUI and firmware revision in configfs for nvmet (Aleksandr Miloserdov) - Force reconnect when number of queue changes in nvmet (Daniel Wagner) - Minor fixes and improvements (Uros Bizjak, Joel Granados, Sagi Grimberg, Christoph Hellwig, Christophe JAILLET) - Fix and cleanup nvme-fc req allocation (Chaitanya Kulkarni) - Use the common tagset helpers in nvme-pci driver (Christoph Hellwig) - Cleanup the nvme-pci removal path (Christoph Hellwig) - Use kstrtobool() instead of strtobool (Christophe JAILLET) - Allow unprivileged passthrough of Identify Controller (Joel Granados) - Support io stats on the mpath device (Sagi Grimberg) - Minor nvmet cleanup (Sagi Grimberg) - MD pull requests via Song: - Code cleanups (Christoph) - Various fixes - Floppy pull request from Denis: - Fix a memory leak in the init error path (Yuan) - Series fixing some batch wakeup issues with sbitmap (Gabriel) - Removal of the pktcdvd driver that was deprecated more than 5 years ago, and subsequent removal of the devnode callback in struct block_device_operations as no users are now left (Greg) - Fix for partition read on an exclusively opened bdev (Jan) - Series of elevator API cleanups (Jinlong, Christoph) - Series of fixes and cleanups for blk-iocost (Kemeng) - Series of fixes and cleanups for blk-throttle (Kemeng) - Series adding concurrent support for sync queues in BFQ (Yu) - Series bringing drbd a bit closer to the out-of-tree maintained version (Christian, Joel, Lars, Philipp) - Misc drbd fixes (Wang) - blk-wbt fixes and tweaks for enable/disable (Yu) - Fixes for mq-deadline for zoned devices (Damien) - Add support for read-only and offline zones for null_blk (Shin'ichiro) - Series fixing the delayed holder tracking, as used by DM (Yu, Christoph) - Series enabling bio alloc caching for IRQ based IO (Pavel) - Series enabling userspace peer-to-peer DMA (Logan) - BFQ waker fixes (Khazhismel) - Series fixing elevator refcount issues (Christoph, Jinlong) - Series cleaning up references around queue destruction (Christoph) - Series doing quiesce by tagset, enabling cleanups in drivers (Christoph, Chao) - Series untangling the queue kobject and queue references (Christoph) - Misc fixes and cleanups (Bart, David, Dawei, Jinlong, Kemeng, Ye, Yang, Waiman, Shin'ichiro, Randy, Pankaj, Christoph) * tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux: (247 commits) blktrace: Fix output non-blktrace event when blk_classic option enabled block: sed-opal: Don't include <linux/kernel.h> sed-opal: allow using IOC_OPAL_SAVE for locking too blk-cgroup: Fix typo in comment block: remove bio_set_op_attrs nvmet: don't open-code NVME_NS_ATTR_RO enumeration nvme-pci: use the tagset alloc/free helpers nvme: add the Apple shared tag workaround to nvme_alloc_io_tag_set nvme: only set reserved_tags in nvme_alloc_io_tag_set for fabrics controllers nvme: consolidate setting the tagset flags nvme: pass nr_maps explicitly to nvme_alloc_io_tag_set block: bio_copy_data_iter nvme-pci: split out a nvme_pci_ctrl_is_dead helper nvme-pci: return early on ctrl state mismatch in nvme_reset_work nvme-pci: rename nvme_disable_io_queues nvme-pci: cleanup nvme_suspend_queue nvme-pci: remove nvme_pci_disable nvme-pci: remove nvme_disable_admin_queue nvme: merge nvme_shutdown_ctrl into nvme_disable_ctrl nvme: use nvme_wait_ready in nvme_shutdown_ctrl ...
2022-12-12Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscryptLinus Torvalds1-0/+6
Pull fscrypt updates from Eric Biggers: "This release adds SM4 encryption support, contributed by Tianjia Zhang. SM4 is a Chinese block cipher that is an alternative to AES. I recommend against using SM4, but (according to Tianjia) some people are being required to use it. Since SM4 has been turning up in many other places (crypto API, wireless, TLS, OpenSSL, ARMv8 CPUs, etc.), it hasn't been very controversial, and some people have to use it, I don't think it would be fair for me to reject this optional feature. Besides the above, there are a couple cleanups" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: fscrypt: add additional documentation for SM4 support fscrypt: remove unused Speck definitions fscrypt: Add SM4 XTS/CTS symmetric algorithm support blk-crypto: Add support for SM4-XTS blk crypto mode fscrypt: add comment for fscrypt_valid_enc_modes_v1() fscrypt: pass super_block to fscrypt_put_master_key_activeref()
2022-12-08sed-opal: allow using IOC_OPAL_SAVE for locking tooLuca Boccassi1-0/+39
Usually when closing a crypto device (eg: dm-crypt with LUKS) the volume key is not required, as it requires root privileges anyway, and root can deny access to a disk in many ways regardless. Requiring the volume key to lock the device is a peculiarity of the OPAL specification. Given we might already have saved the key if the user requested it via the 'IOC_OPAL_SAVE' ioctl, we can use that key to lock the device if no key was provided here and the locking range matches, and the user sets the appropriate flag with 'IOC_OPAL_SAVE'. This allows integrating OPAL with tools and libraries that are used to the common behaviour and do not ask for the volume key when closing a device. Callers can always pass a non-zero key and it will be used regardless, as before. Suggested-by: Štěpán Horáček <stepan.horacek@gmail.com> Signed-off-by: Luca Boccassi <bluca@debian.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20221206092913.4625-1-luca.boccassi@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-08blk-cgroup: Fix typo in commentKemeng Shi1-2/+2
Replace assocating with associating. Replace intiailized with initialized. Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221206093307.378249-1-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-06block: bio_copy_data_iterChristoph Hellwig1-22/+15
With the pktcdvdv removal, bio_copy_data_iter is unused now. Fold the logic into bio_copy_data and remove the separate lower level function. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221206144407.722049-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-05blk-throttle: Use more suitable time_after check for update of slice_startKemeng Shi1-1/+1
There is no need to update tg->slice_start[rw] to start when they are equal already. So remove "eq" part of check before update slice_start. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Link: https://lore.kernel.org/r/20221205115709.251489-10-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-05blk-throttle: remove repeat check of elapsed timeKemeng Shi1-2/+6
There is no need to check elapsed time from last upgrade for each node in hierarchy. Move this check before traversing as throtl_can_upgrade do to remove repeat check. Reported-by: kernel test robot <lkp@intel.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Link: https://lore.kernel.org/r/20221205115709.251489-9-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-05blk-throttle: remove incorrect comment for tg_last_low_overflow_timeKemeng Shi1-1/+0
Function tg_last_low_overflow_time is called with intermediate node as following: throtl_hierarchy_can_downgrade throtl_tg_can_downgrade tg_last_low_overflow_time throtl_hierarchy_can_upgrade throtl_tg_can_upgrade tg_last_low_overflow_time throtl_hierarchy_can_downgrade/throtl_hierarchy_can_upgrade will traverse from leaf node to sub-root node and pass traversed intermediate node to tg_last_low_overflow_time. No such limit could be found from context and implementation of tg_last_low_overflow_time, so remove this limit in comment. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Link: https://lore.kernel.org/r/20221205115709.251489-8-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-05blk-throttle: fix typo in comment of throtl_adjusted_limitKemeng Shi1-1/+1
lapsed time -> elapsed time Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Link: https://lore.kernel.org/r/20221205115709.251489-7-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-05blk-throttle: simpfy low limit reached check in throtl_tg_can_upgradeKemeng Shi1-13/+18
Commit c79892c557616 ("blk-throttle: add upgrade logic for LIMIT_LOW state") added upgrade logic for low limit and methioned that 1. "To determine if a cgroup exceeds its limitation, we check if the cgroup has pending request. Since cgroup is throttled according to the limit, pending request means the cgroup reaches the limit." 2. "If a cgroup has limit set for both read and write, we consider the combination of them for upgrade. The reason is read IO and write IO can interfere with each other. If we do the upgrade based in one direction IO, the other direction IO could be severly harmed." Besides, we also determine that cgroup reaches low limit if low limit is 0, see comment in throtl_tg_can_upgrade. Collect the information above, the desgin of upgrade check is as following: 1.The low limit is reached if limit is zero or io is already queued. 2.Cgroup will pass upgrade check if low limits of READ and WRITE are both reached. Simpfy the check code described above to removce repeat check and improve readability. There is no functional change. Detail equivalence proof is as following: All replaced conditions to return true are as following: condition 1 (!read_limit && !write_limit) condition 2 read_limit && sq->nr_queued[READ] && (!write_limit || sq->nr_queued[WRITE]) condition 3 write_limit && sq->nr_queued[WRITE] && (!read_limit || sq->nr_queued[READ]) Transferring condition 2 as following: (read_limit && sq->nr_queued[READ]) && (!write_limit || sq->nr_queued[WRITE]) is equivalent to (read_limit && sq->nr_queued[READ]) && (!write_limit || (write_limit && sq->nr_queued[WRITE])) is equivalent to condition 2.1 (read_limit && sq->nr_queued[READ] && !write_limit) || condition 2.2 (read_limit && sq->nr_queued[READ] && (write_limit && sq->nr_queued[WRITE])) Transferring condition 3 as following: write_limit && sq->nr_queued[WRITE] && (!read_limit || sq->nr_queued[READ]) is equivalent to (write_limit && sq->nr_queued[WRITE]) && (!read_limit || (read_limit && sq->nr_queued[READ])) is equivalent to condition 3.1 ((write_limit && sq->nr_queued[WRITE]) && !read_limit) || condition 3.2 ((write_limit && sq->nr_queued[WRITE]) && (read_limit && sq->nr_queued[READ])) Condition 3.2 is the same as condition 2.2, so all conditions we get to return are as following: (!read_limit && !write_limit) (1) (!read_limit && (write_limit && sq->nr_queued[WRITE])) (3.1) ((read_limit && sq->nr_queued[READ]) && !write_limit) (2.1) ((write_limit && sq->nr_queued[WRITE]) && (read_limit && sq->nr_queued[READ])) (2.2) As we can extract conditions "(a1 || a2) && (b1 || b2)" to: a1 && b1 a1 && b2 a2 && b1 ab && b2 Considering that: a1 = !read_limit a2 = read_limit && sq->nr_queued[READ] b1 = !write_limit b2 = write_limit && sq->nr_queued[WRITE] We can pack replaced conditions to (!read_limit || (read_limit && sq->nr_queued[READ])) && (!write_limit || (write_limit && sq->nr_queued[WRITE])) which is equivalent to (!read_limit || sq->nr_queued[READ]) && (!write_limit || sq->nr_queued[WRITE]) Reported-by: kernel test robot <lkp@intel.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Link: https://lore.kernel.org/r/20221205115709.251489-6-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-05blk-throttle: correct calculation of wait time in tg_may_dispatchKemeng Shi1-25/+13
In C language, When executing "if (expression1 && expression2)" and expression1 return false, the expression2 may not be executed. For "tg_within_bps_limit(tg, bio, bps_limit, &bps_wait) && tg_within_iops_limit(tg, bio, iops_limit, &iops_wait))", if bps is limited, tg_within_bps_limit will return false and tg_within_iops_limit will not be called. So even bps and iops are both limited, iops_wait will not be calculated and is always zero. So wait time of iops is always ignored. Fix this by always calling tg_within_bps_limit and tg_within_iops_limit to get wait time for both bps and iops. Observed that: 1. Wait time in tg_within_iops_limit/tg_within_bps_limit need always be stored as wait argument is always passed. 2. wait time is stored to zero if iops/bps is limited otherwise non-zero is stored. Simpfy tg_within_iops_limit/tg_within_bps_limit by removing wait argument and return wait time directly. Caller tg_may_dispatch checks if wait time is zero to find if iops/bps is limited. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Link: https://lore.kernel.org/r/20221205115709.251489-5-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-05blk-throttle: ignore cgroup without io queued in blk_throtl_cancel_biosKemeng Shi1-1/+12
Ignore cgroup without io queued in blk_throtl_cancel_bios for two reasons: 1. Save cpu cycle for trying to dispatch cgroup which is no io queued. 2. Avoid non-consistent state that cgroup is inserted to service queue without THROTL_TG_PENDING set as tg_update_disptime will unconditional re-insert cgroup to service queue. If we are on the default hierarchy, IO dispatched from child in tg_dispatch_one_bio will trigger inserting cgroup to service queue without erase first and ruin the tree. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kemeng Shi <shikemeng@huawei.com> Link: https://lore.kernel.org/r/20221205115709.251489-4-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>