aboutsummaryrefslogtreecommitdiff
path: root/include/linux/blkdev.h
AgeCommit message (Collapse)AuthorFilesLines
2017-04-19block: Inline blk_rq_set_prio()Bart Van Assche1-14/+0
Since only a single caller remains, inline blk_rq_set_prio(). Initialize req->ioprio even if no I/O priority has been set in the bio nor in the I/O context. Signed-off-by: Bart Van Assche <[email protected]> Reviewed-by: Adam Manzanares <[email protected]> Tested-by: Adam Manzanares <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Matias Bjørling <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-04-19block: Export blk_init_request_from_bio()Bart Van Assche1-0/+1
Export this function such that it becomes available to block drivers. Signed-off-by: Bart Van Assche <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Matias Bjørling <[email protected]> Cc: Adam Manzanares <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-04-19block: remove blk_end_request_curChristoph Hellwig1-1/+0
This function is not used anywhere in the kernel. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-04-19block: remove blk_end_request_err and __blk_end_request_errChristoph Hellwig1-2/+0
Both functions are entirely unused. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-04-19block, bfq: add full hierarchical scheduling and cgroups supportArianna Avanzini1-1/+1
Add complete support for full hierarchical scheduling, with a cgroups interface. Full hierarchical scheduling is implemented through the 'entity' abstraction: both bfq_queues, i.e., the internal BFQ queues associated with processes, and groups are represented in general by entities. Given the bfq_queues associated with the processes belonging to a given group, the entities representing these queues are sons of the entity representing the group. At higher levels, if a group, say G, contains other groups, then the entity representing G is the parent entity of the entities representing the groups in G. Hierarchical scheduling is performed as follows: if the timestamps of a leaf entity (i.e., of a bfq_queue) change, and such a change lets the entity become the next-to-serve entity for its parent entity, then the timestamps of the parent entity are recomputed as a function of the budget of its new next-to-serve leaf entity. If the parent entity belongs, in its turn, to a group, and its new timestamps let it become the next-to-serve for its parent entity, then the timestamps of the latter parent entity are recomputed as well, and so on. When a new bfq_queue must be set in service, the reverse path is followed: the next-to-serve highest-level entity is chosen, then its next-to-serve child entity, and so on, until the next-to-serve leaf entity is reached, and the bfq_queue that this entity represents is set in service. Writeback is accounted for on a per-group basis, i.e., for each group, the async I/O requests of the processes of the group are enqueued in a distinct bfq_queue, and the entity associated with this queue is a child of the entity associated with the group. Weights can be assigned explicitly to groups and processes through the cgroups interface, differently from what happens, for single processes, if the cgroups interface is not used (as explained in the description of the previous patch). In particular, since each node has a full scheduler, each group can be assigned its own weight. Signed-off-by: Fabio Checconi <[email protected]> Signed-off-by: Paolo Valente <[email protected]> Signed-off-by: Arianna Avanzini <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-04-14block: fix bio_will_gap() for first bvec with offsetMing Lei1-4/+28
Commit 729204ef49ec("block: relax check on sg gap") allows us to merge bios, if both are physically contiguous. This change can merge a huge number of small bios, through mkfs for example, mkfs.ntfs running time can be decreased to ~1/10. But if one rq starts with a non-aligned buffer (the 1st bvec's bv_offset is non-zero) and if we allow the merge, it is quite difficult to respect sg gap limit, especially the max segment size, or we risk having an unaligned virtual boundary. This patch tries to avoid the issue by disallowing a merge, if the req starts with an unaligned buffer. Also add comments to explain why the merged segment can't end in unaligned virt boundary. Fixes: 729204ef49ec ("block: relax check on sg gap") Tested-by: Johannes Thumshirn <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Ming Lei <[email protected]> Rewrote parts of the commit message and comments. Signed-off-by: Jens Axboe <[email protected]>
2017-04-08block: remove the discard_zeroes_data flagChristoph Hellwig1-15/+0
Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can kill this hack. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-04-08block: add a new BLKDEV_ZERO_NOFALLBACK flagChristoph Hellwig1-0/+1
This avoids fallbacks to explicit zeroing in (__)blkdev_issue_zeroout if the caller doesn't want them. Also clean up the convoluted check for the return condition that this new flag is added to. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-04-08block: add a flags argument to (__)blkdev_issue_zerooutChristoph Hellwig1-6/+10
Turn the existing discard flag into a new BLKDEV_ZERO_UNMAP flag with similar semantics, but without referring to diѕcard. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-04-07Merge branch 'for-linus' into for-4.12/blockJens Axboe1-3/+2
We've added a considerable amount of fixes for stalls and issues with the blk-mq scheduling in the 4.11 series since forking off the for-4.12/block branch. We need to do improvements on top of that for 4.12, so pull in the previous fixes to make our lives easier going forward. Signed-off-by: Jens Axboe <[email protected]>
2017-04-07blk-mq: Restart a single queue if tag sets are sharedBart Van Assche1-1/+0
To improve scalability, if hardware queues are shared, restart a single hardware queue in round-robin fashion. Rename blk_mq_sched_restart_queues() to reflect the new semantics. Remove blk_mq_sched_mark_restart_queue() because this function has no callers. Remove flag QUEUE_FLAG_RESTART because this patch removes the code that uses this flag. Signed-off-by: Bart Van Assche <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Hannes Reinecke <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-04-05block: move timeout field in struct request to pack betterJens Axboe1-1/+2
After commit 64c7f1d1572c, we went from 1 to 2 holes in my test setup. If we move the timeout field a bit, we remove both of those holes and shrink struct request by 8 bytes. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-04-05block, scsi: move the retries field to struct scsi_requestChristoph Hellwig1-1/+0
Instead of bloating the generic struct request with it. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-03-29block: warn if sharing request queue across gendisksOmar Sandoval1-0/+1
Now that the remaining drivers have been converted to one request queue per gendisk, let's warn if a request queue gets registered more than once. This will catch future drivers which might do it inadvertently or any old drivers that I may have missed. Signed-off-by: Omar Sandoval <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-03-21blk-stat: convert to callback-based statistics reportingOmar Sandoval1-2/+8
Currently, statistics are gathered in ~0.13s windows, and users grab the statistics whenever they need them. This is not ideal for both in-tree users: 1. Writeback throttling wants its own dynamically sized window of statistics. Since the blk-stats statistics are reset after every window and the wbt windows don't line up with the blk-stats windows, wbt doesn't see every I/O. 2. Polling currently grabs the statistics on every I/O. Again, depending on how the window lines up, we may miss some I/Os. It's also unnecessary overhead to get the statistics on every I/O; the hybrid polling heuristic would be just as happy with the statistics from the previous full window. This reworks the blk-stats infrastructure to be callback-based: users register a callback that they want called at a given time with all of the statistics from the window during which the callback was active. Users can dynamically bucketize the statistics. wbt and polling both currently use read vs. write, but polling can be extended to further subdivide based on request size. The callbacks are kept on an RCU list, and each callback has percpu stats buffers. There will only be a few users, so the overhead on the I/O completion side is low. The stats flushing is also simplified considerably: since the timer function is responsible for clearing the statistics, we don't have to worry about stale statistics. wbt is a trivial conversion. After the conversion, the windowing problem mentioned above is fixed. For polling, we register an extra callback that caches the previous window's statistics in the struct request_queue for the hybrid polling heuristic to use. Since we no longer have a single stats buffer for the request queue, this also removes the sysfs and debugfs stats entries. To replace those, we add a debugfs entry for the poll statistics. Signed-off-by: Omar Sandoval <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-03-08Revert "scsi, block: fix duplicate bdi name registration crashes"Jan Kara1-1/+0
This reverts commit 0dba1314d4f81115dce711292ec7981d17231064. It causes leaking of device numbers for SCSI when SCSI registers multiple gendisks for one request_queue in succession. It can be easily reproduced using Omar's script [1] on kernel with CONFIG_DEBUG_TEST_DRIVER_REMOVE. Furthermore the protection provided by this commit is not needed anymore as the problem it was fixing got also fixed by commit 165a5e22fafb "block: Move bdi_unregister() to del_gendisk()". [1]: http://marc.info/?l=linux-block&m=148554717109098&w=2 Signed-off-by: Jan Kara <[email protected]> Acked-by: Dan Williams <[email protected]> Tested-by: Omar Sandoval <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar1-0/+1
<linux/sched/clock.h> We are going to split <linux/sched/clock.h> out of <linux/sched.h>, which will have to be picked up from other headers and .c files. Create a trivial placeholder <linux/sched/clock.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-02-08block: optionally merge discontiguous discard bios into a single requestChristoph Hellwig1-0/+26
Add a new merge strategy that merges discard bios into a request until the maximum number of discard ranges (or the maximum discard size) is reached from the plug merging code. I/O scheduler merging is not wired up yet but might also be useful, although not for fast devices like NVMe which are the only user for now. Note that for now we don't support limiting the size of each discard range, but if needed that can be added later. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-02-02block: fix debugfs config conditional in struct request_queueOmar Sandoval1-1/+1
The debugfs dentries are only used for CONFIG_BLK_DEBUG_FS, so make them conditional on that instead of CONFIG_DEBUG_FS. Signed-off-by: Omar Sandoval <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-02-02scsi, block: fix duplicate bdi name registration crashesDan Williams1-0/+1
Warnings of the following form occur because scsi reuses a devt number while the block layer still has it referenced as the name of the bdi [1]: WARNING: CPU: 1 PID: 93 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x62/0x80 sysfs: cannot create duplicate filename '/devices/virtual/bdi/8:192' [..] Call Trace: dump_stack+0x86/0xc3 __warn+0xcb/0xf0 warn_slowpath_fmt+0x5f/0x80 ? kernfs_path_from_node+0x4f/0x60 sysfs_warn_dup+0x62/0x80 sysfs_create_dir_ns+0x77/0x90 kobject_add_internal+0xb2/0x350 kobject_add+0x75/0xd0 device_add+0x15a/0x650 device_create_groups_vargs+0xe0/0xf0 device_create_vargs+0x1c/0x20 bdi_register+0x90/0x240 ? lockdep_init_map+0x57/0x200 bdi_register_owner+0x36/0x60 device_add_disk+0x1bb/0x4e0 ? __pm_runtime_use_autosuspend+0x5c/0x70 sd_probe_async+0x10d/0x1c0 async_run_entry_fn+0x39/0x170 This is a brute-force fix to pass the devt release information from sd_probe() to the locations where we register the bdi, device_add_disk(), and unregister the bdi, blk_cleanup_queue(). Thanks to Omar for the quick reproducer script [2]. This patch survives where an unmodified kernel fails in a few seconds. [1]: https://marc.info/?l=linux-scsi&m=147116857810716&w=4 [2]: http://marc.info/?l=linux-block&m=148554717109098&w=2 Cc: James Bottomley <[email protected]> Cc: Bart Van Assche <[email protected]> Cc: "Martin K. Petersen" <[email protected]> Cc: Jan Kara <[email protected]> Reported-by: Omar Sandoval <[email protected]> Tested-by: Omar Sandoval <[email protected]> Signed-off-by: Dan Williams <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-02-02block: Get rid of blk_get_backing_dev_info()Jan Kara1-1/+0
blk_get_backing_dev_info() is now a simple dereference. Remove that function and simplify some code around that. Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-02-02block: Dynamically allocate and refcount backing_dev_infoJan Kara1-1/+0
Instead of storing backing_dev_info inside struct request_queue, allocate it dynamically, reference count it, and free it when the last reference is dropped. Currently only request_queue holds the reference but in the following patch we add other users referencing backing_dev_info. Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-02-02block: Use pointer to backing_dev_info from request_queueJan Kara1-1/+2
We will want to have struct backing_dev_info allocated separately from struct request_queue. As the first step add pointer to backing_dev_info to request_queue and convert all users touching it. No functional changes in this patch. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-01-31block: move internal_tag to same cache line as tagJens Axboe1-2/+3
Since we removed cmd_type, we now have a hole in the struct. Move the internal_tag member to the same cacheline as tag, since we use them at the same time. This doesn't fix the hole, just moves it elsewhere. Signed-off-by: Jens Axboe <[email protected]>
2017-01-31block: fold cmd_type into the REQ_OP_ spaceChristoph Hellwig1-11/+11
Instead of keeping two levels of indirection for requests types, fold it all into the operations. The little caveat here is that previously cmd_type only applied to struct request, while the request and bio op fields were set to plain REQ_OP_READ/WRITE even for passthrough operations. Instead this patch adds new REQ_OP_* for SCSI passthrough and driver private requests, althought it has to add two for each so that we can communicate the data in/out nature of the request. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-01-31block: introduce blk_rq_is_passthroughChristoph Hellwig1-5/+11
This can be used to check for fs vs non-fs requests and basically removes all knowledge of BLOCK_PC specific from the block layer, as well as preparing for removing the cmd_type field in struct request. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-01-27block: split scsi_request out of struct requestChristoph Hellwig1-13/+0
And require all drivers that want to support BLOCK_PC to allocate it as the first thing of their private data. To support this the legacy IDE and BSG code is switched to set cmd_size on their queues to let the block layer allocate the additional space. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-01-27block: allow specifying size for extra command dataChristoph Hellwig1-0/+7
This mirrors the blk-mq capabilities to allocate extra drivers-specific data behind struct request by setting a cmd_size field, as well as having a constructor / destructor for it. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-01-27block: simplify blk_init_allocated_queueChristoph Hellwig1-2/+1
Return an errno value instead of the passed in queue so that the callers don't have to keep track of two queues, and move the assignment of the request_fn and lock to the caller as passing them as argument doesn't simplify anything. While we're at it also remove two pointless NULL assignments, given that the request structure is zeroed on allocation. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-01-27Merge branch 'for-4.11/block' into for-4.11/rq-refactorJens Axboe1-3/+31
Signed-off-by: Jens Axboe <[email protected]>
2017-01-27blk-mq-sched: fix starvation for multiple hardware queues and shared tagsJens Axboe1-0/+1
If we have both multiple hardware queues and shared tag map between devices, we need to ensure that we propagate the hardware queue restart bit higher up. This is because we can get into a situation where we don't have any IO pending on a hardware queue, yet we fail getting a tag to start new IO. If that happens, it's not enough to mark the hardware queue as needing a restart, we need to bubble that up to the higher level queue as well. Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Omar Sandoval <[email protected]> Tested-by: Hannes Reinecke <[email protected]>
2017-01-27blk-mq: create debugfs directory treeOmar Sandoval1-0/+5
In preparation for putting blk-mq debugging information in debugfs, create a directory tree mirroring the one in sysfs: # tree -d /sys/kernel/debug/block /sys/kernel/debug/block |-- nvme0n1 | `-- mq | |-- 0 | | `-- cpu0 | |-- 1 | | `-- cpu1 | |-- 2 | | `-- cpu2 | `-- 3 | `-- cpu3 `-- vda `-- mq `-- 0 |-- cpu0 |-- cpu1 |-- cpu2 `-- cpu3 Also add the scaffolding for the actual files that will go in here, either under the hardware queue or software queue directories. Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Omar Sandoval <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-01-17blk-mq-sched: add framework for MQ capable IO schedulersJens Axboe1-1/+3
This adds a set of hooks that intercepts the blk-mq path of allocating/inserting/issuing/completing requests, allowing us to develop a scheduler within that framework. We reuse the existing elevator scheduler API on the registration side, but augment that with the scheduler flagging support for the blk-mq interfce, and with a separate set of ops hooks for MQ devices. We split driver and scheduler tags, so we can run the scheduling independently of device queue depth. Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Reviewed-by: Omar Sandoval <[email protected]>
2017-01-13block: add blk_rq_payload_bytesChristoph Hellwig1-0/+13
Add a helper to calculate the actual data transfer size for special payload requests. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-01-12block: Rename blk_queue_zone_size and bdev_zone_sizeDamien Le Moal1-3/+3
All block device data fields and functions returning a number of 512B sectors are by convention named xxx_sectors while names in the form xxx_size are generally used for a number of bytes. The blk_queue_zone_size and bdev_zone_size functions were not following this convention so rename them. No functional change is introduced by this patch. Signed-off-by: Damien Le Moal <[email protected]> Collapsed the two patches, they were nonsensically split and broke bisection. Signed-off-by: Jens Axboe <[email protected]>
2017-01-11blk-mq: make mq_ops a const pointerJens Axboe1-1/+1
We never change it, make that clear. Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Bart Van Assche <[email protected]>
2017-01-11block: relax check on sg gapMing Lei1-1/+21
If the last bvec of the 1st bio and the 1st bvec of the next bio are physically contigious, and the latter can be merged to last segment of the 1st bio, we should think they don't violate sg gap(or virt boundary) limit. Both Vitaly and Dexuan reported lots of unmergeable small bios are observed when running mkfs on Hyper-V virtual storage, and performance becomes quite low. This patch fixes that performance issue. The same issue should exist on NVMe, since it sets virt boundary too. Reported-by: Vitaly Kuznetsov <[email protected]> Reported-by: Dexuan Cui <[email protected]> Tested-by: Dexuan Cui <[email protected]> Cc: Keith Busch <[email protected]> Signed-off-by: Ming Lei <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2016-12-17block: Remove unused member (busy) from struct blk_queue_tagRitesh Harjani1-1/+0
Signed-off-by: Ritesh Harjani <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2016-12-13Merge branch 'for-4.10' of ↵Linus Torvalds1-0/+14
git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata Pull libata updates from Tejun Heo: - Adam added opt-in ATA command priority support. - There are machines which hide multiple nvme devices behind an ahci BAR. Dan Williams proposed a solution to force-switch the mode but deemed too hackishd. People are gonna discuss the proper way to handle the situation in nvme standard meetings. For now, detect and warn about the situation. - Low level driver specific changes. Christoph Hellwig pipes in about the hidden nvme warning: "I wish that was the case. We've pretty much agreed that we'll want to implement it as a virtual PCIe root bridge, similar to Intels other 'innovation' VMD that we work around that way. But Intel management has apparently decided that they don't want to spend more cycles on this now that Lenovo has an optional BIOS that doesn't force this broken mode anymore, and no one outside of Intel has enough information to implement something like this. So for now I guess this warning is it, until Intel reconsideres and spends resources on fixing up the damage their Chipset people caused" * 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: ahci: warn about remapped NVMe devices ahci-remap.h: add ahci remapping definitions nvme: move NVMe class code to pci_ids.h pata: imx: support controller modes up to PIO4 pata: imx: add support of setting timings for PIO modes pata: imx: set controller PIO mode with .set_piomode callback pata: imx: sort headers out ata: set ncq_prio_enabled iff device has support ata: ATA Command Priority Disabled By Default ata: Enabling ATA Command Priorities block: Add iocontext priority to request ahci: qoriq: added ls1046a platform support
2016-12-09block: improve handling of the magic discard payloadChristoph Hellwig1-3/+12
Instead of allocating a single unused biovec for discard requests, send them down without any payload. Instead we allow the driver to add a "special" payload using a biovec embedded into struct request (unioned over other fields never used while in the driver), and overloading the number of segments for this case. This has a couple of advantages: - we don't have to allocate the bio_vec - the amount of special casing for discard requests in the block layer is significantly reduced - using this same scheme for other request types is trivial, which will be important for implementing the new WRITE_ZEROES op on devices where it actually requires a payload (e.g. SCSI) - we can get rid of playing games with the request length, as we'll never touch it and completions will work just fine - it will allow us to support ranged discard operations in the future by merging non-contiguous discard bios into a single request - last but not least it removes a lot of code This patch is the common base for my WIP series for ranges discards and to remove discard_zeroes_data in favor of always using REQ_OP_WRITE_ZEROES, so it would be good to get it in quickly. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2016-12-01block: add support for REQ_OP_WRITE_ZEROESChaitanya Kulkarni1-0/+19
This adds a new block layer operation to zero out a range of LBAs. This allows to implement zeroing for devices that don't use either discard with a predictable zero pattern or WRITE SAME of zeroes. The prominent example of that is NVMe with the Write Zeroes command, but in the future, this should also help with improving the way zeroing discards work. For this operation, suitable entry is exported in sysfs which indicate the number of maximum bytes allowed in one write zeroes operation by the device. Signed-off-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2016-12-01block: add async variant of blkdev_issue_zerooutChaitanya Kulkarni1-0/+3
Similar to __blkdev_issue_discard this variant allows submitting the final bio asynchronously and chaining multiple ranges into a single completion. Signed-off-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2016-11-18block: Change extern inline to static inlineTobias Klauser1-1/+1
With compilers which follow the C99 standard (like modern versions of gcc and clang), "extern inline" does the opposite thing from older versions of gcc (emits code for an externally linkable version of the inline function). "static inline" does the intended behavior in all cases instead. Description taken from commit 6d91857d4826 ("staging, rtl8192e, LLVMLinux: Change extern inline to static inline"). This also fixes the following GCC warning when building with CONFIG_PM disabled: ./include/linux/blkdev.h:1143:20: warning: no previous prototype for 'blk_set_runtime_active' [-Wmissing-prototypes] Fixes: d07ab6d11477 ("block: Add blk_set_runtime_active()") Reviewed-by: Mika Westerberg <[email protected]> Signed-off-by: Tobias Klauser <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2016-11-17blk-mq: make the polling code adaptiveJens Axboe1-1/+1
The previous commit introduced the hybrid sleep/poll mode. Take that one step further, and use the completion latencies to automatically sleep for half the mean completion time. This is a good approximation. This changes the 'io_poll_delay' sysfs file a bit to expose the various options. Depending on the value, the polling code will behave differently: -1 Never enter hybrid sleep mode 0 Use half of the completion mean for the sleep delay >0 Use this specific value as the sleep delay Signed-off-by: Jens Axboe <[email protected]> Tested-By: Stephen Bates <[email protected]> Reviewed-By: Stephen Bates <[email protected]>
2016-11-17blk-mq: implement hybrid poll mode for sync O_DIRECTJens Axboe1-0/+1
This patch enables a hybrid polling mode. Instead of polling after IO submission, we can induce an artificial delay, and then poll after that. For example, if the IO is presumed to complete in 8 usecs from now, we can sleep for 4 usecs, wake up, and then do our polling. This still puts a sleep/wakeup cycle in the IO path, but instead of the wakeup happening after the IO has completed, it'll happen before. With this hybrid scheme, we can achieve big latency reductions while still using the same (or less) amount of CPU. Signed-off-by: Jens Axboe <[email protected]> Tested-By: Stephen Bates <[email protected]> Reviewed-By: Stephen Bates <[email protected]>
2016-11-11block: move poll code to blk-mqJens Axboe1-1/+1
The poll code is blk-mq specific, let's move it to blk-mq.c. This is a prep patch for improving the polling code. Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
2016-11-10block: hook up writeback throttlingJens Axboe1-0/+3
Enable throttling of buffered writeback to make it a lot more smooth, and has way less impact on other system activity. Background writeback should be, by definition, background activity. The fact that we flush huge bundles of it at the time means that it potentially has heavy impacts on foreground workloads, which isn't ideal. We can't easily limit the sizes of writes that we do, since that would impact file system layout in the presence of delayed allocation. So just throttle back buffered writeback, unless someone is waiting for it. The algorithm for when to throttle takes its inspiration in the CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors the minimum latencies of requests over a window of time. In that window of time, if the minimum latency of any request exceeds a given target, then a scale count is incremented and the queue depth is shrunk. The next monitoring window is shrunk accordingly. Unlike CoDel, if we hit a window that exhibits good behavior, then we simply increment the scale count and re-calculate the limits for that scale value. This prevents us from oscillating between a close-to-ideal value and max all the time, instead remaining in the windows where we get good behavior. Unlike CoDel, blk-wb allows the scale count to to negative. This happens if we primarily have writes going on. Unlike positive scale counts, this doesn't change the size of the monitoring window. When the heavy writers finish, blk-bw quickly snaps back to it's stable state of a zero scale count. The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency target to me met. It defaults to 2 msec for non-rotational storage, and 75 msec for rotational storage. Setting this value to '0' disables blk-wb. Generally, a user would not have to touch this setting. We don't enable WBT on devices that are managed with CFQ, and have a non-root block cgroup attached. If we have a proportional share setup on this particular disk, then the wbt throttling will interfere with that. We don't have a strong need for wbt for that case, since we will rely on CFQ doing that for us. Signed-off-by: Jens Axboe <[email protected]>
2016-11-10block: add scalable completion tracking of requestsJens Axboe1-0/+7
For legacy block, we simply track them in the request queue. For blk-mq, we track them on a per-sw queue basis, which we can then sum up through the hardware queues and finally to a per device state. The stats are tracked in, roughly, 0.1s interval windows. Add sysfs files to display the stats. The feature is off by default, to avoid any extra overhead. In-kernel users of it can turn it on by setting QUEUE_FLAG_STATS in the queue flags. We currently don't turn it on if someone just reads any of the stats files, that is something we could add as well. Signed-off-by: Jens Axboe <[email protected]>
2016-11-05block: add code to track actual device queue depthJens Axboe1-0/+11
For blk-mq, ->nr_requests does track queue depth, at least at init time. But for the older queue paths, it's simply a soft setting. On top of that, it's generally larger than the hardware setting on purpose, to allow backup of requests for merging. Fill a hole in struct request with a 'queue_depth' member, that drivers can call to more closely inform the block layer of the real queue depth. Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Jan Kara <[email protected]>
2016-11-03block: immediately dispatch big size requestShaohua Li1-0/+1
Currently block plug holds up to 16 non-mergeable requests. This makes sense if the request size is small, eg, reduce lock contention. But if request size is big enough, we don't need to worry about lock contention. Holding such request makes no sense and it lows the disk utilization. In practice, this improves 10% throughput for my raid5 sequential write workload. The size (128k) is arbitrary right now, but it makes sure lock contention is small. This probably could be more intelligent, eg, check average request size holded. Since this is mainly for sequential IO, probably not worthy. V2: check the last request instead of the first request, so as long as there is one big size request we flush the plug. Signed-off-by: Shaohua Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>