aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-07-23nvmet-rdma: add an error flow for post_recv failuresMax Gurtovoy1-5/+21
Posting receive buffer operation can fail, thus we should make sure to have an error flow during initialization phase. While we're here, add a debug print in case of a failure. Signed-off-by: Max Gurtovoy <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2018-07-23nvmet-rdma: add unlikely check in the fast pathMax Gurtovoy1-1/+1
ib_post_send operation should succeed unless something unusual happened to the ib device. Signed-off-by: Max Gurtovoy <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2018-07-23nvmet-rdma: support max(16KB, PAGE_SIZE) inline dataSteve Wise6-43/+169
The patch enables inline data sizes using up to 4 recv sges, and capping the size at 16KB or at least 1 page size. So on a 4K page system, up to 16KB is supported, and for a 64K page system 1 page of 64KB is supported. We avoid > 0 order page allocations for the inline buffers by using multiple recv sges, one for each page. If the device cannot support the configured inline data size due to lack of enough recv sges, then log a warning and reduce the inline size. Add a new configfs port attribute, called param_inline_data_size, to allow configuring the size of inline data for a given nvmf port. The maximum size allowed is still enforced by nvmet-rdma with NVMET_RDMA_MAX_INLINE_DATA_SIZE, which is now max(16KB, PAGE_SIZE). And the default size, if not specified via configfs, is still PAGE_SIZE. This preserves the existing behavior, but allows larger inline sizes for small page systems. If the configured inline data size exceeds NVMET_RDMA_MAX_INLINE_DATA_SIZE, a warning is logged and the size is reduced. If param_inline_data_size is set to 0, then inline data is disabled for that nvmf port. Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Signed-off-by: Steve Wise <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2018-07-23nvme-rdma: support up to 4 segments of inline dataSteve Wise1-11/+27
Allow up to 4 segments of inline data for NVMF WRITE operations. This reduces latency for small WRITEs by removing the need for the target to issue a READ WR for IB, or a REG_MR + READ WR chain for iWarp. Also cap the inline segments used based on the limitations of the device. Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Max Gurtovoy <[email protected]> Signed-off-by: Steve Wise <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2018-07-23nvmet: add buffered I/O support for file backed nsChaitanya Kulkarni4-5/+67
Add a new "buffered_io" attribute, which disabled direct I/O and thus enables page cache based caching when enabled. The attribute can only be changed when the namespace is disabled as the file has to be reopend for the change to take effect. The possibly blocking read/write are deferred to a newly introduced global workqueue. Signed-off-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2018-07-23nvmet: add commands supported and effects log pageChaitanya Kulkarni1-1/+34
This patch adds support for Commands Supported and Effects log page (Log Identifier 05h) for NVMeOF. This also makes it easier to find which commands are supported, e.g. :- subnqn : testnqn1 Admin Command Set ACS2 [Get Log Page ] 00000001 ACS6 [Identify ] 00000001 ACS8 [Abort ] 00000001 ACS9 [Set Features ] 00000001 ACS10 [Get Features ] 00000001 ACS12 [Asynchronous Event Request ] 00000001 ACS24 [Keep Alive ] 00000001 NVM Command Set IOCS0 [Flush ] 00000001 IOCS1 [Write ] 00000001 IOCS2 [Read ] 00000001 IOCS8 [Write Zeroes ] 00000001 IOCS9 [Dataset Management ] 00000001 This partticular functionality can be used from the host side to examine the NVMeOF ctrl commands supported. Signed-off-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2018-07-23nvme: move init of keep_alive work item to controller initializationJames Smart1-3/+4
Currently, the code initializes the keep alive work item whenever nvme_start_keep_alive() is called. However, this routine is called several times while reconnecting, etc. Although it's hoped that keep alive is always disabled and not scheduled when start is called, re-initing if it were scheduled or completing can have very bad side effects. There's no need for re-initialization. Move the keep_alive work item and cmd struct initialization to controller init. Signed-off-by: James Smart <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2018-07-23nvme.h: resync with nvme-cliRevanth Rajashekar1-0/+5
Added some feature ids present in nvme-cli but not kernel. Signed-off-by: Revanth Rajashekar <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2018-07-22blk-mq: fail the request in case issue failureMing Lei1-2/+6
Inside blk_mq_try_issue_list_directly(), if the request is issued as failed, we shouldn't try to do it again, otherwise the warning in blk_mq_start_request() will be triggered. This change is aligned to behaviour of other ways of request issue & dispatch. Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'") Cc: Kashyap Desai <[email protected]> Cc: Laurence Oberman <[email protected]> Cc: Omar Sandoval <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Bart Van Assche <[email protected]> Cc: Hannes Reinecke <[email protected]> Cc: Kashyap Desai <[email protected]> Cc: kernel test robot <[email protected]> Cc: LKP <[email protected]> Reported-by: kernel test robot <[email protected]> Signed-off-by: Ming Lei <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-22blk-rq-qos: make depth comparisons unsignedJosef Bacik2-5/+5
With the change to use UINT_MAX I broke the depth check as any value of inflight (ie 0) would be less than (int)UINT_MAX. Fix this by changing everything to unsigned int to match the depth. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-18blkcg: Track DISCARD statistics and output them in cgroup io.statTejun Heo3-9/+19
Add tracking of REQ_OP_DISCARD ios to the per-cgroup io.stat. Two fields, dbytes and dios, to respectively count the total bytes and number of discards are added. Signed-off-by: Tejun Heo <[email protected]> Cc: Andy Newell <[email protected]> Cc: Michael Callahan <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-18block: Track DISCARD statistics and output them in stat and diskstatMichael Callahan7-18/+68
Add tracking of REQ_OP_DISCARD ios to the partition statistics and append them to the various stat files in /sys as well as /proc/diskstats. These are tracked with the same four stats as reads and writes: Number of discard ios completed. Number of discard ios merged Number of discard sectors completed Milliseconds spent on discard requests This is done via adding a new STAT_DISCARD define to genhd.h and then using it to index that stat field for discard requests. tj: Refreshed on top of v4.17 and other previous updates. Signed-off-by: Michael Callahan <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Cc: Andy Newell <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-18block: Add and use op_stat_group() for indexing disk_stat fields.Michael Callahan13-43/+50
Add and use a new op_stat_group() function for indexing partition stat fields rather than indexing them by rq_data_dir() or bio_data_dir(). This function works similarly to op_is_sync() in that it takes the request::cmd_flags or bio::bi_opf flags and determines which stats should et updated. In addition, the second parameter to generic_start_io_acct() and generic_end_io_acct() is now a REQ_OP rather than simply a read or write bit and it uses op_stat_group() on the parameter to determine the stat group. Note that the partition in_flight counts are not part of the per-cpu statistics and as such are not indexed via this function. It's now indexed by op_is_write(). tj: Refreshed on top of v4.17. Updated to pass around REQ_OP. Signed-off-by: Michael Callahan <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Dan Williams <[email protected]> Cc: Joshua Morris <[email protected]> Cc: Philipp Reisner <[email protected]> Cc: Matias Bjorling <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Alasdair Kergon <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-18block: Define and use STAT_READ and STAT_WRITEMichael Callahan8-28/+40
Add defines for STAT_READ and STAT_WRITE for indexing the partition stat entries. This clarifies some fs/ code which has hardcoded 1 for STAT_WRITE and will make it easier to extend the stats with additional fields. tj: Refreshed on top of v4.17. Signed-off-by: Michael Callahan <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Cc: Jaegeuk Kim <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-18block: Add part_stat_read_accum to read across field entries.Michael Callahan4-7/+7
Add a part_stat_read_accum macro to genhd.h to read and sum across field entries. For example to sum up the number read and write sectors completed. In addition to being ar reasonable cleanup by itself this will make it easier to add new stat fields in the future. tj: Refreshed on top of v4.17. Signed-off-by: Michael Callahan <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-18block: make bdev_ops->rw_page() take a REQ_OP instead of boolTejun Heo7-33/+34
c11f0c0b5bb9 ("block/mm: make bdev_ops->rw_page() take a bool for read/write") replaced @op with boolean @is_write, which limited the amount of information going into ->rw_page() and more importantly page_endio(), which removed the need to expose block internals to mm. Unfortunately, we want to track discards separately and @is_write isn't enough information. This patch updates bdev_ops->rw_page() to take REQ_OP instead but leaves page_endio() to take bool @is_write. This allows the block part of operations to have enough information while not leaking it to mm. Signed-off-by: Tejun Heo <[email protected]> Cc: Mike Christie <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-17pktcdvd: remove assignment in if conditionRAGHU Halharvi1-23/+46
* Remove checkpatch errors caused due to assignment operation in if condition Signed-off-by: RAGHU Halharvi <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-17blk-mq: issue directly if hw queue isn't busy in case of 'none'Ming Lei3-2/+36
In case of 'none' io scheduler, when hw queue isn't busy, it isn't necessary to enqueue request to sw queue and dequeue it from sw queue because request may be submitted to hw queue asap without extra cost, meantime there shouldn't be much request in sw queue, and we don't need to worry about effect on IO merge. There are still some single hw queue SCSI HBAs(HPSA, megaraid_sas, ...) which may connect high performance devices, so 'none' is often required for obtaining good performance. This patch improves IOPS and decreases CPU unilization on megaraid_sas, per Kashyap's test. Cc: Kashyap Desai <[email protected]> Cc: Laurence Oberman <[email protected]> Cc: Omar Sandoval <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Bart Van Assche <[email protected]> Cc: Hannes Reinecke <[email protected]> Reported-by: Kashyap Desai <[email protected]> Tested-by: Kashyap Desai <[email protected]> Signed-off-by: Ming Lei <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-16blk-iolatency: truncate our current timeJosef Bacik1-0/+6
In our longer tests we noticed that some boxes would degrade to the point of uselessness. This is because we truncate the current time when saving it in our bio, but I was using the raw current time to subtract from. So once the box had been up a certain amount of time it would appear as if our IO's were taking several years to complete. Fix this by truncating the current time so it matches the issue time. Verified this worked by running with this patch for a week on our test tier. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-16blk-iolatency: don't change the latency windowJosef Bacik1-10/+0
Early versions of these patches had us waiting for seconds at a time during submission, so we had to adjust the timing window we monitored for latency. Now we don't do things like that so this is unnecessary code. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-13lightnvm: pblk: assume that chunks are closed on 1.2 devicesHans Holmberg1-2/+3
We can't know if a block is closed or not on 1.2 devices, so assume closed state to make sure that blocks are erased before writing. Fixes: 32ef9412c114 ("lightnvm: pblk: implement get log report chunk") Signed-off-by: Hans Holmberg <[email protected]> Signed-off-by: Matias Bjørling <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-13lightnvm: pblk: add asynchronous partial readHeiner Litz2-63/+130
In the read path, partial reads are currently performed synchronously which affects performance for workloads that generate many partial reads. This patch adds an asynchronous partial read path as well as the required partial read ctx. Signed-off-by: Heiner Litz <[email protected]> Reviewed-by: Igor Konopko <[email protected]> Signed-off-by: Matias Bjørling <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-13lightnvm: pblk: mark expected switch fall-throughGustavo A. R. Silva1-0/+1
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Signed-off-by: Gustavo A. R. Silva <[email protected]> Signed-off-by: Matias Bjørling <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-13lightnvm: pblk: expose generic disk name on pr_* msgsMatias Bjørling9-138/+153
The error messages in pblk does not say which pblk instance that a message occurred from. Update each error message to reflect the instance it belongs to, and also prefix it with pblk, so we know the message comes from the pblk module. Signed-off-by: Matias Bjørling <[email protected]> Reviewed-by: Javier González <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-13lightnvm: limit get chunk meta request sizeMatias Bjørling1-2/+8
For devices that does not specify a limit on its transfer size, the get_chk_meta command may send down a single I/O retrieving the full chunk metadata table. Resulting in large 2-4MB I/O requests. Instead, split up the I/Os to a maximum of 256KB and issue them separately to reduce memory requirements. Signed-off-by: Matias Bjørling <[email protected]> Reviewed-by: Javier González <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-13lightnvm: pblk: fix read_bitmap for 32bit archsMatias Bjørling1-7/+7
If using pblk on a 32bit architecture, and there is a need to perform a partial read, the partial read bitmap will only have allocated 32 entries, where as 64 are needed. Make sure that the read_bitmap is initialized to 64bits on 32bit architectures as well. Signed-off-by: Matias Bjørling <[email protected]> Reviewed-by: Igor Konopko <[email protected]> Reviewed-by: Javier González <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-13lightnvm: Remove redundant rq->__data_len initializationBart Van Assche1-4/+2
Since both blk_old_get_request() and blk_mq_alloc_request() initialize rq->__data_len to zero, it is not necessary to initialize that member in nvme_nvm_alloc_request(). Hence remove the rq->__data_len initialization from nvme_nvm_alloc_request(). Signed-off-by: Bart Van Assche <[email protected]> Signed-off-by: Matias Bjørling <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-13lightnvm: pblk: enable line minor version detectionMatias Bjørling1-2/+3
When recovering a line, an extra check was added when debugging was active, such that minor version where also checked. Unfortunately, this used the ifdef NVM_DEBUG, which is not correct. Instead use the proper DEBUG def, and now that it compiles, also fix the variable. Signed-off-by: Matias Bjørling <[email protected]> Fixes: d0ab0b1ab991f ("lightnvm: pblk: check data lines version on recovery") Reviewed-by: Javier González <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-13lightnvm: move NVM_DEBUG to pblkMatias Bjørling10-70/+72
There is no users of CONFIG_NVM_DEBUG in the LightNVM subsystem. All users are in pblk. Rename NVM_DEBUG to NVM_PBLK_DEBUG and enable only for pblk. Also fix up the CONFIG_NVM_PBLK entry to follow the code style for Kconfig files. Signed-off-by: Matias Bjørling <[email protected]> Reviewed-by: Javier González <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-13lightnvm: pblk: handle case when mw_cunits equals to 0Marcin Dziegielewski2-7/+5
Some devices can expose mw_cunits equal to 0, it can cause the creation of too small write buffer and cause performance to drop on write workloads. Additionally, write buffer size must cover write data requirements, such as WS_MIN and MW_CUNITS - it must be greater than or equal to the larger one multiplied by the number of PUs. However, for performance reasons, use the WS_OPT value to calculation instead of WS_MIN. Because the place where buffer size is calculated was changed, this patch also removes pgs_in_buffer filed in pblk structure. Signed-off-by: Marcin Dziegielewski <[email protected]> Signed-off-by: Igor Konopko <[email protected]> Reviewed-by: Javier González <[email protected]> Signed-off-by: Matias Bjørling <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-13block: remove blkdev_entry_to_request() macroVladimir Zapolskiy1-2/+0
Remove blkdev_entry_to_request() macro, which remained unused through the observable history, also note that it repeats list_entry_rq() macro verbatim. Signed-off-by: Vladimir Zapolskiy <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-12block: skd: Use %pad printk format for dma_addr_t valuesHelge Deller1-8/+8
Use the existing %pad printk format to print dma_addr_t values. This avoids the following warnings when compiling on the parisc64 platform: drivers/block/skd_main.c: In function 'skd_preop_sg_list': drivers/block/skd_main.c:660:4: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 6 has type 'dma_addr_t {aka unsigned int}' [-Wformat=] Reviewed-by: Bart Van Assche <[email protected]> Signed-off-by: Helge Deller <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-12bsg: remove read/write supportChristoph Hellwig1-454/+6
The code poses a security risk due to user memory access in ->release and had an API that can't be used reliably. As far as we know it was never used for real, but if that turns out wrong we'll have to revert this commit and come up with a band aid. Jann Horn did look software archives for users of this interface, and the only users found were example code in sg3_utils, and optional support in an optional module of the tgt user space iscsi target, which looks like a proof of concept extension of the /dev/sg read/write support. Tony Battersby chimes in that the code is basically unsafe to use in general: The read/write interface on /dev/bsg is impossible to use safely because the list of completed commands is per-device (bd->done_list) rather than per-fd like it is with /dev/sg. So if program A and program B are both using the write/read interface on the same bsg device, then their command responses will get mixed up, and program A will read() some command results from program B and vice versa. So no, I don't use read/write on /dev/bsg. From a security standpoint, it should definitely be fixed or removed. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-11blk-iolatency: fix max_depth comparisonsJosef Bacik1-6/+6
max_depth used to be a u64, but I changed it to a unsigned int but didn't convert my comparisons over everywhere. Fix by using UINT_MAX everywhere instead of (u64)-1. Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-10block: iolatency: avoid 64-bit divisionArnd Bergmann1-2/+1
On 32-bit architectures, dividing a 64-bit number needs to use the do_div() function or something like it to avoid a link failure: block/blk-iolatency.o: In function `iolatency_prfill_limit': blk-iolatency.c:(.text+0x8cc): undefined reference to `__aeabi_uldivmod' Using div_u64() gives us the best output and avoids the need for an explicit cast. Fixes: d70675121546 ("block: introduce blk-iolatency io controller") Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-09block/DAC960.c: fix defined but not used build warningsRandy Dunlap1-3/+6
Fix build warnings in DAC960.c when CONFIG_PROC_FS is not enabled by marking the unused functions as __maybe_unused. ../drivers/block/DAC960.c:6429:12: warning: 'dac960_proc_show' defined but not used [-Wunused-function] ../drivers/block/DAC960.c:6449:12: warning: 'dac960_initial_status_proc_show' defined but not used [-Wunused-function] ../drivers/block/DAC960.c:6456:12: warning: 'dac960_current_status_proc_show' defined but not used [-Wunused-function] Signed-off-by: Randy Dunlap <[email protected]> Cc: Jens Axboe <[email protected]> Cc: [email protected] Signed-off-by: Jens Axboe <[email protected]>
2018-07-09null_blk: add zone supportMatias Bjørling5-3/+234
Adds support for exposing a null_blk device through the zone device interface. The interface is managed with the parameters zoned and zone_size. If zoned is set, the null_blk instance registers as a zoned block device. The zone_size parameter defines how big each zone will be. Signed-off-by: Matias Bjørling <[email protected]> Signed-off-by: Bart Van Assche <[email protected]> Signed-off-by: Damien Le Moal <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-09null_blk: move shared definitions to header fileMatias Bjørling2-75/+81
Split the null_blk device driver, such that it can prepare for zoned block interface support. Signed-off-by: Matias Bjørling <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-09block: Add default switch case to blk_pm_allow_request() to kill warningGeert Uytterhoeven1-2/+2
With gcc 4.9.0 and 7.3.0: block/blk-core.c: In function 'blk_pm_allow_request': block/blk-core.c:2747:2: warning: enumeration value 'RPM_ACTIVE' not handled in switch [-Wswitch] switch (rq->q->rpm_status) { ^ Convert the return statement below the switch() block into a default case to fix this. Fixes: e4f36b249b4d4e75 ("block: fix peeking requests during PM") Signed-off-by: Geert Uytterhoeven <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-09block: fix infinite loop if the device loses discard capabilityMikulas Patocka1-0/+10
If __blkdev_issue_discard is in progress and a device mapper device is reloaded with a table that doesn't support discard, q->limits.max_discard_sectors is set to zero. This results in infinite loop in __blkdev_issue_discard. This patch checks if max_discard_sectors is zero and aborts with -EOPNOTSUPP. Signed-off-by: Mikulas Patocka <[email protected]> Tested-by: Zdenek Kabelac <[email protected]> Cc: [email protected] Signed-off-by: Jens Axboe <[email protected]>
2018-07-09block, mm: remove unnecessary __GFP_HIGH flagShakeel Butt1-1/+1
The flag GFP_ATOMIC already contains __GFP_HIGH. There is no need to explicitly or __GFP_HIGH again. So, just remove unnecessary __GFP_HIGH. Signed-off-by: Shakeel Butt <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-09null_blk: remove NULLB_DEV_FL_CONFIGURED on turning off nullb deviceLiu Bo1-0/+1
Currently mbps knob could only be set once before switching power knob to on, after power knob has been set at least once, there is no way to set mbps knob again due to -EBUSY. As nullb is mainly used for testing, in order to make it flexible, this removes the flag NULLB_DEV_FL_CONFIGURED so that mbps knob can be reset when power knob is off, e.g. echo 0 > /config/nullb/a/power echo 40 > /config/nullb/a/mbps echo 1 > /config/nullb/a/power So does other knobs under /config/nullb/a. Signed-off-by: Liu Bo <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-09mm: skip readahead if the cgroup is congestedJosef Bacik1-0/+7
We noticed in testing we'd get pretty bad latency stalls under heavy pressure because read ahead would try to do its thing while the cgroup was under severe pressure. If we're under this much pressure we want to do as little IO as possible so we can still make progress on real work if we're a throttled cgroup, so just skip readahead if our group is under pressure. Signed-off-by: Josef Bacik <[email protected]> Acked-by: Tejun Heo <[email protected]> Acked-by: Andrew Morton <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-09Documentation: add a doc for blk-iolatencyJosef Bacik1-0/+79
A basic documentation to describe the interface, statistics, and behavior of io.latency. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-09block: introduce blk-iolatency io controllerJosef Bacik6-2/+957
Current IO controllers for the block layer are less than ideal for our use case. The io.max controller is great at hard limiting, but it is not work conserving. This patch introduces io.latency. You provide a latency target for your group and we monitor the io in short windows to make sure we are not exceeding those latency targets. This makes use of the rq-qos infrastructure and works much like the wbt stuff. There are a few differences from wbt - It's bio based, so the latency covers the whole block layer in addition to the actual io. - We will throttle all IO types that comes in here if we need to. - We use the mean latency over the 100ms window. This is because writes can be particularly fast, which could give us a false sense of the impact of other workloads on our protected workload. - By default there's no throttling, we set the queue_depth to INT_MAX so that we can have as many outstanding bio's as we're allowed to. Only at throttle time do we pay attention to the actual queue depth. - We backcharge cgroups for root cg issued IO and induce artificial delays in order to deal with cases like metadata only or swap heavy workloads. In testing this has worked out relatively well. Protected workloads will throttle noisy workloads down to 1 io at time if they are doing normal IO on their own, or induce up to a 1 second delay per syscall if they are doing a lot of root issued IO (metadata/swap IO). Our testing has revolved mostly around our production web servers where we have hhvm (the web server application) in a protected group and everything else in another group. We see slightly higher requests per second (RPS) on the test tier vs the control tier, and much more stable RPS across all machines in the test tier vs the control tier. Another test we run is a slow memory allocator in the unprotected group. Before this would eventually push us into swap and cause the whole box to die and not recover at all. With these patches we see slight RPS drops (usually 10-15%) before the memory consumer is properly killed and things recover within seconds. Signed-off-by: Josef Bacik <[email protected]> Acked-by: Tejun Heo <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-09rq-qos: introduce dio_bio callbackJosef Bacik3-0/+16
wbt cares only about request completion time, but controllers may need information that is on the bio itself, so add a done_bio callback for rq-qos so things like blk-iolatency can use it to have the bio when it completes. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-09block: remove external dependency on wbt_flagsJosef Bacik6-42/+68
We don't really need to save this stuff in the core block code, we can just pass the bio back into the helpers later on to derive the same flags and update the rq->wbt_flags appropriately. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-09blk-rq-qos: refactor out common elements of blk-wbtJosef Bacik10-251/+478
blkcg-qos is going to do essentially what wbt does, only on a cgroup basis. Break out the common code that will be shared between blkcg-qos and wbt into blk-rq-qos.* so they can both utilize the same infrastructure. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-09blk-stat: export helpers for modifying blk_rq_statJosef Bacik2-8/+12
We need to use blk_rq_stat in the blkcg qos stuff, so export some of these helpers so they can be used by other things. Signed-off-by: Josef Bacik <[email protected]> Acked-by: Tejun Heo <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-09memcontrol: schedule throttling if we are congestedTejun Heo7-14/+81
Memory allocations can induce swapping via kswapd or direct reclaim. If we are having IO done for us by kswapd and don't actually go into direct reclaim we may never get scheduled for throttling. So instead check to see if our cgroup is congested, and if so schedule the throttling. Before we return to user space the throttling stuff will only throttle if we actually required it. Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Andrew Morton <[email protected]> Signed-off-by: Jens Axboe <[email protected]>