aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-08-05lightnvm: remove minor version check for 2.0Matias Bjørling1-6/+0
A minor version number increase should not break backwards compatibility. Fixes: 3cb98f84d368b ("lightnvm: add minor version to generic geometry") Reviewed-by: Javier González <[email protected]> Signed-off-by: Matias Bjørling <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-08-05Merge branch 'nvme-4.19' of git://git.infradead.org/nvme into for-4.19/block2Jens Axboe9-74/+880
Pull NVMe changes from Christoph: "This contains the support for TP4004, Asymmetric Namespace Access, which makes NVMe multipathing usable in practice." * 'nvme-4.19' of git://git.infradead.org/nvme: nvmet: use Retain Async Event bit to clear AEN nvmet: support configuring ANA groups nvmet: add minimal ANA support nvmet: track and limit the number of namespaces per subsystem nvmet: keep a port pointer in nvmet_ctrl nvme: add ANA support nvme: remove nvme_req_needs_failover nvme: simplify the API for getting log pages nvme.h: add ANA definitions nvme.h: add support for the log specific field Signed-off-by: Jens Axboe <[email protected]>
2018-08-05Merge tag 'v4.18-rc6' into for-4.19/block2Jens Axboe487-2406/+4164
Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a merge conflict down the line. Signed-of-by: Jens Axboe <[email protected]>
2018-08-02scsi: Check sense buffer size at build timeKees Cook3-8/+18
To avoid introducing problems like those fixed in commit f7068114d45e ("sr: pass down correctly sized SCSI sense buffer"), this creates a macro wrapper for scsi_execute() that verifies the size of the sense buffer similar to what was done for command string sizes in commit 3756f6401c30 ("exec: avoid gcc-8 warning for get_task_comm"). Another solution could be to add a length argument to scsi_execute(), but this function already takes a lot of arguments and Jens was not fond of that approach. Additionally, this moves the SCSI_SENSE_BUFFERSIZE definition into scsi_device.h, and removes a redundant include for scsi_device.h from scsi_cmnd.h. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-08-02libata-scsi: Move sense buffers onto stackKees Cook1-12/+6
To support future compile-time sizeof() checks that will be able to validate the length of sense buffers, this removes the only dynamically allocated sense buffers in the tree by putting the 96 byte sense buffers on the stack. Reviewed-by: Christoph Hellwig <[email protected]> Acked-by: Tejun Heo <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-08-02cdrom: Use struct scsi_sense_hdr internallyKees Cook2-3/+7
This removes more casts of struct request_sense and uses the standard struct scsi_sense_hdr instead. This also fixes any possible stale values since the prior code did not check the sense length. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-08-02ide-cd: Remove redundant sense bufferKees Cook1-8/+9
This is already able to process the sense buffer, so remove the redundant parsing during the failure path. This also fixes any possible stale values since the prior code did not check the sense length. Acked-by: David S. Miller <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-08-02block: Switch struct packet_command to use struct scsi_sense_hdrKees Cook7-65/+63
There is a lot of needless struct request_sense usage in the CDROM code. These can all be struct scsi_sense_hdr instead, to avoid any confusion over their respective structure sizes. This patch is a lot of noise changing "sense" to "sshdr", but the final code is more readable to distinguish between "sense" meaning "struct request_sense" and "sshdr" meaning "struct scsi_sense_hdr". Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-08-02target: don't depend on SCSIChristoph Hellwig1-2/+3
The core target code only needs code from scsi_common.c, which is now separately selectable. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-08-02scsi: build scsi_common.o for all scsi passthrough request usersChristoph Hellwig2-2/+2
Split scsi_common.o out of SCSI so that non-SCSI users can pull it in easily for future sense buffer helper usage. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-08-02scsi: cxlflash: Drop unused sense buffersKees Cook2-11/+4
This removes the unused sense buffer in read_cap16() and write_same16(). Reviewed-by: Christoph Hellwig <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-08-02ide-cd: Drop unused sense buffersKees Cook3-44/+28
This drops unused sense buffers from: cdrom_eject() cdrom_read_capacity() cdrom_read_tocentry() ide_cd_lockdoor() ide_cd_read_toc() Acked-by: David S. Miller <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-08-02blk-mq: fix updating tags depthMing Lei1-4/+4
The passed 'nr' from userspace represents the total depth, meantime inside 'struct blk_mq_tags', 'nr_tags' stores the total tag depth, and 'nr_reserved_tags' stores the reserved part. There are two issues in blk_mq_tag_update_depth() now: 1) for growing tags, we should have used the passed 'nr', and keep the number of reserved tags not changed. 2) the passed 'nr' should have been used for checking against 'tags->nr_tags', instead of number of the normal part. This patch fixes the above two cases, and avoids kernel crash caused by wrong resizing sbitmap queue. Cc: "Ewan D. Milne" <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Bart Van Assche <[email protected]> Cc: Omar Sandoval <[email protected]> Tested by: Marco Patalano <[email protected]> Signed-off-by: Ming Lei <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-08-02block: really disable runtime-pm for blk-mqMing Lei1-2/+4
Runtime PM isn't ready for blk-mq yet, and commit 765e40b675a9 ("block: disable runtime-pm for blk-mq") tried to disable it. Unfortunately, it can't take effect in that way since user space still can switch it on via 'echo auto > /sys/block/sdN/device/power/control'. This patch disables runtime-pm for blk-mq really by pm_runtime_disable() and fixes all kinds of PM related kernel crash. Cc: Tomas Janousek <[email protected]> Cc: Przemek Socha <[email protected]> Cc: Alan Stern <[email protected]> Cc: <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Tested-by: Patrick Steinhardt <[email protected]> Signed-off-by: Ming Lei <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-08-02aoe: mark expected switch fall-throughGustavo A. R. Silva1-0/+1
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Addresses-Coverity-ID: 114722 ("Missing break in switch") Signed-off-by: Gustavo A. R. Silva <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-08-02block: make iolatency avg_lat exponentially decayDennis Zhou (Facebook)2-24/+57
Currently, avg_lat is calculated by accumulating the mean of every window in a long running cumulative average. As time goes on, the metric becomes less and less useful due to the accumulated history. This patch reuses the same calculation done in load averages to make the avg_lat metric more lively. Unlike load averages, the avg only advances when a window elapses (due to an io). Idle periods extend the most recent window. Bucketing is used to limit the history of avg_lat by binding it to the window size. So, the window range for 1/exp (decay rate) is [1 min, 2.5 min) when windows elapse immediately. The current sample window size is exposed in the debug info to enable calculation of the window range. Signed-off-by: Dennis Zhou <[email protected]> Acked-by: Tejun Heo <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-08-01blk-cgroup: clear the throttle queue on forkJosef Bacik1-0/+5
We were hitting a panic in production where we put too many times on the request queue. This is because we'd get the throttle_queue of the parent if we fork()'ed while we needed to be throttled, but we didn't have a reference on it. Instead just clear these flags on fork so the child doesn't pay for the sins of its father. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-08-01blk-cgroup: hold the queue ref during throttlingJosef Bacik1-1/+1
The blkg lifetime is protected by the queue lifetime, so we need to put the queue _after_ we're done using the blkg. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-08-01blk-iolatency: fix blkg leak in timer_fnJosef Bacik1-1/+1
At this point we have a ref on the blkg, we need to drop it if we don't have a iolat. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-08-01block/bsg-lib: use PTR_ERR_OR_ZERO to simplify the flow pathzhong jiang1-3/+2
Simplify the code by using the PTR_ERR_OR_ZERO, instead of the open code. It is better. Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: zhong jiang <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-31t10-pi: provide empty t10_pi_complete() for !CONFIG_BLK_DEV_INTEGRITYJens Axboe1-0/+11
Fixes a link failure whtn BLK_DEV_INTEGRITY isn't defined. Fixes: 10c41ddd6132 ("block: move dif_prepare/dif_complete functions to block layer") Signed-off-by: Jens Axboe <[email protected]>
2018-07-30block: blk_init_allocated_queue() set q->fq as NULL in the fail casexiao jin1-0/+1
We find the memory use-after-free issue in __blk_drain_queue() on the kernel 4.14. After read the latest kernel 4.18-rc6 we think it has the same problem. Memory is allocated for q->fq in the blk_init_allocated_queue(). If the elevator init function called with error return, it will run into the fail case to free the q->fq. Then the __blk_drain_queue() uses the same memory after the free of the q->fq, it will lead to the unpredictable event. The patch is to set q->fq as NULL in the fail case of blk_init_allocated_queue(). Fixes: commit 7c94e1c157a2 ("block: introduce blk_flush_queue to drive flush machinery") Cc: <[email protected]> Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Signed-off-by: xiao jin <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-30nvme: use blk API to remap ref tags for IOs with metadataMax Gurtovoy3-82/+20
Also moved the logic of the remapping to the nvme core driver instead of implementing it in the nvme pci driver. This way all the other nvme transport drivers will benefit from it (in case they'll implement metadata support). Suggested-by: Christoph Hellwig <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Acked-by: Keith Busch <[email protected]> Signed-off-by: Max Gurtovoy <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-30block: move dif_prepare/dif_complete functions to block layerMax Gurtovoy5-125/+118
Currently these functions are implemented in the scsi layer, but their actual place should be the block layer since T10-PI is a general data integrity feature that is used in the nvme protocol as well. Also, use the tuple size from the integrity profile since it may vary between integrity types. Suggested-by: Christoph Hellwig <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Signed-off-by: Max Gurtovoy <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-30block: move ref_tag calculation func to the block layerMax Gurtovoy6-12/+16
Currently this function is implemented in the scsi layer, but it's actual place should be the block layer since T10-PI is a general data integrity feature that is used in the nvme protocol as well. Suggested-by: Christoph Hellwig <[email protected]> Cc: Martin K. Petersen <[email protected]> Signed-off-by: Max Gurtovoy <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-30block: don't account for split bio's size in cgroup statsJosef Bacik1-2/+8
We need to check in blkcg_bio_issue_check if the bio is flagged as QUEUE_ENTERED, because if it is then we've already accounted for the size of the IO in the cgroup stats. We can still however account for the extra IO since it'll be another request. Reported-by: Tejun Heo <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-28pktcdvd: Fix possible Spectre-v1 for pkt_devsJinbum Park1-1/+3
User controls @dev_minor which to be used as index of pkt_devs. So, It can be exploited via Spectre-like attack. (speculative execution) This kind of attack leaks address of pkt_devs, [1] It leads an attacker to bypass security mechanism such as KASLR. So sanitize @dev_minor before using it to prevent attack. [1] https://github.com/jinb-park/linux-exploit/ tree/master/exploit-remaining-spectre-gadget/leak_pkt_devs.c Signed-off-by: Jinbum Park <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-27nvmet: use Retain Async Event bit to clear AENChaitanya Kulkarni1-2/+15
In the current implementation, we clear the AEN bit when we get the "get log page" command if given log page is associated with AEN. This patch allows optionally retaining the AEN for the ctrl under consideration when Retain Asynchronous Event (RAE) bit is set as a part of "get log page" command. This allows the host to read the Log page and optionally retaining the AEN associated with this log page when using userspace tools like nvme-cli. Signed-off-by: Chaitanya Kulkarni <[email protected]> [hch: also use the new helper in the just merged ANA code] Signed-off-by: Christoph Hellwig <[email protected]>
2018-07-27nvmet: support configuring ANA groupsChristoph Hellwig4-4/+236
Allow creating non-default ANA groups (group ID > 1). Groups are created either by assigning the group ID to a namespace, or by creating a configfs group object under a specific port. All namespaces assigned to a group that doesn't have a configfs object for a given port are marked as inaccessible. Allow changing the ANA state on a per-port basis by creating an ana_groups directory under each port, and another directory with an ana_state file in it. The default ANA group 1 directory is created automatically for each port. For all changes in ANA configuration the ANA change AEN is sent. We only keep a global changecount instead of additional per-group changecounts to keep the implementation as simple as possible. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]>
2018-07-27nvmet: add minimal ANA supportChristoph Hellwig4-4/+142
Add support for Asynchronous Namespace Access as specified in NVMe 1.3 TP 4004. Just add a default ANA group 1 that is optimized on all ports. This is (and will remain) the default assignment for any namespace not epxlicitly assigned to another ANA group. The ANA state can be manually changed through the configfs interface, including the change state. Includes fixes and improvements from Hannes Reinecke. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]>
2018-07-27nvmet: track and limit the number of namespaces per subsystemChristoph Hellwig3-1/+16
TP 4004 introduces a new 'Maximum Number of Allocated Namespaces' field in the Identify controller data to help the host size resources. Put an upper limit on the supported namespaces to be able to support this value as supporting 32-bits worth of namespaces would lead to very large buffers. The limit is completely arbitrary at this point. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]>
2018-07-27nvmet: keep a port pointer in nvmet_ctrlChristoph Hellwig2-0/+4
This will be needed for the ANA AEN code. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]>
2018-07-27nvme: add ANA supportChristoph Hellwig3-27/+408
Add support for Asynchronous Namespace Access as specified in NVMe 1.3 TP 4004. With ANA each namespace attached to a controller belongs to an ANA group that describes the characteristics of accessing the namespaces through this controller. In the optimized and non-optimized states namespaces can be accessed regularly, although in a multi-pathing environment we should always prefer to access a namespace through a controller where an optimized relationship exists. Namespaces in Inaccessible, Permanent-Loss or Change state for a given controller should not be accessed. The states are updated through reading the ANA log page, which is read once during controller initialization, whenever the ANA change notice AEN is received, or when one of the ANA specific status codes that signal a state change is received on a command. The ANA state is kept in the nvme_ns structure, which makes the checks in the fast path very simple. Updating the ANA state when reading the log page is also very simple, the only downside is that finding the initial ANA state when scanning for namespaces is a bit cumbersome. The gendisk for a ns_head is only registered once a live path for it exists. Without that the kernel would hang during partition scanning. Includes fixes and improvements from Hannes Reinecke. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]>
2018-07-27nvme: remove nvme_req_needs_failoverChristoph Hellwig3-14/+2
Now that we just call out to blk_path_error there isn't really any good reason to not merge it into the only caller. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]>
2018-07-27nvme: simplify the API for getting log pagesChristoph Hellwig3-25/+16
Merge nvme_get_log and nvme_get_log_ext into a single helper, which takes a plain nsid instead of the nvme_ns pointer. Also add support for the log specific field while we're at it. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]>
2018-07-27nvme.h: add ANA definitionsChristoph Hellwig1-3/+47
Add various defintions from NVMe 1.3 TP 4004. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]>
2018-07-27nvme.h: add support for the log specific fieldChristoph Hellwig1-1/+1
NVMe 1.3 added a new log specific field to the get log page CQ defintion, add it to our get_log_page SQ structure. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]>
2018-07-27partitions/aix: append null character to print data from diskMauricio Faria de Oliveira1-2/+6
Even if properly initialized, the lvname array (i.e., strings) is read from disk, and might contain corrupt data (e.g., lack the null terminating character for strings). So, make sure the partition name string used in pr_warn() has the null terminating character. Fixes: 6ceea22bbbc8 ("partitions: add aix lvm partition support files") Suggested-by: Daniel J. Axtens <[email protected]> Signed-off-by: Mauricio Faria de Oliveira <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-27partitions/aix: fix usage of uninitialized lv_info and lvname structuresMauricio Faria de Oliveira1-2/+3
The if-block that sets a successful return value in aix_partition() uses 'lvip[].pps_per_lv' and 'n[].name' potentially uninitialized. For example, if 'numlvs' is zero or alloc_lvn() fails, neither is initialized, but are used anyway if alloc_pvd() succeeds after it. So, make the alloc_pvd() call conditional on their initialization. This has been hit when attaching an apparently corrupted/stressed AIX LUN, misleading the kernel to pr_warn() invalid data and hang. [...] partition (null) (11 pp's found) is not contiguous [...] partition (null) (2 pp's found) is not contiguous [...] partition (null) (3 pp's found) is not contiguous [...] partition (null) (64 pp's found) is not contiguous Fixes: 6ceea22bbbc8 ("partitions: add aix lvm partition support files") Signed-off-by: Mauricio Faria de Oliveira <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-27bcache: stop using the deprecated get_seconds()Arnd Bergmann2-8/+8
The get_seconds function is deprecated now since it returns a 32-bit value that will eventually overflow, and we are replacing it throughout the kernel with ktime_get_seconds() or ktime_get_real_seconds() that return a time64_t. bcache uses get_seconds() to read the current system time and store it in the superblock as well as in uuid_entry structures that are user visible. Unfortunately, the two structures in are still limited to 32 bits, so this won't fix any real problems but will still overflow in year 2106. Let's at least document that properly, in case we get an updated format in the future it can be fixed. We still have a long time before the overflow and checking the tools at https://github.com/koverstreet/bcache-tools reveals no access to any of them. Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-27bcache: do not assign in if condition in bcache_device_init()Florian Schmaus1-5/+11
Fixes an error condition reported by checkpatch.pl which is caused by assigning a variable in an if condition. Signed-off-by: Florian Schmaus <[email protected]> Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-27bcache: do not assign in if condition in bcache_init()Florian Schmaus1-3/+9
Fixes an error condition reported by checkpatch.pl which is caused by assigning a variable in an if condition. Signed-off-by: Florian Schmaus <[email protected]> Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-27bcache: free heap cache_set->flush_btree in bch_journal_freeShenghui Wang1-0/+1
Free the cache_set->flush_bree heap memory on journal free. Signed-off-by: Wang Sheng-Hui <[email protected]> Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-27bcache: do not assign in if condition register_bcache()Florian Schmaus1-2/+6
Fixes an error condition reported by checkpatch.pl which is caused by assigning a variable in an if condition. Signed-off-by: Florian Schmaus <[email protected]> Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-27bcache: fix I/O significant decline while backend devices registeringTang Junhui1-3/+26
I attached several backend devices in the same cache set, and produced lots of dirty data by running small rand I/O writes in a long time, then I continue run I/O in the others cached devices, and stopped a cached device, after a mean while, I register the stopped device again, I see the running I/O in the others cached devices dropped significantly, sometimes even jumps to zero. In currently code, bcache would traverse each keys and btree node to count the dirty data under read locker, and the writes threads can not get the btree write locker, and when there is a lot of keys and btree node in the registering device, it would last several seconds, so the write I/Os in others cached device are blocked and declined significantly. In this patch, when a device registering to a ache set, which exist others cached devices with running I/Os, we get the amount of dirty data of the device in an incremental way, and do not block other cached devices all the time. Patch v2: Rename some variables and macros name as Coly suggested. Signed-off-by: Tang Junhui <[email protected]> Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-27bcache: calculate the number of incremental GC nodes according to the total ↵Tang Junhui1-2/+35
of btree nodes This patch base on "[PATCH] bcache: finish incremental GC". Since incremental GC would stop 100ms when front side I/O comes, so when there are many btree nodes, if GC only processes constant (100) nodes each time, GC would last a long time, and the front I/Os would run out of the buckets (since no new bucket can be allocated during GC), and I/Os be blocked again. So GC should not process constant nodes, but varied nodes according to the number of btree nodes. In this patch, GC is divided into constant (100) times, so when there are many btree nodes, GC can process more nodes each time, otherwise GC will process less nodes each time (but no less than MIN_GC_NODES). Signed-off-by: Tang Junhui <[email protected]> Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-27bcache: finish incremental GCTang Junhui3-1/+21
In GC thread, we record the latest GC key in gc_done, which is expected to be used for incremental GC, but in currently code, we didn't realize it. When GC runs, front side IO would be blocked until the GC over, it would be a long time if there is a lot of btree nodes. This patch realizes incremental GC, the main ideal is that, when there are front side I/Os, after GC some nodes (100), we stop GC, release locker of the btree node, and go to process the front side I/Os for some times (100 ms), then go back to GC again. By this patch, when we doing GC, I/Os are not blocked all the time, and there is no obvious I/Os zero jump problem any more. Patch v2: Rename some variables and macros name as Coly suggested. Signed-off-by: Tang Junhui <[email protected]> Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-27bcache: simplify the calculation of the total amount of flash dirty dataTang Junhui4-20/+7
Currently we calculate the total amount of flash only devices dirty data by adding the dirty data of each flash only device under registering locker. It is very inefficient. In this patch, we add a member flash_dev_dirty_sectors in struct cache_set to record the total amount of flash only devices dirty data in real time, so we didn't need to calculate the total amount of dirty data any more. Signed-off-by: Tang Junhui <[email protected]> Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-07-27readahead: stricter check for bdi io_pagesMarkus Stockhausen1-2/+10
ondemand_readahead() checks bdi->io_pages to cap the maximum pages that need to be processed. This works until the readit section. If we would do an async only readahead (async size = sync size) and target is at beginning of window we expand the pages by another get_next_ra_size() pages. Btrace for large reads shows that kernel always issues a doubled size read at the beginning of processing. Add an additional check for io_pages in the lower part of the func. The fix helps devices that hard limit bio pages and rely on proper handling of max_hw_read_sectors (e.g. older FusionIO cards). For that reason it could qualify for stable. Fixes: 9491ae4a ("mm: don't cap request size based on read-ahead setting") Cc: [email protected] Signed-off-by: Markus Stockhausen [email protected] Signed-off-by: Jens Axboe <[email protected]>
2018-07-26scsi: virtio_scsi: fix pi_bytes{out,in} on 4 KiB block size devicesGreg Edwards1-4/+4
When the underlying device is a 4 KiB logical block size device with a protection interval exponent of 0, i.e. 4096 bytes data + 8 bytes PI, the driver miscalculates the pi_bytes{out,in} by a factor of 8x (64 bytes). This leads to errors on all reads and writes on 4 KiB logical block size devices when CONFIG_BLK_DEV_INTEGRITY is enabled and the VIRTIO_SCSI_F_T10_PI feature bit has been negotiated. Fixes: e6dc783a38ec0 ("virtio-scsi: Enable DIF/DIX modes in SCSI host LLD") Acked-by: Martin K. Petersen <[email protected]> Signed-off-by: Greg Edwards <[email protected]> Signed-off-by: Jens Axboe <[email protected]>