aboutsummaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)AuthorFilesLines
2017-01-23blk-throttle: Adjust two function calls together with a variable assignmentMarkus Elfring1-2/+4
The script "checkpatch.pl" pointed information out like the following. ERROR: do not use assignment in if condition Thus fix the affected source code places. Signed-off-by: Markus Elfring <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-01-23block: Initialize cfqq->ioprio_class in cfq_get_queue()Alexander Potapenko1-0/+2
KMSAN (KernelMemorySanitizer, a new error detection tool) reports use of uninitialized memory in cfq_init_cfqq(): ================================================================== BUG: KMSAN: use of unitialized memory ... Call Trace: [< inline >] __dump_stack lib/dump_stack.c:15 [<ffffffff8202ac97>] dump_stack+0x157/0x1d0 lib/dump_stack.c:51 [<ffffffff813e9b65>] kmsan_report+0x205/0x360 ??:? [<ffffffff813eabbb>] __msan_warning+0x5b/0xb0 ??:? [< inline >] cfq_init_cfqq block/cfq-iosched.c:3754 [<ffffffff8201e110>] cfq_get_queue+0xc80/0x14d0 block/cfq-iosched.c:3857 ... origin: [<ffffffff8103ab37>] save_stack_trace+0x27/0x50 arch/x86/kernel/stacktrace.c:67 [<ffffffff813e836b>] kmsan_internal_poison_shadow+0xab/0x150 ??:? [<ffffffff813e88ab>] kmsan_poison_slab+0xbb/0x120 ??:? [< inline >] allocate_slab mm/slub.c:1627 [<ffffffff813e533f>] new_slab+0x3af/0x4b0 mm/slub.c:1641 [< inline >] new_slab_objects mm/slub.c:2407 [<ffffffff813e0ef3>] ___slab_alloc+0x323/0x4a0 mm/slub.c:2564 [< inline >] __slab_alloc mm/slub.c:2606 [< inline >] slab_alloc_node mm/slub.c:2669 [<ffffffff813dfb42>] kmem_cache_alloc_node+0x1d2/0x1f0 mm/slub.c:2746 [<ffffffff8201d90d>] cfq_get_queue+0x47d/0x14d0 block/cfq-iosched.c:3850 ... ================================================================== (the line numbers are relative to 4.8-rc6, but the bug persists upstream) The uninitialized struct cfq_queue is created by kmem_cache_alloc_node() and then passed to cfq_init_cfqq(), which accesses cfqq->ioprio_class before it's initialized. Signed-off-by: Alexander Potapenko <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-01-20blk-mq: allow resize of scheduler requestsJens Axboe3-13/+61
Add support for growing the tags associated with a hardware queue, for the scheduler tags. Currently we only support resizing within the limits of the original depth, change that so we can grow it as well by allocating and replacing the existing scheduler tag set. This is similar to how we could increase the software queue depth with the legacy IO stack and schedulers. Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Omar Sandoval <[email protected]>
2017-01-19blk-mq: stop hardware queue in blk_mq_delay_queue()Jens Axboe1-0/+1
The run handler we register for the delayed work requires that the queue be stopped, yet we leave that up to the caller. Let's move it into blk_mq_delay_queue() itself, so that the API is sane. This fixes a stall with SCSI, where it calls blk_mq_delay_queue() without having stopped the queue. Hence the queue is never run. Reported-by: Hannes Reinecke <[email protected]> Fixes: 70f4db639c5b ("blk-mq: add blk_mq_delay_queue") Signed-off-by: Jens Axboe <[email protected]>
2017-01-19blk-mq-tag: remove redundant check for 'data->hctx' being non-NULLJens Axboe1-4/+2
We used to pass in NULL for hctx for reserved tags, but we don't do that anymore. Hence the check for whether hctx is NULL or not is now redundant, kill it. Reported-by: Dan Carpenter <[email protected]> Fixes: a642a158aec6 ("blk-mq-tag: cleanup the normal/reserved tag allocation") Signed-off-by: Jens Axboe <[email protected]>
2017-01-19elevator: fix unnecessary put of elevator in failure caseJens Axboe1-4/+0
We already checked that e is NULL, so no point in calling elevator_put() to free it. Reported-by: Dan Carpenter <[email protected]> Fixes: dc877dbd088f ("blk-mq-sched: add framework for MQ capable IO schedulers") Signed-off-by: Jens Axboe <[email protected]>
2017-01-18blk-cgroup: don't quiesce the queue on policy activate/deactivateJens Axboe1-12/+8
There's no potential harm in quiescing the queue, but it also doesn't buy us anything. And we can't run the queue async for policy deactivate, since we could be in the path of tearing the queue down. If we schedule an async run of the queue at that time, we're racing with queue teardown AFTER having we've already torn most of it down. Reported-by: Omar Sandoval <[email protected]> Fixes: 4d199c6f1c84 ("blk-cgroup: ensure that we clear the stop bit on quiesced queues") Tested-by: Omar Sandoval <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-01-18blk-mq: Remove unused variableKeith Busch1-1/+0
Signed-off-by: Keith Busch <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-01-18blk-cgroup: ensure that we clear the stop bit on quiesced queuesJens Axboe1-4/+6
If we call blk_mq_quiesce_queue() on a queue, we must remember to pair that with something that clears the stopped by on the queues later on. Signed-off-by: Jens Axboe <[email protected]>
2017-01-17blk-mq-sched: allow setting of default IO schedulerJens Axboe5-7/+87
Add Kconfig entries to manage what devices get assigned an MQ scheduler, and add a blk-mq flag for drivers to opt out of scheduling. The latter is useful for admin type queues that still allocate a blk-mq queue and tag set, but aren't use for normal IO. Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Reviewed-by: Omar Sandoval <[email protected]>
2017-01-17mq-deadline: add blk-mq adaptation of the deadline IO schedulerJens Axboe3-0/+560
This is basically identical to deadline-iosched, except it registers as a MQ capable scheduler. This is still a single queue design. Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Reviewed-by: Omar Sandoval <[email protected]>
2017-01-17blk-mq-sched: add framework for MQ capable IO schedulersJens Axboe14-192/+945
This adds a set of hooks that intercepts the blk-mq path of allocating/inserting/issuing/completing requests, allowing us to develop a scheduler within that framework. We reuse the existing elevator scheduler API on the registration side, but augment that with the scheduler flagging support for the blk-mq interfce, and with a separate set of ops hooks for MQ devices. We split driver and scheduler tags, so we can run the scheduling independently of device queue depth. Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Reviewed-by: Omar Sandoval <[email protected]>
2017-01-17blk-mq: split tag ->rqs[] into twoJens Axboe3-9/+26
This is in preparation for having two sets of tags available. For that we need a static index, and a dynamically assignable one. Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Omar Sandoval <[email protected]>
2017-01-17blk-mq: add support for carrying internal tag information in blk_qc_tJens Axboe1-3/+8
No functional change in this patch, just in preparation for having two types of tags available to the block layer for a single request. Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Omar Sandoval <[email protected]>
2017-01-17blk-mq: abstract out helpers for allocating/freeing tag mapsJens Axboe2-48/+83
Prep patch for adding an extra tag map for scheduler requests. Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Reviewed-by: Omar Sandoval <[email protected]>
2017-01-17blk-mq-tag: cleanup the normal/reserved tag allocationJens Axboe4-61/+44
This is in preparation for having another tag set available. Cleanup the parameters, and allow passing in of tags for blk_mq_put_tag(). Signed-off-by: Jens Axboe <[email protected]> [hch: even more cleanups] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Omar Sandoval <[email protected]>
2017-01-17blk-mq: export some helpers we need to the scheduling frameworkJens Axboe2-18/+46
Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Omar Sandoval <[email protected]>
2017-01-17blk-mq: un-export blk_mq_free_hctx_request()Jens Axboe1-3/+2
It's only used in blk-mq, kill it from the main exported header and kill the symbol export as well. Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Omar Sandoval <[email protected]>
2017-01-17block: move rq_ioc() to blk.hJens Axboe2-16/+16
We want to use it outside of blk-core.c. Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Omar Sandoval <[email protected]>
2017-01-17block: move existing elevator ops to unionJens Axboe7-44/+44
Prep patch for adding MQ ops as well, since doing anon unions with named initializers doesn't work on older compilers. Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Reviewed-by: Omar Sandoval <[email protected]>
2017-01-17partitions/efi: Fix integer overflow in GPT size calculationAlden Tondettar1-5/+12
If a GUID Partition Table claims to have more than 2**25 entries, the calculation of the partition table size in alloc_read_gpt_entries() will overflow a 32-bit integer and not enough space will be allocated for the table. Nothing seems to get written out of bounds, but later efi_partition() will read up to 32768 bytes from a 128 byte buffer, possibly OOPSing or exposing information to /proc/partitions and uevents. The problem exists on both 64-bit and 32-bit platforms. Fix the overflow and also print a meaningful debug message if the table size is too large. Signed-off-by: Alden Tondettar <[email protected]> Acked-by: Ard Biesheuvel <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-01-13block: don't try to discard from __blkdev_issue_zerooutChristoph Hellwig1-7/+6
Discard can return -EIO asynchronously if the alignment for the request isn't suitable for the driver, which makes a proper fallback to other methods in __blkdev_issue_zeroout impossible. Thus only issue a sync discard from blkdev_issue_zeroout an don't try discard at all from __blkdev_issue_zeroout as a non-invasive workaround. One more reason why abusing discard for zeroing must die.. Signed-off-by: Christoph Hellwig <[email protected]> Reported-by: Eryu Guan <[email protected]> Fixes: e73c23ff ("block: add async variant of blkdev_issue_zeroout") Signed-off-by: Jens Axboe <[email protected]>
2017-01-12block: Rename blk_queue_zone_size and bdev_zone_sizeDamien Le Moal2-9/+9
All block device data fields and functions returning a number of 512B sectors are by convention named xxx_sectors while names in the form xxx_size are generally used for a number of bytes. The blk_queue_zone_size and bdev_zone_size functions were not following this convention so rename them. No functional change is introduced by this patch. Signed-off-by: Damien Le Moal <[email protected]> Collapsed the two patches, they were nonsensically split and broke bisection. Signed-off-by: Jens Axboe <[email protected]>
2017-01-11blk-mq: make mq_ops a const pointerJens Axboe1-1/+1
We never change it, make that clear. Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Bart Van Assche <[email protected]>
2017-01-04Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds1-6/+7
Pull block layer fixes from Jens Axboe: "A set of fixes for the current series, one fixing a regression with block size < page cache size in the alias series from Jan. Outside of that, two small cleanups for wbt from Bart, a nvme pull request from Christoph, and a few small fixes of documentation updates" * 'for-linus' of git://git.kernel.dk/linux-block: block: fix up io_poll documentation block: Avoid that sparse complains about context imbalance in __wbt_wait() block: Make wbt_wait() definition consistent with declaration clean_bdev_aliases: Prevent cleaning blocks that are not in block range genhd: remove dead and duplicated scsi code block: add back plugging in __blkdev_direct_IO nvmet/fcloop: remove some logically dead code performing redundant ret checks nvmet: fix KATO offset in Set Features nvme/fc: simplify error handling of nvme_fc_create_hw_io_queues nvme/fc: correct some printk information nvme/scsi: Remove START STOP emulation nvme/pci: Delete misleading queue-wrap comment nvme/pci: Fix whitespace problem nvme: simplify stripe quirk nvme: update maintainers information
2017-01-02block: Avoid that sparse complains about context imbalance in __wbt_wait()Bart Van Assche1-5/+6
This patch does not change any functionality. Fixes: e34cbd307477 ("blk-wbt: add general throttling mechanism") Signed-off-by: Bart Van Assche <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2017-01-02block: Make wbt_wait() definition consistent with declarationBart Van Assche1-1/+1
Fixes: e34cbd307477 ("blk-wbt: add general throttling mechanism") Signed-off-by: Bart Van Assche <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2016-12-25ktime: Cleanup ktime_set() usageThomas Gleixner1-1/+1
ktime_set(S,N) was required for the timespec storage type and is still useful for situations where a Seconds and Nanoseconds part of a time value needs to be converted. For anything where the Seconds argument is 0, this is pointless and can be replaced with a simple assignment. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]>
2016-12-24Replace <asm/uaccess.h> with <linux/uaccess.h> globallyLinus Torvalds3-3/+3
This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-23Merge branch 'for-linus' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull final vfs updates from Al Viro: "Assorted cleanups and fixes all over the place" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: sg_write()/bsg_write() is not fit to be called under KERNEL_DS ufs: fix function declaration for ufs_truncate_blocks fs: exec: apply CLOEXEC before changing dumpable task flags seq_file: reset iterator to first record for zero offset vfs: fix isize/pos/len checks for reflink & dedupe [iov_iter] fix iterate_all_kinds() on empty iterators move aio compat to fs/aio.c reorganize do_make_slave() clone_private_mount() doesn't need to touch namespace_sem remove a bogus claim about namespace_sem being held by callers of mnt_alloc_id()
2016-12-22sg_write()/bsg_write() is not fit to be called under KERNEL_DSAl Viro1-0/+3
Both damn things interpret userland pointers embedded into the payload; worse, they are actually traversing those. Leaving aside the bad API design, this is very much _not_ safe to call with KERNEL_DS. Bail out early if that happens. Cc: [email protected] Signed-off-by: Al Viro <[email protected]>
2016-12-19block: check partition alignmentStefan Haberland1-0/+3
Partitions that are not aligned to the blocksize of a device may cause invalid I/O requests because the blocklayer cares only about alignment within the partition when building requests on partitions. device |--------4096--------|--------4096--------|--------4096--------| partition offset 512byte |-512-|--------4096--------|--------4096--------|--------4096--------| When reading/writing one 4k block of the partition this maps to reading/writing with an offset of 512 byte of the device leading to unaligned requests for the device which in turn may cause unexpected behavior of the device driver. For DASD devices we have to translate the block number into a cylinder, head, record format. The unaligned requests lead to wrong calculation and therefore to misdirected I/O. In a "good" case this leads to I/O errors because the underlying hardware detects the wrong addressing. In a worst case scenario this might destroy data on the device. To prevent partitions that are not aligned to the physical blocksize of a device check for the alignment in the blkpg_ioctl. Signed-off-by: Stefan Haberland <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2016-12-19block: allow WRITE_SAME commands with the SG_IO ioctlMauricio Faria de Oliveira1-0/+3
The WRITE_SAME commands are not present in the blk_default_cmd_filter write_ok list, and thus are failed with -EPERM when the SG_IO ioctl() is executed without CAP_SYS_RAWIO capability (e.g., unprivileged users). [ sg_io() -> blk_fill_sghdr_rq() > blk_verify_command() -> -EPERM ] The problem can be reproduced with the sg_write_same command # sg_write_same --num 1 --xferlen 512 /dev/sda # # capsh --drop=cap_sys_rawio -- -c \ 'sg_write_same --num 1 --xferlen 512 /dev/sda' Write same: pass through os error: Operation not permitted # For comparison, the WRITE_VERIFY command does not observe this problem, since it is in that list: # capsh --drop=cap_sys_rawio -- -c \ 'sg_write_verify --num 1 --ilen 512 --lba 0 /dev/sda' # So, this patch adds the WRITE_SAME commands to the list, in order for the SG_IO ioctl to finish successfully: # capsh --drop=cap_sys_rawio -- -c \ 'sg_write_same --num 1 --xferlen 512 /dev/sda' # That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices (qemu "-device scsi-block" [1], libvirt "<disk type='block' device='lun'>" [2]), which employs the SG_IO ioctl() and runs as an unprivileged user (libvirt-qemu). In that scenario, when a filesystem (e.g., ext4) performs its zero-out calls, which are translated to write-same calls in the guest kernel, and then into SG_IO ioctls to the host kernel, SCSI I/O errors may be observed in the guest: [...] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [...] sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current] [...] sd 0:0:0:0: [sda] tag#0 Add. Sense: I/O process terminated [...] sd 0:0:0:0: [sda] tag#0 CDB: Write Same(10) 41 00 01 04 e0 78 00 00 08 00 [...] blk_update_request: I/O error, dev sda, sector 17096824 Links: [1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52 [2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device') Signed-off-by: Mauricio Faria de Oliveira <[email protected]> Signed-off-by: Brahadambal Srinivasan <[email protected]> Reported-by: Manjunatha H R <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2016-12-14Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds1-8/+24
Pull block IO fixes from Jens Axboe: "A few fixes that I collected as post-merge. I was going to wait a bit with sending this out, but the O_DIRECT fix should really go in sooner rather than later" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: Fix failed allocation path when mapping queues blk-mq: Avoid memory reclaim when remapping queues block_dev: don't update file access position for sync direct IO nvme/pci: Log PCI_STATUS when the controller dies block_dev: don't test bdev->bd_contains when it is not stable
2016-12-14blk-mq: Fix failed allocation path when mapping queuesGabriel Krisman Bertazi1-5/+21
In blk_mq_map_swqueue, there is a memory optimization that frees the tags of a queue that has gone unmapped. Later, if that hctx is remapped after another topology change, the tags need to be reallocated. If this allocation fails, a simple WARN_ON triggers, but the block layer ends up with an active hctx without any corresponding set of tags. Then, any income IO to that hctx can trigger an Oops. I can reproduce it consistently by running IO, flipping CPUs on and off and eventually injecting a memory allocation failure in that path. In the fix below, if the system experiences a failed allocation of any hctx's tags, we remap all the ctxs of that queue to the hctx_0, which should always keep it's tags. There is a minor performance hit, since our mapping just got worse after the error path, but this is the simplest solution to handle this error path. The performance hit will disappear after another successful remap. I considered dropping the memory optimization all together, but it seemed a bad trade-off to handle this very specific error case. This should apply cleanly on top of Jens' for-next branch. The Oops is the one below: SP (3fff935ce4d0) is in userspace 1:mon> e cpu 0x1: Vector: 300 (Data Access) at [c000000fe99eb110] pc: c0000000005e868c: __sbitmap_queue_get+0x2c/0x180 lr: c000000000575328: __bt_get+0x48/0xd0 sp: c000000fe99eb390 msr: 900000010280b033 dar: 28 dsisr: 40000000 current = 0xc000000fe9966800 paca = 0xc000000007e80300 softe: 0 irq_happened: 0x01 pid = 11035, comm = aio-stress Linux version 4.8.0-rc6+ (root@bean) (gcc version 5.4.0 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.2) ) #3 SMP Mon Oct 10 20:16:53 CDT 2016 1:mon> s [c000000fe99eb3d0] c000000000575328 __bt_get+0x48/0xd0 [c000000fe99eb400] c000000000575838 bt_get.isra.1+0x78/0x2d0 [c000000fe99eb480] c000000000575cb4 blk_mq_get_tag+0x44/0x100 [c000000fe99eb4b0] c00000000056f6f4 __blk_mq_alloc_request+0x44/0x220 [c000000fe99eb500] c000000000570050 blk_mq_map_request+0x100/0x1f0 [c000000fe99eb580] c000000000574650 blk_mq_make_request+0xf0/0x540 [c000000fe99eb640] c000000000561c44 generic_make_request+0x144/0x230 [c000000fe99eb690] c000000000561e00 submit_bio+0xd0/0x200 [c000000fe99eb740] c0000000003ef740 ext4_io_submit+0x90/0xb0 [c000000fe99eb770] c0000000003e95d8 ext4_writepages+0x588/0xdd0 [c000000fe99eb910] c00000000025a9f0 do_writepages+0x60/0xc0 [c000000fe99eb940] c000000000246c88 __filemap_fdatawrite_range+0xf8/0x180 [c000000fe99eb9e0] c000000000246f90 filemap_write_and_wait_range+0x70/0xf0 [c000000fe99eba20] c0000000003dd844 ext4_sync_file+0x214/0x540 [c000000fe99eba80] c000000000364718 vfs_fsync_range+0x78/0x130 [c000000fe99ebad0] c0000000003dd46c ext4_file_write_iter+0x35c/0x430 [c000000fe99ebb90] c00000000038c280 aio_run_iocb+0x3b0/0x450 [c000000fe99ebce0] c00000000038dc28 do_io_submit+0x368/0x730 [c000000fe99ebe30] c000000000009404 system_call+0x38/0xec Signed-off-by: Gabriel Krisman Bertazi <[email protected]> Cc: Brian King <[email protected]> Cc: Douglas Miller <[email protected]> Cc: [email protected] Cc: [email protected] Reviewed-by: Douglas Miller <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2016-12-14Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds3-4/+21
Pull SCSI updates from James Bottomley: "This update includes the usual round of major driver updates (ncr5380, lpfc, hisi_sas, megaraid_sas, ufs, ibmvscsis, mpt3sas). There's also an assortment of minor fixes, mostly in error legs or other not very user visible stuff. The major change is the pci_alloc_irq_vectors replacement for the old pci_msix_.. calls; this effectively makes IRQ mapping generic for the drivers and allows blk_mq to use the information" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (256 commits) scsi: qla4xxx: switch to pci_alloc_irq_vectors scsi: hisi_sas: support deferred probe for v2 hw scsi: megaraid_sas: switch to pci_alloc_irq_vectors scsi: scsi_devinfo: remove synchronous ALUA for NETAPP devices scsi: be2iscsi: set errno on error path scsi: be2iscsi: set errno on error path scsi: hpsa: fallback to use legacy REPORT PHYS command scsi: scsi_dh_alua: Fix RCU annotations scsi: hpsa: use %phN for short hex dumps scsi: hisi_sas: fix free'ing in probe and remove scsi: isci: switch to pci_alloc_irq_vectors scsi: ipr: Fix runaway IRQs when falling back from MSI to LSI scsi: dpt_i2o: double free on error path scsi: cxlflash: Migrate scsi command pointer to AFU command scsi: cxlflash: Migrate IOARRIN specific routines to function pointers scsi: cxlflash: Cleanup queuecommand() scsi: cxlflash: Cleanup send_tmf() scsi: cxlflash: Remove AFU command lock scsi: cxlflash: Wait for active AFU commands to timeout upon tear down scsi: cxlflash: Remove private command pool ...
2016-12-14blk-mq: Avoid memory reclaim when remapping queuesGabriel Krisman Bertazi1-3/+3
While stressing memory and IO at the same time we changed SMT settings, we were able to consistently trigger deadlocks in the mm system, which froze the entire machine. I think that under memory stress conditions, the large allocations performed by blk_mq_init_rq_map may trigger a reclaim, which stalls waiting on the block layer remmaping completion, thus deadlocking the system. The trace below was collected after the machine stalled, waiting for the hotplug event completion. The simplest fix for this is to make allocations in this path non-reclaimable, with GFP_NOIO. With this patch, We couldn't hit the issue anymore. This should apply on top of Jens's for-next branch cleanly. Changes since v1: - Use GFP_NOIO instead of GFP_NOWAIT. Call Trace: [c000000f0160aaf0] [c000000f0160ab50] 0xc000000f0160ab50 (unreliable) [c000000f0160acc0] [c000000000016624] __switch_to+0x2e4/0x430 [c000000f0160ad20] [c000000000b1a880] __schedule+0x310/0x9b0 [c000000f0160ae00] [c000000000b1af68] schedule+0x48/0xc0 [c000000f0160ae30] [c000000000b1b4b0] schedule_preempt_disabled+0x20/0x30 [c000000f0160ae50] [c000000000b1d4fc] __mutex_lock_slowpath+0xec/0x1f0 [c000000f0160aed0] [c000000000b1d678] mutex_lock+0x78/0xa0 [c000000f0160af00] [d000000019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs] [c000000f0160b0b0] [d000000019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs] [c000000f0160b0f0] [d0000000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs] [c000000f0160b120] [c0000000003172c8] super_cache_scan+0x1f8/0x210 [c000000f0160b190] [c00000000026301c] shrink_slab.part.13+0x21c/0x4c0 [c000000f0160b2d0] [c000000000268088] shrink_zone+0x2d8/0x3c0 [c000000f0160b380] [c00000000026834c] do_try_to_free_pages+0x1dc/0x520 [c000000f0160b450] [c00000000026876c] try_to_free_pages+0xdc/0x250 [c000000f0160b4e0] [c000000000251978] __alloc_pages_nodemask+0x868/0x10d0 [c000000f0160b6f0] [c000000000567030] blk_mq_init_rq_map+0x160/0x380 [c000000f0160b7a0] [c00000000056758c] blk_mq_map_swqueue+0x33c/0x360 [c000000f0160b820] [c000000000567904] blk_mq_queue_reinit+0x64/0xb0 [c000000f0160b850] [c00000000056a16c] blk_mq_queue_reinit_notify+0x19c/0x250 [c000000f0160b8a0] [c0000000000f5d38] notifier_call_chain+0x98/0x100 [c000000f0160b8f0] [c0000000000c5fb0] __cpu_notify+0x70/0xe0 [c000000f0160b930] [c0000000000c63c4] notify_prepare+0x44/0xb0 [c000000f0160b9b0] [c0000000000c52f4] cpuhp_invoke_callback+0x84/0x250 [c000000f0160ba10] [c0000000000c570c] cpuhp_up_callbacks+0x5c/0x120 [c000000f0160ba60] [c0000000000c7cb8] _cpu_up+0xf8/0x1d0 [c000000f0160bac0] [c0000000000c7eb0] do_cpu_up+0x120/0x150 [c000000f0160bb40] [c0000000006fe024] cpu_subsys_online+0x64/0xe0 [c000000f0160bb90] [c0000000006f5124] device_online+0xb4/0x120 [c000000f0160bbd0] [c0000000006f5244] online_store+0xb4/0xc0 [c000000f0160bc20] [c0000000006f0a68] dev_attr_store+0x68/0xa0 [c000000f0160bc60] [c0000000003ccc30] sysfs_kf_write+0x80/0xb0 [c000000f0160bca0] [c0000000003cbabc] kernfs_fop_write+0x17c/0x250 [c000000f0160bcf0] [c00000000030fe6c] __vfs_write+0x6c/0x1e0 [c000000f0160bd90] [c000000000311490] vfs_write+0xd0/0x270 [c000000f0160bde0] [c0000000003131fc] SyS_write+0x6c/0x110 [c000000f0160be30] [c000000000009204] system_call+0x38/0xec Signed-off-by: Gabriel Krisman Bertazi <[email protected]> Cc: Brian King <[email protected]> Cc: Douglas Miller <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Jens Axboe <[email protected]>
2016-12-13Merge branch 'for-4.10' of ↵Linus Torvalds1-1/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata Pull libata updates from Tejun Heo: - Adam added opt-in ATA command priority support. - There are machines which hide multiple nvme devices behind an ahci BAR. Dan Williams proposed a solution to force-switch the mode but deemed too hackishd. People are gonna discuss the proper way to handle the situation in nvme standard meetings. For now, detect and warn about the situation. - Low level driver specific changes. Christoph Hellwig pipes in about the hidden nvme warning: "I wish that was the case. We've pretty much agreed that we'll want to implement it as a virtual PCIe root bridge, similar to Intels other 'innovation' VMD that we work around that way. But Intel management has apparently decided that they don't want to spend more cycles on this now that Lenovo has an optional BIOS that doesn't force this broken mode anymore, and no one outside of Intel has enough information to implement something like this. So for now I guess this warning is it, until Intel reconsideres and spends resources on fixing up the damage their Chipset people caused" * 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata: ahci: warn about remapped NVMe devices ahci-remap.h: add ahci remapping definitions nvme: move NVMe class code to pci_ids.h pata: imx: support controller modes up to PIO4 pata: imx: add support of setting timings for PIO modes pata: imx: set controller PIO mode with .set_piomode callback pata: imx: sort headers out ata: set ncq_prio_enabled iff device has support ata: ATA Command Priority Disabled By Default ata: Enabling ATA Command Priorities block: Add iocontext priority to request ahci: qoriq: added ls1046a platform support
2016-12-13Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-blockLinus Torvalds30-488/+2835
Pull block layer updates from Jens Axboe: "This is the main block pull request this series. Contrary to previous release, I've kept the core and driver changes in the same branch. We always ended up having dependencies between the two for obvious reasons, so makes more sense to keep them together. That said, I'll probably try and keep more topical branches going forward, especially for cycles that end up being as busy as this one. The major parts of this pull request is: - Improved support for O_DIRECT on block devices, with a small private implementation instead of using the pig that is fs/direct-io.c. From Christoph. - Request completion tracking in a scalable fashion. This is utilized by two components in this pull, the new hybrid polling and the writeback queue throttling code. - Improved support for polling with O_DIRECT, adding a hybrid mode that combines pure polling with an initial sleep. From me. - Support for automatic throttling of writeback queues on the block side. This uses feedback from the device completion latencies to scale the queue on the block side up or down. From me. - Support from SMR drives in the block layer and for SD. From Hannes and Shaun. - Multi-connection support for nbd. From Josef. - Cleanup of request and bio flags, so we have a clear split between which are bio (or rq) private, and which ones are shared. From Christoph. - A set of patches from Bart, that improve how we handle queue stopping and starting in blk-mq. - Support for WRITE_ZEROES from Chaitanya. - Lightnvm updates from Javier/Matias. - Supoort for FC for the nvme-over-fabrics code. From James Smart. - A bunch of fixes from a whole slew of people, too many to name here" * 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits) blk-stat: fix a few cases of missing batch flushing blk-flush: run the queue when inserting blk-mq flush elevator: make the rqhash helpers exported blk-mq: abstract out blk_mq_dispatch_rq_list() helper blk-mq: add blk_mq_start_stopped_hw_queue() block: improve handling of the magic discard payload blk-wbt: don't throttle discard or write zeroes nbd: use dev_err_ratelimited in io path nbd: reset the setup task for NBD_CLEAR_SOCK nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME nvme-fabrics: Add target support for FC transport nvme-fabrics: Add host support for FC transport nvme-fabrics: Add FC transport LLDD api definitions nvme-fabrics: Add FC transport FC-NVME definitions nvme-fabrics: Add FC transport error codes to nvme.h Add type 0x28 NVME type code to scsi fc headers nvme-fabrics: patch target code in prep for FC transport support nvme-fabrics: set sqe.command_id in core not transports parser: add u64 number parser nvme-rdma: align to generic ib_event logging helper ...
2016-12-12mm: don't cap request size based on read-ahead settingJens Axboe2-0/+2
We ran into a funky issue, where someone doing 256K buffered reads saw 128K requests at the device level. Turns out it is read-ahead capping the request size, since we use 128K as the default setting. This doesn't make a lot of sense - if someone is issuing 256K reads, they should see 256K reads, regardless of the read-ahead setting, if the underlying device can support a 256K read in a single command. This patch introduces a bdi hint, io_pages. This is the soft max IO size for the lower level, I've hooked it up to the bdev settings here. Read-ahead is modified to issue the maximum of the user request size, and the read-ahead max size, but capped to the max request size on the device side. The latter is done to avoid reading ahead too much, if the application asks for a huge read. With this patch, the kernel behaves like the application expects. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-09blk-stat: fix a few cases of missing batch flushingJens Axboe1-0/+8
Everytime we need to read ->nr_samples, we should have flushed the batch first. The non-mq read path also needs to flush the batch. Signed-off-by: Jens Axboe <[email protected]>
2016-12-09blk-flush: run the queue when inserting blk-mq flushJens Axboe1-1/+1
Currently we pass in to run the queue async, but don't flag the queue to be run. We don't need to run it async here, but we should run it. So fixup the parameters. Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]>
2016-12-09elevator: make the rqhash helpers exportedJens Axboe1-4/+4
Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]>
2016-12-09blk-mq: abstract out blk_mq_dispatch_rq_list() helperJens Axboe2-38/+48
Takes a list of requests, and dispatches it. Moves any residual requests to the dispatch list. Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]>
2016-12-09blk-mq: add blk_mq_start_stopped_hw_queue()Jens Axboe1-7/+12
We have a variant for all hardware queues, but not one for a single hardware queue. Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]>
2016-12-09block: improve handling of the magic discard payloadChristoph Hellwig4-78/+21
Instead of allocating a single unused biovec for discard requests, send them down without any payload. Instead we allow the driver to add a "special" payload using a biovec embedded into struct request (unioned over other fields never used while in the driver), and overloading the number of segments for this case. This has a couple of advantages: - we don't have to allocate the bio_vec - the amount of special casing for discard requests in the block layer is significantly reduced - using this same scheme for other request types is trivial, which will be important for implementing the new WRITE_ZEROES op on devices where it actually requires a payload (e.g. SCSI) - we can get rid of playing games with the request length, as we'll never touch it and completions will work just fine - it will allow us to support ranged discard operations in the future by merging non-contiguous discard bios into a single request - last but not least it removes a lot of code This patch is the common base for my WIP series for ranges discards and to remove discard_zeroes_data in favor of always using REQ_OP_WRITE_ZEROES, so it would be good to get it in quickly. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2016-12-09blk-wbt: don't throttle discard or write zeroesChristoph Hellwig1-3/+2
Both of these are metadata only commands that are not issued by the writeback code and not directly relevant to the writeback bandwith. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2016-12-07Don't feed anything but regular iovec's to blk_rq_map_user_iovLinus Torvalds1-0/+4
In theory we could map other things, but there's a reason that function is called "user_iov". Using anything else (like splice can do) just confuses it. Reported-and-tested-by: Johannes Thumshirn <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-05blk-mq: blk_account_io_start() takes a boolJens Axboe1-1/+1
Signed-off-by: Jens Axboe <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]>
2016-12-05block: fix unintended fallthrough in generic_make_request_checks()Nicolai Stange1-0/+1
Since commit e73c23ff736e ("block: add async variant of blkdev_issue_zeroout") messages like the following show up: EXT4-fs (dm-1): Delayed block allocation failed for inode 2368848 at logical offset 0 with max blocks 1 with error 95 EXT4-fs (dm-1): This should not happen!! Data will be lost Due to the following fallthrough introduced with commit 2d253440b5af ("block: Define zoned block device operations"), generic_make_request_checks() would accept a REQ_OP_WRITE_SAME bio only if the block device supports "write same" *and* is a zoned one: switch (bio_op(bio)) { [...] case REQ_OP_WRITE_SAME: if (!bdev_write_same(bio->bi_bdev)) goto not_supported; case REQ_OP_ZONE_REPORT: case REQ_OP_ZONE_RESET: if (!bdev_is_zoned(bio->bi_bdev)) goto not_supported; break; [...] } Thus, although the bio setup as done by __blkdev_issue_write_same() from commit e73c23ff736e ("block: add async variant of blkdev_issue_zeroout") would succeed, its actual submission would not, resulting in the EOPNOTSUPP == 95. Fix this by removing the fallthrough which, due to the lack of an explicit comment, seems to be unintended anyway. Fixes: e73c23ff736e ("block: add async variant of blkdev_issue_zeroout") Fixes: 2d253440b5af ("block: Define zoned block device operations") Signed-off-by: Nicolai Stange <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>