aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2019-10-15libata/ahci: Fix PCS quirk applicationDan Williams1-1/+3
Commit c312ef176399 "libata/ahci: Drop PCS quirk for Denverton and beyond" got the polarity wrong on the check for which board-ids should have the quirk applied. The board type board_ahci_pcs7 is defined at the end of the list such that "pcs7" boards can be special cased in the future if they need the quirk. All prior Intel board ids "< board_ahci_pcs7" should proceed with applying the quirk. Reported-by: Andreas Friedrich <[email protected]> Reported-by: Stephen Douthit <[email protected]> Fixes: c312ef176399 ("libata/ahci: Drop PCS quirk for Denverton and beyond") Cc: <[email protected]> Signed-off-by: Dan Williams <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-10-15blk-rq-qos: fix first node deletion of rq_qos_del()Tejun Heo1-8/+5
rq_qos_del() incorrectly assigns the node being deleted to the head if it was the first on the list in the !prev path. Fix it by iterating with ** instead. Signed-off-by: Tejun Heo <[email protected]> Cc: Josef Bacik <[email protected]> Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt") Cc: [email protected] # v4.19+ Signed-off-by: Jens Axboe <[email protected]>
2019-10-15blkcg: Fix multiple bugs in blkcg_activate_policy()Tejun Heo1-18/+51
blkcg_activate_policy() has the following bugs. * cf09a8ee19ad ("blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn()") added @blkcg to ->pd_alloc_fn(); however, blkcg_activate_policy() ends up using pd's allocated for the root blkcg for all preallocations, so ->pd_init_fn() for non-root blkcgs can be passed in pd's which are allocated for the root blkcg. For blk-iocost, this means that ->pd_init_fn() can write beyond the end of the allocated object as it determines the length of the flex array at the end based on the blkcg's nesting level. * Each pd is initialized as they get allocated. If alloc fails, the policy will get freed with pd's initialized on it. * After the above partial failure, the partial pds are not freed. This patch fixes all the above issues by * Restructuring blkcg_activate_policy() so that alloc and init passes are separate. Init takes place only after all allocs succeeded and on failure all allocated pds are freed. * Unifying and fixing the cleanup of the remaining pd_prealloc. Signed-off-by: Tejun Heo <[email protected]> Fixes: cf09a8ee19ad ("blkcg: pass @q and @blkcg into blkcg_pol_alloc_pd_fn()") Signed-off-by: Jens Axboe <[email protected]>
2019-10-15io_uring: consider the overflow of sequence for timeout reqyangerkun1-6/+21
Now we recalculate the sequence of timeout with 'req->sequence = ctx->cached_sq_head + count - 1', judge the right place to insert for timeout_list by compare the number of request we still expected for completion. But we have not consider about the situation of overflow: 1. ctx->cached_sq_head + count - 1 may overflow. And a bigger count for the new timeout req can have a small req->sequence. 2. cached_sq_head of now may overflow compare with before req. And it will lead the timeout req with small req->sequence. This overflow will lead to the misorder of timeout_list, which can lead to the wrong order of the completion of timeout_list. Fix it by reuse req->submit.sequence to store the count, and change the logic of inserting sort in io_timeout. Signed-off-by: yangerkun <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-10-14block: Fix elv_support_iosched()Damien Le Moal1-1/+2
A BIO based request queue does not have a tag_set, which prevent testing for the flag BLK_MQ_F_NO_SCHED indicating that the queue does not require an elevator. This leads to an incorrect initialization of a default elevator in some cases such as BIO based null_blk (queue_mode == BIO) with zoned mode enabled as the default elevator in this case is mq-deadline instead of "none". Fix this by testing for a NULL queue mq_ops field which indicates that the queue is BIO based and should not have an elevator. Reported-by: Shinichiro Kawasaki <[email protected]> Reviewed-by: Bob Liu <[email protected]> Signed-off-by: Damien Le Moal <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-10-10io_uring: fix sequence logic for timeout requestsJens Axboe1-17/+20
We have two ways a request can be deferred: 1) It's a regular request that depends on another one 2) It's a timeout that tracks completions We have a shared helper to determine whether to defer, and that attempts to make the right decision based on the request. But we only have some of this information in the caller. Un-share the two timeout/defer helpers so the caller can use the right one. Fixes: 5262f567987d ("io_uring: IORING_OP_TIMEOUT support") Reported-by: yangerkun <[email protected]> Reviewed-by: Jackie Liu <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-10-10nbd: fix possible sysfs duplicate warningXiubo Li1-1/+1
1. nbd_put takes the mutex and drops nbd->ref to 0. It then does idr_remove and drops the mutex. 2. nbd_genl_connect takes the mutex. idr_find/idr_for_each fails to find an existing device, so it does nbd_dev_add. 3. just before the nbd_put could call nbd_dev_remove or not finished totally, but if nbd_dev_add try to add_disk, we can hit: debugfs: Directory 'nbd1' with parent 'block' already present! This patch will make sure all the disk add/remove stuff are done by holding the nbd_index_mutex lock. Reported-by: Mike Christie <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Xiubo Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-10-09null_blk: Fix zoned command return codeKeith Busch1-2/+1
The return code from null_handle_zoned() sets the cmd->error value. Returning OK status when an error occured overwrites the intended cmd->error. Return the appropriate error code instead of setting the error in the cmd. Fixes: fceb5d1b19cbe626 ("null_blk: create a helper for zoned devices") Cc: Chaitanya Kulkarni <[email protected]> Signed-off-by: Keith Busch <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-10-09io_uring: only flush workqueues on fileset removalJens Axboe1-1/+5
We should not remove the workqueue, we just need to ensure that the workqueues are synced. The workqueues are torn down on ctx removal. Cc: [email protected] Fixes: 6b06314c47e1 ("io_uring: add file set registration") Reported-by: Stefan Hajnoczi <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-10-07io_uring: remove wait loop spurious wakeupsPavel Begunkov1-12/+4
Any changes interesting to tasks waiting in io_cqring_wait() are commited with io_cqring_ev_posted(). However, io_ring_drop_ctx_refs() also tries to do that but with no reason, that means spurious wakeups every io_free_req() and io_uring_enter(). Just use percpu_ref_put() instead. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-10-06blk-wbt: fix performance regression in wbt scale_up/scale_downHarshad Shirwadkar3-9/+15
scale_up wakes up waiters after scaling up. But after scaling max, it should not wake up more waiters as waiters will not have anything to do. This patch fixes this by making scale_up (and also scale_down) return when threshold is reached. This bug causes increased fdatasync latency when fdatasync and dd conv=sync are performed in parallel on 4.19 compared to 4.14. This bug was introduced during refactoring of blk-wbt code. Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt") Cc: [email protected] Cc: Josef Bacik <[email protected]> Signed-off-by: Harshad Shirwadkar <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-10-06Revert "libata, freezer: avoid block device removal while system is frozen"Mika Westerberg2-27/+0
This reverts commit 85fbd722ad0f5d64d1ad15888cd1eb2188bfb557. The commit was added as a quick band-aid for a hang that happened when a block device was removed during system suspend. Now that bdi_wq is not freezable anymore the hang should not be possible and we can get rid of this hack by reverting it. Acked-by: Rafael J. Wysocki <[email protected]> Signed-off-by: Mika Westerberg <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-10-06bdi: Do not use freezable workqueueMika Westerberg1-2/+2
A removable block device, such as NVMe or SSD connected over Thunderbolt can be hot-removed any time including when the system is suspended. When device is hot-removed during suspend and the system gets resumed, kernel first resumes devices and then thaws the userspace including freezable workqueues. What happens in that case is that the NVMe driver notices that the device is unplugged and removes it from the system. This ends up calling bdi_unregister() for the gendisk which then schedules wb_workfn() to be run one more time. However, since the bdi_wq is still frozen flush_delayed_work() call in wb_shutdown() blocks forever halting system resume process. User sees this as hang as nothing is happening anymore. Triggering sysrq-w reveals this: Workqueue: nvme-wq nvme_remove_dead_ctrl_work [nvme] Call Trace: ? __schedule+0x2c5/0x630 ? wait_for_completion+0xa4/0x120 schedule+0x3e/0xc0 schedule_timeout+0x1c9/0x320 ? resched_curr+0x1f/0xd0 ? wait_for_completion+0xa4/0x120 wait_for_completion+0xc3/0x120 ? wake_up_q+0x60/0x60 __flush_work+0x131/0x1e0 ? flush_workqueue_prep_pwqs+0x130/0x130 bdi_unregister+0xb9/0x130 del_gendisk+0x2d2/0x2e0 nvme_ns_remove+0xed/0x110 [nvme_core] nvme_remove_namespaces+0x96/0xd0 [nvme_core] nvme_remove+0x5b/0x160 [nvme] pci_device_remove+0x36/0x90 device_release_driver_internal+0xdf/0x1c0 nvme_remove_dead_ctrl_work+0x14/0x30 [nvme] process_one_work+0x1c2/0x3f0 worker_thread+0x48/0x3e0 kthread+0x100/0x140 ? current_work+0x30/0x30 ? kthread_park+0x80/0x80 ret_from_fork+0x35/0x40 This is not limited to NVMes so exactly same issue can be reproduced by hot-removing SSD (over Thunderbolt) while the system is suspended. Prevent this from happening by removing WQ_FREEZABLE from bdi_wq. Reported-by: AceLan Kao <[email protected]> Link: https://marc.info/?l=linux-kernel&m=138695698516487 Link: https://bugzilla.kernel.org/show_bug.cgi?id=204385 Link: https://lore.kernel.org/lkml/[email protected]/#t Acked-by: Rafael J. Wysocki <[email protected]> Signed-off-by: Mika Westerberg <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-10-04io_uring: fix reversed nonblock flag for link submissionPavel Begunkov1-1/+1
io_queue_link_head() accepts @force_nonblock flag, but io_ring_submit() passes something opposite. Fixes: c576666863b78 ("io_uring: optimize submit_and_wait API") Reported-by: kbuild test robot <[email protected]> Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-10-03block: sed-opal: fix sparse warning: convert __be64 dataRandy Dunlap1-2/+2
sparse warns about incorrect type when using __be64 data. It is not being converted to CPU-endian but it should be. Fixes these sparse warnings: ../block/sed-opal.c:375:20: warning: incorrect type in assignment (different base types) ../block/sed-opal.c:375:20: expected unsigned long long [usertype] align ../block/sed-opal.c:375:20: got restricted __be64 const [usertype] alignment_granularity ../block/sed-opal.c:376:25: warning: incorrect type in assignment (different base types) ../block/sed-opal.c:376:25: expected unsigned long long [usertype] lowest_lba ../block/sed-opal.c:376:25: got restricted __be64 const [usertype] lowest_aligned_lba Fixes: 455a7b238cd6 ("block: Add Sed-opal library") Cc: Scott Bauer <[email protected]> Cc: Rafael Antognolli <[email protected]> Cc: [email protected] Reviewed-by: Jon Derrick <[email protected]> Signed-off-by: Randy Dunlap <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-10-03block: sed-opal: fix sparse warning: obsolete array init.Randy Dunlap1-1/+1
Fix sparse warning: (missing '=') ../block/sed-opal.c:133:17: warning: obsolete array initializer, use C99 syntax Fixes: ff91064ea37c ("block: sed-opal: check size of shadow mbr") Cc: [email protected] Cc: Jonas Rabenstein <[email protected]> Cc: David Kozub <[email protected]> Reviewed-by: Scott Bauer <[email protected]> Reviewed-by: Revanth Rajashekar <[email protected]> Signed-off-by: Randy Dunlap <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-10-02block: pg: add header include guardMasahiro Yamada1-1/+4
Add a header include guard just in case. Signed-off-by: Masahiro Yamada <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-10-01Revert "s390/dasd: Add discard support for ESE volumes"Stefan Haberland1-54/+3
This reverts commit 7e64db1597fe114b83fe17d0ba96c6aa5fca419a. The thin provisioning feature introduces an IOCTL and the discard support to allow userspace tools and filesystems to release unused and previously allocated space respectively. During some internal performance improvements and further tests, the release of allocated space revealed some issues that may lead to data corruption in some configurations when filesystems are mounted with discard support enabled. While we're working on a fix and trying to clarify the situation, this commit reverts the discard support for ESE volumes to prevent potential data corruption. Cc: <[email protected]> # 5.3 Signed-off-by: Stefan Haberland <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-10-01s390/dasd: Fix error handling during online processingJan Höppner1-16/+8
It is possible that the CCW commands for reading volume and extent pool information are not supported, either by the storage server (for dedicated DASDs) or by z/VM (for virtual devices, such as MDISKs). As a command reject will occur in such a case, the current error handling leads to a failing online processing and thus the DASD can't be used at all. Since the data being read is not essential for an fully operational DASD, the error handling can be removed. Information about the failing command is sent to the s390dbf debug feature. Fixes: c729696bcf8b ("s390/dasd: Recognise data for ESE volumes") Cc: <[email protected]> # 5.3 Reported-by: Frank Heimes <[email protected]> Signed-off-by: Jan Höppner <[email protected]> Signed-off-by: Stefan Haberland <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-10-01io_uring: use __kernel_timespec in timeout ABIArnd Bergmann1-4/+4
All system calls use struct __kernel_timespec instead of the old struct timespec, but this one was just added with the old-style ABI. Change it now to enforce the use of __kernel_timespec, avoiding ABI confusion and the need for compat handlers on 32-bit architectures. Any user space caller will have to use __kernel_timespec now, but this is unambiguous and works for any C library regardless of the time_t definition. A nicer way to specify the timeout would have been a less ambiguous 64-bit nanosecond value, but I suppose it's too late now to change that as this would impact both 32-bit and 64-bit users. Fixes: 5262f567987d ("io_uring: IORING_OP_TIMEOUT support") Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-10-01loop: change queue block size to match when using DIOMartijn Coenen1-0/+10
The loop driver assumes that if the passed in fd is opened with O_DIRECT, the caller wants to use direct I/O on the loop device. However, if the underlying block device has a different block size than the loop block queue, direct I/O can't be enabled. Instead of requiring userspace to manually change the blocksize and re-enable direct I/O, just change the queue block sizes to match, as well as the io_min size. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Martijn Coenen <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-09-27Merge branch 'nvme-5.4' of git://git.infradead.org/nvme into for-linusJens Axboe8-51/+177
Pull NVMe changes from Sagi: "This set consists of various fixes and cleanups: - controller removal race fix from Balbir - quirk additions from Gabriel and Jian-Hong - nvme-pci power state save fix from Mario - Add 64bit user commands (for 64bit registers) from Marta - nvme-rdma/nvme-tcp fixes from Max, Mark and Me - Minor cleanups and nits from James, Dan and John" * 'nvme-5.4' of git://git.infradead.org/nvme: nvme-rdma: fix possible use-after-free in connect timeout nvme: Move ctrl sqsize to generic space nvme: Add ctrl attributes for queue_count and sqsize nvme: allow 64-bit results in passthru commands nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T nvmet-tcp: remove superflous check on request sgl Added QUIRKs for ADATA XPG SX8200 Pro 512GB nvme-rdma: Fix max_hw_sectors calculation nvme: fix an error code in nvme_init_subsystem() nvme-pci: Save PCI state before putting drive into deepest state nvme-tcp: fix wrong stop condition in io_work nvme-pci: Fix a race in controller removal nvmet: change ppl to lpp
2019-09-27blk-mq: apply normal plugging for HDDMing Lei1-1/+5
Some HDD drive may expose multiple hardware queues, such as MegraRaid. Let's apply the normal plugging for such devices because sequential IO may benefit a lot from plug merging. Cc: Bart Van Assche <[email protected]> Cc: Hannes Reinecke <[email protected]> Cc: Dave Chinner <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Ming Lei <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-09-27blk-mq: honor IO scheduler for multiqueue devicesMing Lei1-2/+4
If a device is using multiple queues, the IO scheduler may be bypassed. This may hurt performance for some slow MQ devices, and it also breaks zoned devices which depend on mq-deadline for respecting the write order in one zone. Don't bypass io scheduler if we have one setup. This patch can double sequential write performance basically on MQ scsi_debug when mq-deadline is applied. Cc: Bart Van Assche <[email protected]> Cc: Hannes Reinecke <[email protected]> Cc: Dave Chinner <[email protected]> Reviewed-by: Javier González <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Ming Lei <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-09-27nvme-rdma: fix possible use-after-free in connect timeoutSagi Grimberg1-1/+2
If the connect times out, we may have already destroyed the queue in the timeout handler, so test if the queue is still allocated in the connect error handler. Reported-by: Yi Zhang <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-09-27block: fix null pointer dereference in blk_mq_rq_timed_out()Yufen Yu3-1/+21
We got a null pointer deference BUG_ON in blk_mq_rq_timed_out() as following: [ 108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040 [ 108.827059] PGD 0 P4D 0 [ 108.827313] Oops: 0000 [#1] SMP PTI [ 108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431 [ 108.829503] Workqueue: kblockd blk_mq_timeout_work [ 108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330 [ 108.838191] Call Trace: [ 108.838406] bt_iter+0x74/0x80 [ 108.838665] blk_mq_queue_tag_busy_iter+0x204/0x450 [ 108.839074] ? __switch_to_asm+0x34/0x70 [ 108.839405] ? blk_mq_stop_hw_queue+0x40/0x40 [ 108.839823] ? blk_mq_stop_hw_queue+0x40/0x40 [ 108.840273] ? syscall_return_via_sysret+0xf/0x7f [ 108.840732] blk_mq_timeout_work+0x74/0x200 [ 108.841151] process_one_work+0x297/0x680 [ 108.841550] worker_thread+0x29c/0x6f0 [ 108.841926] ? rescuer_thread+0x580/0x580 [ 108.842344] kthread+0x16a/0x1a0 [ 108.842666] ? kthread_flush_work+0x170/0x170 [ 108.843100] ret_from_fork+0x35/0x40 The bug is caused by the race between timeout handle and completion for flush request. When timeout handle function blk_mq_rq_timed_out() try to read 'req->q->mq_ops', the 'req' have completed and reinitiated by next flush request, which would call blk_rq_init() to clear 'req' as 0. After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"), normal requests lifetime are protected by refcount. Until 'rq->ref' drop to zero, the request can really be free. Thus, these requests cannot been reused before timeout handle finish. However, flush request has defined .end_io and rq->end_io() is still called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq' can be reused by the next flush request handle, resulting in null pointer deference BUG ON. We fix this problem by covering flush request with 'rq->ref'. If the refcount is not zero, flush_end_io() return and wait the last holder recall it. To record the request status, we add a new entry 'rq_status', which will be used in flush_end_io(). Cc: Christoph Hellwig <[email protected]> Cc: Keith Busch <[email protected]> Cc: Bart Van Assche <[email protected]> Cc: [email protected] # v4.18+ Reviewed-by: Ming Lei <[email protected]> Reviewed-by: Bob Liu <[email protected]> Signed-off-by: Yufen Yu <[email protected]> ------- v2: - move rq_status from struct request to struct blk_flush_queue v3: - remove unnecessary '{}' pair. v4: - let spinlock to protect 'fq->rq_status' v5: - move rq_status after flush_running_idx member of struct blk_flush_queue Signed-off-by: Jens Axboe <[email protected]>
2019-09-27rq-qos: get rid of redundant wbt_update_limits()Yufen Yu1-1/+0
We have updated limits after calling wbt_set_min_lat(). No need to update again. Reviewed-by: Bob Liu <[email protected]> Signed-off-by: Yufen Yu <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-09-26nvme: Move ctrl sqsize to generic spaceKeith Busch1-1/+1
This isn't specific to fabrics. Signed-off-by: Keith Busch <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-09-26iocost: bump up default latency targets for hard disksTejun Heo1-2/+2
The default hard disk param sets latency targets at 50ms. As the default target percentiles are zero, these don't directly regulate vrate; however, they're still used to calculate the period length - 100ms in this case. This is excessively low. A SATA drive with QD32 saturated with random IOs can easily reach avg completion latency of several hundred msecs. A period duration which is substantially lower than avg completion latency can lead to wildly fluctuating vrate. Let's bump up the default latency targets to 250ms so that the period duration is sufficiently long. Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-09-26iocost: improve nr_lagging handlingTejun Heo1-8/+11
Some IOs may span multiple periods. As latencies are collected on completion, the inbetween periods won't register them and may incorrectly decide to increase vrate. nr_lagging tracks these IOs to avoid those situations. Currently, whenever there are IOs which are spanning from the previous period, busy_level is reset to 0 if negative thus suppressing vrate increase. This has the following two problems. * When latency target percentiles aren't set, vrate adjustment should only be governed by queue depth depletion; however, the current code keeps nr_lagging active which pulls in latency results and can keep down vrate unexpectedly. * When lagging condition is detected, it resets the entire negative busy_level. This turned out to be way too aggressive on some devices which sometimes experience extended latencies on a small subset of commands. In addition, a lagging IO will be accounted as latency target miss on completion anyway and resetting busy_level amplifies its impact unnecessarily. This patch fixes the above two problems by disabling nr_lagging counting when latency target percentiles aren't set and blocking vrate increases when there are lagging IOs while leaving busy_level as-is. Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-09-26iocost: better trace vrate changesTejun Heo1-1/+6
vrate_adj tracepoint traces vrate changes; however, it does so only when busy_level is non-zero. busy_level turning to zero can sometimes be as interesting an event. This patch also enables vrate_adj tracepoint on other vrate related events - busy_level changes and non-zero nr_lagging. Signed-off-by: Tejun Heo <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-09-26block: don't release queue's sysfs lock during switching elevatorMing Lei2-39/+5
cecf5d87ff20 ("block: split .sysfs_lock into two locks") starts to release & acquire sysfs_lock before registering/un-registering elevator queue during switching elevator for avoiding potential deadlock from showing & storing 'queue/iosched' attributes and removing elevator's kobject. Turns out there isn't such deadlock because 'q->sysfs_lock' isn't required in .show & .store of queue/iosched's attributes, and just elevator's sysfs lock is acquired in elv_iosched_store() and elv_iosched_show(). So it is safe to hold queue's sysfs lock when registering/un-registering elevator queue. The biggest issue is that commit cecf5d87ff20 assumes that concurrent write on 'queue/scheduler' can't happen. However, this assumption isn't true, because kernfs_fop_write() only guarantees that concurrent write aren't called on the same open file, but the write could be from different open on the file. So we can't release & re-acquire queue's sysfs lock during switching elevator, otherwise use-after-free on elevator could be triggered. Fixes the issue by not releasing queue's sysfs lock during switching elevator. Fixes: cecf5d87ff20 ("block: split .sysfs_lock into two locks") Cc: Christoph Hellwig <[email protected]> Cc: Hannes Reinecke <[email protected]> Cc: Greg KH <[email protected]> Cc: Mike Snitzer <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Signed-off-by: Ming Lei <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-09-26blk-mq: move lockdep_assert_held() into elevator_exitMing Lei2-2/+2
Commit c48dac137a62 ("block: don't hold q->sysfs_lock in elevator_init_mq") removes q->sysfs_lock from elevator_init_mq(), but forgot to deal with lockdep_assert_held() called in blk_mq_sched_free_requests() which is run in failure path of elevator_init_mq(). blk_mq_sched_free_requests() is called in the following 3 functions: elevator_init_mq() elevator_exit() blk_cleanup_queue() In blk_cleanup_queue(), blk_mq_sched_free_requests() is followed exactly by 'mutex_lock(&q->sysfs_lock)'. So moving the lockdep_assert_held() from blk_mq_sched_free_requests() into elevator_exit() for fixing the report by syzbot. Reported-by: [email protected] Fixed: c48dac137a62 ("block: don't hold q->sysfs_lock in elevator_init_mq") Reviewed-by: Bart Van Assche <[email protected]> Reviewed-by: Damien Le Moal <[email protected]> Signed-off-by: Ming Lei <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2019-09-25nvme: Add ctrl attributes for queue_count and sqsizeJames Smart1-0/+4
Current controller interrogation requires a lot of guesswork on how many io queues were created and what the io sq size is. The numbers are dependent upon core/fabric defaults, connect arguments, and target responses. Add sysfs attributes for queue_count and sqsize. Signed-off-by: James Smart <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-09-25nvme: allow 64-bit results in passthru commandsMarta Rybczynska2-16/+115
It is not possible to get 64-bit results from the passthru commands, what prevents from getting for the Capabilities (CAP) property value. As a result, it is not possible to implement IOL's NVMe Conformance test 4.3 Case 1 for Fabrics targets [1] (page 123). This issue has been already discussed [2], but without a solution. This patch solves the problem by adding new ioctls with a new passthru structure, including 64-bit results. The older ioctls stay unchanged. [1] https://www.iol.unh.edu/sites/default/files/testsuites/nvme/UNH-IOL_NVMe_Conformance_Test_Suite_v11.0.pdf [2] http://lists.infradead.org/pipermail/linux-nvme/2018-June/018791.html Signed-off-by: Marta Rybczynska <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-09-25nvme: Add quirk for Kingston NVME SSD running FW E8FK11.TJian-Hong Pan1-0/+10
Kingston NVME SSD with firmware version E8FK11.T has no interrupt after resume with actions related to suspend to idle. This patch applied NVME_QUIRK_SIMPLE_SUSPEND quirk to fix this issue. Fixes: d916b1be94b6 ("nvme-pci: use host managed power state for suspend") Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=204887 Signed-off-by: Jian-Hong Pan <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-09-25nvmet-tcp: remove superflous check on request sglSagi Grimberg1-8/+4
Now that sgl_free is null safe, drop the superflous check. Signed-off-by: Sagi Grimberg <[email protected]>
2019-09-25Added QUIRKs for ADATA XPG SX8200 Pro 512GBGabriel Craciunescu1-0/+3
Booting with default_ps_max_latency_us >6000 makes the device fail. Also SUBNQN is NULL and gives a warning on each boot/resume. $ nvme id-ctrl /dev/nvme0 | grep ^subnqn subnqn : (null) I use this device with an Acer Nitro 5 (AN515-43-R8BF) Laptop. To be sure is not a Laptop issue only, I tested the device on my server board with the same results. ( with 2x,4x link on the board and 4x link on a PCI-E card ). Signed-off-by: Gabriel Craciunescu <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-09-25nvme-rdma: Fix max_hw_sectors calculationMax Gurtovoy1-5/+11
By default, the NVMe/RDMA driver should support max io_size of 1MiB (or upto the maximum supported size by the HCA). Currently, one will see that /sys/class/block/<bdev>/queue/max_hw_sectors_kb is 1020 instead of 1024. A non power of 2 value can cause performance degradation due to unnecessary splitting of IO requests and unoptimized allocation units. The number of pages per MR has been fixed here, so there is no longer any need to reduce max_sectors by 1. Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Max Gurtovoy <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-09-25nvme: fix an error code in nvme_init_subsystem()Dan Carpenter1-2/+3
"ret" should be a negative error code here, but it's either success or possibly uninitialized. Fixes: 32fd90c40768 ("nvme: change locking for the per-subsystem controller list") Signed-off-by: Dan Carpenter <[email protected]> Reviewed-by: Keith Busch <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-09-25nvme-pci: Save PCI state before putting drive into deepest stateMario Limonciello1-7/+10
The action of saving the PCI state will cause numerous PCI configuration space reads which depending upon the vendor implementation may cause the drive to exit the deepest NVMe state. In these cases ASPM will typically resolve the PCIe link state and APST may resolve the NVMe power state. However it has also been observed that this register access after quiesced will cause PC10 failure on some device combinations. To resolve this, move the PCI state saving to before SetFeatures has been called. This has been proven to resolve the issue across a 5000 sample test on previously failing disk/system combinations. Signed-off-by: Mario Limonciello <[email protected]> Reviewed-by: Keith Busch <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-09-25nvme-tcp: fix wrong stop condition in io_workWunderlich, Mark1-2/+2
Allow the do/while statement to continue if current time is not after the proposed time 'deadline'. Intent is to allow loop to proceed for a specific time period. Currently the loop, as coded, will exit after first pass. Signed-off-by: Mark Wunderlich <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]>
2019-09-25Merge tag 'ceph-for-5.4-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds27-385/+767
Pull ceph updates from Ilya Dryomov: "The highlights are: - automatic recovery of a blacklisted filesystem session (Zheng Yan). This is disabled by default and can be enabled by mounting with the new "recover_session=clean" option. - serialize buffered reads and O_DIRECT writes (Jeff Layton). Care is taken to avoid serializing O_DIRECT reads and writes with each other, this is based on the exclusion scheme from NFS. - handle large osdmaps better in the face of fragmented memory (myself) - don't limit what security.* xattrs can be get or set (Jeff Layton). We were overly restrictive here, unnecessarily preventing things like file capability sets stored in security.capability from working. - allow copy_file_range() within the same inode and across different filesystems within the same cluster (Luis Henriques)" * tag 'ceph-for-5.4-rc1' of git://github.com/ceph/ceph-client: (41 commits) ceph: call ceph_mdsc_destroy from destroy_fs_client libceph: use ceph_kvmalloc() for osdmap arrays libceph: avoid a __vmalloc() deadlock in ceph_kvmalloc() ceph: allow object copies across different filesystems in the same cluster ceph: include ceph_debug.h in cache.c ceph: move static keyword to the front of declarations rbd: pull rbd_img_request_create() dout out into the callers ceph: reconnect connection if session hang in opening state libceph: drop unused con parameter of calc_target() ceph: use release_pages() directly rbd: fix response length parameter for encoded strings ceph: allow arbitrary security.* xattrs ceph: only set CEPH_I_SEC_INITED if we got a MAC label ceph: turn ceph_security_invalidate_secctx into static inline ceph: add buffered/direct exclusionary locking for reads and writes libceph: handle OSD op ceph_pagelist_append() errors ceph: don't return a value from void function ceph: don't freeze during write page faults ceph: update the mtime when truncating up ceph: fix indentation in __get_snap_name() ...
2019-09-25Merge tag 'fuse-update-5.4' of ↵Linus Torvalds14-1614/+1730
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Continue separating the transport (user/kernel communication) and the filesystem layers of fuse. Getting rid of most layering violations will allow for easier cleanup and optimization later on. - Prepare for the addition of the virtio-fs filesystem. The actual filesystem will be introduced by a separate pull request. - Convert to new mount API. - Various fixes, optimizations and cleanups. * tag 'fuse-update-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (55 commits) fuse: Make fuse_args_to_req static fuse: fix memleak in cuse_channel_open fuse: fix beyond-end-of-page access in fuse_parse_cache() fuse: unexport fuse_put_request fuse: kmemcg account fs data fuse: on 64-bit store time in d_fsdata directly fuse: fix missing unlock_page in fuse_writepage() fuse: reserve byteswapped init opcodes fuse: allow skipping control interface and forced unmount fuse: dissociate DESTROY from fuseblk fuse: delete dentry if timeout is zero fuse: separate fuse device allocation and installation in fuse_conn fuse: add fuse_iqueue_ops callbacks fuse: extract fuse_fill_super_common() fuse: export fuse_dequeue_forget() function fuse: export fuse_get_unique() fuse: export fuse_send_init_request() fuse: export fuse_len_args() fuse: export fuse_end_request() fuse: fix request limit ...
2019-09-25Merge tag 'tpmdd-next-20190925' of git://git.infradead.org/users/jjs/linux-tpmddLinus Torvalds5-13/+20
Pull tpm fixes from Jarkko Sakkinen. * tag 'tpmdd-next-20190925' of git://git.infradead.org/users/jjs/linux-tpmdd: tpm: Wrap the buffer from the caller to tpm_buf in tpm_send() MAINTAINERS: keys: Update path to trusted.h KEYS: trusted: correctly initialize digests and fix locking issue selftests/tpm2: Add log and *.pyc to .gitignore selftests/tpm2: Add the missing TEST_FILES assignment
2019-09-25Merge tag 'iomap-5.4-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds3-21/+27
Pull iomap updates from Darrick Wong: "After last week's failed pull request attempt, I scuttled everything in the branch except for the directio endio api changes, which were trivial. Everything else will simply have to wait for the next cycle. Summary: - Report both io errors and short io results to the directio endio handler. - Allow directio callers to pass an ops structure to iomap_dio_rw" * tag 'iomap-5.4-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: move the iomap_dio_rw ->end_io callback into a structure iomap: split size and error for iomap_dio_rw ->end_io
2019-09-24Merge branch 'i2c/for-5.4' of ↵Linus Torvalds42-267/+795
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c updates from Wolfram Sang: - new driver for ICY, an Amiga Zorro card :) - axxia driver gained slave mode support, NXP driver gained ACPI - the slave EEPROM backend gained 16 bit address support - and lots of regular driver updates and reworks * 'i2c/for-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (52 commits) i2c: tegra: Move suspend handling to NOIRQ phase i2c: imx: ACPI support for NXP i2c controller i2c: uniphier(-f): remove all dev_dbg() i2c: uniphier(-f): use devm_platform_ioremap_resource() i2c: slave-eeprom: Add comment about address handling i2c: exynos5: Remove IRQF_ONESHOT i2c: stm32f7: Make structure stm32f7_i2c_algo constant i2c: cht-wc: drop check because i2c_unregister_device() is NULL safe i2c-eeprom_slave: Add support for more eeprom models i2c: fsi: Add of_put_node() before break i2c: synquacer: Make synquacer_i2c_ops constant i2c: hix5hd2: Remove IRQF_ONESHOT i2c: i801: Use iTCO version 6 in Cannon Lake PCH and beyond watchdog: iTCO: Add support for Cannon Lake PCH iTCO i2c: iproc: Make bcm_iproc_i2c_quirks constant i2c: iproc: Add full name of devicetree node to adapter name i2c: piix4: Add ACPI support i2c: piix4: Fix probing of reserved ports on AMD Family 16h Model 30h i2c: ocores: use request_any_context_irq() to register IRQ handler i2c: designware: Fix optional reset error handling ...
2019-09-24Merge tag 'sound-fix-5.4-rc1' of ↵Linus Torvalds13-24/+67
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound fixes from Takashi Iwai: "A few small remaining wrap-up for this merge window. Most of patches are device-specific (HD-audio and USB-audio quirks, FireWire, pcm316a, fsl, rsnd, Atmel, and TI fixes), while there is a simple fix (actually two commits) for ASoC core" * tag 'sound-fix-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: ALSA: usb-audio: Add DSD support for EVGA NU Audio ALSA: hda - Add laptop imic fixup for ASUS M9V laptop ASoC: ti: fix SND_SOC_DM365_VOICE_CODEC dependencies ASoC: pcm3168a: The codec does not support S32_LE ASoC: core: use list_del_init and move it back to soc_cleanup_component ALSA: hda/realtek - PCI quirk for Medion E4254 ALSA: hda - Apply AMD controller workaround for Raven platform ASoC: rsnd: do error check after rsnd_channel_normalization() ASoC: atmel_ssc_dai: Remove wrong spinlock usage ASoC: core: delete component->card_list in soc_remove_component only ASoC: fsl_sai: Fix noise when using EDMA ALSA: usb-audio: Add Hiby device family to quirks for native DSD support ALSA: hda/realtek - Fix alienware headset mic ALSA: dice: fix wrong packet parameter for Alesis iO26
2019-09-25tpm: Wrap the buffer from the caller to tpm_buf in tpm_send()Jarkko Sakkinen1-7/+2
tpm_send() does not give anymore the result back to the caller. This would require another memcpy(), which kind of tells that the whole approach is somewhat broken. Instead, as Mimi suggested, this commit just wraps the data to the tpm_buf, and thus the result will not go to the garbage. Obviously this assumes from the caller that it passes large enough buffer, which makes the whole API somewhat broken because it could be different size than @buflen but since trusted keys is the only module using this API right now I think that this fix is sufficient for the moment. In the near future the plan is to replace the parameters with a tpm_buf created by the caller. Reported-by: Mimi Zohar <[email protected]> Suggested-by: Mimi Zohar <[email protected]> Cc: [email protected] Fixes: 412eb585587a ("use tpm_buf in tpm_transmit_cmd() as the IO parameter") Signed-off-by: Jarkko Sakkinen <[email protected]> Reviewed-by: Jerry Snitselaar <[email protected]>
2019-09-25MAINTAINERS: keys: Update path to trusted.hDenis Efremov1-1/+1
Update MAINTAINERS record to reflect that trusted.h was moved to a different directory in commit 22447981fc05 ("KEYS: Move trusted.h to include/keys [ver #2]"). Cc: Denis Kenzior <[email protected]> Cc: James Bottomley <[email protected]> Cc: Jarkko Sakkinen <[email protected]> Cc: Mimi Zohar <[email protected]> Cc: [email protected] Signed-off-by: Denis Efremov <[email protected]> Reviewed-by: Jarkko Sakkinen <[email protected]> Signed-off-by: Jarkko Sakkinen <[email protected]>