Age | Commit message (Collapse) | Author | Files | Lines |
|
Allows mempools to be embedded in other structs, getting rid of a
pointer indirection from allocation fastpaths.
mempool_exit() is safe to call on an uninitialized but zeroed mempool.
Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
If we have multiple callers of sbq_wake_up(), we can end up in a
situation where the wait_cnt will continually go more and more
negative. Consider the case where our wake batch is 1, hence
wait_cnt will start out as 1.
wait_cnt == 1
CPU0 CPU1
atomic_dec_return(), cnt == 0
atomic_dec_return(), cnt == -1
cmpxchg(-1, 0) (succeeds)
[wait_cnt now 0]
cmpxchg(0, 1) (fails)
This ends up with wait_cnt being 0, we'll wakeup immediately
next time. Going through the same loop as above again, and
we'll have wait_cnt -1.
For the case where we have a larger wake batch, the only
difference is that the starting point will be higher. We'll
still end up with continually smaller batch wakeups, which
defeats the purpose of the rolling wakeups.
Always reset the wait_cnt to the batch value. Then it doesn't
matter who wins the race. But ensure that whomever does win
the race is the one that increments the ws index and wakes up
our batch count, loser gets to call __sbq_wake_up() again to
account his wakeups towards the next active wait state index.
Fixes: 6c0ca7ae292a ("sbitmap: fix wakeup hang after sbq resize")
Reviewed-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Same numerical value (for now at least), but a much better documentation
of intent.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
We just can't do I/O when doing block layer requests allocations,
so use GFP_NOIO instead of the even more limited __GFP_DIRECT_RECLAIM.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
blk_old_get_request already has it at hand, and in blk_queue_bio, which
is the fast path, it is constant.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Switch everyone to blk_get_request_flags, and then rename
blk_get_request_flags to blk_get_request.
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Always GFP_KERNEL, and keeping it would cause serious complications for
the next change.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Fixes: 7c2d748e8476 ("memstick: don't call blk_queue_bounce_limit")
Reported-by: Stephen Rothwell <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
The ps3disk driver already kmaps all pages when copying from/to the
internal bounce buffer, so it can accept highmem pages just fine.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Just kmap the bio single page payload before processing it.
(and yes, now highmem on sparc32 anyway, but kmap_(un)map atomic are nops,
so this gives the right example)
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Use kmap_atomic when copying out of a bio_vec.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Just kmap the single payload page before passing it on to the FTL.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
All in-tree host drivers set up a proper dma mask and use the dma-mapping
helpers. This means they will be able to deal with any address that we
are throwing at them.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
DAC960 just sets the block bounce limit to the dma mask, which means
that the iommu or swiotlb already take care of the bounce buffering,
and the block bouncing can be removed.
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
mtip32xx just sets the block bounce limit to the dma mask, which means
that the iommu or swiotlb already take care of the bounce buffering,
and the block bouncing can be removed.
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
AER handling expects a successful return from slot_reset means the
driver made the device functional again. The nvme driver had been using
an asynchronous reset to recover the device, so the device
may still be initializing after control is returned to the
AER handler. This creates problems for subsequent event handling,
causing the initializion to fail.
This patch fixes that by syncing the controller reset before returning
to the AER driver, and reporting the true state of the reset.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199657
Reported-by: Alex Gagniuc <[email protected]>
Cc: Sinan Kaya <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: [email protected]
Tested-by: Alex Gagniuc <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Martin K. Petersen <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
Make sure the user passed the right value to
sbitmap_queue_min_shallow_depth().
Acked-by: Paolo Valente <[email protected]>
Signed-off-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
We don't expect the async depth to be smaller than the wake batch
count for sbitmap, but just in case, inform sbitmap of what shallow
depth kyber may use.
Acked-by: Paolo Valente <[email protected]>
Reviewed-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
If our shallow depth is smaller than the wake batching of sbitmap,
we can introduce hangs. Ensure that sbitmap knows how low we'll go.
Acked-by: Paolo Valente <[email protected]>
Reviewed-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
The sbitmap queue wake batch is calculated such that once allocations
start blocking, all of the bits which are already allocated must be
enough to fulfill the batch counters of all of the waitqueues. However,
the shallow allocation depth can break this invariant, since we block
before our full depth is being utilized. Add
sbitmap_queue_min_shallow_depth(), which saves the minimum shallow depth
the sbq will use, and update sbq_calc_wake_batch() to take it into
account.
Acked-by: Paolo Valente <[email protected]>
Signed-off-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
bfqd->sb_shift was attempted used as a cache for the sbitmap queue
shift, but we don't need it, as it never changes. Kill it with fire.
Acked-by: Paolo Valente <[email protected]>
Reviewed-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
It doesn't change, so don't put it in the per-IO hot path.
Acked-by: Paolo Valente <[email protected]>
Reviewed-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Reserved tags are used for error handling, we don't need to
care about them for regular IO. The core won't call us for these
anyway.
Acked-by: Paolo Valente <[email protected]>
Reviewed-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
It's not useful, they are internal and/or error handling recovery
commands.
Acked-by: Paolo Valente <[email protected]>
Reviewed-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
When invoked for an I/O request rq, the prepare_request hook of bfq
increments reference counters in the destination bfq_queue for rq. In
this respect, after this hook has been invoked, rq may still be
transformed into a request with no icq attached, i.e., for bfq, a
request not associated with any bfq_queue. No further hook is invoked
to signal this tranformation to bfq (in general, to the destination
elevator for rq). This leads bfq into an inconsistent state, because
bfq has no chance to correctly lower these counters back. This
inconsistency may in its turn cause incorrect scheduling and hangs. It
certainly causes memory leaks, by making it impossible for bfq to free
the involved bfq_queue.
On the bright side, no transformation can still happen for rq after rq
has been inserted into bfq, or merged with another, already inserted,
request. Exploiting this fact, this commit addresses the above issue
by delaying the preparation of an I/O request to when the request is
inserted or merged.
This change also gives a performance bonus: a lock-contention point
gets removed. To prepare a request, bfq needs to hold its scheduler
lock. After postponing request preparation to insertion or merging, no
lock needs to be grabbed any longer in the prepare_request hook, while
the lock already taken to perform insertion or merging is used to
preparare the request as well.
Tested-by: Oleksandr Natalenko <[email protected]>
Tested-by: Bart Van Assche <[email protected]>
Signed-off-by: Paolo Valente <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Branch to the right label in the error handling path in order to keep it
logical.
Signed-off-by: Christophe JAILLET <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
This commit sets QUEUE_FLAG_NONROT and clears up QUEUE_FLAG_ADD_RANDOM
to mark the ramdisks as non-rotational device.
Signed-off-by: SeongJae Park <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Currently, struct request has four timestamp fields:
- A start time, set at get_request time, in jiffies, used for iostats
- An I/O start time, set at start_request time, in ktime nanoseconds,
used for blk-stats (i.e., wbt, kyber, hybrid polling)
- Another start time and another I/O start time, used for cfq and bfq
These can all be consolidated into one start time and one I/O start
time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
request depending on the kernel config.
Signed-off-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
We want this next to blk_account_io_done() for the next change so that
we can call ktime_get() only once for both.
Signed-off-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
cfq and bfq have some internal fields that use sched_clock() which can
trivially use ktime_get_ns() instead. Their timestamp fields in struct
request can also use ktime_get_ns(), which resolves the 8 year old
comment added by commit 28f4197e5d47 ("block: disable preemption before
using sched_clock()").
Signed-off-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
struct blk_issue_stat squashes three things into one u64:
- The time the driver started working on a request
- The original size of the request (for the io.low controller)
- Flags for writeback throttling
It turns out that on x86_64, we have a 4 byte hole in struct request
which we can fill with the non-timestamp fields from blk_issue_stat,
simplifying things quite a bit.
Signed-off-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
struct blk_issue_stat is going away, and bio->bi_issue_stat doesn't even
use the blk-stats interface, so we can provide a separate implementation
specific for bios. The helpers work the same way as the blk-stats
helpers.
Signed-off-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
issue_stat is going to go away, so first make writeback throttling take
the containing request, update the internal wbt helpers accordingly, and
change rwb->sync_cookie to be the request pointer instead of the
issue_stat pointer. No functional change.
Signed-off-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
A few helpers are only used from blk-wbt.c, so move them there, and put
wbt_track() behind the CONFIG_BLK_WBT typedef. This is in preparation
for changing how the wbt flags are tracked.
Signed-off-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Throttle discards like we would any background write. Discards should
be background activity, so if they are impacting foreground IO, then
we will throttle them down.
Reviewed-by: Darrick J. Wong <[email protected]>
Reviewed-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
This is in preparation for having more write queues, in which
case we would have needed to pass in more information than just
a simple 'is_kswapd' boolean.
Reviewed-by: Darrick J. Wong <[email protected]>
Reviewed-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
We currently special case WRITE and FLUSH, but we should really
just include any command with the write bit set. This ensures
that we account DISCARD.
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Reviewed-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Don't build discards bigger than what the user asked for, if the
user decided to limit the size by writing to 'discard_max_bytes'.
Reviewed-by: Darrick J. Wong <[email protected]>
Reviewed-by: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
syzbot is hitting WARN() triggered by memory allocation fault
injection [1] because loop module is calling sysfs_remove_group()
when sysfs_create_group() failed.
Fix this by remembering whether sysfs_create_group() succeeded.
[1] https://syzkaller.appspot.com/bug?id=3f86c0edf75c86d2633aeb9dd69eccc70bc7e90b
Signed-off-by: Tetsuo Handa <[email protected]>
Reported-by: syzbot <syzbot+9f03168400f56df89dbc6f1751f4458fe739ff29@syzkaller.appspotmail.com>
Reviewed-by: Greg Kroah-Hartman <[email protected]>
Renamed sysfs_ready -> sysfs_inited.
Signed-off-by: Jens Axboe <[email protected]>
|
|
Commit 9c40cef2b799 ("sched: Move blk_schedule_flush_plug() out of
__schedule()") moved the blk_schedule_flush_plug() call out of the
interrupt/preempt disabled region in the scheduler. This allows to replace
local_irq_save/restore(flags) by local_irq_disable/enable() in
blk_flush_plug_list().
But it makes more sense to disable interrupts explicitly when the request
queue is locked end reenable them when the request to is unlocked. This
shortens the interrupt disabled section which is important when the plug
list contains requests for more than one queue. The comment which claims
that disabling interrupts around the loop is misleading as the called
functions can reenable interrupts unconditionally anyway and obfuscates the
scope badly:
local_irq_save(flags);
spin_lock(q->queue_lock);
...
queue_unplugged(q...);
scsi_request_fn();
spin_unlock_irq(q->queue_lock);
-------------------^^^ ????
spin_lock_irq(q->queue_lock);
spin_unlock(q->queue_lock);
local_irq_restore(flags);
Aside of that the detached interrupt disabling is a constant pain for
PREEMPT_RT as it requires patching and special casing when RT is enabled
while with the spin_*_irq() variants this happens automatically.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Commit 2fff8a924d4c ("block: Check locking assumptions at runtime") added a
lockdep_assert_held(q->queue_lock) which makes the WARN_ON() redundant
because lockdep will detect and warn about context violations.
The unconditional WARN_ON() does not provide real additional value, so it
can be removed.
Signed-off-by: Anna-Maria Gleixner <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
bounce_copy_vec() disables interrupts around kmap_atomic(). This is a
leftover from the old kmap_atomic() implementation which relied on fixed
mapping slots, so the caller had to make sure that the same slot could not
be reused from an interrupting context.
kmap_atomic() was changed to dynamic slots long ago and commit 1ec9c5ddc17a
("include/linux/highmem.h: remove the second argument of k[un]map_atomic()")
removed the slot assignements, but the callers were not checked for now
redundant interrupt disabling.
Remove the conditional interrupt disable.
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
CONFIG_PRREMPT -> CONFIG_PREEMPT
Signed-off-by: Florian La Roche <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull DeviceTree fixes from Rob Herring:
- fix path to display timing binding
- fix some typos in interrupt-names and clock-names
- fix a resource leak on overlay removal
- add missing documentation for R8A77965 DMA, serial, and net
- cleanup sunxi pinctrl description
- add Kieback & Peter GmbH vendor prefix
* tag 'devicetree-fixes-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
dt-bindings: panel: lvds: Fix path to display timing bindings
dt-bindings: mvebu-uart: DT fix s/interrupts-names/interrupt-names/
dt-bindings: meson-uart: DT fix s/clocks-names/clock-names/
of: overlay: Stop leaking resources on overlay removal
dtc: checks: drop warning for missing PCI bridge bus-range
dt-bindings: dmaengine: rcar-dmac: document R8A77965 support
dt-bindings: serial: sh-sci: Add support for r8a77965 (H)SCIF
dt-bindings: net: ravb: Add support for r8a77965 SoC
dt-bindings: pinctrl: sunxi: Fix reference to driver
doc: Add vendor prefix for Kieback & Peter GmbH
|
|
It is possible the driver's remove may have freed the controller if
the remove callback is invoked prior to the async_schedule starting
the reset_work. This patch fixes that by holding a reference on the
controller.
Reported-by: Mikulas Patocka <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
|
|
|
|
Pll KVM fixes from Radim Krčmář:
"ARM:
- Fix proxying of GICv2 CPU interface accesses
- Fix crash when switching to BE
- Track source vcpu git GICv2 SGIs
- Fix an outdated bit of documentation
x86:
- Speed up injection of expired timers (for stable)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: remove APIC Timer periodic/oneshot spikes
arm64: vgic-v2: Fix proxying of cpuif access
KVM: arm/arm64: vgic_init: Cleanup reference to process_maintenance
KVM: arm64: Fix order of vcpu_write_sys_reg() arguments
KVM: arm/arm64: vgic: Fix source vcpu issues for GICv2 SGI
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:
- fix a compile warning in the AMD IOMMU driver with irq remapping
disabled
- fix for VT-d interrupt remapping and invalidation size (caused a
BUG_ON when trying to invalidate more than 4GB)
- build fix and a regression fix for broken graphics with old DTS for
the rockchip iommu driver
- a revert in the PCI window reservation code which fixes a regression
with VFIO.
* tag 'iommu-fixes-v4.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu: rockchip: fix building without CONFIG_OF
iommu/vt-d: Use WARN_ON_ONCE instead of BUG_ON in qi_flush_dev_iotlb()
iommu/vt-d: fix shift-out-of-bounds in bug checking
iommu/dma: Move PCI window region reservation back into dma specific path.
iommu/rockchip: Make clock handling optional
iommu/amd: Hide unused iommu_table_lock
iommu/vt-d: Fix usage of force parameter in intel_ir_reconfigure_irte()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Thomas Gleixner:
"Unbreak the CPUID CPUID_8000_0008_EBX reload which got dropped when
the evaluation of physical and virtual bits which uses the same CPUID
leaf was moved out of get_cpu_cap()"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Restore CPUID_8000_0008_EBX reload
|