Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.4/block
Pull MD fixes from Song.
* 'md-next' of git://git.kernel.org/pub/scm/linux/kernel/git/song/md:
md/raid5: use bio_end_sector to calculate last_sector
md/raid1: fail run raid1 array when active disk less than one
md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone
|
|
Use the common way to get last_sector.
Signed-off-by: Guoqing Jiang <[email protected]>
Signed-off-by: Song Liu <[email protected]>
|
|
When run test case:
mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] --assume-clean --bitmap=internal
mdadm -S /dev/md1
mdadm -A /dev/md1 /dev/sd[b-c] --run --force
mdadm --zero /dev/sda
mdadm /dev/md1 -a /dev/sda
echo offline > /sys/block/sdc/device/state
echo offline > /sys/block/sdb/device/state
sleep 5
mdadm -S /dev/md1
echo running > /sys/block/sdb/device/state
echo running > /sys/block/sdc/device/state
mdadm -A /dev/md1 /dev/sd[a-c] --run --force
mdadm run fail with kernel message as follow:
[ 172.986064] md: kicking non-fresh sdb from array!
[ 173.004210] md: kicking non-fresh sdc from array!
[ 173.022383] md/raid1:md1: active with 0 out of 4 mirrors
[ 173.022406] md1: failed to create bitmap (-5)
In fact, when active disk in raid1 array less than one, we
need to return fail in raid1_run().
Reviewed-by: NeilBrown <[email protected]>
Signed-off-by: Yufen Yu <[email protected]>
Signed-off-by: Song Liu <[email protected]>
|
|
Currently md raid0/linear are not provided with any mechanism to validate
if an array member got removed or failed. The driver keeps sending BIOs
regardless of the state of array members, and kernel shows state 'clean'
in the 'array_state' sysfs attribute. This leads to the following
situation: if a raid0/linear array member is removed and the array is
mounted, some user writing to this array won't realize that errors are
happening unless they check dmesg or perform one fsync per written file.
Despite udev signaling the member device is gone, 'mdadm' cannot issue the
STOP_ARRAY ioctl successfully, given the array is mounted.
In other words, no -EIO is returned and writes (except direct ones) appear
normal. Meaning the user might think the wrote data is correctly stored in
the array, but instead garbage was written given that raid0 does stripping
(and so, it requires all its members to be working in order to not corrupt
data). For md/linear, writes to the available members will work fine, but
if the writes go to the missing member(s), it'll cause a file corruption
situation, whereas the portion of the writes to the missing devices aren't
written effectively.
This patch changes this behavior: we check if the block device's gendisk
is UP when submitting the BIO to the array member, and if it isn't, we flag
the md device as MD_BROKEN and fail subsequent I/Os to that device; a read
request to the array requiring data from a valid member is still completed.
While flagging the device as MD_BROKEN, we also show a rate-limited warning
in the kernel log.
A new array state 'broken' was added too: it mimics the state 'clean' in
every aspect, being useful only to distinguish if the array has some member
missing. We rely on the MD_BROKEN flag to put the array in the 'broken'
state. This state cannot be written in 'array_state' as it just shows
one or more members of the array are missing but acts like 'clean', it
wouldn't make sense to write it.
With this patch, the filesystem reacts much faster to the event of missing
array member: after some I/O errors, ext4 for instance aborts the journal
and prevents corruption. Without this change, we're able to keep writing
in the disk and after a machine reboot, e2fsck shows some severe fs errors
that demand fixing. This patch was tested in ext4 and xfs filesystems, and
requires a 'mdadm' counterpart to handle the 'broken' state.
Cc: Song Liu <[email protected]>
Reviewed-by: NeilBrown <[email protected]>
Signed-off-by: Guilherme G. Piccoli <[email protected]>
Signed-off-by: Song Liu <[email protected]>
|
|
The race was when a thread using closure_sync() notices cl->s->done == 1
before the thread calling closure_put() calls wake_up_process(). Then,
it's possible for that thread to return and exit just before
wake_up_process() is called - so we're trying to wake up a process that
no longer exists.
rcu_read_lock() is sufficient to protect against this, as there's an rcu
barrier somewhere in the process teardown path.
Signed-off-by: Kent Overstreet <[email protected]>
Acked-by: Coly Li <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
The copy_to_user() function returns the number of bytes remaining to be
copied, but the intention here was to return -EFAULT if the copy fails.
Fixes: cafe56359144 ("bcache: A block layer cache")
Signed-off-by: Dan Carpenter <[email protected]>
Signed-off-by: Coly Li <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Read /sys/fs/bcache/<uuid>/cacheN/priority_stats can take very long
time with huge cache after long run.
Signed-off-by: Shile Zhang <[email protected]>
Tested-by: Heitor Alves de Siqueira <[email protected]>
Signed-off-by: Coly Li <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
This argument was not being considered since blk-mq was set by default,
so removed this documentation to avoid confusion.
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Marcos Paulo de Souza <[email protected]>
.txt file is now .rst
Signed-off-by: Jens Axboe <[email protected]>
|
|
This argument was ignored since blk-mq was set as default, so remove it
from documentation.
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Marcos Paulo de Souza <[email protected]>
.txt file is now .rst
Signed-off-by: Jens Axboe <[email protected]>
|
|
Since the inclusion of blk-mq, elevator argument was not being
considered anymore, and it's utility died long with the legacy IO path,
now removed too.
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Bob Liu <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Marcos Paulo de Souza <[email protected]>
Fold with doc removal patch.
Signed-off-by: Jens Axboe <[email protected]>
|
|
Commit 7211aef86f79 ("block: mq-deadline: Fix write completion
handling") added a call to blk_mq_sched_mark_restart_hctx() in
dd_dispatch_request() to make sure that write request dispatching does
not stall when all target zones are locked. This fix left a subtle race
when a write completion happens during a dispatch execution on another
CPU:
CPU 0: Dispatch CPU1: write completion
dd_dispatch_request()
lock(&dd->lock);
...
lock(&dd->zone_lock); dd_finish_request()
rq = find request lock(&dd->zone_lock);
unlock(&dd->zone_lock);
zone write unlock
unlock(&dd->zone_lock);
...
__blk_mq_free_request
check restart flag (not set)
-> queue not run
...
if (!rq && have writes)
blk_mq_sched_mark_restart_hctx()
unlock(&dd->lock)
Since the dispatch context finishes after the write request completion
handling, marking the queue as needing a restart is not seen from
__blk_mq_free_request() and blk_mq_sched_restart() not executed leading
to the dispatch stall under 100% write workloads.
Fix this by moving the call to blk_mq_sched_mark_restart_hctx() from
dd_dispatch_request() into dd_finish_request() under the zone lock to
ensure full mutual exclusion between write request dispatch selection
and zone unlock on write request completion.
Fixes: 7211aef86f79 ("block: mq-deadline: Fix write completion handling")
Cc: [email protected]
Reported-by: Hans Holmberg <[email protected]>
Reviewed-by: Hans Holmberg <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Damien Le Moal <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
page->mapping may encode different values in it and page_mapping()
should always be used to access the mapping pointer.
track_foreign_dirty tracepoint was incorrectly accessing page->mapping
directly. Use page_mapping() instead. Also, add NULL checks while at
it.
Fixes: 3a8e9ac89e6a ("writeback: add tracepoints for cgroup foreign writebacks")
Reported-by: Jan Kara <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Pull NVMe changes from Sagi:
"The nvme updates include:
- ana log parse fix from Anton
- nvme quirks support for Apple devices from Ben
- fix missing bio completion tracing for multipath stack devices from
Hannes and Mikhail
- IP TOS settings for nvme rdma and tcp transports from Israel
- rq_dma_dir cleanups from Israel
- tracing for Get LBA Status command from Minwoo
- Some nvme-tcp cleanups from Minwoo, Potnuri and Myself
- Some consolidation between the fabrics transports for handling the CAP
register
- reset race with ns scanning fix for fabrics (move fabrics commands to
a dedicated request queue with a different lifetime from the admin
request queue)."
* 'nvme-5.4' of git://git.infradead.org/nvme: (30 commits)
nvme-rdma: Use rq_dma_dir macro
nvme-fc: Use rq_dma_dir macro
nvme-pci: Tidy up nvme_unmap_data
nvme: make fabrics command run on a separate request queue
nvme-pci: Support shared tags across queues for Apple 2018 controllers
nvme-pci: Add support for Apple 2018+ models
nvme-pci: Add support for variable IO SQ element size
nvme-pci: Pass the queue to SQ_SIZE/CQ_SIZE macros
nvme: trace bio completion
nvme-multipath: fix ana log nsid lookup when nsid is not found
nvmet-tcp: Add TOS for tcp transport
nvme-tcp: Add TOS for tcp transport
nvme-tcp: Use struct nvme_ctrl directly
nvme-rdma: Add TOS for rdma transport
nvme-fabrics: Add type of service (TOS) configuration
nvmet-tcp: fix possible memory leak
nvmet-tcp: fix possible NULL deref
nvmet: trace: parse Get LBA Status command in detail
nvme: trace: parse Get LBA Status command in detail
nvme: trace: support for Get LBA Status opcode parsed
...
|
|
cgroup foreign inode handling has quite a bit of heuristics and
internal states which sometimes makes it difficult to understand
what's going on. Add tracepoints to improve visibility.
Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
ioc_cpd_alloc() forgot to check NULL return from kzalloc(). Add it.
Signed-off-by: Tejun Heo <[email protected]>
Reported-by: kbuild test robot <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Remove code duplication.
Signed-off-by: Israel Rukshin <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
Remove code duplication.
Signed-off-by: Israel Rukshin <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Reviewed-by: James Smart <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
Remove pointless local variable and use rq_dma_dir macro.
Signed-off-by: Israel Rukshin <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
We have a fundamental issue that fabric commands use the admin_q.
The reason is, that admin-connect, register reads and writes and
admin commands cannot be guaranteed ordering while we are running
controller resets.
For example, when we reset a controller we perform:
1. disable the controller
2. teardown the admin queue
3. re-establish the admin queue
4. enable the controller
In order to perform (3), we need to unquiesce the admin queue, however
we may have some admin commands that are already pending on the
quiesced admin_q and will immediate execute when we unquiesce it before
we execute (4). The host must not send admin commands to the controller
before enabling the controller.
To fix this, we have the fabric commands (admin connect and property
get/set, but not I/O queue connect) use a separate fabrics_q and make
sure to quiesce the admin_q before we disable the controller, and
unquiesce it only after we enable the controller.
This fixes the error prints from nvmet in a controller reset storm test:
kernel: nvmet: got cmd 6 while CC.EN == 0 on qid = 0
Which indicate that the host is sending an admin command when the
controller is not enabled.
Reviewed-by: James Smart <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
Another issue with the Apple T2 based 2018 controllers seem to be
that they blow up (and shut the machine down) if there's a tag
collision between the IO queue and the Admin queue.
My suspicion is that they use our tags for their internal tracking
and don't mix them with the queue id. They also seem to not like
when tags go beyond the IO queue depth, ie 128 tags.
This adds a quirk that marks tags 0..31 of the IO queue reserved
Signed-off-by: Benjamin Herrenschmidt <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Acked-by: Keith Busch <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
Based on reverse engineering and original patch by
Paul Pawlowski <[email protected]>
This adds support for Apple weird implementation of NVME in their
2018 or later machines. It accounts for the twice-as-big SQ entries
for the IO queues, and the fact that only interrupt vector 0 appears
to function properly.
Signed-off-by: Benjamin Herrenschmidt <[email protected]>
Reviewed-by: Minwoo Im <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
The size of a submission queue element should always be 6 (64 bytes)
by spec.
However some controllers such as Apple's are not properly implementing
the standard and require a different size.
This provides the ground work for the subsequent quirks for these
controllers.
Signed-off-by: Benjamin Herrenschmidt <[email protected]>
Reviewed-by: Minwoo Im <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
This will make it easier to handle variable queue entry sizes
later. No functional change.
Signed-off-by: Benjamin Herrenschmidt <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Minwoo Im <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
When native multipathing is enabled we cannot enable blktrace for
the underlying paths, so any completion is never traced.
Signed-off-by: Hannes Reinecke <[email protected]>
[fixed-up by Mikhail for non-multipath-build]
Signed-off-by: Mikhail Skorzhinskii <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
ANA log parsing invokes nvme_update_ana_state() per ANA group desc.
This updates the state of namespaces with nsids in desc->nsids[].
Both ctrl->namespaces list and desc->nsids[] array are sorted by nsid.
Hence nvme_update_ana_state() performs a single walk over ctrl->namespaces:
- if current namespace matches the current desc->nsids[n],
this namespace is updated, and n is incremented.
- the process stops when it encounters the end of either
ctrl->namespaces end or desc->nsids[]
In case desc->nsids[n] does not match any of ctrl->namespaces,
the remaining nsids following desc->nsids[n] will not be updated.
Such situation was considered abnormal and generated WARN_ON_ONCE.
However ANA log MAY contain nsids not (yet) found in ctrl->namespaces.
For example, lets consider the following scenario:
- nvme0 exposes namespaces with nsids = [2, 3] to the host
- a new namespace nsid = 1 is added dynamically
- also, a ANA topology change is triggered
- NS_CHANGED aen is generated and triggers scan_work
- before scan_work discovers nsid=1 and creates a namespace, a NOTICE_ANA
aen was issues and ana_work receives ANA log with nsids=[1, 2, 3]
Result: ana_work fails to update ANA state on existing namespaces [2, 3]
Solution:
Change the way nvme_update_ana_state() namespace list walk
checks the current namespace against desc->nsids[n] as follows:
a) ns->head->ns_id < desc->nsids[n]: keep walking ctrl->namespaces.
b) ns->head->ns_id == desc->nsids[n]: match, update the namespace
c) ns->head->ns_id >= desc->nsids[n]: skip to desc->nsids[n+1]
This enables correct operation in the scenario described above.
This also allows ANA log to contain nsids currently invisible
to the host, i.e. inactive nsids.
Signed-off-by: Anton Eidelman <[email protected]>
Reviewed-by: James Smart <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
Set the outgoing packets type of service (TOS) according to the
receiving TOS.
Signed-off-by: Israel Rukshin <[email protected]>
Suggested-by: Sagi Grimberg <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
TOS provide clients the ability to segregate traffic flows for
different type of data.
One of the TOS usage is bandwidth management which allows setting bandwidth
limits for QoS classes, e.g. 80% bandwidth to controllers at QoS class A
and 20% to controllers at QoS class B.
usage examples:
nvme connect --tos=0 --transport=tcp --traddr=10.0.1.1 --nqn=test-nvme
Signed-off-by: Israel Rukshin <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
This patch doesn't change any functionality.
Signed-off-by: Israel Rukshin <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
For RDMA transports, TOS is an extension of IB QoS to provide clients
the ability to segregate traffic flows for different type of data.
RDMA CM abstract it for ULPs using rdma_set_service_type().
Internally, each traffic flow is represented by a connection with all of
its independent resources like that of a normal connection, and is
differentiated by service type. In other words, there can be multiple qp
connections between an IP pair and each supports a unique service type.
One of the TOS usage is bandwidth management which allows setting bandwidth
limits for QoS classes, e.g. 80% bandwidth to controllers at QoS class A
and 20% to controllers at QoS class B.
Note: In addition to the TOS configuration, QOS must be configured on the
relevant HCA on the target (send RDMA commands) and initiator to effect
the traffic.
usage examples:
nvme connect --tos=0 --transport=rdma --traddr=10.0.1.1 --nqn=test-nvme
Signed-off-by: Israel Rukshin <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
TOS is user-defined and needs to be configured via nvme-cli.
It must be set before initiating any traffic and once set the TOS
cannot be changed.
Signed-off-by: Israel Rukshin <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
when we uninit a command in error flow we also need to
free an iovec if it was allocated.
Reviewed-by: Max Gurtovoy <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
We must only call sgl_free for sgl that we actually
allocated.
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
Four different fields are in CDWs of Get LBA Status command which means
it would be great if we can see in detail when tracing in target side
also.
Signed-off-by: Minwoo Im <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
Four different fields are in CDWs of Get LBA Status command which means
it would be great if we can see in detail when tracing.
Signed-off-by: Minwoo Im <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
This patch adds Get LBA Status command's opcode to the macro that is
used by the trace feature. Now we can see "get_lba_status" instead of
the opcode value itself.
Signed-off-by: Minwoo Im <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
NVMe 1.4 added Get LBA Status command with opcode 0x86.
Signed-off-by: Minwoo Im <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
In nvme spec 1.3 there is a definition for data write/read counters
from SMART log, (See section 5.14.1.2):
This value is reported in thousands (i.e., a value of 1
corresponds to 1000 units of 512 bytes read) and is rounded up.
However, in nvme target where value is reported with actual units,
but not thousands of units as the spec requires.
Signed-off-by: Tom Wu <[email protected]>
Reviewed-by: Israel Rukshin <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
Simple polling support via socket busy_poll interface.
Although we do not shutdown interrupts but simply hammer
the socket poll, we can sometimes find completions faster
than the normal interrupt driven RX path.
We add per queue nr_cqe counter that resets every time
RX path is invoked such that .poll callback can return it
to stay consistent with the semantics.
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
The tcp host module is now taking those APIs from crypto ahash:
(1) crypto_ahash_final()
(2) crypto_ahash_digest()
(3) crypto_alloc_ahash()
nvme-tcp should depends on CRYPTO_CRC32C.
Cc: Christoph Hellwig <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Jens Axboe <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Minwoo Im <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
All seem to call it with ctrl->cap so no need to pass it
at all.
Reviewed-by: Minwoo Im <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
nvme_enable_ctrl reads the cap register right after, so
no need to do that locally in the transport driver. Have
sqsize setting in nvme_init_identify.
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
Align with what the rest of the transports are doing.
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
No need to use a stack cap variable.
Reviewed-by: Minwoo Im <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
Using socket specific read_sock() calls instead of directly calling
tcp_read_sock() helps lld module registered handlers if any, to be called
from nvme-tcp host.
This patch therefore replaces the tcp_read_sock() with socket specific
prot_ops.
Signed-off-by: Potnuri Bharat Teja <[email protected]>
Acked-by: Sagi Grimberg <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
Can return directly in the switch statement
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
|
|
blk_iocost_init() forgot to free its percpu stat on the error path.
Fix it.
Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Reported-by: Hillf Danton <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Stephen Rothwell <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Add a script which can be used to generate device-specific iocost
linear model coefficients.
Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Instead of mucking with debugfs and ->pd_stat(), add drgn based
monitoring script.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Omar Sandoval <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
This patchset implements IO cost model based work-conserving
proportional controller.
While io.latency provides the capability to comprehensively prioritize
and protect IOs depending on the cgroups, its protection is binary -
the lowest latency target cgroup which is suffering is protected at
the cost of all others. In many use cases including stacking multiple
workload containers in a single system, it's necessary to distribute
IO capacity with better granularity.
One challenge of controlling IO resources is the lack of trivially
observable cost metric. The most common metrics - bandwidth and iops
- can be off by orders of magnitude depending on the device type and
IO pattern. However, the cost isn't a complete mystery. Given
several key attributes, we can make fairly reliable predictions on how
expensive a given stream of IOs would be, at least compared to other
IO patterns.
The function which determines the cost of a given IO is the IO cost
model for the device. This controller distributes IO capacity based
on the costs estimated by such model. The more accurate the cost
model the better but the controller adapts based on IO completion
latency and as long as the relative costs across differents IO
patterns are consistent and sensible, it'll adapt to the actual
performance of the device.
Currently, the only implemented cost model is a simple linear one with
a few sets of default parameters for different classes of device.
This covers most common devices reasonably well. All the
infrastructure to tune and add different cost models is already in
place and a later patch will also allow using bpf progs for cost
models.
Please see the top comment in blk-iocost.c and documentation for
more details.
v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix
for a divide-by-zero bug in current_hweight() triggered by zero
inuse_sum.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Andy Newell <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|