| Age | Commit message (Collapse) | Author | Files | Lines |
|
Otherwise, once the parent snapshot is removed, the clone's snapshot
wouldn't reflect the state of the clone prior to whole-object write or
zeroout because a deep-copyup was never done ("rbd flatten" wouldn't do
it because the modified object would exist in HEAD).
Signed-off-by: Ilya Dryomov <[email protected]>
|
|
This is the core of deep-flatten feature: sending a copyup request
(i.e. a guarded write of the data read from the parent) with an empty
snapshot context (snaps = [], seq = 0) causes the OSD to reflect the
write in all existing snapshots. This allows "rbd flatten" to fully
disconnect the clone image and its snapshots from the parent and make
the parent snapshot removable.
The actual modification request is sent only after deep-copyup request
is completed. Waiting for deep-copyup reply is unnecessary, this will
be improved in the future.
Signed-off-by: Ilya Dryomov <[email protected]>
|
|
In preparation for deep-flatten feature, split rbd_obj_issue_copyup()
into two functions and add a new write state to make the state machine
slightly more clear. Make the copyup op optional and start using that
for when the overlap goes to 0.
Signed-off-by: Ilya Dryomov <[email protected]>
|
|
In preparation for deep-flatten feature, stop copying num_osd_ops from
the original request in rbd_obj_issue_copyup(). Split the calculation
into count_{write,zeroout}_ops() respectively and determine whether the
assert_exists guard is needed with the new rbd_obj_copyup_enabled().
As a nice side effect, we no longer guard in the writefull case as the
copyup'ed object is always fully overwritten.
Signed-off-by: Ilya Dryomov <[email protected]>
|
|
Allow passing a custom snapshot context: NULL for read and an empty
snapshot context for deep-copyup.
Signed-off-by: Ilya Dryomov <[email protected]>
|
|
Otherwise the assert in rbd_obj_end_request() is triggered.
Fixes: 3da691bf4366 ("rbd: new request handling code")
Signed-off-by: Ilya Dryomov <[email protected]>
|
|
Support for kernel layering hasn't been considered experimental for
a few years now. All the issues that I'm aware of were shaken out in
2014 and early 2015. Moreover, most of that code was rewritten with
the addition of support for fancy striping.
Signed-off-by: Ilya Dryomov <[email protected]>
Reviewed-by: Jason Dillaman <[email protected]>
|
|
If, after rounding off, the discard request is smaller than alloc_size,
drop it on the floor in __rbd_img_fill_request().
Default alloc_size to 64k. This should cover both HDD and SSD based
bluestore OSDs and somewhat improve things for filestore. For OSDs on
filestore with filestore_punch_hole = false, alloc_size is best set to
object size in order to allow deletes and truncates and disallow zero
op.
Signed-off-by: Ilya Dryomov <[email protected]>
Reviewed-by: Jason Dillaman <[email protected]>
|
|
With discard_zeroes_data gone in commit 48920ff2a5a9 ("block: remove
the discard_zeroes_data flag"), continuing to provide this guarantee is
pointless: applications can't query it and discards can only be used
for deallocating.
Add OBJ_OP_ZEROOUT and move the existing logic under it. As the first
step to divorcing OBJ_OP_DISCARD, stop worrying about copyups but keep
special casing whole-object layered discards.
Signed-off-by: Ilya Dryomov <[email protected]>
Reviewed-by: Jason Dillaman <[email protected]>
|
|
It is effectively unused.
Signed-off-by: Ilya Dryomov <[email protected]>
Reviewed-by: Jason Dillaman <[email protected]>
|
|
genlmsg_reply can fail, so propagate its return code
Signed-off-by: Li RongQing <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Fixes gcc '-Wunused-but-set-variable' warning:
drivers/block/floppy.c: In function 'request_done':
drivers/block/floppy.c:2233:24: warning:
variable 'q' set but not used [-Wunused-but-set-variable]
It's never used and can be removed.
Acked-by: Jiri Kosina <[email protected]>
Signed-off-by: YueHaibing <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
null_handle_bio() erroneously uses the bio_op macro
which masks respective request flag bits including REQ_FUA
out thus failing the check.
Fix by checking bio->bi_opf directly.
Signed-off-by: Heinz Mauelshagen <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
'ring-page-order' set by malicious blkfront
The xenstore 'ring-page-order' is used globally for each blkback queue and
therefore should be read from xenstore only once. However, it is obtained
in read_per_ring_refs() which might be called multiple times during the
initialization of each blkback queue.
If the blkfront is malicious and the 'ring-page-order' is set in different
value by blkfront every time before blkback reads it, this may end up at
the "WARN_ON(i != (XEN_BLKIF_REQS_PER_PAGE * blkif->nr_ring_pages));" in
xen_blkif_disconnect() when frontend is destroyed.
This patch reworks connect_ring() to read xenstore 'ring-page-order' only
once.
Signed-off-by: Dongli Zhang <[email protected]>
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
|
|
Commit 0da03cab87e6
("loop: Fix deadlock when calling blkdev_reread_part()") moves
blkdev_reread_part() out of the loop_ctl_mutex. However,
GENHD_FL_NO_PART_SCAN is set before __blkdev_reread_part(). As a result,
__blkdev_reread_part() will fail the check of GENHD_FL_NO_PART_SCAN and
will not rescan the loop device to delete all partitions.
Below are steps to reproduce the issue:
step1 # dd if=/dev/zero of=tmp.raw bs=1M count=100
step2 # losetup -P /dev/loop0 tmp.raw
step3 # parted /dev/loop0 mklabel gpt
step4 # parted -a none -s /dev/loop0 mkpart primary 64s 1
step5 # losetup -d /dev/loop0
Step5 will not be able to delete /dev/loop0p1 (introduced by step4) and
there is below kernel warning message:
[ 464.414043] __loop_clr_fd: partition scan of loop0 failed (rc=-22)
This patch sets GENHD_FL_NO_PART_SCAN after blkdev_reread_part().
Fixes: 0da03cab87e6 ("loop: Fix deadlock when calling blkdev_reread_part()")
Signed-off-by: Dongli Zhang <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Do not print warn message when the partition scan returns 0.
Fixes: d57f3374ba48 ("loop: Move special partition reread handling in loop_clr_fd()")
Signed-off-by: Dongli Zhang <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
QUEUE_FLAG_NO_SG_MERGE has been killed, so kill BLK_MQ_F_SG_MERGE too.
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Omar Sandoval <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
iov_iter is implemented on bvec itererator helpers, so it is safe to pass
multi-page bvec to it, and this way is much more efficient than passing one
page in each bvec.
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Omar Sandoval <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
floppy_check_events() is supposed to return bit flags to say which
events occured. We should return zero to say that no event flags are
set. Only BIT(0) and BIT(1) are used in the caller. And .check_events
interface also expect to return an unsigned int value.
However, after commit a0c80efe5956, it may return -EINTR (-4u).
Here, both BIT(0) and BIT(1) are cleared. So this patch shouldn't
affect runtime, but it obviously is still worth fixing.
Reviewed-by: Dan Carpenter <[email protected]>
Fixes: a0c80efe5956 ("floppy: fix lock_fdc() signal handling")
Signed-off-by: Yufen Yu <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
We need the debugfs fixes in here as well.
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
We have various helpers for setting/clearing this flag, and also
a helper to check if the queue supports queueable flushes or not.
But nobody uses them anymore, kill it with fire.
Signed-off-by: Jens Axboe <[email protected]>
|
|
The mtip32xx driver uses managed resources for DMA coherent memory
and irqs, but then always pairs them with free calls anyway, making
the resource tracking rather pointless. Given some DMA allocations
are transient anyway, the irq freeing seems to require ordering vs
other hardware access the best solution seems to be to stop using
the managed resource API entirely.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
We are trying to get rid of BUS_ATTR() and the usage of that in rbd.c
can be trivially converted to use BUS_ATTR_WO and RO, so use those
macros instead.
Cc: Sage Weil <[email protected]>
Cc: Alex Elder <[email protected]>
Cc: Jens Axboe <[email protected]>
Acked-by: Ilya Dryomov <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
Pull block fixes from Jens Axboe:
- block size setting fixes for loop/nbd (Jan Kara)
- md bio_alloc_mddev() cleanup (Marcos)
- Ensure we don't lose the REQ_INTEGRITY flag (Ming)
- Two NVMe fixes by way of Christoph:
- Fix NVMe IRQ calculation (Ming)
- Uninitialized variable in nvmet-tcp (Sagi)
- BFQ comment fix (Paolo)
- License cleanup for recently added blk-mq-debugfs-zoned (Thomas)
* tag 'for-linus-20190118' of git://git.kernel.dk/linux-block:
block: Cleanup license notice
nvme-pci: fix nvme_setup_irqs()
nvmet-tcp: fix uninitialized variable access
block: don't lose track of REQ_INTEGRITY flag
blockdev: Fix livelocks on loop device
nbd: Use set_blocksize() to set device blocksize
md: Make bio_alloc_mddev use bio_alloc_bioset
block, bfq: fix comments on __bfq_deactivate_entity
|
|
As 'be->blkif' is used for many times in connect_ring(), the stack variable
'blkif' is added to substitute 'be-blkif'.
Suggested-by: Paul Durrant <[email protected]>
Signed-off-by: Dongli Zhang <[email protected]>
Reviewed-by: Paul Durrant <[email protected]>
Reviewed-by: Roger Pau Monné <[email protected]>
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
|
|
NBD can update block device block size implicitely through
bd_set_size(). Make it explicitely set blocksize with set_blocksize() as
this behavior of bd_set_size() is going away.
CC: Josef Bacik <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Pull block fixes from Jens Axboe:
- NVMe pull request from Christoph, with little fixes all over the map
- Loop caching fix for offset/bs change (Jaegeuk Kim)
- Block documentation tweaks (Jeff, Jon, Weiping, John)
- null_blk zoned tweak (John)
- ahch mvebu suspend/resume support. Should have gone into the merge
window, but there was some confusion on which tree had it. (Miquel)
* tag 'for-linus-20190112' of git://git.kernel.dk/linux-block: (22 commits)
ata: ahci: mvebu: request PHY suspend/resume for Armada 3700
ata: ahci: mvebu: add Armada 3700 initialization needed for S2RAM
ata: ahci: mvebu: do Armada 38x configuration only on relevant SoCs
ata: ahci: mvebu: remove stale comment
ata: libahci_platform: comply to PHY framework
loop: drop caches if offset or block_size are changed
block: fix kerneldoc comment for blk_attempt_plug_merge()
nvme: don't initlialize ctrl->cntlid twice
nvme: introduce NVME_QUIRK_IGNORE_DEV_SUBNQN
nvme: pad fake subsys NQN vid and ssvid with zeros
nvme-multipath: zero out ANA log buffer
nvme-fabrics: unset write/poll queues for discovery controllers
nvme-tcp: don't ask if controller is fabrics
nvme-tcp: remove dead code
nvme-pci: fix out of bounds access in nvme_cqe_pending
nvme-pci: rerun irq setup on IO queue init errors
nvme-pci: use the same attributes when freeing host_mem_desc_bufs.
nvme-pci: fix the wrong setting of nr_maps
block: doc: add slice_idle_us to bfq documentation
block: clarify documentation for blk_{start|finish}_plug
...
|
|
git://git.infradead.org/users/hch/dma-mapping
Pull dma_zalloc_coherent() removal from Christoph Hellwig:
"We've always had a weird situation around dma_zalloc_coherent. To
safely support mapping the allocations to userspace major
architectures like x86 and arm have always zeroed allocations from
dma_alloc_coherent, but a couple other architectures were missing that
zeroing either always or in corner cases.
Then later we grew anothe dma_zalloc_coherent interface to explicitly
request zeroing, but that just added __GFP_ZERO to the allocation
flags, which for some allocators that didn't end up using the page
allocator ended up being a no-op and still not zeroing the
allocations.
So for this merge window I fixed up all remaining architectures to
zero the memory in dma_alloc_coherent, and made dma_zalloc_coherent a
no-op wrapper around dma_alloc_coherent, which fixes all of the above
issues.
dma_zalloc_coherent is now pointless and can go away, and Luis helped
me writing a cocchinelle script and patch series to kill it, which I
think we should apply now just after -rc1 to finally settle these
issue"
* tag 'remove-dma_zalloc_coherent-5.0' of git://git.infradead.org/users/hch/dma-mapping:
dma-mapping: remove dma_zalloc_coherent()
cross-tree: phase out dma_zalloc_coherent() on headers
cross-tree: phase out dma_zalloc_coherent()
|
|
Pull ceph updates from Ilya Dryomov:
"A patch to allow setting abort_on_full and a fix for an old "rbd
unmap" edge case, marked for stable"
* tag 'ceph-for-5.0-rc2' of git://github.com/ceph/ceph-client:
rbd: don't return 0 on unmap if RBD_DEV_FLAG_REMOVING is set
ceph: use vmf_error() in ceph_filemap_fault()
libceph: allow setting abort_on_full for rbd
|
|
There is a window between when RBD_DEV_FLAG_REMOVING is set and when
the device is removed from rbd_dev_list. During this window, we set
"already" and return 0.
Returning 0 from write(2) can confuse userspace tools because
0 indicates that nothing was written. In particular, "rbd unmap"
will retry the write multiple times a second:
10:28:05.463299 write(4, "0", 1) = 0
10:28:05.463509 write(4, "0", 1) = 0
10:28:05.463720 write(4, "0", 1) = 0
10:28:05.463942 write(4, "0", 1) = 0
10:28:05.464155 write(4, "0", 1) = 0
Cc: [email protected]
Signed-off-by: Ilya Dryomov <[email protected]>
Tested-by: Dongsheng Yang <[email protected]>
|
|
If we don't drop caches used in old offset or block_size, we can get old data
from new offset/block_size, which gives unexpected data to user.
For example, Martijn found a loopback bug in the below scenario.
1) LOOP_SET_FD loads first two pages on loop file
2) LOOP_SET_STATUS64 changes the offset on the loop file
3) mount is failed due to the cached pages having wrong superblock
Cc: Jens Axboe <[email protected]>
Cc: [email protected]
Reported-by: Martijn Coenen <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
This patch includes some fixes and cleanup for idle-page writeback.
1. writeback_limit interface
Now writeback_limit interface is rather conusing. For example, once
writeback limit budget is exausted, admin can see 0 from
/sys/block/zramX/writeback_limit which is same semantic with disable
writeback_limit at this moment. IOW, admin cannot tell that zero came
from disable writeback limit or exausted writeback limit.
To make the interface clear, let's sepatate enable of writeback limit to
another knob - /sys/block/zram0/writeback_limit_enable
* before:
while true :
# to re-enable writeback limit once previous one is used up
echo 0 > /sys/block/zram0/writeback_limit
echo $((200<<20)) > /sys/block/zram0/writeback_limit
..
.. # used up the writeback limit budget
* new
# To enable writeback limit, from the beginning, admin should
# enable it.
echo $((200<<20)) > /sys/block/zram0/writeback_limit
echo 1 > /sys/block/zram/0/writeback_limit_enable
while true :
echo $((200<<20)) > /sys/block/zram0/writeback_limit
..
.. # used up the writeback limit budget
It's much strightforward.
2. fix condition check idle/huge writeback mode check
The mode in writeback_store is not bit opeartion any more so no need to
use bit operations. Furthermore, current condition check is broken in
that it does writeback every pages regardless of huge/idle.
3. clean up idle_store
No need to use goto.
[[email protected]: missed spin_lock_init]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Minchan Kim <[email protected]>
Suggested-by: John Dias <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: John Dias <[email protected]>
Cc: Srinivas Paladugu <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
We already need to zero out memory for dma_alloc_coherent(), as such
using dma_zalloc_coherent() is superflous. Phase it out.
This change was generated with the following Coccinelle SmPL patch:
@ replace_dma_zalloc_coherent @
expression dev, size, data, handle, flags;
@@
-dma_zalloc_coherent(dev, size, handle, flags)
+dma_alloc_coherent(dev, size, handle, flags)
Suggested-by: Christoph Hellwig <[email protected]>
Signed-off-by: Luis Chamberlain <[email protected]>
[hch: re-ran the script on the latest tree]
Signed-off-by: Christoph Hellwig <[email protected]>
|
|
If the kernel is built without CONFIG_BLK_DEV_ZONED, a modprobe
of the null_blk driver with zoned=1 fails with 'Invalid argument'.
This can be confusing to users, prompting a search as to why the
parameter is invalid. To assist in that search, add a bit more
information to the failure, additionally adding to the documentation
that CONFIG_BLK_DEV_ZONED is needed for zoned=1.
Reviewed-by: Bart Van Assche <[email protected]>
Signed-off-by: John Pittman <[email protected]>
Added null_blk prefix to error message.
Signed-off-by: Jens Axboe <[email protected]>
|
|
vdc_blk_queue_start() may be called from irq context, so we can't run
queue via blk_mq_start_hw_queues() since we never allow to run queue
from irq context. Use blk_mq_start_stopped_hw_queues(q, true) to fix
this issue.
Fixes: fa182a1fa97dff56cd ("sunvdc: convert to blk-mq")
Reported-by: Anatoly Pugachev <[email protected]>
Tested-by: Anatoly Pugachev <[email protected]>
Cc: Anatoly Pugachev <[email protected]>
Cc: [email protected]
Acked-by: David S. Miller <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Pull virtio/vhost updates from Michael Tsirkin:
"Features, fixes, cleanups:
- discard in virtio blk
- misc fixes and cleanups"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost: correct the related warning message
vhost: split structs into a separate header file
virtio: remove deprecated VIRTIO_PCI_CONFIG()
vhost/vsock: switch to a mutex for vhost_vsock_hash
virtio_blk: add discard and write zeroes support
|
|
Pull more block updates from Jens Axboe:
- Dead code removal for loop/sunvdc (Chengguang)
- Mark BIDI support for bsg as deprecated, logging a single dmesg
warning if anyone is actually using it (Christoph)
- blkcg cleanup, killing a dead function and making the tryget_closest
variant easier to read (Dennis)
- Floppy fixes, one fixing a regression in swim3 (Finn)
- lightnvm use-after-free fix (Gustavo)
- gdrom leak fix (Wenwen)
- a set of drbd updates (Lars, Luc, Nathan, Roland)
* tag 'for-4.21/block-20190102' of git://git.kernel.dk/linux-block: (28 commits)
block/swim3: Fix regression on PowerBook G3
block/swim3: Fix -EBUSY error when re-opening device after unmount
block/swim3: Remove dead return statement
block/amiflop: Don't log error message on invalid ioctl
gdrom: fix a memory leak bug
lightnvm: pblk: fix use-after-free bug
block: sunvdc: remove redundant code
block: loop: remove redundant code
bsg: deprecate BIDI support in bsg
blkcg: remove unused __blkg_release_rcu()
blkcg: clean up blkg_tryget_closest()
drbd: Change drbd_request_detach_interruptible's return type to int
drbd: Avoid Clang warning about pointless switch statment
drbd: introduce P_ZEROES (REQ_OP_WRITE_ZEROES on the "wire")
drbd: skip spurious timeout (ping-timeo) when failing promote
drbd: don't retry connection if peers do not agree on "authentication" settings
drbd: fix print_st_err()'s prototype to match the definition
drbd: avoid spurious self-outdating with concurrent disconnect / down
drbd: do not block when adjusting "disk-options" while IO is frozen
drbd: fix comment typos
...
|
|
As of v4.20, the swim3 driver crashes when loaded on a PowerBook G3
(Wallstreet).
MacIO PCI driver attached to Gatwick chipset
MacIO PCI driver attached to Heathrow chipset
swim3 0.00015000:floppy: [fd0] SWIM3 floppy controller in media bay
0.00013020:ch-a: ttyS0 at MMIO 0xf3013020 (irq = 16, base_baud = 230400) is a Z85c30 ESCC - Serial port
0.00013000:ch-b: ttyS1 at MMIO 0xf3013000 (irq = 17, base_baud = 230400) is a Z85c30 ESCC - Infrared port
macio: fixed media-bay irq on gatwick
macio: fixed left floppy irqs
swim3 1.00015000:floppy: [fd1] Couldn't request interrupt
Unable to handle kernel paging request for data at address 0x00000024
Faulting instruction address: 0xc02652f8
Oops: Kernel access of bad area, sig: 11 [#1]
BE SMP NR_CPUS=2 PowerMac
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.20.0 #2
NIP: c02652f8 LR: c026915c CTR: c0276d1c
REGS: df43ba10 TRAP: 0300 Not tainted (4.20.0)
MSR: 00009032 <EE,ME,IR,DR,RI> CR: 28228288 XER: 00000100
DAR: 00000024 DSISR: 40000000
GPR00: c026915c df43bac0 df439060 c0731524 df494700 00000000 c06e1c08 00000001
GPR08: 00000001 00000000 df5ff220 00001032 28228282 00000000 c0004ca4 00000000
GPR16: 00000000 00000000 00000000 c073144c dfffe064 c0731524 00000120 c0586108
GPR24: c073132c c073143c c073143c 00000000 c0731524 df67cd70 df494700 00000001
NIP [c02652f8] blk_mq_free_rqs+0x28/0xf8
LR [c026915c] blk_mq_sched_tags_teardown+0x58/0x84
Call Trace:
[df43bac0] [c0045f50] flush_workqueue_prep_pwqs+0x178/0x1c4 (unreliable)
[df43bae0] [c026915c] blk_mq_sched_tags_teardown+0x58/0x84
[df43bb00] [c02697f0] blk_mq_exit_sched+0x9c/0xb8
[df43bb20] [c0252794] elevator_exit+0x84/0xa4
[df43bb40] [c0256538] blk_exit_queue+0x30/0x50
[df43bb50] [c0256640] blk_cleanup_queue+0xe8/0x184
[df43bb70] [c034732c] swim3_attach+0x330/0x5f0
[df43bbb0] [c034fb24] macio_device_probe+0x58/0xec
[df43bbd0] [c032ba88] really_probe+0x1e4/0x2f4
[df43bc00] [c032bd28] driver_probe_device+0x64/0x204
[df43bc20] [c0329ac4] bus_for_each_drv+0x60/0xac
[df43bc50] [c032b824] __device_attach+0xe8/0x160
[df43bc80] [c032ab38] bus_probe_device+0xa0/0xbc
[df43bca0] [c0327338] device_add+0x3d8/0x630
[df43bcf0] [c0350848] macio_add_one_device+0x444/0x48c
[df43bd50] [c03509f8] macio_pci_add_devices+0x168/0x1bc
[df43bd90] [c03500ec] macio_pci_probe+0xc0/0x10c
[df43bda0] [c02ad884] pci_device_probe+0xd4/0x184
[df43bdd0] [c032ba88] really_probe+0x1e4/0x2f4
[df43be00] [c032bd28] driver_probe_device+0x64/0x204
[df43be20] [c032bfcc] __driver_attach+0x104/0x108
[df43be40] [c0329a00] bus_for_each_dev+0x64/0xb4
[df43be70] [c032add8] bus_add_driver+0x154/0x238
[df43be90] [c032ca24] driver_register+0x84/0x148
[df43bea0] [c0004aa0] do_one_initcall+0x40/0x188
[df43bf00] [c0690100] kernel_init_freeable+0x138/0x1d4
[df43bf30] [c0004cbc] kernel_init+0x18/0x10c
[df43bf40] [c00121e4] ret_from_kernel_thread+0x14/0x1c
Instruction dump:
5484d97e 4bfff4f4 9421ffe0 7c0802a6 bf410008 7c9e2378 90010024 8124005c
2f890000 419e0078 81230004 7c7c1b78 <81290024> 2f890000 419e0064 81440000
---[ end trace 12025ab921a9784c ]---
Reverting commit 8ccb8cb1892b ("swim3: convert to blk-mq") resolves the
problem.
That commit added a struct blk_mq_tag_set to struct floppy_state and
initialized it with a blk_mq_init_sq_queue() call. Unfortunately, there
is a memset() in swim3_add_device() that subsequently clears the
floppy_state struct. That means fs->tag_set->ops is a NULL pointer, and
it gets dereferenced by blk_mq_free_rqs() which gets called in the
request_irq() error path. Move the memset() to fix this bug.
BTW, the request_irq() failure for the left mediabay floppy (fd1) is not
a regression. I don't know why it happens. The right media bay floppy
(fd0) works fine however.
Reported-and-tested-by: Stan Johnson <[email protected]>
Fixes: 8ccb8cb1892b ("swim3: convert to blk-mq")
Cc: [email protected]
Signed-off-by: Finn Thain <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
When the block device is opened with FMODE_EXCL, ref_count is set to -1.
This value doesn't get reset when the device is closed which means the
device cannot be opened again. Fix this by checking for refcount <= 0
in the release method.
Reported-and-tested-by: Stan Johnson <[email protected]>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: [email protected]
Signed-off-by: Finn Thain <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Cc: [email protected]
Signed-off-by: Finn Thain <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Cc: [email protected]
Signed-off-by: Finn Thain <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
Merge misc updates from Andrew Morton:
- large KASAN update to use arm's "software tag-based mode"
- a few misc things
- sh updates
- ocfs2 updates
- just about all of MM
* emailed patches from Andrew Morton <[email protected]>: (167 commits)
kernel/fork.c: mark 'stack_vm_area' with __maybe_unused
memcg, oom: notify on oom killer invocation from the charge path
mm, swap: fix swapoff with KSM pages
include/linux/gfp.h: fix typo
mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm
hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
memory_hotplug: add missing newlines to debugging output
mm: remove __hugepage_set_anon_rmap()
include/linux/vmstat.h: remove unused page state adjustment macro
mm/page_alloc.c: allow error injection
mm: migrate: drop unused argument of migrate_page_move_mapping()
blkdev: avoid migration stalls for blkdev pages
mm: migrate: provide buffer_migrate_page_norefs()
mm: migrate: move migrate_page_lock_buffers()
mm: migrate: lock buffers before migrate_page_move_mapping()
mm: migration: factor out code to compute expected number of page references
mm, page_alloc: enable pcpu_drain with zone capability
kmemleak: add config to select auto scan
mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init
...
|
|
Pull block updates from Jens Axboe:
"This is the main pull request for block/storage for 4.21.
Larger than usual, it was a busy round with lots of goodies queued up.
Most notable is the removal of the old IO stack, which has been a long
time coming. No new features for a while, everything coming in this
week has all been fixes for things that were previously merged.
This contains:
- Use atomic counters instead of semaphores for mtip32xx (Arnd)
- Cleanup of the mtip32xx request setup (Christoph)
- Fix for circular locking dependency in loop (Jan, Tetsuo)
- bcache (Coly, Guoju, Shenghui)
* Optimizations for writeback caching
* Various fixes and improvements
- nvme (Chaitanya, Christoph, Sagi, Jay, me, Keith)
* host and target support for NVMe over TCP
* Error log page support
* Support for separate read/write/poll queues
* Much improved polling
* discard OOM fallback
* Tracepoint improvements
- lightnvm (Hans, Hua, Igor, Matias, Javier)
* Igor added packed metadata to pblk. Now drives without metadata
per LBA can be used as well.
* Fix from Geert on uninitialized value on chunk metadata reads.
* Fixes from Hans and Javier to pblk recovery and write path.
* Fix from Hua Su to fix a race condition in the pblk recovery
code.
* Scan optimization added to pblk recovery from Zhoujie.
* Small geometry cleanup from me.
- Conversion of the last few drivers that used the legacy path to
blk-mq (me)
- Removal of legacy IO path in SCSI (me, Christoph)
- Removal of legacy IO stack and schedulers (me)
- Support for much better polling, now without interrupts at all.
blk-mq adds support for multiple queue maps, which enables us to
have a map per type. This in turn enables nvme to have separate
completion queues for polling, which can then be interrupt-less.
Also means we're ready for async polled IO, which is hopefully
coming in the next release.
- Killing of (now) unused block exports (Christoph)
- Unification of the blk-rq-qos and blk-wbt wait handling (Josef)
- Support for zoned testing with null_blk (Masato)
- sx8 conversion to per-host tag sets (Christoph)
- IO priority improvements (Damien)
- mq-deadline zoned fix (Damien)
- Ref count blkcg series (Dennis)
- Lots of blk-mq improvements and speedups (me)
- sbitmap scalability improvements (me)
- Make core inflight IO accounting per-cpu (Mikulas)
- Export timeout setting in sysfs (Weiping)
- Cleanup the direct issue path (Jianchao)
- Export blk-wbt internals in block debugfs for easier debugging
(Ming)
- Lots of other fixes and improvements"
* tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block: (364 commits)
kyber: use sbitmap add_wait_queue/list_del wait helpers
sbitmap: add helpers for add/del wait queue handling
block: save irq state in blkg_lookup_create()
dm: don't reuse bio for flushes
nvme-pci: trace SQ status on completions
nvme-rdma: implement polling queue map
nvme-fabrics: allow user to pass in nr_poll_queues
nvme-fabrics: allow nvmf_connect_io_queue to poll
nvme-core: optionally poll sync commands
block: make request_to_qc_t public
nvme-tcp: fix spelling mistake "attepmpt" -> "attempt"
nvme-tcp: fix endianess annotations
nvmet-tcp: fix endianess annotations
nvme-pci: refactor nvme_poll_irqdisable to make sparse happy
nvme-pci: only set nr_maps to 2 if poll queues are supported
nvmet: use a macro for default error location
nvmet: fix comparison of a u16 with -1
blk-mq: enable IO poll if .nr_queues of type poll > 0
blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()
blk-mq: skip zero-queue maps in blk_mq_map_swqueue
...
|
|
If there are lots of write IO with flash device, it could have a
wearout problem of storage. To overcome the problem, admin needs
to design write limitation to guarantee flash health
for entire product life.
This patch creates a new knob "writeback_limit" for zram.
writeback_limit's default value is 0 so that it doesn't limit
any writeback. If admin want to measure writeback count in a
certain period, he could know it via /sys/block/zram0/bd_stat's
3rd column.
If admin want to limit writeback as per-day 400M, he could do it
like below.
MB_SHIFT=20
4K_SHIFT=12
echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
/sys/block/zram0/writeback_limit.
If admin want to allow further write again, he could do it like below
echo 0 > /sys/block/zram0/writeback_limit
If admin want to see remaining writeback budget,
cat /sys/block/zram0/writeback_limit
The writeback_limit count will reset whenever you reset zram (e.g., system
reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of writeback
happened until you reset the zram to allocate extra writeback budget in
next setting is user's job.
[[email protected]: v4]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Minchan Kim <[email protected]>
Reviewed-by: Sergey Senozhatsky <[email protected]>
Cc: Joey Pabalinas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
bd_stat represents things that happened in the backing device. Currently
it supports bd_counts, bd_reads and bd_writes which are helpful to
understand wearout of flash and memory saving.
[[email protected]: v4]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Minchan Kim <[email protected]>
Reviewed-by: Sergey Senozhatsky <[email protected]>
Cc: Joey Pabalinas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Add a new feature "zram idle/huge page writeback". In the zram-swap use
case, zram usually has many idle/huge swap pages. It's pointless to keep
them in memory (ie, zram).
To solve this problem, this feature introduces idle/huge page writeback to
the backing device so the goal is to save more memory space on embedded
systems.
Normal sequence to use idle/huge page writeback feature is as follows,
while (1) {
# mark allocated zram slot to idle
echo all > /sys/block/zram0/idle
# leave system working for several hours
# Unless there is no access for some blocks on zram,
# they are still IDLE marked pages.
echo "idle" > /sys/block/zram0/writeback
or/and
echo "huge" > /sys/block/zram0/writeback
# write the IDLE or/and huge marked slot into backing device
# and free the memory.
}
Per the discussion at
https://lore.kernel.org/lkml/20181122065926.GG3441@jagdpanzerIV/T/#u,
This patch removes direct incommpressibe page writeback feature
(d2afd25114f4 ("zram: write incompressible pages to backing device")).
Below concerns from Sergey:
== &< ==
"IDLE writeback" is superior to "incompressible writeback".
"incompressible writeback" is completely unpredictable and uncontrollable;
it depens on data patterns and compression algorithms. While "IDLE
writeback" is predictable.
I even suspect, that, *ideally*, we can remove "incompressible writeback".
"IDLE pages" is a super set which also includes "incompressible" pages.
So, technically, we still can do "incompressible writeback" from "IDLE
writeback" path; but a much more reasonable one, based on a page idling
period.
I understand that you want to keep "direct incompressible writeback"
around. ZRAM is especially popular on devices which do suffer from flash
wearout, so I can see "incompressible writeback" path becoming a dead
code, long term.
== &< ==
Below concerns from Minchan:
== &< ==
My concern is if we enable CONFIG_ZRAM_WRITEBACK in this implementation,
both hugepage/idlepage writeck will turn on. However someuser want to
enable only idlepage writeback so we need to introduce turn on/off knob
for hugepage or new CONFIG_ZRAM_IDLEPAGE_WRITEBACK for those usecase. I
don't want to make it complicated *if possible*.
Long term, I imagine we need to make VM aware of new swap hierarchy a
little bit different with as-is. For example, first high priority swap
can return -EIO or -ENOCOMP, swap try to fallback to next lower priority
swap device. With that, hugepage writeback will work tranparently.
So we could regard it as regression because incompressible pages doesn't
go to backing storage automatically. Instead, user should do it via "echo
huge" > /sys/block/zram/writeback" manually.
== &< ==
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Minchan Kim <[email protected]>
Reviewed-by: Joey Pabalinas <[email protected]>
Reviewed-by: Sergey Senozhatsky <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
To support idle page writeback with upcoming patches, this patch
introduces a new ZRAM_IDLE flag.
Userspace can mark zram slots as "idle" via
"echo all > /sys/block/zramX/idle"
which marks every allocated zram slot as ZRAM_IDLE.
User could see it by /sys/kernel/debug/zram/zram0/block_state.
300 75.033841 ...i
301 63.806904 s..i
302 63.806919 ..hi
Once there is IO for the slot, the mark will be disappeared.
300 75.033841 ...
301 63.806904 s..i
302 63.806919 ..hi
Therefore, 300th block is idle zpage. With this feature,
user can how many zram has idle pages which are waste of memory.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Minchan Kim <[email protected]>
Reviewed-by: Sergey Senozhatsky <[email protected]>
Reviewed-by: Joey Pabalinas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Rename some variables and restructure some code for better readability in
writeback and zs_free_page.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Minchan Kim <[email protected]>
Reviewed-by: Sergey Senozhatsky <[email protected]>
Reviewed-by: Joey Pabalinas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
If blkdev_get fails, we shouldn't do blkdev_put. Otherwise, kernel emits
below log. This patch fixes it.
WARNING: CPU: 0 PID: 1893 at fs/block_dev.c:1828 blkdev_put+0x105/0x120
Modules linked in:
CPU: 0 PID: 1893 Comm: swapoff Not tainted 4.19.0+ #453
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
RIP: 0010:blkdev_put+0x105/0x120
Call Trace:
__x64_sys_swapoff+0x46d/0x490
do_syscall_64+0x5a/0x190
entry_SYSCALL_64_after_hwframe+0x49/0xbe
irq event stamp: 4466
hardirqs last enabled at (4465): __free_pages_ok+0x1e3/0x490
hardirqs last disabled at (4466): trace_hardirqs_off_thunk+0x1a/0x1c
softirqs last enabled at (3420): __do_softirq+0x333/0x446
softirqs last disabled at (3407): irq_exit+0xd1/0xe0
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Minchan Kim <[email protected]>
Reviewed-by: Sergey Senozhatsky <[email protected]>
Reviewed-by: Joey Pabalinas <[email protected]>
Cc: <[email protected]> [4.14+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Patch series "zram idle page writeback", v3.
Inherently, swap device has many idle pages which are rare touched since
it was allocated. It is never problem if we use storage device as swap.
However, it's just waste for zram-swap.
This patchset supports zram idle page writeback feature.
* Admin can define what is idle page "no access since X time ago"
* Admin can define when zram should writeback them
* Admin can define when zram should stop writeback to prevent wearout
Details are in each patch's description.
This patch (of 7):
================================
WARNING: inconsistent lock state
4.19.0+ #390 Not tainted
--------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
zram_verify/2095 [HC0[0]:SC1[1]:HE1:SE0] takes:
00000000b1828693 (&(&zram->bitmap_lock)->rlock){+.?.}, at: put_entry_bdev+0x1e/0x50
{SOFTIRQ-ON-W} state was registered at:
_raw_spin_lock+0x2c/0x40
zram_make_request+0x755/0xdc9
generic_make_request+0x373/0x6a0
submit_bio+0x6c/0x140
__swap_writepage+0x3a8/0x480
shrink_page_list+0x1102/0x1a60
shrink_inactive_list+0x21b/0x3f0
shrink_node_memcg.constprop.99+0x4f8/0x7e0
shrink_node+0x7d/0x2f0
do_try_to_free_pages+0xe0/0x300
try_to_free_pages+0x116/0x2b0
__alloc_pages_slowpath+0x3f4/0xf80
__alloc_pages_nodemask+0x2a2/0x2f0
__handle_mm_fault+0x42e/0xb50
handle_mm_fault+0x55/0xb0
__do_page_fault+0x235/0x4b0
page_fault+0x1e/0x30
irq event stamp: 228412
hardirqs last enabled at (228412): [<ffffffff98245846>] __slab_free+0x3e6/0x600
hardirqs last disabled at (228411): [<ffffffff98245625>] __slab_free+0x1c5/0x600
softirqs last enabled at (228396): [<ffffffff98e0031e>] __do_softirq+0x31e/0x427
softirqs last disabled at (228403): [<ffffffff98072051>] irq_exit+0xd1/0xe0
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&(&zram->bitmap_lock)->rlock);
<Interrupt>
lock(&(&zram->bitmap_lock)->rlock);
*** DEADLOCK ***
no locks held by zram_verify/2095.
stack backtrace:
CPU: 5 PID: 2095 Comm: zram_verify Not tainted 4.19.0+ #390
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
<IRQ>
dump_stack+0x67/0x9b
print_usage_bug+0x1bd/0x1d3
mark_lock+0x4aa/0x540
__lock_acquire+0x51d/0x1300
lock_acquire+0x90/0x180
_raw_spin_lock+0x2c/0x40
put_entry_bdev+0x1e/0x50
zram_free_page+0xf6/0x110
zram_slot_free_notify+0x42/0xa0
end_swap_bio_read+0x5b/0x170
blk_update_request+0x8f/0x340
scsi_end_request+0x2c/0x1e0
scsi_io_completion+0x98/0x650
blk_done_softirq+0x9e/0xd0
__do_softirq+0xcc/0x427
irq_exit+0xd1/0xe0
do_IRQ+0x93/0x120
common_interrupt+0xf/0xf
</IRQ>
With writeback feature, zram_slot_free_notify could be called in softirq
context by end_swap_bio_read. However, bitmap_lock is not aware of that
so lockdep yell out:
get_entry_bdev
spin_lock(bitmap->lock);
irq
softirq
end_swap_bio_read
zram_slot_free_notify
zram_slot_lock <-- deadlock prone
zram_free_page
put_entry_bdev
spin_lock(bitmap->lock); <-- deadlock prone
With akpm's suggestion (i.e. bitmap operation is already atomic), we
could remove bitmap lock. It might fail to find a empty slot if serious
contention happens. However, it's not severe problem because huge page
writeback has already possiblity to fail if there is severe memory
pressure. Worst case is just keeping the incompressible in memory, not
storage.
The other problem is zram_slot_lock in zram_slot_slot_free_notify. To
make it safe is this patch introduces zram_slot_trylock where
zram_slot_free_notify uses it. Although it's rare to be contented, this
patch adds new debug stat "miss_free" to keep monitoring how often it
happens.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Minchan Kim <[email protected]>
Reviewed-by: Sergey Senozhatsky <[email protected]>
Reviewed-by: Joey Pabalinas <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|