aboutsummaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)AuthorFilesLines
2015-11-20mac: validate mac_partition is within sectorKees Cook1-3/+7
If md->signature == MAC_DRIVER_MAGIC and md->block_size == 1023, a single 512 byte sector would be read (secsize / 512). However the partition structure would be located past the end of the buffer (secsize % 512). Signed-off-by: Kees Cook <[email protected]> Cc: [email protected] Signed-off-by: Jens Axboe <[email protected]>
2015-11-19block: protect rw_page against device teardownDan Williams1-2/+0
Fix use after free crashes like the following: general protection fault: 0000 [#1] SMP Call Trace: [<ffffffffa0050216>] ? pmem_do_bvec.isra.12+0xa6/0xf0 [nd_pmem] [<ffffffffa0050ba2>] pmem_rw_page+0x42/0x80 [nd_pmem] [<ffffffff8128fd90>] bdev_read_page+0x50/0x60 [<ffffffff812972f0>] do_mpage_readpage+0x510/0x770 [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20 [<ffffffff811d86dc>] ? lru_cache_add+0x1c/0x50 [<ffffffff81297657>] mpage_readpages+0x107/0x170 [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20 [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20 [<ffffffff8129058d>] blkdev_readpages+0x1d/0x20 [<ffffffff811d615f>] __do_page_cache_readahead+0x28f/0x310 [<ffffffff811d6039>] ? __do_page_cache_readahead+0x169/0x310 [<ffffffff811c5abd>] ? pagecache_get_page+0x2d/0x1d0 [<ffffffff811c76f6>] filemap_fault+0x396/0x530 [<ffffffff811f816e>] __do_fault+0x4e/0xf0 [<ffffffff811fce7d>] handle_mm_fault+0x11bd/0x1b50 Cc: <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Alexander Viro <[email protected]> Reported-by: kbuild test robot <[email protected]> Acked-by: Matthew Wilcox <[email protected]> [willy: symmetry fixups] Signed-off-by: Dan Williams <[email protected]>
2015-11-16blk-flush: Queue through IO scheduler when flush not requiredJan Kara1-1/+1
Currently blk_insert_flush() just adds flush request to q->queue_head when flush is not required. That completely bypasses IO scheduler so e.g. CFQ can be idling waiting for new request to arrive and will idle through the whole window unnecessarily. Luckily this only happens in rare cases as usually checks in generic_make_request_checks() clear FLUSH and FUA flags early if they are not needed. When no flushing is actually required, we can easily fix the problem by properly queueing the request through the IO scheduler. Ideally IO scheduler should be also made aware of requests queued via blk_flush_queue_rq(). However inserting flush request through IO scheduler can have unwanted side-effects since due to flush batching delaying the flush request in IO scheduler will delay all flush requests possibly coming from other processes. So we keep adding the request directly to q->queue_head. Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Jeff Moyer <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-11-16elevator: use list_{first,prev,next}_entryGeliang Tang1-5/+5
To make the intention clearer, use list_{first,prev,next}_entry instead of list_entry. Signed-off-by: Geliang Tang <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-11-11block: fix blk-core.c kernel-doc warningRandy Dunlap1-0/+3
Fix kernel-doc warning in blk-core.c: Warning(..//block/blk-core.c:1549): No description found for parameter 'same_queue_rq' Signed-off-by: Randy Dunlap <[email protected]> Reviewed-by: Jeff Moyer <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-11-11blk-mq: mark __blk_mq_complete_request() staticJens Axboe2-2/+1
It's no longer used outside of blk-mq core. Signed-off-by: Jens Axboe <[email protected]>
2015-11-10Merge branch 'for-4.4/io-poll' of git://git.kernel.dk/linux-blockLinus Torvalds4-33/+138
Pull block IO poll support from Jens Axboe: "Various groups have been doing experimentation around IO polling for (really) fast devices. The code has been reviewed and has been sitting on the side for a few releases, but this is now good enough for coordinated benchmarking and further experimentation. Currently O_DIRECT sync read/write are supported. A framework is in the works that allows scalable stats tracking so we can auto-tune this. And we'll add libaio support as well soon. Fow now, it's an opt-in feature for test purposes" * 'for-4.4/io-poll' of git://git.kernel.dk/linux-block: direct-io: be sure to assign dio->bio_bdev for both paths directio: add block polling support NVMe: add blk polling support block: add block polling support blk-mq: return tag/queue combo in the make_request_fn handlers block: change ->make_request_fn() and users to return a queue cookie
2015-11-07block: add block polling supportJens Axboe3-0/+86
Add basic support for polling for specific IO to complete. This uses the cookie that blk-mq passes back, which enables the block layer to pass this cookie to the driver to spin for a specific request. This will be combined with request latency tracking, so we can make qualified decisions about when to poll and when not to. For now, for benchmark purposes, we add a sysfs file that controls whether polling is enabled or not. Signed-off-by: Jens Axboe <[email protected]> Acked-by: Christoph Hellwig <[email protected]> Acked-by: Keith Busch <[email protected]>
2015-11-07blk-mq: return tag/queue combo in the make_request_fn handlersJens Axboe1-17/+28
Return a cookie, blk_qc_t, from the blk-mq make request functions, that allows a later caller to uniquely identify a specific IO. The cookie doesn't mean anything to the caller, but the caller can use it to later pass back to the block layer. The block layer can then identify the hardware queue and request from that cookie. Signed-off-by: Jens Axboe <[email protected]> Acked-by: Christoph Hellwig <[email protected]> Acked-by: Keith Busch <[email protected]>
2015-11-07block: change ->make_request_fn() and users to return a queue cookieJens Axboe2-22/+30
No functional changes in this patch, but it prepares us for returning a more useful cookie related to the IO that was queued up. Signed-off-by: Jens Axboe <[email protected]> Acked-by: Christoph Hellwig <[email protected]> Acked-by: Keith Busch <[email protected]>
2015-11-06pidns: fix set/getpriority and ioprio_set/get in PRIO_USER modeBen Segall1-2/+4
setpriority(PRIO_USER, 0, x) will change the priority of tasks outside of the current pid namespace. This is in contrast to both the other modes of setpriority and the example of kill(-1). Fix this. getpriority and ioprio have the same failure mode, fix them too. Eric said: : After some more thinking about it this patch sounds justifiable. : : My goal with namespaces is not to build perfect isolation mechanisms : as that can get into ill defined territory, but to build well defined : mechanisms. And to handle the corner cases so you can use only : a single namespace with well defined results. : : In this case you have found the two interfaces I am aware of that : identify processes by uid instead of by pid. Which quite frankly is : weird. Unfortunately the weird unexpected cases are hard to handle : in the usual way. : : I was hoping for a little more information. Changes like this one we : have to be careful of because someone might be depending on the current : behavior. I don't think they are and I do think this make sense as part : of the pid namespace. Signed-off-by: Ben Segall <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Al Viro <[email protected]> Cc: Ambrose Feinstein <[email protected]> Acked-by: "Eric W. Biederman" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-11-06mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIMMel Gorman3-6/+6
__GFP_WAIT was used to signal that the caller was in atomic context and could not sleep. Now it is possible to distinguish between true atomic context and callers that are not willing to sleep. The latter should clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing __GFP_WAIT behaves differently, there is a risk that people will clear the wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly indicate what it does -- setting it allows all reclaim activity, clearing them prevents it. [[email protected]: fix build] [[email protected]: coding-style fixes] Signed-off-by: Mel Gorman <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Christoph Lameter <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-11-06mm, page_alloc: distinguish between being unable to sleep, unwilling to ↵Mel Gorman5-26/+26
sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Rik van Riel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-11-05Merge branch 'for-4.4' of ↵Linus Torvalds3-3/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "The cgroup core saw several significant updates this cycle: - percpu_rwsem for threadgroup locking is reinstated. This was temporarily dropped due to down_write latency issues. Oleg's rework of percpu_rwsem which is scheduled to be merged in this merge window resolves the issue. - On the v2 hierarchy, when controllers are enabled and disabled, all operations are atomic and can fail and revert cleanly. This allows ->can_attach() failure which is necessary for cpu RT slices. - Tasks now stay associated with the original cgroups after exit until released. This allows tracking resources held by zombies (e.g. pids) and makes it easy to find out where zombies came from on the v2 hierarchy. The pids controller was broken before these changes as zombies escaped the limits; unfortunately, updating this behavior required too many invasive changes and I don't think it's a good idea to backport them, so the pids controller on 4.3, the first version which included the pids controller, will stay broken at least until I'm sure about the cgroup core changes. - Optimization of a couple common tests using static_key" * 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (38 commits) cgroup: fix race condition around termination check in css_task_iter_next() blkcg: don't create "io.stat" on the root cgroup cgroup: drop cgroup__DEVEL__legacy_files_on_dfl cgroup: replace error handling in cgroup_init() with WARN_ON()s cgroup: add cgroup_subsys->free() method and use it to fix pids controller cgroup: keep zombies associated with their original cgroups cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock cgroup: don't hold css_set_rwsem across css task iteration cgroup: reorganize css_task_iter functions cgroup: factor out css_set_move_task() cgroup: keep css_set and task lists in chronological order cgroup: make cgroup_destroy_locked() test cgroup_is_populated() cgroup: make css_sets pin the associated cgroups cgroup: relocate cgroup_[try]get/put() cgroup: move check_for_release() invocation cgroup: replace cgroup_has_tasks() with cgroup_is_populated() cgroup: make cgroup->nr_populated count the number of populated css_sets cgroup: remove an unused parameter from cgroup_task_migrate() cgroup: fix too early usage of static_branch_disable() cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically ...
2015-11-04Merge branch 'for-4.4/reservations' of git://git.kernel.dk/linux-blockLinus Torvalds1-100/+230
Pull block reservation support from Jens Axboe: "This adds support for persistent reservations, both at the core level, as well as for sd and NVMe" [ Background from the docs: "Persistent Reservations allow restricting access to block devices to specific initiators in a shared storage setup. All implementations are expected to ensure the reservations survive a power loss and cover all connections in a multi path environment" ] * 'for-4.4/reservations' of git://git.kernel.dk/linux-block: NVMe: Precedence error in nvme_pr_clear() nvme: add missing endianess annotations in nvme_pr_command NVMe: Add persistent reservation ops sd: implement the Persistent Reservation API block: add an API for Persistent Reservations block: cleanup blkdev_ioctl
2015-11-04Merge branch 'for-4.4/integrity' of git://git.kernel.dk/linux-blockLinus Torvalds10-197/+216
Pull block integrity updates from Jens Axboe: ""This is the joint work of Dan and Martin, cleaning up and improving the support for block data integrity" * 'for-4.4/integrity' of git://git.kernel.dk/linux-block: block, libnvdimm, nvme: provide a built-in blk_integrity nop profile block: blk_flush_integrity() for bio-based drivers block: move blk_integrity to request_queue block: generic request_queue reference counting nvme: suspend i/o during runtime blk_integrity_unregister md: suspend i/o during runtime blk_integrity_unregister md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown block: Inline blk_integrity in struct gendisk block: Export integrity data interval size in sysfs block: Reduce the size of struct blk_integrity block: Consolidate static integrity profile properties block: Move integrity kobject to struct gendisk
2015-11-04Merge branch 'for-4.4/core' of git://git.kernel.dk/linux-blockLinus Torvalds7-40/+122
Pull core block updates from Jens Axboe: "This is the core block pull request for 4.4. I've got a few more topic branches this time around, some of them will layer on top of the core+drivers changes and will come in a separate round. So not a huge chunk of changes in this round. This pull request contains: - Enable blk-mq page allocation tracking with kmemleak, from Catalin. - Unused prototype removal in blk-mq from Christoph. - Cleanup of the q->blk_trace exchange, using cmpxchg instead of two xchg()'s, from Davidlohr. - A plug flush fix from Jeff. - Also from Jeff, a fix that means we don't have to update shared tag sets at init time unless we do a state change. This cuts down boot times on thousands of devices a lot with scsi/blk-mq. - blk-mq waitqueue barrier fix from Kosuke. - Various fixes from Ming: - Fixes for segment merging and splitting, and checks, for the old core and blk-mq. - Potential blk-mq speedup by marking ctx pending at the end of a plug insertion batch in blk-mq. - direct-io no page dirty on kernel direct reads. - A WRITE_SYNC fix for mpage from Roman" * 'for-4.4/core' of git://git.kernel.dk/linux-block: blk-mq: avoid excessive boot delays with large lun counts blktrace: re-write setting q->blk_trace blk-mq: mark ctx as pending at batch in flush plug path blk-mq: fix for trace_block_plug() block: check bio_mergeable() early before merging blk-mq: check bio_mergeable() early before merging block: avoid to merge splitted bio block: setup bi_phys_segments after splitting block: fix plug list flushing for nomerge queues blk-mq: remove unused blk_mq_clone_flush_request prototype blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c fs: direct-io: don't dirtying pages for ITER_BVEC/ITER_KVEC direct read fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write block: kmemleak: Track the page allocations for struct request
2015-11-03blk-mq: avoid excessive boot delays with large lun countsJeff Moyer1-17/+30
Hi, Zhangqing Luo reported long boot times on a system with thousands of LUNs when scsi-mq was enabled. He narrowed the problem down to blk_mq_add_queue_tag_set, where every queue is frozen in order to set the BLK_MQ_F_TAG_SHARED flag. Each added device will freeze all queues added before it in sequence, which involves waiting for an RCU grace period for each one. We don't need to do this. After the second queue is added, only new queues need to be initialized with the shared tag. We can do that by percolating the flag up to the blk_mq_tag_set, and updating the newly added queue's hctxs if the flag is set. This problem was introduced by commit 0d2602ca30e41 (blk-mq: improve support for shared tags maps). Reported-and-tested-by: Jason Luo <[email protected]> Reviewed-by: Ming Lei <[email protected]> Signed-off-by: Jeff Moyer <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-10-28block: re-add discard_granularity and alignment checksMing Lin1-9/+22
In commit b49a087("block: remove split code in blkdev_issue_{discard,write_same}"), discard_granularity and alignment checks were removed. Ideally, with bio late splitting, the upper layers shouldn't need to depend on device's limits. Christoph reported a discard regression on the HGST Ultrastar SN100 NVMe device when mkfs.xfs. We have not found the root cause yet. This patch re-adds discard_granularity and alignment checks by reverting the related changes in commit b49a087. The good thing is now we can remove the 2G discard size cap and just use UINT_MAX to avoid bi_size overflow. Reviewed-by: Christoph Hellwig <[email protected]> Tested-by: Christoph Hellwig <[email protected]> Signed-off-by: Ming Lin <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-10-22blkcg: don't create "io.stat" on the root cgroupTejun Heo1-0/+1
The stat files on the root cgroup shows stats for the whole system and usually don't contain any information which isn't available through the usual system monitoring mechanisms. Some controllers skip collecting these duplicate stats to optimize cases where cgroup isn't used and later try to emulate the result on demand. This leads to complexities and subtle differences in the information shown through different channels. This is entirely unnecessary and cgroup v2 is dropping stat files which are duplicate from all controllers. This patch removes "io.stat" from the root hierarchy. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Jens Axboe <[email protected]> Cc: Vivek Goyal <[email protected]>
2015-10-21blk-mq: mark ctx as pending at batch in flush plug pathMing Lei1-5/+13
Most of times, flush plug should be the hottest I/O path, so mark ctx as pending after all requests in the list are inserted. Reviewed-by: Jeff Moyer <[email protected]> Signed-off-by: Ming Lei <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-10-21blk-mq: fix for trace_block_plug()Ming Lei1-1/+1
The trace point is for tracing plug event of each request queue instead of each task, so we should check the request count in the plug list from current queue instead of current task. Signed-off-by: Ming Lei <[email protected]> Reviewed-by: Jeff Moyer <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-10-21block: check bio_mergeable() early before mergingMing Lei1-1/+1
After bio splitting is introduced, one bio can be splitted and it is marked as NOMERGE because it is too fat to be merged, so check bio_mergeable() earlier to avoid to try to merge it unnecessarily. Signed-off-by: Ming Lei <[email protected]> Reviewed-by: Jeff Moyer <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-10-21blk-mq: check bio_mergeable() early before mergingMing Lei1-1/+1
It isn't necessary to try to merge the bio which is marked as NOMERGE. Reviewed-by: Jeff Moyer <[email protected]> Signed-off-by: Ming Lei <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-10-21block: avoid to merge splitted bioMing Lei1-0/+3
The splitted bio has been already too fat to merge, so mark it as NOMERGE. Reviewed-by: Jeff Moyer <[email protected]> Signed-off-by: Ming Lei <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-10-21block: setup bi_phys_segments after splittingMing Lei1-7/+22
The number of bio->bi_phys_segments is always obtained during bio splitting, so it is natural to setup it just after bio splitting, then we can avoid to compute nr_segment again during merge. Reviewed-by: Jeff Moyer <[email protected]> Signed-off-by: Ming Lei <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-10-21block: fix plug list flushing for nomerge queuesJeff Moyer3-6/+36
Request queues with merging disabled will not flush the plug list after BLK_MAX_REQUEST_COUNT requests have been queued, since the code relies on blk_attempt_plug_merge to compute the request_count. Fix this by computing the number of queued requests even for nomerge queues. Signed-off-by: Jeff Moyer <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-10-21block: add an API for Persistent ReservationsChristoph Hellwig1-0/+103
This commits adds a driver API and ioctls for controlling Persistent Reservations s/genericly/generically/ at the block layer. Persistent Reservations are supported by SCSI and NVMe and allow controlling who gets access to a device in a shared storage setup. Note that we add a pr_ops structure to struct block_device_operations instead of adding the members directly to avoid bloating all instances of devices that will never support Persistent Reservations. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-10-21block: cleanup blkdev_ioctlChristoph Hellwig1-100/+127
Split out helpers for all non-trivial ioctls to make this function simpler, and also start passing around a pointer version of the argument, as that's what most ioctl handlers actually need. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-10-21block, libnvdimm, nvme: provide a built-in blk_integrity nop profileDan Williams1-1/+12
The libnvidmm-btt and nvme drivers use blk_integrity to reserve space for per-sector metadata, but sometimes without protection checksums. This property is generically useful, so teach the block core to internally specify a nop profile if one is not provided at registration time. Cc: Keith Busch <[email protected]> Cc: Matthew Wilcox <[email protected]> Suggested-by: Christoph Hellwig <[email protected]> [hch: kill the local nvme nop profile as well] Acked-by: Martin K. Petersen <[email protected]> Signed-off-by: Dan Williams <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-10-21block: blk_flush_integrity() for bio-based driversDan Williams3-0/+16
Since they lack requests to pin the request_queue active, synchronous bio-based drivers may have in-flight integrity work from bio_integrity_endio() that is not flushed by blk_freeze_queue(). Flush that work to prevent races to free the queue and the final usage of the blk_integrity profile. This is temporary unless/until bio-based drivers start to generically take a q_usage_counter reference while a bio is in-flight. Cc: Martin K. Petersen <[email protected]> [martin: fix the CONFIG_BLK_DEV_INTEGRITY=n case] Tested-by: Ross Zwisler <[email protected]> Signed-off-by: Dan Williams <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-10-21block: move blk_integrity to request_queueDan Williams1-7/+7
A trace like the following proceeds a crash in bio_integrity_process() when it goes to use an already freed blk_integrity profile. BUG: unable to handle kernel paging request at ffff8800d31b10d8 IP: [<ffff8800d31b10d8>] 0xffff8800d31b10d8 PGD 2f65067 PUD 21fffd067 PMD 80000000d30001e3 Oops: 0011 [#1] SMP Dumping ftrace buffer: --------------------------------- ndctl-2222 2.... 44526245us : disk_release: pmem1s systemd--2223 4.... 44573945us : bio_integrity_endio: pmem1s <...>-409 4.... 44574005us : bio_integrity_process: pmem1s --------------------------------- [..] Call Trace: [<ffffffff8144e0f9>] ? bio_integrity_process+0x159/0x2d0 [<ffffffff8144e4f6>] bio_integrity_verify_fn+0x36/0x60 [<ffffffff810bd2dc>] process_one_work+0x1cc/0x4e0 Given that a request_queue is pinned while i/o is in flight and that a gendisk is allowed to have a shorter lifetime, move blk_integrity to request_queue to satisfy requests arriving after the gendisk has been torn down. Cc: Christoph Hellwig <[email protected]> Cc: Martin K. Petersen <[email protected]> [martin: fix the CONFIG_BLK_DEV_INTEGRITY=n case] Tested-by: Ross Zwisler <[email protected]> Signed-off-by: Dan Williams <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-10-21block: generic request_queue reference countingDan Williams5-73/+101
Allow pmem, and other synchronous/bio-based block drivers, to fallback on a per-cpu reference count managed by the core for tracking queue live/dead state. The existing per-cpu reference count for the blk_mq case is promoted to be used in all block i/o scenarios. This involves initializing it by default, waiting for it to drop to zero at exit, and holding a live reference over the invocation of q->make_request_fn() in generic_make_request(). The blk_mq code continues to take its own reference per blk_mq request and retains the ability to freeze the queue, but the check that the queue is frozen is moved to generic_make_request(). This fixes crash signatures like the following: BUG: unable to handle kernel paging request at ffff880140000000 [..] Call Trace: [<ffffffff8145e8bf>] ? copy_user_handle_tail+0x5f/0x70 [<ffffffffa004e1e0>] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem] [<ffffffffa004e331>] pmem_make_request+0xd1/0x200 [nd_pmem] [<ffffffff811c3162>] ? mempool_alloc+0x72/0x1a0 [<ffffffff8141f8b6>] generic_make_request+0xd6/0x110 [<ffffffff8141f966>] submit_bio+0x76/0x170 [<ffffffff81286dff>] submit_bh_wbc+0x12f/0x160 [<ffffffff81286e62>] submit_bh+0x12/0x20 [<ffffffff813395bd>] jbd2_write_superblock+0x8d/0x170 [<ffffffff8133974d>] jbd2_mark_journal_empty+0x5d/0x90 [<ffffffff813399cb>] jbd2_journal_destroy+0x24b/0x270 [<ffffffff810bc4ca>] ? put_pwq_unlocked+0x2a/0x30 [<ffffffff810bc6f5>] ? destroy_workqueue+0x225/0x250 [<ffffffff81303494>] ext4_put_super+0x64/0x360 [<ffffffff8124ab1a>] generic_shutdown_super+0x6a/0xf0 Cc: Jens Axboe <[email protected]> Cc: Keith Busch <[email protected]> Cc: Ross Zwisler <[email protected]> Suggested-by: Christoph Hellwig <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Tested-by: Ross Zwisler <[email protected]> Signed-off-by: Dan Williams <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-10-21block: Inline blk_integrity in struct gendiskMartin K. Petersen3-102/+61
Up until now the_integrity profile has been dynamically allocated and attached to struct gendisk after the disk has been made active. This causes problems because NVMe devices need to register the profile prior to the partition table being read due to a mandatory metadata buffer requirement. In addition, DM goes through hoops to deal with preallocating, but not initializing integrity profiles. Since the integrity profile is small (4 bytes + a pointer), Christoph suggested moving it to struct gendisk proper. This requires several changes: - Moving the blk_integrity definition to genhd.h. - Inlining blk_integrity in struct gendisk. - Removing the dynamic allocation code. - Adding helper functions which allow gendisk to set up and tear down the integrity sysfs dir when a disk is added/deleted. - Adding a blk_integrity_revalidate() callback for updating the stable pages bdi setting. - The calls that depend on whether a device has an integrity profile or not now key off of the bi->profile pointer. - Simplifying the integrity support routines in DM (Mike Snitzer). Signed-off-by: Martin K. Petersen <[email protected]> Reported-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Mike Snitzer <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Dan Williams <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-10-21block: Export integrity data interval size in sysfsMartin K. Petersen1-0/+14
The size of the data interval was not exported in the sysfs integrity directory. Export it so that userland apps can tell whether the interval is different from the device's logical block size. Signed-off-by: Martin K. Petersen <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Dan Williams <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-10-21block: Reduce the size of struct blk_integrityMartin K. Petersen2-5/+5
The per-device properties in the blk_integrity structure were previously unsigned short. However, most of the values fit inside a char. The only exception is the data interval size and we can work around that by storing it as a power of two. This cuts the size of the dynamic portion of blk_integrity in half. Signed-off-by: Martin K. Petersen <[email protected]> Reported-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Dan Williams <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-10-21block: Consolidate static integrity profile propertiesMartin K. Petersen3-25/+16
We previously made a complete copy of a device's data integrity profile even though several of the fields inside the blk_integrity struct are pointers to fixed template entries in t10-pi.c. Split the static and per-device portions so that we can reference the template directly. Signed-off-by: Martin K. Petersen <[email protected]> Reported-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Dan Williams <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-10-21block: Move integrity kobject to struct gendiskMartin K. Petersen1-11/+11
The integrity kobject purely exists to support the integrity subdirectory in sysfs and doesn't really have anything to do with the blk_integrity data structure. Move the kobject to struct gendisk where it belongs. Signed-off-by: Martin K. Petersen <[email protected]> Reported-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Dan Williams <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-10-15block: don't release bdi while request_queue has live referencesTejun Heo2-1/+2
bdi's are initialized in two steps, bdi_init() and bdi_register(), but destroyed in a single step by bdi_destroy() which, for a bdi embedded in a request_queue, is called during blk_cleanup_queue() which makes the queue invisible and starts the draining of remaining usages. A request_queue's user can access the congestion state of the embedded bdi as long as it holds a reference to the queue. As such, it may access the congested state of a queue which finished blk_cleanup_queue() but hasn't reached blk_release_queue() yet. Because the congested state was embedded in backing_dev_info which in turn is embedded in request_queue, accessing the congested state after bdi_destroy() was called was fine. The bdi was destroyed but the memory region for the congested state remained accessible till the queue got released. a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback") changed the situation. Now, the root congested state which is expected to be pinned while request_queue remains accessible is separately reference counted and the base ref is put during bdi_destroy(). This means that the root congested state may go away prematurely while the queue is between bdi_dstroy() and blk_cleanup_queue(), which was detected by Andrey's KASAN tests. The root cause of this problem is that bdi doesn't distinguish the two steps of destruction, unregistration and release, and now the root congested state actually requires a separate release step. To fix the issue, this patch separates out bdi_unregister() and bdi_exit() from bdi_destroy(). bdi_unregister() is called from blk_cleanup_queue() and bdi_exit() from blk_release_queue(). bdi_destroy() is now just a simple wrapper calling the two steps back-to-back. While at it, the prototype of bdi_destroy() is moved right below bdi_setup_and_register() so that the counterpart operations are located together. Signed-off-by: Tejun Heo <[email protected]> Fixes: a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback") Cc: [email protected] # v4.2+ Reported-and-tested-by: Andrey Konovalov <[email protected]> Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kCY5GXJBwGfQ@mail.gmail.com Reviewed-by: Jan Kara <[email protected]> Reviewed-by: Jeff Moyer <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-10-15blk-mq: fix use-after-free in blk_mq_free_tag_set()Junichi Nomura2-3/+2
tags is freed in blk_mq_free_rq_map() and should not be used after that. The problem doesn't manifest if CONFIG_CPUMASK_OFFSTACK is false because free_cpumask_var() is nop. tags->cpumask is allocated in blk_mq_init_tags() so it's natural to free cpumask in its counter part, blk_mq_free_tags(). Fixes: f26cdc8536ad ("blk-mq: Shared tag enhancements") Signed-off-by: Jun'ichi Nomura <[email protected]> Cc: Keith Busch <[email protected]> Reviewed-by: Jeff Moyer <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-10-09blk-mq: remove unused blk_mq_clone_flush_request prototypeChristoph Hellwig1-2/+0
Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-10-09blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.cKosuke Tatsukawa1-0/+4
blk_mq_tag_update_depth() seems to be missing a memory barrier which might cause the waker to not notice the waiter and fail to send a wake_up as in the following figure. blk_mq_tag_update_depth bt_get ------------------------------------------------------------------------ if (waitqueue_active(&bs->wait)) /* The CPU might reorder the test for the waitqueue up here, before prior writes complete */ prepare_to_wait(&bs->wait, &wait, TASK_UNINTERRUPTIBLE); tag = __bt_get(hctx, bt, last_tag, tags); /* Value set in bt_update_count not visible yet */ bt_update_count(&tags->bitmap_tags, tdepth); /* blk_mq_tag_wakeup_all(tags, false); */ bt = &tags->bitmap_tags; wake_index = atomic_read(&bt->wake_index); ... io_schedule(); ------------------------------------------------------------------------ This patch adds the missing memory barrier. I found this issue when I was looking through the linux source code for places calling waitqueue_active() before wake_up*(), but without preceding memory barriers, after sending a patch to fix a similar issue in drivers/tty/n_tty.c (Details about the original issue can be found here: https://lkml.org/lkml/2015/9/28/849). Signed-off-by: Kosuke Tatsukawa <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-10-09Merge tag 'v4.3-rc4' into for-4.4/coreJens Axboe6-65/+128
Linux 4.3-rc4 Pulling in v4.3-rc4 to avoid conflicts with NVMe fixes that have gone in since for-4.4/core was based.
2015-10-01blk-mq: factor out a helper to iterate all tags for a request_queueChristoph Hellwig3-18/+25
And replace the blk_mq_tag_busy_iter with it - the driver use has been replaced with a new helper a while ago, and internal to the block we only need the new version. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-10-01blk-mq: fix racy updates of rq->errorsChristoph Hellwig1-6/+6
blk_mq_complete_request may be a no-op if the request has already been completed by others means (e.g. a timeout or cancellation), but currently drivers have to set rq->errors before calling blk_mq_complete_request, which might leave us with the wrong error value. Add an error parameter to blk_mq_complete_request so that we can defer setting rq->errors until we known we won the race to complete the request. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-09-29blk-mq: fix deadlock when reading cpu_listAkinobu Mita2-4/+7
CPU hotplug handling for blk-mq (blk_mq_queue_reinit) acquires all_q_mutex in blk_mq_queue_reinit_notify() and then removes sysfs entries by blk_mq_sysfs_unregister(). Removing sysfs entry needs to be blocked until the active reference of the kernfs_node to be zero. On the other hand, reading blk_mq_hw_sysfs_cpu sysfs entry (e.g. /sys/block/nullb0/mq/0/cpu_list) acquires all_q_mutex in blk_mq_hw_sysfs_cpus_show(). If these happen at the same time, a deadlock can happen. Because one can wait for the active reference to be zero with holding all_q_mutex, and the other tries to acquire all_q_mutex with holding the active reference. The reason that all_q_mutex is acquired in blk_mq_hw_sysfs_cpus_show() is to avoid reading an imcomplete hctx->cpumask. Since reading sysfs entry for blk-mq needs to acquire q->sysfs_lock, we can avoid deadlock and reading an imcomplete hctx->cpumask by protecting q->sysfs_lock while hctx->cpumask is being updated. Signed-off-by: Akinobu Mita <[email protected]> Reviewed-by: Ming Lei <[email protected]> Cc: Ming Lei <[email protected]> Cc: Wanpeng Li <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-09-29blk-mq: avoid inserting requests before establishing new mappingAkinobu Mita3-19/+52
Notifier callbacks for CPU_ONLINE action can be run on the other CPU than the CPU which was just onlined. So it is possible for the process running on the just onlined CPU to insert request and run hw queue before establishing new mapping which is done by blk_mq_queue_reinit_notify(). This can cause a problem when the CPU has just been onlined first time since the request queue was initialized. At this time ctx->index_hw for the CPU, which is the index in hctx->ctxs[] for this ctx, is still zero before blk_mq_queue_reinit_notify() is called by notifier callbacks for CPU_ONLINE action. For example, there is a single hw queue (hctx) and two CPU queues (ctx0 for CPU0, and ctx1 for CPU1). Now CPU1 is just onlined and a request is inserted into ctx1->rq_list and set bit0 in pending bitmap as ctx1->index_hw is still zero. And then while running hw queue, flush_busy_ctxs() finds bit0 is set in pending bitmap and tries to retrieve requests in hctx->ctxs[0]->rq_list. But htx->ctxs[0] is a pointer to ctx0, so the request in ctx1->rq_list is ignored. Fix it by ensuring that new mapping is established before onlined cpu starts running. Signed-off-by: Akinobu Mita <[email protected]> Reviewed-by: Ming Lei <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Ming Lei <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-09-29blk-mq: fix q->mq_usage_counter access raceAkinobu Mita1-3/+4
CPU hotplug handling for blk-mq (blk_mq_queue_reinit) accesses q->mq_usage_counter while freezing all request queues in all_q_list. On the other hand, q->mq_usage_counter is deinitialized in blk_mq_free_queue() before deleting the queue from all_q_list. So if CPU hotplug event occurs in the window, percpu_ref_kill() is called with q->mq_usage_counter which has already been marked dead, and it triggers warning. Fix it by deleting the queue from all_q_list earlier than destroying q->mq_usage_counter. Signed-off-by: Akinobu Mita <[email protected]> Reviewed-by: Ming Lei <[email protected]> Cc: Ming Lei <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-09-29blk-mq: Fix use after of free q->mq_mapAkinobu Mita1-5/+3
CPU hotplug handling for blk-mq (blk_mq_queue_reinit) updates q->mq_map by blk_mq_update_queue_map() for all request queues in all_q_list. On the other hand, q->mq_map is released before deleting the queue from all_q_list. So if CPU hotplug event occurs in the window, invalid memory access can happen. Fix it by releasing q->mq_map in blk_mq_release() to make it happen latter than removal from all_q_list. Signed-off-by: Akinobu Mita <[email protected]> Suggested-by: Ming Lei <[email protected]> Reviewed-by: Ming Lei <[email protected]> Cc: Ming Lei <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-09-29blk-mq: fix sysfs registration/unregistration raceAkinobu Mita2-11/+25
There is a race between cpu hotplug handling and adding/deleting gendisk for blk-mq, where both are trying to register and unregister the same sysfs entries. null_add_dev --> blk_mq_init_queue --> blk_mq_init_allocated_queue --> add to 'all_q_list' (*) --> add_disk --> blk_register_queue --> blk_mq_register_disk (++) null_del_dev --> del_gendisk --> blk_unregister_queue --> blk_mq_unregister_disk (--) --> blk_cleanup_queue --> blk_mq_free_queue --> del from 'all_q_list' (*) blk_mq_queue_reinit --> blk_mq_sysfs_unregister (-) --> blk_mq_sysfs_register (+) While the request queue is added to 'all_q_list' (*), blk_mq_queue_reinit() can be called for the queue anytime by CPU hotplug callback. But blk_mq_sysfs_unregister (-) and blk_mq_sysfs_register (+) in blk_mq_queue_reinit must not be called before blk_mq_register_disk (++) and after blk_mq_unregister_disk (--) is finished. Because '/sys/block/*/mq/' is not exists. There has already been BLK_MQ_F_SYSFS_UP flag in hctx->flags which can be used to track these sysfs stuff, but it is only fixing this issue partially. In order to fix it completely, we just need per-queue flag instead of per-hctx flag with appropriate locking. So this introduces q->mq_sysfs_init_done which is properly protected with all_q_mutex. Also, we need to ensure that blk_mq_map_swqueue() is called with all_q_mutex is held. Since hctx->nr_ctx is reset temporarily and updated in blk_mq_map_swqueue(), so we should avoid blk_mq_register_hctx() seeing the temporary hctx->nr_ctx value in CPU hotplug handling or adding/deleting gendisk . Signed-off-by: Akinobu Mita <[email protected]> Reviewed-by: Ming Lei <[email protected]> Cc: Ming Lei <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>