aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-05-29mtip32xx: complete requests from ->timeoutChristoph Hellwig1-1/+2
By completing the request entirely in the driver we can remove the BLK_EH_HANDLED return value and thus the split responsibility between the driver and the block layer that has been causing trouble. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-29nbd: complete requests from ->timeoutChristoph Hellwig1-3/+4
By completing the request entirely in the driver we can remove the BLK_EH_HANDLED return value and thus the split responsibility between the driver and the block layer that has been causing trouble. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-29nvme: return BLK_EH_DONE from ->timeoutChristoph Hellwig3-11/+7
NVMe always completes the request before returning from ->timeout, either by polling for it, or by disabling the controller. Return BLK_EH_DONE so that the block layer doesn't even try to complete it again. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-29block: rename BLK_EH_NOT_HANDLED to BLK_EH_DONEChristoph Hellwig17-25/+25
The BLK_EH_NOT_HANDLED implies nothing happen, but very often that is not what is happening - instead the driver already completed the command. Fix the symbolic name to reflect that a little better. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Johannes Thumshirn <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-29blk-mq: Remove generation seqeunceKeith Busch6-254/+89
This patch simplifies the timeout handling by relying on the request reference counting to ensure the iterator is operating on an inflight and truly timed out request. Since the reference counting prevents the tag from being reallocated, the block layer no longer needs to prevent drivers from completing their requests while the timeout handler is operating on it: a driver completing a request is allowed to proceed to the next state without additional syncronization with the block layer. This also removes any need for generation sequence numbers since the request lifetime is prevented from being reallocated as a new sequence while timeout handling is operating on it. To enables this a refcount is added to struct request so that request users can be sure they're operating on the same request without it changing while they're processing it. The request's tag won't be released for reuse until both the timeout handler and the completion are done with it. Signed-off-by: Keith Busch <[email protected]> [hch: slight cleanups, added back submission side hctx lock, use cmpxchg for completions] Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-29blk-mq: Fix timeout and state orderKeith Busch1-1/+1
The block layer had been setting the state to in-flight prior to updating the timer. This is the wrong order since the timeout handler could observe the in-flight state with the older timeout, believing the request had expired when in fact it is just getting started. Signed-off-by: Keith Busch <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-29libata: remove ata_scsi_timed_outChristoph Hellwig2-53/+0
As far as I can tell this function can't even be called any more, given that ATA implements its own eh_strategy_handler with ata_scsi_error, which never calls ->eh_timed_out. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-28bcache: Replace bch_read_string_list() by __sysfs_match_string()Andy Shevchenko1-35/+9
Kernel library has a common function to match user input from sysfs against an array of strings. Thus, replace bch_read_string_list() by __sysfs_match_string(). Signed-off-by: Andy Shevchenko <[email protected]> Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-28bcache: Move couple of functions to sysfs.cAndy Shevchenko3-40/+35
There is couple of functions that are used exclusively in sysfs.c. Move it to there and make them static. Besides above, it will allow further clean up. No functional change intended. Signed-off-by: Andy Shevchenko <[email protected]> Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-28bcache: Move couple of string arrays to sysfs.cAndy Shevchenko3-20/+18
There is couple of string arrays that are used exclusively in sysfs.c. Move it to there and make them static. Besides above, it will allow further clean up. No functional change intended. Signed-off-by: Andy Shevchenko <[email protected]> Signed-off-by: Coly Li <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-28bcache: stop bcache device when backing device is offlineColy Li2-0/+55
Currently bcache does not handle backing device failure, if backing device is offline and disconnected from system, its bcache device can still be accessible. If the bcache device is in writeback mode, I/O requests even can success if the requests hit on cache device. That is to say, when and how bcache handles offline backing device is undefined. This patch tries to handle backing device offline in a rather simple way, - Add cached_dev->status_update_thread kernel thread to update backing device status in every 1 second. - Add cached_dev->offline_seconds to record how many seconds the backing device is observed to be offline. If the backing device is offline for BACKING_DEV_OFFLINE_TIMEOUT (30) seconds, set dc->io_disable to 1 and call bcache_device_stop() to stop the bache device which linked to the offline backing device. Now if a backing device is offline for BACKING_DEV_OFFLINE_TIMEOUT seconds, its bcache device will be removed, then user space application writing on it will get error immediately, and handler the device failure in time. This patch is quite simple, does not handle more complicated situations. Once the bcache device is stopped, users need to recovery the backing device, register and attach it manually. Changelog: v3: call wait_for_kthread_stop() before exits kernel thread. v2: remove "bcache: " prefix when calling pr_warn(). v1: initial version. Signed-off-by: Coly Li <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Cc: Michael Lyle <[email protected]> Cc: Junhui Tang <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-25null_blk: add blocking description and remove lightnvmLiu Bo1-3/+6
- The description of 'blocking' is missing in null_blk.txt - The 'lightnvm' parameter has been removed in null_blk.c This updates both in null_blk.txt. Signed-off-by: Liu Bo <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-24block drivers/block: Use octal not symbolic permissionsJoe Perches25-163/+154
Convert the S_<FOO> symbolic permissions to their octal equivalents as using octal and not symbolic permissions is preferred by many as more readable. see: https://lkml.org/lkml/2016/8/2/1945 Done with automated conversion via: $ ./scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace <files...> Miscellanea: o Wrapped modified multi-line calls to a single line where appropriate o Realign modified multi-line calls to open parenthesis Signed-off-by: Joe Perches <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-24blk-mq: avoid starving tag allocation after allocating process migratesMing Lei3-14/+34
When the allocation process is scheduled back and the mapped hw queue is changed, fake one extra wake up on previous queue for compensating wake up miss, so other allocations on the previous queue won't be starved. This patch fixes one request allocation hang issue, which can be triggered easily in case of very low nr_request. The race is as follows: 1) 2 hw queues, nr_requests are 2, and wake_batch is one 2) there are 3 waiters on hw queue 0 3) two in-flight requests in hw queue 0 are completed, and only two waiters of 3 are waken up because of wake_batch, but both the two waiters can be scheduled to another CPU and cause to switch to hw queue 1 4) then the 3rd waiter will wait for ever, since no in-flight request is in hw queue 0 any more. 5) this patch fixes it by the fake wakeup when waiter is scheduled to another hw queue Cc: <[email protected]> Reviewed-by: Omar Sandoval <[email protected]> Signed-off-by: Ming Lei <[email protected]> Modified commit message to make it clearer, and make it apply on top of the 4.18 branch. Signed-off-by: Jens Axboe <[email protected]>
2018-05-23bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueueTejun Heo1-1/+17
From 0aa2e9b921d6db71150633ff290199554f0842a8 Mon Sep 17 00:00:00 2001 From: Tejun Heo <[email protected]> Date: Wed, 23 May 2018 10:29:00 -0700 cgwb_release() punts the actual release to cgwb_release_workfn() on system_wq. Depending on the number of cgroups or block devices, there can be a lot of cgwb_release_workfn() in flight at the same time. We're periodically seeing close to 256 kworkers getting stuck with the following stack trace and overtime the entire system gets stuck. [<ffffffff810ee40c>] _synchronize_rcu_expedited.constprop.72+0x2fc/0x330 [<ffffffff810ee634>] synchronize_rcu_expedited+0x24/0x30 [<ffffffff811ccf23>] bdi_unregister+0x53/0x290 [<ffffffff811cd1e9>] release_bdi+0x89/0xc0 [<ffffffff811cd645>] wb_exit+0x85/0xa0 [<ffffffff811cdc84>] cgwb_release_workfn+0x54/0xb0 [<ffffffff810a68d0>] process_one_work+0x150/0x410 [<ffffffff810a71fd>] worker_thread+0x6d/0x520 [<ffffffff810ad3dc>] kthread+0x12c/0x160 [<ffffffff81969019>] ret_from_fork+0x29/0x40 [<ffffffffffffffff>] 0xffffffffffffffff The events leading to the lockup are... 1. A lot of cgwb_release_workfn() is queued at the same time and all system_wq kworkers are assigned to execute them. 2. They all end up calling synchronize_rcu_expedited(). One of them wins and tries to perform the expedited synchronization. 3. However, that invovles queueing rcu_exp_work to system_wq and waiting for it. Because #1 is holding all available kworkers on system_wq, rcu_exp_work can't be executed. cgwb_release_workfn() is waiting for synchronize_rcu_expedited() which in turn is waiting for cgwb_release_workfn() to free up some of the kworkers. We shouldn't be scheduling hundreds of cgwb_release_workfn() at the same time. There's nothing to be gained from that. This patch updates cgwb release path to use a dedicated percpu workqueue with @max_active of 1. While this resolves the problem at hand, it might be a good idea to isolate rcu_exp_work to its own workqueue too as it can be used from various paths and is prone to this sort of indirect A-A deadlocks. Signed-off-by: Tejun Heo <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: [email protected] Signed-off-by: Jens Axboe <[email protected]>
2018-05-23nbd: set discard granularity properlyJosef Bacik1-2/+8
For some reason we had discard granularity set to 512 always even when discards were disabled. Fix this by having the default be 0, and then if we turn it on set the discard granularity to the blocksize. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-22blkdev_report_zones_ioctl(): Use vmalloc() to allocate large buffersBart Van Assche1-2/+6
Avoid that complaints similar to the following appear in the kernel log if the number of zones is sufficiently large: fio: page allocation failure: order:9, mode:0x140c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null) Call Trace: dump_stack+0x63/0x88 warn_alloc+0xf5/0x190 __alloc_pages_slowpath+0x8f0/0xb0d __alloc_pages_nodemask+0x242/0x260 alloc_pages_current+0x6a/0xb0 kmalloc_order+0x18/0x50 kmalloc_order_trace+0x26/0xb0 __kmalloc+0x20e/0x220 blkdev_report_zones_ioctl+0xa5/0x1a0 blkdev_ioctl+0x1ba/0x930 block_ioctl+0x41/0x50 do_vfs_ioctl+0xaa/0x610 SyS_ioctl+0x79/0x90 do_syscall_64+0x79/0x1b0 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Fixes: 3ed05a987e0f ("blk-zoned: implement ioctls") Signed-off-by: Bart Van Assche <[email protected]> Cc: Shaun Tancheff <[email protected]> Cc: Damien Le Moal <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Martin K. Petersen <[email protected]> Cc: Hannes Reinecke <[email protected]> Cc: <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-22block/ndb: add WQ_UNBOUND to the knbd-recv workqueueDan Melnic1-1/+2
Add WQ_UNBOUND to the knbd-recv workqueue so we're not bound to a single CPU that is selected at device creation time. Signed-off-by: Dan Melnic <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-22blk-mq: remove wrong 'unlikely' checkhuhai1-1/+1
When dispatch_rq_from_ctx is called, in the vast majority of cases the ctx->rq_list is not empty. Signed-off-by: huhai <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-21nvme-pci: fix race between poll and IRQ completionsJens Axboe1-4/+11
If polling completions are racing with the IRQ triggered by a completion, the IRQ handler will find no work and return IRQ_NONE. This can trigger complaints about spurious interrupts: [ 560.169153] irq 630: nobody cared (try booting with the "irqpoll" option) [ 560.175988] CPU: 40 PID: 0 Comm: swapper/40 Not tainted 4.17.0-rc2+ #65 [ 560.175990] Hardware name: Intel Corporation S2600STB/S2600STB, BIOS SE5C620.86B.00.01.0010.010920180151 01/09/2018 [ 560.175991] Call Trace: [ 560.175994] <IRQ> [ 560.176005] dump_stack+0x5c/0x7b [ 560.176010] __report_bad_irq+0x30/0xc0 [ 560.176013] note_interrupt+0x235/0x280 [ 560.176020] handle_irq_event_percpu+0x51/0x70 [ 560.176023] handle_irq_event+0x27/0x50 [ 560.176026] handle_edge_irq+0x6d/0x180 [ 560.176031] handle_irq+0xa5/0x110 [ 560.176036] do_IRQ+0x41/0xc0 [ 560.176042] common_interrupt+0xf/0xf [ 560.176043] </IRQ> [ 560.176050] RIP: 0010:cpuidle_enter_state+0x9b/0x2b0 [ 560.176052] RSP: 0018:ffffa0ed4659fe98 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdd [ 560.176055] RAX: ffff9527beb20a80 RBX: 000000826caee491 RCX: 000000000000001f [ 560.176056] RDX: 000000826caee491 RSI: 00000000335206ee RDI: 0000000000000000 [ 560.176057] RBP: 0000000000000001 R08: 00000000ffffffff R09: 0000000000000008 [ 560.176059] R10: ffffa0ed4659fe78 R11: 0000000000000001 R12: ffff9527beb29358 [ 560.176060] R13: ffffffffa235d4b8 R14: 0000000000000000 R15: 000000826caed593 [ 560.176065] ? cpuidle_enter_state+0x8b/0x2b0 [ 560.176071] do_idle+0x1f4/0x260 [ 560.176075] cpu_startup_entry+0x6f/0x80 [ 560.176080] start_secondary+0x184/0x1d0 [ 560.176085] secondary_startup_64+0xa5/0xb0 [ 560.176088] handlers: [ 560.178387] [<00000000efb612be>] nvme_irq [nvme] [ 560.183019] Disabling IRQ #630 A previous commit removed ->cqe_seen that was handling this case, but we need to handle this a bit differently due to completions now running outside the queue lock. Return IRQ_HANDLED from the IRQ handler, if the completion ring head was moved since we last saw it. Fixes: 5cb525c8315f ("nvme-pci: handle completions outside of the queue lock") Reported-by: Keith Busch <[email protected]> Reviewed-by: Keith Busch <[email protected]> Tested-by: Keith Busch <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-21Merge branch 'nvme-4.18' of git://git.infradead.org/nvme into for-4.18/blockJens Axboe5-94/+133
Pull NVMe changes from Keith: "This is just the first nvme pull request for 4.18. There are several fabrics and target patches that I missed, so there will be more to come." * 'nvme-4.18' of git://git.infradead.org/nvme: nvme-pci: drop IRQ disabling on submission queue lock nvme-pci: split the nvme queue lock into submission and completion locks nvme-pci: handle completions outside of the queue lock nvme-pci: move ->cq_vector == -1 check outside of ->q_lock nvme-pci: remove cq check after submission nvme-pci: simplify nvme_cqe_valid nvme: mark the result argument to nvme_complete_async_event volatile nvme/pci: Sync controller reset for AER slot_reset nvme/pci: Hold controller reference during async probe nvme: only reconfigure discard if necessary nvme/pci: Use async_schedule for initial reset work nvme: lightnvm: add granby support NVMe: Add Quirk Delay before CHK RDY for Seagate Nytro Flash Storage nvme: change order of qid and cmdid in completion trace nvme: fc: provide a descriptive error
2018-05-18nvme-pci: drop IRQ disabling on submission queue lockJens Axboe1-4/+4
Since we aren't sharing the lock for completions now, we don't have to make it IRQ safe. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2018-05-18nvme-pci: split the nvme queue lock into submission and completion locksJens Axboe1-21/+23
This is now feasible. We protect the submission queue ring with ->sq_lock, and the completion side with ->cq_lock. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2018-05-18nvme-pci: handle completions outside of the queue lockJens Axboe1-42/+45
Split the completion of events into a two part process: 1) Reap the events inside the queue lock 2) Complete the events outside the queue lock Since we never wrap the queue, we can access it locklessly after we've updated the completion queue head. This patch started off with batching events on the stack, but with this trick we don't have to. Keith Busch <[email protected]> came up with that idea. Note that this kills the ->cqe_seen as well. I haven't been able to trigger any ill effects of this. If we do race with polling every so often, it should be rare enough NOT to trigger any issues. Signed-off-by: Jens Axboe <[email protected]> Signed-off-by: Keith Busch <[email protected]> [hch: refactored, restored poll early exit optimization] Signed-off-by: Christoph Hellwig <[email protected]>
2018-05-18nvme-pci: move ->cq_vector == -1 check outside of ->q_lockJens Axboe1-5/+13
We only clear it dynamically in nvme_suspend_queue(). When we do, ensure to do a full flush so that any nvme_queue_rq() invocation will see it. Ideally we'd kill this check completely, but we're using it to flush requests on a dying queue. Signed-off-by: Jens Axboe <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2018-05-18nvme-pci: remove cq check after submissionJens Axboe1-2/+0
We always check the completion queue after submitting, but in my testing this isn't a win even on DRAM/xpoint devices. In some cases it's actually worse. Kill it. Signed-off-by: Jens Axboe <[email protected]> Signed-off-by: Keith Busch <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2018-05-18nvme-pci: simplify nvme_cqe_validChristoph Hellwig1-6/+6
We always look at the current CQ head and phase, so don't pass these as separate arguments, and rename the function to nvme_cqe_pending. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-18nvme: mark the result argument to nvme_complete_async_event volatileChristoph Hellwig2-2/+2
We'll need that in the PCIe driver soon as we'll read it straight off the CQ. Signed-off-by: Christoph Hellwig <[email protected]>
2018-05-18blk-mq: clear hctx->dispatch_from when mappings changehuhai1-0/+1
When the number of hardware queues is changed, the drivers will call blk_mq_update_nr_hw_queues() to remap hardware queues. This changes the ctx mappings, but the current code doesn't clear the ->dispatch_from hint. This can result in dispatch_from pointing to a ctx that isn't mapped to the hctx anymore. Fixes: b347689ffbca ("blk-mq-sched: improve dispatching from sw queue") Signed-off-by: huhai <[email protected]> Reviewed-by: Ming Lei <[email protected]> Moved the placement of the clearing to where we clear other items pertaining to the existing mapping, added Fixes line, and reworded the commit message. Signed-off-by: Jens Axboe <[email protected]>
2018-05-16nbd: call nbd_bdev_reset instead of bd_set_size on disconnectJosef Bacik1-1/+1
We need to make sure we don't just set the size of the bdev to 0 while it's being used by a file system. We have the appropriate check in nbd_bdev_reset, simply use that helper instead. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-16nbd: fix how we set bd_invalidatedJosef Bacik1-4/+3
bd_invalidated is kind of a pain wrt partitions as it really only triggers the partition rescan if it is set after bd_ops->open() runs, so setting it when we reset the device isn't useful. We also sporadically would still have partitions left over in some disconnect cases, so fix this by always setting bd_invalidated on open if there's no configuration or if we've had a disconnect action happen, that way the partition table gets invalidated and rescanned properly. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-16nbd: clear_sock on netlink disconnectJosef Bacik1-0/+1
This is what the ioctl based nbd disconnect does as well. Without this the device will just sit there and wait for the connection to go away (or IO to occur) before the device gets torn down. Instead clear everything up on our end so the configuration goes away as quickly as possible. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-16nbd: use bd_set_size when updating disk sizeJosef Bacik1-1/+9
When we stopped relying on the bdev everywhere I broke updating the block device size on the fly, which ceph relies on. We can't just do set_capacity, we also have to do bd_set_size so things like parted will notice the device size change. Fixes: 29eaadc ("nbd: stop using the bdev everywhere") cc: [email protected] Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-16nbd: update size when connectedJosef Bacik1-0/+2
I messed up changing the size of an NBD device while it was connected by not actually updating the device or doing the uevent. Fix this by updating everything if we're connected and we change the size. cc: [email protected] Fixes: 639812a ("nbd: don't set the device size until we're connected") Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-16nbd: fix nbd device deletionJosef Bacik1-1/+4
This fixes a use after free bug, we shouldn't be doing disk->queue right after we do del_gendisk(disk). Save the queue and do the cleanup after the del_gendisk. Fixes: c6a4759ea0c9 ("nbd: add device refcounting") cc: [email protected] Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-16block: fix MAINTAINERS email for nbdJosef Bacik1-1/+1
I've been missing stuff because it's been going into my work email which is a black hole. Update to the email I actually use so I stop missing patches and bug reports. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-16blk-mq: remove redundant insert case in blk_mq_make_request()huhai1-15/+1
We can use blk_mq_sched_insert_request() even if we don't have an IO scheduler attached, since that case will end up being exactly the same as what blk_mq_queue_io() was doing now. Signed-off-by: huhai <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-15Remove jsflash driverJens Axboe4-708/+0
Nobody is using it anymore, and it's been abandoned. Since David is fine with removing it, kill it. Suggested-by: Christoph Hellwig <[email protected]> Acked-by: David S. Miller <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-14block: Add sysfs entry for fua supportKent Overstreet1-0/+11
Signed-off-by: Kent Overstreet <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-14block: Export bio check/set pages_dirtyKent Overstreet1-0/+2
Signed-off-by: Kent Overstreet <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-14block: Add warning for bi_next not NULL in bio_endio()Kent Overstreet2-1/+10
Recently found a bug where a driver left bi_next not NULL and then called bio_endio(), and then the submitter of the bio used bio_copy_data() which was treating src and dst as lists of bios. Fixed that bug by splitting out bio_list_copy_data(), but in case other things are depending on bi_next in weird ways, add a warning to help avoid more bugs like that in the future. Signed-off-by: Kent Overstreet <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-14block: Add missing flush_dcache_page() callKent Overstreet1-0/+2
Since a bio can point to userspace pages (e.g. direct IO), this is generally necessary. Signed-off-by: Kent Overstreet <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-14block: Split out bio_list_copy_data()Kent Overstreet3-35/+55
Found a bug (with ASAN) where we were passing a bio to bio_copy_data() with bi_next not NULL, when it should have been - a driver had left bi_next set to something after calling bio_endio(). Since the normal case is only copying single bios, split out bio_list_copy_data() to avoid more bugs like this in the future. Signed-off-by: Kent Overstreet <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-14block: Add bio_copy_data_iter(), zero_fill_bio_iter()Kent Overstreet2-23/+39
Add versions that take bvec_iter args instead of using bio->bi_iter - to be used by bcachefs. Signed-off-by: Kent Overstreet <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-14block: Use bioset_init() for fs_bio_setKent Overstreet4-8/+7
Minor optimization - remove a pointer indirection when using fs_bio_set. Signed-off-by: Kent Overstreet <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-14block: Add bioset_init()/bioset_exit()Kent Overstreet2-28/+67
Similarly to mempool_init()/mempool_exit(), take a pointer indirection out of allocation/freeing by allowing biosets to be embedded in other structs. Signed-off-by: Kent Overstreet <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-14block: Convert bio_set to mempool_init()Kent Overstreet3-39/+36
Minor performance improvement by getting rid of pointer indirections from allocation/freeing fastpaths. Signed-off-by: Kent Overstreet <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-14mempool: Add mempool_init()/mempool_exit()Kent Overstreet2-27/+115
Allows mempools to be embedded in other structs, getting rid of a pointer indirection from allocation fastpaths. mempool_exit() is safe to call on an uninitialized but zeroed mempool. Signed-off-by: Kent Overstreet <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-14sbitmap: fix race in wait batch accountingJens Axboe1-10/+25
If we have multiple callers of sbq_wake_up(), we can end up in a situation where the wait_cnt will continually go more and more negative. Consider the case where our wake batch is 1, hence wait_cnt will start out as 1. wait_cnt == 1 CPU0 CPU1 atomic_dec_return(), cnt == 0 atomic_dec_return(), cnt == -1 cmpxchg(-1, 0) (succeeds) [wait_cnt now 0] cmpxchg(0, 1) (fails) This ends up with wait_cnt being 0, we'll wakeup immediately next time. Going through the same loop as above again, and we'll have wait_cnt -1. For the case where we have a larger wake batch, the only difference is that the starting point will be higher. We'll still end up with continually smaller batch wakeups, which defeats the purpose of the rolling wakeups. Always reset the wait_cnt to the batch value. Then it doesn't matter who wins the race. But ensure that whomever does win the race is the one that increments the ws index and wakes up our batch count, loser gets to call __sbq_wake_up() again to account his wakeups towards the next active wait state index. Fixes: 6c0ca7ae292a ("sbitmap: fix wakeup hang after sbq resize") Reviewed-by: Omar Sandoval <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2018-05-14block: consistently use GFP_NOIO instead of __GFP_NORECLAIMChristoph Hellwig8-15/+16
Same numerical value (for now at least), but a much better documentation of intent. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Jens Axboe <[email protected]>