aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2020-08-21nvme: redirect commands on dying queueChao Leng1-4/+5
If a command send through nvme-multipath failed on a dying queue, resend it on another path. Signed-off-by: Chao Leng <[email protected]> [hch: rebased on top of the completion refactoring] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-08-21nvme: just check the status code type in nvme_is_path_errorChristoph Hellwig1-10/+2
Check the SCT sub-field for a path related status instead of enumerating invididual status code. As of NVMe 1.4 this adds "Internal Path Error" and "Controller Pathing Error" to the list, but it also future proofs for additional status codes added to the category. Suggested-by: Chao Leng <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-08-21nvme: refactor command completionChristoph Hellwig3-64/+89
Lift all the code to decide the dispostition of a completed command from nvme_complete_rq and nvme_failover_req into a new helper, which returns an emum of the potential actions. nvme_complete_rq then just switches on those and calls the proper helper for the action. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-08-21nvme: rename and document nvme_end_requestChristoph Hellwig7-8/+14
nvme_end_request is a bit misnamed, as it wraps around the blk_mq_complete_* API. It's semantics also are non-trivial, so give it a more descriptive name and add a comment explaining the semantics. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Mike Snitzer <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-08-21nvme: skip noiob for zoned devicesKeith Busch1-1/+1
Zoned block devices reuse the chunk_sectors queue limit to define zone boundaries. If a such a device happens to also report an optimal boundary, do not use that to define the chunk_sectors as that may intermittently interfere with io splitting and zone size queries. Signed-off-by: Keith Busch <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-08-21nvme-pci: fix PRP pool sizeChristoph Hellwig1-1/+2
All operations are based on the controller, not the host page size. Switch the dma pool to use the controller page size as well to avoid massive overallocations on large page size systems. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Keith Busch <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-08-21nvme-pci: Use u32 for nvme_dev.q_depth and nvme_queue.q_depthJohn Garry1-6/+6
Recently nvme_dev.q_depth was changed from an int to u16 type. This falls over for the queue depth calculation in nvme_pci_enable(), where NVME_CAP_MQES(dev->ctrl.cap) + 1 may overflow as a u16, as NVME_CAP_MQES() is a 16b number also. That happens for me, and this is the result: root@ubuntu:/home/john# [148.272996] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010 Mem abort info: ESR = 0x96000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000004 CM = 0, WnR = 0 user pgtable: 4k pages, 48-bit VAs, pgdp=00000a27bf3c9000 [0000000000000010] pgd=0000000000000000, p4d=0000000000000000 Internal error: Oops: 96000004 [#1] PREEMPT SMP Modules linked in: nvme nvme_core CPU: 56 PID: 256 Comm: kworker/u195:0 Not tainted 5.8.0-next-20200812 #27 Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 - V1.16.01 03/15/2019 Workqueue: nvme-reset-wq nvme_reset_work [nvme] pstate: 80c00009 (Nzcv daif +PAN +UAO BTYPE=--) pc : __sg_alloc_table_from_pages+0xec/0x238 lr : __sg_alloc_table_from_pages+0xc8/0x238 sp : ffff800013ccbad0 x29: ffff800013ccbad0 x28: ffff0a27b3d380a8 x27: 0000000000000000 x26: 0000000000002dc2 x25: 0000000000000dc0 x24: 0000000000000000 x23: 0000000000000000 x22: ffff800013ccbbe8 x21: 0000000000000010 x20: 0000000000000000 x19: 00000000fffff000 x18: ffffffffffffffff x17: 00000000000000c0 x16: fffffe289eaf6380 x15: ffff800011b59948 x14: ffff002bc8fe98f8 x13: ff00000000000000 x12: ffff8000114ca000 x11: 0000000000000000 x10: ffffffffffffffff x9 : ffffffffffffffc0 x8 : ffff0a27b5f9b6a0 x7 : 0000000000000000 x6 : 0000000000000001 x5 : ffff0a27b5f9b680 x4 : 0000000000000000 x3 : ffff0a27b5f9b680 x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000000000 Call trace: __sg_alloc_table_from_pages+0xec/0x238 sg_alloc_table_from_pages+0x18/0x28 iommu_dma_alloc+0x474/0x678 dma_alloc_attrs+0xd8/0xf0 nvme_alloc_queue+0x114/0x160 [nvme] nvme_reset_work+0xb34/0x14b4 [nvme] process_one_work+0x1e8/0x360 worker_thread+0x44/0x478 kthread+0x150/0x158 ret_from_fork+0x10/0x34 Code: f94002c3 6b01017f 540007c2 11000486 (f8645aa5) ---[ end trace 89bb2b72d59bf925 ]--- Fix by making onto a u32. Also use u32 for nvme_dev.q_depth, as we assign this value from nvme_dev.q_depth, and nvme_dev.q_depth will possibly hold 65536 - this avoids the same crash as above. Fixes: 61f3b8963097 ("nvme-pci: use unsigned for io queue depth") Signed-off-by: John Garry <[email protected]> Reviewed-by: Keith Busch <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-08-21nvme: Use spin_lock_irq() when taking the ctrl->lockLogan Gunthorpe1-4/+4
When locking the ctrl->lock spinlock IRQs need to be disabled to avoid a dead lock. The new spin_lock() calls recently added produce the following lockdep warning when running the blktest nvme/003: ================================ WARNING: inconsistent lock state -------------------------------- inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. ksoftirqd/2/22 [HC0[0]:SC1[1]:HE0:SE0] takes: ffff888276a8c4c0 (&ctrl->lock){+.?.}-{2:2}, at: nvme_keep_alive_end_io+0x50/0xc0 {SOFTIRQ-ON-W} state was registered at: lock_acquire+0x164/0x500 _raw_spin_lock+0x28/0x40 nvme_get_effects_log+0x37/0x1c0 nvme_init_identify+0x9e4/0x14f0 nvme_reset_work+0xadd/0x2360 process_one_work+0x66b/0xb70 worker_thread+0x6e/0x6c0 kthread+0x1e7/0x210 ret_from_fork+0x22/0x30 irq event stamp: 1449221 hardirqs last enabled at (1449220): [<ffffffff81c58e69>] ktime_get+0xf9/0x140 hardirqs last disabled at (1449221): [<ffffffff83129665>] _raw_spin_lock_irqsave+0x25/0x60 softirqs last enabled at (1449210): [<ffffffff83400447>] __do_softirq+0x447/0x595 softirqs last disabled at (1449215): [<ffffffff81b489b5>] run_ksoftirqd+0x35/0x50 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&ctrl->lock); <Interrupt> lock(&ctrl->lock); *** DEADLOCK *** no locks held by ksoftirqd/2/22. stack backtrace: CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 5.8.0-rc4-eid-vmlocalyes-dbg-00157-g7236657c6b3a #1450 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-1 04/01/2014 Call Trace: dump_stack+0xc8/0x11a print_usage_bug.cold.63+0x235/0x23e mark_lock+0xa9c/0xcf0 __lock_acquire+0xd9a/0x2b50 lock_acquire+0x164/0x500 _raw_spin_lock_irqsave+0x40/0x60 nvme_keep_alive_end_io+0x50/0xc0 blk_mq_end_request+0x158/0x210 nvme_complete_rq+0x146/0x500 nvme_loop_complete_rq+0x26/0x30 [nvme_loop] blk_done_softirq+0x187/0x1e0 __do_softirq+0x118/0x595 run_ksoftirqd+0x35/0x50 smpboot_thread_fn+0x1d3/0x310 kthread+0x1e7/0x210 ret_from_fork+0x22/0x30 Fixes: be93e87e7802 ("nvme: support for multiple Command Sets Supported and Effects log pages") Signed-off-by: Logan Gunthorpe <[email protected]> Reviewed-by: Keith Busch <[email protected]> Tested-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-08-21nvmet: call blk_mq_free_request() directlyChaitanya Kulkarni1-3/+3
Instead of calling blk_put_request() which calls blk_mq_free_request(), call blk_mq_free_request() directly for NVMeOF passthru. This is to mainly avoid an extra function call in the completion path nvmet_passthru_req_done(). Signed-off-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Keith Busch <[email protected]> Reviewed-by: Logan Gunthorpe <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-08-21nvmet: fix oops in pt cmd executionChaitanya Kulkarni1-3/+3
In the existing NVMeOF Passthru core command handling on failure of nvme_alloc_request() it errors out with rq value set to NULL. In the error handling path it calls blk_put_request() without checking if rq is set to NULL or not which produces following Oops:- [ 1457.346861] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 1457.347838] #PF: supervisor read access in kernel mode [ 1457.348464] #PF: error_code(0x0000) - not-present page [ 1457.349085] PGD 0 P4D 0 [ 1457.349402] Oops: 0000 [#1] SMP NOPTI [ 1457.349851] CPU: 18 PID: 10782 Comm: kworker/18:2 Tainted: G OE 5.8.0-rc4nvme-5.9+ #35 [ 1457.350951] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e3214 [ 1457.352347] Workqueue: events nvme_loop_execute_work [nvme_loop] [ 1457.353062] RIP: 0010:blk_mq_free_request+0xe/0x110 [ 1457.353651] Code: 3f ff ff ff 83 f8 01 75 0d 4c 89 e7 e8 1b db ff ff e9 2d ff ff ff 0f 0b eb ef 66 8 [ 1457.355975] RSP: 0018:ffffc900035b7de0 EFLAGS: 00010282 [ 1457.356636] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002 [ 1457.357526] RDX: ffffffffa060bd05 RSI: 0000000000000000 RDI: 0000000000000000 [ 1457.358416] RBP: 0000000000000037 R08: 0000000000000000 R09: 0000000000000000 [ 1457.359317] R10: 0000000000000000 R11: 000000000000006d R12: 0000000000000000 [ 1457.360424] R13: ffff8887ffa68600 R14: 0000000000000000 R15: ffff8888150564c8 [ 1457.361322] FS: 0000000000000000(0000) GS:ffff888814600000(0000) knlGS:0000000000000000 [ 1457.362337] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1457.363058] CR2: 0000000000000000 CR3: 000000081c0ac000 CR4: 00000000003406e0 [ 1457.363973] Call Trace: [ 1457.364296] nvmet_passthru_execute_cmd+0x150/0x2c0 [nvmet] [ 1457.364990] process_one_work+0x24e/0x5a0 [ 1457.365493] ? __schedule+0x353/0x840 [ 1457.365957] worker_thread+0x3c/0x380 [ 1457.366426] ? process_one_work+0x5a0/0x5a0 [ 1457.366948] kthread+0x135/0x150 [ 1457.367362] ? kthread_create_on_node+0x60/0x60 [ 1457.367934] ret_from_fork+0x22/0x30 [ 1457.368388] Modules linked in: nvme_loop(OE) nvmet(OE) nvme_fabrics(OE) null_blk nvme(OE) nvme_corer [ 1457.368414] ata_piix crc32c_intel virtio_pci libata virtio_ring serio_raw t10_pi virtio floppy dm_] [ 1457.380849] CR2: 0000000000000000 [ 1457.381288] ---[ end trace c6cab61bfd1f68fd ]--- [ 1457.381861] RIP: 0010:blk_mq_free_request+0xe/0x110 [ 1457.382469] Code: 3f ff ff ff 83 f8 01 75 0d 4c 89 e7 e8 1b db ff ff e9 2d ff ff ff 0f 0b eb ef 66 8 [ 1457.384749] RSP: 0018:ffffc900035b7de0 EFLAGS: 00010282 [ 1457.385393] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002 [ 1457.386264] RDX: ffffffffa060bd05 RSI: 0000000000000000 RDI: 0000000000000000 [ 1457.387142] RBP: 0000000000000037 R08: 0000000000000000 R09: 0000000000000000 [ 1457.388029] R10: 0000000000000000 R11: 000000000000006d R12: 0000000000000000 [ 1457.388914] R13: ffff8887ffa68600 R14: 0000000000000000 R15: ffff8888150564c8 [ 1457.389798] FS: 0000000000000000(0000) GS:ffff888814600000(0000) knlGS:0000000000000000 [ 1457.390796] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1457.391508] CR2: 0000000000000000 CR3: 000000081c0ac000 CR4: 00000000003406e0 [ 1457.392525] Kernel panic - not syncing: Fatal exception [ 1457.394138] Kernel Offset: disabled [ 1457.394677] ---[ end Kernel panic - not syncing: Fatal exception ]--- We fix this Oops by adding a new goto label out_put_req and reordering the blk_put_request call to avoid calling blk_put_request() with rq value is set to NULL. Here we also update the rest of the code accordingly. Fixes: 06b7164dfdc0 ("nvmet: add passthru code to process commands") Signed-off-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Logan Gunthorpe <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-08-21nvmet: add ns tear down label for pt-cmd handlingChaitanya Kulkarni1-4/+5
In the current implementation before submitting the passthru cmd we may come across error e.g. getting ns from passthru controller, allocating a request from passthru controller, etc. For all the failure cases it only uses single goto label fail_out. In the target code, we follow the pattern to have a separate label for each error out the case when setting up multiple things before the actual action. This patch follows the same pattern and renames generic fail_out label to out_put_ns and updates the error out cases in the nvmet_passthru_execute_cmd() where it is needed. Signed-off-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Logan Gunthorpe <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-08-21nvme: multipath: round-robin: eliminate "fallback" variableMartin Wilck1-5/+4
If we find an optimized path, we quit the loop immediately. Thus we can use just one variable for the next path, slighly simplifying the code. Signed-off-by: Martin Wilck <[email protected]> Reviewed-by: Keith Busch <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-08-21nvme: multipath: round-robin: fix single non-optimized path caseMartin Wilck1-5/+10
If there's only one usable, non-optimized path, nvme_round_robin_path() returns NULL, which is wrong. Fix it by falling back to "old", like in the single optimized path case. Also, if the active path isn't changed, there's no need to re-assign the pointer. Fixes: 3f6e3246db0e ("nvme-multipath: fix logic for non-optimized paths") Signed-off-by: Martin Wilck <[email protected]> Signed-off-by: Martin George <[email protected]> Reviewed-by: Keith Busch <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-08-21nvme-fc: Fix wrong return value in __nvme_fc_init_request()Tianjia Zhang1-2/+2
On an error exit path, a negative error code should be returned instead of a positive return value. Fixes: e399441de9115 ("nvme-fabrics: Add host support for FC transport") Cc: James Smart <[email protected]> Signed-off-by: Tianjia Zhang <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-08-21nvmet-passthru: Reject commands with non-sgl flags setLogan Gunthorpe1-0/+8
Any command with a non-SGL flag set (like fuse flags) should be rejected. Fixes: c1fef73f793b ("nvmet: add passthru code to process commands") Signed-off-by: Logan Gunthorpe <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-08-21nvmet: fix a memory leakSagi Grimberg1-0/+1
We forgot to free new_model_number Fixes: 013b7ebe5a0d ("nvmet: make ctrl model configurable") Reviewed-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-08-21blkcg: fix memleak for iolatencyYufen Yu1-3/+5
Normally, blkcg_iolatency_exit() will free related memory in iolatency when cleanup queue. But if blk_throtl_init() return error and queue init fail, blkcg_iolatency_exit() will not do that for us. Then it cause memory leak. Fixes: d70675121546 ("block: introduce blk-iolatency io controller") Signed-off-by: Yufen Yu <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-08-21MAINTAINERS: Add missing header files to BLOCK LAYER sectionGeert Uytterhoeven1-0/+1
The various <linux/blk*.h> header files are part of the Block Layer. Add them to the corresponding section in the MAINTAINERS file, so scripts/get_maintainer.pl will pick them up. Signed-off-by: Geert Uytterhoeven <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-08-21block: fix get_max_io_size()Keith Busch1-1/+1
A previous commit aligning splits to physical block sizes inadvertently modified one return case such that that it now returns 0 length splits when the number of sectors doesn't exceed the physical offset. This later hits a BUG in bio_split(). Restore the previous working behavior. Fixes: 9cc5169cd478b ("block: Improve physical block alignment of split bios") Reported-by: Eric Deal <[email protected]> Signed-off-by: Keith Busch <[email protected]> Cc: Bart Van Assche <[email protected]> Cc: [email protected] Signed-off-by: Jens Axboe <[email protected]>
2020-08-21blk-mq: insert request not through ->queue_rq into sw/scheduler queueMing Lei1-1/+2
c616cbee97ae ("blk-mq: punt failed direct issue to dispatch list") supposed to add request which has been through ->queue_rq() to the hw queue dispatch list, however it adds request running out of budget or driver tag to hw queue too. This way basically bypasses request merge, and causes too many request dispatched to LLD, and system% is unnecessary increased. Fixes this issue by adding request not through ->queue_rq into sw/scheduler queue, and this way is safe because no ->queue_rq is called on this request yet. High %system can be observed on Azure storvsc device, and even soft lock is observed. This patch reduces %system during heavy sequential IO, meantime decreases soft lockup risk. Fixes: c616cbee97ae ("blk-mq: punt failed direct issue to dispatch list") Signed-off-by: Ming Lei <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Bart Van Assche <[email protected]> Cc: Mike Snitzer <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-08-21block/rnbd: Ensure err is always initialized in process_rdmaNathan Chancellor1-1/+2
Clang warns: drivers/block/rnbd/rnbd-srv.c:150:6: warning: variable 'err' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized] if (IS_ERR(bio)) { ^~~~~~~~~~~ drivers/block/rnbd/rnbd-srv.c:177:9: note: uninitialized use occurs here return err; ^~~ drivers/block/rnbd/rnbd-srv.c:150:2: note: remove the 'if' if its condition is always false if (IS_ERR(bio)) { ^~~~~~~~~~~~~~~~~~ drivers/block/rnbd/rnbd-srv.c:126:9: note: initialize the variable 'err' to silence this warning int err; ^ = 0 1 warning generated. err is indeed uninitialized when this statement is taken. Ensure that it is assigned the error value of bio before jumping to the error handling label. Fixes: 735d77d4fd28 ("rnbd: remove rnbd_dev_submit_io") Reported-by: Brooke Basile <[email protected]> Signed-off-by: Nathan Chancellor <[email protected]> Acked-by: Jack Wang <[email protected]> Link: https://github.com/ClangBuiltLinux/linux/issues/1134 Signed-off-by: Jens Axboe <[email protected]>
2020-08-21dt-bindings: vendor-prefixes: Remove trailing whitespaceGeert Uytterhoeven1-1/+1
Fixes: f516fb704d02fff2 ("dt-bindings: Whitespace clean-ups in schema files") Signed-off-by: Geert Uytterhoeven <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Rob Herring <[email protected]>
2020-08-21KVM: arm64: Only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is not setWill Deacon1-4/+13
When an MMU notifier call results in unmapping a range that spans multiple PGDs, we end up calling into cond_resched_lock() when crossing a PGD boundary, since this avoids running into RCU stalls during VM teardown. Unfortunately, if the VM is destroyed as a result of OOM, then blocking is not permitted and the call to the scheduler triggers the following BUG(): | BUG: sleeping function called from invalid context at arch/arm64/kvm/mmu.c:394 | in_atomic(): 1, irqs_disabled(): 0, non_block: 1, pid: 36, name: oom_reaper | INFO: lockdep is turned off. | CPU: 3 PID: 36 Comm: oom_reaper Not tainted 5.8.0 #1 | Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 | Call trace: | dump_backtrace+0x0/0x284 | show_stack+0x1c/0x28 | dump_stack+0xf0/0x1a4 | ___might_sleep+0x2bc/0x2cc | unmap_stage2_range+0x160/0x1ac | kvm_unmap_hva_range+0x1a0/0x1c8 | kvm_mmu_notifier_invalidate_range_start+0x8c/0xf8 | __mmu_notifier_invalidate_range_start+0x218/0x31c | mmu_notifier_invalidate_range_start_nonblock+0x78/0xb0 | __oom_reap_task_mm+0x128/0x268 | oom_reap_task+0xac/0x298 | oom_reaper+0x178/0x17c | kthread+0x1e4/0x1fc | ret_from_fork+0x10/0x30 Use the new 'flags' argument to kvm_unmap_hva_range() to ensure that we only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is set in the notifier flags. Cc: <[email protected]> Fixes: 8b3405e345b5 ("kvm: arm/arm64: Fix locking for kvm_free_stage2_pgd") Cc: Marc Zyngier <[email protected]> Cc: Suzuki K Poulose <[email protected]> Cc: James Morse <[email protected]> Signed-off-by: Will Deacon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-08-21KVM: Pass MMU notifier range flags to kvm_unmap_hva_range()Will Deacon10-10/+17
The 'flags' field of 'struct mmu_notifier_range' is used to indicate whether invalidate_range_{start,end}() are permitted to block. In the case of kvm_mmu_notifier_invalidate_range_start(), this field is not forwarded on to the architecture-specific implementation of kvm_unmap_hva_range() and therefore the backend cannot sensibly decide whether or not to block. Add an extra 'flags' parameter to kvm_unmap_hva_range() so that architectures are aware as to whether or not they are permitted to block. Cc: <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Suzuki K Poulose <[email protected]> Cc: James Morse <[email protected]> Signed-off-by: Will Deacon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-08-21dt-bindings: net: correct description of phy-connection-typeMadalin Bucur1-1/+2
The phy-connection-type parameter is described in ePAPR 1.1: Specifies interface type between the Ethernet device and a physical layer (PHY) device. The value of this property is specific to the implementation. Signed-off-by: Madalin Bucur <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Rob Herring <[email protected]>
2020-08-21Merge tag 'io_uring-5.9-2020-08-21' of git://git.kernel.dk/linux-blockLinus Torvalds1-94/+79
Pull io_uring fixes from Jens Axboe: - Make sure the head link cancelation includes async work - Get rid of kiocb_wait_page_queue_init(), makes no sense to have it as a separate function since you moved it into io_uring itself - io_import_iovec cleanups (Pavel, me) - Use system_unbound_wq for ring exit work, to avoid spawning tons of these if we have tons of rings exiting at the same time - Fix req->flags overflow flag manipulation (Pavel) * tag 'io_uring-5.9-2020-08-21' of git://git.kernel.dk/linux-block: io_uring: kill extra iovec=NULL in import_iovec() io_uring: comment on kfree(iovec) checks io_uring: fix racy req->flags modification io_uring: use system_unbound_wq for ring exit work io_uring: cleanup io_import_iovec() of pre-mapped request io_uring: get rid of kiocb_wait_page_queue_init() io_uring: find and cancel head link async work on files exit
2020-08-21dt-bindings: PCI: intel,lgm-pcie: Fix matching on all snps,dw-pcie instancesRob Herring1-0/+8
The intel,lgm-pcie binding is matching on all snps,dw-pcie instances which is wrong. Add a custom 'select' entry to fix this. Fixes: e54ea45a4955 ("dt-bindings: PCI: intel: Add YAML schemas for the PCIe RC controller") Cc: Bjorn Helgaas <[email protected]> Cc: [email protected] Reviewed-by: Dilip Kota <[email protected]> Signed-off-by: Rob Herring <[email protected]>
2020-08-21Merge branch 'akpm' (patches from Andrew)Linus Torvalds10-9/+21
Merge misc fixes from Andrew Morton: "11 patches. Subsystems affected by this: misc, mm/hugetlb, mm/vmalloc, mm/misc, romfs, relay, uprobes, squashfs, mm/cma, mm/pagealloc" * emailed patches from Andrew Morton <[email protected]>: mm, page_alloc: fix core hung in free_pcppages_bulk() mm: include CMA pages in lowmem_reserve at boot squashfs: avoid bio_alloc() failure with 1Mbyte blocks uprobes: __replace_page() avoid BUG in munlock_vma_page() kernel/relay.c: fix memleak on destroy relay channel romfs: fix uninitialized memory leak in romfs_dev_read() mm/rodata_test.c: fix missing function declaration mm/vunmap: add cond_resched() in vunmap_pmd_range khugepaged: adjust VM_BUG_ON_MM() in __khugepaged_enter() hugetlb_cgroup: convert comma to semicolon mailmap: add Andi Kleen
2020-08-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller12-24/+78
Alexei Starovoitov says: ==================== pull-request: bpf 2020-08-21 The following pull-request contains BPF updates for your *net* tree. We've added 11 non-merge commits during the last 5 day(s) which contain a total of 12 files changed, 78 insertions(+), 24 deletions(-). The main changes are: 1) three fixes in BPF task iterator logic, from Yonghong. 2) fix for compressed dwarf sections in vmlinux, from Jiri. 3) fix xdp attach regression, from Andrii. ==================== Signed-off-by: David S. Miller <[email protected]>
2020-08-21Merge tag 'riscv-for-linus-5.9-rc2' of ↵Linus Torvalds19-153/+376
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fixes from Palmer Dabbelt: - The CLINT driver has been split in two: one to handle the M-mode CLINT (memory mapped and used on NOMMU systems) and one to handle the S-mode CLINT (via SBI). - The addition of SiFive's drivers to rv32_defconfig * tag 'riscv-for-linus-5.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: riscv: Add SiFive drivers to rv32_defconfig dt-bindings: timer: Add CLINT bindings RISC-V: Remove CLINT related code from timer and arch clocksource/drivers: Add CLINT timer driver RISC-V: Add mechanism to provide custom IPI operations
2020-08-21Merge tag 'for-linus-5.9-rc2-tag' of ↵Linus Torvalds2-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: "One build fix and a minor fix for suppressing a useless warning when booting a Xen dom0 via UEFI" * tag 'for-linus-5.9-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: Fix build error when CONFIG_ACPI is not set/enabled: efi: avoid error message when booting under Xen
2020-08-21Merge tag 'pm-5.9-rc2' of ↵Linus Torvalds1-7/+12
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix a few issues in the operating performance points (OPP) framework. Specifics: - Fix re-enabling of resources in dev_pm_opp_set_rate() (Rajendra Nayak) - Fix OPP table reference counting in error paths (Stephen Boyd)" * tag 'pm-5.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: opp: Enable resources again if they were disabled earlier opp: Put opp table in dev_pm_opp_set_rate() if _set_opp_bw() fails opp: Put opp table in dev_pm_opp_set_rate() for empty tables
2020-08-21bpf: Fix two typos in uapi/linux/bpf.hTobias Klauser2-10/+10
Also remove trailing whitespaces in bpf_skb_get_tunnel_key example code. Signed-off-by: Tobias Klauser <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2020-08-21net: dsa: b53: check for timeoutTom Rix1-0/+2
clang static analysis reports this problem b53_common.c:1583:13: warning: The left expression of the compound assignment is an uninitialized value. The computed value will also be garbage ent.port &= ~BIT(port); ~~~~~~~~ ^ ent is set by a successful call to b53_arl_read(). Unsuccessful calls are caught by an switch statement handling specific returns. b32_arl_read() calls b53_arl_op_wait() which fails with the unhandled -ETIMEDOUT. So add -ETIMEDOUT to the switch statement. Because b53_arl_op_wait() already prints out a message, do not add another one. Fixes: 1da6df85c6fb ("net: dsa: b53: Implement ARL add/del/dump operations") Signed-off-by: Tom Rix <[email protected]> Acked-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-08-21hwmon: (applesmc) check status earlier.Tom Rix1-15/+16
clang static analysis reports this representative problem applesmc.c:758:10: warning: 1st function call argument is an uninitialized value left = be16_to_cpu(*(__be16 *)(buffer + 6)) >> 2; ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ buffer is filled by the earlier call ret = applesmc_read_key(LIGHT_SENSOR_LEFT_KEY, ... This problem is reported because a goto skips the status check. Other similar problems use data from applesmc_read_key before checking the status. So move the checks to before the use. Signed-off-by: Tom Rix <[email protected]> Reviewed-by: Henrik Rydberg <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Guenter Roeck <[email protected]>
2020-08-21hwmon: (nct7904) Correct divide by 0Jason Baron1-2/+2
We hit a kernel panic due to a divide by 0 in nct7904_read_fan() for the hwmon_fan_min case. Extend the check to hwmon_fan_input case as well for safety. [ 1656.545650] divide error: 0000 [#1] SMP PTI [ 1656.545779] CPU: 12 PID: 18010 Comm: sensors Not tainted 5.4.47 #1 [ 1656.546065] RIP: 0010:nct7904_read+0x1e9/0x510 [nct7904] ... [ 1656.546549] RAX: 0000000000149970 RBX: ffffbd6b86bcbe08 RCX: 0000000000000000 ... [ 1656.547548] Call Trace: [ 1656.547665] hwmon_attr_show+0x32/0xd0 [hwmon] [ 1656.547783] dev_attr_show+0x18/0x50 [ 1656.547898] sysfs_kf_seq_show+0x99/0x120 [ 1656.548013] seq_read+0xd8/0x3e0 [ 1656.548127] vfs_read+0x89/0x130 [ 1656.548234] ksys_read+0x7d/0xb0 [ 1656.548342] do_syscall_64+0x48/0x110 [ 1656.548451] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: d65a5102a99f5 ("hwmon: (nct7904) Convert to use new hwmon registration API") Signed-off-by: Jason Baron <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Guenter Roeck <[email protected]>
2020-08-21device property: Fix the secondary firmware node handling in ↵Heikki Krogerus1-4/+8
set_primary_fwnode() When the primary firmware node pointer is removed from a device (set to NULL) the secondary firmware node pointer, when it exists, is made the primary node for the device. However, the secondary firmware node pointer of the original primary firmware node is never cleared (set to NULL). To avoid situation where the secondary firmware node pointer is pointing to a non-existing object, clearing it properly when the primary node is removed from a device in set_primary_fwnode(). Fixes: 97badf873ab6 ("device property: Make it possible to use secondary firmware nodes") Cc: All applicable <[email protected]> Signed-off-by: Heikki Krogerus <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2020-08-21ACPI: ioremap: avoid redundant rounding to OS page sizeArd Biesheuvel1-2/+2
The arm64 implementation of acpi_os_ioremap() was recently updated to tighten the checks around which parts of memory are permitted to be mapped by ACPI code, which generally only needs access to memory regions that are statically described by firmware, and any attempts to access memory that is in active use by the OS is generally a bug or a hacking attempt. This tightening is based on the EFI memory map, which describes all memory in the system. The AArch64 architecture permits page sizes of 16k and 64k in addition to the EFI default, which is 4k, which means that the EFI memory map may describe regions that cannot be mapped seamlessly if the OS page size is greater than 4k. This is usually not a problem, given that the EFI spec does not permit memory regions requiring different memory attributes to share a 64k page frame, and so the usual rounding to page size performed by ioremap() is sufficient to deal with this. However, this rounding does complicate our EFI memory map permission check, due to the loss of information that occurs when several small regions share a single 64k page frame (where rounding each of them will result in the same 64k single page region). However, due to the fact that the region check occurs *before* the call to ioremap() where the necessary rounding is performed, we can deal with this issue simply by removing the redundant rounding performed by acpi_os_map_iomem(), as it appears to be the only place where the arguments to a call to acpi_os_ioremap() are rounded up. So omit the rounding in the call, and instead, apply the necessary masking when assigning the map->virt member. Fixes: 1583052d111f ("arm64/acpi: disallow AML memory opregions to access kernel memory") Signed-off-by: Ard Biesheuvel <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2020-08-21ACPI: SoC: APD: Check return value of acpi_dev_get_property()Furquan Shaikh1-2/+2
`fch_misc_setup()` uses `acpi_dev_get_property()` to read the value of "is-rv" passed in by BIOS in ACPI tables. However, not all BIOSes might pass in this property and hence it is important to first check the return value of `acpi_dev_get_property()` before referencing the object filled by it. Signed-off-by: Furquan Shaikh <[email protected]> [ rjw: Subject edits ] Signed-off-by: Rafael J. Wysocki <[email protected]>
2020-08-21cpufreq: replace cpu_logical_map() with read_cpuid_mpir()Sumit Gupta1-3/+7
Commit eaecca9e7710 ("arm64: Fix __cpu_logical_map undefined issue") fixes the issue with building tegra194 cpufreq driver as module. But the fix might cause problem while supporting physical CPU hotplug[1]. This patch fixes the original problem by avoiding use of cpu_logical_map(). Instead calling read_cpuid_mpidr() to get MPIDR on target CPU. [1] https://lore.kernel.org/linux-arm-kernel/20200724131059.GB6521@bogus/ Fixes: df320f89359c ("cpufreq: Add Tegra194 cpufreq driver") Reviewed-by: Sudeep Holla <[email protected]> Signed-off-by: Sumit Gupta <[email protected]> Acked-by: Viresh Kumar <[email protected]> [ rjw: Subject & changelog edits ] Signed-off-by: Rafael J. Wysocki <[email protected]>
2020-08-21ARM64: vdso32: Install vdso32 from vdso_installStephen Boyd2-1/+2
Add the 32-bit vdso Makefile to the vdso_install rule so that 'make vdso_install' installs the 32-bit compat vdso when it is compiled. Fixes: a7f71a2c8903 ("arm64: compat: Add vDSO") Signed-off-by: Stephen Boyd <[email protected]> Reviewed-by: Vincenzo Frascino <[email protected]> Acked-by: Will Deacon <[email protected]> Cc: Vincenzo Frascino <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Catalin Marinas <[email protected]>
2020-08-21Merge tag 'ext4_for_linus' of ↵Linus Torvalds28-424/+881
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Improvements to ext4's block allocator performance for very large file systems, especially when the file system or files which are highly fragmented. There is a new mount option, prefetch_block_bitmaps which will pull in the block bitmaps and set up the in-memory buddy bitmaps when the file system is initially mounted. Beyond that, a lot of bug fixes and cleanups. In particular, a number of changes to make ext4 more robust in the face of write errors or file system corruptions" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (46 commits) ext4: limit the length of per-inode prealloc list ext4: reorganize if statement of ext4_mb_release_context() ext4: add mb_debug logging when there are lost chunks ext4: Fix comment typo "the the". jbd2: clean up checksum verification in do_one_pass() ext4: change to use fallthrough macro ext4: remove unused parameter of ext4_generic_delete_entry function mballoc: replace seq_printf with seq_puts ext4: optimize the implementation of ext4_mb_good_group() ext4: delete invalid comments near ext4_mb_check_limits() ext4: fix typos in ext4_mb_regular_allocator() comment ext4: fix checking of directory entry validity for inline directories fs: prevent BUG_ON in submit_bh_wbc() ext4: correctly restore system zone info when remount fails ext4: handle add_system_zone() failure in ext4_setup_system_zone() ext4: fold ext4_data_block_valid_rcu() into the caller ext4: check journal inode extents more carefully ext4: don't allow overlapping system zones ext4: handle error of ext4_setup_system_zone() on remount ext4: delete the invalid BUGON in ext4_mb_load_buddy_gfp() ...
2020-08-21afs: Fix NULL deref in afs_dynroot_depopulate()David Howells1-9/+11
If an error occurs during the construction of an afs superblock, it's possible that an error occurs after a superblock is created, but before we've created the root dentry. If the superblock has a dynamic root (ie. what's normally mounted on /afs), the afs_kill_super() will call afs_dynroot_depopulate() to unpin any created dentries - but this will oops if the root hasn't been created yet. Fix this by skipping that bit of code if there is no root dentry. This leads to an oops looking like: general protection fault, ... KASAN: null-ptr-deref in range [0x0000000000000068-0x000000000000006f] ... RIP: 0010:afs_dynroot_depopulate+0x25f/0x529 fs/afs/dynroot.c:385 ... Call Trace: afs_kill_super+0x13b/0x180 fs/afs/super.c:535 deactivate_locked_super+0x94/0x160 fs/super.c:335 afs_get_tree+0x1124/0x1460 fs/afs/super.c:598 vfs_get_tree+0x89/0x2f0 fs/super.c:1547 do_new_mount fs/namespace.c:2875 [inline] path_mount+0x1387/0x2070 fs/namespace.c:3192 do_mount fs/namespace.c:3205 [inline] __do_sys_mount fs/namespace.c:3413 [inline] __se_sys_mount fs/namespace.c:3390 [inline] __x64_sys_mount+0x27f/0x300 fs/namespace.c:3390 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 which is oopsing on this line: inode_lock(root->d_inode); presumably because sb->s_root was NULL. Fixes: 0da0b7fd73e4 ("afs: Display manually added cells in dynamic root mount") Reported-by: [email protected] Signed-off-by: David Howells <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-08-21Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds12-47/+45
Pull rdma fixes from Jason Gunthorpe: "One regression from 5.8 and a few bugs from earlier kernels: - Various spelling corrections in kernel prints - Bug fixes in hfi1 and bntx_re - Revert a 5.8 patch in hns - Batch update for Mellanox and Cumulus maintainers emails" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: MAINTAINERS: Update Mellanox and Cumulus Network addresses to new domain Revert "RDMA/hns: Reserve one sge in order to avoid local length error" RDMA/hfi1: Correct an interlock issue for TID RDMA WRITE request RDMA/bnxt_re: Do not add user qps to flushlist RDMA/core: Fix spelling mistake "Could't" -> "Couldn't" RDMA/usnic: Fix spelling mistake "transistion" -> "transition" RDMA/hns: Fix spelling mistake "epmty" -> "empty"
2020-08-21Merge tag 'sound-5.9-rc2' of ↵Linus Torvalds23-260/+323
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound fixes from Takashi Iwai: "A collection of small fixes over several drivers, but all are driver- specific and nothing looks scary. Slightly large changes are seen in ASoC qcom driver for the bugs that were revealed by the recent ASoC core change to report the invalid register access errors. Also ASoC fsl got a slight intensive change for the distortion fix. Others are only trivial fixes or device-specific quirks" * tag 'sound-5.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (25 commits) ALSA: hda: avoid reset of sdo_limit ALSA: hda/realtek: Add quirk for Samsung Galaxy Book Ion ALSA: usb-audio: ignore broken processing/extension unit ASoC: intel: Fix memleak in sst_media_open ASoC: wm8994: Avoid attempts to read unreadable registers ASoC: msm8916-wcd-analog: fix register Interrupt offset ASoC: wm8994: Prevent access to invalid VU register bits on WM1811 ALSA: hda/realtek: Add model alc298-samsung-headphone ALSA: usb-audio: Update documentation comment for MS2109 quirk ALSA: isa: fix spelling mistakes in the comments ALSA: usb-audio: Add capture support for Saffire 6 (USB 1.1) ALSA: hda/realtek: Add quirk for Samsung Galaxy Flex Book ASoC: q6routing: add dummy register read/write function ASoC: q6afe-dai: mark all widgets registers as SND_SOC_NOPM ASoC: Make soc_component_read() returning an error code again ASoC: amd: Replacing component->name with codec_dai->name. ASoC: fsl: Fix unused variable warning ASoC: tegra: tegra210_i2s: Fix compile warning with CONFIG_PM=n ASoC: tegra: tegra210_dmic: Fix compile warning with CONFIG_PM=n ASoC: tegra: tegra210_ahub: Fix compile warning with CONFIG_PM=n ...
2020-08-21Merge tag 'drm-fixes-2020-08-21' of git://anongit.freedesktop.org/drm/drmLinus Torvalds41-94/+317
Pull drm fixes from Dave Airlie: "Regular fixes pull for rc2. Usual rc2 doesn't seem too busy, mainly i915 and amdgpu. I'd expect the usual uptick for rc3. amdgpu: - Fix allocation size - SR-IOV fixes - Vega20 SMU feature state caching fix - Fix custom pptable handling - Arcturus golden settings update - Several display fixes - Fixes for Navy Flounder - Misc display fixes - RAS fix amdkfd: - SDMA fix for renoir i915: - Fix device parameter usage for selftest mock i915 device - Fix LPSP capability debugfs NULL dereference - Fix buddy register pagemask table - Fix intel_atomic_check() non-negative return value - Fix selftests passing a random 0 into ilog2() - Fix TGL power well enable/disable ordering - Switch to PMU module refcounting - GVT fixes virtio: - Add missing dma_fence_put() in virtio_gpu_execbuffer_ioctl() - Fix memory leak in virtio_gpu_cleanup_object()" * tag 'drm-fixes-2020-08-21' of git://anongit.freedesktop.org/drm/drm: (34 commits) Revert "drm/amdgpu: disable gfxoff for navy_flounder" drm/i915/tgl: Make sure TC-cold is blocked before enabling TC AUX power wells drm/i915/selftests: Avoid passing a random 0 into ilog2 drm/i915: Fix wrong return value in intel_atomic_check() drm/i915: Update bw_buddy pagemask table drm/i915/display: Check for an LPSP encoder before dereferencing drm/i915: Copy default modparams to mock i915_device drm/i915: Provide the perf pmu.module drm/amd/display: fix pow() crashing when given base 0 drm/amd/display: Reset scrambling on Test Pattern drm/amd/display: fix dcn3 wide timing dsc validation drm/amd/display: Fix DFPstate hang due to view port changed drm/amd/display: Assign correct left shift drm/amd/display: Call DMUB for eDP power control drm/amdkfd: fix the wrong sdma instance query for renoir drm/amdgpu: parse ta firmware for navy_flounder drm/amdgpu: fix NULL pointer access issue when unloading driver drm/amdgpu: fix uninit-value in arcturus_log_thermal_throttling_event() drm/amdgpu: disable gfxoff for navy_flounder drm/amdgpu/display: use GFP_ATOMIC in dcn20_validate_bandwidth_internal ...
2020-08-21mm, page_alloc: fix core hung in free_pcppages_bulk()Charan Teja Reddy1-0/+5
The following race is observed with the repeated online, offline and a delay between two successive online of memory blocks of movable zone. P1 P2 Online the first memory block in the movable zone. The pcp struct values are initialized to default values,i.e., pcp->high = 0 & pcp->batch = 1. Allocate the pages from the movable zone. Try to Online the second memory block in the movable zone thus it entered the online_pages() but yet to call zone_pcp_update(). This process is entered into the exit path thus it tries to release the order-0 pages to pcp lists through free_unref_page_commit(). As pcp->high = 0, pcp->count = 1 proceed to call the function free_pcppages_bulk(). Update the pcp values thus the new pcp values are like, say, pcp->high = 378, pcp->batch = 63. Read the pcp's batch value using READ_ONCE() and pass the same to free_pcppages_bulk(), pcp values passed here are, batch = 63, count = 1. Since num of pages in the pcp lists are less than ->batch, then it will stuck in while(list_empty(list)) loop with interrupts disabled thus a core hung. Avoid this by ensuring free_pcppages_bulk() is called with proper count of pcp list pages. The mentioned race is some what easily reproducible without [1] because pcp's are not updated for the first memory block online and thus there is a enough race window for P2 between alloc+free and pcp struct values update through onlining of second memory block. With [1], the race still exists but it is very narrow as we update the pcp struct values for the first memory block online itself. This is not limited to the movable zone, it could also happen in cases with the normal zone (e.g., hotplug to a node that only has DMA memory, or no other memory yet). [1]: https://patchwork.kernel.org/patch/11696389/ Fixes: 5f8dcc21211a ("page-allocator: split per-cpu list into one-list-per-migrate-type") Signed-off-by: Charan Teja Reddy <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: David Rientjes <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Vinayak Menon <[email protected]> Cc: <[email protected]> [2.6+] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-21mm: include CMA pages in lowmem_reserve at bootDoug Berger1-1/+1
The lowmem_reserve arrays provide a means of applying pressure against allocations from lower zones that were targeted at higher zones. Its values are a function of the number of pages managed by higher zones and are assigned by a call to the setup_per_zone_lowmem_reserve() function. The function is initially called at boot time by the function init_per_zone_wmark_min() and may be called later by accesses of the /proc/sys/vm/lowmem_reserve_ratio sysctl file. The function init_per_zone_wmark_min() was moved up from a module_init to a core_initcall to resolve a sequencing issue with khugepaged. Unfortunately this created a sequencing issue with CMA page accounting. The CMA pages are added to the managed page count of a zone when cma_init_reserved_areas() is called at boot also as a core_initcall. This makes it uncertain whether the CMA pages will be added to the managed page counts of their zones before or after the call to init_per_zone_wmark_min() as it becomes dependent on link order. With the current link order the pages are added to the managed count after the lowmem_reserve arrays are initialized at boot. This means the lowmem_reserve values at boot may be lower than the values used later if /proc/sys/vm/lowmem_reserve_ratio is accessed even if the ratio values are unchanged. In many cases the difference is not significant, but for example an ARM platform with 1GB of memory and the following memory layout cma: Reserved 256 MiB at 0x0000000030000000 Zone ranges: DMA [mem 0x0000000000000000-0x000000002fffffff] Normal empty HighMem [mem 0x0000000030000000-0x000000003fffffff] would result in 0 lowmem_reserve for the DMA zone. This would allow userspace to deplete the DMA zone easily. Funnily enough $ cat /proc/sys/vm/lowmem_reserve_ratio would fix up the situation because as a side effect it forces setup_per_zone_lowmem_reserve. This commit breaks the link order dependency by invoking init_per_zone_wmark_min() as a postcore_initcall so that the CMA pages have the chance to be properly accounted in their zone(s) and allowing the lowmem_reserve arrays to receive consistent values. Fixes: bc22af74f271 ("mm: update min_free_kbytes from khugepaged after core initialization") Signed-off-by: Doug Berger <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Jason Baron <[email protected]> Cc: David Rientjes <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-21squashfs: avoid bio_alloc() failure with 1Mbyte blocksPhillip Lougher1-1/+5
This is a regression introduced by the patch "migrate from ll_rw_block usage to BIO". Bio_alloc() is limited to 256 pages (1 Mbyte). This can cause a failure when reading 1 Mbyte block filesystems. The problem is a datablock can be fully (or almost uncompressed), requiring 256 pages, but, because blocks are not aligned to page boundaries, it may require 257 pages to read. Bio_kmalloc() can handle 1024 pages, and so use this for the edge condition. Fixes: 93e72b3c612a ("squashfs: migrate from ll_rw_block usage to BIO") Reported-by: Nicolas Prochazka <[email protected]> Reported-by: Tomoatsu Shimada <[email protected]> Signed-off-by: Phillip Lougher <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Guenter Roeck <[email protected]> Cc: Philippe Liard <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Adrien Schildknecht <[email protected]> Cc: Daniel Rosenberg <[email protected]> Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-21uprobes: __replace_page() avoid BUG in munlock_vma_page()Hugh Dickins1-1/+1
syzbot crashed on the VM_BUG_ON_PAGE(PageTail) in munlock_vma_page(), when called from uprobes __replace_page(). Which of many ways to fix it? Settled on not calling when PageCompound (since Head and Tail are equals in this context, PageCompound the usual check in uprobes.c, and the prior use of FOLL_SPLIT_PMD will have cleared PageMlocked already). Fixes: 5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT") Reported-by: syzbot <[email protected]> Signed-off-by: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Srikar Dronamraju <[email protected]> Acked-by: Song Liu <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: <[email protected]> [5.4+] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>