aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2019-02-01mm, memory_hotplug: __offline_pages fix wrong lockingMichal Hocko1-2/+0
Jan has noticed that we do double unlock on some failure paths when offlining a page range. This is indeed the case when test_pages_in_a_zone respp. start_isolate_page_range fail. This was an omission when forward porting the debugging patch from an older kernel. Fix the issue by dropping mem_hotplug_done from the failure condition and keeping the single unlock in the catch all failure path. Link: http://lkml.kernel.org/r/[email protected] Fixes: 7960509329c2 ("mm, memory_hotplug: print reason for the offlining failure") Signed-off-by: Michal Hocko <[email protected]> Reported-by: Jan Kara <[email protected]> Reviewed-by: Jan Kara <[email protected]> Tested-by: Jan Kara <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-02-01mm: hwpoison: use do_send_sig_info() instead of force_sig()Naoya Horiguchi1-1/+2
Currently memory_failure() is racy against process's exiting, which results in kernel crash by null pointer dereference. The root cause is that memory_failure() uses force_sig() to forcibly kill asynchronous (meaning not in the current context) processes. As discussed in thread https://lkml.org/lkml/2010/6/8/236 years ago for OOM fixes, this is not a right thing to do. OOM solves this issue by using do_send_sig_info() as done in commit d2d393099de2 ("signal: oom_kill_task: use SEND_SIG_FORCED instead of force_sig()"), so this patch is suggesting to do the same for hwpoison. do_send_sig_info() properly accesses to siglock with lock_task_sighand(), so is free from the reported race. I confirmed that the reported bug reproduces with inserting some delay in kill_procs(), and it never reproduces with this patch. Note that memory_failure() can send another type of signal using force_sig_mceerr(), and the reported race shouldn't happen on it because force_sig_mceerr() is called only for synchronous processes (i.e. BUS_MCEERR_AR happens only when some process accesses to the corrupted memory.) Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Naoya Horiguchi <[email protected]> Reported-by: Jane Chu <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: William Kucharski <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-02-01kasan: mark file common so ftrace doesn't trace itAnders Roxell1-0/+1
When option CONFIG_KASAN is enabled toghether with ftrace, function ftrace_graph_caller() gets in to a recursion, via functions kasan_check_read() and kasan_check_write(). Breakpoint 2, ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:179 179 mcount_get_pc x0 // function's pc (gdb) bt #0 ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:179 #1 0xffffff90101406c8 in ftrace_caller () at ../arch/arm64/kernel/entry-ftrace.S:151 #2 0xffffff90106fd084 in kasan_check_write (p=0xffffffc06c170878, size=4) at ../mm/kasan/common.c:105 #3 0xffffff90104a2464 in atomic_add_return (v=<optimized out>, i=<optimized out>) at ./include/generated/atomic-instrumented.h:71 #4 atomic_inc_return (v=<optimized out>) at ./include/generated/atomic-fallback.h:284 #5 trace_graph_entry (trace=0xffffffc03f5ff380) at ../kernel/trace/trace_functions_graph.c:441 #6 0xffffff9010481774 in trace_graph_entry_watchdog (trace=<optimized out>) at ../kernel/trace/trace_selftest.c:741 #7 0xffffff90104a185c in function_graph_enter (ret=<optimized out>, func=<optimized out>, frame_pointer=18446743799894897728, retp=<optimized out>) at ../kernel/trace/trace_functions_graph.c:196 #8 0xffffff9010140628 in prepare_ftrace_return (self_addr=18446743592948977792, parent=0xffffffc03f5ff418, frame_pointer=18446743799894897728) at ../arch/arm64/kernel/ftrace.c:231 #9 0xffffff90101406f4 in ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:182 Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb) Rework so that the kasan implementation isn't traced. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Anders Roxell <[email protected]> Acked-by: Dmitry Vyukov <[email protected]> Tested-by: Dmitry Vyukov <[email protected]> Acked-by: Steven Rostedt (VMware) <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-02-01init/Kconfig: fix grammar by moving a closing parenthesisJonathan Neuschäfer1-1/+1
Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jonathan Neuschäfer <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-02-01lib/test_kmod.c: potential double free in error handlingDan Carpenter1-1/+1
There is a copy and paste bug so we set "config->test_driver" to NULL twice instead of setting "config->test_fs". Smatch complains that it leads to a double free: lib/test_kmod.c:840 __kmod_config_init() warn: 'config->test_fs' double freed Link: http://lkml.kernel.org/r/20190121140011.GA14283@kadam Fixes: d9c6a72d6fa2 ("kmod: add test driver to stress test the module loader") Signed-off-by: Dan Carpenter <[email protected]> Acked-by: Luis Chamberlain <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-02-01mm, oom: fix use-after-free in oom_kill_processShakeel Butt1-0/+8
Syzbot instance running on upstream kernel found a use-after-free bug in oom_kill_process. On further inspection it seems like the process selected to be oom-killed has exited even before reaching read_lock(&tasklist_lock) in oom_kill_process(). More specifically the tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task() and the put_task_struct within for_each_thread() frees the tsk and for_each_thread() tries to access the tsk. The easiest fix is to do get/put across the for_each_thread() on the selected task. Now the next question is should we continue with the oom-kill as the previously selected task has exited? However before adding more complexity and heuristics, let's answer why we even look at the children of oom-kill selected task? The select_bad_process() has already selected the worst process in the system/memcg. Due to race, the selected process might not be the worst at the kill time but does that matter? The userspace can use the oom_score_adj interface to prefer children to be killed before the parent. I looked at the history but it seems like this is there before git history. Link: http://lkml.kernel.org/r/[email protected] Reported-by: [email protected] Fixes: 6b0c81b3be11 ("mm, oom: reduce dependency on tasklist_lock") Signed-off-by: Shakeel Butt <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-02-01mm/hotplug: invalid PFNs from pfn_to_online_page()Qian Cai1-8/+10
On an arm64 ThunderX2 server, the first kmemleak scan would crash [1] with CONFIG_DEBUG_VM_PGFLAGS=y due to page_to_nid() found a pfn that is not directly mapped (MEMBLOCK_NOMAP). Hence, the page->flags is uninitialized. This is due to the commit 9f1eb38e0e11 ("mm, kmemleak: little optimization while scanning") starts to use pfn_to_online_page() instead of pfn_valid(). However, in the CONFIG_MEMORY_HOTPLUG=y case, pfn_to_online_page() does not call memblock_is_map_memory() while pfn_valid() does. Historically, the commit 68709f45385a ("arm64: only consider memblocks with NOMAP cleared for linear mapping") causes pages marked as nomap being no long reassigned to the new zone in memmap_init_zone() by calling __init_single_page(). Since the commit 2d070eab2e82 ("mm: consider zone which is not fully populated to have holes") introduced pfn_to_online_page() and was designed to return a valid pfn only, but it is clearly broken on arm64. Therefore, let pfn_to_online_page() call pfn_valid_within(), so it can handle nomap thanks to the commit f52bb98f5ade ("arm64: mm: always enable CONFIG_HOLES_IN_ZONE"), while it will be optimized away on architectures where have no HOLES_IN_ZONE. [1] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000006 Mem abort info: ESR = 0x96000005 Exception class = DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000005 CM = 0, WnR = 0 Internal error: Oops: 96000005 [#1] SMP CPU: 60 PID: 1408 Comm: kmemleak Not tainted 5.0.0-rc2+ #8 pstate: 60400009 (nZCv daif +PAN -UAO) pc : page_mapping+0x24/0x144 lr : __dump_page+0x34/0x3dc sp : ffff00003a5cfd10 x29: ffff00003a5cfd10 x28: 000000000000802f x27: 0000000000000000 x26: 0000000000277d00 x25: ffff000010791f56 x24: ffff7fe000000000 x23: ffff000010772f8b x22: ffff00001125f670 x21: ffff000011311000 x20: ffff000010772f8b x19: fffffffffffffffe x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: ffff802698b19600 x13: ffff802698b1a200 x12: ffff802698b16f00 x11: ffff802698b1a400 x10: 0000000000001400 x9 : 0000000000000001 x8 : ffff00001121a000 x7 : 0000000000000000 x6 : ffff0000102c53b8 x5 : 0000000000000000 x4 : 0000000000000003 x3 : 0000000000000100 x2 : 0000000000000000 x1 : ffff000010772f8b x0 : ffffffffffffffff Process kmemleak (pid: 1408, stack limit = 0x(____ptrval____)) Call trace: page_mapping+0x24/0x144 __dump_page+0x34/0x3dc dump_page+0x28/0x4c kmemleak_scan+0x4ac/0x680 kmemleak_scan_thread+0xb4/0xdc kthread+0x12c/0x13c ret_from_fork+0x10/0x18 Code: d503201f f9400660 36000040 d1000413 (f9400661) ---[ end trace 4d4bd7f573490c8e ]--- Kernel panic - not syncing: Fatal exception SMP: stopping secondary CPUs Kernel Offset: disabled CPU features: 0x002,20000c38 Memory Limit: none ---[ end Kernel panic - not syncing: Fatal exception ]--- Link: http://lkml.kernel.org/r/[email protected] Fixes: 9f1eb38e0e11 ("mm, kmemleak: little optimization while scanning") Signed-off-by: Qian Cai <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-02-01mm,memory_hotplug: fix scan_movable_pages() for gigantic hugepagesOscar Salvador1-16/+20
This is the same sort of error we saw in commit 17e2e7d7e1b8 ("mm, page_alloc: fix has_unmovable_pages for HugePages"). Gigantic hugepages cross several memblocks, so it can be that the page we get in scan_movable_pages() is a page-tail belonging to a 1G-hugepage. If that happens, page_hstate()->size_to_hstate() will return NULL, and we will blow up in hugepage_migration_supported(). The splat is as follows: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 #PF error: [normal kernel read fault] PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 1 PID: 1350 Comm: bash Tainted: G E 5.0.0-rc1-mm1-1-default+ #27 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 RIP: 0010:__offline_pages+0x6ae/0x900 Call Trace: memory_subsys_offline+0x42/0x60 device_offline+0x80/0xa0 state_store+0xab/0xc0 kernfs_fop_write+0x102/0x180 __vfs_write+0x26/0x190 vfs_write+0xad/0x1b0 ksys_write+0x42/0x90 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Modules linked in: af_packet(E) xt_tcpudp(E) ipt_REJECT(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv4(E) ip_set(E) nfnetlink(E) ebtable_nat(E) ebtable_broute(E) bridge(E) stp(E) llc(E) iptable_mangle(E) iptable_raw(E) iptable_security(E) ebtable_filter(E) ebtables(E) iptable_filter(E) ip_tables(E) x_tables(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) bochs_drm(E) ttm(E) aesni_intel(E) drm_kms_helper(E) aes_x86_64(E) crypto_simd(E) cryptd(E) glue_helper(E) drm(E) virtio_net(E) syscopyarea(E) sysfillrect(E) net_failover(E) sysimgblt(E) pcspkr(E) failover(E) i2c_piix4(E) fb_sys_fops(E) parport_pc(E) parport(E) button(E) btrfs(E) libcrc32c(E) xor(E) zstd_decompress(E) zstd_compress(E) xxhash(E) raid6_pq(E) sd_mod(E) ata_generic(E) ata_piix(E) ahci(E) libahci(E) libata(E) crc32c_intel(E) serio_raw(E) virtio_pci(E) virtio_ring(E) virtio(E) sg(E) scsi_mod(E) autofs4(E) [[email protected]: fix brace layout, per David. Reduce indentation] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Reviewed-by: Anthony Yznaga <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-02-01psi: fix aggregation idle shut-offJohannes Weiner3-5/+45
psi has provisions to shut off the periodic aggregation worker when there is a period of no task activity - and thus no data that needs aggregating. However, while developing psi monitoring, Suren noticed that the aggregation clock currently won't stay shut off for good. Debugging this revealed a flaw in the idle design: an aggregation run will see no task activity and decide to go to sleep; shortly thereafter, the kworker thread that executed the aggregation will go idle and cause a scheduling change, during which the psi callback will kick the !pending worker again. This will ping-pong forever, and is equivalent to having no shut-off logic at all (but with more code!) Fix this by exempting aggregation workers from psi's clock waking logic when the state change is them going to sleep. To do this, tag workers with the last work function they executed, and if in psi we see a worker going to sleep after aggregating psi data, we will not reschedule the aggregation work item. What if the worker is also executing other items before or after? Any psi state times that were incurred by work items preceding the aggregation work will have been collected from the per-cpu buckets during the aggregation itself. If there are work items following the aggregation work, the worker's last_func tag will be overwritten and the aggregator will be kept alive to process this genuine new activity. If the aggregation work is the last thing the worker does, and we decide to go idle, the brief period of non-idle time incurred between the aggregation run and the kworker's dequeue will be stranded in the per-cpu buckets until the clock is woken by later activity. But that should not be a problem. The buckets can hold 4s worth of time, and future activity will wake the clock with a 2s delay, giving us 2s worth of data we can leave behind when disabling aggregation. If it takes a worker more than two seconds to go idle after it finishes its last work item, we likely have bigger problems in the system, and won't notice one sample that was averaged with a bogus per-CPU weight. Link: http://lkml.kernel.org/r/[email protected] Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO") Signed-off-by: Johannes Weiner <[email protected]> Reported-by: Suren Baghdasaryan <[email protected]> Acked-by: Tejun Heo <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Lai Jiangshan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-02-01mm, memory_hotplug: test_pages_in_a_zone do not pass the end of zoneMikhail Zaslonko1-0/+3
If memory end is not aligned with the sparse memory section boundary, the mapping of such a section is only partly initialized. This may lead to VM_BUG_ON due to uninitialized struct pages access from test_pages_in_a_zone() function triggered by memory_hotplug sysfs handlers. Here are the the panic examples: CONFIG_DEBUG_VM_PGFLAGS=y kernel parameter mem=2050M -------------------------- page:000003d082008000 is uninitialized and poisoned page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) Call Trace: test_pages_in_a_zone+0xde/0x160 show_valid_zones+0x5c/0x190 dev_attr_show+0x34/0x70 sysfs_kf_seq_show+0xc8/0x148 seq_read+0x204/0x480 __vfs_read+0x32/0x178 vfs_read+0x82/0x138 ksys_read+0x5a/0xb0 system_call+0xdc/0x2d8 Last Breaking-Event-Address: test_pages_in_a_zone+0xde/0x160 Kernel panic - not syncing: Fatal exception: panic_on_oops Fix this by checking whether the pfn to check is within the zone. [[email protected]: separated this change from http://lkml.kernel.org/r/[email protected]] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: separated this change from http://lkml.kernel.org/r/[email protected]] Signed-off-by: Michal Hocko <[email protected]> Signed-off-by: Mikhail Zaslonko <[email protected]> Tested-by: Mikhail Gavrilov <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Tested-by: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Mikhail Gavrilov <[email protected]> Cc: Pavel Tatashin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-02-01mm, memory_hotplug: is_mem_section_removable do not pass the end of a zoneMichal Hocko1-1/+2
Patch series "mm, memory_hotplug: fix uninitialized pages fallouts", v2. Mikhail Zaslonko has posted fixes for the two bugs quite some time ago [1]. I have pushed back on those fixes because I believed that it is much better to plug the problem at the initialization time rather than play whack-a-mole all over the hotplug code and find all the places which expect the full memory section to be initialized. We have ended up with commit 2830bf6f05fb ("mm, memory_hotplug: initialize struct pages for the full memory section") merged and cause a regression [2][3]. The reason is that there might be memory layouts when two NUMA nodes share the same memory section so the merged fix is simply incorrect. In order to plug this hole we really have to be zone range aware in those handlers. I have split up the original patch into two. One is unchanged (patch 2) and I took a different approach for `removable' crash. [1] http://lkml.kernel.org/r/[email protected] [2] https://bugzilla.redhat.com/show_bug.cgi?id=1666948 [3] http://lkml.kernel.org/r/[email protected] This patch (of 2): Mikhail has reported the following VM_BUG_ON triggered when reading sysfs removable state of a memory block: page:000003d08300c000 is uninitialized and poisoned page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) Call Trace: is_mem_section_removable+0xb4/0x190 show_mem_removable+0x9a/0xd8 dev_attr_show+0x34/0x70 sysfs_kf_seq_show+0xc8/0x148 seq_read+0x204/0x480 __vfs_read+0x32/0x178 vfs_read+0x82/0x138 ksys_read+0x5a/0xb0 system_call+0xdc/0x2d8 Last Breaking-Event-Address: is_mem_section_removable+0xb4/0x190 Kernel panic - not syncing: Fatal exception: panic_on_oops The reason is that the memory block spans the zone boundary and we are stumbling over an unitialized struct page. Fix this by enforcing zone range in is_mem_section_removable so that we never run away from a zone. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Reported-by: Mikhail Zaslonko <[email protected]> Debugged-by: Mikhail Zaslonko <[email protected]> Tested-by: Gerald Schaefer <[email protected]> Tested-by: Mikhail Gavrilov <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Martin Schwidefsky <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-02-01oom, oom_reaper: do not enqueue same task twiceTetsuo Handa2-2/+3
Arkadiusz reported that enabling memcg's group oom killing causes strange memcg statistics where there is no task in a memcg despite the number of tasks in that memcg is not 0. It turned out that there is a bug in wake_oom_reaper() which allows enqueuing same task twice which makes impossible to decrease the number of tasks in that memcg due to a refcount leak. This bug existed since the OOM reaper became invokable from task_will_free_mem(current) path in out_of_memory() in Linux 4.7, T1@P1 |T2@P1 |T3@P1 |OOM reaper ----------+----------+----------+------------ # Processing an OOM victim in a different memcg domain. try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) out_of_memory() oom_kill_process(P1) do_send_sig_info(SIGKILL, @P1) mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T2@P1) wake_oom_reaper(T2@P1) # T2@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL. mutex_unlock(&oom_lock) # Completed processing an OOM victim in a different memcg domain. spin_lock(&oom_reaper_lock) # T1P1 is dequeued. spin_unlock(&oom_reaper_lock) but memcg's group oom killing made it easier to trigger this bug by calling wake_oom_reaper() on the same task from one out_of_memory() request. Fix this bug using an approach used by commit 855b018325737f76 ("oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task"). As a side effect of this patch, this patch also avoids enqueuing multiple threads sharing memory via task_will_free_mem(current) path. Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Fixes: af8e15cc85a25315 ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head") Signed-off-by: Tetsuo Handa <[email protected]> Reported-by: Arkadiusz Miskiewicz <[email protected]> Tested-by: Arkadiusz Miskiewicz <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Roman Gushchin <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Aleksa Sarai <[email protected]> Cc: Jay Kamat <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-02-01mm: migrate: make buffer_migrate_page_norefs() actually succeedJan Kara1-5/+0
Currently, buffer_migrate_page_norefs() was constantly failing because buffer_migrate_lock_buffers() grabbed reference on each buffer. In fact, there's no reason for buffer_migrate_lock_buffers() to grab any buffer references as the page is locked during all our operation and thus nobody can reclaim buffers from the page. So remove grabbing of buffer references which also makes buffer_migrate_page_norefs() succeed. Link: http://lkml.kernel.org/r/[email protected] Fixes: 89cb0888ca14 "mm: migrate: provide buffer_migrate_page_norefs()" Signed-off-by: Jan Kara <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: David Rientjes <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Zi Yan <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-02-01kernel/exit.c: release ptraced tasks before zap_pid_ns_processesAndrei Vagin1-2/+10
Currently, exit_ptrace() adds all ptraced tasks in a dead list, then zap_pid_ns_processes() waits on all tasks in a current pidns, and only then are tasks from the dead list released. zap_pid_ns_processes() can get stuck on waiting tasks from the dead list. In this case, we will have one unkillable process with one or more dead children. Thanks to Oleg for the advice to release tasks in find_child_reaper(). Link: http://lkml.kernel.org/r/[email protected] Fixes: 7c8bd2322c7f ("exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent()") Signed-off-by: Andrei Vagin <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-02-01x86_64: increase stack size for KASAN_EXTRAQian Cai1-0/+4
If the kernel is configured with KASAN_EXTRA, the stack size is increasted significantly because this option sets "-fstack-reuse" to "none" in GCC [1]. As a result, it triggers stack overrun quite often with 32k stack size compiled using GCC 8. For example, this reproducer https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/madvise/madvise06.c triggers a "corrupted stack end detected inside scheduler" very reliably with CONFIG_SCHED_STACK_END_CHECK enabled. There are just too many functions that could have a large stack with KASAN_EXTRA due to large local variables that have been called over and over again without being able to reuse the stacks. Some noticiable ones are size 7648 shrink_page_list 3584 xfs_rmap_convert 3312 migrate_page_move_mapping 3312 dev_ethtool 3200 migrate_misplaced_transhuge_page 3168 copy_process There are other 49 functions are over 2k in size while compiling kernel with "-Wframe-larger-than=" even with a related minimal config on this machine. Hence, it is too much work to change Makefiles for each object to compile without "-fsanitize-address-use-after-scope" individually. [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81715#c23 Although there is a patch in GCC 9 to help the situation, GCC 9 probably won't be released in a few months and then it probably take another 6-month to 1-year for all major distros to include it as a default. Hence, the stack usage with KASAN_EXTRA can be revisited again in 2020 when GCC 9 is everywhere. Until then, this patch will help users avoid stack overrun. This has already been fixed for arm64 for the same reason via 6e8830674ea ("arm64: kasan: Increase stack size for KASAN_EXTRA"). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Qian Cai <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-02-01mm/hugetlb.c: teach follow_hugetlb_page() to handle FOLL_NOWAITAndrea Arcangeli1-1/+2
hugetlb needs the same fix as faultin_nopage (which was applied in commit 96312e61282a ("mm/gup.c: teach get_user_pages_unlocked to handle FOLL_NOWAIT")) or KVM hangs because it thinks the mmap_sem was already released by hugetlb_fault() if it returned VM_FAULT_RETRY, but it wasn't in the FOLL_NOWAIT case. Link: http://lkml.kernel.org/r/[email protected] Fixes: ce53053ce378 ("kvm: switch get_user_page_nowait() to get_user_pages_unlocked()") Signed-off-by: Andrea Arcangeli <[email protected]> Tested-by: "Dr. David Alan Gilbert" <[email protected]> Reported-by: "Dr. David Alan Gilbert" <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Reviewed-by: Peter Xu <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-02-01arch: unexport asm/shmparam.h for all architecturesMasahiro Yamada14-7/+7
Most architectures do not export shmparam.h to user-space. $ find arch -name shmparam.h | sort arch/alpha/include/asm/shmparam.h arch/arc/include/asm/shmparam.h arch/arm64/include/asm/shmparam.h arch/arm/include/asm/shmparam.h arch/csky/include/asm/shmparam.h arch/ia64/include/asm/shmparam.h arch/mips/include/asm/shmparam.h arch/nds32/include/asm/shmparam.h arch/nios2/include/asm/shmparam.h arch/parisc/include/asm/shmparam.h arch/powerpc/include/asm/shmparam.h arch/s390/include/asm/shmparam.h arch/sh/include/asm/shmparam.h arch/sparc/include/asm/shmparam.h arch/x86/include/asm/shmparam.h arch/xtensa/include/asm/shmparam.h Strangely, some users of the asm-generic wrapper export shmparam.h $ git grep 'generic-y += shmparam.h' arch/c6x/include/uapi/asm/Kbuild:generic-y += shmparam.h arch/h8300/include/uapi/asm/Kbuild:generic-y += shmparam.h arch/hexagon/include/uapi/asm/Kbuild:generic-y += shmparam.h arch/m68k/include/uapi/asm/Kbuild:generic-y += shmparam.h arch/microblaze/include/uapi/asm/Kbuild:generic-y += shmparam.h arch/openrisc/include/uapi/asm/Kbuild:generic-y += shmparam.h arch/riscv/include/asm/Kbuild:generic-y += shmparam.h arch/unicore32/include/uapi/asm/Kbuild:generic-y += shmparam.h The newly added riscv correctly creates the asm-generic wrapper in the kernel space, but the others (c6x, h8300, hexagon, m68k, microblaze, openrisc, unicore32) create the one in the uapi directory. Digging into the git history, now I guess fcc8487d477a ("uapi: export all headers under uapi directories") was the misconversion. Prior to that commit, no architecture exported to shmparam.h As its commit description said, that commit exported shmparam.h for c6x, h8300, hexagon, m68k, openrisc, unicore32. 83f0124ad81e ("microblaze: remove asm-generic wrapper headers") accidentally exported shmparam.h for microblaze. This commit unexports shmparam.h for those architectures. There is no more reason to export include/uapi/asm-generic/shmparam.h, so it has been moved to include/asm-generic/shmparam.h Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Masahiro Yamada <[email protected]> Acked-by: Stafford Horne <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Michal Simek <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Richard Kuo <[email protected]> Cc: Guan Xuetao <[email protected]> Cc: Nicolas Dichtel <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Aurelien Jacquiot <[email protected]> Cc: Greentime Hu <[email protected]> Cc: Guo Ren <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Stefan Kristiansson <[email protected]> Cc: Mark Salter <[email protected]> Cc: Albert Ou <[email protected]> Cc: Jonas Bonn <[email protected]> Cc: Vincent Chen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-02-01proc: fix /proc/net/* after setns(2)Alexey Dobriyan6-1/+155
/proc entries under /proc/net/* can't be cached into dcache because setns(2) can change current net namespace. [[email protected]: coding-style fixes] [[email protected]: avoid vim miscolorization] [[email protected]: write test, add dummy ->d_revalidate hook: necessary if /proc/net/* is pinned at setns time] Link: http://lkml.kernel.org/r/20190108192350.GA12034@avx2 Link: http://lkml.kernel.org/r/20190107162336.GA9239@avx2 Fixes: 1da4d377f943fe4194ffb9fb9c26cc58fad4dd24 ("proc: revalidate misc dentries") Signed-off-by: Alexey Dobriyan <[email protected]> Reported-by: Mateusz Stępień <[email protected]> Reported-by: Ahmad Fatoum <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-02-01mm, memory_hotplug: don't bail out in do_migrate_range() prematurelyOscar Salvador1-16/+2
do_migrate_range() takes a memory range and tries to isolate the pages to put them into a list. This list will be later on used in migrate_pages() to know the pages we need to migrate. Currently, if we fail to isolate a single page, we put all already isolated pages back to their LRU and we bail out from the function. This is quite suboptimal, as this will force us to start over again because scan_movable_pages will give us the same range. If there is no chance that we can isolate that page, we will loop here forever. Issue debugged in [1] has proved that. During the debugging of that issue, it was noticed that if do_migrate_ranges() fails to isolate a single page, we will just discard the work we have done so far and bail out, which means that scan_movable_pages() will find again the same set of pages. Instead, we can just skip the error, keep isolating as much pages as possible and then proceed with the call to migrate_pages(). This will allow us to do as much work as possible at once. [1] https://lkml.org/lkml/2018/12/6/324 Michal said: : I still think that this doesn't give us a whole picture. Looping for : ever is a bug. Failing the isolation is quite possible and it should : be a ephemeral condition (e.g. a race with freeing the page or : somebody else isolating the page for whatever reason). And here comes : the disadvantage of the current implementation. We simply throw : everything on the floor just because of a ephemeral condition. The : racy page_count check is quite dubious to prevent from that. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Dan Williams <[email protected]> Cc: Jan Kara <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: William Kucharski <[email protected]> Cc: Pavel Tatashin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2019-02-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller16-67/+127
Alexei Starovoitov says: ==================== pull-request: bpf 2019-01-31 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) disable preemption in sender side of socket filters, from Alexei. 2) fix two potential deadlocks in syscall bpf lookup and prog_register, from Martin and Alexei. 3) fix BTF to allow typedef on func_proto, from Yonghong. 4) two bpftool fixes, from Jiri and Paolo. ==================== Signed-off-by: David S. Miller <[email protected]>
2019-02-01dccp: fool proof ccid_hc_[rt]x_parse_options()Eric Dumazet1-2/+2
Similarly to commit 276bdb82dedb ("dccp: check ccid before dereferencing") it is wise to test for a NULL ccid. kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.0.0-rc3+ #37 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:ccid_hc_tx_parse_options net/dccp/ccid.h:205 [inline] RIP: 0010:dccp_parse_options+0x8d9/0x12b0 net/dccp/options.c:233 Code: c5 0f b6 75 b3 80 38 00 0f 85 d6 08 00 00 48 b9 00 00 00 00 00 fc ff df 48 8b 45 b8 4c 8b b8 f8 07 00 00 4c 89 f8 48 c1 e8 03 <80> 3c 08 00 0f 85 95 08 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b kobject: 'loop5' (0000000080f78fc1): kobject_uevent_env RSP: 0018:ffff8880a94df0b8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8880858ac723 RCX: dffffc0000000000 RDX: 0000000000000100 RSI: 0000000000000007 RDI: 0000000000000001 RBP: ffff8880a94df140 R08: 0000000000000001 R09: ffff888061b83a80 R10: ffffed100c370752 R11: ffff888061b83a97 R12: 0000000000000026 R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f0defa33518 CR3: 000000008db5e000 CR4: 00000000001406e0 kobject: 'loop5' (0000000080f78fc1): fill_kobj_path: path = '/devices/virtual/block/loop5' DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: dccp_rcv_state_process+0x2b6/0x1af6 net/dccp/input.c:654 dccp_v4_do_rcv+0x100/0x190 net/dccp/ipv4.c:688 sk_backlog_rcv include/net/sock.h:936 [inline] __sk_receive_skb+0x3a9/0xea0 net/core/sock.c:473 dccp_v4_rcv+0x10cb/0x1f80 net/dccp/ipv4.c:880 ip_protocol_deliver_rcu+0xb6/0xa20 net/ipv4/ip_input.c:208 ip_local_deliver_finish+0x23b/0x390 net/ipv4/ip_input.c:234 NF_HOOK include/linux/netfilter.h:289 [inline] NF_HOOK include/linux/netfilter.h:283 [inline] ip_local_deliver+0x1f0/0x740 net/ipv4/ip_input.c:255 dst_input include/net/dst.h:450 [inline] ip_rcv_finish+0x1f4/0x2f0 net/ipv4/ip_input.c:414 NF_HOOK include/linux/netfilter.h:289 [inline] NF_HOOK include/linux/netfilter.h:283 [inline] ip_rcv+0xed/0x620 net/ipv4/ip_input.c:524 __netif_receive_skb_one_core+0x160/0x210 net/core/dev.c:4973 __netif_receive_skb+0x2c/0x1c0 net/core/dev.c:5083 process_backlog+0x206/0x750 net/core/dev.c:5923 napi_poll net/core/dev.c:6346 [inline] net_rx_action+0x76d/0x1930 net/core/dev.c:6412 __do_softirq+0x30b/0xb11 kernel/softirq.c:292 run_ksoftirqd kernel/softirq.c:654 [inline] run_ksoftirqd+0x8e/0x110 kernel/softirq.c:646 smpboot_thread_fn+0x6ab/0xa10 kernel/smpboot.c:164 kthread+0x357/0x430 kernel/kthread.c:246 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352 Modules linked in: ---[ end trace 58a0ba03bea2c376 ]--- RIP: 0010:ccid_hc_tx_parse_options net/dccp/ccid.h:205 [inline] RIP: 0010:dccp_parse_options+0x8d9/0x12b0 net/dccp/options.c:233 Code: c5 0f b6 75 b3 80 38 00 0f 85 d6 08 00 00 48 b9 00 00 00 00 00 fc ff df 48 8b 45 b8 4c 8b b8 f8 07 00 00 4c 89 f8 48 c1 e8 03 <80> 3c 08 00 0f 85 95 08 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b RSP: 0018:ffff8880a94df0b8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8880858ac723 RCX: dffffc0000000000 RDX: 0000000000000100 RSI: 0000000000000007 RDI: 0000000000000001 RBP: ffff8880a94df140 R08: 0000000000000001 R09: ffff888061b83a80 R10: ffffed100c370752 R11: ffff888061b83a97 R12: 0000000000000026 R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f0defa33518 CR3: 0000000009871000 CR4: 00000000001406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Signed-off-by: Eric Dumazet <[email protected]> Reported-by: syzbot <[email protected]> Cc: Gerrit Renker <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-02-01Merge branch 'smc-fixes'David S. Miller8-21/+34
Ursula Braun says: ==================== net/smc: fixes 2019-01-30 here are some fixes in different areas of the smc code for the net tree. ==================== Signed-off-by: David S. Miller <[email protected]>
2019-02-01net/smc: fix use of variable in cleared areaKarsten Graul1-4/+4
Do not use pend->idx as index for the arrays because its value is located in the cleared area. Use the existing local variable instead. Without this fix the wrong area might be cleared. Signed-off-by: Karsten Graul <[email protected]> Signed-off-by: Ursula Braun <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-02-01net/smc: use device link provided in qp_contextKarsten Graul1-3/+3
The device field of the IB event structure does not always point to the SMC IB device. Load the pointer from the qp_context which is always provided to smc_ib_qp_event_handler() in the priv field. And for qp events the affected port is given in the qp structure of the ibevent, derive it from there. Signed-off-by: Karsten Graul <[email protected]> Signed-off-by: Ursula Braun <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-02-01net/smc: call smc_cdc_msg_send() under send_lockKarsten Graul1-1/+4
Call smc_cdc_msg_send() under the connection send_lock to make sure all send operations for one connection are serialized. Signed-off-by: Karsten Graul <[email protected]> Signed-off-by: Ursula Braun <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-02-01net/smc: do not wait under send_lockKarsten Graul1-6/+4
smc_cdc_get_free_slot() might wait for free transfer buffers when using SMC-R. This wait should not be done under the send_lock, which is a spin_lock. This fixes a cpu loop in parallel threads waiting for the send_lock. Signed-off-by: Karsten Graul <[email protected]> Signed-off-by: Ursula Braun <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-02-01net/smc: recvmsg and splice_read should return 0 after shutdownKarsten Graul1-1/+10
When a socket was connected and is now shut down for read, return 0 to indicate end of data in recvmsg and splice_read (like TCP) and do not return ENOTCONN. This behavior is required by the socket api. Signed-off-by: Karsten Graul <[email protected]> Signed-off-by: Ursula Braun <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-02-01net/smc: don't wait for send buffer space when data was already sentKarsten Graul1-4/+3
When there is no more send buffer space and at least 1 byte was already sent then return to user space. The wait is only done when no data was sent by the sendmsg() call. This fixes smc_tx_sendmsg() which tried to always send all user data and started to wait for free send buffer space when needed. During this wait the user space program was blocked in the sendmsg() call and hence not able to receive incoming data. When both sides were in such a situation then the connection stalled forever. Signed-off-by: Karsten Graul <[email protected]> Signed-off-by: Ursula Braun <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-02-01net/smc: prevent races between smc_lgr_terminate() and smc_conn_free()Karsten Graul1-0/+4
To prevent races between smc_lgr_terminate() and smc_conn_free() add an extra check of the lgr field before accessing it, and cancel a delayed free_work when a new smc connection is created. This fixes the problem that free_work cleared the lgr variable but smc_lgr_terminate() or smc_conn_free() still access it in parallel. Signed-off-by: Karsten Graul <[email protected]> Signed-off-by: Ursula Braun <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-02-01net/smc: allow 16 byte pnetids in netlink policyHans Wippel1-1/+1
Currently, users can only send pnetids with a maximum length of 15 bytes over the SMC netlink interface although the maximum pnetid length is 16 bytes. This patch changes the SMC netlink policy to accept 16 byte pnetids. Signed-off-by: Hans Wippel <[email protected]> Signed-off-by: Ursula Braun <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-02-01net/smc: fix another sizeof to int comparisonUrsula Braun1-1/+1
Comparing an int to a size, which is unsigned, causes the int to become unsigned, giving the wrong result. kernel_sendmsg can return a negative error code. Signed-off-by: Ursula Braun <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-02-01enic: fix checksum validation for IPv6Govindarajulu Varadarajan1-1/+2
In case of IPv6 pkts, ipv4_csum_ok is 0. Because of this, driver does not set skb->ip_summed. So IPv6 rx checksum is not offloaded. Signed-off-by: Govindarajulu Varadarajan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-02-01sctp: walk the list of asoc safelyGreg Kroah-Hartman1-2/+2
In sctp_sendmesg(), when walking the list of endpoint associations, the association can be dropped from the list, making the list corrupt. Properly handle this by using list_for_each_entry_safe() Fixes: 4910280503f3 ("sctp: add support for snd flag SCTP_SENDALL process in sendmsg") Reported-by: Secunia Research <[email protected]> Tested-by: Secunia Research <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Acked-by: Marcelo Ricardo Leitner <[email protected]> Acked-by: Neil Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-02-01Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds18-42/+97
Pull rdma fixes from Jason Gunthorpe: "Still not much going on, the usual set of oops and driver fixes this time: - Fix two uapi breakage regressions in mlx5 drivers - Various oops fixes in hfi1, mlx4, umem, uverbs, and ipoib - A protocol bug fix for hfi1 preventing it from implementing the verbs API properly, and a compatability fix for EXEC STACK user programs - Fix missed refcounting in the 'advise_mr' patches merged this cycle. - Fix wrong use of the uABI in the hns SRQ patches merged this cycle" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: IB/uverbs: Fix OOPs in uverbs_user_mmap_disassociate IB/ipoib: Fix for use-after-free in ipoib_cm_tx_start IB/uverbs: Fix ioctl query port to consider device disassociation RDMA/mlx5: Fix flow creation on representors IB/uverbs: Fix OOPs upon device disassociation RDMA/umem: Add missing initialization of owning_mm RDMA/hns: Update the kernel header file of hns IB/mlx5: Fix how advise_mr() launches async work RDMA/device: Expose ib_device_try_get(() IB/hfi1: Add limit test for RC/UC send via loopback IB/hfi1: Remove overly conservative VM_EXEC flag check IB/{hfi1, qib}: Fix WC.byte_len calculation for UD_SEND_WITH_IMM IB/mlx4: Fix using wrong function to destroy sqp AHs under SRIOV RDMA/mlx5: Fix check for supported user flags when creating a QP
2019-02-01Merge tag 'iomap-5.0-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-7/+30
Pull iomap fixes from Darrick Wong: "A couple of iomap fixes to eliminate some memory corruption and hang problems that were reported: - fix page migration when using iomap for pagecache management - fix a use-after-free bug in the directio code" * tag 'iomap-5.0-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: fix a use after free in iomap_dio_rw iomap: get/put the page in iomap_page_create/release()
2019-02-01Merge tag 'pm-5.0-rc5' of ↵Linus Torvalds3-7/+7
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix a PM-runtime framework regression introduced by the recent switch-over of device autosuspend to hrtimers and a mistake in the "poll idle state" code introduced by a recent change in it. Specifics: - Since ktime_get() turns out to be problematic for device autosuspend in the PM-runtime framework, make it use ktime_get_mono_fast_ns() instead (Vincent Guittot). - Fix an initial value of a local variable in the "poll idle state" code that makes it behave not exactly as expected when all idle states except for the "polling" one are disabled (Doug Smythies)" * tag 'pm-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpuidle: poll_state: Fix default time limit PM-runtime: Fix deadlock with ktime_get()
2019-02-01Merge tag 'acpi-5.0-rc5' of ↵Linus Torvalds2-1/+3
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI Kconfig fixes from Rafael Wysocki: "Prevent invalid configurations from being created (e.g. by randconfig) due to some ACPI-related Kconfig options' dependencies that are not specified directly (Sinan Kaya)" * tag 'acpi-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: platform/x86: Fix unmet dependency warning for SAMSUNG_Q10 platform/x86: Fix unmet dependency warning for ACPI_CMPC mfd: Fix unmet dependency warning for MFD_TPS68470
2019-02-01Merge tag 'mmc-v5.0-rc4' of ↵Linus Torvalds2-1/+3
git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc Pull MMC host fixes from Ulf Hansson: - mediatek: Fix incorrect register write for tunings - bcm2835: Fixup leakage of DMA channel on probe errors * tag 'mmc-v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: mmc: mediatek: fix incorrect register setting of hs400_cmd_int_delay mmc: bcm2835: Fix DMA channel leak on probe error
2019-02-01Merge tag 'batadv-net-for-davem-20190201' of git://git.open-mesh.org/linux-mergeDavid S. Miller3-2/+8
Simon Wunderlich says: ==================== Here are some batman-adv bugfixes: - Avoid WARN to report incorrect configuration, by Sven Eckelmann - Fix mac header position setting, by Sven Eckelmann - Fix releasing station statistics, by Felix Fietkau ==================== Signed-off-by: David S. Miller <[email protected]>
2019-02-01Merge tag 'i3c/fixes-for-5.0-rc5' of ↵Linus Torvalds2-7/+13
git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux Pull i3c fixes from Boris Brezillon: - Fix a deadlock in the designware driver - Fix the error path in i3c_master_add_i3c_dev_locked() * tag 'i3c/fixes-for-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux: i3c: master: dw: fix deadlock i3c: fix missing detach if failed to retrieve i3c dev
2019-02-01Merge tag 'mac80211-for-davem-2019-02-01' of ↵David S. Miller4-4/+14
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Two more fixes: * sometimes, not enough tailroom was allocated for software-encrypted management frames in mac80211 * cfg80211 regulatory restore got an additional condition, needs to rerun the checks after that condition changes ==================== Signed-off-by: David S. Miller <[email protected]>
2019-02-01skge: potential memory corruption in skge_get_regs()Dan Carpenter1-2/+4
The "p" buffer is 0x4000 bytes long. B3_RI_WTO_R1 is 0x190. The value of "regs->len" is in the 1-0x4000 range. The bug here is that "regs->len - B3_RI_WTO_R1" can be a negative value which would lead to memory corruption and an abrupt crash. Fixes: c3f8be961808 ("[PATCH] skge: expand ethtool debug register dump") Signed-off-by: Dan Carpenter <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-02-01x86/kexec: Don't setup EFI info if EFI runtime is not enabledKairui Song1-0/+3
Kexec-ing a kernel with "efi=noruntime" on the first kernel's command line causes the following null pointer dereference: BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 #PF error: [normal kernel read fault] Call Trace: efi_runtime_map_copy+0x28/0x30 bzImage64_load+0x688/0x872 arch_kexec_kernel_image_load+0x6d/0x70 kimage_file_alloc_init+0x13e/0x220 __x64_sys_kexec_file_load+0x144/0x290 do_syscall_64+0x55/0x1a0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Just skip the EFI info setup if EFI runtime services are not enabled. [ bp: Massage commit message. ] Suggested-by: Dave Young <[email protected]> Signed-off-by: Kairui Song <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Acked-by: Dave Young <[email protected]> Cc: AKASHI Takahiro <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: [email protected] Cc: David Howells <[email protected]> Cc: [email protected] Cc: [email protected] Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Philipp Rudo <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Thomas Gleixner <[email protected]> Cc: x86-ml <[email protected]> Cc: Yannik Sembritzki <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2019-02-01x86: explicitly align IO accesses in memcpy_{to,from}ioLinus Torvalds1-3/+30
In commit 170d13ca3a2f ("x86: re-introduce non-generic memcpy_{to,from}io") I made our copy from IO space use a separate copy routine rather than rely on the generic memcpy. I did that because our generic memory copy isn't actually well-defined when it comes to internal access ordering or alignment, and will in fact depend on various CPUID flags. In particular, the default memcpy() for a modern Intel CPU will generally be just a "rep movsb", which works reasonably well for medium-sized memory copies of regular RAM, since the CPU will turn it into fairly optimized microcode. However, for non-cached memory and IO, "rep movs" ends up being horrendously slow and will just do the architectural "one byte at a time" accesses implied by the movsb. At the other end of the spectrum, if you _don't_ end up using the "rep movsb" code, you'd likely fall back to the software copy, which does overlapping accesses for the tail, and may copy things backwards. Again, for regular memory that's fine, for IO memory not so much. The thinking was that clearly nobody really cared (because things worked), but some people had seen horrible performance due to the byte accesses, so let's just revert back to our long ago version that dod "rep movsl" for the bulk of the copy, and then fixed up the potentially last few bytes of the tail with "movsw/b". Interestingly (and perhaps not entirely surprisingly), while that was our original memory copy implementation, and had been used before for IO, in the meantime many new users of memcpy_*io() had come about. And while the access patterns for the memory copy weren't well-defined (so arguably _any_ access pattern should work), in practice the "rep movsb" case had been very common for the last several years. In particular Jarkko Sakkinen reported that the memcpy_*io() change resuled in weird errors from his Geminilake NUC TPM module. And it turns out that the TPM TCG accesses according to spec require that the accesses be (a) done strictly sequentially (b) be naturally aligned otherwise the TPM chip will abort the PCI transaction. And, in fact, the tpm_crb.c driver did this: memcpy_fromio(buf, priv->rsp, 6); ... memcpy_fromio(&buf[6], &priv->rsp[6], expected - 6); which really should never have worked in the first place, but back before commit 170d13ca3a2f it *happened* to work, because the memcpy_fromio() would be expanded to a regular memcpy, and (a) gcc would expand the first memcpy in-line, and turn it into a 4-byte and a 2-byte read, and they happened to be in the right order, and the alignment was right. (b) gcc would call "memcpy()" for the second one, and the machines that had this TPM chip also apparently ended up always having ERMS ("Enhanced REP MOVSB/STOSB instructions"), so we'd use the "rep movbs" for that copy. In other words, basically by pure luck, the code happened to use the right access sizes in the (two different!) memcpy() implementations to make it all work. But after commit 170d13ca3a2f, both of the memcpy_fromio() calls resulted in a call to the routine with the consistent memory accesses, and in both cases it started out transferring with 4-byte accesses. Which worked for the first copy, but resulted in the second copy doing a 32-bit read at an address that was only 2-byte aligned. Jarkko is actually fixing the fragile code in the TPM driver, but since this is an excellent example of why we absolutely must not use a generic memcpy for IO accesses, _and_ an IO-specific one really should strive to align the IO accesses, let's do exactly that. Side note: Jarkko also noted that the driver had been used on ARM platforms, and had worked. That was because on 32-bit ARM, memcpy_*io() ends up always doing byte accesses, and on 64-bit ARM it first does byte accesses to align to 8-byte boundaries, and then does 8-byte accesses for the bulk. So ARM actually worked by design, and the x86 case worked by pure luck. We *might* want to make x86-64 do the 8-byte case too. That should be a pretty straightforward extension, but let's do one thing at a time. And generally MMIO accesses aren't really all that performance-critical, as shown by the fact that for a long time we just did them a byte at a time, and very few people ever noticed. Reported-and-tested-by: Jarkko Sakkinen <[email protected]> Tested-by: Jerry Snitselaar <[email protected]> Cc: David Laight <[email protected]> Fixes: 170d13ca3a2f ("x86: re-introduce non-generic memcpy_{to,from}io") Signed-off-by: Linus Torvalds <[email protected]>
2019-02-01apparmor: Fix aa_label_build() error handling for failed mergesJohn Johansen1-1/+4
aa_label_merge() can return NULL for memory allocations failures make sure to handle and set the correct error in this case. Reported-by: Peng Hao <[email protected]> Signed-off-by: John Johansen <[email protected]>
2019-02-01mic: vop: Fix crash on removeVincent Whitchurch1-3/+6
The remove path contains a hack which depends on internal structures in other source files, similar to the one which was recently removed from the registration path. Since commit 1ce9e6055fa0 ("virtio_ring: introduce packed ring support"), this leads to a crash when vop devices are removed. The structure in question is only examined to get the virtual address of the allocated used page. Store that pointer locally instead to fix the crash. Fixes: 1ce9e6055fa0 ("virtio_ring: introduce packed ring support") Signed-off-by: Vincent Whitchurch <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2019-02-01mic: vop: Fix use-after-free on removeVincent Whitchurch1-1/+3
KASAN detects a use-after-free when vop devices are removed. This problem was introduced by commit 0063e8bbd2b62d136 ("virtio_vop: don't kfree device on register failure"). That patch moved the freeing of the struct _vop_vdev to the release function, but failed to ensure that vop holds a reference to the device when it doesn't want it to go away. A kfree() was replaced with a put_device() in the unregistration path, but the last reference to the device is already dropped in unregister_virtio_device() so the struct is freed before vop is done with it. Fix it by holding a reference until cleanup is done. This is similar to the fix in virtio_pci in commit 2989be09a8a9d6 ("virtio_pci: fix use after free on release"). ================================================================== BUG: KASAN: use-after-free in vop_scan_devices+0xc6c/0xe50 [vop] Read of size 8 at addr ffff88800da18580 by task kworker/0:1/12 CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.0.0-rc4+ #53 Workqueue: events vop_hotplug_devices [vop] Call Trace: dump_stack+0x74/0xbb print_address_description+0x5d/0x2b0 ? vop_scan_devices+0xc6c/0xe50 [vop] kasan_report+0x152/0x1aa ? vop_scan_devices+0xc6c/0xe50 [vop] ? vop_scan_devices+0xc6c/0xe50 [vop] vop_scan_devices+0xc6c/0xe50 [vop] ? vop_loopback_free_irq+0x160/0x160 [vop_loopback] process_one_work+0x7c0/0x14b0 ? pwq_dec_nr_in_flight+0x2d0/0x2d0 ? do_raw_spin_lock+0x120/0x280 worker_thread+0x8f/0xbf0 ? __kthread_parkme+0x78/0xf0 ? process_one_work+0x14b0/0x14b0 kthread+0x2ae/0x3a0 ? kthread_park+0x120/0x120 ret_from_fork+0x3a/0x50 Allocated by task 12: kmem_cache_alloc_trace+0x13a/0x2a0 vop_scan_devices+0x473/0xe50 [vop] process_one_work+0x7c0/0x14b0 worker_thread+0x8f/0xbf0 kthread+0x2ae/0x3a0 ret_from_fork+0x3a/0x50 Freed by task 12: kfree+0x104/0x310 device_release+0x73/0x1d0 kobject_put+0x14f/0x420 unregister_virtio_device+0x32/0x50 vop_scan_devices+0x19d/0xe50 [vop] process_one_work+0x7c0/0x14b0 worker_thread+0x8f/0xbf0 kthread+0x2ae/0x3a0 ret_from_fork+0x3a/0x50 The buggy address belongs to the object at ffff88800da18008 which belongs to the cache kmalloc-2k of size 2048 The buggy address is located 1400 bytes inside of 2048-byte region [ffff88800da18008, ffff88800da18808) The buggy address belongs to the page: page:ffffea0000368600 count:1 mapcount:0 mapping:ffff88801440dbc0 index:0x0 compound_mapcount: 0 flags: 0x4000000000010200(slab|head) raw: 4000000000010200 ffffea0000378608 ffffea000037a008 ffff88801440dbc0 raw: 0000000000000000 00000000000d000d 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88800da18480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88800da18500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff88800da18580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88800da18600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88800da18680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== Fixes: 0063e8bbd2b62d136 ("virtio_vop: don't kfree device on register failure") Signed-off-by: Vincent Whitchurch <[email protected]> Cc: stable <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2019-02-01binderfs: remove separate device_initcall()Christian Brauner3-4/+16
binderfs should not have a separate device_initcall(). When a kernel is compiled with CONFIG_ANDROID_BINDERFS register the filesystem alongside CONFIG_ANDROID_IPC. This use-case is especially sensible when users specify CONFIG_ANDROID_IPC=y, CONFIG_ANDROID_BINDERFS=y and ANDROID_BINDER_DEVICES="". When CONFIG_ANDROID_BINDERFS=n then this always succeeds so there's no regression potential for legacy workloads. Signed-off-by: Christian Brauner <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2019-02-01arm64: hibernate: Clean the __hyp_text to PoC after resumeJames Morse1-1/+3
During resume hibernate restores all physical memory. Any memory that is accessed with the MMU disabled needs to be cleaned to the PoC. KVMs __hyp_text was previously ommitted as it runs with the MMU enabled, but now that the hyp-stub is located in this section, we must clean __hyp_text too. This ensures secondary CPUs that come online after hibernate has finished resuming, and load KVM via the freshly written hyp-stub see the correct instructions. Signed-off-by: James Morse <[email protected]> Cc: [email protected] Signed-off-by: Will Deacon <[email protected]>
2019-02-01arm64: hyp-stub: Forbid kprobing of the hyp-stubJames Morse1-0/+2
The hyp-stub is loaded by the kernel's early startup code at EL2 during boot, before KVM takes ownership later. The hyp-stub's text is part of the regular kernel text, meaning it can be kprobed. A breakpoint in the hyp-stub causes the CPU to spin in el2_sync_invalid. Add it to the __hyp_text. Signed-off-by: James Morse <[email protected]> Cc: [email protected] Signed-off-by: Will Deacon <[email protected]>