aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2017-08-10rmap: do not call mmu_notifier_invalidate_page() under ptlKirill A. Shutemov1-22/+30
MMU notifiers can sleep, but in page_mkclean_one() we call mmu_notifier_invalidate_page() under page table lock. Let's instead use mmu_notifier_invalidate_range() outside page_vma_mapped_walk() loop. [[email protected]: try_to_unmap_one() do not call mmu_notifier under ptl] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Fixes: c7ab0d2fdc84 ("mm: convert try_to_unmap_one() to use page_vma_mapped_walk()") Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Jérôme Glisse <[email protected]> Reported-by: axie <[email protected]> Cc: Alex Deucher <[email protected]> Cc: "Writer, Tim" <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-08-10mm: fix list corruptions on shmem shrinklistCong Wang1-2/+10
We saw many list corruption warnings on shmem shrinklist: WARNING: CPU: 18 PID: 177 at lib/list_debug.c:59 __list_del_entry+0x9e/0xc0 list_del corruption. prev->next should be ffff9ae5694b82d8, but was ffff9ae5699ba960 Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt CPU: 18 PID: 177 Comm: kswapd1 Not tainted 4.9.34-t3.el7.twitter.x86_64 #1 Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013 Call Trace: dump_stack+0x4d/0x66 __warn+0xcb/0xf0 warn_slowpath_fmt+0x4f/0x60 __list_del_entry+0x9e/0xc0 shmem_unused_huge_shrink+0xfa/0x2e0 shmem_unused_huge_scan+0x20/0x30 super_cache_scan+0x193/0x1a0 shrink_slab.part.41+0x1e3/0x3f0 shrink_slab+0x29/0x30 shrink_node+0xf9/0x2f0 kswapd+0x2d8/0x6c0 kthread+0xd7/0xf0 ret_from_fork+0x22/0x30 WARNING: CPU: 23 PID: 639 at lib/list_debug.c:33 __list_add+0x89/0xb0 list_add corruption. prev->next should be next (ffff9ae5699ba960), but was ffff9ae5694b82d8. (prev=ffff9ae5694b82d8). Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt CPU: 23 PID: 639 Comm: systemd-udevd Tainted: G W 4.9.34-t3.el7.twitter.x86_64 #1 Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013 Call Trace: dump_stack+0x4d/0x66 __warn+0xcb/0xf0 warn_slowpath_fmt+0x4f/0x60 __list_add+0x89/0xb0 shmem_setattr+0x204/0x230 notify_change+0x2ef/0x440 do_truncate+0x5d/0x90 path_openat+0x331/0x1190 do_filp_open+0x7e/0xe0 do_sys_open+0x123/0x200 SyS_open+0x1e/0x20 do_syscall_64+0x61/0x170 entry_SYSCALL64_slow_path+0x25/0x25 The problem is that shmem_unused_huge_shrink() moves entries from the global sbinfo->shrinklist to its local lists and then releases the spinlock. However, a parallel shmem_setattr() could access one of these entries directly and add it back to the global shrinklist if it is removed, with the spinlock held. The logic itself looks solid since an entry could be either in a local list or the global list, otherwise it is removed from one of them by list_del_init(). So probably the race condition is that, one CPU is in the middle of INIT_LIST_HEAD() but the other CPU calls list_empty() which returns true too early then the following list_add_tail() sees a corrupted entry. list_empty_careful() is designed to fix this situation. [[email protected]: add comments] Link: http://lkml.kernel.org/r/[email protected] Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure") Signed-off-by: Cong Wang <[email protected]> Acked-by: Linus Torvalds <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-08-10mm/balloon_compaction.c: don't zero ballooned pagesWei Wang1-1/+1
Revert commit bb01b64cfab7 ("mm/balloon_compaction.c: enqueue zero page to balloon device")' Zeroing ballon pages is rather time consuming, especially when a lot of pages are in flight. E.g. 7GB worth of ballooned memory takes 2.8s with __GFP_ZERO while it takes ~491ms without it. The original commit argued that zeroing will help ksmd to merge these pages on the host but this argument is assuming that the host actually marks balloon pages for ksm which is not universally true. So we pay performance penalty for something that even might not be used in the end which is wrong. The host can zero out pages on its own when there is a need. [[email protected]: new changelog text] Link: http://lkml.kernel.org/r/[email protected] Fixes: bb01b64cfab7 ("mm/balloon_compaction.c: enqueue zero page to balloon device") Signed-off-by: Wei Wang <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: zhenwei.pi <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-08-10mm: fix KSM data corruptionMinchan Kim1-1/+2
Nadav reported KSM can corrupt the user data by the TLB batching race[1]. That means data user written can be lost. Quote from Nadav Amit: "For this race we need 4 CPUs: CPU0: Caches a writable and dirty PTE entry, and uses the stale value for write later. CPU1: Runs madvise_free on the range that includes the PTE. It would clear the dirty-bit. It batches TLB flushes. CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty. We care about the fact that it clears the PTE write-bit, and of course, batches TLB flushes. CPU3: Runs KSM. Our purpose is to pass the following test in write_protect_page(): if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) || (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte))) Since it will avoid TLB flush. And we want to do it while the PTE is stale. Later, and before replacing the page, we would be able to change the page. Note that all the operations the CPU1-3 perform canhappen in parallel since they only acquire mmap_sem for read. We start with two identical pages. Everything below regards the same page/PTE. CPU0 CPU1 CPU2 CPU3 ---- ---- ---- ---- Write the same value on page [cache PTE as dirty in TLB] MADV_FREE pte_mkclean() 4 > clear_refs pte_wrprotect() write_protect_page() [ success, no flush ] pages_indentical() [ ok ] Write to page different value [Ok, using stale PTE] replace_page() Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late. CPU0 already wrote on the page, but KSM ignored this write, and it got lost" In above scenario, MADV_FREE is fixed by changing TLB batching API including [set|clear]_tlb_flush_pending. Remained thing is soft-dirty part. This patch changes soft-dirty uses TLB batching API instead of flush_tlb_mm and KSM checks pending TLB flush by using mm_tlb_flush_pending so that it will flush TLB to avoid data lost if there are other parallel threads pending TLB flush. [1] http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Signed-off-by: Nadav Amit <[email protected]> Reported-by: Nadav Amit <[email protected]> Tested-by: Nadav Amit <[email protected]> Reviewed-by: Andrea Arcangeli <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Russell King <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Tony Luck <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-08-10mm: fix MADV_[FREE|DONTNEED] TLB flush miss problemMinchan Kim1-2/+16
Nadav reported parallel MADV_DONTNEED on same range has a stale TLB problem and Mel fixed it[1] and found same problem on MADV_FREE[2]. Quote from Mel Gorman: "The race in question is CPU 0 running madv_free and updating some PTEs while CPU 1 is also running madv_free and looking at the same PTEs. CPU 1 may have writable TLB entries for a page but fail the pte_dirty check (because CPU 0 has updated it already) and potentially fail to flush. Hence, when madv_free on CPU 1 returns, there are still potentially writable TLB entries and the underlying PTE is still present so that a subsequent write does not necessarily propagate the dirty bit to the underlying PTE any more. Reclaim at some unknown time at the future may then see that the PTE is still clean and discard the page even though a write has happened in the meantime. I think this is possible but I could have missed some protection in madv_free that prevents it happening." This patch aims for solving both problems all at once and is ready for other problem with KSM, MADV_FREE and soft-dirty story[3]. TLB batch API(tlb_[gather|finish]_mmu] uses [inc|dec]_tlb_flush_pending and mmu_tlb_flush_pending so that when tlb_finish_mmu is called, we can catch there are parallel threads going on. In that case, forcefully, flush TLB to prevent for user to access memory via stale TLB entry although it fail to gather page table entry. I confirmed this patch works with [4] test program Nadav gave so this patch supersedes "mm: Always flush VMA ranges affected by zap_page_range v2" in current mmotm. NOTE: This patch modifies arch-specific TLB gathering interface(x86, ia64, s390, sh, um). It seems most of architecture are straightforward but s390 need to be careful because tlb_flush_mmu works only if mm->context.flush_mm is set to non-zero which happens only a pte entry really is cleared by ptep_get_and_clear and friends. However, this problem never changes the pte entries but need to flush to prevent memory access from stale tlb. [1] http://lkml.kernel.org/r/[email protected] [2] http://lkml.kernel.org/r/[email protected] [3] http://lkml.kernel.org/r/[email protected] [4] https://patchwork.kernel.org/patch/9861621/ [[email protected]: decrease tlb flush pending count in tlb_finish_mmu] Link: http://lkml.kernel.org/r/20170808080821.GA31730@bbox Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Signed-off-by: Nadav Amit <[email protected]> Reported-by: Nadav Amit <[email protected]> Reported-by: Mel Gorman <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Russell King <[email protected]> Cc: Tony Luck <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-08-10mm: make tlb_flush_pending globalMinchan Kim1-4/+0
Currently, tlb_flush_pending is used only for CONFIG_[NUMA_BALANCING| COMPACTION] but upcoming patches to solve subtle TLB flush batching problem will use it regardless of compaction/NUMA so this patch doesn't remove the dependency. [[email protected]: remove more ifdefs from world's ugliest printk statement] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Signed-off-by: Nadav Amit <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Russell King <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Tony Luck <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-08-10mm: refactor TLB gathering APIMinchan Kim1-7/+21
This patch is a preparatory patch for solving race problems caused by TLB batch. For that, we will increase/decrease TLB flush pending count of mm_struct whenever tlb_[gather|finish]_mmu is called. Before making it simple, this patch separates architecture specific part and rename it to arch_tlb_[gather|finish]_mmu and generic part just calls it. It shouldn't change any behavior. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Signed-off-by: Nadav Amit <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Russell King <[email protected]> Cc: Tony Luck <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-08-10Revert "mm: numa: defer TLB flush for THP migration as long as possible"Nadav Amit2-6/+7
While deferring TLB flushes is a good practice, the reverted patch caused pending TLB flushes to be checked while the page-table lock is not taken. As a result, in architectures with weak memory model (PPC), Linux may miss a memory-barrier, miss the fact TLB flushes are pending, and cause (in theory) a memory corruption. Since the alternative of using smp_mb__after_unlock_lock() was considered a bit open-coded, and the performance impact is expected to be small, the previous patch is reverted. This reverts b0943d61b8fa ("mm: numa: defer TLB flush for THP migration as long as possible"). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Nadav Amit <[email protected]> Suggested-by: Mel Gorman <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Russell King <[email protected]> Cc: Tony Luck <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-08-10mm: migrate: prevent racy access to tlb_flush_pendingNadav Amit2-3/+3
Patch series "fixes of TLB batching races", v6. It turns out that Linux TLB batching mechanism suffers from various races. Races that are caused due to batching during reclamation were recently handled by Mel and this patch-set deals with others. The more fundamental issue is that concurrent updates of the page-tables allow for TLB flushes to be batched on one core, while another core changes the page-tables. This other core may assume a PTE change does not require a flush based on the updated PTE value, while it is unaware that TLB flushes are still pending. This behavior affects KSM (which may result in memory corruption) and MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior). A proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED. Memory corruption in KSM is harder to produce in practice, but was observed by hacking the kernel and adding a delay before flushing and replacing the KSM page. Finally, there is also one memory barrier missing, which may affect architectures with weak memory model. This patch (of 7): Setting and clearing mm->tlb_flush_pending can be performed by multiple threads, since mmap_sem may only be acquired for read in task_numa_work(). If this happens, tlb_flush_pending might be cleared while one of the threads still changes PTEs and batches TLB flushes. This can lead to the same race between migration and change_protection_range() that led to the introduction of tlb_flush_pending. The result of this race was data corruption, which means that this patch also addresses a theoretically possible data corruption. An actual data corruption was not observed, yet the race was was confirmed by adding assertion to check tlb_flush_pending is not set by two threads, adding artificial latency in change_protection_range() and using sysctl to reduce kernel.numa_balancing_scan_delay_ms. Link: http://lkml.kernel.org/r/[email protected] Fixes: 20841405940e ("mm: fix TLB flush race between migration, and change_protection_range") Signed-off-by: Nadav Amit <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Rik van Riel <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Russell King <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Tony Luck <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-08-10userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED caseAndrea Arcangeli1-1/+1
huge_add_to_page_cache->add_to_page_cache implicitly unlocks the page before returning in case of errors. The error returned was -EEXIST by running UFFDIO_COPY on a non-hole offset of a VM_SHARED hugetlbfs mapping. It was an userland bug that triggered it and the kernel must cope with it returning -EEXIST from ioctl(UFFDIO_COPY) as expected. page dumped because: VM_BUG_ON_PAGE(!PageLocked(page)) kernel BUG at mm/filemap.c:964! invalid opcode: 0000 [#1] SMP CPU: 1 PID: 22582 Comm: qemu-system-x86 Not tainted 4.11.11-300.fc26.x86_64 #1 RIP: unlock_page+0x4a/0x50 Call Trace: hugetlb_mcopy_atomic_pte+0xc0/0x320 mcopy_atomic+0x96f/0xbe0 userfaultfd_ioctl+0x218/0xe90 do_vfs_ioctl+0xa5/0x600 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x1a/0xa9 Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrea Arcangeli <[email protected]> Tested-by: Maxime Coquelin <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Alexey Perevalov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-08-10mm: ratelimit PFNs busy info messageJonathan Toppins1-1/+1
The RDMA subsystem can generate several thousand of these messages per second eventually leading to a kernel crash. Ratelimit these messages to prevent this crash. Doug said: "I've been carrying a version of this for several kernel versions. I don't remember when they started, but we have one (and only one) class of machines: Dell PE R730xd, that generate these errors. When it happens, without a rate limit, we get rcu timeouts and kernel oopses. With the rate limit, we just get a lot of annoying kernel messages but the machine continues on, recovers, and eventually the memory operations all succeed" And: "> Well... why are all these EBUSY's occurring? It sounds inefficient > (at least) but if it is expected, normal and unavoidable then > perhaps we should just remove that message altogether? I don't have an answer to that question. To be honest, I haven't looked real hard. We never had this at all, then it started out of the blue, but only on our Dell 730xd machines (and it hits all of them), but no other classes or brands of machines. And we have our 730xd machines loaded up with different brands and models of cards (for instance one dedicated to mlx4 hardware, one for qib, one for mlx5, an ocrdma/cxgb4 combo, etc), so the fact that it hit all of the machines meant it wasn't tied to any particular brand/model of RDMA hardware. To me, it always smelled of a hardware oddity specific to maybe the CPUs or mainboard chipsets in these machines, so given that I'm not an mm expert anyway, I never chased it down. A few other relevant details: it showed up somewhere around 4.8/4.9 or thereabouts. It never happened before, but the prinkt has been there since the 3.18 days, so possibly the test to trigger this message was changed, or something else in the allocator changed such that the situation started happening on these machines? And, like I said, it is specific to our 730xd machines (but they are all identical, so that could mean it's something like their specific ram configuration is causing the allocator to hit this on these machine but not on other machines in the cluster, I don't want to say it's necessarily the model of chipset or CPU, there are other bits of identicalness between these machines)" Link: http://lkml.kernel.org/r/499c0f6cc10d6eb829a67f2a4d75b4228a9b356e.1501695897.git.jtoppins@redhat.com Signed-off-by: Jonathan Toppins <[email protected]> Reviewed-by: Doug Ledford <[email protected]> Tested-by: Doug Ledford <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Hillf Danton <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-08-10mm: fix global NR_SLAB_.*CLAIMABLE counter readsJohannes Weiner2-5/+6
As Tetsuo points out: "Commit 385386cff4c6 ("mm: vmstat: move slab statistics from zone to node counters") broke "Slab:" field of /proc/meminfo . It shows nearly 0kB" In addition to /proc/meminfo, this problem also affects the slab counters OOM/allocation failure info dumps, can cause early -ENOMEM from overcommit protection, and miscalculate image size requirements during suspend-to-disk. This is because the patch in question switched the slab counters from the zone level to the node level, but forgot to update the global accessor functions to read the aggregate node data instead of the aggregate zone data. Use global_node_page_state() to access the global slab counters. Fixes: 385386cff4c6 ("mm: vmstat: move slab statistics from zone to node counters") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Reported-by: Tetsuo Handa <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Stefan Agner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-08-02mm: take memory hotplug lock within numa_zonelist_order_handler()Heiko Carstens1-0/+2
Andre Wild reported the following warning: WARNING: CPU: 2 PID: 1205 at kernel/cpu.c:240 lockdep_assert_cpus_held+0x4c/0x60 Modules linked in: CPU: 2 PID: 1205 Comm: bash Not tainted 4.13.0-rc2-00022-gfd2b2c57ec20 #10 Hardware name: IBM 2964 N96 702 (z/VM 6.4.0) task: 00000000701d8100 task.stack: 0000000073594000 Krnl PSW : 0704f00180000000 0000000000145e24 (lockdep_assert_cpus_held+0x4c/0x60) ... Call Trace: lockdep_assert_cpus_held+0x42/0x60) stop_machine_cpuslocked+0x62/0xf0 build_all_zonelists+0x92/0x150 numa_zonelist_order_handler+0x102/0x150 proc_sys_call_handler.isra.12+0xda/0x118 proc_sys_write+0x34/0x48 __vfs_write+0x3c/0x178 vfs_write+0xbc/0x1a0 SyS_write+0x66/0xc0 system_call+0xc4/0x2b0 locks held by bash/1205: #0: (sb_writers#4){.+.+.+}, at: vfs_write+0xa6/0x1a0 #1: (zl_order_mutex){+.+...}, at: numa_zonelist_order_handler+0x44/0x150 #2: (zonelists_mutex){+.+...}, at: numa_zonelist_order_handler+0xf4/0x150 Last Breaking-Event-Address: lockdep_assert_cpus_held+0x48/0x60 This can be easily triggered with e.g. echo n > /proc/sys/vm/numa_zonelist_order In commit 3f906ba23689a ("mm/memory-hotplug: switch locking to a percpu rwsem") memory hotplug locking was changed to fix a potential deadlock. This also switched the stop_machine() invocation within build_all_zonelists() to stop_machine_cpuslocked() which now expects that online cpus are locked when being called. This assumption is not true if build_all_zonelists() is being called from numa_zonelist_order_handler(). In order to fix this simply add a mem_hotplug_begin()/mem_hotplug_done() pair to numa_zonelist_order_handler(). Link: http://lkml.kernel.org/r/[email protected] Fixes: 3f906ba23689a ("mm/memory-hotplug: switch locking to a percpu rwsem") Signed-off-by: Heiko Carstens <[email protected]> Reported-by: Andre Wild <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-08-02mm/page_io.c: fix oops during block io poll in swapin pathTetsuo Handa1-0/+7
When a thread is OOM-killed during swap_readpage() operation, an oops occurs because end_swap_bio_read() is calling wake_up_process() based on an assumption that the thread which called swap_readpage() is still alive. Out of memory: Kill process 525 (polkitd) score 0 or sacrifice child Killed process 525 (polkitd) total-vm:528128kB, anon-rss:0kB, file-rss:4kB, shmem-rss:0kB oom_reaper: reaped process 525 (polkitd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp ppdev pcspkr vmw_balloon sg shpchp vmw_vmci parport_pc parport i2c_piix4 ip_tables xfs libcrc32c sd_mod sr_mod cdrom ata_generic pata_acpi vmwgfx ahci libahci drm_kms_helper ata_piix syscopyarea sysfillrect sysimgblt fb_sys_fops mptspi scsi_transport_spi ttm e1000 mptscsih drm mptbase i2c_core libata serio_raw CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0-rc2-next-20170725 #129 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 task: ffffffffb7c16500 task.stack: ffffffffb7c00000 RIP: 0010:__lock_acquire+0x151/0x12f0 Call Trace: <IRQ> lock_acquire+0x59/0x80 _raw_spin_lock_irqsave+0x3b/0x4f try_to_wake_up+0x3b/0x410 wake_up_process+0x10/0x20 end_swap_bio_read+0x6f/0xf0 bio_endio+0x92/0xb0 blk_update_request+0x88/0x270 scsi_end_request+0x32/0x1c0 scsi_io_completion+0x209/0x680 scsi_finish_command+0xd4/0x120 scsi_softirq_done+0x120/0x140 __blk_mq_complete_request_remote+0xe/0x10 flush_smp_call_function_queue+0x51/0x120 generic_smp_call_function_single_interrupt+0xe/0x20 smp_trace_call_function_single_interrupt+0x22/0x30 smp_call_function_single_interrupt+0x9/0x10 call_function_single_interrupt+0xa7/0xb0 </IRQ> RIP: 0010:native_safe_halt+0x6/0x10 default_idle+0xe/0x20 arch_cpu_idle+0xa/0x10 default_idle_call+0x1e/0x30 do_idle+0x187/0x200 cpu_startup_entry+0x6e/0x70 rest_init+0xd0/0xe0 start_kernel+0x456/0x477 x86_64_start_reservations+0x24/0x26 x86_64_start_kernel+0xf7/0x11a secondary_startup_64+0xa5/0xa5 Code: c3 49 81 3f 20 9e 0b b8 41 bc 00 00 00 00 44 0f 45 e2 83 fe 01 0f 87 62 ff ff ff 89 f0 49 8b 44 c7 08 48 85 c0 0f 84 52 ff ff ff <f0> ff 80 98 01 00 00 8b 3d 5a 49 c4 01 45 8b b3 18 0c 00 00 85 RIP: __lock_acquire+0x151/0x12f0 RSP: ffffa01f39e03c50 ---[ end trace 6c441db499169b1e ]--- Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: 0x36000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) ---[ end Kernel panic - not syncing: Fatal exception in interrupt Fix it by holding a reference to the thread. [[email protected]: add comment] Fixes: 23955622ff8d231b ("swap: add block io poll in swapin path") Signed-off-by: Tetsuo Handa <[email protected]> Reviewed-by: Shaohua Li <[email protected]> Cc: Tim Chen <[email protected]> Cc: Huang Ying <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-08-02zram: do not free pool->size_classMinchan Kim1-1/+0
Mike reported kernel goes oops with ltp:zram03 testcase. zram: Added device: zram0 zram0: detected capacity change from 0 to 107374182400 BUG: unable to handle kernel paging request at 0000306d61727a77 IP: zs_map_object+0xb9/0x260 PGD 0 P4D 0 Oops: 0000 [#1] SMP Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: zram(E) xfs(E) libcrc32c(E) btrfs(E) xor(E) raid6_pq(E) loop(E) ebtable_filter(E) ebtables(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) ip_tables(E) x_tables(E) af_packet(E) br_netfilter(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) intel_powerclamp(E) coretemp(E) cdc_ether(E) kvm_intel(E) usbnet(E) mii(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) iTCO_wdt(E) ghash_clmulni_intel(E) bnx2(E) iTCO_vendor_support(E) pcbc(E) ioatdma(E) ipmi_ssif(E) aesni_intel(E) i5500_temp(E) i2c_i801(E) aes_x86_64(E) lpc_ich(E) shpchp(E) mfd_core(E) crypto_simd(E) i7core_edac(E) dca(E) glue_helper(E) cryptd(E) ipmi_si(E) button(E) acpi_cpufreq(E) ipmi_devintf(E) pcspkr(E) ipmi_msghandler(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) ext4(E) crc16(E) mbcache(E) jbd2(E) sd_mod(E) ata_generic(E) i2c_algo_bit(E) ata_piix(E) drm_kms_helper(E) ahci(E) syscopyarea(E) sysfillrect(E) libahci(E) sysimgblt(E) fb_sys_fops(E) uhci_hcd(E) ehci_pci(E) ttm(E) ehci_hcd(E) libata(E) drm(E) megaraid_sas(E) usbcore(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) efivarfs(E) autofs4(E) [last unloaded: zram] CPU: 6 PID: 12356 Comm: swapon Tainted: G E 4.13.0.g87b2c3f-default #194 Hardware name: IBM System x3550 M3 -[7944K3G]-/69Y5698 , BIOS -[D6E150AUS-1.10]- 12/15/2010 task: ffff880158d2c4c0 task.stack: ffffc90001680000 RIP: 0010:zs_map_object+0xb9/0x260 Call Trace: zram_bvec_rw.isra.26+0xe8/0x780 [zram] zram_rw_page+0x6e/0xa0 [zram] bdev_read_page+0x81/0xb0 do_mpage_readpage+0x51a/0x710 mpage_readpages+0x122/0x1a0 blkdev_readpages+0x1d/0x20 __do_page_cache_readahead+0x1b2/0x270 ondemand_readahead+0x180/0x2c0 page_cache_sync_readahead+0x31/0x50 generic_file_read_iter+0x7e7/0xaf0 blkdev_read_iter+0x37/0x40 __vfs_read+0xce/0x140 vfs_read+0x9e/0x150 SyS_read+0x46/0xa0 entry_SYSCALL_64_fastpath+0x1a/0xa5 Code: 81 e6 00 c0 3f 00 81 fe 00 00 16 00 0f 85 9f 01 00 00 0f b7 13 65 ff 05 5e 07 dc 7e 66 c1 ea 02 81 e2 ff 01 00 00 49 8b 54 d4 08 <8b> 4a 48 41 0f af ce 81 e1 ff 0f 00 00 41 89 c9 48 c7 c3 a0 70 RIP: zs_map_object+0xb9/0x260 RSP: ffffc90001683988 CR2: 0000306d61727a77 He bisected the problem is [1]. After commit cf8e0fedf078 ("mm/zsmalloc: simplify zs_max_alloc_size handling"), zram doesn't use double pointer for pool->size_class any more in zs_create_pool so counter function zs_destroy_pool don't need to free it, either. Otherwise, it does kfree wrong address and then, kernel goes Oops. Link: http://lkml.kernel.org/r/20170725062650.GA12134@bbox Fixes: cf8e0fedf078 ("mm/zsmalloc: simplify zs_max_alloc_size handling") Signed-off-by: Minchan Kim <[email protected]> Reported-by: Mike Galbraith <[email protected]> Tested-by: Mike Galbraith <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Cc: Jerome Marchand <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-08-02kasan: avoid -Wmaybe-uninitialized warningArnd Bergmann1-0/+1
gcc-7 produces this warning: mm/kasan/report.c: In function 'kasan_report': mm/kasan/report.c:351:3: error: 'info.first_bad_addr' may be used uninitialized in this function [-Werror=maybe-uninitialized] print_shadow_for_address(info->first_bad_addr); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mm/kasan/report.c:360:27: note: 'info.first_bad_addr' was declared here The code seems fine as we only print info.first_bad_addr when there is a shadow, and we always initialize it in that case, but this is relatively hard for gcc to figure out after the latest rework. Adding an intialization to the most likely value together with the other struct members shuts up that warning. Fixes: b235b9808664 ("kasan: unify report headers") Link: https://patchwork.kernel.org/patch/9641417/ Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]> Suggested-by: Alexander Potapenko <[email protected]> Suggested-by: Andrey Ryabinin <[email protected]> Acked-by: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-08-02userfaultfd: non-cooperative: notify about unmap of destination during mremapMike Rapoport1-2/+5
When mremap is called with MREMAP_FIXED it unmaps memory at the destination address without notifying userfaultfd monitor. If the destination were registered with userfaultfd, the monitor has no way to distinguish between the old and new ranges and to properly relate the page faults that would occur in the destination region. Fixes: 897ab3e0c49e ("userfaultfd: non-cooperative: add event for memory unmaps") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Acked-by: Pavel Emelyanov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-08-02mm, mprotect: flush TLB if potentially racing with a parallel reclaim ↵Mel Gorman6-1/+44
leaving stale TLB entries Nadav Amit identified a theoritical race between page reclaim and mprotect due to TLB flushes being batched outside of the PTL being held. He described the race as follows: CPU0 CPU1 ---- ---- user accesses memory using RW PTE [PTE now cached in TLB] try_to_unmap_one() ==> ptep_get_and_clear() ==> set_tlb_ubc_flush_pending() mprotect(addr, PROT_READ) ==> change_pte_range() ==> [ PTE non-present - no flush ] user writes using cached RW PTE ... try_to_unmap_flush() The same type of race exists for reads when protecting for PROT_NONE and also exists for operations that can leave an old TLB entry behind such as munmap, mremap and madvise. For some operations like mprotect, it's not necessarily a data integrity issue but it is a correctness issue as there is a window where an mprotect that limits access still allows access. For munmap, it's potentially a data integrity issue although the race is massive as an munmap, mmap and return to userspace must all complete between the window when reclaim drops the PTL and flushes the TLB. However, it's theoritically possible so handle this issue by flushing the mm if reclaim is potentially currently batching TLB flushes. Other instances where a flush is required for a present pte should be ok as either the page lock is held preventing parallel reclaim or a page reference count is elevated preventing a parallel free leading to corruption. In the case of page_mkclean there isn't an obvious path that userspace could take advantage of without using the operations that are guarded by this patch. Other users such as gup as a race with reclaim looks just at PTEs. huge page variants should be ok as they don't race with reclaim. mincore only looks at PTEs. userfault also should be ok as if a parallel reclaim takes place, it will either fault the page back in or read some of the data before the flush occurs triggering a fault. Note that a variant of this patch was acked by Andy Lutomirski but this was for the x86 parts on top of his PCID work which didn't make the 4.13 merge window as expected. His ack is dropped from this version and there will be a follow-on patch on top of PCID that will include his ack. [[email protected]: tweak comments] [[email protected]: fix spello] Link: http://lkml.kernel.org/r/[email protected] Reported-by: Nadav Amit <[email protected]> Signed-off-by: Mel Gorman <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: <[email protected]> [v4.4+] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-08-02mm/hugetlb.c: __get_user_pages ignores certain follow_hugetlb_page errorsDaniel Jordan1-6/+3
Commit 9a291a7c9428 ("mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified") causes __get_user_pages to ignore certain errors from follow_hugetlb_page. After such error, __get_user_pages subsequently calls faultin_page on the same VMA and start address that follow_hugetlb_page failed on instead of returning the error immediately as it should. In follow_hugetlb_page, when hugetlb_fault returns a value covered under VM_FAULT_ERROR, follow_hugetlb_page returns it without setting nr_pages to 0 as __get_user_pages expects in this case, which causes the following to happen in __get_user_pages: the "while (nr_pages)" check succeeds, we skip the "if (!vma..." check because we got a VMA the last time around, we find no page with follow_page_mask, and we call faultin_page, which calls hugetlb_fault for the second time. This issue also slightly changes how __get_user_pages works. Before, it only returned error if it had made no progress (i = 0). But now, follow_hugetlb_page can clobber "i" with an error code since its new return path doesn't check for progress. So if "i" is nonzero before a failing call to follow_hugetlb_page, that indication of progress is lost and __get_user_pages can return error even if some pages were successfully pinned. To fix this, change follow_hugetlb_page so that it updates nr_pages, allowing __get_user_pages to fail immediately and restoring the "error only if no progress" behavior to __get_user_pages. Tested that __get_user_pages returns when expected on error from hugetlb_fault in follow_hugetlb_page. Fixes: 9a291a7c9428 ("mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Daniel Jordan <[email protected]> Acked-by: Punit Agrawal <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: James Morse <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: zhong jiang <[email protected]> Cc: <[email protected]> [4.12.x] Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-15Merge branch 'work.mount' of ↵Linus Torvalds1-0/+24
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull ->s_options removal from Al Viro: "Preparations for fsmount/fsopen stuff (coming next cycle). Everything gets moved to explicit ->show_options(), killing ->s_options off + some cosmetic bits around fs/namespace.c and friends. Basically, the stuff needed to work with fsmount series with minimum of conflicts with other work. It's not strictly required for this merge window, but it would reduce the PITA during the coming cycle, so it would be nice to have those bits and pieces out of the way" * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: isofs: Fix isofs_show_options() VFS: Kill off s_options and helpers orangefs: Implement show_options 9p: Implement show_options isofs: Implement show_options afs: Implement show_options affs: Implement show_options befs: Implement show_options spufs: Implement show_options bpf: Implement show_options ramfs: Implement show_options pstore: Implement show_options omfs: Implement show_options hugetlbfs: Implement show_options VFS: Don't use save/replace_mount_options if not using generic_show_options VFS: Provide empty name qstr VFS: Make get_filesystem() return the affected filesystem VFS: Clean up whitespace in fs/namespace.c and fs/super.c Provide a function to create a NUL-terminated string from unterminated data
2017-07-14mm: fix overflow check in expand_upwards()Helge Deller1-1/+1
Jörn Engel noticed that the expand_upwards() function might not return -ENOMEM in case the requested address is (unsigned long)-PAGE_SIZE and if the architecture didn't defined TASK_SIZE as multiple of PAGE_SIZE. Affected architectures are arm, frv, m68k, blackfin, h8300 and xtensa which all define TASK_SIZE as 0xffffffff, but since none of those have an upwards-growing stack we currently have no actual issue. Nevertheless let's fix this just in case any of the architectures with an upward-growing stack (currently parisc, metag and partly ia64) define TASK_SIZE similar. Link: http://lkml.kernel.org/r/[email protected] Fixes: bd726c90b6b8 ("Allow stack to grow up to address space limit") Signed-off-by: Helge Deller <[email protected]> Reported-by: Jörn Engel <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-12writeback: rework wb_[dec|inc]_stat family of functionsNikolay Borisov1-5/+5
Currently the writeback statistics code uses a percpu counters to hold various statistics. Furthermore we have 2 families of functions - those which disable local irq and those which doesn't and whose names begin with double underscore. However, they both end up calling __add_wb_stats which in turn calls percpu_counter_add_batch which is already irq-safe. Exploiting this fact allows to eliminated the __wb_* functions since they don't add any further protection than we already have. Furthermore, refactor the wb_* function to call __add_wb_stat directly without the irq-disabling dance. This will likely result in better runtime of code which deals with modifying the stat counters. While at it also document why percpu_counter_add_batch is in fact preempt and irq-safe since at least 3 people got confused. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Nikolay Borisov <[email protected]> Acked-by: Tejun Heo <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Jeff Layton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-12mm, migration: do not trigger OOM killer when migrating memoryMichal Hocko1-1/+2
Page migration (for memory hotplug, soft_offline_page or mbind) needs to allocate a new memory. This can trigger an oom killer if the target memory is depleated. Although quite unlikely, still possible, especially for the memory hotplug (offlining of memoery). Up to now we didn't really have reasonable means to back off. __GFP_NORETRY can fail just too easily and __GFP_THISNODE sticks to a single node and that is not suitable for all callers. But now that we have __GFP_RETRY_MAYFAIL we should use it. It is preferable to fail the migration than disrupt the system by killing some processes. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Alex Belits <[email protected]> Cc: Chris Wilson <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Darrick J. Wong <[email protected]> Cc: David Daney <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: NeilBrown <[email protected]> Cc: Ralf Baechle <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-12mm: kvmalloc support __GFP_RETRY_MAYFAIL for all sizesMichal Hocko1-10/+4
Now that __GFP_RETRY_MAYFAIL has a reasonable semantic regardless of the request size we can drop the hackish implementation for !costly orders. __GFP_RETRY_MAYFAIL retries as long as the reclaim makes a forward progress and backs of when we are out of memory for the requested size. Therefore we do not need to enforce__GFP_NORETRY for !costly orders just to silent the oom killer anymore. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Alex Belits <[email protected]> Cc: Chris Wilson <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Darrick J. Wong <[email protected]> Cc: David Daney <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: NeilBrown <[email protected]> Cc: Ralf Baechle <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-12mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful ↵Michal Hocko7-16/+24
semantic __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to the page allocator. This has been true but only for allocations requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for smaller sizes. This is a bit unfortunate because there is no way to express the same semantic for those requests and they are considered too important to fail so they might end up looping in the page allocator for ever, similarly to GFP_NOFAIL requests. Now that the whole tree has been cleaned up and accidental or misled usage of __GFP_REPEAT flag has been removed for !costly requests we can give the original flag a better name and more importantly a more useful semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user that the allocator would try really hard but there is no promise of a success. This will work independent of the order and overrides the default allocator behavior. Page allocator users have several levels of guarantee vs. cost options (take GFP_KERNEL as an example) - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_ attempt to free memory at all. The most light weight mode which even doesn't kick the background reclaim. Should be used carefully because it might deplete the memory and the next user might hit the more aggressive reclaim - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic allocation without any attempt to free memory from the current context but can wake kswapd to reclaim memory if the zone is below the low watermark. Can be used from either atomic contexts or when the request is a performance optimization and there is another fallback for a slow path. - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) - non sleeping allocation with an expensive fallback so it can access some portion of memory reserves. Usually used from interrupt/bh context with an expensive slow path fallback. - GFP_KERNEL - both background and direct reclaim are allowed and the _default_ page allocator behavior is used. That means that !costly allocation requests are basically nofail but there is no guarantee of that behavior so failures have to be checked properly by callers (e.g. OOM killer victim is allowed to fail currently). - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and all allocation requests fail early rather than cause disruptive reclaim (one round of reclaim in this implementation). The OOM killer is not invoked. - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator behavior and all allocation requests try really hard. The request will fail if the reclaim cannot make any progress. The OOM killer won't be triggered. - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior and all allocation requests will loop endlessly until they succeed. This might be really dangerous especially for larger orders. Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL because they already had their semantic. No new users are added. __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if there is no progress and we have already passed the OOM point. This means that all the reclaim opportunities have been exhausted except the most disruptive one (the OOM killer) and a user defined fallback behavior is more sensible than keep retrying in the page allocator. [[email protected]: fix arch/sparc/kernel/mdesc.c] [[email protected]: semantic fix] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: address other thing spotted by Vlastimil] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Alex Belits <[email protected]> Cc: Chris Wilson <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Darrick J. Wong <[email protected]> Cc: David Daney <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: NeilBrown <[email protected]> Cc: Ralf Baechle <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-12mm/memory.c: mark create_huge_pmd() inline to prevent build failureGeert Uytterhoeven1-1/+1
With gcc 4.1.2: mm/memory.o: In function `create_huge_pmd': memory.c:(.text+0x93e): undefined reference to `do_huge_pmd_anonymous_page' Interestingly, create_huge_pmd() is emitted in the assembler output, but never called. Converting transparent_hugepage_enabled() from a macro to a static inline function reduced the ability of the compiler to remove unused code. Fix this by marking create_huge_pmd() inline. Fixes: 16981d763501c0e0 ("mm: improve readability of transparent_hugepage_enabled()") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Geert Uytterhoeven <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-10kasan: make get_wild_bug_type() staticColin Ian King1-1/+1
The helper function get_wild_bug_type() does not need to be in global scope, so make it static. Cleans up sparse warning: "symbol 'get_wild_bug_type' was not declared. Should it be static?" Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Colin Ian King <[email protected]> Acked-by: Dmitry Vyukov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Alexander Potapenko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-10mm/kasan/kasan.c: rename XXX_is_zero to XXX_is_nonzeroJoonsoo Kim1-7/+7
They return positive value, that is, true, if non-zero value is found. Rename them to reduce confusion. Link: http://lkml.kernel.org/r/20170516012350.GA16015@js1304-desktop Signed-off-by: Joonsoo Kim <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-10mm/kasan: add support for memory hotplugAndrey Ryabinin2-6/+35
KASAN doesn't happen work with memory hotplug because hotplugged memory doesn't have any shadow memory. So any access to hotplugged memory would cause a crash on shadow check. Use memory hotplug notifier to allocate and map shadow memory when the hotplugged memory is going online and free shadow after the memory offlined. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrey Ryabinin <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-10mm/kasan: get rid of speculative shadow checksAndrey Ryabinin1-82/+16
For some unaligned memory accesses we have to check additional byte of the shadow memory. Currently we load that byte speculatively to have only single load + branch on the optimistic fast path. However, this approach has some downsides: - It's unaligned access, so this prevents porting KASAN on architectures which doesn't support unaligned accesses. - We have to map additional shadow page to prevent crash if speculative load happens near the end of the mapped memory. This would significantly complicate upcoming memory hotplug support. I wasn't able to notice any performance degradation with this patch. So these speculative loads is just a pain with no gain, let's remove them. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andrey Ryabinin <[email protected]> Acked-by: Dmitry Vyukov <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-10mm/kasan/kasan_init.c: use kasan_zero_pud for p4d tableJoonsoo Kim1-0/+12
There is missing optimization in zero_p4d_populate() that can save some memory when mapping zero shadow. Implement it like as others. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Joonsoo Kim <[email protected]> Acked-by: Andrey Ryabinin <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-10mm/zsmalloc: simplify zs_max_alloc_size handlingJerome Marchand1-37/+15
Commit 40f9fb8cffc6 ("mm/zsmalloc: support allocating obj with size of ZS_MAX_ALLOC_SIZE") fixes a size calculation error that prevented zsmalloc to allocate an object of the maximal size (ZS_MAX_ALLOC_SIZE). I think however the fix is unneededly complicated. This patch replaces the dynamic calculation of zs_size_classes at init time by a compile time calculation that uses the DIV_ROUND_UP() macro already used in get_size_class_index(). [[email protected]: use min_t] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jerome Marchand <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Mahendran Ganesh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-10mm/memory-hotplug: switch locking to a percpu rwsemThomas Gleixner2-75/+16
Andrey reported a potential deadlock with the memory hotplug lock and the cpu hotplug lock. The reason is that memory hotplug takes the memory hotplug lock and then calls stop_machine() which calls get_online_cpus(). That's the reverse lock order to get_online_cpus(); get_online_mems(); in mm/slub_common.c The problem has been there forever. The reason why this was never reported is that the cpu hotplug locking had this homebrewn recursive reader writer semaphore construct which due to the recursion evaded the full lock dep coverage. The memory hotplug code copied that construct verbatim and therefor has similar issues. Three steps to fix this: 1) Convert the memory hotplug locking to a per cpu rwsem so the potential issues get reported proper by lockdep. 2) Lock the online cpus in mem_hotplug_begin() before taking the memory hotplug rwsem and use stop_machine_cpuslocked() in the page_alloc code to avoid recursive locking. 3) The cpu hotpluck locking in #2 causes a recursive locking of the cpu hotplug lock via __offline_pages() -> lru_add_drain_all(). Solve this by invoking lru_add_drain_all_cpuslocked() instead. Link: http://lkml.kernel.org/r/[email protected] Reported-by: Andrey Ryabinin <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Davidlohr Bueso <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-10mm: swap: provide lru_add_drain_all_cpuslocked()Thomas Gleixner1-3/+8
The rework of the cpu hotplug locking unearthed potential deadlocks with the memory hotplug locking code. The solution for these is to rework the memory hotplug locking code as well and take the cpu hotplug lock before the memory hotplug lock in mem_hotplug_begin(), but this will cause a recursive locking of the cpu hotplug lock when the memory hotplug code calls lru_add_drain_all(). Split out the inner workings of lru_add_drain_all() into lru_add_drain_all_cpuslocked() so this function can be invoked from the memory hotplug code with the cpu hotplug lock held. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Reported-by: Andrey Ryabinin <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Davidlohr Bueso <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-10mm: use dedicated helper to access rlimit valueKrzysztof Opasiak1-3/+2
Use rlimit() helper instead of manually writing whole chain from current task to rlim_cur. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Krzysztof Opasiak <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-10mm/list_lru.c: fix list_lru_count_node() to be race freeSahitya Tummala1-8/+6
list_lru_count_node() iterates over all memcgs to get the total number of entries on the node but it can race with memcg_drain_all_list_lrus(), which migrates the entries from a dead cgroup to another. This can return incorrect number of entries from list_lru_count_node(). Fix this by keeping track of entries per node and simply return it in list_lru_count_node(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Sahitya Tummala <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Cc: Jan Kara <[email protected]> Cc: Alexander Polakov <[email protected]> Cc: Al Viro <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-10mm/mmap.c: expand_downwards: don't require the gap if !vm_prevOleg Nesterov1-7/+3
expand_stack(vma) fails if address < stack_guard_gap even if there is no vma->vm_prev. I don't think this makes sense, and we didn't do this before the recent commit 1be7107fbe18 ("mm: larger stack guard gap, between vmas"). We do not need a gap in this case, any address is fine as long as security_mmap_addr() doesn't object. This also simplifies the code, we know that address >= prev->vm_end and thus underflow is not possible. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Larry Woodman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-10mm/mmap.c: do not blow on PROT_NONE MAP_FIXED holes in the stackMichal Hocko1-2/+4
Commit 1be7107fbe18 ("mm: larger stack guard gap, between vmas") has introduced a regression in some rust and Java environments which are trying to implement their own stack guard page. They are punching a new MAP_FIXED mapping inside the existing stack Vma. This will confuse expand_{downwards,upwards} into thinking that the stack expansion would in fact get us too close to an existing non-stack vma which is a correct behavior wrt safety. It is a real regression on the other hand. Let's work around the problem by considering PROT_NONE mapping as a part of the stack. This is a gros hack but overflowing to such a mapping would trap anyway an we only can hope that usespace knows what it is doing and handle it propely. Fixes: 1be7107fbe18 ("mm: larger stack guard gap, between vmas") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Debugged-by: Vlastimil Babka <[email protected]> Cc: Ben Hutchings <[email protected]> Cc: Willy Tarreau <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-10mm/balloon_compaction.c: enqueue zero page to balloon devicezhenwei.pi1-1/+1
presently pages in the balloon device have random value, and these pages will be scanned by ksmd on the host. They usually cannot be merged. Enqueue zero pages will resolve this problem. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: zhenwei.pi <[email protected]> Cc: Gioh Kim <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: Rafael Aquini <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-10cma: fix calculation of aligned offsetDoug Berger1-9/+6
The align_offset parameter is used by bitmap_find_next_zero_area_off() to represent the offset of map's base from the previous alignment boundary; the function ensures that the returned index, plus the align_offset, honors the specified align_mask. The logic introduced by commit b5be83e308f7 ("mm: cma: align to physical address, not CMA region position") has the cma driver calculate the offset to the *next* alignment boundary. In most cases, the base alignment is greater than that specified when making allocations, resulting in a zero offset whether we align up or down. In the example given with the commit, the base alignment (8MB) was half the requested alignment (16MB) so the math also happened to work since the offset is 8MB in both directions. However, when requesting allocations with an alignment greater than twice that of the base, the returned index would not be correctly aligned. Also, the align_order arguments of cma_bitmap_aligned_mask() and cma_bitmap_aligned_offset() should not be negative so the argument type was made unsigned. Fixes: b5be83e308f7 ("mm: cma: align to physical address, not CMA region position") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Angus Clark <[email protected]> Signed-off-by: Doug Berger <[email protected]> Acked-by: Gregory Fong <[email protected]> Cc: Doug Berger <[email protected]> Cc: Angus Clark <[email protected]> Cc: Laura Abbott <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Lucas Stach <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Shiraz Hashim <[email protected]> Cc: Jaewon Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-10mm/memory_hotplug.c: remove unused local zone_type from __remove_zone()John Hubbard1-3/+0
__remove_zone() sets up up zone_type, but never uses it for anything. This does not cause a warning, due to the (necessary) use of -Wno-unused-but-set-variable. However, it's noise, so just delete it. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: John Hubbard <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-10mm/swap_slots.c: don't disable preemption while taking the per-CPU cacheSebastian Andrzej Siewior1-3/+2
get_cpu_var() disables preemption and returns the per-CPU version of the variable. Disabling preemption is useful to ensure atomic access to the variable within the critical section. In this case however, after the per-CPU version of the variable is obtained the ->free_lock is acquired. For that reason it seems the raw accessor could be used. It only seems that ->slots_ret should be retested (because with disabled preemption this variable can not be set to NULL otherwise). This popped up during PREEMPT-RT testing because it tries to take spinlocks in a preempt disabled section. In RT, spinlocks can sleep. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Tim Chen <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ying Huang <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-10mm/page_alloc.c: eliminate unsigned confusion in __rmqueue_fallbackRasmus Villemoes1-4/+7
Since current_order starts as MAX_ORDER-1 and is then only decremented, the second half of the loop condition seems superfluous. However, if order is 0, we may decrement current_order past 0, making it UINT_MAX. This is obviously too subtle ([1], [2]). Since we need to add some comment anyway, change the two variables to signed, making the counting-down for loop look more familiar, and apparently also making gcc generate slightly smaller code. [1] https://lkml.org/lkml/2016/6/20/493 [2] https://lkml.org/lkml/2017/6/19/345 [[email protected]: fix up reject fixupping] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Rasmus Villemoes <[email protected]> Reported-by: Hao Lee <[email protected]> Acked-by: Wei Yang <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-10mm: avoid taking zone lock in pagetypeinfo_showmixed()Vinayak Menon2-11/+19
pagetypeinfo_showmixedcount_print is found to take a lot of time to complete and it does this holding the zone lock and disabling interrupts. In some cases it is found to take more than a second (On a 2.4GHz,8Gb RAM,arm64 cpu). Avoid taking the zone lock similar to what is done by read_page_owner, which means possibility of inaccurate results. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vinayak Menon <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: zhongjiang <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Sudip Mukherjee <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: David Rientjes <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-10mm, hugetlb, soft_offline: use new_page_nodemask for soft offline migrationMichal Hocko1-9/+1
new_page is yet another duplication of the migration callback which has to handle hugetlb migration specially. We can safely use the generic new_page_nodemask for the same purpose. Please note that gigantic hugetlb pages do not need any special handling because alloc_huge_page_nodemask will make sure to check pages in all per node pools. The reason this was done previously was that alloc_huge_page_node treated NO_NUMA_NODE and a specific node differently and so alloc_huge_page_node(nid) would check on this specific node. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Reported-by: Vlastimil Babka <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Tested-by: Mike Kravetz <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-10hugetlb: add support for preferred node to alloc_huge_page_nodemaskMichal Hocko1-44/+44
alloc_huge_page_nodemask tries to allocate from any numa node in the allowed node mask starting from lower numa nodes. This might lead to filling up those low NUMA nodes while others are not used. We can reduce this risk by introducing a concept of the preferred node similar to what we have in the regular page allocator. We will start allocating from the preferred nid and then iterate over all allowed nodes in the zonelist order until we try them all. This is mimicing the page allocator logic except it operates on per-node mempools. dequeue_huge_page_vma already does this so distill the zonelist logic into a more generic dequeue_huge_page_nodemask and use it in alloc_huge_page_nodemask. This will allow us to use proper per numa distance fallback also for alloc_huge_page_node which can use alloc_huge_page_nodemask now and we can get rid of alloc_huge_page_node helper which doesn't have any user anymore. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Tested-by: Mike Kravetz <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-10mm, hugetlb: unclutter hugetlb allocation layersMichal Hocko1-104/+29
Patch series "mm, hugetlb: allow proper node fallback dequeue". While working on a hugetlb migration issue addressed in a separate patchset[1] I have noticed that the hugetlb allocations from the preallocated pool are quite subotimal. [1] //lkml.kernel.org/r/[email protected] There is no fallback mechanism implemented and no notion of preferred node. I have tried to work around it but Vlastimil was right to push back for a more robust solution. It seems that such a solution is to reuse zonelist approach we use for the page alloctor. This series has 3 patches. The first one tries to make hugetlb allocation layers more clear. The second one implements the zonelist hugetlb pool allocation and introduces a preferred node semantic which is used by the migration callbacks. The last patch is a clean up. This patch (of 3): Hugetlb allocation path for fresh huge pages is unnecessarily complex and it mixes different interfaces between layers. __alloc_buddy_huge_page is the central place to perform a new allocation. It checks for the hugetlb overcommit and then relies on __hugetlb_alloc_buddy_huge_page to invoke the page allocator. This is all good except that __alloc_buddy_huge_page pushes vma and address down the callchain and so __hugetlb_alloc_buddy_huge_page has to deal with two different allocation modes - one for memory policy and other node specific (or to make it more obscure node non-specific) requests. This just screams for a reorganization. This patch pulls out all the vma specific handling up to __alloc_buddy_huge_page_with_mpol where it belongs. __alloc_buddy_huge_page will get nodemask argument and __hugetlb_alloc_buddy_huge_page will become a trivial wrapper over the page allocator. In short: __alloc_buddy_huge_page_with_mpol - memory policy handling __alloc_buddy_huge_page - overcommit handling and accounting __hugetlb_alloc_buddy_huge_page - page allocator layer Also note that __hugetlb_alloc_buddy_huge_page and its cpuset retry loop is not really needed because the page allocator already handles the cpusets update. Finally __hugetlb_alloc_buddy_huge_page had a special case for node specific allocations (when no policy is applied and there is a node given). This has relied on __GFP_THISNODE to not fallback to a different node. alloc_huge_page_node is the only caller which relies on this behavior so move the __GFP_THISNODE there. Not only does this remove quite some code it also should make those layers easier to follow and clear wrt responsibilities. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Tested-by: Mike Kravetz <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-10mm/oom_kill.c: add tracepoints for oom reaper-related eventsRoman Gushchin1-0/+7
During the debugging of the problem described in https://lkml.org/lkml/2017/5/17/542 and fixed by Tetsuo Handa in https://lkml.org/lkml/2017/5/19/383 , I've found that the existing debug output is not really useful to understand issues related to the oom reaper. So, I assume, that adding some tracepoints might help with debugging of similar issues. Trace the following events: 1) a process is marked as an oom victim, 2) a process is added to the oom reaper list, 3) the oom reaper starts reaping process's mm, 4) the oom reaper finished reaping, 5) the oom reaper skips reaping. How it works in practice? Below is an example which show how the problem mentioned above can be found: one process is added twice to the oom_reaper list: $ cd /sys/kernel/debug/tracing $ echo "oom:mark_victim" > set_event $ echo "oom:wake_reaper" >> set_event $ echo "oom:skip_task_reaping" >> set_event $ echo "oom:start_task_reaping" >> set_event $ echo "oom:finish_task_reaping" >> set_event $ cat trace_pipe allocate-502 [001] .... 91.836405: mark_victim: pid=502 allocate-502 [001] .N.. 91.837356: wake_reaper: pid=502 allocate-502 [000] .N.. 91.871149: wake_reaper: pid=502 oom_reaper-23 [000] .... 91.871177: start_task_reaping: pid=502 oom_reaper-23 [000] .N.. 91.879511: finish_task_reaping: pid=502 oom_reaper-23 [000] .... 91.879580: skip_task_reaping: pid=502 Link: http://lkml.kernel.org/r/20170530185231.GA13412@castle Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-10userfaultfd: non-cooperative: add madvise() event for MADV_FREE requestMike Rapoport1-20/+22
MADV_FREE is identical to MADV_DONTNEED from the point of view of uffd monitor. The monitor has to stop handling #PF events in the range being freed. We are reusing userfaultfd_remove callback along with the logic required to re-get and re-validate the VMA which may change or disappear because userfaultfd_remove releases mmap_sem. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-07-10mm/truncate.c: fix THP handling in invalidate_mapping_pages()Jan Kara1-2/+8
The condition checking for THP straddling end of invalidated range is wrong - it checks 'index' against 'end' but 'index' has been already advanced to point to the end of THP and thus the condition can never be true. As a result THP straddling 'end' has been fully invalidated. Given the nature of invalidate_mapping_pages(), this could be only performance issue. In fact, we are lucky the condition is wrong because if it was ever true, we'd leave locked page behind. Fix the condition checking for THP straddling 'end' and also properly unlock the page. Also update the comment before the condition to explain why we decide not to invalidate the page as it was not clear to me and I had to ask Kirill. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>