aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-01-31mm: improve comment on page->mappingMatthew Wilcox1-9/+3
The comment on page->mapping is terse, and out of date (it does not mention the possibility of PAGE_MAPPING_MOVABLE). Instead, point the interested reader to page-flags.h where there is a much better comment. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Christoph Lameter <[email protected]> Cc: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm: remove misleading alignment claimsMatthew Wilcox1-8/+5
The "third double word block" isn't on 32-bit systems. The layout looks like this: unsigned long flags; struct address_space *mapping pgoff_t index; atomic_t _mapcount; atomic_t _refcount; which is 32 bytes on 64-bit, but 20 bytes on 32-bit. Nobody is trying to use the fact that it's double-word aligned today, so just remove the misleading claims. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: Christoph Lameter <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm: de-indent struct pageMatthew Wilcox1-21/+19
I found the struct { union { struct { union { struct { } } } } } layout rather confusing. Fortunately, there is an easier way to write this. The innermost union is of four things which are the size of an int, so the ones which are used by slab/slob/slub can be pulled up two levels to be in the outermost union with 'counters'. That leaves us with struct { union { struct { atomic_t; atomic_t; } } } which has the same layout, but is easier to read. Output from the current git version of pahole, diffed with -uw to ignore the whitespace changes from the indentation: }; /* 16 8 */ union { long unsigned int counters; /* 24 8 */ - struct { - union { - atomic_t _mapcount; /* 24 4 */ unsigned int active; /* 24 4 */ struct { unsigned int inuse:16; /* 24:16 4 */ @@ -21,7 +18,8 @@ unsigned int frozen:1; /* 24: 0 4 */ }; /* 24 4 */ int units; /* 24 4 */ - }; /* 24 4 */ + struct { + atomic_t _mapcount; /* 24 4 */ atomic_t _refcount; /* 28 4 */ }; /* 24 8 */ }; /* 24 8 */ Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Christoph Lameter <[email protected]> Cc: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm: align struct page more aestheticallyMatthew Wilcox1-9/+7
Patch series "Restructure struct page", v2. This series does not attempt any grand restructuring. Instead, it cures the worst of the indentitis, fixes the documentation and reduces the ifdeffery. The only layout change is compound_dtor and compound_order are each reduced to one byte. This patch (of 8): Instead of an ifdef block at the end of the struct, which needed its own comment, define _struct_page_alignment up at the top where it fits nicely with the existing comment. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Christoph Lameter <[email protected]> Cc: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm/zsmalloc: simplify shrinker init/destroyAliaksei Karaliou1-13/+9
Structure zs_pool has special flag to indicate success of shrinker initialization. unregister_shrinker() has improved and can detect by itself whether actual deinitialization should be performed or not, so extra flag becomes redundant. [[email protected]: update comment (Aliaksei), remove unneeded cast] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Aliaksei Karaliou <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Acked-by: Minchan Kim <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm, oom: avoid reaping only for mm's with blockable invalidate callbacksDavid Rientjes1-10/+11
This uses the new annotation to determine if an mm has mmu notifiers with blockable invalidate range callbacks to avoid oom reaping. Otherwise, the callbacks are used around unmap_page_range(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Rientjes <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Christian König <[email protected]> Cc: Dimitri Sivanich <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Oded Gabbay <[email protected]> Cc: Alex Deucher <[email protected]> Cc: David Airlie <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Doug Ledford <[email protected]> Cc: Jani Nikula <[email protected]> Cc: Mike Marciniszyn <[email protected]> Cc: Sean Hefty <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Radim Krčmář <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm, mmu_notifier: annotate mmu notifiers with blockable invalidate callbacksDavid Rientjes7-3/+63
Commit 4d4bbd8526a8 ("mm, oom_reaper: skip mm structs with mmu notifiers") prevented the oom reaper from unmapping private anonymous memory with the oom reaper when the oom victim mm had mmu notifiers registered. The rationale is that doing mmu_notifier_invalidate_range_{start,end}() around the unmap_page_range(), which is needed, can block and the oom killer will stall forever waiting for the victim to exit, which may not be possible without reaping. That concern is real, but only true for mmu notifiers that have blockable invalidate_range_{start,end}() callbacks. This patch adds a "flags" field to mmu notifier ops that can set a bit to indicate that these callbacks do not block. The implementation is steered toward an expensive slowpath, such as after the oom reaper has grabbed mm->mmap_sem of a still alive oom victim. [[email protected]: mmu_notifier_invalidate_range_end() can also call the invalidate_range() must not block, fix comment] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: make mm_has_blockable_invalidate_notifiers() return bool, use rwsem_is_locked()] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: David Rientjes <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Acked-by: Christian König <[email protected]> Acked-by: Dimitri Sivanich <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Oded Gabbay <[email protected]> Cc: Alex Deucher <[email protected]> Cc: David Airlie <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Doug Ledford <[email protected]> Cc: Jani Nikula <[email protected]> Cc: Mike Marciniszyn <[email protected]> Cc: Sean Hefty <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Radim Krčmář <[email protected]> Signed-off-by: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm: thp: use down_read_trylock() in khugepaged to avoid long blockYang Shi1-4/+8
In the current design, khugepaged needs to acquire mmap_sem before scanning an mm. But in some corner cases, khugepaged may scan a process which is modifying its memory mapping, so khugepaged blocks in uninterruptible state. But the process might hold the mmap_sem for a long time when modifying a huge memory space and it may trigger the below khugepaged hung issue: INFO: task khugepaged:270 blocked for more than 120 seconds. Tainted: G E 4.9.65-006.ali3000.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. khugepaged D 0 270 2 0x00000000  ffff883f3deae4c0 0000000000000000 ffff883f610596c0 ffff883f7d359440 ffff883f63818000 ffffc90019adfc78 ffffffff817079a5 d67e5aa8c1860a64 0000000000000246 ffff883f7d359440 ffffc90019adfc88 ffff883f610596c0 Call Trace: schedule+0x36/0x80 rwsem_down_read_failed+0xf0/0x150 call_rwsem_down_read_failed+0x18/0x30 down_read+0x20/0x40 khugepaged+0x476/0x11d0 kthread+0xe6/0x100 ret_from_fork+0x25/0x30 So it sounds pointless to just block khugepaged waiting for the semaphore so replace down_read() with down_read_trylock() to move to scan the next mm quickly instead of just blocking on the semaphore so that other processes can get more chances to install THP. Then khugepaged can come back to scan the skipped mm when it has finished the current round full_scan. And it appears that the change can improve khugepaged efficiency a little bit. Below is the test result when running LTP on a 24 cores 4GB memory 2 nodes NUMA VM: pristine w/ trylock full_scan 197 187 pages_collapsed 21 26 thp_fault_alloc 40818 44466 thp_fault_fallback 18413 16679 thp_collapse_alloc 21 150 thp_collapse_alloc_failed 14 16 thp_file_alloc 369 369 [[email protected]: coding-style fixes] [[email protected]: tweak comment] [[email protected]: avoid uninitialized variable use] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm/thp: remove pmd_huge_split_prepare()Aneesh Kumar K.V7-86/+35
Instead of marking the pmd ready for split, invalidate the pmd. This should take care of powerpc requirement. Only side effect is that we mark the pmd invalid early. This can result in us blocking access to the page a bit longer if we race against a thp split. [[email protected]: rebased, dirty THP once] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: David Daney <[email protected]> Cc: David Miller <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm: use updated pmdp_invalidate() interface to track dirty/accessed bitsKirill A. Shutemov2-21/+16
Use the modifed pmdp_invalidate() that returns the previous value of pmd to transfer dirty and accessed bits. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: David Daney <[email protected]> Cc: David Miller <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm: do not lose dirty and accessed bits in pmdp_invalidate()Kirill A. Shutemov2-4/+4
Vlastimil noted that pmdp_invalidate() is not atomic and we can lose dirty and access bits if CPU sets them after pmdp dereference, but before set_pmd_at(). The patch change pmdp_invalidate() to make the entry non-present atomically and return previous value of the entry. This value can be used to check if CPU set dirty/accessed bits under us. The race window is very small and I haven't seen any reports that can be attributed to the bug. For this reason, I don't think backporting to stable trees needed. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Reported-by: Vlastimil Babka <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: David Daney <[email protected]> Cc: David Miller <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vineet Gupta <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31x86/mm: provide pmdp_establish() helperKirill A. Shutemov2-1/+51
We need an atomic way to setup pmd page table entry, avoiding races with CPU setting dirty/accessed bits. This is required to implement pmdp_invalidate() that doesn't lose these bits. On PAE we can avoid expensive cmpxchg8b for cases when new page table entry is not present. If it's present, fallback to cpmxchg loop. [[email protected]: add missing `do' to do-while loop] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31sparc64: update pmdp_invalidate() to return old pmd valueNitin Gupta2-6/+19
It's required to avoid losing dirty and accessed bits. [[email protected]: add a `do' to the do-while loop] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Nitin Gupta <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: David Miller <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31s390/mm: modify pmdp_invalidate to return old value.Martin Schwidefsky1-2/+2
It's required to avoid losing dirty and accessed bits. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Martin Schwidefsky <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31powerpc/mm: update pmdp_invalidate to return old pmd valueAneesh Kumar K.V2-4/+7
It's required to avoid losing dirty and accessed bits. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Aneesh Kumar K.V <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mips: use generic_pmdp_establish as pmdp_establishKirill A. Shutemov1-0/+3
MIPS doesn't support hardware dirty/accessed bits. generic_pmdp_establish() is suitable in this case. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: David Daney <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31arm64: provide pmdp_establish() helperCatalin Marinas1-0/+7
We need an atomic way to setup pmd page table entry, avoiding races with CPU setting dirty/accessed bits. This is required to implement pmdp_invalidate() that doesn't lose these bits. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Catalin Marinas <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31arm/mm: provide pmdp_establish() helperKirill A. Shutemov1-0/+3
ARM LPAE doesn't have hardware dirty/accessed bits. generic_pmdp_establish() is the right implementation of pmdp_establish for this case. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31arc: use generic_pmdp_establish as pmdp_establishKirill A. Shutemov1-0/+3
ARC doesn't support hardware dirty/accessed bits. generic_pmdp_establish() is suitable in this case. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31asm-generic: provide generic_pmdp_establish()Kirill A. Shutemov1-0/+15
Patch series "Do not lose dirty bit on THP pages", v4. Vlastimil noted that pmdp_invalidate() is not atomic and we can lose dirty and access bits if CPU sets them after pmdp dereference, but before set_pmd_at(). The bug can lead to data loss, but the race window is tiny and I haven't seen any reports that suggested that it happens in reality. So I don't think it worth sending it to stable. Unfortunately, there's no way to address the issue in a generic way. We need to fix all architectures that support THP one-by-one. All architectures that have THP supported have to provide atomic pmdp_invalidate() that returns previous value. If generic implementation of pmdp_invalidate() is used, architecture needs to provide atomic pmdp_estabish(). pmdp_estabish() is not used out-side generic implementation of pmdp_invalidate() so far, but I think this can change in the future. This patch (of 12): This is an implementation of pmdp_establish() that is only suitable for an architecture that doesn't have hardware dirty/accessed bits. In this case we can't race with CPU which sets these bits and non-atomic approach is fine. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: David Daney <[email protected]> Cc: David Miller <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vineet Gupta <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm: get 7% more pages in a pagevecMatthew Wilcox1-3/+3
We don't have to use an entire 'long' for the number of elements in the pagevec; we know it's a number between 0 and 14 (now 15). So we can store it in a char, and then the bool packs next to it and we still have two or six bytes of padding for more elements in the header. That gives us space to cram in an extra page. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm: add unmap_mapping_pages()Matthew Wilcox6-60/+61
Several users of unmap_mapping_range() would prefer to express their range in pages rather than bytes. Unfortuately, on a 32-bit kernel, you have to remember to cast your page number to a 64-bit type before shifting it, and four places in the current tree didn't remember to do that. That's a sign of a bad interface. Conveniently, unmap_mapping_range() actually converts from bytes into pages, so hoist the guts of unmap_mapping_range() into a new function unmap_mapping_pages() and convert the callers which want to use pages. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Reported-by: "zhangyi (F)" <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm, userfaultfd, THP: avoid waiting when PMD under THP migrationHuang Ying1-1/+4
If THP migration is enabled, for a VMA handled by userfaultfd, consider the following situation, do_page_fault() __do_huge_pmd_anonymous_page() handle_userfault() userfault_msg() /* a huge page is allocated and mapped at fault address */ /* the huge page is under migration, leaves migration entry in page table */ userfaultfd_must_wait() /* return true because !pmd_present() */ /* may wait in loop until fatal signal */ That is, it may be possible for userfaultfd_must_wait() encounters a PMD entry which is !pmd_none() && !pmd_present(). In the current implementation, we will wait for such PMD entries, which may cause unnecessary waiting, and potential soft lockup. This is fixed via avoiding to wait when !pmd_none() && !pmd_present(), only wait when pmd_none(). This may be not a problem in practice, because userfaultfd_must_wait() is always called with mm->mmap_sem read-locked. mremap() will write-lock mm->mmap_sem. And UFFDIO_COPY doesn't support to copy THP mapping. But the change introduced still makes the code more correct, and makes the PMD and PTE code more consistent. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Reviewed-by: Andrea Arcangeli <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Zi Yan <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm/huge_memory.c: fix comment in __split_huge_pmd_lockedYisheng Xie1-4/+3
pmd_trans_splitting() was removed after THP refcounting redesign, therefore related comment should be updated. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yisheng Xie <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm: memory_hotplug: remove second __nr_to_section in ↵Oscar Salvador1-2/+2
register_page_bootmem_info_section() In register_page_bootmem_info_section() we call __nr_to_section() in order to get the mem_section struct at the beginning of the function. Since we already got it, there is no need for a second call to __nr_to_section(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31fs/proc/task_mmu.c: do not show VmExe bigger than total executable virtual ↵Konstantin Khlebnikov1-3/+8
memory If start_code / end_code pointers are screwed then "VmExe" could be bigger than total executable virtual memory and "VmLib" becomes negative: VmExe: 294320 kB VmLib: 18446744073709327564 kB VmExe and VmLib documented as text segment and shared library code size. Now their sum will be always equal to mm->exec_vm which sums size of executable and not writable and not stack areas. I've seen this for huge (>2Gb) statically linked binary which has whole world inside. For it start_code .. end_code range also covers one of rodata sections. Probably this is bug in customized linker, elf loader or both. Anyway CONFIG_CHECKPOINT_RESTORE allows to change these pointers, thus we cannot trust them without validation. Link: http://lkml.kernel.org/r/150728955451.743749.11276392315459539583.stgit@buzz Signed-off-by: Konstantin Khlebnikov <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm: update comment describing tlb_gather_mmuMike Rapoport1-4/+11
The comment describes @fullmm argument, but the function has no such parameter. Update the comment to match the code and convert it to kernel-doc markup. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm/memory_hotplug.c: remove unnecesary check from ↵Oscar Salvador1-3/+0
register_page_bootmem_info_section() When we call register_page_bootmem_info_section() having CONFIG_SPARSEMEM_VMEMMAP enabled, we check if the pfn is valid. This check is redundant as we already checked this in register_page_bootmem_info_node() before calling register_page_bootmem_info_section(), so let's get rid of it. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm, hugetlb: remove hugepages_treat_as_movable sysctlMichal Hocko4-36/+1
hugepages_treat_as_movable has been introduced by 396faf0303d2 ("Allow huge page allocations to use GFP_HIGH_MOVABLE") to allow hugetlb allocations from ZONE_MOVABLE even when hugetlb pages were not migrateable. The purpose of the movable zone was different at the time. It aimed at reducing memory fragmentation and hugetlb pages being long lived and large werre not contributing to the fragmentation so it was acceptable to use the zone back then. Things have changed though and the primary purpose of the zone became migratability guarantee. If we allow non migrateable hugetlb pages to be in ZONE_MOVABLE memory hotplug might fail to offline the memory. Remove the knob and only rely on hugepage_migration_supported to allow movable zones. Mel said: : Primarily it was aimed at allowing the hugetlb pool to safely shrink with : the ability to grow it again. The use case was for batched jobs, some of : which needed huge pages and others that did not but didn't want the memory : useless pinned in the huge pages pool. : : I suspect that more users rely on THP than hugetlbfs for flexible use of : huge pages with fallback options so I think that removing the option : should be ok. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Reported-by: Alexandru Moise <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Alexandru Moise <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm: remove unused pgdat_reclaimable_pages()Jan Kara3-34/+0
Remove unused function pgdat_reclaimable_pages() and node_page_state_snapshot() which becomes unused as well. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jan Kara <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm/interval_tree.c: use vma_pages() helperVasyl Gomonovych1-1/+1
Use vma_pages function on vma object instead of explicit computation. mm/interval_tree.c:21:27-33: WARNING: Consider using vma_pages helper Generated by: scripts/coccinelle/api/vma_pages.cocci Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vasyl Gomonovych <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Acked-by: Davidlohr Bueso <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31selftests/vm: move 128TB mmap boundary test to generic directoryAneesh Kumar K.V4-179/+311
Architectures like PPC64 support mmap hint address based large address space selection. This test can be run on those architectures too. Move the test from the x86 selftests to selftest/vm so that other architectures can use it too. We also add a few new test scenarios in this patch. We do test a few boundary conditions before we do a high address mmap. PPC64 uses the address limit to validate the address in the fault path. We had bugs in this area w.r.t SLB fault handling before we updated the addess limit. We also touch the allocated space to make sure we don't have any bugs in the fault handling path. [[email protected]: restore tools/testing/selftests/vm/Makefile alpha ordering] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Aneesh Kumar K.V <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm: do not stall register_shrinker()Minchan Kim1-0/+9
Shakeel Butt reported he has observed in production systems that the job loader gets stuck for 10s of seconds while doing a mount operation. It turns out that it was stuck in register_shrinker() because some unrelated job was under memory pressure and was spending time in shrink_slab(). Machines have a lot of shrinkers registered and jobs under memory pressure have to traverse all of those memcg-aware shrinkers and affect unrelated jobs which want to register their own shrinkers. To solve the issue, this patch simply bails out slab shrinking if it is found that someone wants to register a shrinker in parallel. A downside is it could cause unfair shrinking between shrinkers. However, it should be rare and we can add compilcated logic if we find it's not enough. [[email protected]: tweak code comment] Link: http://lkml.kernel.org/r/20171115005602.GB23810@bbox Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Signed-off-by: Shakeel Butt <[email protected]> Reported-by: Shakeel Butt <[email protected]> Tested-by: Shakeel Butt <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Anshuman Khandual <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm/page_alloc.c: fix comment in __get_free_pages()Jiankang Chen1-1/+1
__get_free_pages() will return a virtual address, but it is not just a 32-bit address, for example on a 64-bit system. And this comment really confuses new readers of mm. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jiankang Chen <[email protected]> Reported-by: Hanjun Guo <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Yisheng Xie <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm/page_owner.c: use PTR_ERR_OR_ZERO()Vasyl Gomonovych1-3/+1
Fix ptr_ret.cocci warnings: mm/page_owner.c:639:1-3: WARNING: PTR_ERR_OR_ZERO can be used Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR Generated by: scripts/coccinelle/api/ptr_ret.cocci Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vasyl Gomonovych <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm: memcontrol: fix excessive complexity in memory.stat reportingJohannes Weiner2-84/+113
We've seen memory.stat reads in top-level cgroups take up to fourteen seconds during a userspace bug that created tens of thousands of ghost cgroups pinned by lingering page cache. Even with a more reasonable number of cgroups, aggregating memory.stat is unnecessarily heavy. The complexity is this: nr_cgroups * nr_stat_items * nr_possible_cpus where the stat items are ~70 at this point. With 128 cgroups and 128 CPUs - decent, not enormous setups - reading the top-level memory.stat has to aggregate over a million per-cpu counters. This doesn't scale. Instead of spreading the source of truth across all CPUs, use the per-cpu counters merely to batch updates to shared atomic counters. This is the same as the per-cpu stocks we use for charging memory to the shared atomic page_counters, and also the way the global vmstat counters are implemented. Vmstat has elaborate spilling thresholds that depend on the number of CPUs, amount of memory, and memory pressure - carefully balancing the cost of counter updates with the amount of per-cpu error. That's because the vmstat counters are system-wide, but also used for decisions inside the kernel (e.g. NR_FREE_PAGES in the allocator). Neither is true for the memory controller. Use the same static batch size we already use for page_counter updates during charging. The per-cpu error in the stats will be 128k, which is an acceptable ratio of cores to memory accounting granularity. [[email protected]: fix warning in __this_cpu_xchg() calls] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm: memcontrol: implement lruvec stat functions on top of each otherJohannes Weiner1-22/+22
The implementation of the lruvec stat functions and their variants for accounting through a page, or accounting from a preemptible context, are mostly identical and needlessly repetitive. Implement the lruvec_page functions by looking up the page's lruvec and then using the lruvec function. Implement the functions for preemptible contexts by disabling preemption before calling the atomic context functions. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm: memcontrol: eliminate raw access to stat and event countersJohannes Weiner2-45/+45
Replace all raw 'this_cpu_' modifications of the stat and event per-cpu counters with API functions such as mod_memcg_state(). This makes the code easier to read, but is also in preparation for the next patch, which changes the per-cpu implementation of those counters. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Vladimir Davydov <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm/filemap.c: remove include of hardirq.hYang Shi1-1/+0
in_atomic() has been moved to include/linux/preempt.h, and the filemap.c doesn't use in_atomic() directly at all, so it sounds unnecessary to include hardirq.h. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Reviewed-by: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm: split deferred_init_range into initializing and freeing partsPavel Tatashin1-70/+76
In deferred_init_range() we initialize struct pages, and also free them to buddy allocator. We do it in separate loops, because buddy page is computed ahead, so we do not want to access a struct page that has not been initialized yet. There is still, however, a corner case where it is potentially possible to access uninitialized struct page: this is when buddy page is from the next memblock range. This patch fixes this problem by splitting deferred_init_range() into two functions: one to initialize struct pages, and another to free them. In addition, this patch brings the following improvements: - Get rid of __def_free() helper function. And simplifies loop logic by adding a new pfn validity check function: deferred_pfn_valid(). - Reduces number of variables that we track. So, there is a higher chance that we will avoid using stack to store/load variables inside hot loops. - Enables future multi-threading of these functions: do initialization in multiple threads, wait for all threads to finish, do freeing part in multithreading. Tested on x86 with 1T of memory to make sure no regressions are introduced. [[email protected]: fix spello in comment] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Pavel Tatashin <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Steven Sistare <[email protected]> Cc: Daniel Jordan <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm: use sc->priority for slab shrink targetsJosef Bacik2-47/+23
Previously we were using the ratio of the number of lru pages scanned to the number of eligible lru pages to determine the number of slab objects to scan. The problem with this is that these two things have nothing to do with each other, so in slab heavy work loads where there is little to no page cache we can end up with the pages scanned being a very low number. This means that we reclaim next to no slab pages and waste a lot of time reclaiming small amounts of space. Consider the following scenario, where we have the following values and the rest of the memory usage is in slab Active: 58840 kB Inactive: 46860 kB Every time we do a get_scan_count() we do this scan = size >> sc->priority where sc->priority starts at DEF_PRIORITY, which is 12. The first loop through reclaim would result in a scan target of 2 pages to 11715 total inactive pages, and 3 pages to 14710 total active pages. This is a really really small target for a system that is entirely slab pages. And this is super optimistic, this assumes we even get to scan these pages. We don't increment sc->nr_scanned unless we 1) isolate the page, which assumes it's not in use, and 2) can lock the page. Under pressure these numbers could probably go down, I'm sure there's some random pages from daemons that aren't actually in use, so the targets get even smaller. Instead use sc->priority in the same way we use it to determine scan amounts for the lru's. This generally equates to pages. Consider the following slab_pages = (nr_objects * object_size) / PAGE_SIZE What we would like to do is scan = slab_pages >> sc->priority but we don't know the number of slab pages each shrinker controls, only the objects. However say that theoretically we knew how many pages a shrinker controlled, we'd still have to convert this to objects, which would look like the following scan = shrinker_pages >> sc->priority scan_objects = (PAGE_SIZE / object_size) * scan or written another way scan_objects = (shrinker_pages >> sc->priority) * (PAGE_SIZE / object_size) which can thus be written scan_objects = ((shrinker_pages * PAGE_SIZE) / object_size) >> sc->priority which is just scan_objects = nr_objects >> sc->priority We don't need to know exactly how many pages each shrinker represents, it's objects are all the information we need. Making this change allows us to place an appropriate amount of pressure on the shrinker pools for their relative size. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Josef Bacik <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Dave Chinner <[email protected]> Acked-by: Andrey Ryabinin <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm: show total hugetlb memory consumption in /proc/meminfoRoman Gushchin2-21/+42
Currently we display some hugepage statistics (total, free, etc) in /proc/meminfo, but only for default hugepage size (e.g. 2Mb). If hugepages of different sizes are used (like 2Mb and 1Gb on x86-64), /proc/meminfo output can be confusing, as non-default sized hugepages are not reflected at all, and there are no signs that they are existing and consuming system memory. To solve this problem, let's display the total amount of memory, consumed by hugetlb pages of all sized (both free and used). Let's call it "Hugetlb", and display size in kB to match generic /proc/meminfo style. For example, (1024 2Mb pages and 2 1Gb pages are pre-allocated): $ cat /proc/meminfo MemTotal: 8168984 kB MemFree: 3789276 kB <...> CmaFree: 0 kB HugePages_Total: 1024 HugePages_Free: 1024 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 4194304 kB DirectMap4k: 32632 kB DirectMap2M: 4161536 kB DirectMap1G: 6291456 kB Also, this patch updates corresponding docs to reflect Hugetlb entry meaning and difference between Hugetlb and HugePages_Total * Hugepagesize. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm: drop hotplug lock from lru_add_drain_all()Michal Hocko3-10/+9
Pulling cpu hotplug locks inside the mm core function like lru_add_drain_all just asks for problems and the recent lockdep splat [1] just proves this. While the usage in that particular case might be wrong we should avoid the locking as lru_add_drain_all() is used in many places. It seems that this is not all that hard to achieve actually. We have done the same thing for drain_all_pages which is analogous by commit a459eeb7b852 ("mm, page_alloc: do not depend on cpu hotplug locks inside the allocator"). All we have to care about is to handle - the work item might be executed on a different cpu in worker from unbound pool so it doesn't run on pinned on the cpu - we have to make sure that we do not race with page_alloc_cpu_dead calling lru_add_drain_cpu the first part is already handled because the worker calls lru_add_drain which disables preemption when calling lru_add_drain_cpu on the local cpu it is draining. The later is true because page_alloc_cpu_dead is called on the controlling CPU after the hotplugged CPU vanished completely. [1] http://lkml.kernel.org/r/[email protected] [add a cpu hotplug locking interaction as per tglx] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm/mempolicy: add nodes_empty check in SYSC_migrate_pagesYisheng Xie1-3/+7
As in manpage of migrate_pages, the errno should be set to EINVAL when none of the node IDs specified by new_nodes are on-line and allowed by the process's current cpuset context, or none of the specified nodes contain memory. However, when test by following case: new_nodes = 0; old_nodes = 0xf; ret = migrate_pages(pid, old_nodes, new_nodes, MAX); The ret will be 0 and no errno is set. As the new_nodes is empty, we should expect EINVAL as documented. To fix the case like above, this patch check whether target nodes AND current task_nodes is empty, and then check whether AND node_states[N_MEMORY] is empty. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yisheng Xie <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Chris Salls <[email protected]> Cc: Christopher Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Tan Xiaojun <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm/mempolicy: fix the check of nodemask from userYisheng Xie1-3/+20
As Xiaojun reported the ltp of migrate_pages01 will fail on arm64 system which has 4 nodes[0...3], all have memory and CONFIG_NODES_SHIFT=2: migrate_pages01 0 TINFO : test_invalid_nodes migrate_pages01 14 TFAIL : migrate_pages_common.c:45: unexpected failure - returned value = 0, expected: -1 migrate_pages01 15 TFAIL : migrate_pages_common.c:55: call succeeded unexpectedly In this case the test_invalid_nodes of migrate_pages01 will call: SYSC_migrate_pages as: migrate_pages(0, , {0x0000000000000001}, 64, , {0x0000000000000010}, 64) = 0 The new nodes specifies one or more node IDs that are greater than the maximum supported node ID, however, the errno is not set to EINVAL as expected. As man pages of set_mempolicy[1], mbind[2], and migrate_pages[3] mentioned, when nodemask specifies one or more node IDs that are greater than the maximum supported node ID, the errno should set to EINVAL. However, get_nodes only check whether the part of bits [BITS_PER_LONG*BITS_TO_LONGS(MAX_NUMNODES), maxnode) is zero or not, and remain [MAX_NUMNODES, BITS_PER_LONG*BITS_TO_LONGS(MAX_NUMNODES) unchecked. This patch is to check the bits of [MAX_NUMNODES, maxnode) in get_nodes to let migrate_pages set the errno to EINVAL when nodemask specifies one or more node IDs that are greater than the maximum supported node ID, which follows the manpage's guide. [1] http://man7.org/linux/man-pages/man2/set_mempolicy.2.html [2] http://man7.org/linux/man-pages/man2/mbind.2.html [3] http://man7.org/linux/man-pages/man2/migrate_pages.2.html Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yisheng Xie <[email protected]> Reported-by: Tan Xiaojun <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Chris Salls <[email protected]> Cc: Christopher Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm/mempolicy: remove redundant check in get_nodesYisheng Xie1-2/+0
We have already checked whether maxnode is a page worth of bits, by: maxnode > PAGE_SIZE*BITS_PER_BYTE So no need to check it once more. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yisheng Xie <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: David Rientjes <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Chris Salls <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Christopher Lameter <[email protected]> Cc: Tan Xiaojun <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm: relax deferred struct page requirementsPavel Tatashin4-9/+1
There is no need to have ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT, as all the page initialization code is in common code. Also, there is no need to depend on MEMORY_HOTPLUG, as initialization code does not really use hotplug memory functionality. So, we can remove this requirement as well. This patch allows to use deferred struct page initialization on all platforms with memblock allocator. Tested on x86, arm64, and sparc. Also, verified that code compiles on PPC with CONFIG_MEMORY_HOTPLUG disabled. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Pavel Tatashin <[email protected]> Acked-by: Heiko Carstens <[email protected]> [s390] Reviewed-by: Khalid Aziz <[email protected]> Acked-by: Michael Ellerman <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Steven Sistare <[email protected]> Cc: Daniel Jordan <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Reza Arbab <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31zswap: same-filled pages handlingSrividya Desireddy1-5/+66
Zswap is a cache which compresses the pages that are being swapped out and stores them into a dynamically allocated RAM-based memory pool. Experiments have shown that around 10-20% of pages stored in zswap are same-filled pages (i.e. contents of the page are all same), but these pages are handled as normal pages by compressing and allocating memory in the pool. This patch adds a check in zswap_frontswap_store() to identify same-filled page before compression of the page. If the page is a same-filled page, set zswap_entry.length to zero, save the same-filled value and skip the compression of the page and alloction of memory in zpool. In zswap_frontswap_load(), check if value of zswap_entry.length is zero corresponding to the page to be loaded. If zswap_entry.length is zero, fill the page with same-filled value. This saves the decompression time during load. On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and relaunching different applications, out of ~64000 pages stored in zswap, ~11000 pages were same-value filled pages (including zero-filled pages) and ~9000 pages were zero-filled pages. An average of 17% of pages(including zero-filled pages) in zswap are same-value filled pages and 14% pages are zero-filled pages. An average of 3% of pages are same-filled non-zero pages. The below table shows the execution time profiling with the patch. Baseline With patch % Improvement ----------------------------------------------------------------- *Zswap Store Time 26.5ms 18ms 32% (of same value pages) *Zswap Load Time (of same value pages) 25.5ms 13ms 49% ----------------------------------------------------------------- On Ubuntu PC with 2GB RAM, while executing kernel build and other test scripts and running multimedia applications, out of 360000 pages stored in zswap 78000(~22%) of pages were found to be same-value filled pages (including zero-filled pages) and 64000(~17%) are zero-filled pages. So an average of %5 of pages are same-filled non-zero pages. The below table shows the execution time profiling with the patch. Baseline With patch % Improvement ----------------------------------------------------------------- *Zswap Store Time 91ms 74ms 19% (of same value pages) *Zswap Load Time 50ms 7.5ms 85% (of same value pages) ----------------------------------------------------------------- *The execution times may vary with test device used. Dan said: : I did test this patch out this week, and I added some instrumentation to : check the performance impact, and tested with a small program to try to : check the best and worst cases. : : When doing a lot of swap where all (or almost all) pages are same-value, I : found this patch does save both time and space, significantly. The exact : improvement in time and space depends on which compressor is being used, : but roughly agrees with the numbers you listed. : : In the worst case situation, where all (or almost all) pages have the : same-value *except* the final long (meaning, zswap will check each long on : the entire page but then still have to pass the page to the compressor), : the same-value check is around 10-15% of the total time spent in : zswap_frontswap_store(). That's a not-insignificant amount of time, but : it's not huge. Considering that most systems will probably be swapping : pages that aren't similar to the worst case (although I don't have any : data to know that), I'd say the improvement is worth the possible : worst-case performance impact. [[email protected]: add memset_l instead of for loop] Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1 Signed-off-by: Srividya Desireddy <[email protected]> Acked-by: Dan Streetman <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Dinakar Reddy Pathireddy <[email protected]> Cc: SHARAN ALLUR <[email protected]> Cc: RAJIB BASU <[email protected]> Cc: JUHUN KIM <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Timofey Titovets <[email protected]> Cc: Andi Kleen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31mm: kmemleak: remove unused hardirq.hYang Shi1-1/+0
Preempt counter APIs have been split out, currently, hardirq.h just includes irq_enter/exit APIs which are not used by kmemleak at all. So, remove the unused hardirq.h. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-01-31include/linux/sched/mm.h: uninline mmdrop_async(), etcAndrew Morton2-234/+238
mmdrop_async() is only used in fork.c. Move that and its support functions into fork.c, uninline it all. Quite a lot of code gets moved around to avoid forward declarations. Cc: Ingo Molnar <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>