aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-06-07include/linux/gfp.h: fix the annotation of GFP_ZONE_TABLEHuaisheng Ye1-2/+2
When bit is equal to 0x4, it means OPT_ZONE_DMA32 should be got from GFP_ZONE_TABLE. OPT_ZONE_DMA32 shall be equal to ZONE_DMA32 or ZONE_NORMAL according to the status of CONFIG_ZONE_DMA32. Similarly, when bit is equal to 0xc, that means OPT_ZONE_DMA32 should be got with an allocation policy GFP_MOVABLE. So ZONE_DMA32 or ZONE_NORMAL is the possible result value. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Huaisheng Ye <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Kate Stewart <[email protected]> Cc: "Levin, Alexander (Sasha Levin)" <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm/shmem.c: zero out unused vma fields in shmem_pseudo_vma_init()Kirill A. Shutemov1-2/+1
shmem/tmpfs uses pseudo vma to allocate page with correct NUMA policy. The pseudo vma doesn't have vm_page_prot set. We are going to encode encryption KeyID in vm_page_prot. Having garbage there causes problems. Zero out all unused fields in the pseudo vma. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm, page_alloc: do not break __GFP_THISNODE by zonelist resetVlastimil Babka1-1/+0
In __alloc_pages_slowpath() we reset zonelist and preferred_zoneref for allocations that can ignore memory policies. The zonelist is obtained from current CPU's node. This is a problem for __GFP_THISNODE allocations that want to allocate on a different node, e.g. because the allocating thread has been migrated to a different CPU. This has been observed to break SLAB in our 4.4-based kernel, because there it relies on __GFP_THISNODE working as intended. If a slab page is put on wrong node's list, then further list manipulations may corrupt the list because page_to_nid() is used to determine which node's list_lock should be locked and thus we may take a wrong lock and race. Current SLAB implementation seems to be immune by luck thanks to commit 511e3a058812 ("mm/slab: make cache_grow() handle the page allocated on arbitrary node") but there may be others assuming that __GFP_THISNODE works as promised. We can fix it by simply removing the zonelist reset completely. There is actually no reason to reset it, because memory policies and cpusets don't affect the zonelist choice in the first place. This was different when commit 183f6371aac2 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK") introduced the code, as mempolicies provided their own restricted zonelists. We might consider this for 4.17 although I don't know if there's anything currently broken. SLAB is currently not affected, but in kernels older than 4.7 that don't yet have 511e3a058812 ("mm/slab: make cache_grow() handle the page allocated on arbitrary node") it is. That's at least 4.4 LTS. Older ones I'll have to check. So stable backports should be more important, but will have to be reviewed carefully, as the code went through many changes. BTW I think that also the ac->preferred_zoneref reset is currently useless if we don't also reset ac->nodemask from a mempolicy to NULL first (which we probably should for the OOM victims etc?), but I would leave that for a separate patch. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Fixes: 183f6371aac2 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK") Acked-by: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07userfaultfd: prevent non-cooperative events vs mcopy_atomic racesMike Rapoport3-9/+41
If a process monitored with userfaultfd changes it's memory mappings or forks() at the same time as uffd monitor fills the process memory with UFFDIO_COPY, the actual creation of page table entries and copying of the data in mcopy_atomic may happen either before of after the memory mapping modifications and there is no way for the uffd monitor to maintain consistent view of the process memory layout. For instance, let's consider fork() running in parallel with userfaultfd_copy(): process | uffd monitor ---------------------------------+------------------------------ fork() | userfaultfd_copy() ... | ... dup_mmap() | down_read(mmap_sem) down_write(mmap_sem) | /* create PTEs, copy data */ dup_uffd() | up_read(mmap_sem) copy_page_range() | up_write(mmap_sem) | dup_uffd_complete() | /* notify monitor */ | If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will be present by the time copy_page_range() is called and they will appear in the child's memory mappings. However, if the fork() is the first to take the mmap_sem, the new pages won't be mapped in the child's address space. If the pages are not present and child tries to access them, the monitor will get page fault notification and everything is fine. However, if the pages *are present*, the child can access them without uffd noticing. And if we copy them into child it'll see the wrong data. Since we are talking about background copy, we'd need to decide whether the pages should be copied or not regardless #PF notifications. Since userfaultfd monitor has no way to determine what was the order, let's disallow userfaultfd_copy in parallel with the non-cooperative events. In such case we return -EAGAIN and the uffd monitor can understand that userfaultfd_copy() clashed with a non-cooperative event and take an appropriate action. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Acked-by: Pavel Emelyanov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Andrei Vagin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: memcg: allow lowering memory.swap.max below the current usageTejun Heo2-5/+6
Currently an attempt to set swap.max into a value lower than the actual swap usage fails, which causes configuration problems as there's no way of lowering the configuration below the current usage short of turning off swap entirely. This makes swap.max difficult to use and allows delegatees to lock the delegator out of reducing swap allocation. This patch updates swap_max_write() so that the limit can be lowered below the current usage. It doesn't implement active reclaiming of swap entries for the following reasons. * mem_cgroup_swap_full() already tells the swap machinary to aggressively reclaim swap entries if the usage is above 50% of limit, so simply lowering the limit automatically triggers gradual reclaim. * Forcing back swapped out pages is likely to heavily impact the workload and mess up the working set. Given that swap usually is a lot less valuable and less scarce, letting the existing usage dissipate over time through the above gradual reclaim and as they're falted back in is likely the better behavior. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Tejun Heo <[email protected]> Acked-by: Roman Gushchin <[email protected]> Acked-by: Rik van Riel <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Shaohua Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm/shmem.c: use new return type vm_fault_tSouptick Joarder1-6/+6
Use new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. See commit 1c8f422059ae ("mm: change return type to vm_fault_t") vmf_error() is the newly introduce inline function in 4.17-rc6. Link: http://lkml.kernel.org/r/20180521202410.GA17912@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <[email protected]> Reviewed-by: Matthew Wilcox <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07slub: remove 'reserved' file from sysfsMatthew Wilcox1-7/+0
Christoph doubts anyone was using the 'reserved' file in sysfs, so remove it. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Christoph Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07slub: remove kmem_cache->reservedMatthew Wilcox2-22/+20
The reserved field was only used for embedding an rcu_head in the data structure. With the previous commit, we no longer need it. That lets us remove the 'reserved' argument to a lot of functions. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Christoph Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07slab,slub: remove rcu_head size checksMatthew Wilcox2-27/+2
rcu_head may now grow larger than list_head without affecting slab or slub. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Christoph Lameter <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: add hmm_data to struct pageMatthew Wilcox2-12/+8
Make hmm_data an explicit member of the struct page union. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: add pt_mm to struct pageMatthew Wilcox2-4/+3
For pgd page table pages, x86 overloads the page->index field to store a pointer to the mm_struct. Rename this to pt_mm so it's visible to other users. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: improve struct page documentationMatthew Wilcox1-21/+19
Rewrite the documentation to describe what you can use in struct page rather than what you can't. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Reviewed-by: Randy Dunlap <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: combine LRU and main union in struct pageMatthew Wilcox2-52/+47
This gives us five words of space in a single union in struct page. The compound_mapcount moves position (from offset 24 to offset 20) on 64-bit systems, but that does not seem likely to cause any trouble. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: move lru union within struct pageMatthew Wilcox2-55/+55
Since the LRU is two words, this does not affect the double-word alignment of SLUB's freelist. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: use page->deferred_listMatthew Wilcox2-6/+3
Now that we can represent the location of 'deferred_list' in C instead of comments, make use of that ability. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: combine first three unions in struct pageMatthew Wilcox1-33/+33
By combining these three one-word unions into one three-word union, we make it easier for users to add their own multi-word fields to struct page, as well as making it obvious that SLUB needs to keep its double-word alignment for its freelist & counters. No field moves position; verified with pahole. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: move _refcount out of struct page unionMatthew Wilcox1-15/+10
Keeping the refcount in the union only encourages people to put something else in the union which will overlap with _refcount and eventually explode messily. pahole reports no fields change location. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: move 'private' union within struct pageMatthew Wilcox2-49/+27
By moving page->private to the fourth word of struct page, we can put the SLUB counters in the same word as SLAB's s_mem and still do the cmpxchg_double trick. Now the SLUB counters no longer overlap with the mapcount or refcount so we can drop the call to page_mapcount_reset() and simplify set_page_slub_counters() to a single line. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: switch s_mem and slab_cache in struct pageMatthew Wilcox2-2/+3
This will allow us to store slub's counters in the same bits as slab's s_mem. slub now needs to set page->mapping to NULL as it frees the page, just like slab does. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Christoph Lameter <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: mark pages in use for page tablesMatthew Wilcox5-1/+12
Define a new PageTable bit in the page_type and use it to mark pages in use as page tables. This can be helpful when debugging crashdumps or analysing memory fragmentation. Add a KPF flag to report these pages to userspace and update page-types.c to interpret that flag. Note that only pages currently accounted as NR_PAGETABLES are tracked as PageTable; this does not include pgd/p4d/pud/pmd pages. Those will be the subject of a later patch. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: split page_type out from _mapcountMatthew Wilcox5-35/+43
We're already using a union of many fields here, so stop abusing the _mapcount and make page_type its own field. That implies renaming some of the machinery that creates PageBuddy, PageBalloon and PageKmemcg; bring back the PG_buddy, PG_balloon and PG_kmemcg names. As suggested by Kirill, make page_type a bitmask. Because it starts out life as -1 (thanks to sharing the storage with _mapcount), setting a page flag means clearing the appropriate bit. This gives us space for probably twenty or so extra bits (depending how paranoid we want to be about _mapcount underflow). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07s390: use _refcount for pgtablesMatthew Wilcox1-9/+12
Patch series "Rearrange struct page", v6. As presented at LSFMM, this patch-set rearranges struct page to give more contiguous usable space to users who have allocated a struct page for their own purposes. For a graphical view of before-and-after, see the first two tabs of https://docs.google.com/spreadsheets/d/1tvCszs_7FXrjei9_mtFiKV6nW1FLnYyvPvW-qNZhdog/edit?usp=sharing Highlights: - deferred_list now really exists in struct page instead of just a comment. - hmm_data also exists in struct page instead of being a nasty hack. - x86's PGD pages have a real pointer to the mm_struct. - VMalloc pages now have all sorts of extra information stored in them to help with debugging and tuning. - rcu_head is no longer tied to slab in case anyone else wants to free pages by RCU. - slub's counters no longer share space with _refcount. - slub's freelist+counters are now naturally dword aligned. - slub loses a parameter to a lot of functions and a sysfs file. This patch (of 17): s390 borrows the storage used for _mapcount in struct page in order to account whether the bottom or top half is being used for 2kB page tables. I want to use that for something else, so use the top byte of _refcount instead of the bottom byte of _mapcount. _refcount may temporarily be incremented by other CPUs that see a stale pointer to this page in the page cache, but each CPU can only increment it by one, and there are no systems with 2^24 CPUs today, so they will not change the upper byte of _refcount. We do have to be a little careful not to lose any of their writes (as they will subsequently decrement the counter). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Martin Schwidefsky <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: save two stranded bits in gfp_maskShakeel Butt1-5/+5
___GFP_COLD and ___GFP_OTHER_NODE were removed but their bits were stranded. Fill the gaps by moving the existing gfp masks around. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Suggested-by: Vlastimil Babka <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm, hugetlbfs: pass fault address to no page handlerHuang Ying1-21/+21
This is to take better advantage of general huge page clearing optimization (commit c79b57e462b5: "mm: hugetlb: clear target sub-page last when clearing huge page") for hugetlbfs. In the general optimization patch, the sub-page to access will be cleared last to avoid the cache lines of to access sub-page to be evicted when clearing other sub-pages. This works better if we have the address of the sub-page to access, that is, the fault address inside the huge page. So the hugetlbfs no page fault handler is changed to pass that information. This will benefit workloads which don't access the begin of the hugetlbfs huge page after the page fault under heavy cache contention for shared last level cache. The patch is a generic optimization which should benefit quite some workloads, not for a specific use case. To demonstrate the performance benefit of the patch, we tested it with vm-scalability run on hugetlbfs. With this patch, the throughput increases ~28.1% in vm-scalability anon-w-seq test case with 88 processes on a 2 socket Xeon E5 2699 v4 system (44 cores, 88 threads). The test case creates 88 processes, each process mmaps a big anonymous memory area with MAP_HUGETLB and writes to it from the end to the begin. For each process, other processes could be seen as other workload which generates heavy cache pressure. At the same time, the cache miss rate reduced from ~36.3% to ~25.6%, the IPC (instruction per cycle) increased from 0.3 to 0.37, and the time spent in user space is reduced ~19.3%. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Jan Kara <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Christopher Lameter <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Punit Agrawal <[email protected]> Cc: Anshuman Khandual <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: change return type to vm_fault_tSouptick Joarder3-6/+6
Use new return type vm_fault_t for fault handler in struct vm_operations_struct. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. See commit 1c8f422059ae ("mm: change return type to vm_fault_t") Link: http://lkml.kernel.org/r/20180512063745.GA26866@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <[email protected]> Reviewed-by: Matthew Wilcox <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Joe Perches <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Rientjes <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: use new return type vm_fault_tSouptick Joarder3-7/+7
Use new return type vm_fault_t for fault handler in struct vm_operations_struct. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. Link: http://lkml.kernel.org/r/20180511190542.GA2412@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <[email protected]> Reviewed-by: Matthew Wilcox <[email protected]> Cc: Dan Williams <[email protected]> Cc: Jan Kara <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm/page_alloc.c: remove useless parameter of finalise_ac()Huaisheng Ye1-3/+2
finalise_ac() has parameter order which is not used at all. Remove it. Signed-off-by: Huaisheng Ye <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm/vmpressure.c: convert to use match_string() helperAndy Shevchenko1-26/+6
The new helper returns index of the matching string in an array. We are going to use it here. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andy Shevchenko <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm/vmpressure.c: use kstrndup instead of kmalloc+strncpyAndy Shevchenko1-2/+1
Using kstrndup() simplifies the code. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andy Shevchenko <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07memcg: introduce memory.minRoman Gushchin6-50/+202
Memory controller implements the memory.low best-effort memory protection mechanism, which works perfectly in many cases and allows protecting working sets of important workloads from sudden reclaim. But its semantics has a significant limitation: it works only as long as there is a supply of reclaimable memory. This makes it pretty useless against any sort of slow memory leaks or memory usage increases. This is especially true for swapless systems. If swap is enabled, memory soft protection effectively postpones problems, allowing a leaking application to fill all swap area, which makes no sense. The only effective way to guarantee the memory protection in this case is to invoke the OOM killer. It's possible to handle this case in userspace by reacting on MEMCG_LOW events; but there is still a place for a fail-safe in-kernel mechanism to provide stronger guarantees. This patch introduces the memory.min interface for cgroup v2 memory controller. It works very similarly to memory.low (sharing the same hierarchical behavior), except that it's not disabled if there is no more reclaimable memory in the system. If cgroup is not populated, its memory.min is ignored, because otherwise even the OOM killer wouldn't be able to reclaim the protected memory, and the system can stall. [[email protected]: s/low/min/ in docs] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Reviewed-by: Randy Dunlap <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: move is_pageblock_removable_nolock() to mm/memory_hotplug.cMathieu Malaterre3-24/+23
is_pageblock_removable_nolock() is not used outside of mm/memory_hotplug.c. Move it next to unique caller is_mem_section_removable() and make it static. Remove prototype in <linux/memory_hotplug.h> to silence gcc warning (W=1): mm/page_alloc.c:7704:6: warning: no previous prototype for `is_pageblock_removable_nolock' [-Wmissing-prototypes] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mathieu Malaterre <[email protected]> Suggested-by: Michal Hocko <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: /proc/pid/pagemap: hide swap entries from unprivileged usersHuang Ying1-10/+16
In commit ab676b7d6fbf ("pagemap: do not leak physical addresses to non-privileged userspace"), the /proc/PID/pagemap is restricted to be readable only by CAP_SYS_ADMIN to address some security issue. In commit 1c90308e7a77 ("pagemap: hide physical addresses from non-privileged users"), the restriction is relieved to make /proc/PID/pagemap readable, but hide the physical addresses for non-privileged users. But the swap entries are readable for non-privileged users too. This has some security issues. For example, for page under migrating, the swap entry has physical address information. So, in this patch, the swap entries are hided for non-privileged users too. Link: http://lkml.kernel.org/r/[email protected] Fixes: 1c90308e7a77 ("pagemap: hide physical addresses from non-privileged users") Signed-off-by: "Huang, Ying" <[email protected]> Suggested-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Reviewed-by: Konstantin Khlebnikov <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: Andrei Vagin <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Daniel Colascione <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm/memblock: print memblock_removeMinchan Kim1-0/+5
memblock_remove report is useful to see why MemTotal of /proc/meminfo between two kernels makes difference. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: memcontrol: drain memcg stock on force_emptyJunaid Shahid1-0/+3
The per-cpu memcg stock can retain a charge of upto 32 pages. On a machine with large number of cpus, this can amount to a decent amount of memory. Additionally force_empty interface might be triggering unneeded memcg reclaims. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Junaid Shahid <[email protected]> Signed-off-by: Shakeel Butt <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: memcontrol: drain stocks on resize limitShakeel Butt1-0/+7
Resizing the memcg limit for cgroup-v2 drains the stocks before triggering the memcg reclaim. Do the same for cgroup-v1 to make the behavior consistent. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07memcg: mark memcg1_events static constGreg Thelen1-1/+1
Mark memcg1_events static: it's only used by memcontrol.c. And mark it const: it's not modified. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Greg Thelen <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07memcg: writeback: use memcg->cgwb_list directlyWang Long3-8/+2
mem_cgroup_cgwb_list is a very simple wrapper and it will never be used outside of code under CONFIG_CGROUP_WRITEBACK. so use memcg->cgwb_list directly. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Wang Long <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Tejun Heo <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07tmpfs: allow decoding a file handle of an unlinked fileAmir Goldstein1-1/+10
tmpfs uses the helper d_find_alias() to find a dentry from a decoded inode, but d_find_alias() skips unhashed dentries, so unlinked files cannot be decoded from a file handle. This can be reproduced using xfstests test program open_by_handle: $ open_by handle -c /tmp/testdir $ open_by_handle -dk /tmp/testdir open_by_handle(/tmp/testdir/file000000) returned 116 incorrectly on an unlinked open file! To fix this, if d_find_alias() can't find a hashed alias, call d_find_any_alias() to return an unhashed one. Link: http://lkml.kernel.org/r/CAOQ4uxg+qSLP0KwdW+h1tcPqOCQd+_pGZVXiePQB1TXCMBMctQ@mail.gmail.com Signed-off-by: Amir Goldstein <[email protected]> Reviewed-by: NeilBrown <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jeff Layton <[email protected]> Cc: "J. Bruce Fields" <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm/ksm: move [set_]page_stable_node from ksm.h to ksm.cMike Rapoport2-11/+11
page_stable_node() and set_page_stable_node() are only used in mm/ksm.c and there is no point to keep them in the include/linux/ksm.h [[email protected]: fix SYSFS=n build] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm/ksm: remove unused page_referenced_ksm declarationMike Rapoport1-6/+0
Commit 9f32624be943 ("mm/rmap: use rmap_walk() in page_referenced()") removed the declaration of page_referenced_ksm for the case CONFIG_KSM=y, but left one for CONFIG_KSM=n. Remove the unused leftover. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07lockdep: fix fs_reclaim annotationOmar Sandoval3-12/+32
While revisiting my Btrfs swapfile series [1], I introduced a situation in which reclaim would lock i_rwsem, and even though the swapon() path clearly made GFP_KERNEL allocations while holding i_rwsem, I got no complaints from lockdep. It turns out that the rework of the fs_reclaim annotation was broken: if the current task has PF_MEMALLOC set, we don't acquire the dummy fs_reclaim lock, but when reclaiming we always check this _after_ we've just set the PF_MEMALLOC flag. In most cases, we can fix this by moving the fs_reclaim_{acquire,release}() outside of the memalloc_noreclaim_{save,restore}(), althought kswapd is slightly different. After applying this, I got the expected lockdep splats. 1: https://lwn.net/Articles/625412/ Link: http://lkml.kernel.org/r/9f8aa70652a98e98d7c4de0fc96a4addcee13efe.1523778026.git.osandov@fb.com Fixes: d92a8cfcb37e ("locking/lockdep: Rework FS_RECLAIM annotation") Signed-off-by: Omar Sandoval <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: shmem: make stat.st_blksize return huge page size if THP is onYang Shi1-0/+14
Since tmpfs THP was supported in 4.8, hugetlbfs is not the only filesystem with huge page support anymore. tmpfs can use huge page via THP when mounting by "huge=" mount option. When applications use huge page on hugetlbfs, it just need check the filesystem magic number, but it is not enough for tmpfs. Make stat.st_blksize return huge page size if it is mounted by appropriate "huge=" option to give applications a hint to optimize the behavior with THP. Some applications may not do wisely with THP. For example, QEMU may mmap file on non huge page aligned hint address with MAP_FIXED, which results in no pages are PMD mapped even though THP is used. Some applications may mmap file with non huge page aligned offset. Both behaviors make THP pointless. statfs.f_bsize still returns 4KB for tmpfs since THP could be split, and it also may fallback to 4KB page silently if there is not enough huge page. Furthermore, different f_bsize makes max_blocks and free_blocks calculation harder but without too much benefit. Returning huge page size via stat.st_blksize sounds good enough. Since PUD size huge page for THP has not been supported, now it just returns HPAGE_PMD_SIZE. Hugh said: : Sorry, I have no enthusiasm for this patch; but do I feel strongly : enough to override you and everyone else to NAK it? No, I don't feel : that strongly, maybe st_blksize isn't worth arguing over. : : We did look at struct stat when designing huge tmpfs, to see if there : were any fields that should be adjusted for it; but concluded none. : Yes, it would sometimes be nice to have a quickly accessible indicator : for when tmpfs has been mounted huge (scanning /proc/mounts for options : can be tiresome, agreed); but since tmpfs tries to supply huge (or not) : pages transparently, no difference seemed right. : : So, because st_blksize is a not very useful field of struct stat, with : "size" in the name, we're going to put HPAGE_PMD_SIZE in there instead : of PAGE_SIZE, if the tmpfs was mounted with one of the huge "huge" : options (force or always, okay; within_size or advise, not so much). : Though HPAGE_PMD_SIZE is no more its "preferred I/O size" or "blocksize : for file system I/O" than PAGE_SIZE was. : : Which we can expect to speed up some applications and disadvantage : others, depending on how they interpret st_blksize: just like if we : changed it in the same way on non-huge tmpfs. (Did I actually try : changing st_blksize early on, and find it broke something? If so, I've : now forgotten what, and a search through commit messages didn't find : it; but I guess we'll find out soon enough.) : : If there were an mstat() syscall, returning a field "preferred : alignment", then we could certainly agree to put HPAGE_PMD_SIZE in : there; but in stat()'s st_blksize? And what happens when (in future) : mm maps this or that hard-disk filesystem's blocks with a pmd mapping - : should that filesystem then advertise a bigger st_blksize, despite the : same disk layout as before? What happens with DAX? : : And this change is not going to help the QEMU suboptimality that : brought you here (or does QEMU align mmaps according to st_blksize?). : QEMU ought to work well with kernels without this change, and kernels : with this change; and I hope it can easily deal with both by avoiding : that use of MAP_FIXED which prevented the kernel's intended alignment. [[email protected]: remove unneeded `else'] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Suggested-by: Christoph Hellwig <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Alexander Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: vmalloc: pass proper vm_start into debugobjectsChintan Pandya1-4/+5
Client can call vunmap with some intermediate 'addr' which may not be the start of the VM area. Entire unmap code works with vm->vm_start which is proper but debug object API is called with 'addr'. This could be a problem within debug objects. Pass proper start address into debug object API. [[email protected]: fix warning] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Chintan Pandya <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Byungchul Park <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Florian Fainelli <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Laura Abbott <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Yisheng Xie <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: vmalloc: avoid racy handling of debugobjects in vunmapChintan Pandya1-1/+2
Currently, __vunmap flow is, 1) Release the VM area 2) Free the debug objects corresponding to that vm area. This leave some race window open. 1) Release the VM area 1.5) Some other client gets the same vm area 1.6) This client allocates new debug objects on the same vm area 2) Free the debug objects corresponding to this vm area. Here, we actually free 'other' client's debug objects. Fix this by freeing the debug objects first and then releasing the VM area. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Chintan Pandya <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Byungchul Park <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Florian Fainelli <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Laura Abbott <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Yisheng Xie <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: vmalloc: clean up vunmap to avoid pgtable ops twiceChintan Pandya1-22/+7
vunmap does page table clear operations twice in the case when DEBUG_PAGEALLOC_ENABLE_DEFAULT is enabled. So, clean up the code as that is unintended. As a perf gain, we save few us. Below ftrace data was obtained while doing 1 MB of vmalloc/vfree on ARM64 based SoC *without* this patch applied. After this patch, we can save ~3 us (on 1 extra vunmap_page_range). CPU DURATION FUNCTION CALLS | | | | | | | 6) | __vunmap() { 6) | vmap_debug_free_range() { 6) 3.281 us | vunmap_page_range(); 6) + 45.468 us | } 6) 2.760 us | vunmap_page_range(); 6) ! 505.105 us | } [[email protected]: v3] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Chintan Pandya <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Laura Abbott <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Florian Fainelli <[email protected]> Cc: Yisheng Xie <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Wei Yang <[email protected]> Cc: Byungchul Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm/sparse.c: pass the __highest_present_section_nr + 1 to alloc_func()Wei Yang1-1/+1
In commit c4e1be9ec113 ("mm, sparsemem: break out of loops early") __highest_present_section_nr is introduced to reduce the loop counts for present section. This is also helpful for usemap and memmap allocation. This patch uses __highest_present_section_nr + 1 to optimize the loop. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: David Rientjes <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm/sparse.c: check __highest_present_section_nr only for a present sectionWei Yang1-3/+1
When searching a present section, there are two boundaries: * __highest_present_section_nr * NR_MEM_SECTIONS And it is known, __highest_present_section_nr is a more strict boundary than NR_MEM_SECTIONS. This means it would be necessary to check __highest_present_section_nr only. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Acked-by: David Rientjes <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm, gup: prevent pmd checking race in follow_pmd_mask()Huang Ying1-11/+27
mmap_sem will be read locked when calling follow_pmd_mask(). But this cannot prevent PMD from being changed for all cases when PTL is unlocked, for example, from pmd_trans_huge() to pmd_none() via MADV_DONTNEED. So it is possible for the pmd_present() check in follow_pmd_mask() to encounter an invalid PMD. This may cause an incorrect VM_BUG_ON() or an infinite loop. Fix this by reading the PMD entry into a local variable with READ_ONCE() and checking the local variable and pmd_none() in the retry loop. As Kirill pointed out, with PTL unlocked, the *pmd may be changed under us, so reading it directly again and again may incur weird bugs. So although using *pmd directly other than for pmd_present() checking may be safe, it is still better to replace them to read *pmd once and check the local variable multiple times. When PTL unlocked, replace all *pmd with local variable was suggested by Kirill. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Reviewed-by: Zi Yan <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Al Viro <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm/docs: describe memory.low refinementsRoman Gushchin1-15/+13
Refine cgroup v2 docs after latest memory.low changes. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: treat memory.low value inclusiveRoman Gushchin1-3/+3
If memcg's usage is equal to the memory.low value, avoid reclaiming from this cgroup while there is a surplus of reclaimable memory. This sounds more logical and also matches memory.high and memory.max behavior: both are inclusive. Empty cgroups are not considered protected, so MEMCG_LOW events are not emitted for empty cgroups, if there is no more reclaimable memory in the system. Link: http://lkml.kernel.org/r/20180406122132.GA7185@castle Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>