aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2018-06-08Merge tag 'libnvdimm-for-4.18' of ↵Linus Torvalds4-22/+35
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "This adds a user for the new 'bytes-remaining' updates to memcpy_mcsafe() that you already received through Ingo via the x86-dax- for-linus pull. Not included here, but still targeting this cycle, is support for handling memory media errors (poison) consumed via userspace dax mappings. Summary: - DAX broke a fundamental assumption of truncate of file mapped pages. The truncate path assumed that it is safe to disconnect a pinned page from a file and let the filesystem reclaim the physical block. With DAX the page is equivalent to the filesystem block. Introduce dax_layout_busy_page() to enable filesystems to wait for pinned DAX pages to be released. Without this wait a filesystem could allocate blocks under active device-DMA to a new file. - DAX arranges for the block layer to be bypassed and uses dax_direct_access() + copy_to_iter() to satisfy read(2) calls. However, the memcpy_mcsafe() facility is available through the pmem block driver. In order to safely handle media errors, via the DAX block-layer bypass, introduce copy_to_iter_mcsafe(). - Fix cache management policy relative to the ACPI NFIT Platform Capabilities Structure to properly elide cache flushes when they are not necessary. The table indicates whether CPU caches are power-fail protected. Clarify that a deep flush is always performed on REQ_{FUA,PREFLUSH} requests" * tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits) dax: Use dax_write_cache* helpers libnvdimm, pmem: Do not flush power-fail protected CPU caches libnvdimm, pmem: Unconditionally deep flush on *sync libnvdimm, pmem: Complete REQ_FLUSH => REQ_PREFLUSH acpi, nfit: Remove ecc_unit_size dax: dax_insert_mapping_entry always succeeds libnvdimm, e820: Register all pmem resources libnvdimm: Debug probe times linvdimm, pmem: Preserve read-only setting for pmem devices x86, nfit_test: Add unit test for memcpy_mcsafe() pmem: Switch to copy_to_iter_mcsafe() dax: Report bytes remaining in dax_iomap_actor() dax: Introduce a ->copy_to_iter dax operation uio, lib: Fix CONFIG_ARCH_HAS_UACCESS_MCSAFE compilation xfs, dax: introduce xfs_break_dax_layouts() xfs: prepare xfs_break_layouts() for another layout type xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL mm, fs, dax: handle layout changes to pinned dax mappings mm: fix __gup_device_huge vs unmap mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS ...
2018-06-08Merge branch 'for-4.18/dax' into libnvdimm-for-nextDan Williams4-22/+35
2018-06-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds32-737/+1008
Merge updates from Andrew Morton: - a few misc things - ocfs2 updates - v9fs updates - MM - procfs updates - lib/ updates - autofs updates * emailed patches from Andrew Morton <[email protected]>: (118 commits) autofs: small cleanup in autofs_getpath() autofs: clean up includes autofs: comment on selinux changes needed for module autoload autofs: update MAINTAINERS entry for autofs autofs: use autofs instead of autofs4 in documentation autofs: rename autofs documentation files autofs: create autofs Kconfig and Makefile autofs: delete fs/autofs4 source files autofs: update fs/autofs4/Makefile autofs: update fs/autofs4/Kconfig autofs: copy autofs4 to autofs autofs4: use autofs instead of autofs4 everywhere autofs4: merge auto_fs.h and auto_fs4.h fs/binfmt_misc.c: do not allow offset overflow checkpatch: improve patch recognition lib/ucs2_string.c: add MODULE_LICENSE() lib/mpi: headers cleanup lib/percpu_ida.c: use _irqsave() instead of local_irq_save() + spin_lock lib/idr.c: remove simple_ida_lock lib/bitmap.c: micro-optimization for __bitmap_complement() ...
2018-06-07mm: kvmalloc does not fallback to vmalloc for incompatible gfp flagsMichal Hocko1-2/+4
kvmalloc warned about incompatible gfp_mask to catch abusers (mostly GFP_NOFS) with an intention that this will motivate authors of the code to fix those. Linus argues that this just motivates people to do even more hacks like if (gfp == GFP_KERNEL) kvmalloc else kmalloc I haven't seen this happening much (Linus pointed to bucket_lock special cases an atomic allocation but my git foo hasn't found much more) but it is true that we can grow those in future. Therefore Linus suggested to simply not fallback to vmalloc for incompatible gfp flags and rather stick with the kmalloc path. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Hocko <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Cc: Tom Herbert <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm/shmem.c: zero out unused vma fields in shmem_pseudo_vma_init()Kirill A. Shutemov1-2/+1
shmem/tmpfs uses pseudo vma to allocate page with correct NUMA policy. The pseudo vma doesn't have vm_page_prot set. We are going to encode encryption KeyID in vm_page_prot. Having garbage there causes problems. Zero out all unused fields in the pseudo vma. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm, page_alloc: do not break __GFP_THISNODE by zonelist resetVlastimil Babka1-1/+0
In __alloc_pages_slowpath() we reset zonelist and preferred_zoneref for allocations that can ignore memory policies. The zonelist is obtained from current CPU's node. This is a problem for __GFP_THISNODE allocations that want to allocate on a different node, e.g. because the allocating thread has been migrated to a different CPU. This has been observed to break SLAB in our 4.4-based kernel, because there it relies on __GFP_THISNODE working as intended. If a slab page is put on wrong node's list, then further list manipulations may corrupt the list because page_to_nid() is used to determine which node's list_lock should be locked and thus we may take a wrong lock and race. Current SLAB implementation seems to be immune by luck thanks to commit 511e3a058812 ("mm/slab: make cache_grow() handle the page allocated on arbitrary node") but there may be others assuming that __GFP_THISNODE works as promised. We can fix it by simply removing the zonelist reset completely. There is actually no reason to reset it, because memory policies and cpusets don't affect the zonelist choice in the first place. This was different when commit 183f6371aac2 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK") introduced the code, as mempolicies provided their own restricted zonelists. We might consider this for 4.17 although I don't know if there's anything currently broken. SLAB is currently not affected, but in kernels older than 4.7 that don't yet have 511e3a058812 ("mm/slab: make cache_grow() handle the page allocated on arbitrary node") it is. That's at least 4.4 LTS. Older ones I'll have to check. So stable backports should be more important, but will have to be reviewed carefully, as the code went through many changes. BTW I think that also the ac->preferred_zoneref reset is currently useless if we don't also reset ac->nodemask from a mempolicy to NULL first (which we probably should for the OOM victims etc?), but I would leave that for a separate patch. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Fixes: 183f6371aac2 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK") Acked-by: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07userfaultfd: prevent non-cooperative events vs mcopy_atomic racesMike Rapoport1-5/+17
If a process monitored with userfaultfd changes it's memory mappings or forks() at the same time as uffd monitor fills the process memory with UFFDIO_COPY, the actual creation of page table entries and copying of the data in mcopy_atomic may happen either before of after the memory mapping modifications and there is no way for the uffd monitor to maintain consistent view of the process memory layout. For instance, let's consider fork() running in parallel with userfaultfd_copy(): process | uffd monitor ---------------------------------+------------------------------ fork() | userfaultfd_copy() ... | ... dup_mmap() | down_read(mmap_sem) down_write(mmap_sem) | /* create PTEs, copy data */ dup_uffd() | up_read(mmap_sem) copy_page_range() | up_write(mmap_sem) | dup_uffd_complete() | /* notify monitor */ | If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will be present by the time copy_page_range() is called and they will appear in the child's memory mappings. However, if the fork() is the first to take the mmap_sem, the new pages won't be mapped in the child's address space. If the pages are not present and child tries to access them, the monitor will get page fault notification and everything is fine. However, if the pages *are present*, the child can access them without uffd noticing. And if we copy them into child it'll see the wrong data. Since we are talking about background copy, we'd need to decide whether the pages should be copied or not regardless #PF notifications. Since userfaultfd monitor has no way to determine what was the order, let's disallow userfaultfd_copy in parallel with the non-cooperative events. In such case we return -EAGAIN and the uffd monitor can understand that userfaultfd_copy() clashed with a non-cooperative event and take an appropriate action. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Acked-by: Pavel Emelyanov <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Andrei Vagin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: memcg: allow lowering memory.swap.max below the current usageTejun Heo1-5/+1
Currently an attempt to set swap.max into a value lower than the actual swap usage fails, which causes configuration problems as there's no way of lowering the configuration below the current usage short of turning off swap entirely. This makes swap.max difficult to use and allows delegatees to lock the delegator out of reducing swap allocation. This patch updates swap_max_write() so that the limit can be lowered below the current usage. It doesn't implement active reclaiming of swap entries for the following reasons. * mem_cgroup_swap_full() already tells the swap machinary to aggressively reclaim swap entries if the usage is above 50% of limit, so simply lowering the limit automatically triggers gradual reclaim. * Forcing back swapped out pages is likely to heavily impact the workload and mess up the working set. Given that swap usually is a lot less valuable and less scarce, letting the existing usage dissipate over time through the above gradual reclaim and as they're falted back in is likely the better behavior. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Tejun Heo <[email protected]> Acked-by: Roman Gushchin <[email protected]> Acked-by: Rik van Riel <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Shaohua Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm/shmem.c: use new return type vm_fault_tSouptick Joarder1-6/+6
Use new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. See commit 1c8f422059ae ("mm: change return type to vm_fault_t") vmf_error() is the newly introduce inline function in 4.17-rc6. Link: http://lkml.kernel.org/r/20180521202410.GA17912@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <[email protected]> Reviewed-by: Matthew Wilcox <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07slub: remove 'reserved' file from sysfsMatthew Wilcox1-7/+0
Christoph doubts anyone was using the 'reserved' file in sysfs, so remove it. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Christoph Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07slub: remove kmem_cache->reservedMatthew Wilcox1-21/+20
The reserved field was only used for embedding an rcu_head in the data structure. With the previous commit, we no longer need it. That lets us remove the 'reserved' argument to a lot of functions. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Christoph Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07slab,slub: remove rcu_head size checksMatthew Wilcox2-27/+2
rcu_head may now grow larger than list_head without affecting slab or slub. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Christoph Lameter <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: combine LRU and main union in struct pageMatthew Wilcox1-1/+1
This gives us five words of space in a single union in struct page. The compound_mapcount moves position (from offset 24 to offset 20) on 64-bit systems, but that does not seem likely to cause any trouble. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: move lru union within struct pageMatthew Wilcox1-4/+4
Since the LRU is two words, this does not affect the double-word alignment of SLUB's freelist. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: use page->deferred_listMatthew Wilcox2-6/+3
Now that we can represent the location of 'deferred_list' in C instead of comments, make use of that ability. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: move 'private' union within struct pageMatthew Wilcox1-18/+2
By moving page->private to the fourth word of struct page, we can put the SLUB counters in the same word as SLAB's s_mem and still do the cmpxchg_double trick. Now the SLUB counters no longer overlap with the mapcount or refcount so we can drop the call to page_mapcount_reset() and simplify set_page_slub_counters() to a single line. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: switch s_mem and slab_cache in struct pageMatthew Wilcox1-0/+1
This will allow us to store slub's counters in the same bits as slab's s_mem. slub now needs to set page->mapping to NULL as it frees the page, just like slab does. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Christoph Lameter <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: "Kirill A . Shutemov" <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: split page_type out from _mapcountMatthew Wilcox1-8/+5
We're already using a union of many fields here, so stop abusing the _mapcount and make page_type its own field. That implies renaming some of the machinery that creates PageBuddy, PageBalloon and PageKmemcg; bring back the PG_buddy, PG_balloon and PG_kmemcg names. As suggested by Kirill, make page_type a bitmask. Because it starts out life as -1 (thanks to sharing the storage with _mapcount), setting a page flag means clearing the appropriate bit. This gives us space for probably twenty or so extra bits (depending how paranoid we want to be about _mapcount underflow). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Jérôme Glisse <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm, hugetlbfs: pass fault address to no page handlerHuang Ying1-21/+21
This is to take better advantage of general huge page clearing optimization (commit c79b57e462b5: "mm: hugetlb: clear target sub-page last when clearing huge page") for hugetlbfs. In the general optimization patch, the sub-page to access will be cleared last to avoid the cache lines of to access sub-page to be evicted when clearing other sub-pages. This works better if we have the address of the sub-page to access, that is, the fault address inside the huge page. So the hugetlbfs no page fault handler is changed to pass that information. This will benefit workloads which don't access the begin of the hugetlbfs huge page after the page fault under heavy cache contention for shared last level cache. The patch is a generic optimization which should benefit quite some workloads, not for a specific use case. To demonstrate the performance benefit of the patch, we tested it with vm-scalability run on hugetlbfs. With this patch, the throughput increases ~28.1% in vm-scalability anon-w-seq test case with 88 processes on a 2 socket Xeon E5 2699 v4 system (44 cores, 88 threads). The test case creates 88 processes, each process mmaps a big anonymous memory area with MAP_HUGETLB and writes to it from the end to the begin. For each process, other processes could be seen as other workload which generates heavy cache pressure. At the same time, the cache miss rate reduced from ~36.3% to ~25.6%, the IPC (instruction per cycle) increased from 0.3 to 0.37, and the time spent in user space is reduced ~19.3%. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Jan Kara <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Christopher Lameter <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Punit Agrawal <[email protected]> Cc: Anshuman Khandual <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: change return type to vm_fault_tSouptick Joarder2-3/+3
Use new return type vm_fault_t for fault handler in struct vm_operations_struct. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. See commit 1c8f422059ae ("mm: change return type to vm_fault_t") Link: http://lkml.kernel.org/r/20180512063745.GA26866@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <[email protected]> Reviewed-by: Matthew Wilcox <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Joe Perches <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Rientjes <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: use new return type vm_fault_tSouptick Joarder2-5/+5
Use new return type vm_fault_t for fault handler in struct vm_operations_struct. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. Link: http://lkml.kernel.org/r/20180511190542.GA2412@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <[email protected]> Reviewed-by: Matthew Wilcox <[email protected]> Cc: Dan Williams <[email protected]> Cc: Jan Kara <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm/page_alloc.c: remove useless parameter of finalise_ac()Huaisheng Ye1-3/+2
finalise_ac() has parameter order which is not used at all. Remove it. Signed-off-by: Huaisheng Ye <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm/vmpressure.c: convert to use match_string() helperAndy Shevchenko1-26/+6
The new helper returns index of the matching string in an array. We are going to use it here. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andy Shevchenko <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm/vmpressure.c: use kstrndup instead of kmalloc+strncpyAndy Shevchenko1-2/+1
Using kstrndup() simplifies the code. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Andy Shevchenko <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07memcg: introduce memory.minRoman Gushchin3-42/+157
Memory controller implements the memory.low best-effort memory protection mechanism, which works perfectly in many cases and allows protecting working sets of important workloads from sudden reclaim. But its semantics has a significant limitation: it works only as long as there is a supply of reclaimable memory. This makes it pretty useless against any sort of slow memory leaks or memory usage increases. This is especially true for swapless systems. If swap is enabled, memory soft protection effectively postpones problems, allowing a leaking application to fill all swap area, which makes no sense. The only effective way to guarantee the memory protection in this case is to invoke the OOM killer. It's possible to handle this case in userspace by reacting on MEMCG_LOW events; but there is still a place for a fail-safe in-kernel mechanism to provide stronger guarantees. This patch introduces the memory.min interface for cgroup v2 memory controller. It works very similarly to memory.low (sharing the same hierarchical behavior), except that it's not disabled if there is no more reclaimable memory in the system. If cgroup is not populated, its memory.min is ignored, because otherwise even the OOM killer wouldn't be able to reclaim the protected memory, and the system can stall. [[email protected]: s/low/min/ in docs] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Reviewed-by: Randy Dunlap <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: move is_pageblock_removable_nolock() to mm/memory_hotplug.cMathieu Malaterre2-23/+23
is_pageblock_removable_nolock() is not used outside of mm/memory_hotplug.c. Move it next to unique caller is_mem_section_removable() and make it static. Remove prototype in <linux/memory_hotplug.h> to silence gcc warning (W=1): mm/page_alloc.c:7704:6: warning: no previous prototype for `is_pageblock_removable_nolock' [-Wmissing-prototypes] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mathieu Malaterre <[email protected]> Suggested-by: Michal Hocko <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm/memblock: print memblock_removeMinchan Kim1-0/+5
memblock_remove report is useful to see why MemTotal of /proc/meminfo between two kernels makes difference. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Acked-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: memcontrol: drain memcg stock on force_emptyJunaid Shahid1-0/+3
The per-cpu memcg stock can retain a charge of upto 32 pages. On a machine with large number of cpus, this can amount to a decent amount of memory. Additionally force_empty interface might be triggering unneeded memcg reclaims. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Junaid Shahid <[email protected]> Signed-off-by: Shakeel Butt <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: memcontrol: drain stocks on resize limitShakeel Butt1-0/+7
Resizing the memcg limit for cgroup-v2 drains the stocks before triggering the memcg reclaim. Do the same for cgroup-v1 to make the behavior consistent. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Greg Thelen <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07memcg: mark memcg1_events static constGreg Thelen1-1/+1
Mark memcg1_events static: it's only used by memcontrol.c. And mark it const: it's not modified. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Greg Thelen <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Vladimir Davydov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07memcg: writeback: use memcg->cgwb_list directlyWang Long2-7/+2
mem_cgroup_cgwb_list is a very simple wrapper and it will never be used outside of code under CONFIG_CGROUP_WRITEBACK. so use memcg->cgwb_list directly. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Wang Long <[email protected]> Reviewed-by: Jan Kara <[email protected]> Acked-by: Tejun Heo <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07tmpfs: allow decoding a file handle of an unlinked fileAmir Goldstein1-1/+10
tmpfs uses the helper d_find_alias() to find a dentry from a decoded inode, but d_find_alias() skips unhashed dentries, so unlinked files cannot be decoded from a file handle. This can be reproduced using xfstests test program open_by_handle: $ open_by handle -c /tmp/testdir $ open_by_handle -dk /tmp/testdir open_by_handle(/tmp/testdir/file000000) returned 116 incorrectly on an unlinked open file! To fix this, if d_find_alias() can't find a hashed alias, call d_find_any_alias() to return an unhashed one. Link: http://lkml.kernel.org/r/CAOQ4uxg+qSLP0KwdW+h1tcPqOCQd+_pGZVXiePQB1TXCMBMctQ@mail.gmail.com Signed-off-by: Amir Goldstein <[email protected]> Reviewed-by: NeilBrown <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jeff Layton <[email protected]> Cc: "J. Bruce Fields" <[email protected]> Cc: Al Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm/ksm: move [set_]page_stable_node from ksm.h to ksm.cMike Rapoport1-0/+11
page_stable_node() and set_page_stable_node() are only used in mm/ksm.c and there is no point to keep them in the include/linux/ksm.h [[email protected]: fix SYSFS=n build] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07lockdep: fix fs_reclaim annotationOmar Sandoval2-12/+28
While revisiting my Btrfs swapfile series [1], I introduced a situation in which reclaim would lock i_rwsem, and even though the swapon() path clearly made GFP_KERNEL allocations while holding i_rwsem, I got no complaints from lockdep. It turns out that the rework of the fs_reclaim annotation was broken: if the current task has PF_MEMALLOC set, we don't acquire the dummy fs_reclaim lock, but when reclaiming we always check this _after_ we've just set the PF_MEMALLOC flag. In most cases, we can fix this by moving the fs_reclaim_{acquire,release}() outside of the memalloc_noreclaim_{save,restore}(), althought kswapd is slightly different. After applying this, I got the expected lockdep splats. 1: https://lwn.net/Articles/625412/ Link: http://lkml.kernel.org/r/9f8aa70652a98e98d7c4de0fc96a4addcee13efe.1523778026.git.osandov@fb.com Fixes: d92a8cfcb37e ("locking/lockdep: Rework FS_RECLAIM annotation") Signed-off-by: Omar Sandoval <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: shmem: make stat.st_blksize return huge page size if THP is onYang Shi1-0/+14
Since tmpfs THP was supported in 4.8, hugetlbfs is not the only filesystem with huge page support anymore. tmpfs can use huge page via THP when mounting by "huge=" mount option. When applications use huge page on hugetlbfs, it just need check the filesystem magic number, but it is not enough for tmpfs. Make stat.st_blksize return huge page size if it is mounted by appropriate "huge=" option to give applications a hint to optimize the behavior with THP. Some applications may not do wisely with THP. For example, QEMU may mmap file on non huge page aligned hint address with MAP_FIXED, which results in no pages are PMD mapped even though THP is used. Some applications may mmap file with non huge page aligned offset. Both behaviors make THP pointless. statfs.f_bsize still returns 4KB for tmpfs since THP could be split, and it also may fallback to 4KB page silently if there is not enough huge page. Furthermore, different f_bsize makes max_blocks and free_blocks calculation harder but without too much benefit. Returning huge page size via stat.st_blksize sounds good enough. Since PUD size huge page for THP has not been supported, now it just returns HPAGE_PMD_SIZE. Hugh said: : Sorry, I have no enthusiasm for this patch; but do I feel strongly : enough to override you and everyone else to NAK it? No, I don't feel : that strongly, maybe st_blksize isn't worth arguing over. : : We did look at struct stat when designing huge tmpfs, to see if there : were any fields that should be adjusted for it; but concluded none. : Yes, it would sometimes be nice to have a quickly accessible indicator : for when tmpfs has been mounted huge (scanning /proc/mounts for options : can be tiresome, agreed); but since tmpfs tries to supply huge (or not) : pages transparently, no difference seemed right. : : So, because st_blksize is a not very useful field of struct stat, with : "size" in the name, we're going to put HPAGE_PMD_SIZE in there instead : of PAGE_SIZE, if the tmpfs was mounted with one of the huge "huge" : options (force or always, okay; within_size or advise, not so much). : Though HPAGE_PMD_SIZE is no more its "preferred I/O size" or "blocksize : for file system I/O" than PAGE_SIZE was. : : Which we can expect to speed up some applications and disadvantage : others, depending on how they interpret st_blksize: just like if we : changed it in the same way on non-huge tmpfs. (Did I actually try : changing st_blksize early on, and find it broke something? If so, I've : now forgotten what, and a search through commit messages didn't find : it; but I guess we'll find out soon enough.) : : If there were an mstat() syscall, returning a field "preferred : alignment", then we could certainly agree to put HPAGE_PMD_SIZE in : there; but in stat()'s st_blksize? And what happens when (in future) : mm maps this or that hard-disk filesystem's blocks with a pmd mapping - : should that filesystem then advertise a bigger st_blksize, despite the : same disk layout as before? What happens with DAX? : : And this change is not going to help the QEMU suboptimality that : brought you here (or does QEMU align mmaps according to st_blksize?). : QEMU ought to work well with kernels without this change, and kernels : with this change; and I hope it can easily deal with both by avoiding : that use of MAP_FIXED which prevented the kernel's intended alignment. [[email protected]: remove unneeded `else'] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Shi <[email protected]> Suggested-by: Christoph Hellwig <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Alexander Viro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: vmalloc: pass proper vm_start into debugobjectsChintan Pandya1-4/+5
Client can call vunmap with some intermediate 'addr' which may not be the start of the VM area. Entire unmap code works with vm->vm_start which is proper but debug object API is called with 'addr'. This could be a problem within debug objects. Pass proper start address into debug object API. [[email protected]: fix warning] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Chintan Pandya <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Byungchul Park <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Florian Fainelli <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Laura Abbott <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Yisheng Xie <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: vmalloc: avoid racy handling of debugobjects in vunmapChintan Pandya1-1/+2
Currently, __vunmap flow is, 1) Release the VM area 2) Free the debug objects corresponding to that vm area. This leave some race window open. 1) Release the VM area 1.5) Some other client gets the same vm area 1.6) This client allocates new debug objects on the same vm area 2) Free the debug objects corresponding to this vm area. Here, we actually free 'other' client's debug objects. Fix this by freeing the debug objects first and then releasing the VM area. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Chintan Pandya <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Byungchul Park <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Florian Fainelli <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Laura Abbott <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wei Yang <[email protected]> Cc: Yisheng Xie <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: vmalloc: clean up vunmap to avoid pgtable ops twiceChintan Pandya1-22/+7
vunmap does page table clear operations twice in the case when DEBUG_PAGEALLOC_ENABLE_DEFAULT is enabled. So, clean up the code as that is unintended. As a perf gain, we save few us. Below ftrace data was obtained while doing 1 MB of vmalloc/vfree on ARM64 based SoC *without* this patch applied. After this patch, we can save ~3 us (on 1 extra vunmap_page_range). CPU DURATION FUNCTION CALLS | | | | | | | 6) | __vunmap() { 6) | vmap_debug_free_range() { 6) 3.281 us | vunmap_page_range(); 6) + 45.468 us | } 6) 2.760 us | vunmap_page_range(); 6) ! 505.105 us | } [[email protected]: v3] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Chintan Pandya <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Laura Abbott <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Florian Fainelli <[email protected]> Cc: Yisheng Xie <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Wei Yang <[email protected]> Cc: Byungchul Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm/sparse.c: pass the __highest_present_section_nr + 1 to alloc_func()Wei Yang1-1/+1
In commit c4e1be9ec113 ("mm, sparsemem: break out of loops early") __highest_present_section_nr is introduced to reduce the loop counts for present section. This is also helpful for usemap and memmap allocation. This patch uses __highest_present_section_nr + 1 to optimize the loop. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: David Rientjes <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm/sparse.c: check __highest_present_section_nr only for a present sectionWei Yang1-3/+1
When searching a present section, there are two boundaries: * __highest_present_section_nr * NR_MEM_SECTIONS And it is known, __highest_present_section_nr is a more strict boundary than NR_MEM_SECTIONS. This means it would be necessary to check __highest_present_section_nr only. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Acked-by: David Rientjes <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm, gup: prevent pmd checking race in follow_pmd_mask()Huang Ying1-11/+27
mmap_sem will be read locked when calling follow_pmd_mask(). But this cannot prevent PMD from being changed for all cases when PTL is unlocked, for example, from pmd_trans_huge() to pmd_none() via MADV_DONTNEED. So it is possible for the pmd_present() check in follow_pmd_mask() to encounter an invalid PMD. This may cause an incorrect VM_BUG_ON() or an infinite loop. Fix this by reading the PMD entry into a local variable with READ_ONCE() and checking the local variable and pmd_none() in the retry loop. As Kirill pointed out, with PTL unlocked, the *pmd may be changed under us, so reading it directly again and again may incur weird bugs. So although using *pmd directly other than for pmd_present() checking may be safe, it is still better to replace them to read *pmd once and check the local variable multiple times. When PTL unlocked, replace all *pmd with local variable was suggested by Kirill. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Reviewed-by: Zi Yan <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Al Viro <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Dan Williams <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: treat memory.low value inclusiveRoman Gushchin1-3/+3
If memcg's usage is equal to the memory.low value, avoid reclaiming from this cgroup while there is a surplus of reclaimable memory. This sounds more logical and also matches memory.high and memory.max behavior: both are inclusive. Empty cgroups are not considered protected, so MEMCG_LOW events are not emitted for empty cgroups, if there is no more reclaimable memory in the system. Link: http://lkml.kernel.org/r/20180406122132.GA7185@castle Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: memory.low hierarchical behaviorRoman Gushchin2-29/+126
This patch aims to address an issue in current memory.low semantics, which makes it hard to use it in a hierarchy, where some leaf memory cgroups are more valuable than others. For example, there are memcgs A, A/B, A/C, A/D and A/E: A A/memory.low = 2G, A/memory.current = 6G //\\ BC DE B/memory.low = 3G B/memory.current = 2G C/memory.low = 1G C/memory.current = 2G D/memory.low = 0 D/memory.current = 2G E/memory.low = 10G E/memory.current = 0 If we apply memory pressure, B, C and D are reclaimed at the same pace while A's usage exceeds 2G. This is obviously wrong, as B's usage is fully below B's memory.low, and C has 1G of protection as well. Also, A is pushed to the size, which is less than A's 2G memory.low, which is also wrong. A simple bash script (provided below) can be used to reproduce the problem. Current results are: A: 1430097920 A/B: 711929856 A/C: 717426688 A/D: 741376 A/E: 0 To address the issue a concept of effective memory.low is introduced. Effective memory.low is always equal or less than original memory.low. In a case, when there is no memory.low overcommittment (and also for top-level cgroups), these two values are equal. Otherwise it's a part of parent's effective memory.low, calculated as a cgroup's memory.low usage divided by sum of sibling's memory.low usages (under memory.low usage I mean the size of actually protected memory: memory.current if memory.current < memory.low, 0 otherwise). It's necessary to track the actual usage, because otherwise an empty cgroup with memory.low set (A/E in my example) will affect actual memory distribution, which makes no sense. To avoid traversing the cgroup tree twice, page_counters code is reused. Calculating effective memory.low can be done in the reclaim path, as we conveniently traversing the cgroup tree from top to bottom and check memory.low on each level. So, it's a perfect place to calculate effective memory low and save it to use it for children cgroups. This also eliminates a need to traverse the cgroup tree from bottom to top each time to check if parent's guarantee is not exceeded. Setting/resetting effective memory.low is intentionally racy, but it's fine and shouldn't lead to any significant differences in actual memory distribution. With this patch applied results are matching the expectations: A: 2147930112 A/B: 1428721664 A/C: 718393344 A/D: 815104 A/E: 0 Test script: #!/bin/bash CGPATH="/sys/fs/cgroup" truncate /file1 --size 2G truncate /file2 --size 2G truncate /file3 --size 2G truncate /file4 --size 50G mkdir "${CGPATH}/A" echo "+memory" > "${CGPATH}/A/cgroup.subtree_control" mkdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E" echo 2G > "${CGPATH}/A/memory.low" echo 3G > "${CGPATH}/A/B/memory.low" echo 1G > "${CGPATH}/A/C/memory.low" echo 0 > "${CGPATH}/A/D/memory.low" echo 10G > "${CGPATH}/A/E/memory.low" echo $$ > "${CGPATH}/A/B/cgroup.procs" && vmtouch -qt /file1 echo $$ > "${CGPATH}/A/C/cgroup.procs" && vmtouch -qt /file2 echo $$ > "${CGPATH}/A/D/cgroup.procs" && vmtouch -qt /file3 echo $$ > "${CGPATH}/cgroup.procs" && vmtouch -qt /file4 echo "A: " `cat "${CGPATH}/A/memory.current"` echo "A/B: " `cat "${CGPATH}/A/B/memory.current"` echo "A/C: " `cat "${CGPATH}/A/C/memory.current"` echo "A/D: " `cat "${CGPATH}/A/D/memory.current"` echo "A/E: " `cat "${CGPATH}/A/E/memory.current"` rmdir "${CGPATH}/A/B" "${CGPATH}/A/C" "${CGPATH}/A/D" "${CGPATH}/A/E" rmdir "${CGPATH}/A" rm /file1 /file2 /file3 /file4 Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: rename page_counter's count/limit into usage/maxRoman Gushchin4-74/+74
This patch renames struct page_counter fields: count -> usage limit -> max and the corresponding functions: page_counter_limit() -> page_counter_set_max() mem_cgroup_get_limit() -> mem_cgroup_get_max() mem_cgroup_resize_limit() -> mem_cgroup_resize_max() memcg_update_kmem_limit() -> memcg_update_kmem_max() memcg_update_tcp_limit() -> memcg_update_tcp_max() The idea behind this renaming is to have the direct matching between memory cgroup knobs (low, high, max) and page_counters API. This is pure renaming, this patch doesn't bring any functional change. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm/memblock: introduce PHYS_ADDR_MAXStefan Agner1-11/+11
So far code was using ULLONG_MAX and type casting to obtain a phys_addr_t with all bits set. The typecast is necessary to silence compiler warnings on 32-bit platforms. Use the simpler but still type safe approach "~(phys_addr_t)0" to create a preprocessor define for all bits set. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Stefan Agner <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Ard Biesheuvel <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: remove odd HAVE_PTE_SPECIALLaurent Dufour1-9/+6
Remove the additional define HAVE_PTE_SPECIAL and rely directly on CONFIG_ARCH_HAS_PTE_SPECIAL. There is no functional change introduced by this patch Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Laurent Dufour <[email protected]> Acked-by: David Rientjes <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Christophe LEROY <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: introduce ARCH_HAS_PTE_SPECIALLaurent Dufour3-3/+6
Currently the PTE special supports is turned on in per architecture header files. Most of the time, it is defined in arch/*/include/asm/pgtable.h depending or not on some other per architecture static definition. This patch introduce a new configuration variable to manage this directly in the Kconfig files. It would later replace __HAVE_ARCH_PTE_SPECIAL. Here notes for some architecture where the definition of __HAVE_ARCH_PTE_SPECIAL is not obvious: arm __HAVE_ARCH_PTE_SPECIAL which is currently defined in arch/arm/include/asm/pgtable-3level.h which is included by arch/arm/include/asm/pgtable.h when CONFIG_ARM_LPAE is set. So select ARCH_HAS_PTE_SPECIAL if ARM_LPAE. powerpc __HAVE_ARCH_PTE_SPECIAL is defined in 2 files: - arch/powerpc/include/asm/book3s/64/pgtable.h - arch/powerpc/include/asm/pte-common.h The first one is included if (PPC_BOOK3S & PPC64) while the second is included in all the other cases. So select ARCH_HAS_PTE_SPECIAL all the time. sparc: __HAVE_ARCH_PTE_SPECIAL is defined if defined(__sparc__) && defined(__arch64__) which are defined through the compiler in sparc/Makefile if !SPARC32 which I assume to be if SPARC64. So select ARCH_HAS_PTE_SPECIAL if SPARC64 There is no functional change introduced by this patch. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Laurent Dufour <[email protected]> Suggested-by: Jerome Glisse <[email protected]> Reviewed-by: Jerome Glisse <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Michal Hocko <[email protected]> Cc: "Aneesh Kumar K . V" <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Rich Felker <[email protected]> Cc: David S. Miller <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Albert Ou <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: David Rientjes <[email protected]> Cc: Robin Murphy <[email protected]> Cc: Christophe LEROY <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm/page_alloc: remove realsize in free_area_init_core()Wei Yang1-4/+4
Highmem's realsize always equals to freesize, so it is not necessary to spare a variable to record this. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm: restructure memfd codeMike Kravetz3-324/+346
With the addition of memfd hugetlbfs support, we now have the situation where memfd depends on TMPFS -or- HUGETLBFS. Previously, memfd was only supported on tmpfs, so it made sense that the code resided in shmem.c. In the current code, memfd is only functional if TMPFS is defined. If HUGETLFS is defined and TMPFS is not defined, then memfd functionality will not be available for hugetlbfs. This does not cause BUGs, just a lack of potentially desired functionality. Code is restructured in the following way: - include/linux/memfd.h is a new file containing memfd specific definitions previously contained in shmem_fs.h. - mm/memfd.c is a new file containing memfd specific code previously contained in shmem.c. - memfd specific code is removed from shmem_fs.h and shmem.c. - A new config option MEMFD_CREATE is added that is defined if TMPFS or HUGETLBFS is defined. No functional changes are made to the code: restructuring only. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Reviewed-by: Khalid Aziz <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: David Herrmann <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Marc-Andr Lureau <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2018-06-07mm/shmem: update file sealing comments and file checkingMike Kravetz1-24/+26
In preparation for memfd code restructure, update comments, definitions and function names dealing with file sealing to indicate that tmpfs and hugetlbfs are the supported filesystems. Also, change file pointer checks in memfd_file_seals_ptr to use defined interfaces instead of directly referencing file_operation structs. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Kravetz <[email protected]> Reviewed-by: Khalid Aziz <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: David Herrmann <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Marc-Andr Lureau <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>