blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2020-08-12	include/linux/memcontrol.h: drop duplicate word and fix spello	Randy Dunlap	1	-2/+2
	Drop the doubled word "for" in a comment. Fix spello of "incremented". Signed-off-by: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Chris Down <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Vladimir Davydov <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	include/linux/frontswap.h: drop duplicated word in a comment	Randy Dunlap	1	-1/+1
	Drop the doubled word "in" in a comment. Signed-off-by: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	include/linux/highmem.h: fix duplicated words in a comment	Randy Dunlap	1	-1/+1
	Change the doubled word "is" in a comment to "it is". Signed-off-by: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm: drop duplicated words in <linux/mm.h>	Randy Dunlap	1	-2/+2
	Drop the doubled words "to" and "the". Signed-off-by: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm: drop duplicated words in <linux/pgtable.h>	Randy Dunlap	1	-6/+6
	Drop the doubled words "used" and "by". Drop the repeated acronym "TLB" and make several other fixes around it. (capital letters, spellos) Signed-off-by: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: SeongJae Park <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm, memory_hotplug: update pcp lists everytime onlining a memory block	Charan Teja Reddy	1	-2/+1
	When onlining a first memory block in a zone, pcp lists are not updated thus pcp struct will have the default setting of ->high = 0,->batch = 1. This means till the second memory block in a zone(if it have) is onlined the pcp lists of this zone will not contain any pages because pcp's ->count is always greater than ->high thus free_pcppages_bulk() is called to free batch size(=1) pages every time system wants to add a page to the pcp list through free_unref_page(). To put this in a word, system is not using benefits offered by the pcp lists when there is a single onlineable memory block in a zone. Correct this by always updating the pcp lists when memory block is onlined. Fixes: 1f522509c77a ("mem-hotplug: avoid multiple zones sharing same boot strapping boot_pageset") Signed-off-by: Charan Teja Reddy <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Vinayak Menon <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm/memory_hotplug: fix unpaired mem_hotplug_begin/done	Jia He	1	-3/+2
	When check_memblock_offlined_cb() returns failed rc(e.g. the memblock is online at that time), mem_hotplug_begin/done is unpaired in such case. Therefore a warning: Call Trace: percpu_up_write+0x33/0x40 try_remove_memory+0x66/0x120 ? _cond_resched+0x19/0x30 remove_memory+0x2b/0x40 dev_dax_kmem_remove+0x36/0x72 [kmem] device_release_driver_internal+0xf0/0x1c0 device_release_driver+0x12/0x20 bus_remove_device+0xe1/0x150 device_del+0x17b/0x3e0 unregister_dev_dax+0x29/0x60 devm_action_release+0x15/0x20 release_nodes+0x19a/0x1e0 devres_release_all+0x3f/0x50 device_release_driver_internal+0x100/0x1c0 driver_detach+0x4c/0x8f bus_remove_driver+0x5c/0xd0 driver_unregister+0x31/0x50 dax_pmem_exit+0x10/0xfe0 [dax_pmem] Fixes: f1037ec0cc8a ("mm/memory_hotplug: fix remove_memory() lockdep splat") Signed-off-by: Jia He <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Dan Williams <[email protected]> Cc: <[email protected]> [5.6+] Cc: Andy Lutomirski <[email protected]> Cc: Baoquan He <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chuhong Yuan <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jonathan Cameron <[email protected]> Cc: Kaly Xin <[email protected]> Cc: Logan Gunthorpe <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rich Felker <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vishal Verma <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yoshinori Sato <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm/memory_hotplug: introduce default dummy memory_add_physaddr_to_nid()	Jia He	5	-22/+10
	This is to introduce a general dummy helper. memory_add_physaddr_to_nid() is a fallback option to get the nid in case NUMA_NO_NID is detected. After this patch, arm64/sh/s390 can simply use the general dummy version. PowerPC/x86/ia64 will still use their specific version. This is the preparation to set a fallback value for dev_dax->target_node. Signed-off-by: Jia He <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Dan Williams <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Tony Luck <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Rich Felker <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Vishal Verma <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Baoquan He <[email protected]> Cc: Chuhong Yuan <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Logan Gunthorpe <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: Jonathan Cameron <[email protected]> Cc: Kaly Xin <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	x86/mm: use max memory block size on bare metal	Daniel Jordan	1	-0/+9
	Some of our servers spend significant time at kernel boot initializing memory block sysfs directories and then creating symlinks between them and the corresponding nodes. The slowness happens because the machines get stuck with the smallest supported memory block size on x86 (128M), which results in 16,288 directories to cover the 2T of installed RAM. The search for each memory block is noticeable even with commit 4fb6eabf1037 ("drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup"). Commit 078eb6aa50dc ("x86/mm/memory_hotplug: determine block size based on the end of boot memory") chooses the block size based on alignment with memory end. That addresses hotplug failures in qemu guests, but for bare metal systems whose memory end isn't aligned to even the smallest size, it leaves them at 128M. Make kernels that aren't running on a hypervisor use the largest supported size (2G) to minimize overhead on big machines. Kernel boot goes 7% faster on the aforementioned servers, shaving off half a second. [[email protected]: v3] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Daniel Jordan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Sistare <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm: mmu_notifier: fix and extend kerneldoc	Krzysztof Kozlowski	1	-4/+5
	Fix W=1 compile warnings (invalid kerneldoc): mm/mmu_notifier.c:187: warning: Function parameter or member 'interval_sub' not described in 'mmu_interval_read_bgin' mm/mmu_notifier.c:708: warning: Function parameter or member 'subscription' not described in 'mmu_notifier_registr' mm/mmu_notifier.c:708: warning: Excess function parameter 'mn' description in 'mmu_notifier_register' mm/mmu_notifier.c:880: warning: Function parameter or member 'subscription' not described in 'mmu_notifier_put' mm/mmu_notifier.c:880: warning: Excess function parameter 'mn' description in 'mmu_notifier_put' mm/mmu_notifier.c:982: warning: Function parameter or member 'ops' not described in 'mmu_interval_notifier_insert' Signed-off-by: Krzysztof Kozlowski <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	include/linux/sched/mm.h: optimize current_gfp_context()	Waiman Long	1	-3/+5
	The current_gfp_context() converts a number of PF_MEMALLOC_* per-process flags into the corresponding GFP_* flags for memory allocation. In that function, current->flags is accessed 3 times. That may lead to duplicated access of the same memory location. This is not usually a problem with minimal debug config options on as the compiler can optimize away the duplicated memory accesses. With most of the debug config options on, however, that may not be the case. For example, the x86-64 object size of the __need_fs_reclaim() in a debug kernel that calls current_gfp_context() was 309 bytes. With this patch applied, the object size is reduced to 202 bytes. This is a saving of 107 bytes and will probably be slightly faster too. Use READ_ONCE() to access current->flags to prevent the compiler from possibly accessing current->flags multiple times. Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Michel Lespinasse <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	cma: don't quit at first error when activating reserved areas	Mike Kravetz	1	-14/+9
	The routine cma_init_reserved_areas is designed to activate all reserved cma areas. It quits when it first encounters an error. This can leave some areas in a state where they are reserved but not activated. There is no feedback to code which performed the reservation. Attempting to allocate memory from areas in such a state will result in a BUG. Modify cma_init_reserved_areas to always attempt to activate all areas. The called routine, cma_activate_area is responsible for leaving the area in a valid state. No one is making active use of returned error codes, so change the routine to void. How to reproduce: This example uses kernelcore, hugetlb and cma as an easy way to reproduce. However, this is a more general cma issue. Two node x86 VM 16GB total, 8GB per node Kernel command line parameters, kernelcore=4G hugetlb_cma=8G Related boot time messages, hugetlb_cma: reserve 8192 MiB, up to 4096 MiB per node cma: Reserved 4096 MiB at 0x0000000100000000 hugetlb_cma: reserved 4096 MiB on node 0 cma: Reserved 4096 MiB at 0x0000000300000000 hugetlb_cma: reserved 4096 MiB on node 1 cma: CMA area hugetlb could not be activated # echo 8 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI ... Call Trace: bitmap_find_next_zero_area_off+0x51/0x90 cma_alloc+0x1a5/0x310 alloc_fresh_huge_page+0x78/0x1a0 alloc_pool_huge_page+0x6f/0xf0 set_max_huge_pages+0x10c/0x250 nr_hugepages_store_common+0x92/0x120 ? __kmalloc+0x171/0x270 kernfs_fop_write+0xc1/0x1a0 vfs_write+0xc7/0x1f0 ksys_write+0x5f/0xe0 do_syscall_64+0x4d/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: c64be2bb1c6e ("drivers: add Contiguous Memory Allocator") Signed-off-by: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Acked-by: Barry Song <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Michal Nazarewicz <[email protected]> Cc: Kyungmin Park <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm: hugetlb: fix the name of hugetlb CMA	Barry Song	1	-1/+3
	Once we enable CMA_DEBUGFS, we will get the below errors: directory 'cma-hugetlb' with parent 'cma' already present. We should have different names for different CMA areas. Signed-off-by: Barry Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Acked-by: Roman Gushchin <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm: cma: fix the name of CMA areas	Barry Song	2	-9/+10
	Patch series "mm: fix the names of general cma and hugetlb cma", v2. The current code of CMA can only work when users pass a const string as name parameter. we need to fix the way to handle names in CMA. On the other hand, to avoid name conflicts after enabling CMA_DEBUGFS, each hugetlb should get a different CMA name. This patch (of 2): If users give a name saved in stack, the current code will generate magic pointer. if users don't give a name(NULL), kasprintf() will always return NULL as we are at the early stage. that means cma_init_reserved_mem() will return -ENOMEM if users set name parameter as NULL. [[email protected]: return cma->name directly in cma_get_name] Link: https://github.com/ClangBuiltLinux/linux/issues/1063 Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Barry Song <[email protected]> Signed-off-by: Nathan Chancellor <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Acked-by: Roman Gushchin <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm/cma.c: fix NULL pointer dereference when cma could not be activated	Jianqun Xu	1	-1/+1
	In some case the cma area could not be activated, but the cma_alloc be used under this case, then the kernel will crash caused by NULL pointer dereference. Add bitmap valid check in cma_alloc to avoid this issue. Signed-off-by: Jianqun Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm/vmstat: add events for THP migration without split	Anshuman Khandual	5	-11/+91
	Add following new vmstat events which will help in validating THP migration without split. Statistics reported through these new VM events will help in performance debugging. 1. THP_MIGRATION_SUCCESS 2. THP_MIGRATION_FAILURE 3. THP_MIGRATION_SPLIT In addition, these new events also update normal page migration statistics appropriately via PGMIGRATE_SUCCESS and PGMIGRATE_FAILURE. While here, this updates current trace event 'mm_migrate_pages' to accommodate now available THP statistics. [[email protected]: s/hpage_nr_pages/thp_nr_pages/] [[email protected]: v2] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: s/thp_nr_pages/hpage_nr_pages/] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Signed-off-by: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Daniel Jordan <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Zi Yan <[email protected]> Cc: John Hubbard <[email protected]> Cc: Naoya Horiguchi <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm: thp: remove debug_cow switch	Yang Shi	2	-28/+0
	Since commit 3917c80280c93a7123f ("thp: change CoW semantics for anon-THP"), the CoW page fault of THP has been rewritten, debug_cow is not used anymore. So, just remove it. Signed-off-by: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Zi Yan <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm/migrate: add migrate-shared test for migrate_vma_*()	Ralph Campbell	1	-0/+35
	Add a migrate_vma_*() self test for mmap(MAP_SHARED) to verify that !vma_anonymous() ranges won't be migrated. Signed-off-by: Ralph Campbell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Hubbard <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: "Bharata B Rao" <[email protected]> Cc: Shuah Khan <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm/migrate: optimize migrate_vma_setup() for holes	Ralph Campbell	1	-2/+14
	Patch series "mm/migrate: optimize migrate_vma_setup() for holes". A simple optimization for migrate_vma_*() when the source vma is not an anonymous vma and a new test case to exercise it. This patch (of 2): When migrating system memory to device private memory, if the source address range is a valid VMA range and there is no memory or a zero page, the source PFN array is marked as valid but with no PFN. This lets the device driver allocate private memory and clear it, then insert the new device private struct page into the CPU's page tables when migrate_vma_pages() is called. migrate_vma_pages() only inserts the new page if the VMA is an anonymous range. There is no point in telling the device driver to allocate device private memory and then not migrate the page. Instead, mark the source PFN array entries as not migrating to avoid this overhead. [[email protected]: v2] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ralph Campbell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Hubbard <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: "Bharata B Rao" <[email protected]> Cc: Shuah Khan <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	hugetlbfs: remove call to huge_pte_alloc without i_mmap_rwsem	Mike Kravetz	4	-12/+23
	Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization") requires callers of huge_pte_alloc to hold i_mmap_rwsem in at least read mode. This is because the explicit locking in huge_pmd_share (called by huge_pte_alloc) was removed. When restructuring the code, the call to huge_pte_alloc in the else block at the beginning of hugetlb_fault was missed. Unfortunately, that else clause is exercised when there is no page table entry. This will likely lead to a call to huge_pmd_share. If huge_pmd_share thinks pmd sharing is possible, it will traverse the mapping tree (i_mmap) without holding i_mmap_rwsem. If someone else is modifying the tree, bad things such as addressing exceptions or worse could happen. Simply remove the else clause. It should have been removed previously. The code following the else will call huge_pte_alloc with the appropriate locking. To prevent this type of issue in the future, add routines to assert that i_mmap_rwsem is held, and call these routines in huge pmd sharing routines. Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization") Suggested-by: Matthew Wilcox <[email protected]> Signed-off-by: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: "Kirill A.Shutemov" <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Prakash Sangappa <[email protected]> Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	hugetlbfs: prevent filesystem stacking of hugetlbfs	Mike Kravetz	1	-0/+6
	syzbot found issues with having hugetlbfs on a union/overlay as reported in [1]. Due to the limitations (no write) and special functionality of hugetlbfs, it does not work well in filesystem stacking. There are no know use cases for hugetlbfs stacking. Rather than making modifications to get hugetlbfs working in such environments, simply prevent stacking. [1] https://lore.kernel.org/linux-mm/[email protected]/ Reported-by: [email protected] Suggested-by: Amir Goldstein <[email protected]> Signed-off-by: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Miklos Szeredi <[email protected]> Cc: Al Viro <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Colin Walters <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm, oom: show process exiting information in __oom_kill_process()	Yafang Shao	1	-0/+2
	When the OOM killer finds a victim and tryies to kill it, if the victim is already exiting, the task mm will be NULL and no process will be killed. But the dump_header() has been already executed, so it will be strange to dump so much information without killing a process. We'd better show some helpful information to indicate why this happens. Suggested-by: David Rientjes <[email protected]> Signed-off-by: Yafang Shao <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Qian Cai <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	doc, mm: clarify /proc/<pid>/oom_score value range	Michal Hocko	1	-0/+3
	The exported value includes oom_score_adj so the range is no [0, 1000] as described in the previous section but rather [0, 2000]. Mention that fact explicitly. Signed-off-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: David Rientjes <[email protected]> Cc: Yafang Shao <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	doc, mm: sync up oom_score_adj documentation	Michal Hocko	1	-8/+0
	There are at least two notes in the oom section. The 3% discount for root processes is gone since d46078b28889 ("mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes"). Likewise children of the selected oom victim are not sacrificed since bbbe48029720 ("mm, oom: remove 'prefer children over parent' heuristic") Drop both of them. Signed-off-by: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: David Rientjes <[email protected]> Cc: Yafang Shao <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm, oom: make the calculation of oom badness more accurate	Yafang Shao	3	-15/+22
	Recently we found an issue on our production environment that when memcg oom is triggered the oom killer doesn't chose the process with largest resident memory but chose the first scanned process. Note that all processes in this memcg have the same oom_score_adj, so the oom killer should chose the process with largest resident memory. Bellow is part of the oom info, which is enough to analyze this issue. [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037 [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0 [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0 [...] [7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause [7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas [7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron [7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord [7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd [7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python [7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent [7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat [7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent [7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3 [7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client [7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client [7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner [7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su [7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2 [7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python [7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p [7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2 [7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster [7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout [7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0 [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB We can find that the first scanned process 5740 (pause) was killed, but its rss is only one page. That is because, when we calculate the oom badness in oom_badness(), we always ignore the negtive point and convert all of these negtive points to 1. Now as oom_score_adj of all the processes in this targeted memcg have the same value -998, the points of these processes are all negtive value. As a result, the first scanned process will be killed. The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a a Guaranteed pod, which has higher priority to prevent from being killed by system oom. To fix this issue, we should make the calculation of oom point more accurate. We can achieve it by convert the chosen_point from 'unsigned long' to 'long'. [[email protected]: reported a issue in the previous version] [[email protected]: fixed the issue reported by Cai] [[email protected]: add the comment in proc_oom_score()] [[email protected]: v3] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Yafang Shao <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Naresh Kamboju <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Qian Cai <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	include/linux/mempolicy.h: fix typo	Yanfei Xu	1	-1/+1
	Change "interlave" to "interleave". Signed-off-by: Yanfei Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm/mempolicy.c: check parameters first in kernel_get_mempolicy	Wenchao Hao	1	-2/+2
	Previous implementatoin calls untagged_addr() before error check, while if the error check failed and return EINVAL, the untagged_addr() call is just useless work. Signed-off-by: Wenchao Hao <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm: mempolicy: fix kerneldoc of numa_map_to_online_node()	Krzysztof Kozlowski	1	-1/+1
	Fix W=1 compile warnings (invalid kerneldoc): mm/mempolicy.c:137: warning: Function parameter or member 'node' not described in 'numa_map_to_online_node' mm/mempolicy.c:137: warning: Excess function parameter 'nid' description in 'numa_map_to_online_node' Signed-off-by: Krzysztof Kozlowski <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm/compaction: correct the comments of compact_defer_shift	Alex Shi	2	-1/+2
	There is no compact_defer_limit. It should be compact_defer_shift in use. and add compact_order_failed explanation. Signed-off-by: Alex Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Alexander Duyck <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm: use unsigned types for fragmentation score	Nitin Gupta	4	-13/+13
	Proactive compaction uses per-node/zone "fragmentation score" which is always in range [0, 100], so use unsigned type of these scores as well as for related constants. Signed-off-by: Nitin Gupta <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Baoquan He <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Kees Cook <[email protected]> Cc: Iurii Zaikin <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Joonsoo Kim <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm: fix compile error due to COMPACTION_HPAGE_ORDER	Nitin Gupta	1	-1/+1
	Fix compile error when COMPACTION_HPAGE_ORDER is assigned to HUGETLB_PAGE_ORDER. The correct way to check if this constant is defined is to check for CONFIG_HUGETLBFS. Reported-by: Nathan Chancellor <[email protected]> Signed-off-by: Nitin Gupta <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Nathan Chancellor <[email protected]> Cc: Stephen Rothwell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm: proactive compaction	Nitin Gupta	6	-5/+223
	For some applications, we need to allocate almost all memory as hugepages. However, on a running system, higher-order allocations can fail if the memory is fragmented. Linux kernel currently does on-demand compaction as we request more hugepages, but this style of compaction incurs very high latency. Experiments with one-time full memory compaction (followed by hugepage allocations) show that kernel is able to restore a highly fragmented memory state to a fairly compacted memory state within <1 sec for a 32G system. Such data suggests that a more proactive compaction can help us allocate a large fraction of memory as hugepages keeping allocation latencies low. For a more proactive compaction, the approach taken here is to define a new sysctl called 'vm.compaction_proactiveness' which dictates bounds for external fragmentation which kcompactd tries to maintain. The tunable takes a value in range [0, 100], with a default of 20. Note that a previous version of this patch [1] was found to introduce too many tunables (per-order extfrag{low, high}), but this one reduces them to just one sysctl. Also, the new tunable is an opaque value instead of asking for specific bounds of "external fragmentation", which would have been difficult to estimate. The internal interpretation of this opaque value allows for future fine-tuning. Currently, we use a simple translation from this tunable to [low, high] "fragmentation score" thresholds (low=100-proactiveness, high=low+10%). The score for a node is defined as weighted mean of per-zone external fragmentation. A zone's present_pages determines its weight. To periodically check per-node score, we reuse per-node kcompactd threads, which are woken up every 500 milliseconds to check the same. If a node's score exceeds its high threshold (as derived from user-provided proactiveness value), proactive compaction is started until its score reaches its low threshold value. By default, proactiveness is set to 20, which implies threshold values of low=80 and high=90. This patch is largely based on ideas from Michal Hocko [2]. See also the LWN article [3]. Performance data ================ System: x64_64, 1T RAM, 80 CPU threads. Kernel: 5.6.0-rc3 + this patch echo madvise \| sudo tee /sys/kernel/mm/transparent_hugepage/enabled echo madvise \| sudo tee /sys/kernel/mm/transparent_hugepage/defrag Before starting the driver, the system was fragmented from a userspace program that allocates all memory and then for each 2M aligned section, frees 3/4 of base pages using munmap. The workload is mainly anonymous userspace pages, which are easy to move around. I intentionally avoided unmovable pages in this test to see how much latency we incur when hugepage allocations hit direct compaction. 1. Kernel hugepage allocation latencies With the system in such a fragmented state, a kernel driver then allocates as many hugepages as possible and measures allocation latency: (all latency values are in microseconds) - With vanilla 5.6.0-rc3 percentile latency –––––––––– ––––––– 5 7894 10 9496 25 12561 30 15295 40 18244 50 21229 60 27556 75 30147 80 31047 90 32859 95 33799 Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G total free => 98% of free memory could be allocated as hugepages) - With 5.6.0-rc3 + this patch, with proactiveness=20 sysctl -w vm.compaction_proactiveness=20 percentile latency –––––––––– ––––––– 5 2 10 2 25 3 30 3 40 3 50 4 60 4 75 4 80 4 90 5 95 429 Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G total free => 98% of free memory could be allocated as hugepages) 2. JAVA heap allocation In this test, we first fragment memory using the same method as for (1). Then, we start a Java process with a heap size set to 700G and request the heap to be allocated with THP hugepages. We also set THP to madvise to allow hugepage backing of this heap. /usr/bin/time java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch The above command allocates 700G of Java heap using hugepages. - With vanilla 5.6.0-rc3 17.39user 1666.48system 27:37.89elapsed - With 5.6.0-rc3 + this patch, with proactiveness=20 8.35user 194.58system 3:19.62elapsed Elapsed time remains around 3:15, as proactiveness is further increased. Note that proactive compaction happens throughout the runtime of these workloads. The situation of one-time compaction, sufficient to supply hugepages for following allocation stream, can probably happen for more extreme proactiveness values, like 80 or 90. In the above Java workload, proactiveness is set to 20. The test starts with a node's score of 80 or higher, depending on the delay between the fragmentation step and starting the benchmark, which gives more-or-less time for the initial round of compaction. As t he benchmark consumes hugepages, node's score quickly rises above the high threshold (90) and proactive compaction starts again, which brings down the score to the low threshold level (80). Repeat. bpftrace also confirms proactive compaction running 20+ times during the runtime of this Java benchmark. kcompactd threads consume 100% of one of the CPUs while it tries to bring a node's score within thresholds. Backoff behavior ================ Above workloads produce a memory state which is easy to compact. However, if memory is filled with unmovable pages, proactive compaction should essentially back off. To test this aspect: - Created a kernel driver that allocates almost all memory as hugepages followed by freeing first 3/4 of each hugepage. - Set proactiveness=40 - Note that proactive_compact_node() is deferred maximum number of times with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check (=> ~30 seconds between retries). [1] https://patchwork.kernel.org/patch/11098289/ [2] https://lore.kernel.org/linux-mm/[email protected]/ [3] https://lwn.net/Articles/817905/ Signed-off-by: Nitin Gupta <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Tested-by: Oleksandr Natalenko <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Reviewed-by: Khalid Aziz <[email protected]> Reviewed-by: Oleksandr Natalenko <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Khalid Aziz <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: David Rientjes <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Oleksandr Natalenko <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	/proc/PID/smaps: consistent whitespace output format	Michal Koutný	1	-2/+2
	The keys in smaps output are padded to fixed width with spaces. All except for THPeligible that uses tabs (only since commit c06306696f83 ("mm: thp: fix false negative of shmem vma's THP eligibility")). Unify the output formatting to save time debugging some naïve parsers. (Part of the unification is also aligning FilePmdMapped with others.) Signed-off-by: Michal Koutný <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Yang Shi <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Matthew Wilcox <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm/vmscan: restore active/inactive ratio for anonymous LRU	Joonsoo Kim	1	-1/+1
	Now that workingset detection is implemented for anonymous LRU, we don't need large inactive list to allow detecting frequently accessed pages before they are reclaimed, anymore. This effectively reverts the temporary measure put in by commit "mm/vmscan: make active/inactive ratio as 1:1 for anon lru". Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm/swap: implement workingset detection for anonymous LRU	Joonsoo Kim	5	-19/+43
	This patch implements workingset detection for anonymous LRU. All the infrastructure is implemented by the previous patches so this patch just activates the workingset detection by installing/retrieving the shadow entry and adding refault calculation. Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm/swapcache: support to handle the shadow entries	Joonsoo Kim	5	-12/+69
	Workingset detection for anonymous page will be implemented in the following patch and it requires to store the shadow entries into the swapcache. This patch implements an infrastructure to store the shadow entry in the swapcache. Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Vlastimil Babka <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm/workingset: prepare the workingset detection infrastructure for anon LRU	Joonsoo Kim	5	-21/+43
	To prepare the workingset detection for anon LRU, this patch splits workingset event counters for refault, activate and restore into anon and file variants, as well as the refaults counter in struct lruvec. Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm/vmscan: protect the workingset on anonymous LRU	Joonsoo Kim	10	-21/+19
	In current implementation, newly created or swap-in anonymous page is started on active list. Growing active list results in rebalancing active/inactive list so old pages on active list are demoted to inactive list. Hence, the page on active list isn't protected at all. Following is an example of this situation. Assume that 50 hot pages on active list. Numbers denote the number of pages on active/inactive list (active \| inactive). 1. 50 hot pages on active list 50(h) \| 0 2. workload: 50 newly created (used-once) pages 50(uo) \| 50(h) 3. workload: another 50 newly created (used-once) pages 50(uo) \| 50(uo), swap-out 50(h) This patch tries to fix this issue. Like as file LRU, newly created or swap-in anonymous pages will be inserted to the inactive list. They are promoted to active list if enough reference happens. This simple modification changes the above example as following. 1. 50 hot pages on active list 50(h) \| 0 2. workload: 50 newly created (used-once) pages 50(h) \| 50(uo) 3. workload: another 50 newly created (used-once) pages 50(h) \| 50(uo), swap-out 50(uo) As you can see, hot pages on active list would be protected. Note that, this implementation has a drawback that the page cannot be promoted and will be swapped-out if re-access interval is greater than the size of inactive list but less than the size of total(active+inactive). To solve this potential issue, following patch will apply workingset detection similar to the one that's already applied to file LRU. Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm/vmscan: make active/inactive ratio as 1:1 for anon lru	Joonsoo Kim	1	-1/+1
	Patch series "workingset protection/detection on the anonymous LRU list", v7. * PROBLEM In current implementation, newly created or swap-in anonymous page is started on the active list. Growing the active list results in rebalancing active/inactive list so old pages on the active list are demoted to the inactive list. Hence, hot page on the active list isn't protected at all. Following is an example of this situation. Assume that 50 hot pages on active list and system can contain total 100 pages. Numbers denote the number of pages on active/inactive list (active \| inactive). (h) stands for hot pages and (uo) stands for used-once pages. 1. 50 hot pages on active list 50(h) \| 0 2. workload: 50 newly created (used-once) pages 50(uo) \| 50(h) 3. workload: another 50 newly created (used-once) pages 50(uo) \| 50(uo), swap-out 50(h) As we can see, hot pages are swapped-out and it would cause swap-in later. * SOLUTION Since this is what we want to avoid, this patchset implements workingset protection. Like as the file LRU list, newly created or swap-in anonymous page is started on the inactive list. Also, like as the file LRU list, if enough reference happens, the page will be promoted. This simple modification changes the above example as following. 1. 50 hot pages on active list 50(h) \| 0 2. workload: 50 newly created (used-once) pages 50(h) \| 50(uo) 3. workload: another 50 newly created (used-once) pages 50(h) \| 50(uo), swap-out 50(uo) hot pages remains in the active list. :) * EXPERIMENT I tested this scenario on my test bed and confirmed that this problem happens on current implementation. I also checked that it is fixed by this patchset. * SUBJECT workingset detection * PROBLEM Later part of the patchset implements the workingset detection for the anonymous LRU list. There is a corner case that workingset protection could cause thrashing. If we can avoid thrashing by workingset detection, we can get the better performance. Following is an example of thrashing due to the workingset protection. 1. 50 hot pages on active list 50(h) \| 0 2. workload: 50 newly created (will be hot) pages 50(h) \| 50(wh) 3. workload: another 50 newly created (used-once) pages 50(h) \| 50(uo), swap-out 50(wh) 4. workload: 50 (will be hot) pages 50(h) \| 50(wh), swap-in 50(wh) 5. workload: another 50 newly created (used-once) pages 50(h) \| 50(uo), swap-out 50(wh) 6. repeat 4, 5 Without workingset detection, this kind of workload cannot be promoted and thrashing happens forever. * SOLUTION Therefore, this patchset implements workingset detection. All the infrastructure for workingset detecion is already implemented, so there is not much work to do. First, extend workingset detection code to deal with the anonymous LRU list. Then, make swap cache handles the exceptional value for the shadow entry. Lastly, install/retrieve the shadow value into/from the swap cache and check the refault distance. * EXPERIMENT I made a test program to imitates above scenario and confirmed that problem exists. Then, I checked that this patchset fixes it. My test setup is a virtual machine with 8 cpus and 6100MB memory. But, the amount of the memory that the test program can use is about 280 MB. This is because the system uses large ram-backed swap and large ramdisk to capture the trace. Test scenario is like as below. 1. allocate cold memory (512MB) 2. allocate hot-1 memory (96MB) 3. activate hot-1 memory (96MB) 4. allocate another hot-2 memory (96MB) 5. access cold memory (128MB) 6. access hot-2 memory (96MB) 7. repeat 5, 6 Since hot-1 memory (96MB) is on the active list, the inactive list can contains roughly 190MB pages. hot-2 memory's re-access interval (96+128 MB) is more 190MB, so it cannot be promoted without workingset detection and swap-in/out happens repeatedly. With this patchset, workingset detection works and promotion happens. Therefore, swap-in/out occurs less. Here is the result. (average of 5 runs) type swap-in swap-out base 863240 989945 patch 681565 809273 As we can see, patched kernel do less swap-in/out. * OVERALL TEST (ebizzy using modified random function) ebizzy is the test program that main thread allocates lots of memory and child threads access them randomly during the given times. Swap-in will happen if allocated memory is larger than the system memory. The random function that represents the zipf distribution is used to make hot/cold memory. Hot/cold ratio is controlled by the parameter. If the parameter is high, hot memory is accessed much larger than cold one. If the parameter is low, the number of access on each memory would be similar. I uses various parameters in order to show the effect of patchset on various hot/cold ratio workload. My test setup is a virtual machine with 8 cpus, 1024 MB memory and 5120 MB ram swap. Result format is as following. param: 1-1024-0.1 - 1 (number of thread) - 1024 (allocated memory size, MB) - 0.1 (zipf distribution alpha, 0.1 works like as roughly uniform random, 1.3 works like as small portion of memory is hot and the others are cold) pswpin: smaller is better std: standard deviation improvement: negative is better * single thread param pswpin std improvement base 1-1024.0-0.1 14101983.40 79441.19 prot 1-1024.0-0.1 14065875.80 136413.01 ( -0.26 ) detect 1-1024.0-0.1 13910435.60 100804.82 ( -1.36 ) base 1-1024.0-0.7 7998368.80 43469.32 prot 1-1024.0-0.7 7622245.80 88318.74 ( -4.70 ) detect 1-1024.0-0.7 7618515.20 59742.07 ( -4.75 ) base 1-1024.0-1.3 1017400.80 38756.30 prot 1-1024.0-1.3 940464.60 29310.69 ( -7.56 ) detect 1-1024.0-1.3 945511.40 24579.52 ( -7.07 ) base 1-1280.0-0.1 22895541.40 50016.08 prot 1-1280.0-0.1 22860305.40 51952.37 ( -0.15 ) detect 1-1280.0-0.1 22705565.20 93380.35 ( -0.83 ) base 1-1280.0-0.7 13717645.60 46250.65 prot 1-1280.0-0.7 12935355.80 64754.43 ( -5.70 ) detect 1-1280.0-0.7 13040232.00 63304.00 ( -4.94 ) base 1-1280.0-1.3 1654251.40 4159.68 prot 1-1280.0-1.3 1522680.60 33673.50 ( -7.95 ) detect 1-1280.0-1.3 1599207.00 70327.89 ( -3.33 ) base 1-1536.0-0.1 31621775.40 31156.28 prot 1-1536.0-0.1 31540355.20 62241.36 ( -0.26 ) detect 1-1536.0-0.1 31420056.00 123831.27 ( -0.64 ) base 1-1536.0-0.7 19620760.60 60937.60 prot 1-1536.0-0.7 18337839.60 56102.58 ( -6.54 ) detect 1-1536.0-0.7 18599128.00 75289.48 ( -5.21 ) base 1-1536.0-1.3 2378142.40 20994.43 prot 1-1536.0-1.3 2166260.60 48455.46 ( -8.91 ) detect 1-1536.0-1.3 2183762.20 16883.24 ( -8.17 ) base 1-1792.0-0.1 40259714.80 90750.70 prot 1-1792.0-0.1 40053917.20 64509.47 ( -0.51 ) detect 1-1792.0-0.1 39949736.40 104989.64 ( -0.77 ) base 1-1792.0-0.7 25704884.40 69429.68 prot 1-1792.0-0.7 23937389.00 79945.60 ( -6.88 ) detect 1-1792.0-0.7 24271902.00 35044.30 ( -5.57 ) base 1-1792.0-1.3 3129497.00 32731.86 prot 1-1792.0-1.3 2796994.40 19017.26 ( -10.62 ) detect 1-1792.0-1.3 2886840.40 33938.82 ( -7.75 ) base 1-2048.0-0.1 48746924.40 50863.88 prot 1-2048.0-0.1 48631954.40 24537.30 ( -0.24 ) detect 1-2048.0-0.1 48509419.80 27085.34 ( -0.49 ) base 1-2048.0-0.7 32046424.40 78624.22 prot 1-2048.0-0.7 29764182.20 86002.26 ( -7.12 ) detect 1-2048.0-0.7 30250315.80 101282.14 ( -5.60 ) base 1-2048.0-1.3 3916723.60 24048.55 prot 1-2048.0-1.3 3490781.60 33292.61 ( -10.87 ) detect 1-2048.0-1.3 3585002.20 44942.04 ( -8.47 ) * multi thread param pswpin std improvement base 8-1024.0-0.1 16219822.60 329474.01 prot 8-1024.0-0.1 15959494.00 654597.45 ( -1.61 ) detect 8-1024.0-0.1 15773790.80 502275.25 ( -2.75 ) base 8-1024.0-0.7 9174107.80 537619.33 prot 8-1024.0-0.7 8571915.00 385230.08 ( -6.56 ) detect 8-1024.0-0.7 8489484.20 364683.00 ( -7.46 ) base 8-1024.0-1.3 1108495.60 83555.98 prot 8-1024.0-1.3 1038906.20 63465.20 ( -6.28 ) detect 8-1024.0-1.3 941817.80 32648.80 ( -15.04 ) base 8-1280.0-0.1 25776114.20 450480.45 prot 8-1280.0-0.1 25430847.00 465627.07 ( -1.34 ) detect 8-1280.0-0.1 25282555.00 465666.55 ( -1.91 ) base 8-1280.0-0.7 15218968.00 702007.69 prot 8-1280.0-0.7 13957947.80 492643.86 ( -8.29 ) detect 8-1280.0-0.7 14158331.20 238656.02 ( -6.97 ) base 8-1280.0-1.3 1792482.80 30512.90 prot 8-1280.0-1.3 1577686.40 34002.62 ( -11.98 ) detect 8-1280.0-1.3 1556133.00 22944.79 ( -13.19 ) base 8-1536.0-0.1 33923761.40 575455.85 prot 8-1536.0-0.1 32715766.20 300633.51 ( -3.56 ) detect 8-1536.0-0.1 33158477.40 117764.51 ( -2.26 ) base 8-1536.0-0.7 20628907.80 303851.34 prot 8-1536.0-0.7 19329511.20 341719.31 ( -6.30 ) detect 8-1536.0-0.7 20013934.00 385358.66 ( -2.98 ) base 8-1536.0-1.3 2588106.40 130769.20 prot 8-1536.0-1.3 2275222.40 89637.06 ( -12.09 ) detect 8-1536.0-1.3 2365008.40 124412.55 ( -8.62 ) base 8-1792.0-0.1 43328279.20 946469.12 prot 8-1792.0-0.1 41481980.80 525690.89 ( -4.26 ) detect 8-1792.0-0.1 41713944.60 406798.93 ( -3.73 ) base 8-1792.0-0.7 27155647.40 536253.57 prot 8-1792.0-0.7 24989406.80 502734.52 ( -7.98 ) detect 8-1792.0-0.7 25524806.40 263237.87 ( -6.01 ) base 8-1792.0-1.3 3260372.80 137907.92 prot 8-1792.0-1.3 2879187.80 63597.26 ( -11.69 ) detect 8-1792.0-1.3 2892962.20 33229.13 ( -11.27 ) base 8-2048.0-0.1 50583989.80 710121.48 prot 8-2048.0-0.1 49599984.40 228782.42 ( -1.95 ) detect 8-2048.0-0.1 50578596.00 660971.66 ( -0.01 ) base 8-2048.0-0.7 33765479.60 812659.55 prot 8-2048.0-0.7 30767021.20 462907.24 ( -8.88 ) detect 8-2048.0-0.7 32213068.80 211884.24 ( -4.60 ) base 8-2048.0-1.3 3941675.80 28436.45 prot 8-2048.0-1.3 3538742.40 76856.08 ( -10.22 ) detect 8-2048.0-1.3 3579397.80 58630.95 ( -9.19 ) As we can see, all the cases show improvement. Especially, test case with zipf distribution 1.3 show more improvements. It means that if there is a hot/cold tendency in anon pages, this patchset works better. This patch (of 6): Current implementation of LRU management for anonymous page has some problems. Most important one is that it doesn't protect the workingset, that is, pages on the active LRU list. Although, this problem will be fixed in the following patchset, the preparation is required and this patch does it. What following patch does is to implement workingset protection. After the following patchset, newly created or swap-in pages will start their lifetime on the inactive list. If inactive list is too small, there is not enough chance to be referenced and the page cannot become the workingset. In order to provide the newly anonymous or swap-in pages enough chance to be referenced again, this patch makes active/inactive LRU ratio as 1:1. This is just a temporary measure. Later patch in the series introduces workingset detection for anonymous LRU that will be used to better decide if pages should start on the active and inactive list. Afterwards this patch is effectively reverted. Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Matthew Wilcox <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm/hugetlb: add mempolicy check in the reservation routine	Muchun Song	3	-6/+34
	In the reservation routine, we only check whether the cpuset meets the memory allocation requirements. But we ignore the mempolicy of MPOL_BIND case. If someone mmap hugetlb succeeds, but the subsequent memory allocation may fail due to mempolicy restrictions and receives the SIGBUS signal. This can be reproduced by the follow steps. 1) Compile the test case. cd tools/testing/selftests/vm/ gcc map_hugetlb.c -o map_hugetlb 2) Pre-allocate huge pages. Suppose there are 2 numa nodes in the system. Each node will pre-allocate one huge page. echo 2 > /proc/sys/vm/nr_hugepages 3) Run test case(mmap 4MB). We receive the SIGBUS signal. numactl --membind=3D0 ./map_hugetlb 4 With this patch applied, the mmap will fail in the step 3) and throw "mmap: Cannot allocate memory". [[email protected]: include sched.h for `current'] Reported-by: Jianchao Guo <[email protected]> Suggested-by: Michal Hocko <[email protected]> Signed-off-by: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: David Rientjes <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Baoquan He <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	kselftests: cgroup: add perpcu memory accounting test	Roman Gushchin	1	-1/+69
	Add a simple test to check the percpu memory accounting. The test creates a cgroup tree with 1000 child cgroups and checks values of memory.current and memory.stat::percpu. Signed-off-by: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Tobin C. Harding <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Waiman Long <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Bixuan Cui <[email protected]> Cc: Stephen Rothwell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm: memcg: charge memcg percpu memory to the parent cgroup	Roman Gushchin	1	-4/+16
	Memory cgroups are using large chunks of percpu memory to store vmstat data. Yet this memory is not accounted at all, so in the case when there are many (dying) cgroups, it's not exactly clear where all the memory is. Because the size of memory cgroup internal structures can dramatically exceed the size of object or page which is pinning it in the memory, it's not a good idea to simply ignore it. It actually breaks the isolation between cgroups. Let's account the consumed percpu memory to the parent cgroup. [[email protected]: add WARN_ON_ONCE()s, per Johannes] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Dennis Zhou <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Tobin C. Harding <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Waiman Long <[email protected]> Cc: Bixuan Cui <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Stephen Rothwell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm: memcg/percpu: per-memcg percpu memory statistics	Roman Gushchin	4	-1/+25
	Percpu memory can represent a noticeable chunk of the total memory consumption, especially on big machines with many CPUs. Let's track percpu memory usage for each memcg and display it in memory.stat. A percpu allocation is usually scattered over multiple pages (and nodes), and can be significantly smaller than a page. So let's add a byte-sized counter on the memcg level: MEMCG_PERCPU_B. Byte-sized vmstat infra created for slabs can be perfectly reused for percpu case. [[email protected]: v3] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Dennis Zhou <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Tobin C. Harding <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Waiman Long <[email protected]> Cc: Bixuan Cui <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Stephen Rothwell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	mm: memcg/percpu: account percpu memory to memory cgroups	Roman Gushchin	5	-40/+246
	Percpu memory is becoming more and more widely used by various subsystems, and the total amount of memory controlled by the percpu allocator can make a good part of the total memory. As an example, bpf maps can consume a lot of percpu memory, and they are created by a user. Also, some cgroup internals (e.g. memory controller statistics) can be quite large. On a machine with many CPUs and big number of cgroups they can consume hundreds of megabytes. So the lack of memcg accounting is creating a breach in the memory isolation. Similar to the slab memory, percpu memory should be accounted by default. To implement the perpcu accounting it's possible to take the slab memory accounting as a model to follow. Let's introduce two types of percpu chunks: root and memcg. What makes memcg chunks different is an additional space allocated to store memcg membership information. If __GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used. If it's possible to charge the corresponding size to the target memory cgroup, allocation is performed, and the memcg ownership data is recorded. System-wide allocations are performed using root chunks, so there is no additional memory overhead. To implement a fast reparenting of percpu memory on memcg removal, we don't store mem_cgroup pointers directly: instead we use obj_cgroup API, introduced for slab accounting. [[email protected]: fix CONFIG_MEMCG_KMEM=n build errors and warning] [[email protected]: move unreachable code, per Roman] [[email protected]: mm/percpu: fix 'defined but not used' warning] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Signed-off-by: Bixuan Cui <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Dennis Zhou <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Tobin C. Harding <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Waiman Long <[email protected]> Cc: Bixuan Cui <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Stephen Rothwell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	percpu: return number of released bytes from pcpu_free_area()	Roman Gushchin	1	-3/+10
	Patch series "mm: memcg accounting of percpu memory", v3. This patchset adds percpu memory accounting to memory cgroups. It's based on the rework of the slab controller and reuses concepts and features introduced for the per-object slab accounting. Percpu memory is becoming more and more widely used by various subsystems, and the total amount of memory controlled by the percpu allocator can make a good part of the total memory. As an example, bpf maps can consume a lot of percpu memory, and they are created by a user. Also, some cgroup internals (e.g. memory controller statistics) can be quite large. On a machine with many CPUs and big number of cgroups they can consume hundreds of megabytes. So the lack of memcg accounting is creating a breach in the memory isolation. Similar to the slab memory, percpu memory should be accounted by default. Percpu allocations by their nature are scattered over multiple pages, so they can't be tracked on the per-page basis. So the per-object tracking introduced by the new slab controller is reused. The patchset implements charging of percpu allocations, adds memcg-level statistics, enables accounting for percpu allocations made by memory cgroup internals and provides some basic tests. To implement the accounting of percpu memory without a significant memory and performance overhead the following approach is used: all accounted allocations are placed into a separate percpu chunk (or chunks). These chunks are similar to default chunks, except that they do have an attached vector of pointers to obj_cgroup objects, which is big enough to save a pointer for each allocated object. On the allocation, if the allocation has to be accounted (__GFP_ACCOUNT is passed, the allocating process belongs to a non-root memory cgroup, etc), the memory cgroup is getting charged and if the maximum limit is not exceeded the allocation is performed using a memcg-aware chunk. Otherwise -ENOMEM is returned or the allocation is forced over the limit, depending on gfp (as any other kernel memory allocation). The memory cgroup information is saved in the obj_cgroup vector at the corresponding offset. On the release time the memcg information is restored from the vector and the cgroup is getting uncharged. Unaccounted allocations (at this point the absolute majority of all percpu allocations) are performed in the old way, so no additional overhead is expected. To avoid pinning dying memory cgroups by outstanding allocations, obj_cgroup API is used instead of directly saving memory cgroup pointers. obj_cgroup is basically a pointer to a memory cgroup with a standalone reference counter. The trick is that it can be atomically swapped to point at the parent cgroup, so that the original memory cgroup can be released prior to all objects, which has been charged to it. Because all charges and statistics are fully recursive, it's perfectly correct to uncharge the parent cgroup instead. This scheme is used in the slab memory accounting, and percpu memory can just follow the scheme. This patch (of 5): To implement accounting of percpu memory we need the information about the size of freed object. Return it from pcpu_free_area(). Signed-off-by: Roman Gushchin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Shakeel Butt <[email protected]> Acked-by: Dennis Zhou <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Tobin C. Harding <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Waiman Long <[email protected]> cC: Michal Koutný[email protected]> Cc: Bixuan Cui <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Stephen Rothwell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	fix breakage in do_rmdir()	Al Viro	1	-1/+1
	syzbot reported and bisected a use-after-free due to the recent init cleanups. The putname() should happen only after we'd not branched to retry, same as it's done in do_unlinkat(). Reported-by: [email protected] Fixes: e24ab0ef689d "fs: push the getname from do_rmdir into the callers" Cc: Christoph Hellwig <[email protected]> Signed-off-by: Al Viro <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-08-12	parisc: Implement __smp_store_release and __smp_load_acquire barriers	John David Anglin	1	-0/+61
	This patch implements the __smp_store_release and __smp_load_acquire barriers using ordered stores and loads. This avoids the sync instruction present in the generic implementation. Cc: <[email protected]> # 4.14+ Signed-off-by: Dave Anglin <[email protected]> Signed-off-by: Helge Deller <[email protected]>
2020-08-12	rtc: pcf2127: fix alarm handling	Alexandre Belloni	1	-18/+19
	Fix multiple issues when handling alarms: - Use threaded interrupt to avoid scheduling when atomic - Stop matching on week day as it may not be set correctly - Avoid parsing the DT interrupt and use what is provided by the i2c or spi subsystem - Avoid returning IRQ_NONE in case of error in the interrupt handler - Never write WDTF as specified in the datasheet - Set uie_unsupported, as for the pcf85063, setting alarms every seconds is not working correctly and confuses the RTC. Signed-off-by: Alexandre Belloni <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-08-12	rtc: pcf2127: add alarm support	Liam Beguin	1	-0/+134
	Add alarm support for the pcf2127 RTC chip family. Tested on pca2129. Signed-off-by: Liam Beguin <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]> Reviewed-by: Bruno Thomsen <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-08-12	rtc: pcf2127: add pca2129 device id	Liam Beguin	2	-0/+5
	The PCA2129 is the automotive grade version of the PCF2129. add it to the list of compatibles. Signed-off-by: Liam Beguin <[email protected]> Signed-off-by: Alexandre Belloni <[email protected]> Reviewed-by: Bruno Thomsen <[email protected]> Link: https://lore.kernel.org/r/[email protected]