blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2023-06-09	mm: khugepaged: avoid pointless allocation for "struct mm_slot"	Xin Hao	1	-7/+5
	In __khugepaged_enter(), if "mm->flags" with MMF_VM_HUGEPAGE bit is set, the "mm_slot" will be released and return, so we can call mm_slot_alloc() after test_and_set_bit(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Xin Hao <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm/page_alloc: don't wake kswapd from rmqueue() unless __GFP_KSWAPD_RECLAIM ↵	Tetsuo Handa	1	-1/+2
	is specified Commit 73444bc4d8f9 ("mm, page_alloc: do not wake kswapd with zone lock held") moved wakeup_kswapd() from steal_suitable_fallback() to rmqueue() using ZONE_BOOSTED_WATERMARK flag. Only allocation contexts that include ALLOC_KSWAPD (which corresponds to __GFP_KSWAPD_RECLAIM) should wake kswapd, for callers are supposed to remove __GFP_KSWAPD_RECLAIM if trying to hold pgdat->kswapd_wait has a risk of deadlock. But since zone->flags is a shared variable, a thread doing !__GFP_KSWAPD_RECLAIM allocation request might observe this flag being set immediately after another thread doing __GFP_KSWAPD_RECLAIM allocation request set this flag, causing possibility of deadlock. Link: https://lkml.kernel.org/r/[email protected] Fixes: 73444bc4d8f9 ("mm, page_alloc: do not wake kswapd with zone lock held") Signed-off-by: Tetsuo Handa <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm/mm_init.c: remove free_area_init_memoryless_node()	Haifeng Xu	1	-6/+1
	free_area_init_memoryless_node() is just a wrapper of free_area_init_node(), remove it to clean up. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Haifeng Xu <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Mike Rapoport <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	THP: avoid lock when check whether THP is in deferred list	Yin Fengwei	1	-5/+12
	free_transhuge_page() acquires split queue lock then check whether the THP was added to deferred list or not. It brings high deferred queue lock contention. It's safe to check whether the THP is in deferred list or not without holding the deferred queue lock in free_transhuge_page() because when code hit free_transhuge_page(), there is no one tries to add the folio to _deferred_list. Running page_fault1 of will-it-scale + order 2 folio for anonymous mapping with 96 processes on an Ice Lake 48C/96T test box, we could see the 61% split_queue_lock contention: - 63.02% 0.01% page_fault1_pro [kernel.kallsyms] [k] free_transhuge_page - 63.01% free_transhuge_page + 62.91% _raw_spin_lock_irqsave With this patch applied, the split_queue_lock contention is less than 1%. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yin Fengwei <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	swap: comments get_swap_device() with usage rule	Huang Ying	1	-3/+9
	The general rule to use a swap entry is as follows. When we get a swap entry, if there aren't some other ways to prevent swapoff, such as the folio in swap cache is locked, page table lock is held, etc., the swap entry may become invalid because of swapoff. Then, we need to enclose all swap related functions with get_swap_device() and put_swap_device(), unless the swap functions call get/put_swap_device() by themselves. Add the rule as comments of get_swap_device(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Yosry Ahmed <[email protected]> Reviewed-by: Chris Li (Google) <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Tim Chen <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	swap: remove get/put_swap_device() in __swap_duplicate()	Huang Ying	1	-4/+1
	__swap_duplicate() is called by - swap_shmem_alloc(): the folio in swap cache is locked. - copy_nonpresent_pte() -> swap_duplicate() and try_to_unmap_one() -> swap_duplicate(): the page table lock is held. - __read_swap_cache_async() -> swapcache_prepare(): enclosed with get/put_swap_device() in __read_swap_cache_async() already. So, it's safe to remove get/put_swap_device() in __swap_duplicate(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Reviewed-by: Yosry Ahmed <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Chris Li (Google) <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Tim Chen <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	swap: remove __swp_swapcount()	Huang Ying	2	-20/+2
	__swp_swapcount() just encloses the calling to swap_swapcount() with get/put_swap_device(). It is called in __read_swap_cache_async() only, which encloses the calling with get/put_swap_device() already. So, __read_swap_cache_async() can call swap_swapcount() directly. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Chris Li (Google) <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Tim Chen <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	swap, __read_swap_cache_async(): enlarge get/put_swap_device protection range	Huang Ying	1	-10/+21
	This makes the function a little easier to be understood because we don't need to consider swapoff. And this makes it possible to remove get/put_swap_device() calling in some functions called by __read_swap_cache_async(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Chris Li (Google) <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Tim Chen <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	swap: remove get/put_swap_device() in __swap_count()	Huang Ying	1	-8/+2
	Patch series "swap: cleanup get/put_swap_device() usage", v3. The general rule to use a swap entry is as follows. When we get a swap entry, if there aren't some other ways to prevent swapoff, such as the folio in swap cache is locked, page table lock is held, etc., the swap entry may become invalid because of swapoff. Then, we need to enclose all swap related functions with get_swap_device() and put_swap_device(), unless the swap functions call get/put_swap_device() by themselves. Based on the above rule, all get/put_swap_device() usage are checked and cleaned up if necessary. This patch (of 5): get/put_swap_device() are added to __swap_count() in commit eb085574a752 ("mm, swap: fix race between swapoff and some swap operations"). Later, in commit 2799e77529c2 ("swap: fix do_swap_page() race with swapoff"), get/put_swap_device() are added to do_swap_page(). And they enclose the only call site of __swap_count(). So, it's safe to remove get/put_swap_device() in __swap_count() now. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Huang, Ying" <[email protected]> Reviewed-by: Yosry Ahmed <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Chris Li (Google) <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Tim Chen <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm/mm_init.c: do not calculate zone_start_pfn/zone_end_pfn in ↵	Haifeng Xu	1	-18/+12
	zone_absent_pages_in_node() In calculate_node_totalpages(), zone_start_pfn/zone_end_pfn are already calculated in zone_spanned_pages_in_node(), so use them as parameters instead of node_start_pfn/node_end_pfn and the duplicated calculation process can de dropped. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Haifeng Xu <[email protected]> Suggested-by: Mike Rapoport <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Haifeng Xu <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm/mm_init.c: introduce reset_memoryless_node_totalpages()	Haifeng Xu	1	-9/+22
	Currently, no matter whether a node actually has memory or not, calculate_node_totalpages() is used to account number of pages in zone/node. However, for node without memory, these unnecessary calculations can be skipped. All the zone/node page counts can be set to 0 directly. So introduce reset_memoryless_node_totalpages() to perform this action. Furthermore, calculate_node_totalpages() only gets called for the node with memory. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Haifeng Xu <[email protected]> Suggested-by: Mike Rapoport <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	Multi-gen LRU: fix workingset accounting	Kalesh Singh	2	-4/+7
	On Android app cycle workloads, MGLRU showed a significant reduction in workingset refaults although pgpgin/pswpin remained relatively unchanged. This indicated MGLRU may be undercounting workingset refaults. This has impact on userspace programs, like Android's LMKD, that monitor workingset refault statistics to detect thrashing. It was found that refaults were only accounted if the MGLRU shadow entry was for a recently evicted folio. However, recently evicted folios should be accounted as workingset activation, and refaults should be accounted regardless of recency. Fix MGLRU's workingset refault and activation accounting to more closely match that of the conventional active/inactive LRU. Link: https://lkml.kernel.org/r/[email protected] Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation") Signed-off-by: Kalesh Singh <[email protected]> Reported-by: Charan Teja Kalla <[email protected]> Acked-by: Yu Zhao <[email protected]> Cc: Brian Geffon <[email protected]> Cc: Jan Alexander Steffens (heftig) <[email protected]> Cc: Oleksandr Natalenko <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm/memcontrol: export memcg.swap watermark via sysfs for v2 memcg	Lars R. Damerow	1	-0/+13
	This patch is similar to commit 8e20d4b33266 ("mm/memcontrol: export memcg->watermark via sysfs for v2 memcg"), but exports the swap counter's watermark. We allocate jobs to our compute farm using heuristics determined by memory and swap usage from previous jobs. Tracking the peak swap usage for new jobs is important for determining when jobs are exceeding their expected bounds, or when our baseline metrics are getting outdated. Our toolset was written to use the "memory.memsw.max_usage_in_bytes" file in cgroups v1, and altering it to poll cgroups v2's "memory.swap.current" would give less accurate results as well as add complication to the code. Having this watermark exposed in sysfs is much preferred. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Lars R. Damerow <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm: shmem: fix UAF bug in shmem_show_options()	Tu Jinjiang	1	-1/+4
	shmem_show_options() uses sbinfo->mpol without adding it's refcnt. This may lead to race with replacement of the mpol by remount. The execution sequence is as follows. CPU0 CPU1 shmem_show_options() shmem_reconfigure() shmem_show_mpol(seq, sbinfo->mpol) mpol = sbinfo->mpol mpol_put(mpol) mpol->mode The KASAN report is as follows. BUG: KASAN: slab-use-after-free in shmem_show_options+0x21b/0x340 Read of size 2 at addr ffff888124324004 by task mount/2388 CPU: 2 PID: 2388 Comm: mount Not tainted 6.4.0-rc3-00017-g9d646009f65d-dirty #8 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x37/0x50 print_report+0xd0/0x620 ? shmem_show_options+0x21b/0x340 ? __virt_addr_valid+0xf4/0x180 ? shmem_show_options+0x21b/0x340 kasan_report+0xb8/0xe0 ? shmem_show_options+0x21b/0x340 shmem_show_options+0x21b/0x340 ? __pfx_shmem_show_options+0x10/0x10 ? strchr+0x2c/0x50 ? strlen+0x23/0x40 ? seq_puts+0x7d/0x90 show_vfsmnt+0x1e6/0x260 ? __pfx_show_vfsmnt+0x10/0x10 ? __kasan_kmalloc+0x7f/0x90 seq_read_iter+0x57a/0x740 vfs_read+0x2e2/0x4a0 ? __pfx_vfs_read+0x10/0x10 ? down_write_killable+0xb8/0x140 ? __pfx_down_write_killable+0x10/0x10 ? __fget_light+0xa9/0x1e0 ? up_write+0x3f/0x80 ksys_read+0xb8/0x150 ? __pfx_ksys_read+0x10/0x10 ? fpregs_assert_state_consistent+0x55/0x60 ? exit_to_user_mode_prepare+0x2d/0x120 do_syscall_64+0x3c/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc </TASK> Allocated by task 2387: kasan_save_stack+0x22/0x50 kasan_set_track+0x25/0x30 __kasan_slab_alloc+0x59/0x70 kmem_cache_alloc+0xdd/0x220 mpol_new+0x83/0x150 mpol_parse_str+0x280/0x4a0 shmem_parse_one+0x364/0x520 vfs_parse_fs_param+0xf8/0x1a0 vfs_parse_fs_string+0xc9/0x130 shmem_parse_options+0xb2/0x110 path_mount+0x597/0xdf0 do_mount+0xcd/0xf0 __x64_sys_mount+0xbd/0x100 do_syscall_64+0x3c/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc Freed by task 2389: kasan_save_stack+0x22/0x50 kasan_set_track+0x25/0x30 kasan_save_free_info+0x2e/0x50 __kasan_slab_free+0x10e/0x1a0 kmem_cache_free+0x9c/0x350 shmem_reconfigure+0x278/0x370 reconfigure_super+0x383/0x450 path_mount+0xcc5/0xdf0 do_mount+0xcd/0xf0 __x64_sys_mount+0xbd/0x100 do_syscall_64+0x3c/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc The buggy address belongs to the object at ffff888124324000 which belongs to the cache numa_policy of size 32 The buggy address is located 4 bytes inside of freed 32-byte region [ffff888124324000, ffff888124324020) ================================================================== To fix the bug, shmem_get_sbmpol() / mpol_put() needs to be called before / after shmem_show_mpol() call. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Tu Jinjiang <[email protected]> Reviewed-by: Kefeng Wang <[email protected]> Acked-by: Hugh Dickins <[email protected]> Cc: Nanyong Sun <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm: compaction: skip fast freepages isolation if enough freepages are isolated	Baolin Wang	1	-0/+4
	I've observed that fast isolation often isolates more pages than cc->migratepages, and the excess freepages will be released back to the buddy system. So skip fast freepages isolation if enough freepages are isolated to save some CPU cycles. Link: https://lkml.kernel.org/r/f39c2c07f2dba2732fd9c0843572e5bef96f7f67.1685018752.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm: compaction: add trace event for fast freepages isolation	Baolin Wang	1	-1/+5
	The fast_isolate_freepages() can also isolate freepages, but we can not know the fast isolation efficiency to understand the fast isolation pressure. So add a trace event to show some numbers to help to understand the efficiency for fast freepages isolation. Link: https://lkml.kernel.org/r/78d2932d0160d122c15372aceb3f2c45460a17fc.1685018752.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm: compaction: only set skip flag if cc->no_set_skip_hint is false	Baolin Wang	1	-1/+1
	To keep the same logic as test_and_set_skip(), only set the skip flag if cc->no_set_skip_hint is false, which makes code more reasonable. Link: https://lkml.kernel.org/r/0eb2cd2407ffb259ae6e3071e10f70f2d41d0f3e.1685018752.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm: compaction: skip more fully scanned pageblock	Baolin Wang	1	-1/+1
	In fast_isolate_around(), it assumes the pageblock is fully scanned if cc->nr_freepages < cc->nr_migratepages after trying to isolate some free pages, and will set skip flag to avoid scanning in future. However this can miss setting the skip flag for a fully scanned pageblock (returned 'start_pfn' is equal to 'end_pfn') in the case where cc->nr_freepages is larger than cc->nr_migratepages. So using the returned 'start_pfn' from isolate_freepages_block() and 'end_pfn' to decide if a pageblock is fully scanned makes more sense. It can also cover the case where cc->nr_freepages < cc->nr_migratepages, which means the 'start_pfn' is usually equal to 'end_pfn' except some uncommon fatal error occurs after non-strict mode isolation. Link: https://lkml.kernel.org/r/f4efd2fa08735794a6d809da3249b6715ba6ad38.1685018752.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm: compaction: change fast_isolate_freepages() to void type	Baolin Wang	1	-5/+3
	No caller cares about the return value of fast_isolate_freepages(), void it. Link: https://lkml.kernel.org/r/759fca20b22ebf4c81afa30496837b9e0fb2e53b.1685018752.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm: compaction: drop the redundant page validation in update_pageblock_skip()	Baolin Wang	1	-3/+0
	Patch series "Misc cleanups and improvements for compaction". This series cantains some cleanups and improvements for compaction. This patch (of 6): The caller has validated the page before calling update_pageblock_skip(), thus drop the redundant page validation in update_pageblock_skip(). Link: https://lkml.kernel.org/r/5142e15b9295fe8c447dbb39b7907a20177a1413.1685018752.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm/vmalloc: dont purge usable blocks unnecessarily	Thomas Gleixner	1	-8/+20
	Purging fragmented blocks is done unconditionally in several contexts: 1) From drain_vmap_area_work(), when the number of lazy to be freed vmap_areas reached the threshold 2) Reclaiming vmalloc address space from pcpu_get_vm_areas() 3) _vm_unmap_aliases() #1 There is no reason to zap fragmented vmap blocks unconditionally, simply because reclaiming all lazy areas drains at least 32MB * fls(num_online_cpus()) per invocation which is plenty. #2 Reclaiming when running out of space or due to memory pressure makes a lot of sense #3 _unmap_aliases() requires to touch everything because the caller has no clue which vmap_area used a particular page last and the vmap_area lost that information too. Except for the vfree + VM_FLUSH_RESET_PERMS case, which removes the vmap area first and then cares about the flush. That in turn requires a full walk of _all_ vmap areas including the one which was just added to the purge list. But as this has to be flushed anyway this is an opportunity to combine outstanding TLB flushes and do the housekeeping of purging freed areas, but like #1 there is no real good reason to zap usable vmap blocks unconditionally. Add a @force_purge argument to the newly split out block purge function and if not true only purge fragmented blocks which have less than 1/4 of their capacity left. Rename purge_vmap_area_lazy() to reclaim_and_purge_vmap_areas() to make it clear what the function does. [[email protected]: correct VMAP_PURGE_THRESHOLD check] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Baoquan He <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Uladzislau Rezki (Sony) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm/vmalloc: add missing READ/WRITE_ONCE() annotations	Thomas Gleixner	1	-5/+8
	purge_fragmented_blocks() accesses vmap_block::free and vmap_block::dirty lockless for a quick check. Add the missing READ/WRITE_ONCE() annotations. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Baoquan He <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm/vmalloc: check free space in vmap_block lockless	Thomas Gleixner	1	-1/+4
	vb_alloc() unconditionally locks a vmap_block on the free list to check the free space. This can be done locklessly because vmap_block::free never increases, it's only decreased on allocations. Check the free space lockless and only if that succeeds, recheck under the lock. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Baoquan He <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm/vmalloc: prevent flushing dirty space over and over	Thomas Gleixner	1	-2/+6
	vmap blocks which have active mappings cannot be purged. Allocations which have been freed are accounted for in vmap_block::dirty_min/max, so that they can be detected in _vm_unmap_aliases() as potentially stale TLBs. If there are several invocations of _vm_unmap_aliases() then each of them will flush the dirty range. That's pointless and just increases the probability of full TLB flushes. Avoid that by resetting the flush range after accounting for it. That's safe versus other invocations of _vm_unmap_aliases() because this is all serialized with vmap_purge_lock. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Baoquan He <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Uladzislau Rezki (Sony) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm/vmalloc: avoid iterating over per CPU vmap blocks twice	Thomas Gleixner	1	-24/+46
	_vunmap_aliases() walks the per CPU xarrays to find partially unmapped blocks and then walks the per cpu free lists to purge fragmented blocks. Arguably that's waste of CPU cycles and cache lines as the full xarray walk already touches every block. Avoid this double iteration: - Split out the code to purge one block and the code to free the local purge list into helper functions. - Try to purge the fragmented blocks in the xarray walk before looking at their dirty space. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Baoquan He <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Uladzislau Rezki (Sony) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm/vmalloc: prevent stale TLBs in fully utilized blocks	Thomas Gleixner	1	-1/+2
	Patch series "mm/vmalloc: Assorted fixes and improvements", v2. this series addresses the following issues: 1) Prevent the stale TLB problem related to fully utilized vmap blocks 2) Avoid the double per CPU list walk in _vm_unmap_aliases() 3) Avoid flushing dirty space over and over 4) Add a lockless quickcheck in vb_alloc() and add missing READ/WRITE_ONCE() annotations 5) Prevent overeager purging of usable vmap_blocks if not under memory/address space pressure. This patch (of 6): _vm_unmap_aliases() is used to ensure that no unflushed TLB entries for a page are left in the system. This is required due to the lazy TLB flush mechanism in vmalloc. This is tried to achieve by walking the per CPU free lists, but those do not contain fully utilized vmap blocks because they are removed from the free list once the blocks free space became zero. When the block is not fully unmapped then it is not on the purge list either. So neither the per CPU list iteration nor the purge list walk find the block and if the page was mapped via such a block and the TLB has not yet been flushed, the guarantee of _vm_unmap_aliases() that there are no stale TLBs after returning is broken: x = vb_alloc() // Removes vmap_block from free list because vb->free became 0 vb_free(x) // Unmaps page and marks in dirty_min/max range // Block has still mappings and is not put on purge list // Page is reused vm_unmap_aliases() // Can't find vmap block with the dirty space -> FAIL So instead of walking the per CPU free lists, walk the per CPU xarrays which hold pointers to _all_ active blocks in the system including those removed from the free lists. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: db64fe02258f ("mm: rewrite vmap layer") Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: Baoquan He <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm: multi-gen LRU: cleanup lru_gen_test_recent()	T.J. Alumbaugh	1	-30/+16
	Avoid passing memcg* and pglist_data* to lru_gen_test_recent() since we only use the lruvec anyway. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: T.J. Alumbaugh <[email protected]> Reviewed-by: Yuanchu Xie <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm: multi-gen LRU: add helpers in page table walks	T.J. Alumbaugh	1	-5/+15
	Add helpers to page table walking code: - Clarifies intent via name "should_walk_mmu" and "should_clear_pmd_young" - Avoids repeating same logic in two places Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: T.J. Alumbaugh <[email protected]> Reviewed-by: Yuanchu Xie <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm: multi-gen LRU: cleanup lru_gen_soft_reclaim()	T.J. Alumbaugh	2	-2/+4
	lru_gen_soft_reclaim() gets the lruvec from the memcg and node ID to keep a cleaner interface on the caller side. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: T.J. Alumbaugh <[email protected]> Reviewed-by: Yuanchu Xie <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm: multi-gen LRU: use macro for bitmap	T.J. Alumbaugh	1	-1/+1
	Use DECLARE_BITMAP macro when possible. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: T.J. Alumbaugh <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Yuanchu Xie <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm/memcontrol: fix typo in comment	Haifeng Xu	1	-1/+1
	Replace 'then' with 'than'. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Haifeng Xu <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm/mlock: rename mlock_future_check() to mlock_future_ok()	Andrew Morton	4	-7/+7
	It is felt that the name mlock_future_check() is vague - it doesn't particularly convey the function's operation. mlock_future_ok() is a clearer name for a predicate function. Acked-by: Vlastimil Babka <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm/mmap: refactor mlock_future_check()	Lorenzo Stoakes	4	-20/+21
	In all but one instance, mlock_future_check() is treated as a boolean function despite returning an error code. In one instance, this error code is ignored and replaced with -ENOMEM. This is confusing, and the inversion of true -> failure, false -> success is not warranted. Convert the function to a bool, lightly refactor and return true if the check passes, false if not. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Lorenzo Stoakes <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm: compaction: avoid GFP_NOFS ABBA deadlock	Johannes Weiner	1	-2/+14
	During stress testing with higher-order allocations, a deadlock scenario was observed in compaction: One GFP_NOFS allocation was sleeping on mm/compaction.c::too_many_isolated(), while all CPUs in the system were busy with compactors spinning on buffer locks held by the sleeping GFP_NOFS allocation. Reclaim is susceptible to this same deadlock; we fixed it by granting GFP_NOFS allocations additional LRU isolation headroom, to ensure it makes forward progress while holding fs locks that other reclaimers might acquire. Do the same here. This code has been like this since compaction was initially merged, and I only managed to trigger this with out-of-tree patches that dramatically increase the contexts that do GFP_NOFS compaction. While the issue is real, it seems theoretical in nature given existing allocation sites. Worth fixing now, but no Fixes tag or stable CC. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm: compaction: have compaction_suitable() return bool	Johannes Weiner	2	-37/+33
	Since it only returns COMPACT_CONTINUE or COMPACT_SKIPPED now, a bool return value simplifies the callsites. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Suggested-by: Vlastimil Babka <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm: compaction: drop redundant watermark check in compaction_zonelist_suitable()	Johannes Weiner	1	-7/+0
	The watermark check in compaction_zonelist_suitable(), called from should_compact_retry(), is sandwiched between two watermark checks already: before, there are freelist attempts as part of direct reclaim and direct compaction; after, there is a last-minute freelist attempt in __alloc_pages_may_oom(). The check in compaction_zonelist_suitable() isn't necessary. Kill it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Baolin Wang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm: compaction: remove unnecessary is_via_compact_memory() checks	Johannes Weiner	2	-17/+2
	Remove from all paths not reachable via /proc/sys/vm/compact_memory. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Baolin Wang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm: compaction: refactor __compaction_suitable()	Johannes Weiner	2	-43/+71
	__compaction_suitable() is supposed to check for available migration targets. However, it also checks whether the operation was requested via /proc/sys/vm/compact_memory, and whether the original allocation request can already succeed. These don't apply to all callsites. Move the checks out to the callers, so that later patches can deal with them one by one. No functional change intended. [[email protected]: fix comment, per Vlastimil] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm: compaction: simplify should_compact_retry()	Johannes Weiner	1	-38/+19
	The different branches for retry are unnecessarily complicated. There are really only three outcomes: progress (retry n times), skipped (retry if reclaim can help), failed (retry with higher priority). Rearrange the branches and the retry counter to make it simpler. [[email protected]: restore behavior when hitting max_retries] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm: compaction: remove compaction result helpers	Johannes Weiner	1	-13/+17
	Patch series "mm: compaction: cleanups & simplifications". These compaction cleanups are split out from the huge page allocator series[1], as requested by reviewer feedback. [1] https://lore.kernel.org/linux-mm/[email protected]/ This patch (of 5): The compaction result helpers encode quirks that are specific to the allocator's retry logic. E.g. COMPACT_SUCCESS and COMPACT_COMPLETE actually represent failures that should be retried upon, and so on. I frequently found myself pulling up the helper implementation in order to understand and work on the retry logic. They're not quite clean abstractions; rather they split the retry logic into two locations. Remove the helpers and inline the checks. Then comment on the result interpretations directly where the decision making happens. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Johannes Weiner <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm: page_alloc: set sysctl_lowmem_reserve_ratio storage-class-specifier to ↵	Tom Rix	1	-1/+1
	static smatch reports mm/page_alloc.c:247:5: warning: symbol 'sysctl_lowmem_reserve_ratio' was not declared. Should it be static? This variable is only used in its defining file, so it should be static Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Tom Rix <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm: avoid rewalk in mmap_region	Liam R. Howlett	1	-0/+3
	If the iterator has moved to the previous entry, then step forward one range, back to the gap. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Cc: David Binderman <[email protected]> Cc: Peng Zhang <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Vernon Yang <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm/mmap: change do_vmi_align_munmap() for maple tree iterator changes	Liam R. Howlett	1	-1/+6
	The maple tree iterator clean up is incompatible with the way do_vmi_align_munmap() expects it to behave. Update the expected behaviour to map now since the change will work currently. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Cc: David Binderman <[email protected]> Cc: Peng Zhang <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Vernon Yang <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm: update vma_iter_store() to use MAS_WARN_ON()	Liam R. Howlett	1	-11/+10
	MAS_WARN_ON() will provide more information on the maple state and can be more useful for debugging. Use this version of WARN_ON() in the debugging code when storing to the tree. Update the printk to a pr_warn(), but this will only be printed when maple tree debug is enabled anyways. Making all print statements into one will keep them together on a busy terminal. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Cc: David Binderman <[email protected]> Cc: Peng Zhang <[email protected]> Cc: Vernon Yang <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm: update validate_mm() to use vma iterator	Liam R. Howlett	3	-59/+47
	Use the vma iterator in the validation code and combine the code to check the maple tree into the main validate_mm() function. Introduce a new function vma_iter_dump_tree() to dump the maple tree in hex layout. Replace all calls to validate_mm_mt() with validate_mm(). [[email protected]: update validate_mm() to use vma iterator CONFIG flag] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Cc: David Binderman <[email protected]> Cc: Peng Zhang <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Vernon Yang <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	maple_tree: add format option to mt_dump()	Liam R. Howlett	2	-6/+6
	Allow different formatting strings to be used when dumping the tree. Currently supports hex and decimal. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Cc: David Binderman <[email protected]> Cc: Peng Zhang <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Vernon Yang <[email protected]> Cc: Wei Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm: convert migrate_pages() to work on folios	Matthew Wilcox (Oracle)	4	-110/+96
	Almost all of the callers & implementors of migrate_pages() were already converted to use folios. compaction_alloc() & compaction_free() are trivial to convert a part of this patch and not worth splitting out. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm/gup: remove vmas array from internal GUP functions	Lorenzo Stoakes	2	-66/+41
	Now we have eliminated all callers to GUP APIs which use the vmas parameter, eliminate it altogether. This eliminates a class of bugs where vmas might have been kept around longer than the mmap_lock and thus we need not be concerned about locks being dropped during this operation leaving behind dangling pointers. This simplifies the GUP API and makes it considerably clearer as to its purpose - follow flags are applied and if pinning, an array of pages is returned. Link: https://lkml.kernel.org/r/6811b4b2b4b3baf3dd07f422bb18853bb2cd09fb.1684350871.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian König <[email protected]> Cc: Dennis Dalessandro <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Janosch Frank <[email protected]> Cc: Jarkko Sakkinen <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Sakari Ailus <[email protected]> Cc: Sean Christopherson <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm/gup: remove vmas parameter from pin_user_pages()	Lorenzo Stoakes	2	-11/+7
	We are now in a position where no caller of pin_user_pages() requires the vmas parameter at all, so eliminate this parameter from the function and all callers. This clears the way to removing the vmas parameter from GUP altogether. Link: https://lkml.kernel.org/r/195a99ae949c9f5cb589d2222b736ced96ec199a.1684350871.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Dennis Dalessandro <[email protected]> [qib] Reviewed-by: Christoph Hellwig <[email protected]> Acked-by: Sakari Ailus <[email protected]> [drivers/media] Cc: Catalin Marinas <[email protected]> Cc: Christian König <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Janosch Frank <[email protected]> Cc: Jarkko Sakkinen <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Sean Christopherson <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09	mm/gup: remove vmas parameter from get_user_pages_remote()	Lorenzo Stoakes	3	-19/+15
	The only instances of get_user_pages_remote() invocations which used the vmas parameter were for a single page which can instead simply look up the VMA directly. In particular:- - __update_ref_ctr() looked up the VMA but did nothing with it so we simply remove it. - __access_remote_vm() was already using vma_lookup() when the original lookup failed so by doing the lookup directly this also de-duplicates the code. We are able to perform these VMA operations as we already hold the mmap_lock in order to be able to call get_user_pages_remote(). As part of this work we add get_user_page_vma_remote() which abstracts the VMA lookup, error handling and decrementing the page reference count should the VMA lookup fail. This forms part of a broader set of patches intended to eliminate the vmas parameter altogether. [[email protected]: avoid passing NULL to PTR_ERR] Link: https://lkml.kernel.org/r/d20128c849ecdbf4dd01cc828fcec32127ed939a.1684350871.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> (for arm64) Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Janosch Frank <[email protected]> (for s390) Reviewed-by: Christoph Hellwig <[email protected]> Cc: Christian König <[email protected]> Cc: Dennis Dalessandro <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Jarkko Sakkinen <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Sakari Ailus <[email protected]> Cc: Sean Christopherson <[email protected]> Signed-off-by: Andrew Morton <[email protected]>