blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2023-12-20	kunit: kasan_test: disable fortify string checker on kmalloc_oob_memset	Nico Pache	1	-4/+16
	Similar to commit 09c6304e38e4 ("kasan: test: fix compatibility with FORTIFY_SOURCE") the kernel is panicing in kmalloc_oob_memset_*. This is due to the `ptr` not being hidden from the optimizer which would disable the runtime fortify string checker. kernel BUG at lib/string_helpers.c:1048! Call Trace: [<00000000272502e2>] fortify_panic+0x2a/0x30 ([<00000000272502de>] fortify_panic+0x26/0x30) [<001bffff817045c4>] kmalloc_oob_memset_2+0x22c/0x230 [kasan_test] Hide the `ptr` variable from the optimizer to fix the kernel panic. Also define a memset_size variable and hide that as well. This cleans up the code and follows the same convention as other tests. [[email protected]: address review comments from Andrey] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nico Pache <[email protected]> Reviewed-by: Andrey Konovalov <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Vincenzo Frascino <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-14	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	Jakub Kicinski	10	-41/+103
	Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/intel/iavf/iavf_ethtool.c 3a0b5a2929fd ("iavf: Introduce new state machines for flow director") 95260816b489 ("iavf: use iavf_schedule_aq_request() helper") https://lore.kernel.org/all/[email protected]/ drivers/net/ethernet/broadcom/bnxt/bnxt.c c13e268c0768 ("bnxt_en: Fix HWTSTAMP_FILTER_ALL packet timestamp logic") c2f8063309da ("bnxt_en: Refactor RX VLAN acceleration logic.") a7445d69809f ("bnxt_en: Add support for new RX and TPA_START completion types for P7") 1c7fd6ee2fe4 ("bnxt_en: Rename some macros for the P5 chips") https://lore.kernel.org/all/[email protected]/ drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.c bd6781c18cb5 ("bnxt_en: Fix wrong return value check in bnxt_close_nic()") 84793a499578 ("bnxt_en: Skip nic close/open when configuring tstamp filters") https://lore.kernel.org/all/[email protected]/ drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c 3d7a3f2612d7 ("net/mlx5: Nack sync reset request when HotPlug is enabled") cecf44ea1a1f ("net/mlx5: Allow sync reset flow when BF MGT interface device is present") https://lore.kernel.org/all/[email protected]/ No adjacent changes. Signed-off-by: Jakub Kicinski <[email protected]>
2023-12-14	mm: Introduce flush_cache_vmap_early()	Alexandre Ghiti	1	-7/+1
	The pcpu setup when using the page allocator sets up a new vmalloc mapping very early in the boot process, so early that it cannot use the flush_cache_vmap() function which may depend on structures not yet initialized (for example in riscv, we currently send an IPI to flush other cpus TLB). But on some architectures, we must call flush_cache_vmap(): for example, in riscv, some uarchs can cache invalid TLB entries so we need to flush the new established mapping to avoid taking an exception. So fix this by introducing a new function flush_cache_vmap_early() which is called right after setting the new page table entry and before accessing this new mapping. This new function implements a local flush tlb on riscv and is no-op for other architectures (same as today). Signed-off-by: Alexandre Ghiti <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Signed-off-by: Dennis Zhou <[email protected]>
2023-12-12	mm/mglru: reclaim offlined memcgs harder	Yu Zhao	1	-8/+16
	In the effort to reduce zombie memcgs [1], it was discovered that the memcg LRU doesn't apply enough pressure on offlined memcgs. Specifically, instead of rotating them to the tail of the current generation (MEMCG_LRU_TAIL) for a second attempt, it moves them to the next generation (MEMCG_LRU_YOUNG) after the first attempt. Not applying enough pressure on offlined memcgs can cause them to build up, and this can be particularly harmful to memory-constrained systems. On Pixel 8 Pro, launching apps for 50 cycles: Before After Change Zombie memcgs 45 35 -22% [1] https://lore.kernel.org/CABdmKX2M6koq4Q0Cmp_-=wbP0Qa190HdEGGaHfxNS05gAkUtPA@mail.gmail.com/ Link: https://lkml.kernel.org/r/[email protected] Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists") Signed-off-by: Yu Zhao <[email protected]> Reported-by: T.J. Mercier <[email protected]> Tested-by: T.J. Mercier <[email protected]> Cc: Charan Teja Kalla <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Jaroslav Pulchart <[email protected]> Cc: Kairui Song <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	mm/mglru: respect min_ttl_ms with memcgs	Yu Zhao	1	-14/+16
	While investigating kswapd "consuming 100% CPU" [1] (also see "mm/mglru: try to stop at high watermarks"), it was discovered that the memcg LRU can breach the thrashing protection imposed by min_ttl_ms. Before the memcg LRU: kswapd() shrink_node_memcgs() mem_cgroup_iter() inc_max_seq() // always hit a different memcg lru_gen_age_node() mem_cgroup_iter() check the timestamp of the oldest generation After the memcg LRU: kswapd() shrink_many() restart: iterate the memcg LRU: inc_max_seq() // occasionally hit the same memcg if raced with lru_gen_rotate_memcg(): goto restart lru_gen_age_node() mem_cgroup_iter() check the timestamp of the oldest generation Specifically, when the restart happens in shrink_many(), it needs to stick with the (memcg LRU) generation it began with. In other words, it should neither re-read memcg_lru->seq nor age an lruvec of a different generation. Otherwise it can hit the same memcg multiple times without giving lru_gen_age_node() a chance to check the timestamp of that memcg's oldest generation (against min_ttl_ms). [1] https://lore.kernel.org/CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg@mail.gmail.com/ Link: https://lkml.kernel.org/r/[email protected] Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists") Signed-off-by: Yu Zhao <[email protected]> Tested-by: T.J. Mercier <[email protected]> Cc: Charan Teja Kalla <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Jaroslav Pulchart <[email protected]> Cc: Kairui Song <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	mm/mglru: try to stop at high watermarks	Yu Zhao	1	-8/+28
	The initial MGLRU patchset didn't include the memcg LRU support, and it relied on should_abort_scan(), added by commit f76c83378851 ("mm: multi-gen LRU: optimize multiple memcgs"), to "backoff to avoid overshooting their aggregate reclaim target by too much". Later on when the memcg LRU was added, should_abort_scan() was deemed unnecessary, and the test results [1] showed no side effects after it was removed by commit a579086c99ed ("mm: multi-gen LRU: remove eviction fairness safeguard"). However, that test used memory.reclaim, which sets nr_to_reclaim to SWAP_CLUSTER_MAX. So it can overshoot only by SWAP_CLUSTER_MAX-1 pages, i.e., from nr_reclaimed=nr_to_reclaim-1 to nr_reclaimed=nr_to_reclaim+SWAP_CLUSTER_MAX-1. Compared with the batch size kswapd sets to nr_to_reclaim, SWAP_CLUSTER_MAX is tiny. Therefore that test isn't able to reproduce the worst case scenario, i.e., kswapd overshooting GBs on large systems and "consuming 100% CPU" (see the Closes tag). Bring back a simplified version of should_abort_scan() on top of the memcg LRU, so that kswapd stops when all eligible zones are above their respective high watermarks plus a small delta to lower the chance of KSWAPD_HIGH_WMARK_HIT_QUICKLY. Note that this only applies to order-0 reclaim, meaning compaction-induced reclaim can still run wild (which is a different problem). On Android, launching 55 apps sequentially: Before After Change pgpgin 838377172 802955040 -4% pgpgout 38037080 34336300 -10% [1] https://lore.kernel.org/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Fixes: a579086c99ed ("mm: multi-gen LRU: remove eviction fairness safeguard") Signed-off-by: Yu Zhao <[email protected]> Reported-by: Charan Teja Kalla <[email protected]> Reported-by: Jaroslav Pulchart <[email protected]> Closes: https://lore.kernel.org/CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg@mail.gmail.com/ Tested-by: Jaroslav Pulchart <[email protected]> Tested-by: Kalesh Singh <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Kairui Song <[email protected]> Cc: T.J. Mercier <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	mm/mglru: fix underprotected page cache	Yu Zhao	2	-4/+4
	Unmapped folios accessed through file descriptors can be underprotected. Those folios are added to the oldest generation based on: 1. The fact that they are less costly to reclaim (no need to walk the rmap and flush the TLB) and have less impact on performance (don't cause major PFs and can be non-blocking if needed again). 2. The observation that they are likely to be single-use. E.g., for client use cases like Android, its apps parse configuration files and store the data in heap (anon); for server use cases like MySQL, it reads from InnoDB files and holds the cached data for tables in buffer pools (anon). However, the oldest generation can be very short lived, and if so, it doesn't provide the PID controller with enough time to respond to a surge of refaults. (Note that the PID controller uses weighted refaults and those from evicted generations only take a half of the whole weight.) In other words, for a short lived generation, the moving average smooths out the spike quickly. To fix the problem: 1. For folios that are already on LRU, if they can be beyond the tracking range of tiers, i.e., five accesses through file descriptors, move them to the second oldest generation to give them more time to age. (Note that tiers are used by the PID controller to statistically determine whether folios accessed multiple times through file descriptors are worth protecting.) 2. When adding unmapped folios to LRU, adjust the placement of them so that they are not too close to the tail. The effect of this is similar to the above. On Android, launching 55 apps sequentially: Before After Change workingset_refault_anon 25641024 25598972 0% workingset_refault_file 115016834 106178438 -8% Link: https://lkml.kernel.org/r/[email protected] Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation") Signed-off-by: Yu Zhao <[email protected]> Reported-by: Charan Teja Kalla <[email protected]> Tested-by: Kalesh Singh <[email protected]> Cc: T.J. Mercier <[email protected]> Cc: Kairui Song <[email protected]> Cc: Hillf Danton <[email protected]> Cc: Jaroslav Pulchart <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	mm/shmem: fix race in shmem_undo_range w/THP	David Stevens	1	-1/+18
	Split folios during the second loop of shmem_undo_range. It's not sufficient to only split folios when dealing with partial pages, since it's possible for a THP to be faulted in after that point. Calling truncate_inode_folio in that situation can result in throwing away data outside of the range being targeted. [[email protected]: tidy up comment layout] Link: https://lkml.kernel.org/r/[email protected] Fixes: b9a8a4195c7d ("truncate,shmem: Handle truncates that split large folios") Signed-off-by: David Stevens <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Suleiman Souhlal <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	mm/damon/core: make damon_start() waits until kdamond_fn() starts	SeongJae Park	1	-0/+6
	The cleanup tasks of kdamond threads including reset of corresponding DAMON context's ->kdamond field and decrease of global nr_running_ctxs counter is supposed to be executed by kdamond_fn(). However, commit 0f91d13366a4 ("mm/damon: simplify stop mechanism") made neither damon_start() nor damon_stop() ensure the corresponding kdamond has started the execution of kdamond_fn(). As a result, the cleanup can be skipped if damon_stop() is called fast enough after the previous damon_start(). Especially the skipped reset of ->kdamond could cause a use-after-free. Fix it by waiting for start of kdamond_fn() execution from damon_start(). Link: https://lkml.kernel.org/r/[email protected] Fixes: 0f91d13366a4 ("mm/damon: simplify stop mechanism") Signed-off-by: SeongJae Park <[email protected]> Reported-by: Jakub Acs <[email protected]> Cc: Changbin Du <[email protected]> Cc: Jakub Acs <[email protected]> Cc: <[email protected]> # 5.15.x Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	mm: compaction: avoid fast_isolate_freepages blindly choose improper pageblock	Barry Song	1	-0/+3
	Testing shows fast_isolate_freepages can blindly choose an unsuitable pageblock from time to time particularly while the min mark is used from XXX path: if (!page) { cc->fast_search_fail++; if (scan_start) { /* * Use the highest PFN found above min. If one was * not found, be pessimistic for direct compaction * and use the min mark. / if (highest >= min_pfn) { page = pfn_to_page(highest); cc->free_pfn = highest; } else { if (cc->direct_compaction && pfn_valid(min_pfn)) { / XXX */ page = pageblock_pfn_to_page(min_pfn, min(pageblock_end_pfn(min_pfn), zone_end_pfn(cc->zone)), cc->zone); cc->free_pfn = min_pfn; } } } } The reason is that no code is doing any check on the min_pfn min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1)); In contrast, slow path of isolate_freepages() is always skipping unsuitable pageblocks in a decent way. This issue doesn't happen quite often. When running 25 machines with 16GiB memory for one night, most of them can hit this unexpected code path. However the frequency isn't like many times per second. It might be one time in a couple of hours. Thus, it is very hard to measure the visible performance impact in my machines though the affection of choosing the unsuitable migration_target should be negative in theory. I feel it's still worth fixing this to at least make the code theoretically self-explanatory as it is quite odd an unsuitable migration_target can be still migration_target. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Barry Song <[email protected]> Reported-by: Zhanyuan Hu <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kemeng Shi <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	mm: use vma_pages() for vma objects	Chen Haonan	1	-1/+1
	vma_pages() is more readable and also better at avoiding error codes, so use vma_pages() instead of direct operations on vma Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chen Haonan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	mm: cma: remove unnecessary initialization of ret	Li zeming	1	-1/+1
	The ret variable can be defined without assigning a value, as it is assigned before use. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Li zeming <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	mm: hugetlb_vmemmap: move mmap lock to vmemmap_remap_range()	Muchun Song	1	-13/+4
	All the users of vmemmap_remap_range() will hold the mmap lock and release it once it returns, it is naturally to move the lock to vmemmap_remap_range() to simplify the code and the users. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	mm: hugetlb_vmemmap: add check of CONFIG_MEMORY_HOTPLUG back	Muchun Song	1	-1/+1
	The compiler will optimize the code as much as possible if we add the check of CONFIG_MEMORY_HOTPLUG back. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	mm: filemap: remove unnecessary iitialization of ret	Li zeming	1	-1/+1
	The ret variable can be defined without assigning a value, as it is assigned before use. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Li zeming <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	mm/thp: add CONFIG_TRANSPARENT_HUGEPAGE_NEVER option	Dmytro Maluka	1	-0/+6
	Currently enabling THP support (CONFIG_TRANSPARENT_HUGEPAGE) requires enabling either CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS or CONFIG_TRANSPARENT_HUGEPAGE_MADVISE, which both cause khugepaged starting by default at kernel bootup. Add the third choice CONFIG_TRANSPARENT_HUGEPAGE_NEVER, in line with the existing kernel command line setting transparent_hugepage=never, to disable THP by default (in particular, to prevent starting khugepaged by default) but still allow enabling it at runtime via sysfs. Rationale: khugepaged has its own non-negligible memory cost even if it is not used by any applications, since it bumps up vm.min_free_kbytes to its own required minimum in set_recommended_min_free_kbytes(). For example, on a machine with 4GB RAM, with 3 mm zones and pageblock_order == MAX_ORDER, starting khugepaged causes vm.min_free_kbytes increase from 8MB to 132MB. So if we use THP on machines with e.g. >=8GB of memory for better performance, but avoid using it on lower-memory machines to avoid its memory overhead, then for the same reason we also want to avoid even starting khugepaged on those <8GB machines. So with CONFIG_TRANSPARENT_HUGEPAGE_NEVER we can use the same kernel image on both >=8GB and <8GB machines, with THP support enabled but khugepaged not started by default. The userspace can then decide to enable THP via sysfs if needed, based on the total amount of memory. This could also be achieved with the existing transparent_hugepage=never setting in the kernel command line instead. But it seems cleaner to avoid tweaking the command line for such a basic setting. P.S. I see that CONFIG_TRANSPARENT_HUGEPAGE_NEVER was already proposed in the past [1] but without an explanation of the purpose. [1] https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Dmytro Maluka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	mm: huge_memory: use more folio api in __split_huge_page_tail()	Kefeng Wang	1	-6/+6
	Use more folio APIs to save six compound_head() calls in __split_huge_page_tail(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	kmemleak: avoid RCU stalls when freeing metadata for per-CPU pointers	Catalin Marinas	1	-81/+97
	On systems with large number of CPUs, the following soft lockup splat might sometimes happen: [ 2656.001617] watchdog: BUG: soft lockup - CPU#364 stuck for 21s! [ksoftirqd/364:2206] : [ 2656.141194] RIP: 0010:_raw_spin_unlock_irqrestore+0x3d/0x70 : 2656.241214] Call Trace: [ 2656.243971] <IRQ> [ 2656.246237] ? show_trace_log_lvl+0x1c4/0x2df [ 2656.251152] ? show_trace_log_lvl+0x1c4/0x2df [ 2656.256066] ? kmemleak_free_percpu+0x11f/0x1f0 [ 2656.261173] ? watchdog_timer_fn+0x379/0x470 [ 2656.265984] ? __pfx_watchdog_timer_fn+0x10/0x10 [ 2656.271179] ? __hrtimer_run_queues+0x5f3/0xd00 [ 2656.276283] ? __pfx___hrtimer_run_queues+0x10/0x10 [ 2656.281783] ? ktime_get_update_offsets_now+0x95/0x2c0 [ 2656.287573] ? ktime_get_update_offsets_now+0xdd/0x2c0 [ 2656.293380] ? hrtimer_interrupt+0x2e9/0x780 [ 2656.298221] ? __sysvec_apic_timer_interrupt+0x184/0x640 [ 2656.304211] ? sysvec_apic_timer_interrupt+0x8e/0xc0 [ 2656.309807] </IRQ> [ 2656.312169] <TASK> [ 2656.326110] kmemleak_free_percpu+0x11f/0x1f0 [ 2656.331015] free_percpu.part.0+0x1b/0xe70 [ 2656.335635] free_vfsmnt+0xb9/0x100 [ 2656.339567] rcu_do_batch+0x3c8/0xe30 [ 2656.363693] rcu_core+0x3de/0x5a0 [ 2656.367433] __do_softirq+0x2d0/0x9a8 [ 2656.381119] run_ksoftirqd+0x36/0x60 [ 2656.385145] smpboot_thread_fn+0x556/0x910 [ 2656.394971] kthread+0x2a4/0x350 [ 2656.402826] ret_from_fork+0x29/0x50 [ 2656.406861] </TASK> The issue is caused by kmemleak registering each per_cpu_ptr() corresponding to the __percpu pointer. This is unnecessary since such individual per-CPU pointers are not tracked anyway. Create a new object_percpu_tree_root rbtree that stores a single __percpu pointer together with an OBJECT_PERCPU flag for the kmemleak metadata. Scanning needs to be done for all per_cpu_ptr() pointers with a cond_resched() between each CPU iteration to avoid RCU stalls. [[email protected]: update comment] Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected]: https://lkml.kernel.org/r/[email protected] Signed-off-by: Catalin Marinas <[email protected]> Reported-by: Waiman Long <[email protected]> Closes: https://lore.kernel.org/r/[email protected] Reviewed-by: Waiman Long <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	mm/readahead: do not allow order-1 folio	Ryan Roberts	1	-8/+6
	The THP machinery does not support order-1 folios because it requires meta data spanning the first 3 `struct page`s. So order-2 is the smallest large folio that we can safely create. There was a theoretical bug whereby if ra->size was 2 or 3 pages (due to the device-specific bdi->ra_pages being set that way), we could end up with order = 1. Fix this by unconditionally checking if the preferred order is 1 and if so, set it to 0. Previously this was done in a few specific places, but with this refactoring it is done just once, unconditionally, at the end of the calculation. This is a theoretical bug found during review of the code; I have no evidence to suggest this manifests in the real world (I expect all device-specific ra_pages values are much bigger than 3). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ryan Roberts <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	mm: memory: use folio_prealloc() in wp_page_copy()	Kefeng Wang	1	-15/+7
	Use folio_prealloc() helper to simplify code a bit. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	mm: memory: use a folio in do_cow_fault()	Kefeng Wang	1	-10/+6
	Use folio_prealloc() helper and convert to use a folio in do_cow_fault(), which save five compound_head() calls. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Reviewed-by: Vishal Moola (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Sidhartha Kumar <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	mm: memory: rename page_copy_prealloc() to folio_prealloc()	Kefeng Wang	1	-4/+9
	Let's rename page_copy_prealloc() to folio_prealloc(), which could be reused in more functons, as it maybe zero the new page, pass a new need_zero to it, and call the vma_alloc_zeroed_movable_folio() if need_zero is true. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Reviewed-by: Sidhartha Kumar <[email protected]> Reviewed-by: Vishal Moola (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	mm: memory: use a folio in validate_page_before_insert()	Kefeng Wang	1	-2/+5
	Use a folio in validate_page_before_insert() to save two compound_head() calls. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Reviewed-by: Sidhartha Kumar <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	mm: ksm: use more folio api in ksm_might_need_to_copy()	Kefeng Wang	1	-18/+21
	Patch series "mm: cleanup and use more folio in page fault", v3. Rename page_copy_prealloc() to folio_prealloc(), which is used by more functions, also do more folio conversion in page fault. This patch (of 5): Since ksm only support normal page, no swapout/in for ksm large folio too, add large folio check in ksm_might_need_to_copy(), also convert page->index to folio->index as page->index is going away. Then convert ksm_might_need_to_copy() to use more folio api to save nine compound_head() calls, short 'address' to reduce max-line-length. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	mm/damon/core-test: add a unit test for the feedback loop algorithm	SeongJae Park	1	-0/+32
	Implement a simple kunit test for testing the behavior of the feedback loop algorithm for the aim-oriented feedback-friven DAMOS aggressiveness auto tuning. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: David Gow <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	mm/damon/sysfs-schemes: implement a command for scheme quota goals only commit	SeongJae Park	3	-0/+46
	To update DAMOS quota goals, users need to enter 'commit' command to the 'state' file of the kdamond, which applies not only the goals but entire inputs. It is inefficient. Implement yet another 'state' file input command for reading and committing only the scheme quota goals, namely 'commit_schemes_quota_goals'. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: David Gow <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	mm/damon/sysfs-schemes: commit damos quota goals user input to DAMOS	SeongJae Park	1	-0/+32
	Make DAMON sysfs interface to read the user inputs for DAMOS quota goals and pass those to DAMOS, so that the users can use the quota auto-tuning feature. It uses the DAMON sysfs interface's user input commit mechanism, which applies all user inputs for initial starting of DAMON and online input updates, which can be done by writing 'on' and 'commit' to the kdamond's 'state' file, respectively. In other words, the user should periodically write appropriate value to 'current_value' files and 'commit' command to the 'state' file. 'target_value' files could also be similarly updated at any time. Note that the interface is supporting multiple goals while the core logic supports only one goal. DAMON sysfs interface passes only best feedback among the given inputs, to avoid making DAMOS too aggressive. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: David Gow <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	mm/damon/sysfs-schemes: implement files for scheme quota goals setup	SeongJae Park	1	-3/+221
	Implement DAMON sysfs directories and files for the goals of DAMOS quota. Those allow users set multiple goals for their aim, with target values. Users can further enter the current score value for each goal as feedback for DAMOS. Note that this commit is implementing only the basic file operations, and not connecting the files with the DAMOS core logic. Hence writing something to the files makes no real effect. The following commit will connect the file operations and the core logic. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: David Gow <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	mm/damon/core: implement goal-oriented feedback-driven quota auto-tuning	SeongJae Park	1	-9/+59
	Patch series "mm/damon: let users feed and tame/auto-tune DAMOS". Introduce Aim-oriented Feedback-driven DAMOS Aggressiveness Auto-tuning. It makes DAMOS self-tuned with periodic simple user feedback. Background: DAMOS Control Difficulty ==================================== DAMOS helps users easily implement access pattern aware system operations. However, controlling DAMOS in the wild is not that easy. The basic way for DAMOS control is specifying the target access pattern. In this approach, the user is assumed to well understand the access pattern and the characteristics of the system and the workloads. Though there are useful tools for that, it takes time and effort depending on the complexity and the dynamicity of the system and the workloads. After all, the access pattern consists of three ranges, namely the size, the access rate, and the age of the regions. It means users need to tune six parameters, which is anyway not a simple task. One of the worst cases would be DAMOS being too aggressive like a berserker, and therefore consuming too much system resource and making unwanted radical system operations. To let users avoid such cases, DAMOS allows users to set the upper-limit of the schemes' aggressiveness, namely DAMOS quota. DAMOS further provides its best-effort under the limit by prioritizing regions based on the access pattern of the regions. For example, users can ask DAMOS to page out up to 100 MiB of memory regions per second. Then DAMOS pages out regions that are not accessed for a longer time (colder) first under the limit. This allows users to set the target access pattern a bit naive with wider ranges, and focus on tuning only one parameter, the quota. In other words, the number of parameters to tune can be reduced from six to one. Still, however, the optimum value for the quota depends on the system and the workloads' characteristics, so not that simple. The number of parameters to tune can also increase again if the user needs to run multiple schemes. Aim-oriented Feedback-driven DAMOS Aggressiveness Auto Tuning ============================================================= Users would use DAMOS since they want to achieve something with it. They will likely have measurable metrics representing the achievement and the target number of the metric like SLO, and continuously measure that anyway. While the additional cost of getting the information is nearly zero, it could be useful for DAMOS to understand how appropriate its current aggressiveness is set, and adjust it on its own to make the metric value more close to the target. Based on this idea, we introduce a new way of tuning DAMOS with nearly zero additional effort, namely Aim-oriented Feedback-driven DAMOS Aggressiveness Auto Tuning. It asks users to provide feedback representing how well DAMOS is doing relative to the users' aim. Then DAMOS adjusts its aggressiveness, specifically the quota that provides the best effort result under the limit, based on the current level of the aggressiveness and the users' feedback. Implementation ============== The implementation asks users to represent the feedback with score numbers. The scores could be anything including user-space specific metrics including latency and throughput of special user-space workloads, and system metrics including free memory ratio, memory pressure stall time (PSI), and active to inactive LRU lists size ratio. The feedback scores and the aggressiveness of the given DAMOS scheme are assumed to be positively proportional, though. Selecting metrics of the assumption is the users' responsibility. The core logic uses the below simple feedback loop algorithm to calculate the next aggressiveness level of the scheme from the current aggressiveness level and the current feedback (target_score and current_score). It calculates the compensation for next aggressiveness as a proportion of current aggressiveness and distance to the target score. As a result, it arrives at the near-goal state in a short time using big steps when it's far from the goal, but avoids making unnecessarily radical changes that could turn out to be a bad decision using small steps when its near to the goal. f(n) = max(1, f(n - 1) * ((target_score - current_score) / target_score + 1)) Note that the compensation value becomes negative when it's over achieving the goal. That's why the feedback metric and the aggressiveness of the scheme should be positively proportional. The distance-adaptive speed manipulation is simply applied. Example Use Cases ================= If users want to reduce the memory footprint of the system as much as possible as long as the time spent for handling the resulting memory pressure is within a threshold, they could use DAMOS scheme that reclaims cold memory regions aiming for a little level of memory pressure stall time. If users want the active/inactive LRU lists well balanced to reduce the performance impact due to possible future memory pressure, they could use two schemes. The first one would be set to locate hot pages in the active LRU list, aiming for a specific active-to-inactive LRU list size ratio, say, 70%. The second one would be to locate cold pages in the inactive LRU list, aiming for a specific inactive-to-active LRU list size ratio, say, 30%. Then, DAMOS will balance the two schemes based on the goal and feedback. This aim-oriented auto tuning could also be useful for general balancing-required access aware system operations such as system memory auto scaling[3] and tiered memory management[4]. These two example usages are not what current DAMOS implementation is already supporting, but require additional DAMOS action developments, though. Evaluation: subtle memory pressure aiming proactive reclamation =============================================================== To show if the implementation works as expected, we prepare four different system configurations on AWS i3.metal instances. The first setup (original) runs the workload without any DAMOS scheme. The second setup (not-tuned) runs the workload with a virtual address space-based proactive reclamation scheme that pages out memory regions that are not accessed for five seconds or more. The third setup (offline-tuned) runs the same proactive reclamation DAMOS scheme, but after making it tuned for each workload offline, using our previous user-space driven automatic tuning approach, namely DAMOOS[1]. The fourth and final setup (AFDAA) runs the scheme that is the same as that of 'not-tuned' setup, but aims to keep 0.5% of 'some' memory pressure stall time (PSI) for the last 10 seconds using the aiming-oriented auto tuning. For each setup, we run realistic workloads from PARSEC3 and SPLASH-2X benchmark suites. For each run, we measure RSS and runtime of the workload, and 'some' memory pressure stall time (PSI) of the system. We repeat the runs five times and use averaged measurements. For simple comparison of the results, we normalize the measurements to those of 'original'. In the case of the PSI, though, the measurement for 'original' was zero, so we normalize the value to that of 'not-tuned' scheme's result. The normalized results are shown below. Not-tuned Offline-tuned AFDAA RSS 0.622688178226118 0.787950678944904 0.740093483278979 runtime 1.11767826657912 1.0564674983585 1.0910833880499 PSI 1 0.727521443794069 0.308498846350299 The 'not-tuned' scheme achieves about 38.7% memory saving but incur about 11.7% runtime slowdown. The 'offline-tuned' scheme achieves about 22.2% memory saving with about 5.5% runtime slowdown. It also achieves about 28.2% memory pressure stall time saving. AFDAA achieves about 26% memory saving with about 9.1% runtime slowdown. It also achieves about 69.1% memory pressure stall time saving. We repeat this test multiple times, and get consistent results. AFDAA is now integrated in our daily DAMON performance test setup. Apparently the aggressiveness of 'AFDAA' setup is somewhere between those of 'not-tuned' and 'offline-tuned' setup, since its memory saving and runtime overhead are between those of the other two setups. Actually we set the memory pressure stall time goal aiming for this middle aggressiveness. The difference in the two metrics are not significant, though. However, it shows significant saving of the memory pressure stall time, which was the goal of the auto-tuning, over the two variants. Hence, we conclude the automatic tuning is working as expected. Please note that the AFDAA setup is only for the evaluation, and therefore intentionally set a bit aggressive. It might not be appropriate for production environments. The test code is also available[2], so you could reproduce it on your system and workloads. Patches Sequence ================ The first four patches implement the core logic and user interfaces for the auto tuning. The first patch implements the core logic for the auto tuning, and the API for DAMOS users in the kernel space. The second patch implements basic file operations of DAMON sysfs directories and files that will be used for setting the goals and providing the feedback. The third patch connects the quota goals files inputs to the DAMOS core logic. Finally the fourth patch implements a dedicated DAMOS sysfs command for efficiently committing the quota goals feedback. Two patches for simple tests of the logic and interfaces follow. The fifth patch implements the core logic unit test. The sixth patch implements a selftest for the DAMON Sysfs interface for the goals. Finally, three patches for documentation follows. The seventh patch documents the design of the feature. The eighth patch updates the API doc for the new sysfs files. The final eighth patch updates the usage document for the features. References ========== [1] DAOS paper: https://www.amazon.science/publications/daos-data-access-aware-operating-system [2] Evaluation code: https://github.com/damonitor/damon-tests/commit/3f884e61193f0166b8724554b6d06b0c449a712d [3] Memory auto scaling RFC idea: https://lore.kernel.org/damon/[email protected]/ [4] DAMON-based tiered memory management RFC idea: https://lore.kernel.org/damon/[email protected]/ This patch (of 9) Users can effectively control the upper-limit aggressiveness of DAMOS schemes using the quota feature. The quota provides best result under the limit by prioritizing regions based on the access pattern. That said, finding the best value, which could depend on dynamic characteristics of the system and the workloads, is still challenging. Implement a simple feedback-driven tuning mechanism and use it for automatic tuning of DAMOS quota. The implementation allows users to provide the feedback by setting a feedback score returning callback function. Then DAMOS periodically calls the function back and adjusts the quota based on the return value of the callback and current quota value. Note that the absolute-value based time/size quotas still work as the maximum hard limits of the scheme's aggressiveness. The feedback-driven auto-tuned quota is applied only if it is not exceeding the manually set maximum limits. Same for the scheme-target access pattern and filters like other features. [[email protected]: document get_score_arg field of struct damos_quota] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: David Gow <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	zswap: shrink zswap pool based on memory pressure	Nhat Pham	4	-4/+205
	Currently, we only shrink the zswap pool when the user-defined limit is hit. This means that if we set the limit too high, cold data that are unlikely to be used again will reside in the pool, wasting precious memory. It is hard to predict how much zswap space will be needed ahead of time, as this depends on the workload (specifically, on factors such as memory access patterns and compressibility of the memory pages). This patch implements a memcg- and NUMA-aware shrinker for zswap, that is initiated when there is memory pressure. The shrinker does not have any parameter that must be tuned by the user, and can be opted in or out on a per-memcg basis. Furthermore, to make it more robust for many workloads and prevent overshrinking (i.e evicting warm pages that might be refaulted into memory), we build in the following heuristics: * Estimate the number of warm pages residing in zswap, and attempt to protect this region of the zswap LRU. * Scale the number of freeable objects by an estimate of the memory saving factor. The better zswap compresses the data, the fewer pages we will evict to swap (as we will otherwise incur IO for relatively small memory saving). * During reclaim, if the shrinker encounters a page that is also being brought into memory, the shrinker will cautiously terminate its shrinking action, as this is a sign that it is touching the warmer region of the zswap LRU. As a proof of concept, we ran the following synthetic benchmark: build the linux kernel in a memory-limited cgroup, and allocate some cold data in tmpfs to see if the shrinker could write them out and improved the overall performance. Depending on the amount of cold data generated, we observe from 14% to 35% reduction in kernel CPU time used in the kernel builds. [[email protected]: check shrinker enablement early, use less costly stat flushing] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nhat Pham <[email protected]> Acked-by: Johannes Weiner <[email protected]> Tested-by: Bagas Sanjaya <[email protected]> Cc: Chris Li <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Yosry Ahmed <[email protected]> Cc: Chengming Zhou <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	mm: memcg: add per-memcg zswap writeback stat	Domenico Cerasuolo	3	-0/+6
	Since zswap now writes back pages from memcg-specific LRUs, we now need a new stat to show writebacks count for each memcg. [[email protected]: rename ZSWP_WB to ZSWPWB] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Suggested-by: Nhat Pham <[email protected]> Signed-off-by: Domenico Cerasuolo <[email protected]> Signed-off-by: Nhat Pham <[email protected]> Tested-by: Bagas Sanjaya <[email protected]> Reviewed-by: Yosry Ahmed <[email protected]> Cc: Chris Li <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	zswap: make shrinking memcg-aware	Domenico Cerasuolo	4	-59/+238
	Currently, we only have a single global LRU for zswap. This makes it impossible to perform worload-specific shrinking - an memcg cannot determine which pages in the pool it owns, and often ends up writing pages from other memcgs. This issue has been previously observed in practice and mitigated by simply disabling memcg-initiated shrinking: https://lore.kernel.org/all/[email protected]/T/#u This patch fully resolves the issue by replacing the global zswap LRU with memcg- and NUMA-specific LRUs, and modify the reclaim logic: a) When a store attempt hits an memcg limit, it now triggers a synchronous reclaim attempt that, if successful, allows the new hotter page to be accepted by zswap. b) If the store attempt instead hits the global zswap limit, it will trigger an asynchronous reclaim attempt, in which an memcg is selected for reclaim in a round-robin-like fashion. [[email protected]: use correct function for the onlineness check, use mem_cgroup_iter_break()] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: drop the pool's reference at the end of the writeback step] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Domenico Cerasuolo <[email protected]> Co-developed-by: Nhat Pham <[email protected]> Signed-off-by: Nhat Pham <[email protected]> Tested-by: Bagas Sanjaya <[email protected]> Cc: Chris Li <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	list_lru: allow explicit memcg and NUMA node selection	Nhat Pham	2	-10/+42
	Patch series "workload-specific and memory pressure-driven zswap writeback", v8. There are currently several issues with zswap writeback: 1. There is only a single global LRU for zswap, making it impossible to perform worload-specific shrinking - an memcg under memory pressure cannot determine which pages in the pool it owns, and often ends up writing pages from other memcgs. This issue has been previously observed in practice and mitigated by simply disabling memcg-initiated shrinking: https://lore.kernel.org/all/[email protected]/T/#u But this solution leaves a lot to be desired, as we still do not have an avenue for an memcg to free up its own memory locked up in the zswap pool. 2. We only shrink the zswap pool when the user-defined limit is hit. This means that if we set the limit too high, cold data that are unlikely to be used again will reside in the pool, wasting precious memory. It is hard to predict how much zswap space will be needed ahead of time, as this depends on the workload (specifically, on factors such as memory access patterns and compressibility of the memory pages). This patch series solves these issues by separating the global zswap LRU into per-memcg and per-NUMA LRUs, and performs workload-specific (i.e memcg- and NUMA-aware) zswap writeback under memory pressure. The new shrinker does not have any parameter that must be tuned by the user, and can be opted in or out on a per-memcg basis. As a proof of concept, we ran the following synthetic benchmark: build the linux kernel in a memory-limited cgroup, and allocate some cold data in tmpfs to see if the shrinker could write them out and improved the overall performance. Depending on the amount of cold data generated, we observe from 14% to 35% reduction in kernel CPU time used in the kernel builds. This patch (of 6): The interface of list_lru is based on the assumption that the list node and the data it represents belong to the same allocated on the correct node/memcg. While this assumption is valid for existing slab objects LRU such as dentries and inodes, it is undocumented, and rather inflexible for certain potential list_lru users (such as the upcoming zswap shrinker and the THP shrinker). It has caused us a lot of issues during our development. This patch changes list_lru interface so that the caller must explicitly specify numa node and memcg when adding and removing objects. The old list_lru_add() and list_lru_del() are renamed to list_lru_add_obj() and list_lru_del_obj(), respectively. It also extends the list_lru API with a new function, list_lru_putback, which undoes a previous list_lru_isolate call. Unlike list_lru_add, it does not increment the LRU node count (as list_lru_isolate does not decrement the node count). list_lru_putback also allows for explicit memcg and NUMA node selection. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nhat Pham <[email protected]> Suggested-by: Johannes Weiner <[email protected]> Acked-by: Johannes Weiner <[email protected]> Tested-by: Bagas Sanjaya <[email protected]> Cc: Chris Li <[email protected]> Cc: Dan Streetman <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Vitaly Wool <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	maple_tree: separate ma_state node from status	Liam R. Howlett	1	-4/+4
	The maple tree node is overloaded to keep status as well as the active node. This, unfortunately, results in a re-walk on underflow or overflow. Since the maple state has room, the status can be placed in its own enum in the structure. Once an underflow/overflow is detected, certain modes can restore the status to active and others may need to re-walk just that one node to see the entry. The status being an enum has the benefit of detecting unhandled status in switch statements. [[email protected]: fix comments about MAS_*] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: update forking to separate maple state and node] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix mas_prev() state separation code] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Cc: Peng Zhang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	maple_tree: move debug check to __mas_set_range()	Liam R. Howlett	1	-2/+0
	__mas_set_range() was created to shortcut resetting the maple state and a debug check was added to the caller (the vma iterator) to ensure the internal maple state remains safe to use. Move the debug check from the vma iterator into the maple tree itself so other users do not incorrectly use the advanced maple state modification. Fallout from this change include a large amount of debug setup needed to be moved to earlier in the header, and the maple_tree.h radix-tree test code needed to move the inclusion of the header to after the atomic define. None of those changes have functional changes. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Cc: Peng Zhang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-12	mm: Deprecate pasid field	Tina Zhang	1	-3/+0
	Drop the pasid field, as all the information needed for sva domain management has been moved to the newly added iommu_mm field. Reviewed-by: Lu Baolu <[email protected]> Reviewed-by: Vasant Hegde <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Tina Zhang <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2023-12-12	iommu: Change kconfig around IOMMU_SVA	Jason Gunthorpe	2	-1/+4
	Linus suggested that the kconfig here is confusing: https://lore.kernel.org/all/CAHk-=wgUiAtiszwseM1p2fCJ+sC4XWQ+YN4TanFhUgvUqjr9Xw@mail.gmail.com/ Let's break it into three kconfigs controlling distinct things: - CONFIG_IOMMU_MM_DATA controls if the mm_struct has the additional fields for the IOMMU. Currently only PASID, but later patches store a struct iommu_mm_data * - CONFIG_ARCH_HAS_CPU_PASID controls if the arch needs the scheduling bit for keeping track of the ENQCMD instruction. x86 will select this if IOMMU_SVA is enabled - IOMMU_SVA controls if the IOMMU core compiles in the SVA support code for iommu driver use and the IOMMU exported API This way ARM will not enable CONFIG_ARCH_HAS_CPU_PASID Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2023-12-10	kasan: record and report more information	Juntong Deng	5	-0/+58
	Record and report more information to help us find the cause of the bug and to help us correlate the error with other system events. This patch adds recording and showing CPU number and timestamp at allocation and free (controlled by CONFIG_KASAN_EXTRA_INFO). The timestamps in the report use the same format and source as printk. Error occurrence timestamp is already implicit in the printk log, and CPU number is already shown by dump_stack_lvl, so there is no need to add it. In order to record CPU number and timestamp at allocation and free, corresponding members need to be added to the relevant data structures, which will lead to increased memory consumption. In Generic KASAN, members are added to struct kasan_track. Since in most cases, alloc meta is stored in the redzone and free meta is stored in the object or the redzone, memory consumption will not increase much. In SW_TAGS KASAN and HW_TAGS KASAN, members are added to struct kasan_stack_ring_entry. Memory consumption increases as the size of struct kasan_stack_ring_entry increases (this part of the memory is allocated by memblock), but since this is configurable, it is up to the user to choose. Link: https://lkml.kernel.org/r/VI1P193MB0752BD991325D10E4AB1913599BDA@VI1P193MB0752.EURP193.PROD.OUTLOOK.COM Signed-off-by: Juntong Deng <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Vincenzo Frascino <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10	mm: memcg: add reminder comment for the memcg v2 events	Dmitry Rokosov	1	-0/+4
	To maintain the correct state, it is important to ensure that events for the memory cgroup v2 are aligned with the sample cgroup codes. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Dmitry Rokosov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10	mm: hugetlb_vmemmap: convert page to folio	Muchun Song	1	-26/+25
	There are still some places where it does not be converted to folio, this patch convert all of them to folio. And this patch also does some trival cleanup to fix the code style problems. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Kefeng Wang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10	mm: hugetlb_vmemmap: move PageVmemmapSelfHosted() check to ↵	Muchun Song	1	-46/+24
	split_vmemmap_huge_pmd() To check a page whether it is self-hosted needs to traverse the page table (e.g. pmd_off_k()), however, we already have done this in the next calling of vmemmap_remap_range(). Moving PageVmemmapSelfHosted() check to vmemmap_pmd_entry() could simplify the code a bit. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Kefeng Wang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10	mm: hugetlb_vmemmap: use walk_page_range_novma() to simplify the code	Muchun Song	1	-109/+39
	It is unnecessary to implement a series of dedicated page table walking helpers since there is already a general one walk_page_range_novma(). So use it to simplify the code. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Reviewed-by: Mike Kravetz <[email protected]> Cc: Kefeng Wang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10	mm: pagewalk: assert write mmap lock only for walking the user page tables	Muchun Song	1	-1/+28
	The 8782fb61cc848 ("mm: pagewalk: Fix race between unmap and page walker") introduces an assertion to walk_page_range_novma() to make all the users of page table walker is safe. However, the race only exists for walking the user page tables. And it is ridiculous to hold a particular user mmap write lock against the changes of the kernel page tables. So only assert at least mmap read lock when walking the kernel page tables. And some users matching this case could downgrade to a mmap read lock to relief the contention of mmap lock of init_mm, it will be nicer in hugetlb (only holding mmap read lock) in the next patch. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Acked-by: Mike Kravetz <[email protected]> Cc: Kefeng Wang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10	mm/swapfile: replace kmap_atomic() with kmap_local_page()	Fabio M. De Francesco	1	-17/+17
	kmap_atomic() has been deprecated in favor of kmap_local_page(). Therefore, replace kmap_atomic() with kmap_local_page() in swapfile.c. kmap_atomic() is implemented like a kmap_local_page() which also disables page-faults and preemption (the latter only in !PREEMPT_RT kernels). The kernel virtual addresses returned by these two API are only valid in the context of the callers (i.e., they cannot be handed to other threads). With kmap_local_page() the mappings are per thread and CPU local like in kmap_atomic(); however, they can handle page-faults and can be called from any context (including interrupts). The tasks that call kmap_local_page() can be preempted and, when they are scheduled to run again, the kernel virtual addresses are restored and are still valid. In mm/swapfile.c, the blocks of code between the mappings and un-mappings do not depend on the above-mentioned side effects of kmap_atomic(), so that the mere replacements of the old API with the new one is all that is required (i.e., there is no need to explicitly call pagefault_disable() and/or preempt_disable()). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Fabio M. De Francesco <[email protected]> Cc: Ira Weiny <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10	mm/zswap: replace kmap_atomic() with kmap_local_page()	Fabio M. De Francesco	1	-5/+5
	kmap_atomic() has been deprecated in favor of kmap_local_page(). Therefore, replace kmap_atomic() with kmap_local_page() in zswap.c. kmap_atomic() is implemented like a kmap_local_page() which also disables page-faults and preemption (the latter only in !PREEMPT_RT kernels). The kernel virtual addresses returned by these two API are only valid in the context of the callers (i.e., they cannot be handed to other threads). With kmap_local_page() the mappings are per thread and CPU local like in kmap_atomic(); however, they can handle page-faults and can be called from any context (including interrupts). The tasks that call kmap_local_page() can be preempted and, when they are scheduled to run again, the kernel virtual addresses are restored and are still valid. In mm/zswap.c, the blocks of code between the mappings and un-mappings do not depend on the above-mentioned side effects of kmap_atomic(), so that the mere replacements of the old API with the new one is all that is required (i.e., there is no need to explicitly call pagefault_disable() and/or preempt_disable()). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Fabio M. De Francesco <[email protected]> Reviewed-by: Nhat Pham <[email protected]> Acked-by: Chris Li <[email protected]> (Google) Cc: Ira Weiny <[email protected]> Cc: Seth Jennings <[email protected]> Cc: Dan Streetman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10	mm, oom:dump_tasks add rss detailed information printing	Yong Wang	1	-3/+4
	When the system is under oom, it prints out the RSS information of each process. However, we don't know the size of rss_anon, rss_file, and rss_shmem. To distinguish the memory occupied by anonymous or file mappings or shmem, could help us identify the root cause of the oom. So this patch adds RSS details, which refers to the /proc/<pid>/status[1]. It can help us know more about process memory usage. Example of oom including the new rss_* fields: [ 1630.902466] Tasks state (memory values in pages): [ 1630.902870] [ pid ] uid tgid total_vm rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name [ 1630.903619] [ 149] 0 149 486 288 0 288 0 36864 0 0 ash [ 1630.904210] [ 156] 0 156 153531 153345 153345 0 0 1269760 0 0 mm_test [1] commit 8cee852ec53f ("mm, procfs: breakdown RSS for anon, shmem and file in /proc/pid/status"). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yong Wang <[email protected]> Reviewed-by: Yang Yang <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Xuexin Jiang <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10	mm/gup: fix follow_devmap_p[mu]d() on page==NULL handling	Peter Xu	1	-0/+2
	This is a bug found not by any report but only by code observations. When GUP sees a devpmd/devpud and if page==NULL is returned, it means a fault is probably required. Here falling through when page==NULL can cause unexpected behavior. Fix both cases by catching the page==NULL cases with no_page_table(). Link: https://lkml.kernel.org/r/[email protected] Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings") Fixes: 080dbb618b4b ("mm/follow_page_mask: split follow_page_mask to smaller functions.") Signed-off-by: Peter Xu <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Dan Williams <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10	mm: page_alloc: unreserve highatomic page blocks before oom	Charan Teja Kalla	1	-8/+8
	__alloc_pages_direct_reclaim() is called from slowpath allocation where high atomic reserves can be unreserved after there is a progress in reclaim and yet no suitable page is found. Later should_reclaim_retry() gets called from slow path allocation to decide if the reclaim needs to be retried before OOM kill path is taken. should_reclaim_retry() checks the available(reclaimable + free pages) memory against the min wmark levels of a zone and returns: a) true, if it is above the min wmark so that slow path allocation will do the reclaim retries. b) false, thus slowpath allocation takes oom kill path. should_reclaim_retry() can also unreserves the high atomic reserves but only after all the reclaim retries are exhausted. In a case where there are almost none reclaimable memory and free pages contains mostly the high atomic reserves but allocation context can't use these high atomic reserves, makes the available memory below min wmark levels hence false is returned from should_reclaim_retry() leading the allocation request to take OOM kill path. This can turn into a early oom kill if high atomic reserves are holding lot of free memory and unreserving of them is not attempted. (early)OOM is encountered on a VM with the below state: [ 295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kB local_pcp:492kB free_cma:0kB [ 295.998656] lowmem_reserve[]: 0 32 [ 295.998659] Normal: 5084kB (UMEH) 2418kB (UMEH) 14316kB (UMEH) 3332kB (UH) 764kB (UH) 0128kB 0256kB 0512kB 01024kB 02048kB 0*4096kB = 7752kB Per above log, the free memory of ~7MB exist in the high atomic reserves is not freed up before falling back to oom kill path. Fix it by trying to unreserve the high atomic reserves in should_reclaim_retry() before __alloc_pages_direct_reclaim() can fallback to oom kill path. Link: https://lkml.kernel.org/r/[email protected] Fixes: 0aaa29a56e4f ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand") Signed-off-by: Charan Teja Kalla <[email protected]> Reported-by: Chris Goldsworthy <[email protected]> Suggested-by: Michal Hocko <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Chris Goldsworthy <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Pavankumar Kondeti <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10	mm: page_alloc: enforce minimum zone size to do high atomic reserves	Charan Teja Kalla	1	-1/+4
	Highatomic reserves are set to roughly 1% of zone for maximum and a pageblock size for minimum. Encountered a system with the below configuration: Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB reserved_highatomic:8192KB managed:49224kB On such systems, even a single pageblock makes highatomic reserves are set to ~8% of the zone memory. This high value can easily exert pressure on the zone. Per discussion with Michal and Mel, it is not much useful to reserve the memory for highatomic allocations on such small systems[1]. Since the minimum size for high atomic reserves is always going to be a pageblock size and if 1% of zone managed pages is going to be below pageblock size, don't reserve memory for high atomic allocations. Thanks Michal for this suggestion[2]. Since no memory is being reserved for high atomic allocations and if respective allocation failures are seen, this patch can be reverted. [1] https://lore.kernel.org/linux-mm/[email protected]/ [2] https://lore.kernel.org/linux-mm/ZVYRJMUitykepLRy@tiehlicka/ Link: https://lkml.kernel.org/r/c3a2a48e2cfe08176a80eaf01c110deb9e918055.1700821416.git.quic_charante@quicinc.com Signed-off-by: Charan Teja Kalla <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Pavankumar Kondeti <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10	mm: page_alloc: correct high atomic reserve calculations	Charan Teja Kalla	1	-2/+3
	Patch series "mm: page_alloc: fixes for high atomic reserve caluculations", v3. The state of the system where the issue exposed shown in oom kill logs: [ 295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kBlocal_pcp:492kB free_cma:0kB [ 295.998656] lowmem_reserve[]: 0 32 [ 295.998659] Normal: 5084kB (UMEH) 2418kB (UMEH) 14316kB (UMEH) 3332kB (UH) 764kB (UH) 0128kB 0256kB 0512kB 01024kB 02048kB 04096kB = 7752kB From the above, it is seen that ~16MB of memory reserved for high atomic reserves against the expectation of 1% reserves which is fixed in the 1st patch. Don't reserve the high atomic page blocks if 1% of zone memory size is below a pageblock size. This patch (of 2): reserve_highatomic_pageblock() aims to reserve the 1% of the managed pages of a zone, which is used for the high order atomic allocations. It uses the below calculation to reserve: static void reserve_highatomic_pageblock(struct page page, ....) { ....... max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages; if (zone->nr_reserved_highatomic >= max_managed) goto out; zone->nr_reserved_highatomic += pageblock_nr_pages; set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC); move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL); out: .... } Since we are always appending the 1% of zone managed pages count to pageblock_nr_pages, the minimum it is turning into 2 pageblocks as the nr_reserved_highatomic is incremented/decremented in pageblock sizes. Encountered a system(actually a VM running on the Linux kernel) with the below zone configuration: Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB reserved_highatomic:8192KB managed:49224kB The existing calculations making it to reserve the 8MB(with pageblock size of 4MB) i.e. 16% of the zone managed memory. Reserving such high amount of memory can easily exert memory pressure in the system thus may lead into unnecessary reclaims till unreserving of high atomic reserves. Since high atomic reserves are managed in pageblock size granules, as MIGRATE_HIGHATOMIC is set for such pageblock, fix the calculations for high atomic reserves as, minimum is pageblock size , maximum is approximately 1% of the zone managed pages. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/1660034138397b82a0a8b6ae51cbe96bd583d89e.1700821416.git.quic_charante@quicinc.com Signed-off-by: Charan Teja Kalla <[email protected]> Acked-by: Mel Gorman <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Pavankumar Kondeti <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>