blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2024-02-22	mm/memory: further separate anon and pagecache folio handling in ↵	David Hildenbrand	1	-5/+11
	zap_present_pte() We don't need up-to-date accessed-dirty information for anon folios and can simply work with the ptent we already have. Also, we know the RSS counter we want to update. We can safely move arch_check_zapped_pte() + tlb_remove_tlb_entry() + zap_install_uffd_wp_if_needed() after updating the folio and RSS. While at it, only call zap_install_uffd_wp_if_needed() if there is even any chance that pte_install_uffd_wp_if_needed() would do something. That is, just don't bother if uffd-wp does not apply. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: "Naveen N. Rao" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yin Fengwei <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/memory: handle !page case in zap_present_pte() separately	David Hildenbrand	1	-11/+11
	We don't need uptodate accessed/dirty bits, so in theory we could replace ptep_get_and_clear_full() by an optimized ptep_clear_full() function. Let's rely on the provided pte. Further, there is no scenario where we would have to insert uffd-wp markers when zapping something that is not a normal page (i.e., zeropage). Add a sanity check to make sure this remains true. should_zap_folio() no longer has to handle NULL pointers. This change replaces 2/3 "!page/!folio" checks by a single "!page" one. Note that arch_check_zapped_pte() on x86-64 checks the HW-dirty bit to detect shadow stack entries. But for shadow stack entries, the HW dirty bit (in combination with non-writable PTEs) is set by software. So for the arch_check_zapped_pte() check, we don't have to sync against HW setting the HW dirty bit concurrently, it is always set. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: "Naveen N. Rao" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yin Fengwei <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/memory: factor out zapping of present pte into zap_present_pte()	David Hildenbrand	1	-41/+53
	Patch series "mm/memory: optimize unmap/zap with PTE-mapped THP", v3. This series is based on [1]. Similar to what we did with fork(), let's implement PTE batching during unmap/zap when processing PTE-mapped THPs. We collect consecutive PTEs that map consecutive pages of the same large folio, making sure that the other PTE bits are compatible, and (a) adjust the refcount only once per batch, (b) call rmap handling functions only once per batch, (c) perform batch PTE setting/updates and (d) perform TLB entry removal once per batch. Ryan was previously working on this in the context of cont-pte for arm64, int latest iteration [2] with a focus on arm6 with cont-pte only. This series implements the optimization for all architectures, independent of such PTE bits, teaches MMU gather/TLB code to be fully aware of such large-folio-pages batches as well, and amkes use of our new rmap batching function when removing the rmap. To achieve that, we have to enlighten MMU gather / page freeing code (i.e., everything that consumes encoded_page) to process unmapping of consecutive pages that all belong to the same large folio. I'm being very careful to not degrade order-0 performance, and it looks like I managed to achieve that. While this series should -- similar to [1] -- be beneficial for adding cont-pte support on arm64[2], it's one of the requirements for maintaining a total mapcount[3] for large folios with minimal added overhead and further changes[4] that build up on top of the total mapcount. Independent of all that, this series results in a speedup during munmap() and similar unmapping (process teardown, MADV_DONTNEED on larger ranges) with PTE-mapped THP, which is the default with THPs that are smaller than a PMD (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]). On an Intel Xeon Silver 4210R CPU, munmap'ing a 1GiB VMA backed by PTE-mapped folios of the same size (stddev < 1%) results in the following runtimes for munmap() in seconds (shorter is better): Folio Size \| mm-unstable \| New \| Change --------------------------------------------- 4KiB \| 0.058110 \| 0.057715 \| - 1% 16KiB \| 0.044198 \| 0.035469 \| -20% 32KiB \| 0.034216 \| 0.023522 \| -31% 64KiB \| 0.029207 \| 0.018434 \| -37% 128KiB \| 0.026579 \| 0.014026 \| -47% 256KiB \| 0.025130 \| 0.011756 \| -53% 512KiB \| 0.024292 \| 0.010703 \| -56% 1024KiB \| 0.023812 \| 0.010294 \| -57% 2048KiB \| 0.023785 \| 0.009910 \| -58% [1] https://lkml.kernel.org/r/[email protected] [2] https://lkml.kernel.org/r/[email protected] [3] https://lkml.kernel.org/r/[email protected] [4] https://lkml.kernel.org/r/[email protected] [5] https://lkml.kernel.org/r/[email protected] This patch (of 10): Let's prepare for further changes by factoring out processing of present PTEs. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: [email protected] Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: "Naveen N. Rao" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yin Fengwei <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm: compaction: limit the suitable target page order to be less than cc->order	Baolin Wang	1	-1/+3
	It can not improve the fragmentation if we isolate the target free pages exceeding cc->order, especially when the cc->order is less than pageblock_order. For example, suppose the pageblock_order is MAX_ORDER (size is 4M) and cc->order is 2M THP size, we should not isolate other 2M free pages to be the migration target, which can not improve the fragmentation. Moreover this is also applicable for large folio compaction. Link: https://lkml.kernel.org/r/afcd9377351c259df7a25a388a4a0d5862b986f4.1705928395.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/hugetlb: move page order check inside hugetlb_cma_reserve()	Anshuman Khandual	1	-0/+7
	All platforms could benefit from page order check against MAX_PAGE_ORDER before allocating a CMA area for gigantic hugetlb pages. Let's move this check from individual platforms to generic hugetlb. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Reviewed-by: Jane Chu <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/mglru: improve swappiness handling	Kinsey Ho	1	-10/+10
	The reclaimable number of anon pages used to set initial reclaim priority is only based on get_swappiness(). Use can_reclaim_anon_pages() to include NUMA node demotion. Also move the swappiness handling of when !__GFP_IO in try_to_shrink_lruvec() into isolate_folios(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kinsey Ho <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Donet Tom <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/mglru: improve struct lru_gen_mm_walk	Kinsey Ho	1	-24/+26
	Rename max_seq to seq in struct lru_gen_mm_walk to keep consistent with struct lru_gen_mm_state. Note that seq is not always up to date with max_seq from lru_gen_folio. No functional changes. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kinsey Ho <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Donet Tom <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/mglru: improve reset_mm_stats()	Kinsey Ho	1	-20/+22
	struct lruvec* is already a field of struct lru_gen_mm_walk. Remove the parameter struct lruvec* into functions that already have access to struct lru_gen_mm_walk*. Also, we do not need to handle reset histogram stats when !should_walk_mmu(). Remove the call to reset_mm_stats() in iterate_mm_list_nowalk(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kinsey Ho <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Donet Tom <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/mglru: improve should_run_aging()	Kinsey Ho	1	-14/+11
	scan_control *sc does not need to be passed into should_run_aging(), as it provides only the reclaim priority. This can be moved to get_nr_to_scan(). Refactor should_run_aging() and get_nr_to_scan() to improve code readability. No functional changes. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kinsey Ho <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Donet Tom <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/mglru: drop unused parameter	Kinsey Ho	1	-5/+5
	Patch series "mm/mglru: code cleanup and refactoring" This provides MGLRU code cleanup and refactoring for better readability. This patch (of 5): struct scan_control sc is currently passed into try_to_inc_max_seq() and run_aging(). This parameter is not used. Drop the unused parameter struct scan_control sc. No functional change. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kinsey Ho <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Donet Tom <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	kasan/test: avoid gcc warning for intentional overflow	Arnd Bergmann	1	-1/+2
	The out-of-bounds test allocates an object that is three bytes too short in order to validate the bounds checking. Starting with gcc-14, this causes a compile-time warning as gcc has grown smart enough to understand the sizeof() logic: mm/kasan/kasan_test.c: In function 'kmalloc_oob_16': mm/kasan/kasan_test.c:443:14: error: allocation of insufficient size '13' for type 'struct <anonymous>' with size '16' [-Werror=alloc-size] 443 \| ptr1 = kmalloc(sizeof(*ptr1) - 3, GFP_KERNEL); \| ^ Hide the actual computation behind a RELOC_HIDE() that ensures the compiler misses the intentional bug. Link: https://lkml.kernel.org/r/[email protected] Fixes: 3f15801cdc23 ("lib: add kasan test module") Signed-off-by: Arnd Bergmann <[email protected]> Reviewed-by: Andrey Konovalov <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Marco Elver <[email protected]> Cc: Vincenzo Frascino <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/zswap: optimize and cleanup the invalidation of duplicate entry	Chengming Zhou	1	-18/+16
	We may encounter duplicate entry in the zswap_store(): 1. swap slot that freed to per-cpu swap cache, doesn't invalidate the zswap entry, then got reused. This has been fixed. 2. !exclusive load mode, swapin folio will leave its zswap entry on the tree, then swapout again. This has been removed. 3. one folio can be dirtied again after zswap_store(), so need to zswap_store() again. This should be handled correctly. So we must invalidate the old duplicate entry before inserting the new one, which actually doesn't have to be done at the beginning of zswap_store(). The good point is that we don't need to lock the tree twice in the normal store success path. And cleanup the loop as we are here. Note we still need to invalidate the old duplicate entry when store failed or zswap is disabled , otherwise the new data in swapfile could be overwrite by the old data in zswap pool when lru writeback. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chengming Zhou <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Yosry Ahmed <[email protected]> Acked-by: Chris Li <[email protected]> Acked-by: Nhat Pham <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm: compaction: refactor compact_node()	Kefeng Wang	1	-44/+21
	Refactor compact_node() to handle both proactive and synchronous compact memory, which cleanups code a bit. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/cma: add sysfs file 'release_pages_success'	Anshuman Khandual	3	-0/+21
	This adds the following new sysfs file tracking the number of successfully released pages from a given CMA heap area. This file will be available via CONFIG_CMA_SYSFS and help in determining active CMA pages available on the CMA heap area. This adds a new 'nr_pages_released' (CONFIG_CMA_SYSFS) into 'struct cma' which gets updated during cma_release(). /sys/kernel/mm/cma/<cma-heap-area>/release_pages_success After this change, an user will be able to find active CMA pages available in a given CMA heap area via the following method. Active pages = alloc_pages_success - release_pages_success That's valuable information for both software designers, and system admins as it allows them to tune the number of CMA pages available in the system. This increases user visibility for allocated CMA area and its utilization. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/demotion: print demotion targets	Li Zhijian	1	-1/+23
	Currently, when a demotion occurs, it will prioritize selecting a node from the preferred targets as the destination node for the demotion. If the preferred node does not meet the requirements, it will try from all the lower memory tier nodes until it finds a suitable demotion destination node or ultimately fails. However, the demotion target information isn't exposed to the users, especially the preferred target information, which relies on more factors. This makes it hard for users to understand the exact demotion behavior. Rather than having a new sysfs interface to expose this information, printing directly to kernel messages, just like the current page allocation fallback order does. A dmesg example with this patch is as follows: [ 0.704860] Demotion targets for Node 0: null [ 0.705456] Demotion targets for Node 1: null // node 2 is onlined [ 32.259775] Demotion targets for Node 0: perferred: 2, fallback: 2 [ 32.261290] Demotion targets for Node 1: perferred: 2, fallback: 2 [ 32.262726] Demotion targets for Node 2: null // node 3 is onlined [ 42.448809] Demotion targets for Node 0: perferred: 2, fallback: 2-3 [ 42.450704] Demotion targets for Node 1: perferred: 2, fallback: 2-3 [ 42.452556] Demotion targets for Node 2: perferred: 3, fallback: 3 [ 42.454136] Demotion targets for Node 3: null // node 4 is onlined [ 52.676833] Demotion targets for Node 0: perferred: 2, fallback: 2-4 [ 52.678735] Demotion targets for Node 1: perferred: 2, fallback: 2-4 [ 52.680493] Demotion targets for Node 2: perferred: 4, fallback: 3-4 [ 52.682154] Demotion targets for Node 3: null [ 52.683405] Demotion targets for Node 4: null // node 5 is onlined [ 62.931902] Demotion targets for Node 0: perferred: 2, fallback: 2-5 [ 62.938266] Demotion targets for Node 1: perferred: 5, fallback: 2-5 [ 62.943515] Demotion targets for Node 2: perferred: 4, fallback: 3-4 [ 62.947471] Demotion targets for Node 3: null [ 62.949908] Demotion targets for Node 4: null [ 62.952137] Demotion targets for Node 5: perferred: 3, fallback: 3-4 Regarding this requirement, we have previously discussed [1]. The initial proposal involved introducing a new sysfs interface. However, due to concerns about potential changes and compatibility issues with the interface in the future, a consensus was not reached with the community. Therefore, this time, we are directly printing out the information. [1] https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Li Zhijian <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/damon/sysfs: handle 'state' file inputs for every sampling interval if ↵	SeongJae Park	3	-16/+29
	possible DAMON sysfs interface need to access kdamond-touching data for some of kdamond user commands. It uses ->after_aggregation() kdamond callback to safely access the data in the case. It had to use the aggregation interval callback because that was the only callback that users can access complete monitoring results. Since patch series "mm/damon: provide pseudo-moving sum based access rate", which starts from commit 78fbfb155d20 ("mm/damon/core: define and use a dedicated function for region access rate update"), DAMON provides good-to-use quality moitoring results for every sampling interval. It aims to help users who need to quickly retrieve the monitoring results. When the aggregation interval is set too long and therefore waiting for the aggregation interval can degrade user experience, or when the access pattern is expected to be significantly changed[1] could be such cases. However, because DAMON sysfs interface is still handling the commands per aggregation interval, the end user cannot get the benefit. Update DAMON sysfs interface to handle kdamond commands for every sampling interval if applicable. Specifically, all kdamond data accessing commands except 'commit' command are applicable. [1] https://lore.kernel.org/r/20240129121316.GA9706@cuiyangpei Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: xiongping1 <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm: hugetlb: improve the handling of hugetlb allocation failure for freed or ↵	Baolin Wang	1	-16/+16
	in-use hugetlb alloc_and_dissolve_hugetlb_folio() preallocates a new hugetlb page before it takes hugetlb_lock. In 3 out of 4 cases the page is not really used and therefore the newly allocated page is just freed right away. This is wasteful and it might cause pre-mature failures in those cases. Address that by moving the allocation down to the only case (hugetlb page is really in the free pages pool). We need to drop hugetlb_lock to do so and therefore need to recheck the page state after regaining it. The patch is more of a cleanup than an actual fix to an existing problem. There are no known reports about pre-mature failures. Link: https://lkml.kernel.org/r/62890fd60b1ecd5bf1cdc476c973f60fe37aa0cb.1707181934.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Acked-by: Michal Hocko <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/migrate: preserve exact soft-dirty state	Paul Gofman	1	-2/+5
	pte_mkdirty() sets both _PAGE_DIRTY and _PAGE_SOFT_DIRTY bits. The _PAGE_SOFT_DIRTY can get set even if it wasn't set on original page before migration. This makes non-soft-dirty pages soft-dirty just because of migration/compaction. Clear the _PAGE_SOFT_DIRTY flag if it wasn't set on original page. By definition of soft-dirty feature, there can be spurious soft-dirty pages because of kernel's internal activity such as VMA merging or migration/compaction. This patch is eliminating the spurious soft-dirty pages because of migration/compaction. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Paul Gofman <[email protected]> Signed-off-by: Muhammad Usama Anjum <[email protected]> Acked-by: Andrei Vagin <[email protected]> Cc: Michał Mirosław <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/zswap: zswap entry doesn't need refcount anymore	Chengming Zhou	1	-52/+11
	Since we don't need to leave zswap entry on the zswap tree anymore, we should remove it from tree once we find it from the tree. Then after using it, we can directly free it, no concurrent path can find it from tree. Only the shrinker can see it from lru list, which will also double check under tree lock, so no race problem. So we don't need refcount in zswap entry anymore and don't need to take the spinlock for the second time to invalidate it. The side effect is that zswap_entry_free() maybe not happen in tree spinlock, but it's ok since nothing need to be protected by the lock. Link: https://lkml.kernel.org/r/20240201-b4-zswap-invalidate-entry-v2-6-99d4084260a0@bytedance.com Signed-off-by: Chengming Zhou <[email protected]> Reviewed-by: Nhat Pham <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/zswap: only support zswap_exclusive_loads_enabled	Chengming Zhou	2	-27/+3
	The !zswap_exclusive_loads_enabled mode will leave compressed copy in the zswap tree and lru list after the folio swapin. There are some disadvantages in this mode: 1. It's a waste of memory since there are two copies of data, one is folio, the other one is compressed data in zswap. And it's unlikely the compressed data is useful in the near future. 2. If that folio is dirtied, the compressed data must be not useful, but we don't know and don't invalidate the trashy memory in zswap. 3. It's not reclaimable from zswap shrinker since zswap_writeback_entry() will always return -EEXIST and terminate the shrinking process. On the other hand, the only downside of zswap_exclusive_loads_enabled is a little more cpu usage/latency when compression, and the same if the folio is removed from swapcache or dirtied. More explanation by Johannes on why we should consider exclusive load as the default for zswap: Caching "swapout work" is helpful when the system is thrashing. Then recently swapped in pages might get swapped out again very soon. It certainly makes sense with conventional swap, because keeping a clean copy on the disk saves IO work and doesn't cost any additional memory. But with zswap, it's different. It saves some compression work on a thrashing page. But the act of keeping compressed memory contributes to a higher rate of thrashing. And that can cause IO in other places like zswap writeback and file memory. And the A/B test results of the kernel build in tmpfs with limited memory can support this theory: !exclusive exclusive real 63.80 63.01 user 1063.83 1061.32 sys 290.31 266.15 workingset_refault_anon 2383084.40 1976397.40 workingset_refault_file 44134.00 45689.40 workingset_activate_anon 837878.00 728441.20 workingset_activate_file 4710.00 4085.20 workingset_restore_anon 732622.60 639428.40 workingset_restore_file 1007.00 926.80 workingset_nodereclaim 0.00 0.00 pgscan 14343003.40 12409570.20 pgscan_kswapd 0.00 0.00 pgscan_direct 14343003.40 12409570.20 pgscan_khugepaged 0.00 0.00 Link: https://lkml.kernel.org/r/20240201-b4-zswap-invalidate-entry-v2-5-99d4084260a0@bytedance.com Signed-off-by: Chengming Zhou <[email protected]> Acked-by: Yosry Ahmed <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Nhat Pham <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/zswap: remove duplicate_entry debug value	Chengming Zhou	1	-8/+1
	cat /sys/kernel/debug/zswap/duplicate_entry 2086447 When testing, the duplicate_entry value is very high, but no warning message in the kernel log. From the comment of duplicate_entry "Duplicate store was encountered (rare)", it seems something goes wrong. Actually it's incremented in the beginning of zswap_store(), which found its zswap entry has already on the tree. And this is a normal case, since the folio could leave zswap entry on the tree after swapin, later it's dirtied and swapout/zswap_store again, found its original zswap entry. So duplicate_entry should be only incremented in the real bug case, which already have "WARN_ON(1)", it looks redundant to count bug case, so this patch just remove it. Link: https://lkml.kernel.org/r/20240201-b4-zswap-invalidate-entry-v2-4-99d4084260a0@bytedance.com Signed-off-by: Chengming Zhou <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Nhat Pham <[email protected]> Acked-by: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/zswap: stop lru list shrinking when encounter warm region	Chengming Zhou	2	-1/+6
	When the shrinker encounter an existing folio in swap cache, it means we are shrinking into the warmer region. We should terminate shrinking if we're in the dynamic shrinker context. This patch add LRU_STOP to support this, to avoid overshrinking. Link: https://lkml.kernel.org/r/20240201-b4-zswap-invalidate-entry-v2-3-99d4084260a0@bytedance.com Signed-off-by: Chengming Zhou <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Nhat Pham <[email protected]> Reviewed-by: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/zswap: invalidate zswap entry when swap entry free	Chengming Zhou	3	-3/+6
	During testing I found there are some times the zswap_writeback_entry() return -ENOMEM, which is not we expected: bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}' @[-12]: 1563 @[0]: 277221 The reason is that __read_swap_cache_async() return NULL because swapcache_prepare() failed. The reason is that we won't invalidate zswap entry when swap entry freed to the per-cpu pool, these zswap entries are still on the zswap tree and lru list. This patch moves the invalidation ahead to when swap entry freed to the per-cpu pool, since there is no any benefit to leave trashy zswap entry on the tree and lru list. With this patch: bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}' @[0]: 259744 Note: large folio can't have zswap entry for now, so don't bother to add zswap entry invalidation in the large folio swap free path. Link: https://lkml.kernel.org/r/20240201-b4-zswap-invalidate-entry-v2-2-99d4084260a0@bytedance.com Signed-off-by: Chengming Zhou <[email protected]> Reviewed-by: Nhat Pham <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/zswap: add more comments in shrink_memcg_cb()	Chengming Zhou	1	-17/+26
	Patch series "mm/zswap: optimize zswap lru list", v2. This series is motivated when observe the zswap lru list shrinking, noted there are some unexpected cases in zswap_writeback_entry(). bpftrace -e 'kr:zswap_writeback_entry {@[(int32)retval]=count()}' There are some -ENOMEM because when the swap entry is freed to per-cpu swap pool, it doesn't invalidate/drop zswap entry. Then the shrinker encounter these trashy zswap entries, it can't be reclaimed and return -ENOMEM. So move the invalidation ahead to when swap entry freed to the per-cpu swap pool, since there is no any benefit to leave trashy zswap entries on the zswap tree and lru list. Another case is -EEXIST, which is seen more in the case of !zswap_exclusive_loads_enabled, in which case the swapin folio will leave compressed copy on the tree and lru list. And it can't be reclaimed until the folio is removed from swapcache. Changing to zswap_exclusive_loads_enabled mode will invalidate when folio swapin, which has its own drawback if that folio is still clean in swapcache and swapout again, we need to compress it again. Please see the commit for details on why we choose exclusive load as the default for zswap. Another optimization for -EEXIST is that we add LRU_STOP to support terminating the shrinking process to avoid evicting warmer region. Testing using kernel build in tmpfs, one 50GB swapfile and zswap shrinker_enabled, with memory.max set to 2GB. mm-unstable zswap-optimize real 63.90s 63.25s user 1064.05s 1063.40s sys 292.32s 270.94s The main optimization is in sys cpu, about 7% improvement. This patch (of 6): Add more comments in shrink_memcg_cb() to describe the deref dance which is implemented to fix race problem between lru writeback and swapoff, and the reason why we rotate the entry at the beginning. Also fix the stale comments in zswap_writeback_entry(), and add more comments to state that we only deref the tree after we get the swapcache reference. Link: https://lkml.kernel.org/r/20240201-b4-zswap-invalidate-entry-v2-0-99d4084260a0@bytedance.com Link: https://lkml.kernel.org/r/20240201-b4-zswap-invalidate-entry-v2-1-99d4084260a0@bytedance.com Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Chengming Zhou <[email protected]> Suggested-by: Yosry Ahmed <[email protected]> Suggested-by: Johannes Weiner <[email protected]> Acked-by: Yosry Ahmed <[email protected]> Reviewed-by: Nhat Pham <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	memory tier: make memory_tier_subsys const	Ricardo B. Marliere	1	-1/+1
	Now that the driver core can properly handle constant struct bus_type, move the memory_tier_subsys variable to be a constant structure as well, placing it into read-only memory which can not be modified at runtime. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ricardo B. Marliere <[email protected]> Suggested-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/vmscan: make too_many_isolated return bool	Hao Ge	1	-3/+3
	too_many_isolated() should return bool as does the similar too_many_isolated() in mm/compaction.c. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hao Ge <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/cma: make MAX_CMA_AREAS = CONFIG_CMA_AREAS	Anshuman Khandual	1	-3/+3
	There is no real difference between the global area, and other additionally configured CMA areas via CONFIG_CMA_AREAS that always defaults without user input. This makes MAX_CMA_AREAS same as CONFIG_CMA_AREAS, also incrementing its default values, thus maintaining current default for MAX_CMA_AREAS both for UMA and NUMA systems. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/cma: drop CONFIG_CMA_DEBUG	Anshuman Khandual	2	-18/+0
	All pr_debug() prints in (mm/cma.c) could be enabled via standard Makefile based method. Besides cma_debug_show_areas() should always be called during cma_alloc() failure path. This seemingly redundant config, CONFIG_CMA_DEBUG can be dropped without any problem. [[email protected]: remove debug code to removed CONFIG_CMA_DEBUG] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Signed-off-by: Lukas Bulwahn <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Robin Murphy <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	kasan: rename test_kasan_module_init to kasan_test_module_init	Tiezhu Yang	1	-2/+2
	After commit f7e01ab828fd ("kasan: move tests to mm/kasan/"), the test module file is renamed from lib/test_kasan_module.c to mm/kasan/kasan_test_module.c, in order to keep consistent, rename test_kasan_module_init to kasan_test_module_init. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Tiezhu Yang <[email protected]> Acked-by: Marco Elver <[email protected]> Reviewed-by: Andrey Konovalov <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/hugetlb: restore the reservation if needed	Breno Leitao	1	-0/+25
	Patch series "mm/hugetlb: Restore the reservation", v2. This is a fix for a case where a backing huge page could stolen after madvise(MADV_DONTNEED). A full reproducer is in selftest. See https://lore.kernel.org/all/[email protected]/ In order to test this patch, I instrumented the kernel with LOCKDEP and KASAN, and run the following tests, without any regression: * The self test that reproduces the problem * All mm hugetlb selftests SUMMARY: PASS=9 SKIP=0 FAIL=0 * All libhugetlbfs tests PASS: 0 86 FAIL: 0 0 This patch (of 2): Currently there is a bug that a huge page could be stolen, and when the original owner tries to fault in it, it causes a page fault. You can achieve that by: 1) Creating a single page echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages 2) mmap() the page above with MAP_HUGETLB into (void ptr1). This will mark the page as reserved 3) touch the page, which causes a page fault and allocates the page * This will move the page out of the free list. * It will also unreserved the page, since there is no more free page 4) madvise(MADV_DONTNEED) the page * This will free the page, but not mark it as reserved. 5) Allocate a secondary page with mmap(MAP_HUGETLB) into (void ptr2). it should fail, but, since there is no more available page. * But, since the page above is not reserved, this mmap() succeed. 6) Faulting at ptr1 will cause a SIGBUS * it will try to allocate a huge page, but there is none available A full reproducer is in selftest. See https://lore.kernel.org/all/[email protected]/ Fix this by restoring the reserved page if necessary. These are the condition for the page restore: * The system is not using surplus pages. The goal is to reduce the surplus usage for this case. * If the VMA has the HPAGE_RESV_OWNER flag set, and is PRIVATE. This is safely checked using __vma_private_lock() * The page is anonymous Once this is scenario is found, set the `hugetlb_restore_reserve` bit in the folio. Then check if the resv reservations need to be adjusted later, done later, after the spinlock, since the vma_xxxx_reservation() might touch the file system lock. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Breno Leitao <[email protected]> Suggested-by: Rik van Riel <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	kasan: add atomic tests	Paul Heidekrüger	1	-0/+79
	Test that KASan can detect some unsafe atomic accesses. As discussed in the linked thread below, these tests attempt to cover the most common uses of atomics and, therefore, aren't exhaustive. Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/all/[email protected]/T/#u Signed-off-by: Paul Heidekrüger <[email protected]> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=214055 Acked-by: Mark Rutland <[email protected]> Cc: Marco Elver <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Vincenzo Frascino <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm: memcg: use larger batches for proactive reclaim	T.J. Mercier	1	-2/+3
	Before 388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive reclaim") we passed the number of pages for the reclaim request directly to try_to_free_mem_cgroup_pages, which could lead to significant overreclaim. After 0388536ac291 the number of pages was limited to a maximum 32 (SWAP_CLUSTER_MAX) to reduce the amount of overreclaim. However such a small batch size caused a regression in reclaim performance due to many more reclaim start/stop cycles inside memory_reclaim. The restart cost is amortized over more pages with larger batch sizes, and becomes a significant component of the runtime if the batch size is too small. Reclaim tries to balance nr_to_reclaim fidelity with fairness across nodes and cgroups over which the pages are spread. As such, the bigger the request, the bigger the absolute overreclaim error. Historic in-kernel users of reclaim have used fixed, small sized requests to approach an appropriate reclaim rate over time. When we reclaim a user request of arbitrary size, use decaying batch sizes to manage error while maintaining reasonable throughput. MGLRU enabled - memcg LRU used root - full reclaim pages/sec time (sec) pre-0388536ac291 : 68047 10.46 post-0388536ac291 : 13742 inf (reclaim-reclaimed)/4 : 67352 10.51 MGLRU enabled - memcg LRU not used /uid_0 - 1G reclaim pages/sec time (sec) overreclaim (MiB) pre-0388536ac291 : 258822 1.12 107.8 post-0388536ac291 : 105174 2.49 3.5 (reclaim-reclaimed)/4 : 233396 1.12 -7.4 MGLRU enabled - memcg LRU not used /uid_0 - full reclaim pages/sec time (sec) pre-0388536ac291 : 72334 7.09 post-0388536ac291 : 38105 14.45 (reclaim-reclaimed)/4 : 72914 6.96 [[email protected]: v4] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 0388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive reclaim") Signed-off-by: T.J. Mercier <[email protected]> Reviewed-by: Yosry Ahmed <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Michal Koutny <[email protected]> Acked-by: Shakeel Butt <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Muchun Song <[email protected]> Cc: Efly Young <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/mmap: pass vma to vma_merge()	Yajun Deng	1	-14/+13
	These vma_merge() callers will pass mm, anon_vma and file, they all from the same vma. There is no need to pass three parameters at the same time. Pass vma instead of mm, anon_vma and file to vma_merge(), so that it can save two parameters. Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/lkml/[email protected]/ Signed-off-by: Yajun Deng <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Yajun Deng <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/memory: ignore writable bit in folio_pte_batch()	David Hildenbrand	1	-6/+24
	... and conditionally return to the caller if any PTE except the first one is writable. fork() has to make sure to properly write-protect in case any PTE is writable. Other users (e.g., page unmaping) are expected to not care. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Cc: Albert Ou <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Alexandre Ghiti <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: David S. Miller <[email protected]> Cc: Dinh Nguyen <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Naveen N. Rao <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Russell King (Oracle) <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()	David Hildenbrand	1	-5/+31
	Let's always ignore the accessed/young bit: we'll always mark the PTE as old in our child process during fork, and upcoming users will similarly not care. Ignore the dirty bit only if we don't want to duplicate the dirty bit into the child process during fork. Maybe, we could just set all PTEs in the child dirty if any PTE is dirty. For now, let's keep the behavior unchanged, this can be optimized later if required. Ignore the soft-dirty bit only if the bit doesn't have any meaning in the src vma, and similarly won't have any in the copied dst vma. For now, we won't bother with the uffd-wp bit. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Cc: Albert Ou <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Alexandre Ghiti <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: David S. Miller <[email protected]> Cc: Dinh Nguyen <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Naveen N. Rao <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Russell King (Oracle) <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/memory: optimize fork() with PTE-mapped THP	David Hildenbrand	1	-19/+93
	Let's implement PTE batching when consecutive (present) PTEs map consecutive pages of the same large folio, and all other PTE bits besides the PFNs are equal. We will optimize folio_pte_batch() separately, to ignore selected PTE bits. This patch is based on work by Ryan Roberts. Use __always_inline for __copy_present_ptes() and keep the handling for single PTEs completely separate from the multi-PTE case: we really want the compiler to optimize for the single-PTE case with small folios, to not degrade performance. Note that PTE batching will never exceed a single page table and will always stay within VMA boundaries. Further, processing PTE-mapped THP that maybe pinned and have PageAnonExclusive set on at least one subpage should work as expected, but there is room for improvement: We will repeatedly (1) detect a PTE batch (2) detect that we have to copy a page (3) fall back and allocate a single page to copy a single page. For now we won't care as pinned pages are a corner case, and we should rather look into maintaining only a single PageAnonExclusive bit for large folios. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Cc: Albert Ou <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Alexandre Ghiti <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: David S. Miller <[email protected]> Cc: Dinh Nguyen <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Naveen N. Rao <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Russell King (Oracle) <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/memory: pass PTE to copy_present_pte()	David Hildenbrand	1	-4/+5
	We already read it, let's just forward it. This patch is based on work by Ryan Roberts. [[email protected]: fix the hmm "exclusive_cow" selftest] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Cc: Albert Ou <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Alexandre Ghiti <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: David S. Miller <[email protected]> Cc: Dinh Nguyen <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Naveen N. Rao <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Russell King (Oracle) <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/memory: factor out copying the actual PTE in copy_present_pte()	David Hildenbrand	1	-30/+33
	Let's prepare for further changes. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Cc: Albert Ou <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Alexandre Ghiti <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: David S. Miller <[email protected]> Cc: Dinh Nguyen <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Naveen N. Rao <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Russell King (Oracle) <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/vmscan: change the type of file from int to bool	Hao Ge	1	-2/+2
	Change the type of file from int to bool because is_file_lru return bool Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hao Ge <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm: compaction: update the cc->nr_migratepages when allocating or freeing ↵	Baolin Wang	1	-2/+10
	the freepages Currently we will use 'cc->nr_freepages >= cc->nr_migratepages' comparison to ensure that enough freepages are isolated in isolate_freepages(), however it just decreases the cc->nr_freepages without updating cc->nr_migratepages in compaction_alloc(), which will waste more CPU cycles and cause too many freepages to be isolated. So we should also update the cc->nr_migratepages when allocating or freeing the freepages to avoid isolating excess freepages. And I can see fewer free pages are scanned and isolated when running thpcompact on my Arm64 server: k6.7 k6.7_patched Ops Compaction pages isolated 120692036.00 118160797.00 Ops Compaction migrate scanned 131210329.00 154093268.00 Ops Compaction free scanned 1090587971.00 1080632536.00 Ops Compact scan efficiency 12.03 14.26 Moreover, I did not see an obvious latency improvements, this is likely because isolating freepages is not the bottleneck in the thpcompact test case. k6.7 k6.7_patched Amean fault-both-1 1089.76 ( 0.00%) 1080.16 * 0.88%* Amean fault-both-3 1616.48 ( 0.00%) 1636.65 * -1.25%* Amean fault-both-5 2266.66 ( 0.00%) 2219.20 * 2.09%* Amean fault-both-7 2909.84 ( 0.00%) 2801.90 * 3.71%* Amean fault-both-12 4861.26 ( 0.00%) 4733.25 * 2.63%* Amean fault-both-18 7351.11 ( 0.00%) 6950.51 * 5.45%* Amean fault-both-24 9059.30 ( 0.00%) 9159.99 * -1.11%* Amean fault-both-30 10685.68 ( 0.00%) 11399.02 * -6.68%* Link: https://lkml.kernel.org/r/6440493f18da82298152b6305d6b41c2962a3ce6.1708409245.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Acked-by: Mel Gorman <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	userfaultfd: handle zeropage moves by UFFDIO_MOVE	Suren Baghdasaryan	2	-51/+98
	Current implementation of UFFDIO_MOVE fails to move zeropages and returns EBUSY when it encounters one. We can handle them by mapping a zeropage at the destination and clearing the mapping at the source. This is done both for ordinary and for huge zeropages. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Reported-by: kernel test robot <[email protected]> Reported-by: Dan Carpenter <[email protected]> Closes: https://lore.kernel.org/r/[email protected]/ Cc: Alexander Viro <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Brian Geffon <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jann Horn <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Nicolas Geoffray <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Shuah Khan <[email protected]> Cc: ZhangPeng <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/cma: don't treat bad input arguments for cma_alloc() as its failure	Anshuman Khandual	1	-6/+4
	Invalid cma_alloc() input scenarios - including excess allocation request should neither be counted as CMA_ALLOC_FAIL nor 'cma->nr_pages_failed' be updated when applicable with CONFIG_CMA_SYSFS. This also drops 'out' jump label which has become redundant. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Anshuman Khandual <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Joonsoo Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm: ptdump: add check_wx_pages debugfs attribute	Christophe Leroy	1	-0/+22
	Add a readable attribute in debugfs to trigger a W^X pages check at any time. To trigger the test, just read /sys/kernel/debug/check_wx_pages It will report FAILED if the test failed, SUCCESS otherwise. Detailed result is provided into dmesg. Link: https://lkml.kernel.org/r/e947fb1a9f3f5466344823e532d343ff194ae03d.1706610398.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <[email protected]> Cc: Albert Ou <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Alexandre Ghiti <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: "Aneesh Kumar K.V (IBM)" <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Greg KH <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Kees Cook <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: "Naveen N. Rao" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Phong Tran <[email protected]> Cc: Russell King <[email protected]> Cc: Steven Price <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/mempolicy: protect task interleave functions with tsk->mems_allowed_seq	Gregory Price	1	-4/+23
	In the event of rebind, pol->nodemask can change at the same time as an allocation occurs. We can detect this with tsk->mems_allowed_seq and prevent a miscount or an allocation failure from occurring. The same thing happens in the allocators to detect failure, but this can prevent spurious failures in a much smaller critical section. [[email protected]: weighted interleave checks wrong parameter] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Gregory Price <[email protected]> Suggested-by: "Huang, Ying" <[email protected]> Cc: Dan Williams <[email protected]> Cc: Hasan Al Maruf <[email protected]> Cc: Honggyu Kim <[email protected]> Cc: Hyeongtak Ji <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rakie Kim <[email protected]> Cc: Ravi Jonnalagadda <[email protected]> Cc: Srinivasulu Thanneeru <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving	Gregory Price	1	-4/+214
	When a system has multiple NUMA nodes and it becomes bandwidth hungry, using the current MPOL_INTERLEAVE could be an wise option. However, if those NUMA nodes consist of different types of memory such as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin based interleave policy does not optimally distribute data to make use of their different bandwidth characteristics. Instead, interleave is more effective when the allocation policy follows each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution. This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE, enabling weighted interleave between NUMA nodes. Weighted interleave allows for proportional distribution of memory across multiple numa nodes, preferably apportioned to match the bandwidth of each node. For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1), with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate weight distribution is (2:1). Weights for each node can be assigned via the new sysfs extension: /sys/kernel/mm/mempolicy/weighted_interleave/ For now, the default value of all nodes will be `1`, which matches the behavior of standard 1:1 round-robin interleave. An extension will be added in the future to allow default values to be registered at kernel and device bringup time. The policy allocates a number of pages equal to the set weights. For example, if the weights are (2,1), then 2 pages will be allocated on node0 for every 1 page allocated on node1. The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2) and mbind(2). Some high level notes about the pieces of weighted interleave: current->il_prev: Tracks the node previously allocated from. current->il_weight: The active weight of the current node (current->il_prev) When this reaches 0, current->il_prev is set to the next node and current->il_weight is set to the next weight. weighted_interleave_nodes: Counts the number of allocations as they occur, and applies the weight for the current node. When the weight reaches 0, switch to the next node. Operates only on task->mempolicy. weighted_interleave_nid: Gets the total weight of the nodemask as well as each individual node weight, then calculates the node based on the given index. Operates on VMA policies. bulk_array_weighted_interleave: Gets the total weight of the nodemask as well as each individual node weight, then calculates the number of "interleave rounds" as well as any delta ("partial round"). Calculates the number of pages for each node and allocates them. If a node was scheduled for interleave via interleave_nodes, the current weight will be allocated first. Operates only on the task->mempolicy. One piece of complexity is the interaction between a recent refactor which split the logic to acquire the "ilx" (interleave index) of an allocation and the actually application of the interleave. If a call to alloc_pages_mpol() were made with a weighted-interleave policy and ilx set to NO_INTERLEAVE_INDEX, weighted_interleave_nodes() would operate on a VMA policy - violating the description above. An inspection of all callers of alloc_pages_mpol() shows that all external callers set ilx to `0`, an index value, or will call get_vma_policy() to acquire the ilx. For example, mm/shmem.c may call into alloc_pages_mpol. The call stacks all set (pgoff_t ilx) or end up in `get_vma_policy()`. This enforces the `weighted_interleave_nodes()` and `weighted_interleave_nid()` policy requirements (task/vma respectively). Link: https://lkml.kernel.org/r/[email protected] Suggested-by: Hasan Al Maruf <[email protected]> Signed-off-by: Gregory Price <[email protected]> Co-developed-by: Rakie Kim <[email protected]> Signed-off-by: Rakie Kim <[email protected]> Co-developed-by: Honggyu Kim <[email protected]> Signed-off-by: Honggyu Kim <[email protected]> Co-developed-by: Hyeongtak Ji <[email protected]> Signed-off-by: Hyeongtak Ji <[email protected]> Co-developed-by: Srinivasulu Thanneeru <[email protected]> Signed-off-by: Srinivasulu Thanneeru <[email protected]> Co-developed-by: Ravi Jonnalagadda <[email protected]> Signed-off-by: Ravi Jonnalagadda <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Cc: Dan Williams <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/mempolicy: refactor a read-once mechanism into a function for re-use	Gregory Price	1	-10/+16
	Move the use of barrier() to force policy->nodemask onto the stack into a function `read_once_policy_nodemask` so that it may be re-used. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Gregory Price <[email protected]> Suggested-by: "Huang, Ying" <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Cc: Dan Williams <[email protected]> Cc: Hasan Al Maruf <[email protected]> Cc: Honggyu Kim <[email protected]> Cc: Hyeongtak Ji <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rakie Kim <[email protected]> Cc: Ravi Jonnalagadda <[email protected]> Cc: Srinivasulu Thanneeru <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/mempolicy: implement the sysfs-based weighted_interleave interface	Rakie Kim	1	-0/+223
	Patch series "mm/mempolicy: weighted interleave mempolicy and sysfs extension", v5. Weighted interleave is a new interleave policy intended to make use of heterogeneous memory environments appearing with CXL. The existing interleave mechanism does an even round-robin distribution of memory across all nodes in a nodemask, while weighted interleave distributes memory across nodes according to a provided weight. (Weight = # of page allocations per round) Weighted interleave is intended to reduce average latency when bandwidth is pressured - therefore increasing total throughput. In other words: It allows greater use of the total available bandwidth in a heterogeneous hardware environment (different hardware provides different bandwidth capacity). As bandwidth is pressured, latency increases - first linearly and then exponentially. By keeping bandwidth usage distributed according to available bandwidth, we therefore can reduce the average latency of a cacheline fetch. A good explanation of the bandwidth vs latency response curve: https://mahmoudhatem.wordpress.com/2017/11/07/memory-bandwidth-vs-latency-response-curve/ From the article: ``` Constant region: The latency response is fairly constant for the first 40% of the sustained bandwidth. Linear region: In between 40% to 80% of the sustained bandwidth, the latency response increases almost linearly with the bandwidth demand of the system due to contention overhead by numerous memory requests. Exponential region: Between 80% to 100% of the sustained bandwidth, the memory latency is dominated by the contention latency which can be as much as twice the idle latency or more. Maximum sustained bandwidth : Is 65% to 75% of the theoretical maximum bandwidth. ``` As a general rule of thumb: * If bandwidth usage is low, latency does not increase. It is optimal to place data in the nearest (lowest latency) device. * If bandwidth usage is high, latency increases. It is optimal to place data such that bandwidth use is optimized per-device. This is the top line goal: Provide a user a mechanism to target using the "maximum sustained bandwidth" of each hardware component in a heterogenous memory system. For example, the stream benchmark demonstrates that 1:1 (default) interleave is actively harmful, while weighted interleave can be beneficial. Default interleave distributes data such that too much pressure is placed on devices with lower available bandwidth. Stream Benchmark (vs DRAM, 1 Socket + 1 CXL Device) Default interleave : -78% (slower than DRAM) Global weighting : -6% to +4% (workload dependant) Targeted weights : +2.5% to +4% (consistently better than DRAM) Global means the task-policy was set (set_mempolicy), while targeted means VMA policies were set (mbind2). We see weighted interleave is not always beneficial when applied globally, but is always beneficial when applied to bandwidth-driving memory regions. There are 4 patches in this set: 1) Implement system-global interleave weights as sysfs extension in mm/mempolicy.c. These weights are RCU protected, and a default weight set is provided (all weights are 1 by default). In future work, we intend to expose an interface for HMAT/CDAT code to set reasonable default values based on the memory configuration of the system discovered at boot/hotplug. 2) A mild refactor of some interleave-logic for re-use in the new weighted interleave logic. 3) MPOL_WEIGHTED_INTERLEAVE extension for set_mempolicy/mbind 4) Protect interleave logic (weighted and normal) with the mems_allowed seq cookie. If the nodemask changes while accessing it during a rebind, just retry the access. Included below are some performance and LTP test information, and a sample numactl branch which can be used for testing. = Performance summary = (tests may have different configurations, see extended info below) 1) MLC (W2) : +38% over DRAM. +264% over default interleave. MLC (W5) : +40% over DRAM. +226% over default interleave. 2) Stream : -6% to +4% over DRAM, +430% over default interleave. 3) XSBench : +19% over DRAM. +47% over default interleave. = LTP Testing Summary = existing mempolicy & mbind tests: pass mempolicy & mbind + weighted interleave (global weights): pass = version history v5: - style fixes - mems_allowed cookie protection to detect rebind issues, prevents spurious allocation failures and/or mis-allocations - sparse warning fixes related to __rcu on local variables ===================================================================== Performance tests - MLC From - Ravi Jonnalagadda <[email protected]> Hardware: Single-socket, multiple CXL memory expanders. Workload: W2 Data Signature: 2:1 read:write DRAM only bandwidth (GBps): 298.8 DRAM + CXL (default interleave) (GBps): 113.04 DRAM + CXL (weighted interleave)(GBps): 412.5 Gain over DRAM only: 1.38x Gain over default interleave: 2.64x Workload: W5 Data Signature: 1:1 read:write DRAM only bandwidth (GBps): 273.2 DRAM + CXL (default interleave) (GBps): 117.23 DRAM + CXL (weighted interleave)(GBps): 382.7 Gain over DRAM only: 1.4x Gain over default interleave: 2.26x ===================================================================== Performance test - Stream From - Gregory Price <[email protected]> Hardware: Single socket, single CXL expander numactl extension: https://github.com/gmprice/numactl/tree/weighted_interleave_master Summary: 64 threads, ~18GB workload, 3GB per array, executed 100 times Default interleave : -78% (slower than DRAM) Global weighting : -6% to +4% (workload dependant) mbind2 weights : +2.5% to +4% (consistently better than DRAM) dram only: numactl --cpunodebind=1 --membind=1 ./stream_c.exe --ntimes 100 --array-size 400M --malloc Function Direction BestRateMBs AvgTime MinTime MaxTime Copy: 0->0 200923.2 0.032662 0.031853 0.033301 Scale: 0->0 202123.0 0.032526 0.031664 0.032970 Add: 0->0 208873.2 0.047322 0.045961 0.047884 Triad: 0->0 208523.8 0.047262 0.046038 0.048414 CXL-only: numactl --cpunodebind=1 -w --membind=2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc Copy: 0->0 22209.7 0.288661 0.288162 0.289342 Scale: 0->0 22288.2 0.287549 0.287147 0.288291 Add: 0->0 24419.1 0.393372 0.393135 0.393735 Triad: 0->0 24484.6 0.392337 0.392083 0.394331 Based on the above, the optimal weights are ~9:1 echo 9 > /sys/kernel/mm/mempolicy/weighted_interleave/node1 echo 1 > /sys/kernel/mm/mempolicy/weighted_interleave/node2 default interleave: numactl --cpunodebind=1 --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc Copy: 0->0 44666.2 0.143671 0.143285 0.144174 Scale: 0->0 44781.6 0.143256 0.142916 0.143713 Add: 0->0 48600.7 0.197719 0.197528 0.197858 Triad: 0->0 48727.5 0.197204 0.197014 0.197439 global weighted interleave: numactl --cpunodebind=1 -w --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc Copy: 0->0 190085.9 0.034289 0.033669 0.034645 Scale: 0->0 207677.4 0.031909 0.030817 0.033061 Add: 0->0 202036.8 0.048737 0.047516 0.053409 Triad: 0->0 217671.5 0.045819 0.044103 0.046755 targted regions w/ global weights (modified stream to mbind2 malloc'd regions)) numactl --cpunodebind=1 --membind=1 ./stream_c.exe -b --ntimes 100 --array-size 400M --malloc Copy: 0->0 205827.0 0.031445 0.031094 0.031984 Scale: 0->0 208171.8 0.031320 0.030744 0.032505 Add: 0->0 217352.0 0.045087 0.044168 0.046515 Triad: 0->0 216884.8 0.045062 0.044263 0.046982 ===================================================================== Performance tests - XSBench From - Hyeongtak Ji <[email protected]> Hardware: Single socket, Single CXL memory Expander NUMA node 0: 56 logical cores, 128 GB memory NUMA node 2: 96 GB CXL memory Threads: 56 Lookups: 170,000,000 Summary: +19% over DRAM. +47% over default interleave. Performance tests - XSBench 1. dram only $ numactl -m 0 ./XSBench -s XL –p 5000000 Runtime: 36.235 seconds Lookups/s: 4,691,618 2. default interleave $ numactl –i 0,2 ./XSBench –s XL –p 5000000 Runtime: 55.243 seconds Lookups/s: 3,077,293 3. weighted interleave numactl –w –i 0,2 ./XSBench –s XL –p 5000000 Runtime: 29.262 seconds Lookups/s: 5,809,513 ===================================================================== LTP Tests: https://github.com/gmprice/ltp/tree/mempolicy2 = Existing tests set_mempolicy, get_mempolicy, mbind MPOL_WEIGHTED_INTERLEAVE added manually to test basic functionality but did not adjust tests for weighting. Basically the weights were set to 1, which is the default, and it should behave the same as MPOL_INTERLEAVE if logic is correct. == set_mempolicy01 : passed 18, failed 0 == set_mempolicy02 : passed 10, failed 0 == set_mempolicy03 : passed 64, failed 0 == set_mempolicy04 : passed 32, failed 0 == set_mempolicy05 - n/a on non-x86 == set_mempolicy06 : passed 10, failed 0 this is set_mempolicy02 + MPOL_WEIGHTED_INTERLEAVE == set_mempolicy07 : passed 32, failed 0 set_mempolicy04 + MPOL_WEIGHTED_INTERLEAVE == get_mempolicy01 : passed 12, failed 0 change: added MPOL_WEIGHTED_INTERLEAVE == get_mempolicy02 : passed 2, failed 0 == mbind01 : passed 15, failed 0 added MPOL_WEIGHTED_INTERLEAVE == mbind02 : passed 4, failed 0 added MPOL_WEIGHTED_INTERLEAVE == mbind03 : passed 16, failed 0 added MPOL_WEIGHTED_INTERLEAVE == mbind04 : passed 48, failed 0 added MPOL_WEIGHTED_INTERLEAVE ===================================================================== numactl (set_mempolicy) w/ global weighting test numactl fork: https://github.com/gmprice/numactl/tree/weighted_interleave_master command: numactl -w --interleave=0,1 ./eatmem result (weights 1:1): 0176a000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=32897 N1=32896 kernelpagesize_kB=4 7fceeb9ff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=32768 N1=32769 kernelpagesize_kB=4 50% distribution is correct result (weights 5:1): 01b14000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=54828 N1=10965 kernelpagesize_kB=4 7f47a1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=54614 N1=10923 kernelpagesize_kB=4 16.666% distribution is correct result (weights 1:5): 01f07000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=10966 N1=54827 kernelpagesize_kB=4 7f17b1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=10923 N1=54614 kernelpagesize_kB=4 16.666% distribution is correct #include <stdio.h> #include <stdlib.h> #include <string.h> int main (void) { char* mem = malloc(10241024256); memset(mem, 1, 10241024256); for (int i = 0; i < ((10241024256)/4096); i++) { mem = malloc(4096); mem[0] = 1; } printf("done\n"); getchar(); return 0; } This patch (of 4): This patch provides a way to set interleave weight information under sysfs at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN The sysfs structure is designed as follows. $ tree /sys/kernel/mm/mempolicy/ /sys/kernel/mm/mempolicy/ [1] └── weighted_interleave [2] ├── node0 [3] └── node1 Each file above can be explained as follows. [1] mm/mempolicy: configuration interface for mempolicy subsystem [2] weighted_interleave/: config interface for weighted interleave policy [3] weighted_interleave/nodeN: weight for nodeN If a node value is set to `0`, the system-default value will be used. As of this patch, the system-default for all nodes is always 1. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Suggested-by: "Huang, Ying" <[email protected]> Signed-off-by: Rakie Kim <[email protected]> Signed-off-by: Honggyu Kim <[email protected]> Co-developed-by: Gregory Price <[email protected]> Signed-off-by: Gregory Price <[email protected]> Co-developed-by: Hyeongtak Ji <[email protected]> Signed-off-by: Hyeongtak Ji <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Cc: Dan Williams <[email protected]> Cc: Gregory Price <[email protected]> Cc: Hasan Al Maruf <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Srinivasulu Thanneeru <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/mmap: use SZ_{8K, 128K} helper macro	Yajun Deng	1	-4/+4
	Use SZ_{8K, 128K} helper macro instead of the number in init_user_reserve and reserve_mem_notifier. This is more readable. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yajun Deng <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/damon/dbgfs: rename monitor_on file to monitor_on_DEPRECATED	SeongJae Park	1	-1/+1
	Kernel builders could silently enable CONFIG_DAMON_DBGFS_DEPRECATED. Users who manually check the files under the DAMON debugfs directory could notice the deprecation owing to the 'DEPRECATED' DAMON debugfs file, but there could be users who doesn't manually check the files. Make the deprecation cannot be ignored in the case by renaming 'monitor_on' file, which is essential for real use of DAMON on runtime, to 'monitor_on_DEPRECATED'. Still users who control DAMON via only user-space tool could ignore the deprecation, but that's what the tool developers should take care of. DAMON user-space tool, damo, has also made a change[1] for the purpose. [1] commit 935dae76f2aee ("_damon_args: Rename --damon_interface to --damon_interface_DEPRECATED") of https://github.com/awslabs/damo Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Alex Shi <[email protected]> Cc: Hu Haowen <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Yanteng Si <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-02-22	mm/damon/dbgfs: make debugfs interface deprecation message a macro	SeongJae Park	1	-8/+7
	DAMON debugfs interface deprecation message is written twice, once for the warning, and again for DEPRECATED file's read output. De-duplicate those by defining the message as a macro and reuse. [[email protected]: s/comnst/const/] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Alex Shi <[email protected]> Cc: Hu Haowen <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Yanteng Si <[email protected]> Signed-off-by: Andrew Morton <[email protected]>