aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-09-01mm: kmem: add lockdep assertion to obj_cgroup_memcgMuchun Song2-9/+22
obj_cgroup_memcg() is supposed to safe to prevent the returned memory cgroup from being freed only when the caller is holding the rcu read lock or objcg_lock or cgroup_mutex. It is very easy to ignore thoes conditions when users call some upper APIs which call obj_cgroup_memcg() internally like mem_cgroup_from_slab_obj() (See the link below). So it is better to add lockdep assertion to obj_cgroup_memcg() to find those issues ASAP. Because there is no user of obj_cgroup_memcg() holding objcg_lock to make the returned memory cgroup safe, do not add objcg_lock assertion (We should export objcg_lock if we really want to do). Additionally, this is some internal implementation detail of memcg and should not be accessible outside memcg code. Some users like __mem_cgroup_uncharge() do not care the lifetime of the returned memory cgroup, which just want to know if the folio is charged to a memory cgroup, therefore, they do not need to hold the needed locks. In which case, introduce a new helper folio_memcg_charged() to do this. Compare it to folio_memcg(), it could eliminate a memory access of objcg->memcg for kmem, actually, a really small gain. [[email protected]: fix split_page_memcg()] Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Muchun Song <[email protected]> Acked-by: Shakeel Butt <[email protected]> Acked-by: Roman Gushchin <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01memcg: use ratelimited stats flush in the reclaimShakeel Butt1-3/+4
The Meta prod is seeing large amount of stalls in memcg stats flush from the memcg reclaim code path. At the moment, this specific callsite is doing a synchronous memcg stats flush. The rstat flush is an expensive and time consuming operation, so concurrent relaimers will busywait on the lock potentially for a long time. Actually this issue is not unique to Meta and has been observed by Cloudflare [1] as well. For the Cloudflare case, the stalls were due to contention between kswapd threads running on their 8 numa node machines which does not make sense as rstat flush is global and flush from one kswapd thread should be sufficient for all. Simply replace the synchronous flush with the ratelimited one. One may raise a concern on potentially using 2 sec stale (at worst) stats for heuristics like desirable inactive:active ratio and preferring inactive file pages over anon pages but these specific heuristics do not require very precise stats and also are ignored under severe memory pressure. More specifically for this code path, the stats are needed for two specific heuristics: 1. Deactivate LRUs 2. Cache trim mode The deactivate LRUs heuristic is to maintain a desirable inactive:active ratio of the LRUs. The specific stats needed are WORKINGSET_ACTIVATE* and the hierarchical LRU size. The WORKINGSET_ACTIVATE* is needed to check if there is a refault since last snapshot and the LRU size are needed for the desirable ratio between inactive and active LRUs. See the table below on how the desirable ratio is calculated. /* total target max * memory ratio inactive * ------------------------------------- * 10MB 1 5MB * 100MB 1 50MB * 1GB 3 250MB * 10GB 10 0.9GB * 100GB 31 3GB * 1TB 101 10GB * 10TB 320 32GB */ The desirable ratio only changes at the boundary of 1 GiB, 10 GiB, 100 GiB, 1 TiB and 10 TiB. There is no need for the precise and accurate LRU size information to calculate this ratio. In addition, if deactivation is skipped for some LRU, the kernel will force deactive on the severe memory pressure situation. For the cache trim mode, inactive file LRU size is read and the kernel scales it down based on the reclaim iteration (file >> sc->priority) and only checks if it is zero or not. Again precise information is not needed. This patch has been running on Meta fleet for several months and we have not observed any issues. Please note that MGLRU is not impacted by this issue at all as it avoids rstat flushing completely. Link: https://lore.kernel.org/all/[email protected] [1] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Cc: Jesper Dangaard Brouer <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Yosry Ahmed <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Nhat Pham <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: remove legacy install_special_mapping() codeLinus Torvalds7-53/+63
All relevant architectures had already been converted to the new interface (which just has an underscore in front of the name - not very imaginative naming), this just force-converts the stragglers. The modern interface is almost identical to the old one, except instead of the page pointer it takes a "struct vm_special_mapping" that describes the mapping (and contains the page pointer as one member), and it returns the resulting 'vma' instead of just the error code. Getting rid of the old interface also gets rid of some special casing, which had caused problems with the mremap extensions to "struct vm_special_mapping". [[email protected]: coding-style cleanups] Link: https://lkml.kernel.org/r/CAHk-=whvR+z=0=0gzgdfUiK70JTa-=+9vxD-4T=3BagXR6dciA@mail.gmail.comTested-by: Rob Landley <[email protected]> # arch/sh/ Link: https://lore.kernel.org/all/20240819195120.GA1113263@thelio-3990X/ Signed-off-by: Linus Torvalds <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Anton Ivanov <[email protected]> Cc: Brian Cain <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dinh Nguyen <[email protected]> Cc: Guo Ren <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Johannes Berg <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Pedro Falcato <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Rob Landley <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01powerpc/vdso: refactor error handlingMichael Ellerman1-11/+7
Linus noticed that the error handling in __arch_setup_additional_pages() fails to clear the mm VDSO pointer if _install_special_mapping() fails. In practice there should be no actual bug, because if there's an error the VDSO pointer is cleared later in arch_setup_additional_pages(). However it's no longer necessary to set the pointer before installing the mapping. Commit c1bab64360e6 ("powerpc/vdso: Move to _install_special_mapping() and remove arch_vma_name()") reworked the code so that the VMA name comes from the vm_special_mapping.name, rather than relying on arch_vma_name(). So rework the code to only set the VDSO pointer once the mappings have been installed correctly, and remove the stale comment. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Michael Ellerman <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Pedro Falcato <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: remove arch_unmap()Michael Ellerman5-27/+6
Now that powerpc no longer uses arch_unmap() to handle VDSO unmapping, there are no meaningful implementions left. Drop support for it entirely, and update comments which refer to it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Michael Ellerman <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Pedro Falcato <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01powerpc/mm: handle VDSO unmapping via close() rather than arch_unmap()Michael Ellerman2-4/+17
Add a close() callback to the VDSO special mapping to handle unmapping of the VDSO. That will make it possible to remove the arch_unmap() hook entirely in a subsequent patch. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Michael Ellerman <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Pedro Falcato <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: add optional close() to struct vm_special_mappingMichael Ellerman2-0/+9
Add an optional close() callback to struct vm_special_mapping. It will be used, by powerpc at least, to handle unmapping of the VDSO. Although support for unmapping the VDSO was initially added for CRIU[1], it is not desirable to guard that support behind CONFIG_CHECKPOINT_RESTORE. There are other known users of unmapping the VDSO which are not related to CRIU, eg. Valgrind [2] and void-ship [3]. The powerpc arch_unmap() hook has been in place for ~9 years, with no ifdef, so there may be other unknown users that have come to rely on unmapping the VDSO. Even if the code was behind an ifdef, major distros enable CHECKPOINT_RESTORE so users may not realise unmapping the VDSO depends on that configuration option. It's also undesirable to have such core mm behaviour behind a relatively obscure CONFIG option. Longer term the unmap behaviour should be standardised across architectures, however that is complicated by the fact the VDSO pointer is stored differently across architectures. There was a previous attempt to unify that handling [4], which could be revived. See [5] for further discussion. [1]: commit 83d3f0e90c6c ("powerpc/mm: tracking vDSO remap") [2]: https://sourceware.org/git/?p=valgrind.git;a=commit;h=3a004915a2cbdcdebafc1612427576bf3321eef5 [3]: https://github.com/insanitybit/void-ship [4]: https://lore.kernel.org/lkml/[email protected]/ [5]: https://lore.kernel.org/linuxppc-dev/shiq5v3jrmyi6ncwke7wgl76ojysgbhrchsk32q4lbx2hadqqc@kzyy2igem256 Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Michael Ellerman <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Pedro Falcato <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01kfence: save freeing stack trace at calling time instead of freeing timeTianchen Ding3-13/+34
For kmem_cache with SLAB_TYPESAFE_BY_RCU, the freeing trace stack at calling kmem_cache_free() is more useful. While the following stack is meaningless and provides no help: freed by task 46 on cpu 0 at 656.840729s: rcu_do_batch+0x1ab/0x540 nocb_cb_wait+0x8f/0x260 rcu_nocb_cb_kthread+0x25/0x80 kthread+0xd2/0x100 ret_from_fork+0x34/0x50 ret_from_fork_asm+0x1a/0x30 Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Tianchen Ding <[email protected]> Reviewed-by: Marco Elver <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01maple_tree: fix comment typo with corresponding maple_statusWei Yang1-1/+1
In comment of function mas_start(), we list the return value of different cases. According to the comment context, tell the maple_status here is more consistent with others. Let's correct it with ma_active in the case it's a tree. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01maple_tree: fix comment typo of ma_rootWei Yang1-1/+1
In comment of mas_start(), we lists the return value for different cases. In case of a single entry, we set mas->status to ma_root, while the comment uses mas_root, which is not a maple_status. Fix the typo according to the code. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01maple_tree: add test to replicate low memory race conditionsSidhartha Kumar3-1/+101
Add new callback fields to the userspace implementation of struct kmem_cache. This allows for executing callback functions in order to further test low memory scenarios where node allocation is retried. This callback can help test race conditions by calling a function when a low memory event is tested. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sidhartha Kumar <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01maple_tree: reset mas->index and mas->last on write retriesSidhartha Kumar1-4/+12
The following scenario can result in a race condition: Consider a node with the following indices and values a<------->b<----------->c<--------->d 0xA NULL 0xB CPU 1 CPU 2 --------- --------- mas_set_range(a,b) mas_erase() -> range is expanded (a,c) because of null expansion mas_nomem() mas_unlock() mas_store_range(b,c,0xC) The node now looks like: a<------->b<----------->c<--------->d 0xA 0xC 0xB mas_lock() mas_erase() <------ range of erase is still (a,c) The node is now NULL from (a,c) but the write from CPU 2 should have been retained and range (b,c) should still have 0xC as its value. We can fix this by re-intializing to the original index and last. This does not need a cc: Stable as there are no users of the maple tree which use internal locking and this condition is only possible with internal locking. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sidhartha Kumar <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/hugetlb_vmemmap: batch HVO work when demotingYu Zhao1-64/+92
Batch the HVO work, including de-HVO of the source and HVO of the destination hugeTLB folios, to speed up demotion. After commit bd225530a4c7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers"), each request of HVO or de-HVO, batched or not, invokes synchronize_rcu() once. For example, when not batched, demoting one 1GB hugeTLB folio to 512 2MB hugeTLB folios invokes synchronize_rcu() 513 times (1 de-HVO plus 512 HVO requests), whereas when batched, only twice (1 de-HVO plus 1 HVO request). And the performance difference between the two cases is significant, e.g., echo 2048kB >/sys/kernel/mm/hugepages/hugepages-1048576kB/demote_size time echo 100 >/sys/kernel/mm/hugepages/hugepages-1048576kB/demote Before this patch: real 8m58.158s user 0m0.009s sys 0m5.900s After this patch: real 0m0.900s user 0m0.000s sys 0m0.851s Note that this patch changes the behavior of the `demote` interface when de-HVO fails. Before, the interface aborts immediately upon failure; now, it tries to finish an entire batch, meaning it can make extra progress if the rest of the batch contains folios that do not need to de-HVO. Link: https://lkml.kernel.org/r/[email protected] Fixes: bd225530a4c7 ("mm/hugetlb_vmemmap: fix race with speculative PFN walkers") Signed-off-by: Yu Zhao <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/swap: take folio refcount after testing the LRU flagyangge1-5/+3
Whoever passes a folio to __folio_batch_add_and_move() must hold a reference, otherwise something else would already be messed up. If the folio is referenced, it will not be freed elsewhere, so we can safely clear the folio's lru flag. As discussed with David in [1], we should take the reference after testing the LRU flag, not before. Link: https://lore.kernel.org/lkml/[email protected]/ [1] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: yangge <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01filemap: add trace events for get_pages, map_pages, and faultTakaya Saeki2-0/+88
To allow precise tracking of page caches accessed, add new tracepoints that trigger when a process actually accesses them. The ureadahead program used by ChromeOS traces the disk access of programs as they start up at boot up. It uses mincore(2) or the 'mm_filemap_add_to_page_cache' trace event to accomplish this. It stores this information in a "pack" file and on subsequent boots, it will read the pack file and call readahead(2) on the information so that disk storage can be loaded into RAM before the applications actually need it. A problem we see is that due to the kernel's readahead algorithm that can aggressively pull in more data than needed (to try and accomplish the same goal) and this data is also recorded. The end result is that the pack file contains a lot of pages on disk that are never actually used. Calling readahead(2) on these unused pages can slow down the system boot up times. To solve this, add 3 new trace events, get_pages, map_pages, and fault. These will be used to trace the pages are not only pulled in from disk, but are actually used by the application. Only those pages will be stored in the pack file, and this helps out the performance of boot up. With the combination of these 3 new trace events and mm_filemap_add_to_page_cache, we observed a reduction in the pack file by 7.3% - 20% on ChromeOS varying by device. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Takaya Saeki <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Reviewed-by: Steven Rostedt (Google) <[email protected]> Cc: Junichi Uekawa <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/mprotect: fix dax pud handlingsPeter Xu3-8/+107
This is only relevant to the two archs that support PUD dax, aka, x86_64 and ppc64. PUD THPs do not yet exist elsewhere, and hugetlb PUDs do not count in this case. DAX have had PUD mappings for years, but change protection path never worked. When the path is triggered in any form (a simple test program would be: call mprotect() on a 1G dev_dax mapping), the kernel will report "bad pud". This patch should fix that. The new change_huge_pud() tries to keep everything simple. For example, it doesn't optimize write bit as that will need even more PUD helpers. It's not too bad anyway to have one more write fault in the worst case once for 1G range; may be a bigger thing for each PAGE_SIZE, though. Neither does it support userfault-wp bits, as there isn't such PUD mappings that is supported; file mappings always need a split there. The same to TLB shootdown: the pmd path (which was for x86 only) has the trick of using _ad() version of pmdp_invalidate*() which can avoid one redundant TLB, but let's also leave that for later. Again, the larger the mapping, the smaller of such effect. There's some difference on handling "retry" for change_huge_pud() (where it can return 0): it isn't like change_huge_pmd(), as the pmd version is safe with all conditions handled in change_pte_range() later, thanks to Hugh's new pte_offset_map_lock(). In short, change_pte_range() is simply smarter. For that, change_pud_range() will need proper retry if it races with something else when a huge PUD changed from under us. The last thing to mention is currently the PUD path ignores the huge pte numa counter (NUMA_HUGE_PTE_UPDATES), not only because DAX is not applicable to NUMA, but also that it's ambiguous on its own to decide how to account pud in this case. In one earlier version of this patchset I proposed to remove the counter as it doesn't even look right to do the accounting as of now [1], but then a further discussion suggests we can leave that for later, as that doesn't block this series if we choose to ignore that counter. That's what this patch does, by ignoring it. When at it, touch up the comment in pgtable_split_needed() to make it generic to either pmd or pud file THPs. [1] https://lore.kernel.org/all/[email protected]/ [2] https://lore.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages") Fixes: 27af67f35631 ("powerpc/book3s64/mm: enable transparent pud hugepage") Signed-off-by: Peter Xu <[email protected]> Cc: Dan Williams <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: "Edgecombe, Rick P" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Sean Christopherson <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/x86: add missing pud helpersPeter Xu2-8/+61
Some new helpers will be needed for pud entry updates soon. Introduce these helpers by referencing the pmd ones. Namely: - pudp_invalidate(): this helper invalidates a huge pud before a split happens, so that the invalidated pud entry will make sure no race will happen (either with software, like a concurrent zap, or hardware, like a/d bit lost). - pud_modify(): this helper applies a new pgprot to an existing huge pud mapping. For more information on why we need these two helpers, please refer to the corresponding pmd helpers in the mprotect() code path. When at it, simplify the pud_modify()/pmd_modify() comments on shadow stack pgtable entries to reference pte_modify() to avoid duplicating the whole paragraph three times. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Jiang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: "Edgecombe, Rick P" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/x86: implement arch_check_zapped_pud()Peter Xu4-1/+25
Introduce arch_check_zapped_pud() to sanity check shadow stack on PUD zaps. It has the same logic as the PMD helper. One thing to mention is, it might be a good idea to use page_table_check in the future for trapping wrong setups of shadow stack pgtable entries [1]. That is left for the future as a separate effort. [1] https://lore.kernel.org/all/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: "Edgecombe, Rick P" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Jiang <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/x86: make pud_leaf() only care about PSE bitPeter Xu1-2/+1
When working on mprotect() on 1G dax entries, I hit an zap bad pud error when zapping a huge pud that is with PROT_NONE permission. Here the problem is x86's pud_leaf() requires both PRESENT and PSE bits set to report a pud entry as a leaf, but that doesn't look right, as it's not following the pXd_leaf() definition that we stick with so far, where PROT_NONE entries should be reported as leaves. To fix it, change x86's pud_leaf() implementation to only check against PSE bit to report a leaf, irrelevant of whether PRESENT bit is set. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Acked-by: Dave Hansen <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Jiang <[email protected]> Cc: David Rientjes <[email protected]> Cc: "Edgecombe, Rick P" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/powerpc: add missing pud helpersPeter Xu2-0/+23
Some new helpers will be needed for pud entry updates soon. Introduce these helpers by referencing the pmd ones. Namely: - pudp_invalidate(): this helper invalidates a huge pud before a split happens, so that the invalidated pud entry will make sure no race will happen (either with software, like a concurrent zap, or hardware, like a/d bit lost). - pud_modify(): this helper applies a new pgprot to an existing huge pud mapping. For more information on why we need these two helpers, please refer to the corresponding pmd helpers in the mprotect() code path. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dave Jiang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: "Edgecombe, Rick P" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/mprotect: push mmu notifier to PUDsPeter Xu1-16/+16
mprotect() does mmu notifiers in PMD levels. It's there since 2014 of commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to change_pmd_range"). At that time, the issue was that NUMA balancing can be applied on a huge range of VM memory, even if nothing was populated. The notification can be avoided in this case if no valid pmd detected, which includes either THP or a PTE pgtable page. Now to pave way for PUD handling, this isn't enough. We need to generate mmu notifications even on PUD entries properly. mprotect() is currently broken on PUD (e.g., one can easily trigger kernel error with dax 1G mappings already), this is the start to fix it. To fix that, this patch proposes to push such notifications to the PUD layers. There is risk on regressing the problem Rik wanted to resolve before, but I think it shouldn't really happen, and I still chose this solution because of a few reasons: 1) Consider a large VM that should definitely contain more than GBs of memory, it's highly likely that PUDs are also none. In this case there will have no regression. 2) KVM has evolved a lot over the years to get rid of rmap walks, which might be the major cause of the previous soft-lockup. At least TDP MMU already got rid of rmap as long as not nested (which should be the major use case, IIUC), then the TDP MMU pgtable walker will simply see empty VM pgtable (e.g. EPT on x86), the invalidation of a full empty region in most cases could be pretty fast now, comparing to 2014. 3) KVM has explicit code paths now to even give way for mmu notifiers just like this one, e.g. in commit d02c357e5bfa ("KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing"). It'll also avoid contentions that may also contribute to a soft-lockup. 4) Stick with PMD layer simply don't work when PUD is there... We need one way or another to fix PUD mappings on mprotect(). Pushing it to PUD should be the safest approach as of now, e.g. there's yet no sign of huge P4D coming on any known archs. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: David Rientjes <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dave Jiang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: "Edgecombe, Rick P" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/dax: dump start address in fault handlerPeter Xu1-3/+3
Patch series "mm/mprotect: Fix dax puds", v5. Dax supports pud pages for a while, but mprotect on puds was missing since the start. This series tries to fix that by providing pud handling in mprotect(). The goal is to add more types of pud mappings like hugetlb or pfnmaps. This series paves way for it by fixing known pud entries. Considering nobody reported this until when I looked at those other types of pud mappings, I am thinking maybe it doesn't need to be a fix for stable and this may not need to be backported. I would guess whoever cares about mprotect() won't care 1G dax puds yet, vice versa. I hope fixing that in new kernels would be fine, but I'm open to suggestions. There're a few small things changed to teach mprotect work on PUDs. E.g. it will need to start with dropping NUMA_HUGE_PTE_UPDATES which may stop making sense when there can be more than one type of huge pte. OTOH, we'll also need to push the mmu notifiers from pmd to pud layers, which might need some attention but so far I think it's safe. For such details, please refer to each patch's commit message. The mprotect() pud process should be straightforward, as I kept it as simple as possible. There's no NUMA handled as dax simply doesn't support that. There's also no userfault involvements as file memory (even if work with userfault-wp async mode) will need to split a pud, so pud entry doesn't need to yet know userfault's existance (but hugetlb entries will; that's also for later). This patch (of 7): Currently the dax fault handler dumps the vma range when dynamic debugging enabled. That's mostly not useful. Dump the (aligned) address instead with the order info. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dave Jiang <[email protected]> Cc: David Rientjes <[email protected]> Cc: "Edgecombe, Rick P" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: multi-gen LRU: ignore non-leaf pmd_young for force_scan=trueYuanchu Xie1-2/+2
When non-leaf pmd accessed bits are available, MGLRU page table walks can clear the non-leaf pmd accessed bit and ignore the accessed bit on the pte if it's on a different node, skipping a generation update as well. If another scan occurs on the same node as said skipped pte. The non-leaf pmd accessed bit might remain cleared and the pte accessed bits won't be checked. While this is sufficient for reclaim-driven aging, where the goal is to select a reasonably cold page, the access can be missed when aging proactively for workingset estimation of a node/memcg. In more detail, get_pfn_folio returns NULL if the folio's nid != node under scanning, so the page table walk skips processing of said pte. Now the pmd_young flag on this pmd is cleared, and if none of the pte's are accessed before another scan occurs on the folio's node, the pmd_young check fails and the pte accessed bit is skipped. Since force_scan disables various other optimizations, we check force_scan to ignore the non-leaf pmd accessed bit. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yuanchu Xie <[email protected]> Acked-by: Yu Zhao <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Lance Yang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: vmalloc: add optimization hint on page existence checkMiao Wang1-1/+1
In commit 21e516b913c1 ("mm: vmalloc: dump page owner info if page is already mapped"), a BUG_ON macro was changed into an if statement, where the compiler optimization hint introduced in the BUG_ON macro was removed along with this change. This patch adds back the hint. Link: https://lkml.kernel.org/r/[email protected] Fixes: 21e516b913c1 ("mm: vmalloc: dump page owner info if page is already mapped") Signed-off-by: Miao Wang <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Hariom Panthi <[email protected]> Cc: "Uladzislau Rezki (Sony)" <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: accept to promo watermarkKirill A. Shutemov1-2/+2
Commit c574bbe91703 ("NUMA balancing: optimize page placement for memory tiering system") introduced a new watermark above "high" -- "promo". Accept memory memory to the highest watermark which is WMARK_PROMO now. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Rapoport (Microsoft) <[email protected]> Cc: Tom Lendacky <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: page_isolation: handle unaccepted memory isolationKirill A. Shutemov1-0/+8
Page isolation machinery doesn't know anything about unaccepted memory and considers it non-free. It leads to alloc_contig_pages() failure. Treat unaccepted memory as free and accept memory on pageblock isolation. Once memory is accepted it becomes PageBuddy() and page isolation knows how to deal with them. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Rapoport (Microsoft) <[email protected]> Cc: Tom Lendacky <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: add a helper to accept pageKirill A. Shutemov2-12/+43
Accept a given struct page and add it free list. The help is useful for physical memory scanners that want to use free unaccepted memory. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Rapoport (Microsoft) <[email protected]> Cc: Tom Lendacky <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: rework accept memory helpersKirill A. Shutemov10-38/+26
Make accept_memory() and range_contains_unaccepted_memory() take 'start' and 'size' arguments instead of 'start' and 'end'. Remove accept_page(), replacing it with direct calls to accept_memory(). The accept_page() name is going to be used for a different function. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Suggested-by: David Hildenbrand <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Rapoport (Microsoft) <[email protected]> Cc: Tom Lendacky <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: introduce PageUnaccepted() page typeKirill A. Shutemov2-0/+10
The new page type allows physical memory scanners to detect unaccepted memory and handle it accordingly. The page type is serialized with zone lock. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Rapoport (Microsoft) <[email protected]> Cc: Tom Lendacky <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: accept memory in __alloc_pages_bulk()Kirill A. Shutemov1-0/+11
Currently, the kernel only accepts memory in get_page_from_freelist(), but there is another path that directly takes pages from free lists - __alloc_page_bulk(). This function can consume all accepted memory and will resort to __alloc_pages_noprof() if necessary. Conditionally accepted in __alloc_pages_bulk(). The same issue may arise due to deferred page initialization. Kick the deferred initialization machinery before abandoning the zone, as the kernel does in get_page_from_freelist(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Rapoport (Microsoft) <[email protected]> Cc: Tom Lendacky <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: reduce deferred struct page init ifdefferyKirill A. Shutemov1-4/+5
Patch series "mm: Fix several issues with unaccepted memory", v2. The patchset addresses several issues related to unaccepted memory. Pacth 1/7 preparatory cleanup. Patch 2/7 ensures that __alloc_pages_bulk() will not exhaust all accepted memory without accepting more. Patches 3/7-5/7 are preparations for patch 6/7, which fixes alloc_config_page() on machines with unaccepted memory. This allows, for example, the allocation of gigantic pages at runtime. Patch 7/7 enables the kernel to accept memory up to the promo watermark. This patch (of 7): Add dummy _deferred_grow_zone() for !DEFERRED_STRUCT_PAGE_INIT and remove #ifdefs in two places. No functional changes. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Suggested-by: David Hildenbrand <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Rapoport (Microsoft) <[email protected]> Cc: Tom Lendacky <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/migrate: move common code to numa_migrate_check (was numa_migrate_prep)Zi Yan3-50/+47
do_numa_page() and do_huge_pmd_numa_page() share a lot of common code. To reduce redundancy, move common code to numa_migrate_prep() and rename the function to numa_migrate_check() to reflect its functionality. Now do_huge_pmd_numa_page() also checks shared folios to set TNF_SHARED flag. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zi Yan <[email protected]> Suggested-by: David Hildenbrand <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Yang Shi <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01memcg: replace memcg ID idr with xarrayShakeel Butt1-29/+10
At the moment memcg IDs are managed through IDR which requires external synchronization mechanisms and makes the allocation code a bit awkward. Let's switch to xarray and make the code simpler. [[email protected]: fix error path in mem_cgroup_alloc(), per Dan] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Suggested-by: Matthew Wilcox <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Muchun Song <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Dan Carpenter <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01selftest mm/mseal: fix test_seal_mremap_move_dontunmap_anyaddrJeff Xu1-21/+36
the syscall remap accepts following: mremap(src, size, size, MREMAP_MAYMOVE | MREMAP_DONTUNMAP, dst) when the src is sealed, the call will fail with error code: EPERM Previously, the test uses hard-coded 0xdeaddead as dst, and it will fail on the system with newer glibc installed. This patch removes test's dependency on glibc for mremap(), also fix the test and remove the hardcoded address. Link: https://lkml.kernel.org/r/[email protected] Fixes: 4926c7a52de7 ("selftest mm/mseal memory sealing") Signed-off-by: Jeff Xu <[email protected]> Reported-by: Pedro Falcato <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: return the folio from swapin_readaheadMatthew Wilcox (Oracle)4-16/+9
The unuse_pte_range() caller only wants the folio while do_swap_page() wants both the page and the folio. Since do_swap_page() already has logic for handling both the folio and the page, move the folio-to-page logic there. This also lets us allocate larger folios in the SWP_SYNCHRONOUS_IO path in future. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: remove PG_errorMatthew Wilcox (Oracle)4-8/+2
The PG_error bit is now unused; delete it and free up a bit in page->flags. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01fs: remove calls to set and clear the folio error flagMatthew Wilcox (Oracle)4-15/+2
Nobody checks the folio error flag any more, so we can stop setting and clearing it. Also remove the documentation suggesting to not bother setting the error bit. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: kfence: print the elapsed time for allocated/freed trackqiwu.chen1-2/+6
Print the elapsed time for the allocated or freed track, which can be useful in some debugging scenarios. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: qiwu.chen <[email protected]> Reviewed-by: Marco Elver <[email protected]> Cc: chenqiwu <[email protected]> Cc: Dmitry Vyukov <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01percpu: remove pcpu_alloc_size()Jianhui Zhou2-32/+0
pcpu_alloc_size() was added in 7ac5c53e0073 "mm/percpu.c: introduce pcpu_alloc_size()", which is used to get the allocated memory size in bpf. However, pcpu_alloc_size() is no longer used in "bpf: Use c->unit_size to select target cache during free" because its actuall allocated memory size may change at runtime due to its slab merging mechanism. Therefore, pcpu_alloc_size() can be removed. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jianhui Zhou <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: JonasZhou <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/rmap: minimize folio->_nr_pages_mapped updates when batching PTE (un)mappingDavid Hildenbrand1-14/+13
It is not immediately obvious, but we can move the folio->_nr_pages_mapped update out of the loop and reduce the number of atomic ops without affecting the stats. The important point to realize is that only removing the last PMD mapping will result in _nr_pages_mapped going below ENTIRELY_MAPPED, not the individual atomic_inc_return_relaxed() calls. Concurrent races with removal of PMD mappings should be handled as expected, just like when we would have such races right now on a single mapcount update. In a simple munmap() microbenchmark [1] on 1 GiB of memory backed by the same PTE-mapped folio size (only mapped by a single process such that they will get completely unmapped), this change results in a speedup (positive is good) per folio size on a x86-64 Intel machine of roughly (a bit of noise expected): * 16 KiB: +10% * 32 KiB: +15% * 64 KiB: +17% * 128 KiB: +21% * 256 KiB: +22% * 512 KiB: +22% * 1024 KiB: +23% * 2048 KiB: +27% [1] https://gitlab.com/davidhildenbrand/scratchspace/-/blob/main/pte-mapped-folio-benchmarks.c Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01selftests/mm: add mseal test for no-discard madvisePedro Falcato1-1/+35
Add an mseal test for madvise() operations that aren't considered "discard" (e.g purely advisory ops such as MADV_RANDOM). [[email protected]: adjust the mseal test's plan] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Pedro Falcato <[email protected]> Tested-by: Jeff Xu <[email protected]> Reviewed-by: Jeff Xu <[email protected]> Cc: Kees Cook <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01kfence: introduce burst modeMarco Elver3-5/+18
Introduce burst mode, which can be configured with kfence.burst=$count, where the burst count denotes the additional successive slab allocations to be allocated through KFENCE for each sample interval. The idea is that this can give developers an additional knob to make KFENCE more aggressive when debugging specific issues of systems where either rebooting or recompiling the kernel with KASAN is not possible. Experiment: To assess the effectiveness of the new option, we randomly picked a recent out-of-bounds [1] and use-after-free bug [2], each with a reproducer provided by syzbot, that initially detected these bugs with KASAN. We then tried to reproduce the bugs with KFENCE below. [1] Fixed by: 7c55b78818cf ("jfs: xattr: fix buffer overflow for invalid xattr") https://syzkaller.appspot.com/bug?id=9d1b59d4718239da6f6069d3891863c25f9f24a2 [2] Fixed by: f8ad00f3fb2a ("l2tp: fix possible UAF when cleaning up tunnels") https://syzkaller.appspot.com/bug?id=4f34adc84f4a3b080187c390eeef60611fd450e1 The following KFENCE configs were compared. A pool size of 1023 objects was used for all configurations. Baseline kfence.sample_interval=100 kfence.skip_covered_thresh=75 kfence.burst=0 Aggressive kfence.sample_interval=1 kfence.skip_covered_thresh=10 kfence.burst=0 AggressiveBurst kfence.sample_interval=1 kfence.skip_covered_thresh=10 kfence.burst=1000 Each reproducer was run 10 times (after a fresh reboot), with the following detection counts for each KFENCE config: | Detection Count out of 10 | | OOB [1] | UAF [2] | ------------------+-------------+-------------+ Default | 0/10 | 0/10 | Aggressive | 0/10 | 0/10 | AggressiveBurst | 8/10 | 8/10 | With the Default and even the Aggressive configs the results are unsurprising, given KFENCE has not been designed for deterministic bug detection of small test cases. However, when enabling burst mode with relatively large burst count, KFENCE can start to detect heap memory-safety bugs even in simpler test cases with high probability (in the above cases with ~80% probability). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Marco Elver <[email protected]> Reviewed-by: Alexander Potapenko <[email protected]> Cc: Andrey Konovalov <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Jann Horn <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: fix (harmless) type confusion in lock_vma_under_rcu()Jann Horn2-6/+23
There is a (harmless) type confusion in lock_vma_under_rcu(): After vma_start_read(), we have taken the VMA lock but don't know yet whether the VMA has already been detached and scheduled for RCU freeing. At this point, ->vm_start and ->vm_end are accessed. vm_area_struct contains a union such that ->vm_rcu uses the same memory as ->vm_start and ->vm_end; so accessing ->vm_start and ->vm_end of a detached VMA is illegal and leads to type confusion between union members. Fix it by reordering the vma->detached check above the address checks, and document the rules for RCU readers accessing VMAs. This will probably change the number of observed VMA_LOCK_MISS events (since previously, trying to access a detached VMA whose ->vm_rcu has been scheduled would bail out when checking the fault address against the rcu_head members reinterpreted as VMA bounds). Link: https://lkml.kernel.org/r/20240805-fix-vma-lock-type-confusion-v1-1-9f25443a9a71@google.com Fixes: 50ee32537206 ("mm: introduce lock_vma_under_rcu to be used from arch-specific code") Signed-off-by: Jann Horn <[email protected]> Acked-by: Suren Baghdasaryan <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01zswap: track swapins from disk more accuratelyNhat Pham2-7/+10
Currently, there are a couple of issues with our disk swapin tracking for dynamic zswap shrinker heuristics: 1. We only increment the swapin counter on pivot pages. This means we are not taking into account pages that also need to be swapped in, but are already taken care of as part of the readahead window. 2. We are also incrementing when the pages are read from the zswap pool, which is inaccurate. This patch rectifies these issues by incrementing the counter whenever we need to perform a non-zswap read. Note that we are slightly overcounting, as a page might be read into memory by the readahead algorithm even though it will not be neeeded by users - however, this is an acceptable inaccuracy, as the readahead logic itself will adapt to these kind of scenarios. To test this change, I built the kernel under a cgroup with its memory.max set to 2 GB: real: 236.66s user: 4286.06s sys: 652.86s swapins: 81552 For comparison, with just the new second chance algorithm, the build time is as follows: real: 244.85s user: 4327.22s sys: 664.39s swapins: 94663 Without neither: real: 263.89s user: 4318.11s sys: 673.29s swapins: 227300.5 (average over 5 runs) With this change, the kernel CPU time reduces by a further 1.7%, and the real time is reduced by another 3.3%, compared to just the second chance algorithm by itself. The swapins count also reduces by another 13.85%. Combinng the two changes, we reduce the real time by 10.32%, kernel CPU time by 3%, and number of swapins by 64.12%. To gauge the new scheme's ability to offload cold data, I ran another benchmark, in which the kernel was built under a cgroup with memory.max set to 3 GB, but with 0.5 GB worth of cold data allocated before each build (in a shmem file). Under the old scheme: real: 197.18s user: 4365.08s sys: 289.02s zswpwb: 72115.2 Under the new scheme: real: 195.8s user: 4362.25s sys: 290.14s zswpwb: 87277.8 (average over 5 runs) Notice that we actually observe a 21% increase in the number of written back pages - so the new scheme is just as good, if not better at offloading pages from the zswap pool when they are cold. Build time reduces by around 0.7% as a result. [[email protected]: squeeze a comment into a single line] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure") Signed-off-by: Nhat Pham <[email protected]> Suggested-by: Johannes Weiner <[email protected]> Acked-by: Yosry Ahmed <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Chengming Zhou <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Takero Funaki <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01zswap: implement a second chance algorithm for dynamic zswap shrinkerNhat Pham2-54/+70
Patch series "improving dynamic zswap shrinker protection scheme", v3. When experimenting with the memory-pressure based (i.e "dynamic") zswap shrinker in production, we observed a sharp increase in the number of swapins, which led to performance regression. We were able to trace this regression to the following problems with the shrinker's warm pages protection scheme: 1. The protection decays way too rapidly, and the decaying is coupled with zswap stores, leading to anomalous patterns, in which a small batch of zswap stores effectively erase all the protection in place for the warmer pages in the zswap LRU. This observation has also been corroborated upstream by Takero Funaki (in [1]). 2. We inaccurately track the number of swapped in pages, missing the non-pivot pages that are part of the readahead window, while counting the pages that are found in the zswap pool. To alleviate these two issues, this patch series improve the dynamic zswap shrinker in the following manner: 1. Replace the protection size tracking scheme with a second chance algorithm. This new scheme removes the need for haphazard stats decaying, and automatically adjusts the pace of pages aging with memory pressure, and writeback rate with pool activities: slowing down when the pool is dominated with zswpouts, and speeding up when the pool is dominated with stale entries. 2. Fix the tracking of the number of swapins to take into account non-pivot pages in the readahead window. With these two changes in place, in a kernel-building benchmark without any cold data added, the number of swapins is reduced by 64.12%. This translate to a 10.32% reduction in build time. We also observe a 3% reduction in kernel CPU time. In another benchmark, with cold data added (to gauge the new algorithm's ability to offload cold data), the new second chance scheme outperforms the old protection scheme by around 0.7%, and actually written back around 21% more pages to backing swap device. So the new scheme is just as good, if not even better than the old scheme on this front as well. [1]: https://lore.kernel.org/linux-mm/CAPpodddcGsK=0Xczfuk8usgZ47xeyf4ZjiofdT+ujiyz6V2pFQ@mail.gmail.com/ This patch (of 2): Current zswap shrinker's heuristics to prevent overshrinking is brittle and inaccurate, specifically in the way we decay the protection size (i.e making pages in the zswap LRU eligible for reclaim). We currently decay protection aggressively in zswap_lru_add() calls. This leads to the following unfortunate effect: when a new batch of pages enter zswap, the protection size rapidly decays to below 25% of the zswap LRU size, which is way too low. We have observed this effect in production, when experimenting with the zswap shrinker: the rate of shrinking shoots up massively right after a new batch of zswap stores. This is somewhat the opposite of what we want originally - when new pages enter zswap, we want to protect both these new pages AND the pages that are already protected in the zswap LRU. Replace existing heuristics with a second chance algorithm 1. When a new zswap entry is stored in the zswap pool, its referenced bit is set. 2. When the zswap shrinker encounters a zswap entry with the referenced bit set, give it a second chance - only flips the referenced bit and rotate it in the LRU. 3. If the shrinker encounters the entry again, this time with its referenced bit unset, then it can reclaim the entry. In this manner, the aging of the pages in the zswap LRUs are decoupled from zswap stores, and picks up the pace with increasing memory pressure (which is what we want). The second chance scheme allows us to modulate the writeback rate based on recent pool activities. Entries that recently entered the pool will be protected, so if the pool is dominated by such entries the writeback rate will reduce proportionally, protecting the workload's workingset.On the other hand, stale entries will be written back quickly, which increases the effective writeback rate. The referenced bit is added at the hole after the `length` field of struct zswap_entry, so there is no extra space overhead for this algorithm. We will still maintain the count of swapins, which is consumed and subtracted from the lru size in zswap_shrinker_count(), to further penalize past overshrinking that led to disk swapins. The idea is that had we considered this many more pages in the LRU active/protected, they would not have been written back and we would not have had to swapped them in. To test this new heuristics, I built the kernel under a cgroup with memory.max set to 2G, on a host with 36 cores: With the old shrinker: real: 263.89s user: 4318.11s sys: 673.29s swapins: 227300.5 With the second chance algorithm: real: 244.85s user: 4327.22s sys: 664.39s swapins: 94663 (average over 5 runs) We observe an 1.3% reduction in kernel CPU usage, and around 7.2% reduction in real time. Note that the number of swapped in pages dropped by 58%. [[email protected]: fix a small mistake in the referenced bit documentation] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nhat Pham <[email protected]> Suggested-by: Johannes Weiner <[email protected]> Acked-by: Yosry Ahmed <[email protected]> Cc: Chengming Zhou <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Takero Funaki <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: only enforce minimum stack gap size if it's sensibleDavid Gow1-1/+1
The generic mmap_base code tries to leave a gap between the top of the stack and the mmap base address, but enforces a minimum gap size (MIN_GAP) of 128MB, which is too large on some setups. In particular, on arm tasks without ADDR_LIMIT_32BIT, the STACK_TOP value is less than 128MB, so it's impossible to fit such a gap in. Only enforce this minimum if MIN_GAP < MAX_GAP, as we'd prefer to honour MAX_GAP, which is defined proportionally, so scales better and always leaves us with both _some_ stack space and some room for mmap. This fixes the usercopy KUnit test suite on 32-bit arm, as it doesn't set any personality flags so gets the default (in this case 26-bit) task size. This test can be run with: ./tools/testing/kunit/kunit.py run --arch arm usercopy --make_options LLVM=1 Link: https://lkml.kernel.org/r/[email protected] Fixes: dba79c3df4a2 ("arm: use generic mmap top-down layout and brk randomization") Signed-off-by: David Gow <[email protected]> Reviewed-by: Kees Cook <[email protected]> Cc: Alexandre Ghiti <[email protected]> Cc: Linus Walleij <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Russell King <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: remove duplicated include in vma_internal.hYang Li1-1/+0
The header files linux/bug.h is included twice in vma_internal.h, so one inclusion of each can be removed. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yang Li <[email protected]> Reported-by: Abaci Robot <[email protected]> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9636 Reviewed-by: Lorenzo Stoakes <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/ksm: convert break_ksm() from walk_page_range_vma() to folio_walkDavid Hildenbrand1-47/+16
Let's simplify by reusing folio_walk. Keep the existing behavior by handling migration entries and zeropages. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Claudio Imbrenda <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Janosch Frank <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: remove follow_page()David Hildenbrand5-36/+5
All users are gone, let's remove it and any leftovers in comments. We'll leave any FOLL/follow_page_() naming cleanups as future work. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Claudio Imbrenda <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Janosch Frank <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01s390/mm/fault: convert do_secure_storage_access() from follow_page() to ↵David Hildenbrand1-6/+10
folio_walk Let's get rid of another follow_page() user and perform the conversion under PTL: Note that this is also what follow_page_pte() ends up doing. Unfortunately we cannot currently optimize out the additional reference, because arch_make_folio_accessible() must be called with a raised refcount to protect against concurrent conversion to secure. We can just move the arch_make_folio_accessible() under the PTL, like follow_page_pte() would. We'll effectively drop the "writable" check implied by FOLL_WRITE: follow_page_pte() would also not check that when calling arch_make_folio_accessible(), so there is no good reason for doing that here. We'll lose the secretmem check from follow_page() as well, about which we shouldn't really care. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Claudio Imbrenda <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Janosch Frank <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>