aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-09-01s390/uv: convert gmap_destroy_page() from follow_page() to folio_walkDavid Hildenbrand1-6/+12
Let's get rid of another follow_page() user and perform the UV calls under PTL -- which likely should be fine. No need for an additional reference while holding the PTL: uv_destroy_folio() and uv_convert_from_secure_folio() raise the refcount, so any concurrent make_folio_secure() would see an unexpted reference and cannot set PG_arch_1 concurrently. Do we really need a writable PTE? Likely yes, because the "destroy" part is, in comparison to the export, a destructive operation. So we'll keep the writability check for now. We'll lose the secretmem check from follow_page(). Likely we don't care about that here. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Claudio Imbrenda <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Janosch Frank <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/huge_memory: convert split_huge_pages_pid() from follow_page() to folio_walkDavid Hildenbrand3-14/+19
Let's remove yet another follow_page() user. Note that we have to do the split without holding the PTL, after folio_walk_end(). We don't care about losing the secretmem check in follow_page(). [[email protected]: teach can_split_folio() that we are not holding an additional reference] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Zi Yan <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Claudio Imbrenda <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Janosch Frank <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Ryan Roberts <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/ksm: convert scan_get_next_rmap_item() from follow_page() to folio_walkDavid Hildenbrand1-14/+24
Let's use folio_walk instead, for example avoiding taking temporary folio references if the folio does obviously not even apply and getting rid of one more follow_page() user. We cannot move all handling under the PTL, so leave the rmap handling (which implies an allocation) out. Note that zeropages obviously don't apply: old code could just have specified FOLL_DUMP. Further, we don't care about losing the secretmem check in follow_page(): these are never anon pages and vma_ksm_compatible() would never consider secretmem vmas (VM_SHARED | VM_MAYSHARE must be set for secretmem, see secretmem_mmap()). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Claudio Imbrenda <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Janosch Frank <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/ksm: convert get_mergeable_page() from follow_page() to folio_walkDavid Hildenbrand1-12/+14
Let's use folio_walk instead, for example avoiding taking temporary folio references if the folio does not even apply and getting rid of one more follow_page() user. Note that zeropages obviously don't apply: old code could just have specified FOLL_DUMP. Anon folios are never secretmem, so we don't care about losing the check in follow_page(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Claudio Imbrenda <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Janosch Frank <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/migrate: convert add_page_for_migration() from follow_page() to folio_walkDavid Hildenbrand1-55/+45
Let's use folio_walk instead, so we can avoid taking a folio reference when we won't even be trying to migrate the folio and to get rid of another follow_page()/FOLL_DUMP user. Use FW_ZEROPAGE so we can return "-EFAULT" for it as documented. We now perform the folio_likely_mapped_shared() check under PTL, which is what we want: relying on the mapcount and friends after dropping the PTL does not make too much sense, as the page can get unmapped concurrently from this process. Further, we perform the folio isolation under PTL, similar to how we handle it for MADV_PAGEOUT. The possible return values for follow_page() were confusing, especially with FOLL_DUMP set. We'll handle it like documented in the man page: * -EFAULT: This is a zero page or the memory area is not mapped by the process. * -ENOENT: The page is not present. We'll keep setting -ENOENT for ZONE_DEVICE. Maybe not the right thing to do, but it likely doesn't really matter (just like for weird devmap, whereby we fake "not present"). The other errros are left as is, and match the documentation in the man page. While at it, rename add_page_for_migration() to add_folio_for_migration(). We'll lose the "secretmem" check, but that shouldn't really matter because these folios cannot ever be migrated. Should vma_migratable() refuse these VMAs? Maybe. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Claudio Imbrenda <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Janosch Frank <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/migrate: convert do_pages_stat_array() from follow_page() to folio_walkDavid Hildenbrand1-16/+15
Let's use folio_walk instead, so we can avoid taking a folio reference just to read the nid and get rid of another follow_page()/FOLL_DUMP user. Use FW_ZEROPAGE so we can return "-EFAULT" for it as documented. The possible return values for follow_page() were confusing, especially with FOLL_DUMP set. We'll handle it like documented in the man page: * -EFAULT: This is a zero page or the memory area is not mapped by the process. * -ENOENT: The page is not present. We'll keep setting -ENOENT for ZONE_DEVICE. Maybe not the right thing to do, but it likely doesn't really matter (just like for weird devmap, whereby we fake "not present"). Note that the other errors (-EACCESS, -EBUSY, -EIO, -EINVAL, -ENOMEM) so far only applied when actually moving pages, not when only querying stats. We'll effectively drop the "secretmem" check we had in follow_page(), but that shouldn't really matter here, we're not accessing folio/page content after all. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Claudio Imbrenda <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Janosch Frank <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/pagewalk: introduce folio_walk_start() + folio_walk_end()David Hildenbrand2-0/+260
We want to get rid of follow_page(), and have a more reasonable way to just lookup a folio mapped at a certain address, perform some checks while still under PTL, and then only conditionally grab a folio reference if really required. Further, we might want to get rid of some walk_page_range*() users that really only want to temporarily lookup a single folio at a single address. So let's add a new page table walker that does exactly that, similarly to GUP also being able to walk hugetlb VMAs. Add folio_walk_end() as a macro for now: the compiler is not easy to please with the pte_unmap()->kunmap_local(). Note that one difference between follow_page() and get_user_pages(1) is that follow_page() will not trigger faults to get something mapped. So folio_walk is at least currently not a replacement for get_user_pages(1), but could likely be extended/reused to achieve something similar in the future. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Claudio Imbrenda <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Janosch Frank <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: provide vm_normal_(page|folio)_pmd() with CONFIG_PGTABLE_HAS_HUGE_LEAVESDavid Hildenbrand1-1/+1
Patch series "mm: replace follow_page() by folio_walk". Looking into a way of moving the last folio_likely_mapped_shared() call in add_folio_for_migration() under the PTL, I found myself removing follow_page(). This paves the way for cleaning up all the FOLL_, follow_* terminology to just be called "GUP" nowadays. The new page table walker will lookup a mapped folio and return to the caller with the PTL held, such that the folio cannot get unmapped concurrently. Callers can then conditionally decide whether they really want to take a short-term folio reference or whether the can simply unlock the PTL and be done with it. folio_walk is similar to page_vma_mapped_walk(), except that we don't know the folio we want to walk to and that we are only walking to exactly one PTE/PMD/PUD. folio_walk provides access to the pte/pmd/pud (and the referenced folio page because things like KSM need that), however, as part of this series no page table modifications are performed by users. We might be able to convert some other walk_page_range() users that really only walk to one address, such as DAMON with damon_mkold_ops/damon_young_ops. It might make sense to extend folio_walk in the future to optionally fault in a folio (if applicable), such that we can replace some get_user_pages() users that really only want to lookup a single page/folio under PTL without unconditionally grabbing a folio reference. I have plans to extend the approach to a range walker that will try batching various page table entries (not just folio pages) to be a better replace for walk_page_range() -- and users will be able to opt in which type of page table entries they want to process -- but that will require more work and more thoughts. KSM seems to work just fine (ksm_functional_tests selftests) and move_pages seems to work (migration selftest). I tested the leaf implementation excessively using various hugetlb sizes (64K, 2M, 32M, 1G) on arm64 using move_pages and did some more testing on x86-64. Cross compiled on a bunch of architectures. This patch (of 11): We want to make use of vm_normal_page_pmd() in generic page table walking code where we might walk hugetlb folios that are mapped by PMDs even without CONFIG_TRANSPARENT_HUGEPAGE. So let's expose vm_normal_page_pmd() + vm_normal_folio_pmd() with CONFIG_PGTABLE_HAS_HUGE_LEAVES. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Claudio Imbrenda <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Janosch Frank <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01include/linux/mmzone.h: clean up watermark accessorsAndrew Morton1-6/+26
- we have a helper wmark_pages(). Teach min_wmark_pages(), low_wmark_pages(), high_wmark_pages() and promo_wmark_pages() to use it instead of open-coding its implementation. - there's no reason to implement all these things as macros. Redo them in C. Acked-by: Johannes Weiner <[email protected]> Cc: Kaiyang Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: print the promo watermark in zoneinfoKaiyang Zhao1-0/+2
Print the promo watermark in zoneinfo just like other watermarks. This helps users check and verify all the watermarks are appropriate. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kaiyang Zhao <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: create promo_wmark_pages and clean up open-coded sitesKaiyang Zhao3-2/+3
Patch series "mm: print the promo watermark in zoneinfo", v2. This patch (of 2): Define promo_wmark_pages and convert current call sites of wmark_pages with fixed WMARK_PROMO to using it instead. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kaiyang Zhao <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: consider CMA pages in watermark check for NUMA balancing target nodeKaiyang Zhao1-1/+1
Currently in migrate_balanced_pgdat(), ALLOC_CMA flag is not passed when checking watermark on the migration target node. This does not match the gfp in alloc_misplaced_dst_folio() which allows allocation from CMA. This causes promotion failures when there are a lot of available CMA memory in the system. Therefore, we change the alloc_flags passed to zone_watermark_ok() in migrate_balanced_pgdat(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kaiyang Zhao <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: zswap: fix global shrinker error handling logicTakero Funaki1-7/+33
This patch fixes the zswap global shrinker, which did not shrink the zpool as expected. The issue addressed is that shrink_worker() did not distinguish between unexpected errors and expected errors, such as failed writeback from an empty memcg. The shrinker would stop shrinking after iterating through the memcg tree 16 times, even if there was only one empty memcg. With this patch, the shrinker no longer considers encountering an empty memcg, encountering a memcg with writeback disabled, or reaching the end of a memcg tree walk as a failure, as long as there are memcgs that are candidates for writeback. Systems with one or more empty memcgs will now observe significantly higher zswap writeback activity after the zswap pool limit is hit. To avoid an infinite loop when there are no writeback candidates, this patch tracks writeback attempts during memcg tree walks and limits reties if no writeback candidates are found. To handle the empty memcg case, the helper function shrink_memcg() is modified to check if the memcg is empty and then return -ENOENT. Link: https://lkml.kernel.org/r/[email protected] Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware") Signed-off-by: Takero Funaki <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Reviewed-by: Nhat Pham <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: zswap: fix global shrinker memcg iterationTakero Funaki1-29/+47
Patch series "mm: zswap: fixes for global shrinker", v5. This series addresses issues in the zswap global shrinker that could not shrink stored pages. With this series, the shrinker continues to shrink pages until it reaches the accept threshold more reliably, gives much higher writeback when the zswap pool limit is hit. This patch (of 2): This patch fixes an issue where the zswap global shrinker stopped iterating through the memcg tree. The problem was that shrink_worker() would restart iterating memcg tree from the tree root, considering an offline memcg as a failure, and abort shrinking after encountering the same offline memcg 16 times even if there is only one offline memcg. After this change, an offline memcg in the tree is no longer considered a failure. This allows the shrinker to continue shrinking the other online memcgs regardless of whether an offline memcg exists, gives higher zswap writeback activity. To avoid holding refcount of offline memcg encountered during the memcg tree walking, shrink_worker() must continue iterating to release the offline memcg to ensure the next memcg stored in the cursor is online. The offline memcg cleaner has also been changed to avoid the same issue. When the next memcg of the offlined memcg is also offline, the refcount stored in the iteration cursor was held until the next shrink_worker() run. The cleaner must release the offline memcg recursively. [[email protected]: make critical section more obvious, unify comments] Link: https://lkml.kernel.org/r/CAJD7tkaScz+SbB90Q1d5mMD70UfM2a-J2zhXDT9sePR7Qap45Q@mail.gmail.com Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware") Signed-off-by: Takero Funaki <[email protected]> Signed-off-by: Yosry Ahmed <[email protected]> Acked-by: Yosry Ahmed <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Reviewed-by: Nhat Pham <[email protected]> Cc: Chengming Zhou <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: swap: allocate folio only first time in __read_swap_cache_async()Zhaoyu Liu1-27/+31
It should be checked by filemap_get_folio() if SWAP_HAS_CACHE was marked while reading a share swap page. It would re-allocate a folio if the swap cache was not ready now. We save the new folio to avoid page allocating again. Link: https://lkml.kernel.org/r/20240731133101.GA2096752@bytedance Signed-off-by: Zhaoyu Liu <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Kairui Song <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: clarify folio_likely_mapped_shared() documentation for KSM foliosDavid Hildenbrand1-6/+8
For KSM folios, the function actually does what it is supposed to do: even having multiple mappings inside the same MM is considered "sharing", as there is no real relationship between these KSM page mappings -- in contrast to mapping the same file range twice and having the same pagecache page mapped twice. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/rmap: cleanup partially-mapped handling in __folio_remove_rmap()David Hildenbrand1-13/+10
Let's simplify and reduce code indentation. In the RMAP_LEVEL_PTE case, we already check for nr when computing partially_mapped. For RMAP_LEVEL_PMD, it's a bit more confusing. Likely, we don't need the "nr" check, but we could have "nr < nr_pmdmapped" also if we stumbled into the "/* Raced ahead of another remove and an add? */" case. So let's simply move the nr check in there. Note that partially_mapped is always false for small folios. No functional change intended. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Zi Yan <[email protected]> Reviewed-by: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/hugetlb: remove hugetlb_follow_page_mask() leftoverDavid Hildenbrand2-12/+2
We removed hugetlb_follow_page_mask() in commit 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code") but forgot to cleanup some leftovers. While at it, simplify the hugetlb comment, it's overly detailed and rather confusing. Stating that we may end up in there during coredumping is sufficient to explain the PF_DUMPCORE usage. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Peter Xu <[email protected]> Cc: Muchun Song <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/memory_hotplug: get rid of __refWei Yang1-11/+11
After commit 73db3abdca58 ("init/modpost: conditionally check section mismatch to __meminit*"), we can get rid of __ref annotations. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: swap: add nr argument in swapcache_prepare and swapcache_clear to ↵Barry Song5-49/+63
support large folios Right now, swapcache_prepare() and swapcache_clear() supports one entry only, to support large folios, we need to handle multiple swap entries. To optimize stack usage, we iterate twice in __swap_duplicate(): the first time to verify that all entries are valid, and the second time to apply the modifications to the entries. Currently, we're using nr=1 for the existing users. [[email protected]: clarify swap_count_continued and improve readability for __swap_duplicate] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Barry Song <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Acked-by: David Hildenbrand <[email protected]> Tested-by: Baolin Wang <[email protected]> Cc: Chris Li <[email protected]> Cc: Gao Xiang <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kairui Song <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/z3fold: add __percpu annotation to *unbuddied pointer in struct z3fold_poolUros Bizjak1-1/+1
Compiling z3fold.c results in several sparse warnings: z3fold.c:797:21: warning: incorrect type in initializer (different address spaces) z3fold.c:797:21: expected void const [noderef] __percpu *__vpp_verify z3fold.c:797:21: got struct list_head * z3fold.c:852:37: warning: incorrect type in initializer (different address spaces) z3fold.c:852:37: expected void const [noderef] __percpu *__vpp_verify z3fold.c:852:37: got struct list_head * z3fold.c:924:25: warning: incorrect type in assignment (different address spaces) z3fold.c:924:25: expected struct list_head *unbuddied z3fold.c:924:25: got void [noderef] __percpu *_res z3fold.c:930:33: warning: incorrect type in initializer (different address spaces) z3fold.c:930:33: expected void const [noderef] __percpu *__vpp_verify z3fold.c:930:33: got struct list_head * z3fold.c:949:25: warning: incorrect type in argument 1 (different address spaces) z3fold.c:949:25: expected void [noderef] __percpu *__pdata z3fold.c:949:25: got struct list_head *unbuddied z3fold.c:979:25: warning: incorrect type in argument 1 (different address spaces) z3fold.c:979:25: expected void [noderef] __percpu *__pdata z3fold.c:979:25: got struct list_head *unbuddied Add __percpu annotation to *unbuddied pointer to fix these warnings. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Uros Bizjak <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/cma: change the addition of totalcma_pages in the cma_init_reserved_memHao Ge1-1/+1
Replace the unnecessary division calculation with cma->count when update the value of totalcma_pages. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hao Ge <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: improve code consistency with zonelist_* helper functionsWei Yang5-18/+18
Replace direct access to zoneref->zone, zoneref->zone_idx, or zone_to_nid(zoneref->zone) with the corresponding zonelist_* helper functions for consistency. No functional change. Link: https://lkml.kernel.org/r/[email protected] Co-developed-by: Shivank Garg <[email protected]> Signed-off-by: Shivank Garg <[email protected]> Signed-off-by: Wei Yang <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01tools: add skeleton code for userland testing of VMA logicLorenzo Stoakes7-0/+1163
Establish a new userland VMA unit testing implementation under tools/testing which utilises existing logic providing maple tree support in userland utilising the now-shared code previously exclusive to radix tree testing. This provides fundamental VMA operations whose API is defined in mm/vma.h, while stubbing out superfluous functionality. This exists as a proof-of-concept, with the test implementation functional and sufficient to allow userland compilation of vma.c, but containing only cursory tests to demonstrate basic functionality. Link: https://lkml.kernel.org/r/533ffa2eec771cbe6b387dd049a7f128a53eb616.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Tested-by: SeongJae Park <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Gow <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Jan Kara <[email protected]> Cc: Kees Cook <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Rae Moar <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Pengfei Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01tools: separate out shared radix-tree componentsLorenzo Stoakes27-69/+144
The core components contained within the radix-tree tests which provide shims for kernel headers and access to the maple tree are useful for testing other things, so separate them out and make the radix tree tests dependent on the shared components. This lays the groundwork for us to add VMA tests of the newly introduced vma.c file. Link: https://lkml.kernel.org/r/1ee720c265808168e0d75608e687607d77c36719.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Gow <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Jan Kara <[email protected]> Cc: Kees Cook <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Rae Moar <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Pengfei Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01MAINTAINERS: add entry for new VMA filesLorenzo Stoakes1-0/+13
The vma files contain logic split from mmap.c for the most part and are all relevant to VMA logic, so maintain the same reviewers for both. Link: https://lkml.kernel.org/r/bf2581cce2b4d210deabb5376c6aa0ad6facf1ff.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Acked-by: Liam R. Howlett <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Gow <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Jan Kara <[email protected]> Cc: Kees Cook <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Rae Moar <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Pengfei Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: move internal core VMA manipulation functions to own fileLorenzo Stoakes8-2039/+2188
This patch introduces vma.c and moves internal core VMA manipulation functions to this file from mmap.c. This allows us to isolate VMA functionality in a single place such that we can create userspace testing code that invokes this functionality in an environment where we can implement simple unit tests of core functionality. This patch ensures that core VMA functionality is explicitly marked as such by its presence in mm/vma.h. It also places the header includes required by vma.c in vma_internal.h, which is simply imported by vma.c. This makes the VMA functionality testable, as userland testing code can simply stub out functionality as required. Link: https://lkml.kernel.org/r/c77a6aafb4c42aaadb8e7271a853658cbdca2e22.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Gow <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Jan Kara <[email protected]> Cc: Kees Cook <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Rae Moar <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Pengfei Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: move vma_shrink(), vma_expand() to internal headerLorenzo Stoakes4-91/+106
The vma_shrink() and vma_expand() functions are internal VMA manipulation functions which we ought to abstract for use outside of memory management code. To achieve this, we replace shift_arg_pages() in fs/exec.c with an invocation of a new relocate_vma_down() function implemented in mm/mmap.c, which enables us to also move move_page_tables() and vma_iter_prev_range() to internal.h. The purpose of doing this is to isolate key VMA manipulation functions in order that we can both abstract them and later render them easily testable. Link: https://lkml.kernel.org/r/3cfcd9ec433e032a85f636fdc0d7d98fafbd19c5.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Gow <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Jan Kara <[email protected]> Cc: Kees Cook <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Rae Moar <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Pengfei Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: move vma_modify() and helpers to internal headerLorenzo Stoakes2-60/+61
These are core VMA manipulation functions which invoke VMA splitting and merging and should not be directly accessed from outside of mm/. Link: https://lkml.kernel.org/r/5efde0c6342a8860d5ffc90b415f3989fd8ed0b2.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Gow <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Jan Kara <[email protected]> Cc: Kees Cook <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Rae Moar <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Pengfei Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01userfaultfd: move core VMA manipulation logic to mm/userfaultfd.cLorenzo Stoakes3-149/+198
Patch series "Make core VMA operations internal and testable", v4. There are a number of "core" VMA manipulation functions implemented in mm/mmap.c, notably those concerning VMA merging, splitting, modifying, expanding and shrinking, which logically don't belong there. More importantly this functionality represents an internal implementation detail of memory management and should not be exposed outside of mm/ itself. This patch series isolates core VMA manipulation functionality into its own file, mm/vma.c, and provides an API to the rest of the mm code in mm/vma.h. Importantly, it also carefully implements mm/vma_internal.h, which specifies which headers need to be imported by vma.c, leading to the very useful property that vma.c depends only on mm/vma.h and mm/vma_internal.h. This means we can then re-implement vma_internal.h in userland, adding shims for kernel mechanisms as required, allowing us to unit test internal VMA functionality. This testing is useful as opposed to an e.g. kunit implementation as this way we can avoid all external kernel side-effects while testing, run tests VERY quickly, and iterate on and debug problems quickly. Excitingly this opens the door to, in the future, recreating precise problems observed in production in userland and very quickly debugging problems that might otherwise be very difficult to reproduce. This patch series takes advantage of existing shim logic and full userland maple tree support contained in tools/testing/radix-tree/ and tools/include/linux/, separating out shared components of the radix tree implementation to provide this testing. Kernel functionality is stubbed and shimmed as needed in tools/testing/vma/ which contains a fully functional userland vma_internal.h file and which imports mm/vma.c and mm/vma.h to be directly tested from userland. A simple, skeleton testing implementation is provided in tools/testing/vma/vma.c as a proof-of-concept, asserting that simple VMA merge, modify (testing split), expand and shrink functionality work correctly. This patch (of 4): This patch forms part of a patch series intending to separate out VMA logic and render it testable from userspace, which requires that core manipulation functions be exposed in an mm/-internal header file. In order to do this, we must abstract APIs we wish to test, in this instance functions which ultimately invoke vma_modify(). This patch therefore moves all logic which ultimately invokes vma_modify() to mm/userfaultfd.c, trying to transfer code at a functional granularity where possible. [[email protected]: fix user-after-free in userfaultfd_clear_vma()] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/50c3ed995fd81c45876c86304c8a00bf3e396cfd.1722251717.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Brendan Higgins <[email protected]> Cc: Christian Brauner <[email protected]> Cc: David Gow <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Jan Kara <[email protected]> Cc: Kees Cook <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Rae Moar <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Pengfei Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm, memcg: cg2 memory{.swap,}.peak write testsDavid Finkel3-8/+280
Extend two existing tests to cover extracting memory usage through the newly mutable memory.peak and memory.swap.peak handlers. In particular, make sure to exercise adding and removing watchers with overlapping lifetimes so the less-trivial logic gets tested. The new/updated tests attempt to detect a lack of the write handler by fstat'ing the memory.peak and memory.swap.peak files and skip the tests if that's the case. Additionally, skip if the file doesn't exist at all. [[email protected]: update tests] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Finkel <[email protected]> Acked-by: Tejun Heo <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Muchun Song <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Waiman Long <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm, memcg: cg2 memory{.swap,}.peak write handlersDavid Finkel9-27/+173
Patch series "mm, memcg: cg2 memory{.swap,}.peak write handlers", v7. This patch (of 2): Other mechanisms for querying the peak memory usage of either a process or v1 memory cgroup allow for resetting the high watermark. Restore parity with those mechanisms, but with a less racy API. For example: - Any write to memory.max_usage_in_bytes in a cgroup v1 mount resets the high watermark. - writing "5" to the clear_refs pseudo-file in a processes's proc directory resets the peak RSS. This change is an evolution of a previous patch, which mostly copied the cgroup v1 behavior, however, there were concerns about races/ownership issues with a global reset, so instead this change makes the reset filedescriptor-local. Writing any non-empty string to the memory.peak and memory.swap.peak pseudo-files reset the high watermark to the current usage for subsequent reads through that same FD. Notably, following Johannes's suggestion, this implementation moves the O(FDs that have written) behavior onto the FD write(2) path. Instead, on the page-allocation path, we simply add one additional watermark to conditionally bump per-hierarchy level in the page-counter. Additionally, this takes Longman's suggestion of nesting the page-charging-path checks for the two watermarks to reduce the number of common-case comparisons. This behavior is particularly useful for work scheduling systems that need to track memory usage of worker processes/cgroups per-work-item. Since memory can't be squeezed like CPU can (the OOM-killer has opinions), these systems need to track the peak memory usage to compute system/container fullness when binpacking workitems. Most notably, Vimeo's use-case involves a system that's doing global binpacking across many Kubernetes pods/containers, and while we can use PSI for some local decisions about overload, we strive to avoid packing workloads too tightly in the first place. To facilitate this, we track the peak memory usage. However, since we run with long-lived workers (to amortize startup costs) we need a way to track the high watermark while a work-item is executing. Polling runs the risk of missing short spikes that last for timescales below the polling interval, and peak memory tracking at the cgroup level is otherwise perfect for this use-case. As this data is used to ensure that binpacked work ends up with sufficient headroom, this use-case mostly avoids the inaccuracies surrounding reclaimable memory. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Finkel <[email protected]> Suggested-by: Johannes Weiner <[email protected]> Suggested-by: Waiman Long <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Acked-by: Tejun Heo <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01s390/uv: drop arch_make_page_accessible()David Hildenbrand3-14/+0
All code was converted to using arch_make_folio_accessible(), let's drop arch_make_page_accessible(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Vishal Moola (Oracle) <[email protected]> Reviewed-by: Claudio Imbrenda <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Janosch Frank <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/gup: convert to arch_make_folio_accessible()David Hildenbrand1-3/+5
Let's use arch_make_folio_accessible() instead so we can get rid of arch_make_page_accessible(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Vishal Moola (Oracle) <[email protected]> Reviewed-by: Claudio Imbrenda <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Janosch Frank <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: simplify arch_make_folio_accessible()David Hildenbrand1-10/+1
Patch series "mm: remove arch_make_page_accessible()". Now that s390x implements arch_make_folio_accessible(), let's convert remaining users to use arch_make_folio_accessible() instead so we can remove arch_make_page_accessible(). This patch (of 3): Now that s390x implements HAVE_ARCH_MAKE_FOLIO_ACCESSIBLE, let's turn generic arch_make_folio_accessible() into a NOP: there are no other targets that implement HAVE_ARCH_MAKE_PAGE_ACCESSIBLE but not HAVE_ARCH_MAKE_FOLIO_ACCESSIBLE. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Vishal Moola (Oracle) <[email protected]> Reviewed-by: Claudio Imbrenda <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Janosch Frank <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01lib: test_hmm: use min() to improve dmirror_exclusive()Thorsten Blum1-4/+1
Use min() to simplify the dmirror_exclusive() function and improve its readability. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Thorsten Blum <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Jérôme Glisse <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01powerpc/8xx: document and enforce that split PT locks are not usedDavid Hildenbrand1-0/+6
Right now, we cannot have split PT locks because 8xx does not support SMP. But for the sake of documentation *why* 8xx is fine regarding what we documented in huge_pte_lockptr(), let's just add code to enforce it at the same time as documenting it. This should also make everybody who wants to copy from the 8xx approach of supporting such unusual ways of mapping hugetlb folios aware that it gets tricky once multiple page tables are involved. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Mike Rapoport (Microsoft) <[email protected]> Cc: Muchun Song <[email protected]> Cc: "Naveen N. Rao" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Cc: Russell King <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/hugetlb: enforce that PMD PT sharing has split PMD PT locksDavid Hildenbrand3-7/+10
Sharing page tables between processes but falling back to per-MM page table locks cannot possibly work. So, let's make sure that we do have split PMD locks by adding a new Kconfig option and letting that depend on CONFIG_SPLIT_PMD_PTLOCKS. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Mike Rapoport (Microsoft) <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Muchun Song <[email protected]> Cc: "Naveen N. Rao" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Cc: Russell King <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: turn USE_SPLIT_PTE_PTLOCKS / USE_SPLIT_PTE_PTLOCKS into Kconfig optionsDavid Hildenbrand8-24/+26
Patch series "mm: split PTE/PMD PT table Kconfig cleanups+clarifications". This series is a follow up to the fixes: "[PATCH v1 0/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking" When working on the fixes, I wondered why 8xx is fine (-> never uses split PT locks) and how PT locking even works properly with PMD page table sharing (-> always requires split PMD PT locks). Let's improve the split PT lock detection, make hugetlb properly depend on it and make 8xx bail out if it would ever get enabled by accident. As an alternative to patch #3 we could extend the Kconfig SPLIT_PTE_PTLOCKS option from patch #2 -- but enforcing it closer to the code that actually implements it feels a bit nicer for documentation purposes, and there is no need to actually disable it because it should always be disabled (!SMP). Did a bunch of cross-compilations to make sure that split PTE/PMD PT locks are still getting used where we would expect them. [1] https://lkml.kernel.org/r/[email protected] This patch (of 3): Let's clean that up a bit and prepare for depending on CONFIG_SPLIT_PMD_PTLOCKS in other Kconfig options. More cleanups would be reasonable (like the arch-specific "depends on" for CONFIG_SPLIT_PTE_PTLOCKS), but we'll leave that for another day. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Mike Rapoport (Microsoft) <[email protected]> Reviewed-by: Russell King (Oracle) <[email protected]> Reviewed-by: Qi Zheng <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Muchun Song <[email protected]> Cc: "Naveen N. Rao" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: page_counters: initialize usage using ATOMIC_LONG_INIT() macroRoman Gushchin1-1/+1
When a page_counter structure is initialized, there is no need to use an atomic set operation to initialize the usage counter because at this point the structure is not visible to anybody else. ATOMIC_LONG_INIT() is what should be used in such cases. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Shakeel Butt <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: page_counters: put page_counter_calculate_protection() under CONFIG_MEMCGRoman Gushchin2-0/+8
Put page_counter_calculate_protection() under CONFIG_MEMCG. The protection functionality (min/low limits) is not supported by any other cgroup subsystem, so page_counter_calculate_protection() and related static effective_protection() can be compiled out if CONFIG_MEMCG is not enabled. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Shakeel Butt <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: memcg: don't call propagate_protected_usage() needlesslyRoman Gushchin4-14/+31
Patch series "mm: memcg: page counters optimizations", v3. This patchset contains 3 independent small optimizations of page counters. This patch (of 3): Memory protection (min/low) requires a constant tracking of protected memory usage. propagate_protected_usage() is called on each page counters update and does a number of operations even in cases when the actual memory protection functionality is not supported (e.g. hugetlb cgroups or memcg swap counters). It's obviously inefficient and leads to a waste of CPU cycles. It can be addressed by calling propagate_protected_usage() only for the counters which do support memory guarantees. As of now it's only memcg->memory - the unified memory memcg counter. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Shakeel Butt <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: hugetlb: remove left over comment about follow_huge_foo()Kefeng Wang1-4/+0
The comment is useless after commit 57a196a58421 ("hugetlb: simplify hugetlb handling in follow_page_mask") since all follow_huge_foo() are killed. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Vishal Moola (Oracle) <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01kmemleak-test: add percpu leakPavel Tikhomirov1-0/+2
Add a per-CPU memory leak, which will be reported like: unreferenced object 0x3efa840195f8 (size 64): comm "modprobe", pid 4667, jiffies 4294688677 hex dump (first 32 bytes on cpu 0): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc 0): [<ffffffffa7fa87af>] pcpu_alloc+0x3df/0x840 [<ffffffffc11642d9>] kmemleak_test_init+0x2c9/0x2f0 [kmemleak_test] [<ffffffffa7c02264>] do_one_initcall+0x44/0x300 [<ffffffffa7de9e10>] do_init_module+0x60/0x240 [<ffffffffa7deb946>] init_module_from_file+0x86/0xc0 [<ffffffffa7deba99>] idempotent_init_module+0x109/0x2a0 [<ffffffffa7debd2a>] __x64_sys_finit_module+0x5a/0xb0 [<ffffffffa88f4f3a>] do_syscall_64+0x7a/0x160 [<ffffffffa8a0012b>] entry_SYSCALL_64_after_hwframe+0x76/0x7e Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Pavel Tikhomirov <[email protected]> Acked-by: Catalin Marinas <[email protected]> Cc: Alexander Mikhalitsyn <[email protected]> Cc: Chen Jun <[email protected]> Cc: Wei Yongjun <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01kmemleak: enable tracking for percpu pointersPavel Tikhomirov1-59/+94
Patch series "kmemleak: support for percpu memory leak detect'. This is a rework of this series: https://lore.kernel.org/lkml/[email protected]/ Originally I was investigating a percpu leak on our customer nodes and having this functionality was a huge help, which lead to this fix [1]. So probably it's a good idea to have it in mainstream too, especially as after [2] it became much easier to implement (we already have a separate tree for percpu pointers). [1] commit 0af8c09c89681 ("netfilter: x_tables: fix percpu counter block leak on error path when creating new netns") [2] commit 39042079a0c24 ("kmemleak: avoid RCU stalls when freeing metadata for per-CPU pointers") This patch (of 2): This basically does: - Add min_percpu_addr and max_percpu_addr to filter out unrelated data similar to min_addr and max_addr; - Set min_count for percpu pointers to 1 to start tracking them; - Calculate checksum of percpu area as xor of crc32 for each cpu; - Split pointer lookup and update refs code into separate helper and use it twice: once as if the pointer is a virtual pointer and once as if it's percpu. [[email protected]: v2] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Pavel Tikhomirov <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Cc: Wei Yongjun <[email protected]> Cc: Chen Jun <[email protected]> Cc: Alexander Mikhalitsyn <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01task_stack: uninline stack_not_usedPasha Tatashin3-18/+23
Given that stack_not_used() is not performance critical function uninline it. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Pasha Tatashin <[email protected]> Acked-by: Shakeel Butt <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Li Zhijian <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01vmstat: kernel stack usage histogramPasha Tatashin3-0/+86
As part of the dynamic kernel stack project, we need to know the amount of data that can be saved by reducing the default kernel stack size [1]. Provide a kernel stack usage histogram to aid in optimizing kernel stack sizes and minimizing memory waste in large-scale environments. The histogram divides stack usage into power-of-two buckets and reports the results in /proc/vmstat. This information is especially valuable in environments with millions of machines, where even small optimizations can have a significant impact. The histogram data is presented in /proc/vmstat with entries like "kstack_1k", "kstack_2k", and so on, indicating the number of threads that exited with stack usage falling within each respective bucket. Example outputs: Intel: $ grep kstack /proc/vmstat kstack_1k 3 kstack_2k 188 kstack_4k 11391 kstack_8k 243 kstack_16k 0 ARM with 64K page_size: $ grep kstack /proc/vmstat kstack_1k 1 kstack_2k 340 kstack_4k 25212 kstack_8k 1659 kstack_16k 0 kstack_32k 0 kstack_64k 0 Note: once the dynamic kernel stack is implemented it will depend on the implementation the usability of this feature: On hardware that supports faults on kernel stacks, we will have other metrics that show the total number of pages allocated for stacks. On hardware where faults are not supported, we will most likely have some optimization where only some threads are extended, and for those, these metrics will still be very useful. [1] https://lwn.net/Articles/974367 Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Pasha Tatashin <[email protected]> Reviewed-by: Kent Overstreet <[email protected]> Acked-by: Shakeel Butt <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Li Zhijian <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01memcg: increase the valid index range for memcg statsShakeel Butt1-22/+28
Patch series "Kernel stack usage histogram", v6. Provide histogram of stack sizes for the exited threads: Example outputs: Intel: $ grep kstack /proc/vmstat kstack_1k 3 kstack_2k 188 kstack_4k 11391 kstack_8k 243 kstack_16k 0 ARM with 64K page_size: $ grep kstack /proc/vmstat kstack_1k 1 kstack_2k 340 kstack_4k 25212 kstack_8k 1659 kstack_16k 0 kstack_32k 0 kstack_64k 0 This patch (of 3): At the moment the valid index for the indirection tables for memcg stats and events is < S8_MAX. These indirection tables are used in performance critical codepaths. With the latest addition to the vm_events, the NR_VM_EVENT_ITEMS has gone over S8_MAX. One way to resolve is to increase the entry size of the indirection table from int8_t to int16_t but this will increase the potential number of cachelines needed to access the indirection table. This patch took a different approach and make the valid index < U8_MAX. In this way the size of the indirection tables will remain same and we only need to invalid index check from less than 0 to equal to U8_MAX. In this approach we have also removed a subtraction from the performance critical codepaths. [[email protected]: v6] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Co-developed-by: Pasha Tatashin <[email protected]> Signed-off-by: Pasha Tatashin <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Li Zhijian <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: shrink skip folio mapped by an exiting processZhiguo Jiang2-1/+21
The releasing process of the non-shared anonymous folio mapped solely by an exiting process may go through two flows: 1) the anonymous folio is firstly is swaped-out into swapspace and transformed into a swp_entry in shrink_folio_list; 2) then the swp_entry is released in the process exiting flow. This will result in the high cpu load of releasing a non-shared anonymous folio mapped solely by an exiting process. When the low system memory and the exiting process exist at the same time, it will be likely to happen, because the non-shared anonymous folio mapped solely by an exiting process may be reclaimed by shrink_folio_list. This patch is that shrink skips the non-shared anonymous folio solely mapped by an exting process and this folio is only released directly in the process exiting flow, which will save swap-out time and alleviate the load of the process exiting. Barry provided some effectiveness testing in [1]. "I observed that this patch effectively skipped 6114 folios (either 4KB or 64KB mTHP), potentially reducing the swap-out by up to 92MB (97,300,480 bytes) during the process exit. The working set size is 256MB." Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/linux-mm/[email protected]/ [1] Signed-off-by: Zhiguo Jiang <[email protected]> Acked-by: Barry Song <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/swap: remove boilerplateYu Zhao1-73/+38
Remove boilerplate by using a macro to choose the corresponding lock and handler for each folio_batch in cpu_fbatches. [[email protected]: handle zero-length local_lock_t] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix "BUG: using smp_processor_id() in preemptible"] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yu Zhao <[email protected]> Tested-by: Hugh Dickins <[email protected]> Cc: Barry Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>