aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2024-05-02mm/slub: create kmalloc 96 and 192 caches regardless cache size orderHyunmin Lee1-12/+7
For SLAB the kmalloc caches needed to be created in ascending sizes in order. However, the constraint is not necessary anymore because SLAB has been removed and SLUB doesn't need to comply with the constraint. Thus, kmalloc 96 and 192 caches can be created after the other size kmalloc caches are created instead of checking every time to find their order to be created. Also, this change could prevent engineers from being confused by the removed constraint. Signed-off-by: Hyunmin Lee <[email protected]> Co-developed-by: Jeungwoo Yoo <[email protected]> Signed-off-by: Jeungwoo Yoo <[email protected]> Co-developed-by: Sangyun Kim <[email protected]> Signed-off-by: Sangyun Kim <[email protected]> Cc: Hyeonggon Yoo <[email protected]> Cc: Gwan-gyeong Mun <[email protected]> Reviewed-by: Christoph Lameter <[email protected]> Acked-by: David Rientjes <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]>
2024-05-02mm/slub: mark racy access on slab->freelistlinke li1-1/+1
In deactivate_slab(), slab->freelist can be changed concurrently. Mark data race on slab->freelist as benign using READ_ONCE. This patch is aimed at reducing the number of benign races reported by KCSAN in order to focus future debugging effort on harmful races. Signed-off-by: linke li <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]>
2024-05-01mm: Export writeback_iter()David Howells1-0/+1
Export writeback_iter() so that it can be used by netfslib as a module. Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> cc: Matthew Wilcox (Oracle) <[email protected]> cc: Christoph Hellwig <[email protected]> cc: [email protected]
2024-05-01mm: Provide a means of invalidation without using launder_folioDavid Howells1-0/+54
Implement a replacement for launder_folio. The key feature of invalidate_inode_pages2() is that it locks each folio individually, unmaps it to prevent mmap'd accesses interfering and calls the ->launder_folio() address_space op to flush it. This has problems: firstly, each folio is written individually as one or more small writes; secondly, adjacent folios cannot be added so easily into the laundry; thirdly, it's yet another op to implement. Instead, use the invalidate lock to cause anyone wanting to add a folio to the inode to wait, then unmap all the folios if we have mmaps, then, conditionally, use ->writepages() to flush any dirty data back and then discard all pages. The invalidate lock prevents ->read_iter(), ->write_iter() and faulting through mmap all from adding pages for the duration. This is then used from netfslib to handle the flusing in unbuffered and direct writes. Signed-off-by: David Howells <[email protected]> cc: Matthew Wilcox <[email protected]> cc: Miklos Szeredi <[email protected]> cc: Trond Myklebust <[email protected]> cc: Christoph Hellwig <[email protected]> cc: Andrew Morton <[email protected]> cc: Alexander Viro <[email protected]> cc: Christian Brauner <[email protected]> cc: Jeff Layton <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected]
2024-05-01mm/slub: avoid zeroing outside-object freepointer for single freeNicolas Bouchinet1-23/+29
Commit 284f17ac13fe ("mm/slub: handle bulk and single object freeing separately") splits single and bulk object freeing in two functions slab_free() and slab_free_bulk() which leads slab_free() to call slab_free_hook() directly instead of slab_free_freelist_hook(). If `init_on_free` is set, slab_free_hook() zeroes the object. Afterward, if `slub_debug=F` and `CONFIG_SLAB_FREELIST_HARDENED` are set, the do_slab_free() slowpath executes freelist consistency checks and try to decode a zeroed freepointer which leads to a "Freepointer corrupt" detection in check_object(). During bulk free, slab_free_freelist_hook() isn't affected as it always sets it objects freepointer using set_freepointer() to maintain its reconstructed freelist after `init_on_free`. For single free, object's freepointer thus needs to be avoided when stored outside the object if `init_on_free` is set. The freepointer left as is, check_object() may later detect an invalid pointer value due to objects overflow. To reproduce, set `slub_debug=FU init_on_free=1 log_level=7` on the command line of a kernel build with `CONFIG_SLAB_FREELIST_HARDENED=y`. dmesg sample log: [ 10.708715] ============================================================================= [ 10.710323] BUG kmalloc-rnd-05-32 (Tainted: G B T ): Freepointer corrupt [ 10.712695] ----------------------------------------------------------------------------- [ 10.712695] [ 10.712695] Slab 0xffffd8bdc400d580 objects=32 used=4 fp=0xffff9d9a80356f80 flags=0x200000000000a00(workingset|slab|node=0|zone=2) [ 10.716698] Object 0xffff9d9a80356600 @offset=1536 fp=0x7ee4f480ce0ecd7c [ 10.716698] [ 10.716698] Bytes b4 ffff9d9a803565f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 10.720703] Object ffff9d9a80356600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 10.720703] Object ffff9d9a80356610: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 10.724696] Padding ffff9d9a8035666c: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 10.724696] Padding ffff9d9a8035667c: 00 00 00 00 .... [ 10.724696] FIX kmalloc-rnd-05-32: Object at 0xffff9d9a80356600 not freed Fixes: 284f17ac13fe ("mm/slub: handle bulk and single object freeing separately") Cc: <[email protected]> Co-developed-by: Chengming Zhou <[email protected]> Signed-off-by: Chengming Zhou <[email protected]> Signed-off-by: Nicolas Bouchinet <[email protected]> Signed-off-by: Vlastimil Babka <[email protected]>
2024-04-29mm: Remove the PG_fscache alias for PG_private_2David Howells1-3/+3
Remove the PG_fscache alias for PG_private_2 and use the latter directly. Use of this flag for marking pages undergoing writing to the cache should be considered deprecated and the folios should be marked dirty instead and the write done in ->writepages(). Note that PG_private_2 itself should be considered deprecated and up for future removal by the MM folks too. Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> cc: Matthew Wilcox (Oracle) <[email protected]> cc: Ilya Dryomov <[email protected]> cc: Xiubo Li <[email protected]> cc: Steve French <[email protected]> cc: Paulo Alcantara <[email protected]> cc: Ronnie Sahlberg <[email protected]> cc: Shyam Prasad N <[email protected]> cc: Tom Talpey <[email protected]> cc: Bharath SM <[email protected]> cc: Trond Myklebust <[email protected]> cc: Anna Schumaker <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected]
2024-04-26Merge branch 'memory-observability' into x86/amdJoerg Roedel1-0/+3
2024-04-25mm: kmsan: implement kmsan_memmove()Alexander Potapenko1-0/+11
Provide a hook that can be used by custom memcpy implementations to tell KMSAN that the metadata needs to be copied. Without that, false positive reports are possible in the cases where KMSAN fails to intercept memory initialization. Link: https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alexander Potapenko <[email protected]> Suggested-by: Tetsuo Handa <[email protected]> Reviewed-by: Marco Elver <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: convert free_zone_device_page to free_zone_device_folioMatthew Wilcox (Oracle)3-17/+19
Both callers already have a folio; pass it in and save a few calls to compound_head(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: combine __folio_put_small, __folio_put_large and __folio_putMatthew Wilcox (Oracle)1-26/+6
It's now obvious that __folio_put_small() and __folio_put_large() do almost exactly the same thing. Inline them both into __folio_put(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: inline destroy_large_folio() into __folio_put_large()Matthew Wilcox (Oracle)2-17/+10
destroy_large_folio() has only one caller, move its contents there. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: combine free_the_page() and free_unref_page()Matthew Wilcox (Oracle)1-14/+11
The pcp_allowed_order() check in free_the_page() was only being skipped by __folio_put_small() which is about to be rearranged. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: free non-hugetlb large folios in a batchMatthew Wilcox (Oracle)1-2/+2
Patch series "Clean up __folio_put()". With all the changes over the last few years, __folio_put_small and __folio_put_large have become almost identical to each other ... except you can't tell because they're spread over two files. Rearrange it all so that you can tell, and then inline them both into __folio_put(). This patch (of 5): free_unref_folios() can now handle non-hugetlb large folios, so keep normal large folios in the batch. hugetlb folios still need to be handled specially. [[email protected]: fix panic] Link: https://lkml.kernel.org/r/ZikjPB0Dt5HA8-uL@x1n Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Peter Xu <[email protected]> Reviewed-by: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: convert pagecache_isize_extended to use a folioMatthew Wilcox (Oracle)1-19/+17
Remove four hidden calls to compound_head(). Also exit early if the filesystem block size is >= PAGE_SIZE instead of just equal to PAGE_SIZE. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Pankaj Raghav <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm/hugetlb: pass correct order_per_bit to cma_declare_contiguous_nidFrank van der Linden1-3/+3
The hugetlb_cma code passes 0 in the order_per_bit argument to cma_declare_contiguous_nid (the alignment, computed using the page order, is correctly passed in). This causes a bit in the cma allocation bitmap to always represent a 4k page, making the bitmaps potentially very large, and slower. It would create bitmaps that would be pretty big. E.g. for a 4k page size on x86, hugetlb_cma=64G would mean a bitmap size of (64G / 4k) / 8 == 2M. With HUGETLB_PAGE_ORDER as order_per_bit, as intended, this would be (64G / 2M) / 8 == 4k. So, that's quite a difference. Also, this restricted the hugetlb_cma area to ((PAGE_SIZE << MAX_PAGE_ORDER) * 8) * PAGE_SIZE (e.g. 128G on x86) , since bitmap_alloc uses normal page allocation, and is thus restricted by MAX_PAGE_ORDER. Specifying anything about that would fail the CMA initialization. So, correctly pass in the order instead. Link: https://lkml.kernel.org/r/[email protected] Fixes: cf11e85fc08c ("mm: hugetlb: optionally allocate gigantic hugepages using cma") Signed-off-by: Frank van der Linden <[email protected]> Acked-by: Roman Gushchin <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Muchun Song <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm/cma: drop incorrect alignment check in cma_init_reserved_memFrank van der Linden1-4/+0
cma_init_reserved_mem uses IS_ALIGNED to check if the size represented by one bit in the cma allocation bitmask is aligned with CMA_MIN_ALIGNMENT_BYTES (pageblock size). However, this is too strict, as this will fail if order_per_bit > pageblock_order, which is a valid configuration. We could check IS_ALIGNED both ways, but since both numbers are powers of two, no check is needed at all. Link: https://lkml.kernel.org/r/[email protected] Fixes: de9e14eebf33 ("drivers: dma-contiguous: add initialization from device tree") Signed-off-by: Frank van der Linden <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25hugetlb: convert hugetlb_wp() to use struct vm_faultVishal Moola (Oracle)1-32/+32
hugetlb_wp() can use the struct vm_fault passed in from hugetlb_fault(). This alleviates the stack by consolidating 5 variables into a single struct. [[email protected]: simplify hugetlb_wp() arguments] Link: https://lkml.kernel.org/r/ZhQtoFNZBNwBCeXn@fedora Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vishal Moola (Oracle) <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25hugetlb: convert hugetlb_no_page() to use struct vm_faultVishal Moola (Oracle)1-32/+31
hugetlb_no_page() can use the struct vm_fault passed in from hugetlb_fault(). This alleviates the stack by consolidating 7 variables into a single struct. [[email protected]: simplify hugetlb_no_page() arguments] Link: https://lkml.kernel.org/r/ZhQtN8y5zud8iI1u@fedora Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vishal Moola (Oracle) <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25hugetlb: convert hugetlb_fault() to use struct vm_faultVishal Moola (Oracle)1-43/+41
Patch series "Hugetlb fault path to use struct vm_fault", v2. This patchset converts the hugetlb fault path to use struct vm_fault. This helps make the code more readable, and alleviates the stack by allowing us to consolidate many fault-related variables into an individual pointer. This patch (of 3): Now that hugetlb_fault() has a vm_fault available for fault tracking, use it throughout. This cleans up the code by removing 2 variables, and prepares hugetlb_fault() to take in a struct vm_fault argument. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vishal Moola (Oracle) <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: use "GUP-fast" instead "fast GUP" in remaining commentsDavid Hildenbrand2-2/+2
Let's fixup the remaining comments to consistently call that thing "GUP-fast". With this change, we consistently call it "GUP-fast". Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: John Hubbard <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm/treewide: rename CONFIG_HAVE_FAST_GUP to CONFIG_HAVE_GUP_FASTDavid Hildenbrand3-7/+7
Nowadays, we call it "GUP-fast", the external interface includes functions like "get_user_pages_fast()", and we renamed all internal functions to reflect that as well. Let's make the config option reflect that. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: John Hubbard <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm/gup: consistently name GUP-fast functionsDavid Hildenbrand1-102/+103
Patch series "mm/gup: consistently call it GUP-fast". Some cleanups around function names, comments and the config option of "GUP-fast" -- GUP without "lock" safety belts on. With this cleanup it's easy to judge which functions are GUP-fast specific. We now consistently call it "GUP-fast", avoiding mixing it with "fast GUP", "lockless", or simply "gup" (which I always considered confusing in the ode). So the magic now happens in functions that contain "gup_fast", whereby gup_fast() is the entry point into that magic. Comments consistently reference either "GUP-fast" or "gup_fast()". This patch (of 3): Let's consistently call the "fast-only" part of GUP "GUP-fast" and rename all relevant internal functions to start with "gup_fast", to make it clearer that this is not ordinary GUP. The current mixture of "lockless", "gup" and "gup_fast" is confusing. Further, avoid the term "huge" when talking about a "leaf" -- for example, we nowadays check pmd_leaf() because pmd_huge() is gone. For the "hugepd"/"hugepte" stuff, it's part of the name ("is_hugepd"), so that stays. What remains is the "external" interface: * get_user_pages_fast_only() * get_user_pages_fast() * pin_user_pages_fast() The high-level internal functions for GUP-fast (+slow fallback) are now: * internal_get_user_pages_fast() -> gup_fast_fallback() * lockless_pages_from_mm() -> gup_fast() The basic GUP-fast walker functions: * gup_pgd_range() -> gup_fast_pgd_range() * gup_p4d_range() -> gup_fast_p4d_range() * gup_pud_range() -> gup_fast_pud_range() * gup_pmd_range() -> gup_fast_pmd_range() * gup_pte_range() -> gup_fast_pte_range() * gup_huge_pgd() -> gup_fast_pgd_leaf() * gup_huge_pud() -> gup_fast_pud_leaf() * gup_huge_pmd() -> gup_fast_pmd_leaf() The weird hugepd stuff: * gup_huge_pd() -> gup_fast_hugepd() * gup_hugepte() -> gup_fast_hugepte() The weird devmap stuff: * __gup_device_huge_pud() -> gup_fast_devmap_pud_leaf() * __gup_device_huge_pmd -> gup_fast_devmap_pmd_leaf() * __gup_device_huge() -> gup_fast_devmap_leaf() * undo_dev_pagemap() -> gup_fast_undo_dev_pagemap() Helper functions: * unpin_user_pages_lockless() -> gup_fast_unpin_user_pages() * gup_fast_folio_allowed() is already properly named * gup_fast_permitted() is already properly named With "gup_fast()", we now even have a function that is referred to in comment in mm/mmu_gather.c. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Reviewed-by: John Hubbard <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25hugetlb: convert alloc_buddy_hugetlb_folio to use a folioMatthew Wilcox (Oracle)1-17/+16
While this function returned a folio, it was still using __alloc_pages() and __free_pages(). Use __folio_alloc() and put_folio() instead. This actually removes a call to compound_head(), but more importantly, it prepares us for the move to memdescs. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Sidhartha Kumar <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Reviewed-by: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: remove struct page from get_shadow_from_swap_cacheMatthew Wilcox (Oracle)1-4/+4
We don't actually use any parts of struct page; all we do is check the value of the pointer. So give the pointer the appropriate name & type. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: madvise: avoid split during MADV_PAGEOUT and MADV_COLDRyan Roberts3-41/+62
Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large folio that is fully and contiguously mapped in the pageout/cold vm range. This change means that large folios will be maintained all the way to swap storage. This both improves performance during swap-out, by eliding the cost of splitting the folio, and sets us up nicely for maintaining the large folio when it is swapped back in (to be covered in a separate series). Folios that are not fully mapped in the target range are still split, but note that behavior is changed so that if the split fails for any reason (folio locked, shared, etc) we now leave it as is and move to the next pte in the range and continue work on the proceeding folios. Previously any failure of this sort would cause the entire operation to give up and no folios mapped at higher addresses were paged out or made cold. Given large folios are becoming more common, this old behavior would have likely lead to wasted opportunities. While we are at it, change the code that clears young from the ptes to use ptep_test_and_clear_young(), via the new mkold_ptes() batch helper function. This is more efficent than get_and_clear/modify/set, especially for contpte mappings on arm64, where the old approach would require unfolding/refolding and the new approach can be done in place. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ryan Roberts <[email protected]> Reviewed-by: Barry Song <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Barry Song <[email protected]> Cc: Chris Li <[email protected]> Cc: Gao Xiang <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Lance Yang <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: vmscan: avoid split during shrink_folio_list()Ryan Roberts1-10/+10
Now that swap supports storing all mTHP sizes, avoid splitting large folios before swap-out. This benefits performance of the swap-out path by eliding split_folio_to_list(), which is expensive, and also sets us up for swapping in large folios in a future series. If the folio is partially mapped, we continue to split it since we want to avoid the extra IO overhead and storage of writing out pages uneccessarily. THP_SWPOUT and THP_SWPOUT_FALLBACK counters should continue to count events only for PMD-mappable folios to avoid user confusion. THP_SWPOUT already has the appropriate guard. Add a guard for THP_SWPOUT_FALLBACK. It may be appropriate to add per-size counters in future. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ryan Roberts <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Barry Song <[email protected]> Cc: Barry Song <[email protected]> Cc: Chris Li <[email protected]> Cc: Gao Xiang <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Lance Yang <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: swap: allow storage of all mTHP ordersRyan Roberts1-71/+91
Multi-size THP enables performance improvements by allocating large, pte-mapped folios for anonymous memory. However I've observed that on an arm64 system running a parallel workload (e.g. kernel compilation) across many cores, under high memory pressure, the speed regresses. This is due to bottlenecking on the increased number of TLBIs added due to all the extra folio splitting when the large folios are swapped out. Therefore, solve this regression by adding support for swapping out mTHP without needing to split the folio, just like is already done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled, and when the swap backing store is a non-rotating block device. These are the same constraints as for the existing PMD-sized THP swap-out support. Note that no attempt is made to swap-in (m)THP here - this is still done page-by-page, like for PMD-sized THP. But swapping-out mTHP is a prerequisite for swapping-in mTHP. The main change here is to improve the swap entry allocator so that it can allocate any power-of-2 number of contiguous entries between [1, (1 << PMD_ORDER)]. This is done by allocating a cluster for each distinct order and allocating sequentially from it until the cluster is full. This ensures that we don't need to search the map and we get no fragmentation due to alignment padding for different orders in the cluster. If there is no current cluster for a given order, we attempt to allocate a free cluster from the list. If there are no free clusters, we fail the allocation and the caller can fall back to splitting the folio and allocates individual entries (as per existing PMD-sized THP fallback). The per-order current clusters are maintained per-cpu using the existing infrastructure. This is done to avoid interleving pages from different tasks, which would prevent IO being batched. This is already done for the order-0 allocations so we follow the same pattern. As is done for order-0 per-cpu clusters, the scanner now can steal order-0 entries from any per-cpu-per-order reserved cluster. This ensures that when the swap file is getting full, space doesn't get tied up in the per-cpu reserves. This change only modifies swap to be able to accept any order mTHP. It doesn't change the callers to elide doing the actual split. That will be done in separate changes. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ryan Roberts <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Cc: Barry Song <[email protected]> Cc: Barry Song <[email protected]> Cc: Chris Li <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Gao Xiang <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Lance Yang <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: swap: update get_swap_pages() to take folio orderRyan Roberts2-9/+10
We are about to allow swap storage of any mTHP size. To prepare for that, let's change get_swap_pages() to take a folio order parameter instead of nr_pages. This makes the interface self-documenting; a power-of-2 number of pages must be provided. We will also need the order internally so this simplifies accessing it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ryan Roberts <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Barry Song <[email protected]> Cc: Barry Song <[email protected]> Cc: Chris Li <[email protected]> Cc: Gao Xiang <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Lance Yang <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: swap: simplify struct percpu_clusterRyan Roberts1-11/+11
struct percpu_cluster stores the index of cpu's current cluster and the offset of the next entry that will be allocated for the cpu. These two pieces of information are redundant because the cluster index is just (offset / SWAPFILE_CLUSTER). The only reason for explicitly keeping the cluster index is because the structure used for it also has a flag to indicate "no cluster". However this data structure also contains a spin lock, which is never used in this context, as a side effect the code copies the spinlock_t structure, which is questionable coding practice in my view. So let's clean this up and store only the next offset, and use a sentinal value (SWAP_NEXT_INVALID) to indicate "no cluster". SWAP_NEXT_INVALID is chosen to be 0, because 0 will never be seen legitimately; The first page in the swap file is the swap header, which is always marked bad to prevent it from being allocated as an entry. This also prevents the cluster to which it belongs being marked free, so it will never appear on the free list. This change saves 16 bytes per cpu. And given we are shortly going to extend this mechanism to be per-cpu-AND-per-order, we will end up saving 16 * 9 = 144 bytes per cpu, which adds up if you have 256 cpus in the system. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ryan Roberts <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Cc: Barry Song <[email protected]> Cc: Barry Song <[email protected]> Cc: Chris Li <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Gao Xiang <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Lance Yang <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()Ryan Roberts4-28/+158
Now that we no longer have a convenient flag in the cluster to determine if a folio is large, free_swap_and_cache() will take a reference and lock a large folio much more often, which could lead to contention and (e.g.) failure to split large folios, etc. Let's solve that problem by batch freeing swap and cache with a new function, free_swap_and_cache_nr(), to free a contiguous range of swap entries together. This allows us to first drop a reference to each swap slot before we try to release the cache folio. This means we only try to release the folio once, only taking the reference and lock once - much better than the previous 512 times for the 2M THP case. Contiguous swap entries are gathered in zap_pte_range() and madvise_free_pte_range() in a similar way to how present ptes are already gathered in zap_pte_range(). While we are at it, let's simplify by converting the return type of both functions to void. The return value was used only by zap_pte_range() to print a bad pte, and was ignored by everyone else, so the extra reporting wasn't exactly guaranteed. We will still get the warning with most of the information from get_swap_device(). With the batch version, we wouldn't know which pte was bad anyway so could print the wrong one. [[email protected]: fix a build warning on parisc] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ryan Roberts <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Barry Song <[email protected]> Cc: Barry Song <[email protected]> Cc: Chris Li <[email protected]> Cc: Gao Xiang <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Lance Yang <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: swap: remove CLUSTER_FLAG_HUGE from swap_cluster_info:flagsRyan Roberts2-42/+8
Patch series "Swap-out mTHP without splitting", v7. This series adds support for swapping out multi-size THP (mTHP) without needing to first split the large folio via split_huge_page_to_list_to_order(). It closely follows the approach already used to swap-out PMD-sized THP. There are a couple of reasons for swapping out mTHP without splitting: - Performance: It is expensive to split a large folio and under extreme memory pressure some workloads regressed performance when using 64K mTHP vs 4K small folios because of this extra cost in the swap-out path. This series not only eliminates the regression but makes it faster to swap out 64K mTHP vs 4K small folios. - Memory fragmentation avoidance: If we can avoid splitting a large folio memory is less likely to become fragmented, making it easier to re-allocate a large folio in future. - Performance: Enables a separate series [7] to swap-in whole mTHPs, which means we won't lose the TLB-efficiency benefits of mTHP once the memory has been through a swap cycle. I've done what I thought was the smallest change possible, and as a result, this approach is only employed when the swap is backed by a non-rotating block device (just as PMD-sized THP is supported today). Discussion against the RFC concluded that this is sufficient. Performance Testing =================== I've run some swap performance tests on Ampere Altra VM (arm64) with 8 CPUs. The VM is set up with a 35G block ram device as the swap device and the test is run from inside a memcg limited to 40G memory. I've then run `usemem` from vm-scalability with 70 processes, each allocating and writing 1G of memory. I've repeated everything 6 times and taken the mean performance improvement relative to 4K page baseline: | alloc size | baseline | + this series | | | mm-unstable (~v6.9-rc1) | | |:-----------|------------------------:|------------------------:| | 4K Page | 0.0% | 1.3% | | 64K THP | -13.6% | 46.3% | | 2M THP | 91.4% | 89.6% | So with this change, the 64K swap performance goes from a 14% regression to a 46% improvement. While 2M shows a small regression I'm confident that this is just noise. [1] https://lore.kernel.org/linux-mm/[email protected]/ [2] https://lore.kernel.org/linux-mm/[email protected]/ [3] https://lore.kernel.org/linux-mm/[email protected]/ [4] https://lore.kernel.org/linux-mm/[email protected]/ [5] https://lore.kernel.org/linux-mm/[email protected]/ [6] https://lore.kernel.org/linux-mm/[email protected]/ [7] https://lore.kernel.org/linux-mm/[email protected]/ [8] https://lore.kernel.org/linux-mm/CAGsJ_4yMOow27WDvN2q=E4HAtDd2PJ=OQ5Pj9DG+6FLWwNuXUw@mail.gmail.com/ [9] https://lore.kernel.org/linux-mm/[email protected]/ This patch (of 7): As preparation for supporting small-sized THP in the swap-out path, without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE, which, when present, always implies PMD-sized THP, which is the same as the cluster size. The only use of the flag was to determine whether a swap entry refers to a single page or a PMD-sized THP in swap_page_trans_huge_swapped(). Instead of relying on the flag, we now pass in order, which originates from the folio's order. This allows the logic to work for folios of any order. The one snag is that one of the swap_page_trans_huge_swapped() call sites does not have the folio. But it was only being called there to shortcut a call __try_to_reclaim_swap() in some cases. __try_to_reclaim_swap() gets the folio and (via some other functions) calls swap_page_trans_huge_swapped(). So I've removed the problematic call site and believe the new logic should be functionally equivalent. That said, removing the fast path means that we will take a reference and trylock a large folio much more often, which we would like to avoid. The next patch will solve this. Removing CLUSTER_FLAG_HUGE also means we can remove split_swap_cluster() which used to be called during folio splitting, since split_swap_cluster()'s only job was to remove the flag. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ryan Roberts <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Acked-by: Chris Li <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Barry Song <[email protected]> Cc: Gao Xiang <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Lance Yang <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Barry Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: page_alloc: use the correct THP order for THP PCPBaolin Wang1-3/+3
Commit 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") extends the PCP allocator to store THP pages, and it determines whether to cache THP pages in PCP by comparing with pageblock_order. But the pageblock_order is not always equal to THP order. It might also be MAX_PAGE_ORDER, which could prevent PCP from caching THP pages. Therefore, using HPAGE_PMD_ORDER instead to determine the need for caching THP for PCP will fix this issue Link: https://lkml.kernel.org/r/a25c9e14cd03907d5978b60546a69e6aa3fc2a7d.1712151833.git.baolin.wang@linux.alibaba.com Fixes: 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") Signed-off-by: Baolin Wang <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Reviewed-by: Barry Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25khugepaged: use a folio throughout hpage_collapse_scan_file()Matthew Wilcox (Oracle)1-17/+16
Replace the use of pages with folios. Saves a few calls to compound_head() and removes some uses of obsolete functions. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Vishal Moola (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25khugepaged: use a folio throughout collapse_file()Matthew Wilcox (Oracle)1-59/+54
Pull folios from the page cache instead of pages. Half of this work had been done already, but we were still operating on pages for a large chunk of this function. There is no attempt in this patch to handle large folios that are smaller than a THP; that will have to wait for a future patch. [[email protected]: the unlikely() is embedded in IS_ERR()] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25khugepaged: remove hpage from collapse_file()Matthew Wilcox (Oracle)1-38/+39
Use new_folio throughout where we had been using hpage. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Vishal Moola (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25khugepaged: pass a folio to __collapse_huge_page_copy()Matthew Wilcox (Oracle)1-19/+15
Simplify the body of __collapse_huge_page_copy() while I'm looking at it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Vishal Moola (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25khugepaged: remove hpage from collapse_huge_page()Matthew Wilcox (Oracle)1-7/+5
Work purely in terms of the folio. Removes a call to compound_head() in put_page(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Vishal Moola (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25khugepaged: convert alloc_charge_hpage to alloc_charge_folioMatthew Wilcox (Oracle)1-8/+9
Both callers want to deal with a folio, so return a folio from this function. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25khugepaged: inline hpage_collapse_alloc_folio()Matthew Wilcox (Oracle)1-15/+4
Patch series "khugepaged folio conversions". We've been kind of hacking piecemeal at converting khugepaged to use folios instead of compound pages, and so this patchset is a little larger than it should be as I undo some of our wrong moves in the past. In particular, collapse_file() now consistently uses 'new_folio' for the freshly allocated folio and 'folio' for the one that's currently in use. This patch (of 7): This function has one caller, and the combined function is simpler to read, reason about and modify. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Vishal Moola (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25memory: remove the now superfluous sentinel element from ctl_table arrayJoel Granados7-7/+0
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%[email protected]/) Remove sentinel from all files under mm/ that register a sysctl table. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Joel Granados <[email protected]> Reviewed-by: Muchun Song <[email protected]> Reviewed-by: Miaohe Lin <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: rename vma_pgoff_address back to vma_addressMatthew Wilcox (Oracle)4-13/+12
With all callers converted, we can use the nice shorter name. Take this opportunity to reorder the arguments to the logical order (larger object first). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: remove vma_address()Matthew Wilcox (Oracle)2-18/+17
Convert the three remaining callers to call vma_pgoff_address() directly. This removes an ambiguity where we'd check just one page if passed a tail page and all N pages if passed a head page. Also add better kernel-doc for vma_pgoff_address(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: correct page_mapped_in_vma() for large foliosMatthew Wilcox (Oracle)1-1/+3
Patch series "Unify vma_address and vma_pgoff_address". The current vma_address() pretends that the ambiguity between head & tail page is an advantage. If you pass a head page to vma_address(), it will operate on all pages in the folio, while if you pass a tail page, it will operate on a single page. That's not what any of the callers actually want, so first convert all callers to use vma_pgoff_address() and then rename vma_pgoff_address() to vma_address(). This patch (of 3): If 'page' is the first page of a large folio then vma_address() will scan for any page in the entire folio. This can lead to page_mapped_in_vma() returning true if some of the tail pages are mapped and the head page is not. This could lead to memory failure choosing to kill a task unnecessarily. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: huge_memory: add the missing folio_test_pmd_mappable() for THP split ↵Baolin Wang1-2/+5
statistics Now the mTHP can also be split or added into the deferred list, so add folio_test_pmd_mappable() validation for PMD mapped THP, to avoid confusion with PMD mapped THP related statistics. [[email protected]: check THP earlier in case folio is split, per Lance] Link: https://lkml.kernel.org/r/b99f8cb14bc85fdb6ab43721d1331cb5ebed2581.1713771041.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/a5341defeef27c9ac7b85c97f030f93e4368bbc1.1711694852.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Lance Yang <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: support multi-size THP numa balancingBaolin Wang2-13/+52
Now the anonymous page allocation already supports multi-size THP (mTHP), but the numa balancing still prohibits mTHP migration even though it is an exclusive mapping, which is unreasonable. Allow scanning mTHP: Commit 859d4adc3415 ("mm: numa: do not trap faults on shared data section pages") skips shared CoW pages' NUMA page migration to avoid shared data segment migration. In addition, commit 80d47f5de5e3 ("mm: don't try to NUMA-migrate COW pages that have other uses") change to use page_count() to avoid GUP pages migration, that will also skip the mTHP numa scanning. Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP issue, although there is still a GUP race, the issue seems to have been resolved by commit 80d47f5de5e3. Meanwhile, use the folio_likely_mapped_shared() to skip shared CoW pages though this is not a precise sharers count. To check if the folio is shared, ideally we want to make sure every page is mapped to the same process, but doing that seems expensive and using the estimated mapcount seems can work when running autonuma benchmark. Allow migrating mTHP: As mentioned in the previous thread[1], large folios (including THP) are more susceptible to false sharing issues among threads than 4K base page, leading to pages ping-pong back and forth during numa balancing, which is currently not easy to resolve. Therefore, as a start to support mTHP numa balancing, we can follow the PMD mapped THP's strategy, that means we can reuse the 2-stage filter in should_numa_migrate_memory() to check if the mTHP is being heavily contended among threads (through checking the CPU id and pid of the last access) to avoid false sharing at some degree. Thus, we can restore all PTE maps upon the first hint page fault of a large folio to follow the PMD mapped THP's strategy. In the future, we can continue to optimize the NUMA balancing algorithm to avoid the false sharing issue with large folios as much as possible. Performance data: Machine environment: 2 nodes, 128 cores Intel(R) Xeon(R) Platinum Base: 2024-03-25 mm-unstable branch Enable mTHP to run autonuma-benchmark mTHP:16K Base Patched numa01 numa01 224.70 143.48 numa01_THREAD_ALLOC numa01_THREAD_ALLOC 118.05 47.43 numa02 numa02 13.45 9.29 numa02_SMT numa02_SMT 14.80 7.50 mTHP:64K Base Patched numa01 numa01 216.15 114.40 numa01_THREAD_ALLOC numa01_THREAD_ALLOC 115.35 47.41 numa02 numa02 13.24 9.25 numa02_SMT numa02_SMT 14.67 7.34 mTHP:128K Base Patched numa01 numa01 205.13 144.45 numa01_THREAD_ALLOC numa01_THREAD_ALLOC 112.93 41.88 numa02 numa02 13.16 9.18 numa02_SMT numa02_SMT 14.81 7.49 [1] https://lore.kernel.org/all/[email protected]/ [[email protected]: v3] Link: https://lkml.kernel.org/r/c33a5c0b0a0323b1f8ed53772f50501f4b196e25.1712132950.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/d28d276d599c26df7f38c9de8446f60e22dd1950.1711683069.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: John Hubbard <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Ryan Roberts <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: factor out the numa mapping rebuilding into a new helperBaolin Wang1-7/+15
Patch series "support multi-size THP numa balancing", v2. This patchset tries to support mTHP numa balancing, as a simple solution to start, the NUMA balancing algorithm for mTHP will follow the THP strategy as the basic support. Please find details in each patch. This patch (of 2): To support large folio's numa balancing, factor out the numa mapping rebuilding into a new helper as a preparation. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/8bc2586bdd8dbbe6d83c09b77b360ec8fcac3736.1711683069.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: John Hubbard <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Ryan Roberts <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: alloc_anon_folio: avoid doing vma_thp_gfp_mask in fallback casesBarry Song1-0/+3
Fallback rates surpassing 90% have been observed on phones utilizing 64KiB CONT-PTE mTHP. In these scenarios, when one out of every 16 PTEs fails to allocate large folios, the remaining 15 PTEs fallback. Consequently, invoking vma_thp_gfp_mask seems redundant in such cases. Furthermore, abstaining from its use can also contribute to improved code readability. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Barry Song <[email protected]> Reviewed-by: Zi Yan <[email protected]> Acked-by: Yu Zhao <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: John Hubbard <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: David Rientjes <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Itaru Kitayama <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yin Fengwei <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: init_mlocked_on_free_v3York Jasper Niebuhr4-8/+44
Implements the "init_mlocked_on_free" boot option. When this boot option is enabled, any mlock'ed pages are zeroed on free. If the pages are munlock'ed beforehand, no initialization takes place. This boot option is meant to combat the performance hit of "init_on_free" as reported in commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options"). With "init_mlocked_on_free=1" only relevant data is freed while everything else is left untouched by the kernel. Correspondingly, this patch introduces no performance hit for unmapping non-mlock'ed memory. The unmapping overhead for purely mlocked memory was measured to be approximately 13%. Realistically, most systems mlock only a fraction of the total memory so the real-world system overhead should be close to zero. Optimally, userspace programs clear any key material or other confidential memory before exit and munlock the according memory regions. If a program crashes, userspace key managers fail to do this job. Accordingly, no munlock operations are performed so the data is caught and zeroed by the kernel. Should the program not crash, all memory will ideally be munlocked so no overhead is caused. CONFIG_INIT_MLOCKED_ON_FREE_DEFAULT_ON can be set to enable "init_mlocked_on_free" by default. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: York Jasper Niebuhr <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: York Jasper Niebuhr <[email protected]> Cc: Kees Cook <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: take placement mappings gap into accountRick Edgecombe1-3/+9
When memory is being placed, mmap() will take care to respect the guard gaps of certain types of memory (VM_SHADOWSTACK, VM_GROWSUP and VM_GROWSDOWN). In order to ensure guard gaps between mappings, mmap() needs to consider two things: 1. That the new mapping isn't placed in an any existing mappings guard gaps. 2. That the new mapping isn't placed such that any existing mappings are not in *its* guard gaps. The longstanding behavior of mmap() is to ensure 1, but not take any care around 2. So for example, if there is a PAGE_SIZE free area, and a mmap() with a PAGE_SIZE size, and a type that has a guard gap is being placed, mmap() may place the shadow stack in the PAGE_SIZE free area. Then the mapping that is supposed to have a guard gap will not have a gap to the adjacent VMA. For MAP_GROWSDOWN/VM_GROWSDOWN and MAP_GROWSUP/VM_GROWSUP this has not been a problem in practice because applications place these kinds of mappings very early, when there is not many mappings to find a space between. But for shadow stacks, they may be placed throughout the lifetime of the application. Use the start_gap field to find a space that includes the guard gap for the new mapping. Take care to not interfere with the alignment. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Rick Edgecombe <[email protected]> Reviewed-by: Christophe Leroy <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Deepak Gupta <[email protected]> Cc: Guo Ren <[email protected]> Cc: Helge Deller <[email protected]> Cc: H. Peter Anvin (Intel) <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Kees Cook <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Mark Brown <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Naveen N. Rao <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25treewide: use initializer for struct vm_unmapped_area_infoRick Edgecombe1-7/+2
Future changes will need to add a new member to struct vm_unmapped_area_info. This would cause trouble for any call site that doesn't initialize the struct. Currently every caller sets each member manually, so if new ones are added they will be uninitialized and the core code parsing the struct will see garbage in the new member. It could be possible to initialize the new member manually to 0 at each call site. This and a couple other options were discussed. Having some struct vm_unmapped_area_info instances not zero initialized will put those sites at risk of feeding garbage into vm_unmapped_area(), if the convention is to zero initialize the struct and any new field addition missed a call site that initializes each field manually. So it is useful to do things similar across the kernel. The consensus (see links) was that in general the best way to accomplish taking into account both code cleanliness and minimizing the chance of introducing bugs, was to do C99 static initialization. As in: struct vm_unmapped_area_info info = {}; With this method of initialization, the whole struct will be zero initialized, and any statements setting fields to zero will be unneeded. The change should not leave cleanup at the call sides. While iterating though the possible solutions a few archs kindly acked other variations that still zero initialized the struct. These sites have been modified in previous changes using the pattern acked by the respective arch. So to be reduce the chance of bugs via uninitialized fields, perform a tree wide change using the consensus for the best general way to do this change. Use C99 static initializing to zero the struct and remove and statements that simply set members to zero. Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/lkml/202402280912.33AEE7A9CF@keescook/#t Link: https://lore.kernel.org/lkml/j7bfvig3gew3qruouxrh7z7ehjjafrgkbcmg6tcghhfh3rhmzi@wzlcoecgy5rs/ Link: https://lore.kernel.org/lkml/[email protected]/ Signed-off-by: Rick Edgecombe <[email protected]> Reviewed-by: Kees Cook <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Deepak Gupta <[email protected]> Cc: Guo Ren <[email protected]> Cc: Helge Deller <[email protected]> Cc: H. Peter Anvin (Intel) <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Mark Brown <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Naveen N. Rao <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>