aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-07-17mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOCVlastimil Babka4-11/+7
This mostly reverts commit af3b854492f3 ("mm/page_alloc.c: allow error injection"). The commit made should_fail_alloc_page() a noinline function that's always called from the page allocation hotpath, even if it's empty because CONFIG_FAIL_PAGE_ALLOC is not enabled, and there is no option to disable it and prevent the associated function call overhead. As with the preceding patch "mm, slab: put should_failslab back behind CONFIG_SHOULD_FAILSLAB" and for the same reasons, put the should_fail_alloc_page() back behind the config option. When enabled, the ALLOW_ERROR_INJECTION and BTF_ID records are preserved so it's not a complete revert. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Cc: Akinobu Mita <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Andrii Nakryiko <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: David Rientjes <[email protected]> Cc: Eduard Zingerman <[email protected]> Cc: Hao Luo <[email protected]> Cc: Hyeonggon Yoo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: John Fastabend <[email protected]> Cc: KP Singh <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Song Liu <[email protected]> Cc: Stanislav Fomichev <[email protected]> Cc: Yonghong Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-17mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLABVlastimil Babka4-17/+12
Patch series "revert unconditional slab and page allocator fault injection calls". These two patches largely revert commits that added function call overhead into slab and page allocation hotpaths and that cannot be currently disabled even though related CONFIG_ options do exist. A much more involved solution that can keep the callsites always existing but hidden behind a static key if unused, is possible [1] and can be pursued by anyone who believes it's necessary. Meanwhile the fact the should_failslab() error injection is already not functional on kernels built with current gcc without anyone noticing [2], and lukewarm response to [1] suggests the need is not there. I believe it will be more fair to have the state after this series as a baseline for possible further optimisation, instead of the unconditional overhead. For example a possible compromise for anyone who's fine with an empty function call overhead but not the full CONFIG_FAILSLAB / CONFIG_FAIL_PAGE_ALLOC overhead is to reuse patch 1 from [1] but insert a static key check only inside should_failslab() and should_fail_alloc_page() before performing the more expensive checks. [1] https://lore.kernel.org/all/[email protected]/#t [2] https://github.com/bpftrace/bpftrace/issues/3258 This patch (of 2): This mostly reverts commit 4f6923fbb352 ("mm: make should_failslab always available for fault injection"). The commit made should_failslab() a noinline function that's always called from the slab allocation hotpath, even if it's empty because CONFIG_SHOULD_FAILSLAB is not enabled, and there is no option to disable that call. This is visible in profiles and the function call overhead can be noticeable especially with cpu mitigations. Meanwhile the bpftrace program example in the commit silently does not work without CONFIG_SHOULD_FAILSLAB anyway with a recent gcc, because the empty function gets a .constprop clone that is actually being called (uselessly) from the slab hotpath, while the error injection is hooked to the original function that's not being called at all [1]. Thus put the whole should_failslab() function back behind CONFIG_SHOULD_FAILSLAB. It's not a complete revert of 4f6923fbb352 - the int return type that returns -ENOMEM on failure is preserved, as well ALLOW_ERROR_INJECTION annotation. The BTF_ID() record that was meanwhile added is also guarded by CONFIG_SHOULD_FAILSLAB. [1] https://github.com/bpftrace/bpftrace/issues/3258 Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Cc: Akinobu Mita <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Andrii Nakryiko <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: David Rientjes <[email protected]> Cc: Eduard Zingerman <[email protected]> Cc: Hao Luo <[email protected]> Cc: Hyeonggon Yoo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: John Fastabend <[email protected]> Cc: KP Singh <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Song Liu <[email protected]> Cc: Stanislav Fomichev <[email protected]> Cc: Yonghong Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-17mm: ignore data-race in __swap_writepagePei Li1-1/+6
Syzbot reported a possible data race: BUG: KCSAN: data-race in __swap_writepage / scan_swap_map_slots read-write to 0xffff888102fca610 of 8 bytes by task 7106 on cpu 1. read to 0xffff888102fca610 of 8 bytes by task 7080 on cpu 0. While we are in __swap_writepage to read sis->flags, scan_swap_map_slots is trying to update it with SWP_SCANNING. value changed: 0x0000000000008083 -> 0x0000000000004083. While this can be updated non-atomicially, this won't affect SWP_SYNCHRONOUS_IO, so we consider this data-race safe. This is possibly introduced by commit 3222d8c2a7f8 ("block: remove ->rw_page"), where this if branch is introduced. Link: https://lkml.kernel.org/r/[email protected] Fixes: 3222d8c2a7f8 ("block: remove ->rw_page") Signed-off-by: Pei Li <[email protected]> Reported-by: [email protected] Closes: https://syzkaller.appspot.com/bug?extid=da25887cc13da6bf3b8c Cc: Dan Williams <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-17hugetlbfs: ensure generic_hugetlb_get_unmapped_area() returns higher address ↵Donet Tom1-5/+6
than mmap_min_addr generic_hugetlb_get_unmapped_area() was returning an address less than mmap_min_addr if the mmap argument addr, after alignment, was less than mmap_min_addr, causing mmap to fail. This is because current generic_hugetlb_get_unmapped_area() code does not take into account mmap_min_addr. This patch ensures that generic_hugetlb_get_unmapped_area() always returns an address that is greater than mmap_min_addr. Additionally, similar to generic_get_unmapped_area(), vm_end_gap() checks are included to maintain stack gap. How to reproduce ================ #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include <unistd.h> #define HUGEPAGE_SIZE (16 * 1024 * 1024) int main() { void *addr = mmap((void *)-1, HUGEPAGE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0); if (addr == MAP_FAILED) { perror("mmap"); exit(EXIT_FAILURE); } snprintf((char *)addr, HUGEPAGE_SIZE, "Hello, Huge Pages!"); printf("%s\n", (char *)addr); if (munmap(addr, HUGEPAGE_SIZE) == -1) { perror("munmap"); exit(EXIT_FAILURE); } return 0; } Result without fix ================== # cat /proc/meminfo |grep -i HugePages_Free HugePages_Free: 20 # ./test mmap: Permission denied # Result with fix =============== # cat /proc/meminfo |grep -i HugePages_Free HugePages_Free: 20 # ./test Hello, Huge Pages! # Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Donet Tom <[email protected]> Reported-by Pavithra Prakash <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Muchun Song <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Ritesh Harjani (IBM) <[email protected]> Cc: Tony Battersby <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm: shmem: rename mTHP shmem countersRyan Roberts4-26/+29
The legacy PMD-sized THP counters at /proc/vmstat include thp_file_alloc, thp_file_fallback and thp_file_fallback_charge, which rather confusingly refer to shmem THP and do not include any other types of file pages. This is inconsistent since in most other places in the kernel, THP counters are explicitly separated for anon, shmem and file flavours. However, we are stuck with it since it constitutes a user ABI. Recently, commit 66f44583f9b6 ("mm: shmem: add mTHP counters for anonymous shmem") added equivalent mTHP stats for shmem, keeping the same "file_" prefix in the names. But in future, we may want to add extra stats to cover actual file pages, at which point, it would all become very confusing. So let's take the opportunity to rename these new counters "shmem_" before the change makes it upstream and the ABI becomes immutable. While we are at it, let's improve the documentation for the legacy counters to make it clear that they count shmem pages only. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ryan Roberts <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Reviewed-by: Lance Yang <[email protected]> Reviewed-by: Zi Yan <[email protected]> Reviewed-by: Barry Song <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Daniel Gomez <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm: swap_state: use folio_alloc_mpol() in __read_swap_cache_async()Kefeng Wang1-2/+1
Convert to use folio_alloc_mpol() helper() in __read_swap_cache_async(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm/migrate: putback split folios when numa hint migration failsPeter Xu1-9/+2
This issue is not from any report yet, but by code observation only. This is yet another fix besides Hugh's patch [1] but on relevant code path, where eager split of folio can happen if the folio is already on deferred list during a folio migration. Here the issue is NUMA path (migrate_misplaced_folio()) may start to encounter such folio split now even with MR_NUMA_MISPLACED hint applied. Then when migrate_pages() didn't migrate all the folios, it's possible the split small folios be put onto the list instead of the original folio. Then putting back only the head page won't be enough. Fix it by putting back all the folios on the list. [1] https://lore.kernel.org/all/[email protected]/ [[email protected]: remove now unused local `nr_pages'] Link: https://lkml.kernel.org/r/[email protected] Fixes: 7262f208ca68 ("mm/migrate: split source folio if it is on deferred split list") Signed-off-by: Peter Xu <[email protected]> Reviewed-by: Zi Yan <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Cc: Yang Shi <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Huang Ying <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm/truncate: batch-clear shadow entriesYu Zhao1-37/+31
Make clear_shadow_entry() clear shadow entries in `struct folio_batch` so that it can reduce contention on i_lock and i_pages locks, e.g., watchdog: BUG: soft lockup - CPU#29 stuck for 11s! [fio:2701649] clear_shadow_entry+0x3d/0x100 mapping_try_invalidate+0x117/0x1d0 invalidate_mapping_pages+0x10/0x20 invalidate_bdev+0x3c/0x50 blkdev_common_ioctl+0x5f7/0xa90 blkdev_ioctl+0x109/0x270 Also, rename clear_shadow_entry() to clear_shadow_entries() accordingly. [[email protected]: v2] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Reported-by: Bharata B Rao <[email protected]> Closes: https://lore.kernel.org/[email protected]/ Signed-off-by: Yu Zhao <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm/memory-failure: remove obsolete MF_MSG_DIFFERENT_COMPOUNDMiaohe Lin2-17/+3
The page cannot become compound pages again just after a folio is split as an extra refcnt is held. So the MF_MSG_DIFFERENT_COMPOUND case is obsolete and can be removed to get rid of this false assumption and code burden. But add one WARN_ON() here to keep the situation clear. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Tony Luck <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm: simplify folio_migrate_mapping()Hugh Dickins1-2/+1
Now that folio_undo_large_rmappable() is an inline function checking order and large_rmappable for itself (and __folio_undo_large_rmappable() is now declared even when CONFIG_TRANASPARENT_HUGEPAGE is off) there is no need for folio_migrate_mapping() to check large and large_rmappable first (in the mapping case when it has had to freeze anyway). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Reviewed-by: Zi Yan <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Barry Song <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Yang Shi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm/page_alloc: put __free_pages_core() in __meminit sectionWei Yang2-2/+3
__free_pages_core() is only used in bootmem init and hot-add memory init path. Let's put it in __meminit section. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Wei Yang <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm: thp: support "THPeligible" semantics for mTHP with anonymous shmemBang Li3-12/+19
After the commit 7fb1b252afb5 ("mm: shmem: add mTHP support for anonymous shmem"), we can configure different policies through the multi-size THP sysfs interface for anonymous shmem. But currently "THPeligible" indicates only whether the mapping is eligible for allocating THP-pages as well as the THP is PMD mappable or not for anonymous shmem, we need to support semantics for mTHP with anonymous shmem similar to those for mTHP with anonymous memory. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Bang Li <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Lance Yang <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12kpageflags: detect isolated KPF_THP foliosRan Xiaokai2-14/+11
When folio is isolated, the PG_lru bit is cleared. So the PG_lru check in stable_page_flags() will miss this kind of isolated folios. Use folio_test_large_rmappable() instead to also include isolated folios. Since pagecache supports large folios and the introduction of mTHP, the semantics of KPF_THP have been expanded, now it indicates not only PMD-sized THP. Update related documentation to clearly state that KPF_THP indicates multiple order THPs. [[email protected]: directly use is_zero_folio(), per David] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ran Xiaokai <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Andrei Vagin <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Muhammad Usama Anjum <[email protected]> Cc: Ryan Roberts <[email protected]> Cc: Svetly Todorov <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm: fix khugepaged activation policyRyan Roberts4-25/+38
Since the introduction of mTHP, the docuementation has stated that khugepaged would be enabled when any mTHP size is enabled, and disabled when all mTHP sizes are disabled. There are 2 problems with this; 1. this is not what was implemented by the code and 2. this is not the desirable behavior. Desirable behavior is for khugepaged to be enabled when any PMD-sized THP is enabled, anon or file. (Note that file THP is still controlled by the top-level control so we must always consider that, as well as the PMD-size mTHP control for anon). khugepaged only supports collapsing to PMD-sized THP so there is no value in enabling it when PMD-sized THP is disabled. So let's change the code and documentation to reflect this policy. Further, per-size enabled control modification events were not previously forwarded to khugepaged to give it an opportunity to start or stop. Consequently the following was resulting in khugepaged eroneously not being activated: echo never > /sys/kernel/mm/transparent_hugepage/enabled echo always > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled [[email protected]: v3] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ryan Roberts <[email protected]> Fixes: 3485b88390b0 ("mm: thp: introduce multi-size THP sysfs interface") Closes: https://lore.kernel.org/linux-mm/[email protected]/ Acked-by: David Hildenbrand <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Barry Song <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Lance Yang <[email protected]> Cc: Yang Shi <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12memory tier: consolidate the initialization of memory tiersHo-Ren (Jack) Chuang3-37/+24
The current memory tier initialization process is distributed across two different functions, memory_tier_init() and memory_tier_late_init(). This design is hard to maintain. Thus, this patch is proposed to reduce the possible code paths by consolidating different initialization patches into one. The earlier discussion with Jonathan and Ying is listed here: https://lore.kernel.org/lkml/[email protected]/ If we want to put these two initializations together, they must be placed together in the later function. Because only at that time, the HMAT information will be ready, adist between nodes can be calculated, and memory tiering can be established based on the adist. So we position the initialization at memory_tier_init() to the memory_tier_late_init() call. Moreover, it's natural to keep memory_tier initialization in drivers at device_initcall() level. If we simply move the set_node_memory_tier() from memory_tier_init() to late_initcall(), it will result in HMAT not registering the mt_adistance_algorithm callback function, because set_node_memory_tier() is not performed during the memory tiering initialization phase, leading to a lack of correct default_dram information. Therefore, we introduced a nodemask to pass the information of the default DRAM nodes. The reason for not choosing to reuse default_dram_type->nodes is that it is not clean enough. So in the end, we use a __initdata variable, which is a variable that is released once initialization is complete, including both CPU and memory nodes for HMAT to iterate through. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ho-Ren (Jack) Chuang <[email protected]> Suggested-by: Jonathan Cameron <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Gregory Price <[email protected]> Cc: Len Brown <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Ravi Jonnalagadda <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm/page_counter: move calculating protection values to page_counterMaarten Lankhorst3-151/+180
It's a lot of math, and there is nothing memcontrol specific about it. This makes it easier to use inside of the drm cgroup controller. [[email protected]: fix kerneldoc, per Jeff Johnson] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Maarten Lankhorst <[email protected]> Acked-by: Roman Gushchin <[email protected]> Acked-by: Shakeel Butt <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Muchun Song <[email protected]> Cc: Jeff Johnson <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm: add comments for allocation helpers explaining why they are macrosSuren Baghdasaryan8-3/+46
A number of allocation helper functions were converted into macros to account them at the call sites. Add a comment for each converted allocation helper explaining why it has to be a macro and why we typecast the return value wherever required. The patch also moves acpi_os_acquire_object() closer to other allocation helpers to group them together under the same comment. The patch has no functional changes. Link: https://lkml.kernel.org/r/[email protected] Fixes: 2c321f3f70bc ("mm: change inlined allocation helpers to account at the call site") Signed-off-by: Suren Baghdasaryan <[email protected]> Suggested-by: Andrew Morton <[email protected]> Cc: Christian König <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Jan Kara <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Thorsten Blum <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm: unexport vmf_insert_mixed_mkwriteChristoph Hellwig1-1/+0
vmf_insert_mixed_mkwrite is only used by the built-in DAX code. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Alistair Popple <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm: remove CONFIG_ARCH_HAS_HUGEPDChristophe Leroy4-258/+9
powerpc was the only user of CONFIG_ARCH_HAS_HUGEPD and doesn't use it anymore, so remove all related code. Link: https://lkml.kernel.org/r/4b10c54c794780b955f3ad6c657d0199dd792146.1719928057.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <[email protected]> Acked-by: Oscar Salvador <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12powerpc/mm: remove hugepd leftoversChristophe Leroy7-477/+3
All targets have now opted out of CONFIG_ARCH_HAS_HUGEPD so remove left over code. Link: https://lkml.kernel.org/r/39c0d0adee6790fc42cee9f458e05fb95136c3dd.1719928057.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <[email protected]> Acked-by: Oscar Salvador <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12powerpc/64s: use contiguous PMD/PUD instead of HUGEPDChristophe Leroy15-186/+74
On book3s/64, the only user of hugepd is hash in 4k mode. All other setups (hash-64, radix-4, radix-64) use leaf PMD/PUD. Rework hash-4k to use contiguous PMD and PUD instead. In that setup there are only two huge page sizes: 16M and 16G. 16M sits at PMD level and 16G at PUD level. pte_update doesn't know page size, lets use the same trick as hpte_need_flush() to get page size from segment properties. That's not the most efficient way but let's do that until callers of pte_update() provide page size instead of just a huge flag. Link: https://lkml.kernel.org/r/7448f60a9b3efd396595f4f735d1e0babc5ae379.1719928057.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <[email protected]> Acked-by: Michael Ellerman <[email protected]> (powerpc) Cc: Jason Gunthorpe <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12powerpc/e500: use contiguous PMD instead of hugepdChristophe Leroy10-79/+107
e500 supports many page sizes among which the following size are implemented in the kernel at the time being: 4M, 16M, 64M, 256M, 1G. On e500, TLB miss for hugepages is exclusively handled by SW even on e6500 which has HW assistance for 4k pages, so there are no constraints like on the 8xx. On e500/32, all are at PGD/PMD level and can be handled as cont-PMD. On e500/64, smaller ones are on PMD while bigger ones are on PUD. Again, they can easily be handled as cont-PMD and cont-PUD instead of hugepd. On e500/32, use the pagesize bits in PTE to know if it is a PMD or a leaf entry. This works because the pagesize bits are in the last 12 bits and page tables are 4k aligned. On e500/64, use highest bit which is always 1 on PxD (Because PxD contains virtual address of a kernel memory) and always 0 on PTEs because not all bits of RPN are used/possible. Link: https://lkml.kernel.org/r/dd085987816ed2a0c70adb7e34966cb833fc03e1.1719928057.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12powerpc/e500: free r10 for FIND_PTEChristophe Leroy1-14/+16
Move r13 load after the call to FIND_PTE, and use r13 instead of r10 for storing fault address. This will allow using r10 freely in FIND_PTE in following patch to handle hugepage size. Link: https://lkml.kernel.org/r/a3ee563ad5b13c891a15d3aae6c136c44ce8aa63.1719928057.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12powerpc/e500: don't pre-check write access on data TLB errorChristophe Leroy1-15/+0
Don't pre-check write access on read-only pages on data TLB error. Load the TLB anyway and take a DSI exception when it happens. This avoids reading SPRN_ESR at every data TLB error exception. Link: https://lkml.kernel.org/r/8525518e1657d6032b7e980c1888102828d66950.1719928057.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12powerpc/e500: encode hugepage size in PTE bitsChristophe Leroy2-15/+22
Use PTE page size bits to encode hugepage size with the following format corresponding to the values expected in bits 52-55 in MAS1 register. Those bits are called TSIZE: 0001 4 Kbyte 0010 16 Kbyte 0011 64 Kbyte 0100 256 Kbyte 0101 1 Mbyte 0110 4 Mbyte 0111 16 Mbyte 1000 64 Mbyte 1001 256 Mbyte 1010 1 Gbyte 1011 4 Gbyte 1100 16 Gbyte 1101 64 Gbyte 1110 256 Gbyte 1111 1 Tbyte It corresponds to shift value minus 10 with lowest bit removed. It is not the value expected in the PTE in that field, but only e6500 performs HW based TLB loading and the e6500 reference manual explicitely says that this field is ignored. Also add pte_huge_size() which will be used later. Link: https://lkml.kernel.org/r/6f7ce82fa8c381d55f65342d77060fc55802e612.1719928057.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12powerpc/e500: switch to 64 bits PGD on 85xx (32 bits)Christophe Leroy2-4/+10
At the time being when CONFIG_PTE_64BIT is selected, PTE entries are 64 bits but PGD entries are still 32 bits. In order to allow leaf PMD entries, switch the PGD to 64 bits entries. Link: https://lkml.kernel.org/r/ca85397df02564e5edc3a3c27b55cf43af3e4ef3.1719928057.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12powerpc/e500: remove enc and ind fields from struct mmu_psize_defChristophe Leroy4-14/+4
enc field is hidden behind BOOK3E_PAGESZ_XX macros, and when you look closer you realise that this field is nothing else than the value of shift minus ten. So remove enc field and calculate tsize from shift field. Also remove inc field which is unused. Link: https://lkml.kernel.org/r/e99136779b5b0829c2c60d37f305a1410c65cf9b.1719928057.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12powerpc/8xx: simplify struct mmu_psize_defChristophe Leroy1-7/+2
On 8xx, only the shift field is used in struct mmu_psize_def Remove other fields and related macros. Link: https://lkml.kernel.org/r/dd0587a9e8354005858c7f8c9a775ad05523b314.1719928057.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12powerpc/8xx: rework support for 8M pages using contiguous PTE entriesChristophe Leroy12-114/+111
In order to fit better with standard Linux page tables layout, add support for 8M pages using contiguous PTE entries in a standard page table. Page tables will then be populated with 1024 similar entries and two PMD entries will point to that page table. The PMD entries also get a flag to tell it is addressing an 8M page, this is required for the HW tablewalk assistance. Link: https://lkml.kernel.org/r/8693d9a0408371043ca63bf9e4a9c140667af63e.1719928057.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12powerpc/8xx: fix size given to set_huge_pte_at()Christophe Leroy1-1/+2
set_huge_pte_at() expects the size of the hugepage as an int, not the psize which is the index of the page definition in table mmu_psize_defs[] Link: https://lkml.kernel.org/r/97f2090011e25d99b6b0aae73e22e1b921c5d1fb.1719928057.git.christophe.leroy@csgroup.eu Fixes: 935d4f0c6dc8 ("mm: hugetlb: add huge page size param to set_huge_pte_at()") Signed-off-by: Christophe Leroy <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12powerpc/mm: allow hugepages without hugepdChristophe Leroy7-12/+41
In preparation of implementing huge pages on powerpc 8xx without hugepd, enclose hugepd related code inside an ifdef CONFIG_ARCH_HAS_HUGEPD This also allows removing some stubs. Link: https://lkml.kernel.org/r/ada097ca8a4fa85a77f51719516ef2478800d77a.1719928057.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12powerpc/mm: fix __find_linux_pte() on 32 bits with PMD leaf entriesChristophe Leroy1-1/+10
Building on 32 bits with pmd_leaf() not returning always false leads to the following error: CC arch/powerpc/mm/pgtable.o arch/powerpc/mm/pgtable.c: In function '__find_linux_pte': arch/powerpc/mm/pgtable.c:506:1: error: function may return address of local variable [-Werror=return-local-addr] 506 | } | ^ arch/powerpc/mm/pgtable.c:394:15: note: declared here 394 | pud_t pud, *pudp; | ^~~ arch/powerpc/mm/pgtable.c:394:15: note: declared here This is due to pmd_offset() being a no-op in that case. So rework it for powerpc/32 so that pXd_offset() are used on real pointers and not on on-stack copies. Behind fixing the problem, it also has the advantage of simplifying __find_linux_pte() including the removal of stack frame: After this patch: 00000018 <__find_linux_pte>: 18: 2c 06 00 00 cmpwi r6,0 1c: 41 82 00 0c beq 28 <__find_linux_pte+0x10> 20: 39 20 00 00 li r9,0 24: 91 26 00 00 stw r9,0(r6) 28: 2f 85 00 00 cmpwi cr7,r5,0 2c: 41 9e 00 0c beq cr7,38 <__find_linux_pte+0x20> 30: 39 20 00 00 li r9,0 34: 99 25 00 00 stb r9,0(r5) 38: 54 89 65 3a rlwinm r9,r4,12,20,29 3c: 7c 63 48 2e lwzx r3,r3,r9 40: 2f 83 00 00 cmpwi cr7,r3,0 44: 41 9e 00 30 beq cr7,74 <__find_linux_pte+0x5c> 48: 54 69 07 3a rlwinm r9,r3,0,28,29 4c: 2f 89 00 0c cmpwi cr7,r9,12 50: 54 63 00 26 clrrwi r3,r3,12 54: 54 84 b5 36 rlwinm r4,r4,22,20,27 58: 3c 63 c0 00 addis r3,r3,-16384 5c: 7c 63 22 14 add r3,r3,r4 60: 4c be 00 20 bnelr+ cr7 64: 4d 82 00 20 beqlr 68: 39 20 00 17 li r9,23 6c: 91 26 00 00 stw r9,0(r6) 70: 4e 80 00 20 blr 74: 38 60 00 00 li r3,0 78: 4e 80 00 20 blr Before this patch: 00000018 <__find_linux_pte>: 18: 2c 06 00 00 cmpwi r6,0 1c: 94 21 ff e0 stwu r1,-32(r1) 20: 41 82 00 0c beq 2c <__find_linux_pte+0x14> 24: 39 20 00 00 li r9,0 28: 91 26 00 00 stw r9,0(r6) 2c: 2f 85 00 00 cmpwi cr7,r5,0 30: 41 9e 00 0c beq cr7,3c <__find_linux_pte+0x24> 34: 39 20 00 00 li r9,0 38: 99 25 00 00 stb r9,0(r5) 3c: 54 89 65 3a rlwinm r9,r4,12,20,29 40: 7c 63 48 2e lwzx r3,r3,r9 44: 54 69 07 3a rlwinm r9,r3,0,28,29 48: 2f 89 00 0c cmpwi cr7,r9,12 4c: 90 61 00 0c stw r3,12(r1) 50: 41 9e 00 4c beq cr7,9c <__find_linux_pte+0x84> 54: 80 61 00 0c lwz r3,12(r1) 58: 54 69 07 3a rlwinm r9,r3,0,28,29 5c: 2f 89 00 0c cmpwi cr7,r9,12 60: 90 61 00 08 stw r3,8(r1) 64: 41 9e 00 38 beq cr7,9c <__find_linux_pte+0x84> 68: 80 61 00 08 lwz r3,8(r1) 6c: 2f 83 00 00 cmpwi cr7,r3,0 70: 41 9e 00 54 beq cr7,c4 <__find_linux_pte+0xac> 74: 54 69 07 3a rlwinm r9,r3,0,28,29 78: 2f 89 00 0c cmpwi cr7,r9,12 7c: 54 69 00 26 clrrwi r9,r3,12 80: 54 8a b5 36 rlwinm r10,r4,22,20,27 84: 3c 69 c0 00 addis r3,r9,-16384 88: 7c 63 52 14 add r3,r3,r10 8c: 54 84 93 be srwi r4,r4,14 90: 41 9e 00 14 beq cr7,a4 <__find_linux_pte+0x8c> 94: 38 21 00 20 addi r1,r1,32 98: 4e 80 00 20 blr 9c: 54 69 00 26 clrrwi r9,r3,12 a0: 54 84 93 be srwi r4,r4,14 a4: 3c 69 c0 00 addis r3,r9,-16384 a8: 54 84 25 36 rlwinm r4,r4,4,20,27 ac: 7c 63 22 14 add r3,r3,r4 b0: 41 a2 ff e4 beq 94 <__find_linux_pte+0x7c> b4: 39 20 00 17 li r9,23 b8: 91 26 00 00 stw r9,0(r6) bc: 38 21 00 20 addi r1,r1,32 c0: 4e 80 00 20 blr c4: 38 60 00 00 li r3,0 c8: 38 21 00 20 addi r1,r1,32 cc: 4e 80 00 20 blr Link: https://lkml.kernel.org/r/50a3cfbab5b11890a0da027de5cb011a9d47ba89.1719928057.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12powerpc/mm: remove _PAGE_PSIZEChristophe Leroy4-12/+3
_PAGE_PSIZE macro is never used outside the place it is defined and is used only on 8xx and e500. Remove indirection, remove it and use its content directly. Link: https://lkml.kernel.org/r/c41da3b0ceda7311a50f0391cc4d54302ae15b74.1719928057.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm: provide mm_struct and address to huge_ptep_get()Christophe Leroy21-53/+53
On powerpc 8xx huge_ptep_get() will need to know whether the given ptep is a PTE entry or a PMD entry. This cannot be known with the PMD entry itself because there is no easy way to know it from the content of the entry. So huge_ptep_get() will need to know either the size of the page or get the pmd. In order to be consistent with huge_ptep_get_and_clear(), give mm and address to huge_ptep_get(). Link: https://lkml.kernel.org/r/cc00c70dd384298796a4e1b25d6c4eb306d3af85.1719928057.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm: define __pte_leaf_size() to also take a PMD entryChristophe Leroy2-1/+4
On powerpc 8xx, when a page is 8M size, the information is in the PMD entry. So allow architectures to provide __pte_leaf_size() instead of pte_leaf_size() and provide the PMD entry to that function. When __pte_leaf_size() is not defined, define it as a pte_leaf_size() so that architectures not interested in the PMD arguments are not impacted. Only define a default pte_leaf_size() when __pte_leaf_size() is not defined to make sure nobody adds new calls to pte_leaf_size() in the core. Link: https://lkml.kernel.org/r/c7c008f0a314bf8029ad7288fdc908db1ec7e449.1719928057.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <[email protected]> Reviewed-by: Oscar Salvador <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12powerpc/64e: drop unused TLB miss handlersMichael Ellerman3-232/+2
There are two possibilities for book3e_htw_mode, PPC_HTW_E6500 or PPC_HTW_NONE. The TLB miss handlers are patched to use, respectively: - exc_[data|indstruction]_tlb_miss_e6500_book3e - exc_[data|indstruction]_tlb_miss_bolted_book3e Which means the default handlers are never used. Remove those, and use the bolted handlers (PPC_HTW_NONE) by default. Link: https://lkml.kernel.org/r/9a670adc1771fb1871fba93ace5372f7eadc286f.1719928057.git.christophe.leroy@csgroup.eu Signed-off-by: Michael Ellerman <[email protected]> Signed-off-by: Christophe Leroy <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12powerpc/64e: consolidate TLB miss handler patchingMichael Ellerman1-23/+15
The 64e TLB miss handler patching is done in setup_mmu_htw(), and then again immediately afterward in early_init_mmu_global(). Consolidate it into a single location. Link: https://lkml.kernel.org/r/7033b37493fb48a3e5245b59d0a42afb75dabfc1.1719928057.git.christophe.leroy@csgroup.eu Signed-off-by: Michael Ellerman <[email protected]> Signed-off-by: Christophe Leroy <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12powerpc/64e: drop MMU_FTR_TYPE_FSL_E checks in 64-bit codeMichael Ellerman2-65/+38
All 64-bit Book3E have MMU_FTR_TYPE_FSL_E, since A2 was removed, so remove checks for it in 64-bit only code. Link: https://lkml.kernel.org/r/2b0b0bc9752e6cece222e4e2050358da70bb631d.1719928057.git.christophe.leroy@csgroup.eu Signed-off-by: Michael Ellerman <[email protected]> Signed-off-by: Christophe Leroy <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12powerpc/64e: drop E500 ifdefs in 64-bit codeMichael Ellerman1-12/+0
All 64-bit Book3E have E500=y, so drop the unneeded ifdefs. Link: https://lkml.kernel.org/r/7fb88809c88a1b774063eda602a9333079403f83.1719928057.git.christophe.leroy@csgroup.eu Signed-off-by: Michael Ellerman <[email protected]> Signed-off-by: Christophe Leroy <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12powerpc/64e: split out nohash Book3E 64-bit codeMichael Ellerman3-343/+363
A reasonable chunk of nohash/tlb.c is 64-bit only code, split it out into a separate file. Link: https://lkml.kernel.org/r/cb2b118f9d8a86f82d01bfb9ad309d1d304480a1.1719928057.git.christophe.leroy@csgroup.eu Signed-off-by: Michael Ellerman <[email protected]> Signed-off-by: Christophe Leroy <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12powerpc/64e: remove unused IBM HTW codeMichael Ellerman3-253/+2
Patch series "Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64)", v7. Unlike most architectures, powerpc 8xx HW requires a two-level pagetable topology for all page sizes. So a leaf PMD-contig approach is not feasible as such. Possible sizes on 8xx are 4k, 16k, 512k and 8M. First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries must point to a single entry level-2 page table. Until now that was done using hugepd. This series changes it to use standard page tables where the entry is replicated 1024 times on each of the two pagetables refered by the two associated PMD entries for that 8M page. For e500 and book3s/64 there are less constraints because it is not tied to the HW assisted tablewalk like on 8xx, so it is easier to use leaf PMDs (and PUDs). On e500 the supported page sizes are 4M, 16M, 64M, 256M and 1G. All at PMD level on e500/32 (mpc85xx) and mix of PMD and PUD for e500/64. We encode page size with 4 available bits in PTE entries. On e300/32 PGD entries size is increases to 64 bits in order to allow leaf-PMD entries because PTE are 64 bits on e500. On book3s/64 only the hash-4k mode is concerned. It supports 16M pages as cont-PMD and 16G pages as cont-PUD. In other modes (radix-4k, radix-6k and hash-64k) the sizes match with PMD and PUD sizes so that's just leaf entries. The hash processing make things a bit more complex. To ease things, __hash_page_huge() is modified to bail out when DIRTY or ACCESSED bits are missing, leaving it to mm core to fix it. This patch (of 23): The nohash HTW_IBM (Hardware Table Walk) code is unused since support for A2 was removed in commit fb5a515704d7 ("powerpc: Remove platforms/ wsp and associated pieces") (2014). The remaining supported CPUs use either no HTW (data_tlb_miss_bolted), or the e6500 HTW (data_tlb_miss_e6500). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/820dd1385ecc931f07b0d7a0fa827b1613917ab6.1719928057.git.christophe.leroy@csgroup.eu Signed-off-by: Michael Ellerman <[email protected]> Signed-off-by: Christophe Leroy <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12zsmalloc: rename class stat mutatorsSergey Senozhatsky1-19/+19
A cosmetic change. o Rename class_stat_inc() and class_stat_dec() to class_stat_add() and class_stat_sub() correspondingly. inc/dec are usually associated with +1/-1 modifications, while zsmlloc can modify stats by up to ->objs_per_zspage. Use add/sub (follow atomics naming). o Rename zs_stat_get() to class_stat_read() get() is usually associated with ref-counting and is paired with put(). zs_stat_get() simply reads class stat so rename to reflect it. (This also follows atomics naming). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sergey Senozhatsky <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm: add docs for per-order mTHP split countersLance Yang1-4/+15
This commit introduces documentation for mTHP split counters in transhuge.rst. [[email protected]: improve the doc as suggested by Ryan] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: tweak Documentation/admin-guide/mm/transhuge.rst] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mingzhe Yang <[email protected]> Signed-off-by: Lance Yang <[email protected]> Reviewed-by: Barry Song <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Bang Li <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Yang Shi <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm: add per-order mTHP split countersLance Yang2-2/+13
Patch series "mm: introduce per-order mTHP split counters", v3. At present, the split counters in THP statistics no longer include PTE-mapped mTHP. Therefore, we want to introduce per-order mTHP split counters to monitor the frequency of mTHP splits. This will assist developers in better analyzing and optimizing system performance. /sys/kernel/mm/transparent_hugepage/hugepages-<size>/stats split split_failed split_deferred This patch (of 2): Currently, the split counters in THP statistics no longer include PTE-mapped mTHP. Therefore, we propose introducing per-order mTHP split counters to monitor the frequency of mTHP splits. This will help developers better analyze and optimize system performance. /sys/kernel/mm/transparent_hugepage/hugepages-<size>/stats split split_failed split_deferred [[email protected]: make things more readable, per Barry and Baolin] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: use == for `order' test, per David] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mingzhe Yang <[email protected]> Signed-off-by: Lance Yang <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Acked-by: Barry Song <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Bang Li <[email protected]> Cc: Yang Shi <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm/zsmalloc: move record_obj() into obj_malloc()Chengming Zhou1-9/+6
We always record_obj() to make handle points to object after obj_malloc(), so simplify the code by moving record_obj() into obj_malloc(). There should be no functional change. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chengming Zhou <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Cc: Minchan Kim <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12mm/zsmalloc: clarify class per-fullness zspage countsChengming Zhou1-0/+1
We always use insert_zspage() and remove_zspage() to update zspage's fullness location, which will account correctly. But this special async free path use "splice" instead of remove_zspage(), so the per-fullness zspage count for ZS_INUSE_RATIO_0 won't decrease. Clean things up by decreasing when iterate over the zspage free list. This doesn't actually fix anything. ZS_INUSE_RATIO_0 is just a "placeholder" which is never used anywhere. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Chengming Zhou <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12selftests/proc: add PROCMAP_QUERY ioctl testsAndrii Nakryiko2-0/+87
Extend existing proc-pid-vm.c tests with PROCMAP_QUERY ioctl() API. Test a few successful and negative cases, validating querying filtering and exact vs next VMA logic works as expected. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Liam R. Howlett <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12tools: sync uapi/linux/fs.h header into tools subdirAndrii Nakryiko1-10/+170
We need this UAPI header in tools/include subdirectory for using it from BPF selftests. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Liam R. Howlett <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12docs/procfs: call out ioctl()-based PROCMAP_QUERY command existenceAndrii Nakryiko1-0/+9
Call out PROCMAP_QUERY ioctl() existence in the section describing /proc/PID/maps file in documentation. We refer user to UAPI header for low-level details of this programmatic interface. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Liam R. Howlett <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-12fs/procfs: add build ID fetching to PROCMAP_QUERY APIAndrii Nakryiko2-2/+53
The need to get ELF build ID reliably is an important aspect when dealing with profiling and stack trace symbolization, and /proc/<pid>/maps textual representation doesn't help with this. To get backing file's ELF build ID, application has to first resolve VMA, then use it's start/end address range to follow a special /proc/<pid>/map_files/<start>-<end> symlink to open the ELF file (this is necessary because backing file might have been removed from the disk or was already replaced with another binary in the same file path. Such approach, beyond just adding complexity of having to do a bunch of extra work, has extra security implications. Because application opens underlying ELF file and needs read access to its entire contents (as far as kernel is concerned), kernel puts additional capable() checks on following /proc/<pid>/map_files/<start>-<end> symlink. And that makes sense in general. But in the case of build ID, profiler/symbolizer doesn't need the contents of ELF file, per se. It's only build ID that is of interest, and ELF build ID itself doesn't provide any sensitive information. So this patch adds a way to request backing file's ELF build ID along the rest of VMA information in the same API. User has control over whether this piece of information is requested or not by either setting build_id_size field to zero or non-zero maximum buffer size they provided through build_id_addr field (which encodes user pointer as __u64 field). This is a completely optional piece of information, and so has no performance implications for user cases that don't care about build ID, while improving performance and simplifying the setup for those application that do need it. Kernel already implements build ID fetching, which is used from BPF subsystem. We are reusing this code here, but plan a follow up changes to make it work better under more relaxed assumption (compared to what existing code assumes) of being called from user process context, in which page faults are allowed. BPF-specific implementation currently bails out if necessary part of ELF file is not paged in, all due to extra BPF-specific restrictions (like the need to fetch build ID in restrictive contexts such as NMI handler). [[email protected]: fix integer to pointer cast warning in do_procmap_query()] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Liam R. Howlett <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Al Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>