aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-05-27mm: fix is_pinnable_page against a cma pageMinchan Kim2-4/+13
Pages in the CMA area could have MIGRATE_ISOLATE as well as MIGRATE_CMA so the current is_pinnable_page() could miss CMA pages which have MIGRATE_ISOLATE. It ends up pinning CMA pages as longterm for the pin_user_pages() API so CMA allocations keep failing until the pin is released. CPU 0 CPU 1 - Task B cma_alloc alloc_contig_range pin_user_pages_fast(FOLL_LONGTERM) change pageblock as MIGRATE_ISOLATE internal_get_user_pages_fast lockless_pages_from_mm gup_pte_range try_grab_folio is_pinnable_page return true; So, pinned the page successfully. page migration failure with pinned page .. .. After 30 sec unpin_user_page(page) CMA allocation succeeded after 30 sec. The CMA allocation path protects the migration type change race using zone->lock but what GUP path need to know is just whether the page is on CMA area or not rather than exact migration type. Thus, we don't need zone->lock but just checks migration type in either of (MIGRATE_ISOLATE and MIGRATE_CMA). Adding the MIGRATE_ISOLATE check in is_pinnable_page could cause rejecting of pinning pages on MIGRATE_ISOLATE pageblocks even though it's neither CMA nor movable zone if the page is temporarily unmovable. However, such a migration failure by unexpected temporal refcount holding is general issue, not only come from MIGRATE_ISOLATE and the MIGRATE_ISOLATE is also transient state like other temporal elevated refcount problem. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Reviewed-by: John Hubbard <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Cc: David Hildenbrand <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-27mm: filter out swapin error entry in shmem mappingMiaohe Lin2-1/+7
There might be swapin error entries in shmem mapping. Filter them out to avoid "Bad swap file entry" complaint. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Alistair Popple <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: NeilBrown <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-27mm/shmem: fix infinite loop when swap in shmem error at swapoff timeMiaohe Lin1-0/+39
When swap in shmem error at swapoff time, there would be a infinite loop in the while loop in shmem_unuse_inode(). It's because swapin error is deliberately ignored now and thus info->swapped will never reach 0. So we can't escape the loop in shmem_unuse(). In order to fix the issue, swapin_error entry is stored in the mapping when swapin error occurs. So the swapcache page can be freed and the user won't end up with a permanently mounted swap because a sector is bad. If the page is accessed later, the user process will be killed so that corrupted data is never consumed. On the other hand, if the page is never accessed, the user won't even notice it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reported-by: Naoya Horiguchi <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: Alistair Popple <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Howells <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: NeilBrown <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-27mm/madvise: free hwpoison and swapin error entry in madvise_free_pte_rangeMiaohe Lin1-5/+8
Once the MADV_FREE operation has succeeded, callers can expect they might get zero-fill pages if accessing the memory again. Therefore it should be safe to delete the hwpoison entry and swapin error entry. There is no reason to kill the process if it has called MADV_FREE on the range. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Suggested-by: Alistair Popple <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Naoya Horiguchi <[email protected]> Cc: David Howells <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: NeilBrown <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-27mm/swapfile: fix lost swap bits in unuse_pte()Miaohe Lin1-3/+7
This is observed by code review only but not any real report. When we turn off swapping we could have lost the bits stored in the swap ptes. The new rmap-exclusive bit is fine since that turned into a page flag, but not for soft-dirty and uffd-wp. Add them. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Suggested-by: Peter Xu <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Alistair Popple <[email protected]> Cc: David Howells <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: NeilBrown <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-27mm/swapfile: unuse_pte can map random data if swap read failsMiaohe Lin4-2/+31
Patch series "A few fixup patches for mm", v4. This series contains a few patches to avoid mapping random data if swap read fails and fix lost swap bits in unuse_pte. Also we free hwpoison and swapin error entry in madvise_free_pte_range and so on. More details can be found in the respective changelogs. This patch (of 5): There is a bug in unuse_pte(): when swap page happens to be unreadable, page filled with random data is mapped into user address space. In case of error, a special swap entry indicating swap read fails is set to the page table. So the swapcache page can be freed and the user won't end up with a permanently mounted swap because a sector is bad. And if the page is accessed later, the user process will be killed so that corrupted data is never consumed. On the other hand, if the page is never accessed, the user won't even notice it. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: David Howells <[email protected]> Cc: NeilBrown <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Peter Xu <[email protected]> Cc: Ralph Campbell <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-27selftests: memcg: factor out common parts of memory.{low,min} testsMichal Koutný1-163/+36
The memory protection test setup and runtime is almost equal for memory.low and memory.min cases. It makes modification of the common parts prone to mistakes, since the protections are similar not only in setup but also in principle, factor the common part out. Past exceptions between the tests: - missing memory.min is fine (kept), - test_memcg_low protected orphaned pagecache (adapted like test_memcg_min and we keep the processes of protected memory running). The evaluation in two tests is different (OOM of allocator vs low events of protégés), this is kept different. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Koutný <[email protected]> Acked-by: Roman Gushchin <[email protected]> CC: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Richard Palethorpe <[email protected]> Cc: David Vernet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-27selftests: memcg: remove protection from top level memcgMichal Koutný1-7/+3
The reclaim is triggered by memory limit in a subtree, therefore the testcase does not need configured protection against external reclaim. Also, correct respective comments. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Koutný <[email protected]> Acked-by: Roman Gushchin <[email protected]> Cc: David Vernet <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Richard Palethorpe <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-27selftests: memcg: adjust expected reclaim values of protected cgroupsMichal Koutný3-12/+107
The numbers are not easy to derive in a closed form (certainly mere protections ratios do not apply), therefore use a simulation to obtain expected numbers. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Koutný <[email protected]> Acked-by: Roman Gushchin <[email protected]> Cc: David Vernet <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Richard Palethorpe <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-27selftests: memcg: expect no low events in unprotected siblingMichal Koutný1-1/+1
This is effectively a revert of commit cdc69458a5f3 ("cgroup: account for memory_recursiveprot in test_memcg_low()"). The case test_memcg_low will fail with memory_recursiveprot until resolved in reclaim code. However, this patch preserves the existing helpers and variables for later uses. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Koutný <[email protected]> Reviewed-by: David Vernet <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Richard Palethorpe <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-27selftests: memcg: fix compilationMichal Koutný1-11/+14
Patch series "memcontrol selftests fixups", v2. Flushing the patches to make memcontrol selftests check the events behavior we had consensus about (test_memcg_low fails). (test_memcg_reclaim, test_memcg_swap_max fail for me now but it's present even before the refactoring.) The two bigger changes are: - adjustment of the protected values to make tests succeed with the given tolerance, - both test_memcg_low and test_memcg_min check protection of memory in populated cgroups (actually as per Documentation/admin-guide/cgroup-v2.rst memory.min should not apply to empty cgroups, which is not the case currently. Therefore I unified tests with the populated case in order to to bring more broken tests). This patch (of 5): This fixes mis-applied changes from commit 72b1e03aa725 ("cgroup: account for memory_localevents in test_memcg_oom_group_leaf_events()"). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Michal Koutný <[email protected]> Reviewed-by: David Vernet <[email protected]> Acked-by: Roman Gushchin <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Richard Palethorpe <[email protected]> Cc: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-27mm/z3fold: fix z3fold_page_migrate races with z3fold_mapMiaohe Lin1-4/+12
Think about the below scenario: CPU1 CPU2 z3fold_page_migrate z3fold_map z3fold_page_trylock ... z3fold_page_unlock /* slots still points to old zhdr*/ get_z3fold_header get slots from handle get old zhdr from slots z3fold_page_trylock return *old* zhdr encode_handle(new_zhdr, FIRST|LAST|MIDDLE) put_page(page) /* zhdr is freed! */ but zhdr is still used by caller! z3fold_map can map freed z3fold page and lead to use-after-free bug. To fix it, we add PAGE_MIGRATED to indicate z3fold page is migrated and soon to be released. So get_z3fold_header won't return such page. Link: https://lkml.kernel.org/r/[email protected] Fixes: 1f862989b04a ("mm/z3fold.c: support page migration") Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-27mm/z3fold: fix z3fold_reclaim_page races with z3fold_freeMiaohe Lin1-15/+3
Think about the below scenario: CPU1 CPU2 z3fold_reclaim_page z3fold_free spin_lock(&pool->lock) get_z3fold_header -- hold page_lock kref_get_unless_zero kref_put--zhdr->refcount can be 1 now !z3fold_page_trylock kref_put -- zhdr->refcount is 0 now release_z3fold_page WARN_ON(!list_empty(&zhdr->buddy)); -- we're on buddy now! spin_lock(&pool->lock); -- deadlock here! z3fold_reclaim_page might race with z3fold_free and will lead to pool lock deadlock and zhdr buddy non-empty warning. To fix this, defer getting the refcount until page_lock is held just like what __z3fold_alloc does. Note this has the side effect that we won't break the reclaim if we meet a soon to be released z3fold page now. Link: https://lkml.kernel.org/r/[email protected] Fixes: dcf5aedb24f8 ("z3fold: stricter locking and more careful reclaim") Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-27mm/z3fold: always clear PAGE_CLAIMED under z3fold page lockMiaohe Lin1-3/+3
Think about the below race window: CPU1 CPU2 z3fold_reclaim_page z3fold_free test_and_set_bit PAGE_CLAIMED failed to reclaim page z3fold_page_lock(zhdr); add back to the lru list; z3fold_page_unlock(zhdr); get_z3fold_header page_claimed=test_and_set_bit PAGE_CLAIMED clear_bit(PAGE_CLAIMED, &page->private); if (!page_claimed) /* it's false true */ free_handle is not called free_handle won't be called in this case. So z3fold_buddy_slots will leak. Fix it by always clear PAGE_CLAIMED under z3fold page lock. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-27mm/z3fold: put z3fold page back into unbuddied list when reclaim or ↵Miaohe Lin1-0/+4
migration fails When doing z3fold page reclaim or migration, the page is removed from unbuddied list. If reclaim or migration succeeds, it's fine as page is released. But in case it fails, the page is not put back into unbuddied list now. The page will be leaked until next compaction work, reclaim or migration is done. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-27revert "mm/z3fold.c: allow __GFP_HIGHMEM in z3fold_alloc"Miaohe Lin1-5/+3
Revert commit f1549cb5ab2b ("mm/z3fold.c: allow __GFP_HIGHMEM in z3fold_alloc"). z3fold can't support GFP_HIGHMEM page now. page_address is used directly at all places. Moreover, z3fold_header is on per cpu unbuddied list which could be accessed anytime. So we should remove the support of GFP_HIGHMEM allocation for z3fold. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-27mm/z3fold: throw warning on failure of trylock_page in z3fold_allocMiaohe Lin1-4/+3
If trylock_page fails, the page won't be non-lru movable page. When this page is freed via free_z3fold_page, it will trigger bug on PageMovable check in __ClearPageMovable. Throw warning on failure of trylock_page to guard against such rare case just as what zsmalloc does. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Cc: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-27mm/z3fold: remove buggy use of stale list for allocationMiaohe Lin1-22/+1
Currently if z3fold couldn't find an unbuddied page it would first try to pull a page off the stale list. But this approach is problematic. If init z3fold page fails later, the page should be freed via free_z3fold_page to clean up the relevant resource instead of using __free_page directly. And if page is successfully reused, it will BUG_ON later in __SetPageMovable because it's already non-lru movable page, i.e. PAGE_MAPPING_MOVABLE is already set in page->mapping. In order to fix all of these issues, we can simply remove the buggy use of stale list for allocation because can_sleep should always be false and we never really hit the reusing code path now. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-27mm/z3fold: fix possible null pointer dereferencingMiaohe Lin1-1/+11
alloc_slots could fail to allocate memory under heavy memory pressure. So we should check zhdr->slots against NULL to avoid future null pointer dereferencing. Link: https://lkml.kernel.org/r/[email protected] Fixes: fc5488651c7d ("z3fold: simplify freeing slots") Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-27mm/z3fold: fix sheduling while atomicMiaohe Lin1-2/+1
Patch series "A few fixup patches for z3fold". This series contains a few fixup patches to fix sheduling while atomic, fix possible null pointer dereferencing, fix various race conditions and so on. More details can be found in the respective changelogs. This patch (of 9): z3fold's page_lock is always held when calling alloc_slots. So gfp should be GFP_ATOMIC to avoid "scheduling while atomic" bug. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: fc5488651c7d ("z3fold: simplify freeing slots") Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Vitaly Wool <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-27mm: split free page with properly free memory accounting and without raceZi Yan3-9/+29
In isolate_single_pageblock(), free pages are checked without holding zone lock, but they can go away in split_free_page() when zone lock is held. Check the free page and its order again in split_free_page() when zone lock is held. Recheck the page if the free page is gone under zone lock. In addition, in split_free_page(), the free page was deleted from the page list without changing free page accounting. Add the missing free page accounting code. Fix the type of order parameter in split_free_page(). Link: https://lore.kernel.org/lkml/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Fixes: b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity") Signed-off-by: Zi Yan <[email protected]> Reported-by: Doug Berger <[email protected]> Link: https://lore.kernel.org/linux-mm/[email protected]/ Cc: David Hildenbrand <[email protected]> Cc: Qian Cai <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Eric Ren <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Marek Szyprowski <[email protected]> Cc: Michael Walle <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-27mm: page-isolation: skip isolated pageblock in start_isolate_page_range()Zi Yan1-8/+18
start_isolate_page_range() first isolates the first and the last pageblocks in the range and ensure pages across range boundaries are split during isolation. But it missed the case when the range is <= a pageblock and the first and the last pageblocks are the same one, so the second isolate_single_pageblock() will always fail. To fix it, skip the pageblock isolation in second isolate_single_pageblock(). Link: https://lkml.kernel.org/r/[email protected] Fixes: 88ee134320b8 ("mm: fix a potential infinite loop in start_isolate_page_range()") Signed-off-by: Zi Yan <[email protected]> Reported-by: Marek Szyprowski <[email protected]> Tested-by: Marek Szyprowski <[email protected]> Link: https://lore.kernel.org/linux-mm/[email protected]/ Reported-by: Michael Walle <[email protected]> Tested-by: Michael Walle <[email protected]> Link: https://lore.kernel.org/linux-mm/[email protected]/ Cc: Christophe Leroy <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Doug Berger <[email protected]> Cc: Eric Ren <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Qian Cai <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-27tools arch x86: Sync the msr-index.h copy with the kernel sourcesArnaldo Carvalho de Melo1-0/+19
To pick up the changes in: db1af12929c99d15 ("x86/msr-index: Define INTEGRITY_CAPABILITIES MSR") 089be16d5992dd0b ("x86/msr: Add PerfCntrGlobal* registers") f52ba93190457aa2 ("tools/power turbostat: Add Power Limit4 support") Addressing these tools/perf build warnings: diff -u tools/arch/x86/include/asm/msr-index.h arch/x86/include/asm/msr-index.h Warning: Kernel ABI header at 'tools/arch/x86/include/asm/msr-index.h' differs from latest version at 'arch/x86/include/asm/msr-index.h' That makes the beautification scripts to pick some new entries: $ tools/perf/trace/beauty/tracepoints/x86_msr.sh > before $ cp arch/x86/include/asm/msr-index.h tools/arch/x86/include/asm/msr-index.h $ tools/perf/trace/beauty/tracepoints/x86_msr.sh > after $ diff -u before after --- before 2022-05-26 12:50:01.228612839 -0300 +++ after 2022-05-26 12:50:07.699776166 -0300 @@ -116,6 +116,7 @@ [0x0000026f] = "MTRRfix4K_F8000", [0x00000277] = "IA32_CR_PAT", [0x00000280] = "IA32_MC0_CTL2", + [0x000002d9] = "INTEGRITY_CAPS", [0x000002ff] = "MTRRdefType", [0x00000309] = "CORE_PERF_FIXED_CTR0", [0x0000030a] = "CORE_PERF_FIXED_CTR1", @@ -176,6 +177,7 @@ [0x00000586] = "IA32_RTIT_ADDR3_A", [0x00000587] = "IA32_RTIT_ADDR3_B", [0x00000600] = "IA32_DS_AREA", + [0x00000601] = "VR_CURRENT_CONFIG", [0x00000606] = "RAPL_POWER_UNIT", [0x0000060a] = "PKGC3_IRTL", [0x0000060b] = "PKGC6_IRTL", @@ -260,6 +262,10 @@ [0xc0000102 - x86_64_specific_MSRs_offset] = "KERNEL_GS_BASE", [0xc0000103 - x86_64_specific_MSRs_offset] = "TSC_AUX", [0xc0000104 - x86_64_specific_MSRs_offset] = "AMD64_TSC_RATIO", + [0xc000010f - x86_64_specific_MSRs_offset] = "AMD_DBG_EXTN_CFG", + [0xc0000300 - x86_64_specific_MSRs_offset] = "AMD64_PERF_CNTR_GLOBAL_STATUS", + [0xc0000301 - x86_64_specific_MSRs_offset] = "AMD64_PERF_CNTR_GLOBAL_CTL", + [0xc0000302 - x86_64_specific_MSRs_offset] = "AMD64_PERF_CNTR_GLOBAL_STATUS_CLR", }; #define x86_AMD_V_KVM_MSRs_offset 0xc0010000 @@ -318,4 +324,5 @@ [0xc00102b4 - x86_AMD_V_KVM_MSRs_offset] = "AMD_CPPC_STATUS", [0xc00102f0 - x86_AMD_V_KVM_MSRs_offset] = "AMD_PPIN_CTL", [0xc00102f1 - x86_AMD_V_KVM_MSRs_offset] = "AMD_PPIN", + [0xc0010300 - x86_AMD_V_KVM_MSRs_offset] = "AMD_SAMP_BR_FROM", }; $ Now one can trace systemwide asking to see backtraces to where those MSRs are being read/written, see this example with a previous update: # perf trace -e msr:*_msr/max-stack=32/ --filter="msr>=IA32_U_CET && msr<=IA32_INT_SSP_TAB" ^C# If we use -v (verbose mode) we can see what it does behind the scenes: # perf trace -v -e msr:*_msr/max-stack=32/ --filter="msr>=IA32_U_CET && msr<=IA32_INT_SSP_TAB" Using CPUID AuthenticAMD-25-21-0 0x6a0 0x6a8 New filter for msr:read_msr: (msr>=0x6a0 && msr<=0x6a8) && (common_pid != 597499 && common_pid != 3313) 0x6a0 0x6a8 New filter for msr:write_msr: (msr>=0x6a0 && msr<=0x6a8) && (common_pid != 597499 && common_pid != 3313) mmap size 528384B ^C# Example with a frequent msr: # perf trace -v -e msr:*_msr/max-stack=32/ --filter="msr==IA32_SPEC_CTRL" --max-events 2 Using CPUID AuthenticAMD-25-21-0 0x48 New filter for msr:read_msr: (msr==0x48) && (common_pid != 2612129 && common_pid != 3841) 0x48 New filter for msr:write_msr: (msr==0x48) && (common_pid != 2612129 && common_pid != 3841) mmap size 528384B Looking at the vmlinux_path (8 entries long) symsrc__init: build id mismatch for vmlinux. Using /proc/kcore for kernel data Using /proc/kallsyms for symbols 0.000 Timer/2525383 msr:write_msr(msr: IA32_SPEC_CTRL, val: 6) do_trace_write_msr ([kernel.kallsyms]) do_trace_write_msr ([kernel.kallsyms]) __switch_to_xtra ([kernel.kallsyms]) __switch_to ([kernel.kallsyms]) __schedule ([kernel.kallsyms]) schedule ([kernel.kallsyms]) futex_wait_queue_me ([kernel.kallsyms]) futex_wait ([kernel.kallsyms]) do_futex ([kernel.kallsyms]) __x64_sys_futex ([kernel.kallsyms]) do_syscall_64 ([kernel.kallsyms]) entry_SYSCALL_64_after_hwframe ([kernel.kallsyms]) __futex_abstimed_wait_common64 (/usr/lib64/libpthread-2.33.so) 0.030 :0/0 msr:write_msr(msr: IA32_SPEC_CTRL, val: 2) do_trace_write_msr ([kernel.kallsyms]) do_trace_write_msr ([kernel.kallsyms]) __switch_to_xtra ([kernel.kallsyms]) __switch_to ([kernel.kallsyms]) __schedule ([kernel.kallsyms]) schedule_idle ([kernel.kallsyms]) do_idle ([kernel.kallsyms]) cpu_startup_entry ([kernel.kallsyms]) secondary_startup_64_no_verify ([kernel.kallsyms]) # Cc: Adrian Hunter <[email protected]> Cc: Hans de Goede <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Len Brown <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sandipan Das <[email protected]> Cc: Sumeet Pawnikar <[email protected]> Cc: Tony Luck <[email protected]> Link: https://lore.kernel.org/lkml/Yo+i%[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2022-05-27perf scripts python: Support Arm CoreSight trace data disassemblyLeo Yan1-0/+272
This commit adds python script to parse CoreSight tracing event and print out source line and disassembly, it generates readable program execution flow for easier humans inspecting. The script receives CoreSight tracing packet with below format: +------------+------------+------------+ packet(n): | addr | ip | cpu | +------------+------------+------------+ packet(n+1): | addr | ip | cpu | +------------+------------+------------+ packet::addr presents the start address of the coming branch sample, and packet::ip is the last address of the branch smple. Therefore, a code section between branches starts from packet(n)::addr and it stops at packet(n+1)::ip. As results we combines the two continuous packets to generate the address range for instructions: [ sample(n)::addr .. sample(n+1)::ip ] The script supports both objdump or llvm-objdump for disassembly with specifying option '-d'. If doesn't specify option '-d', the script simply outputs source lines and symbols. Below shows usages with llvm-objdump or objdump to output disassembly. # perf script -s scripts/python/arm-cs-trace-disasm.py -- -d llvm-objdump-11 -k ./vmlinux ARM CoreSight Trace Data Assembler Dump ffff800008eb3198 <etm4_enable_hw>: ffff800008eb3310: c0 38 00 35 cbnz w0, 0xffff800008eb3a28 <etm4_enable_hw+0x890> ffff800008eb3314: 9f 3f 03 d5 dsb sy ffff800008eb3318: df 3f 03 d5 isb ffff800008eb331c: f5 5b 42 a9 ldp x21, x22, [sp, #32] ffff800008eb3320: fb 73 45 a9 ldp x27, x28, [sp, #80] ffff800008eb3324: e0 82 40 39 ldrb w0, [x23, #32] ffff800008eb3328: 60 00 00 34 cbz w0, 0xffff800008eb3334 <etm4_enable_hw+0x19c> ffff800008eb332c: e0 03 19 aa mov x0, x25 ffff800008eb3330: 8c fe ff 97 bl 0xffff800008eb2d60 <etm4_cs_lock.isra.0.part.0> main 6728/6728 [0004] 0.000000000 etm4_enable_hw+0x198 [kernel.kallsyms] ffff800008eb2d60 <etm4_cs_lock.isra.0.part.0>: ffff800008eb2d60: 1f 20 03 d5 nop ffff800008eb2d64: 1f 20 03 d5 nop ffff800008eb2d68: 3f 23 03 d5 hint #25 ffff800008eb2d6c: 00 00 40 f9 ldr x0, [x0] ffff800008eb2d70: 9f 3f 03 d5 dsb sy ffff800008eb2d74: 00 c0 3e 91 add x0, x0, #4016 ffff800008eb2d78: 1f 00 00 b9 str wzr, [x0] ffff800008eb2d7c: bf 23 03 d5 hint #29 ffff800008eb2d80: c0 03 5f d6 ret main 6728/6728 [0004] 0.000000000 etm4_cs_lock.isra.0.part.0+0x20 # perf script -s scripts/python/arm-cs-trace-disasm.py -- -d objdump -k ./vmlinux ARM CoreSight Trace Data Assembler Dump ffff800008eb3310 <etm4_enable_hw+0x178>: ffff800008eb3310: 350038c0 cbnz w0, ffff800008eb3a28 <etm4_enable_hw+0x890> ffff800008eb3314: d5033f9f dsb sy ffff800008eb3318: d5033fdf isb ffff800008eb331c: a9425bf5 ldp x21, x22, [sp, #32] ffff800008eb3320: a94573fb ldp x27, x28, [sp, #80] ffff800008eb3324: 394082e0 ldrb w0, [x23, #32] ffff800008eb3328: 34000060 cbz w0, ffff800008eb3334 <etm4_enable_hw+0x19c> ffff800008eb332c: aa1903e0 mov x0, x25 ffff800008eb3330: 97fffe8c bl ffff800008eb2d60 <etm4_cs_lock.isra.0.part.0> main 6728/6728 [0004] 0.000000000 etm4_enable_hw+0x198 [kernel.kallsyms] ffff800008eb2d60 <etm4_cs_lock.isra.0.part.0>: ffff800008eb2d60: d503201f nop ffff800008eb2d64: d503201f nop ffff800008eb2d68: d503233f paciasp ffff800008eb2d6c: f9400000 ldr x0, [x0] ffff800008eb2d70: d5033f9f dsb sy ffff800008eb2d74: 913ec000 add x0, x0, #0xfb0 ffff800008eb2d78: b900001f str wzr, [x0] ffff800008eb2d7c: d50323bf autiasp ffff800008eb2d80: d65f03c0 ret main 6728/6728 [0004] 0.000000000 etm4_cs_lock.isra.0.part.0+0x20 Signed-off-by: Leo Yan <[email protected]> Co-authored-by: Al Grant <[email protected]> Co-authored-by: Mathieu Poirier <[email protected]> Co-authored-by: Tor Jeremiassen <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Eelco Chaudron <[email protected]> Cc: German Gomez <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: James Clark <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephen Brennan <[email protected]> Cc: Tanmay Jagdale <[email protected]> Cc: [email protected] Cc: zengshun . wu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2022-05-27perf scripting python: Expose dso and map informationLeo Yan1-4/+17
This change adds dso build_id and corresponding map's start and end address. The info of dso build_id can be used to find dso file path, and we can validate if a branch address falls into the range of map's start and end addresses. In addition, the map's start address can be used as an offset for disassembly. Signed-off-by: Leo Yan <[email protected]> Acked-by: Adrian Hunter <[email protected]> Cc: Al Grant <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Eelco Chaudron <[email protected]> Cc: German Gomez <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: James Clark <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephen Brennan <[email protected]> Cc: Tanmay Jagdale <[email protected]> Cc: [email protected] Cc: zengshun . wu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2022-05-27perf jevents: Fix event syntax error caused by ExtSelZhengjun Xing1-1/+1
In the origin code, when "ExtSel" is 1, the eventcode will change to "eventcode |= 1 << 21”. For event “UNC_Q_RxL_CREDITS_CONSUMED_VN0.DRS", its "ExtSel" is "1", its eventcode will change from 0x1E to 0x20001E, but in fact the eventcode should <=0x1FF, so this will cause the parse fail: # perf stat -e "UNC_Q_RxL_CREDITS_CONSUMED_VN0.DRS" -a sleep 0.1 event syntax error: '.._RxL_CREDITS_CONSUMED_VN0.DRS' \___ value too big for format, maximum is 511 On the perf kernel side, the kernel assumes the valid bits are continuous. It will adjust the 0x100 (bit 8 for perf tool) to bit 21 in HW. DEFINE_UNCORE_FORMAT_ATTR(event_ext, event, "config:0-7,21"); So the perf tool follows the kernel side and just set bit8 other than bit21. Fixes: fedb2b518239cbc0 ("perf jevents: Add support for parsing uncore json files") Reviewed-by: Kan Liang <[email protected]> Signed-off-by: Xing Zhengjun <[email protected]> Acked-by: Ian Rogers <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2022-05-27perf tools arm64: Add support for VG registerJames Clark2-0/+40
Add the name of the VG register so it can be used in --user-regs The event will fail to open if the register is requested but not available so only add it to the mask if the kernel supports sve and also if it supports that specific register. Committer notes: Add conditional definition of HWCAP_SVE, as suggested by Leo Yan, to build on older systems where this is not available in the system headers. Reviewed-by: Leo Yan <[email protected]> Signed-off-by: James Clark <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: German Gomez <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: John Garry <[email protected]> Cc: Mark Brown <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Poirier <[email protected]> Cc: Mike Leach <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Will Deacon <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2022-05-27mm/page_table_check: fix accessing unmapped ptepMiaohe Lin1-1/+1
ptep is unmapped too early, so ptep could theoretically be accessed while it's unmapped. This might become a problem if/when CONFIG_HIGHPTE becomes available on riscv. Fix it by deferring pte_unmap() until page table checking is done. [[email protected]: account for ptep alteration, per Matthew] Link: https://lkml.kernel.org/r/[email protected] Fixes: 80110bbfbba6 ("mm/page_table_check: check entries at pmd levels") Signed-off-by: Miaohe Lin <[email protected]> Acked-by: Pasha Tatashin <[email protected]> Cc: Qi Zheng <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-27kexec_file: drop weak attribute from arch_kexec_apply_relocations[_add]Naveen N. Rao4-42/+56
Since commit d1bcae833b32f1 ("ELF: Don't generate unused section symbols") [1], binutils (v2.36+) started dropping section symbols that it thought were unused. This isn't an issue in general, but with kexec_file.c, gcc is placing kexec_arch_apply_relocations[_add] into a separate .text.unlikely section and the section symbol ".text.unlikely" is being dropped. Due to this, recordmcount is unable to find a non-weak symbol in .text.unlikely to generate a relocation record against. Address this by dropping the weak attribute from these functions. Instead, follow the existing pattern of having architectures #define the name of the function they want to override in their headers. [1] https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=d1bcae833b32f1 [[email protected]: arch/s390/include/asm/kexec.h needs linux/module.h] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Michael Ellerman <[email protected]> Signed-off-by: Naveen N. Rao <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-27mm/page_alloc: always attempt to allocate at least one page during bulk ↵Mel Gorman1-2/+2
allocation Peter Pavlisko reported the following problem on kernel bugzilla 216007. When I try to extract an uncompressed tar archive (2.6 milion files, 760.3 GiB in size) on newly created (empty) XFS file system, after first low tens of gigabytes extracted the process hangs in iowait indefinitely. One CPU core is 100% occupied with iowait, the other CPU core is idle (on 2-core Intel Celeron G1610T). It was bisected to c9fa563072e1 ("xfs: use alloc_pages_bulk_array() for buffers") but XFS is only the messenger. The problem is that nothing is waking kswapd to reclaim some pages at a time the PCP lists cannot be refilled until some reclaim happens. The bulk allocator checks that there are some pages in the array and the original intent was that a bulk allocator did not necessarily need all the requested pages and it was best to return as quickly as possible. This was fine for the first user of the API but both NFS and XFS require the requested number of pages be available before making progress. Both could be adjusted to call the page allocator directly if a bulk allocation fails but it puts a burden on users of the API. Adjust the semantics to attempt at least one allocation via __alloc_pages() before returning so kswapd is woken if necessary. It was reported via bugzilla that the patch addressed the problem and that the tar extraction completed successfully. This may also address bug 215975 but has yet to be confirmed. BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=216007 BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215975 Link: https://lkml.kernel.org/r/[email protected] Fixes: 387ba26fb1cb ("mm/page_alloc: add a bulk page allocator") Signed-off-by: Mel Gorman <[email protected]> Cc: "Darrick J. Wong" <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Jan Kara <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Jesper Dangaard Brouer <[email protected]> Cc: Chuck Lever <[email protected]> Cc: <[email protected]> [5.13+] Signed-off-by: Andrew Morton <[email protected]>
2022-05-27hugetlb: fix huge_pmd_unshare address updateMike Kravetz1-1/+8
The routine huge_pmd_unshare() is passed a pointer to an address associated with an area which may be unshared. If unshare is successful this address is updated to 'optimize' callers iterating over huge page addresses. For the optimization to work correctly, address should be updated to the last huge page in the unmapped/unshared area. However, in the common case where the passed address is PUD_SIZE aligned, the address is incorrectly updated to the address of the preceding huge page. That wastes CPU cycles as the unmapped/unshared range is scanned twice. Link: https://lkml.kernel.org/r/[email protected] Fixes: 39dde65c9940 ("shared page table for hugetlb page") Signed-off-by: Mike Kravetz <[email protected]> Acked-by: Muchun Song <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-27powerpc/64: Include cache.h directly in paca.hMichael Ellerman1-0/+1
paca.h uses ____cacheline_aligned without directly including cache.h, where it's defined. For Book3S builds that's OK because paca.h includes lppaca.h, and it does include cache.h. But Book3E builds have been getting cache.h indirectly via printk.h, which is dicey, and in fact that include was recently removed, leading to build errors such as: ld: fs/isofs/dir.o:(.bss+0x0): multiple definition of `____cacheline_aligned'; fs/isofs/namei.o:(.bss+0x0): first defined here So include cache.h directly to fix the build error. Fixes: 534aa1dc975a ("printk: stop including cache.h from printk.h") Reported-by: Guenter Roeck <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2022-05-27Revert "printk: wake up all waiters"John Ogness1-1/+1
This reverts commit 938ba4084abcf6fdd21d9078513c52f8fb9b00d0. The wait queue @log_wait never has exclusive waiters, so there is no need to use wake_up_interruptible_all(). Using wake_up_interruptible() was the correct function to wake all waiters. Since there are no exclusive waiters, erroneously changing wake_up_interruptible() to wake_up_interruptible_all() did not result in any behavior change. However, using wake_up_interruptible_all() on a wait queue without exclusive waiters is fundamentally wrong. Go back to using wake_up_interruptible() to wake all waiters. Signed-off-by: John Ogness <[email protected]> Acked-by: Steven Rostedt (Google) <[email protected]> Signed-off-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-05-26Merge tag 'for-5.19/dm-changes' of ↵Linus Torvalds15-287/+409
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Enable DM core bioset's per-cpu bio cache if QUEUE_FLAG_POLL set. This change improves DM's hipri bio polling (REQ_POLLED) performance by 7 - 20% depending on the system. - Update DM core to use jump_labels to further reduce cost of unlikely branches for zoned block devices, dm-stats and swap_bios throttling. - Various DM core changes to reduce bio-based DM overhead and simplify IO accounting. - Fundamental DM core improvements to dm_io reference counting and the elimination of using bio_split()+bio_chain() -- instead DM's bio-based IO accounting is updated to account that a split occurred. - Improve DM core's abnormal bio processing to do less work. - Improve DM core's hipri polling support to use a single list rather than an hlist. - Update DM core to pass NULL bdev to bio_alloc_clone() so that initialization that isn't useful for DM can be elided. - Add cond_resched to DM stats' various loops that loop over all entries. - Fix incorrect error code return from DM integrity's constructor. - Make DM crypt's printing of the key constant-time. - Update bio-based DM multipath to provide high-resolution timer to the Historical Service Time (HST) path selector. * tag 'for-5.19/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (26 commits) dm: pass NULL bdev to bio_alloc_clone dm cache metadata: remove unnecessary variable in __dump_mapping dm mpath: provide high-resolution timer to HST for bio-based dm crypt: make printing of the key constant-time dm integrity: fix error code in dm_integrity_ctr() dm stats: add cond_resched when looping over entries dm: improve abnormal bio processing dm: simplify bio-based IO accounting further dm: put all polled dm_io instances into a single list dm: improve dm_io reference counting dm: don't grab target io reference in dm_zone_map_bio dm: improve bio splitting and associated IO accounting dm: switch to bdev based IO accounting interfaces dm: pass dm_io instance to dm_io_acct directly dm: don't pass bio to __dm_start_io_acct and dm_end_io_acct dm: use bio_sectors in dm_aceept_partial_bio dm: simplify basic targets dm: conditionally enable branching for less used features dm: introduce dm_{get,put}_live_table_bio called from dm_submit_bio dm: move hot dm_io members to same cacheline as dm_target_io ...
2022-05-26Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds88-2002/+1968
Pull rdma updates from Jason Gunthorpe: "Small collection of incremental improvement patches: - Minor code cleanup patches, comment improvements, etc from static tools - Clean the some of the kernel caps, reducing the historical stealth uAPI leftovers - Bug fixes and minor changes for rdmavt, hns, rxe, irdma - Remove unimplemented cruft from rxe - Reorganize UMR QP code in mlx5 to avoid going through the IB verbs layer - flush_workqueue(system_unbound_wq) removal - Ensure rxe waits for objects to be unused before allowing the core to free them - Several rc quality bug fixes for hfi1" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (67 commits) RDMA/rtrs-clt: Fix one kernel-doc comment RDMA/hfi1: Remove all traces of diagpkt support RDMA/hfi1: Consolidate software versions RDMA/hfi1: Remove pointless driver version RDMA/hfi1: Fix potential integer multiplication overflow errors RDMA/hfi1: Prevent panic when SDMA is disabled RDMA/hfi1: Prevent use of lock before it is initialized RDMA/rxe: Fix an error handling path in rxe_get_mcg() IB/core: Fix typo in comment RDMA/core: Fix typo in comment IB/hf1: Fix typo in comment IB/qib: Fix typo in comment IB/iser: Fix typo in comment RDMA/mlx4: Avoid flush_scheduled_work() usage IB/isert: Avoid flush_scheduled_work() usage RDMA/mlx5: Remove duplicate pointer assignment in mlx5_ib_alloc_implicit_mr() RDMA/qedr: Remove unnecessary synchronize_irq() before free_irq() RDMA/hns: Use hr_reg_read() instead of remaining roce_get_xxx() RDMA/hns: Use hr_reg_xxx() instead of remaining roce_set_xxx() RDMA/irdma: Add SW mechanism to generate completions on error ...
2022-05-26Merge tag 'hardening-v5.19-rc1-fix1' of ↵Linus Torvalds6-6/+6
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull kernel hardening fix from Kees Cook: "This fixes an unlucky build race condition when using the GCC plugins, noticed by a few folks. - Avoid GCC plugins needing utsrelease.h build target (Masahiro Yamada)" * tag 'hardening-v5.19-rc1-fix1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: gcc-plugins: use KERNELVERSION for plugin version
2022-05-26Merge tag 'nfsd-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds30-426/+985
Pull nfsd updates from Chuck Lever: "We introduce 'courteous server' in this release. Previously NFSD would purge open and lock state for an unresponsive client after one lease period (typically 90 seconds). Now, after one lease period, another client can open and lock those files and the unresponsive client's lease is purged; otherwise if the unresponsive client's open and lock state is uncontended, the server retains that open and lock state for up to 24 hours, allowing the client's workload to resume after a lengthy network partition. A longstanding issue with NFSv4 file creation is also addressed. Previously a file creation can fail internally, returning an error to the client, but leave the newly created file in place as an artifact. The file creation code path has been reorganized so that internal failures and race conditions are less likely to result in an unwanted file creation. A fault injector has been added to help exercise paths that are run during kernel metadata cache invalidation. These caches contain information maintained by user space about exported filesystems. Many of our test workloads do not trigger cache invalidation. There is one patch that is needed to support PREEMPT_RT and a fix for an ancient 'sleep while spin-locked' splat that seems to have become easier to hit since v5.18-rc3" * tag 'nfsd-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (36 commits) NFSD: nfsd_file_put() can sleep NFSD: Add documenting comment for nfsd4_release_lockowner() NFSD: Modernize nfsd4_release_lockowner() NFSD: Fix possible sleep during nfsd4_release_lockowner() nfsd: destroy percpu stats counters after reply cache shutdown nfsd: Fix null-ptr-deref in nfsd_fill_super() nfsd: Unregister the cld notifier when laundry_wq create failed SUNRPC: Use RMW bitops in single-threaded hot paths NFSD: Clean up the show_nf_flags() macro NFSD: Trace filecache opens NFSD: Move documenting comment for nfsd4_process_open2() NFSD: Fix whitespace NFSD: Remove dprintk call sites from tail of nfsd4_open() NFSD: Instantiate a struct file when creating a regular NFSv4 file NFSD: Clean up nfsd_open_verified() NFSD: Remove do_nfsd_create() NFSD: Refactor NFSv4 OPEN(CREATE) NFSD: Refactor NFSv3 CREATE NFSD: Refactor nfsd_create_setattr() NFSD: Avoid calling fh_drop_write() twice in do_nfsd_create() ...
2022-05-26tracing: Fix comments for event_trigger_separate_filter()sunliming1-4/+4
The parameter name in comments of event_trigger_separate_filter() is inconsistent with actual parameter name, fix it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: sunliming <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2022-05-26x86/traceponit: Fix comment about irq vector tracepointssunliming1-3/+0
Commit: 4b9a8dca0e58 ("x86/idt: Remove the tracing IDT completely") removed the 'tracing IDT' from arch/x86/kernel/tracepoint.c, but left related comment. So that the comment become anachronistic. Just remove the comment. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: sunliming <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2022-05-26x86,tracing: Remove unused headerssunliming1-3/+0
Commit 4b9a8dca0e58 ("x86/idt: Remove the tracing IDT completely") removed the tracing IDT from the file arch/x86/kernel/tracepoint.c, but left the related headers unused, remove it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: sunliming <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2022-05-26ftrace: Clean up hash direct_functions on register failuresSong Liu1-3/+2
We see the following GPF when register_ftrace_direct fails: [ ] general protection fault, probably for non-canonical address \ 0x200000000000010: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [...] [ ] RIP: 0010:ftrace_find_rec_direct+0x53/0x70 [ ] Code: 48 c1 e0 03 48 03 42 08 48 8b 10 31 c0 48 85 d2 74 [...] [ ] RSP: 0018:ffffc9000138bc10 EFLAGS: 00010206 [ ] RAX: 0000000000000000 RBX: ffffffff813e0df0 RCX: 000000000000003b [ ] RDX: 0200000000000000 RSI: 000000000000000c RDI: ffffffff813e0df0 [ ] RBP: ffffffffa00a3000 R08: ffffffff81180ce0 R09: 0000000000000001 [ ] R10: ffffc9000138bc18 R11: 0000000000000001 R12: ffffffff813e0df0 [ ] R13: ffffffff813e0df0 R14: ffff888171b56400 R15: 0000000000000000 [ ] FS: 00007fa9420c7780(0000) GS:ffff888ff6a00000(0000) knlGS:000000000 [ ] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ ] CR2: 000000000770d000 CR3: 0000000107d50003 CR4: 0000000000370ee0 [ ] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ ] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ ] Call Trace: [ ] <TASK> [ ] register_ftrace_direct+0x54/0x290 [ ] ? render_sigset_t+0xa0/0xa0 [ ] bpf_trampoline_update+0x3f5/0x4a0 [ ] ? 0xffffffffa00a3000 [ ] bpf_trampoline_link_prog+0xa9/0x140 [ ] bpf_tracing_prog_attach+0x1dc/0x450 [ ] bpf_raw_tracepoint_open+0x9a/0x1e0 [ ] ? find_held_lock+0x2d/0x90 [ ] ? lock_release+0x150/0x430 [ ] __sys_bpf+0xbd6/0x2700 [ ] ? lock_is_held_type+0xd8/0x130 [ ] __x64_sys_bpf+0x1c/0x20 [ ] do_syscall_64+0x3a/0x80 [ ] entry_SYSCALL_64_after_hwframe+0x44/0xae [ ] RIP: 0033:0x7fa9421defa9 [ ] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 9 f8 [...] [ ] RSP: 002b:00007ffed743bd78 EFLAGS: 00000246 ORIG_RAX: 0000000000000141 [ ] RAX: ffffffffffffffda RBX: 00000000069d2480 RCX: 00007fa9421defa9 [ ] RDX: 0000000000000078 RSI: 00007ffed743bd80 RDI: 0000000000000011 [ ] RBP: 00007ffed743be00 R08: 0000000000bb7270 R09: 0000000000000000 [ ] R10: 00000000069da210 R11: 0000000000000246 R12: 0000000000000001 [ ] R13: 00007ffed743c4b0 R14: 00000000069d2480 R15: 0000000000000001 [ ] </TASK> [ ] Modules linked in: klp_vm(OK) [ ] ---[ end trace 0000000000000000 ]--- One way to trigger this is: 1. load a livepatch that patches kernel function xxx; 2. run bpftrace -e 'kfunc:xxx {}', this will fail (expected for now); 3. repeat #2 => gpf. This is because the entry is added to direct_functions, but not removed. Fix this by remove the entry from direct_functions when register_ftrace_direct fails. Also remove the last trailing space from ftrace.c, so we don't have to worry about it anymore. Link: https://lkml.kernel.org/r/[email protected] Cc: [email protected] Fixes: 763e34e74bb7 ("ftrace: Add register_ftrace_direct()") Signed-off-by: Song Liu <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2022-05-26tracing: Fix comments of create_filter()sunliming1-1/+1
The name in comments of parameter "filter_string" in function create_filter is annotated as "filter_str", just fix it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: sunliming <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2022-05-26tracing: Disable kcov on trace_preemptirq.cCongyu Liu1-0/+4
Functions in trace_preemptirq.c could be invoked from early interrupt code that bypasses kcov trace function's in_task() check. Disable kcov on this file to reduce random code coverage. Link: https://lkml.kernel.org/r/[email protected] Acked-by: Dmitry Vyukov <[email protected]> Signed-off-by: Congyu Liu <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2022-05-26tracing: Initialize integer variable to prevent garbage return valueGautam Menghani1-1/+1
Initialize the integer variable to 0 to fix the clang scan warning: Undefined or garbage value returned to caller [core.uninitialized.UndefReturn] return ret; Link: https://lkml.kernel.org/r/[email protected] Cc: [email protected] Fixes: 8993665abcce ("tracing/boot: Support multiple handlers for per-event histogram") Acked-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Gautam Menghani <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2022-05-26ftrace: Fix typo in commentJulia Lawall1-1/+1
Spelling mistake (triple letters) in comment. Detected with the help of Coccinelle. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Julia Lawall <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2022-05-26ftrace: Remove return value of ftrace_arch_modify_*()Li kunyu6-28/+13
All instances of the function ftrace_arch_modify_prepare() and ftrace_arch_modify_post_process() return zero. There's no point in checking their return value. Just have them be void functions. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Li kunyu <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2022-05-26tracing: Cleanup code by removing init "char *name"liqiong1-3/+1
The pointer is assigned to "type->name" anyway. no need to initialize with "preemption". Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: liqiong <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2022-05-26tracing: Change "char *" string form to "char []"liqiong2-2/+2
The "char []" string form declares a single variable. It is better than "char *" which creates two variables in the final assembly. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: liqiong <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2022-05-26tracing/timerlat: Do not wakeup the thread if the trace stops at the IRQDaniel Bristot de Oliveira1-0/+2
There is no need to wakeup the timerlat/ thread if stop tracing is hit at the timerlat's IRQ handler. Return before waking up timerlat's thread. Link: https://lkml.kernel.org/r/b392356c91b56aedd2b289513cc56a84cf87e60d.1652175637.git.bristot@kernel.org Cc: Juri Lelli <[email protected]> Cc: Clark Williams <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Daniel Bristot de Oliveira <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2022-05-26tracing/timerlat: Print stacktrace in the IRQ handler if neededDaniel Bristot de Oliveira2-2/+16
If print_stack and stop_tracing_us are set, and stop_tracing_us is hit with latency higher than or equal to print_stack, print the stack at the IRQ handler as it is useful to define the root cause for the IRQ latency. Link: https://lkml.kernel.org/r/fd04530ce98ae9270e41bb124ee5bf67b05ecfed.1652175637.git.bristot@kernel.org Cc: Juri Lelli <[email protected]> Cc: Clark Williams <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Daniel Bristot de Oliveira <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>