aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2024-09-03x86/numa_emu: use a helper function to get MAX_DMA32_PFNMike Rapoport (Microsoft)3-2/+8
This is required to make numa emulation code architecture independent so that it can be moved to generic code in following commits. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport (Microsoft) <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64 Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU] Acked-by: Dan Williams <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Andreas Larsson <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David S. Miller <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiaxun Yang <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Rob Herring (Arm) <[email protected]> Cc: Samuel Holland <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03x86/numa_emu: split __apicid_to_node update to a helper functionMike Rapoport (Microsoft)3-13/+25
This is required to make numa emulation code architecture independent so that it can be moved to generic code in following commits. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport (Microsoft) <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64 Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU] Acked-by: Dan Williams <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Andreas Larsson <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David S. Miller <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiaxun Yang <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Rob Herring (Arm) <[email protected]> Cc: Samuel Holland <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03x86/numa_emu: simplify allocation of phys_distMike Rapoport (Microsoft)1-6/+2
By the time numa_emulation() is called, all physical memory is already mapped in the direct map and there is no need to define limits for memblock allocation. Replace memblock_phys_alloc_range() with memblock_alloc(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport (Microsoft) <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64 Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU] Acked-by: Dan Williams <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Andreas Larsson <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David S. Miller <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiaxun Yang <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Rob Herring (Arm) <[email protected]> Cc: Samuel Holland <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03x86/numa: move FAKE_NODE_* defines to numa_emuMike Rapoport (Microsoft)2-2/+3
The definitions of FAKE_NODE_MIN_SIZE and FAKE_NODE_MIN_HASH_MASK are only used by numa emulation code, make them local to arch/x86/mm/numa_emulation.c Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport (Microsoft) <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64 Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU] Acked-by: Dan Williams <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Andreas Larsson <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David S. Miller <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiaxun Yang <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Rob Herring (Arm) <[email protected]> Cc: Samuel Holland <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03x86/numa: use get_pfn_range_for_nid to verify that node spans memoryMike Rapoport (Microsoft)1-10/+7
Instead of looping over numa_meminfo array to detect node's start and end addresses use get_pfn_range_for_init(). This is shorter and make it easier to lift numa_memblks to generic code. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport (Microsoft) <[email protected]> Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64 Reviewed-by: Jonathan Cameron <[email protected]> Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU] Acked-by: Dan Williams <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Andreas Larsson <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David S. Miller <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiaxun Yang <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Rob Herring (Arm) <[email protected]> Cc: Samuel Holland <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03x86/numa: simplify numa_distance allocationMike Rapoport (Microsoft)1-5/+2
Allocation of numa_distance uses memblock_phys_alloc_range() to limit allocation to be below the last mapped page. But NUMA initializaition runs after the direct map is populated and there is also code in setup_arch() that adjusts memblock limit to reflect how much memory is already mapped in the direct map. Simplify the allocation of numa_distance and use plain memblock_alloc(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport (Microsoft) <[email protected]> Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64 Reviewed-by: Jonathan Cameron <[email protected]> Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU] Acked-by: Dan Williams <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Andreas Larsson <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David S. Miller <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiaxun Yang <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Rob Herring (Arm) <[email protected]> Cc: Samuel Holland <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03arch, mm: pull out allocation of NODE_DATA to generic codeMike Rapoport (Microsoft)1-33/+1
Architectures that support NUMA duplicate the code that allocates NODE_DATA on the node-local memory with slight variations in reporting of the addresses where the memory was allocated. Use x86 version as the basis for the generic alloc_node_data() function and call this function in architecture specific numa initialization. Round up node data size to SMP_CACHE_BYTES rather than to PAGE_SIZE like x86 used to do since the bootmem era when allocation granularity was PAGE_SIZE anyway. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport (Microsoft) <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64 Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU] Acked-by: Dan Williams <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Andreas Larsson <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David S. Miller <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiaxun Yang <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Rob Herring (Arm) <[email protected]> Cc: Samuel Holland <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03arch, mm: move definition of node_data to generic codeMike Rapoport (Microsoft)5-44/+1
Every architecture that supports NUMA defines node_data in the same way: struct pglist_data *node_data[MAX_NUMNODES]; No reason to keep multiple copies of this definition and its forward declarations, especially when such forward declaration is the only thing in include/asm/mmzone.h for many architectures. Add definition and declaration of node_data to generic code and drop architecture-specific versions. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport (Microsoft) <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Acked-by: Davidlohr Bueso <[email protected]> Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64 Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU] Acked-by: Dan Williams <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Andreas Larsson <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David S. Miller <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiaxun Yang <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Rob Herring (Arm) <[email protected]> Cc: Samuel Holland <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-04dma-mapping: clearly mark DMA ops as an architecture featureChristoph Hellwig1-1/+1
DMA ops are a helper for architectures and not for drivers to override the DMA implementation. Unfortunately driver authors keep ignoring this. Make the fact more clear by renaming the symbol to ARCH_HAS_DMA_OPS and having the two drivers overriding their dma_ops depend on that. These drivers should probably be marked broken, but we can give them a bit of a grace period for that. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Sakari Ailus <[email protected]> # for IPU6 Acked-by: Robin Murphy <[email protected]>
2024-09-03x86/cpu/intel: Replace PAT erratum model/family magic numbers with symbolic ↵Dave Hansen2-8/+12
IFM references There's an erratum that prevents the PAT from working correctly: https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-dual-core-specification-update.pdf # Document 316515 Version 010 The kernel currently disables PAT support on those CPUs, but it does it with some magic numbers. Replace the magic numbers with the new "IFM" macros. Make the check refer to the last affected CPU (INTEL_CORE_YONAH) rather than the first fixed one. This makes it easier to find the documentation of the erratum since Intel documents where it is broken and not where it is fixed. I don't think the Pentium Pro (or Pentium II) is actually affected. But the old check included them, so it can't hurt to keep doing the same. I'm also not completely sure about the "Pentium M" CPUs (models 0x9 and 0xd). But, again, they were included in in the old checks and were close Pentium III derivatives, so are likely affected. While we're at it, revise the comment referring to the erratum name and making sure it is a quote of the language from the actual errata doc. That should make it easier to find in the future when the URL inevitably changes. Why bother with this in the first place? It actually gets rid of one of the very few remaining direct references to c->x86{,_model}. No change in functionality intended. Signed-off-by: Dave Hansen <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Len Brown <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-09-02KVM: x86: Only advertise KVM_CAP_READONLY_MEM when supported by VMTom Dohrmann1-1/+2
Until recently, KVM_CAP_READONLY_MEM was unconditionally supported on x86, but this is no longer the case for SEV-ES and SEV-SNP VMs. When KVM_CHECK_EXTENSION is invoked on a VM, only advertise KVM_CAP_READONLY_MEM when it's actually supported. Fixes: 66155de93bcf ("KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)") Cc: Sean Christopherson <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Michael Roth <[email protected]> Signed-off-by: Tom Dohrmann <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-09-01x86/mm: add testmmiotrace MODULE_DESCRIPTION()Jeff Johnson1-0/+1
Fix the following 'make W=1' warning: WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/mm/testmmiotrace.o Link: https://lkml.kernel.org/r/20240730-module_description_orphans-v1-2-7094088076c8@quicinc.com Signed-off-by: Jeff Johnson <[email protected]> Cc: Alexandre Torgue <[email protected]> Cc: Alistar Popple <[email protected]> Cc: Andrew Jeffery <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Boqun Feng <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Eddie James <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jeremy Kerr <[email protected]> Cc: Joel Stanley <[email protected]> Cc: Karol Herbst <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Maxime Coquelin <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Naveen N Rao <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Nouveau <[email protected]> Cc: Pekka Paalanen <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Russell King <[email protected]> Cc: Steven Rostedt (Google) <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Viresh Kumar <[email protected]> Cc: Waiman Long <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: remove legacy install_special_mapping() codeLinus Torvalds1-4/+8
All relevant architectures had already been converted to the new interface (which just has an underscore in front of the name - not very imaginative naming), this just force-converts the stragglers. The modern interface is almost identical to the old one, except instead of the page pointer it takes a "struct vm_special_mapping" that describes the mapping (and contains the page pointer as one member), and it returns the resulting 'vma' instead of just the error code. Getting rid of the old interface also gets rid of some special casing, which had caused problems with the mremap extensions to "struct vm_special_mapping". [[email protected]: coding-style cleanups] Link: https://lkml.kernel.org/r/CAHk-=whvR+z=0=0gzgdfUiK70JTa-=+9vxD-4T=3BagXR6dciA@mail.gmail.comTested-by: Rob Landley <[email protected]> # arch/sh/ Link: https://lore.kernel.org/all/20240819195120.GA1113263@thelio-3990X/ Signed-off-by: Linus Torvalds <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Anton Ivanov <[email protected]> Cc: Brian Cain <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dinh Nguyen <[email protected]> Cc: Guo Ren <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Johannes Berg <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Pedro Falcato <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rich Felker <[email protected]> Cc: Rob Landley <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: remove arch_unmap()Michael Ellerman1-5/+0
Now that powerpc no longer uses arch_unmap() to handle VDSO unmapping, there are no meaningful implementions left. Drop support for it entirely, and update comments which refer to it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Michael Ellerman <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Pedro Falcato <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/x86: add missing pud helpersPeter Xu2-8/+61
Some new helpers will be needed for pud entry updates soon. Introduce these helpers by referencing the pmd ones. Namely: - pudp_invalidate(): this helper invalidates a huge pud before a split happens, so that the invalidated pud entry will make sure no race will happen (either with software, like a concurrent zap, or hardware, like a/d bit lost). - pud_modify(): this helper applies a new pgprot to an existing huge pud mapping. For more information on why we need these two helpers, please refer to the corresponding pmd helpers in the mprotect() code path. When at it, simplify the pud_modify()/pmd_modify() comments on shadow stack pgtable entries to reference pte_modify() to avoid duplicating the whole paragraph three times. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Jiang <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: "Edgecombe, Rick P" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/x86: implement arch_check_zapped_pud()Peter Xu2-0/+16
Introduce arch_check_zapped_pud() to sanity check shadow stack on PUD zaps. It has the same logic as the PMD helper. One thing to mention is, it might be a good idea to use page_table_check in the future for trapping wrong setups of shadow stack pgtable entries [1]. That is left for the future as a separate effort. [1] https://lore.kernel.org/all/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: "Edgecombe, Rick P" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Jiang <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm/x86: make pud_leaf() only care about PSE bitPeter Xu1-2/+1
When working on mprotect() on 1G dax entries, I hit an zap bad pud error when zapping a huge pud that is with PROT_NONE permission. Here the problem is x86's pud_leaf() requires both PRESENT and PSE bits set to report a pud entry as a leaf, but that doesn't look right, as it's not following the pXd_leaf() definition that we stick with so far, where PROT_NONE entries should be reported as leaves. To fix it, change x86's pud_leaf() implementation to only check against PSE bit to report a leaf, irrelevant of whether PRESENT bit is set. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Acked-by: Dave Hansen <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Jiang <[email protected]> Cc: David Rientjes <[email protected]> Cc: "Edgecombe, Rick P" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: rework accept memory helpersKirill A. Shutemov2-2/+2
Make accept_memory() and range_contains_unaccepted_memory() take 'start' and 'size' arguments instead of 'start' and 'end'. Remove accept_page(), replacing it with direct calls to accept_memory(). The accept_page() name is going to be used for a different function. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Suggested-by: David Hildenbrand <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Rapoport (Microsoft) <[email protected]> Cc: Tom Lendacky <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: turn USE_SPLIT_PTE_PTLOCKS / USE_SPLIT_PTE_PTLOCKS into Kconfig optionsDavid Hildenbrand1-3/+4
Patch series "mm: split PTE/PMD PT table Kconfig cleanups+clarifications". This series is a follow up to the fixes: "[PATCH v1 0/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking" When working on the fixes, I wondered why 8xx is fine (-> never uses split PT locks) and how PT locking even works properly with PMD page table sharing (-> always requires split PMD PT locks). Let's improve the split PT lock detection, make hugetlb properly depend on it and make 8xx bail out if it would ever get enabled by accident. As an alternative to patch #3 we could extend the Kconfig SPLIT_PTE_PTLOCKS option from patch #2 -- but enforcing it closer to the code that actually implements it feels a bit nicer for documentation purposes, and there is no need to actually disable it because it should always be disabled (!SMP). Did a bunch of cross-compilations to make sure that split PTE/PMD PT locks are still getting used where we would expect them. [1] https://lkml.kernel.org/r/[email protected] This patch (of 3): Let's clean that up a bit and prepare for depending on CONFIG_SPLIT_PMD_PTLOCKS in other Kconfig options. More cleanups would be reasonable (like the arch-specific "depends on" for CONFIG_SPLIT_PTE_PTLOCKS), but we'll leave that for another day. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Mike Rapoport (Microsoft) <[email protected]> Reviewed-by: Russell King (Oracle) <[email protected]> Reviewed-by: Qi Zheng <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Muchun Song <[email protected]> Cc: "Naveen N. Rao" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01Merge tag 'x86-urgent-2024-09-01' of ↵Linus Torvalds11-20/+61
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: - x2apic_disable() clears x2apic_state and x2apic_mode unconditionally, even when the state is X2APIC_ON_LOCKED, which prevents the kernel to disable it thereby creating inconsistent state. Reorder the logic so it actually works correctly - The XSTATE logic for handling LBR is incorrect as it assumes that XSAVES supports LBR when the CPU supports LBR. In fact both conditions need to be true. Otherwise the enablement of LBR in the IA32_XSS MSR fails and subsequently the machine crashes on the next XRSTORS operation because IA32_XSS is not initialized. Cache the XSTATE support bit during init and make the related functions use this cached information and the LBR CPU feature bit to cure this. - Cure a long standing bug in KASLR KASLR uses the full address space between PAGE_OFFSET and vaddr_end to randomize the starting points of the direct map, vmalloc and vmemmap regions. It thereby limits the size of the direct map by using the installed memory size plus an extra configurable margin for hot-plug memory. This limitation is done to gain more randomization space because otherwise only the holes between the direct map, vmalloc, vmemmap and vaddr_end would be usable for randomizing. The limited direct map size is not exposed to the rest of the kernel, so the memory hot-plug and resource management related code paths still operate under the assumption that the available address space can be determined with MAX_PHYSMEM_BITS. request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1 downwards. That means the first allocation happens past the end of the direct map and if unlucky this address is in the vmalloc space, which causes high_memory to become greater than VMALLOC_START and consequently causes iounmap() to fail for valid ioremap addresses. Cure this by exposing the end of the direct map via PHYSMEM_END and use that for the memory hot-plug and resource management related places instead of relying on MAX_PHYSMEM_BITS. In the KASLR case PHYSMEM_END maps to a variable which is initialized by the KASLR initialization and otherwise it is based on MAX_PHYSMEM_BITS as before. - Prevent a data leak in mmio_read(). The TDVMCALL exposes the value of an initialized variabled on the stack to the VMM. The variable is only required as output value, so it does not have to exposed to the VMM in the first place. - Prevent an array overrun in the resource control code on systems with Sub-NUMA Clustering enabled because the code failed to adjust the index by the number of SNC nodes per L3 cache. * tag 'x86-urgent-2024-09-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/resctrl: Fix arch_mbm_* array overrun on SNC x86/tdx: Fix data leak in mmio_read() x86/kaslr: Expose and use the end of the physical memory address space x86/fpu: Avoid writing LBR bit to IA32_XSS unless supported x86/apic: Make x2apic_disable() work correctly
2024-09-01tinyconfig: remove unnecessary 'is not set' for choice blocksMasahiro Yamada1-4/+0
This reverts the following commits: - 236dec051078 ("kconfig: tinyconfig: provide whole choice blocks to avoid warnings") - b0f269728ccd ("x86/config: Fix warning for 'make ARCH=x86_64 tinyconfig'") Since commit f79dc03fe68c ("kconfig: refactor choice value calculation"), it is no longer necessary to disable the remaining options in choice blocks. Signed-off-by: Masahiro Yamada <[email protected]> Acked-by: Thomas Gleixner <[email protected]>
2024-08-29x86/EISA: Dereference memory directly instead of using readl()Maciej W. Rozycki1-2/+2
Sparse expect an __iomem pointer, but after converting the EISA probe to memremap() the pointer is a regular memory pointer. Access it directly instead. [ tglx: Converted it to fix the already applied version ] Fixes: 80a4da05642c ("x86/EISA: Use memremap() to probe for the EISA BIOS signature") Signed-off-by: Maciej W. Rozycki <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-28KVM: SEV: Update KVM_AMD_SEV Kconfig entry and mention SEV-SNPVitaly Kuznetsov1-2/+4
SEV-SNP support is present since commit 1dfe571c12cf ("KVM: SEV: Add initial SEV-SNP support") but Kconfig entry wasn't updated and still mentions SEV and SEV-ES only. Add SEV-SNP there and, while on it, expand 'SEV' in the description as 'Encrypted VMs' is not what 'SEV' stands for. No functional change. Signed-off-by: Vitaly Kuznetsov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-28x86/resctrl: Fix arch_mbm_* array overrun on SNCPeter Newman2-6/+8
When using resctrl on systems with Sub-NUMA Clustering enabled, monitoring groups may be allocated RMID values which would overrun the arch_mbm_{local,total} arrays. This is due to inconsistencies in whether the SNC-adjusted num_rmid value or the unadjusted value in resctrl_arch_system_num_rmid_idx() is used. The num_rmid value for the L3 resource is currently: resctrl_arch_system_num_rmid_idx() / snc_nodes_per_l3_cache As a simple fix, make resctrl_arch_system_num_rmid_idx() return the SNC-adjusted, L3 num_rmid value on x86. Fixes: e13db55b5a0d ("x86/resctrl: Introduce snc_nodes_per_l3_cache") Signed-off-by: Peter Newman <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-08-27virt: sev-guest: Ensure the SNP guest messages do not exceed a pageNikunj A Dadhania1-1/+1
Currently, struct snp_guest_msg includes a message header (96 bytes) and a payload (4000 bytes). There is an implicit assumption here that the SNP message header will always be 96 bytes, and with that assumption the payload array size has been set to 4000 bytes - a magic number. If any new member is added to the SNP message header, the SNP guest message will span more than a page. Instead of using a magic number for the payload, declare struct snp_guest_msg in a way that payload plus the message header do not exceed a page. [ bp: Massage. ] Suggested-by: Tom Lendacky <[email protected]> Signed-off-by: Nikunj A Dadhania <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-08-26x86/tdx: Fix data leak in mmio_read()Kirill A. Shutemov1-1/+0
The mmio_read() function makes a TDVMCALL to retrieve MMIO data for an address from the VMM. Sean noticed that mmio_read() unintentionally exposes the value of an initialized variable (val) on the stack to the VMM. This variable is only needed as an output value. It did not need to be passed to the VMM in the first place. Do not send the original value of *val to the VMM. [ dhansen: clarify what 'val' is used for. ] Fixes: 31d58c4e557d ("x86/tdx: Handle in-kernel MMIO") Reported-by: Sean Christopherson <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Cc:[email protected] Link: https://lore.kernel.org/all/20240826125304.1566719-1-kirill.shutemov%40linux.intel.com
2024-08-26x86/ioremap: Improve iounmap() address range checksMax Ramanouski1-1/+2
Allowing iounmap() on memory that was not ioremap()'d in the first place is obviously a bad idea. There is currently a feeble attempt to avoid errant iounmap()s by checking to see if the address is below "high_memory". But that's imprecise at best because there are plenty of high addresses that are also invalid to call iounmap() on. Thankfully, there is a more precise helper: is_ioremap_addr(). x86 just does not use it in iounmap(). Restrict iounmap() to addresses in the ioremap region, by using is_ioremap_addr(). This aligns x86 closer to the generic iounmap() implementation. Additionally, add a warning in case there is an attempt to iounmap() invalid memory. This replaces an existing silent return and will help alert folks to any incorrect usage of iounmap(). Due to VMALLOC_START on i386 not being present in asm/pgtable.h, include for asm/vmalloc.h had to be added to include/linux/ioremap.h. [ dhansen: tweak subject and changelog ] Signed-off-by: Max Ramanouski <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Alistair Popple <[email protected]> Link: https://lore.kernel.org/all/20240824220111.84441-1-max8rr8%40gmail.com
2024-08-25x86/entry: Set FRED RSP0 on return to userspace instead of context switchXin Li (Intel)4-5/+27
The FRED RSP0 MSR points to the top of the kernel stack for user level event delivery. As this is the task stack it needs to be updated when a task is scheduled in. The update is done at context switch. That means it's also done when switching to kernel threads, which is pointless as those never go out to user space. For KVM threads this means there are two writes to FRED_RSP0 as KVM has to switch to the guest value before VMENTER. Defer the update to the exit to user space path and cache the per CPU FRED_RSP0 value, so redundant writes can be avoided. Provide fred_sync_rsp0() for KVM to keep the cache in sync with the actual MSR value after returning from guest to host mode. [ tglx: Massage change log ] Suggested-by: Sean Christopherson <[email protected]> Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Xin Li (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-25x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanismAndrew Cooper3-16/+11
Per the discussion about FRED MSR writes with WRMSRNS instruction [1], use the alternatives mechanism to choose WRMSRNS when it's available, otherwise fallback to WRMSR. Remove the dependency on X86_FEATURE_WRMSRNS as WRMSRNS is no longer dependent on FRED. [1] https://lore.kernel.org/lkml/[email protected]/ Use DS prefix to pad WRMSR instead of a NOP. The prefix is ignored. At least that's the current information from the hardware folks. Signed-off-by: Andrew Cooper <[email protected]> Signed-off-by: Xin Li (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-25x86/entry: Test ti_work for zero before processing individual bitsXin Li (Intel)1-2/+8
In most cases, ti_work values passed to arch_exit_to_user_mode_prepare() are zeros, e.g., 99% in kernel build tests. So an obvious optimization is to test ti_work for zero before processing individual bits in it. Omit the optimization when FPU debugging is enabled, otherwise the FPU consistency check is never executed. Intel 0day tests did not find a perfermance regression with this change. Suggested-by: H. Peter Anvin (Intel) <[email protected]> Signed-off-by: Xin Li (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-25perf/x86/intel: Limit the period on HaswellKan Liang1-2/+21
Running the ltp test cve-2015-3290 concurrently reports the following warnings. perfevents: irq loop stuck! WARNING: CPU: 31 PID: 32438 at arch/x86/events/intel/core.c:3174 intel_pmu_handle_irq+0x285/0x370 Call Trace: <NMI> ? __warn+0xa4/0x220 ? intel_pmu_handle_irq+0x285/0x370 ? __report_bug+0x123/0x130 ? intel_pmu_handle_irq+0x285/0x370 ? __report_bug+0x123/0x130 ? intel_pmu_handle_irq+0x285/0x370 ? report_bug+0x3e/0xa0 ? handle_bug+0x3c/0x70 ? exc_invalid_op+0x18/0x50 ? asm_exc_invalid_op+0x1a/0x20 ? irq_work_claim+0x1e/0x40 ? intel_pmu_handle_irq+0x285/0x370 perf_event_nmi_handler+0x3d/0x60 nmi_handle+0x104/0x330 Thanks to Thomas Gleixner's analysis, the issue is caused by the low initial period (1) of the frequency estimation algorithm, which triggers the defects of the HW, specifically erratum HSW11 and HSW143. (For the details, please refer https://lore.kernel.org/lkml/87plq9l5d2.ffs@tglx/) The HSW11 requires a period larger than 100 for the INST_RETIRED.ALL event, but the initial period in the freq mode is 1. The erratum is the same as the BDM11, which has been supported in the kernel. A minimum period of 128 is enforced as well on HSW. HSW143 is regarding that the fixed counter 1 may overcount 32 with the Hyper-Threading is enabled. However, based on the test, the hardware has more issues than it tells. Besides the fixed counter 1, the message 'interrupt took too long' can be observed on any counter which was armed with a period < 32 and two events expired in the same NMI. A minimum period of 32 is enforced for the rest of the events. The recommended workaround code of the HSW143 is not implemented. Because it only addresses the issue for the fixed counter. It brings extra overhead through extra MSR writing. No related overcounting issue has been reported so far. Fixes: 3a632cb229bf ("perf/x86/intel: Add simple Haswell PMU support") Reported-by: Li Huafei <[email protected]> Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/all/[email protected] Closes: https://lore.kernel.org/lkml/[email protected]/
2024-08-25x86/fred: Set SS to __KERNEL_DS when enabling FREDXin Li (Intel)1-0/+14
SS is initialized to NULL during boot time and not explicitly set to __KERNEL_DS. With FRED enabled, if a kernel event is delivered before a CPU goes to user level for the first time, its SS is NULL thus NULL is pushed into the SS field of the FRED stack frame. But before ERETS is executed, the CPU may context switch to another task and go to user level. Then when the CPU comes back to kernel mode, SS is changed to __KERNEL_DS. Later when ERETS is executed to return from the kernel event handler, a #GP fault is generated because SS doesn't match the SS saved in the FRED stack frame. Initialize SS to __KERNEL_DS when enabling FRED to prevent that. Note, IRET doesn't check if SS matches the SS saved in its stack frame, thus IDT doesn't have this problem. For IDT it doesn't matter whether SS is set to __KERNEL_DS or not, because it's set to NULL upon interrupt or exception delivery and __KERNEL_DS upon SYSCALL. Thus it's pointless to initialize SS for IDT. Signed-off-by: Xin Li (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-25x86/amd_nb: Add new PCI IDs for AMD family 1Ah model 60h-70hRichard Gong1-0/+4
Add new PCI IDs for Device 18h and Function 4 to enable the amd_atl driver on those systems. Signed-off-by: Richard Gong <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Yazen Ghannam <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-25x86/extable: Remove unused declaration fixup_bug()Yue Haibing1-1/+0
Commit 15a416e8aaa7 ("x86/entry: Treat BUG/WARN as NMI-like entries") removed the implementation but left the declaration. Signed-off-by: Yue Haibing <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-25x86/boot/64: Strip percpu address space when setting up GDT descriptorsUros Bizjak1-1/+2
init_per_cpu_var() returns a pointer in the percpu address space while rip_rel_ptr() expects a pointer in the generic address space. When strict address space checks are enabled, GCC's named address space checks fail: asm.h:124:63: error: passing argument 1 of 'rip_rel_ptr' from pointer to non-enclosed address space Add a explicit cast to remove address space of the returned pointer. Fixes: 11e36b0f7c21 ("x86/boot/64: Load the final kernel GDT during early boot directly, remove startup_gdt[]") Signed-off-by: Uros Bizjak <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-25x86/cpu: Clarify the error message when BIOS does not support SGXWangYuli1-1/+1
When SGX is not supported by the BIOS, the kernel log contains the error 'SGX disabled by BIOS', which can be confusing since there might not be an SGX-related option in the BIOS settings. For the kernel it's difficult to distinguish between the BIOS not supporting SGX and the BIOS supporting SGX but having it disabled. Therefore, update the error message to 'SGX disabled or unsupported by BIOS' to make it easier for those reading kernel logs to understand what's happening. Reported-by: Bo Wu <[email protected]> Co-developed-by: Zelong Xiang <[email protected]> Signed-off-by: Zelong Xiang <[email protected]> Signed-off-by: WangYuli <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Kai Huang <[email protected]> Link: https://lore.kernel.org/all/[email protected] Closes: https://github.com/linuxdeepin/developer-center/issues/10032
2024-08-25x86/kexec: Add comments around swap_pages() assembly to improve readabilityKai Huang1-2/+6
The current assembly around swap_pages() in the relocate_kernel() takes some time to follow because the use of registers can be easily lost when the line of assembly goes long. Add a couple of comments to clarify the code around swap_pages() to improve readability. Signed-off-by: Kai Huang <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Link: https://lore.kernel.org/all/8b52b0b8513a34b2a02fb4abb05c6700c2821475.1724573384.git.kai.huang@intel.com
2024-08-25x86/kexec: Fix a comment of swap_pages() assemblyKai Huang1-1/+1
When relocate_kernel() gets called, %rdi holds 'indirection_page' and %rsi holds 'page_list'. And %rdi always holds 'indirection_page' when swap_pages() is called. Therefore the comment of the first line code of swap_pages() movq %rdi, %rcx /* Put the page_list in %rcx */ .. isn't correct because it actually moves the 'indirection_page' to the %rcx. Fix it. Signed-off-by: Kai Huang <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Link: https://lore.kernel.org/all/adafdfb1421c88efce04420fc9a996c0e2ca1b34.1724573384.git.kai.huang@intel.com
2024-08-25x86/sgx: Fix a W=1 build warning in function commentKai Huang1-1/+1
Building the SGX code with W=1 generates below warning: arch/x86/kernel/cpu/sgx/main.c:741: warning: Function parameter or struct member 'low' not described in 'sgx_calc_section_metric' arch/x86/kernel/cpu/sgx/main.c:741: warning: Function parameter or struct member 'high' not described in 'sgx_calc_section_metric' ... The function sgx_calc_section_metric() is a simple helper which is only used in sgx/main.c. There's no need to use kernel-doc style comment for it. Downgrade to a normal comment. Signed-off-by: Kai Huang <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-25x86/EISA: Use memremap() to probe for the EISA BIOS signatureMaciej W. Rozycki1-3/+3
The area at the 0x0FFFD9 physical location in the PC memory space is regular memory, traditionally ROM BIOS and more recently a copy of BIOS code and data in RAM, write-protected. Therefore use memremap() to get access to it rather than ioremap(), avoiding issues in virtualization scenarios and complementing changes such as commit f7750a795687 ("x86, mpparse, x86/acpi, x86/PCI, x86/dmi, SFI: Use memremap() for RAM mappings") or commit 5997efb96756 ("x86/boot: Use memremap() to map the MPF and MPC data"). Reported-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Maciej W. Rozycki <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected] Closes: https://lore.kernel.org/r/[email protected]
2024-08-25x86/mtrr: Remove obsolete declaration for mtrr_bp_restore()Gaosheng Cui1-2/+0
mtrr_bp_restore() has been removed in commit 0b9a6a8bedbf ("x86/mtrr: Add a stop_machine() handler calling only cache_cpu_init()"), but the declaration was left behind. Remove it. Signed-off-by: Gaosheng Cui <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-25x86/cpu_entry_area: Annotate percpu_setup_exception_stacks() as __initNathan Chancellor1-1/+1
After a recent LLVM change that deduces __cold on functions that only call cold code (such as __init functions), there is a section mismatch warning from percpu_setup_exception_stacks(), which got moved to .text.unlikely. as a result of that optimization: WARNING: modpost: vmlinux: section mismatch in reference: percpu_setup_exception_stacks+0x3a (section: .text.unlikely.) -> cea_map_percpu_pages (section: .init.text) Drop the inline keyword, which does not guarantee inlining, and replace it with __init, as percpu_setup_exception_stacks() is only called from __init code, which clears up the warning. Signed-off-by: Nathan Chancellor <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/20240822-x86-percpu_setup_exception_stacks-init-v1-1-57c5921b8209@kernel.org
2024-08-24crypto: x86/sha256 - Add parentheses around macros' single argumentsFangrui Song1-8/+8
The macros FOUR_ROUNDS_AND_SCHED and DO_4ROUNDS rely on an unexpected/undocumented behavior of the GNU assembler, which might change in the future (https://sourceware.org/bugzilla/show_bug.cgi?id=32073). M (1) (2) // 1 arg !? Future: 2 args M 1 + 2 // 1 arg !? Future: 3 args M 1 2 // 2 args Add parentheses around the single arguments to support future GNU assembler and LLVM integrated assembler (when the IsOperator hack from the following link is dropped). Link: https://github.com/llvm/llvm-project/commit/055006475e22014b28a070db1bff41ca15f322f0 Signed-off-by: Fangrui Song <[email protected]> Reviewed-by: Jan Beulich <[email protected]> Signed-off-by: Herbert Xu <[email protected]>
2024-08-23x86/hyperv: use helpers to read control registers in hv_snp_boot_ap()Yosry Ahmed1-3/+3
Use native_read_cr*() helpers to read control registers into vmsa->cr* instead of open-coded assembly. No functional change intended, unless there was a purpose to specifying rax. Signed-off-by: Yosry Ahmed <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Wei Liu <[email protected]> Message-ID: <[email protected]>
2024-08-23x86/syscall: Avoid memcpy() for ia32 syscall_get_arguments()Kees Cook1-1/+6
Modern (fortified) memcpy() prefers to avoid writing (or reading) beyond the end of the addressed destination (or source) struct member: In function ‘fortify_memcpy_chk’, inlined from ‘syscall_get_arguments’ at ./arch/x86/include/asm/syscall.h:85:2, inlined from ‘populate_seccomp_data’ at kernel/seccomp.c:258:2, inlined from ‘__seccomp_filter’ at kernel/seccomp.c:1231:3: ./include/linux/fortify-string.h:580:25: error: call to ‘__read_overflow2_field’ declared with attribute warning: detected read beyond size of field (2nd parameter); maybe use struct_group()? [-Werror=attribute-warning] 580 | __read_overflow2_field(q_size_field, size); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ As already done for x86_64 and compat mode, do not use memcpy() to extract syscall arguments from struct pt_regs but rather just perform direct assignments. Binary output differences are negligible, and actually ends up using less stack space: - sub $0x84,%esp + sub $0x6c,%esp and less text size: text data bss dec hex filename 10794 252 0 11046 2b26 gcc-32b/kernel/seccomp.o.stock 10714 252 0 10966 2ad6 gcc-32b/kernel/seccomp.o.after Closes: https://lore.kernel.org/lkml/[email protected]/ Reported-by: Mirsad Todorovac <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Gustavo A. R. Silva <[email protected]> Acked-by: Dave Hansen <[email protected]> Tested-by: Mirsad Todorovac <[email protected]> Link: https://lore.kernel.org/all/20240708202202.work.477-kees%40kernel.org
2024-08-22KVM: SVM: Don't advertise Bus Lock Detect to guest if SVM support is missingRavi Bangoria1-0/+3
If host supports Bus Lock Detect, KVM advertises it to guests even if SVM support is absent. Additionally, guest wouldn't be able to use it despite guest CPUID bit being set. Fix it by unconditionally clearing the feature bit in KVM cpu capability. Reported-by: Jim Mattson <[email protected]> Closes: https://lore.kernel.org/r/CALMp9eRet6+v8Y1Q-i6mqPm4hUow_kJNhmVHfOV8tMfuSS=tVg@mail.gmail.com Fixes: 76ea438b4afc ("KVM: X86: Expose bus lock debug exception to guest") Cc: [email protected] Signed-off-by: Ravi Bangoria <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Reviewed-by: Tom Lendacky <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: SVM: fix emulation of msr reads/writes of MSR_FS_BASE and MSR_GS_BASEMaxim Levitsky1-0/+12
If these msrs are read by the emulator (e.g due to 'force emulation' prefix), SVM code currently fails to extract the corresponding segment bases, and return them to the emulator. Fix that. Cc: [email protected] Signed-off-by: Maxim Levitsky <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: x86: Acquire kvm->srcu when handling KVM_SET_VCPU_EVENTSSean Christopherson1-0/+2
Grab kvm->srcu when processing KVM_SET_VCPU_EVENTS, as KVM will forcibly leave nested VMX/SVM if SMM mode is being toggled, and leaving nested VMX reads guest memory. Note, kvm_vcpu_ioctl_x86_set_vcpu_events() can also be called from KVM_RUN via sync_regs(), which already holds SRCU. I.e. trying to precisely use kvm_vcpu_srcu_read_lock() around the problematic SMM code would cause problems. Acquiring SRCU isn't all that expensive, so for simplicity, grab it unconditionally for KVM_SET_VCPU_EVENTS. ============================= WARNING: suspicious RCU usage 6.10.0-rc7-332d2c1d713e-next-vm #552 Not tainted ----------------------------- include/linux/kvm_host.h:1027 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by repro/1071: #0: ffff88811e424430 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x7d/0x970 [kvm] stack backtrace: CPU: 15 PID: 1071 Comm: repro Not tainted 6.10.0-rc7-332d2c1d713e-next-vm #552 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl+0x7f/0x90 lockdep_rcu_suspicious+0x13f/0x1a0 kvm_vcpu_gfn_to_memslot+0x168/0x190 [kvm] kvm_vcpu_read_guest+0x3e/0x90 [kvm] nested_vmx_load_msr+0x6b/0x1d0 [kvm_intel] load_vmcs12_host_state+0x432/0xb40 [kvm_intel] vmx_leave_nested+0x30/0x40 [kvm_intel] kvm_vcpu_ioctl_x86_set_vcpu_events+0x15d/0x2b0 [kvm] kvm_arch_vcpu_ioctl+0x1107/0x1750 [kvm] ? mark_held_locks+0x49/0x70 ? kvm_vcpu_ioctl+0x7d/0x970 [kvm] ? kvm_vcpu_ioctl+0x497/0x970 [kvm] kvm_vcpu_ioctl+0x497/0x970 [kvm] ? lock_acquire+0xba/0x2d0 ? find_held_lock+0x2b/0x80 ? do_user_addr_fault+0x40c/0x6f0 ? lock_release+0xb7/0x270 __x64_sys_ioctl+0x82/0xb0 do_syscall_64+0x6c/0x170 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7ff11eb1b539 </TASK> Fixes: f7e570780efc ("KVM: x86: Forcibly leave nested virt when SMM state is toggled") Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: x86/mmu: Check that root is valid/loaded when pre-faulting SPTEsSean Christopherson1-1/+3
Error out if kvm_mmu_reload() fails when pre-faulting memory, as trying to fault-in SPTEs will fail miserably due to root.hpa pointing at garbage. Note, kvm_mmu_reload() can return -EIO and thus trigger the WARN on -EIO in kvm_vcpu_pre_fault_memory(), but all such paths also WARN, i.e. the WARN isn't user-triggerable and won't run afoul of warn-on-panic because the kernel would already be panicking. BUG: unable to handle page fault for address: 000029ffffffffe8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP CPU: 22 PID: 1069 Comm: pre_fault_memor Not tainted 6.10.0-rc7-332d2c1d713e-next-vm #548 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:is_page_fault_stale+0x3e/0xe0 [kvm] RSP: 0018:ffffc9000114bd48 EFLAGS: 00010206 RAX: 00003fffffffffc0 RBX: ffff88810a07c080 RCX: ffffc9000114bd78 RDX: ffff88810a07c080 RSI: ffffea0000000000 RDI: ffff88810a07c080 RBP: ffffc9000114bd78 R08: 00007fa3c8c00000 R09: 8000000000000225 R10: ffffea00043d7d80 R11: 0000000000000000 R12: ffff88810a07c080 R13: 0000000100000000 R14: ffffc9000114be58 R15: 0000000000000000 FS: 00007fa3c9da0740(0000) GS:ffff888277d80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000029ffffffffe8 CR3: 000000011d698000 CR4: 0000000000352eb0 Call Trace: <TASK> kvm_tdp_page_fault+0xcc/0x160 [kvm] kvm_mmu_do_page_fault+0xfb/0x1f0 [kvm] kvm_arch_vcpu_pre_fault_memory+0xd0/0x1a0 [kvm] kvm_vcpu_ioctl+0x761/0x8c0 [kvm] __x64_sys_ioctl+0x82/0xb0 do_syscall_64+0x5b/0x160 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> Modules linked in: kvm_intel kvm CR2: 000029ffffffffe8 ---[ end trace 0000000000000000 ]--- Fixes: 6e01b7601dfe ("KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory()") Reported-by: [email protected] Closes: https://lore.kernel.org/all/[email protected] Reviewed-by: Kai Huang <[email protected]> Tested-by: xingwei lee <[email protected]> Tested-by: yuxin wang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: x86/mmu: Fixup comments missed by the REMOVED_SPTE=>FROZEN_SPTE renameYan Zhao3-8/+8
Replace "removed" with "frozen" in comments as appropriate to complete the rename of REMOVED_SPTE to FROZEN_SPTE. Fixes: 964cea817196 ("KVM: x86/tdp_mmu: Rename REMOVED_SPTE to FROZEN_SPTE") Signed-off-by: Yan Zhao <[email protected]> Signed-off-by: Rick Edgecombe <[email protected]> Link: https://lore.kernel.org/r/[email protected] [sean: write changelog] Signed-off-by: Sean Christopherson <[email protected]>