aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/include/asm
AgeCommit message (Collapse)AuthorFilesLines
2022-05-13x86/tsc: Use fallback for random_get_entropy() instead of zeroJason A. Donenfeld2-4/+12
In the event that random_get_entropy() can't access a cycle counter or similar, falling back to returning 0 is suboptimal. Instead, fallback to calling random_get_entropy_fallback(), which isn't extremely high precision or guaranteed to be entropic, but is certainly better than returning zero all the time. If CONFIG_X86_TSC=n, then it's possible for the kernel to run on systems without RDTSC, such as 486 and certain 586, so the fallback code is only required for that case. As well, fix up both the new function and the get_cycles() function from which it was derived to use cpu_feature_enabled() rather than boot_cpu_has(), and use !IS_ENABLED() instead of #ifndef. Signed-off-by: Jason A. Donenfeld <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: [email protected]
2022-05-13mm: page_table_check: add hooks to public helpersTong Tiangen1-10/+0
Move ptep_clear() to the include/linux/pgtable.h and add page table check relate hooks to some helpers, it's prepare for support page table check feature on new architecture. Optimize the implementation of ptep_clear(), page table hooks added page table check stubs, the interface control should be at stubs, there is no rationale for doing a IS_ENABLED() check here. For architectures that do not enable CONFIG_PAGE_TABLE_CHECK, they will call a fallback page table check stubs[1] when getting their page table helpers[2] in include/linux/pgtable.h. [1] page table check stubs defined in include/linux/page_table_check.h [2] ptep_clear() ptep_get_and_clear() pmdp_huge_get_and_clear() pudp_huge_get_and_clear() Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Tong Tiangen <[email protected]> Acked-by: Pasha Tatashin <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-13mm: page_table_check: move pxx_user_accessible_page into x86Kefeng Wang1-0/+17
The pxx_user_accessible_page() checks the PTE bit, it's architecture-specific code, move them into x86's pgtable.h. These helpers are being moved out to make the page table check framework platform independent. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Signed-off-by: Tong Tiangen <[email protected]> Acked-by: Pasha Tatashin <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-13mm: avoid unnecessary flush on change_huge_pmd()Nadav Amit1-0/+5
Calls to change_protection_range() on THP can trigger, at least on x86, two TLB flushes for one page: one immediately, when pmdp_invalidate() is called by change_huge_pmd(), and then another one later (that can be batched) when change_protection_range() finishes. The first TLB flush is only necessary to prevent the dirty bit (and with a lesser importance the access bit) from changing while the PTE is modified. However, this is not necessary as the x86 CPUs set the dirty-bit atomically with an additional check that the PTE is (still) present. One caveat is Intel's Knights Landing that has a bug and does not do so. Leverage this behavior to eliminate the unnecessary TLB flush in change_huge_pmd(). Introduce a new arch specific pmdp_invalidate_ad() that only invalidates the access and dirty bit from further changes. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nadav Amit <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andrew Cooper <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Peter Xu <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Nick Piggin <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-13mm/mprotect: do not flush when not required architecturallyNadav Amit2-0/+99
Currently, using mprotect() to unprotect a memory region or uffd to unprotect a memory region causes a TLB flush. However, in such cases the PTE is often not modified (i.e., remain RO) and therefore not TLB flush is needed. Add an arch-specific pte_needs_flush() which tells whether a TLB flush is needed based on the old PTE and the new one. Implement an x86 pte_needs_flush(). Always flush the TLB when it is architecturally needed even when skipping a TLB flush might only result in a spurious page-faults by skipping the flush. Even with such conservative manner, we can in the future further refine the checks to test whether a PTE is present by only considering the architectural _PAGE_PRESENT flag instead of {pte|pmd}_preesnt(). For not be careful and use the latter. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nadav Amit <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yu Zhao <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Andrew Cooper <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-13x86/prctl: Remove pointless task argumentThomas Gleixner2-4/+2
The functions invoked via do_arch_prctl_common() can only operate on the current task and none of these function uses the task argument. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/87lev7vtxj.ffs@tglx
2022-05-12KVM: x86/mmu: Expand and clean up page fault statsSean Christopherson1-0/+5
Expand and clean up the page fault stats. The current stats are at best incomplete, and at worst misleading. Differentiate between faults that are actually fixed vs those that result in an MMIO SPTE being created, track faults that are spurious, faults that trigger emulation, faults that that are fixed in the fast path, and last but not least, track the number of faults that are taken. Note, the number of faults that require emulation for write-protected shadow pages can roughly be calculated by subtracting the number of MMIO SPTEs created from the overall number of faults that trigger emulation. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-05-12x86/msr-index: Define INTEGRITY_CAPABILITIES MSRTony Luck1-0/+7
The INTEGRITY_CAPABILITIES MSR is enumerated by bit 2 of the CORE_CAPABILITIES MSR. Add defines for the CORE_CAPS enumeration as well as for the integrity MSR. Reviewed-by: Dan Williams <[email protected]> Signed-off-by: Tony Luck <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Hans de Goede <[email protected]>
2022-05-12x86/microcode/intel: Expose collect_cpu_info_early() for IFSJithu Joseph1-0/+18
IFS is a CPU feature that allows a binary blob, similar to microcode, to be loaded and consumed to perform low level validation of CPU circuitry. In fact, it carries the same Processor Signature (family/model/stepping) details that are contained in Intel microcode blobs. In support of an IFS driver to trigger loading, validation, and running of these tests blobs, make the functionality of cpu_signatures_match() and collect_cpu_info_early() available outside of the microcode driver. Add an "intel_" prefix and drop the "_early" suffix from collect_cpu_info_early() and EXPORT_SYMBOL_GPL() it. Add declaration to x86 <asm/cpu.h> Make cpu_signatures_match() an inline function in x86 <asm/cpu.h>, and also give it an "intel_" prefix. No functional change intended. Reviewed-by: Dan Williams <[email protected]> Signed-off-by: Jithu Joseph <[email protected]> Co-developed-by: Tony Luck <[email protected]> Signed-off-by: Tony Luck <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Borislav Petkov <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Hans de Goede <[email protected]>
2022-05-11swiotlb-xen: fix DMA_ATTR_NO_KERNEL_MAPPING on armChristoph Hellwig2-24/+6
swiotlb-xen uses very different ways to allocate coherent memory on x86 vs arm. On the former it allocates memory from the page allocator, while on the later it reuses the dma-direct allocator the handles the complexities of non-coherent DMA on arm platforms. Unfortunately the complexities of trying to deal with the two cases in the swiotlb-xen.c code lead to a bug in the handling of DMA_ATTR_NO_KERNEL_MAPPING on arm. With the DMA_ATTR_NO_KERNEL_MAPPING flag the coherent memory allocator does not actually allocate coherent memory, but just a DMA handle for some memory that is DMA addressable by the device, but which does not have to have a kernel mapping. Thus dereferencing the return value will lead to kernel crashed and memory corruption. Fix this by using the dma-direct allocator directly for arm, which works perfectly fine because on arm swiotlb-xen is only used when the domain is 1:1 mapped, and then simplifying the remaining code to only cater for the x86 case with DMA coherent device. Reported-by: Rahul Singh <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Rahul Singh <[email protected]> Reviewed-by: Stefano Stabellini <[email protected]> Tested-by: Rahul Singh <[email protected]>
2022-05-11perf/ibs: Fix commentRavi Bangoria1-1/+1
s/IBS Op Data 2/IBS Op Data 1/ for MSR 0xc0011035. Signed-off-by: Ravi Bangoria <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-05-11perf/amd/ibs: Add support for L3 miss filteringRavi Bangoria1-0/+3
IBS L3 miss filtering works by tagging an instruction on IBS counter overflow and generating an NMI if the tagged instruction causes an L3 miss. Samples without an L3 miss are discarded and counter is reset with random value (between 1-15 for fetch pmu and 1-127 for op pmu). This helps in reducing sampling overhead when user is interested only in such samples. One of the use case of such filtered samples is to feed data to page-migration daemon in tiered memory systems. Add support for L3 miss filtering in IBS driver via new pmu attribute "l3missonly". Example usage: # perf record -a -e ibs_op/l3missonly=1/ --raw-samples sleep 5 Signed-off-by: Ravi Bangoria <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-05-11Merge branch 'v5.18-rc5'Peter Zijlstra14-40/+50
Obtain the new INTEL_FAM6 stuff required. Signed-off-by: Peter Zijlstra <[email protected]>
2022-05-09x86/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVEDavid Hildenbrand3-1/+25
Let's use bit 3 to remember PG_anon_exclusive in swap ptes. [[email protected]: fix 32-bit swap layout] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Don Dutile <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: John Hubbard <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Liang Zhang <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Oded Gabbay <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Pedro Demarchi Gomes <[email protected]> Cc: Peter Xu <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-05-09entry: Rename arch_check_user_regs() to arch_enter_from_user_mode()Sven Schnelle1-2/+2
arch_check_user_regs() is used at the moment to verify that struct pt_regs contains valid values when entering the kernel from userspace. s390 needs a place in the generic entry code to modify a cpu data structure when switching from userspace to kernel mode. As arch_check_user_regs() is exactly this, rename it to arch_enter_from_user_mode(). When entering the kernel from userspace, arch_check_user_regs() is used to verify that struct pt_regs contains valid values. Note that the NMI codepath doesn't call this function. s390 needs a place in the generic entry code to modify a cpu data structure when switching from userspace to kernel mode. As arch_check_user_regs() is exactly this, rename it to arch_enter_from_user_mode(). Signed-off-by: Sven Schnelle <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andy Lutomirski <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Heiko Carstens <[email protected]>
2022-05-07fork: Generalize PF_IO_WORKER handlingEric W. Biederman2-5/+5
Add fn and fn_arg members into struct kernel_clone_args and test for them in copy_thread (instead of testing for PF_KTHREAD | PF_IO_WORKER). This allows any task that wants to be a user space task that only runs in kernel mode to use this functionality. The code on x86 is an exception and still retains a PF_KTHREAD test because x86 unlikely everything else handles kthreads slightly differently than user space tasks that start with a function. The functions that created tasks that start with a function have been updated to set ".fn" and ".fn_arg" instead of ".stack" and ".stack_size". These functions are fork_idle(), create_io_thread(), kernel_thread(), and user_mode_thread(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: "Eric W. Biederman" <[email protected]>
2022-05-06x86/mm: Simplify RESERVE_BRK()Josh Poimboeuf1-19/+11
RESERVE_BRK() reserves data in the .brk_reservation section. The data is initialized to zero, like BSS, so the macro specifies 'nobits' to prevent the data from taking up space in the vmlinux binary. The only way to get the compiler to do that (without putting the variable in .bss proper) is to use inline asm. The macro also has a hack which encloses the inline asm in a discarded function, which allows the size to be passed (global inline asm doesn't allow inputs). Remove the need for the discarded function hack by just stringifying the size rather than supplying it as an input to the inline asm. Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-05-04perf/x86/amd/core: Detect available countersSandipan Das1-0/+17
If AMD Performance Monitoring Version 2 (PerfMonV2) is supported, use CPUID leaf 0x80000022 EBX to detect the number of Core PMCs. This offers more flexibility if the counts change in later processor families. Signed-off-by: Sandipan Das <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/68a6d9688df189267db26530378870edd34f7b06.1650515382.git.sandipan.das@amd.com
2022-05-04x86/msr: Add PerfCntrGlobal* registersSandipan Das1-0/+5
Add MSR definitions that will be used to enable the new AMD Performance Monitoring Version 2 (PerfMonV2) features. These include: * Performance Counter Global Control (PerfCntrGlobalCtl) * Performance Counter Global Status (PerfCntrGlobalStatus) * Performance Counter Global Status Clear (PerfCntrGlobalStatusClr) The new Performance Counter Global Control and Status MSRs provide an interface for enabling or disabling multiple counters at the same time and for testing overflow without probing the individual registers for each PMC. The availability of these registers is indicated through the PerfMonV2 feature bit of CPUID leaf 0x80000022 EAX. Signed-off-by: Sandipan Das <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/cdc0d8f75bd519848731b5c64d924f5a0619a573.1650515382.git.sandipan.das@amd.com
2022-05-04x86/cpufeatures: Add PerfMonV2 feature bitSandipan Das1-1/+1
CPUID leaf 0x80000022 i.e. ExtPerfMonAndDbg advertises some new performance monitoring features for AMD processors. Bit 0 of EAX indicates support for Performance Monitoring Version 2 (PerfMonV2) features. If found to be set during PMU initialization, the EBX bits of the same CPUID function can be used to determine the number of available PMCs for different PMU types. Additionally, Core PMCs can be managed using new global control and status registers. For better utilization of feature words, PerfMonV2 is added as a scattered feature bit. Signed-off-by: Sandipan Das <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/c70e497e22f18e7f05b025bb64ca21cc12b17792.1650515382.git.sandipan.das@amd.com
2022-05-03efi: libstub: declare DXE services tableBaskov Evgeniy1-0/+5
UEFI DXE services are not yet used in kernel code but are required to manipulate page table memory protection flags. Add required declarations to use DXE services functions. Signed-off-by: Baskov Evgeniy <[email protected]> Link: https://lore.kernel.org/r/[email protected] [ardb: ignore absent DXE table but warn if the signature check fails] Signed-off-by: Ard Biesheuvel <[email protected]>
2022-05-03x86/entry: Convert SWAPGS to swapgs and remove the definition of SWAPGSLai Jiangshan1-8/+0
XENPV doesn't use swapgs_restore_regs_and_return_to_usermode(), error_entry() and the code between entry_SYSENTER_compat() and entry_SYSENTER_compat_after_hwframe. Change the PV-compatible SWAPGS to the ASM instruction swapgs in these places. Also remove the definition of SWAPGS since no more users. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-05-03x86/traps: Use pt_regs directly in fixup_bad_iret()Lai Jiangshan1-1/+1
Always stash the address error_entry() is going to return to, in %r12 and get rid of the void *error_entry_ret; slot in struct bad_iret_stack which was supposed to account for it and pt_regs pushed on the stack. After this, both fixup_bad_iret() and sync_regs() can work on a struct pt_regs pointer directly. [ bp: Rewrite commit message, touch ups. ] Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-05-02x86/aperfperf: Make it correct on 32bit and UP kernelsThomas Gleixner1-3/+3
The utilization of arch_scale_freq_tick() for CPU frequency readouts is incomplete as it failed to move the function prototype and the define out of the CONFIG_SMP && CONFIG_X86_64 #ifdef. Make them unconditionally available. Fixes: bb6e89df9028 ("x86/aperfmperf: Make parts of the frequency invariance code unconditional") Reported-by: kernel test robot <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/lkml/[email protected]
2022-05-01Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-4/+0
Pull kvm fixes from Paolo Bonzini: "ARM: - Take care of faults occuring between the PARange and IPA range by injecting an exception - Fix S2 faults taken from a host EL0 in protected mode - Work around Oops caused by a PMU access from a 32bit guest when PMU has been created. This is a temporary bodge until we fix it for good. x86: - Fix potential races when walking host page table - Fix shadow page table leak when KVM runs nested - Work around bug in userspace when KVM synthesizes leaf 0x80000021 on older (pre-EPYC) or Intel processors Generic (but affects only RISC-V): - Fix bad user ABI for KVM_EXIT_SYSTEM_EVENT" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: work around QEMU issue with synthetic CPUID leaves Revert "x86/mm: Introduce lookup_address_in_mm()" KVM: x86/mmu: fix potential races when walking host page table KVM: fix bad user ABI for KVM_EXIT_SYSTEM_EVENT KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR KVM: arm64: Inject exception on out-of-IPA-range translation fault KVM/arm64: Don't emulate a PMU for 32-bit guests if feature not set KVM: arm64: Handle host stage-2 faults from 32-bit EL0
2022-05-01Merge tag 'x86_urgent_for_v5.18_rc5' of ↵Linus Torvalds3-0/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - A fix to disable PCI/MSI[-X] masking for XEN_HVM guests as that is solely controlled by the hypervisor - A build fix to make the function prototype (__warn()) as visible as the definition itself - A bunch of objtool annotation fixes which have accumulated over time - An ORC unwinder fix to handle bad input gracefully - Well, we thought the microcode gets loaded in time in order to restore the microcode-emulated MSRs but we thought wrong. So there's a fix for that to have the ordering done properly - Add new Intel model numbers - A spelling fix * tag 'x86_urgent_for_v5.18_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/pci/xen: Disable PCI/MSI[-X] masking for XEN_HVM guests bug: Have __warn() prototype defined unconditionally x86/Kconfig: fix the spelling of 'becoming' in X86_KERNEL_IBT config objtool: Use offstr() to print address of missing ENDBR objtool: Print data address for "!ENDBR" data warnings x86/xen: Add ANNOTATE_NOENDBR to startup_xen() x86/uaccess: Add ENDBR to __put_user_nocheck*() x86/retpoline: Add ANNOTATE_NOENDBR for retpolines x86/static_call: Add ANNOTATE_NOENDBR to static call trampoline objtool: Enable unreachable warnings for CLANG LTO x86,objtool: Explicitly mark idtentry_body()s tail REACHABLE x86,objtool: Mark cpu_startup_entry() __noreturn x86,xen,objtool: Add UNWIND hint lib/strn*,objtool: Enforce user_access_begin() rules MAINTAINERS: Add x86 unwinding entry x86/unwind/orc: Recheck address range after stack info was updated x86/cpu: Load microcode during restore_processor_state() x86/cpu: Add new Alderlake and Raptorlake CPU model numbers
2022-04-29KVM: X86/MMU: Fix shadowing 5-level NPT for 4-level NPT L1 guestLai Jiangshan1-2/+3
When shadowing 5-level NPT for 4-level NPT L1 guest, the root_sp is allocated with role.level = 5 and the guest pagetable's root gfn. And root_sp->spt[0] is also allocated with the same gfn and the same role except role.level = 4. Luckily that they are different shadow pages, but only root_sp->spt[0] is the real translation of the guest pagetable. Here comes a problem: If the guest switches from gCR4_LA57=0 to gCR4_LA57=1 (or vice verse) and uses the same gfn as the root page for nested NPT before and after switching gCR4_LA57. The host (hCR4_LA57=1) might use the same root_sp for the guest even the guest switches gCR4_LA57. The guest will see unexpected page mapped and L2 may exploit the bug and hurt L1. It is lucky that the problem can't hurt L0. And three special cases need to be handled: The root_sp should be like role.direct=1 sometimes: its contents are not backed by gptes, root_sp->gfns is meaningless. (For a normal high level sp in shadow paging, sp->gfns is often unused and kept zero, but it could be relevant and meaningful if sp->gfns is used because they are backed by concrete gptes.) For such root_sp in the case, root_sp is just a portal to contribute root_sp->spt[0], and root_sp->gfns should not be used and root_sp->spt[0] should not be dropped if gpte[0] of the guest root pagetable is changed. Such root_sp should not be accounted too. So add role.passthrough to distinguish the shadow pages in the hash when gCR4_LA57 is toggled and fix above special cases by using it in kvm_mmu_page_{get|set}_gfn() and sp_has_gptes(). Signed-off-by: Lai Jiangshan <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-29KVM: x86/mmu: replace direct_map with root_role.directPaolo Bonzini1-1/+0
direct_map is always equal to the direct field of the root page's role: - for shadow paging, direct_map is true if CR0.PG=0 and root_role.direct is copied from cpu_role.base.direct - for TDP, it is always true and root_role.direct is also always true - for shadow TDP, it is always false and root_role.direct is also always false Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-29KVM: x86/mmu: replace root_level with cpu_role.base.levelPaolo Bonzini1-1/+0
Remove another duplicate field of struct kvm_mmu. This time it's the root level for page table walking; the separate field is always initialized as cpu_role.base.level, so its users can look up the CPU mode directly instead. Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-29KVM: x86/mmu: replace shadow_root_level with root_role.levelPaolo Bonzini1-1/+0
root_role.level is always the same value as shadow_level: - it's kvm_mmu_get_tdp_level(vcpu) when going through init_kvm_tdp_mmu - it's the level argument when going through kvm_init_shadow_ept_mmu - it's assigned directly from new_role.base.level when going through shadow_mmu_init_context Remove the duplication and get the level directly from the role. Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-29KVM: x86/mmu: remove redundant bits from extended rolePaolo Bonzini1-2/+0
Before the separation of the CPU and the MMU role, CR0.PG was not available in the base MMU role, because two-dimensional paging always used direct=1 in the MMU role. However, now that the raw role is snapshotted in mmu->cpu_role, the value of CR0.PG always matches both !cpu_role.base.direct and cpu_role.base.level > 0. There is no need to store it again in union kvm_mmu_extended_role; instead, write an is_cr0_pg accessor by hand that takes care of the conversion. Use cpu_role.base.level since the future of the direct field is unclear. Likewise, CR4.PAE is now always present in the CPU role as !cpu_role.base.has_4_byte_gpte. The inversion makes certain tests on the MMU role easier, and is easily hidden by the is_cr4_pae accessor when operating on the CPU role. Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-29KVM: x86/mmu: rename kvm_mmu_role unionPaolo Bonzini1-3/+3
It is quite confusing that the "full" union is called kvm_mmu_role but is used for the "cpu_role" field of struct kvm_mmu. Rename it to kvm_cpu_role. Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-29KVM: x86/mmu: remove extended bits from mmu_role, rename fieldPaolo Bonzini1-1/+1
mmu_role represents the role of the root of the page tables. It does not need any extended bits, as those govern only KVM's page table walking; the is_* functions used for page table walking always use the CPU role. ext.valid is not present anymore in the MMU role, but an all-zero MMU role is impossible because the level field is never zero in the MMU role. So just zap the whole mmu_role in order to force invalidation after CPUID is updated. While making this change, which requires touching almost every occurrence of "mmu_role", rename it to "root_role". Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-29KVM: x86/mmu: remove ept_ad fieldPaolo Bonzini1-1/+0
The ept_ad field is used during page walk to determine if the guest PTEs have accessed and dirty bits. In the MMU role, the ad_disabled bit represents whether the *shadow* PTEs have the bits, so it would be incorrect to replace PT_HAVE_ACCESSED_DIRTY with just !mmu->mmu_role.base.ad_disabled. However, the similar field in the CPU mode, ad_disabled, is initialized correctly: to the opposite value of ept_ad for shadow EPT, and zero for non-EPT guest paging modes (which always have A/D bits). It is therefore possible to compute PT_HAVE_ACCESSED_DIRTY from the CPU mode, like other page-format fields; it just has to be inverted to account for the different polarity. In fact, now that the CPU mode is distinct from the MMU roles, it would even be possible to remove PT_HAVE_ACCESSED_DIRTY macro altogether, and use !mmu->cpu_role.base.ad_disabled instead. I am not doing this because the macro has a small effect in terms of dead code elimination: text data bss dec hex 103544 16665 112 120321 1d601 # as of this patch 103746 16665 112 120523 1d6cb # without PT_HAVE_ACCESSED_DIRTY Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-29KVM: x86/mmu: split cpu_role from mmu_rolePaolo Bonzini1-0/+1
Snapshot the state of the processor registers that govern page walk into a new field of struct kvm_mmu. This is a more natural representation than having it *mostly* in mmu_role but not exclusively; the delta right now is represented in other fields, such as root_level. The nested MMU now has only the CPU role; and in fact the new function kvm_calc_cpu_role is analogous to the previous kvm_calc_nested_mmu_role, except that it has role.base.direct equal to !CR0.PG. For a walk-only MMU, "direct" has no meaning, but we set it to !CR0.PG so that role.ext.cr0_pg can go away in a future patch. Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-29KVM: x86: Clean up and document nested #PF workaroundSean Christopherson1-0/+2
Replace the per-vendor hack-a-fix for KVM's #PF => #PF => #DF workaround with an explicit, common workaround in kvm_inject_emulated_page_fault(). Aside from being a hack, the current approach is brittle and incomplete, e.g. nSVM's KVM_SET_NESTED_STATE fails to set ->inject_page_fault(), and nVMX fails to apply the workaround when VMX is intercepting #PF due to allow_smaller_maxphyaddr=1. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-29KVM: SEV-ES: Use V_TSC_AUX if available instead of RDTSC/MSR_TSC_AUX interceptsBabu Moger1-1/+1
The TSC_AUX virtualization feature allows AMD SEV-ES guests to securely use TSC_AUX (auxiliary time stamp counter data) in the RDTSCP and RDPID instructions. The TSC_AUX value is set using the WRMSR instruction to the TSC_AUX MSR (0xC0000103). It is read by the RDMSR, RDTSCP and RDPID instructions. If the read/write of the TSC_AUX MSR is intercepted, then RDTSCP and RDPID must also be intercepted when TSC_AUX virtualization is present. However, the RDPID instruction can't be intercepted. This means that when TSC_AUX virtualization is present, RDTSCP and TSC_AUX MSR read/write must not be intercepted for SEV-ES (or SEV-SNP) guests. Signed-off-by: Babu Moger <[email protected]> Message-Id: <165040164424.1399644.13833277687385156344.stgit@bmoger-ubuntu> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-29x86/cpufeatures: Add virtual TSC_AUX feature bitBabu Moger1-0/+1
The TSC_AUX Virtualization feature allows AMD SEV-ES guests to securely use TSC_AUX (auxiliary time stamp counter data) MSR in RDTSCP and RDPID instructions. The TSC_AUX MSR is typically initialized to APIC ID or another unique identifier so that software can quickly associate returned TSC value with the logical processor. Add the feature bit and also include it in the kvm for detection. Signed-off-by: Babu Moger <[email protected]> Acked-by: Borislav Petkov <[email protected]> Message-Id: <165040157111.1399644.6123821125319995316.stgit@bmoger-ubuntu> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-29Revert "x86/mm: Introduce lookup_address_in_mm()"Sean Christopherson1-4/+0
Drop lookup_address_in_mm() now that KVM is providing it's own variant of lookup_address_in_pgd() that is safe for use with user addresses, e.g. guards against page tables being torn down. A variant that provides a non-init mm is inherently dangerous and flawed, as the only reason to use an mm other than init_mm is to walk a userspace mapping, and lookup_address_in_pgd() does not play nice with userspace mappings, e.g. doesn't disable IRQs to block TLB shootdowns and doesn't use READ_ONCE() to ensure an upper level entry isn't converted to a huge page between checking the PAGE_SIZE bit and grabbing the address of the next level down. This reverts commit 13c72c060f1ba6f4eddd7b1c4f52a8aded43d6d9. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <YmwIi3bXr/1yhYV/@google.com> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-28x86/mm: enable ARCH_HAS_VM_GET_PAGE_PROTChristoph Hellwig1-5/+0
This defines and exports a platform specific custom vm_get_page_prot() via subscribing ARCH_HAS_VM_GET_PAGE_PROT. This also unsubscribes from config ARCH_HAS_FILTER_PGPROT, after dropping off arch_filter_pgprot() and arch_vm_get_page_prot(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Anshuman Khandual <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: David S. Miller <[email protected]> Cc: Khalid Aziz <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-04-27x86/aperfmperf: Make parts of the frequency invariance code unconditionalThomas Gleixner2-4/+2
The frequency invariance support is currently limited to x86/64 and SMP, which is the vast majority of machines. arch_scale_freq_tick() is called every tick on all CPUs and reads the APERF and MPERF MSRs. The CPU frequency getters function do the same via dedicated IPIs. While it could be argued that on systems where frequency invariance support is disabled (32bit, !SMP) the per tick read of the APERF and MPERF MSRs can be avoided, it does not make sense to keep the extra code and the resulting runtime issues of mass IPIs around. As a first step split out the non frequency invariance specific initialization code and the read MSR portion of arch_scale_freq_tick(). The rest of the code is still conditional and guarded with a static key. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Rafael J. Wysocki <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-27x86/aperfmperf: Untangle Intel and AMD frequency invariance initThomas Gleixner1-9/+4
AMD boot CPU initialization happens late via ACPI/CPPC which prevents the Intel parts from being marked __init. Split out the common code and provide a dedicated interface for the AMD initialization and mark the Intel specific code and data __init. The remaining text size is almost cut in half: text: 2614 -> 1350 init.text: 0 -> 786 Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Rafael J. Wysocki <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-27x86/aperfmperf: Separate AP/BP frequency invariance initThomas Gleixner1-7/+5
This code is convoluted and because it can be invoked post init via the ACPI/CPPC code, all of the initialization functionality is built in instead of being part of init text and init data. As a first step create separate calls for the boot and the application processors. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Rafael J. Wysocki <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-27x86/split-lock: Remove unused TIF_SLD bitTony Luck2-5/+1
Changes to the "warn" mode of split lock handling mean that TIF_SLD is never set. Remove the bit, and the functions that use it. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-27x86/pm: Fix false positive kmemleak report in msr_build_context()Matthieu Baerts2-5/+9
Since e2a1256b17b1 ("x86/speculation: Restore speculation related MSRs during S3 resume") kmemleak reports this issue: unreferenced object 0xffff888009cedc00 (size 256): comm "swapper/0", pid 1, jiffies 4294693823 (age 73.764s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 48 00 00 00 00 00 00 00 ........H....... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: msr_build_context (include/linux/slab.h:621) pm_check_save_msr (arch/x86/power/cpu.c:520) do_one_initcall (init/main.c:1298) kernel_init_freeable (init/main.c:1370) kernel_init (init/main.c:1504) ret_from_fork (arch/x86/entry/entry_64.S:304) Reproducer: - boot the VM with a debug kernel config (see https://github.com/multipath-tcp/mptcp_net-next/issues/268) - wait ~1 minute - start a kmemleak scan The root cause here is alignment within the packed struct saved_context (from suspend_64.h). Kmemleak only searches for pointers that are aligned (see how pointers are scanned in kmemleak.c), but pahole shows that the saved_msrs struct member and all members after it in the structure are unaligned: struct saved_context { struct pt_regs regs; /* 0 168 */ /* --- cacheline 2 boundary (128 bytes) was 40 bytes ago --- */ u16 ds; /* 168 2 */ ... u64 misc_enable; /* 232 8 */ bool misc_enable_saved; /* 240 1 */ /* Note below odd offset values for the remainder of this struct */ struct saved_msrs saved_msrs; /* 241 16 */ /* --- cacheline 4 boundary (256 bytes) was 1 bytes ago --- */ long unsigned int efer; /* 257 8 */ u16 gdt_pad; /* 265 2 */ struct desc_ptr gdt_desc; /* 267 10 */ u16 idt_pad; /* 277 2 */ struct desc_ptr idt; /* 279 10 */ u16 ldt; /* 289 2 */ u16 tss; /* 291 2 */ long unsigned int tr; /* 293 8 */ long unsigned int safety; /* 301 8 */ long unsigned int return_address; /* 309 8 */ /* size: 317, cachelines: 5, members: 25 */ /* last cacheline: 61 bytes */ } __attribute__((__packed__)); Move misc_enable_saved to the end of the struct declaration so that saved_msrs fits in before the cacheline 4 boundary. The comment above the saved_context declaration says to fix wakeup_64.S file and __save/__restore_processor_state() if the struct is modified: it looks like all the accesses in wakeup_64.S are done through offsets which are computed at build-time. Update that comment accordingly. At the end, the false positive kmemleak report is due to a limitation from kmemleak but it is always good to avoid unaligned members for optimisation purposes. Please note that it looks like this issue is not new, e.g. https://lore.kernel.org/all/[email protected]/ https://lore.kernel.org/all/[email protected]/ [ bp: Massage + cleanup commit message. ] Fixes: 7a9c2dd08ead ("x86/pm: Introduce quirk framework to save/restore extra MSR registers around suspend/resume") Suggested-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Rafael J. Wysocki <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-27x86/sev: Get the AP jump table address from secrets pageBrijesh Singh1-0/+35
The GHCB specification section 2.7 states that when SEV-SNP is enabled, a guest should not rely on the hypervisor to provide the address of the AP jump table. Instead, if a guest BIOS wants to provide an AP jump table, it should record the address in the SNP secrets page so the guest operating system can obtain it directly from there. Fix this on the guest kernel side by having SNP guests use the AP jump table address published in the secrets page rather than issuing a GHCB request to get it. [ mroth: - Improve error handling when ioremap()/memremap() return NULL - Don't mix function calls with declarations - Add missing __init - Tweak commit message ] Fixes: 0afb6b660a6b ("x86/sev: Use SEV-SNP AP creation to start secondary CPUs") Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Michael Roth <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-26asm-generic: compat: Cleanup duplicate definitionsGuo Ren1-68/+12
There are 7 64bit architectures that support Linux COMPAT mode to run 32bit applications. A lot of definitions are duplicate: - COMPAT_USER_HZ - COMPAT_RLIM_INFINITY - COMPAT_OFF_T_MAX - __compat_uid_t, __compat_uid_t - compat_dev_t - compat_ipc_pid_t - struct compat_flock - struct compat_flock64 - struct compat_statfs - struct compat_ipc64_perm, compat_semid64_ds, compat_msqid64_ds, compat_shmid64_ds Cleanup duplicate definitions and merge them into asm-generic. Signed-off-by: Guo Ren <[email protected]> Signed-off-by: Guo Ren <[email protected]> Reviewed-by: Arnd Bergmann <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Tested-by: Heiko Stuebner <[email protected]> Acked-by: Helge Deller <[email protected]> # parisc Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Palmer Dabbelt <[email protected]>
2022-04-26fs: stat: compat: Add __ARCH_WANT_COMPAT_STATGuo Ren1-0/+1
RISC-V doesn't neeed compat_stat, so using __ARCH_WANT_COMPAT_STAT to exclude unnecessary SYSCALL functions. Signed-off-by: Guo Ren <[email protected]> Signed-off-by: Guo Ren <[email protected]> Reviewed-by: Arnd Bergmann <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Tested-by: Heiko Stuebner <[email protected]> Acked-by: Helge Deller <[email protected]> # parisc Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Palmer Dabbelt <[email protected]>
2022-04-26compat: consolidate the compat_flock{,64} definitionChristoph Hellwig1-17/+3
Provide a single common definition for the compat_flock and compat_flock64 structures using the same tricks as for the native variants. Another extra define is added for the packing required on x86. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Guo Ren <[email protected]> Reviewed-by: Arnd Bergmann <[email protected]> Tested-by: Heiko Stuebner <[email protected]> Acked-by: Helge Deller <[email protected]> # parisc Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Palmer Dabbelt <[email protected]>
2022-04-26uapi: always define F_GETLK64/F_SETLK64/F_SETLKW64 in fcntl.hChristoph Hellwig1-4/+0
The F_GETLK64/F_SETLK64/F_SETLKW64 fcntl opcodes are only implemented for the 32-bit syscall APIs, but are also needed for compat handling on 64-bit kernels. Consolidate them in unistd.h instead of definining the internal compat definitions in compat.h, which is rather error prone (e.g. parisc gets the values wrong currently). Note that before this change they were never visible to userspace due to the fact that CONFIG_64BIT is only set for kernel builds. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Guo Ren <[email protected]> Reviewed-by: Arnd Bergmann <[email protected]> Tested-by: Heiko Stuebner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Palmer Dabbelt <[email protected]>