aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2024-06-12x86/shstk: Make return uprobe work with shadow stackJiri Olsa3-1/+19
Currently the application with enabled shadow stack will crash if it sets up return uprobe. The reason is the uretprobe kernel code changes the user space task's stack, but does not update shadow stack accordingly. Adding new functions to update values on shadow stack and using them in uprobe code to keep shadow stack in sync with uretprobe changes to user stack. Link: https://lore.kernel.org/all/[email protected]/ Acked-by: Andrii Nakryiko <[email protected]> Acked-by: Rick Edgecombe <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Fixes: 488af8ea7131 ("x86/shstk: Wire in shadow stack interface") Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2024-06-11x86/uaccess: Fix missed zeroing of ia32 u64 get_user() range checkingKees Cook2-3/+7
When reworking the range checking for get_user(), the get_user_8() case on 32-bit wasn't zeroing the high register. (The jump to bad_get_user_8 was accidentally dropped.) Restore the correct error handling destination (and rename the jump to using the expected ".L" prefix). While here, switch to using a named argument ("size") for the call template ("%c4" to "%c[size]") as already used in the other call templates in this file. Found after moving the usercopy selftests to KUnit: # usercopy_test_invalid: EXPECTATION FAILED at lib/usercopy_kunit.c:278 Expected val_u64 == 0, but val_u64 == -60129542144 (0xfffffff200000000) Closes: https://lore.kernel.org/all/CABVgOSn=tb=Lj9SxHuT4_9MTjjKVxsq-ikdXC4kGHO4CfKVmGQ@mail.gmail.com Fixes: b19b74bc99b1 ("x86/mm: Rework address range check in get_user() and put_user()") Reported-by: David Gow <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Qiuxu Zhuo <[email protected]> Tested-by: David Gow <[email protected]> Link: https://lore.kernel.org/all/20240610210213.work.143-kees%40kernel.org
2024-06-11KVM: x86: Drop now-superflous setting of l1tf_flush_l1d in vcpu_run()Sean Christopherson2-4/+4
Now that KVM unconditionally sets l1tf_flush_l1d in kvm_arch_vcpu_load(), drop the redundant store from vcpu_run(). The flag is cleared only when VM-Enter is imminent, deep below vcpu_run(), i.e. barring a KVM bug, it's impossible for l1tf_flush_l1d to be cleared between loading the vCPU and calling vcpu_run(). Acked-by: Kai Huang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-06-11KVM: x86: Unconditionally set l1tf_flush_l1d during vCPU loadSean Christopherson1-6/+5
Always set l1tf_flush_l1d during kvm_arch_vcpu_load() instead of setting it only when the vCPU is being scheduled back in. The flag is processed only when VM-Enter is imminent, and KVM obviously needs to load the vCPU before VM-Enter, so attempting to precisely set l1tf_flush_l1d provides no meaningful value. I.e. the flag _will_ be set either way, it's simply a matter of when. Acked-by: Kai Huang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-06-11KVM: Delete the now unused kvm_arch_sched_in()Sean Christopherson2-8/+3
Delete kvm_arch_sched_in() now that all implementations are nops. Reviewed-by: Bibo Mao <[email protected]> Acked-by: Kai Huang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-06-11KVM: x86: Fold kvm_arch_sched_in() into kvm_arch_vcpu_load()Sean Christopherson7-27/+16
Fold the guts of kvm_arch_sched_in() into kvm_arch_vcpu_load(), keying off the recently added kvm_vcpu.scheduled_out as appropriate. Note, there is a very slight functional change, as PLE shrink updates will now happen after blasting WBINVD, but that is quite uninteresting as the two operations do not interact in any way. Acked-by: Kai Huang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-06-11KVM: VMX: Move PLE grow/shrink helpers above vmx_vcpu_load()Sean Christopherson1-32/+32
Move VMX's {grow,shrink}_ple_window() above vmx_vcpu_load() in preparation of moving the sched_in logic, which handles shrinking the PLE window, into vmx_vcpu_load(). No functional change intended. Acked-by: Kai Huang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-06-11KVM: x86: Don't re-setup empty IRQ routing when KVM_CAP_SPLIT_IRQCHIPYi Wang3-11/+0
Now that KVM sets up empty IRQ routing during VM creation, don't recreate empty routing during KVM_CAP_SPLIT_IRQCHIP. Setting IRQ routes during KVM_CAP_SPLIT_IRQCHIP can result in 20+ milliseconds of delay due to the synchronize_srcu_expedited() call in kvm_set_irq_routing(). Note, the empty routing is guaranteed to be intact as KVM x86 only allows changing the IRQ routing after an in-kernel IRQCHIP has been created, and KVM_CAP_SPLIT_IRQCHIP is disallowed after creating an IRQCHIP. Signed-off-by: Yi Wang <[email protected]> Link: https://lore.kernel.org/r/[email protected] [sean: massage changelog, remove unused empty_routing array] Signed-off-by: Sean Christopherson <[email protected]>
2024-06-11x86/cpufeatures: Add AMD FAST CPPC feature flagPerry Yuan2-0/+2
Some AMD Zen 4 processors support a new feature FAST CPPC which allows for a faster CPPC loop due to internal architectural enhancements. The goal of this faster loop is higher performance at the same power consumption. Reference: See the page 99 of PPR for AMD Family 19h Model 61h rev.B1, docID 56713 Signed-off-by: Perry Yuan <[email protected]> Signed-off-by: Xiaojian Du <[email protected]> Reviewed-by: Borislav Petkov (AMD) <[email protected]>
2024-06-11KVM: x86/pmu: Add a helper to enable bits in FIXED_CTR_CTRLSean Christopherson1-10/+12
Add a helper, intel_pmu_enable_fixed_counter_bits(), to dedup code that enables fixed counter bits, i.e. when KVM clears bits in the reserved mask used to detect invalid MSR_CORE_PERF_FIXED_CTR_CTRL values. No functional change intended. Cc: Dapeng Mi <[email protected]> Reviewed-by: Dapeng Mi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-06-11x86/alternative: Replace the old macrosBorislav Petkov (AMD)1-121/+32
Now that the new macros have been gradually put in place, replace the old ones. Leave the new label numbers starting at 7xx as a hint that the new nested alternatives are being used now. Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-11x86/alternative: Convert the asm ALTERNATIVE_3() macroBorislav Petkov (AMD)1-25/+0
Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-11x86/alternative: Convert the asm ALTERNATIVE_2() macroBorislav Petkov (AMD)1-22/+0
Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-11x86/alternative: Convert the asm ALTERNATIVE() macroBorislav Petkov (AMD)1-21/+1
Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-11KVM: x86: Add KVM_RUN_X86_GUEST_MODE kvm_run flagThomas Prescher2-0/+4
When a vCPU is interrupted by a signal while running a nested guest, KVM will exit to userspace with L2 state. However, userspace has no way to know whether it sees L1 or L2 state (besides calling KVM_GET_STATS_FD, which does not have a stable ABI). This causes multiple problems: The simplest one is L2 state corruption when userspace marks the sregs as dirty. See this mailing list thread [1] for a complete discussion. Another problem is that if userspace decides to continue by emulating instructions, it will unknowingly emulate with L2 state as if L1 doesn't exist, which can be considered a weird guest escape. Introduce a new flag, KVM_RUN_X86_GUEST_MODE, in the kvm_run data structure, which is set when the vCPU exited while running a nested guest. Also introduce a new capability, KVM_CAP_X86_GUEST_MODE, to advertise the functionality to userspace. [1] https://lore.kernel.org/kvm/[email protected]/T/#m280aadcb2e10ae02c191a7dc4ed4b711a74b1f55 Signed-off-by: Thomas Prescher <[email protected]> Signed-off-by: Julian Stecklina <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-06-11x86/alternative: Convert ALTERNATIVE_3()Borislav Petkov (AMD)2-29/+9
Zap the hack of using an ALTERNATIVE_3() internal label, as suggested by bgerst: https://lore.kernel.org/r/CAMzpN2i4oJ-Dv0qO46Fd-DxNv5z9=x%2BvO%[email protected] in favor of a label local to this macro only, as it should be done. Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-11x86/alternative: Convert ALTERNATIVE_TERNARY()Borislav Petkov (AMD)1-6/+0
The C macro. Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-11x86/alternative: Convert alternative_call_2()Borislav Petkov (AMD)1-15/+6
Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-11x86/alternative: Convert alternative_call()Borislav Petkov (AMD)1-5/+1
Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-11x86/alternative: Convert alternative_io()Borislav Petkov (AMD)1-5/+1
Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-11x86/alternative: Convert alternative_input()Borislav Petkov (AMD)1-5/+1
Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-11x86/alternative: Convert alternative_2()Borislav Petkov (AMD)1-3/+0
Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-11function_graph: Everyone uses HAVE_FUNCTION_GRAPH_RET_ADDR_PTR, remove itSteven Rostedt (Google)1-2/+0
All architectures that implement function graph also implements HAVE_FUNCTION_GRAPH_RET_ADDR_PTR. Remove it, as it is no longer a differentiator. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Guo Ren <[email protected]> Cc: Huacai Chen <[email protected]> Cc: WANG Xuerui <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: "Naveen N. Rao" <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Albert Ou <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-06-11x86/alternative: Convert alternative()Borislav Petkov (AMD)1-4/+1
Split conversion deliberately into minimal pieces to ease bisection because debugging alternatives is a nightmare. Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-11x86/alternatives: Add nested alternatives macrosPeter Zijlstra2-4/+136
Instead of making increasingly complicated ALTERNATIVE_n() implementations, use a nested alternative expression. The only difference between: ALTERNATIVE_2(oldinst, newinst1, flag1, newinst2, flag2) and ALTERNATIVE(ALTERNATIVE(oldinst, newinst1, flag1), newinst2, flag2) is that the outer alternative can add additional padding when the inner alternative is the shorter one, which then results in alt_instr::instrlen being inconsistent. However, this is easily remedied since the alt_instr entries will be consecutive and it is trivial to compute the max(alt_instr::instrlen) at runtime while patching. Specifically, after this the ALTERNATIVE_2 macro, after CPP expansion (and manual layout), looks like this: .macro ALTERNATIVE_2 oldinstr, newinstr1, ft_flags1, newinstr2, ft_flags2 740: 740: \oldinstr ; 741: .skip -(((744f-743f)-(741b-740b)) > 0) * ((744f-743f)-(741b-740b)),0x90 ; 742: .pushsection .altinstructions,"a" ; altinstr_entry 740b,743f,\ft_flags1,742b-740b,744f-743f ; .popsection ; .pushsection .altinstr_replacement,"ax" ; 743: \newinstr1 ; 744: .popsection ; ; 741: .skip -(((744f-743f)-(741b-740b)) > 0) * ((744f-743f)-(741b-740b)),0x90 ; 742: .pushsection .altinstructions,"a" ; altinstr_entry 740b,743f,\ft_flags2,742b-740b,744f-743f ; .popsection ; .pushsection .altinstr_replacement,"ax" ; 743: \newinstr2 ; 744: .popsection ; .endm The only label that is ambiguous is 740, however they all reference the same spot, so that doesn't matter. NOTE: obviously only @oldinstr may be an alternative; making @newinstr an alternative would mean patching .altinstr_replacement which very likely isn't what is intended, also the labels will be confused in that case. [ bp: Debug an issue where it would match the wrong two insns and and consider them nested due to the same signed offsets in the .alternative section and use instr_va() to compare the full virtual addresses instead. - Use new labels to denote that the new, nested alternatives are being used when staring at preprocessed output. - Use the %c constraint everywhere instead of %P and document the difference for future reference. ] Signed-off-by: Peter Zijlstra <[email protected]> Co-developed-by: Borislav Petkov (AMD) <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-06-11x86/alternative: Zap alternative_ternary()Borislav Petkov (AMD)1-3/+0
Unused. Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-11perf/x86: Serialize set_attr_rdpmc()Thomas Gleixner1-0/+3
Yue and Xingwei reported a jump label failure. It's caused by the lack of serialization in set_attr_rdpmc(): CPU0 CPU1 Assume: x86_pmu.attr_rdpmc == 0 if (val != x86_pmu.attr_rdpmc) { if (val == 0) ... else if (x86_pmu.attr_rdpmc == 0) static_branch_dec(&rdpmc_never_available_key); if (val != x86_pmu.attr_rdpmc) { if (val == 0) ... else if (x86_pmu.attr_rdpmc == 0) FAIL, due to imbalance ---> static_branch_dec(&rdpmc_never_available_key); The reported BUG() is a consequence of the above and of another bug in the jump label core code. The core code needs a separate fix, but that cannot prevent the imbalance problem caused by set_attr_rdpmc(). Prevent this by serializing set_attr_rdpmc() locally. Fixes: a66734297f78 ("perf/x86: Add /sys/devices/cpu/rdpmc=2 to allow rdpmc for all tasks") Closes: https://lore.kernel.org/r/CAEkJfYNzfW1vG=ZTMdz_Weoo=RXY1NDunbxnDaLyj8R4kEoE_w@mail.gmail.com Reported-by: Yue Sun <[email protected]> Reported-by: Xingwei Lee <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-06-11x86/sev: Use kernel provided SVSM Calling AreasTom Lendacky6-39/+362
The SVSM Calling Area (CA) is used to communicate between Linux and the SVSM. Since the firmware supplied CA for the BSP is likely to be in reserved memory, switch off that CA to a kernel provided CA so that access and use of the CA is available during boot. The CA switch is done using the SVSM core protocol SVSM_CORE_REMAP_CA call. An SVSM call is executed by filling out the SVSM CA and setting the proper register state as documented by the SVSM protocol. The SVSM is invoked by by requesting the hypervisor to run VMPL0. Once it is safe to allocate/reserve memory, allocate a CA for each CPU. After allocating the new CAs, the BSP will switch from the boot CA to the per-CPU CA. The CA for an AP is identified to the SVSM when creating the VMSA in preparation for booting the AP. [ bp: Heavily simplify svsm_issue_call() asm, other touchups. ] Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/fa8021130bcc3bcf14d722a25548cb0cdf325456.1717600736.git.thomas.lendacky@amd.com
2024-06-11x86/sev: Check for the presence of an SVSM in the SNP secrets pageTom Lendacky5-9/+132
During early boot phases, check for the presence of an SVSM when running as an SEV-SNP guest. An SVSM is present if not running at VMPL0 and the 64-bit value at offset 0x148 into the secrets page is non-zero. If an SVSM is present, save the SVSM Calling Area address (CAA), located at offset 0x150 into the secrets page, and set the VMPL level of the guest, which should be non-zero, to indicate the presence of an SVSM. [ bp: Touchups. ] Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/9d3fe161be93d4ea60f43c2a3f2c311fe708b63b.1717600736.git.thomas.lendacky@amd.com
2024-06-11x86/irqflags: Provide native versions of the local_irq_save()/restore()Tom Lendacky1-0/+20
Functions that need to disable IRQs, but are common to both early boot and post-boot execution, are unable to deal with paravirt support associated with local_irq_save() and local_irq_restore(). Create native versions of these for use in these situations. Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/c4c33c0d07200164a3dd8cfd6da0344f57732648.1717600736.git.thomas.lendacky@amd.com
2024-06-10KVM: x86: Bury guest_cpuid_is_amd_or_hygon() in cpuid.cSean Christopherson2-10/+12
Move guest_cpuid_is_amd_or_hygon() into cpuid.c now that, except for one Intel quirk in the emulator, KVM checks for AMD vs. Intel *compatible* vCPUs, not exact vendors, i.e. now that there should not be any reason for KVM at-large to care about the exact vendor. Opportunistically refactor the guts of the helper to use "entry" instead of "best", and short circuit the !entry path to make the common case more readable. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-06-10KVM: x86: Open code vendor_intel() in string_registers_quirk()Sean Christopherson1-12/+8
Open code the is_guest_vendor_intel() check in string_registers_quirk() to discourage makiking exact vendor==Intel checks in the emulator, and to remove the rather awful #ifdeffery. The string quirk is literally the only Intel specific, *non-architectural* behavior that KVM emulates. All Intel specific behavior that is architecturally defined applies to all vendors that are compatible with Intel's architecture, i.e. should use guest_cpuid_is_intel_compatible(). Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-06-10KVM: x86: Allow SYSENTER in Compatibility Mode for all Intel compat vCPUsSean Christopherson1-4/+6
Emulate SYSENTER in Compatibility Mode for all vCPUs models that are compatible with Intel's architecture, as the behavior if SYSENTER is architecturally defined in Intel's SDM, i.e. should be followed by any CPU that implements Intel's architecture. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-06-10KVM: SVM: Emulate SYSENTER RIP/RSP behavior for all Intel compat vCPUsSean Christopherson2-15/+7
Emulate bits 63:32 of the SYSENTER_R{I,S}P MSRs for all vCPUs that are compatible with Intel's architecture, not just strictly vCPUs that have vendor==Intel. The behavior of bits 63:32 is architecturally defined in the SDM, i.e. not some uarch specific quirk of Intel CPUs. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-06-10KVM: x86: Use "is Intel compatible" helper to emulate SYSCALL in !64-bitSean Christopherson3-36/+16
Use guest_cpuid_is_intel_compatible() to determine whether SYSCALL in 32-bit Protected Mode (including Compatibility Mode) should #UD or succeed. The existing code already does the exact equivalent of guest_cpuid_is_intel_compatible(), just in a rather roundabout way. No functional change intended. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-06-10KVM: x86: Inhibit code #DBs in MOV-SS shadow for all Intel compat vCPUsSean Christopherson1-8/+6
Treat code #DBs as inhibited in MOV/POP-SS shadows for vCPU models that are Intel compatible, not just strictly vCPUs with vendor==Intel. The behavior is explicitly called out in the SDM, and thus architectural, i.e. applies to all CPUs that implement Intel's architecture, and isn't a quirk that is unique to CPUs manufactured by Intel: However, if an instruction breakpoint is placed on an instruction located immediately after a POP SS/MOV SS instruction, the breakpoint will be suppressed as if EFLAGS.RF were 1. Applying the behavior strictly to Intel wasn't intentional, KVM simply didn't have a concept of "Intel compatible" as of commit baf67ca8e545 ("KVM: x86: Suppress code #DBs on Intel if MOV/POP SS blocking is active"). Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-06-10KVM: x86: Apply Intel's TSC_AUX reserved-bit behavior to Intel compat vCPUsSean Christopherson1-4/+4
Extend Intel's check on MSR_TSC_AUX[63:32] to all vCPU models that are Intel compatible, i.e. aren't AMD or Hygon in KVM's world, as the behavior is architectural, i.e. applies to any CPU that is compatible with Intel's architecture. Applying the behavior strictly to Intel wasn't intentional, KVM simply didn't have a concept of "Intel compatible" as of commit 61a05d444d2c ("KVM: x86: Tie Intel and AMD behavior for MSR_TSC_AUX to guest CPU model"). Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-06-10KVM: x86/pmu: Squash period for checkpointed events based on host HLE/RTMSean Christopherson1-1/+1
Zero out the sampling period for checkpointed events if the host supports HLE or RTM, i.e. supports transactions and thus checkpointed events, not based on whether the vCPU vendor model is Intel. Perf's refusal to allow a sample period for checkpointed events is based purely on whether or not the CPU supports HLE/RTM transactions, i.e. perf has no knowledge of the vCPU vendor model. Note, it is _extremely_ unlikely that the existing code is a problem in real world usage, as there are far, far bigger hurdles that would need to be cleared to support cross-vendor vPMUs. The motivation is mainly to eliminate the use of guest_cpuid_is_intel(), in order to get to a state where KVM pivots on AMD vs. Intel compatibility, i.e. doesn't check for exactly vendor==Intel except in rare circumstances (i.e. for CPU quirks). Cc: Like Xu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-06-10KVM: VMX: Remove unused declaration of vmx_request_immediate_exit()Binbin Wu1-1/+0
After commit 0ec3d6d1f169 "KVM: x86: Fully defer to vendor code to decide how to force immediate exit", vmx_request_immediate_exit() was removed. Commit 5f18c642ff7e "KVM: VMX: Move out vmx_x86_ops to 'main.c' to dispatch VMX and TDX" added its declaration by accident. Remove it. Signed-off-by: Binbin Wu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-06-10KVM: x86: Drop unused check_apicv_inhibit_reasons() callback definitionHou Wenlong2-2/+0
The check_apicv_inhibit_reasons() callback implementation was dropped in the commit b3f257a84696 ("KVM: x86: Track required APICv inhibits with variable, not callback"), but the definition removal was missed in the final version patch (it was removed in the v4). Therefore, it should be dropped, and the vmx_check_apicv_inhibit_reasons() function declaration should also be removed. Signed-off-by: Hou Wenlong <[email protected]> Reviewed-by: Alejandro Jimenez <[email protected]> Link: https://lore.kernel.org/r/54abd1d0ccaba4d532f81df61259b9c0e021fbde.1714977229.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <[email protected]>
2024-06-10x86/resctrl: Replace open coded cacheinfo searchesTony Luck2-20/+11
pseudo_lock_region_init() and rdtgroup_cbm_to_size() open code a search for details of a particular cache level. Replace with get_cpu_cacheinfo_level(). Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-08Merge tag 'x86-urgent-2024-06-08' of ↵Linus Torvalds2-3/+17
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Miscellaneous fixes: - Fix kexec() crash if call depth tracking is enabled - Fix SMN reads on inaccessible registers on certain AMD systems" * tag 'x86-urgent-2024-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/amd_nb: Check for invalid SMN reads x86/kexec: Fix bug with call depth tracking
2024-06-07KVM: VMX: Always honor guest PAT on CPUs that support self-snoopSean Christopherson2-7/+11
Unconditionally honor guest PAT on CPUs that support self-snoop, as Intel has confirmed that CPUs that support self-snoop always snoop caches and store buffers. I.e. CPUs with self-snoop maintain cache coherency even in the presence of aliased memtypes, thus there is no need to trust the guest behaves and only honor PAT as a last resort, as KVM does today. Honoring guest PAT is desirable for use cases where the guest has access to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual (mediated, for all intents and purposes) GPU is exposed to the guest, along with buffers that are consumed directly by the physical GPU, i.e. which can't be proxied by the host to ensure writes from the guest are performed with the correct memory type for the GPU. Cc: Yiwei Zhang <[email protected]> Suggested-by: Yan Zhao <[email protected]> Suggested-by: Kevin Tian <[email protected]> Tested-by: Xiangfei Ma <[email protected]> Tested-by: Yongwei Ma <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-06-07KVM: x86: Ensure a full memory barrier is emitted in the VM-Exit pathYan Zhao1-0/+6
Ensure a full memory barrier is emitted in the VM-Exit path, as a full barrier is required on Intel CPUs to evict WC buffers. This will allow unconditionally honoring guest PAT on Intel CPUs that support self-snoop. As srcu_read_lock() is always called in the VM-Exit path and it internally has a smp_mb(), call smp_mb__after_srcu_read_lock() to avoid adding a second fence and make sure smp_mb() is called without dependency on implementation details of srcu_read_lock(). Cc: Paolo Bonzini <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Kevin Tian <[email protected]> Signed-off-by: Yan Zhao <[email protected]> [sean: massage changelog] Tested-by: Xiangfei Ma <[email protected]> Tested-by: Yongwei Ma <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-06-07crypto: x86/aes-gcm - rewrite the AES-NI optimized AES-GCMEric Biggers5-4818/+1390
Rewrite the AES-NI implementations of AES-GCM, taking advantage of things I learned while writing the VAES-AVX10 implementations. This is a complete rewrite that reduces the AES-NI GCM source code size by about 70% and the binary code size by about 95%, while not regressing performance and in fact improving it significantly in many cases. The following summarizes the state before this patch: - The aesni-intel module registered algorithms "generic-gcm-aesni" and "rfc4106-gcm-aesni" with the crypto API that actually delegated to one of three underlying implementations according to the CPU capabilities detected at runtime: AES-NI, AES-NI + AVX, or AES-NI + AVX2. - The AES-NI + AVX and AES-NI + AVX2 assembly code was in aesni-intel_avx-x86_64.S and consisted of 2804 lines of source and 257 KB of binary. This massive binary size was not really appropriate, and depending on the kconfig it could take up over 1% the size of the entire vmlinux. The main loops did 8 blocks per iteration. The AVX code minimized the use of carryless multiplication whereas the AVX2 code did not. The "AVX2" code did not actually use AVX2; the check for AVX2 was really a check for Intel Haswell or later to detect support for fast carryless multiplication. The long source length was caused by factors such as significant code duplication. - The AES-NI only assembly code was in aesni-intel_asm.S and consisted of 1501 lines of source and 15 KB of binary. The main loops did 4 blocks per iteration and minimized the use of carryless multiplication by using Karatsuba multiplication and a multiplication-less reduction. - The assembly code was contributed in 2010-2013. Maintenance has been sporadic and most design choices haven't been revisited. - The assembly function prototypes and the corresponding glue code were separate from and were not consistent with the new VAES-AVX10 code I recently added. The older code had several issues such as not precomputing the GHASH key powers, which hurt performance. This rewrite achieves the following goals: - Much shorter source and binary sizes. The assembly source shrinks from 4300 lines to 1130 lines, and it produces about 9 KB of binary instead of 272 KB. This is achieved via a better designed AES-GCM implementation that doesn't excessively unroll the code and instead prioritizes the parts that really matter. Sharing the C glue code with the VAES-AVX10 implementations also saves 250 lines of C source. - Improve performance on most (possibly all) CPUs on which this code runs, for most (possibly all) message lengths. Benchmark results are given in Tables 1 and 2 below. - Use the same function prototypes and glue code as the new VAES-AVX10 algorithms. This fixes some issues with the integration of the assembly and results in some significant performance improvements, primarily on short messages. Also, the AVX and non-AVX implementations are now registered as separate algorithms with the crypto API, which makes them both testable by the self-tests. - Keep support for AES-NI without AVX (for Westmere, Silvermont, Goldmont, and Tremont), but unify the source code with AES-NI + AVX. Since 256-bit vectors cannot be used without VAES anyway, this is made feasible by just using the non-VEX coded form of most instructions. - Use a unified approach where the main loop does 8 blocks per iteration and uses Karatsuba multiplication to save one pclmulqdq per block but does not use the multiplication-less reduction. This strikes a good balance across the range of CPUs on which this code runs. - Don't spam the kernel log with an informational message on every boot. The following tables summarize the improvement in AES-GCM throughput on various CPU microarchitectures as a result of this patch: Table 1: AES-256-GCM encryption throughput improvement, CPU microarchitecture vs. message length in bytes: | 16384 | 4096 | 4095 | 1420 | 512 | 500 | -------------------+-------+-------+-------+-------+-------+-------+ Intel Broadwell | 2% | 8% | 11% | 18% | 31% | 26% | Intel Skylake | 1% | 4% | 7% | 12% | 26% | 19% | Intel Cascade Lake | 3% | 8% | 10% | 18% | 33% | 24% | AMD Zen 1 | 6% | 12% | 6% | 15% | 27% | 24% | AMD Zen 2 | 8% | 13% | 13% | 19% | 26% | 28% | AMD Zen 3 | 8% | 14% | 13% | 19% | 26% | 25% | | 300 | 200 | 64 | 63 | 16 | -------------------+-------+-------+-------+-------+-------+ Intel Broadwell | 35% | 29% | 45% | 55% | 54% | Intel Skylake | 25% | 19% | 28% | 33% | 27% | Intel Cascade Lake | 36% | 28% | 39% | 49% | 54% | AMD Zen 1 | 27% | 22% | 23% | 29% | 26% | AMD Zen 2 | 32% | 24% | 22% | 25% | 31% | AMD Zen 3 | 30% | 24% | 22% | 23% | 26% | Table 2: AES-256-GCM decryption throughput improvement, CPU microarchitecture vs. message length in bytes: | 16384 | 4096 | 4095 | 1420 | 512 | 500 | -------------------+-------+-------+-------+-------+-------+-------+ Intel Broadwell | 3% | 8% | 11% | 19% | 32% | 28% | Intel Skylake | 3% | 4% | 7% | 13% | 28% | 27% | Intel Cascade Lake | 3% | 9% | 11% | 19% | 33% | 28% | AMD Zen 1 | 15% | 18% | 14% | 20% | 36% | 33% | AMD Zen 2 | 9% | 16% | 13% | 21% | 26% | 27% | AMD Zen 3 | 8% | 15% | 12% | 18% | 23% | 23% | | 300 | 200 | 64 | 63 | 16 | -------------------+-------+-------+-------+-------+-------+ Intel Broadwell | 36% | 31% | 40% | 51% | 53% | Intel Skylake | 28% | 21% | 23% | 30% | 30% | Intel Cascade Lake | 36% | 29% | 36% | 47% | 53% | AMD Zen 1 | 35% | 31% | 32% | 35% | 36% | AMD Zen 2 | 31% | 30% | 27% | 38% | 30% | AMD Zen 3 | 27% | 23% | 24% | 32% | 26% | The above numbers are percentage improvements in single-thread throughput, so e.g. an increase from 3000 MB/s to 3300 MB/s would be listed as 10%. They were collected by directly measuring the Linux crypto API performance using a custom kernel module. Note that indirect benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O) include more overhead and won't see quite as much of a difference. All these benchmarks used an associated data length of 16 bytes. Note that AES-GCM is almost always used with short associated data lengths. I didn't test Intel CPUs before Broadwell, AMD CPUs before Zen 1, or Intel low-power CPUs, as these weren't readily available to me. However, based on the design of the new code and the available information about these other CPU microarchitectures, I wouldn't expect any significant regressions, and there's a good chance performance is improved just as it is above. Signed-off-by: Eric Biggers <[email protected]> Signed-off-by: Herbert Xu <[email protected]>
2024-06-07crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCMEric Biggers4-16/+1759
Add implementations of AES-GCM for x86_64 CPUs that support VAES (vector AES), VPCLMULQDQ (vector carryless multiplication), and either AVX512 or AVX10. There are two implementations, sharing most source code: one using 256-bit vectors and one using 512-bit vectors. This patch improves AES-GCM performance by up to 162%; see Tables 1 and 2 below. I wrote the new AES-GCM assembly code from scratch, focusing on correctness, performance, code size (both source and binary), and documenting the source. The new assembly file aes-gcm-avx10-x86_64.S is about 1200 lines including extensive comments, and it generates less than 8 KB of binary code. The main loop does 4 vectors at a time, with the AES and GHASH instructions interleaved. Any remainder is handled using a simple 1 vector at a time loop, with masking. Several VAES + AVX512 implementations of AES-GCM exist from Intel, including one in OpenSSL and one proposed for inclusion in Linux in 2021 (https://lore.kernel.org/linux-crypto/[email protected]/). These aren't really suitable to be used, though, due to the massive amount of binary code generated (696 KB for OpenSSL, 200 KB for Linux) and well as the significantly larger amount of assembly source (4978 lines for OpenSSL, 1788 lines for Linux). Also, Intel's code does not support 256-bit vectors, which makes it not usable on future AVX10/256-only CPUs, and also not ideal for certain Intel CPUs that have downclocking issues. So I ended up starting from scratch. Usually my much shorter code is actually slightly faster than Intel's AVX512 code, though it depends on message length and on which of Intel's implementations is used; for details, see Tables 3 and 4 below. To facilitate potential integration into other projects, I've dual-licensed aes-gcm-avx10-x86_64.S under Apache-2.0 OR BSD-2-Clause, the same as the recently added RISC-V crypto code. The following two tables summarize the performance improvement over the existing AES-GCM code in Linux that uses AES-NI and AVX2: Table 1: AES-256-GCM encryption throughput improvement, CPU microarchitecture vs. message length in bytes: | 16384 | 4096 | 4095 | 1420 | 512 | 500 | ----------------------+-------+-------+-------+-------+-------+-------+ Intel Ice Lake | 42% | 48% | 60% | 62% | 70% | 69% | Intel Sapphire Rapids | 157% | 145% | 162% | 119% | 96% | 96% | Intel Emerald Rapids | 156% | 144% | 161% | 115% | 95% | 100% | AMD Zen 4 | 103% | 89% | 78% | 56% | 54% | 54% | | 300 | 200 | 64 | 63 | 16 | ----------------------+-------+-------+-------+-------+-------+ Intel Ice Lake | 66% | 48% | 49% | 70% | 53% | Intel Sapphire Rapids | 80% | 60% | 41% | 62% | 38% | Intel Emerald Rapids | 79% | 60% | 41% | 62% | 38% | AMD Zen 4 | 51% | 35% | 27% | 32% | 25% | Table 2: AES-256-GCM decryption throughput improvement, CPU microarchitecture vs. message length in bytes: | 16384 | 4096 | 4095 | 1420 | 512 | 500 | ----------------------+-------+-------+-------+-------+-------+-------+ Intel Ice Lake | 42% | 48% | 59% | 63% | 67% | 71% | Intel Sapphire Rapids | 159% | 145% | 161% | 125% | 102% | 100% | Intel Emerald Rapids | 158% | 144% | 161% | 124% | 100% | 103% | AMD Zen 4 | 110% | 95% | 80% | 59% | 56% | 54% | | 300 | 200 | 64 | 63 | 16 | ----------------------+-------+-------+-------+-------+-------+ Intel Ice Lake | 67% | 56% | 46% | 70% | 56% | Intel Sapphire Rapids | 79% | 62% | 39% | 61% | 39% | Intel Emerald Rapids | 80% | 62% | 40% | 58% | 40% | AMD Zen 4 | 49% | 36% | 30% | 35% | 28% | The above numbers are percentage improvements in single-thread throughput, so e.g. an increase from 4000 MB/s to 6000 MB/s would be listed as 50%. They were collected by directly measuring the Linux crypto API performance using a custom kernel module. Note that indirect benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O) include more overhead and won't see quite as much of a difference. All these benchmarks used an associated data length of 16 bytes. Note that AES-GCM is almost always used with short associated data lengths. The following two tables summarize how the performance of my code compares with Intel's AVX512 AES-GCM code, both the version that is in OpenSSL and the version that was proposed for inclusion in Linux. Neither version exists in Linux currently, but these are alternative AES-GCM implementations that could be chosen instead of mine. I collected the following numbers on Emerald Rapids using a userspace benchmark program that calls the assembly functions directly. I've also included a comparison with Cloudflare's AES-GCM implementation from https://boringssl-review.googlesource.com/c/boringssl/+/65987/3. Table 3: VAES-based AES-256-GCM encryption throughput in MB/s, implementation name vs. message length in bytes: | 16384 | 4096 | 4095 | 1420 | 512 | 500 | ---------------------+-------+-------+-------+-------+-------+-------+ This implementation | 14171 | 12956 | 12318 | 9588 | 7293 | 6449 | AVX512_Intel_OpenSSL | 14022 | 12467 | 11863 | 9107 | 5891 | 6472 | AVX512_Intel_Linux | 13954 | 12277 | 11530 | 8712 | 6627 | 5898 | AVX512_Cloudflare | 12564 | 11050 | 10905 | 8152 | 5345 | 5202 | | 300 | 200 | 64 | 63 | 16 | ---------------------+-------+-------+-------+-------+-------+ This implementation | 4939 | 3688 | 1846 | 1821 | 738 | AVX512_Intel_OpenSSL | 4629 | 4532 | 2734 | 2332 | 1131 | AVX512_Intel_Linux | 4035 | 2966 | 1567 | 1330 | 639 | AVX512_Cloudflare | 3344 | 2485 | 1141 | 1127 | 456 | Table 4: VAES-based AES-256-GCM decryption throughput in MB/s, implementation name vs. message length in bytes: | 16384 | 4096 | 4095 | 1420 | 512 | 500 | ---------------------+-------+-------+-------+-------+-------+-------+ This implementation | 14276 | 13311 | 13007 | 11086 | 8268 | 8086 | AVX512_Intel_OpenSSL | 14067 | 12620 | 12421 | 9587 | 5954 | 7060 | AVX512_Intel_Linux | 14116 | 12795 | 11778 | 9269 | 7735 | 6455 | AVX512_Cloudflare | 13301 | 12018 | 11919 | 9182 | 7189 | 6726 | | 300 | 200 | 64 | 63 | 16 | ---------------------+-------+-------+-------+-------+-------+ This implementation | 6454 | 5020 | 2635 | 2602 | 1079 | AVX512_Intel_OpenSSL | 5184 | 5799 | 2957 | 2545 | 1228 | AVX512_Intel_Linux | 4394 | 4247 | 2235 | 1635 | 922 | AVX512_Cloudflare | 4289 | 3851 | 1435 | 1417 | 574 | So, usually my code is actually slightly faster than Intel's code, though the OpenSSL implementation has a slight edge on messages shorter than 256 bytes in this microbenchmark. (This also holds true when doing the same tests on AMD Zen 4.) It can be seen that the large code size (up to 94x larger!) of the Intel implementations doesn't seem to bring much benefit, so starting from scratch with much smaller code, as I've done, seems appropriate. The performance of my code on messages shorter than 256 bytes could be improved through a limited amount of unrolling, but it's unclear it would be worth it, given code size considerations (e.g. caches) that don't get measured in microbenchmarks. Signed-off-by: Eric Biggers <[email protected]> Signed-off-by: Herbert Xu <[email protected]>
2024-06-07crypto: x86 - add missing MODULE_DESCRIPTION() macrosJeff Johnson2-0/+2
On x86, make allmodconfig && make W=1 C=1 warns: WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/crypto/crc32-pclmul.o WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/crypto/curve25519-x86_64.o Add the missing MODULE_DESCRIPTION() macro invocations. Signed-off-by: Jeff Johnson <[email protected]> Signed-off-by: Herbert Xu <[email protected]>
2024-06-06x86/mm/numa: Use NUMA_NO_NODE when calling memblock_set_node()Jan Beulich1-3/+3
memblock_set_node() warns about using MAX_NUMNODES, see e0eec24e2e19 ("memblock: make memblock_set_node() also warn about use of MAX_NUMNODES") for details. Reported-by: Narasimhan V <[email protected]> Signed-off-by: Jan Beulich <[email protected]> Cc: [email protected] [bp: commit message] Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Tested-by: Paul E. McKenney <[email protected]> Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport (IBM) <[email protected]>
2024-06-05x86/amd_nb: Check for invalid SMN readsYazen Ghannam1-1/+8
AMD Zen-based systems use a System Management Network (SMN) that provides access to implementation-specific registers. SMN accesses are done indirectly through an index/data pair in PCI config space. The PCI config access may fail and return an error code. This would prevent the "read" value from being updated. However, the PCI config access may succeed, but the return value may be invalid. This is in similar fashion to PCI bad reads, i.e. return all bits set. Most systems will return 0 for SMN addresses that are not accessible. This is in line with AMD convention that unavailable registers are Read-as-Zero/Writes-Ignored. However, some systems will return a "PCI Error Response" instead. This value, along with an error code of 0 from the PCI config access, will confuse callers of the amd_smn_read() function. Check for this condition, clear the return value, and set a proper error code. Fixes: ddfe43cdc0da ("x86/amd_nb: Add SMN and Indirect Data Fabric access for AMD Fam17h") Signed-off-by: Yazen Ghannam <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2024-06-05KVM: VMX: Drop support for forcing UC memory when guest CR0.CD=1Sean Christopherson2-10/+0
Drop KVM's emulation of CR0.CD=1 on Intel CPUs now that KVM no longer honors guest MTRR memtypes, as forcing UC memory for VMs with non-coherent DMA only makes sense if the guest is using something other than PAT to configure the memtype for the DMA region. Furthermore, KVM has forced WB memory for CR0.CD=1 since commit fb279950ba02 ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED"), and no known VMM in existence disables KVM_X86_QUIRK_CD_NW_CLEARED, let alone does so with non-coherent DMA. Lastly, commit fb279950ba02 ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED") was from the same author as commit b18d5431acc7 ("KVM: x86: fix CR0.CD virtualization"), and followed by a mere month. I.e. forcing UC memory was likely the result of code inspection or perhaps misdiagnosed failures, and not the necessitate by a concrete use case. Update KVM's documentation to note that KVM_X86_QUIRK_CD_NW_CLEARED is now AMD-only, and to take an erratum for lack of CR0.CD virtualization on Intel. Tested-by: Xiangfei Ma <[email protected]> Tested-by: Yongwei Ma <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>