aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/include
AgeCommit message (Collapse)AuthorFilesLines
2022-04-10Merge tag 'perf_urgent_for_v5.18_rc2' of ↵Linus Torvalds1-0/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - A couple of fixes to cgroup-related handling of perf events - A couple of fixes to event encoding on Sapphire Rapids - Pass event caps of inherited events so that perf doesn't fail wrongly at fork() - Add support for a new Raptor Lake CPU * tag 'perf_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Always set cpuctx cgrp when enable cgroup event perf/core: Fix perf_cgroup_switch() perf/core: Use perf_cgroup_info->active to check if cgroup is active perf/core: Don't pass task around when ctx sched in perf/x86/intel: Update the FRONTEND MSR mask on Sapphire Rapids perf/x86/intel: Don't extend the pseudo-encoding to GP counters perf/core: Inherit event_caps perf/x86/uncore: Add Raptor Lake uncore support perf/x86/msr: Add Raptor Lake CPU support perf/x86/cstate: Add Raptor Lake support perf/x86: Add Intel Raptor Lake support
2022-04-10x86/PCI: Add $IRT PIRQ routing table supportMaciej W. Rozycki1-0/+9
Handle the $IRT PCI IRQ Routing Table format used by AMI for its BCP (BIOS Configuration Program) external tool meant for tweaking BIOS structures without the need to rebuild it from sources[1]. The $IRT format has been invented by AMI before Microsoft has come up with its $PIR format and a $IRT table is therefore there in some systems that lack a $PIR table, such as the DataExpert EXP8449 mainboard based on the ALi FinALi 486 chipset (M1489/M1487), which predates DMI 2.0 and cannot therefore be easily identified at run time. Unlike with the $PIR format there is no alignment guarantee as to the placement of the $IRT table, so scan the whole BIOS area bytewise. Credit to Michal Necasek for helping me chase documentation for the format. References: [1] "What is BCP? - AMI", <https://www.ami.com/what-is-bcp/> Signed-off-by: Maciej W. Rozycki <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Dmitry Osipenko <[email protected]> # crosvm Link: https://lore.kernel.org/r/[email protected]
2022-04-08x86/PCI: Clip only host bridge windows for E820 regionsBjorn Helgaas1-0/+5
ACPI firmware advertises PCI host bridge resources via PNP0A03 _CRS methods. Some BIOSes include non-window address space in _CRS, and if we allocate that non-window space for PCI devices, they don't work. 4dc2287c1805 ("x86: avoid E820 regions when allocating address space") works around this issue by clipping out any regions mentioned in the E820 table in the allocate_resource() path, but the implementation has a couple issues: - The clipping is done for *all* allocations, not just those for PCI address space, and - The clipping is done at each allocation instead of being done once when setting up the host bridge windows. Rework the implementation so we only clip PCI host bridge windows, and we do it once when setting them up. Example output changes: BIOS-e820: [mem 0x00000000b0000000-0x00000000c00fffff] reserved + acpi PNP0A08:00: clipped [mem 0xc0000000-0xfebfffff window] to [mem 0xc0100000-0xfebfffff window] for e820 entry [mem 0xb0000000-0xc00fffff] - pci_bus 0000:00: root bus resource [mem 0xc0000000-0xfebfffff window] + pci_bus 0000:00: root bus resource [mem 0xc0100000-0xfebfffff window] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Bjorn Helgaas <[email protected]> Reviewed-by: Hans de Goede <[email protected]> Reviewed-by: Mika Westerberg <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]>
2022-04-07ACPICA: Avoid cache flush inside virtual machinesKirill A. Shutemov1-1/+13
While running inside virtual machine, the kernel can bypass cache flushing. Changing sleep state in a virtual machine doesn't affect the host system sleep state and cannot lead to data loss. Before entering sleep states, the ACPI code flushes caches to prevent data loss using the WBINVD instruction. This mechanism is required on bare metal. But, any use WBINVD inside of a guest is worthless. Changing sleep state in a virtual machine doesn't affect the host system sleep state and cannot lead to data loss, so most hypervisors simply ignore it. Despite this, the ACPI code calls WBINVD unconditionally anyway. It's useless, but also normally harmless. In TDX guests, though, WBINVD stops being harmless; it triggers a virtualization exception (#VE). If the ACPI cache-flushing WBINVD were left in place, TDX guests would need handling to recover from the exception. Avoid using WBINVD whenever running under a hypervisor. This both removes the useless WBINVDs and saves TDX from implementing WBINVD handling. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86/mm: Make DMA memory shared for TD guestKirill A. Shutemov1-3/+3
Intel TDX doesn't allow VMM to directly access guest private memory. Any memory that is required for communication with the VMM must be shared explicitly. The same rule applies for any DMA to and from the TDX guest. All DMA pages have to be marked as shared pages. A generic way to achieve this without any changes to device drivers is to use the SWIOTLB framework. The previous patch ("Add support for TDX shared memory") gave TDX guests the _ability_ to make some pages shared, but did not make any pages shared. This actually marks SWIOTLB buffers *as* shared. Start returning true for cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) in TDX guests. This has several implications: - Allows the existing mem_encrypt_init() to be used for TDX which sets SWIOTLB buffers shared (aka. "decrypted"). - Ensures that all DMA is routed via the SWIOTLB mechanism (see pci_swiotlb_detect()) Stop selecting DYNAMIC_PHYSICAL_MASK directly. It will get set indirectly by selecting X86_MEM_ENCRYPT. mem_encrypt_init() is currently under an AMD-specific #ifdef. Move it to a generic area of the header. Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Reviewed-by: Tony Luck <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86/acpi/x86/boot: Add multiprocessor wake-up supportKuppuswamy Sathyanarayanan1-0/+5
Secondary CPU startup is currently performed with something called the "INIT/SIPI protocol". This protocol requires assistance from VMMs to boot guests. As should be a familiar story by now, that support can not be provded to TDX guests because TDX VMMs are not trusted by guests. To remedy this situation a new[1] "Multiprocessor Wakeup Structure" has been added to to an existing ACPI table (MADT). This structure provides the physical address of a "mailbox". A write to the mailbox then steers the secondary CPU to the boot code. Add ACPI MADT wake structure parsing support and wake support. Use this support to wake CPUs whenever it is present instead of INIT/SIPI. While this structure can theoretically be used on 32-bit kernels, there are no 32-bit TDX guest kernels. It has not been tested and can not practically *be* tested on 32-bit. Make it 64-bit only. 1. Details about the new structure can be found in ACPI v6.4, in the "Multiprocessor Wakeup Structure" section. Co-developed-by: Sean Christopherson <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Reviewed-by: Rafael J. Wysocki <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86/boot: Add a trampoline for booting APs via firmware handoffSean Christopherson2-0/+3
Historically, x86 platforms have booted secondary processors (APs) using INIT followed by the start up IPI (SIPI) messages. In regular VMs, this boot sequence is supported by the VMM emulation. But such a wakeup model is fatal for secure VMs like TDX in which VMM is an untrusted entity. To address this issue, a new wakeup model was added in ACPI v6.4, in which firmware (like TDX virtual BIOS) will help boot the APs. More details about this wakeup model can be found in ACPI specification v6.4, the section titled "Multiprocessor Wakeup Structure". Since the existing trampoline code requires processors to boot in real mode with 16-bit addressing, it will not work for this wakeup model (because it boots the AP in 64-bit mode). To handle it, extend the trampoline code to support 64-bit mode firmware handoff. Also, extend IDT and GDT pointers to support 64-bit mode hand off. There is no TDX-specific detection for this new boot method. The kernel will rely on it as the sole boot method whenever the new ACPI structure is present. The ACPI table parser for the MADT multiprocessor wake up structure and the wakeup method that uses this structure will be added by the following patch in this series. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86/tdx: Wire up KVM hypercallsKuppuswamy Sathyanarayanan2-0/+33
KVM hypercalls use the VMCALL or VMMCALL instructions. Although the ABI is similar, those instructions no longer function for TDX guests. Make vendor-specific TDVMCALLs instead of VMCALL. This enables TDX guests to run with KVM acting as the hypervisor. Among other things, KVM hypercall is used to send IPIs. Since the KVM driver can be built as a kernel module, export tdx_kvm_hypercall() to make the symbols visible to kvm.ko. Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86/tdx: Port I/O: Add early boot supportAndi Kleen1-0/+4
TDX guests cannot do port I/O directly. The TDX module triggers a #VE exception to let the guest kernel emulate port I/O by converting them into TDCALLs to call the host. But before IDT handlers are set up, port I/O cannot be emulated using normal kernel #VE handlers. To support the #VE-based emulation during this boot window, add a minimal early #VE handler support in early exception handlers. This is similar to what AMD SEV does. This is mainly to support earlyprintk's serial driver, as well as potentially the VGA driver. The early handler only supports I/O-related #VE exceptions. Unhandled or failed exceptions will be handled via early_fixup_exceptions() (like normal exception failures). At runtime I/O-related #VE exceptions (along with other types) handled by virt_exception_kernel(). Signed-off-by: Andi Kleen <[email protected]> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86/boot: Port I/O: Add decompression-time support for TDXKirill A. Shutemov2-27/+32
Port I/O instructions trigger #VE in the TDX environment. In response to the exception, kernel emulates these instructions using hypercalls. But during early boot, on the decompression stage, it is cumbersome to deal with #VE. It is cleaner to go to hypercalls directly, bypassing #VE handling. Hook up TDX-specific port I/O helpers if booting in TDX environment. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86: Consolidate port I/O helpersKirill A. Shutemov2-20/+36
There are two implementations of port I/O helpers: one in the kernel and one in the boot stub. Move the helpers required for both to <asm/shared/io.h> and use the one implementation everywhere. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86: Adjust types used in port I/O helpersKirill A. Shutemov1-13/+13
Change port I/O helpers to use u8/u16/u32 instead of unsigned char/short/int for values. Use u16 instead of int for port number. It aligns the helpers with implementation in boot stub in preparation for consolidation. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86/tdx: Detect TDX at early kernel decompression timeKuppuswamy Sathyanarayanan2-3/+9
The early decompression code does port I/O for its console output. But, handling the decompression-time port I/O demands a different approach from normal runtime because the IDT required to support #VE based port I/O emulation is not yet set up. Paravirtualizing I/O calls during the decompression step is acceptable because the decompression code doesn't have a lot of call sites to IO instruction. To support port I/O in decompression code, TDX must be detected before the decompression code might do port I/O. Detect whether the kernel runs in a TDX guest. Add an early_is_tdx_guest() interface to query the cached TDX guest status in the decompression code. TDX is detected with CPUID. Make cpuid_count() accessible outside boot/cpuflags.c. TDX detection in the main kernel is very similar. Move common bits into <asm/shared/tdx.h>. The actual port I/O paravirtualization will come later in the series. Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Tony Luck <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86/tdx: Add HLT support for TDX guestsKirill A. Shutemov1-0/+4
The HLT instruction is a privileged instruction, executing it stops instruction execution and places the processor in a HALT state. It is used in kernel for cases like reboot, idle loop and exception fixup handlers. For the idle case, interrupts will be enabled (using STI) before the HLT instruction (this is also called safe_halt()). To support the HLT instruction in TDX guests, it needs to be emulated using TDVMCALL (hypercall to VMM). More details about it can be found in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication Interface (GHCI) specification, section TDVMCALL[Instruction.HLT]. In TDX guests, executing HLT instruction will generate a #VE, which is used to emulate the HLT instruction. But #VE based emulation will not work for the safe_halt() flavor, because it requires STI instruction to be executed just before the TDCALL. Since idle loop is the only user of safe_halt() variant, handle it as a special case. To avoid *safe_halt() call in the idle function, define the tdx_guest_idle() and use it to override the "x86_idle" function pointer for a valid TDX guest. Alternative choices like PV ops have been considered for adding safe_halt() support. But it was rejected because HLT paravirt calls only exist under PARAVIRT_XXL, and enabling it in TDX guest just for safe_halt() use case is not worth the cost. Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Reviewed-by: Tony Luck <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86/traps: Add #VE support for TDX guestKirill A. Shutemov2-0/+25
Virtualization Exceptions (#VE) are delivered to TDX guests due to specific guest actions which may happen in either user space or the kernel: * Specific instructions (WBINVD, for example) * Specific MSR accesses * Specific CPUID leaf accesses * Access to specific guest physical addresses Syscall entry code has a critical window where the kernel stack is not yet set up. Any exception in this window leads to hard to debug issues and can be exploited for privilege escalation. Exceptions in the NMI entry code also cause issues. Returning from the exception handler with IRET will re-enable NMIs and nested NMI will corrupt the NMI stack. For these reasons, the kernel avoids #VEs during the syscall gap and the NMI entry code. Entry code paths do not access TD-shared memory, MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves that might generate #VE. VMM can remove memory from TD at any point, but access to unaccepted (or missing) private memory leads to VM termination, not to #VE. Similarly to page faults and breakpoints, #VEs are allowed in NMI handlers once the kernel is ready to deal with nested NMIs. During #VE delivery, all interrupts, including NMIs, are blocked until TDGETVEINFO is called. It prevents #VE nesting until the kernel reads the VE info. TDGETVEINFO retrieves the #VE info from the TDX module, which also clears the "#VE valid" flag. This must be done before anything else as any #VE that occurs while the valid flag is set escalates to #DF by TDX module. It will result in an oops. Virtual NMIs are inhibited if the #VE valid flag is set. NMI will not be delivered until TDGETVEINFO is called. For now, convert unhandled #VE's (everything, until later in this series) so that they appear just like a #GP by calling the ve_raise_fault() directly. The ve_raise_fault() function is similar to #GP handler and is responsible for sending SIGSEGV to userspace and CPU die and notifying debuggers and other die chain users. Co-developed-by: Sean Christopherson <[email protected]> Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Reviewed-by: Tony Luck <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functionsKuppuswamy Sathyanarayanan1-0/+30
Guests communicate with VMMs with hypercalls. Historically, these are implemented using instructions that are known to cause VMEXITs like VMCALL, VMLAUNCH, etc. However, with TDX, VMEXITs no longer expose the guest state to the host. This prevents the old hypercall mechanisms from working. So, to communicate with VMM, TDX specification defines a new instruction called TDCALL. In a TDX based VM, since the VMM is an untrusted entity, an intermediary layer -- TDX module -- facilitates secure communication between the host and the guest. TDX module is loaded like a firmware into a special CPU mode called SEAM. TDX guests communicate with the TDX module using the TDCALL instruction. A guest uses TDCALL to communicate with both the TDX module and VMM. The value of the RAX register when executing the TDCALL instruction is used to determine the TDCALL type. A leaf of TDCALL used to communicate with the VMM is called TDVMCALL. Add generic interfaces to communicate with the TDX module and VMM (using the TDCALL instruction). __tdx_module_call() - Used to communicate with the TDX module (via TDCALL instruction). __tdx_hypercall() - Used by the guest to request services from the VMM (via TDVMCALL leaf of TDCALL). Also define an additional wrapper _tdx_hypercall(), which adds error handling support for the TDCALL failure. The __tdx_module_call() and __tdx_hypercall() helper functions are implemented in assembly in a .S file. The TDCALL ABI requires shuffling arguments in and out of registers, which proved to be awkward with inline assembly. Just like syscalls, not all TDVMCALL use cases need to use the same number of argument registers. The implementation here picks the current worst-case scenario for TDCALL (4 registers). For TDCALLs with fewer than 4 arguments, there will end up being a few superfluous (cheap) instructions. But, this approach maximizes code reuse. For registers used by the TDCALL instruction, please check TDX GHCI specification, the section titled "TDCALL instruction" and "TDG.VP.VMCALL Interface". Based on previous patch by Sean Christopherson. Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Tony Luck <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86/tdx: Provide common base for SEAMCALL and TDCALL C wrappersKirill A. Shutemov1-0/+29
Secure Arbitration Mode (SEAM) is an extension of VMX architecture. It defines a new VMX root operation (SEAM VMX root) and a new VMX non-root operation (SEAM VMX non-root) which are both isolated from the legacy VMX operation where the host kernel runs. A CPU-attested software module (called 'TDX module') runs in SEAM VMX root to manage and protect VMs running in SEAM VMX non-root. SEAM VMX root is also used to host another CPU-attested software module (called 'P-SEAMLDR') to load and update the TDX module. Host kernel transits to either P-SEAMLDR or TDX module via the new SEAMCALL instruction, which is essentially a VMExit from VMX root mode to SEAM VMX root mode. SEAMCALLs are leaf functions defined by P-SEAMLDR and TDX module around the new SEAMCALL instruction. A guest kernel can also communicate with TDX module via TDCALL instruction. TDCALLs and SEAMCALLs use an ABI different from the x86-64 system-v ABI. RAX is used to carry both the SEAMCALL leaf function number (input) and the completion status (output). Additional GPRs (RCX, RDX, R8-R11) may be further used as both input and output operands in individual leaf. TDCALL and SEAMCALL share the same ABI and require the largely same code to pass down arguments and retrieve results. Define an assembly macro that can be used to implement C wrapper for both TDCALL and SEAMCALL. Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86/tdx: Detect running as a TDX guest in early bootKuppuswamy Sathyanarayanan3-1/+29
In preparation of extending cc_platform_has() API to support TDX guest, use CPUID instruction to detect support for TDX guests in the early boot code (via tdx_early_init()). Since copy_bootdata() is the first user of cc_platform_has() API, detect the TDX guest status before it. Define a synthetic feature flag (X86_FEATURE_TDX_GUEST) and set this bit in a valid TDX guest platform. Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Reviewed-by: Tony Luck <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86/sev: Register SEV-SNP guest request platform deviceBrijesh Singh1-0/+4
Version 2 of the GHCB specification provides a Non Automatic Exit (NAE) event type that can be used by the SEV-SNP guest to communicate with the PSP without risk from a malicious hypervisor who wishes to read, alter, drop or replay the messages sent. SNP_LAUNCH_UPDATE can insert two special pages into the guest’s memory: the secrets page and the CPUID page. The PSP firmware populates the contents of the secrets page. The secrets page contains encryption keys used by the guest to interact with the firmware. Because the secrets page is encrypted with the guest’s memory encryption key, the hypervisor cannot read the keys. See SEV-SNP firmware spec for further details on the secrets page format. Create a platform device that the SEV-SNP guest driver can bind to get the platform resources such as encryption key and message id to use to communicate with the PSP. The SEV-SNP guest driver provides a userspace interface to get the attestation report, key derivation, extended attestation report etc. Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-07x86/sev: Provide support for SNP guest request NAEsBrijesh Singh3-0/+21
Version 2 of GHCB specification provides SNP_GUEST_REQUEST and SNP_EXT_GUEST_REQUEST NAE that can be used by the SNP guest to communicate with the PSP. While at it, add a snp_issue_guest_request() helper that will be used by driver or other subsystem to issue the request to PSP. See SEV-SNP firmware and GHCB spec for more details. Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-07x86/sev: Add SEV-SNP feature detection/setupMichael Roth1-0/+2
Initial/preliminary detection of SEV-SNP is done via the Confidential Computing blob. Check for it prior to the normal SEV/SME feature initialization, and add some sanity checks to confirm it agrees with SEV-SNP CPUID/MSR bits. Signed-off-by: Michael Roth <[email protected]> Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-07x86/compressed: Add SEV-SNP feature detection/setupMichael Roth1-0/+3
Initial/preliminary detection of SEV-SNP is done via the Confidential Computing blob. Check for it prior to the normal SEV/SME feature initialization, and add some sanity checks to confirm it agrees with SEV-SNP CPUID/MSR bits. Signed-off-by: Michael Roth <[email protected]> Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-07x86/boot: Add a pointer to Confidential Computing blob in bootparamsMichael Roth2-1/+3
The previously defined Confidential Computing blob is provided to the kernel via a setup_data structure or EFI config table entry. Currently, these are both checked for by boot/compressed kernel to access the CPUID table address within it for use with SEV-SNP CPUID enforcement. To also enable that enforcement for the run-time kernel, similar access to the CPUID table is needed early on while it's still using the identity-mapped page table set up by boot/compressed, where global pointers need to be accessed via fixup_pointer(). This isn't much of an issue for accessing setup_data, and the EFI config table helper code currently used in boot/compressed *could* be used in this case as well since they both rely on identity-mapping. However, it has some reliance on EFI helpers/string constants that would need to be accessed via fixup_pointer(), and fixing it up while making it shareable between boot/compressed and run-time kernel is fragile and introduces a good bit of ugliness. Instead, add a boot_params->cc_blob_address pointer that the boot/compressed kernel can initialize so that the run-time kernel can access the CC blob from there instead of re-scanning the EFI config table. Also document these in Documentation/x86/zero-page.rst. While there, add missing documentation for the acpi_rsdp_addr field, which serves a similar purpose in providing the run-time kernel a pointer to the ACPI RSDP table so that it does not need to [re-]scan the EFI configuration table. [ bp: Fix typos, massage commit message. ] Signed-off-by: Michael Roth <[email protected]> Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-07x86/compressed/64: Add support for SEV-SNP CPUID table in #VC handlersMichael Roth1-0/+2
CPUID instructions generate a #VC exception for SEV-ES/SEV-SNP guests, for which early handlers are currently set up to handle. In the case of SEV-SNP, guests can use a configurable location in guest memory that has been pre-populated with a firmware-validated CPUID table to look up the relevant CPUID values rather than requesting them from hypervisor via a VMGEXIT. Add the various hooks in the #VC handlers to allow CPUID instructions to be handled via the table. The code to actually configure/enable the table will be added in a subsequent commit. Signed-off-by: Michael Roth <[email protected]> Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-07KVM: x86: Move lookup of indexed CPUID leafs to helperMichael Roth1-0/+34
Determining which CPUID leafs have significant ECX/index values is also needed by guest kernel code when doing SEV-SNP-validated CPUID lookups. Move this to common code to keep future updates in sync. Signed-off-by: Michael Roth <[email protected]> Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Venu Busireddy <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-07x86/boot: Add Confidential Computing type to setup_dataBrijesh Singh2-0/+19
While launching encrypted guests, the hypervisor may need to provide some additional information during the guest boot. When booting under an EFI-based BIOS, the EFI configuration table contains an entry for the confidential computing blob that contains the required information. To support booting encrypted guests on non-EFI VMs, the hypervisor needs to pass this additional information to the guest kernel using a different method. For this purpose, introduce SETUP_CC_BLOB type in setup_data to hold the physical address of the confidential computing blob location. The boot loader or hypervisor may choose to use this method instead of an EFI configuration table. The CC blob location scanning should give preference to a setup_data blob over an EFI configuration table. In AMD SEV-SNP, the CC blob contains the address of the secrets and CPUID pages. The secrets page includes information such as a VM to PSP communication key and the CPUID page contains PSP-filtered CPUID values. Define the AMD SEV confidential computing blob structure. While at it, define the EFI GUID for the confidential computing blob. [ bp: Massage commit message, mark struct __packed. ] Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Acked-by: Ard Biesheuvel <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-07x86/msi: Fix msi message data shadow structReto Buerki1-8/+11
The x86 MSI message data is 32 bits in total and is either in compatibility or remappable format, see Intel Virtualization Technology for Directed I/O, section 5.1.2. Fixes: 6285aa50736 ("x86/msi: Provide msi message shadow structs") Co-developed-by: Adrian-Ken Rueegsegger <[email protected]> Signed-off-by: Adrian-Ken Rueegsegger <[email protected]> Signed-off-by: Reto Buerki <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2022-04-07x86/extable: Prefer local labels in .set directivesNick Desaulniers1-10/+10
Bernardo reported an error that Nathan bisected down to (x86_64) defconfig+LTO_CLANG_FULL+X86_PMEM_LEGACY. LTO vmlinux.o ld.lld: error: <instantiation>:1:13: redefinition of 'found' .set found, 0 ^ <inline asm>:29:1: while in macro instantiation extable_type_reg reg=%eax, type=(17 | ((0) << 16)) ^ This appears to be another LTO specific issue similar to what was folded into commit 4b5305decc84 ("x86/extable: Extend extable functionality"), where the `.set found, 0` in DEFINE_EXTABLE_TYPE_REG in arch/x86/include/asm/asm.h conflicts with the symbol for the static function `found` in arch/x86/kernel/pmem.c. Assembler .set directive declare symbols with global visibility, so the assembler may not rename such symbols in the event of a conflict. LTO could rename static functions if there was a conflict in C sources, but it cannot see into symbols defined in inline asm. The symbols are also retained in the symbol table, regardless of LTO. Give the symbols .L prefixes making them locally visible, so that they may be renamed for LTO to avoid conflicts, and to drop them from the symbol table regardless of LTO. Fixes: 4b5305decc84 ("x86/extable: Extend extable functionality") Reported-by: Bernardo Meurer Costa <[email protected]> Debugged-by: Nathan Chancellor <[email protected]> Signed-off-by: Nick Desaulniers <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Nathan Chancellor <[email protected]> Tested-by: Nathan Chancellor <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-06x86/head/64: Re-enable stack protectionMichael Roth1-1/+0
Due to 103a4908ad4d ("x86/head/64: Disable stack protection for head$(BITS).o") kernel/head{32,64}.c are compiled with -fno-stack-protector to allow a call to set_bringup_idt_handler(), which would otherwise have stack protection enabled with CONFIG_STACKPROTECTOR_STRONG. While sufficient for that case, there may still be issues with calls to any external functions that were compiled with stack protection enabled that in-turn make stack-protected calls, or if the exception handlers set up by set_bringup_idt_handler() make calls to stack-protected functions. Subsequent patches for SEV-SNP CPUID validation support will introduce both such cases. Attempting to disable stack protection for everything in scope to address that is prohibitive since much of the code, like the SEV-ES #VC handler, is shared code that remains in use after boot and could benefit from having stack protection enabled. Attempting to inline calls is brittle and can quickly balloon out to library/helper code where that's not really an option. Instead, re-enable stack protection for head32.c/head64.c, and make the appropriate changes to ensure the segment used for the stack canary is initialized in advance of any stack-protected C calls. For head64.c: - The BSP will enter from startup_64() and call into C code (startup_64_setup_env()) shortly after setting up the stack, which may result in calls to stack-protected code. Set up %gs early to allow for this safely. - APs will enter from secondary_startup_64*(), and %gs will be set up soon after. There is one call to C code prior to %gs being setup (__startup_secondary_64()), but it is only to fetch 'sme_me_mask' global, so just load 'sme_me_mask' directly instead, and remove the now-unused __startup_secondary_64() function. For head32.c: - BSPs/APs will set %fs to __BOOT_DS prior to any C calls. In recent kernels, the compiler is configured to access the stack canary at %fs:__stack_chk_guard [1], which overlaps with the initial per-cpu '__stack_chk_guard' variable in the initial/"master" .data..percpu area. This is sufficient to allow access to the canary for use during initial startup, so no changes are needed there. [1] 3fb0fdb3bbe7 ("x86/stackprotector/32: Make the canary into a regular percpu variable") [ bp: Massage commit message. ] Suggested-by: Joerg Roedel <[email protected]> #for 64-bit %gs set up Signed-off-by: Michael Roth <[email protected]> Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-06x86/sev: Use SEV-SNP AP creation to start secondary CPUsTom Lendacky3-0/+10
To provide a more secure way to start APs under SEV-SNP, use the SEV-SNP AP Creation NAE event. This allows for guest control over the AP register state rather than trusting the hypervisor with the SEV-ES Jump Table address. During native_smp_prepare_cpus(), invoke an SEV-SNP function that, if SEV-SNP is active, will set/override apic->wakeup_secondary_cpu. This will allow the SEV-SNP AP Creation NAE event method to be used to boot the APs. As a result of installing the override when SEV-SNP is active, this method of starting the APs becomes the required method. The override function will fail to start the AP if the hypervisor does not have support for AP creation. [ bp: Work in forgotten review comments. ] Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-06x86/mm: Validate memory when changing the C-bitBrijesh Singh3-0/+28
Add the needed functionality to change pages state from shared to private and vice-versa using the Page State Change VMGEXIT as documented in the GHCB spec. Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-06x86/sev: Add helper for validating pages in early enc attribute changesBrijesh Singh1-0/+10
early_set_memory_{encrypted,decrypted}() are used for changing the page state from decrypted (shared) to encrypted (private) and vice versa. When SEV-SNP is active, the page state transition needs to go through additional steps. If the page is transitioned from shared to private, then perform the following after the encryption attribute is set in the page table: 1. Issue the page state change VMGEXIT to add the page as a private in the RMP table. 2. Validate the page after its successfully added in the RMP table. To maintain the security guarantees, if the page is transitioned from private to shared, then perform the following before clearing the encryption attribute from the page table. 1. Invalidate the page. 2. Issue the page state change VMGEXIT to make the page shared in the RMP table. early_set_memory_{encrypted,decrypted}() can be called before the GHCB is setup so use the SNP page state MSR protocol VMGEXIT defined in the GHCB specification to request the page state change in the RMP table. While at it, add a helper snp_prep_memory() which will be used in probe_roms(), in a later patch. [ bp: Massage commit message. ] Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Venu Busireddy <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-06x86/sev: Register GHCB memory when SEV-SNP is activeBrijesh Singh1-0/+2
The SEV-SNP guest is required by the GHCB spec to register the GHCB's Guest Physical Address (GPA). This is because the hypervisor may prefer that a guest uses a consistent and/or specific GPA for the GHCB associated with a vCPU. For more information, see the GHCB specification section "GHCB GPA Registration". [ bp: Cleanup comments. ] Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-06x86/compressed: Register GHCB memory when SEV-SNP is activeBrijesh Singh1-0/+13
The SEV-SNP guest is required by the GHCB spec to register the GHCB's Guest Physical Address (GPA). This is because the hypervisor may prefer that a guest use a consistent and/or specific GPA for the GHCB associated with a vCPU. For more information, see the GHCB specification section "GHCB GPA Registration". If hypervisor can not work with the guest provided GPA then terminate the guest boot. Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Venu Busireddy <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-06x86/compressed: Add helper for validating pages in the decompression stageBrijesh Singh1-0/+26
Many of the integrity guarantees of SEV-SNP are enforced through the Reverse Map Table (RMP). Each RMP entry contains the GPA at which a particular page of DRAM should be mapped. The VMs can request the hypervisor to add pages in the RMP table via the Page State Change VMGEXIT defined in the GHCB specification. Inside each RMP entry is a Validated flag; this flag is automatically cleared to 0 by the CPU hardware when a new RMP entry is created for a guest. Each VM page can be either validated or invalidated, as indicated by the Validated flag in the RMP entry. Memory access to a private page that is not validated generates a #VC. A VM must use the PVALIDATE instruction to validate a private page before using it. To maintain the security guarantee of SEV-SNP guests, when transitioning pages from private to shared, the guest must invalidate the pages before asking the hypervisor to change the page state to shared in the RMP table. After the pages are mapped private in the page table, the guest must issue a page state change VMGEXIT to mark the pages private in the RMP table and validate them. Upon boot, BIOS should have validated the entire system memory. During the kernel decompression stage, early_setup_ghcb() uses set_page_decrypted() to make the GHCB page shared (i.e. clear encryption attribute). And while exiting from the decompression, it calls set_page_encrypted() to make the page private. Add snp_set_page_{private,shared}() helpers that are used by set_page_{decrypted,encrypted}() to change the page state in the RMP table. [ bp: Massage commit message and comments. ] Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-06x86/sev: Check the VMPL levelBrijesh Singh2-0/+17
The Virtual Machine Privilege Level (VMPL) feature in the SEV-SNP architecture allows a guest VM to divide its address space into four levels. The level can be used to provide hardware isolated abstraction layers within a VM. VMPL0 is the highest privilege level, and VMPL3 is the least privilege level. Certain operations must be done by the VMPL0 software, such as: * Validate or invalidate memory range (PVALIDATE instruction) * Allocate VMSA page (RMPADJUST instruction when VMSA=1) The initial SNP support requires that the guest kernel is running at VMPL0. Add such a check to verify the guest is running at level 0 before continuing the boot. There is no easy method to query the current VMPL level, so use the RMPADJUST instruction to determine whether the guest is running at the VMPL0. [ bp: Massage commit message. ] Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-06x86/sev: Add a helper for the PVALIDATE instructionBrijesh Singh1-0/+21
An SNP-active guest uses the PVALIDATE instruction to validate or rescind the validation of a guest page’s RMP entry. Upon completion, a return code is stored in EAX and rFLAGS bits are set based on the return code. If the instruction completed successfully, the carry flag (CF) indicates if the content of the RMP were changed or not. See AMD APM Volume 3 for additional details. Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Venu Busireddy <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-06x86/sev: Check SEV-SNP features supportBrijesh Singh3-1/+9
Version 2 of the GHCB specification added the advertisement of features that are supported by the hypervisor. If the hypervisor supports SEV-SNP then it must set the SEV-SNP features bit to indicate that the base functionality is supported. Check that feature bit while establishing the GHCB; if failed, terminate the guest. Version 2 of the GHCB specification adds several new Non-Automatic Exits (NAEs), most of them are optional except the hypervisor feature. Now that the hypervisor feature NAE is implemented, bump the GHCB maximum supported protocol version. While at it, move the GHCB protocol negotiation check from the #VC exception handler to sev_enable() so that all feature detection happens before the first #VC exception. While at it, document why the GHCB page cannot be setup from load_stage2_idt(). [ bp: Massage commit message. ] Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-06x86/sev: Save the negotiated GHCB versionBrijesh Singh1-1/+1
The SEV-ES guest calls sev_es_negotiate_protocol() to negotiate the GHCB protocol version before establishing the GHCB. Cache the negotiated GHCB version so that it can be used later. Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Venu Busireddy <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-06x86/sev: Define the Linux-specific guest termination reasonsBrijesh Singh1-0/+8
The GHCB specification defines the reason code for reason set 0. The reason codes defined in the set 0 do not cover all possible causes for a guest to request termination. The reason sets 1 to 255 are reserved for the vendor-specific codes. Reserve the reason set 1 for the Linux guest. Define the error codes for reason set 1 so that one can have meaningful termination reasons and thus better guest failure diagnosis. While at it, change sev_es_terminate() to accept a reason set parameter. [ bp: Massage commit message. ] Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Venu Busireddy <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-06x86/mm: Extend cc_attr to include AMD SEV-SNPBrijesh Singh1-0/+2
The CC_ATTR_GUEST_SEV_SNP can be used by the guest to query whether the SNP (Secure Nested Paging) feature is active. Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-06x86/boot: Introduce helpers for MSR reads/writesMichael Roth2-10/+16
The current set of helpers used throughout the run-time kernel have dependencies on code/facilities outside of the boot kernel, so there are a number of call-sites throughout the boot kernel where inline assembly is used instead. More will be added with subsequent patches that add support for SEV-SNP, so take the opportunity to provide a basic set of helpers that can be used by the boot kernel to reduce reliance on inline assembly. Use boot_* prefix so that it's clear these are helpers specific to the boot kernel to avoid any confusion with the various other MSR read/write helpers. [ bp: Disambiguate parameter names and trim comment. ] Suggested-by: Borislav Petkov <[email protected]> Signed-off-by: Michael Roth <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-06KVM: SVM: Update the SEV-ES save area mappingTom Lendacky1-16/+50
This is the final step in defining the multiple save areas to keep them separate and ensuring proper operation amongst the different types of guests. Update the SEV-ES/SEV-SNP save area to match the APM. This save area will be used for the upcoming SEV-SNP AP Creation NAE event support. Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Venu Busireddy <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-06KVM: SVM: Create a separate mapping for the GHCB save areaTom Lendacky1-3/+45
The initial implementation of the GHCB spec was based on trying to keep the register state offsets the same relative to the VM save area. However, the save area for SEV-ES has changed within the hardware causing the relation between the SEV-ES save area to change relative to the GHCB save area. This is the second step in defining the multiple save areas to keep them separate and ensuring proper operation amongst the different types of guests. Create a GHCB save area that matches the GHCB specification. Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Venu Busireddy <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-06KVM: SVM: Create a separate mapping for the SEV-ES save areaTom Lendacky1-20/+67
The save area for SEV-ES/SEV-SNP guests, as used by the hardware, is different from the save area of a non SEV-ES/SEV-SNP guest. This is the first step in defining the multiple save areas to keep them separate and ensuring proper operation amongst the different types of guests. Create an SEV-ES/SEV-SNP save area and adjust usage to the new save area definition where needed. Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Venu Busireddy <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-05x86/amd_nb: Unexport amd_cache_northbridges()Muralidhara M K1-1/+0
amd_cache_northbridges() is exported by amd_nb.c and is called by amd64-agp.c and amd64_edac.c modules at module_init() time so that NB descriptors are properly cached before those drivers can use them. However, the init_amd_nbs() initcall already does call amd_cache_northbridges() unconditionally and thus makes sure the NB descriptors are enumerated. That initcall is a fs_initcall type which is on the 5th group (starting from 0) of initcalls that gets run in increasing numerical order by the init code. The module_init() call is turned into an __initcall() in the MODULE=n case and those are device-level initcalls, i.e., group 6. Therefore, the northbridges caching is already finished by the time module initialization starts and thus the correct initialization order is retained. Unexport amd_cache_northbridges(), update dependent modules to call amd_nb_num() instead. While at it, simplify the checks in amd_cache_northbridges(). [ bp: Heavily massage and *actually* explain why the change is ok. ] Signed-off-by: Muralidhara M K <[email protected]> Signed-off-by: Naveen Krishna Chatradhi <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-05KVM: SVM: Define sev_features and VMPL field in the VMSABrijesh Singh1-2/+4
The hypervisor uses the sev_features field (offset 3B0h) in the Save State Area to control the SEV-SNP guest features such as SNPActive, vTOM, ReflectVC etc. An SEV-SNP guest can read the sev_features field through the SEV_STATUS MSR. While at it, update dump_vmcb() to log the VMPL level. See APM2 Table 15-34 and B-4 for more details. Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Venu Busireddy <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-05KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loadedSean Christopherson1-2/+3
Resolve nx_huge_pages to true/false when kvm.ko is loaded, leaving it as -1 is technically undefined behavior when its value is read out by param_get_bool(), as boolean values are supposed to be '0' or '1'. Alternatively, KVM could define a custom getter for the param, but the auto value doesn't depend on the vendor module in any way, and printing "auto" would be unnecessarily unfriendly to the user. In addition to fixing the undefined behavior, resolving the auto value also fixes the scenario where the auto value resolves to N and no vendor module is loaded. Previously, -1 would result in Y being printed even though KVM would ultimately disable the mitigation. Rename the existing MMU module init/exit helpers to clarify that they're invoked with respect to the vendor module, and add comments to document why KVM has two separate "module init" flows. ========================================================================= UBSAN: invalid-load in kernel/params.c:320:33 load of value 255 is not a valid value for type '_Bool' CPU: 6 PID: 892 Comm: tail Not tainted 5.17.0-rc3+ #799 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl+0x34/0x44 ubsan_epilogue+0x5/0x40 __ubsan_handle_load_invalid_value.cold+0x43/0x48 param_get_bool.cold+0xf/0x14 param_attr_show+0x55/0x80 module_attr_show+0x1c/0x30 sysfs_kf_seq_show+0x93/0xc0 seq_read_iter+0x11c/0x450 new_sync_read+0x11b/0x1a0 vfs_read+0xf0/0x190 ksys_read+0x5f/0xe0 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> ========================================================================= Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation") Cc: [email protected] Reported-by: Bruno Goncalves <[email protected]> Reported-by: Jan Stancek <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-05x86/bug: Prevent shadowing in __WARN_FLAGSVincent Mailhol1-2/+2
The macro __WARN_FLAGS() uses a local variable named "f". This being a common name, there is a risk of shadowing other variables. For example, GCC would yield: | In file included from ./include/linux/bug.h:5, | from ./include/linux/cpumask.h:14, | from ./arch/x86/include/asm/cpumask.h:5, | from ./arch/x86/include/asm/msr.h:11, | from ./arch/x86/include/asm/processor.h:22, | from ./arch/x86/include/asm/timex.h:5, | from ./include/linux/timex.h:65, | from ./include/linux/time32.h:13, | from ./include/linux/time.h:60, | from ./include/linux/stat.h:19, | from ./include/linux/module.h:13, | from virt/lib/irqbypass.mod.c:1: | ./include/linux/rcupdate.h: In function 'rcu_head_after_call_rcu': | ./arch/x86/include/asm/bug.h:80:21: warning: declaration of 'f' shadows a parameter [-Wshadow] | 80 | __auto_type f = BUGFLAG_WARNING|(flags); \ | | ^ | ./include/asm-generic/bug.h:106:17: note: in expansion of macro '__WARN_FLAGS' | 106 | __WARN_FLAGS(BUGFLAG_ONCE | \ | | ^~~~~~~~~~~~ | ./include/linux/rcupdate.h:1007:9: note: in expansion of macro 'WARN_ON_ONCE' | 1007 | WARN_ON_ONCE(func != (rcu_callback_t)~0L); | | ^~~~~~~~~~~~ | In file included from ./include/linux/rbtree.h:24, | from ./include/linux/mm_types.h:11, | from ./include/linux/buildid.h:5, | from ./include/linux/module.h:14, | from virt/lib/irqbypass.mod.c:1: | ./include/linux/rcupdate.h:1001:62: note: shadowed declaration is here | 1001 | rcu_head_after_call_rcu(struct rcu_head *rhp, rcu_callback_t f) | | ~~~~~~~~~~~~~~~^ For reference, sparse also warns about it, c.f. [1]. This patch renames the variable from f to __flags (with two underscore prefixes as suggested in the Linux kernel coding style [2]) in order to prevent collisions. [1] https://lore.kernel.org/all/CAFGhKbyifH1a+nAMCvWM88TK6fpNPdzFtUXPmRGnnQeePV+1sw@mail.gmail.com/ [2] Linux kernel coding style, section 12) Macros, Enums and RTL, paragraph 5) namespace collisions when defining local variables in macros resembling functions https://www.kernel.org/doc/html/latest/process/coding-style.html#macros-enums-and-rtl Fixes: bfb1a7c91fb7 ("x86/bug: Merge annotate_reachable() into_BUG_FLAGS() asm") Signed-off-by: Vincent Mailhol <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Nick Desaulniers <[email protected]> Acked-by: Josh Poimboeuf <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-05perf/x86/amd: Add idle hooks for branch samplingStephane Eranian1-0/+23
On AMD Fam19h Zen3, the branch sampling (BRS) feature must be disabled before entering low power and re-enabled (if was active) when returning from low power. Otherwise, the NMI interrupt may be held up for too long and cause problems. Stopping BRS will cause the NMI to be delivered if it was held up. Define a perf_amd_brs_lopwr_cb() callback to stop/restart BRS. The callback is protected by a jump label which is enabled only when AMD BRS is detected. In all other cases, the callback is never called. Signed-off-by: Stephane Eranian <[email protected]> [peterz: static_call() and build fixes] Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]