aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/include/asm
AgeCommit message (Collapse)AuthorFilesLines
2022-04-25x86/fpu/xsave: Support XSAVEC in the kernelThomas Gleixner1-1/+1
XSAVEC is the user space counterpart of XSAVES which cannot save supervisor state. In virtualization scenarios the hypervisor does not expose XSAVES but XSAVEC to the guest, though the kernel does not make use of it. That's unfortunate because XSAVEC uses the compacted format of saving the XSTATE. This is more efficient in terms of storage space vs. XSAVE[OPT] as it does not create holes for XSTATE components which are not supported or enabled by the kernel but are available in hardware. There is room for further optimizations when XSAVEC/S and XGETBV1 are supported. In order to support XSAVEC: - Define the XSAVEC ASM macro as it's not yet supported by the required minimal toolchain. - Create a software defined X86_FEATURE_XCOMPACTED to select the compacted XSTATE buffer format for both XSAVEC and XSAVES. - Make XSAVEC an option in the 'XSAVE' ASM alternatives Requested-by: Andrew Cooper <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-22objtool: Make jump label hack optionalJosh Poimboeuf1-3/+3
Objtool secretly does a jump label hack to overcome the limitations of the toolchain. Make the hack explicit (and optional for other arches) by turning it into a cmdline option and kernel config option. Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Miroslav Benes <[email protected]> Link: https://lkml.kernel.org/r/3bdcbfdd27ecb01ddec13c04bdf756a583b13d24.1650300597.git.jpoimboe@redhat.com
2022-04-22objtool: Add CONFIG_OBJTOOLJosh Poimboeuf1-3/+3
Now that stack validation is an optional feature of objtool, add CONFIG_OBJTOOL and replace most usages of CONFIG_STACK_VALIDATION with it. CONFIG_STACK_VALIDATION can now be considered to be frame-pointer specific. CONFIG_UNWINDER_ORC is already inherently valid for live patching, so no need to "validate" it. Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Miroslav Benes <[email protected]> Link: https://lkml.kernel.org/r/939bf3d85604b2a126412bf11af6e3bd3b872bcb.1650300597.git.jpoimboe@redhat.com
2022-04-21KVM: SEV: add cache flush to solve SEV cache incoherency issuesMingwei Zhang2-0/+2
Flush the CPU caches when memory is reclaimed from an SEV guest (where reclaim also includes it being unmapped from KVM's memslots). Due to lack of coherency for SEV encrypted memory, failure to flush results in silent data corruption if userspace is malicious/broken and doesn't ensure SEV guest memory is properly pinned and unpinned. Cache coherency is not enforced across the VM boundary in SEV (AMD APM vol.2 Section 15.34.7). Confidential cachelines, generated by confidential VM guests have to be explicitly flushed on the host side. If a memory page containing dirty confidential cachelines was released by VM and reallocated to another user, the cachelines may corrupt the new user at a later time. KVM takes a shortcut by assuming all confidential memory remain pinned until the end of VM lifetime. Therefore, KVM does not flush cache at mmu_notifier invalidation events. Because of this incorrect assumption and the lack of cache flushing, malicous userspace can crash the host kernel: creating a malicious VM and continuously allocates/releases unpinned confidential memory pages when the VM is running. Add cache flush operations to mmu_notifier operations to ensure that any physical memory leaving the guest VM get flushed. In particular, hook mmu_notifier_invalidate_range_start and mmu_notifier_release events and flush cache accordingly. The hook after releasing the mmu lock to avoid contention with other vCPUs. Cc: [email protected] Suggested-by: Sean Christpherson <[email protected]> Reported-by: Mingwei Zhang <[email protected]> Signed-off-by: Mingwei Zhang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-21virt: sevguest: Change driver name to reflect generic SEV supportTom Lendacky1-1/+1
During patch review, it was decided the SNP guest driver name should not be SEV-SNP specific, but should be generic for use with anything SEV. However, this feedback was missed and the driver name, and many of the driver functions and structures, are SEV-SNP name specific. Rename the driver to "sev-guest" (to match the misc device that is created) and update some of the function and structure names, too. While in the file, adjust the one pr_err() message to be a dev_err() message so that the message, if issued, uses the driver name. Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Link: https://lore.kernel.org/r/307710bb5515c9088a19fd0b930268c7300479b2.1650464054.git.thomas.lendacky@amd.com
2022-04-19x86/static_call: Add ANNOTATE_NOENDBR to static call trampolineJosh Poimboeuf1-0/+1
The static call trampoline is never indirect-branched to, but is referenced by the static call key. Add ANNOTATE_NOENDBR. Fixes: ed53a0d97192 ("x86/alternative: Use .ibt_endbr_seal to seal indirect calls") Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/1b5b54aad7d81241dabe5e0c9b40dea64b540b00.1650300597.git.jpoimboe@redhat.com
2022-04-19x86/cpu: Load microcode during restore_processor_state()Borislav Petkov1-0/+2
When resuming from system sleep state, restore_processor_state() restores the boot CPU MSRs. These MSRs could be emulated by microcode. If microcode is not loaded yet, writing to emulated MSRs leads to unchecked MSR access error: ... PM: Calling lapic_suspend+0x0/0x210 unchecked MSR access error: WRMSR to 0x10f (tried to write 0x0...0) at rIP: ... (native_write_msr) Call Trace: <TASK> ? restore_processor_state x86_acpi_suspend_lowlevel acpi_suspend_enter suspend_devices_and_enter pm_suspend.cold state_store kobj_attr_store sysfs_kf_write kernfs_fop_write_iter new_sync_write vfs_write ksys_write __x64_sys_write do_syscall_64 entry_SYSCALL_64_after_hwframe RIP: 0033:0x7fda13c260a7 To ensure microcode emulated MSRs are available for restoration, load the microcode on the boot CPU before restoring these MSRs. [ Pawan: write commit message and productize it. ] Fixes: e2a1256b17b1 ("x86/speculation: Restore speculation related MSRs during S3 resume") Reported-by: Kyle D. Pelton <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Signed-off-by: Pawan Gupta <[email protected]> Tested-by: Kyle D. Pelton <[email protected]> Cc: [email protected] Link: https://bugzilla.kernel.org/show_bug.cgi?id=215841 Link: https://lore.kernel.org/r/4350dfbf785cd482d3fafa72b2b49c83102df3ce.1650386317.git.pawan.kumar.gupta@linux.intel.com
2022-04-19Merge branch 'turbostat' of ↵Rafael J. Wysocki1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux Pull turbostat changes for 5.19 from Len Brown: "Chen Yu (1): tools/power turbostat: Support thermal throttle count print Dan Merillat (1): tools/power turbostat: fix dump for AMD cpus Len Brown (5): tools/power turbostat: tweak --show and --hide capability tools/power turbostat: fix ICX DRAM power numbers tools/power turbostat: be more useful as non-root tools/power turbostat: No build warnings with -Wextra tools/power turbostat: version 2022.04.16 Sumeet Pawnikar (2): tools/power turbostat: Add Power Limit4 support tools/power turbostat: print power values upto three decimal Zephaniah E. Loss-Cutler-Hull (2): tools/power turbostat: Allow -e for all names. tools/power turbostat: Allow printing header every N iterations" * 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux: tools/power turbostat: version 2022.04.16 tools/power turbostat: No build warnings with -Wextra tools/power turbostat: be more useful as non-root tools/power turbostat: fix ICX DRAM power numbers tools/power turbostat: Support thermal throttle count print tools/power turbostat: Allow printing header every N iterations tools/power turbostat: Allow -e for all names. tools/power turbostat: print power values upto three decimal tools/power turbostat: Add Power Limit4 support tools/power turbostat: fix dump for AMD cpus tools/power turbostat: tweak --show and --hide capability
2022-04-19x86/cpu: Add new Alderlake and Raptorlake CPU model numbersTony Luck1-0/+3
Intel is subdividing the mobile segment with additional models with the same codename. Using the Intel "N" and "P" suffices for these will be less confusing than trying to map to some different naming convention. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-18x86: remove cruft from <asm/dma-mapping.h>Christoph Hellwig2-11/+2
<asm/dma-mapping.h> gets pulled in by all drivers using the DMA API. Remove x86 internal variables and unnecessary includes from it. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]> Tested-by: Boris Ostrovsky <[email protected]>
2022-04-18swiotlb: merge swiotlb-xen initialization into swiotlbChristoph Hellwig1-5/+0
Reuse the generic swiotlb initialization for xen-swiotlb. For ARM/ARM64 this works trivially, while for x86 xen_swiotlb_fixup needs to be passed as the remap argument to swiotlb_init_remap/swiotlb_init_late. Note that the lower bound of the swiotlb size is changed to the smaller IO_TLB_MIN_SLABS based value with this patch, but that is fine as the 2MB value used in Xen before was just an optimization and is not the hard lower bound. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Stefano Stabellini <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]> Tested-by: Boris Ostrovsky <[email protected]>
2022-04-18x86: remove the IOMMU table infrastructureChristoph Hellwig6-138/+8
The IOMMU table tries to separate the different IOMMUs into different backends, but actually requires various cross calls. Rewrite the code to do the generic swiotlb/swiotlb-xen setup directly in pci-dma.c and then just call into the IOMMU drivers. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]> Tested-by: Boris Ostrovsky <[email protected]>
2022-04-17Merge tag 'x86-urgent-2022-04-17' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "Two x86 fixes related to TSX: - Use either MSR_TSX_FORCE_ABORT or MSR_IA32_TSX_CTRL to disable TSX to cover all CPUs which allow to disable it. - Disable TSX development mode at boot so that a microcode update which provides TSX development mode does not suddenly make the system vulnerable to TSX Asynchronous Abort" * tag 'x86-urgent-2022-04-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/tsx: Disable TSX development mode at boot x86/tsx: Use MSR_TSX_CTRL to clear CPUID bits
2022-04-16tools/power turbostat: Add Power Limit4 supportSumeet Pawnikar1-0/+1
Add Power Limit4 support. Signed-off-by: Sumeet Pawnikar <[email protected]> Acked-by: Zhang Rui <[email protected]> Signed-off-by: Len Brown <[email protected]>
2022-04-15mm/vmalloc: fix spinning drain_vmap_work after reading from /proc/vmcoreOmar Sandoval1-2/+0
Commit 3ee48b6af49c ("mm, x86: Saving vmcore with non-lazy freeing of vmas") introduced set_iounmap_nonlazy(), which sets vmap_lazy_nr to lazy_max_pages() + 1, ensuring that any future vunmaps() immediately purge the vmap areas instead of doing it lazily. Commit 690467c81b1a ("mm/vmalloc: Move draining areas out of caller context") moved the purging from the vunmap() caller to a worker thread. Unfortunately, set_iounmap_nonlazy() can cause the worker thread to spin (possibly forever). For example, consider the following scenario: 1. Thread reads from /proc/vmcore. This eventually calls __copy_oldmem_page() -> set_iounmap_nonlazy(), which sets vmap_lazy_nr to lazy_max_pages() + 1. 2. Then it calls free_vmap_area_noflush() (via iounmap()), which adds 2 pages (one page plus the guard page) to the purge list and vmap_lazy_nr. vmap_lazy_nr is now lazy_max_pages() + 3, so the drain_vmap_work is scheduled. 3. Thread returns from the kernel and is scheduled out. 4. Worker thread is scheduled in and calls drain_vmap_area_work(). It frees the 2 pages on the purge list. vmap_lazy_nr is now lazy_max_pages() + 1. 5. This is still over the threshold, so it tries to purge areas again, but doesn't find anything. 6. Repeat 5. If the system is running with only one CPU (which is typicial for kdump) and preemption is disabled, then this will never make forward progress: there aren't any more pages to purge, so it hangs. If there is more than one CPU or preemption is enabled, then the worker thread will spin forever in the background. (Note that if there were already pages to be purged at the time that set_iounmap_nonlazy() was called, this bug is avoided.) This can be reproduced with anything that reads from /proc/vmcore multiple times. E.g., vmcore-dmesg /proc/vmcore. It turns out that improvements to vmap() over the years have obsoleted the need for this "optimization". I benchmarked `dd if=/proc/vmcore of=/dev/null` with 4k and 1M read sizes on a system with a 32GB vmcore. The test was run on 5.17, 5.18-rc1 with a fix that avoided the hang, and 5.18-rc1 with set_iounmap_nonlazy() removed entirely: |5.17 |5.18+fix|5.18+removal 4k|40.86s| 40.09s| 26.73s 1M|24.47s| 23.98s| 21.84s The removal was the fastest (by a wide margin with 4k reads). This patch removes set_iounmap_nonlazy(). Link: https://lkml.kernel.org/r/52f819991051f9b865e9ce25605509bfdbacadcd.1649277321.git.osandov@fb.com Fixes: 690467c81b1a ("mm/vmalloc: Move draining areas out of caller context") Signed-off-by: Omar Sandoval <[email protected]> Acked-by: Chris Down <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Acked-by: Baoquan He <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-04-14x86/asm: Merge load_gs_index()Brian Gerst2-10/+4
Merge the 32- and 64-bit implementations of load_gs_index(). Signed-off-by: Brian Gerst <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-14x86/32: Remove lazy GS macrosBrian Gerst2-6/+1
GS is always a user segment now. Signed-off-by: Brian Gerst <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-13mm/usercopy: Check kmap addresses properlyMatthew Wilcox (Oracle)1-0/+1
If you are copying to an address in the kmap region, you may not copy across a page boundary, no matter what the size of the underlying allocation. You can't kmap() a slab page because slab pages always come from low memory. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: Kees Cook <[email protected]> Signed-off-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-13x86/uaccess: Implement macros for CMPXCHG on user addressesPeter Zijlstra1-0/+142
Add support for CMPXCHG loops on userspace addresses. Provide both an "unsafe" version for tight loops that do their own uaccess begin/end, as well as a "safe" version for use cases where the CMPXCHG is not buried in a loop, e.g. KVM will resume the guest instead of looping when emulation of a guest atomic accesses fails the CMPXCHG. Provide 8-byte versions for 32-bit kernels so that KVM can do CMPXCHG on guest PAE PTEs, which are accessed via userspace addresses. Guard the asm_volatile_goto() variation with CC_HAS_ASM_GOTO_TIED_OUTPUT, the "+m" constraint fails on some compilers that otherwise support CC_HAS_ASM_GOTO_OUTPUT. Cc: [email protected] Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Co-developed-by: Sean Christopherson <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-13KVM: x86: Use static calls to reduce kvm_pmu_ops overheadLike Xu1-0/+31
Use static calls to improve kvm_pmu_ops performance, following the same pattern and naming scheme used by kvm-x86-ops.h. Here are the worst fenced_rdtsc() cycles numbers for the kvm_pmu_ops functions that is most often called (up to 7 digits of calls) when running a single perf test case in a guest on an ICX 2.70GHz host (mitigations=on): | legacy | static call ------------------------------------------------------------ .pmc_idx_to_pmc | 1304840 | 994872 (+23%) .pmc_is_enabled | 978670 | 1011750 (-3%) .msr_idx_to_pmc | 47828 | 41690 (+12%) .is_valid_msr | 28786 | 30108 (-4%) Signed-off-by: Like Xu <[email protected]> [sean: Handle static call updates in pmu.c, tweak changelog] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-13KVM: x86: Move .pmu_ops to kvm_x86_init_ops and tag as __initdataLike Xu1-2/+1
The pmu_ops should be moved to kvm_x86_init_ops and tagged as __initdata. That'll save those precious few bytes, and more importantly make the original ops unreachable, i.e. make it harder to sneak in post-init modification bugs. Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Like Xu <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-13KVM: x86: Move kvm_ops_static_call_update() to x86.cLike Xu1-14/+0
The kvm_ops_static_call_update() is defined in kvm_host.h. That's completely unnecessary, it should have exactly one caller, kvm_arch_hardware_setup(). Move the helper to x86.c and have it do the actual memcpy() of the ops in addition to the static call updates. This will also allow for cleanly giving kvm_pmu_ops static_call treatment. Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Like Xu <[email protected]> [sean: Move memcpy() into the helper and rename accordingly] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-13KVM: x86/mmu: Derive EPT violation RWX bits from EPTE RWX bitsSean Christopherson1-6/+2
Derive the mask of RWX bits reported on EPT violations from the mask of RWX bits that are shoved into EPT entries; the layout is the same, the EPT violation bits are simply shifted by three. Use the new shift and a slight copy-paste of the mask derivation instead of completely open coding the same to convert between the EPT entry bits and the exit qualification when synthesizing a nested EPT Violation. No functional change intended. Cc: SU Hang <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-13KVM: VMX: replace 0x180 with EPT_VIOLATION_* definitionSU Hang1-0/+2
Using self-expressing macro definition EPT_VIOLATION_GVA_VALIDATION and EPT_VIOLATION_GVA_TRANSLATED instead of 0x180 in FNAME(walk_addr_generic)(). Signed-off-by: SU Hang <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-13kvm: x86: Adjust the location of pkru_mask of kvm_mmu to reduce memoryPeng Hao1-8/+9
Adjust the field pkru_mask to the back of direct_map to make up 8-byte alignment.This reduces the size of kvm_mmu by 8 bytes. Signed-off-by: Peng Hao <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-13Merge branch 'kvm-older-features' into HEADPaolo Bonzini2-9/+26
Merge branch for features that did not make it into 5.18: * New ioctls to get/set TSC frequency for a whole VM * Allow userspace to opt out of hypercall patching Nested virtualization improvements for AMD: * Support for "nested nested" optimizations (nested vVMLOAD/VMSAVE, nested vGIF) * Allow AVIC to co-exist with a nested guest running * Fixes for LBR virtualizations when a nested guest is running, and nested LBR virtualization support * PAUSE filtering for nested hypervisors Guest support: * Decoupling of vcpu_is_preempted from PV spinlocks Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-13x86/apic: Clarify i82489DX bit overlap in APIC_LVT0Thomas Gleixner1-6/+0
Daniel stumbled over the bit overlap of the i82498DX external APIC and the TSC deadline timer configuration bit in modern APICs, which is neither documented in the code nor in the current SDM. Maciej provided links to the original i82489DX/486 documentation. See Link. Remove the i82489DX macro maze, use a i82489DX specific define in the apic code and document the overlap in a comment. Reported-by: Daniel Vacek <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Maciej W. Rozycki <[email protected]> Link: https://lore.kernel.org/r/87ee22f3ci.ffs@tglx
2022-04-12Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-5/+5
Pull kvm fixes from Paolo Bonzini: "x86: - Miscellaneous bugfixes - A small cleanup for the new workqueue code - Documentation syntax fix RISC-V: - Remove hgatp zeroing in kvm_arch_vcpu_put() - Fix alignment of the guest_hang() in KVM selftest - Fix PTE A and D bits in KVM selftest - Missing #include in vcpu_fp.c ARM: - Some PSCI fixes after introducing PSCIv1.1 and SYSTEM_RESET2 - Fix the MMU write-lock not being taken on THP split - Fix mixed-width VM handling - Fix potential UAF when debugfs registration fails - Various selftest updates for all of the above" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (24 commits) KVM: x86: hyper-v: Avoid writing to TSC page without an active vCPU KVM: SVM: Do not activate AVIC for SEV-enabled guest Documentation: KVM: Add SPDX-License-Identifier tag selftests: kvm: add tsc_scaling_sync to .gitignore RISC-V: KVM: include missing hwcap.h into vcpu_fp KVM: selftests: riscv: Fix alignment of the guest_hang() function KVM: selftests: riscv: Set PTE A and D bits in VS-stage page table RISC-V: KVM: Don't clear hgatp CSR in kvm_arch_vcpu_put() selftests: KVM: Free the GIC FD when cleaning up in arch_timer selftests: KVM: Don't leak GIC FD across dirty log test iterations KVM: Don't create VM debugfs files outside of the VM directory KVM: selftests: get-reg-list: Add KVM_REG_ARM_FW_REG(3) KVM: avoid NULL pointer dereference in kvm_dirty_ring_push KVM: arm64: selftests: Introduce vcpu_width_config KVM: arm64: mixed-width check should be skipped for uninitialized vCPUs KVM: arm64: vgic: Remove unnecessary type castings KVM: arm64: Don't split hugepages outside of MMU write lock KVM: arm64: Drop unneeded minor version check from PSCI v1.x handler KVM: arm64: Actually prevent SMC64 SYSTEM_RESET2 from AArch32 KVM: arm64: Generally disallow SMC64 for AArch32 guests ...
2022-04-12stat: fix inconsistency between struct stat and struct compat_statMikulas Patocka1-4/+2
struct stat (defined in arch/x86/include/uapi/asm/stat.h) has 32-bit st_dev and st_rdev; struct compat_stat (defined in arch/x86/include/asm/compat.h) has 16-bit st_dev and st_rdev followed by a 16-bit padding. This patch fixes struct compat_stat to match struct stat. [ Historical note: the old x86 'struct stat' did have that 16-bit field that the compat layer had kept around, but it was changes back in 2003 by "struct stat - support larger dev_t": https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=e95b2065677fe32512a597a79db94b77b90c968d and back in those days, the x86_64 port was still new, and separate from the i386 code, and had already picked up the old version with a 16-bit st_dev field ] Note that we can't change compat_dev_t because it is used by compat_loop_info. Also, if the st_dev and st_rdev values are 32-bit, we don't have to use old_valid_dev to test if the value fits into them. This fixes -EOVERFLOW on filesystems that are on NVMe because NVMe uses the major number 259. Signed-off-by: Mikulas Patocka <[email protected]> Cc: Andreas Schwab <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2022-04-12x86/32: Simplify ELF_CORE_COPY_REGSBrian Gerst1-13/+2
GS is now always a user segment, so there is no difference between user and kernel registers. Signed-off-by: Brian Gerst <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-04-11KVM: x86: hyper-v: Avoid writing to TSC page without an active vCPUVitaly Kuznetsov1-3/+1
The following WARN is triggered from kvm_vm_ioctl_set_clock(): WARNING: CPU: 10 PID: 579353 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:3161 mark_page_dirty_in_slot+0x6c/0x80 [kvm] ... CPU: 10 PID: 579353 Comm: qemu-system-x86 Tainted: G W O 5.16.0.stable #20 Hardware name: LENOVO 20UF001CUS/20UF001CUS, BIOS R1CET65W(1.34 ) 06/17/2021 RIP: 0010:mark_page_dirty_in_slot+0x6c/0x80 [kvm] ... Call Trace: <TASK> ? kvm_write_guest+0x114/0x120 [kvm] kvm_hv_invalidate_tsc_page+0x9e/0xf0 [kvm] kvm_arch_vm_ioctl+0xa26/0xc50 [kvm] ? schedule+0x4e/0xc0 ? __cond_resched+0x1a/0x50 ? futex_wait+0x166/0x250 ? __send_signal+0x1f1/0x3d0 kvm_vm_ioctl+0x747/0xda0 [kvm] ... The WARN was introduced by commit 03c0304a86bc ("KVM: Warn if mark_page_dirty() is called without an active vCPU") but the change seems to be correct (unlike Hyper-V TSC page update mechanism). In fact, there's no real need to actually write to guest memory to invalidate TSC page, this can be done by the first vCPU which goes through kvm_guest_time_update(). Reported-by: Maxim Levitsky <[email protected]> Reported-by: Naresh Kamboju <[email protected]> Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]>
2022-04-11KVM: SVM: Do not activate AVIC for SEV-enabled guestSuravee Suthikulpanit1-0/+1
Since current AVIC implementation cannot support encrypted memory, inhibit AVIC for SEV-enabled guest. Signed-off-by: Suravee Suthikulpanit <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-11x86/tsx: Disable TSX development mode at bootPawan Gupta1-2/+2
A microcode update on some Intel processors causes all TSX transactions to always abort by default[*]. Microcode also added functionality to re-enable TSX for development purposes. With this microcode loaded, if tsx=on was passed on the cmdline, and TSX development mode was already enabled before the kernel boot, it may make the system vulnerable to TSX Asynchronous Abort (TAA). To be on safer side, unconditionally disable TSX development mode during boot. If a viable use case appears, this can be revisited later. [*]: Intel TSX Disable Update for Selected Processors, doc ID: 643557 [ bp: Drop unstable web link, massage heavily. ] Suggested-by: Andrew Cooper <[email protected]> Suggested-by: Borislav Petkov <[email protected]> Signed-off-by: Pawan Gupta <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Tested-by: Neelima Krishnan <[email protected]> Cc: <[email protected]> Link: https://lore.kernel.org/r/347bd844da3a333a9793c6687d4e4eb3b2419a3e.1646943780.git.pawan.kumar.gupta@linux.intel.com
2022-04-10Merge tag 'x86_urgent_for_v5.18_rc2' of ↵Linus Torvalds3-20/+23
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Fix the MSI message data struct definition - Use local labels in the exception table macros to avoid symbol conflicts with clang LTO builds - A couple of fixes to objtool checking of the relatively newly added SLS and IBT code - Rename a local var in the WARN* macro machinery to prevent shadowing * tag 'x86_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/msi: Fix msi message data shadow struct x86/extable: Prefer local labels in .set directives x86,bpf: Avoid IBT objtool warning objtool: Fix SLS validation for kcov tail-call replacement objtool: Fix IBT tail-call detection x86/bug: Prevent shadowing in __WARN_FLAGS x86/mm/tlb: Revert retpoline avoidance approach
2022-04-10Merge tag 'perf_urgent_for_v5.18_rc2' of ↵Linus Torvalds1-0/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - A couple of fixes to cgroup-related handling of perf events - A couple of fixes to event encoding on Sapphire Rapids - Pass event caps of inherited events so that perf doesn't fail wrongly at fork() - Add support for a new Raptor Lake CPU * tag 'perf_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Always set cpuctx cgrp when enable cgroup event perf/core: Fix perf_cgroup_switch() perf/core: Use perf_cgroup_info->active to check if cgroup is active perf/core: Don't pass task around when ctx sched in perf/x86/intel: Update the FRONTEND MSR mask on Sapphire Rapids perf/x86/intel: Don't extend the pseudo-encoding to GP counters perf/core: Inherit event_caps perf/x86/uncore: Add Raptor Lake uncore support perf/x86/msr: Add Raptor Lake CPU support perf/x86/cstate: Add Raptor Lake support perf/x86: Add Intel Raptor Lake support
2022-04-10x86/PCI: Add $IRT PIRQ routing table supportMaciej W. Rozycki1-0/+9
Handle the $IRT PCI IRQ Routing Table format used by AMI for its BCP (BIOS Configuration Program) external tool meant for tweaking BIOS structures without the need to rebuild it from sources[1]. The $IRT format has been invented by AMI before Microsoft has come up with its $PIR format and a $IRT table is therefore there in some systems that lack a $PIR table, such as the DataExpert EXP8449 mainboard based on the ALi FinALi 486 chipset (M1489/M1487), which predates DMI 2.0 and cannot therefore be easily identified at run time. Unlike with the $PIR format there is no alignment guarantee as to the placement of the $IRT table, so scan the whole BIOS area bytewise. Credit to Michal Necasek for helping me chase documentation for the format. References: [1] "What is BCP? - AMI", <https://www.ami.com/what-is-bcp/> Signed-off-by: Maciej W. Rozycki <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Dmitry Osipenko <[email protected]> # crosvm Link: https://lore.kernel.org/r/[email protected]
2022-04-08x86/PCI: Clip only host bridge windows for E820 regionsBjorn Helgaas1-0/+5
ACPI firmware advertises PCI host bridge resources via PNP0A03 _CRS methods. Some BIOSes include non-window address space in _CRS, and if we allocate that non-window space for PCI devices, they don't work. 4dc2287c1805 ("x86: avoid E820 regions when allocating address space") works around this issue by clipping out any regions mentioned in the E820 table in the allocate_resource() path, but the implementation has a couple issues: - The clipping is done for *all* allocations, not just those for PCI address space, and - The clipping is done at each allocation instead of being done once when setting up the host bridge windows. Rework the implementation so we only clip PCI host bridge windows, and we do it once when setting them up. Example output changes: BIOS-e820: [mem 0x00000000b0000000-0x00000000c00fffff] reserved + acpi PNP0A08:00: clipped [mem 0xc0000000-0xfebfffff window] to [mem 0xc0100000-0xfebfffff window] for e820 entry [mem 0xb0000000-0xc00fffff] - pci_bus 0000:00: root bus resource [mem 0xc0000000-0xfebfffff window] + pci_bus 0000:00: root bus resource [mem 0xc0100000-0xfebfffff window] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Bjorn Helgaas <[email protected]> Reviewed-by: Hans de Goede <[email protected]> Reviewed-by: Mika Westerberg <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]>
2022-04-07ACPICA: Avoid cache flush inside virtual machinesKirill A. Shutemov1-1/+13
While running inside virtual machine, the kernel can bypass cache flushing. Changing sleep state in a virtual machine doesn't affect the host system sleep state and cannot lead to data loss. Before entering sleep states, the ACPI code flushes caches to prevent data loss using the WBINVD instruction. This mechanism is required on bare metal. But, any use WBINVD inside of a guest is worthless. Changing sleep state in a virtual machine doesn't affect the host system sleep state and cannot lead to data loss, so most hypervisors simply ignore it. Despite this, the ACPI code calls WBINVD unconditionally anyway. It's useless, but also normally harmless. In TDX guests, though, WBINVD stops being harmless; it triggers a virtualization exception (#VE). If the ACPI cache-flushing WBINVD were left in place, TDX guests would need handling to recover from the exception. Avoid using WBINVD whenever running under a hypervisor. This both removes the useless WBINVDs and saves TDX from implementing WBINVD handling. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86/mm: Make DMA memory shared for TD guestKirill A. Shutemov1-3/+3
Intel TDX doesn't allow VMM to directly access guest private memory. Any memory that is required for communication with the VMM must be shared explicitly. The same rule applies for any DMA to and from the TDX guest. All DMA pages have to be marked as shared pages. A generic way to achieve this without any changes to device drivers is to use the SWIOTLB framework. The previous patch ("Add support for TDX shared memory") gave TDX guests the _ability_ to make some pages shared, but did not make any pages shared. This actually marks SWIOTLB buffers *as* shared. Start returning true for cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) in TDX guests. This has several implications: - Allows the existing mem_encrypt_init() to be used for TDX which sets SWIOTLB buffers shared (aka. "decrypted"). - Ensures that all DMA is routed via the SWIOTLB mechanism (see pci_swiotlb_detect()) Stop selecting DYNAMIC_PHYSICAL_MASK directly. It will get set indirectly by selecting X86_MEM_ENCRYPT. mem_encrypt_init() is currently under an AMD-specific #ifdef. Move it to a generic area of the header. Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Reviewed-by: Tony Luck <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86/acpi/x86/boot: Add multiprocessor wake-up supportKuppuswamy Sathyanarayanan1-0/+5
Secondary CPU startup is currently performed with something called the "INIT/SIPI protocol". This protocol requires assistance from VMMs to boot guests. As should be a familiar story by now, that support can not be provded to TDX guests because TDX VMMs are not trusted by guests. To remedy this situation a new[1] "Multiprocessor Wakeup Structure" has been added to to an existing ACPI table (MADT). This structure provides the physical address of a "mailbox". A write to the mailbox then steers the secondary CPU to the boot code. Add ACPI MADT wake structure parsing support and wake support. Use this support to wake CPUs whenever it is present instead of INIT/SIPI. While this structure can theoretically be used on 32-bit kernels, there are no 32-bit TDX guest kernels. It has not been tested and can not practically *be* tested on 32-bit. Make it 64-bit only. 1. Details about the new structure can be found in ACPI v6.4, in the "Multiprocessor Wakeup Structure" section. Co-developed-by: Sean Christopherson <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Reviewed-by: Rafael J. Wysocki <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86/boot: Add a trampoline for booting APs via firmware handoffSean Christopherson2-0/+3
Historically, x86 platforms have booted secondary processors (APs) using INIT followed by the start up IPI (SIPI) messages. In regular VMs, this boot sequence is supported by the VMM emulation. But such a wakeup model is fatal for secure VMs like TDX in which VMM is an untrusted entity. To address this issue, a new wakeup model was added in ACPI v6.4, in which firmware (like TDX virtual BIOS) will help boot the APs. More details about this wakeup model can be found in ACPI specification v6.4, the section titled "Multiprocessor Wakeup Structure". Since the existing trampoline code requires processors to boot in real mode with 16-bit addressing, it will not work for this wakeup model (because it boots the AP in 64-bit mode). To handle it, extend the trampoline code to support 64-bit mode firmware handoff. Also, extend IDT and GDT pointers to support 64-bit mode hand off. There is no TDX-specific detection for this new boot method. The kernel will rely on it as the sole boot method whenever the new ACPI structure is present. The ACPI table parser for the MADT multiprocessor wake up structure and the wakeup method that uses this structure will be added by the following patch in this series. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86/tdx: Wire up KVM hypercallsKuppuswamy Sathyanarayanan2-0/+33
KVM hypercalls use the VMCALL or VMMCALL instructions. Although the ABI is similar, those instructions no longer function for TDX guests. Make vendor-specific TDVMCALLs instead of VMCALL. This enables TDX guests to run with KVM acting as the hypervisor. Among other things, KVM hypercall is used to send IPIs. Since the KVM driver can be built as a kernel module, export tdx_kvm_hypercall() to make the symbols visible to kvm.ko. Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86/tdx: Port I/O: Add early boot supportAndi Kleen1-0/+4
TDX guests cannot do port I/O directly. The TDX module triggers a #VE exception to let the guest kernel emulate port I/O by converting them into TDCALLs to call the host. But before IDT handlers are set up, port I/O cannot be emulated using normal kernel #VE handlers. To support the #VE-based emulation during this boot window, add a minimal early #VE handler support in early exception handlers. This is similar to what AMD SEV does. This is mainly to support earlyprintk's serial driver, as well as potentially the VGA driver. The early handler only supports I/O-related #VE exceptions. Unhandled or failed exceptions will be handled via early_fixup_exceptions() (like normal exception failures). At runtime I/O-related #VE exceptions (along with other types) handled by virt_exception_kernel(). Signed-off-by: Andi Kleen <[email protected]> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86/boot: Port I/O: Add decompression-time support for TDXKirill A. Shutemov2-27/+32
Port I/O instructions trigger #VE in the TDX environment. In response to the exception, kernel emulates these instructions using hypercalls. But during early boot, on the decompression stage, it is cumbersome to deal with #VE. It is cleaner to go to hypercalls directly, bypassing #VE handling. Hook up TDX-specific port I/O helpers if booting in TDX environment. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86: Consolidate port I/O helpersKirill A. Shutemov2-20/+36
There are two implementations of port I/O helpers: one in the kernel and one in the boot stub. Move the helpers required for both to <asm/shared/io.h> and use the one implementation everywhere. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86: Adjust types used in port I/O helpersKirill A. Shutemov1-13/+13
Change port I/O helpers to use u8/u16/u32 instead of unsigned char/short/int for values. Use u16 instead of int for port number. It aligns the helpers with implementation in boot stub in preparation for consolidation. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86/tdx: Detect TDX at early kernel decompression timeKuppuswamy Sathyanarayanan2-3/+9
The early decompression code does port I/O for its console output. But, handling the decompression-time port I/O demands a different approach from normal runtime because the IDT required to support #VE based port I/O emulation is not yet set up. Paravirtualizing I/O calls during the decompression step is acceptable because the decompression code doesn't have a lot of call sites to IO instruction. To support port I/O in decompression code, TDX must be detected before the decompression code might do port I/O. Detect whether the kernel runs in a TDX guest. Add an early_is_tdx_guest() interface to query the cached TDX guest status in the decompression code. TDX is detected with CPUID. Make cpuid_count() accessible outside boot/cpuflags.c. TDX detection in the main kernel is very similar. Move common bits into <asm/shared/tdx.h>. The actual port I/O paravirtualization will come later in the series. Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Tony Luck <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86/tdx: Add HLT support for TDX guestsKirill A. Shutemov1-0/+4
The HLT instruction is a privileged instruction, executing it stops instruction execution and places the processor in a HALT state. It is used in kernel for cases like reboot, idle loop and exception fixup handlers. For the idle case, interrupts will be enabled (using STI) before the HLT instruction (this is also called safe_halt()). To support the HLT instruction in TDX guests, it needs to be emulated using TDVMCALL (hypercall to VMM). More details about it can be found in Intel Trust Domain Extensions (Intel TDX) Guest-Host-Communication Interface (GHCI) specification, section TDVMCALL[Instruction.HLT]. In TDX guests, executing HLT instruction will generate a #VE, which is used to emulate the HLT instruction. But #VE based emulation will not work for the safe_halt() flavor, because it requires STI instruction to be executed just before the TDCALL. Since idle loop is the only user of safe_halt() variant, handle it as a special case. To avoid *safe_halt() call in the idle function, define the tdx_guest_idle() and use it to override the "x86_idle" function pointer for a valid TDX guest. Alternative choices like PV ops have been considered for adding safe_halt() support. But it was rejected because HLT paravirt calls only exist under PARAVIRT_XXL, and enabling it in TDX guest just for safe_halt() use case is not worth the cost. Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Reviewed-by: Tony Luck <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86/traps: Add #VE support for TDX guestKirill A. Shutemov2-0/+25
Virtualization Exceptions (#VE) are delivered to TDX guests due to specific guest actions which may happen in either user space or the kernel: * Specific instructions (WBINVD, for example) * Specific MSR accesses * Specific CPUID leaf accesses * Access to specific guest physical addresses Syscall entry code has a critical window where the kernel stack is not yet set up. Any exception in this window leads to hard to debug issues and can be exploited for privilege escalation. Exceptions in the NMI entry code also cause issues. Returning from the exception handler with IRET will re-enable NMIs and nested NMI will corrupt the NMI stack. For these reasons, the kernel avoids #VEs during the syscall gap and the NMI entry code. Entry code paths do not access TD-shared memory, MMIO regions, use #VE triggering MSRs, instructions, or CPUID leaves that might generate #VE. VMM can remove memory from TD at any point, but access to unaccepted (or missing) private memory leads to VM termination, not to #VE. Similarly to page faults and breakpoints, #VEs are allowed in NMI handlers once the kernel is ready to deal with nested NMIs. During #VE delivery, all interrupts, including NMIs, are blocked until TDGETVEINFO is called. It prevents #VE nesting until the kernel reads the VE info. TDGETVEINFO retrieves the #VE info from the TDX module, which also clears the "#VE valid" flag. This must be done before anything else as any #VE that occurs while the valid flag is set escalates to #DF by TDX module. It will result in an oops. Virtual NMIs are inhibited if the #VE valid flag is set. NMI will not be delivered until TDGETVEINFO is called. For now, convert unhandled #VE's (everything, until later in this series) so that they appear just like a #GP by calling the ve_raise_fault() directly. The ve_raise_fault() function is similar to #GP handler and is responsible for sending SIGSEGV to userspace and CPU die and notifying debuggers and other die chain users. Co-developed-by: Sean Christopherson <[email protected]> Co-developed-by: Kuppuswamy Sathyanarayanan <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Reviewed-by: Tony Luck <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-04-07x86/tdx: Add __tdx_module_call() and __tdx_hypercall() helper functionsKuppuswamy Sathyanarayanan1-0/+30
Guests communicate with VMMs with hypercalls. Historically, these are implemented using instructions that are known to cause VMEXITs like VMCALL, VMLAUNCH, etc. However, with TDX, VMEXITs no longer expose the guest state to the host. This prevents the old hypercall mechanisms from working. So, to communicate with VMM, TDX specification defines a new instruction called TDCALL. In a TDX based VM, since the VMM is an untrusted entity, an intermediary layer -- TDX module -- facilitates secure communication between the host and the guest. TDX module is loaded like a firmware into a special CPU mode called SEAM. TDX guests communicate with the TDX module using the TDCALL instruction. A guest uses TDCALL to communicate with both the TDX module and VMM. The value of the RAX register when executing the TDCALL instruction is used to determine the TDCALL type. A leaf of TDCALL used to communicate with the VMM is called TDVMCALL. Add generic interfaces to communicate with the TDX module and VMM (using the TDCALL instruction). __tdx_module_call() - Used to communicate with the TDX module (via TDCALL instruction). __tdx_hypercall() - Used by the guest to request services from the VMM (via TDVMCALL leaf of TDCALL). Also define an additional wrapper _tdx_hypercall(), which adds error handling support for the TDCALL failure. The __tdx_module_call() and __tdx_hypercall() helper functions are implemented in assembly in a .S file. The TDCALL ABI requires shuffling arguments in and out of registers, which proved to be awkward with inline assembly. Just like syscalls, not all TDVMCALL use cases need to use the same number of argument registers. The implementation here picks the current worst-case scenario for TDCALL (4 registers). For TDCALLs with fewer than 4 arguments, there will end up being a few superfluous (cheap) instructions. But, this approach maximizes code reuse. For registers used by the TDCALL instruction, please check TDX GHCI specification, the section titled "TDCALL instruction" and "TDG.VP.VMCALL Interface". Based on previous patch by Sean Christopherson. Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]> Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Tony Luck <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Reviewed-by: Borislav Petkov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]