blaster4385/linux-IllusionX - Linux kernel with personal config changes for arch linux

Age	Commit message (Collapse)	Author	Files	Lines
2024-05-07	KVM: x86/mmu: Pass full 64-bit error code when handling page faults	Isaku Yamahata	3	-5/+4
	Plumb the full 64-bit error code throughout the page fault handling code so that KVM can use the upper 32 bits, e.g. SNP's PFERR_GUEST_ENC_MASK will be used to determine whether or not a fault is private vs. shared. Note, passing the 64-bit error code to FNAME(walk_addr)() does NOT change the behavior of permission_fault() when invoked in the page fault path, as KVM explicitly clears PFERR_IMPLICIT_ACCESS in kvm_mmu_page_fault(). Continue passing '0' from the async #PF worker, as guest_memfd and thus private memory doesn't support async page faults. Signed-off-by: Isaku Yamahata <[email protected]> [mdr: drop references/changes on rebase, update commit message] Signed-off-by: Michael Roth <[email protected]> [sean: drop truncation in call to FNAME(walk_addr)(), rewrite changelog] Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Xiaoyao Li <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-05-07	KVM: x86: Move synthetic PFERR_* sanity checks to SVM's #NPF handler	Sean Christopherson	2	-11/+12
	Move the sanity check that hardware never sets bits that collide with KVM- define synthetic bits from kvm_mmu_page_fault() to npf_interception(), i.e. make the sanity check #NPF specific. The legacy #PF path already WARNs if _any_ of bits 63:32 are set, and the error code that comes from VMX's EPT Violatation and Misconfig is 100% synthesized (KVM morphs VMX's EXIT_QUALIFICATION into error code flags). Add a compile-time assert in the legacy #PF handler to make sure that KVM- define flags are covered by its existing sanity check on the upper bits. Opportunistically add a description of PFERR_IMPLICIT_ACCESS, since we are removing the comment that defined it. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Kai Huang <[email protected]> Reviewed-by: Binbin Wu <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-05-07	KVM: x86: Remove separate "bit" defines for page fault error code masks	Sean Christopherson	1	-3/+2
	Open code the bit number directly in the PFERR_* masks and drop the intermediate PFERR_*_BIT defines, as having to bounce through two macros just to see which flag corresponds to which bit is quite annoying, as is having to define two macros just to add recognition of a new flag. Use ternary operator to derive the bit in permission_fault(), the one function that actually needs the bit number as part of clever shifting to avoid conditional branches. Generally the compiler is able to turn it into a conditional move, and if not it's not really a big deal. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Paolo Bonzini <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-05-07	KVM: x86/mmu: Exit to userspace with -EFAULT if private fault hits emulation	Sean Christopherson	2	-8/+19
	Exit to userspace with -EFAULT / KVM_EXIT_MEMORY_FAULT if a private fault triggers emulation of any kind, as KVM doesn't currently support emulating access to guest private memory. Practically speaking, private faults and emulation are already mutually exclusive, but there are many flow that can result in KVM returning RET_PF_EMULATE, and adding one last check to harden against weird, unexpected combinations and/or KVM bugs is inexpensive. Suggested-by: Yan Zhao <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-05-02	KVM: x86: Remove VT-d mention in posted interrupt tracepoint	Alejandro Jimenez	1	-2/+2
	The kvm_pi_irte_update tracepoint is called from both SVM and VMX vendor code, and while the "posted interrupt" naming is also adopted by SVM in several places, VT-d specifically refers to Intel's "Virtualization Technology for Directed I/O". Signed-off-by: Alejandro Jimenez <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-05-02	KVM: x86: Only set APICV_INHIBIT_REASON_ABSENT if APICv is enabled	Alejandro Jimenez	1	-7/+4
	Use the APICv enablement status to determine if APICV_INHIBIT_REASON_ABSENT needs to be set, instead of unconditionally setting the reason during initialization. Specifically, in cases where AVIC is disabled via module parameter or lack of hardware support, unconditionally setting an inhibit reason due to the absence of an in-kernel local APIC can lead to a scenario where the reason incorrectly remains set after a local APIC has been created by either KVM_CREATE_IRQCHIP or the enabling of KVM_CAP_IRQCHIP_SPLIT. This is because the helpers in charge of removing the inhibit return early if enable_apicv is not true, and therefore the bit remains set. This leads to confusion as to the cause why APICv is not active, since an incorrect reason will be reported by tracepoints and/or a debugging tool that examines the currently set inhibit reasons. Fixes: ef8b4b720368 ("KVM: ensure APICv is considered inactive if there is no APIC") Signed-off-by: Alejandro Jimenez <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-05-02	KVM: x86/mmu: Fix a largely theoretical race in kvm_mmu_track_write()	Sean Christopherson	1	-3/+17
	Add full memory barriers in kvm_mmu_track_write() and account_shadowed() to plug a (very, very theoretical) race where kvm_mmu_track_write() could miss a 0->1 transition of indirect_shadow_pages and fail to zap relevant, stale SPTEs. Without the barriers, because modern x86 CPUs allow (per the SDM): Reads may be reordered with older writes to different locations but not with older writes to the same location. it's possible that the following could happen (terms of values being visible/resolved): CPU0 CPU1 read memory[gfn] (=Y) memory[gfn] Y=>X read indirect_shadow_pages (=0) indirect_shadow_pages 0=>1 or conversely: CPU0 CPU1 indirect_shadow_pages 0=>1 read indirect_shadow_pages (=0) read memory[gfn] (=Y) memory[gfn] Y=>X E.g. in the below scenario, CPU0 could fail to zap SPTEs, and CPU1 could fail to retry the faulting instruction, resulting in a KVM entering the guest with a stale SPTE (map PTE=X instead of PTE=Y). PTE = X; CPU0: emulator_write_phys() PTE = Y kvm_page_track_write() kvm_mmu_track_write() // memory barrier missing here if (indirect_shadow_pages) zap(); CPU1: FNAME(page_fault) FNAME(walk_addr) FNAME(walk_addr_generic) gw->pte = PTE; // X FNAME(fetch) kvm_mmu_get_child_sp kvm_mmu_get_shadow_page __kvm_mmu_get_shadow_page kvm_mmu_alloc_shadow_page account_shadowed indirect_shadow_pages++ // memory barrier missing here if (FNAME(gpte_changed)) // if (PTE == X) return RET_PF_RETRY; In practice, this bug likely cannot be observed as both the 0=>1 transition and reordering of this scope are extremely rare occurrences. Note, if the cost of the barrier (which is simply a locked ADD, see commit 450cbdd0125c ("locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE")), is problematic, KVM could avoid the barrier by bailing earlier if checking kvm_memslots_have_rmaps() is false. But the odds of the barrier being problematic is extremely low, and the odds of the extra checks being meaningfully faster overall is also low. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-05-02	KVM: x86: Allow, don't ignore, same-value writes to immutable MSRs	Sean Christopherson	1	-7/+4
	When handling userspace writes to immutable feature MSRs for a vCPU that has already run, fall through into the normal code to set the MSR instead of immediately returning '0'. I.e. allow such writes, instead of ignoring such writes. This fixes a bug where KVM incorrectly allows writes to the VMX MSRs that enumerate which CR{0,4} can be set, but only if the vCPU has already run. The intent of returning '0' and thus ignoring the write, was to avoid any side effects, e.g. refreshing the PMU and thus doing weird things with perf events while the vCPU is running. That approach sounds nice in theory, but in practice it makes it all but impossible to maintain a sane ABI, e.g. all VMX MSRs return -EBUSY if the CPU is post-VMXON, and the VMX MSRs for fixed-1 CR bits are never writable, etc. As for refreshing the PMU, kvm_set_msr_common() explicitly skips the PMU refresh if MSR_IA32_PERF_CAPABILITIES is being written with the current value, specifically to avoid unwanted side effects. And if necessary, adding similar logic for other MSRs is not difficult. Fixes: 0094f62c7eaa ("KVM: x86: Disallow writes to immutable feature MSRs after KVM_RUN") Reported-by: Jim Mattson <[email protected]> Cc: Raghavendra Rao Ananta <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-04-30	x86/irq: Remove bitfields in posted interrupt descriptor	Jacob Pan	2	-3/+3
	Mixture of bitfields and types is weird and really not intuitive, remove bitfields and use typed data exclusively. Bitfields often result in inferior machine code. Suggested-by: Sean Christopherson <[email protected]> Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Jacob Pan <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/all/20240404101735.402feec8@jacob-builder/T/#mf66e34a82a48f4d8e2926b5581eff59a122de53a
2024-04-30	KVM: VMX: Move posted interrupt descriptor out of VMX code	Jacob Pan	3	-93/+3
	To prepare native usage of posted interrupts, move the PID declarations out of VMX code such that they can be shared. Signed-off-by: Jacob Pan <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Sean Christopherson <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-20	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm	Linus Torvalds	15	-113/+154
	Pull kvm fixes from Paolo Bonzini: "This is a bit on the large side, mostly due to two changes: - Changes to disable some broken PMU virtualization (see below for details under "x86 PMU") - Clean up SVM's enter/exit assembly code so that it can be compiled without OBJECT_FILES_NON_STANDARD. This fixes a warning "Unpatched return thunk in use. This should not happen!" when running KVM selftests. Everything else is small bugfixes and selftest changes: - Fix a mostly benign bug in the gfn_to_pfn_cache infrastructure where KVM would allow userspace to refresh the cache with a bogus GPA. The bug has existed for quite some time, but was exposed by a new sanity check added in 6.9 (to ensure a cache is either GPA-based or HVA-based). - Drop an unused param from gfn_to_pfn_cache_invalidate_start() that got left behind during a 6.9 cleanup. - Fix a math goof in x86's hugepage logic for KVM_SET_MEMORY_ATTRIBUTES that results in an array overflow (detected by KASAN). - Fix a bug where KVM incorrectly clears root_role.direct when userspace sets guest CPUID. - Fix a dirty logging bug in the where KVM fails to write-protect SPTEs used by a nested guest, if KVM is using Page-Modification Logging and the nested hypervisor is NOT using EPT. x86 PMU: - Drop support for virtualizing adaptive PEBS, as KVM's implementation is architecturally broken without an obvious/easy path forward, and because exposing adaptive PEBS can leak host LBRs to the guest, i.e. can leak host kernel addresses to the guest. - Set the enable bits for general purpose counters in PERF_GLOBAL_CTRL at RESET time, as done by both Intel and AMD processors. - Disable LBR virtualization on CPUs that don't support LBR callstacks, as KVM unconditionally uses PERF_SAMPLE_BRANCH_CALL_STACK when creating the perf event, and would fail on such CPUs. Tests: - Fix a flaw in the max_guest_memory selftest that results in it exhausting the supply of ucall structures when run with more than 256 vCPUs. - Mark KVM_MEM_READONLY as supported for RISC-V in set_memory_region_test" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (30 commits) KVM: Drop unused @may_block param from gfn_to_pfn_cache_invalidate_start() KVM: selftests: Add coverage of EPT-disabled to vmx_dirty_log_test KVM: x86/mmu: Fix and clarify comments about clearing D-bit vs. write-protecting KVM: x86/mmu: Remove function comments above clear_dirty_{gfn_range,pt_masked}() KVM: x86/mmu: Write-protect L2 SPTEs in TDP MMU when clearing dirty status KVM: x86/mmu: Precisely invalidate MMU root_role during CPUID update KVM: VMX: Disable LBR virtualization if the CPU doesn't support LBR callstacks perf/x86/intel: Expose existence of callback support to KVM KVM: VMX: Snapshot LBR capabilities during module initialization KVM: x86/pmu: Do not mask LVTPC when handling a PMI on AMD platforms KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel compatible KVM: x86: Stop compiling vmenter.S with OBJECT_FILES_NON_STANDARD KVM: SVM: Create a stack frame in __svm_sev_es_vcpu_run() KVM: SVM: Save/restore args across SEV-ES VMRUN via host save area KVM: SVM: Save/restore non-volatile GPRs in SEV-ES VMRUN via host save area KVM: SVM: Clobber RAX instead of RBX when discarding spec_ctrl_intercepted KVM: SVM: Drop 32-bit "support" from __svm_sev_es_vcpu_run() KVM: SVM: Wrap __svm_sev_es_vcpu_run() with #ifdef CONFIG_KVM_AMD_SEV KVM: SVM: Create a stack frame in __svm_vcpu_run() for unwinding KVM: SVM: Remove a useless zeroing of allocated memory ...
2024-04-19	KVM: VMX: Introduce test mode related to EPT violation VE	Isaku Yamahata	4	-4/+73
	To support TDX, KVM is enhanced to operate with #VE. For TDX, KVM uses the suppress #VE bit in EPT entries selectively, in order to be able to trap non-present conditions. However, #VE isn't used for VMX and it's a bug if it happens. To be defensive and test that VMX case isn't broken introduce an option ept_violation_ve_test and when it's set, BUG the vm. Suggested-by: Paolo Bonzini <[email protected]> Signed-off-by: Isaku Yamahata <[email protected]> Message-Id: <d6db6ba836605c0412e166359ba5c46a63c22f86.1705965635.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-19	KVM, x86: add architectural support code for #VE	Paolo Bonzini	1	-0/+4
	Dump the contents of the #VE info data structure and assert that #VE does not happen, but do not yet do anything with it. No functional change intended, separated for clarity only. Extracted from a patch by Isaku Yamahata <[email protected]>. Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-19	KVM: x86/mmu: Track shadow MMIO value on a per-VM basis	Sean Christopherson	4	-10/+11
	TDX will use a different shadow PTE entry value for MMIO from VMX. Add a member to kvm_arch and track value for MMIO per-VM instead of a global variable. By using the per-VM EPT entry value for MMIO, the existing VMX logic is kept working. Introduce a separate setter function so that guest TD can use a different value later. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Isaku Yamahata <[email protected]> Message-Id: <229a18434e5d83f45b1fcd7bf1544d79db1becb6.1705965635.git.isaku.yamahata@intel.com> Reviewed-by: Xiaoyao Li <[email protected]> Reviewed-by: Binbin Wu <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-19	KVM: x86/mmu: Add Suppress VE bit to EPT shadow_mmio_mask/shadow_present_mask	Isaku Yamahata	1	-2/+4
	To make use of the same value of shadow_mmio_mask and shadow_present_mask for TDX and VMX, add Suppress-VE bit to shadow_mmio_mask and shadow_present_mask so that they can be common for both VMX and TDX. TDX will require shadow_mmio_mask and shadow_present_mask to include VMX_SUPPRESS_VE for shared GPA so that EPT violation is triggered for shared GPA. For VMX, VMX_SUPPRESS_VE doesn't matter for MMIO because the spte value is defined so as to cause EPT misconfig. Signed-off-by: Isaku Yamahata <[email protected]> Message-Id: <97cc616b3563cd8277be91aaeb3e14bce23c3649.1705965635.git.isaku.yamahata@intel.com> Reviewed-by: Xiaoyao Li <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-19	KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed SPTE	Sean Christopherson	3	-14/+28
	For TD guest, the current way to emulate MMIO doesn't work any more, as KVM is not able to access the private memory of TD guest and do the emulation. Instead, TD guest expects to receive #VE when it accesses the MMIO and then it can explicitly make hypercall to KVM to get the expected information. To achieve this, the TDX module always enables "EPT-violation #VE" in the VMCS control. And accordingly, for the MMIO spte for the shared GPA, 1. KVM needs to set "suppress #VE" bit for the non-present SPTE so that EPT violation happens on TD accessing MMIO range. 2. On EPT violation, KVM sets the MMIO spte to clear "suppress #VE" bit so the TD guest can receive the #VE instead of EPT misconfiguration unlike VMX case. For the shared GPA that is not populated yet, EPT violation need to be triggered when TD guest accesses such shared GPA. The non-present SPTE value for shared GPA should set "suppress #VE" bit. Add "suppress #VE" bit (bit 63) to SHADOW_NONPRESENT_VALUE and REMOVED_SPTE. Unconditionally set the "suppress #VE" bit (which is bit 63) for both AMD and Intel as: 1) AMD hardware doesn't use this bit when present bit is off; 2) for normal VMX guest, KVM never enables the "EPT-violation #VE" in VMCS control and "suppress #VE" bit is ignored by hardware. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Isaku Yamahata <[email protected]> Reviewed-by: Binbin Wu <[email protected]> Reviewed-by: Xiaoyao Li <[email protected]> Message-Id: <a99cb866897c7083430dce7f24c63b17d7121134.1705965635.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-19	KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE	Sean Christopherson	4	-13/+19
	The TDX support will need the "suppress #VE" bit (bit 63) set as the initial value for SPTE. To reduce code change size, introduce a new macro SHADOW_NONPRESENT_VALUE for the initial value for the shadow page table entry (SPTE) and replace hard-coded value 0 for it. Initialize shadow page tables with their value. The plan is to unconditionally set the "suppress #VE" bit for both AMD and Intel as: 1) AMD hardware uses the bit 63 as NX for present SPTE and ignored for non-present SPTE; 2) for conventional VMX guests, KVM never enables the "EPT-violation #VE" in VMCS control and "suppress #VE" bit is ignored by hardware. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Isaku Yamahata <[email protected]> Message-Id: <acdf09bf60cad12c495005bf3495c54f6b3069c9.1705965635.git.isaku.yamahata@intel.com> [Remove unnecessary CONFIG_X86_64 check. - Paolo] Reviewed-by: Xiaoyao Li <[email protected]> Reviewed-by: Binbin Wu <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-19	Merge x86 bugfixes from Linux 6.9-rc3	Paolo Bonzini	3	-1/+4
	Pull fix for SEV-SNP late disable bugs. Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-17	Merge branch 'svm' of https://github.com/kvm-x86/linux into HEAD	Paolo Bonzini	5	-67/+57
	Clean up SVM's enter/exit assembly code so that it can be compiled without OBJECT_FILES_NON_STANDARD. The "standard" __svm_vcpu_run() can't be made 100% bulletproof, as RBP isn't restored on #VMEXIT, but that's also the case for __vmx_vcpu_run(), and getting "close enough" is better than not even trying. As for SEV-ES, after yet another refresher on swap types, I realized KVM can simply let the hardware restore registers after #VMEXIT, all that's missing is storing the current values to the host save area (they are swap type B). This should provide 100% accuracy when using stack frames for unwinding, and requires less assembly. In between, build the SEV-ES code iff CONFIG_KVM_AMD_SEV=y, and yank out "support" for 32-bit kernels in __svm_sev_es_vcpu_run, which was unnecessarily polluting the code for a configuration that is disabled at build time. Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-16	Merge tag 'kvm-x86-fixes-6.9-rcN' of https://github.com/kvm-x86/linux into HEAD	Paolo Bonzini	6	-43/+82
	- Fix a mostly benign bug in the gfn_to_pfn_cache infrastructure where KVM would allow userspace to refresh the cache with a bogus GPA. The bug has existed for quite some time, but was exposed by a new sanity check added in 6.9 (to ensure a cache is either GPA-based or HVA-based). - Drop an unused param from gfn_to_pfn_cache_invalidate_start() that got left behind during a 6.9 cleanup. - Disable support for virtualizing adaptive PEBS, as KVM's implementation is architecturally broken and can leak host LBRs to the guest. - Fix a bug where KVM neglects to set the enable bits for general purpose counters in PERF_GLOBAL_CTRL when initializing the virtual PMU. Both Intel and AMD architectures require the bits to be set at RESET in order for v2 PMUs to be backwards compatible with software that was written for v1 PMUs, i.e. for software that will never manually set the global enables. - Disable LBR virtualization on CPUs that don't support LBR callstacks, as KVM unconditionally uses PERF_SAMPLE_BRANCH_CALL_STACK when creating the virtual LBR perf event, i.e. KVM will always fail to create LBR events on such CPUs. - Fix a math goof in x86's hugepage logic for KVM_SET_MEMORY_ATTRIBUTES that results in an array overflow (detected by KASAN). - Fix a flaw in the max_guest_memory selftest that results in it exhausting the supply of ucall structures when run with more than 256 vCPUs. - Mark KVM_MEM_READONLY as supported for RISC-V in set_memory_region_test. - Fix a bug where KVM incorrectly thinks a TDP MMU root is an indirect shadow root due KVM unnecessarily clobbering root_role.direct when userspace sets guest CPUID. - Fix a dirty logging bug in the where KVM fails to write-protect TDP MMU SPTEs used for L2 if Page-Modification Logging is enabled for L1 and the L1 hypervisor is NOT using EPT (if nEPT is enabled, KVM doesn't use the TDP MMU to run L2). For simplicity, KVM always disables PML when running L2, but the TDP MMU wasn't accounting for root-specific conditions that force write- protect based dirty logging.
2024-04-12	KVM: SEV: use u64_to_user_ptr throughout	Paolo Bonzini	1	-22/+22
	Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-12	KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument	Sean Christopherson	1	-9/+7
	TDX uses different ABI to get information about VM exit. Pass intr_info to the NMI and INTR handlers instead of pulling it from vcpu_vmx in preparation for sharing the bulk of the handlers with TDX. When the guest TD exits to VMM, RAX holds status and exit reason, RCX holds exit qualification etc rather than the VMCS fields because VMM doesn't have access to the VMCS. The eventual code will be VMX: - get exit reason, intr_info, exit_qualification, and etc from VMCS - call NMI/INTR handlers (common code) TDX: - get exit reason, intr_info, exit_qualification, and etc from guest registers - call NMI/INTR handlers (common code) Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Isaku Yamahata <[email protected]> Reviewed-by: Paolo Bonzini <[email protected]> Message-Id: <0396a9ae70d293c9d0b060349dae385a8a4fbcec.1705965635.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-12	KVM: VMX: Move out vmx_x86_ops to 'main.c' to dispatch VMX and TDX	Paolo Bonzini	4	-270/+391
	KVM accesses Virtual Machine Control Structure (VMCS) with VMX instructions to operate on VM. TDX doesn't allow VMM to operate VMCS directly. Instead, TDX has its own data structures, and TDX SEAMCALL APIs for VMM to indirectly operate those data structures. This means we must have a TDX version of kvm_x86_ops. The existing global struct kvm_x86_ops already defines an interface which can be adapted to TDX, but kvm_x86_ops is a system-wide, not per-VM structure. To allow VMX to coexist with TDs, the kvm_x86_ops callbacks will have wrappers "if (tdx) tdx_op() else vmx_op()" to pick VMX or TDX at run time. To split the runtime switch, the VMX implementation, and the TDX implementation, add main.c, and move out the vmx_x86_ops hooks in preparation for adding TDX. Use 'vt' for the naming scheme as a nod to VT-x and as a concatenation of VmxTdx. The eventually converted code will look like this: vmx.c: vmx_op() { ... } VMX initialization tdx.c: tdx_op() { ... } TDX initialization x86_ops.h: vmx_op(); tdx_op(); main.c: static vt_op() { if (tdx) tdx_op() else vmx_op() } static struct kvm_x86_ops vt_x86_ops = { .op = vt_op, initialization functions call both VMX and TDX initialization Opportunistically, fix the name inconsistency from vmx_create_vcpu() and vmx_free_vcpu() to vmx_vcpu_create() and vmx_vcpu_free(). Co-developed-by: Xiaoyao Li <[email protected]> Signed-off-by: Xiaoyao Li <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Isaku Yamahata <[email protected]> Reviewed-by: Binbin Wu <[email protected]> Reviewed-by: Xiaoyao Li <[email protected]> Reviewed-by: Yuan Yao <[email protected]> Message-Id: <e603c317587f933a9d1bee8728c84e4935849c16.1705965634.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-12	KVM: x86: Split core of hypercall emulation to helper function	Sean Christopherson	1	-18/+38
	By necessity, TDX will use a different register ABI for hypercalls. Break out the core functionality so that it may be reused for TDX. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Isaku Yamahata <[email protected]> Message-Id: <5134caa55ac3dec33fb2addb5545b52b3b52db02.1705965635.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-12	Merge branch 'kvm-sev-init2' into HEAD	Paolo Bonzini	7	-129/+318
	The idea that no parameter would ever be necessary when enabling SEV or SEV-ES for a VM was decidedly optimistic. The first source of variability that was encountered is the desired set of VMSA features, as that affects the measurement of the VM's initial state and cannot be changed arbitrarily by the hypervisor. This series adds all the APIs that are needed to customize the features, with room for future enhancements: - a new /dev/kvm device attribute to retrieve the set of supported features (right now, only debug swap) - a new sub-operation for KVM_MEM_ENCRYPT_OP that can take a struct, replacing the existing KVM_SEV_INIT and KVM_SEV_ES_INIT It then puts the new op to work by including the VMSA features as a field of the The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of supported VMSA features for backwards compatibility; but I am considering also making them use zero as the feature mask, and will gladly adjust the patches if so requested. In order to avoid creating two new KVM_MEM_ENCRYPT_OPs, I decided that I could as well make SEV and SEV-ES use VM types. This allows SEV-SNP to reuse the KVM_SEV_INIT2 ioctl. And while at it, KVM_SEV_INIT2 also includes two bugfixes. First of all, SEV-ES VM, when created with the new VM type instead of KVM_SEV_ES_INIT, reject KVM_GET_REGS/KVM_SET_REGS and friends on the vCPU file descriptor once the VMSA has been encrypted... which is how the API should have always behaved. Second, they also synchronize the FPU and AVX state. Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-11	KVM: x86/mmu: Fix and clarify comments about clearing D-bit vs. write-protecting	David Matlack	1	-10/+6
	Drop the "If AD bits are enabled/disabled" verbiage from the comments above kvm_tdp_mmu_clear_dirty_{slot,pt_masked}() since TDP MMU SPTEs may need to be write-protected even when A/D bits are enabled. i.e. These comments aren't technically correct. No functional change intended. Signed-off-by: David Matlack <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-04-11	KVM: x86/mmu: Remove function comments above clear_dirty_{gfn_range,pt_masked}()	David Matlack	1	-14/+0
	Drop the comments above clear_dirty_gfn_range() and clear_dirty_pt_masked(), since each is word-for-word identical to the comment above their parent function. Leave the comment on the parent functions since they are APIs called by the KVM/x86 MMU. No functional change intended. Signed-off-by: David Matlack <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-04-11	KVM: x86/mmu: Write-protect L2 SPTEs in TDP MMU when clearing dirty status	David Matlack	1	-5/+16
	Check kvm_mmu_page_ad_need_write_protect() when deciding whether to write-protect or clear D-bits on TDP MMU SPTEs, so that the TDP MMU accounts for any role-specific reasons for disabling D-bit dirty logging. Specifically, TDP MMU SPTEs must be write-protected when the TDP MMU is being used to run an L2 (i.e. L1 has disabled EPT) and PML is enabled. KVM always disables PML when running L2, even when L1 and L2 GPAs are in the some domain, so failing to write-protect TDP MMU SPTEs will cause writes made by L2 to not be reflected in the dirty log. Reported-by: [email protected] Closes: https://syzkaller.appspot.com/bug?extid=900d58a45dcaab9e4821 Fixes: 5982a5392663 ("KVM: x86/mmu: Use kvm_ad_enabled() to determine if TDP MMU SPTEs need wrprot") Cc: [email protected] Cc: Vipin Sharma <[email protected]> Cc: Sean Christopherson <[email protected]> Signed-off-by: David Matlack <[email protected]> Link: https://lore.kernel.org/r/[email protected] [sean: massage shortlog and changelog, tweak ternary op formatting] Signed-off-by: Sean Christopherson <[email protected]>
2024-04-11	KVM: x86/mmu: Precisely invalidate MMU root_role during CPUID update	Sean Christopherson	1	-3/+3
	Set kvm_mmu_page_role.invalid to mark the various MMU root_roles invalid during CPUID update in order to force a refresh, instead of zeroing out the entire role. This fixes a bug where kvm_mmu_free_roots() incorrectly thinks a root is indirect, i.e. not a TDP MMU, due to "direct" being zeroed, which in turn causes KVM to take mmu_lock for write instead of read. Note, paving over the entire role was largely unintentional, commit 7a458f0e1ba1 ("KVM: x86/mmu: remove extended bits from mmu_role, rename field") simply missed that "invalid" could be set. Fixes: 576a15de8d29 ("KVM: x86/mmu: Free TDP MMU roots while holding mmy_lock for read") Reported-by: [email protected] Closes: https://lore.kernel.org/all/[email protected] Cc: Phi Nguyen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-04-11	KVM: VMX: Disable LBR virtualization if the CPU doesn't support LBR callstacks	Sean Christopherson	1	-1/+9
	Disable LBR virtualization if the CPU doesn't support callstacks, which were introduced in HSW (see commit e9d7f7cd97c4 ("perf/x86/intel: Add basic Haswell LBR call stack support"), as KVM unconditionally configures the perf LBR event with PERF_SAMPLE_BRANCH_CALL_STACK, i.e. LBR virtualization always fails on pre-HSW CPUs. Simply disable LBR support on such CPUs, as it has never worked, i.e. there is no risk of breaking an existing setup, and figuring out a way to performantly context switch LBRs on old CPUs is not worth the effort. Fixes: be635e34c284 ("KVM: vmx/pmu: Expose LBR_FMT in the MSR_IA32_PERF_CAPABILITIES") Cc: Mingwei Zhang <[email protected]> Cc: Jim Mattson <[email protected]> Tested-by: Mingwei Zhang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-04-11	KVM: VMX: Snapshot LBR capabilities during module initialization	Sean Christopherson	3	-5/+8
	Snapshot VMX's LBR capabilities once during module initialization instead of calling into perf every time a vCPU reconfigures its vPMU. This will allow massaging the LBR capabilities, e.g. if the CPU doesn't support callstacks, without having to remember to update multiple locations. Opportunistically tag vmx_get_perf_capabilities() with __init, as it's only called from vmx_set_cpu_caps(). Reviewed-by: Mingwei Zhang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-04-11	KVM: delete .change_pte MMU notifier callback	Paolo Bonzini	5	-125/+7
	The .change_pte() MMU notifier callback was intended as an optimization. The original point of it was that KSM could tell KVM to flip its secondary PTE to a new location without having to first zap it. At the time there was also an .invalidate_page() callback; both of them were not bracketed by calls to mmu_notifier_invalidate_range_{start,end}(), and .invalidate_page() also doubled as a fallback implementation of .change_pte(). Later on, however, both callbacks were changed to occur within an invalidate_range_start/end() block. In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end", 2012-10-09) did so to remove the fallback from .invalidate_page() to .change_pte() and allow sleepable .invalidate_page() hooks. This however made KVM's usage of the .change_pte() callback completely moot, because KVM unmaps the sPTEs during .invalidate_range_start() and therefore .change_pte() has no hope of finding a sPTE to change. Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as well as all the architecture specific implementations. Signed-off-by: Paolo Bonzini <[email protected]> Acked-by: Anup Patel <[email protected]> Acked-by: Michael Ellerman <[email protected]> (powerpc) Reviewed-by: Bibo Mao <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-11	KVM: SEV: allow SEV-ES DebugSwap again	Paolo Bonzini	1	-1/+1
	The DebugSwap feature of SEV-ES provides a way for confidential guests to use data breakpoints. Its status is record in VMSA, and therefore attestation signatures depend on whether it is enabled or not. In order to avoid invalidating the signatures depending on the host machine, it was disabled by default (see commit 5abf6dceb066, "SEV: disable SEV-ES DebugSwap by default", 2024-03-09). However, we now have a new API to create SEV VMs that allows enabling DebugSwap based on what the user tells KVM to do, and we also changed the legacy KVM_SEV_ES_INIT API to never enable DebugSwap. It is therefore possible to re-enable the feature without breaking compatibility with kernels that pre-date the introduction of DebugSwap, so go ahead. Signed-off-by: Paolo Bonzini <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-11	KVM: SEV: introduce KVM_SEV_INIT2 operation	Paolo Bonzini	1	-7/+46
	The idea that no parameter would ever be necessary when enabling SEV or SEV-ES for a VM was decidedly optimistic. In fact, in some sense it's already a parameter whether SEV or SEV-ES is desired. Another possible source of variability is the desired set of VMSA features, as that affects the measurement of the VM's initial state and cannot be changed arbitrarily by the hypervisor. Create a new sub-operation for KVM_MEMORY_ENCRYPT_OP that can take a struct, and put the new op to work by including the VMSA features as a field of the struct. The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of supported VMSA features for backwards compatibility. The struct also includes the usual bells and whistles for future extensibility: a flags field that must be zero for now, and some padding at the end. Signed-off-by: Paolo Bonzini <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-11	KVM: SEV: sync FPU and AVX state at LAUNCH_UPDATE_VMSA time	Paolo Bonzini	2	-8/+50
	SEV-ES allows passing custom contents for x87, SSE and AVX state into the VMSA. Allow userspace to do that with the usual KVM_SET_XSAVE API and only mark FPU contents as confidential after it has been copied and encrypted into the VMSA. Since the XSAVE state for AVX is the first, it does not need the compacted-state handling of get_xsave_addr(). However, there are other parts of XSAVE state in the VMSA that currently are not handled, and the validation logic of get_xsave_addr() is pointless to duplicate in KVM, so move get_xsave_addr() to public FPU API; it is really just a facility to operate on XSAVE state and does not expose any internal details of arch/x86/kernel/fpu. Acked-by: Dave Hansen <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-11	KVM: SEV: define VM types for SEV and SEV-ES	Paolo Bonzini	3	-3/+25
	Signed-off-by: Paolo Bonzini <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-11	KVM: SEV: introduce to_kvm_sev_info	Paolo Bonzini	2	-2/+7
	Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-11	KVM: x86: Add supported_vm_types to kvm_caps	Paolo Bonzini	2	-6/+8
	This simplifies the implementation of KVM_CHECK_EXTENSION(KVM_CAP_VM_TYPES), and also allows the vendor module to specify which VM types are supported. Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-11	KVM: x86: add fields to struct kvm_arch for CoCo features	Paolo Bonzini	1	-19/+74
	Some VM types have characteristics in common; in fact, the only use of VM types right now is kvm_arch_has_private_mem and it assumes that _all_ nonzero VM types have private memory. We will soon introduce a VM type for SEV and SEV-ES VMs, and at that point we will have two special characteristics of confidential VMs that depend on the VM type: not just if memory is private, but also whether guest state is protected. For the latter we have kvm->arch.guest_state_protected, which is only set on a fully initialized VM. For VM types with protected guest state, we can actually fix a problem in the SEV-ES implementation, where ioctls to set registers do not cause an error even if the VM has been initialized and the guest state encrypted. Make sure that when using VM types that will become an error. Signed-off-by: Paolo Bonzini <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Reviewed-by: Isaku Yamahata <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-11	KVM: SEV: store VMSA features in kvm_sev_info	Paolo Bonzini	3	-10/+24
	Right now, the set of features that are stored in the VMSA upon initialization is fixed and depends on the module parameters for kvm-amd.ko. However, the hypervisor cannot really change it at will because the feature word has to match between the hypervisor and whatever computes a measurement of the VMSA for attestation purposes. Add a field to kvm_sev_info that holds the set of features to be stored in the VMSA; and query it instead of referring to the module parameters. Because KVM_SEV_INIT and KVM_SEV_ES_INIT accept no parameters, this does not yet introduce any functional change, but it paves the way for an API that allows customization of the features per-VM. Signed-off-by: Paolo Bonzini <[email protected]> Message-Id: <[email protected]> Reviewed-by: Michael Roth <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-11	KVM: SEV: publish supported VMSA features	Paolo Bonzini	3	-2/+25
	Compute the set of features to be stored in the VMSA when KVM is initialized; move it from there into kvm_sev_info when SEV is initialized, and then into the initial VMSA. The new variable can then be used to return the set of supported features to userspace, via the KVM_GET_DEVICE_ATTR ioctl. Signed-off-by: Paolo Bonzini <[email protected]> Reviewed-by: Isaku Yamahata <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-11	KVM: introduce new vendor op for KVM_GET_DEVICE_ATTR	Paolo Bonzini	1	-14/+24
	Allow vendor modules to provide their own attributes on /dev/kvm. To avoid proliferation of vendor ops, implement KVM_HAS_DEVICE_ATTR and KVM_GET_DEVICE_ATTR in terms of the same function. You're not supposed to use KVM_GET_DEVICE_ATTR to do complicated computations, especially on /dev/kvm. Reviewed-by: Michael Roth <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Reviewed-by: Isaku Yamahata <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-11	KVM: x86: use u64_to_user_ptr()	Paolo Bonzini	1	-21/+3
	There is no danger to the kernel if 32-bit userspace provides a 64-bit value that has the high bits set, but for whatever reason happens to resolve to an address that has something mapped there. KVM uses the checked version of get_user() and put_user(), so any faults are caught properly. Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-11	KVM: SVM: Compile sev.c if and only if CONFIG_KVM_AMD_SEV=y	Paolo Bonzini	4	-43/+38
	Stop compiling sev.c when CONFIG_KVM_AMD_SEV=n, as the number of #ifdefs in sev.c is getting ridiculous, and having #ifdefs inside of SEV helpers is quite confusing. To minimize #ifdefs in code flows, #ifdef away only the kvm_x86_ops hooks and the #VMGEXIT handler. Stubs are also restricted to functions that check sev_enabled and to the destruction functions sev_free_cpu() and sev_vm_destroy(), where the style of their callers is to leave checks to the callers. Most call sites instead rely on dead code elimination to take care of functions that are guarded with sev_guest() or sev_es_guest(). Signed-off-by: Sean Christopherson <[email protected]> Co-developed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-11	KVM: SVM: Invert handling of SEV and SEV_ES feature flags	Sean Christopherson	2	-5/+5
	Leave SEV and SEV_ES '0' in kvm_cpu_caps by default, and instead set them in sev_set_cpu_caps() if SEV and SEV-ES support are fully enabled. Aside from the fact that sev_set_cpu_caps() is wildly misleading when it clears capabilities, this will allow compiling out sev.c without falsely advertising SEV/SEV-ES support in KVM_GET_SUPPORTED_CPUID. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Michael Roth <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-11	KVM: x86/pmu: Do not mask LVTPC when handling a PMI on AMD platforms	Sandipan Das	1	-1/+2
	On AMD and Hygon platforms, the local APIC does not automatically set the mask bit of the LVTPC register when handling a PMI and there is no need to clear it in the kernel's PMI handler. For guests, the mask bit is currently set by kvm_apic_local_deliver() and unless it is cleared by the guest kernel's PMI handler, PMIs stop arriving and break use-cases like sampling with perf record. This does not affect non-PerfMonV2 guests because PMIs are handled in the guest kernel by x86_pmu_handle_irq() which always clears the LVTPC mask bit irrespective of the vendor. Before: $ perf record -e cycles:u true [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.001 MB perf.data (1 samples) ] After: $ perf record -e cycles:u true [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.002 MB perf.data (19 samples) ] Fixes: a16eb25b09c0 ("KVM: x86: Mask LVTPC when handling a PMI") Cc: [email protected] Signed-off-by: Sandipan Das <[email protected]> Reviewed-by: Jim Mattson <[email protected]> [sean: use is_intel_compatible instead of !is_amd_or_hygon()] Signed-off-by: Sean Christopherson <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-11	KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel compatible	Sean Christopherson	4	-2/+13
	Add kvm_vcpu_arch.is_amd_compatible to cache if a vCPU's vendor model is compatible with AMD, i.e. if the vCPU vendor is AMD or Hygon, along with helpers to check if a vCPU is compatible AMD vs. Intel. To handle Intel vs. AMD behavior related to masking the LVTPC entry, KVM will need to check for vendor compatibility on every PMI injection, i.e. querying for AMD will soon be a moderately hot path. Note! This subtly (or maybe not-so-subtly) makes "Intel compatible" KVM's default behavior, both if userspace omits (or never sets) CPUID 0x0 and if userspace sets a completely unknown vendor. One could argue that KVM should treat such vCPUs as not being compatible with Intel or AMD, but that would add useless complexity to KVM. KVM needs to do something in the face of vendor specific behavior, and so unless KVM conjured up a magic third option, choosing to treat unknown vendors as neither Intel nor AMD means that checks on AMD compatibility would yield Intel behavior, and checks for Intel compatibility would yield AMD behavior. And that's far worse as it would effectively yield random behavior depending on whether KVM checked for AMD vs. Intel vs. !AMD vs. !Intel. And practically speaking, all x86 CPUs follow either Intel or AMD architecture, i.e. "supporting" an unknown third architecture adds no value. Deliberately don't convert any of the existing guest_cpuid_is_intel() checks, as the Intel side of things is messier due to some flows explicitly checking for exactly vendor==Intel, versus some flows assuming anything that isn't "AMD compatible" gets Intel behavior. The Intel code will be cleaned up in the future. Cc: [email protected] Signed-off-by: Sean Christopherson <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-09	KVM: Use vfree for memory allocated by vcalloc()/__vcalloc()	Li RongQing	2	-4/+4
	commit 37b2a6510a48("KVM: use __vcalloc for very large allocations") replaced kvzalloc()/kvcalloc() with vcalloc(), but didn't replace kvfree() with vfree(). Signed-off-by: Li RongQing <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-04-09	KVM: x86: Advertise max mappable GPA in CPUID.0x80000008.GuestPhysBits	Gerd Hoffmann	3	-3/+32
	Use the GuestPhysBits field in CPUID.0x80000008 to communicate the max mappable GPA to userspace, i.e. the max GPA that is addressable by the CPU itself. Typically this is identical to the max effective GPA, except in the case where the CPU supports MAXPHYADDR > 48 but does not support 5-level TDP (the CPU consults bits 51:48 of the GPA only when walking the fifth level TDP page table entry). Enumerating the max mappable GPA via CPUID will allow guest firmware to map resources like PCI bars in the highest possible address space, while ensuring that the GPA is addressable by the CPU. Without precise knowledge about the max mappable GPA, the guest must assume that 5-level paging is unsupported and thus restrict its mappings to the lower 48 bits. Advertise the max mappable GPA via KVM_GET_SUPPORTED_CPUID as userspace doesn't have easy access to whether or not 5-level paging is supported, and to play nice with userspace VMMs that reflect the supported CPUID directly into the guest. AMD's APM (3.35) defines GuestPhysBits (EAX[23:16]) as: Maximum guest physical address size in bits. This number applies only to guests using nested paging. When this field is zero, refer to the PhysAddrSize field for the maximum guest physical address size. Tom Lendacky confirmed that the purpose of GuestPhysBits is software use and KVM can use it as described above. Real hardware always returns zero. Leave GuestPhysBits as '0' when TDP is disabled in order to comply with the APM's statement that GuestPhysBits "applies only to guest using nested paging". As above, guest firmware will likely create suboptimal mappings, but that is a very minor issue and not a functional concern. Signed-off-by: Gerd Hoffmann <[email protected]> Reviewed-by: Xiaoyao Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] [sean: massage changelog] Signed-off-by: Sean Christopherson <[email protected]>
2024-04-09	KVM: x86: Don't advertise guest.MAXPHYADDR as host.MAXPHYADDR in CPUID	Gerd Hoffmann	1	-11/+10
	Drop KVM's propagation of GuestPhysBits (CPUID leaf 80000008, EAX[23:16]) to HostPhysBits (same leaf, EAX[7:0]) when advertising the address widths to userspace via KVM_GET_SUPPORTED_CPUID. Per AMD, GuestPhysBits is intended for software use, and physical CPUs do not set that field. I.e. GuestPhysBits will be non-zero if and only if KVM is running as a nested hypervisor, and in that case, GuestPhysBits is NOT guaranteed to capture the CPU's effective MAXPHYADDR when running with TDP enabled. E.g. KVM will soon use GuestPhysBits to communicate the CPU's maximum addressable guest physical address, which would result in KVM under- reporting PhysBits when running as an L1 on a CPU with MAXPHYADDR=52, but without 5-level paging. Signed-off-by: Gerd Hoffmann <[email protected]> Cc: [email protected] Reviewed-by: Xiaoyao Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] [sean: rewrite changelog with --verbose, Cc stable@] Signed-off-by: Sean Christopherson <[email protected]>