Age | Commit message (Collapse) | Author | Files | Lines |
|
Plumb the full 64-bit error code throughout the page fault handling code
so that KVM can use the upper 32 bits, e.g. SNP's PFERR_GUEST_ENC_MASK
will be used to determine whether or not a fault is private vs. shared.
Note, passing the 64-bit error code to FNAME(walk_addr)() does NOT change
the behavior of permission_fault() when invoked in the page fault path, as
KVM explicitly clears PFERR_IMPLICIT_ACCESS in kvm_mmu_page_fault().
Continue passing '0' from the async #PF worker, as guest_memfd and thus
private memory doesn't support async page faults.
Signed-off-by: Isaku Yamahata <[email protected]>
[mdr: drop references/changes on rebase, update commit message]
Signed-off-by: Michael Roth <[email protected]>
[sean: drop truncation in call to FNAME(walk_addr)(), rewrite changelog]
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Move the sanity check that hardware never sets bits that collide with KVM-
define synthetic bits from kvm_mmu_page_fault() to npf_interception(),
i.e. make the sanity check #NPF specific. The legacy #PF path already
WARNs if _any_ of bits 63:32 are set, and the error code that comes from
VMX's EPT Violatation and Misconfig is 100% synthesized (KVM morphs VMX's
EXIT_QUALIFICATION into error code flags).
Add a compile-time assert in the legacy #PF handler to make sure that KVM-
define flags are covered by its existing sanity check on the upper bits.
Opportunistically add a description of PFERR_IMPLICIT_ACCESS, since we
are removing the comment that defined it.
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Kai Huang <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Open code the bit number directly in the PFERR_* masks and drop the
intermediate PFERR_*_BIT defines, as having to bounce through two macros
just to see which flag corresponds to which bit is quite annoying, as is
having to define two macros just to add recognition of a new flag.
Use ternary operator to derive the bit in permission_fault(), the one
function that actually needs the bit number as part of clever shifting
to avoid conditional branches. Generally the compiler is able to turn
it into a conditional move, and if not it's not really a big deal.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Exit to userspace with -EFAULT / KVM_EXIT_MEMORY_FAULT if a private fault
triggers emulation of any kind, as KVM doesn't currently support emulating
access to guest private memory. Practically speaking, private faults and
emulation are already mutually exclusive, but there are many flow that
can result in KVM returning RET_PF_EMULATE, and adding one last check
to harden against weird, unexpected combinations and/or KVM bugs is
inexpensive.
Suggested-by: Yan Zhao <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The kvm_pi_irte_update tracepoint is called from both SVM and VMX vendor
code, and while the "posted interrupt" naming is also adopted by SVM in
several places, VT-d specifically refers to Intel's "Virtualization
Technology for Directed I/O".
Signed-off-by: Alejandro Jimenez <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Use the APICv enablement status to determine if APICV_INHIBIT_REASON_ABSENT
needs to be set, instead of unconditionally setting the reason during
initialization.
Specifically, in cases where AVIC is disabled via module parameter or lack
of hardware support, unconditionally setting an inhibit reason due to the
absence of an in-kernel local APIC can lead to a scenario where the reason
incorrectly remains set after a local APIC has been created by either
KVM_CREATE_IRQCHIP or the enabling of KVM_CAP_IRQCHIP_SPLIT. This is
because the helpers in charge of removing the inhibit return early if
enable_apicv is not true, and therefore the bit remains set.
This leads to confusion as to the cause why APICv is not active, since an
incorrect reason will be reported by tracepoints and/or a debugging tool
that examines the currently set inhibit reasons.
Fixes: ef8b4b720368 ("KVM: ensure APICv is considered inactive if there is no APIC")
Signed-off-by: Alejandro Jimenez <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Add full memory barriers in kvm_mmu_track_write() and account_shadowed()
to plug a (very, very theoretical) race where kvm_mmu_track_write() could
miss a 0->1 transition of indirect_shadow_pages and fail to zap relevant,
*stale* SPTEs.
Without the barriers, because modern x86 CPUs allow (per the SDM):
Reads may be reordered with older writes to different locations but not
with older writes to the same location.
it's possible that the following could happen (terms of values being
visible/resolved):
CPU0 CPU1
read memory[gfn] (=Y)
memory[gfn] Y=>X
read indirect_shadow_pages (=0)
indirect_shadow_pages 0=>1
or conversely:
CPU0 CPU1
indirect_shadow_pages 0=>1
read indirect_shadow_pages (=0)
read memory[gfn] (=Y)
memory[gfn] Y=>X
E.g. in the below scenario, CPU0 could fail to zap SPTEs, and CPU1 could
fail to retry the faulting instruction, resulting in a KVM entering the
guest with a stale SPTE (map PTE=X instead of PTE=Y).
PTE = X;
CPU0:
emulator_write_phys()
PTE = Y
kvm_page_track_write()
kvm_mmu_track_write()
// memory barrier missing here
if (indirect_shadow_pages)
zap();
CPU1:
FNAME(page_fault)
FNAME(walk_addr)
FNAME(walk_addr_generic)
gw->pte = PTE; // X
FNAME(fetch)
kvm_mmu_get_child_sp
kvm_mmu_get_shadow_page
__kvm_mmu_get_shadow_page
kvm_mmu_alloc_shadow_page
account_shadowed
indirect_shadow_pages++
// memory barrier missing here
if (FNAME(gpte_changed)) // if (PTE == X)
return RET_PF_RETRY;
In practice, this bug likely cannot be observed as both the 0=>1
transition and reordering of this scope are extremely rare occurrences.
Note, if the cost of the barrier (which is simply a locked ADD, see commit
450cbdd0125c ("locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE")),
is problematic, KVM could avoid the barrier by bailing earlier if checking
kvm_memslots_have_rmaps() is false. But the odds of the barrier being
problematic is extremely low, *and* the odds of the extra checks being
meaningfully faster overall is also low.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
When handling userspace writes to immutable feature MSRs for a vCPU that
has already run, fall through into the normal code to set the MSR instead
of immediately returning '0'. I.e. allow such writes, instead of ignoring
such writes. This fixes a bug where KVM incorrectly allows writes to the
VMX MSRs that enumerate which CR{0,4} can be set, but only if the vCPU has
already run.
The intent of returning '0' and thus ignoring the write, was to avoid any
side effects, e.g. refreshing the PMU and thus doing weird things with
perf events while the vCPU is running. That approach sounds nice in
theory, but in practice it makes it all but impossible to maintain a sane
ABI, e.g. all VMX MSRs return -EBUSY if the CPU is post-VMXON, and the VMX
MSRs for fixed-1 CR bits are never writable, etc.
As for refreshing the PMU, kvm_set_msr_common() explicitly skips the PMU
refresh if MSR_IA32_PERF_CAPABILITIES is being written with the current
value, specifically to avoid unwanted side effects. And if necessary,
adding similar logic for other MSRs is not difficult.
Fixes: 0094f62c7eaa ("KVM: x86: Disallow writes to immutable feature MSRs after KVM_RUN")
Reported-by: Jim Mattson <[email protected]>
Cc: Raghavendra Rao Ananta <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Mixture of bitfields and types is weird and really not intuitive, remove
bitfields and use typed data exclusively. Bitfields often result in
inferior machine code.
Suggested-by: Sean Christopherson <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Link: https://lore.kernel.org/all/20240404101735.402feec8@jacob-builder/T/#mf66e34a82a48f4d8e2926b5581eff59a122de53a
|
|
To prepare native usage of posted interrupts, move the PID declarations out
of VMX code such that they can be shared.
Signed-off-by: Jacob Pan <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Pull kvm fixes from Paolo Bonzini:
"This is a bit on the large side, mostly due to two changes:
- Changes to disable some broken PMU virtualization (see below for
details under "x86 PMU")
- Clean up SVM's enter/exit assembly code so that it can be compiled
without OBJECT_FILES_NON_STANDARD. This fixes a warning "Unpatched
return thunk in use. This should not happen!" when running KVM
selftests.
Everything else is small bugfixes and selftest changes:
- Fix a mostly benign bug in the gfn_to_pfn_cache infrastructure
where KVM would allow userspace to refresh the cache with a bogus
GPA. The bug has existed for quite some time, but was exposed by a
new sanity check added in 6.9 (to ensure a cache is either
GPA-based or HVA-based).
- Drop an unused param from gfn_to_pfn_cache_invalidate_start() that
got left behind during a 6.9 cleanup.
- Fix a math goof in x86's hugepage logic for
KVM_SET_MEMORY_ATTRIBUTES that results in an array overflow
(detected by KASAN).
- Fix a bug where KVM incorrectly clears root_role.direct when
userspace sets guest CPUID.
- Fix a dirty logging bug in the where KVM fails to write-protect
SPTEs used by a nested guest, if KVM is using Page-Modification
Logging and the nested hypervisor is NOT using EPT.
x86 PMU:
- Drop support for virtualizing adaptive PEBS, as KVM's
implementation is architecturally broken without an obvious/easy
path forward, and because exposing adaptive PEBS can leak host LBRs
to the guest, i.e. can leak host kernel addresses to the guest.
- Set the enable bits for general purpose counters in
PERF_GLOBAL_CTRL at RESET time, as done by both Intel and AMD
processors.
- Disable LBR virtualization on CPUs that don't support LBR
callstacks, as KVM unconditionally uses
PERF_SAMPLE_BRANCH_CALL_STACK when creating the perf event, and
would fail on such CPUs.
Tests:
- Fix a flaw in the max_guest_memory selftest that results in it
exhausting the supply of ucall structures when run with more than
256 vCPUs.
- Mark KVM_MEM_READONLY as supported for RISC-V in
set_memory_region_test"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (30 commits)
KVM: Drop unused @may_block param from gfn_to_pfn_cache_invalidate_start()
KVM: selftests: Add coverage of EPT-disabled to vmx_dirty_log_test
KVM: x86/mmu: Fix and clarify comments about clearing D-bit vs. write-protecting
KVM: x86/mmu: Remove function comments above clear_dirty_{gfn_range,pt_masked}()
KVM: x86/mmu: Write-protect L2 SPTEs in TDP MMU when clearing dirty status
KVM: x86/mmu: Precisely invalidate MMU root_role during CPUID update
KVM: VMX: Disable LBR virtualization if the CPU doesn't support LBR callstacks
perf/x86/intel: Expose existence of callback support to KVM
KVM: VMX: Snapshot LBR capabilities during module initialization
KVM: x86/pmu: Do not mask LVTPC when handling a PMI on AMD platforms
KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel compatible
KVM: x86: Stop compiling vmenter.S with OBJECT_FILES_NON_STANDARD
KVM: SVM: Create a stack frame in __svm_sev_es_vcpu_run()
KVM: SVM: Save/restore args across SEV-ES VMRUN via host save area
KVM: SVM: Save/restore non-volatile GPRs in SEV-ES VMRUN via host save area
KVM: SVM: Clobber RAX instead of RBX when discarding spec_ctrl_intercepted
KVM: SVM: Drop 32-bit "support" from __svm_sev_es_vcpu_run()
KVM: SVM: Wrap __svm_sev_es_vcpu_run() with #ifdef CONFIG_KVM_AMD_SEV
KVM: SVM: Create a stack frame in __svm_vcpu_run() for unwinding
KVM: SVM: Remove a useless zeroing of allocated memory
...
|
|
To support TDX, KVM is enhanced to operate with #VE. For TDX, KVM uses the
suppress #VE bit in EPT entries selectively, in order to be able to trap
non-present conditions. However, #VE isn't used for VMX and it's a bug
if it happens. To be defensive and test that VMX case isn't broken
introduce an option ept_violation_ve_test and when it's set, BUG the vm.
Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Message-Id: <d6db6ba836605c0412e166359ba5c46a63c22f86.1705965635.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Dump the contents of the #VE info data structure and assert that #VE does
not happen, but do not yet do anything with it.
No functional change intended, separated for clarity only.
Extracted from a patch by Isaku Yamahata <[email protected]>.
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
TDX will use a different shadow PTE entry value for MMIO from VMX. Add a
member to kvm_arch and track value for MMIO per-VM instead of a global
variable. By using the per-VM EPT entry value for MMIO, the existing VMX
logic is kept working. Introduce a separate setter function so that guest
TD can use a different value later.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Message-Id: <229a18434e5d83f45b1fcd7bf1544d79db1becb6.1705965635.git.isaku.yamahata@intel.com>
Reviewed-by: Xiaoyao Li <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
To make use of the same value of shadow_mmio_mask and shadow_present_mask
for TDX and VMX, add Suppress-VE bit to shadow_mmio_mask and
shadow_present_mask so that they can be common for both VMX and TDX.
TDX will require shadow_mmio_mask and shadow_present_mask to include
VMX_SUPPRESS_VE for shared GPA so that EPT violation is triggered for
shared GPA. For VMX, VMX_SUPPRESS_VE doesn't matter for MMIO because the
spte value is defined so as to cause EPT misconfig.
Signed-off-by: Isaku Yamahata <[email protected]>
Message-Id: <97cc616b3563cd8277be91aaeb3e14bce23c3649.1705965635.git.isaku.yamahata@intel.com>
Reviewed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
For TD guest, the current way to emulate MMIO doesn't work any more, as KVM
is not able to access the private memory of TD guest and do the emulation.
Instead, TD guest expects to receive #VE when it accesses the MMIO and then
it can explicitly make hypercall to KVM to get the expected information.
To achieve this, the TDX module always enables "EPT-violation #VE" in the
VMCS control. And accordingly, for the MMIO spte for the shared GPA,
1. KVM needs to set "suppress #VE" bit for the non-present SPTE so that EPT
violation happens on TD accessing MMIO range. 2. On EPT violation, KVM
sets the MMIO spte to clear "suppress #VE" bit so the TD guest can receive
the #VE instead of EPT misconfiguration unlike VMX case. For the shared GPA
that is not populated yet, EPT violation need to be triggered when TD guest
accesses such shared GPA. The non-present SPTE value for shared GPA should
set "suppress #VE" bit.
Add "suppress #VE" bit (bit 63) to SHADOW_NONPRESENT_VALUE and
REMOVED_SPTE. Unconditionally set the "suppress #VE" bit (which is bit 63)
for both AMD and Intel as: 1) AMD hardware doesn't use this bit when
present bit is off; 2) for normal VMX guest, KVM never enables the
"EPT-violation #VE" in VMCS control and "suppress #VE" bit is ignored by
hardware.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
Message-Id: <a99cb866897c7083430dce7f24c63b17d7121134.1705965635.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The TDX support will need the "suppress #VE" bit (bit 63) set as the
initial value for SPTE. To reduce code change size, introduce a new macro
SHADOW_NONPRESENT_VALUE for the initial value for the shadow page table
entry (SPTE) and replace hard-coded value 0 for it. Initialize shadow page
tables with their value.
The plan is to unconditionally set the "suppress #VE" bit for both AMD and
Intel as: 1) AMD hardware uses the bit 63 as NX for present SPTE and
ignored for non-present SPTE; 2) for conventional VMX guests, KVM never
enables the "EPT-violation #VE" in VMCS control and "suppress #VE" bit is
ignored by hardware.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Message-Id: <acdf09bf60cad12c495005bf3495c54f6b3069c9.1705965635.git.isaku.yamahata@intel.com>
[Remove unnecessary CONFIG_X86_64 check. - Paolo]
Reviewed-by: Xiaoyao Li <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Pull fix for SEV-SNP late disable bugs.
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Clean up SVM's enter/exit assembly code so that it can be compiled
without OBJECT_FILES_NON_STANDARD. The "standard" __svm_vcpu_run() can't
be made 100% bulletproof, as RBP isn't restored on #VMEXIT, but that's
also the case for __vmx_vcpu_run(), and getting "close enough" is better
than not even trying.
As for SEV-ES, after yet another refresher on swap types, I realized
KVM can simply let the hardware restore registers after #VMEXIT, all
that's missing is storing the current values to the host save area
(they are swap type B). This should provide 100% accuracy when using
stack frames for unwinding, and requires less assembly.
In between, build the SEV-ES code iff CONFIG_KVM_AMD_SEV=y, and yank out
"support" for 32-bit kernels in __svm_sev_es_vcpu_run, which was
unnecessarily polluting the code for a configuration that is disabled
at build time.
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
- Fix a mostly benign bug in the gfn_to_pfn_cache infrastructure where KVM
would allow userspace to refresh the cache with a bogus GPA. The bug has
existed for quite some time, but was exposed by a new sanity check added in
6.9 (to ensure a cache is either GPA-based or HVA-based).
- Drop an unused param from gfn_to_pfn_cache_invalidate_start() that got left
behind during a 6.9 cleanup.
- Disable support for virtualizing adaptive PEBS, as KVM's implementation is
architecturally broken and can leak host LBRs to the guest.
- Fix a bug where KVM neglects to set the enable bits for general purpose
counters in PERF_GLOBAL_CTRL when initializing the virtual PMU. Both Intel
and AMD architectures require the bits to be set at RESET in order for v2
PMUs to be backwards compatible with software that was written for v1 PMUs,
i.e. for software that will never manually set the global enables.
- Disable LBR virtualization on CPUs that don't support LBR callstacks, as
KVM unconditionally uses PERF_SAMPLE_BRANCH_CALL_STACK when creating the
virtual LBR perf event, i.e. KVM will always fail to create LBR events on
such CPUs.
- Fix a math goof in x86's hugepage logic for KVM_SET_MEMORY_ATTRIBUTES that
results in an array overflow (detected by KASAN).
- Fix a flaw in the max_guest_memory selftest that results in it exhausting
the supply of ucall structures when run with more than 256 vCPUs.
- Mark KVM_MEM_READONLY as supported for RISC-V in set_memory_region_test.
- Fix a bug where KVM incorrectly thinks a TDP MMU root is an indirect shadow
root due KVM unnecessarily clobbering root_role.direct when userspace sets
guest CPUID.
- Fix a dirty logging bug in the where KVM fails to write-protect TDP MMU
SPTEs used for L2 if Page-Modification Logging is enabled for L1 and the L1
hypervisor is NOT using EPT (if nEPT is enabled, KVM doesn't use the TDP MMU
to run L2). For simplicity, KVM always disables PML when running L2, but
the TDP MMU wasn't accounting for root-specific conditions that force write-
protect based dirty logging.
|
|
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
TDX uses different ABI to get information about VM exit. Pass intr_info to
the NMI and INTR handlers instead of pulling it from vcpu_vmx in
preparation for sharing the bulk of the handlers with TDX.
When the guest TD exits to VMM, RAX holds status and exit reason, RCX holds
exit qualification etc rather than the VMCS fields because VMM doesn't have
access to the VMCS. The eventual code will be
VMX:
- get exit reason, intr_info, exit_qualification, and etc from VMCS
- call NMI/INTR handlers (common code)
TDX:
- get exit reason, intr_info, exit_qualification, and etc from guest
registers
- call NMI/INTR handlers (common code)
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Message-Id: <0396a9ae70d293c9d0b060349dae385a8a4fbcec.1705965635.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
KVM accesses Virtual Machine Control Structure (VMCS) with VMX instructions
to operate on VM. TDX doesn't allow VMM to operate VMCS directly.
Instead, TDX has its own data structures, and TDX SEAMCALL APIs for VMM to
indirectly operate those data structures. This means we must have a TDX
version of kvm_x86_ops.
The existing global struct kvm_x86_ops already defines an interface which
can be adapted to TDX, but kvm_x86_ops is a system-wide, not per-VM
structure. To allow VMX to coexist with TDs, the kvm_x86_ops callbacks
will have wrappers "if (tdx) tdx_op() else vmx_op()" to pick VMX or
TDX at run time.
To split the runtime switch, the VMX implementation, and the TDX
implementation, add main.c, and move out the vmx_x86_ops hooks in
preparation for adding TDX. Use 'vt' for the naming scheme as a nod to
VT-x and as a concatenation of VmxTdx.
The eventually converted code will look like this:
vmx.c:
vmx_op() { ... }
VMX initialization
tdx.c:
tdx_op() { ... }
TDX initialization
x86_ops.h:
vmx_op();
tdx_op();
main.c:
static vt_op() { if (tdx) tdx_op() else vmx_op() }
static struct kvm_x86_ops vt_x86_ops = {
.op = vt_op,
initialization functions call both VMX and TDX initialization
Opportunistically, fix the name inconsistency from vmx_create_vcpu() and
vmx_free_vcpu() to vmx_vcpu_create() and vmx_vcpu_free().
Co-developed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
Reviewed-by: Yuan Yao <[email protected]>
Message-Id: <e603c317587f933a9d1bee8728c84e4935849c16.1705965634.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
By necessity, TDX will use a different register ABI for hypercalls.
Break out the core functionality so that it may be reused for TDX.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Message-Id: <5134caa55ac3dec33fb2addb5545b52b3b52db02.1705965635.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The idea that no parameter would ever be necessary when enabling SEV or
SEV-ES for a VM was decidedly optimistic. The first source of variability
that was encountered is the desired set of VMSA features, as that affects
the measurement of the VM's initial state and cannot be changed
arbitrarily by the hypervisor.
This series adds all the APIs that are needed to customize the features,
with room for future enhancements:
- a new /dev/kvm device attribute to retrieve the set of supported
features (right now, only debug swap)
- a new sub-operation for KVM_MEM_ENCRYPT_OP that can take a struct,
replacing the existing KVM_SEV_INIT and KVM_SEV_ES_INIT
It then puts the new op to work by including the VMSA features as a field
of the The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of
supported VMSA features for backwards compatibility; but I am considering
also making them use zero as the feature mask, and will gladly adjust the
patches if so requested.
In order to avoid creating *two* new KVM_MEM_ENCRYPT_OPs, I decided that
I could as well make SEV and SEV-ES use VM types. This allows SEV-SNP
to reuse the KVM_SEV_INIT2 ioctl.
And while at it, KVM_SEV_INIT2 also includes two bugfixes. First of all,
SEV-ES VM, when created with the new VM type instead of KVM_SEV_ES_INIT,
reject KVM_GET_REGS/KVM_SET_REGS and friends on the vCPU file descriptor
once the VMSA has been encrypted... which is how the API should have
always behaved. Second, they also synchronize the FPU and AVX state.
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Drop the "If AD bits are enabled/disabled" verbiage from the comments
above kvm_tdp_mmu_clear_dirty_{slot,pt_masked}() since TDP MMU SPTEs may
need to be write-protected even when A/D bits are enabled. i.e. These
comments aren't technically correct.
No functional change intended.
Signed-off-by: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Drop the comments above clear_dirty_gfn_range() and
clear_dirty_pt_masked(), since each is word-for-word identical to the
comment above their parent function.
Leave the comment on the parent functions since they are APIs called by
the KVM/x86 MMU.
No functional change intended.
Signed-off-by: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Check kvm_mmu_page_ad_need_write_protect() when deciding whether to
write-protect or clear D-bits on TDP MMU SPTEs, so that the TDP MMU
accounts for any role-specific reasons for disabling D-bit dirty logging.
Specifically, TDP MMU SPTEs must be write-protected when the TDP MMU is
being used to run an L2 (i.e. L1 has disabled EPT) and PML is enabled.
KVM always disables PML when running L2, even when L1 and L2 GPAs are in
the some domain, so failing to write-protect TDP MMU SPTEs will cause
writes made by L2 to not be reflected in the dirty log.
Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=900d58a45dcaab9e4821
Fixes: 5982a5392663 ("KVM: x86/mmu: Use kvm_ad_enabled() to determine if TDP MMU SPTEs need wrprot")
Cc: [email protected]
Cc: Vipin Sharma <[email protected]>
Cc: Sean Christopherson <[email protected]>
Signed-off-by: David Matlack <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[sean: massage shortlog and changelog, tweak ternary op formatting]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Set kvm_mmu_page_role.invalid to mark the various MMU root_roles invalid
during CPUID update in order to force a refresh, instead of zeroing out
the entire role. This fixes a bug where kvm_mmu_free_roots() incorrectly
thinks a root is indirect, i.e. not a TDP MMU, due to "direct" being
zeroed, which in turn causes KVM to take mmu_lock for write instead of
read.
Note, paving over the entire role was largely unintentional, commit
7a458f0e1ba1 ("KVM: x86/mmu: remove extended bits from mmu_role, rename
field") simply missed that "invalid" could be set.
Fixes: 576a15de8d29 ("KVM: x86/mmu: Free TDP MMU roots while holding mmy_lock for read")
Reported-by: [email protected]
Closes: https://lore.kernel.org/all/[email protected]
Cc: Phi Nguyen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Disable LBR virtualization if the CPU doesn't support callstacks, which
were introduced in HSW (see commit e9d7f7cd97c4 ("perf/x86/intel: Add
basic Haswell LBR call stack support"), as KVM unconditionally configures
the perf LBR event with PERF_SAMPLE_BRANCH_CALL_STACK, i.e. LBR
virtualization always fails on pre-HSW CPUs.
Simply disable LBR support on such CPUs, as it has never worked, i.e.
there is no risk of breaking an existing setup, and figuring out a way
to performantly context switch LBRs on old CPUs is not worth the effort.
Fixes: be635e34c284 ("KVM: vmx/pmu: Expose LBR_FMT in the MSR_IA32_PERF_CAPABILITIES")
Cc: Mingwei Zhang <[email protected]>
Cc: Jim Mattson <[email protected]>
Tested-by: Mingwei Zhang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Snapshot VMX's LBR capabilities once during module initialization instead
of calling into perf every time a vCPU reconfigures its vPMU. This will
allow massaging the LBR capabilities, e.g. if the CPU doesn't support
callstacks, without having to remember to update multiple locations.
Opportunistically tag vmx_get_perf_capabilities() with __init, as it's
only called from vmx_set_cpu_caps().
Reviewed-by: Mingwei Zhang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
The .change_pte() MMU notifier callback was intended as an
optimization. The original point of it was that KSM could tell KVM to flip
its secondary PTE to a new location without having to first zap it. At
the time there was also an .invalidate_page() callback; both of them were
*not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(),
and .invalidate_page() also doubled as a fallback implementation of
.change_pte().
Later on, however, both callbacks were changed to occur within an
invalidate_range_start/end() block.
In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to
set_pte_at_notify with invalidate_range_start and invalidate_range_end",
2012-10-09) did so to remove the fallback from .invalidate_page() to
.change_pte() and allow sleepable .invalidate_page() hooks.
This however made KVM's usage of the .change_pte() callback completely
moot, because KVM unmaps the sPTEs during .invalidate_range_start()
and therefore .change_pte() has no hope of finding a sPTE to change.
Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as
well as all the architecture specific implementations.
Signed-off-by: Paolo Bonzini <[email protected]>
Acked-by: Anup Patel <[email protected]>
Acked-by: Michael Ellerman <[email protected]> (powerpc)
Reviewed-by: Bibo Mao <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The DebugSwap feature of SEV-ES provides a way for confidential guests
to use data breakpoints. Its status is record in VMSA, and therefore
attestation signatures depend on whether it is enabled or not. In order
to avoid invalidating the signatures depending on the host machine, it
was disabled by default (see commit 5abf6dceb066, "SEV: disable SEV-ES
DebugSwap by default", 2024-03-09).
However, we now have a new API to create SEV VMs that allows enabling
DebugSwap based on what the user tells KVM to do, and we also changed the
legacy KVM_SEV_ES_INIT API to never enable DebugSwap. It is therefore
possible to re-enable the feature without breaking compatibility with
kernels that pre-date the introduction of DebugSwap, so go ahead.
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The idea that no parameter would ever be necessary when enabling SEV or
SEV-ES for a VM was decidedly optimistic. In fact, in some sense it's
already a parameter whether SEV or SEV-ES is desired. Another possible
source of variability is the desired set of VMSA features, as that affects
the measurement of the VM's initial state and cannot be changed
arbitrarily by the hypervisor.
Create a new sub-operation for KVM_MEMORY_ENCRYPT_OP that can take a struct,
and put the new op to work by including the VMSA features as a field of the
struct. The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of
supported VMSA features for backwards compatibility.
The struct also includes the usual bells and whistles for future
extensibility: a flags field that must be zero for now, and some padding
at the end.
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
SEV-ES allows passing custom contents for x87, SSE and AVX state into the VMSA.
Allow userspace to do that with the usual KVM_SET_XSAVE API and only mark
FPU contents as confidential after it has been copied and encrypted into
the VMSA.
Since the XSAVE state for AVX is the first, it does not need the
compacted-state handling of get_xsave_addr(). However, there are other
parts of XSAVE state in the VMSA that currently are not handled, and
the validation logic of get_xsave_addr() is pointless to duplicate
in KVM, so move get_xsave_addr() to public FPU API; it is really just
a facility to operate on XSAVE state and does not expose any internal
details of arch/x86/kernel/fpu.
Acked-by: Dave Hansen <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
This simplifies the implementation of KVM_CHECK_EXTENSION(KVM_CAP_VM_TYPES),
and also allows the vendor module to specify which VM types are supported.
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Some VM types have characteristics in common; in fact, the only use
of VM types right now is kvm_arch_has_private_mem and it assumes that
_all_ nonzero VM types have private memory.
We will soon introduce a VM type for SEV and SEV-ES VMs, and at that
point we will have two special characteristics of confidential VMs
that depend on the VM type: not just if memory is private, but
also whether guest state is protected. For the latter we have
kvm->arch.guest_state_protected, which is only set on a fully initialized
VM.
For VM types with protected guest state, we can actually fix a problem in
the SEV-ES implementation, where ioctls to set registers do not cause an
error even if the VM has been initialized and the guest state encrypted.
Make sure that when using VM types that will become an error.
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Reviewed-by: Isaku Yamahata <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Right now, the set of features that are stored in the VMSA upon
initialization is fixed and depends on the module parameters for
kvm-amd.ko. However, the hypervisor cannot really change it at will
because the feature word has to match between the hypervisor and whatever
computes a measurement of the VMSA for attestation purposes.
Add a field to kvm_sev_info that holds the set of features to be stored
in the VMSA; and query it instead of referring to the module parameters.
Because KVM_SEV_INIT and KVM_SEV_ES_INIT accept no parameters, this
does not yet introduce any functional change, but it paves the way for
an API that allows customization of the features per-VM.
Signed-off-by: Paolo Bonzini <[email protected]>
Message-Id: <[email protected]>
Reviewed-by: Michael Roth <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Compute the set of features to be stored in the VMSA when KVM is
initialized; move it from there into kvm_sev_info when SEV is initialized,
and then into the initial VMSA.
The new variable can then be used to return the set of supported features
to userspace, via the KVM_GET_DEVICE_ATTR ioctl.
Signed-off-by: Paolo Bonzini <[email protected]>
Reviewed-by: Isaku Yamahata <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Allow vendor modules to provide their own attributes on /dev/kvm.
To avoid proliferation of vendor ops, implement KVM_HAS_DEVICE_ATTR
and KVM_GET_DEVICE_ATTR in terms of the same function. You're not
supposed to use KVM_GET_DEVICE_ATTR to do complicated computations,
especially on /dev/kvm.
Reviewed-by: Michael Roth <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Reviewed-by: Isaku Yamahata <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
There is no danger to the kernel if 32-bit userspace provides a 64-bit
value that has the high bits set, but for whatever reason happens to
resolve to an address that has something mapped there. KVM uses the
checked version of get_user() and put_user(), so any faults are caught
properly.
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Stop compiling sev.c when CONFIG_KVM_AMD_SEV=n, as the number of #ifdefs
in sev.c is getting ridiculous, and having #ifdefs inside of SEV helpers
is quite confusing.
To minimize #ifdefs in code flows, #ifdef away only the kvm_x86_ops hooks
and the #VMGEXIT handler. Stubs are also restricted to functions that
check sev_enabled and to the destruction functions sev_free_cpu() and
sev_vm_destroy(), where the style of their callers is to leave checks
to the callers. Most call sites instead rely on dead code elimination
to take care of functions that are guarded with sev_guest() or
sev_es_guest().
Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Leave SEV and SEV_ES '0' in kvm_cpu_caps by default, and instead set them
in sev_set_cpu_caps() if SEV and SEV-ES support are fully enabled. Aside
from the fact that sev_set_cpu_caps() is wildly misleading when it *clears*
capabilities, this will allow compiling out sev.c without falsely
advertising SEV/SEV-ES support in KVM_GET_SUPPORTED_CPUID.
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Michael Roth <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
On AMD and Hygon platforms, the local APIC does not automatically set
the mask bit of the LVTPC register when handling a PMI and there is
no need to clear it in the kernel's PMI handler.
For guests, the mask bit is currently set by kvm_apic_local_deliver()
and unless it is cleared by the guest kernel's PMI handler, PMIs stop
arriving and break use-cases like sampling with perf record.
This does not affect non-PerfMonV2 guests because PMIs are handled in
the guest kernel by x86_pmu_handle_irq() which always clears the LVTPC
mask bit irrespective of the vendor.
Before:
$ perf record -e cycles:u true
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.001 MB perf.data (1 samples) ]
After:
$ perf record -e cycles:u true
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.002 MB perf.data (19 samples) ]
Fixes: a16eb25b09c0 ("KVM: x86: Mask LVTPC when handling a PMI")
Cc: [email protected]
Signed-off-by: Sandipan Das <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
[sean: use is_intel_compatible instead of !is_amd_or_hygon()]
Signed-off-by: Sean Christopherson <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Add kvm_vcpu_arch.is_amd_compatible to cache if a vCPU's vendor model is
compatible with AMD, i.e. if the vCPU vendor is AMD or Hygon, along with
helpers to check if a vCPU is compatible AMD vs. Intel. To handle Intel
vs. AMD behavior related to masking the LVTPC entry, KVM will need to
check for vendor compatibility on every PMI injection, i.e. querying for
AMD will soon be a moderately hot path.
Note! This subtly (or maybe not-so-subtly) makes "Intel compatible" KVM's
default behavior, both if userspace omits (or never sets) CPUID 0x0 and if
userspace sets a completely unknown vendor. One could argue that KVM
should treat such vCPUs as not being compatible with Intel *or* AMD, but
that would add useless complexity to KVM.
KVM needs to do *something* in the face of vendor specific behavior, and
so unless KVM conjured up a magic third option, choosing to treat unknown
vendors as neither Intel nor AMD means that checks on AMD compatibility
would yield Intel behavior, and checks for Intel compatibility would yield
AMD behavior. And that's far worse as it would effectively yield random
behavior depending on whether KVM checked for AMD vs. Intel vs. !AMD vs.
!Intel. And practically speaking, all x86 CPUs follow either Intel or AMD
architecture, i.e. "supporting" an unknown third architecture adds no
value.
Deliberately don't convert any of the existing guest_cpuid_is_intel()
checks, as the Intel side of things is messier due to some flows explicitly
checking for exactly vendor==Intel, versus some flows assuming anything
that isn't "AMD compatible" gets Intel behavior. The Intel code will be
cleaned up in the future.
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Message-ID: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
commit 37b2a6510a48("KVM: use __vcalloc for very large allocations")
replaced kvzalloc()/kvcalloc() with vcalloc(), but didn't replace kvfree()
with vfree().
Signed-off-by: Li RongQing <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Use the GuestPhysBits field in CPUID.0x80000008 to communicate the max
mappable GPA to userspace, i.e. the max GPA that is addressable by the
CPU itself. Typically this is identical to the max effective GPA, except
in the case where the CPU supports MAXPHYADDR > 48 but does not support
5-level TDP (the CPU consults bits 51:48 of the GPA only when walking the
fifth level TDP page table entry).
Enumerating the max mappable GPA via CPUID will allow guest firmware to
map resources like PCI bars in the highest possible address space, while
ensuring that the GPA is addressable by the CPU. Without precise
knowledge about the max mappable GPA, the guest must assume that 5-level
paging is unsupported and thus restrict its mappings to the lower 48 bits.
Advertise the max mappable GPA via KVM_GET_SUPPORTED_CPUID as userspace
doesn't have easy access to whether or not 5-level paging is supported,
and to play nice with userspace VMMs that reflect the supported CPUID
directly into the guest.
AMD's APM (3.35) defines GuestPhysBits (EAX[23:16]) as:
Maximum guest physical address size in bits. This number applies
only to guests using nested paging. When this field is zero, refer
to the PhysAddrSize field for the maximum guest physical address size.
Tom Lendacky confirmed that the purpose of GuestPhysBits is software use
and KVM can use it as described above. Real hardware always returns zero.
Leave GuestPhysBits as '0' when TDP is disabled in order to comply with
the APM's statement that GuestPhysBits "applies only to guest using nested
paging". As above, guest firmware will likely create suboptimal mappings,
but that is a very minor issue and not a functional concern.
Signed-off-by: Gerd Hoffmann <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[sean: massage changelog]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Drop KVM's propagation of GuestPhysBits (CPUID leaf 80000008, EAX[23:16])
to HostPhysBits (same leaf, EAX[7:0]) when advertising the address widths
to userspace via KVM_GET_SUPPORTED_CPUID.
Per AMD, GuestPhysBits is intended for software use, and physical CPUs do
not set that field. I.e. GuestPhysBits will be non-zero if and only if
KVM is running as a nested hypervisor, and in that case, GuestPhysBits is
NOT guaranteed to capture the CPU's effective MAXPHYADDR when running with
TDP enabled.
E.g. KVM will soon use GuestPhysBits to communicate the CPU's maximum
*addressable* guest physical address, which would result in KVM under-
reporting PhysBits when running as an L1 on a CPU with MAXPHYADDR=52,
but without 5-level paging.
Signed-off-by: Gerd Hoffmann <[email protected]>
Cc: [email protected]
Reviewed-by: Xiaoyao Li <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[sean: rewrite changelog with --verbose, Cc stable@]
Signed-off-by: Sean Christopherson <[email protected]>
|