Age | Commit message (Collapse) | Author | Files | Lines |
|
The BTS feature (including the ability to set the BTS and BTINT
bits in the DEBUGCTL MSR) is currently unsupported on KVM.
But we may try using the BTS facility on a PEBS enabled guest like this:
perf record -e branches:u -c 1 -d ls
and then we would encounter the following call trace:
[] unchecked MSR access error: WRMSR to 0x1d9 (tried to write 0x00000000000003c0)
at rIP: 0xffffffff810745e4 (native_write_msr+0x4/0x20)
[] Call Trace:
[] intel_pmu_enable_bts+0x5d/0x70
[] bts_event_add+0x54/0x70
[] event_sched_in+0xee/0x290
As it lacks any CPUID indicator or perf_capabilities valid bit
fields to prompt for this information, the platform would hint
the Intel BTS feature unavailable to guest by setting the
BTS_UNAVAIL bit in the IA32_MISC_ENABLE.
Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Pull KVM fixes from Paolo Bonzini:
- syzkaller NULL pointer dereference
- TDP MMU performance issue with disabling dirty logging
- 5.14 regression with SVM TSC scaling
- indefinite stall on applying live patches
- unstable selftest
- memory leak from wrong copy-and-paste
- missed PV TLB flush when racing with emulation
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: do not report a vCPU as preempted outside instruction boundaries
KVM: x86: do not set st->preempted when going back to user space
KVM: SVM: fix tsc scaling cache logic
KVM: selftests: Make hyperv_clock selftest more stable
KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging
x86: drop bogus "cc" clobber from __try_cmpxchg_user_asm()
KVM: x86/mmu: Check every prev_roots in __kvm_mmu_free_obsolete_roots()
entry/kvm: Exit to user mode when TIF_NOTIFY_SIGNAL is set
KVM: Don't null dereference ops->destroy
|
|
There are cases that malicious virtual machines can cause CPU stuck (due
to event windows don't open up), e.g., infinite loop in microcode when
nested #AC (CVE-2015-5307). No event window means no event (NMI, SMI and
IRQ) can be delivered. It leads the CPU to be unavailable to host or
other VMs.
VMM can enable notify VM exit that a VM exit generated if no event
window occurs in VM non-root mode for a specified amount of time (notify
window).
Feature enabling:
- The new vmcs field SECONDARY_EXEC_NOTIFY_VM_EXITING is introduced to
enable this feature. VMM can set NOTIFY_WINDOW vmcs field to adjust
the expected notify window.
- Add a new KVM capability KVM_CAP_X86_NOTIFY_VMEXIT so that user space
can query and enable this feature in per-VM scope. The argument is a
64bit value: bits 63:32 are used for notify window, and bits 31:0 are
for flags. Current supported flags:
- KVM_X86_NOTIFY_VMEXIT_ENABLED: enable the feature with the notify
window provided.
- KVM_X86_NOTIFY_VMEXIT_USER: exit to userspace once the exits happen.
- It's safe to even set notify window to zero since an internal hardware
threshold is added to vmcs.notify_window.
VM exit handling:
- Introduce a vcpu state notify_window_exits to records the count of
notify VM exits and expose it through the debugfs.
- Notify VM exit can happen incident to delivery of a vector event.
Allow it in KVM.
- Exit to userspace unconditionally for handling when VM_CONTEXT_INVALID
bit is set.
Nested handling
- Nested notify VM exits are not supported yet. Keep the same notify
window control in vmcs02 as vmcs01, so that L1 can't escape the
restriction of notify VM exits through launching L2 VM.
Notify VM exit is defined in latest Intel Architecture Instruction Set
Extensions Programming Reference, chapter 9.2.
Co-developed-by: Xiaoyao Li <[email protected]>
Signed-off-by: Xiaoyao Li <[email protected]>
Signed-off-by: Tao Xu <[email protected]>
Co-developed-by: Chenyi Qiang <[email protected]>
Signed-off-by: Chenyi Qiang <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Add kvm_caps to hold a variety of capabilites and defaults that aren't
handled by kvm_cpu_caps because they aren't CPUID bits in order to reduce
the amount of boilerplate code required to add a new feature. The vast
majority (all?) of the caps interact with vendor code and are written
only during initialization, i.e. should be tagged __read_mostly, declared
extern in x86.h, and exported.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
All gp or fixed counters have been reprogrammed using PERF_TYPE_RAW,
which means that the table that maps perf_hw_id to event select values is
no longer useful, at least for AMD.
For Intel, the logic to check if the pmu event reported by Intel cpuid is
not available is still required, in which case pmc_perf_hw_id() could be
renamed to hw_event_is_unavail() and a bool value is returned to replace
the semantics of "PERF_COUNT_HW_MAX+1".
Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Since reprogram_counter(), reprogram_{gp, fixed}_counter() currently have
the same incoming parameter "struct kvm_pmc *pmc", the callers can simplify
the conetxt by using uniformly exported interface, which makes reprogram_
{gp, fixed}_counter() static and eliminates EXPORT_SYMBOL_GPL.
Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Since afrer reprogram_fixed_counter() is called, it's bound to assign
the requested fixed_ctr_ctrl to pmu->fixed_ctr_ctrl, this assignment step
can be moved forward (the stale value for diff is saved extra early),
thus simplifying the passing of parameters.
No functional change intended.
Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Because inside reprogram_gp_counter() it is bound to assign the requested
eventel to pmc->eventsel, this assignment step can be moved forward, thus
simplifying the passing of parameters to "struct kvm_pmc *pmc" only.
No functional change intended.
Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Passing the reference "struct kvm_pmc *pmc" when creating
pmc->perf_event is sufficient. This change helps to simplify the
calling convention by replacing reprogram_{gp, fixed}_counter()
with reprogram_counter() seamlessly.
No functional change intended.
Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Whenever an MSR is part of KVM_GET_MSR_INDEX_LIST, it has to be always
retrievable and settable with KVM_GET_MSR and KVM_SET_MSR. Accept
the PMU MSRs unconditionally in intel_is_valid_msr, if the access was
host-initiated.
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Whenever an MSR is part of KVM_GET_MSR_INDEX_LIST, as is the case
for MSR_IA32_DS_AREA, it has to be always settable with KVM_SET_MSR.
Accept a zero value for these MSRs to obey the contract.
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
MSR_CORE_PERF_GLOBAL_CTRL is introduced as part of Architecture PMU V2,
as indicated by Intel SDM 19.2.2 and the intel_is_valid_msr() function.
So in the absence of global_ctrl support, all PMCs are enabled as AMD does.
Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Assigning a value to pmu->global_ctrl just to set the value of
pmu->global_ctrl_mask is more readable but does not conform to the
specification. The value is reset to zero on Power up and Reset but
stays unchanged on INIT, like most other MSRs.
Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The CPUID features PDCM, DS and DTES64 are required for PEBS feature.
KVM would expose CPUID feature PDCM, DS and DTES64 to guest when PEBS
is supported in the KVM on the Ice Lake server platforms.
Originally-by: Andi Kleen <[email protected]>
Co-developed-by: Kan Liang <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Co-developed-by: Luwei Kang <[email protected]>
Signed-off-by: Luwei Kang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
For the same purpose, the leagcy intel_pmu_lbr_is_compatible() can be
renamed for reuse by more callers, and remove the comment about LBR
use case can be deleted by the way.
Signed-off-by: Like Xu <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The information obtained from the interface perf_get_x86_pmu_capability()
doesn't change, so an exported "struct x86_pmu_capability" is introduced
for all guests in the KVM, and it's initialized before hardware_setup().
Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The guest PEBS will be disabled when some users try to perf KVM and
its user-space through the same PEBS facility OR when the host perf
doesn't schedule the guest PEBS counter in a one-to-one mapping manner
(neither of these are typical scenarios).
The PEBS records in the guest DS buffer are still accurate and the
above two restrictions will be checked before each vm-entry only if
guest PEBS is deemed to be enabled.
Suggested-by: Wei Wang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The bit 12 represents "Processor Event Based Sampling Unavailable (RO)" :
1 = PEBS is not supported.
0 = PEBS is supported.
A write to this PEBS_UNAVL available bit will bring #GP(0) when guest PEBS
is enabled. Some PEBS drivers in guest may care about this bit.
Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
If IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14] is set, the adaptive
PEBS is supported. The PEBS_DATA_CFG MSR and adaptive record enable
bits (IA32_PERFEVTSELx.Adaptive_Record and IA32_FIXED_CTR_CTRL.
FCx_Adaptive_Record) are also supported.
Adaptive PEBS provides software the capability to configure the PEBS
records to capture only the data of interest, keeping the record size
compact. An overflow of PMCx results in generation of an adaptive PEBS
record with state information based on the selections specified in
MSR_PEBS_DATA_CFG.By default, the record only contain the Basic group.
When guest adaptive PEBS is enabled, the IA32_PEBS_ENABLE MSR will
be added to the perf_guest_switch_msr() and switched during the VMX
transitions just like CORE_PERF_GLOBAL_CTRL MSR.
According to Intel SDM, software is recommended to PEBS Baseline
when the following is true. IA32_PERF_CAPABILITIES.PEBS_BASELINE[14]
&& IA32_PERF_CAPABILITIES.PEBS_FMT[11:8] ≥ 4.
Co-developed-by: Luwei Kang <[email protected]>
Signed-off-by: Luwei Kang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
When CPUID.01H:EDX.DS[21] is set, the IA32_DS_AREA MSR exists and points
to the linear address of the first byte of the DS buffer management area,
which is used to manage the PEBS records.
When guest PEBS is enabled, the MSR_IA32_DS_AREA MSR will be added to the
perf_guest_switch_msr() and switched during the VMX transitions just like
CORE_PERF_GLOBAL_CTRL MSR. The WRMSR to IA32_DS_AREA MSR brings a #GP(0)
if the source register contains a non-canonical address.
Originally-by: Andi Kleen <[email protected]>
Co-developed-by: Kan Liang <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
If IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14] is set, the
IA32_PEBS_ENABLE MSR exists and all architecturally enumerated fixed
and general-purpose counters have corresponding bits in IA32_PEBS_ENABLE
that enable generation of PEBS records. The general-purpose counter bits
start at bit IA32_PEBS_ENABLE[0], and the fixed counter bits start at
bit IA32_PEBS_ENABLE[32].
When guest PEBS is enabled, the IA32_PEBS_ENABLE MSR will be
added to the perf_guest_switch_msr() and atomically switched during
the VMX transitions just like CORE_PERF_GLOBAL_CTRL MSR.
Based on whether the platform supports x86_pmu.pebs_ept, it has also
refactored the way to add more msrs to arr[] in intel_guest_get_msrs()
for extensibility.
Originally-by: Andi Kleen <[email protected]>
Co-developed-by: Kan Liang <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Co-developed-by: Luwei Kang <[email protected]>
Signed-off-by: Luwei Kang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The mask value of fixed counter control register should be dynamic
adjusted with the number of fixed counters. This patch introduces a
variable that includes the reserved bits of fixed counter control
registers. This is a generic code refactoring.
Co-developed-by: Luwei Kang <[email protected]>
Signed-off-by: Luwei Kang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
On Intel platforms, the software can use the IA32_MISC_ENABLE[7] bit to
detect whether the processor supports performance monitoring facility.
It depends on the PMU is enabled for the guest, and a software write
operation to this available bit will be ignored. The proposal to ignore
the toggle in KVM is the way to go and that behavior matches bare metal.
Signed-off-by: Like Xu <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Splitting the logic for determining the guest values is unnecessarily
confusing, and potentially fragile. Perf should have full knowledge and
control of what values are loaded for the guest.
If we change .guest_get_msrs() to take a struct kvm_pmu pointer, then it
can generate the full set of guest values by grabbing guest ds_area and
pebs_data_cfg. Alternatively, .guest_get_msrs() could take the desired
guest MSR values directly (ds_area and pebs_data_cfg), but kvm_pmu is
vendor agnostic, so we don't see any reason to not just pass the pointer.
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
With IPI virtualization enabled, the processor emulates writes to
APIC registers that would send IPIs. The processor sets the bit
corresponding to the vector in target vCPU's PIR and may send a
notification (IPI) specified by NDST and NV fields in target vCPU's
Posted-Interrupt Descriptor (PID). It is similar to what IOMMU
engine does when dealing with posted interrupt from devices.
A PID-pointer table is used by the processor to locate the PID of a
vCPU with the vCPU's APIC ID. The table size depends on maximum APIC
ID assigned for current VM session from userspace. Allocating memory
for PID-pointer table is deferred to vCPU creation, because irqchip
mode and VM-scope maximum APIC ID is settled at that point. KVM can
skip PID-pointer table allocation if !irqchip_in_kernel().
Like VT-d PI, if a vCPU goes to blocked state, VMM needs to switch its
notification vector to wakeup vector. This can ensure that when an IPI
for blocked vCPUs arrives, VMM can get control and wake up blocked
vCPUs. And if a VCPU is preempted, its posted interrupt notification
is suppressed.
Note that IPI virtualization can only virualize physical-addressing,
flat mode, unicast IPIs. Sending other IPIs would still cause a
trap-like APIC-write VM-exit and need to be handled by VMM.
Signed-off-by: Chao Gao <[email protected]>
Signed-off-by: Zeng Guang <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Remove the condition check cpu_has_secondary_exec_ctrls(). Calling
vmx_refresh_apicv_exec_ctrl() premises secondary controls activated
and VMCS fields related to APICv valid as well. If it's invoked in
wrong circumstance at the worst case, VMX operation will report
VMfailValid error without further harmful impact and just functions
as if all the secondary controls were 0.
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Zeng Guang <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Add tertiary_exec_control field report in dump_vmcs(). Meanwhile,
reorganize the dump output of VMCS category as follows.
Before change:
*** Control State ***
PinBased=0x000000ff CPUBased=0xb5a26dfa SecondaryExec=0x061037eb
EntryControls=0000d1ff ExitControls=002befff
After change:
*** Control State ***
CPUBased=0xb5a26dfa SecondaryExec=0x061037eb TertiaryExec=0x0000000000000010
PinBased=0x000000ff EntryControls=0000d1ff ExitControls=002befff
Reviewed-by: Maxim Levitsky <[email protected]>
Signed-off-by: Robert Hoo <[email protected]>
Signed-off-by: Zeng Guang <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Check VMX features on tertiary execution control in VMCS config setup.
Sub-features in tertiary execution control to be enabled are adjusted
according to hardware capabilities although no sub-feature is enabled
in this patch.
EVMCSv1 doesn't support tertiary VM-execution control, so disable it
when EVMCSv1 is in use. And define the auxiliary functions for Tertiary
control field here, using the new BUILD_CONTROLS_SHADOW().
Reviewed-by: Maxim Levitsky <[email protected]>
Signed-off-by: Robert Hoo <[email protected]>
Signed-off-by: Zeng Guang <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The Tertiary VM-Exec Control, different from previous control fields, is 64
bit. So extend BUILD_CONTROLS_SHADOW() by adding a 'bit' parameter, to
support both 32 bit and 64 bit fields' auxiliary functions building.
Suggested-by: Sean Christopherson <[email protected]>
Reviewed-by: Maxim Levitsky <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Signed-off-by: Robert Hoo <[email protected]>
Signed-off-by: Zeng Guang <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
In the IRQ injection tracepoint, differentiate between Hard IRQs and Soft
"IRQs", i.e. interrupts that are reinjected after incomplete delivery of
a software interrupt from an INTn instruction. Tag reinjected interrupts
as such, even though the information is usually redundant since soft
interrupts are only ever reinjected by KVM. Though rare in practice, a
hard IRQ can be reinjected.
Signed-off-by: Sean Christopherson <[email protected]>
[MSS: change "kvm_inj_virq" event "reinjected" field type to bool]
Signed-off-by: Maciej S. Szmigiero <[email protected]>
Message-Id: <9664d49b3bd21e227caa501cff77b0569bebffe2.1651440202.git.maciej.szmigiero@oracle.com>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
If a vCPU is outside guest mode and is scheduled out, it might be in the
process of making a memory access. A problem occurs if another vCPU uses
the PV TLB flush feature during the period when the vCPU is scheduled
out, and a virtual address has already been translated but has not yet
been accessed, because this is equivalent to using a stale TLB entry.
To avoid this, only report a vCPU as preempted if sure that the guest
is at an instruction boundary. A rescheduling request will be delivered
to the host physical CPU as an external interrupt, so for simplicity
consider any vmexit *not* instruction boundary except for external
interrupts.
It would in principle be okay to report the vCPU as preempted also
if it is sleeping in kvm_vcpu_block(): a TLB flush IPI will incur the
vmentry/vmexit overhead unnecessarily, and optimistic spinning is
also unlikely to succeed. However, leave it for later because right
now kvm_vcpu_check_block() is doing memory accesses. Even
though the TLB flush issue only applies to virtual memory address,
it's very much preferrable to be conservative.
Reported-by: Jann Horn <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Thomas Gleixner:
"A set of small x86 cleanups:
- Remove unused headers in the IDT code
- Kconfig indendation and comment fixes
- Fix all 'the the' typos in one go instead of waiting for bots to
fix one at a time"
* tag 'x86-cleanups-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Fix all occurences of the "the the" typo
x86/idt: Remove unused headers
x86/Kconfig: Fix indentation of arch/x86/Kconfig.debug
x86/Kconfig: Fix indentation and add endif comments to arch/x86/Kconfig
|
|
Rather than waiting for the bots to fix these one-by-one,
fix all occurences of "the the" throughout arch/x86.
Signed-off-by: Bo Liu <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Change the printf format character from 'd' to 'u' for the
VM-instruction error in vmwrite_error().
Fixes: 6aa8b732ca01 ("[PATCH] kvm: userspace interface")
Reported-by: Sean Christopherson <[email protected]>
Signed-off-by: Jim Mattson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Include the value of the "VM-instruction error" field from the current
VMCS (if any) in the error message for VMCLEAR and VMPTRLD, since each
of these instructions may result in more than one VM-instruction
error. Previously, this field was only reported for VMWRITE errors.
Signed-off-by: David Matlack <[email protected]>
[Rebased and refactored code; dropped the error number for INVVPID and
INVEPT; reworded commit message.]
Signed-off-by: Jim Mattson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
When kernel handles the vm-exit caused by external interrupts and NMI,
it always sets kvm_intr_type to tell if it's dealing an IRQ or NMI. For
the PMI scenario, it could be IRQ or NMI.
However, intel_pt PMIs are only generated for HARDWARE perf events, and
HARDWARE events are always configured to generate NMIs. Use
kvm_handling_nmi_from_guest() to precisely identify if the intel_pt PMI
came from the guest; this avoids false positives if an intel_pt PMI/NMI
arrives while the host is handling an unrelated IRQ VM-Exit.
Fixes: db215756ae59 ("KVM: x86: More precisely identify NMI from guest when handling PMI")
Signed-off-by: Yanfei Xu <[email protected]>
Message-Id: <[email protected]>
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
KVM/riscv changes for 5.19
- Added Sv57x4 support for G-stage page table
- Added range based local HFENCE functions
- Added remote HFENCE functions based on VCPU requests
- Added ISA extension registers in ONE_REG interface
- Updated KVM RISC-V maintainers entry to cover selftests support
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 5.19
- Add support for the ARMv8.6 WFxT extension
- Guard pages for the EL2 stacks
- Trap and emulate AArch32 ID registers to hide unsupported features
- Ability to select and save/restore the set of hypercalls exposed
to the guest
- Support for PSCI-initiated suspend in collaboration with userspace
- GICv3 register-based LPI invalidation support
- Move host PMU event merging into the vcpu data structure
- GICv3 ITS save/restore fixes
- The usual set of small-scale cleanups and fixes
[Due to the conflict, KVM_SYSTEM_EVENT_SEV_TERM is relocated
from 4 to 6. - Paolo]
|
|
The enumeration of MD_CLEAR in CPUID(EAX=7,ECX=0).EDX{bit 10} is not an
accurate indicator on all CPUs of whether the VERW instruction will
overwrite fill buffers. FB_CLEAR enumeration in
IA32_ARCH_CAPABILITIES{bit 17} covers the case of CPUs that are not
vulnerable to MDS/TAA, indicating that microcode does overwrite fill
buffers.
Guests running in VMM environments may not be aware of all the
capabilities/vulnerabilities of the host CPU. Specifically, a guest may
apply MDS/TAA mitigations when a virtual CPU is enumerated as vulnerable
to MDS/TAA even when the physical CPU is not. On CPUs that enumerate
FB_CLEAR_CTRL the VMM may set FB_CLEAR_DIS to skip overwriting of fill
buffers by the VERW instruction. This is done by setting FB_CLEAR_DIS
during VMENTER and resetting on VMEXIT. For guests that enumerate
FB_CLEAR (explicitly asking for fill buffer clear capability) the VMM
will not use FB_CLEAR_DIS.
Irrespective of guest state, host overwrites CPU buffers before VMENTER
to protect itself from an MMIO capable guest, as part of mitigation for
MMIO Stale Data vulnerabilities.
Signed-off-by: Pawan Gupta <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
|
|
Processor MMIO Stale Data is a class of vulnerabilities that may
expose data after an MMIO operation. For details please refer to
Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst.
These vulnerabilities are broadly categorized as:
Device Register Partial Write (DRPW):
Some endpoint MMIO registers incorrectly handle writes that are
smaller than the register size. Instead of aborting the write or only
copying the correct subset of bytes (for example, 2 bytes for a 2-byte
write), more bytes than specified by the write transaction may be
written to the register. On some processors, this may expose stale
data from the fill buffers of the core that created the write
transaction.
Shared Buffers Data Sampling (SBDS):
After propagators may have moved data around the uncore and copied
stale data into client core fill buffers, processors affected by MFBDS
can leak data from the fill buffer.
Shared Buffers Data Read (SBDR):
It is similar to Shared Buffer Data Sampling (SBDS) except that the
data is directly read into the architectural software-visible state.
An attacker can use these vulnerabilities to extract data from CPU fill
buffers using MDS and TAA methods. Mitigate it by clearing the CPU fill
buffers using the VERW instruction before returning to a user or a
guest.
On CPUs not affected by MDS and TAA, user application cannot sample data
from CPU fill buffers using MDS or TAA. A guest with MMIO access can
still use DRPW or SBDR to extract data architecturally. Mitigate it with
VERW instruction to clear fill buffers before VMENTER for MMIO capable
guests.
Add a kernel parameter mmio_stale_data={off|full|full,nosmt} to control
the mitigation.
Signed-off-by: Pawan Gupta <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
|
|
Intel MKTME KeyID bits (including Intel TDX private KeyID bits) should
never be set to SPTE. Set shadow_me_value to 0 and shadow_me_mask to
include all MKTME KeyID bits to include them to shadow_zero_check.
Signed-off-by: Kai Huang <[email protected]>
Message-Id: <27bc10e97a3c0b58a4105ff9107448c190328239.1650363789.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Passing per_cpu() to list_for_each_entry() causes the macro to be
evaluated N+1 times for N sleeping vCPUs. This is a very small
inefficiency, and the code is cleaner if the address of the per-CPU
variable is loaded earlier. Do this for both the list and the spinlock.
Signed-off-by: Li RongQing <[email protected]>
Message-Id: <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Exit to userspace with an emulation error if KVM encounters an injected
exception with invalid guest state, in addition to the existing check of
bailing if there's a pending exception (KVM doesn't support emulating
exceptions except when emulating real mode via vm86).
In theory, KVM should never get to such a situation as KVM is supposed to
exit to userspace before injecting an exception with invalid guest state.
But in practice, userspace can intervene and manually inject an exception
and/or stuff registers to force invalid guest state while a previously
injected exception is awaiting reinjection.
Fixes: fc4fad79fc3d ("KVM: VMX: Reject KVM_RUN if emulation is required with pending exception")
Reported-by: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The helper function, vcpu_to_pi_desc(), is defined to get the posted
interrupt descriptor from vcpu. There is one place that doesn't use
it, and instead references vmx_vcpu->pi_desc directly. Remove the
inconsistency.
Signed-off-by: Yuan Yao <[email protected]>
Signed-off-by: Isaku Yamahata <[email protected]>
Message-Id: <ee7be7832bc424546fd4f05015a844a0205b5ba2.1646422845.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
root_role.level is always the same value as shadow_level:
- it's kvm_mmu_get_tdp_level(vcpu) when going through init_kvm_tdp_mmu
- it's the level argument when going through kvm_init_shadow_ept_mmu
- it's assigned directly from new_role.base.level when going
through shadow_mmu_init_context
Remove the duplication and get the level directly from the role.
Reviewed-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Replace the per-vendor hack-a-fix for KVM's #PF => #PF => #DF workaround
with an explicit, common workaround in kvm_inject_emulated_page_fault().
Aside from being a hack, the current approach is brittle and incomplete,
e.g. nSVM's KVM_SET_NESTED_STATE fails to set ->inject_page_fault(),
and nVMX fails to apply the workaround when VMX is intercepting #PF due
to allow_smaller_maxphyaddr=1.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
NMI-watchdog is one of the favorite features of kernel developers,
but it does not work in AMD guest even with vPMU enabled and worse,
the system misrepresents this capability via /proc.
This is a PMC emulation error. KVM does not pass the latest valid
value to perf_event in time when guest NMI-watchdog is running, thus
the perf_event corresponding to the watchdog counter will enter the
old state at some point after the first guest NMI injection, forcing
the hardware register PMC0 to be constantly written to 0x800000000001.
Meanwhile, the running counter should accurately reflect its new value
based on the latest coordinated pmc->counter (from vPMC's point of view)
rather than the value written directly by the guest.
Fixes: 168d918f2643 ("KVM: x86: Adjust counter sample period after a wrmsr")
Reported-by: Dongli Cao <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Reviewed-by: Yanan Wang <[email protected]>
Tested-by: Yanan Wang <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Message-Id: <[email protected]>
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Defer APICv updates that occur while L2 is active until nested VM-Exit,
i.e. until L1 regains control. vmx_refresh_apicv_exec_ctrl() assumes L1
is active and (a) stomps all over vmcs02 and (b) neglects to ever updated
vmcs01. E.g. if vmcs12 doesn't enable the TPR shadow for L2 (and thus no
APICv controls), L1 performs nested VM-Enter APICv inhibited, and APICv
becomes unhibited while L2 is active, KVM will set various APICv controls
in vmcs02 and trigger a failed VM-Entry. The kicker is that, unless
running with nested_early_check=1, KVM blames L1 and chaos ensues.
In all cases, ignoring vmcs02 and always deferring the inhibition change
to vmcs01 is correct (or at least acceptable). The ABSENT and DISABLE
inhibitions cannot truly change while L2 is active (see below).
IRQ_BLOCKING can change, but it is firmly a best effort debug feature.
Furthermore, only L2's APIC is accelerated/virtualized to the full extent
possible, e.g. even if L1 passes through its APIC to L2, normal MMIO/MSR
interception will apply to the virtual APIC managed by KVM.
The exception is the SELF_IPI register when x2APIC is enabled, but that's
an acceptable hole.
Lastly, Hyper-V's Auto EOI can technically be toggled if L1 exposes the
MSRs to L2, but for that to work in any sane capacity, L1 would need to
pass through IRQs to L2 as well, and IRQs must be intercepted to enable
virtual interrupt delivery. I.e. exposing Auto EOI to L2 and enabling
VID for L2 are, for all intents and purposes, mutually exclusive.
Lack of dynamic toggling is also why this scenario is all but impossible
to encounter in KVM's current form. But a future patch will pend an
APICv update request _during_ vCPU creation to plug a race where a vCPU
that's being created doesn't get included in the "all vCPUs request"
because it's not yet visible to other vCPUs. If userspaces restores L2
after VM creation (hello, KVM selftests), the first KVM_RUN will occur
while L2 is active and thus service the APICv update request made during
VM creation.
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Clear the IDT vectoring field in vmcs12 on next VM-Exit due to a double
or triple fault. Per the SDM, a VM-Exit isn't considered to occur during
event delivery if the exit is due to an intercepted double fault or a
triple fault. Opportunistically move the default clearing (no event
"pending") into the helper so that it's more obvious that KVM does indeed
handle this case.
Note, the double fault case is worded rather wierdly in the SDM:
The original event results in a double-fault exception that causes the
VM exit directly.
Temporarily ignoring injected events, double faults can _only_ occur if
an exception occurs while attempting to deliver a different exception,
i.e. there's _always_ an original event. And for injected double fault,
while there's no original event, injected events are never subject to
interception.
Presumably the SDM is calling out that a the vectoring info will be valid
if a different exit occurs after a double fault, e.g. if a #PF occurs and
is intercepted while vectoring #DF, then the vectoring info will show the
double fault. In other words, the clause can simply be read as:
The VM exit is caused by a double-fault exception.
Fixes: 4704d0befb07 ("KVM: nVMX: Exiting from L2 to L1")
Cc: Chenyi Qiang <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|