aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/kvm/vmx
AgeCommit message (Collapse)AuthorFilesLines
2022-06-08KVM: x86/pmu: Avoid exposing Intel BTS featureLike Xu1-1/+2
The BTS feature (including the ability to set the BTS and BTINT bits in the DEBUGCTL MSR) is currently unsupported on KVM. But we may try using the BTS facility on a PEBS enabled guest like this: perf record -e branches:u -c 1 -d ls and then we would encounter the following call trace: [] unchecked MSR access error: WRMSR to 0x1d9 (tried to write 0x00000000000003c0) at rIP: 0xffffffff810745e4 (native_write_msr+0x4/0x20) [] Call Trace: [] intel_pmu_enable_bts+0x5d/0x70 [] bts_event_add+0x54/0x70 [] event_sched_in+0xee/0x290 As it lacks any CPUID indicator or perf_capabilities valid bit fields to prompt for this information, the platform would hint the Intel BTS feature unavailable to guest by setting the BTS_UNAVAIL bit in the IA32_MISC_ENABLE. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-0/+1
Pull KVM fixes from Paolo Bonzini: - syzkaller NULL pointer dereference - TDP MMU performance issue with disabling dirty logging - 5.14 regression with SVM TSC scaling - indefinite stall on applying live patches - unstable selftest - memory leak from wrong copy-and-paste - missed PV TLB flush when racing with emulation * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: do not report a vCPU as preempted outside instruction boundaries KVM: x86: do not set st->preempted when going back to user space KVM: SVM: fix tsc scaling cache logic KVM: selftests: Make hyperv_clock selftest more stable KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging x86: drop bogus "cc" clobber from __try_cmpxchg_user_asm() KVM: x86/mmu: Check every prev_roots in __kvm_mmu_free_obsolete_roots() entry/kvm: Exit to user mode when TIF_NOTIFY_SIGNAL is set KVM: Don't null dereference ops->destroy
2022-06-08KVM: VMX: Enable Notify VM exitTao Xu3-2/+52
There are cases that malicious virtual machines can cause CPU stuck (due to event windows don't open up), e.g., infinite loop in microcode when nested #AC (CVE-2015-5307). No event window means no event (NMI, SMI and IRQ) can be delivered. It leads the CPU to be unavailable to host or other VMs. VMM can enable notify VM exit that a VM exit generated if no event window occurs in VM non-root mode for a specified amount of time (notify window). Feature enabling: - The new vmcs field SECONDARY_EXEC_NOTIFY_VM_EXITING is introduced to enable this feature. VMM can set NOTIFY_WINDOW vmcs field to adjust the expected notify window. - Add a new KVM capability KVM_CAP_X86_NOTIFY_VMEXIT so that user space can query and enable this feature in per-VM scope. The argument is a 64bit value: bits 63:32 are used for notify window, and bits 31:0 are for flags. Current supported flags: - KVM_X86_NOTIFY_VMEXIT_ENABLED: enable the feature with the notify window provided. - KVM_X86_NOTIFY_VMEXIT_USER: exit to userspace once the exits happen. - It's safe to even set notify window to zero since an internal hardware threshold is added to vmcs.notify_window. VM exit handling: - Introduce a vcpu state notify_window_exits to records the count of notify VM exits and expose it through the debugfs. - Notify VM exit can happen incident to delivery of a vector event. Allow it in KVM. - Exit to userspace unconditionally for handling when VM_CONTEXT_INVALID bit is set. Nested handling - Nested notify VM exits are not supported yet. Keep the same notify window control in vmcs02 as vmcs01, so that L1 can't escape the restriction of notify VM exits through launching L2 VM. Notify VM exit is defined in latest Intel Architecture Instruction Set Extensions Programming Reference, chapter 9.2. Co-developed-by: Xiaoyao Li <[email protected]> Signed-off-by: Xiaoyao Li <[email protected]> Signed-off-by: Tao Xu <[email protected]> Co-developed-by: Chenyi Qiang <[email protected]> Signed-off-by: Chenyi Qiang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86: Introduce "struct kvm_caps" to track misc caps/settingsSean Christopherson2-13/+13
Add kvm_caps to hold a variety of capabilites and defaults that aren't handled by kvm_cpu_caps because they aren't CPUID bits in order to reduce the amount of boilerplate code required to add a new feature. The vast majority (all?) of the caps interact with vendor code and are written only during initialization, i.e. should be tagged __read_mostly, declared extern in x86.h, and exported. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Drop amd_event_mapping[] in the KVM contextLike Xu1-7/+4
All gp or fixed counters have been reprogrammed using PERF_TYPE_RAW, which means that the table that maps perf_hw_id to event select values is no longer useful, at least for AMD. For Intel, the logic to check if the pmu event reported by Intel cpuid is not available is still required, in which case pmc_perf_hw_id() could be renamed to hw_event_is_unavail() and a bool value is returned to replace the semantics of "PERF_COUNT_HW_MAX+1". Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Use only the uniform interface reprogram_counter()Paolo Bonzini1-2/+2
Since reprogram_counter(), reprogram_{gp, fixed}_counter() currently have the same incoming parameter "struct kvm_pmc *pmc", the callers can simplify the conetxt by using uniformly exported interface, which makes reprogram_ {gp, fixed}_counter() static and eliminates EXPORT_SYMBOL_GPL. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Drop "u8 ctrl, int idx" for reprogram_fixed_counter()Like Xu1-7/+7
Since afrer reprogram_fixed_counter() is called, it's bound to assign the requested fixed_ctr_ctrl to pmu->fixed_ctr_ctrl, this assignment step can be moved forward (the stale value for diff is saved extra early), thus simplifying the passing of parameters. No functional change intended. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Drop "u64 eventsel" for reprogram_gp_counter()Like Xu1-1/+2
Because inside reprogram_gp_counter() it is bound to assign the requested eventel to pmc->eventsel, this assignment step can be moved forward, thus simplifying the passing of parameters to "struct kvm_pmc *pmc" only. No functional change intended. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Pass only "struct kvm_pmc *pmc" to reprogram_counter()Like Xu1-14/+18
Passing the reference "struct kvm_pmc *pmc" when creating pmc->perf_event is sufficient. This change helps to simplify the calling convention by replacing reprogram_{gp, fixed}_counter() with reprogram_counter() seamlessly. No functional change intended. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86: always allow host-initiated writes to PMU MSRsPaolo Bonzini1-10/+17
Whenever an MSR is part of KVM_GET_MSR_INDEX_LIST, it has to be always retrievable and settable with KVM_GET_MSR and KVM_SET_MSR. Accept the PMU MSRs unconditionally in intel_is_valid_msr, if the access was host-initiated. Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: vmx, pmu: accept 0 for host-initiated write to MSR_IA32_DS_AREAPaolo Bonzini1-0/+2
Whenever an MSR is part of KVM_GET_MSR_INDEX_LIST, as is the case for MSR_IA32_DS_AREA, it has to be always settable with KVM_SET_MSR. Accept a zero value for these MSRs to obey the contract. Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Ignore pmu->global_ctrl check if vPMU doesn't support global_ctrlLike Xu1-0/+3
MSR_CORE_PERF_GLOBAL_CTRL is introduced as part of Architecture PMU V2, as indicated by Intel SDM 19.2.2 and the intel_is_valid_msr() function. So in the absence of global_ctrl support, all PMCs are enabled as AMD does. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Don't overwrite the pmu->global_ctrl when refreshingLike Xu1-4/+5
Assigning a value to pmu->global_ctrl just to set the value of pmu->global_ctrl_mask is more readable but does not conform to the specification. The value is reset to zero on Power up and Reset but stays unchanged on INIT, like most other MSRs. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Expose CPUIDs feature bits PDCM, DS, DTES64Like Xu2-11/+32
The CPUID features PDCM, DS and DTES64 are required for PEBS feature. KVM would expose CPUID feature PDCM, DS and DTES64 to guest when PEBS is supported in the KVM on the Ice Lake server platforms. Originally-by: Andi Kleen <[email protected]> Co-developed-by: Kan Liang <[email protected]> Signed-off-by: Kan Liang <[email protected]> Co-developed-by: Luwei Kang <[email protected]> Signed-off-by: Luwei Kang <[email protected]> Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/cpuid: Refactor host/guest CPU model consistency checkLike Xu3-13/+2
For the same purpose, the leagcy intel_pmu_lbr_is_compatible() can be renamed for reuse by more callers, and remove the comment about LBR use case can be deleted by the way. Signed-off-by: Like Xu <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Add kvm_pmu_cap to optimize perf_get_x86_pmu_capabilityLike Xu1-9/+8
The information obtained from the interface perf_get_x86_pmu_capability() doesn't change, so an exported "struct x86_pmu_capability" is introduced for all guests in the KVM, and it's initialized before hardware_setup(). Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Disable guest PEBS temporarily in two rare situationsLike Xu3-0/+25
The guest PEBS will be disabled when some users try to perf KVM and its user-space through the same PEBS facility OR when the host perf doesn't schedule the guest PEBS counter in a one-to-one mapping manner (neither of these are typical scenarios). The PEBS records in the guest DS buffer are still accurate and the above two restrictions will be checked before each vm-entry only if guest PEBS is deemed to be enabled. Suggested-by: Wei Wang <[email protected]> Signed-off-by: Like Xu <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86: Set PEBS_UNAVAIL in IA32_MISC_ENABLE when PEBS is enabledLike Xu1-0/+2
The bit 12 represents "Processor Event Based Sampling Unavailable (RO)" : 1 = PEBS is not supported. 0 = PEBS is supported. A write to this PEBS_UNAVL available bit will bring #GP(0) when guest PEBS is enabled. Some PEBS drivers in guest may care about this bit. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Add PEBS_DATA_CFG MSR emulation to support adaptive PEBSLike Xu1-1/+19
If IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14] is set, the adaptive PEBS is supported. The PEBS_DATA_CFG MSR and adaptive record enable bits (IA32_PERFEVTSELx.Adaptive_Record and IA32_FIXED_CTR_CTRL. FCx_Adaptive_Record) are also supported. Adaptive PEBS provides software the capability to configure the PEBS records to capture only the data of interest, keeping the record size compact. An overflow of PMCx results in generation of an adaptive PEBS record with state information based on the selections specified in MSR_PEBS_DATA_CFG.By default, the record only contain the Basic group. When guest adaptive PEBS is enabled, the IA32_PEBS_ENABLE MSR will be added to the perf_guest_switch_msr() and switched during the VMX transitions just like CORE_PERF_GLOBAL_CTRL MSR. According to Intel SDM, software is recommended to PEBS Baseline when the following is true. IA32_PERF_CAPABILITIES.PEBS_BASELINE[14] && IA32_PERF_CAPABILITIES.PEBS_FMT[11:8] ≥ 4. Co-developed-by: Luwei Kang <[email protected]> Signed-off-by: Luwei Kang <[email protected]> Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to support guest DSLike Xu1-0/+11
When CPUID.01H:EDX.DS[21] is set, the IA32_DS_AREA MSR exists and points to the linear address of the first byte of the DS buffer management area, which is used to manage the PEBS records. When guest PEBS is enabled, the MSR_IA32_DS_AREA MSR will be added to the perf_guest_switch_msr() and switched during the VMX transitions just like CORE_PERF_GLOBAL_CTRL MSR. The WRMSR to IA32_DS_AREA MSR brings a #GP(0) if the source register contains a non-canonical address. Originally-by: Andi Kleen <[email protected]> Co-developed-by: Kan Liang <[email protected]> Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Like Xu <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBSLike Xu1-0/+31
If IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14] is set, the IA32_PEBS_ENABLE MSR exists and all architecturally enumerated fixed and general-purpose counters have corresponding bits in IA32_PEBS_ENABLE that enable generation of PEBS records. The general-purpose counter bits start at bit IA32_PEBS_ENABLE[0], and the fixed counter bits start at bit IA32_PEBS_ENABLE[32]. When guest PEBS is enabled, the IA32_PEBS_ENABLE MSR will be added to the perf_guest_switch_msr() and atomically switched during the VMX transitions just like CORE_PERF_GLOBAL_CTRL MSR. Based on whether the platform supports x86_pmu.pebs_ept, it has also refactored the way to add more msrs to arr[] in intel_guest_get_msrs() for extensibility. Originally-by: Andi Kleen <[email protected]> Co-developed-by: Kan Liang <[email protected]> Signed-off-by: Kan Liang <[email protected]> Co-developed-by: Luwei Kang <[email protected]> Signed-off-by: Luwei Kang <[email protected]> Signed-off-by: Like Xu <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Introduce the ctrl_mask value for fixed counterLike Xu1-1/+5
The mask value of fixed counter control register should be dynamic adjusted with the number of fixed counters. This patch introduces a variable that includes the reserved bits of fixed counter control registers. This is a generic code refactoring. Co-developed-by: Luwei Kang <[email protected]> Signed-off-by: Luwei Kang <[email protected]> Signed-off-by: Like Xu <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Set MSR_IA32_MISC_ENABLE_EMON bit when vPMU is enabledLike Xu1-0/+1
On Intel platforms, the software can use the IA32_MISC_ENABLE[7] bit to detect whether the processor supports performance monitoring facility. It depends on the PMU is enabled for the guest, and a software write operation to this available bit will be ignored. The proposal to ignore the toggle in KVM is the way to go and that behavior matches bare metal. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08perf/x86/core: Pass "struct kvm_pmu *" to determine the guest valuesLike Xu1-1/+2
Splitting the logic for determining the guest values is unnecessarily confusing, and potentially fragile. Perf should have full knowledge and control of what values are loaded for the guest. If we change .guest_get_msrs() to take a struct kvm_pmu pointer, then it can generate the full set of guest values by grabbing guest ds_area and pebs_data_cfg. Alternatively, .guest_get_msrs() could take the desired guest MSR values directly (ds_area and pebs_data_cfg), but kvm_pmu is vendor agnostic, so we don't see any reason to not just pass the pointer. Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Like Xu <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: VMX: enable IPI virtualizationChao Gao5-6/+106
With IPI virtualization enabled, the processor emulates writes to APIC registers that would send IPIs. The processor sets the bit corresponding to the vector in target vCPU's PIR and may send a notification (IPI) specified by NDST and NV fields in target vCPU's Posted-Interrupt Descriptor (PID). It is similar to what IOMMU engine does when dealing with posted interrupt from devices. A PID-pointer table is used by the processor to locate the PID of a vCPU with the vCPU's APIC ID. The table size depends on maximum APIC ID assigned for current VM session from userspace. Allocating memory for PID-pointer table is deferred to vCPU creation, because irqchip mode and VM-scope maximum APIC ID is settled at that point. KVM can skip PID-pointer table allocation if !irqchip_in_kernel(). Like VT-d PI, if a vCPU goes to blocked state, VMM needs to switch its notification vector to wakeup vector. This can ensure that when an IPI for blocked vCPUs arrives, VMM can get control and wake up blocked vCPUs. And if a VCPU is preempted, its posted interrupt notification is suppressed. Note that IPI virtualization can only virualize physical-addressing, flat mode, unicast IPIs. Sending other IPIs would still cause a trap-like APIC-write VM-exit and need to be handled by VMM. Signed-off-by: Chao Gao <[email protected]> Signed-off-by: Zeng Guang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: VMX: Clean up vmx_refresh_apicv_exec_ctrl()Zeng Guang1-10/+9
Remove the condition check cpu_has_secondary_exec_ctrls(). Calling vmx_refresh_apicv_exec_ctrl() premises secondary controls activated and VMCS fields related to APICv valid as well. If it's invoked in wrong circumstance at the worst case, VMX operation will report VMfailValid error without further harmful impact and just functions as if all the secondary controls were 0. Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Zeng Guang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: VMX: Report tertiary_exec_control field in dump_vmcs()Robert Hoo1-4/+13
Add tertiary_exec_control field report in dump_vmcs(). Meanwhile, reorganize the dump output of VMCS category as follows. Before change: *** Control State *** PinBased=0x000000ff CPUBased=0xb5a26dfa SecondaryExec=0x061037eb EntryControls=0000d1ff ExitControls=002befff After change: *** Control State *** CPUBased=0xb5a26dfa SecondaryExec=0x061037eb TertiaryExec=0x0000000000000010 PinBased=0x000000ff EntryControls=0000d1ff ExitControls=002befff Reviewed-by: Maxim Levitsky <[email protected]> Signed-off-by: Robert Hoo <[email protected]> Signed-off-by: Zeng Guang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: VMX: Detect Tertiary VM-Execution control when setup VMCS configRobert Hoo6-1/+40
Check VMX features on tertiary execution control in VMCS config setup. Sub-features in tertiary execution control to be enabled are adjusted according to hardware capabilities although no sub-feature is enabled in this patch. EVMCSv1 doesn't support tertiary VM-execution control, so disable it when EVMCSv1 is in use. And define the auxiliary functions for Tertiary control field here, using the new BUILD_CONTROLS_SHADOW(). Reviewed-by: Maxim Levitsky <[email protected]> Signed-off-by: Robert Hoo <[email protected]> Signed-off-by: Zeng Guang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: VMX: Extend BUILD_CONTROLS_SHADOW macro to support 64-bit variationRobert Hoo1-28/+28
The Tertiary VM-Exec Control, different from previous control fields, is 64 bit. So extend BUILD_CONTROLS_SHADOW() by adding a 'bit' parameter, to support both 32 bit and 64 bit fields' auxiliary functions building. Suggested-by: Sean Christopherson <[email protected]> Reviewed-by: Maxim Levitsky <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Robert Hoo <[email protected]> Signed-off-by: Zeng Guang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86: Differentiate Soft vs. Hard IRQs vs. reinjected in tracepointSean Christopherson1-2/+2
In the IRQ injection tracepoint, differentiate between Hard IRQs and Soft "IRQs", i.e. interrupts that are reinjected after incomplete delivery of a software interrupt from an INTn instruction. Tag reinjected interrupts as such, even though the information is usually redundant since soft interrupts are only ever reinjected by KVM. Though rare in practice, a hard IRQ can be reinjected. Signed-off-by: Sean Christopherson <[email protected]> [MSS: change "kvm_inj_virq" event "reinjected" field type to bool] Signed-off-by: Maciej S. Szmigiero <[email protected]> Message-Id: <9664d49b3bd21e227caa501cff77b0569bebffe2.1651440202.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86: do not report a vCPU as preempted outside instruction boundariesPaolo Bonzini1-0/+1
If a vCPU is outside guest mode and is scheduled out, it might be in the process of making a memory access. A problem occurs if another vCPU uses the PV TLB flush feature during the period when the vCPU is scheduled out, and a virtual address has already been translated but has not yet been accessed, because this is equivalent to using a stale TLB entry. To avoid this, only report a vCPU as preempted if sure that the guest is at an instruction boundary. A rescheduling request will be delivered to the host physical CPU as an external interrupt, so for simplicity consider any vmexit *not* instruction boundary except for external interrupts. It would in principle be okay to report the vCPU as preempted also if it is sleeping in kvm_vcpu_block(): a TLB flush IPI will incur the vmentry/vmexit overhead unnecessarily, and optimistic spinning is also unlikely to succeed. However, leave it for later because right now kvm_vcpu_check_block() is doing memory accesses. Even though the TLB flush issue only applies to virtual memory address, it's very much preferrable to be conservative. Reported-by: Jann Horn <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-07Merge branch 'kvm-5.20-early-patches' into HEADPaolo Bonzini1-3/+5
2022-06-05Merge tag 'x86-cleanups-2022-06-05' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Thomas Gleixner: "A set of small x86 cleanups: - Remove unused headers in the IDT code - Kconfig indendation and comment fixes - Fix all 'the the' typos in one go instead of waiting for bots to fix one at a time" * tag 'x86-cleanups-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Fix all occurences of the "the the" typo x86/idt: Remove unused headers x86/Kconfig: Fix indentation of arch/x86/Kconfig.debug x86/Kconfig: Fix indentation and add endif comments to arch/x86/Kconfig
2022-05-27x86: Fix all occurences of the "the the" typoBo Liu1-1/+1
Rather than waiting for the bots to fix these one-by-one, fix all occurences of "the the" throughout arch/x86. Signed-off-by: Bo Liu <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Paolo Bonzini <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-05-25KVM: VMX: Print VM-instruction error as unsignedJim Mattson1-1/+1
Change the printf format character from 'd' to 'u' for the VM-instruction error in vmwrite_error(). Fixes: 6aa8b732ca01 ("[PATCH] kvm: userspace interface") Reported-by: Sean Christopherson <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-05-25KVM: VMX: Print VM-instruction error when it may be helpfulDavid Matlack1-2/+4
Include the value of the "VM-instruction error" field from the current VMCS (if any) in the error message for VMCLEAR and VMPTRLD, since each of these instructions may result in more than one VM-instruction error. Previously, this field was only reported for VMWRITE errors. Signed-off-by: David Matlack <[email protected]> [Rebased and refactored code; dropped the error number for INVVPID and INVEPT; reworded commit message.] Signed-off-by: Jim Mattson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-05-25KVM: x86: Fix the intel_pt PMI handling wrongly considered from guestYanfei Xu1-1/+1
When kernel handles the vm-exit caused by external interrupts and NMI, it always sets kvm_intr_type to tell if it's dealing an IRQ or NMI. For the PMI scenario, it could be IRQ or NMI. However, intel_pt PMIs are only generated for HARDWARE perf events, and HARDWARE events are always configured to generate NMIs. Use kvm_handling_nmi_from_guest() to precisely identify if the intel_pt PMI came from the guest; this avoids false positives if an intel_pt PMI/NMI arrives while the host is handling an unrelated IRQ VM-Exit. Fixes: db215756ae59 ("KVM: x86: More precisely identify NMI from guest when handling PMI") Signed-off-by: Yanfei Xu <[email protected]> Message-Id: <[email protected]> Cc: [email protected] Signed-off-by: Paolo Bonzini <[email protected]>
2022-05-25Merge tag 'kvm-riscv-5.19-1' of https://github.com/kvm-riscv/linux into HEADPaolo Bonzini1-1/+1
KVM/riscv changes for 5.19 - Added Sv57x4 support for G-stage page table - Added range based local HFENCE functions - Added remote HFENCE functions based on VCPU requests - Added ISA extension registers in ONE_REG interface - Updated KVM RISC-V maintainers entry to cover selftests support
2022-05-25Merge tag 'kvmarm-5.19' of ↵Paolo Bonzini4-6/+13
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 5.19 - Add support for the ARMv8.6 WFxT extension - Guard pages for the EL2 stacks - Trap and emulate AArch32 ID registers to hide unsupported features - Ability to select and save/restore the set of hypercalls exposed to the guest - Support for PSCI-initiated suspend in collaboration with userspace - GICv3 register-based LPI invalidation support - Move host PMU event merging into the vcpu data structure - GICv3 ITS save/restore fixes - The usual set of small-scale cleanups and fixes [Due to the conflict, KVM_SYSTEM_EVENT_SEV_TERM is relocated from 4 to 6. - Paolo]
2022-05-21KVM: x86/speculation: Disable Fill buffer clear within guestsPawan Gupta2-0/+71
The enumeration of MD_CLEAR in CPUID(EAX=7,ECX=0).EDX{bit 10} is not an accurate indicator on all CPUs of whether the VERW instruction will overwrite fill buffers. FB_CLEAR enumeration in IA32_ARCH_CAPABILITIES{bit 17} covers the case of CPUs that are not vulnerable to MDS/TAA, indicating that microcode does overwrite fill buffers. Guests running in VMM environments may not be aware of all the capabilities/vulnerabilities of the host CPU. Specifically, a guest may apply MDS/TAA mitigations when a virtual CPU is enumerated as vulnerable to MDS/TAA even when the physical CPU is not. On CPUs that enumerate FB_CLEAR_CTRL the VMM may set FB_CLEAR_DIS to skip overwriting of fill buffers by the VERW instruction. This is done by setting FB_CLEAR_DIS during VMENTER and resetting on VMEXIT. For guests that enumerate FB_CLEAR (explicitly asking for fill buffer clear capability) the VMM will not use FB_CLEAR_DIS. Irrespective of guest state, host overwrites CPU buffers before VMENTER to protect itself from an MMIO capable guest, as part of mitigation for MMIO Stale Data vulnerabilities. Signed-off-by: Pawan Gupta <[email protected]> Signed-off-by: Borislav Petkov <[email protected]>
2022-05-21x86/speculation/mmio: Add mitigation for Processor MMIO Stale DataPawan Gupta1-0/+3
Processor MMIO Stale Data is a class of vulnerabilities that may expose data after an MMIO operation. For details please refer to Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst. These vulnerabilities are broadly categorized as: Device Register Partial Write (DRPW): Some endpoint MMIO registers incorrectly handle writes that are smaller than the register size. Instead of aborting the write or only copying the correct subset of bytes (for example, 2 bytes for a 2-byte write), more bytes than specified by the write transaction may be written to the register. On some processors, this may expose stale data from the fill buffers of the core that created the write transaction. Shared Buffers Data Sampling (SBDS): After propagators may have moved data around the uncore and copied stale data into client core fill buffers, processors affected by MFBDS can leak data from the fill buffer. Shared Buffers Data Read (SBDR): It is similar to Shared Buffer Data Sampling (SBDS) except that the data is directly read into the architectural software-visible state. An attacker can use these vulnerabilities to extract data from CPU fill buffers using MDS and TAA methods. Mitigate it by clearing the CPU fill buffers using the VERW instruction before returning to a user or a guest. On CPUs not affected by MDS and TAA, user application cannot sample data from CPU fill buffers using MDS or TAA. A guest with MMIO access can still use DRPW or SBDR to extract data architecturally. Mitigate it with VERW instruction to clear fill buffers before VMENTER for MMIO capable guests. Add a kernel parameter mmio_stale_data={off|full|full,nosmt} to control the mitigation. Signed-off-by: Pawan Gupta <[email protected]> Signed-off-by: Borislav Petkov <[email protected]>
2022-05-12KVM: VMX: Include MKTME KeyID bits in shadow_zero_checkKai Huang1-0/+31
Intel MKTME KeyID bits (including Intel TDX private KeyID bits) should never be set to SPTE. Set shadow_me_value to 0 and shadow_me_mask to include all MKTME KeyID bits to include them to shadow_zero_check. Signed-off-by: Kai Huang <[email protected]> Message-Id: <27bc10e97a3c0b58a4105ff9107448c190328239.1650363789.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <[email protected]>
2022-05-12KVM: VMX: clean up pi_wakeup_handlerLi RongQing1-4/+5
Passing per_cpu() to list_for_each_entry() causes the macro to be evaluated N+1 times for N sleeping vCPUs. This is a very small inefficiency, and the code is cleaner if the address of the per-CPU variable is loaded earlier. Do this for both the list and the spinlock. Signed-off-by: Li RongQing <[email protected]> Message-Id: <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-05-06KVM: VMX: Exit to userspace if vCPU has injected exception and invalid stateSean Christopherson1-1/+1
Exit to userspace with an emulation error if KVM encounters an injected exception with invalid guest state, in addition to the existing check of bailing if there's a pending exception (KVM doesn't support emulating exceptions except when emulating real mode via vm86). In theory, KVM should never get to such a situation as KVM is supposed to exit to userspace before injecting an exception with invalid guest state. But in practice, userspace can intervene and manually inject an exception and/or stuff registers to force invalid guest state while a previously injected exception is awaiting reinjection. Fixes: fc4fad79fc3d ("KVM: VMX: Reject KVM_RUN if emulation is required with pending exception") Reported-by: [email protected] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-05-02KVM: VMX: Use vcpu_to_pi_desc() uniformly in posted_intr.cYuan Yao1-1/+1
The helper function, vcpu_to_pi_desc(), is defined to get the posted interrupt descriptor from vcpu. There is one place that doesn't use it, and instead references vmx_vcpu->pi_desc directly. Remove the inconsistency. Signed-off-by: Yuan Yao <[email protected]> Signed-off-by: Isaku Yamahata <[email protected]> Message-Id: <ee7be7832bc424546fd4f05015a844a0205b5ba2.1646422845.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-29KVM: x86/mmu: replace shadow_root_level with root_role.levelPaolo Bonzini1-1/+1
root_role.level is always the same value as shadow_level: - it's kvm_mmu_get_tdp_level(vcpu) when going through init_kvm_tdp_mmu - it's the level argument when going through kvm_init_shadow_ept_mmu - it's assigned directly from new_role.base.level when going through shadow_mmu_init_context Remove the duplication and get the level directly from the role. Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-29KVM: x86: Clean up and document nested #PF workaroundSean Christopherson1-9/+6
Replace the per-vendor hack-a-fix for KVM's #PF => #PF => #DF workaround with an explicit, common workaround in kvm_inject_emulated_page_fault(). Aside from being a hack, the current approach is brittle and incomplete, e.g. nSVM's KVM_SET_NESTED_STATE fails to set ->inject_page_fault(), and nVMX fails to apply the workaround when VMX is intercepting #PF due to allow_smaller_maxphyaddr=1. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-21KVM: x86/pmu: Update AMD PMC sample period to fix guest NMI-watchdogLike Xu1-6/+2
NMI-watchdog is one of the favorite features of kernel developers, but it does not work in AMD guest even with vPMU enabled and worse, the system misrepresents this capability via /proc. This is a PMC emulation error. KVM does not pass the latest valid value to perf_event in time when guest NMI-watchdog is running, thus the perf_event corresponding to the watchdog counter will enter the old state at some point after the first guest NMI injection, forcing the hardware register PMC0 to be constantly written to 0x800000000001. Meanwhile, the running counter should accurately reflect its new value based on the latest coordinated pmc->counter (from vPMC's point of view) rather than the value written directly by the guest. Fixes: 168d918f2643 ("KVM: x86: Adjust counter sample period after a wrmsr") Reported-by: Dongli Cao <[email protected]> Signed-off-by: Like Xu <[email protected]> Reviewed-by: Yanan Wang <[email protected]> Tested-by: Yanan Wang <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Message-Id: <[email protected]> Cc: [email protected] Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-21KVM: nVMX: Defer APICv updates while L2 is active until L1 is activeSean Christopherson3-0/+11
Defer APICv updates that occur while L2 is active until nested VM-Exit, i.e. until L1 regains control. vmx_refresh_apicv_exec_ctrl() assumes L1 is active and (a) stomps all over vmcs02 and (b) neglects to ever updated vmcs01. E.g. if vmcs12 doesn't enable the TPR shadow for L2 (and thus no APICv controls), L1 performs nested VM-Enter APICv inhibited, and APICv becomes unhibited while L2 is active, KVM will set various APICv controls in vmcs02 and trigger a failed VM-Entry. The kicker is that, unless running with nested_early_check=1, KVM blames L1 and chaos ensues. In all cases, ignoring vmcs02 and always deferring the inhibition change to vmcs01 is correct (or at least acceptable). The ABSENT and DISABLE inhibitions cannot truly change while L2 is active (see below). IRQ_BLOCKING can change, but it is firmly a best effort debug feature. Furthermore, only L2's APIC is accelerated/virtualized to the full extent possible, e.g. even if L1 passes through its APIC to L2, normal MMIO/MSR interception will apply to the virtual APIC managed by KVM. The exception is the SELF_IPI register when x2APIC is enabled, but that's an acceptable hole. Lastly, Hyper-V's Auto EOI can technically be toggled if L1 exposes the MSRs to L2, but for that to work in any sane capacity, L1 would need to pass through IRQs to L2 as well, and IRQs must be intercepted to enable virtual interrupt delivery. I.e. exposing Auto EOI to L2 and enabling VID for L2 are, for all intents and purposes, mutually exclusive. Lack of dynamic toggling is also why this scenario is all but impossible to encounter in KVM's current form. But a future patch will pend an APICv update request _during_ vCPU creation to plug a race where a vCPU that's being created doesn't get included in the "all vCPUs request" because it's not yet visible to other vCPUs. If userspaces restores L2 after VM creation (hello, KVM selftests), the first KVM_RUN will occur while L2 is active and thus service the APICv update request made during VM creation. Cc: [email protected] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-04-13KVM: nVMX: Clear IDT vectoring on nested VM-Exit for double/triple faultSean Christopherson2-4/+33
Clear the IDT vectoring field in vmcs12 on next VM-Exit due to a double or triple fault. Per the SDM, a VM-Exit isn't considered to occur during event delivery if the exit is due to an intercepted double fault or a triple fault. Opportunistically move the default clearing (no event "pending") into the helper so that it's more obvious that KVM does indeed handle this case. Note, the double fault case is worded rather wierdly in the SDM: The original event results in a double-fault exception that causes the VM exit directly. Temporarily ignoring injected events, double faults can _only_ occur if an exception occurs while attempting to deliver a different exception, i.e. there's _always_ an original event. And for injected double fault, while there's no original event, injected events are never subject to interception. Presumably the SDM is calling out that a the vectoring info will be valid if a different exit occurs after a double fault, e.g. if a #PF occurs and is intercepted while vectoring #DF, then the vectoring info will show the double fault. In other words, the clause can simply be read as: The VM exit is caused by a double-fault exception. Fixes: 4704d0befb07 ("KVM: nVMX: Exiting from L2 to L1") Cc: Chenyi Qiang <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>