aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-02-04KVM: SVM: Add support for SVM instruction address check changeWei Huang2-0/+4
New AMD CPUs have a change that checks #VMEXIT intercept on special SVM instructions before checking their EAX against reserved memory region. This change is indicated by CPUID_0x8000000A_EDX[28]. If it is 1, #VMEXIT is triggered before #GP. KVM doesn't need to intercept and emulate #GP faults as #GP is supposed to be triggered. Co-developed-by: Bandan Das <[email protected]> Signed-off-by: Bandan Das <[email protected]> Signed-off-by: Wei Huang <[email protected]> Reviewed-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: SVM: Add emulation support for #GP triggered by SVM instructionsBandan Das1-18/+91
While running SVM related instructions (VMRUN/VMSAVE/VMLOAD), some AMD CPUs check EAX against reserved memory regions (e.g. SMM memory on host) before checking VMCB's instruction intercept. If EAX falls into such memory areas, #GP is triggered before VMEXIT. This causes problem under nested virtualization. To solve this problem, KVM needs to trap #GP and check the instructions triggering #GP. For VM execution instructions, KVM emulates these instructions. Co-developed-by: Wei Huang <[email protected]> Signed-off-by: Wei Huang <[email protected]> Signed-off-by: Bandan Das <[email protected]> Message-Id: <[email protected]> [Conditionally enable #GP intercept. - Paolo] Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86: Factor out x86 instruction emulation with decodingWei Huang2-23/+41
Move the instruction decode part out of x86_emulate_instruction() for it to be used in other places. Also kvm_clear_exception_queue() is moved inside the if-statement as it doesn't apply when KVM are coming back from userspace. Co-developed-by: Bandan Das <[email protected]> Signed-off-by: Bandan Das <[email protected]> Signed-off-by: Wei Huang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: X86: Rename DR6_INIT to DR6_ACTIVE_LOWChenyi Qiang7-25/+38
DR6_INIT contains the 1-reserved bits as well as the bit that is cleared to 0 when the condition (e.g. RTM) happens. The value can be used to initialize dr6 and also be the XOR mask between the #DB exit qualification (or payload) and DR6. Concerning that DR6_INIT is used as initial value only once, rename it to DR6_ACTIVE_LOW and apply it in other places, which would make the incoming changes for bus lock debug exception more simple. Signed-off-by: Chenyi Qiang <[email protected]> Message-Id: <[email protected]> [Define DR6_FIXED_1 from DR6_ACTIVE_LOW and DR6_VOLATILE. - Paolo] Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04selftests: kvm/x86: add test for pmu msr MSR_IA32_PERF_CAPABILITIESLike Xu5-1/+169
This test will check the effect of various CPUID settings on the MSR_IA32_PERF_CAPABILITIES MSR, check that whatever user space writes with KVM_SET_MSR is _not_ modified from the guest and can be retrieved with KVM_GET_MSR, and check that invalid LBR formats are rejected. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: vmx/pmu: Expose LBR_FMT in the MSR_IA32_PERF_CAPABILITIESLike Xu1-1/+8
Userspace could enable guest LBR feature when the exactly supported LBR format value is initialized to the MSR_IA32_PERF_CAPABILITIES and the LBR is also compatible with vPMU version and host cpu model. The LBR could be enabled on the guest if host perf supports LBR (checked via x86_perf_get_lbr()) and the vcpu model is compatible with the host one. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: vmx/pmu: Release guest LBR event via lazy release mechanismLike Xu3-1/+24
The vPMU uses GUEST_LBR_IN_USE_IDX (bit 58) in 'pmu->pmc_in_use' to indicate whether a guest LBR event is still needed by the vcpu. If the vcpu no longer accesses LBR related registers within a scheduling time slice, and the enable bit of LBR has been unset, vPMU will treat the guest LBR event as a bland event of a vPMC counter and release it as usual. Also, the pass-through state of LBR records msrs is cancelled. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: vmx/pmu: Emulate legacy freezing LBRs on virtual PMILike Xu5-3/+39
The current vPMU only supports Architecture Version 2. According to Intel SDM "17.4.7 Freezing LBR and Performance Counters on PMI", if IA32_DEBUGCTL.Freeze_LBR_On_PMI = 1, the LBR is frozen on the virtual PMI and the KVM would emulate to clear the LBR bit (bit 0) in IA32_DEBUGCTL. Also, guest needs to re-enable IA32_DEBUGCTL.LBR to resume recording branches. Signed-off-by: Like Xu <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: vmx/pmu: Reduce the overhead of LBR pass-through or cancellationLike Xu2-0/+16
When the LBR records msrs has already been pass-through, there is no need to call vmx_update_intercept_for_lbr_msrs() again and again, and vice versa. Signed-off-by: Like Xu <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: vmx/pmu: Pass-through LBR msrs when the guest LBR event is ACTIVELike Xu3-3/+135
In addition to DEBUGCTLMSR_LBR, any KVM trap caused by LBR msrs access will result in a creation of guest LBR event per-vcpu. If the guest LBR event is scheduled on with the corresponding vcpu context, KVM will pass-through all LBR records msrs to the guest. The LBR callstack mechanism implemented in the host could help save/restore the guest LBR records during the event context switches, which reduces a lot of overhead if we save/restore tens of LBR msrs (e.g. 32 LBR records entries) in the much more frequent VMX transitions. To avoid reclaiming LBR resources from any higher priority event on host, KVM would always check the exist of guest LBR event and its state before vm-entry as late as possible. A negative result would cancel the pass-through state, and it also prevents real registers accesses and potential data leakage. If host reclaims the LBR between two checks, the interception state and LBR records can be safely preserved due to native save/restore support from guest LBR event. The KVM emits a pr_warn() when the LBR hardware is unavailable to the guest LBR event. The administer is supposed to reminder users that the guest result may be inaccurate if someone is using LBR to record hypervisor on the host side. Suggested-by: Andi Kleen <[email protected]> Co-developed-by: Wei Wang <[email protected]> Signed-off-by: Wei Wang <[email protected]> Signed-off-by: Like Xu <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: vmx/pmu: Create a guest LBR event when vcpu sets DEBUGCTLMSR_LBRLike Xu3-0/+76
When vcpu sets DEBUGCTLMSR_LBR in the MSR_IA32_DEBUGCTLMSR, the KVM handler would create a guest LBR event which enables the callstack mode and none of hardware counter is assigned. The host perf would schedule and enable this event as usual but in an exclusive way. The guest LBR event will be released when the vPMU is reset but soon, the lazy release mechanism would be applied to this event like a vPMC. Suggested-by: Andi Kleen <[email protected]> Co-developed-by: Wei Wang <[email protected]> Signed-off-by: Wei Wang <[email protected]> Signed-off-by: Like Xu <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: vmx/pmu: Add PMU_CAP_LBR_FMT check when guest LBR is enabledLike Xu4-2/+25
Usespace could set the bits [0, 5] of the IA32_PERF_CAPABILITIES MSR which tells about the record format stored in the LBR records. The LBR will be enabled on the guest if host perf supports LBR (checked via x86_perf_get_lbr()) and the vcpu model is compatible with the host one. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: vmx/pmu: Add PMU_CAP_LBR_FMT check when guest LBR is enabledPaolo Bonzini4-0/+43
Usespace could set the bits [0, 5] of the IA32_PERF_CAPABILITIES MSR which tells about the record format stored in the LBR records. The LBR will be enabled on the guest if host perf supports LBR (checked via x86_perf_get_lbr()) and the vcpu model is compatible with the host one. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/pmu: preserve IA32_PERF_CAPABILITIES across CPUID refreshPaolo Bonzini1-6/+10
Once MSR_IA32_PERF_CAPABILITIES is changed via vmx_set_msr(), the value should not be changed by cpuid(). To ensure that the new value is kept, the default initialization path is moved to intel_pmu_init(). The effective value of the MSR will be 0 if PDCM is clear, however. Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/vmx: Make vmx_set_intercept_for_msr() non-staticLike Xu2-1/+3
To make code responsibilities clear, we may resue and invoke the vmx_set_intercept_for_msr() in other vmx-specific files (e.g. pmu_intel.c), so expose it to passthrough LBR msrs later. Signed-off-by: Like Xu <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: VMX: read/write MSR_IA32_DEBUGCTLMSR from GUEST_IA32_DEBUGCTLLike Xu4-18/+28
SVM already has specific handlers of MSR_IA32_DEBUGCTLMSR in the svm_get/set_msr, so the x86 common part can be safely moved to VMX. This allows KVM to store the bits it supports in GUEST_IA32_DEBUGCTL. Add vmx_supported_debugctl() to refactor the throwing logic of #GP. Signed-off-by: Like Xu <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Message-Id: <[email protected]> [Merge parts of Chenyi Qiang's "KVM: X86: Expose bus lock debug exception to guest". - Paolo] Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: VMX: Use x2apic_mode to avoid RDMSR when querying PI stateSean Christopherson1-3/+3
Use x2apic_mode instead of x2apic_enabled() when adjusting the destination ID during Posted Interrupt updates. This avoids the costly RDMSR that is hidden behind x2apic_enabled(). Reported-by: luferry <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04x86/apic: Export x2apic_mode for use by KVM in "warm" pathSean Christopherson1-0/+1
Export x2apic_mode so that KVM can query whether x2APIC is active without having to incur the RDMSR in x2apic_enabled(). When Posted Interrupts are in use for a guest with an assigned device, KVM ends up checking for x2APIC at least once every time a vCPU halts. KVM could obviously snapshot x2apic_enabled() to avoid the RDMSR, but that's rather silly given that x2apic_mode holds the exact info needed by KVM. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: X86: Add the Document for KVM_CAP_X86_BUS_LOCK_EXITChenyi Qiang1-3/+42
Introduce a new capability named KVM_CAP_X86_BUS_LOCK_EXIT, which is used to handle bus locks detected in guest. It allows the userspace to do custom throttling policies to mitigate the 'noisy neighbour' problem. Signed-off-by: Chenyi Qiang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: VMX: Enable bus lock VM exitChenyi Qiang10-4/+83
Virtual Machine can exploit bus locks to degrade the performance of system. Bus lock can be caused by split locked access to writeback(WB) memory or by using locks on uncacheable(UC) memory. The bus lock is typically >1000 cycles slower than an atomic operation within a cache line. It also disrupts performance on other cores (which must wait for the bus lock to be released before their memory operations can complete). To address the threat, bus lock VM exit is introduced to notify the VMM when a bus lock was acquired, allowing it to enforce throttling or other policy based mitigations. A VMM can enable VM exit due to bus locks by setting a new "Bus Lock Detection" VM-execution control(bit 30 of Secondary Processor-based VM execution controls). If delivery of this VM exit was preempted by a higher priority VM exit (e.g. EPT misconfiguration, EPT violation, APIC access VM exit, APIC write VM exit, exception bitmap exiting), bit 26 of exit reason in vmcs field is set to 1. In current implementation, the KVM exposes this capability through KVM_CAP_X86_BUS_LOCK_EXIT. The user can get the supported mode bitmap (i.e. off and exit) and enable it explicitly (disabled by default). If bus locks in guest are detected by KVM, exit to user space even when current exit reason is handled by KVM internally. Set a new field KVM_RUN_BUS_LOCK in vcpu->run->flags to inform the user space that there is a bus lock detected in guest. Document for Bus Lock VM exit is now available at the latest "Intel Architecture Instruction Set Extensions Programming Reference". Document Link: https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html Co-developed-by: Xiaoyao Li <[email protected]> Signed-off-by: Xiaoyao Li <[email protected]> Signed-off-by: Chenyi Qiang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: X86: Reset the vcpu->run->flags at the beginning of vcpu_runChenyi Qiang1-1/+4
Reset the vcpu->run->flags at the beginning of kvm_arch_vcpu_ioctl_run. It can avoid every thunk of code that needs to set the flag clear it, which increases the odds of missing a case and ending up with a flag in an undefined state. Signed-off-by: Chenyi Qiang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: VMX: Convert vcpu_vmx.exit_reason to a unionSean Christopherson3-49/+86
Convert vcpu_vmx.exit_reason from a u32 to a union (of size u32). The full VM_EXIT_REASON field is comprised of a 16-bit basic exit reason in bits 15:0, and single-bit modifiers in bits 31:16. Historically, KVM has only had to worry about handling the "failed VM-Entry" modifier, which could only be set in very specific flows and required dedicated handling. I.e. manually stripping the FAILED_VMENTRY bit was a somewhat viable approach. But even with only a single bit to worry about, KVM has had several bugs related to comparing a basic exit reason against the full exit reason store in vcpu_vmx. Upcoming Intel features, e.g. SGX, will add new modifier bits that can be set on more or less any VM-Exit, as opposed to the significantly more restricted FAILED_VMENTRY, i.e. correctly handling everything in one-off flows isn't scalable. Tracking exit reason in a union forces code to explicitly choose between consuming the full exit reason and the basic exit, and is a convenient way to document and access the modifiers. No functional change intended. Cc: Xiaoyao Li <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Chenyi Qiang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM/SVM: add support for SEV attestation commandBrijesh Singh5-0/+118
The SEV FW version >= 0.23 added a new command that can be used to query the attestation report containing the SHA-256 digest of the guest memory encrypted through the KVM_SEV_LAUNCH_UPDATE_{DATA, VMSA} commands and sign the report with the Platform Endorsement Key (PEK). See the SEV FW API spec section 6.8 for more details. Note there already exist a command (KVM_SEV_LAUNCH_MEASURE) that can be used to get the SHA-256 digest. The main difference between the KVM_SEV_LAUNCH_MEASURE and KVM_SEV_ATTESTATION_REPORT is that the latter can be called while the guest is running and the measurement value is signed with PEK. Cc: James Bottomley <[email protected]> Cc: Tom Lendacky <[email protected]> Cc: David Rientjes <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: John Allen <[email protected]> Cc: Herbert Xu <[email protected]> Cc: [email protected] Reviewed-by: Tom Lendacky <[email protected]> Acked-by: David Rientjes <[email protected]> Tested-by: James Bottomley <[email protected]> Signed-off-by: Brijesh Singh <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: selftests: Disable dirty logging with vCPUs runningBen Gardon1-5/+5
Disabling dirty logging is much more intestesting from a testing perspective if the vCPUs are still running. This also excercises the code-path in which collapsible SPTEs must be faulted back in at a higher level after disabling dirty logging. To: [email protected] CC: Peter Xu <[email protected]> CC: Andrew Jones <[email protected]> CC: Thomas Huth <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: selftests: Add backing src parameter to dirty_log_perf_testBen Gardon8-15/+63
Add a parameter to control the backing memory type for dirty_log_perf_test so that the test can be run with hugepages. To: [email protected] CC: Peter Xu <[email protected]> CC: Andrew Jones <[email protected]> CC: Thomas Huth <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: selftests: Add memslot modification stress testBen Gardon3-0/+213
Add a memslot modification stress test in which a memslot is repeatedly created and removed while vCPUs access memory in another memslot. Most userspaces do not create or remove memslots on running VMs which makes it hard to test races in adding and removing memslots without a dedicated test. Adding and removing a memslot also has the effect of tearing down the entire paging structure, which leads to more page faults and pressure on the page fault handling path than a one-and-done memory population test. Reviewed-by: Jacob Xu <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: selftests: Add option to overlap vCPU memory accessBen Gardon4-18/+57
Add an option to overlap the ranges of memory each vCPU accesses instead of partitioning them. This option will increase the probability of multiple vCPUs faulting on the same page at the same time, and causing interesting races, if there are bugs in the page fault handler or elsewhere in the kernel. Reviewed-by: Jacob Xu <[email protected]> Reviewed-by: Makarand Sonare <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: selftests: Fix population stage in dirty_log_perf_testBen Gardon1-3/+8
Currently the population stage in the dirty_log_perf_test does nothing as the per-vCPU iteration counters are not initialized and the loop does not wait for each vCPU. Remedy those errors. Reviewed-by: Jacob Xu <[email protected]> Reviewed-by: Makarand Sonare <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: selftests: Convert iterations to int in dirty_log_perf_testBen Gardon1-13/+13
In order to add an iteration -1 to indicate that the memory population phase has not yet completed, convert the interations counters to ints. No functional change intended. Reviewed-by: Jacob Xu <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: selftests: Avoid flooding debug log while populating memoryBen Gardon1-5/+4
Peter Xu pointed out that a log message printed while waiting for the memory population phase of the dirty_log_perf_test will flood the debug logs as there is no delay after printing the message. Since the message does not provide much value anyway, remove it. Reviewed-by: Jacob Xu <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: selftests: Rename timespec_diff_now to timespec_elapsedBen Gardon4-13/+13
In response to some earlier comments from Peter Xu, rename timespec_diff_now to the much more sensible timespec_elapsed. No functional change intended. Reviewed-by: Jacob Xu <[email protected]> Reviewed-by: Makarand Sonare <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Remove the defunct update_pte() paging hookSean Christopherson3-51/+2
Remove the update_pte() shadow paging logic, which was obsoleted by commit 4731d4c7a077 ("KVM: MMU: out of sync shadow core"), but never removed. As pointed out by Yu, KVM never write protects leaf page tables for the purposes of shadow paging, and instead marks their associated shadow page as unsync so that the guest can write PTEs at will. The update_pte() path, which predates the unsync logic, optimizes COW scenarios by refreshing leaf SPTEs when they are written, as opposed to zapping the SPTE, restarting the guest, and installing the new SPTE on the subsequent fault. Since KVM no longer write-protects leaf page tables, update_pte() is unreachable and can be dropped. Reported-by: Yu Zhang <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: selftests: Test IPI to halted vCPU in xAPIC while backing page movesPeter Shier4-0/+566
When a guest is using xAPIC KVM allocates a backing page for the required EPT entry for the APIC access address set in the VMCS. If mm decides to move that page the KVM mmu notifier will update the VMCS with the new HPA. This test induces a page move to test that APIC access continues to work correctly. It is a directed test for commit e649b3f0188f "KVM: x86: Fix APIC page invalidation race". Tested: ran for 1 hour on a skylake, migrating backing page every 1ms Depends on patch "selftests: kvm: Add exception handling to selftests" from [email protected] that has not yet been queued. Signed-off-by: Peter Shier <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Reviewed-by: Ricardo Koller <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: Expose AVX_VNNI instruction to gusetYang Zhong1-1/+1
Expose AVX (VEX-encoded) versions of the Vector Neural Network Instructions to guest. The bit definition: CPUID.(EAX=7,ECX=1):EAX[bit 4] AVX_VNNI The following instructions are available when this feature is present in the guest. 1. VPDPBUS: Multiply and Add Unsigned and Signed Bytes 2. VPDPBUSDS: Multiply and Add Unsigned and Signed Bytes with Saturation 3. VPDPWSSD: Multiply and Add Signed Word Integers 4. VPDPWSSDS: Multiply and Add Signed Integers with Saturation This instruction is currently documented in the latest "extensions" manual (ISE). It will appear in the "main" manual (SDM) in the future. Signed-off-by: Yang Zhong <[email protected]> Reviewed-by: Tony Luck <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04Enumerate AVX Vector Neural Network instructionsKyung Min Park1-0/+1
Add AVX version of the Vector Neural Network (VNNI) Instructions. A processor supports AVX VNNI instructions if CPUID.0x07.0x1:EAX[4] is present. The following instructions are available when this feature is present. 1. VPDPBUS: Multiply and Add Unsigned and Signed Bytes 2. VPDPBUSDS: Multiply and Add Unsigned and Signed Bytes with Saturation 3. VPDPWSSD: Multiply and Add Signed Word Integers 4. VPDPWSSDS: Multiply and Add Signed Integers with Saturation The only in-kernel usage of this is kvm passthrough. The CPU feature flag is shown as "avx_vnni" in /proc/cpuinfo. This instruction is currently documented in the latest "extensions" manual (ISE). It will appear in the "main" manual (SDM) in the future. Signed-off-by: Kyung Min Park <[email protected]> Signed-off-by: Yang Zhong <[email protected]> Reviewed-by: Tony Luck <[email protected]> Message-Id: <[email protected]> Acked-by: Borislav Petkov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04x86: kvm: style: Simplify bool comparisonYANG LI1-1/+1
Fix the following coccicheck warning: ./arch/x86/kvm/x86.c:8012:5-48: WARNING: Comparison to bool Signed-off-by: YANG LI <[email protected]> Reported-by: Abaci Robot <[email protected]> Message-Id: <[email protected]> Reviewed-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86: Zap the oldest MMU pages, not the newestSean Christopherson1-1/+1
Walk the list of MMU pages in reverse in kvm_mmu_zap_oldest_mmu_pages(). The list is FIFO, meaning new pages are inserted at the head and thus the oldest pages are at the tail. Using a "forward" iterator causes KVM to zap MMU pages that were just added, which obliterates guest performance once the max number of shadow MMU pages is reached. Fixes: 6b82ef2c9cf1 ("KVM: x86/mmu: Batch zap MMU pages when recycling oldest pages") Reported-by: Zdenek Kaspar <[email protected]> Cc: [email protected] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Use boolean returns for (S)PTE accessorsSean Christopherson2-9/+5
Return a 'bool' instead of an 'int' for various PTE accessors that are boolean in nature, e.g. is_shadow_present_pte(). Returning an int is goofy and potentially dangerous, e.g. if a flag being checked is moved into the upper 32 bits of a SPTE, then the compiler may silently squash the entire check since casting to an int is guaranteed to yield a return value of '0'. Opportunistically refactor is_last_spte() so that it naturally returns a bool value instead of letting it implicitly cast 0/1 to false/true. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: X86: use vzalloc() instead of vmalloc/memsetTian Tao1-2/+1
fixed the following warning: /virt/kvm/dirty_ring.c:70:20-27: WARNING: vzalloc should be used for ring -> dirty_gfns, instead of vmalloc/memset. Signed-off-by: Tian Tao <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86: Take KVM's SRCU lock only if steal time update is neededSean Christopherson1-9/+11
Enter a SRCU critical section for a memslots lookup during steal time update if and only if a steal time update is actually needed. Taking the lock can be avoided if steal time is disabled by the guest, or if KVM knows it has already flagged the vCPU as being preempted. Reword the comment to be more precise as to exactly why memslots will be queried. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86: Remove obsolete disabling of page faults in kvm_arch_vcpu_put()Sean Christopherson1-10/+0
Remove the disabling of page faults across kvm_steal_time_set_preempted() as KVM now accesses the steal time struct (shared with the guest) via a cached mapping (see commit b043138246a4, "x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed".) The cache lookup is flagged as atomic, thus it would be a bug if KVM tried to resolve a new pfn, i.e. we want the splat that would be reached via might_fault(). Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: do not assume PTE is writable after follow_pfnPaolo Bonzini1-3/+12
In order to convert an HVA to a PFN, KVM usually tries to use the get_user_pages family of functinso. This however is not possible for VM_IO vmas; in that case, KVM instead uses follow_pfn. In doing this however KVM loses the information on whether the PFN is writable. That is usually not a problem because the main use of VM_IO vmas with KVM is for BARs in PCI device assignment, however it is a bug. To fix it, use follow_pte and check pte_write while under the protection of the PTE lock. The information can be used to fail hva_to_pfn_remapped or passed back to the caller via *writable. Usage of follow_pfn was introduced in commit add6a0cd1c5b ("KVM: MMU: try to fix up page faults before giving up", 2016-07-05); however, even older version have the same issue, all the way back to commit 2e2e3738af33 ("KVM: Handle vma regions with no backing page", 2008-07-20), as they also did not check whether the PFN was writable. Fixes: 2e2e3738af33 ("KVM: Handle vma regions with no backing page") Reported-by: David Stevens <[email protected]> Cc: [email protected] Cc: Jann Horn <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: [email protected] Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Fix TDP MMU zap collapsible SPTEsBen Gardon1-3/+3
There is a bug in the TDP MMU function to zap SPTEs which could be replaced with a larger mapping which prevents the function from doing anything. Fix this by correctly zapping the last level SPTEs. Cc: [email protected] Fixes: 14881998566d ("kvm: x86/mmu: Support disabling dirty logging for the tdp MMU") Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-03KVM: x86: cleanup CR3 reserved bits checksPaolo Bonzini3-13/+5
If not in long mode, the low bits of CR3 are reserved but not enforced to be zero, so remove those checks. If in long mode, however, the MBZ bits extend down to the highest physical address bit of the guest, excluding the encryption bit. Make the checks consistent with the above, and match them between nested_vmcb_checks and KVM_SET_SREGS. Cc: [email protected] Fixes: 761e41693465 ("KVM: nSVM: Check that MBZ bits in CR3 and CR4 are not set on vmrun of nested guests") Fixes: a780a3ea6282 ("KVM: X86: Fix reserved bits check for MOV to CR3") Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-03KVM: SVM: Treat SVM as unsupported when running as an SEV guestSean Christopherson2-0/+6
Don't let KVM load when running as an SEV guest, regardless of what CPUID says. Memory is encrypted with a key that is not accessible to the host (L0), thus it's impossible for L0 to emulate SVM, e.g. it'll see garbage when reading the VMCB. Technically, KVM could decrypt all memory that needs to be accessible to the L0 and use shadow paging so that L0 does not need to shadow NPT, but exposing such information to L0 largely defeats the purpose of running as an SEV guest. This can always be revisited if someone comes up with a use case for running VMs inside SEV guests. Note, VMLOAD, VMRUN, etc... will also #GP on GPAs with C-bit set, i.e. KVM is doomed even if the SEV guest is debuggable and the hypervisor is willing to decrypt the VMCB. This may or may not be fixed on CPUs that have the SVME_ADDR_CHK fix. Cc: [email protected] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-02KVM: x86: Update emulator context mode if SYSENTER xfers to 64-bit modeSean Christopherson1-0/+2
Set the emulator context to PROT64 if SYSENTER transitions from 32-bit userspace (compat mode) to a 64-bit kernel, otherwise the RIP update at the end of x86_emulate_insn() will incorrectly truncate the new RIP. Note, this bug is mostly limited to running an Intel virtual CPU model on an AMD physical CPU, as other combinations of virtual and physical CPUs do not trigger full emulation. On Intel CPUs, SYSENTER in compatibility mode is legal, and unconditionally transitions to 64-bit mode. On AMD CPUs, SYSENTER is illegal in compatibility mode and #UDs. If the vCPU is AMD, KVM injects a #UD on SYSENTER in compat mode. If the pCPU is Intel, SYSENTER will execute natively and not trigger #UD->VM-Exit (ignoring guest TLB shenanigans). Fixes: fede8076aab4 ("KVM: x86: handle wrap around 32-bit address space") Cc: [email protected] Signed-off-by: Jonny Barker <[email protected]> [sean: wrote changelog] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-01KVM: x86: Supplement __cr4_reserved_bits() with X86_FEATURE_PCID checkVitaly Kuznetsov1-0/+2
Commit 7a873e455567 ("KVM: selftests: Verify supported CR4 bits can be set before KVM_SET_CPUID2") reveals that KVM allows to set X86_CR4_PCIDE even when PCID support is missing: ==== Test Assertion Failure ==== x86_64/set_sregs_test.c:41: rc pid=6956 tid=6956 - Invalid argument 1 0x000000000040177d: test_cr4_feature_bit at set_sregs_test.c:41 2 0x00000000004014fc: main at set_sregs_test.c:119 3 0x00007f2d9346d041: ?? ??:0 4 0x000000000040164d: _start at ??:? KVM allowed unsupported CR4 bit (0x20000) Add X86_FEATURE_PCID feature check to __cr4_reserved_bits() to make kvm_is_valid_cr4() fail. Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-01KVM/x86: assign hva with the right value to vm_munmap the pagesZheng Zhan Liang1-1/+1
Cc: Paolo Bonzini <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Zheng Zhan Liang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-01KVM: x86: Allow guests to see MSR_IA32_TSX_CTRL even if tsx=offPaolo Bonzini2-13/+30
Userspace that does not know about KVM_GET_MSR_FEATURE_INDEX_LIST will generally use the default value for MSR_IA32_ARCH_CAPABILITIES. When this happens and the host has tsx=on, it is possible to end up with virtual machines that have HLE and RTM disabled, but TSX_CTRL available. If the fleet is then switched to tsx=off, kvm_get_arch_capabilities() will clear the ARCH_CAP_TSX_CTRL_MSR bit and it will not be possible to use the tsx=off hosts as migration destinations, even though the guests do not have TSX enabled. To allow this migration, allow guests to write to their TSX_CTRL MSR, while keeping the host MSR unchanged for the entire life of the guests. This ensures that TSX remains disabled and also saves MSR reads and writes, and it's okay to do because with tsx=off we know that guests will not have the HLE and RTM features in their CPUID. (If userspace sets bogus CPUID data, we do not expect HLE and RTM to work in guests anyway). Cc: [email protected] Fixes: cbbaa2727aa3 ("KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES") Signed-off-by: Paolo Bonzini <[email protected]>
2021-01-28Fix unsynchronized access to sev members through svm_register_enc_regionPeter Gonda1-7/+10
Grab kvm->lock before pinning memory when registering an encrypted region; sev_pin_memory() relies on kvm->lock being held to ensure correctness when checking and updating the number of pinned pages. Add a lockdep assertion to help prevent future regressions. Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: Tom Lendacky <[email protected]> Cc: Brijesh Singh <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Fixes: 1e80fdc09d12 ("KVM: SVM: Pin guest memory when SEV is active") Signed-off-by: Peter Gonda <[email protected]> V2 - Fix up patch description - Correct file paths svm.c -> sev.c - Add unlock of kvm->lock on sev_pin_memory error V1 - https://lore.kernel.org/kvm/[email protected]/ Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>