aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-02-04KVM: x86/mmu: Use an rwlock for the x86 MMUBen Gardon10-84/+118
Add a read / write lock to be used in place of the MMU spinlock on x86. The rwlock will enable the TDP MMU to handle page faults, and other operations in parallel in future commits. Reviewed-by: Peter Feiner <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> [Introduce virt/kvm/mmu_lock.h - Paolo] Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04sched: Add cond_resched_rwlockBen Gardon2-0/+52
Safely rescheduling while holding a spin lock is essential for keeping long running kernel operations running smoothly. Add the facility to cond_resched rwlocks. CC: Ingo Molnar <[email protected]> CC: Will Deacon <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Acked-by: Davidlohr Bueso <[email protected]> Acked-by: Waiman Long <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04sched: Add needbreak for rwlocksBen Gardon1-0/+17
Contention awareness while holding a spin lock is essential for reducing latency when long running kernel operations can hold that lock. Add the same contention detection interface for read/write spin locks. CC: Ingo Molnar <[email protected]> CC: Will Deacon <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Acked-by: Davidlohr Bueso <[email protected]> Acked-by: Waiman Long <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04locking/rwlocks: Add contention detection for rwlocksBen Gardon2-6/+25
rwlocks do not currently have any facility to detect contention like spinlocks do. In order to allow users of rwlocks to better manage latency, add contention detection for queued rwlocks. CC: Ingo Molnar <[email protected]> CC: Will Deacon <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Acked-by: Davidlohr Bueso <[email protected]> Acked-by: Waiman Long <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Protect TDP MMU page table memory with RCUBen Gardon4-21/+103
In order to enable concurrent modifications to the paging structures in the TDP MMU, threads must be able to safely remove pages of page table memory while other threads are traversing the same memory. To ensure threads do not access PT memory after it is freed, protect PT memory with RCU. Protecting concurrent accesses to page table memory from use-after-free bugs could also have been acomplished using walk_shadow_page_lockless_begin/end() and READING_SHADOW_PAGE_TABLES, coupling with the barriers in a TLB flush. The use of RCU for this case has several distinct advantages over that approach. 1. Disabling interrupts for long running operations is not desirable. Future commits will allow operations besides page faults to operate without the exclusive protection of the MMU lock and those operations are too long to disable iterrupts for their duration. 2. The use of RCU here avoids long blocking / spinning operations in perfromance critical paths. By freeing memory with an asynchronous RCU API we avoid the longer wait times TLB flushes experience when overlapping with a thread in walk_shadow_page_lockless_begin/end(). 3. RCU provides a separation of concerns when removing memory from the paging structure. Because the RCU callback to free memory can be scheduled immediately after a TLB flush, there's no need for the thread to manually free a queue of pages later, as commit_zap_pages does. Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU") Reviewed-by: Peter Feiner <[email protected]> Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Clear dirtied pages mask bit before early breakBen Gardon1-2/+2
In clear_dirty_pt_masked, the loop is intended to exit early after processing each of the GFNs with corresponding bits set in mask. This does not work as intended if another thread has already cleared the dirty bit or writable bit on the SPTE. In that case, the loop would proceed to the next iteration early and the bit in mask would not be cleared. As a result the loop could not exit early and would proceed uselessly. Move the unsetting of the mask bit before the check for a no-op SPTE change. Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Skip no-op changes in TDP MMU functionsBen Gardon1-2/+4
Skip setting SPTEs if no change is expected. Reviewed-by: Peter Feiner <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changedBen Gardon1-10/+22
Given certain conditions, some TDP MMU functions may not yield reliably / frequently enough. For example, if a paging structure was very large but had few, if any writable entries, wrprot_gfn_range could traverse many entries before finding a writable entry and yielding because the check for yielding only happens after an SPTE is modified. Fix this issue by moving the yield to the beginning of the loop. Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Reviewed-by: Peter Feiner <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Ensure forward progress when yielding in TDP MMU iterBen Gardon3-23/+23
In some functions the TDP iter risks not making forward progress if two threads livelock yielding to one another. This is possible if two threads are trying to execute wrprot_gfn_range. Each could write protect an entry and then yield. This would reset the tdp_iter's walk over the paging structure and the loop would end up repeating the same entry over and over, preventing either thread from making forward progress. Fix this issue by only yielding if the loop has made forward progress since the last yield. Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Reviewed-by: Peter Feiner <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Rename goal_gfn to next_last_level_gfnBen Gardon2-12/+12
The goal_gfn field in tdp_iter can be misleading as it implies that it is the iterator's final goal. It is really a target for the lowest gfn mapped by the leaf level SPTE the iterator will traverse towards. Change the field's name to be more precise. Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Merge flush and non-flush tdp_mmu_iter_cond_reschedBen Gardon1-29/+13
The flushing and non-flushing variants of tdp_mmu_iter_cond_resched have almost identical implementations. Merge the two functions and add a flush parameter. Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Fix braces in kvm_recover_nx_lpagesBen Gardon1-2/+2
No functional change intended. Fixes: 29cf0f5007a2 ("kvm: x86/mmu: NX largepage recovery for TDP MMU") Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Factor out handling of removed page tablesBen Gardon1-29/+42
Factor out the code to handle a disconnected subtree of the TDP paging structure from the code to handle the change to an individual SPTE. Future commits will build on this to allow asynchronous page freeing. No functional change intended. Reviewed-by: Peter Feiner <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Don't redundantly clear TDP MMU pt memoryBen Gardon1-1/+0
The KVM MMU caches already guarantee that shadow page table memory will be zeroed, so there is no reason to re-zero the page in the TDP MMU page fault handler. No functional change intended. Reviewed-by: Peter Feiner <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Add lockdep when setting a TDP MMU SPTEBen Gardon1-0/+2
Add lockdep to __tdp_mmu_set_spte to ensure that SPTEs are only modified under the MMU lock. No functional change intended. Reviewed-by: Peter Feiner <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Add comment on __tdp_mmu_set_spteBen Gardon1-0/+16
__tdp_mmu_set_spte is a very important function in the TDP MMU which already accepts several arguments and will take more in future commits. To offset this complexity, add a comment to the function describing each of the arguemnts. No functional change intended. Reviewed-by: Peter Feiner <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: change TDP MMU yield function returns to match cond_reschedBen Gardon1-10/+29
Currently the TDP MMU yield / cond_resched functions either return nothing or return true if the TLBs were not flushed. These are confusing semantics, especially when making control flow decisions in calling functions. To clean things up, change both functions to have the same return value semantics as cond_resched: true if the thread yielded, false if it did not. If the function yielded in the _flush_ version, then the TLBs will have been flushed. Reviewed-by: Peter Feiner <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86: move kvm_inject_gp up from kvm_set_xcr to callersPaolo Bonzini3-14/+8
Push the injection of #GP up to the callers, so that they can just use kvm_complete_insn_gp. Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: cleanup DR6/DR7 reserved bits checksPaolo Bonzini1-2/+2
kvm_dr6_valid and kvm_dr7_valid check that bits 63:32 are zero. Using them makes it easier to review the code for inconsistencies. Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: move EXIT_FASTPATH_REENTER_GUEST to common codePaolo Bonzini3-22/+15
Now that KVM is using static calls, calling vmx_vcpu_run and vmx_sync_pir_to_irr does not incur anymore the cost of a retpoline. Therefore there is no need anymore to handle EXIT_FASTPATH_REENTER_GUEST in vendor code. Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04selftest: kvm: x86: test KVM_GET_CPUID2 and guest visible CPUIDs against ↵Vitaly Kuznetsov5-0/+233
KVM_GET_SUPPORTED_CPUID Commit 181f494888d5 ("KVM: x86: fix CPUID entries returned by KVM_GET_CPUID2 ioctl") revealed that we're not testing KVM_GET_CPUID2 ioctl at all. Add a test for it and also check that from inside the guest visible CPUIDs are equal to it's output. Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Add '__func__' in rmap_printk()Stephen Zhang2-11/+11
Given the common pattern: rmap_printk("%s:"..., __func__,...) we could improve this by adding '__func__' in rmap_printk(). Signed-off-by: Stephen Zhang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: SVM: Replace hard-coded value with #defineKrish Sadhukhan1-1/+1
Replace the hard-coded value for bit# 1 in EFLAGS, with the available #define. Signed-off-by: Krish Sadhukhan <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: SVM: use .prepare_guest_switch() to handle CPU register save/setupMichael Roth3-47/+56
Currently we save host state like user-visible host MSRs, and do some initial guest register setup for MSR_TSC_AUX and MSR_AMD64_TSC_RATIO in svm_vcpu_load(). Defer this until just before we enter the guest by moving the handling to kvm_x86_ops.prepare_guest_switch() similarly to how it is done for the VMX implementation. Additionally, since handling of saving/restoring host user MSRs is the same both with/without SEV-ES enabled, move that handling to common code. Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Michael Roth <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: SVM: remove uneeded fields from host_save_users_msrsMichael Roth3-21/+8
Now that the set of host user MSRs that need to be individually saved/restored are the same with/without SEV-ES, we can drop the .sev_es_restored flag and just iterate through the list unconditionally for both cases. A subsequent patch can then move these loops to a common path. Signed-off-by: Michael Roth <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: SVM: use vmsave/vmload for saving/restoring additional host stateMichael Roth3-43/+10
Using a guest workload which simply issues 'hlt' in a tight loop to generate VMEXITs, it was observed (on a recent EPYC processor) that a significant amount of the VMEXIT overhead measured on the host was the result of MSR reads/writes in svm_vcpu_load/svm_vcpu_put according to perf: 67.49%--kvm_arch_vcpu_ioctl_run | |--23.13%--vcpu_put | kvm_arch_vcpu_put | | | |--21.31%--native_write_msr | | | --1.27%--svm_set_cr4 | |--16.11%--vcpu_load | | | --15.58%--kvm_arch_vcpu_load | | | |--13.97%--svm_set_cr4 | | | | | |--12.64%--native_read_msr Most of these MSRs relate to 'syscall'/'sysenter' and segment bases, and can be saved/restored using 'vmsave'/'vmload' instructions rather than explicit MSR reads/writes. In doing so there is a significant reduction in the svm_vcpu_load/svm_vcpu_put overhead measured for the above workload: 50.92%--kvm_arch_vcpu_ioctl_run | |--19.28%--disable_nmi_singlestep | |--13.68%--vcpu_load | kvm_arch_vcpu_load | | | |--9.19%--svm_set_cr4 | | | | | --6.44%--native_read_msr | | | --3.55%--native_write_msr | |--6.05%--kvm_inject_nmi |--2.80%--kvm_sev_es_mmio_read |--2.19%--vcpu_put | | | --1.25%--kvm_arch_vcpu_put | native_write_msr Quantifying this further, if we look at the raw cycle counts for a normal iteration of the above workload (according to 'rdtscp'), kvm_arch_vcpu_ioctl_run() takes ~4600 cycles from start to finish with the current behavior. Using 'vmsave'/'vmload', this is reduced to ~2800 cycles, a savings of 39%. While this approach doesn't seem to manifest in any noticeable improvement for more realistic workloads like UnixBench, netperf, and kernel builds, likely due to their exit paths generally involving IO with comparatively high latencies, it does improve overall overhead of KVM_RUN significantly, which may still be noticeable for certain situations. It also simplifies some aspects of the code. With this change, explicit save/restore is no longer needed for the following host MSRs, since they are documented[1] as being part of the VMCB State Save Area: MSR_STAR, MSR_LSTAR, MSR_CSTAR, MSR_SYSCALL_MASK, MSR_KERNEL_GS_BASE, MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP, MSR_FS_BASE, MSR_GS_BASE and only the following MSR needs individual handling in svm_vcpu_put/svm_vcpu_load: MSR_TSC_AUX We could drop the host_save_user_msrs array/loop and instead handle MSR read/write of MSR_TSC_AUX directly, but we leave that for now as a potential follow-up. Since 'vmsave'/'vmload' also handles the LDTR and FS/GS segment registers (and associated hidden state)[2], some of the code previously used to handle this is no longer needed, so we drop it as well. The first public release of the SVM spec[3] also documents the same handling for the host state in question, so we make these changes unconditionally. Also worth noting is that we 'vmsave' to the same page that is subsequently used by 'vmrun' to record some host additional state. This is okay, since, in accordance with the spec[2], the additional state written to the page by 'vmrun' does not overwrite any fields written by 'vmsave'. This has also been confirmed through testing (for the above CPU, at least). [1] AMD64 Architecture Programmer's Manual, Rev 3.33, Volume 2, Appendix B, Table B-2 [2] AMD64 Architecture Programmer's Manual, Rev 3.31, Volume 3, Chapter 4, VMSAVE/VMLOAD [3] Secure Virtual Machine Architecture Reference Manual, Rev 3.01 Suggested-by: Tom Lendacky <[email protected]> Signed-off-by: Michael Roth <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: SVM: Use asm goto to handle unexpected #UD on SVM instructionsSean Christopherson3-16/+67
Add svm_asm*() macros, a la the existing vmx_asm*() macros, to handle faults on SVM instructions instead of using the generic __ex(), a.k.a. __kvm_handle_fault_on_reboot(). Using asm goto generates slightly better code as it eliminates the in-line JMP+CALL sequences that are needed by __kvm_handle_fault_on_reboot() to avoid triggering BUG() from fixup (which generates bad stack traces). Using SVM specific macros also drops the last user of __ex() and the the last asm linkage to kvm_spurious_fault(), and adds a helper for VMSAVE, which may gain an addition call site in the future (as part of optimizing the SVM context switching). Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: VMX: Use the kernel's version of VMXOFFSean Christopherson2-13/+9
Drop kvm_cpu_vmxoff() in favor of the kernel's cpu_vmxoff(). Modify the latter to return -EIO on fault so that KVM can invoke kvm_spurious_fault() when appropriate. In addition to the obvious code reuse, dropping kvm_cpu_vmxoff() also eliminates VMX's last usage of the __ex()/__kvm_handle_fault_on_reboot() macros, thus helping pave the way toward dropping them entirely. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: VMX: Move Intel PT shenanigans out of VMXON/VMXOFF flowsSean Christopherson1-4/+7
Move the Intel PT tracking outside of the VMXON/VMXOFF helpers so that a future patch can drop KVM's kvm_cpu_vmxoff() in favor of the kernel's cpu_vmxoff() without an associated PT functional change, and without losing symmetry between the VMXON and VMXOFF flows. Barring undocumented behavior, this should have no meaningful effects as Intel PT behavior does not interact with CR4.VMXE. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM/nVMX: Use __vmx_vcpu_run in nested_vmx_check_vmentry_hwUros Bizjak4-32/+5
Replace inline assembly in nested_vmx_check_vmentry_hw with a call to __vmx_vcpu_run. The function is not performance critical, so (double) GPR save/restore in __vmx_vcpu_run can be tolerated, as far as performance effects are concerned. Cc: Paolo Bonzini <[email protected]> Cc: Sean Christopherson <[email protected]> Reviewed-and-tested-by: Sean Christopherson <[email protected]> Signed-off-by: Uros Bizjak <[email protected]> [sean: dropped versioning info from changelog] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04x86/virt: Mark flags and memory as clobbered by VMXOFFDavid P. Reed1-1/+2
Explicitly tell the compiler that VMXOFF modifies flags (like all VMX instructions), and mark memory as clobbered since VMXOFF must not be reordered and also may have memory side effects (though the kernel really shouldn't be accessing the root VMCS anyways). Practically speaking, adding the clobbers is most likely a nop; the primary motivation is to properly document VMXOFF's behavior. For the flags clobber, both Clang and GCC automatically mark flags as clobbered; this is noted in commit 4b1e54786e48 ("KVM/x86: Use assembly instruction mnemonics instead of .byte streams"), which intentionally removed the previous clobber. But, neither Clang nor GCC documents this behavior, and there's no downside to including the clobber. For the memory clobber, the RFLAGS.IF and CR4.VMXE manipulations that immediately follow VMXOFF have compiler barriers of their own, i.e. VMXOFF can't get reordered after clearing CR4.VMXE, which is really what's of interest. Cc: Randy Dunlap <[email protected]> Signed-off-by: David P. Reed <[email protected]> [sean: rewrote changelog, dropped comment adjustments] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04x86/reboot: Force all cpus to exit VMX root if VMX is supportedSean Christopherson1-20/+10
Force all CPUs to do VMXOFF (via NMI shootdown) during an emergency reboot if VMX is _supported_, as VMX being off on the current CPU does not prevent other CPUs from being in VMX root (post-VMXON). This fixes a bug where a crash/panic reboot could leave other CPUs in VMX root and prevent them from being woken via INIT-SIPI-SIPI in the new kernel. Fixes: d176720d34c7 ("x86: disable VMX on all CPUs on reboot") Cc: [email protected] Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: David P. Reed <[email protected]> [sean: reworked changelog and further tweaked comment] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04x86/virt: Eat faults on VMXOFF in reboot flowsSean Christopherson1-5/+12
Silently ignore all faults on VMXOFF in the reboot flows as such faults are all but guaranteed to be due to the CPU not being in VMX root. Because (a) VMXOFF may be executed in NMI context, e.g. after VMXOFF but before CR4.VMXE is cleared, (b) there's no way to query the CPU's VMX state without faulting, and (c) the whole point is to get out of VMX root, eating faults is the simplest way to achieve the desired behaior. Technically, VMXOFF can fault (or fail) for other reasons, but all other fault and failure scenarios are mode related, i.e. the kernel would have to magically end up in RM, V86, compat mode, at CPL>0, or running with the SMI Transfer Monitor active. The kernel is beyond hosed if any of those scenarios are encountered; trying to do something fancy in the error path to handle them cleanly is pointless. Fixes: 1e9931146c74 ("x86: asm/virtext.h: add cpu_vmxoff() inline function") Reported-by: David P. Reed <[email protected]> Cc: [email protected] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86: use static calls to reduce kvm_x86_ops overheadJason Baron13-197/+193
Convert kvm_x86_ops to use static calls. Note that all kvm_x86_ops are covered here except for 'pmu_ops and 'nested ops'. Here are some numbers running cpuid in a loop of 1 million calls averaged over 5 runs, measured in the vm (lower is better). Intel Xeon 3000MHz: |default |mitigations=off ------------------------------------- vanilla |.671s |.486s static call|.573s(-15%)|.458s(-6%) AMD EPYC 2500MHz: |default |mitigations=off ------------------------------------- vanilla |.710s |.609s static call|.664s(-6%) |.609s(0%) Cc: Paolo Bonzini <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Sean Christopherson <[email protected]> Signed-off-by: Jason Baron <[email protected]> Message-Id: <e057bf1b8a7ad15652df6eeba3f907ae758d3399.1610680941.git.jbaron@akamai.com> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86: introduce definitions to support static calls for kvm_x86_opsJason Baron3-0/+149
Use static calls to improve kvm_x86_ops performance. Introduce the definitions that will be used by a subsequent patch to actualize the savings. Add a new kvm-x86-ops.h header that can be used for the definition of static calls. This header is also intended to be used to simplify the defition of svm_kvm_ops and vmx_x86_ops. Note that all functions in kvm_x86_ops are covered here except for 'pmu_ops' and 'nested ops'. I think they can be covered by static calls in a simlilar manner, but were omitted from this series to reduce scope and because I don't think they have as large of a performance impact. Cc: Paolo Bonzini <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Jason Baron <[email protected]> Message-Id: <e5cc82ead7ab37b2dceb0837a514f3f8bea4f8d1.1610680941.git.jbaron@akamai.com> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: X86: prepend vmx/svm prefix to additional kvm_x86_ops functionsJason Baron4-27/+27
A subsequent patch introduces macros in preparation for simplifying the definition for vmx_x86_ops and svm_x86_ops. Making the naming more uniform expands the coverage of the macros. Add vmx/svm prefix to the following functions: update_exception_bitmap(), enable_nmi_window(), enable_irq_window(), update_cr8_intercept and enable_smi_window(). Cc: Paolo Bonzini <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Jason Baron <[email protected]> Message-Id: <ed594696f8e2c2b2bfc747504cee9bbb2a269300.1610680941.git.jbaron@akamai.com> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: Stop using deprecated jump label APIsCun Li4-32/+23
The use of 'struct static_key' and 'static_key_false' is deprecated. Use the new API. Signed-off-by: Cun Li <[email protected]> Message-Id: <[email protected]> [Make it compile. While at it, rename kvm_no_apic_vcpu to kvm_has_noapic_vcpu; the former reads too much like "true if no vCPU has an APIC". - Paolo] Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: SVM: Fix #GP handling for doubly-nested virtualizationWei Huang1-2/+18
Under the case of nested on nested (L0, L1, L2 are all hypervisors), we do not support emulation of the vVMLOAD/VMSAVE feature, the L0 hypervisor can inject the proper #VMEXIT to inform L1 of what is happening and L1 can avoid invoking the #GP workaround. For this reason we turns on guest VM's X86_FEATURE_SVME_ADDR_CHK bit for KVM running inside VM to receive the notification and change behavior. Similarly we check if vcpu is under guest mode before emulating the vmware-backdoor instructions. For the case of nested on nested, we let the guest handle it. Co-developed-by: Bandan Das <[email protected]> Signed-off-by: Bandan Das <[email protected]> Signed-off-by: Wei Huang <[email protected]> Tested-by: Maxim Levitsky <[email protected]> Reviewed-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: SVM: Add support for SVM instruction address check changeWei Huang2-0/+4
New AMD CPUs have a change that checks #VMEXIT intercept on special SVM instructions before checking their EAX against reserved memory region. This change is indicated by CPUID_0x8000000A_EDX[28]. If it is 1, #VMEXIT is triggered before #GP. KVM doesn't need to intercept and emulate #GP faults as #GP is supposed to be triggered. Co-developed-by: Bandan Das <[email protected]> Signed-off-by: Bandan Das <[email protected]> Signed-off-by: Wei Huang <[email protected]> Reviewed-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: SVM: Add emulation support for #GP triggered by SVM instructionsBandan Das1-18/+91
While running SVM related instructions (VMRUN/VMSAVE/VMLOAD), some AMD CPUs check EAX against reserved memory regions (e.g. SMM memory on host) before checking VMCB's instruction intercept. If EAX falls into such memory areas, #GP is triggered before VMEXIT. This causes problem under nested virtualization. To solve this problem, KVM needs to trap #GP and check the instructions triggering #GP. For VM execution instructions, KVM emulates these instructions. Co-developed-by: Wei Huang <[email protected]> Signed-off-by: Wei Huang <[email protected]> Signed-off-by: Bandan Das <[email protected]> Message-Id: <[email protected]> [Conditionally enable #GP intercept. - Paolo] Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86: Factor out x86 instruction emulation with decodingWei Huang2-23/+41
Move the instruction decode part out of x86_emulate_instruction() for it to be used in other places. Also kvm_clear_exception_queue() is moved inside the if-statement as it doesn't apply when KVM are coming back from userspace. Co-developed-by: Bandan Das <[email protected]> Signed-off-by: Bandan Das <[email protected]> Signed-off-by: Wei Huang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: X86: Rename DR6_INIT to DR6_ACTIVE_LOWChenyi Qiang7-25/+38
DR6_INIT contains the 1-reserved bits as well as the bit that is cleared to 0 when the condition (e.g. RTM) happens. The value can be used to initialize dr6 and also be the XOR mask between the #DB exit qualification (or payload) and DR6. Concerning that DR6_INIT is used as initial value only once, rename it to DR6_ACTIVE_LOW and apply it in other places, which would make the incoming changes for bus lock debug exception more simple. Signed-off-by: Chenyi Qiang <[email protected]> Message-Id: <[email protected]> [Define DR6_FIXED_1 from DR6_ACTIVE_LOW and DR6_VOLATILE. - Paolo] Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04selftests: kvm/x86: add test for pmu msr MSR_IA32_PERF_CAPABILITIESLike Xu5-1/+169
This test will check the effect of various CPUID settings on the MSR_IA32_PERF_CAPABILITIES MSR, check that whatever user space writes with KVM_SET_MSR is _not_ modified from the guest and can be retrieved with KVM_GET_MSR, and check that invalid LBR formats are rejected. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: vmx/pmu: Expose LBR_FMT in the MSR_IA32_PERF_CAPABILITIESLike Xu1-1/+8
Userspace could enable guest LBR feature when the exactly supported LBR format value is initialized to the MSR_IA32_PERF_CAPABILITIES and the LBR is also compatible with vPMU version and host cpu model. The LBR could be enabled on the guest if host perf supports LBR (checked via x86_perf_get_lbr()) and the vcpu model is compatible with the host one. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: vmx/pmu: Release guest LBR event via lazy release mechanismLike Xu3-1/+24
The vPMU uses GUEST_LBR_IN_USE_IDX (bit 58) in 'pmu->pmc_in_use' to indicate whether a guest LBR event is still needed by the vcpu. If the vcpu no longer accesses LBR related registers within a scheduling time slice, and the enable bit of LBR has been unset, vPMU will treat the guest LBR event as a bland event of a vPMC counter and release it as usual. Also, the pass-through state of LBR records msrs is cancelled. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: vmx/pmu: Emulate legacy freezing LBRs on virtual PMILike Xu5-3/+39
The current vPMU only supports Architecture Version 2. According to Intel SDM "17.4.7 Freezing LBR and Performance Counters on PMI", if IA32_DEBUGCTL.Freeze_LBR_On_PMI = 1, the LBR is frozen on the virtual PMI and the KVM would emulate to clear the LBR bit (bit 0) in IA32_DEBUGCTL. Also, guest needs to re-enable IA32_DEBUGCTL.LBR to resume recording branches. Signed-off-by: Like Xu <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: vmx/pmu: Reduce the overhead of LBR pass-through or cancellationLike Xu2-0/+16
When the LBR records msrs has already been pass-through, there is no need to call vmx_update_intercept_for_lbr_msrs() again and again, and vice versa. Signed-off-by: Like Xu <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: vmx/pmu: Pass-through LBR msrs when the guest LBR event is ACTIVELike Xu3-3/+135
In addition to DEBUGCTLMSR_LBR, any KVM trap caused by LBR msrs access will result in a creation of guest LBR event per-vcpu. If the guest LBR event is scheduled on with the corresponding vcpu context, KVM will pass-through all LBR records msrs to the guest. The LBR callstack mechanism implemented in the host could help save/restore the guest LBR records during the event context switches, which reduces a lot of overhead if we save/restore tens of LBR msrs (e.g. 32 LBR records entries) in the much more frequent VMX transitions. To avoid reclaiming LBR resources from any higher priority event on host, KVM would always check the exist of guest LBR event and its state before vm-entry as late as possible. A negative result would cancel the pass-through state, and it also prevents real registers accesses and potential data leakage. If host reclaims the LBR between two checks, the interception state and LBR records can be safely preserved due to native save/restore support from guest LBR event. The KVM emits a pr_warn() when the LBR hardware is unavailable to the guest LBR event. The administer is supposed to reminder users that the guest result may be inaccurate if someone is using LBR to record hypervisor on the host side. Suggested-by: Andi Kleen <[email protected]> Co-developed-by: Wei Wang <[email protected]> Signed-off-by: Wei Wang <[email protected]> Signed-off-by: Like Xu <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: vmx/pmu: Create a guest LBR event when vcpu sets DEBUGCTLMSR_LBRLike Xu3-0/+76
When vcpu sets DEBUGCTLMSR_LBR in the MSR_IA32_DEBUGCTLMSR, the KVM handler would create a guest LBR event which enables the callstack mode and none of hardware counter is assigned. The host perf would schedule and enable this event as usual but in an exclusive way. The guest LBR event will be released when the vPMU is reset but soon, the lazy release mechanism would be applied to this event like a vPMC. Suggested-by: Andi Kleen <[email protected]> Co-developed-by: Wei Wang <[email protected]> Signed-off-by: Wei Wang <[email protected]> Signed-off-by: Like Xu <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: vmx/pmu: Add PMU_CAP_LBR_FMT check when guest LBR is enabledLike Xu4-2/+25
Usespace could set the bits [0, 5] of the IA32_PERF_CAPABILITIES MSR which tells about the record format stored in the LBR records. The LBR will be enabled on the guest if host perf supports LBR (checked via x86_perf_get_lbr()) and the vcpu model is compatible with the host one. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>