aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-02-04KVM: x86: declare Xen HVM shared info capability and add test caseDavid Woodhouse4-1/+175
Instead of adding a plethora of new KVM_CAP_XEN_FOO capabilities, just add bits to the return value of KVM_CAP_XEN_HVM. Signed-off-by: David Woodhouse <[email protected]>
2021-02-04KVM: x86/xen: Add event channel interrupt vector upcallDavid Woodhouse6-1/+74
It turns out that we can't handle event channels *entirely* in userspace by delivering them as ExtINT, because KVM is a bit picky about when it accepts ExtINT interrupts from a legacy PIC. The in-kernel local APIC has to have LVT0 configured in APIC_MODE_EXTINT and unmasked, which isn't necessarily the case for Xen guests especially on secondary CPUs. To cope with this, add kvm_xen_get_interrupt() which checks the evtchn_pending_upcall field in the Xen vcpu_info, and delivers the Xen upcall vector (configured by KVM_XEN_ATTR_TYPE_UPCALL_VECTOR) if it's set regardless of LAPIC LVT0 configuration. This gives us the minimum support we need for completely userspace-based implementation of event channels. This does mean that vcpu_enter_guest() needs to check for the evtchn_pending_upcall flag being set, because it can't rely on someone having set KVM_REQ_EVENT unless we were to add some way for userspace to do so manually. Signed-off-by: David Woodhouse <[email protected]>
2021-02-04KVM: x86/xen: register vcpu time info regionJoao Martins4-0/+23
Allow the Xen emulated guest the ability to register secondary vcpu time information. On Xen guests this is used in order to be mapped to userspace and hence allow vdso gettimeofday to work. Signed-off-by: Joao Martins <[email protected]> Signed-off-by: David Woodhouse <[email protected]>
2021-02-04KVM: x86/xen: setup pvclock updatesJoao Martins2-15/+21
Parameterise kvm_setup_pvclock_page() a little bit so that it can be invoked for different gfn_to_hva_cache structures, and with different offsets. Then we can invoke it for the normal KVM pvclock and also for the Xen one in the vcpu_info. Signed-off-by: Joao Martins <[email protected]> Signed-off-by: David Woodhouse <[email protected]>
2021-02-04KVM: x86/xen: register vcpu infoJoao Martins3-1/+29
The vcpu info supersedes the per vcpu area of the shared info page and the guest vcpus will use this instead. Signed-off-by: Joao Martins <[email protected]> Signed-off-by: Ankur Arora <[email protected]> Signed-off-by: David Woodhouse <[email protected]>
2021-02-04KVM: x86/xen: Add KVM_XEN_VCPU_SET_ATTR/KVM_XEN_VCPU_GET_ATTRDavid Woodhouse4-0/+65
This will be used for per-vCPU setup such as runstate and vcpu_info. Signed-off-by: David Woodhouse <[email protected]>
2021-02-04KVM: x86/xen: update wallclock regionJoao Martins3-8/+43
Wallclock on Xen is written in the shared_info page. To that purpose, export kvm_write_wall_clock() and pass on the GPA of its location to populate the shared_info wall clock data. Signed-off-by: Joao Martins <[email protected]> Signed-off-by: David Woodhouse <[email protected]>
2021-02-04xen: add wc_sec_hi to struct shared_infoDavid Woodhouse2-1/+6
Xen added this in 2015 (Xen 4.6). On x86_64 and Arm it fills what was previously a 32-bit hole in the generic shared_info structure; on i386 it had to go at the end of struct arch_shared_info. Signed-off-by: David Woodhouse <[email protected]>
2021-02-04KVM: x86/xen: register shared_info pageJoao Martins3-4/+42
Add KVM_XEN_ATTR_TYPE_SHARED_INFO to allow hypervisor to know where the guest's shared info page is. Signed-off-by: Joao Martins <[email protected]> Signed-off-by: David Woodhouse <[email protected]>
2021-02-04KVM: x86/xen: add definitions of compat_shared_info, compat_vcpu_infoDavid Woodhouse1-0/+36
There aren't a lot of differences for the things that the kernel needs to care about, but there are a few. Signed-off-by: David Woodhouse <[email protected]>
2021-02-04KVM: x86/xen: latch long_mode when hypercall page is set upDavid Woodhouse3-1/+24
Signed-off-by: David Woodhouse <[email protected]>
2021-02-04KVM: x86/xen: add KVM_XEN_HVM_SET_ATTR/KVM_XEN_HVM_GET_ATTRJoao Martins4-0/+63
This will be used to set up shared info pages etc. Signed-off-by: Joao Martins <[email protected]> Signed-off-by: David Woodhouse <[email protected]>
2021-02-04KVM: x86/xen: Add kvm_xen_enabled static keyDavid Woodhouse3-2/+27
The code paths for Xen support are all fairly lightweight but if we hide them behind this, they're even *more* lightweight for any system which isn't actually hosting Xen guests. Signed-off-by: David Woodhouse <[email protected]>
2021-02-04KVM: x86/xen: Move KVM_XEN_HVM_CONFIG handling to xen.cDavid Woodhouse3-13/+20
This is already more complex than the simple memcpy it originally had. Move it to xen.c with the rest of the Xen support. Signed-off-by: David Woodhouse <[email protected]>
2021-02-04KVM: x86/xen: Fix coexistence of Xen and Hyper-V hypercallsJoao Martins3-17/+68
Disambiguate Xen vs. Hyper-V calls by adding 'orl $0x80000000, %eax' at the start of the Hyper-V hypercall page when Xen hypercalls are also enabled. That bit is reserved in the Hyper-V ABI, and those hypercall numbers will never be used by Xen (because it does precisely the same trick). Switch to using kvm_vcpu_write_guest() while we're at it, instead of open-coding it. Signed-off-by: David Woodhouse <[email protected]>
2021-02-04KVM: x86/xen: intercept xen hypercalls if enabledJoao Martins10-32/+369
Add a new exit reason for emulator to handle Xen hypercalls. Since this means KVM owns the ABI, dispense with the facility for the VMM to provide its own copy of the hypercall pages; just fill them in directly using VMCALL/VMMCALL as we do for the Hyper-V hypercall page. This behaviour is enabled by a new INTERCEPT_HCALL flag in the KVM_XEN_HVM_CONFIG ioctl structure, and advertised by the same flag being returned from the KVM_CAP_XEN_HVM check. Rename xen_hvm_config() to kvm_xen_write_hypercall_page() and move it to the nascent xen.c while we're at it, and add a test case. Signed-off-by: Joao Martins <[email protected]> Signed-off-by: David Woodhouse <[email protected]>
2021-02-04KVM: x86/xen: Fix __user pointer handling for hypercall page installationDavid Woodhouse1-3/+5
The address we give to memdup_user() isn't correctly tagged as __user. This is harmless enough as it's a one-off use and we're doing exactly the right thing, but fix it anyway to shut the checker up. Otherwise it'll whine when the (now legacy) code gets moved around in a later patch. Signed-off-by: David Woodhouse <[email protected]>
2021-02-04KVM: x86/xen: fix Xen hypercall page msr handlingJoao Martins1-2/+3
Xen usually places its MSR at 0x40000000 or 0x40000200 depending on whether it is running in viridian mode or not. Note that this is not ABI guaranteed, so it is possible for Xen to advertise the MSR some place else. Given the way xen_hvm_config() is handled, if the former address is selected, this will conflict with Hyper-V's MSR (HV_X64_MSR_GUEST_OS_ID) which unconditionally uses the same address. Given that the MSR location is arbitrary, move the xen_hvm_config() handling to the top of kvm_set_msr_common() before falling through. Signed-off-by: Joao Martins <[email protected]> Signed-off-by: David Woodhouse <[email protected]>
2021-02-04KVM: x86/mmu: Allow parallel page faults for the TDP MMUBen Gardon1-2/+10
Make the last few changes necessary to enable the TDP MMU to handle page faults in parallel while holding the mmu_lock in read mode. Reviewed-by: Peter Feiner <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Mark SPTEs in disconnected pages as removedBen Gardon1-6/+30
When clearing TDP MMU pages what have been disconnected from the paging structure root, set the SPTEs to a special non-present value which will not be overwritten by other threads. This is needed to prevent races in which a thread is clearing a disconnected page table, but another thread has already acquired a pointer to that memory and installs a mapping in an already cleared entry. This can lead to memory leaks and accounting errors. Reviewed-by: Peter Feiner <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF handlerBen Gardon2-10/+74
When the TDP MMU is allowed to handle page faults in parallel there is the possiblity of a race where an SPTE is cleared and then imediately replaced with a present SPTE pointing to a different PFN, before the TLBs can be flushed. This race would violate architectural specs. Ensure that the TLBs are flushed properly before other threads are allowed to install any present value for the SPTE. Reviewed-by: Peter Feiner <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU mapBen Gardon3-34/+130
To prepare for handling page faults in parallel, change the TDP MMU page fault handler to use atomic operations to set SPTEs so that changes are not lost if multiple threads attempt to modify the same SPTE. Reviewed-by: Peter Feiner <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> [Document new locking rules. - Paolo] Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Factor out functions to add/remove TDP MMU pagesBen Gardon1-8/+39
Move the work of adding and removing TDP MMU pages to/from "secondary" data structures to helper functions. These functions will be built on in future commits to enable MMU operations to proceed (mostly) in parallel. No functional change expected. Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Use an rwlock for the x86 MMUBen Gardon10-84/+118
Add a read / write lock to be used in place of the MMU spinlock on x86. The rwlock will enable the TDP MMU to handle page faults, and other operations in parallel in future commits. Reviewed-by: Peter Feiner <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> [Introduce virt/kvm/mmu_lock.h - Paolo] Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04sched: Add cond_resched_rwlockBen Gardon2-0/+52
Safely rescheduling while holding a spin lock is essential for keeping long running kernel operations running smoothly. Add the facility to cond_resched rwlocks. CC: Ingo Molnar <[email protected]> CC: Will Deacon <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Acked-by: Davidlohr Bueso <[email protected]> Acked-by: Waiman Long <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04sched: Add needbreak for rwlocksBen Gardon1-0/+17
Contention awareness while holding a spin lock is essential for reducing latency when long running kernel operations can hold that lock. Add the same contention detection interface for read/write spin locks. CC: Ingo Molnar <[email protected]> CC: Will Deacon <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Acked-by: Davidlohr Bueso <[email protected]> Acked-by: Waiman Long <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04locking/rwlocks: Add contention detection for rwlocksBen Gardon2-6/+25
rwlocks do not currently have any facility to detect contention like spinlocks do. In order to allow users of rwlocks to better manage latency, add contention detection for queued rwlocks. CC: Ingo Molnar <[email protected]> CC: Will Deacon <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Acked-by: Davidlohr Bueso <[email protected]> Acked-by: Waiman Long <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Protect TDP MMU page table memory with RCUBen Gardon4-21/+103
In order to enable concurrent modifications to the paging structures in the TDP MMU, threads must be able to safely remove pages of page table memory while other threads are traversing the same memory. To ensure threads do not access PT memory after it is freed, protect PT memory with RCU. Protecting concurrent accesses to page table memory from use-after-free bugs could also have been acomplished using walk_shadow_page_lockless_begin/end() and READING_SHADOW_PAGE_TABLES, coupling with the barriers in a TLB flush. The use of RCU for this case has several distinct advantages over that approach. 1. Disabling interrupts for long running operations is not desirable. Future commits will allow operations besides page faults to operate without the exclusive protection of the MMU lock and those operations are too long to disable iterrupts for their duration. 2. The use of RCU here avoids long blocking / spinning operations in perfromance critical paths. By freeing memory with an asynchronous RCU API we avoid the longer wait times TLB flushes experience when overlapping with a thread in walk_shadow_page_lockless_begin/end(). 3. RCU provides a separation of concerns when removing memory from the paging structure. Because the RCU callback to free memory can be scheduled immediately after a TLB flush, there's no need for the thread to manually free a queue of pages later, as commit_zap_pages does. Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU") Reviewed-by: Peter Feiner <[email protected]> Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Clear dirtied pages mask bit before early breakBen Gardon1-2/+2
In clear_dirty_pt_masked, the loop is intended to exit early after processing each of the GFNs with corresponding bits set in mask. This does not work as intended if another thread has already cleared the dirty bit or writable bit on the SPTE. In that case, the loop would proceed to the next iteration early and the bit in mask would not be cleared. As a result the loop could not exit early and would proceed uselessly. Move the unsetting of the mask bit before the check for a no-op SPTE change. Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Skip no-op changes in TDP MMU functionsBen Gardon1-2/+4
Skip setting SPTEs if no change is expected. Reviewed-by: Peter Feiner <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changedBen Gardon1-10/+22
Given certain conditions, some TDP MMU functions may not yield reliably / frequently enough. For example, if a paging structure was very large but had few, if any writable entries, wrprot_gfn_range could traverse many entries before finding a writable entry and yielding because the check for yielding only happens after an SPTE is modified. Fix this issue by moving the yield to the beginning of the loop. Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Reviewed-by: Peter Feiner <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Ensure forward progress when yielding in TDP MMU iterBen Gardon3-23/+23
In some functions the TDP iter risks not making forward progress if two threads livelock yielding to one another. This is possible if two threads are trying to execute wrprot_gfn_range. Each could write protect an entry and then yield. This would reset the tdp_iter's walk over the paging structure and the loop would end up repeating the same entry over and over, preventing either thread from making forward progress. Fix this issue by only yielding if the loop has made forward progress since the last yield. Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Reviewed-by: Peter Feiner <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Rename goal_gfn to next_last_level_gfnBen Gardon2-12/+12
The goal_gfn field in tdp_iter can be misleading as it implies that it is the iterator's final goal. It is really a target for the lowest gfn mapped by the leaf level SPTE the iterator will traverse towards. Change the field's name to be more precise. Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Merge flush and non-flush tdp_mmu_iter_cond_reschedBen Gardon1-29/+13
The flushing and non-flushing variants of tdp_mmu_iter_cond_resched have almost identical implementations. Merge the two functions and add a flush parameter. Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Fix braces in kvm_recover_nx_lpagesBen Gardon1-2/+2
No functional change intended. Fixes: 29cf0f5007a2 ("kvm: x86/mmu: NX largepage recovery for TDP MMU") Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Factor out handling of removed page tablesBen Gardon1-29/+42
Factor out the code to handle a disconnected subtree of the TDP paging structure from the code to handle the change to an individual SPTE. Future commits will build on this to allow asynchronous page freeing. No functional change intended. Reviewed-by: Peter Feiner <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Don't redundantly clear TDP MMU pt memoryBen Gardon1-1/+0
The KVM MMU caches already guarantee that shadow page table memory will be zeroed, so there is no reason to re-zero the page in the TDP MMU page fault handler. No functional change intended. Reviewed-by: Peter Feiner <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Add lockdep when setting a TDP MMU SPTEBen Gardon1-0/+2
Add lockdep to __tdp_mmu_set_spte to ensure that SPTEs are only modified under the MMU lock. No functional change intended. Reviewed-by: Peter Feiner <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Add comment on __tdp_mmu_set_spteBen Gardon1-0/+16
__tdp_mmu_set_spte is a very important function in the TDP MMU which already accepts several arguments and will take more in future commits. To offset this complexity, add a comment to the function describing each of the arguemnts. No functional change intended. Reviewed-by: Peter Feiner <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: change TDP MMU yield function returns to match cond_reschedBen Gardon1-10/+29
Currently the TDP MMU yield / cond_resched functions either return nothing or return true if the TLBs were not flushed. These are confusing semantics, especially when making control flow decisions in calling functions. To clean things up, change both functions to have the same return value semantics as cond_resched: true if the thread yielded, false if it did not. If the function yielded in the _flush_ version, then the TLBs will have been flushed. Reviewed-by: Peter Feiner <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86: move kvm_inject_gp up from kvm_set_xcr to callersPaolo Bonzini3-14/+8
Push the injection of #GP up to the callers, so that they can just use kvm_complete_insn_gp. Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: cleanup DR6/DR7 reserved bits checksPaolo Bonzini1-2/+2
kvm_dr6_valid and kvm_dr7_valid check that bits 63:32 are zero. Using them makes it easier to review the code for inconsistencies. Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: move EXIT_FASTPATH_REENTER_GUEST to common codePaolo Bonzini3-22/+15
Now that KVM is using static calls, calling vmx_vcpu_run and vmx_sync_pir_to_irr does not incur anymore the cost of a retpoline. Therefore there is no need anymore to handle EXIT_FASTPATH_REENTER_GUEST in vendor code. Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04selftest: kvm: x86: test KVM_GET_CPUID2 and guest visible CPUIDs against ↵Vitaly Kuznetsov5-0/+233
KVM_GET_SUPPORTED_CPUID Commit 181f494888d5 ("KVM: x86: fix CPUID entries returned by KVM_GET_CPUID2 ioctl") revealed that we're not testing KVM_GET_CPUID2 ioctl at all. Add a test for it and also check that from inside the guest visible CPUIDs are equal to it's output. Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: x86/mmu: Add '__func__' in rmap_printk()Stephen Zhang2-11/+11
Given the common pattern: rmap_printk("%s:"..., __func__,...) we could improve this by adding '__func__' in rmap_printk(). Signed-off-by: Stephen Zhang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: SVM: Replace hard-coded value with #defineKrish Sadhukhan1-1/+1
Replace the hard-coded value for bit# 1 in EFLAGS, with the available #define. Signed-off-by: Krish Sadhukhan <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: SVM: use .prepare_guest_switch() to handle CPU register save/setupMichael Roth3-47/+56
Currently we save host state like user-visible host MSRs, and do some initial guest register setup for MSR_TSC_AUX and MSR_AMD64_TSC_RATIO in svm_vcpu_load(). Defer this until just before we enter the guest by moving the handling to kvm_x86_ops.prepare_guest_switch() similarly to how it is done for the VMX implementation. Additionally, since handling of saving/restoring host user MSRs is the same both with/without SEV-ES enabled, move that handling to common code. Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Michael Roth <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: SVM: remove uneeded fields from host_save_users_msrsMichael Roth3-21/+8
Now that the set of host user MSRs that need to be individually saved/restored are the same with/without SEV-ES, we can drop the .sev_es_restored flag and just iterate through the list unconditionally for both cases. A subsequent patch can then move these loops to a common path. Signed-off-by: Michael Roth <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: SVM: use vmsave/vmload for saving/restoring additional host stateMichael Roth3-43/+10
Using a guest workload which simply issues 'hlt' in a tight loop to generate VMEXITs, it was observed (on a recent EPYC processor) that a significant amount of the VMEXIT overhead measured on the host was the result of MSR reads/writes in svm_vcpu_load/svm_vcpu_put according to perf: 67.49%--kvm_arch_vcpu_ioctl_run | |--23.13%--vcpu_put | kvm_arch_vcpu_put | | | |--21.31%--native_write_msr | | | --1.27%--svm_set_cr4 | |--16.11%--vcpu_load | | | --15.58%--kvm_arch_vcpu_load | | | |--13.97%--svm_set_cr4 | | | | | |--12.64%--native_read_msr Most of these MSRs relate to 'syscall'/'sysenter' and segment bases, and can be saved/restored using 'vmsave'/'vmload' instructions rather than explicit MSR reads/writes. In doing so there is a significant reduction in the svm_vcpu_load/svm_vcpu_put overhead measured for the above workload: 50.92%--kvm_arch_vcpu_ioctl_run | |--19.28%--disable_nmi_singlestep | |--13.68%--vcpu_load | kvm_arch_vcpu_load | | | |--9.19%--svm_set_cr4 | | | | | --6.44%--native_read_msr | | | --3.55%--native_write_msr | |--6.05%--kvm_inject_nmi |--2.80%--kvm_sev_es_mmio_read |--2.19%--vcpu_put | | | --1.25%--kvm_arch_vcpu_put | native_write_msr Quantifying this further, if we look at the raw cycle counts for a normal iteration of the above workload (according to 'rdtscp'), kvm_arch_vcpu_ioctl_run() takes ~4600 cycles from start to finish with the current behavior. Using 'vmsave'/'vmload', this is reduced to ~2800 cycles, a savings of 39%. While this approach doesn't seem to manifest in any noticeable improvement for more realistic workloads like UnixBench, netperf, and kernel builds, likely due to their exit paths generally involving IO with comparatively high latencies, it does improve overall overhead of KVM_RUN significantly, which may still be noticeable for certain situations. It also simplifies some aspects of the code. With this change, explicit save/restore is no longer needed for the following host MSRs, since they are documented[1] as being part of the VMCB State Save Area: MSR_STAR, MSR_LSTAR, MSR_CSTAR, MSR_SYSCALL_MASK, MSR_KERNEL_GS_BASE, MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP, MSR_FS_BASE, MSR_GS_BASE and only the following MSR needs individual handling in svm_vcpu_put/svm_vcpu_load: MSR_TSC_AUX We could drop the host_save_user_msrs array/loop and instead handle MSR read/write of MSR_TSC_AUX directly, but we leave that for now as a potential follow-up. Since 'vmsave'/'vmload' also handles the LDTR and FS/GS segment registers (and associated hidden state)[2], some of the code previously used to handle this is no longer needed, so we drop it as well. The first public release of the SVM spec[3] also documents the same handling for the host state in question, so we make these changes unconditionally. Also worth noting is that we 'vmsave' to the same page that is subsequently used by 'vmrun' to record some host additional state. This is okay, since, in accordance with the spec[2], the additional state written to the page by 'vmrun' does not overwrite any fields written by 'vmsave'. This has also been confirmed through testing (for the above CPU, at least). [1] AMD64 Architecture Programmer's Manual, Rev 3.33, Volume 2, Appendix B, Table B-2 [2] AMD64 Architecture Programmer's Manual, Rev 3.31, Volume 3, Chapter 4, VMSAVE/VMLOAD [3] Secure Virtual Machine Architecture Reference Manual, Rev 3.01 Suggested-by: Tom Lendacky <[email protected]> Signed-off-by: Michael Roth <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-02-04KVM: SVM: Use asm goto to handle unexpected #UD on SVM instructionsSean Christopherson3-16/+67
Add svm_asm*() macros, a la the existing vmx_asm*() macros, to handle faults on SVM instructions instead of using the generic __ex(), a.k.a. __kvm_handle_fault_on_reboot(). Using asm goto generates slightly better code as it eliminates the in-line JMP+CALL sequences that are needed by __kvm_handle_fault_on_reboot() to avoid triggering BUG() from fixup (which generates bad stack traces). Using SVM specific macros also drops the last user of __ex() and the the last asm linkage to kvm_spurious_fault(), and adds a helper for VMSAVE, which may gain an addition call site in the future (as part of optimizing the SVM context switching). Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>