aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)AuthorFilesLines
2018-10-29x86: Clean up 'sizeof x' => 'sizeof(x)'Jordan Borgner3-33/+33
"sizeof(x)" is the canonical coding style used in arch/x86 most of the time. Fix the few places that didn't follow the convention. (Also do some whitespace cleanups in a few places while at it.) [ mingo: Rewrote the changelog. ] Signed-off-by: Jordan Borgner <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/20181028125828.7rgammkgzep2wpam@JordanDesktop Signed-off-by: Ingo Molnar <[email protected]>
2018-10-25Merge tag 'kvm-4.20-1' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds14-1050/+2358
Pull KVM updates from Radim Krčmář: "ARM: - Improved guest IPA space support (32 to 52 bits) - RAS event delivery for 32bit - PMU fixes - Guest entry hardening - Various cleanups - Port of dirty_log_test selftest PPC: - Nested HV KVM support for radix guests on POWER9. The performance is much better than with PR KVM. Migration and arbitrary level of nesting is supported. - Disable nested HV-KVM on early POWER9 chips that need a particular hardware bug workaround - One VM per core mode to prevent potential data leaks - PCI pass-through optimization - merge ppc-kvm topic branch and kvm-ppc-fixes to get a better base s390: - Initial version of AP crypto virtualization via vfio-mdev - Improvement for vfio-ap - Set the host program identifier - Optimize page table locking x86: - Enable nested virtualization by default - Implement Hyper-V IPI hypercalls - Improve #PF and #DB handling - Allow guests to use Enlightened VMCS - Add migration selftests for VMCS and Enlightened VMCS - Allow coalesced PIO accesses - Add an option to perform nested VMCS host state consistency check through hardware - Automatic tuning of lapic_timer_advance_ns - Many fixes, minor improvements, and cleanups" * tag 'kvm-4.20-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits) KVM/nVMX: Do not validate that posted_intr_desc_addr is page aligned Revert "kvm: x86: optimize dr6 restore" KVM: PPC: Optimize clearing TCEs for sparse tables x86/kvm/nVMX: tweak shadow fields selftests/kvm: add missing executables to .gitignore KVM: arm64: Safety check PSTATE when entering guest and handle IL KVM: PPC: Book3S HV: Don't use streamlined entry path on early POWER9 chips arm/arm64: KVM: Enable 32 bits kvm vcpu events support arm/arm64: KVM: Rename function kvm_arch_dev_ioctl_check_extension() KVM: arm64: Fix caching of host MDCR_EL2 value KVM: VMX: enable nested virtualization by default KVM/x86: Use 32bit xor to clear registers in svm.c kvm: x86: Introduce KVM_CAP_EXCEPTION_PAYLOAD kvm: vmx: Defer setting of DR6 until #DB delivery kvm: x86: Defer setting of CR2 until #PF delivery kvm: x86: Add payload operands to kvm_multiple_exception kvm: x86: Add exception payload fields to kvm_vcpu_events kvm: x86: Add has_payload and payload to kvm_queued_exception KVM: Documentation: Fix omission in struct kvm_vcpu_events KVM: selftests: add Enlightened VMCS test ...
2018-10-24KVM/nVMX: Do not validate that posted_intr_desc_addr is page alignedKarimAllah Ahmed1-1/+1
The spec only requires the posted interrupt descriptor address to be 64-bytes aligned (i.e. bits[0:5] == 0). Using page_address_valid also forces the address to be page aligned. Only validate that the address does not cross the maximum physical address without enforcing a page alignment. Cc: Paolo Bonzini <[email protected]> Cc: Radim Krčmář <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Fixes: 6de84e581c0 ("nVMX x86: check posted-interrupt descriptor addresss on vmentry of L2") Signed-off-by: KarimAllah Ahmed <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Reviewed-by: Krish Sadhuhan <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2018-10-24Merge branch 'siginfo-linus' of ↵Linus Torvalds1-10/+1
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull siginfo updates from Eric Biederman: "I have been slowly sorting out siginfo and this is the culmination of that work. The primary result is in several ways the signal infrastructure has been made less error prone. The code has been updated so that manually specifying SEND_SIG_FORCED is never necessary. The conversion to the new siginfo sending functions is now complete, which makes it difficult to send a signal without filling in the proper siginfo fields. At the tail end of the patchset comes the optimization of decreasing the size of struct siginfo in the kernel from 128 bytes to about 48 bytes on 64bit. The fundamental observation that enables this is by definition none of the known ways to use struct siginfo uses the extra bytes. This comes at the cost of a small user space observable difference. For the rare case of siginfo being injected into the kernel only what can be copied into kernel_siginfo is delivered to the destination, the rest of the bytes are set to 0. For cases where the signal and the si_code are known this is safe, because we know those bytes are not used. For cases where the signal and si_code combination is unknown the bits that won't fit into struct kernel_siginfo are tested to verify they are zero, and the send fails if they are not. I made an extensive search through userspace code and I could not find anything that would break because of the above change. If it turns out I did break something it will take just the revert of a single change to restore kernel_siginfo to the same size as userspace siginfo. Testing did reveal dependencies on preferring the signo passed to sigqueueinfo over si->signo, so bit the bullet and added the complexity necessary to handle that case. Testing also revealed bad things can happen if a negative signal number is passed into the system calls. Something no sane application will do but something a malicious program or a fuzzer might do. So I have fixed the code that performs the bounds checks to ensure negative signal numbers are handled" * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (80 commits) signal: Guard against negative signal numbers in copy_siginfo_from_user32 signal: Guard against negative signal numbers in copy_siginfo_from_user signal: In sigqueueinfo prefer sig not si_signo signal: Use a smaller struct siginfo in the kernel signal: Distinguish between kernel_siginfo and siginfo signal: Introduce copy_siginfo_from_user and use it's return value signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE signal: Fail sigqueueinfo if si_signo != sig signal/sparc: Move EMT_TAGOVF into the generic siginfo.h signal/unicore32: Use force_sig_fault where appropriate signal/unicore32: Generate siginfo in ucs32_notify_die signal/unicore32: Use send_sig_fault where appropriate signal/arc: Use force_sig_fault where appropriate signal/arc: Push siginfo generation into unhandled_exception signal/ia64: Use force_sig_fault where appropriate signal/ia64: Use the force_sig(SIGSEGV,...) in ia64_rt_sigreturn signal/ia64: Use the generic force_sigsegv in setup_frame signal/arm/kvm: Use send_sig_mceerr signal/arm: Use send_sig_fault where appropriate signal/arm: Use force_sig_fault where appropriate ...
2018-10-23Merge branch 'x86-cpu-for-linus' of ↵Linus Torvalds1-1/+10
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpu updates from Ingo Molnar: "The main changes in this cycle were: - Add support for the "Dhyana" x86 CPUs by Hygon: these are licensed based on the AMD Zen architecture, and are built and sold in China, for domestic datacenter use. The code is pretty close to AMD support, mostly with a few quirks and enumeration differences. (Pu Wen) - Enable CPUID support on Cyrix 6x86/6x86L processors" * 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tools/cpupower: Add Hygon Dhyana support cpufreq: Add Hygon Dhyana support ACPI: Add Hygon Dhyana support x86/xen: Add Hygon Dhyana support to Xen x86/kvm: Add Hygon Dhyana support to KVM x86/mce: Add Hygon Dhyana support to the MCA infrastructure x86/bugs: Add Hygon Dhyana to the respective mitigation machinery x86/apic: Add Hygon Dhyana support x86/pci, x86/amd_nb: Add Hygon Dhyana support to PCI and northbridge x86/amd_nb: Check vendor in AMD-only functions x86/alternative: Init ideal_nops for Hygon Dhyana x86/events: Add Hygon Dhyana support to PMU infrastructure x86/smpboot: Do not use BSP INIT delay and MWAIT to idle on Dhyana x86/cpu/mtrr: Support TOP_MEM2 and get MTRR number x86/cpu: Get cache info and setup cache cpumap for Hygon Dhyana x86/cpu: Create Hygon Dhyana architecture support file x86/CPU: Change query logic so CPUID is enabled before testing x86/CPU: Use correct macros for Cyrix calls
2018-10-23Revert "kvm: x86: optimize dr6 restore"Radim Krčmář1-9/+4
This reverts commit 0e0a53c551317654e2d7885fdfd23299fee99b6b. As Christian Ehrhardt noted: The most common case is that vcpu->arch.dr6 and the host's %dr6 value are not related at all because ->switch_db_regs is zero. To do this all correctly, we must handle the case where the guest leaves an arbitrary unused value in vcpu->arch.dr6 before disabling breakpoints again. However, this means that vcpu->arch.dr6 is not suitable to detect the need for a %dr6 clear. Signed-off-by: Radim Krčmář <[email protected]>
2018-10-19x86/kvm/nVMX: tweak shadow fieldsVitaly Kuznetsov2-9/+6
It seems we have some leftovers from times when 'unrestricted guest' wasn't exposed to L1. Stop shadowing GUEST_CS_{BASE,LIMIT,AR_SELECTOR} and GUEST_ES_BASE, shadow GUEST_SS_AR_BYTES as it was found that some hypervisors (e.g. Hyper-V without Enlightened VMCS) access it pretty often. Suggested-by: Paolo Bonzini <[email protected]> Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: VMX: enable nested virtualization by defaultPaolo Bonzini1-1/+1
With live migration support and finally a good solution for exception event injection, nested VMX should be ready for having a stable userspace ABI. The results of syzkaller fuzzing are not perfect but not horrible either (and might be partially due to running on GCE, so that effectively we're testing three-level nesting on a fork of upstream KVM!). Enabling it by default seems like a nice way to conclude the 4.20 pull request. :) Unfortunately, enabling nested SVM in 2009 (commit 4b6e4dca701) was a bit premature. However, until live migration support is in place we can reasonably expect that it does not offer much in terms of ABI guarantees. Therefore we are still in time to break things and conform as much as possible to the interface used for VMX. Suggested-by: Jim Mattson <[email protected]> Suggested-by: Liran Alon <[email protected]> Reviewed-by: Liran Alon <[email protected]> Celebrated-by: Liran Alon <[email protected]> Celebrated-by: Wanpeng Li <[email protected]> Celebrated-by: Wincy Van <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM/x86: Use 32bit xor to clear registers in svm.cUros Bizjak2-16/+18
x86_64 zero-extends 32bit xor operation to a full 64bit register. Also add a comment and remove unnecessary instruction suffix in vmx.c Signed-off-by: Uros Bizjak <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17kvm: x86: Introduce KVM_CAP_EXCEPTION_PAYLOADJim Mattson1-0/+5
This is a per-VM capability which can be enabled by userspace so that the faulting linear address will be included with the information about a pending #PF in L2, and the "new DR6 bits" will be included with the information about a pending #DB in L2. With this capability enabled, the L1 hypervisor can now intercept #PF before CR2 is modified. Under VMX, the L1 hypervisor can now intercept #DB before DR6 and DR7 are modified. When userspace has enabled KVM_CAP_EXCEPTION_PAYLOAD, it should generally provide an appropriate payload when injecting a #PF or #DB exception via KVM_SET_VCPU_EVENTS. However, to support restoring old checkpoints, this payload is not required. Note that bit 16 of the "new DR6 bits" is set to indicate that a debug exception (#DB) or a breakpoint exception (#BP) occurred inside an RTM region while advanced debugging of RTM transactional regions was enabled. This is the reverse of DR6.RTM, which is cleared in this scenario. This capability also enables exception.pending in struct kvm_vcpu_events, which allows userspace to distinguish between pending and injected exceptions. Reported-by: Jim Mattson <[email protected]> Suggested-by: Paolo Bonzini <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17kvm: vmx: Defer setting of DR6 until #DB deliveryJim Mattson2-31/+62
When exception payloads are enabled by userspace (which is not yet possible) and a #DB is raised in L2, defer the setting of DR6 until later. Under VMX, this allows the L1 hypervisor to intercept the fault before DR6 is modified. Under SVM, DR6 is modified before L1 can intercept the fault (as has always been the case with DR7). Note that the payload associated with a #DB exception includes only the "new DR6 bits." When the payload is delievered, DR6.B0-B3 will be cleared and DR6.RTM will be set prior to merging in the new DR6 bits. Also note that bit 16 in the "new DR6 bits" is set to indicate that a debug exception (#DB) or a breakpoint exception (#BP) occurred inside an RTM region while advanced debugging of RTM transactional regions was enabled. Though the reverse of DR6.RTM, this makes the #DB payload field compatible with both the pending debug exceptions field under VMX and the exit qualification for #DB exceptions under VMX. Reported-by: Jim Mattson <[email protected]> Suggested-by: Paolo Bonzini <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17kvm: x86: Defer setting of CR2 until #PF deliveryJim Mattson4-21/+62
When exception payloads are enabled by userspace (which is not yet possible) and a #PF is raised in L2, defer the setting of CR2 until the #PF is delivered. This allows the L1 hypervisor to intercept the fault before CR2 is modified. For backwards compatibility, when exception payloads are not enabled by userspace, kvm_multiple_exception modifies CR2 when the #PF exception is raised. Reported-by: Jim Mattson <[email protected]> Suggested-by: Paolo Bonzini <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17kvm: x86: Add payload operands to kvm_multiple_exceptionJim Mattson1-7/+15
kvm_multiple_exception now takes two additional operands: has_payload and payload, so that updates to CR2 (and DR6 under VMX) can be delayed until the exception is delivered. This is necessary to properly emulate VMX or SVM hardware behavior for nested virtualization. The new behavior is triggered by vcpu->kvm->arch.exception_payload_enabled, which will (later) be set by a new per-VM capability, KVM_CAP_EXCEPTION_PAYLOAD. Reported-by: Jim Mattson <[email protected]> Suggested-by: Paolo Bonzini <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17kvm: x86: Add exception payload fields to kvm_vcpu_eventsJim Mattson1-16/+45
The per-VM capability KVM_CAP_EXCEPTION_PAYLOAD (to be introduced in a later commit) adds the following fields to struct kvm_vcpu_events: exception_has_payload, exception_payload, and exception.pending. With this capability set, all of the details of vcpu->arch.exception, including the payload for a pending exception, are reported to userspace in response to KVM_GET_VCPU_EVENTS. With this capability clear, the original ABI is preserved, and the exception.injected field is set for either pending or injected exceptions. When userspace calls KVM_SET_VCPU_EVENTS with KVM_CAP_EXCEPTION_PAYLOAD clear, exception.injected is no longer translated to exception.pending. KVM_SET_VCPU_EVENTS can now only establish a pending exception when KVM_CAP_EXCEPTION_PAYLOAD is set. Reported-by: Jim Mattson <[email protected]> Suggested-by: Paolo Bonzini <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17kvm: x86: Add has_payload and payload to kvm_queued_exceptionJim Mattson1-0/+8
The payload associated with a #PF exception is the linear address of the fault to be loaded into CR2 when the fault is delivered. The payload associated with a #DB exception is a mask of the DR6 bits to be set (or in the case of DR6.RTM, cleared) when the fault is delivered. Add fields has_payload and payload to kvm_queued_exception to track payloads for pending exceptions. The new fields are introduced here, but for now, they are just cleared. Reported-by: Jim Mattson <[email protected]> Suggested-by: Paolo Bonzini <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17x86/kvm/nVMX: nested state migration for Enlightened VMCSVitaly Kuznetsov2-20/+64
Add support for get/set of nested state when Enlightened VMCS is in use. A new KVM_STATE_NESTED_EVMCS flag to indicate eVMCS on the vCPU was enabled is added. Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17x86/kvm/nVMX: allow bare VMXON state migrationVitaly Kuznetsov1-7/+8
It is perfectly valid for a guest to do VMXON and not do VMPTRLD. This state needs to be preserved on migration. Cc: [email protected] Fixes: 8fcc4b5923af5de58b80b53a069453b135693304 Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17x86/kvm/lapic: preserve gfn_to_hva_cache len on cache reinitVitaly Kuznetsov1-2/+10
vcpu->arch.pv_eoi is accessible through both HV_X64_MSR_VP_ASSIST_PAGE and MSR_KVM_PV_EOI_EN so on migration userspace may try to restore them in any order. Values match, however, kvm_lapic_enable_pv_eoi() uses different length: for Hyper-V case it's the whole struct hv_vp_assist_page, for KVM native case it is 8. In case we restore KVM-native MSR last cache will be reinitialized with len=8 so trying to access VP assist page beyond 8 bytes with kvm_read_guest_cached() will fail. Check if we re-initializing cache for the same address and preserve length in case it was greater. Signed-off-by: Vitaly Kuznetsov <[email protected]> Cc: [email protected] Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17x86/kvm/hyperv: don't clear VP assist pages on initVitaly Kuznetsov1-1/+7
VP assist pages may hold valuable data which needs to be preserved across migration. Clean PV EOI portion of the data on init, the guest is responsible for making sure there's no garbage in the rest. This will be used for nVMX migration, eVMCS address needs to be preserved. Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: optimize prepare_vmcs02{,_full} for Enlightened VMCS caseVitaly Kuznetsov1-53/+65
When Enlightened VMCS is in use by L1 hypervisor we can avoid vmwriting VMCS fields which did not change. Our first goal is to achieve minimal impact on traditional VMCS case so we're not wrapping each vmwrite() with an if-changed checker. We also can't utilize static keys as Enlightened VMCS usage is per-guest. This patch implements the simpliest solution: checking fields in groups. We skip single vmwrite() statements as doing the check will cost us something even in non-evmcs case and the win is tiny. Unfortunately, this makes prepare_vmcs02_full{,_full}() code Enlightened VMCS-dependent (and a bit ugly). Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: implement enlightened VMPTRLD and VMCLEARVitaly Kuznetsov1-7/+108
Per Hyper-V TLFS 5.0b: "The L1 hypervisor may choose to use enlightened VMCSs by writing 1 to the corresponding field in the VP assist page (see section 7.8.7). Another field in the VP assist page controls the currently active enlightened VMCS. Each enlightened VMCS is exactly one page (4 KB) in size and must be initially zeroed. No VMPTRLD instruction must be executed to make an enlightened VMCS active or current. After the L1 hypervisor performs a VM entry with an enlightened VMCS, the VMCS is considered active on the processor. An enlightened VMCS can only be active on a single processor at the same time. The L1 hypervisor can execute a VMCLEAR instruction to transition an enlightened VMCS from the active to the non-active state. Any VMREAD or VMWRITE instructions while an enlightened VMCS is active is unsupported and can result in unexpected behavior." Keep Enlightened VMCS structure for the current L2 guest permanently mapped from struct nested_vmx instead of mapping it every time. Suggested-by: Ladi Prosek <[email protected]> Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: add enlightened VMCS stateVitaly Kuznetsov1-17/+423
Adds hv_evmcs pointer and implement copy_enlightened_to_vmcs12() and copy_enlightened_to_vmcs12(). prepare_vmcs02()/prepare_vmcs02_full() separation is not valid for Enlightened VMCS, do full sync for now. Suggested-by: Ladi Prosek <[email protected]> Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: add KVM_CAP_HYPERV_ENLIGHTENED_VMCS capabilityVitaly Kuznetsov3-0/+61
Enlightened VMCS is opt-in. The current version does not contain all fields supported by nested VMX so we must not advertise the corresponding VMX features if enlightened VMCS is enabled. Userspace is given the enlightened VMCS version supported by KVM as part of enabling KVM_CAP_HYPERV_ENLIGHTENED_VMCS. The version is to be advertised to the nested hypervisor, currently done via a cpuid leaf for Hyper-V. Suggested-by: Ladi Prosek <[email protected]> Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Liran Alon <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: VMX: refactor evmcs_sanitize_exec_ctrls()Vitaly Kuznetsov1-61/+47
Split off EVMCS1_UNSUPPORTED_* macros so we can re-use them when enabling Enlightened VMCS for Hyper-V on KVM. Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: hyperv: define VP assist page helpersLadi Prosek5-6/+29
The state related to the VP assist page is still managed by the LAPIC code in the pv_eoi field. Signed-off-by: Ladi Prosek <[email protected]> Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Liran Alon <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: x86: reintroduce pte_list_remove, but including mmu_spte_clear_track_bitsWei Yang1-3/+9
rmap_remove() removes the sptep after locating the correct rmap_head but, in several cases, the caller has already known the correct rmap_head. This patch introduces a new pte_list_remove(); because it is known that the spte is present (or it would not have an rmap_head), it is safe to remove the tracking bits without any previous check. Signed-off-by: Wei Yang <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: x86: rename pte_list_remove to __pte_list_removeWei Yang1-8/+8
This is a patch preparing for further change. Signed-off-by: Wei Yang <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17kvm/x86 : fix some typoPeng Hao1-2/+2
Signed-off-by: Peng Hao <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM/VMX: Change hv flush logic when ept tables are mismatched.Lan Tianyu1-7/+8
If ept table pointers are mismatched, flushing tlb for each vcpus via hv flush interface still helps to reduce vmexits which are triggered by IPI and INEPT emulation. Signed-off-by: Lan Tianyu <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM/x86: Use 32bit xor to clear registerUros Bizjak1-2/+2
x86_64 zero-extends 32bit xor to a full 64bit register. Use %k asm operand modifier to force 32bit register and save 268 bytes in kvm.o Signed-off-by: Uros Bizjak <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM/x86: Use assembly instruction mnemonics instead of .byte streamsUros Bizjak1-26/+20
Recently the minimum required version of binutils was changed to 2.20, which supports all VMX instruction mnemonics. The patch removes all .byte #defines and uses real instruction mnemonics instead. The compiler is now able to pass memory operand to the instruction, so there is no need for memory clobber anymore. Also, the compiler adds CC register clobber automatically to all extended asm clauses, so the patch also removes explicit CC clobber. The immediate benefit of the patch is removal of many unnecesary register moves, resulting in 1434 saved bytes in vmx.o: text data bss dec hex filename 151257 18246 8500 178003 2b753 vmx.o 152691 18246 8500 179437 2bced vmx-old.o Some examples of improvement include removal of unneeded moves of %rsp to %rax in front of invept and invvpid instructions: a57e: b9 01 00 00 00 mov $0x1,%ecx a583: 48 89 04 24 mov %rax,(%rsp) a587: 48 89 e0 mov %rsp,%rax a58a: 48 c7 44 24 08 00 00 movq $0x0,0x8(%rsp) a591: 00 00 a593: 66 0f 38 80 08 invept (%rax),%rcx to: a45c: 48 89 04 24 mov %rax,(%rsp) a460: b8 01 00 00 00 mov $0x1,%eax a465: 48 c7 44 24 08 00 00 movq $0x0,0x8(%rsp) a46c: 00 00 a46e: 66 0f 38 80 04 24 invept (%rsp),%rax and the ability to use more optimal registers and memory operands in the instruction: 8faa: 48 8b 44 24 28 mov 0x28(%rsp),%rax 8faf: 4c 89 c2 mov %r8,%rdx 8fb2: 0f 79 d0 vmwrite %rax,%rdx to: 8e7c: 44 0f 79 44 24 28 vmwrite 0x28(%rsp),%r8 Signed-off-by: Uros Bizjak <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM/x86: Fix invvpid and invept register operand size in 64-bit modeUros Bizjak1-2/+2
Register operand size of invvpid and invept instruction in 64-bit mode has always 64 bits. Adjust inline function argument type to reflect correct size. Signed-off-by: Uros Bizjak <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17x86/kvm/mmu: check if MMU reconfiguration is needed in init_kvm_nested_mmu()Vitaly Kuznetsov1-0/+6
We don't use root page role for nested_mmu, however, optimizing out re-initialization in case nothing changed is still valuable as this is done for every nested vmentry. Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is neededVitaly Kuznetsov1-34/+61
MMU reconfiguration in init_kvm_tdp_mmu()/kvm_init_shadow_mmu() can be avoided if the source data used to configure it didn't change; enhance MMU extended role with the required fields and consolidate common code in kvm_calc_mmu_role_common(). Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17x86/kvm/nVMX: introduce source data cache for kvm_init_shadow_ept_mmu()Vitaly Kuznetsov1-15/+38
MMU re-initialization is expensive, in particular, update_permission_bitmask() and update_pkru_bitmask() are. Cache the data used to setup shadow EPT MMU and avoid full re-init when it is unchanged. Signed-off-by: Vitaly Kuznetsov <[email protected]> Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17x86/kvm/mmu: make space for source data caching in struct kvm_mmuVitaly Kuznetsov2-7/+21
In preparation to MMU reconfiguration avoidance we need a space to cache source data. As this partially intersects with kvm_mmu_page_role, create 64bit sized union kvm_mmu_role holding both base and extended data. No functional change. Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17x86/kvm/mmu: get rid of redundant kvm_mmu_setup()Paolo Bonzini2-13/+1
Just inline the contents into the sole caller, kvm_init_mmu is now public. Suggested-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Reviewed-by: Sean Christopherson <[email protected]>
2018-10-17x86/kvm/mmu: introduce guest_mmuVitaly Kuznetsov2-18/+37
When EPT is used for nested guest we need to re-init MMU as shadow EPT MMU (nested_ept_init_mmu_context() does that). When we return back from L2 to L1 kvm_mmu_reset_context() in nested_vmx_load_cr3() resets MMU back to normal TDP mode. Add a special 'guest_mmu' so we can use separate root caches; the improved hit rate is not very important for single vCPU performance, but it avoids contention on the mmu_lock for many vCPUs. On the nested CPUID benchmark, with 16 vCPUs, an L2->L1->L2 vmexit goes from 42k to 26k cycles. Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17x86/kvm/mmu.c: add kvm_mmu parameter to kvm_mmu_free_roots()Vitaly Kuznetsov2-5/+6
Add an option to specify which MMU root we want to free. This will be used when nested and non-nested MMUs for L1 are split. Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Reviewed-by: Sean Christopherson <[email protected]>
2018-10-17x86/kvm/mmu.c: set get_pdptr hook in kvm_init_shadow_ept_mmu()Vitaly Kuznetsov2-0/+2
kvm_init_shadow_ept_mmu() doesn't set get_pdptr() hook and is this not a problem just because MMU context is already initialized and this hook points to kvm_pdptr_read(). As we're intended to use a dedicated MMU for shadow EPT MMU set this hook explicitly. Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Reviewed-by: Sean Christopherson <[email protected]>
2018-10-17x86/kvm/mmu: make vcpu->mmu a pointer to the current MMUVitaly Kuznetsov7-123/+126
As a preparation to full MMU split between L1 and L2 make vcpu->arch.mmu a pointer to the currently used mmu. For now, this is always vcpu->arch.root_mmu. No functional change. Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Reviewed-by: Sean Christopherson <[email protected]>
2018-10-17kvm: x86: optimize dr6 restorePaolo Bonzini1-4/+9
The quote from the comment almost says it all: we are currently zeroing the guest dr6 in kvm_arch_vcpu_put, because do_debug expects it. However, the host %dr6 is either: - zero because the guest hasn't run after kvm_arch_vcpu_load - written from vcpu->arch.dr6 by vcpu_enter_guest - written by the guest and copied to vcpu->arch.dr6 by ->sync_dirty_debug_regs(). Therefore, we can skip the write if vcpu->arch.dr6 is already zero. We may do extra useless writes if vcpu->arch.dr6 is nonzero but the guest hasn't run; however that is less important for performance. Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: x86: hyperv: optimize sparse VP set processingVitaly Kuznetsov1-98/+67
Rewrite kvm_hv_flush_tlb()/send_ipi_vcpus_mask() making them cleaner and somewhat more optimal. hv_vcpu_in_sparse_set() is converted to sparse_set_to_vcpu_mask() which copies sparse banks u64-at-a-time and then, depending on the num_mismatched_vp_indexes value, returns immediately or does vp index to vcpu index conversion by walking all vCPUs. To support the change and make kvm_hv_send_ipi() look similar to kvm_hv_flush_tlb() send_ipi_vcpus_mask() is introduced. Suggested-by: Roman Kagan <[email protected]> Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Roman Kagan <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: x86: hyperv: fix 'tlb_lush' typoVitaly Kuznetsov1-3/+3
Regardless of whether your TLB is lush or not it still needs flushing. Reported-by: Roman Kagan <[email protected]> Reviewed-by: Roman Kagan <[email protected]> Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: WARN if nested run hits VMFail with early consistency checks enabledSean Christopherson1-8/+10
When early consistency checks are enabled, all VMFail conditions should be caught by nested_vmx_check_vmentry_hw(). Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: add option to perform early consistency checks via H/WSean Christopherson1-5/+137
KVM defers many VMX consistency checks to the CPU, ostensibly for performance reasons[1], including checks that result in VMFail (as opposed to VMExit). This behavior may be undesirable for some users since this means KVM detects certain classes of VMFail only after it has processed guest state, e.g. emulated MSR load-on-entry. Because there is a strict ordering between checks that cause VMFail and those that cause VMExit, i.e. all VMFail checks are performed before any checks that cause VMExit, we can detect (almost) all VMFail conditions via a dry run of sorts. The almost qualifier exists because some state in vmcs02 comes from L0, e.g. VPID, which means that hardware will never detect an invalid VPID in vmcs12 because it never sees said value. Software must (continue to) explicitly check such fields. After preparing vmcs02 with all state needed to pass the VMFail consistency checks, optionally do a "test" VMEnter with an invalid GUEST_RFLAGS. If the VMEnter results in a VMExit (due to bad guest state), then we can safely say that the nested VMEnter should not VMFail, i.e. any VMFail encountered in nested_vmx_vmexit() must be due to an L0 bug. GUEST_RFLAGS is used to induce VMExit as it is unconditionally loaded on all implementations of VMX, has an invalid value that is writable on a 32-bit system and its consistency check is performed relatively early in all implementations (the exact order of consistency checks is micro-architectural). Unfortunately, since the "passing" case causes a VMExit, KVM must be extra diligent to ensure that host state is restored, e.g. DR7 and RFLAGS are reset on VMExit. Failure to restore RFLAGS.IF is particularly fatal. And of course the extra VMEnter and VMExit impacts performance. The raw overhead of the early consistency checks is ~6% on modern hardware (though this could easily vary based on configuration), while the added latency observed from the L1 VMM is ~10%. The early consistency checks do not occur in a vacuum, e.g. spending more time in L0 can lead to more interrupts being serviced while emulating VMEnter, thereby increasing the latency observed by L1. Add a module param, early_consistency_checks, to provide control over whether or not VMX performs the early consistency checks. In addition to standard on/off behavior, the param accepts a value of -1, which is essentialy an "auto" setting whereby KVM does the early checks only when it thinks it's running on bare metal. When running nested, doing early checks is of dubious value since the resulting behavior is heavily dependent on L0. In the future, the "auto" setting could also be used to default to skipping the early hardware checks for certain configurations/platforms if KVM reaches a state where it has 100% coverage of VMFail conditions. [1] To my knowledge no one has implemented and tested full software emulation of the VMFail consistency checks. Until that happens, one can only speculate about the actual performance overhead of doing all VMFail consistency checks in software. Obviously any code is slower than no code, but in the grand scheme of nested virtualization it's entirely possible the overhead is negligible. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: vmx: write HOST_IA32_EFER in vmx_set_constant_host_state()Sean Christopherson1-1/+5
EFER is constant in the host and writing it once during setup means we can skip writing the host value in add_atomic_switch_msr_special(). Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: call kvm_skip_emulated_instruction in nested_vmx_{fail,succeed}Sean Christopherson1-154/+91
... as every invocation of nested_vmx_{fail,succeed} is immediately followed by a call to kvm_skip_emulated_instruction(). This saves a bit of code and eliminates some silly paths, e.g. nested_vmx_run() ended up with a goto label purely used to call and return kvm_skip_emulated_instruction(). Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: do not call nested_vmx_succeed() for consistency check VMExitSean Christopherson1-1/+0
EFLAGS is set to a fixed value on VMExit, calling nested_vmx_succeed() is unnecessary and wrong. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: do not skip VMEnter instruction that succeedsSean Christopherson1-16/+5
A successful VMEnter is essentially a fancy indirect branch that pulls the target RIP from the VMCS. Skipping the instruction is unnecessary (RIP will get overwritten by the VMExit handler) and is problematic because it can incorrectly suppress a #DB due to EFLAGS.TF when a VMFail is detected by hardware (happens after we skip the instruction). Now that vmx_nested_run() is not prematurely skipping the instr, use the full kvm_skip_emulated_instruction() in the VMFail path of nested_vmx_vmexit(). We also need to explicitly update the GUEST_INTERRUPTIBILITY_INFO when loading vmcs12 host state. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>