aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2018-10-17x86/kvm/mmu: get rid of redundant kvm_mmu_setup()Paolo Bonzini3-14/+1
Just inline the contents into the sole caller, kvm_init_mmu is now public. Suggested-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Reviewed-by: Sean Christopherson <[email protected]>
2018-10-17x86/kvm/mmu: introduce guest_mmuVitaly Kuznetsov3-18/+40
When EPT is used for nested guest we need to re-init MMU as shadow EPT MMU (nested_ept_init_mmu_context() does that). When we return back from L2 to L1 kvm_mmu_reset_context() in nested_vmx_load_cr3() resets MMU back to normal TDP mode. Add a special 'guest_mmu' so we can use separate root caches; the improved hit rate is not very important for single vCPU performance, but it avoids contention on the mmu_lock for many vCPUs. On the nested CPUID benchmark, with 16 vCPUs, an L2->L1->L2 vmexit goes from 42k to 26k cycles. Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17x86/kvm/mmu.c: add kvm_mmu parameter to kvm_mmu_free_roots()Vitaly Kuznetsov3-6/+8
Add an option to specify which MMU root we want to free. This will be used when nested and non-nested MMUs for L1 are split. Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Reviewed-by: Sean Christopherson <[email protected]>
2018-10-17x86/kvm/mmu.c: set get_pdptr hook in kvm_init_shadow_ept_mmu()Vitaly Kuznetsov2-0/+2
kvm_init_shadow_ept_mmu() doesn't set get_pdptr() hook and is this not a problem just because MMU context is already initialized and this hook points to kvm_pdptr_read(). As we're intended to use a dedicated MMU for shadow EPT MMU set this hook explicitly. Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Reviewed-by: Sean Christopherson <[email protected]>
2018-10-17x86/kvm/mmu: make vcpu->mmu a pointer to the current MMUVitaly Kuznetsov8-124/+130
As a preparation to full MMU split between L1 and L2 make vcpu->arch.mmu a pointer to the currently used mmu. For now, this is always vcpu->arch.root_mmu. No functional change. Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Reviewed-by: Sean Christopherson <[email protected]>
2018-10-17kvm: x86: optimize dr6 restorePaolo Bonzini1-4/+9
The quote from the comment almost says it all: we are currently zeroing the guest dr6 in kvm_arch_vcpu_put, because do_debug expects it. However, the host %dr6 is either: - zero because the guest hasn't run after kvm_arch_vcpu_load - written from vcpu->arch.dr6 by vcpu_enter_guest - written by the guest and copied to vcpu->arch.dr6 by ->sync_dirty_debug_regs(). Therefore, we can skip the write if vcpu->arch.dr6 is already zero. We may do extra useless writes if vcpu->arch.dr6 is nonzero but the guest hasn't run; however that is less important for performance. Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: x86: hyperv: optimize sparse VP set processingVitaly Kuznetsov1-98/+67
Rewrite kvm_hv_flush_tlb()/send_ipi_vcpus_mask() making them cleaner and somewhat more optimal. hv_vcpu_in_sparse_set() is converted to sparse_set_to_vcpu_mask() which copies sparse banks u64-at-a-time and then, depending on the num_mismatched_vp_indexes value, returns immediately or does vp index to vcpu index conversion by walking all vCPUs. To support the change and make kvm_hv_send_ipi() look similar to kvm_hv_flush_tlb() send_ipi_vcpus_mask() is introduced. Suggested-by: Roman Kagan <[email protected]> Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Roman Kagan <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: x86: hyperv: fix 'tlb_lush' typoVitaly Kuznetsov2-4/+4
Regardless of whether your TLB is lush or not it still needs flushing. Reported-by: Roman Kagan <[email protected]> Reviewed-by: Roman Kagan <[email protected]> Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: WARN if nested run hits VMFail with early consistency checks enabledSean Christopherson1-8/+10
When early consistency checks are enabled, all VMFail conditions should be caught by nested_vmx_check_vmentry_hw(). Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: add option to perform early consistency checks via H/WSean Christopherson1-5/+137
KVM defers many VMX consistency checks to the CPU, ostensibly for performance reasons[1], including checks that result in VMFail (as opposed to VMExit). This behavior may be undesirable for some users since this means KVM detects certain classes of VMFail only after it has processed guest state, e.g. emulated MSR load-on-entry. Because there is a strict ordering between checks that cause VMFail and those that cause VMExit, i.e. all VMFail checks are performed before any checks that cause VMExit, we can detect (almost) all VMFail conditions via a dry run of sorts. The almost qualifier exists because some state in vmcs02 comes from L0, e.g. VPID, which means that hardware will never detect an invalid VPID in vmcs12 because it never sees said value. Software must (continue to) explicitly check such fields. After preparing vmcs02 with all state needed to pass the VMFail consistency checks, optionally do a "test" VMEnter with an invalid GUEST_RFLAGS. If the VMEnter results in a VMExit (due to bad guest state), then we can safely say that the nested VMEnter should not VMFail, i.e. any VMFail encountered in nested_vmx_vmexit() must be due to an L0 bug. GUEST_RFLAGS is used to induce VMExit as it is unconditionally loaded on all implementations of VMX, has an invalid value that is writable on a 32-bit system and its consistency check is performed relatively early in all implementations (the exact order of consistency checks is micro-architectural). Unfortunately, since the "passing" case causes a VMExit, KVM must be extra diligent to ensure that host state is restored, e.g. DR7 and RFLAGS are reset on VMExit. Failure to restore RFLAGS.IF is particularly fatal. And of course the extra VMEnter and VMExit impacts performance. The raw overhead of the early consistency checks is ~6% on modern hardware (though this could easily vary based on configuration), while the added latency observed from the L1 VMM is ~10%. The early consistency checks do not occur in a vacuum, e.g. spending more time in L0 can lead to more interrupts being serviced while emulating VMEnter, thereby increasing the latency observed by L1. Add a module param, early_consistency_checks, to provide control over whether or not VMX performs the early consistency checks. In addition to standard on/off behavior, the param accepts a value of -1, which is essentialy an "auto" setting whereby KVM does the early checks only when it thinks it's running on bare metal. When running nested, doing early checks is of dubious value since the resulting behavior is heavily dependent on L0. In the future, the "auto" setting could also be used to default to skipping the early hardware checks for certain configurations/platforms if KVM reaches a state where it has 100% coverage of VMFail conditions. [1] To my knowledge no one has implemented and tested full software emulation of the VMFail consistency checks. Until that happens, one can only speculate about the actual performance overhead of doing all VMFail consistency checks in software. Obviously any code is slower than no code, but in the grand scheme of nested virtualization it's entirely possible the overhead is negligible. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: vmx: write HOST_IA32_EFER in vmx_set_constant_host_state()Sean Christopherson1-1/+5
EFER is constant in the host and writing it once during setup means we can skip writing the host value in add_atomic_switch_msr_special(). Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: call kvm_skip_emulated_instruction in nested_vmx_{fail,succeed}Sean Christopherson1-154/+91
... as every invocation of nested_vmx_{fail,succeed} is immediately followed by a call to kvm_skip_emulated_instruction(). This saves a bit of code and eliminates some silly paths, e.g. nested_vmx_run() ended up with a goto label purely used to call and return kvm_skip_emulated_instruction(). Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: do not call nested_vmx_succeed() for consistency check VMExitSean Christopherson1-1/+0
EFLAGS is set to a fixed value on VMExit, calling nested_vmx_succeed() is unnecessary and wrong. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: do not skip VMEnter instruction that succeedsSean Christopherson1-16/+5
A successful VMEnter is essentially a fancy indirect branch that pulls the target RIP from the VMCS. Skipping the instruction is unnecessary (RIP will get overwritten by the VMExit handler) and is problematic because it can incorrectly suppress a #DB due to EFLAGS.TF when a VMFail is detected by hardware (happens after we skip the instruction). Now that vmx_nested_run() is not prematurely skipping the instr, use the full kvm_skip_emulated_instruction() in the VMFail path of nested_vmx_vmexit(). We also need to explicitly update the GUEST_INTERRUPTIBILITY_INFO when loading vmcs12 host state. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: do early preparation of vmcs02 before check_vmentry_postreqs()Sean Christopherson1-10/+12
In anticipation of using vmcs02 to do early consistency checks, move the early preparation of vmcs02 prior to checking the postreqs. The downside of this approach is that we'll unnecessary load vmcs02 in the case that check_vmentry_postreqs() fails, but that is essentially our slow path anyways (not actually slow, but it's the path we don't really care about optimizing). Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: initialize vmcs02 constant exactly once (per VMCS)Sean Christopherson1-2/+11
Add a dedicated flag to track if vmcs02 has been initialized, i.e. the constant state for vmcs02 has been written to the backing VMCS. The launched flag (in struct loaded_vmcs) gets cleared on logical CPU migration to mirror hardware behavior[1], i.e. using the launched flag to determine whether or not vmcs02 constant state needs to be initialized results in unnecessarily re-initializing the VMCS when migrating between logical CPUS. [1] The active VMCS needs to be VMCLEARed before it can be migrated to a different logical CPU. Hardware's VMCS cache is per-CPU and is not coherent between CPUs. VMCLEAR flushes the cache so that any dirty data is written back to memory. A side effect of VMCLEAR is that it also clears the VMCS's internal launch flag, which KVM must mirror because VMRESUME must be used to run a previously launched VMCS. Suggested-by: Jim Mattson <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: split pieces of prepare_vmcs02() to prepare_vmcs02_early()Sean Christopherson1-186/+224
Add prepare_vmcs02_early() and move pieces of prepare_vmcs02() to the new function. prepare_vmcs02_early() writes the bits of vmcs02 that a) must be in place to pass the VMFail consistency checks (assuming vmcs12 is valid) and b) are needed recover from a VMExit, e.g. host state that is loaded on VMExit. Splitting the functionality will enable KVM to leverage hardware to do VMFail consistency checks via a dry run of VMEnter and recover from a potential VMExit without having to fully initialize vmcs02. Add prepare_vmcs02_constant_state() to handle writing vmcs02 state that comes from vmcs01 and never changes, i.e. we don't need to rewrite any of the vmcs02 that is effectively constant once defined. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: VMX: remove ASSERT() on vmx->pml_pg validitySean Christopherson1-2/+0
vmx->pml_pg is allocated by vmx_create_vcpu() and is only nullified when the vCPU is destroyed by vmx_free_vcpu(). Remove the ASSERTs on vmx->pml_pg, there is no need to carry debug code that provides no value to the current code base. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: vVMX: rename label for post-enter_guest_mode consistency checkSean Christopherson1-3/+3
Rename 'fail' to 'vmentry_fail_vmexit_guest_mode' to make it more obvious that it's simply a different entry point to the VMExit path, whose purpose is unwind the updates done prior to calling prepare_vmcs02(). Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: assimilate nested_vmx_entry_failure() into ↵Sean Christopherson1-42/+36
nested_vmx_enter_non_root_mode() Handling all VMExits due to failed consistency checks on VMEnter in nested_vmx_enter_non_root_mode() consolidates all relevant code into a single location, and removing nested_vmx_entry_failure() eliminates a confusing function name and label. For a VMEntry, "fail" and its derivatives has a very specific meaning due to the different behavior of a VMEnter VMFail versus VMExit, i.e. it wasn't obvious that nested_vmx_entry_failure() handled VMExit scenarios. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: move check_vmentry_postreqs() call to ↵Sean Christopherson1-7/+3
nested_vmx_enter_non_root_mode() In preparation of supporting checkpoint/restore for nested state, commit ca0bde28f2ed ("kvm: nVMX: Split VMCS checks from nested_vmx_run()") modified check_vmentry_postreqs() to only perform the guest EFER consistency checks when nested_run_pending is true. But, in the normal nested VMEntry flow, nested_run_pending is only set after check_vmentry_postreqs(), i.e. the consistency check is being skipped. Alternatively, nested_run_pending could be set prior to calling check_vmentry_postreqs() in nested_vmx_run(), but placing the consistency checks in nested_vmx_enter_non_root_mode() allows us to split prepare_vmcs02() and interleave the preparation with the consistency checks without having to change the call sites of nested_vmx_enter_non_root_mode(). In other words, the rest of the consistency check code in nested_vmx_run() will be joining the postreqs checks in future patches. Fixes: ca0bde28f2ed ("kvm: nVMX: Split VMCS checks from nested_vmx_run()") Signed-off-by: Sean Christopherson <[email protected]> Cc: Jim Mattson <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: rename enter_vmx_non_root_mode to nested_vmx_enter_non_root_modeSean Christopherson1-5/+5
...to be more consistent with the nested VMX nomenclature. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: try to set EFER bits correctly when initializing controlsSean Christopherson1-16/+35
VM_ENTRY_IA32E_MODE and VM_{ENTRY,EXIT}_LOAD_IA32_EFER will be explicitly set/cleared as needed by vmx_set_efer(), but attempt to get the bits set correctly when intializing the control fields. Setting the value correctly can avoid multiple VMWrites. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: vmx: do not unconditionally clear EFER switchingSean Christopherson1-2/+4
Do not unconditionally call clear_atomic_switch_msr() when updating EFER. This adds up to four unnecessary VMWrites in the case where guest_efer != host_efer, e.g. if the load_on_{entry,exit} bits were already set. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: reset cache/shadows when switching loaded VMCSSean Christopherson1-4/+4
Reset the vm_{entry,exit}_controls_shadow variables as well as the segment cache after loading a new VMCS in vmx_switch_vmcs(). The shadows/cache track VMCS data, i.e. they're stale every time we switch to a new VMCS regardless of reason. This fixes a bug where stale control shadows would be consumed after a nested VMExit due to a failed consistency check. Suggested-by: Jim Mattson <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: use vm_exit_controls_init() to write exit controls for vmcs02Sean Christopherson1-1/+1
Write VM_EXIT_CONTROLS using vm_exit_controls_init() when configuring vmcs02, otherwise vm_exit_controls_shadow will be stale. EFER in particular can be corrupted if VM_EXIT_LOAD_IA32_EFER is not updated due to an incorrect shadow optimization, which can crash L0 due to EFER not being loaded on exit. This does not occur with the current code base simply because update_transition_efer() unconditionally clears VM_EXIT_LOAD_IA32_EFER before conditionally setting it, and because a nested guest always starts with VM_EXIT_LOAD_IA32_EFER clear, i.e. we'll only ever unnecessarily clear the bit. That is, until someone optimizes update_transition_efer()... Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: move vmcs12 EPTP consistency check to check_vmentry_prereqs()Sean Christopherson1-12/+9
An invalid EPTP causes a VMFail(VMXERR_ENTRY_INVALID_CONTROL_FIELD), not a VMExit. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: move host EFER consistency checks to VMFail pathSean Christopherson1-15/+16
Invalid host state related to loading EFER on VMExit causes a VMFail(VMXERR_ENTRY_INVALID_HOST_STATE_FIELD), not a VMExit. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: Always reflect #NM VM-exits to L1Jim Mattson1-8/+0
When bit 3 (corresponding to CR0.TS) of the VMCS12 cr0_guest_host_mask field is clear, the VMCS12 guest_cr0 field does not necessarily hold the current value of the L2 CR0.TS bit, so the code that checked for L2's CR0.TS bit being set was incorrect. Moreover, I'm not sure that the CR0.TS check was adequate. (What if L2's CR0.EM was set, for instance?) Fortunately, lazy FPU has gone away, so L0 has lost all interest in intercepting #NM exceptions. See commit bd7e5b0899a4 ("KVM: x86: remove code for lazy FPU handling"). Therefore, there is no longer any question of which hypervisor gets first dibs. The #NM VM-exit should always be reflected to L1. (Note that the corresponding bit must be set in the VMCS12 exception_bitmap field for there to be an #NM VM-exit at all.) Fixes: ccf9844e5d99c ("kvm, vmx: Really fix lazy FPU on nested guest") Reported-by: Abhiroop Dabral <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Reviewed-by: Peter Shier <[email protected]> Tested-by: Abhiroop Dabral <[email protected]> Reviewed-by: Liran Alon <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: x86: hyperv: implement PV IPI send hypercallsVitaly Kuznetsov3-0/+158
Using hypercall for sending IPIs is faster because this allows to specify any number of vCPUs (even > 64 with sparse CPU set), the whole procedure will take only one VMEXIT. Current Hyper-V TLFS (v5.0b) claims that HvCallSendSyntheticClusterIpi hypercall can't be 'fast' (passing parameters through registers) but apparently this is not true, Windows always uses it as 'fast' so we need to support that. Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: x86: hyperv: optimize kvm_hv_flush_tlb() for vp_index == vcpu_idx caseVitaly Kuznetsov1-44/+52
VP inedx almost always matches VCPU and when it does it's faster to walk the sparse set instead of all vcpus. Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: x86: hyperv: valid_bank_mask should be 'u64'Vitaly Kuznetsov1-2/+3
This probably doesn't matter much (KVM_MAX_VCPUS is much lower nowadays) but valid_bank_mask is really u64 and not unsigned long. Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Roman Kagan <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: x86: hyperv: keep track of mismatched VP indexesVitaly Kuznetsov2-3/+26
In most common cases VP index of a vcpu matches its vcpu index. Userspace is, however, free to set any mapping it wishes and we need to account for that when we need to find a vCPU with a particular VP index. To keep search algorithms optimal in both cases introduce 'num_mismatched_vp_indexes' counter showing how many vCPUs with mismatching VP index we have. In case the counter is zero we can assume vp_index == vcpu_idx. Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Roman Kagan <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: x86: hyperv: consistently use 'hv_vcpu' for 'struct kvm_vcpu_hv' variablesVitaly Kuznetsov1-9/+9
Rename 'hv' to 'hv_vcpu' in kvm_hv_set_msr/kvm_hv_get_msr(); 'hv' is 'reserved' for 'struct kvm_hv' variables across the file. Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Roman Kagan <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: x86: hyperv: optimize 'all cpus' case in kvm_hv_flush_tlb()Vitaly Kuznetsov1-19/+23
We can use 'NULL' to represent 'all cpus' case in kvm_make_vcpus_request_mask() and avoid building vCPU mask with all vCPUs. Suggested-by: Radim Krčmář <[email protected]> Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Roman Kagan <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: x86: hyperv: enforce vp_index < KVM_MAX_VCPUSVitaly Kuznetsov1-3/+5
Hyper-V TLFS (5.0b) states: > Virtual processors are identified by using an index (VP index). The > maximum number of virtual processors per partition supported by the > current implementation of the hypervisor can be obtained through CPUID > leaf 0x40000005. A virtual processor index must be less than the > maximum number of virtual processors per partition. Forbid userspace to set VP_INDEX above KVM_MAX_VCPUS. get_vcpu_by_vpidx() can now be optimized to bail early when supplied vpidx is >= KVM_MAX_VCPUS. Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Roman Kagan <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17kvm/x86: return meaningful value from KVM_SIGNAL_MSIPaolo Bonzini1-3/+3
If kvm_apic_map_get_dest_lapic() finds a disabled LAPIC, it will return with bitmap==0 and (*r == -1) will be returned to userspace. QEMU may then record "KVM: injection failed, MSI lost (Operation not permitted)" in its log, which is quite puzzling. Reported-by: Peng Hao <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: x86: move definition PT_MAX_HUGEPAGE_LEVEL and KVM_NR_PAGE_SIZES togetherWei Yang2-6/+9
Currently, there are two definitions related to huge page, but a little bit far from each other and seems loosely connected: * KVM_NR_PAGE_SIZES defines the number of different size a page could map * PT_MAX_HUGEPAGE_LEVEL means the maximum level of huge page The number of different size a page could map equals the maximum level of huge page, which is implied by current definition. While current implementation may not be kind to readers and further developers: * KVM_NR_PAGE_SIZES looks like a stand alone definition at first sight * in case we need to support more level, two places need to change This patch tries to make these two definition more close, so that reader and developer would feel more comfortable to manipulate. Signed-off-by: Wei Yang <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM/VMX: Remve unused function is_external_interrupt().Tianyu Lan1-6/+0
is_external_interrupt() is not used now and so remove it. Signed-off-by: Lan Tianyu <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: x86: return 0 in case kvm_mmu_memory_cache has min number of objectsWei Yang1-2/+2
The code tries to pre-allocate *min* number of objects, so it is ok to return 0 when the kvm_mmu_memory_cache meets the requirement. Signed-off-by: Wei Yang <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17nVMX x86: Make nested_vmx_check_pml_controls() conciseKrish Sadhukhan1-8/+5
Suggested-by: Jim Mattson <[email protected]> Signed-off-by: Krish Sadhukhan <[email protected]> Reviewed-by: Mark Kanda <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: x86: adjust kvm_mmu_page member to save 8 bytesWei Yang1-2/+2
On a 64bits machine, struct is naturally aligned with 8 bytes. Since kvm_mmu_page member *unsync* and *role* are less then 4 bytes, we can rearrange the sequence to compace the struct. As the comment shows, *role* and *gfn* are used to key the shadow page. In order to keep the comment valid, this patch moves the *unsync* up and exchange the position of *role* and *gfn*. From /proc/slabinfo, it shows the size of kvm_mmu_page is 8 bytes less and with one more object per slap after applying this patch. # name <active_objs> <num_objs> <objsize> <objperslab> kvm_mmu_page_header 0 0 168 24 kvm_mmu_page_header 0 0 160 25 Signed-off-by: Wei Yang <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: restore host state in nested_vmx_vmexit for VMFailSean Christopherson1-20/+153
A VMEnter that VMFails (as opposed to VMExits) does not touch host state beyond registers that are explicitly noted in the VMFail path, e.g. EFLAGS. Host state does not need to be loaded because VMFail is only signaled for consistency checks that occur before the CPU starts to load guest state, i.e. there is no need to restore any state as nothing has been modified. But in the case where a VMFail is detected by hardware and not by KVM (due to deferring consistency checks to hardware), KVM has already loaded some amount of guest state. Luckily, "loaded" only means loaded to KVM's software model, i.e. vmcs01 has not been modified. So, unwind our software model to the pre-VMEntry host state. Not restoring host state in this VMFail path leads to a variety of failures because we end up with stale data in vcpu->arch, e.g. CR0, CR4, EFER, etc... will all be out of sync relative to vmcs01. Any significant delta in the stale data is all but guaranteed to crash L1, e.g. emulation of SMEP, SMAP, UMIP, WP, etc... will be wrong. An alternative to this "soft" reload would be to load host state from vmcs12 as if we triggered a VMExit (as opposed to VMFail), but that is wildly inconsistent with respect to the VMX architecture, e.g. an L1 VMM with separate VMExit and VMFail paths would explode. Note that this approach does not mean KVM is 100% accurate with respect to VMX hardware behavior, even at an architectural level (the exact order of consistency checks is microarchitecture specific). But 100% emulation accuracy isn't the goal (with this patch), rather the goal is to be consistent in the information delivered to L1, e.g. a VMExit should not fall-through VMENTER, and a VMFail should not jump to HOST_RIP. This technically reverts commit "5af4157388ad (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure)", but retains the core aspects of that patch, just in an open coded form due to the need to pull state from vmcs01 instead of vmcs12. Restoring host state resolves a variety of issues introduced by commit "4f350c6dbcb9 (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)", which remedied the incorrect behavior of treating VMFail like VMExit but in doing so neglected to restore arch state that had been modified prior to attempting nested VMEnter. A sample failure that occurs due to stale vcpu.arch state is a fault of some form while emulating an LGDT (due to emulated UMIP) from L1 after a failed VMEntry to L3, in this case when running the KVM unit test test_tpr_threshold_values in L1. L0 also hits a WARN in this case due to a stale arch.cr4.UMIP. L1: BUG: unable to handle kernel paging request at ffffc90000663b9e PGD 276512067 P4D 276512067 PUD 276513067 PMD 274efa067 PTE 8000000271de2163 Oops: 0009 [#1] SMP CPU: 5 PID: 12495 Comm: qemu-system-x86 Tainted: G W 4.18.0-rc2+ #2 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:native_load_gdt+0x0/0x10 ... Call Trace: load_fixmap_gdt+0x22/0x30 __vmx_load_host_state+0x10e/0x1c0 [kvm_intel] vmx_switch_vmcs+0x2d/0x50 [kvm_intel] nested_vmx_vmexit+0x222/0x9c0 [kvm_intel] vmx_handle_exit+0x246/0x15a0 [kvm_intel] kvm_arch_vcpu_ioctl_run+0x850/0x1830 [kvm] kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm] do_vfs_ioctl+0x9f/0x600 ksys_ioctl+0x66/0x70 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x4f/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 L0: WARNING: CPU: 2 PID: 3529 at arch/x86/kvm/vmx.c:6618 handle_desc+0x28/0x30 [kvm_intel] ... CPU: 2 PID: 3529 Comm: qemu-system-x86 Not tainted 4.17.2-coffee+ #76 Hardware name: Intel Corporation Kabylake Client platform/KBL S RIP: 0010:handle_desc+0x28/0x30 [kvm_intel] ... Call Trace: kvm_arch_vcpu_ioctl_run+0x863/0x1840 [kvm] kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm] do_vfs_ioctl+0x9f/0x5e0 ksys_ioctl+0x66/0x70 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x49/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 5af4157388ad (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure) Fixes: 4f350c6dbcb9 (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly) Cc: Jim Mattson <[email protected]> Cc: Krish Sadhukhan <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Radim KrÄmář <[email protected]> Cc: Wanpeng Li <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: nVMX: Clear reserved bits of #DB exit qualificationJim Mattson2-2/+6
According to volume 3 of the SDM, bits 63:15 and 12:4 of the exit qualification field for debug exceptions are reserved (cleared to 0). However, the SDM is incorrect about bit 16 (corresponding to DR6.RTM). This bit should be set if a debug exception (#DB) or a breakpoint exception (#BP) occurred inside an RTM region while advanced debugging of RTM transactional regions was enabled. Note that this is the opposite of DR6.RTM, which "indicates (when clear) that a debug exception (#DB) or breakpoint exception (#BP) occurred inside an RTM region while advanced debugging of RTM transactional regions was enabled." There is still an issue with stale DR6 bits potentially being misreported for the current debug exception. DR6 should not have been modified before vectoring the #DB exception, and the "new DR6 bits" should be available somewhere, but it was and they aren't. Fixes: b96fb439774e1 ("KVM: nVMX: fixes to nested virt interrupt injection") Signed-off-by: Jim Mattson <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-17KVM: LAPIC: Tune lapic_timer_advance_ns automaticallyWanpeng Li2-2/+25
In cloud environment, lapic_timer_advance_ns is needed to be tuned for every CPU generations, and every host kernel versions(the kvm-unit-tests/tscdeadline_latency.flat is 5700 cycles for upstream kernel and 9600 cycles for our 3.10 product kernel, both preemption_timer=N, Skylake server). This patch adds the capability to automatically tune lapic_timer_advance_ns step by step, the initial value is 1000ns as 'commit d0659d946be0 ("KVM: x86: add option to advance tscdeadline hrtimer expiration")' recommended, it will be reduced when it is too early, and increased when it is too late. The guest_tsc and tsc_deadline are hard to equal, so we assume we are done when the delta is within a small scope e.g. 100 cycles. This patch reduces latency (kvm-unit-tests/tscdeadline_latency, busy waits, preemption_timer enabled) from ~2600 cyles to ~1200 cyles on our Skylake server. Cc: Paolo Bonzini <[email protected]> Cc: Radim Krčmář <[email protected]> Cc: Liran Alon <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-10-16locking/qspinlock, x86: Provide liveness guaranteePeter Zijlstra1-0/+15
On x86 we cannot do fetch_or() with a single instruction and thus end up using a cmpxchg loop, this reduces determinism. Replace the fetch_or() with a composite operation: tas-pending + load. Using two instructions of course opens a window we previously did not have. Consider the scenario: CPU0 CPU1 CPU2 1) lock trylock -> (0,0,1) 2) lock trylock /* fail */ 3) unlock -> (0,0,0) 4) lock trylock -> (0,0,1) 5) tas-pending -> (0,1,1) load-val <- (0,1,0) from 3 6) clear-pending-set-locked -> (0,0,1) FAIL: _2_ owners where 5) is our new composite operation. When we consider each part of the qspinlock state as a separate variable (as we can when _Q_PENDING_BITS == 8) then the above is entirely possible, because tas-pending will only RmW the pending byte, so the later load is able to observe prior tail and lock state (but not earlier than its own trylock, which operates on the whole word, due to coherence). To avoid this we need 2 things: - the load must come after the tas-pending (obviously, otherwise it can trivially observe prior state). - the tas-pending must be a full word RmW instruction, it cannot be an XCHGB for example, such that we cannot observe other state prior to setting pending. On x86 we can realize this by using "LOCK BTS m32, r32" for tas-pending followed by a regular load. Note that observing later state is not a problem: - if we fail to observe a later unlock, we'll simply spin-wait for that store to become visible. - if we observe a later xchg_tail(), there is no difference from that xchg_tail() having taken place before the tas-pending. Suggested-by: Will Deacon <[email protected]> Reported-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Will Deacon <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Fixes: 59fb586b4a07 ("locking/qspinlock: Remove unbounded cmpxchg() loop from locking slowpath") Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-10-16x86/asm: 'Simplify' GEN_*_RMWcc() macrosPeter Zijlstra7-53/+64
Currently the GEN_*_RMWcc() macros include a return statement, which pretty much mandates we directly wrap them in a (inline) function. Macros with return statements are tricky and, as per the above, limit use, so remove the return statement and make them statement-expressions. This allows them to be used more widely. Also, shuffle the arguments a bit. Place the @cc argument as 3rd, this makes it consistent between UNARY and BINARY, but more importantly, it makes the @arg0 argument last. Since the @arg0 argument is now last, we can do CPP trickery and make it an optional argument, simplifying the users; 17 out of 18 occurences do not need this argument. Finally, change to asm symbolic names, instead of the numeric ordering of operands, which allows us to get rid of __BINARY_RMWcc_ARG and get cleaner code overall. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Cc: Linus Torvalds <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-10-16perf/x86/intel: Export mem events only if there's PEBS supportJiri Olsa2-21/+56
Memory events depends on PEBS support and access to LDLAT MSR, but we display them in /sys/devices/cpu/events even if the CPU does not provide those, like for KVM guests. That brings the false assumption that those events should be available, while they fail event to open. Separating the mem-* events attributes and merging them with cpu_events only if there's PEBS support detected. We could also check if LDLAT MSR is available, but the PEBS check seems to cover the need now. Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Michael Petlan <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Link: http://lkml.kernel.org/r/20180906135748.GC9577@krava Signed-off-by: Ingo Molnar <[email protected]>
2018-10-14x86/boot: Add -Wno-pointer-sign to KBUILD_CFLAGSNathan Chancellor1-0/+1
When compiling the kernel with Clang, this warning appears even though it is disabled for the whole kernel because this folder has its own set of KBUILD_CFLAGS. It was disabled before the beginning of git history. In file included from arch/x86/boot/compressed/kaslr.c:29: In file included from arch/x86/boot/compressed/misc.h:21: In file included from ./include/linux/elf.h:5: In file included from ./arch/x86/include/asm/elf.h:77: In file included from ./arch/x86/include/asm/vdso.h:11: In file included from ./include/linux/mm_types.h:9: In file included from ./include/linux/spinlock.h:88: In file included from ./arch/x86/include/asm/spinlock.h:43: In file included from ./arch/x86/include/asm/qrwlock.h:6: ./include/asm-generic/qrwlock.h:101:53: warning: passing 'u32 *' (aka 'unsigned int *') to parameter of type 'int *' converts between pointers to integer types with different sign [-Wpointer-sign] if (likely(atomic_try_cmpxchg_acquire(&lock->cnts, &cnts, _QW_LOCKED))) ^~~~~ ./include/linux/compiler.h:76:40: note: expanded from macro 'likely' # define likely(x) __builtin_expect(!!(x), 1) ^ ./include/asm-generic/atomic-instrumented.h:69:66: note: passing argument to parameter 'old' here static __always_inline bool atomic_try_cmpxchg(atomic_t *v, int *old, int new) ^ Signed-off-by: Nathan Chancellor <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Nick Desaulniers <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2018-10-14x86/time: Correct the attribute on jiffies' definitionNathan Chancellor1-1/+1
Clang warns that the declaration of jiffies in include/linux/jiffies.h doesn't match the definition in arch/x86/time/kernel.c: arch/x86/kernel/time.c:29:42: warning: section does not match previous declaration [-Wsection] __visible volatile unsigned long jiffies __cacheline_aligned = INITIAL_JIFFIES; ^ ./include/linux/cache.h:49:4: note: expanded from macro '__cacheline_aligned' __section__(".data..cacheline_aligned"))) ^ ./include/linux/jiffies.h:81:31: note: previous attribute is here extern unsigned long volatile __cacheline_aligned_in_smp __jiffy_arch_data jiffies; ^ ./arch/x86/include/asm/cache.h:20:2: note: expanded from macro '__cacheline_aligned_in_smp' __page_aligned_data ^ ./include/linux/linkage.h:39:29: note: expanded from macro '__page_aligned_data' #define __page_aligned_data __section(.data..page_aligned) __aligned(PAGE_SIZE) ^ ./include/linux/compiler_attributes.h:233:56: note: expanded from macro '__section' #define __section(S) __attribute__((__section__(#S))) ^ 1 warning generated. The declaration was changed in commit 7c30f352c852 ("jiffies.h: declare jiffies and jiffies_64 with ____cacheline_aligned_in_smp") but wasn't updated here. Make them match so Clang no longer warns. Fixes: 7c30f352c852 ("jiffies.h: declare jiffies and jiffies_64 with ____cacheline_aligned_in_smp") Signed-off-by: Nathan Chancellor <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Nick Desaulniers <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]