aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-03-19KVM: arm64: Introduce framework for accessing deferred sysregsChristoffer Dall2-2/+39
We are about to defer saving and restoring some groups of system registers to vcpu_put and vcpu_load on supported systems. This means that we need some infrastructure to access system registes which supports either accessing the memory backing of the register or directly accessing the system registers, depending on the state of the system when we access the register. We do this by defining read/write accessor functions, which can handle both "immediate" and "deferrable" system registers. Immediate registers are always saved/restored in the world-switch path, but deferrable registers are only saved/restored in vcpu_put/vcpu_load when supported and sysregs_loaded_on_cpu will be set in that case. Note that we don't use the deferred mechanism yet in this patch, but only introduce infrastructure. This is to improve convenience of review in the subsequent patches where it is clear which registers become deferred. Reviewed-by: Marc Zyngier <[email protected]> Reviewed-by: Andrew Jones <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-19KVM: arm64: Rewrite system register accessors to read/write functionsChristoffer Dall9-76/+101
Currently we access the system registers array via the vcpu_sys_reg() macro. However, we are about to change the behavior to some times modify the register file directly, so let's change this to two primitives: * Accessor macros vcpu_write_sys_reg() and vcpu_read_sys_reg() * Direct array access macro __vcpu_sys_reg() The accessor macros should be used in places where the code needs to access the currently loaded VCPU's state as observed by the guest. For example, when trapping on cache related registers, a write to a system register should go directly to the VCPU version of the register. The direct array access macro can be used in places where the VCPU is known to never be running (for example userspace access) or for registers which are never context switched (for example all the PMU system registers). This rewrites all users of vcpu_sys_regs to one of the macros described above. No functional change. Acked-by: Marc Zyngier <[email protected]> Reviewed-by: Andrew Jones <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-19KVM: arm64: Change 32-bit handling of VM system registersChristoffer Dall2-13/+15
We currently handle 32-bit accesses to trapped VM system registers using the 32-bit index into the coproc array on the vcpu structure, which is a union of the coproc array and the sysreg array. Since all the 32-bit coproc indices are created to correspond to the architectural mapping between 64-bit system registers and 32-bit coprocessor registers, and because the AArch64 system registers are the double in size of the AArch32 coprocessor registers, we can always find the system register entry that we must update by dividing the 32-bit coproc index by 2. This is going to make our lives much easier when we have to start accessing system registers that use deferred save/restore and might have to be read directly from the physical CPU. Reviewed-by: Andrew Jones <[email protected]> Reviewed-by: Marc Zyngier <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-19KVM: arm64: Don't save the host ELR_EL2 and SPSR_EL2 on VHE systemsChristoffer Dall1-0/+13
On non-VHE systems we need to save the ELR_EL2 and SPSR_EL2 so that we can return to the host in EL1 in the same state and location where we issued a hypercall to EL2, but on VHE ELR_EL2 and SPSR_EL2 are not useful because we never enter a guest as a result of an exception entry that would be directly handled by KVM. The kernel entry code already saves ELR_EL1/SPSR_EL1 on exception entry, which is enough. Therefore, factor out these registers into separate save/restore functions, making it easy to exclude them from the VHE world-switch path later on. Reviewed-by: Marc Zyngier <[email protected]> Reviewed-by: Andrew Jones <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-19KVM: arm64: Unify non-VHE host/guest sysreg save and restore functionsChristoffer Dall3-25/+9
There is no need to have multiple identical functions with different names for saving host and guest state. When saving and restoring state for the host and guest, the state is the same for both contexts, and that's why we have the kvm_cpu_context structure. Delete one version and rename the other into simply save/restore. Reviewed-by: Andrew Jones <[email protected]> Reviewed-by: Marc Zyngier <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-19KVM: arm/arm64: Remove leftover comment from kvm_vcpu_run_vheChristoffer Dall1-4/+0
The comment only applied to SPE on non-VHE systems, so we simply remove it. Suggested-by: Andrew Jones <[email protected]> Acked-by: Marc Zyngier <[email protected]> Reviewed-by: Andrew Jones <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-19KVM: arm64: Introduce separate VHE/non-VHE sysreg save/restore functionsChristoffer Dall3-22/+50
As we are about to handle system registers quite differently between VHE and non-VHE systems. In preparation for that, we need to split some of the handling functions between VHE and non-VHE functionality. For now, we simply copy the non-VHE functions, but we do change the use of static keys for VHE and non-VHE functionality now that we have separate functions. Reviewed-by: Andrew Jones <[email protected]> Reviewed-by: Marc Zyngier <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-19KVM: arm64: Rewrite sysreg alternatives to static keysChristoffer Dall1-13/+4
As we are about to move calls around in the sysreg save/restore logic, let's first rewrite the alternative function callers, because it is going to make the next patches much easier to read. Acked-by: Marc Zyngier <[email protected]> Reviewed-by: Andrew Jones <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-19KVM: arm64: Move userspace system registers into separate functionChristoffer Dall1-13/+35
There's a semantic difference between the EL1 registers that control operation of a kernel running in EL1 and EL1 registers that only control userspace execution in EL0. Since we can defer saving/restoring the latter, move them into their own function. The ARMv8 ARM (ARM DDI 0487C.a) Section D10.2.1 recommends that ACTLR_EL1 has no effect on the processor when running the VHE host, and we can therefore move this register into the EL1 state which is only saved/restored on vcpu_put/load for a VHE host. We also take this chance to rename the function saving/restoring the remaining system register to make it clear this function deals with the EL1 system registers. Reviewed-by: Andrew Jones <[email protected]> Reviewed-by: Marc Zyngier <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-19KVM: arm64: Remove noop calls to timer save/restore from VHE switchChristoffer Dall2-24/+22
The VHE switch function calls __timer_enable_traps and __timer_disable_traps which don't do anything on VHE systems. Therefore, simply remove these calls from the VHE switch function and make the functions non-conditional as they are now only called from the non-VHE switch path. Acked-by: Marc Zyngier <[email protected]> Reviewed-by: Andrew Jones <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-19KVM: arm64: Don't deactivate VM on VHE systemsChristoffer Dall1-5/+3
There is no need to reset the VTTBR to zero when exiting the guest on VHE systems. VHE systems don't use stage 2 translations for the EL2&0 translation regime used by the host. Reviewed-by: Andrew Jones <[email protected]> Acked-by: Marc Zyngier <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-19KVM: arm64: Remove kern_hyp_va() use in VHE switch functionChristoffer Dall1-3/+1
VHE kernels run completely in EL2 and therefore don't have a notion of kernel and hyp addresses, they are all just kernel addresses. Therefore don't call kern_hyp_va() in the VHE switch function. Reviewed-by: Andrew Jones <[email protected]> Reviewed-by: Marc Zyngier <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-19KVM: arm64: Introduce VHE-specific kvm_vcpu_runChristoffer Dall6-9/+87
So far this is mostly (see below) a copy of the legacy non-VHE switch function, but we will start reworking these functions in separate directions to work on VHE and non-VHE in the most optimal way in later patches. The only difference after this patch between the VHE and non-VHE run functions is that we omit the branch-predictor variant-2 hardening for QC Falkor CPUs, because this workaround is specific to a series of non-VHE ARMv8.0 CPUs. Reviewed-by: Marc Zyngier <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-19KVM: arm64: Factor out fault info population and gic workaroundsChristoffer Dall1-47/+57
The current world-switch function has functionality to detect a number of cases where we need to fixup some part of the exit condition and possibly run the guest again, before having restored the host state. This includes populating missing fault info, emulating GICv2 CPU interface accesses when mapped at unaligned addresses, and emulating the GICv3 CPU interface on systems that need it. As we are about to have an alternative switch function for VHE systems, but VHE systems still need the same early fixup logic, factor out this logic into a separate function that can be shared by both switch functions. No functional change. Reviewed-by: Marc Zyngier <[email protected]> Reviewed-by: Andrew Jones <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-19KVM: arm64: Improve debug register save/restore flowChristoffer Dall3-30/+42
Instead of having multiple calls from the world switch path to the debug logic, each figuring out if the dirty bit is set and if we should save/restore the debug registers, let's just provide two hooks to the debug save/restore functionality, one for switching to the guest context, and one for switching to the host context, and we get the benefit of only having to evaluate the dirty flag once on each path, plus we give the compiler some more room to inline some of this functionality. Reviewed-by: Marc Zyngier <[email protected]> Reviewed-by: Andrew Jones <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-19KVM: arm64: Slightly improve debug save/restore functionsChristoffer Dall1-14/+12
The debug save/restore functions can be improved by using the has_vhe() static key instead of the instruction alternative. Using the static key uses the same paradigm as we're going to use elsewhere, it makes the code more readable, and it generates slightly better code (no stack setups and function calls unless necessary). We also use a static key on the restore path, because it will be marginally faster than loading a value from memory. Finally, we don't have to conditionally clear the debug dirty flag if it's set, we can just clear it. Reviewed-by: Marc Zyngier <[email protected]> Reviewed-by: Andrew Jones <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-19KVM: arm64: Move debug dirty flag calculation out of world switchChristoffer Dall2-6/+5
There is no need to figure out inside the world-switch if we should save/restore the debug registers or not, we might as well do that in the higher level debug setup code, making it easier to optimize down the line. Reviewed-by: Julien Thierry <[email protected]> Reviewed-by: Marc Zyngier <[email protected]> Reviewed-by: Andrew Jones <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-19KVM: arm/arm64: Introduce vcpu_el1_is_32bitChristoffer Dall4-12/+17
We have numerous checks around that checks if the HCR_EL2 has the RW bit set to figure out if we're running an AArch64 or AArch32 VM. In some cases, directly checking the RW bit (given its unintuitive name), is a bit confusing, and that's not going to improve as we move logic around for the following patches that optimize KVM on AArch64 hosts with VHE. Therefore, introduce a helper, vcpu_el1_is_32bit, and replace existing direct checks of HCR_EL2.RW with the helper. Reviewed-by: Julien Grall <[email protected]> Reviewed-by: Julien Thierry <[email protected]> Acked-by: Marc Zyngier <[email protected]> Reviewed-by: Andrew Jones <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-19KVM: arm/arm64: Add kvm_vcpu_load_sysregs and kvm_vcpu_put_sysregsChristoffer Dall4-0/+38
As we are about to move a bunch of save/restore logic for VHE kernels to the load and put functions, we need some infrastructure to do this. Reviewed-by: Andrew Jones <[email protected]> Acked-by: Marc Zyngier <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-19KVM: arm/arm64: Get rid of vcpu->arch.irq_linesChristoffer Dall10-37/+16
We currently have a separate read-modify-write of the HCR_EL2 on entry to the guest for the sole purpose of setting the VF and VI bits, if set. Since this is most rarely the case (only when using userspace IRQ chip and interrupts are in flight), let's get rid of this operation and instead modify the bits in the vcpu->arch.hcr[_el2] directly when needed. Acked-by: Marc Zyngier <[email protected]> Reviewed-by: Andrew Jones <[email protected]> Reviewed-by: Julien Thierry <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-19KVM: arm64: Move HCR_INT_OVERRIDE to default HCR_EL2 guest flagShih-Wei Li2-5/+2
We always set the IMO and FMO bits in the HCR_EL2 when running the guest, regardless if we use the vgic or not. By moving these flags to HCR_GUEST_FLAGS we can avoid one of the extra save/restore operations of HCR_EL2 in the world switch code, and we can also soon get rid of the other one. This is safe, because even though the IMO and FMO bits control both taking the interrupts to EL2 and remapping ICC_*_EL1 to ICV_*_EL1 when executed at EL1, as long as we ensure that these bits are clear when running the EL1 host, we're OK, because we reset the HCR_EL2 to only have the HCR_RW bit set when returning to EL1 on non-VHE systems. Reviewed-by: Marc Zyngier <[email protected]> Signed-off-by: Shih-Wei Li <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-19KVM: arm64: Rework hyp_panic for VHE and non-VHEChristoffer Dall1-19/+23
VHE actually doesn't rely on clearing the VTTBR when returning to the host kernel, and that is the current key mechanism of hyp_panic to figure out how to attempt to return to a state good enough to print a panic statement. Therefore, we split the hyp_panic function into two functions, a VHE and a non-VHE, keeping the non-VHE version intact, but changing the VHE behavior. The vttbr_el2 check on VHE doesn't really make that much sense, because the only situation where we can get here on VHE is when the hypervisor assembly code actually called into hyp_panic, which only happens when VBAR_EL2 has been set to the KVM exception vectors. On VHE, we can always safely disable the traps and restore the host registers at this point, so we simply do that unconditionally and call into the panic function directly. Acked-by: Marc Zyngier <[email protected]> Reviewed-by: Andrew Jones <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-19KVM: arm64: Avoid storing the vcpu pointer on the stackChristoffer Dall7-27/+48
We already have the percpu area for the host cpu state, which points to the VCPU, so there's no need to store the VCPU pointer on the stack on every context switch. We can be a little more clever and just use tpidr_el2 for the percpu offset and load the VCPU pointer from the host context. This has the benefit of being able to retrieve the host context even when our stack is corrupted, and it has a potential performance benefit because we trade a store plus a load for an mrs and a load on a round trip to the guest. This does require us to calculate the percpu offset without including the offset from the kernel mapping of the percpu array to the linear mapping of the array (which is what we store in tpidr_el1), because a PC-relative generated address in EL2 is already giving us the hyp alias of the linear mapping of a kernel address. We do this in __cpu_init_hyp_mode() by using kvm_ksym_ref(). The code that accesses ESR_EL2 was previously using an alternative to use the _EL1 accessor on VHE systems, but this was actually unnecessary as the _EL1 accessor aliases the ESR_EL2 register on VHE, and the _EL2 accessor does the same thing on both systems. Cc: Ard Biesheuvel <[email protected]> Reviewed-by: Marc Zyngier <[email protected]> Reviewed-by: Andrew Jones <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-19KVM: arm/arm64: Move vcpu_load call after kvm_vcpu_first_run_initChristoffer Dall3-29/+8
Moving the call to vcpu_load() in kvm_arch_vcpu_ioctl_run() to after we've called kvm_vcpu_first_run_init() simplifies some of the vgic and there is also no need to do vcpu_load() for things such as handling the immediate_exit flag. Reviewed-by: Julien Grall <[email protected]> Reviewed-by: Marc Zyngier <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-19KVM: arm/arm64: Avoid vcpu_load for other vcpu ioctls than KVM_RUNChristoffer Dall2-12/+0
Calling vcpu_load() registers preempt notifiers for this vcpu and calls kvm_arch_vcpu_load(). The latter will soon be doing a lot of heavy lifting on arm/arm64 and will try to do things such as enabling the virtual timer and setting us up to handle interrupts from the timer hardware. Loading state onto hardware registers and enabling hardware to signal interrupts can be problematic when we're not actually about to run the VCPU, because it makes it difficult to establish the right context when handling interrupts from the timer, and it makes the register access code difficult to reason about. Luckily, now when we call vcpu_load in each ioctl implementation, we can simply remove the call from the non-KVM_RUN vcpu ioctls, and our kvm_arch_vcpu_load() is only used for loading vcpu content to the physical CPU when we're actually going to run the vcpu. Reviewed-by: Julien Grall <[email protected]> Reviewed-by: Marc Zyngier <[email protected]> Reviewed-by: Andrew Jones <[email protected]> Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2018-03-19KVM: PPC: Book3S HV: Handle 1GB pages in radix page fault handlerPaul Mackerras1-36/+93
This adds code to the radix hypervisor page fault handler to handle the case where the guest memory is backed by 1GB hugepages, and put them into the partition-scoped radix tree at the PUD level. The code is essentially analogous to the code for 2MB pages. This also rearranges kvmppc_create_pte() to make it easier to follow. Signed-off-by: Paul Mackerras <[email protected]>
2018-03-19KVM: PPC: Book3S HV: Streamline setting of reference and change bitsPaul Mackerras1-33/+19
When using the radix MMU, we can get hypervisor page fault interrupts with the DSISR_SET_RC bit set in DSISR/HSRR1, indicating that an attempt to set the R (reference) or C (change) bit in a PTE atomically failed. Previously we would find the corresponding Linux PTE and check the permission and dirty bits there, but this is not really necessary since we only need to do what the hardware was trying to do, namely set R or C atomically. This removes the code that reads the Linux PTE and just update the partition-scoped PTE, having first checked that it is still present, and if the access is a write, that the PTE still has write permission. Furthermore, we now check whether any other relevant bits are set in DSISR, and if there are, then we proceed with the rest of the function in order to handle whatever condition they represent, instead of returning to the guest as we did previously. Signed-off-by: Paul Mackerras <[email protected]>
2018-03-19KVM: PPC: Book3S HV: Radix page fault handler optimizationsPaul Mackerras1-15/+27
This improves the handling of transparent huge pages in the radix hypervisor page fault handler. Previously, if a small page is faulted in to a 2MB region of guest physical space, that means that there is a page table pointer at the PMD level, which could never be replaced by a leaf (2MB) PMD entry. This adds the code to clear the PMD, invlidate the page walk cache and free the page table page in this situation, so that the leaf PMD entry can be created. This also adds code to check whether a PMD or PTE being inserted is the same as is already there (because of a race with another CPU that faulted on the same page) and if so, we don't replace the existing entry, meaning that we don't invalidate the PTE or PMD and do a TLB invalidation. Signed-off-by: Paul Mackerras <[email protected]>
2018-03-19KVM: PPC: Remove unused kvm_unmap_hva callbackPaul Mackerras10-46/+2
Since commit fb1522e099f0 ("KVM: update to new mmu_notifier semantic v2", 2017-08-31), the MMU notifier code in KVM no longer calls the kvm_unmap_hva callback. This removes the PPC implementations of kvm_unmap_hva(). Signed-off-by: Paul Mackerras <[email protected]>
2018-03-16x86/kvm/vmx: avoid expensive rdmsr for MSR_GS_BASEVitaly Kuznetsov3-2/+9
vmx_save_host_state() is only called from kvm_arch_vcpu_ioctl_run() so the context is pretty well defined and as we're past 'swapgs' MSR_GS_BASE should contain kernel's GS base which we point to irq_stack_union. Add new kernelmode_gs_base() API, irq_stack_union needs to be exported as KVM can be build as module. Acked-by: Andy Lutomirski <[email protected]> Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-03-16x86/kvm/vmx: read MSR_{FS,KERNEL_GS}_BASE from current->threadVitaly Kuznetsov3-3/+29
vmx_save_host_state() is only called from kvm_arch_vcpu_ioctl_run() so the context is pretty well defined. Read MSR_{FS,KERNEL_GS}_BASE from current->thread after calling save_fsgs() which takes care of X86_BUG_NULL_SEG case now and will do RD[FG,GS]BASE when FSGSBASE extensions are exposed to userspace (currently they are not). Acked-by: Andy Lutomirski <[email protected]> Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-03-16KVM: X86: Provide a capability to disable PAUSE interceptsWanpeng Li5-7/+27
Allow to disable pause loop exit/pause filtering on a per VM basis. If some VMs have dedicated host CPUs, they won't be negatively affected due to needlessly intercepted PAUSE instructions. Thanks to Jan H. Schönherr's initial patch. Cc: Paolo Bonzini <[email protected]> Cc: Radim Krčmář <[email protected]> Cc: Jan H. Schönherr <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-03-16KVM: X86: Provide a capability to disable HLT interceptsWanpeng Li7-2/+46
If host CPUs are dedicated to a VM, we can avoid VM exits on HLT. This patch adds the per-VM capability to disable them. Cc: Paolo Bonzini <[email protected]> Cc: Radim Krčmář <[email protected]> Cc: Jan H. Schönherr <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-03-16KVM: X86: Provide a capability to disable MWAIT interceptsWanpeng Li8-25/+53
Allowing a guest to execute MWAIT without interception enables a guest to put a (physical) CPU into a power saving state, where it takes longer to return from than what may be desired by the host. Don't give a guest that power over a host by default. (Especially, since nothing prevents a guest from using MWAIT even when it is not advertised via CPUID.) Cc: Paolo Bonzini <[email protected]> Cc: Radim Krčmář <[email protected]> Cc: Jan H. Schönherr <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-03-16Merge tag 'kvm-s390-next-4.17-1' of ↵Paolo Bonzini18-119/+249
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD KVM: s390: fixes and features - more kvm stat counters - virtio gpu plumbing. The 3 non-KVM/s390 patches have Acks from Bartlomiej Zolnierkiewicz, Heiko Carstens and Greg Kroah-Hartman but all belong together to make virtio-gpu work as a tty. So I carried them in the KVM/s390 tree. - document some KVM_CAPs - cpu-model only facilities - cleanups
2018-03-16KVM: x86: Add support for VMware backdoor Pseudo-PMCsArbel Moshe4-17/+75
VMware exposes the following Pseudo PMCs: 0x10000: Physical host TSC 0x10001: Elapsed real time in ns 0x10002: Elapsed apparent time in ns For more info refer to: https://www.vmware.com/files/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf VMware allows access to these Pseduo-PMCs even when read via RDPMC in Ring3 and CR4.PCE=0. Therefore, commit modifies x86 emulator to allow access to these PMCs in this situation. In addition, emulation of these PMCs were added to kvm_pmu_rdpmc(). Signed-off-by: Arbel Moshe <[email protected]> Signed-off-by: Liran Alon <[email protected]> Reviewed-by: Radim Krčmář <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-03-16KVM: x86: SVM: Intercept #GP to support access to VMware backdoor portsLiran Alon1-0/+26
If KVM enable_vmware_backdoor module parameter is set, the commit change VMX to now intercept #GP instead of being directly deliviered from CPU to guest. It is done to support access to VMware Backdoor I/O ports even if TSS I/O permission denies it. In that case: 1. A #GP will be raised and intercepted. 2. #GP intercept handler will simulate I/O port access instruction. 3. I/O port access instruction simulation will allow access to VMware backdoor ports specifically even if TSS I/O permission bitmap denies it. Note that the above change introduce slight performance hit as now #GPs are now not deliviered directly from CPU to guest but instead cause #VMExit and instruction emulation. However, this behavior is introduced only when enable_vmware_backdoor KVM module parameter is set. Signed-off-by: Liran Alon <[email protected]> Reviewed-by: Nikita Leshenko <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]> Reviewed-by: Radim Krčmář <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-03-16KVM: x86: VMX: Intercept #GP to support access to VMware backdoor portsLiran Alon1-0/+24
If KVM enable_vmware_backdoor module parameter is set, the commit change VMX to now intercept #GP instead of being directly deliviered from CPU to guest. It is done to support access to VMware backdoor I/O ports even if TSS I/O permission denies it. In that case: 1. A #GP will be raised and intercepted. 2. #GP intercept handler will simulate I/O port access instruction. 3. I/O port access instruction simulation will allow access to VMware backdoor ports specifically even if TSS I/O permission bitmap denies it. Note that the above change introduce slight performance hit as now #GPs are not deliviered directly from CPU to guest but instead cause #VMExit and instruction emulation. However, this behavior is introduced only when enable_vmware_backdoor KVM module parameter is set. Signed-off-by: Liran Alon <[email protected]> Reviewed-by: Nikita Leshenko <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]> Reviewed-by: Radim Krčmář <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-03-16KVM: x86: Emulate only IN/OUT instructions when accessing VMware backdoorLiran Alon2-0/+29
Access to VMware backdoor ports is done by one of the IN/OUT/INS/OUTS instructions. These ports must be allowed access even if TSS I/O permission bitmap don't allow it. To handle this, VMX/SVM will be changed in future commits to intercept #GP which was raised by such access and handle it by calling x86 emulator to emulate instruction. If it was one of these instructions, the x86 emulator already handles it correctly (Since commit "KVM: x86: Always allow access to VMware backdoor I/O ports") by not checking these ports against TSS I/O permission bitmap. One may wonder why checking for specific instructions is necessary as we can just forward all #GPs to the x86 emulator. There are multiple reasons for doing so: 1. We don't want the x86 emulator to be reached easily by guest by just executing an instruction that raises #GP as that exposes the x86 emulator as a bigger attack surface. 2. The x86 emulator is incomplete and therefore certain instructions that can cause #GP cannot be emulated. Such an example is "INT x" (opcode 0xcd) which reaches emulate_int() which can only emulate the instruction if vCPU is in real-mode. Signed-off-by: Liran Alon <[email protected]> Reviewed-by: Nikita Leshenko <[email protected]> Reviewed-by: Konrad Rzeszutek Wilk <[email protected]> Reviewed-by: Radim Krčmář <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-03-16KVM: x86: Add emulation_type to not raise #UD on emulation failureLiran Alon2-3/+9
Next commits are going introduce support for accessing VMware backdoor ports even though guest's TSS I/O permissions bitmap doesn't allow access. This mimic VMware hypervisor behavior. In order to support this, next commits will change VMX/SVM to intercept #GP which was raised by such access and handle it by calling the x86 emulator to emulate instruction. Since commit "KVM: x86: Always allow access to VMware backdoor I/O ports", the x86 emulator handles access to these I/O ports by not checking these ports against the TSS I/O permission bitmap. However, there could be cases that CPU rasies a #GP on instruction that fails to be disassembled by the x86 emulator (Because of incomplete implementation for example). In those cases, we would like the #GP intercept to just forward #GP as-is to guest as if there was no intercept to begin with. However, current emulator code always queues #UD exception in case emulator fails (including disassembly failures) which is not what is wanted in this flow. This commit addresses this issue by adding a new emulation_type flag that will allow the #GP intercept handler to specify that it wishes to be aware when instruction emulation fails and doesn't want #UD exception to be queued. Signed-off-by: Liran Alon <[email protected]> Reviewed-by: Nikita Leshenko <[email protected]> Reviewed-by: Radim Krčmář <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-03-16KVM: x86: Always allow access to VMware backdoor I/O portsLiran Alon1-0/+11
VMware allows access to these ports even if denied by TSS I/O permission bitmap. Mimic behavior. Signed-off-by: Liran Alon <[email protected]> Reviewed-by: Nikita Leshenko <[email protected]> Reviewed-by: Radim Krčmář <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-03-16KVM: x86: Add module parameter for supporting VMware backdoorLiran Alon3-0/+9
Support access to VMware backdoor requires KVM to intercept #GP exceptions from guest which introduce slight performance hit. Therefore, control this support by module parameter. Note that module parameter is exported as it should be consumed by kvm_intel & kvm_amd to determine if they should intercept #GP or not. This commit doesn't change semantics. It is done as a preparation for future commits. Signed-off-by: Liran Alon <[email protected]> Reviewed-by: Nikita Leshenko <[email protected]> Reviewed-by: Radim Krčmář <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-03-16KVM: x86: add kvm_fast_pio() to consolidate fast PIO codeSean Christopherson4-27/+24
Add kvm_fast_pio() to consolidate duplicate code in VMX and SVM. Unexport kvm_fast_pio_in() and kvm_fast_pio_out(). Suggested-by: Paolo Bonzini <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-03-16KVM: VMX: use kvm_fast_pio_in for handling IN I/OSean Christopherson1-3/+6
Fast emulation of processor I/O for IN was disabled on x86 (both VMX and SVM) some years ago due to a buggy implementation. The addition of kvm_fast_pio_in(), used by SVM, re-introduced (functional!) fast emulation of IN. Piggyback SVM's work and use kvm_fast_pio_in() on VMX instead of performing full emulation of IN. Reviewed-by: Paolo Bonzini <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-03-16KVM: vVMX: signal failure for nested VMEntry if emulation_requiredSean Christopherson1-0/+15
Fail a nested VMEntry with EXIT_REASON_INVALID_STATE if L2 guest state is invalid, i.e. vmcs12 contained invalid guest state, and unrestricted guest is disabled in L0 (and by extension disabled in L1). WARN_ON_ONCE in handle_invalid_guest_state() if we're attempting to emulate L2, i.e. nested_run_pending is true, to aid debug in the (hopefully unlikely) scenario that we somehow skip the nested VMEntry consistency check, e.g. due to a L0 bug. Note: KVM relies on hardware to detect the scenario where unrestricted guest is enabled in L0 but disabled in L1 and vmcs12 contains invalid guest state, i.e. checking emulation_required in prepare_vmcs02 is required only to handle the case were unrestricted guest is disabled in L0 since L0 never actually attempts VMLAUNCH/VMRESUME with vmcs02. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2018-03-16KVM: VMX: WARN on a MOV CR3 exit w/ unrestricted guestSean Christopherson1-0/+2
CR3 load/store exiting are always off when unrestricted guest is enabled. WARN on the associated CR3 VMEXIT to detect code that would re-introduce CR3 load/store exiting for unrestricted guest. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2018-03-16KVM: VMX: give unrestricted guest full control of CR3Sean Christopherson1-2/+2
Now CR3 is not forced to a host-controlled value when paging is disabled in an unrestricted guest, CR3 load/store exiting can be left disabled (for an unrestricted guest). And because CR0.WP and CR4.PAE/PSE are also not force to host-controlled values, all of ept_update_paging_mode_cr0() is no longer needed, i.e. skip ept_update_paging_mode_cr0() for an unrestricted guest. Because MOV CR3 no longer exits when paging is disabled for an unrestricted guest, vmx_decache_cr3() must always read GUEST_CR3 from the VMCS for an unrestricted guest. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2018-03-16KVM: VMX: don't force CR4.PAE/PSE for unrestricted guestSean Christopherson1-14/+22
CR4.PAE - Unrestricted guest can only be enabled when EPT is enabled, and vmx_set_cr4() clears hardware CR0.PAE based on the guest's CR4.PAE, i.e. CR4.PAE always follows the guest's value when unrestricted guest is enabled. CR4.PSE - Unrestricted guest no longer uses the identity mapped IA32 page tables since CR0.PG can be cleared in hardware, thus there is no need to set CR4.PSE when paging is disabled in the guest (and EPT is enabled). Define KVM_VM_CR4_ALWAYS_ON_UNRESTRICTED_GUEST (to X86_CR4_VMXE) and use it in lieu of KVM_*MODE_VM_CR4_ALWAYS_ON when unrestricted guest is enabled, which removes the forcing of CR4.PAE. Skip the manipulation of CR4.PAE/PSE for EPT when unrestricted guest is enabled, as CR4.PAE isn't forced and so doesn't need to be manually cleared, and CR4.PSE does not need to be set when paging is disabled since the identity mapped IA32 page tables are not used. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2018-03-16KVM: VMX: remove CR0.WP from ..._ALWAYS_ON_UNRESTRICTED_GUESTSean Christopherson1-3/+4
Unrestricted guest can only be enabled when EPT is enabled, and when EPT is enabled, ept_update_paging_mode_cr0() will clear hardware CR0.WP based on the guest's CR0.WP, i.e. CR0.WP always follows the guest's value when unrestricted guest is enabled. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>
2018-03-16KVM: VMX: don't configure EPT identity map for unrestricted guestSean Christopherson1-2/+3
An unrestricted guest can run with hardware CR0.PG==0, i.e. IA32 paging disabled, in which case there is no need to load the guest's CR3 with identity mapped IA32 page tables since hardware will effectively ignore CR3. If unrestricted guest is enabled, don't configure the identity mapped IA32 page table and always load the guest's desired CR3. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Radim Krčmář <[email protected]>