Age | Commit message (Collapse) | Author | Files | Lines |
|
We are about to defer saving and restoring some groups of system
registers to vcpu_put and vcpu_load on supported systems. This means
that we need some infrastructure to access system registes which
supports either accessing the memory backing of the register or directly
accessing the system registers, depending on the state of the system
when we access the register.
We do this by defining read/write accessor functions, which can handle
both "immediate" and "deferrable" system registers. Immediate registers
are always saved/restored in the world-switch path, but deferrable
registers are only saved/restored in vcpu_put/vcpu_load when supported
and sysregs_loaded_on_cpu will be set in that case.
Note that we don't use the deferred mechanism yet in this patch, but only
introduce infrastructure. This is to improve convenience of review in
the subsequent patches where it is clear which registers become
deferred.
Reviewed-by: Marc Zyngier <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
Currently we access the system registers array via the vcpu_sys_reg()
macro. However, we are about to change the behavior to some times
modify the register file directly, so let's change this to two
primitives:
* Accessor macros vcpu_write_sys_reg() and vcpu_read_sys_reg()
* Direct array access macro __vcpu_sys_reg()
The accessor macros should be used in places where the code needs to
access the currently loaded VCPU's state as observed by the guest. For
example, when trapping on cache related registers, a write to a system
register should go directly to the VCPU version of the register.
The direct array access macro can be used in places where the VCPU is
known to never be running (for example userspace access) or for
registers which are never context switched (for example all the PMU
system registers).
This rewrites all users of vcpu_sys_regs to one of the macros described
above.
No functional change.
Acked-by: Marc Zyngier <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
We currently handle 32-bit accesses to trapped VM system registers using
the 32-bit index into the coproc array on the vcpu structure, which is a
union of the coproc array and the sysreg array.
Since all the 32-bit coproc indices are created to correspond to the
architectural mapping between 64-bit system registers and 32-bit
coprocessor registers, and because the AArch64 system registers are the
double in size of the AArch32 coprocessor registers, we can always find
the system register entry that we must update by dividing the 32-bit
coproc index by 2.
This is going to make our lives much easier when we have to start
accessing system registers that use deferred save/restore and might
have to be read directly from the physical CPU.
Reviewed-by: Andrew Jones <[email protected]>
Reviewed-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
On non-VHE systems we need to save the ELR_EL2 and SPSR_EL2 so that we can
return to the host in EL1 in the same state and location where we issued a
hypercall to EL2, but on VHE ELR_EL2 and SPSR_EL2 are not useful because we
never enter a guest as a result of an exception entry that would be directly
handled by KVM. The kernel entry code already saves ELR_EL1/SPSR_EL1 on
exception entry, which is enough. Therefore, factor out these registers into
separate save/restore functions, making it easy to exclude them from the VHE
world-switch path later on.
Reviewed-by: Marc Zyngier <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
There is no need to have multiple identical functions with different
names for saving host and guest state. When saving and restoring state
for the host and guest, the state is the same for both contexts, and
that's why we have the kvm_cpu_context structure. Delete one
version and rename the other into simply save/restore.
Reviewed-by: Andrew Jones <[email protected]>
Reviewed-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
The comment only applied to SPE on non-VHE systems, so we simply remove
it.
Suggested-by: Andrew Jones <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
As we are about to handle system registers quite differently between VHE
and non-VHE systems. In preparation for that, we need to split some of
the handling functions between VHE and non-VHE functionality.
For now, we simply copy the non-VHE functions, but we do change the use
of static keys for VHE and non-VHE functionality now that we have
separate functions.
Reviewed-by: Andrew Jones <[email protected]>
Reviewed-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
As we are about to move calls around in the sysreg save/restore logic,
let's first rewrite the alternative function callers, because it is
going to make the next patches much easier to read.
Acked-by: Marc Zyngier <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
There's a semantic difference between the EL1 registers that control
operation of a kernel running in EL1 and EL1 registers that only control
userspace execution in EL0. Since we can defer saving/restoring the
latter, move them into their own function.
The ARMv8 ARM (ARM DDI 0487C.a) Section D10.2.1 recommends that
ACTLR_EL1 has no effect on the processor when running the VHE host, and
we can therefore move this register into the EL1 state which is only
saved/restored on vcpu_put/load for a VHE host.
We also take this chance to rename the function saving/restoring the
remaining system register to make it clear this function deals with
the EL1 system registers.
Reviewed-by: Andrew Jones <[email protected]>
Reviewed-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
The VHE switch function calls __timer_enable_traps and
__timer_disable_traps which don't do anything on VHE systems.
Therefore, simply remove these calls from the VHE switch function and
make the functions non-conditional as they are now only called from the
non-VHE switch path.
Acked-by: Marc Zyngier <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
There is no need to reset the VTTBR to zero when exiting the guest on
VHE systems. VHE systems don't use stage 2 translations for the EL2&0
translation regime used by the host.
Reviewed-by: Andrew Jones <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
VHE kernels run completely in EL2 and therefore don't have a notion of
kernel and hyp addresses, they are all just kernel addresses. Therefore
don't call kern_hyp_va() in the VHE switch function.
Reviewed-by: Andrew Jones <[email protected]>
Reviewed-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
So far this is mostly (see below) a copy of the legacy non-VHE switch
function, but we will start reworking these functions in separate
directions to work on VHE and non-VHE in the most optimal way in later
patches.
The only difference after this patch between the VHE and non-VHE run
functions is that we omit the branch-predictor variant-2 hardening for
QC Falkor CPUs, because this workaround is specific to a series of
non-VHE ARMv8.0 CPUs.
Reviewed-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
The current world-switch function has functionality to detect a number
of cases where we need to fixup some part of the exit condition and
possibly run the guest again, before having restored the host state.
This includes populating missing fault info, emulating GICv2 CPU
interface accesses when mapped at unaligned addresses, and emulating
the GICv3 CPU interface on systems that need it.
As we are about to have an alternative switch function for VHE systems,
but VHE systems still need the same early fixup logic, factor out this
logic into a separate function that can be shared by both switch
functions.
No functional change.
Reviewed-by: Marc Zyngier <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
Instead of having multiple calls from the world switch path to the debug
logic, each figuring out if the dirty bit is set and if we should
save/restore the debug registers, let's just provide two hooks to the
debug save/restore functionality, one for switching to the guest
context, and one for switching to the host context, and we get the
benefit of only having to evaluate the dirty flag once on each path,
plus we give the compiler some more room to inline some of this
functionality.
Reviewed-by: Marc Zyngier <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
The debug save/restore functions can be improved by using the has_vhe()
static key instead of the instruction alternative. Using the static key
uses the same paradigm as we're going to use elsewhere, it makes the
code more readable, and it generates slightly better code (no
stack setups and function calls unless necessary).
We also use a static key on the restore path, because it will be
marginally faster than loading a value from memory.
Finally, we don't have to conditionally clear the debug dirty flag if
it's set, we can just clear it.
Reviewed-by: Marc Zyngier <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
There is no need to figure out inside the world-switch if we should
save/restore the debug registers or not, we might as well do that in the
higher level debug setup code, making it easier to optimize down the
line.
Reviewed-by: Julien Thierry <[email protected]>
Reviewed-by: Marc Zyngier <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
We have numerous checks around that checks if the HCR_EL2 has the RW bit
set to figure out if we're running an AArch64 or AArch32 VM. In some
cases, directly checking the RW bit (given its unintuitive name), is a
bit confusing, and that's not going to improve as we move logic around
for the following patches that optimize KVM on AArch64 hosts with VHE.
Therefore, introduce a helper, vcpu_el1_is_32bit, and replace existing
direct checks of HCR_EL2.RW with the helper.
Reviewed-by: Julien Grall <[email protected]>
Reviewed-by: Julien Thierry <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
As we are about to move a bunch of save/restore logic for VHE kernels to
the load and put functions, we need some infrastructure to do this.
Reviewed-by: Andrew Jones <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
We currently have a separate read-modify-write of the HCR_EL2 on entry
to the guest for the sole purpose of setting the VF and VI bits, if set.
Since this is most rarely the case (only when using userspace IRQ chip
and interrupts are in flight), let's get rid of this operation and
instead modify the bits in the vcpu->arch.hcr[_el2] directly when
needed.
Acked-by: Marc Zyngier <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Reviewed-by: Julien Thierry <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
We always set the IMO and FMO bits in the HCR_EL2 when running the
guest, regardless if we use the vgic or not. By moving these flags to
HCR_GUEST_FLAGS we can avoid one of the extra save/restore operations of
HCR_EL2 in the world switch code, and we can also soon get rid of the
other one.
This is safe, because even though the IMO and FMO bits control both
taking the interrupts to EL2 and remapping ICC_*_EL1 to ICV_*_EL1 when
executed at EL1, as long as we ensure that these bits are clear when
running the EL1 host, we're OK, because we reset the HCR_EL2 to only
have the HCR_RW bit set when returning to EL1 on non-VHE systems.
Reviewed-by: Marc Zyngier <[email protected]>
Signed-off-by: Shih-Wei Li <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
VHE actually doesn't rely on clearing the VTTBR when returning to the
host kernel, and that is the current key mechanism of hyp_panic to
figure out how to attempt to return to a state good enough to print a
panic statement.
Therefore, we split the hyp_panic function into two functions, a VHE and
a non-VHE, keeping the non-VHE version intact, but changing the VHE
behavior.
The vttbr_el2 check on VHE doesn't really make that much sense, because
the only situation where we can get here on VHE is when the hypervisor
assembly code actually called into hyp_panic, which only happens when
VBAR_EL2 has been set to the KVM exception vectors. On VHE, we can
always safely disable the traps and restore the host registers at this
point, so we simply do that unconditionally and call into the panic
function directly.
Acked-by: Marc Zyngier <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
We already have the percpu area for the host cpu state, which points to
the VCPU, so there's no need to store the VCPU pointer on the stack on
every context switch. We can be a little more clever and just use
tpidr_el2 for the percpu offset and load the VCPU pointer from the host
context.
This has the benefit of being able to retrieve the host context even
when our stack is corrupted, and it has a potential performance benefit
because we trade a store plus a load for an mrs and a load on a round
trip to the guest.
This does require us to calculate the percpu offset without including
the offset from the kernel mapping of the percpu array to the linear
mapping of the array (which is what we store in tpidr_el1), because a
PC-relative generated address in EL2 is already giving us the hyp alias
of the linear mapping of a kernel address. We do this in
__cpu_init_hyp_mode() by using kvm_ksym_ref().
The code that accesses ESR_EL2 was previously using an alternative to
use the _EL1 accessor on VHE systems, but this was actually unnecessary
as the _EL1 accessor aliases the ESR_EL2 register on VHE, and the _EL2
accessor does the same thing on both systems.
Cc: Ard Biesheuvel <[email protected]>
Reviewed-by: Marc Zyngier <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
Moving the call to vcpu_load() in kvm_arch_vcpu_ioctl_run() to after
we've called kvm_vcpu_first_run_init() simplifies some of the vgic and
there is also no need to do vcpu_load() for things such as handling the
immediate_exit flag.
Reviewed-by: Julien Grall <[email protected]>
Reviewed-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
Calling vcpu_load() registers preempt notifiers for this vcpu and calls
kvm_arch_vcpu_load(). The latter will soon be doing a lot of heavy
lifting on arm/arm64 and will try to do things such as enabling the
virtual timer and setting us up to handle interrupts from the timer
hardware.
Loading state onto hardware registers and enabling hardware to signal
interrupts can be problematic when we're not actually about to run the
VCPU, because it makes it difficult to establish the right context when
handling interrupts from the timer, and it makes the register access
code difficult to reason about.
Luckily, now when we call vcpu_load in each ioctl implementation, we can
simply remove the call from the non-KVM_RUN vcpu ioctls, and our
kvm_arch_vcpu_load() is only used for loading vcpu content to the
physical CPU when we're actually going to run the vcpu.
Reviewed-by: Julien Grall <[email protected]>
Reviewed-by: Marc Zyngier <[email protected]>
Reviewed-by: Andrew Jones <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
This adds code to the radix hypervisor page fault handler to handle the
case where the guest memory is backed by 1GB hugepages, and put them
into the partition-scoped radix tree at the PUD level. The code is
essentially analogous to the code for 2MB pages. This also rearranges
kvmppc_create_pte() to make it easier to follow.
Signed-off-by: Paul Mackerras <[email protected]>
|
|
When using the radix MMU, we can get hypervisor page fault interrupts
with the DSISR_SET_RC bit set in DSISR/HSRR1, indicating that an
attempt to set the R (reference) or C (change) bit in a PTE atomically
failed. Previously we would find the corresponding Linux PTE and
check the permission and dirty bits there, but this is not really
necessary since we only need to do what the hardware was trying to
do, namely set R or C atomically. This removes the code that reads
the Linux PTE and just update the partition-scoped PTE, having first
checked that it is still present, and if the access is a write, that
the PTE still has write permission.
Furthermore, we now check whether any other relevant bits are set
in DSISR, and if there are, then we proceed with the rest of the
function in order to handle whatever condition they represent,
instead of returning to the guest as we did previously.
Signed-off-by: Paul Mackerras <[email protected]>
|
|
This improves the handling of transparent huge pages in the radix
hypervisor page fault handler. Previously, if a small page is faulted
in to a 2MB region of guest physical space, that means that there is
a page table pointer at the PMD level, which could never be replaced
by a leaf (2MB) PMD entry. This adds the code to clear the PMD,
invlidate the page walk cache and free the page table page in this
situation, so that the leaf PMD entry can be created.
This also adds code to check whether a PMD or PTE being inserted is
the same as is already there (because of a race with another CPU that
faulted on the same page) and if so, we don't replace the existing
entry, meaning that we don't invalidate the PTE or PMD and do a TLB
invalidation.
Signed-off-by: Paul Mackerras <[email protected]>
|
|
Since commit fb1522e099f0 ("KVM: update to new mmu_notifier semantic
v2", 2017-08-31), the MMU notifier code in KVM no longer calls the
kvm_unmap_hva callback. This removes the PPC implementations of
kvm_unmap_hva().
Signed-off-by: Paul Mackerras <[email protected]>
|
|
vmx_save_host_state() is only called from kvm_arch_vcpu_ioctl_run() so
the context is pretty well defined and as we're past 'swapgs' MSR_GS_BASE
should contain kernel's GS base which we point to irq_stack_union.
Add new kernelmode_gs_base() API, irq_stack_union needs to be exported
as KVM can be build as module.
Acked-by: Andy Lutomirski <[email protected]>
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
vmx_save_host_state() is only called from kvm_arch_vcpu_ioctl_run() so
the context is pretty well defined. Read MSR_{FS,KERNEL_GS}_BASE from
current->thread after calling save_fsgs() which takes care of
X86_BUG_NULL_SEG case now and will do RD[FG,GS]BASE when FSGSBASE
extensions are exposed to userspace (currently they are not).
Acked-by: Andy Lutomirski <[email protected]>
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Allow to disable pause loop exit/pause filtering on a per VM basis.
If some VMs have dedicated host CPUs, they won't be negatively affected
due to needlessly intercepted PAUSE instructions.
Thanks to Jan H. Schönherr's initial patch.
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Cc: Jan H. Schönherr <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
If host CPUs are dedicated to a VM, we can avoid VM exits on HLT.
This patch adds the per-VM capability to disable them.
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Cc: Jan H. Schönherr <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Allowing a guest to execute MWAIT without interception enables a guest
to put a (physical) CPU into a power saving state, where it takes
longer to return from than what may be desired by the host.
Don't give a guest that power over a host by default. (Especially,
since nothing prevents a guest from using MWAIT even when it is not
advertised via CPUID.)
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Cc: Jan H. Schönherr <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
KVM: s390: fixes and features
- more kvm stat counters
- virtio gpu plumbing. The 3 non-KVM/s390 patches have Acks from
Bartlomiej Zolnierkiewicz, Heiko Carstens and Greg Kroah-Hartman
but all belong together to make virtio-gpu work as a tty. So
I carried them in the KVM/s390 tree.
- document some KVM_CAPs
- cpu-model only facilities
- cleanups
|
|
VMware exposes the following Pseudo PMCs:
0x10000: Physical host TSC
0x10001: Elapsed real time in ns
0x10002: Elapsed apparent time in ns
For more info refer to:
https://www.vmware.com/files/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf
VMware allows access to these Pseduo-PMCs even when read via RDPMC
in Ring3 and CR4.PCE=0. Therefore, commit modifies x86 emulator
to allow access to these PMCs in this situation. In addition,
emulation of these PMCs were added to kvm_pmu_rdpmc().
Signed-off-by: Arbel Moshe <[email protected]>
Signed-off-by: Liran Alon <[email protected]>
Reviewed-by: Radim Krčmář <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
If KVM enable_vmware_backdoor module parameter is set,
the commit change VMX to now intercept #GP instead of being directly
deliviered from CPU to guest.
It is done to support access to VMware Backdoor I/O ports
even if TSS I/O permission denies it.
In that case:
1. A #GP will be raised and intercepted.
2. #GP intercept handler will simulate I/O port access instruction.
3. I/O port access instruction simulation will allow access to VMware
backdoor ports specifically even if TSS I/O permission bitmap denies it.
Note that the above change introduce slight performance hit as now #GPs
are now not deliviered directly from CPU to guest but instead
cause #VMExit and instruction emulation.
However, this behavior is introduced only when enable_vmware_backdoor
KVM module parameter is set.
Signed-off-by: Liran Alon <[email protected]>
Reviewed-by: Nikita Leshenko <[email protected]>
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
Reviewed-by: Radim Krčmář <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
If KVM enable_vmware_backdoor module parameter is set,
the commit change VMX to now intercept #GP instead of being directly
deliviered from CPU to guest.
It is done to support access to VMware backdoor I/O ports
even if TSS I/O permission denies it.
In that case:
1. A #GP will be raised and intercepted.
2. #GP intercept handler will simulate I/O port access instruction.
3. I/O port access instruction simulation will allow access to VMware
backdoor ports specifically even if TSS I/O permission bitmap denies it.
Note that the above change introduce slight performance hit as now #GPs
are not deliviered directly from CPU to guest but instead
cause #VMExit and instruction emulation.
However, this behavior is introduced only when enable_vmware_backdoor
KVM module parameter is set.
Signed-off-by: Liran Alon <[email protected]>
Reviewed-by: Nikita Leshenko <[email protected]>
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
Reviewed-by: Radim Krčmář <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Access to VMware backdoor ports is done by one of the IN/OUT/INS/OUTS
instructions. These ports must be allowed access even if TSS I/O
permission bitmap don't allow it.
To handle this, VMX/SVM will be changed in future commits
to intercept #GP which was raised by such access and
handle it by calling x86 emulator to emulate instruction.
If it was one of these instructions, the x86 emulator already handles
it correctly (Since commit "KVM: x86: Always allow access to VMware
backdoor I/O ports") by not checking these ports against TSS I/O
permission bitmap.
One may wonder why checking for specific instructions is necessary
as we can just forward all #GPs to the x86 emulator.
There are multiple reasons for doing so:
1. We don't want the x86 emulator to be reached easily
by guest by just executing an instruction that raises #GP as that
exposes the x86 emulator as a bigger attack surface.
2. The x86 emulator is incomplete and therefore certain instructions
that can cause #GP cannot be emulated. Such an example is "INT x"
(opcode 0xcd) which reaches emulate_int() which can only emulate
the instruction if vCPU is in real-mode.
Signed-off-by: Liran Alon <[email protected]>
Reviewed-by: Nikita Leshenko <[email protected]>
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
Reviewed-by: Radim Krčmář <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Next commits are going introduce support for accessing VMware backdoor
ports even though guest's TSS I/O permissions bitmap doesn't allow
access. This mimic VMware hypervisor behavior.
In order to support this, next commits will change VMX/SVM to
intercept #GP which was raised by such access and handle it by calling
the x86 emulator to emulate instruction. Since commit "KVM: x86:
Always allow access to VMware backdoor I/O ports", the x86 emulator
handles access to these I/O ports by not checking these ports against
the TSS I/O permission bitmap.
However, there could be cases that CPU rasies a #GP on instruction
that fails to be disassembled by the x86 emulator (Because of
incomplete implementation for example).
In those cases, we would like the #GP intercept to just forward #GP
as-is to guest as if there was no intercept to begin with.
However, current emulator code always queues #UD exception in case
emulator fails (including disassembly failures) which is not what is
wanted in this flow.
This commit addresses this issue by adding a new emulation_type flag
that will allow the #GP intercept handler to specify that it wishes
to be aware when instruction emulation fails and doesn't want #UD
exception to be queued.
Signed-off-by: Liran Alon <[email protected]>
Reviewed-by: Nikita Leshenko <[email protected]>
Reviewed-by: Radim Krčmář <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
VMware allows access to these ports even if denied
by TSS I/O permission bitmap. Mimic behavior.
Signed-off-by: Liran Alon <[email protected]>
Reviewed-by: Nikita Leshenko <[email protected]>
Reviewed-by: Radim Krčmář <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Support access to VMware backdoor requires KVM to intercept #GP
exceptions from guest which introduce slight performance hit.
Therefore, control this support by module parameter.
Note that module parameter is exported as it should be consumed by
kvm_intel & kvm_amd to determine if they should intercept #GP or not.
This commit doesn't change semantics.
It is done as a preparation for future commits.
Signed-off-by: Liran Alon <[email protected]>
Reviewed-by: Nikita Leshenko <[email protected]>
Reviewed-by: Radim Krčmář <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Add kvm_fast_pio() to consolidate duplicate code in VMX and SVM.
Unexport kvm_fast_pio_in() and kvm_fast_pio_out().
Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Fast emulation of processor I/O for IN was disabled on x86 (both VMX
and SVM) some years ago due to a buggy implementation. The addition
of kvm_fast_pio_in(), used by SVM, re-introduced (functional!) fast
emulation of IN. Piggyback SVM's work and use kvm_fast_pio_in() on
VMX instead of performing full emulation of IN.
Reviewed-by: Paolo Bonzini <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Fail a nested VMEntry with EXIT_REASON_INVALID_STATE if L2 guest state
is invalid, i.e. vmcs12 contained invalid guest state, and unrestricted
guest is disabled in L0 (and by extension disabled in L1).
WARN_ON_ONCE in handle_invalid_guest_state() if we're attempting to
emulate L2, i.e. nested_run_pending is true, to aid debug in the
(hopefully unlikely) scenario that we somehow skip the nested VMEntry
consistency check, e.g. due to a L0 bug.
Note: KVM relies on hardware to detect the scenario where unrestricted
guest is enabled in L0 but disabled in L1 and vmcs12 contains invalid
guest state, i.e. checking emulation_required in prepare_vmcs02 is
required only to handle the case were unrestricted guest is disabled
in L0 since L0 never actually attempts VMLAUNCH/VMRESUME with vmcs02.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
CR3 load/store exiting are always off when unrestricted guest
is enabled. WARN on the associated CR3 VMEXIT to detect code
that would re-introduce CR3 load/store exiting for unrestricted
guest.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Radim Krčmář <[email protected]>
|
|
Now CR3 is not forced to a host-controlled value when paging is
disabled in an unrestricted guest, CR3 load/store exiting can be
left disabled (for an unrestricted guest). And because CR0.WP
and CR4.PAE/PSE are also not force to host-controlled values,
all of ept_update_paging_mode_cr0() is no longer needed, i.e.
skip ept_update_paging_mode_cr0() for an unrestricted guest.
Because MOV CR3 no longer exits when paging is disabled for an
unrestricted guest, vmx_decache_cr3() must always read GUEST_CR3
from the VMCS for an unrestricted guest.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Radim Krčmář <[email protected]>
|
|
CR4.PAE - Unrestricted guest can only be enabled when EPT is
enabled, and vmx_set_cr4() clears hardware CR0.PAE based on
the guest's CR4.PAE, i.e. CR4.PAE always follows the guest's
value when unrestricted guest is enabled.
CR4.PSE - Unrestricted guest no longer uses the identity mapped
IA32 page tables since CR0.PG can be cleared in hardware, thus
there is no need to set CR4.PSE when paging is disabled in the
guest (and EPT is enabled).
Define KVM_VM_CR4_ALWAYS_ON_UNRESTRICTED_GUEST (to X86_CR4_VMXE)
and use it in lieu of KVM_*MODE_VM_CR4_ALWAYS_ON when unrestricted
guest is enabled, which removes the forcing of CR4.PAE.
Skip the manipulation of CR4.PAE/PSE for EPT when unrestricted
guest is enabled, as CR4.PAE isn't forced and so doesn't need to
be manually cleared, and CR4.PSE does not need to be set when
paging is disabled since the identity mapped IA32 page tables
are not used.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Radim Krčmář <[email protected]>
|
|
Unrestricted guest can only be enabled when EPT is enabled, and
when EPT is enabled, ept_update_paging_mode_cr0() will clear
hardware CR0.WP based on the guest's CR0.WP, i.e. CR0.WP always
follows the guest's value when unrestricted guest is enabled.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Radim Krčmář <[email protected]>
|
|
An unrestricted guest can run with hardware CR0.PG==0, i.e.
IA32 paging disabled, in which case there is no need to load
the guest's CR3 with identity mapped IA32 page tables since
hardware will effectively ignore CR3. If unrestricted guest
is enabled, don't configure the identity mapped IA32 page
table and always load the guest's desired CR3.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Radim Krčmář <[email protected]>
|