Age | Commit message (Collapse) | Author | Files | Lines |
|
Most interrupt are delivered to only one vcpu. Use pre-build tables to
find interrupt destination instead of looping through all vcpus. In case
of logical mode loop only through vcpus in a logical cluster irq is sent
to.
Signed-off-by: Gleb Natapov <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
* queue:
KVM: MMU: Eliminate pointless temporary 'ac'
KVM: MMU: Avoid access/dirty update loop if all is well
KVM: MMU: Eliminate eperm temporary
KVM: MMU: Optimize is_last_gpte()
KVM: MMU: Simplify walk_addr_generic() loop
KVM: MMU: Optimize pte permission checks
KVM: MMU: Update accessed and dirty bits after guest pagetable walk
KVM: MMU: Move gpte_access() out of paging_tmpl.h
KVM: MMU: Optimize gpte_access() slightly
KVM: MMU: Push clean gpte write protection out of gpte_access()
KVM: clarify kvmclock documentation
KVM: make processes waiting on vcpu mutex killable
KVM: SVM: Make use of asm.h
KVM: VMX: Make use of asm.h
KVM: VMX: Make lto-friendly
Signed-off-by: Avi Kivity <[email protected]>
|
|
'ac' essentially reconstructs the 'access' variable we already
have, except for the PFERR_PRESENT_MASK and PFERR_RSVD_MASK. As
these are not used by callees, just use 'access' directly.
Reviewed-by: Xiao Guangrong <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
Keep track of accessed/dirty bits; if they are all set, do not
enter the accessed/dirty update loop.
Reviewed-by: Xiao Guangrong <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
'eperm' is no longer used in the walker loop, so we can eliminate it.
Reviewed-by: Xiao Guangrong <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
Instead of branchy code depending on level, gpte.ps, and mmu configuration,
prepare everything in a bitmap during mode changes and look it up during
runtime.
Reviewed-by: Xiao Guangrong <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
The page table walk is coded as an infinite loop, with a special
case on the last pte.
Code it as an ordinary loop with a termination condition on the last
pte (large page or walk length exhausted), and put the last pte handling
code after the loop where it belongs.
Reviewed-by: Xiao Guangrong <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
walk_addr_generic() permission checks are a maze of branchy code, which is
performed four times per lookup. It depends on the type of access, efer.nxe,
cr0.wp, cr4.smep, and in the near future, cr4.smap.
Optimize this away by precalculating all variants and storing them in a
bitmap. The bitmap is recalculated when rarely-changing variables change
(cr0, cr4) and is indexed by the often-changing variables (page fault error
code, pte access permissions).
The permission check is moved to the end of the loop, otherwise an SMEP
fault could be reported as a false positive, when PDE.U=1 but PTE.U=0.
Noted by Xiao Guangrong.
The result is short, branch-free code.
Reviewed-by: Xiao Guangrong <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
While unspecified, the behaviour of Intel processors is to first
perform the page table walk, then, if the walk was successful, to
atomically update the accessed and dirty bits of walked paging elements.
While we are not required to follow this exactly, doing so will allow us
to perform the access permissions check after the walk is complete, rather
than after each walk step.
(the tricky case is SMEP: a zero in any pte's U bit makes the referenced
page a supervisor page, so we can't fault on a one bit during the walk
itself).
Reviewed-by: Xiao Guangrong <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
We no longer rely on paging_tmpl.h defines; so we can move the function
to mmu.c.
Rely on zero extension to 64 bits to get the correct nx behaviour.
Reviewed-by: Xiao Guangrong <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
If nx is disabled, then is gpte[63] is set we will hit a reserved
bit set fault before checking permissions; so we can ignore the
setting of efer.nxe.
Reviewed-by: Xiao Guangrong <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
gpte_access() computes the access permissions of a guest pte and also
write-protects clean gptes. This is wrong when we are servicing a
write fault (since we'll be setting the dirty bit momentarily) but
correct when instantiating a speculative spte, or when servicing a
read fault (since we'll want to trap a following write in order to
set the dirty bit).
It doesn't seem to hurt in practice, but in order to make the code
readable, push the write protection out of gpte_access() and into
a new protect_clean_gpte() which is called explicitly when needed.
Reviewed-by: Xiao Guangrong <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
- mention that system time needs to be added to wallclock time
- positive tsc_shift means left shift, not right
- mention additional 32bit right shift
Signed-off-by: Stefan Fritsch <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
|
|
vcpu mutex can be held for unlimited time so
taking it with mutex_lock on an ioctl is wrong:
one process could be passed a vcpu fd and
call this ioctl on the vcpu used by another process,
it will then be unkillable until the owner exits.
Call mutex_lock_killable instead and return status.
Note: mutex_lock_interruptible would be even nicer,
but I am not sure all users are prepared to handle EINTR
from these ioctls. They might misinterpret it as an error.
Cleanup paths expect a vcpu that can't be used by
any userspace so this will always succeed - catch bugs
by calling BUG_ON.
Catch callers that don't check return state by adding
__must_check.
Signed-off-by: Michael S. Tsirkin <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
|
|
Use macros for bitness-insensitive register names, instead of
rolling our own.
Signed-off-by: Avi Kivity <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
|
|
Use macros for bitness-insensitive register names, instead of
rolling our own.
Signed-off-by: Avi Kivity <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
|
|
LTO (link-time optimization) doesn't like local labels to be referred to
from a different function, since the two functions may be built in separate
compilation units. Use an external variable instead.
Signed-off-by: Avi Kivity <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
|
|
find_highest_vector() and count_vectors():
- Instead of using magic values, define and use proper macros.
find_highest_vector():
- Remove likely() which is there only for historical reasons and not
doing correct branch predictions anymore. Using such heuristics
to optimize this function is not worth it now. Let CPUs predict
things instead.
- Stop checking word[0] separately. This was only needed for doing
likely() optimization.
- Use for loop, not while, to iterate over the register array to make
the code clearer.
Note that we actually confirmed that the likely() did wrong predictions
by inserting debug code.
Acked-by: Michael S. Tsirkin <[email protected]>
Signed-off-by: Takuya Yoshikawa <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
|
|
Checking the return of kvm_mmu_get_page is unnecessary since it is
guaranteed by memory cache
Signed-off-by: Xiao Guangrong <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
KVM lapic timer and tsc deadline timer based on hrtimer,
setting a leftmost node to rb tree and then do hrtimer reprogram.
If hrtimer not configured as high resolution, hrtimer_enqueue_reprogram
do nothing and then make kvm lapic timer and tsc deadline timer fail.
Signed-off-by: Liu, Jinsong <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
Signed-off-by: Jan Kiszka <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
interrupt_bitmap is KVM_NR_INTERRUPTS bits in size,
so just use that instead of hard-coded constants
and math.
Signed-off-by: Michael S. Tsirkin <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
Optimize "rep ins" by allowing emulator to write back more than one
datum at a time. Introduce new operand type OP_MEM_STR which tells
writeback() that dst contains pointer to an array that should be written
back as opposite to just one data element.
Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
Remove unneeded segment argument. Address structure already has correct
segment which was put there during decode.
Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
Current code assumes that IO exit was due to instruction emulation
and handles execution back to emulator directly. This patch adds new
userspace IO exit completion callback that can be set by any other code
that caused IO exit to userspace.
Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
Other arches do not need this.
Signed-off-by: Marcelo Tosatti <[email protected]>
v2: fix incorrect deletion of mmio sptes on gpa move (noticed by Takuya)
Signed-off-by: Avi Kivity <[email protected]>
|
|
PPC must flush all translations before the new memory slot
is visible.
Signed-off-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
Introducing kvm_arch_flush_shadow_memslot, to invalidate the
translations of a single memory slot.
Signed-off-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
We never modify direct_access_msrs[], msrpm_ranges[],
svm_exit_handlers[] or x86_intercept_map[] at runtime.
Mark them r/o.
Signed-off-by: Mathias Krause <[email protected]>
Cc: Joerg Roedel <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
We use vmcs_field_to_offset_table[], kvm_vmx_segment_fields[] and
kvm_vmx_exit_handlers[] as lookup tables only -- make them r/o.
Signed-off-by: Mathias Krause <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
Signed-off-by: Mathias Krause <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
We never change those, make them r/o.
Signed-off-by: Mathias Krause <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
We never change emulate_ops[] at runtime so it should be r/o.
Signed-off-by: Mathias Krause <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
The opcode tables never change at runtime, therefor mark them const.
Signed-off-by: Mathias Krause <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
As the the compiler ensures that the memory operand is always aligned
to a 16 byte memory location, use the aligned variant of MOVDQ for
read_sse_reg() and write_sse_reg().
Signed-off-by: Mathias Krause <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
Some fields can be constified and/or made static to reduce code and data
size.
Numbers for a 32 bit build:
text data bss dec hex filename
before: 3351 80 0 3431 d67 cpuid.o
after: 3391 0 0 3391 d3f cpuid.o
Signed-off-by: Mathias Krause <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
kvm_pic_reset() is not used anywhere. Move reset logic from
pic_ioport_write() there.
Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Avi Kivity <[email protected]>
|
|
Signed-off-by: Marcelo Tosatti <[email protected]>
|
|
We will enter the guest with G and D cleared; as real hardware ignores D in
real mode, and G is taken care of by the limit test, we allow more code to
run in vm86 mode.
Signed-off-by: Avi Kivity <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
|
|
Signed-off-by: Avi Kivity <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
|
|
While this is undocumented, real processors do not reload the segment
limit and access rights when loading a segment register in real mode.
Real programs rely on it so we need to comply with this behaviour.
Signed-off-by: Avi Kivity <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
|
|
emulate_invalid_guest_state=1
emulate_invalid_guest_state=1 doesn't mean we don't munge the segments in the
vmcs; we do. So we need to return the real ones (maintained by vmx_set_segment).
Signed-off-by: Avi Kivity <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
|
|
We want the segment selector, nor segment number.
Signed-off-by: Avi Kivity <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
|
|
Segment limits are verified in real mode, not just protected mode.
Signed-off-by: Avi Kivity <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
|
|
When loading a segment in real mode, only the base and selector must
be modified. The limit needs to be left alone, otherwise big real mode
users will hit a #GP due to limit checking (currently this is suppressed
because we don't check limits in real mode).
Signed-off-by: Avi Kivity <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
|
|
Usually, big real mode uses large (4GB) segments. Currently we don't
virtualize this; if any segment has a limit other than 0xffff, we emulate.
But if we set the vmx-visible limit to 0xffff, we can use vm86 to virtualize
real mode; if an access overruns the segment limit, the guest will #GP, which
we will trap and forward to the emulator. This results in significantly
faster execution, and less risk of hitting an unemulated instruction.
If the limit is less than 0xffff, we retain the existing behaviour.
Signed-off-by: Avi Kivity <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
|
|
Real mode is always entered from protected mode with dpl=0. Since
the dpl doesn't affect execution, and we already override it to 3
in the vmcs (as vmx requires), we can allow execution in that state.
Signed-off-by: Avi Kivity <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
|
|
Real processors don't change segment limits and attributes while in
real mode. Mimic that behaviour.
Signed-off-by: Avi Kivity <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
|
|
Instead of using struct kvm_save_segment, use struct kvm_segment, which is what
the other APIs use. This leads to some simplification.
We replace save_rmode_seg() with a call to vmx_save_segment(). Since this depends
on rmode.vm86_active, we move the call to before setting the flag.
Signed-off-by: Avi Kivity <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
|