Age | Commit message (Collapse) | Author | Files | Lines |
|
Document it to Documentation/virtual/kvm/mmu.txt
Signed-off-by: Xiao Guangrong <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Document fast page fault to Documentation/virtual/kvm/mmu.txt
Signed-off-by: Xiao Guangrong <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Document it to Documentation/virtual/kvm/mmu.txt
Signed-off-by: Xiao Guangrong <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Document write_flooding_count to Documentation/virtual/kvm/mmu.txt
Signed-off-by: Xiao Guangrong <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Document it to Documentation/virtual/kvm/mmu.txt
Signed-off-by: Xiao Guangrong <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Drop kvm_mmu_zap_mmio_sptes and use kvm_mmu_invalidate_zap_all_pages
instead to handle mmio generation number overflow
Signed-off-by: Xiao Guangrong <[email protected]>
Reviewed-by: Gleb Natapov <[email protected]>
Reviewed-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Then it has the chance to trigger mmio generation number wrap-around
Signed-off-by: Xiao Guangrong <[email protected]>
Reviewed-by: Gleb Natapov <[email protected]>
Reviewed-by: Marcelo Tosatti <[email protected]>
[Change from MMIO_MAX_GEN - 13 to MMIO_MAX_GEN - 150, because 13 is
very close to the number of calls to KVM_SET_USER_MEMORY_REGION
before the guest is started and there is any chance to create any
spte. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
It is useful for debug mmio spte invalidation
Signed-off-by: Xiao Guangrong <[email protected]>
Reviewed-by: Gleb Natapov <[email protected]>
Reviewed-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
This patch tries to introduce a very simple and scale way to invalidate
all mmio sptes - it need not walk any shadow pages and hold mmu-lock
KVM maintains a global mmio valid generation-number which is stored in
kvm->memslots.generation and every mmio spte stores the current global
generation-number into his available bits when it is created
When KVM need zap all mmio sptes, it just simply increase the global
generation-number. When guests do mmio access, KVM intercepts a MMIO #PF
then it walks the shadow page table and get the mmio spte. If the
generation-number on the spte does not equal the global generation-number,
it will go to the normal #PF handler to update the mmio spte
Since 19 bits are used to store generation-number on mmio spte, we zap all
mmio sptes when the number is round
Signed-off-by: Xiao Guangrong <[email protected]>
Reviewed-by: Gleb Natapov <[email protected]>
Reviewed-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Define some meaningful names instead of raw code
Signed-off-by: Xiao Guangrong <[email protected]>
Reviewed-by: Gleb Natapov <[email protected]>
Reviewed-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Store the generation-number into bit3 ~ bit11 and bit52 ~ bit61, totally
19 bits can be used, it should be enough for nearly all most common cases
In this patch, the generation-number is always 0, it will be changed in
the later patch
[Gleb: masking generation bits from spte in get_mmio_spte_gfn() and
get_mmio_spte_access()]
Signed-off-by: Xiao Guangrong <[email protected]>
Reviewed-by: Gleb Natapov <[email protected]>
Reviewed-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Let mmio spte only use bit62 and bit63 on upper 32 bits, then bit 52 ~ bit 61
can be used for other purposes
Signed-off-by: Xiao Guangrong <[email protected]>
Reviewed-by: Gleb Natapov <[email protected]>
Reviewed-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Added some missing validity checks for the operands and fixed the
priority of exceptions for some function codes according to the
"Principles of Operation" document.
Signed-off-by: Thomas Huth <[email protected]>
Acked-by: Cornelia Huck <[email protected]>
Signed-off-by: Cornelia Huck <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
LCTL and LCTLG are also privileged instructions, thus there is no need for
treating them separately from the other instructions in priv.c. So this
patch moves these two instructions to priv.c, adds a check for supervisor
state and simplifies the "handle_eb" instruction decoding by merging the
two eb_handlers jump tables from intercept.c and priv.c into one table only.
Signed-off-by: Thomas Huth <[email protected]>
Acked-by: Cornelia Huck <[email protected]>
Signed-off-by: Cornelia Huck <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
When a guest calls the TPI instruction, the second operand address could
point to an invalid location. In this case the problem should be signaled
to the guest by throwing an access exception.
Signed-off-by: Thomas Huth <[email protected]>
Acked-by: Cornelia Huck <[email protected]>
Signed-off-by: Cornelia Huck <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
DIAGNOSE is a privileged instruction and thus we must make sure that we are
in supervisor mode before taking any other actions.
Signed-off-by: Thomas Huth <[email protected]>
Acked-by: Cornelia Huck <[email protected]>
Signed-off-by: Cornelia Huck <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
We need more fine-grained control about the point in time when we check
for privileged instructions, since the exceptions that can happen during
an instruction have a well-defined priority. For example, for the PFMF
instruction, the check for PGM_PRIVILEGED_OP must happen after the check
for PGM_OPERATION since the latter has a higher precedence - thus the
check for privileged operation must not be done in kvm_s390_handle_b9()
already.
Signed-off-by: Thomas Huth <[email protected]>
Acked-by: Cornelia Huck <[email protected]>
Signed-off-by: Cornelia Huck <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
TPROT is a privileged instruction and thus should generate a privileged
operation exception when the problem state bit is not cleared in the PSW.
Signed-off-by: Thomas Huth <[email protected]>
Acked-by: Cornelia Huck <[email protected]>
Signed-off-by: Cornelia Huck <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Renamed the PGM_PRIVILEGED_OPERATION define to PGM_PRIVILEGED_OP since this
define was way longer than the other PGM_* defines and caused the code often
to exceed the 80 columns limit when not split to multiple lines.
Signed-off-by: Thomas Huth <[email protected]>
Acked-by: Cornelia Huck <[email protected]>
Signed-off-by: Cornelia Huck <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Update the document to match the current reverse mapping of
parent_pte
Signed-off-by: Xiao Guangrong <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Signed-off-by: Alexey Kardashevskiy <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The handle_epsw() function calculated the first register in the wrong way,
so that it always used r0 by mistake. Now the code uses the common helper
function for decoding the registers of rre functions instead to avoid such
mistakes.
Signed-off-by: Thomas Huth <[email protected]>
Reviewed-by: Christian Borntraeger <[email protected]>
Signed-off-by: Cornelia Huck <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
This patch is based on an original patch of David Hildenbrand.
The perf core implementation calls architecture specific code in order
to ask for specific information for a particular sample:
perf_instruction_pointer()
When perf core code asks for the instruction pointer, architecture
specific code must detect if a KVM guest was running when the sample
was taken. A sample can be associated with a KVM guest when the PSW
supervisor state bit is set and the PSW instruction pointer part
contains the address of 'sie_exit'.
A KVM guest's instruction pointer information is then retrieved via
gpsw entry pointed to by the sie control-block.
perf_misc_flags()
perf code code calls this function in order to associate the kernel
vs. user state infomation with a particular sample. Architecture
specific code must also first detectif a KVM guest was running
at the time the sample was taken.
Signed-off-by: Heinz Graalfs <[email protected]>
Reviewed-by: Hendrik Brueckner <[email protected]>
Signed-off-by: Christian Borntraeger <[email protected]>
Signed-off-by: Cornelia Huck <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Lets use the common waitqueue for kvm cpus on s390. By itself it is
just a cleanup, but it should also improve the accuracy of diag 0x44
which is implemented via kvm_vcpu_on_spin. kvm_vcpu_on_spin has
an explicit check for waiting on the waitqueue to optimize the
yielding.
Signed-off-by: Christian Borntraeger <[email protected]>
Signed-off-by: Cornelia Huck <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
cleanup of arch specific code to use common code provided vcpu slab cache
instead of kzalloc() provided memory
Signed-off-by: Michael Mueller <[email protected]>
Signed-off-by: Christian Borntraeger <[email protected]>
Signed-off-by: Cornelia Huck <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
This patch enables kvm to give large pages to the guest. The heavy
lifting is done by the hardware, the host only has to take care
of the PFMF instruction, which is also part of EDAT-1.
We also support the non-quiescing key setting facility if the host
supports it, to behave similar to the interpretation of sske.
Signed-off-by: Christian Borntraeger <[email protected]>
Signed-off-by: Cornelia Huck <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
From time to time we need to set the guest storage key. Lets
provide a helper function that handles the changes with all the
right locking and checking.
Signed-off-by: Christian Borntraeger <[email protected]>
Signed-off-by: Cornelia Huck <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Its possible that idivl overflows (due to large delta stored in usdiff,
valid scenario).
Create an exception handler to catch the overflow exception (division by zero
is protected by vcpu->arch.virtual_tsc_khz check), and interpret it accordingly
(delta is larger than USEC_PER_SEC).
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=969644
Signed-off-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
|
|
Quote Gleb's mail:
| why don't we check for sp->role.invalid in
| kvm_mmu_prepare_zap_page before calling kvm_reload_remote_mmus()?
and
| Actually we can add check for is_obsolete_sp() there too since
| kvm_mmu_invalidate_all_pages() already calls kvm_reload_remote_mmus()
| after incrementing mmu_valid_gen.
[ Xiao: add some comments and the check of is_obsolete_sp() ]
Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Xiao Guangrong <[email protected]>
Reviewed-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
|
|
As Marcelo pointed out that
| "(retention of large number of pages while zapping)
| can be fatal, it can lead to OOM and host crash"
We introduce a list, kvm->arch.zapped_obsolete_pages, to link all
the pages which are deleted from the mmu cache but not actually
freed. When page reclaiming is needed, we always zap this kind of
pages first.
Signed-off-by: Xiao Guangrong <[email protected]>
Reviewed-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
|
|
kvm_zap_obsolete_pages uses lock-break technique to zap pages,
it will flush tlb every time when it does lock-break
We can reload mmu on all vcpus after updating the generation
number so that the obsolete pages are not used on any vcpus,
after that we do not need to flush tlb when obsolete pages
are zapped
It will do kvm_mmu_prepare_zap_page many times and use one
kvm_mmu_commit_zap_page to collapse tlb flush, the side-effects
is that causes obsolete pages unlinked from active_list but leave
on hash-list, so we add the comment around the hash list walker
Note: kvm_mmu_commit_zap_page is still needed before free
the pages since other vcpus may be doing locklessly shadow
page walking
Signed-off-by: Xiao Guangrong <[email protected]>
Reviewed-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
|
|
Zap at lease 10 pages before releasing mmu-lock to reduce the overload
caused by requiring lock
After the patch, kvm_zap_obsolete_pages can forward progress anyway,
so update the comments
[ It improves the case 0.6% ~ 1% that do kernel building meanwhile read
PCI ROM. ]
Note: i am not sure that "10" is the best speculative value, i just
guessed that '10' can make vcpu do not spend long time on
kvm_zap_obsolete_pages and do not cause mmu-lock too hungry.
Signed-off-by: Xiao Guangrong <[email protected]>
Reviewed-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
|
|
The obsolete page will be zapped soon, do not reuse it to
reduce future page fault
Signed-off-by: Xiao Guangrong <[email protected]>
Reviewed-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
|
|
It is good for debug and development
Signed-off-by: Xiao Guangrong <[email protected]>
Reviewed-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
|
|
Show sp->mmu_valid_gen
Signed-off-by: Xiao Guangrong <[email protected]>
Reviewed-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
|
|
Replace kvm_mmu_zap_all by kvm_mmu_invalidate_zap_all_pages
Signed-off-by: Xiao Guangrong <[email protected]>
Reviewed-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
|
|
The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
walk and zap all shadow pages one by one, also it need to zap all guest
page's rmap and all shadow page's parent spte list. Particularly, things
become worse if guest uses more memory or vcpus. It is not good for
scalability
In this patch, we introduce a faster way to invalidate all shadow pages.
KVM maintains a global mmu invalid generation-number which is stored in
kvm->arch.mmu_valid_gen and every shadow page stores the current global
generation-number into sp->mmu_valid_gen when it is created
When KVM need zap all shadow pages sptes, it just simply increase the
global generation-number then reload root shadow pages on all vcpus.
Vcpu will create a new shadow page table according to current kvm's
generation-number. It ensures the old pages are not used any more.
Then the obsolete pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
are zapped by using lock-break technique
Signed-off-by: Xiao Guangrong <[email protected]>
Reviewed-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
|
|
It is the responsibility of kvm_mmu_zap_all that keeps the
consistent of mmu and tlbs. And it is also unnecessary after
zap all mmio sptes since no mmio spte exists on root shadow
page and it can not be cached into tlb
Signed-off-by: Xiao Guangrong <[email protected]>
Reviewed-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
|
|
Quote Gleb's mail:
| Back then kvm->lock protected memslot access so code like:
|
| mutex_lock(&vcpu->kvm->lock);
| kvm_mmu_zap_all(vcpu->kvm);
| mutex_unlock(&vcpu->kvm->lock);
|
| which is what 7aa81cc0 does was enough to guaranty that no vcpu will
| run while code is patched. This is no longer the case and
| mutex_lock(&vcpu->kvm->lock); is gone from that code path long time ago,
| so now kvm_mmu_zap_all() there is useless and the code is incorrect.
So we drop it and it will be fixed later
Signed-off-by: Xiao Guangrong <[email protected]>
Reviewed-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
|
|
We can easily reach the 1000 limit by start VM with a couple
hundred I/O devices (multifunction=on). The hardcode limit
already been adjusted 3 times (6 ~ 200 ~ 300 ~ 1000).
In userspace, we already have maximum file descriptor to
limit ioeventfd count. But kvm_io_bus devices also are used
for pit, pic, ioapic, coalesced_mmio. They couldn't be limited
by maximum file descriptor.
Currently only ioeventfds take too much kvm_io_bus devices,
so just exclude it from counting kvm_io_range limit.
Also fixed one indent issue in kvm_host.h
Signed-off-by: Amos Kong <[email protected]>
Reviewed-by: Stefan Hajnoczi <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
|
|
Providing a "devname:kvm" module alias enables automatic loading of
the kvm module when /dev/kvm is opened.
Signed-off-by: Cornelia Huck <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
|
|
Signed-off-by: Avi Kivity <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
|
|
Signed-off-by: Avi Kivity <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
|
|
Since DIV and IDIV can generate exceptions, we need an additional output
parameter indicating whether an execption has occured. To avoid increasing
register pressure on i386, we use %rsi, which is already allocated for
the fastop code pointer.
Gleb: added comment about fop usage as exception indication.
Signed-off-by: Avi Kivity <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
|
|
Signed-off-by: Avi Kivity <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
|
|
This makes OpAccHi useful.
Signed-off-by: Avi Kivity <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
|
|
Signed-off-by: Avi Kivity <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
|
|
Single-operand MUL and DIV access an extended accumulator: AX for byte
instructions, and DX:AX, EDX:EAX, or RDX:RAX for larger-sized instructions.
Add support for fetching the extended accumulator.
In order not to change things too much, RDX is loaded into Src2, which is
already loaded by fastop(). This avoids increasing register pressure on
i386.
Gleb: disable src writeback for ByteOp div/mul.
Signed-off-by: Avi Kivity <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
|
|
Some instructions write back the source operand, not just the destination.
Add support for doing this via the decode flags.
Gleb: add BUG_ON() to prevent source to be memory operand.
Signed-off-by: Avi Kivity <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
|
|
On heavy paging load some guest cpus started to loop in gmap_ipte_notify.
This was visible as stalled cpus inside the guest. The gmap_ipte_notifier
tries to map a user page and then made sure that the pte is valid and
writable. Turns out that with the software change bit tracking the pte
can become read-only (and only software writable) if the page is clean.
Since we loop in this code, the page would stay clean and, therefore,
be never writable again.
Let us just use fixup_user_fault, that guarantees to call handle_mm_fault.
Signed-off-by: Christian Borntraeger <[email protected]>
Acked-by: Martin Schwidefsky <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
|