Age | Commit message (Collapse) | Author | Files | Lines |
|
In order to make the APIC access page migratable, stop pinning it in
memory.
And because the APIC access page is not pinned in memory, we can
remove kvm_arch->apic_access_page. When we need to write its
physical address into vmcs, we use gfn_to_page() to get its page
struct, which is needed to call page_to_phys(); the page is then
immediately unpinned.
Suggested-by: Gleb Natapov <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Currently, the APIC access page is pinned by KVM for the entire life
of the guest. We want to make it migratable in order to make memory
hot-unplug available for machines that run KVM.
This patch prepares to handle this for the case where there is no nested
virtualization, or where the nested guest does not have an APIC page of
its own. All accesses to kvm->arch.apic_access_page are changed to go
through kvm_vcpu_reload_apic_access_page.
If the APIC access page is invalidated when the host is running, we update
the VMCS in the next guest entry.
If it is invalidated when the guest is running, the MMU notifier will force
an exit, after which we will handle everything as in the previous case.
If it is invalidated when a nested guest is running, the request will update
either the VMCS01 or the VMCS02. Updating the VMCS01 is done at the
next L2->L1 exit, while updating the VMCS02 is done in prepare_vmcs02.
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Currently, the APIC access page is pinned by KVM for the entire life
of the guest. We want to make it migratable in order to make memory
hot-unplug available for machines that run KVM.
This patch prepares to handle this in generic code, through a new
request bit (that will be set by the MMU notifier) and a new hook
that is called whenever the request bit is processed.
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
This will be used to let the guest run while the APIC access page is
not pinned. Because subsequent patches will fill in the function
for x86, place the (still empty) x86 implementation in the x86.c file
instead of adding an inline function in kvm_host.h.
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
it non-static
Different architectures need different requests, and in fact we
will use this function in architecture-specific code later. This
will be outside kvm_main.c, so make it non-static and rename it to
kvm_make_all_cpus_request().
Reviewed-by: Paolo Bonzini <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
1. We were calling clear_flush_young_notify in unmap_one, but we are
within an mmu notifier invalidate range scope. The spte exists no more
(due to range_start) and the accessed bit info has already been
propagated (due to kvm_pfn_set_accessed). Simply call
clear_flush_young.
2. We clear_flush_young on a primary MMU PMD, but this may be mapped
as a collection of PTEs by the secondary MMU (e.g. during log-dirty).
This required expanding the interface of the clear_flush_young mmu
notifier, so a lot of code has been trivially touched.
3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate
the access bit by blowing the spte. This requires proper synchronizing
with MMU notifier consumers, like every other removal of spte's does.
Signed-off-by: Andres Lagar-Cavilla <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Callbacks don't have to do extra computation to learn what the caller
(lvm_handle_hva_range()) knows very well. Useful for
debugging/tracing/printk/future.
Signed-off-by: Andres Lagar-Cavilla <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
On x86_64, kernel text mappings are mapped read-only with CONFIG_DEBUG_RODATA.
In that case, KVM will fail to patch VMCALL instructions to VMMCALL
as required on AMD processors.
The failure mode is currently a divide-by-zero exception, which obviously
is a KVM bug that has to be fixed. However, picking the right instruction
between VMCALL and VMMCALL will be faster and will help if you cannot upgrade
the hypervisor.
Reported-by: Chris Webb <[email protected]>
Tested-by: Chris Webb <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Acked-by: Borislav Petkov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Avoid open coded calculations for bank MSRs by using well-defined
macros that hide the index of higher bank MSRs.
No semantic changes.
Signed-off-by: Chen Yucong <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Commit 346874c9507a ("KVM: x86: Fix CR3 reserved bits") removed non-PAE
reserved bits which were not according to Intel SDM. However, residue was left
in a debug assertion (CR3_NONPAE_RESERVED_BITS). Remove it.
Signed-off-by: Nadav Amit <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
vcpu ioctls can hang the calling thread if issued while a vcpu is running.
However, invalid ioctls can happen when userspace tries to probe the kind
of file descriptors (e.g. isatty() calls ioctl(TCGETS)); in that case,
we know the ioctl is going to be rejected as invalid anyway and we can
fail before trying to take the vcpu mutex.
This patch does not change functionality, it just makes invalid ioctls
fail faster.
Cc: [email protected]
Signed-off-by: David Matlack <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
has been swapped out or is behind a filemap, this will trigger async
readahead and return immediately. The rationale is that KVM will kick
back the guest with an "async page fault" and allow for some other
guest process to take over.
If async PFs are enabled the fault is retried asap from an async
workqueue. If not, it's retried immediately in the same code path. In
either case the retry will not relinquish the mmap semaphore and will
block on the IO. This is a bad thing, as other mmap semaphore users
now stall as a function of swap or filemap latency.
This patch ensures both the regular and async PF path re-enter the
fault allowing for the mmap semaphore to be relinquished in the case
of IO wait.
Reviewed-by: Radim Krčmář <[email protected]>
Signed-off-by: Andres Lagar-Cavilla <[email protected]>
Acked-by: Andrew Morton <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
s/drity/dirty and s/vmsc01/vmcs01
Signed-off-by: Tiejun Chen <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Guest which sets the PAT CR to invalid value should get a #GP. Currently, if
vmx supports loading PAT CR during entry, then the value is not checked. This
patch makes the required check in that case.
Signed-off-by: Nadav Amit <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
In 64-bit mode a #GP should be delivered to the guest "if the code segment
descriptor pointed to by the selector in the 64-bit gate doesn't have the L-bit
set and the D-bit clear." - Intel SDM "Interrupt 13—General Protection
Exception (#GP)".
This patch fixes the behavior of CS loading emulation code. Although the
comment says that segment loading is not supported in long mode, this function
is executed in long mode, so the fix is necassary.
Signed-off-by: Nadav Amit <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
A one-line wrapper around kvm_make_request is not particularly
useful. Replace kvm_mmu_flush_tlb() with kvm_make_request().
Signed-off-by: Liang Chen <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
- we count KVM_REQ_TLB_FLUSH requests, not actual flushes
(KVM can have multiple requests for one flush)
- flushes from kvm_flush_remote_tlbs aren't counted
- it's easy to make a direct request by mistake
Solve these by postponing the counting to kvm_check_request().
Signed-off-by: Radim Krčmář <[email protected]>
Signed-off-by: Liang Chen <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Initilization of L2 guest with -cpu host, on L1 guest with -cpu host
triggers:
(qemu) KVM: entry failed, hardware error 0x7
...
nested_vmx_run: VMCS MSR_{LOAD,STORE} unsupported
Nested VMX MSR load/store support is not sufficient to
allow perf for L2 guest.
Until properly fixed, trap CPUID and disable function 0xA.
Signed-off-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Commit fc3a9157d314 ("KVM: X86: Don't report L2 emulation failures to
user-space") disabled the reporting of L2 (nested guest) emulation failures to
userspace due to race-condition between a vmexit and the instruction emulator.
The same rational applies also to userspace applications that are permitted by
the guest OS to access MMIO area or perform PIO.
This patch extends the current behavior - of injecting a #UD instead of
reporting it to userspace - also for guest userspace code.
Signed-off-by: Nadav Amit <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
In init_rmode_tss(), there two variables indicating the return
value, r and ret, and it return 0 on error, 1 on success. The function
is only called by vmx_set_tss_addr(), and ret is redundant.
This patch removes the redundant variable, by making init_rmode_tss()
return 0 on success, -errno on failure.
Reviewed-by: Radim Krčmář <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The KVM emulator code assumes that the guest virtual address space (in 64-bit)
is 48-bits wide. Fail the KVM_SET_CPUID and KVM_SET_CPUID2 ioctl if
userspace tries to create a guest that does not obey this restriction.
Signed-off-by: Nadav Amit <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
/me got confused between the kernel and QEMU. In the kernel, you can
only have one module_init function, and it will prevent unloading the
module unless you also have the corresponding module_exit function.
So, commit 80ce1639727e (KVM: VFIO: register kvm_device_ops dynamically,
2014-09-02) broke unloading of the kvm module, by adding a module_init
function and no module_exit.
Repair it by making kvm_vfio_ops_init weak, and checking it in
kvm_init.
Cc: Will Deacon <[email protected]>
Cc: Gleb Natapov <[email protected]>
Cc: Alex Williamson <[email protected]>
Fixes: 80ce1639727e9d38729c34f162378508c307ca25
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Commit c77dcac (KVM: Move more code under CONFIG_HAVE_KVM_IRQFD) added
functionality that depends on definitions in ioapic.h when
__KVM_HAVE_IOAPIC is defined.
At the same time, kvm-arm commit 0ba0951 (KVM: EVENTFD: remove inclusion
of irq.h) removed the inclusion of irq.h, an architecture-specific header
that is not present on ARM but which happened to include ioapic.h on x86.
Include ioapic.h directly in eventfd.c if __KVM_HAVE_IOAPIC is defined.
This fixes x86 and lets ARM use eventfd.c.
Signed-off-by: Christoffer Dall <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
In init_rmode_identity_map(), there two variables indicating the return
value, r and ret, and it return 0 on error, 1 on success. The function
is only called by vmx_create_vcpu(), and ret is redundant.
This patch removes the redundant variable, and makes init_rmode_identity_map()
return 0 on success, -errno on failure.
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
kvm_arch->ept_identity_pagetable holds the ept identity pagetable page. But
it is never used to refer to the page at all.
In vcpu initialization, it indicates two things:
1. indicates if ept page is allocated
2. indicates if a memory slot for identity page is initialized
Actually, kvm_arch->ept_identity_pagetable_done is enough to tell if the ept
identity pagetable is initialized. So we can remove ept_identity_pagetable.
NOTE: In the original code, ept identity pagetable page is pinned in memroy.
As a result, it cannot be migrated/hot-removed. After this patch, since
kvm_arch->ept_identity_pagetable is removed, ept identity pagetable page
is no longer pinned in memory. And it can be migrated/hot-removed.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Gleb Natapov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Now that we have a dynamic means to register kvm_device_ops, use that
for the VFIO kvm device, instead of relying on the static table.
This is achieved by a module_init call to register the ops with KVM.
Cc: Gleb Natapov <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Acked-by: Alex Williamson <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Using the new kvm_register_device_ops() interface makes us get rid of
an #ifdef in common code.
Cc: Gleb Natapov <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Signed-off-by: Cornelia Huck <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Now that we have a dynamic means to register kvm_device_ops, use that
for the ARM VGIC, instead of relying on the static table.
Cc: Gleb Natapov <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Reviewed-by: Christoffer Dall <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
kvm_ioctl_create_device currently has knowledge of all the device types
and their associated ops. This is fairly inflexible when adding support
for new in-kernel device emulations, so move what we currently have out
into a table, which can support dynamic registration of ops by new
drivers for virtual hardware.
Cc: Alex Williamson <[email protected]>
Cc: Alex Graf <[email protected]>
Cc: Gleb Natapov <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Marc Zyngier <[email protected]>
Acked-by: Cornelia Huck <[email protected]>
Reviewed-by: Christoffer Dall <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Currently, we call ioapic_service() immediately when we find the irq is still
active during eoi broadcast. But for real hardware, there's some delay between
the EOI writing and irq delivery. If we do not emulate this behavior, and
re-inject the interrupt immediately after the guest sends an EOI and re-enables
interrupts, a guest might spend all its time in the ISR if it has a broken
handler for a level-triggered interrupt.
Such livelock actually happens with Windows guests when resuming from
hibernation.
As there's no way to recognize the broken handle from new raised ones, this patch
delays an interrupt if 10.000 consecutive EOIs found that the interrupt was
still high. The guest can then make a little forward progress, until a proper
IRQ handler is set or until some detection routine in the guest (such as
Linux's note_interrupt()) recognizes the situation.
Cc: Michael S. Tsirkin <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Signed-off-by: Zhang Haoyu <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
This patch replace the set_bit method by kvm_make_request
to make code more readable and consistent.
Signed-off-by: Guo Hui Liu <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Initially the tracepoint was added only to the APIC_DM_FIXED case,
also because it reported coalesced interrupts that only made sense
for that case. However, the coalesced argument is not used anymore
and tracing other delivery modes is useful, so hoist the call out
of the switch statement.
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
We have APIC_DEFAULT_PHYS_BASE defined as 0xfee00000, which is also the address of
apic access page. So use this macro.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Gleb Natapov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kvm-next
KVM: s390: Fixes and features for next (3.18)
1. Crypto/CPACF support: To enable the MSA4 instructions we have to
provide a common control structure for each SIE control block
2. Two cleanups found by a static code checker: one redundant assignment
and one useless if
3. Fix the page handling of the diag10 ballooning interface. If the
guest freed the pages at absolute 0 some checks and frees were
incorrect
4. Limit guests to 16TB
5. Add __must_check to interrupt injection code
|
|
r is already initialized to 0.
Signed-off-by: Christian Borntraeger <[email protected]>
Reviewed-by: Thomas Huth <[email protected]>
|
|
The old handling of prefix pages was broken in the diag10 ballooner.
We now rely on gmap_discard to check for start > end and do a
slow path if the prefix swap pages are affected:
1. discard the pages from start to prefix
2. discard the absolute 0 pages
3. discard the pages after prefix swap to end
Signed-off-by: Christian Borntraeger <[email protected]>
Reviewed-by: Thomas Huth <[email protected]>
|
|
Due to the earlier check we know that ipte_lock_count must be 0.
No need to add a useless if. Let's make clear that we are going
to always wakeup when we execute that code.
Signed-off-by: Christian Borntraeger <[email protected]>
Acked-by: Heiko Carstens <[email protected]>
|
|
We must not fallthrough if the conditions for external call are not met.
Signed-off-by: Christian Borntraeger <[email protected]>
Reviewed-by: Thomas Huth <[email protected]>
Cc: [email protected]
|
|
Currently we fill up a full 5 level page table to hold the guest
mapping. Since commit "support gmap page tables with less than 5
levels" we can do better.
Having more than 4 TB might be useful for some testing scenarios,
so let's just limit ourselves to 16TB guest size.
Having more than that is totally untested as I do not have enough
swap space/memory.
We continue to allow ucontrol the full size.
Signed-off-by: Christian Borntraeger <[email protected]>
Acked-by: Cornelia Huck <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
|
|
We now propagate interrupt injection errors back to the ioctl. We
should mark functions that might fail with __must_check.
Signed-off-by: Christian Borntraeger <[email protected]>
Acked-by: Jens Freimann <[email protected]>
|
|
We have to provide a per guest crypto block for the CPUs to
enable MSA4 instructions. According to icainfo on z196 or
later this enables CCM-AES-128, CMAC-AES-128, CMAC-AES-192
and CMAC-AES-256.
Signed-off-by: Tony Krowiak <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Reviewed-by: Michael Mueller <[email protected]>
Signed-off-by: Christian Borntraeger <[email protected]>
[split MSA4/protected key into two patches]
|
|
It looks like when this was initially merged it got accidentally included
in the following section. I've just moved it back in the correct section
and re-numbered it as other ioctls have been added since.
Signed-off-by: Alex Bennée <[email protected]>
Acked-by: Borislav Petkov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
In preparation for working on the ARM implementation I noticed the debug
interface was missing from the API document. I've pieced together the
expected behaviour from the code and commit messages written it up as
best I can.
Signed-off-by: Alex Bennée <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
__kvm_set_memory_region sets r to EINVAL very early.
Doing it again is not necessary. The same is true later on, where
r is assigned -ENOMEM twice.
Signed-off-by: Christian Borntraeger <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The first statement of kvm_dev_ioctl is
long r = -EINVAL;
No need to reassign the same value.
Signed-off-by: Christian Borntraeger <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The expression `vcpu->spin_loop.in_spin_loop' is always true,
because it is evaluated only when the condition
`!vcpu->spin_loop.in_spin_loop' is false.
Signed-off-by: Christian Borntraeger <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Currently, if a permission error happens during the translation of
the final GPA to HPA, walk_addr_generic returns 0 but does not fill
in walker->fault. To avoid this, add an x86_exception* argument
to the translate_gpa function, and let it fill in walker->fault.
The nested_page_fault field will be true, since the walk_mmu is the
nested_mmu and translate_gpu instead operates on the "outer" (NPT)
instance.
Reported-by: Valentine Sinitsyn <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
If a nested page fault happens during emulation, we will inject a vmexit,
not a page fault. However because writeback happens after the injection,
we will write ctxt->eip from L2 into the L1 EIP. We do not write back
if an instruction caused an interception vmexit---do the same for page
faults.
Suggested-by: Gleb Natapov <[email protected]>
Reviewed-by: Gleb Natapov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
This is similar to what the EPT code does with the exit qualification.
This allows the guest to see a valid value for bits 33:32.
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Bit 8 would be the "global" bit, which does not quite make sense for non-leaf
page table entries. Intel ignores it; AMD ignores it in PDEs, but reserves it
in PDPEs and PML4Es. The SVM test is relying on this behavior, so enforce it.
Signed-off-by: Paolo Bonzini <[email protected]>
|