Age | Commit message (Collapse) | Author | Files | Lines |
|
The vGPU folks would like to trap the first access to a BAR by setting
vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault handler
then can use remap_pfn_range to place some non-reserved pages in the VMA.
This kind of VM_PFNMAP mapping is not handled by KVM, but follow_pfn
and fixup_user_fault together help supporting it. The patch also supports
VM_MIXEDMAP vmas where the pfns are not reserved and thus subject to
reference counting.
Cc: Xiao Guangrong <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Radim Krčmář <[email protected]>
Tested-by: Neo Jia <[email protected]>
Reported-by: Kirti Wankhede <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Handle VM_IO like VM_PFNMAP, as is common in the rest of Linux; extract
the formula to convert hva->pfn into a new function, which will soon
gain more capabilities.
Cc: Xiao Guangrong <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Radim Krčmář <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
In case we have to emuluate an instruction or part of it (instruction,
partial instruction, operation exception), we have to inject a PER
instruction-fetching event for that instruction, if hardware told us to do
so.
In case we retry an instruction, we must not inject the PER event.
Please note that we don't filter the events properly yet, so guest
debugging will be visible for the guest.
Signed-off-by: David Hildenbrand <[email protected]>
Signed-off-by: Christian Borntraeger <[email protected]>
|
|
We have both KERN_TO_HYP and kern_hyp_va, which do the exact same
thing. Let's standardize on the latter.
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
|
|
This is more of a safety measure than anything else: If we end-up
with an idmap page that intersect with the range picked for the
the HYP VA space, abort the KVM setup, as it is unsafe to go
further.
I cannot imagine it happening on 64bit (we have a mechanism to
work around it), but could potentially occur on a 32bit system with
the kernel loaded high enough in memory so that in conflicts with
the kernel VA.
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
|
|
We can now remove a number of dead #defines, thanks to the trampoline
code being gone.
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
|
|
So far, KVM was getting in the way of kexec on 32bit (and the arm64
kexec hackers couldn't be bothered to fix it on 32bit...).
With simpler page tables, tearing KVM down becomes very easy, so
let's just do it.
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
|
|
Just like for arm64, we can now make the HYP setup a lot simpler,
and we can now initialise it in one go (instead of the two
phases we currently have).
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
|
|
There is no way to free the boot PGD, because it doesn't exist
anymore as a standalone entity.
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
|
|
Since we now only have one set of page tables, the concept of
boot_pgd is useless and can be removed. We still keep it as
an element of the "extended idmap" thing.
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
|
|
Now that we only have the "merged page tables" case to deal with,
there is a bunch of things we can simplify in the HYP code (both
at init and teardown time).
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
|
|
We're in a position where we can now always have "merged" page
tables, where both the runtime mapping and the idmap coexist.
This results in some code being removed, but there is more to come.
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
|
|
Add the code that enables the switch to the lower HYP VA range.
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
|
|
Declare the __hyp_text_start/end symbols in asm/virt.h so that
they can be reused without having to declare them locally.
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
|
|
As we move towards a selectable HYP VA range, it is obvious that
we don't want to test a variable to find out if we need to use
the bottom VA range, the top VA range, or use the address as is
(for VHE).
Instead, we can expand our current helper to generate the right
mask or nop with code patching. We default to using the top VA
space, with alternatives to switch to the bottom one or to nop
out the instructions.
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
|
|
Define the two possible HYP VA regions in terms of VA_BITS,
and keep HYP_PAGE_OFFSET_MASK as a temporary compatibility
definition.
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
|
|
As we need to indicate to the rest of the kernel which region of
the HYP VA space is safe to use, add a capability that will
indicate that KVM should use the [VA_BITS-2:0] range.
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
|
|
HYP_PAGE_OFFSET is not massively useful. And the way we use it
in KERN_HYP_VA is inconsistent with the equivalent operation in
EL2, where we use a mask instead.
Let's replace the uses of HYP_PAGE_OFFSET with HYP_PAGE_OFFSET_MASK,
and get rid of the pointless macro.
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
|
|
hyp_kern_va is now completely unused, so let's remove it entirely.
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
|
|
__hyp_panic_string is passed via the HYP panic code to the panic
function, and is being "upgraded" to a kernel address, as it is
referenced by the HYP code (in a PC-relative way).
This is a bit silly, and we'd be better off obtaining the kernel
address and not mess with it at all. This patch implements this
with a tiny bit of asm glue, by forcing the string pointer to be
read from the literal pool.
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
|
|
Since dealing with VA ranges tends to hurt my brain badly, let's
start with a bit of documentation that will hopefully help
understanding what comes next...
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
|
|
I don't think any single piece of the KVM/ARM code ever generated
as much hatred as the GIC emulation.
It was written by someone who had zero experience in modeling
hardware (me), was riddled with design flaws, should have been
scrapped and rewritten from scratch long before having a remote
chance of reaching mainline, and yet we supported it for a good
three years. No need to mention the names of those who suffered,
the git log is singing their praises.
Thankfully, we now have a much more maintainable implementation,
and we can safely put the grumpy old GIC to rest.
Fellow hackers, please raise your glass in memory of the GIC:
The GIC is dead, long live the GIC!
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
|
|
INFO: rcu_sched detected stalls on CPUs/tasks:
1-...: (11800 GPs behind) idle=45d/140000000000000/0 softirq=0/0 fqs=21663
(detected by 0, t=65016 jiffies, g=11500, c=11499, q=719)
Task dump for CPU 1:
qemu-system-x86 R running task 0 3529 3525 0x00080808
ffff8802021791a0 ffff880212895040 0000000000000001 00007f1c2c00db40
ffff8801dd20fcd3 ffffc90002b98000 ffff8801dd20fc88 ffff8801dd20fcf8
0000000000000286 ffff8801dd2ac538 ffff8801dd20fcc0 ffffffffc06949c9
Call Trace:
? kvm_write_guest_cached+0xb9/0x160 [kvm]
? __delay+0xf/0x20
? wait_lapic_expire+0x14a/0x200 [kvm]
? kvm_arch_vcpu_ioctl_run+0xcbe/0x1b00 [kvm]
? kvm_arch_vcpu_ioctl_run+0xe34/0x1b00 [kvm]
? kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm]
? __fget+0x5/0x210
? do_vfs_ioctl+0x96/0x6a0
? __fget_light+0x2a/0x90
? SyS_ioctl+0x79/0x90
? do_syscall_64+0x7c/0x1e0
? entry_SYSCALL64_slow_path+0x25/0x25
This can be reproduced readily by running a full dynticks guest(since hrtimer
in guest is heavily used) w/ lapic_timer_advance disabled.
If fail to program hardware preemption timer, we will fallback to hrtimer based
method, however, a previous programmed preemption timer miss to cancel in this
scenario which results in one hardware preemption timer and one hrtimer emulated
tsc deadline timer run simultaneously. So sometimes the target guest deadline
tsc is earlier than guest tsc, which leads to the computation in vmx_set_hv_timer
can underflow and cause delta_tsc to be set a huge value, then host soft lockup
as above.
This patch fix it by cancelling the previous programmed preemption timer if there
is once we failed to program the new preemption timer and fallback to hrtimer
based method.
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Cc: Yunhong Jiang <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Introduce cancel_hv_tscdeadline() to encapsulate preemption
timer cancel stuff.
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Cc: Yunhong Jiang <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
If the TSC deadline timer is programmed really close to the deadline or
even in the past, the computation in vmx_set_hv_timer can underflow and
cause delta_tsc to be set to a huge value. This generally results
in vmx_set_hv_timer returning -ERANGE, but we can fix it by limiting
delta_tsc to be positive or zero.
Reported-by: Wanpeng Li <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
This gains a few clock cycles per vmexit. On Intel there is no need
anymore to enable the interrupts in vmx_handle_external_intr, since
we are using the "acknowledge interrupt on exit" feature. AMD
needs to do that, and must be careful to avoid the interrupt shadow.
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
This is necessary to simplify handle_external_intr in the next patch.
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Use the functions from context_tracking.h directly.
Cc: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Structures that can be generally written to don't have any requirement
to be executable (quite the opposite). This includes the kvm and vcpu
structures, as well as the stacks.
Let's change the default to incorporate the XN flag.
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
|
|
There should be no reason for mapping the HYP text read/write.
As such, let's have a new set of flags (PAGE_HYP_EXEC) that allows
execution, but makes the page as read-only, and update the two call
sites that deal with mapping code.
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
|
|
In order to be able to use C code in HYP, we're now mapping the kernel's
rodata in HYP. It works absolutely fine, except that we're mapping it RWX,
which is not what it should be.
Add a new HYP_PAGE_RO protection, and pass it as the protection flags
when mapping the rodata section.
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
|
|
EL2 page tables can be configured to deny code from being
executed, which is done by setting bit 54 in the page descriptor.
It is the same bit as PTE_UXN, but the "USER" reference felt odd
in the hypervisor code.
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
|
|
Currently, create_hyp_mappings applies a "one size fits all" page
protection (PAGE_HYP). As we're heading towards separate protections
for different sections, let's make this protection a parameter, and
let the callers pass their prefered protection (PAGE_HYP for everyone
for the time being).
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>
|
|
Make kvm_guest_{enter,exit} and __kvm_guest_{enter,exit} trivial wrappers
around the code in context_tracking.h. Name the context_tracking.h functions
consistently with what those for kernel<->user switch.
Cc: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Combine the kvm_enter, kvm_reenter and kvm_out trace events into a
single kvm_transition event class to reduce duplication and bloat.
Suggested-by: Steven Rostedt <[email protected]>
Fixes: 93258604ab6d ("MIPS: KVM: Add guest mode switch trace events")
Signed-off-by: James Hogan <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Radim KrÄmář <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
KVM reads the current boottime value as a struct timespec in order to
calculate the guest wallclock time, resulting in an overflow in 2038
on 32-bit systems.
The data then gets passed as an unsigned 32-bit number to the guest,
and that in turn overflows in 2106.
We cannot do much about the second overflow, which affects both 32-bit
and 64-bit hosts, but we can ensure that they both behave the same
way and don't overflow until 2106, by using getboottime64() to read
a timespec64 value.
Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
On Intel platforms, this patch adds LMCE to KVM MCE supported
capabilities and handles guest access to LMCE related MSRs.
Signed-off-by: Ashok Raj <[email protected]>
[Haozhong: macro KVM_MCE_CAP_SUPPORTED => variable kvm_mce_cap_supported
Only enable LMCE on Intel platform
Check MSR_IA32_FEATURE_CONTROL when handling guest
access to MSR_IA32_MCG_EXT_CTL]
Signed-off-by: Haozhong Zhang <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
KVM currently does not check the value written to guest
MSR_IA32_FEATURE_CONTROL, though bits corresponding to disabled features
may be set. This patch makes KVM to validate individual bits written to
guest MSR_IA32_FEATURE_CONTROL according to enabled features.
Signed-off-by: Haozhong Zhang <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
msr_ia32_feature_control will be used for LMCE and not depend only on
nested anymore, so move it from struct nested_vmx to struct vcpu_vmx.
Signed-off-by: Haozhong Zhang <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
KVM: s390: vSIE (nested virtualization) feature for 4.8 (kvm/next)
With an updated QEMU this allows to create nested KVM guests
(KVM under KVM) on s390.
s390 memory management changes from Martin Schwidefsky or
acked by Martin. One common code memory management change (pageref)
acked by Andrew Morton.
The feature has to be enabled with the nested medule parameter.
|
|
Let's be careful first and allow nested virtualization only if enabled
by the system administrator. In addition, user space still has to
explicitly enable it via SCLP features for it to work.
Acked-by: Christian Borntraeger <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
Signed-off-by: Christian Borntraeger <[email protected]>
|
|
We have certain SIE features that we cannot support for now.
Let's add these features, so user space can directly prepare to enable
them, so we don't have to update yet another component.
In addition, add a comment block, telling why it is for now not possible to
forward/enable these features.
Acked-by: Christian Borntraeger <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
Signed-off-by: Christian Borntraeger <[email protected]>
|
|
Guest 2 sets up the epoch of guest 3 from his point of view. Therefore,
we have to add the guest 2 epoch to the guest 3 epoch. We also have to take
care of guest 2 epoch changes on STP syncs. This will work just fine by
also updating the guest 3 epoch when a vsie_block has been set for a VCPU.
Acked-by: Christian Borntraeger <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
Signed-off-by: Christian Borntraeger <[email protected]>
|
|
Whenever a SIGP external call is injected via the SIGP external call
interpretation facility, the VCPU is not kicked. When a VCPU is currently
in the VSIE, the external call might not be processed immediately.
Therefore we have to provoke partial execution exceptions, which leads to a
kick of the VCPU and therefore also kick out of VSIE. This is done by
simulating the WAIT state. This bit has no other side effects.
Acked-by: Christian Borntraeger <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
Signed-off-by: Christian Borntraeger <[email protected]>
|
|
As we want to make use of CPUSTAT_WAIT also when a VCPU is not idle but
to force interception of external calls, let's check in the bitmap instead.
Acked-by: Christian Borntraeger <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
Signed-off-by: Christian Borntraeger <[email protected]>
|
|
Whenever we want to wake up a VCPU (e.g. when injecting an IRQ), we
have to kick it out of vsie, so the request will be handled faster.
Acked-by: Christian Borntraeger <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
Signed-off-by: Christian Borntraeger <[email protected]>
|
|
We can avoid one unneeded SIE entry after we reported a fault to g2.
Theoretically, g2 resolves the fault and we can create the shadow mapping
directly, instead of failing again when entering the SIE.
Acked-by: Christian Borntraeger <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
Signed-off-by: Christian Borntraeger <[email protected]>
|
|
We can easily enable ibs for guest 2, so he can use it for guest 3.
Acked-by: Christian Borntraeger <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
Signed-off-by: Christian Borntraeger <[email protected]>
|
|
We can easily enable cei for guest 2, so he can use it for guest 3.
Acked-by: Christian Borntraeger <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
Signed-off-by: Christian Borntraeger <[email protected]>
|
|
We can easily enable intervention bypass for guest 2, so it can use it
for guest 3.
Acked-by: Christian Borntraeger <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
Signed-off-by: Christian Borntraeger <[email protected]>
|