Age | Commit message (Collapse) | Author | Files | Lines |
|
Add decoding for INVEPT and reorder the list according to the reason
numbers.
Signed-off-by: Jan Kiszka <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
After commit:
8827247ffcc ("x86: don't define __this_fixmap_does_not_exist()")
variable idx0 is no longer needed, so just remove it.
Signed-off-by: Jianguo Wu <[email protected]>
Cc: Joonsoo Kim <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: Hanjun Guo <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/ras
Pull MCE-uncorrected-error fix from Tony Luck:
"Bit 12 may or may not be set in MCi_STATUS.MCACOD when
an uncorrected error is reported. Ignore it when checking
error signatures."
Signed-off-by: Ingo Molnar <[email protected]>
|
|
load_microcode_amd() (and the helper it is using) should not have an
cpu parameter. The microcode loading does not depend on the CPU wrt the
patches loaded since they will end up in a global list for all CPUs
anyway.
The change from cpu to x86family in load_microcode_amd()
now allows to drop the code messing with cpu_data(cpu) from
collect_cpu_info_amd_early(), which is wrong anyway because at that
point the per-cpu cpu_info is not yet setup (These values would later be
overwritten by smp_store_boot_cpu_info() / smp_store_cpu_info()).
Fold the rest of collect_cpu_info_amd_early() into load_ucode_amd_ap(),
because its only used at one place and without the cpuinfo_x86 accesses
it was not much left.
Signed-off-by: Torsten Kaiser <[email protected]>
[ Fengguang: build fix ]
Signed-off-by: Fengguang Wu <[email protected]>
[ Boris: adapt it to current tree. ]
Signed-off-by: Borislav Petkov <[email protected]>
|
|
cpuinfo_x86
cpu_has_amd_erratum() is buggy, because it uses the per-cpu cpu_info
before it is filled by smp_store_boot_cpu_info() / smp_store_cpu_info().
If early microcode loading is enabled its collect_cpu_info_amd_early()
will fill ->x86 and so the fallback to boot_cpu_data is not used. But
->x86_vendor was not filled and is still X86_VENDOR_INTEL resulting in
no errata fixes getting applied and my system hangs on boot.
Using cpu_info in cpu_has_amd_erratum() is wrong anyway: its only
caller init_amd() will have a struct cpuinfo_x86 as parameter and the
set_cpu_bug() that is controlled by cpu_has_amd_erratum() also only uses
that struct.
So pass the struct cpuinfo_x86 from init_amd() to cpu_has_amd_erratum()
and the broken fallback can be dropped.
[ Boris: Drop WARN_ON() since we're called only from init_amd() ]
Signed-off-by: Torsten Kaiser <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
|
|
Pursue a single RAS/MCE topic branch on x86.
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Now that we have weak versions for each of the PCI MSI architecture
functions, we can actually build the MSI support for all platforms,
regardless of whether they provide or not architecture-specific
versions of those functions. For this reason, the ARCH_SUPPORTS_MSI
hidden kconfig boolean becomes useless, and this patch gets rid of it.
Signed-off-by: Thomas Petazzoni <[email protected]>
Acked-by: Bjorn Helgaas <[email protected]>
Acked-by: Benjamin Herrenschmidt <[email protected]>
Tested-by: Daniel Price <[email protected]>
Tested-by: Thierry Reding <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: [email protected]
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: [email protected]
Cc: Russell King <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: [email protected]
Cc: Ralf Baechle <[email protected]>
Cc: [email protected]
Cc: David S. Miller <[email protected]>
Cc: [email protected]
Cc: Chris Metcalf <[email protected]>
Signed-off-by: Jason Cooper <[email protected]>
|
|
Until now, the MSI architecture-specific functions could be overloaded
using a fairly complex set of #define and compile-time
conditionals. In order to prepare for the introduction of the msi_chip
infrastructure, it is desirable to switch all those functions to use
the 'weak' mechanism. This commit converts all the architectures that
were overidding those MSI functions to use the new strategy.
Note that we keep two separate, non-weak, functions
default_teardown_msi_irqs() and default_restore_msi_irqs() for the
default behavior of the arch_teardown_msi_irqs() and
arch_restore_msi_irqs(), as the default behavior is needed by x86 PCI
code.
Signed-off-by: Thomas Petazzoni <[email protected]>
Acked-by: Bjorn Helgaas <[email protected]>
Acked-by: Benjamin Herrenschmidt <[email protected]>
Tested-by: Daniel Price <[email protected]>
Tested-by: Thierry Reding <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: [email protected]
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: [email protected]
Cc: Russell King <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: [email protected]
Cc: Ralf Baechle <[email protected]>
Cc: [email protected]
Cc: David S. Miller <[email protected]>
Cc: [email protected]
Cc: Chris Metcalf <[email protected]>
Signed-off-by: Jason Cooper <[email protected]>
|
|
F15h, models 0x30 and later don't have a GART. Note that. Also check
CPUID leaf 0x80000006 for L3 prescence because there are models which
don't sport an L3 cache.
Signed-off-by: Aravind Gopalakrishnan <[email protected]>
[ Boris: rewrite commit message, cleanup comments. ]
Signed-off-by: Borislav Petkov <[email protected]>
|
|
This one was missed earlier.
Signed-off-by: Andi Kleen <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
OpenFirmware wasn't quite following the protocol described in boot.txt
and the kernel has detected this through use of the sentinel value
in boot_params. OFW does zero out almost all of the stuff that it should
do, but not the sentinel.
This causes the kernel to clear olpc_ofw_header, which breaks x86 OLPC
support.
OpenFirmware has now been fixed. However, it would be nice if we could
maintain Linux compatibility with old firmware versions. To do that, we just
have to avoid zeroing out olpc_ofw_header.
OFW does not write to any other parts of the header that are being zapped
by the sentinel-detection code, and all users of olpc_ofw_header are
somewhat protected through checking for the OLPC_OFW_SIG magic value
before using it. So this should not cause any problems for anyone.
Signed-off-by: Daniel Drake <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Acked-by: Yinghai Lu <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
Cc: <[email protected]> # v3.9+
|
|
In m2p_remove_override() when removing the grant map from the kernel
mapping and replacing with a mapping to the original page, the grant
unmap will already have flushed the TLB and it is not necessary to do
it again after updating the mapping.
Signed-off-by: David Vrabel <[email protected]>
Cc: Stefano Stabellini <[email protected]>
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
Reviewed-by: Stefano Stabellini <[email protected]>
|
|
This is based on a patch that Zhenzhong Duan had sent - which
was missing some of the remaining pieces. The kernel has the
logic to handle Xen-type-exceptions using the paravirt interface
in the assembler code (see PARAVIRT_ADJUST_EXCEPTION_FRAME -
pv_irq_ops.adjust_exception_frame and and INTERRUPT_RETURN -
pv_cpu_ops.iret).
That means the nmi handler (and other exception handlers) use
the hypervisor iret.
The other changes that would be neccessary for this would
be to translate the NMI_VECTOR to one of the entries on the
ipi_vector and make xen_send_IPI_mask_allbutself use different
events.
Fortunately for us commit 1db01b4903639fcfaec213701a494fe3fb2c490b
(xen: Clean up apic ipi interface) implemented this and we piggyback
on the cleanup such that the apic IPI interface will pass the right
vector value for NMI.
With this patch we can trigger NMIs within a PV guest (only tested
x86_64).
For this to work with normal PV guests (not initial domain)
we need the domain to be able to use the APIC ops - they are
already implemented to use the Xen event channels. For that
to be turned on in a PV domU we need to remove the masking
of X86_FEATURE_APIC.
Incidentally that means kgdb will also now work within
a PV guest without using the 'nokgdbroundup' workaround.
Note that the 32-bit version is different and this patch
does not enable that.
CC: Lisa Nguyen <[email protected]>
CC: Ben Guthro <[email protected]>
CC: Zhenzhong Duan <[email protected]>
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
[v1: Fixed up per David Vrabel comments]
Reviewed-by: Ben Guthro <[email protected]>
Reviewed-by: David Vrabel <[email protected]>
|
|
Signed-off-by: Srivatsa Vaddagiri <[email protected]>
Link: http://lkml.kernel.org/r/1376058122-8248-14-git-send-email-raghavendra.kt@linux.vnet.ibm.com
Signed-off-by: Suzuki Poulose <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
Acked-by: Gleb Natapov <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
|
|
These are needed by both guest and host.
Originally-from: Srivatsa Vaddagiri <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
Link: http://lkml.kernel.org/r/1376058122-8248-13-git-send-email-raghavendra.kt@linux.vnet.ibm.com
Acked-by: Gleb Natapov <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
|
|
If interrupts were enabled when taking the spinlock, we can leave them
enabled while blocking to get the lock.
If we can enable interrupts while waiting for the lock to become
available, and we take an interrupt before entering the poll,
and the handler takes a spinlock which ends up going into
the slow state (invalidating the per-cpu "lock" and "want" values),
then when the interrupt handler returns the event channel will
remain pending so the poll will return immediately, causing it to
return out to the main spinlock loop.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Link: http://lkml.kernel.org/r/1376058122-8248-12-git-send-email-raghavendra.kt@linux.vnet.ibm.com
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
|
|
Maintain a flag in the LSB of the ticket lock tail which indicates
whether anyone is in the lock slowpath and may need kicking when
the current holder unlocks. The flags are set when the first locker
enters the slowpath, and cleared when unlocking to an empty queue (ie,
no contention).
In the specific implementation of lock_spinning(), make sure to set
the slowpath flags on the lock just before blocking. We must do
this before the last-chance pickup test to prevent a deadlock
with the unlocker:
Unlocker Locker
test for lock pickup
-> fail
unlock
test slowpath
-> false
set slowpath flags
block
Whereas this works in any ordering:
Unlocker Locker
set slowpath flags
test for lock pickup
-> fail
block
unlock
test slowpath
-> true, kick
If the unlocker finds that the lock has the slowpath flag set but it is
actually uncontended (ie, head == tail, so nobody is waiting), then it
clears the slowpath flag.
The unlock code uses a locked add to update the head counter. This also
acts as a full memory barrier so that its safe to subsequently
read back the slowflag state, knowing that the updated lock is visible
to the other CPUs. If it were an unlocked add, then the flag read may
just be forwarded from the store buffer before it was visible to the other
CPUs, which could result in a deadlock.
Unfortunately this means we need to do a locked instruction when
unlocking with PV ticketlocks. However, if PV ticketlocks are not
enabled, then the old non-locked "add" is the only unlocking code.
Note: this code relies on gcc making sure that unlikely() code is out of
line of the fastpath, which only happens when OPTIMIZE_SIZE=n. If it
doesn't the generated code isn't too bad, but its definitely suboptimal.
Thanks to Srivatsa Vaddagiri for providing a bugfix to the original
version of this change, which has been folded in.
Thanks to Stephan Diestelhorst for commenting on some code which relied
on an inaccurate reading of the x86 memory ordering rules.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Link: http://lkml.kernel.org/r/1376058122-8248-11-git-send-email-raghavendra.kt@linux.vnet.ibm.com
Signed-off-by: Srivatsa Vaddagiri <[email protected]>
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
Cc: Stephan Diestelhorst <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
|
|
Increment ticket head/tails by 2 rather than 1 to leave the LSB free
to store a "is in slowpath state" bit. This halves the number
of possible CPUs for a given ticket size, but this shouldn't matter
in practice - kernels built for 32k+ CPU systems are probably
specially built for the hardware rather than a generic distro
kernel.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Link: http://lkml.kernel.org/r/1376058122-8248-9-git-send-email-raghavendra.kt@linux.vnet.ibm.com
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
Tested-by: Attilio Rao <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
|
|
Although the lock_spinning calls in the spinlock code are on the
uncommon path, their presence can cause the compiler to generate many
more register save/restores in the function pre/postamble, which is in
the fast path. To avoid this, convert it to using the pvops callee-save
calling convention, which defers all the save/restores until the actual
function is called, keeping the fastpath clean.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Link: http://lkml.kernel.org/r/1376058122-8248-8-git-send-email-raghavendra.kt@linux.vnet.ibm.com
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
Tested-by: Attilio Rao <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
|
|
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Link: http://lkml.kernel.org/r/1376058122-8248-7-git-send-email-raghavendra.kt@linux.vnet.ibm.com
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
|
|
Replace the old Xen implementation of PV spinlocks with and implementation
of xen_lock_spinning and xen_unlock_kick.
xen_lock_spinning simply registers the cpu in its entry in lock_waiting,
adds itself to the waiting_cpus set, and blocks on an event channel
until the channel becomes pending.
xen_unlock_kick searches the cpus in waiting_cpus looking for the one
which next wants this lock with the next ticket, if any. If found,
it kicks it by making its event channel pending, which wakes it up.
We need to make sure interrupts are disabled while we're relying on the
contents of the per-cpu lock_waiting values, otherwise an interrupt
handler could come in, try to take some other lock, block, and overwrite
our values.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Link: http://lkml.kernel.org/r/1376058122-8248-6-git-send-email-raghavendra.kt@linux.vnet.ibm.com
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
[ Raghavendra: use function + enum instead of macro, cmpxchg for zero status reset
Reintroduce break since we know the exact vCPU to send IPI as suggested by Konrad.]
Signed-off-by: Raghavendra K T <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
|
|
There's no need to do it at very early init, and doing it there
makes it impossible to use the jump_label machinery.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Link: http://lkml.kernel.org/r/1376058122-8248-5-git-send-email-raghavendra.kt@linux.vnet.ibm.com
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
|
|
Now that the paravirtualization layer doesn't exist at the spinlock
level any more, we can collapse the __ticket_ functions into the arch_
functions.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Link: http://lkml.kernel.org/r/1376058122-8248-4-git-send-email-raghavendra.kt@linux.vnet.ibm.com
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
Tested-by: Attilio Rao <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
|
|
The code size expands somewhat, and its better to just call
a function rather than inline it.
Thanks Jeremy for original version of ARCH_NOINLINE_SPIN_UNLOCK config patch,
which is simplified.
Suggested-by: Linus Torvalds <[email protected]>
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
Link: http://lkml.kernel.org/r/1376058122-8248-3-git-send-email-raghavendra.kt@linux.vnet.ibm.com
Acked-by: Ingo Molnar <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
|
|
Rather than outright replacing the entire spinlock implementation in
order to paravirtualize it, keep the ticket lock implementation but add
a couple of pvops hooks on the slow patch (long spin on lock, unlocking
a contended lock).
Ticket locks have a number of nice properties, but they also have some
surprising behaviours in virtual environments. They enforce a strict
FIFO ordering on cpus trying to take a lock; however, if the hypervisor
scheduler does not schedule the cpus in the correct order, the system can
waste a huge amount of time spinning until the next cpu can take the lock.
(See Thomas Friebel's talk "Prevent Guests from Spinning Around"
http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.)
To address this, we add two hooks:
- __ticket_spin_lock which is called after the cpu has been
spinning on the lock for a significant number of iterations but has
failed to take the lock (presumably because the cpu holding the lock
has been descheduled). The lock_spinning pvop is expected to block
the cpu until it has been kicked by the current lock holder.
- __ticket_spin_unlock, which on releasing a contended lock
(there are more cpus with tail tickets), it looks to see if the next
cpu is blocked and wakes it if so.
When compiled with CONFIG_PARAVIRT_SPINLOCKS disabled, a set of stub
functions causes all the extra code to go away.
Results:
=======
setup: 32 core machine with 32 vcpu KVM guest (HT off) with 8GB RAM
base = 3.11-rc
patched = base + pvspinlock V12
+-----------------+----------------+--------+
dbench (Throughput in MB/sec. Higher is better)
+-----------------+----------------+--------+
| base (stdev %)|patched(stdev%) | %gain |
+-----------------+----------------+--------+
| 15035.3 (0.3) |15150.0 (0.6) | 0.8 |
| 1470.0 (2.2) | 1713.7 (1.9) | 16.6 |
| 848.6 (4.3) | 967.8 (4.3) | 14.0 |
| 652.9 (3.5) | 685.3 (3.7) | 5.0 |
+-----------------+----------------+--------+
pvspinlock shows benefits for overcommit ratio > 1 for PLE enabled cases,
and undercommits results are flat
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Link: http://lkml.kernel.org/r/1376058122-8248-2-git-send-email-raghavendra.kt@linux.vnet.ibm.com
Reviewed-by: Konrad Rzeszutek Wilk <[email protected]>
Tested-by: Attilio Rao <[email protected]>
[ Raghavendra: Changed SPIN_THRESHOLD, fixed redefinition of arch_spinlock_t]
Signed-off-by: Raghavendra K T <[email protected]>
Acked-by: Ingo Molnar <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
|
|
The next commit gets conflicts because it relies on patches which were
cc:stable and thus had to be merged into Linus' tree before the coming
merge window. So pull in master now.
Signed-off-by: Rusty Russell <[email protected]>
|
|
Moves the relocation handling into C, after decompression. This requires
that the decompressed size is passed to the decompression routine as
well so that relocations can be found. Only kernels that need relocation
support will use the code (currently just x86_32), but this is laying
the ground work for 64-bit using it in support of KASLR.
Based on work by Neill Clift and Michael Davidson.
Signed-off-by: Kees Cook <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Acked-by: Zhang Yanfei <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
|
|
* pm-cpufreq:
cpufreq: Remove unused function __cpufreq_driver_getavg()
cpufreq: Remove unused APERF/MPERF support
cpufreq: ondemand: Change the calculation of target frequency
|
|
Advertise VM_EXIT_SAVE_IA32_PAT and VM_EXIT_LOAD_IA32_PAT.
Signed-off-by: Arthur Chunqi Li <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Do not report that we can enter the guest in 64-bit mode if the host is
32-bit only. This is not supported by KVM.
Signed-off-by: Jan Kiszka <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
At least WB must be possible.
Signed-off-by: Jan Kiszka <[email protected]>
Reviewed-by: Gleb Natapov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
When asking vmx to load the PAT MSR for us while switching from L1 to L2
or vice versa, we have to update arch.pat as well as it may later be
used again to load or read out the MSR content.
Reviewed-by: Gleb Natapov <[email protected]>
Tested-by: Arthur Chunqi Li <[email protected]>
Signed-off-by: Jan Kiszka <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Some trivial code cleanups not really related to nested EPT.
Reviewed-by: Xiao Guangrong <[email protected]>
Signed-off-by: Nadav Har'El <[email protected]>
Signed-off-by: Jun Nakajima <[email protected]>
Signed-off-by: Xinhao Xu <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Some additional comments to preexisting code:
Explain who (L0 or L1) handles EPT violation and misconfiguration exits.
Don't mention "shadow on either EPT or shadow" as the only two options.
Reviewed-by: Xiao Guangrong <[email protected]>
Signed-off-by: Nadav Har'El <[email protected]>
Signed-off-by: Jun Nakajima <[email protected]>
Signed-off-by: Xinhao Xu <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
This is the last patch of the basic Nested EPT feature, so as to allow
bisection through this patch series: The guest will not see EPT support until
this last patch, and will not attempt to use the half-applied feature.
Reviewed-by: Xiao Guangrong <[email protected]>
Signed-off-by: Nadav Har'El <[email protected]>
Signed-off-by: Jun Nakajima <[email protected]>
Signed-off-by: Xinhao Xu <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
If we let L1 use EPT, we should probably also support the INVEPT instruction.
In our current nested EPT implementation, when L1 changes its EPT table
for L2 (i.e., EPT12), L0 modifies the shadow EPT table (EPT02), and in
the course of this modification already calls INVEPT. But if last level
of shadow page is unsync not all L1's changes to EPT12 are intercepted,
which means roots need to be synced when L1 calls INVEPT. Global INVEPT
should not be different since roots are synced by kvm_mmu_load() each
time EPTP02 changes.
Reviewed-by: Xiao Guangrong <[email protected]>
Signed-off-by: Nadav Har'El <[email protected]>
Signed-off-by: Jun Nakajima <[email protected]>
Signed-off-by: Xinhao Xu <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
KVM's existing shadow MMU code already supports nested TDP. To use it, we
need to set up a new "MMU context" for nested EPT, and create a few callbacks
for it (nested_ept_*()). This context should also use the EPT versions of
the page table access functions (defined in the previous patch).
Then, we need to switch back and forth between this nested context and the
regular MMU context when switching between L1 and L2 (when L1 runs this L2
with EPT).
Reviewed-by: Xiao Guangrong <[email protected]>
Signed-off-by: Nadav Har'El <[email protected]>
Signed-off-by: Jun Nakajima <[email protected]>
Signed-off-by: Xinhao Xu <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Inject nEPT fault to L1 guest. This patch is original from Xinhao.
Reviewed-by: Xiao Guangrong <[email protected]>
Signed-off-by: Jun Nakajima <[email protected]>
Signed-off-by: Xinhao Xu <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
need_remote_flush() assumes that shadow page is in PT64 format, but
with addition of nested EPT this is no longer always true. Fix it by
bits definitions that depend on host shadow page type.
Reported-by: Xiao Guangrong <[email protected]>
Reviewed-by: Xiao Guangrong <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Since nEPT doesn't support A/D bit, so we should not set those bit
when build shadow page table.
Reviewed-by: Xiao Guangrong <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
This is the first patch in a series which adds nested EPT support to KVM's
nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
to set its own cr3 and take its own page faults without either of L0 or L1
getting involved. This often significanlty improves L2's performance over the
previous two alternatives (shadow page tables over EPT, and shadow page
tables over shadow page tables).
This patch adds EPT support to paging_tmpl.h.
paging_tmpl.h contains the code for reading and writing page tables. The code
for 32-bit and 64-bit tables is very similar, but not identical, so
paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
with PTTYPE=64, and this generates the two sets of similar functions.
There are subtle but important differences between the format of EPT tables
and that of ordinary x86 64-bit page tables, so for nested EPT we need a
third set of functions to read the guest EPT table and to write the shadow
EPT table.
So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
with "EPT") which correctly read and write EPT tables.
Reviewed-by: Xiao Guangrong <[email protected]>
Signed-off-by: Nadav Har'El <[email protected]>
Signed-off-by: Jun Nakajima <[email protected]>
Signed-off-by: Xinhao Xu <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Some guest paging modes do not support A/D bits. Add support for such
modes in shadow page code. For such modes PT_GUEST_DIRTY_MASK,
PT_GUEST_ACCESSED_MASK, PT_GUEST_DIRTY_SHIFT and PT_GUEST_ACCESSED_SHIFT
should be set to zero.
Reviewed-by: Xiao Guangrong <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
This patch makes guest A/D bits definition to be dependable on paging
mode, so when EPT support will be added it will be able to define them
differently.
Reviewed-by: Xiao Guangrong <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
For preparation, we just move gpte_access(), prefetch_invalid_gpte(),
s_rsvd_bits_set(), protect_clean_gpte() and is_dirty_gpte() from mmu.c
to paging_tmpl.h.
Reviewed-by: Xiao Guangrong <[email protected]>
Signed-off-by: Nadav Har'El <[email protected]>
Signed-off-by: Jun Nakajima <[email protected]>
Signed-off-by: Xinhao Xu <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Jun Nakajima <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
kvm_set_cr3() attempts to check if the new cr3 is a valid guest physical
address. The problem is that with nested EPT, cr3 is an *L2* physical
address, not an L1 physical address as this test expects.
As the comment above this test explains, it isn't necessary, and doesn't
correspond to anything a real processor would do. So this patch removes it.
Note that this wrong test could have also theoretically caused problems
in nested NPT, not just in nested EPT. However, in practice, the problem
was avoided: nested_svm_vmexit()/vmrun() do not call kvm_set_cr3 in the
nested NPT case, and instead set the vmcb (and arch.cr3) directly, thus
circumventing the problem. Additional potential calls to the buggy function
are avoided in that we don't trap cr3 modifications when nested NPT is
enabled. However, because in nested VMX we did want to use kvm_set_cr3()
(as requested in Avi Kivity's review of the original nested VMX patches),
we can't avoid this problem and need to fix it.
Reviewed-by: Orit Wasserman <[email protected]>
Reviewed-by: Xiao Guangrong <[email protected]>
Signed-off-by: Nadav Har'El <[email protected]>
Signed-off-by: Jun Nakajima <[email protected]>
Signed-off-by: Xinhao Xu <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
The existing code for handling cr3 and related VMCS fields during nested
exit and entry wasn't correct in all cases:
If L2 is allowed to control cr3 (and this is indeed the case in nested EPT),
during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and
we forgot to do so. This patch adds this copy.
If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and
whoever does control cr3 (L1 or L2) is using PAE, the processor might have
saved PDPTEs and we should also save them in vmcs12 (and restore later).
Reviewed-by: Xiao Guangrong <[email protected]>
Reviewed-by: Orit Wasserman <[email protected]>
Signed-off-by: Nadav Har'El <[email protected]>
Signed-off-by: Jun Nakajima <[email protected]>
Signed-off-by: Xinhao Xu <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Recent KVM, since http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
switch the EFER MSR when EPT is used and the host and guest have different
NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
and want to be able to run recent KVM as L1, we need to allow L1 to use this
EFER switching feature.
To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if available,
and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
support for the former (the latter is still unsupported).
Nested entry and exit emulation (prepare_vmcs_02 and load_vmcs12_host_state,
respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all
that's left to do in this patch is to properly advertise this feature to L1.
Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by using
vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
support this feature, regardless of whether the host supports it.
Reviewed-by: Orit Wasserman <[email protected]>
Signed-off-by: Nadav Har'El <[email protected]>
Signed-off-by: Jun Nakajima <[email protected]>
Signed-off-by: Xinhao Xu <[email protected]>
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Current code always uses arch.mmu to check the reserved bits on guest gpte
which is valid only for L1 guest, we should use arch.nested_mmu instead when
we translate gva to gpa for the L2 guest
Fix it by using @mmu instead since it is adapted to the current mmu mode
automatically
The bug can be triggered when nested npt is used and L1 guest and L2 guest
use different mmu mode
Reported-by: Jan Kiszka <[email protected]>
Reviewed-by: Gleb Natapov <[email protected]>
Signed-off-by: Xiao Guangrong <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
After commit 21feb4eb64e21f8dc91136b91ee886b978ce6421 tr base is zeroed
during vmexit. Set it to L1's HOST_TR_BASE. This should fix
https://bugzilla.kernel.org/show_bug.cgi?id=60679
Reported-by: Yongjie Ren <[email protected]>
Reviewed-by: Arthur Chunqi Li <[email protected]>
Tested-by: Yongjie Ren <[email protected]>
Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
When modifying text sections for jump labels, a paranoid check is
performed. If the check fails, the system "bugs". But why it failed
is not shown.
The BUG_ON()s in the jump label update code is replaced with bug_at(ip).
This is a function that will show what pointer failed, and what was
at the location of the failure that made jump label panic.
Signed-off-by: Steven Rostedt <[email protected]>
|