Age | Commit message (Collapse) | Author | Files | Lines |
|
When handling events, armv8pmu_handle_irq() calls perf_event_overflow(),
and subsequently calls irq_work_run() to handle any work queued by
perf_event_overflow(). As perf_event_overflow() raises IPI_IRQ_WORK when
queuing the work, this isn't strictly necessary and the work could be
handled as part of the IPI_IRQ_WORK handler.
In the common case the IPI handler will run immediately after the PMU IRQ
handler, and where the PE is heavily loaded with interrupts other handlers
may run first, widening the window where some counters are disabled.
In practice this window is unlikely to be a significant issue, and removing
the call to irq_work_run() would make the PMU IRQ handler NMI safe in
addition to making it simpler, so let's do that.
[Alexandru E.: Reworded commit message]
Signed-off-by: Julien Thierry <[email protected]>
Signed-off-by: Alexandru Elisei <[email protected]>
Cc: Julien Thierry <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Catalin Marinas <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
|
|
The PMU is disabled and enabled, and the counters are programmed from
contexts where interrupts or preemption is disabled.
The functions to toggle the PMU and to program the PMU counters access the
registers directly and don't access data modified by the interrupt handler.
That, and the fact that they're always called from non-preemptible
contexts, means that we don't need to disable interrupts or use a spinlock.
[Alexandru E.: Explained why locking is not needed, removed WARN_ONs]
Signed-off-by: Julien Thierry <[email protected]>
Signed-off-by: Alexandru Elisei <[email protected]>
Tested-by: Sumit Garg <[email protected]> (Developerbox)
Cc: Will Deacon <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Catalin Marinas <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
|
|
Currently we access the counter registers and their respective type
registers indirectly. This requires us to write to PMSELR, issue an ISB,
then access the relevant PMXEV* registers.
This is unfortunate, because:
* Under virtualization, accessing one register requires two traps to
the hypervisor, even though we could access the register directly with
a single trap.
* We have to issue an ISB which we could otherwise avoid the cost of.
* When we use NMIs, the NMI handler will have to save/restore the select
register in case the code it preempted was attempting to access a
counter or its type register.
We can avoid these issues by directly accessing the relevant registers.
This patch adds helpers to do so.
In armv8pmu_enable_event() we still need the ISB to prevent the PE from
reordering the write to PMINTENSET_EL1 register. If the interrupt is
enabled before we disable the counter and the new event is configured,
we might get an interrupt triggered by the previously programmed event
overflowing, but which we wrongly attribute to the event that we are
enabling. Execute an ISB after we disable the counter.
In the process, remove the comment that refers to the ARMv7 PMU.
[Julien T.: Don't inline read/write functions to avoid big code-size
increase, remove unused read_pmevtypern function,
fix counter index issue.]
[Alexandru E.: Removed comment, removed trailing semicolons in macros,
added ISB]
Signed-off-by: Mark Rutland <[email protected]>
Signed-off-by: Julien Thierry <[email protected]>
Signed-off-by: Alexandru Elisei <[email protected]>
Tested-by: Sumit Garg <[email protected]> (Developerbox)
Cc: Julien Thierry <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Catalin Marinas <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
|
|
Writes to the PMXEVTYPER_EL0 register are not self-synchronising. In
armv8pmu_enable_event(), the PE can reorder configuring the event type
after we have enabled the counter and the interrupt. This can lead to an
interrupt being asserted because of the previous event type that we were
counting using the same counter, not the one that we've just configured.
The same rationale applies to writes to the PMINTENSET_EL1 register. The PE
can reorder enabling the interrupt at any point in the future after we have
enabled the event.
Prevent both situations from happening by adding an ISB just before we
enable the event counter.
Fixes: 030896885ade ("arm64: Performance counters support")
Reported-by: Julien Thierry <[email protected]>
Signed-off-by: Alexandru Elisei <[email protected]>
Tested-by: Sumit Garg <[email protected]> (Developerbox)
Cc: Julien Thierry <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Catalin Marinas <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
|
|
7f47d8cc039f ("x86, tracing, perf: Add trace point for MSR accesses") added
tracing of msr read and write, but because of complexity in having
tracepoints in headers, and even more so for a core header like msr.h, not
to mention the bloat a tracepoint adds to inline functions, a helper
function is needed to be called from the header.
Use the new tracepoint_enabled() macro in tracepoint-defs.h to test if the
tracepoint is active before calling the helper function, instead of open
coding the same logic, which requires knowing the internals of a tracepoint.
Cc: Andi Kleen <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
The unconditional selection of PCI_MSI_ARCH_FALLBACKS has an unmet
dependency because PCI_MSI_ARCH_FALLBACKS is defined in a 'if PCI' clause.
As it is only relevant when PCI_MSI is enabled, update the affected
architecture Kconfigs to make the selection of PCI_MSI_ARCH_FALLBACKS
depend on 'if PCI_MSI'.
Fixes: 077ee78e3928 ("PCI/MSI: Make arch_.*_msi_irq[s] fallbacks selectable")
Reported-by: Qian Cai <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Links: https://lore.kernel.org/r/[email protected]
|
|
Mechanical switch from the legacy "mode" enumeration to the new
"engine type" enumeration in drivers and board files.
The device tree parsing is also updated to return the new enumeration
from the old strings.
Signed-off-by: Miquel Raynal <[email protected]>
Reviewed-by: Boris Brezillon <[email protected]>
Link: https://lore.kernel.org/linux-mtd/[email protected]
|
|
The use of "syndrome" placement should not be encoded in the ECC
engine mode/type.
Create a "placement" field in NAND chip and change all occurrences of
the NAND_ECC_HW_SYNDROME enumeration to be just NAND_ECC_HW and
possibly a placement entry like NAND_ECC_PLACEMENT_INTERLEAVED.
Signed-off-by: Miquel Raynal <[email protected]>
Reviewed-by: Boris Brezillon <[email protected]>
Link: https://lore.kernel.org/linux-mtd/[email protected]
|
|
ARMv8.4-PMU introduces the PMMIR_EL1 registers and some new PMU events,
like STALL_SLOT etc, are related to it. Let's add a caps directory to
/sys/bus/event_source/devices/armv8_pmuv3_0/ and support slots from
PMMIR_EL1 registers in this entry. The user programs can get the slots
from sysfs directly.
/sys/bus/event_source/devices/armv8_pmuv3_0/caps/slots is exposed
under sysfs. Both ARMv8.4-PMU and STALL_SLOT event are implemented,
it returns the slots from PMMIR_EL1, otherwise it will return 0.
Signed-off-by: Shaokun Zhang <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Mark Rutland <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Will Deacon <[email protected]>
|
|
The current binding for the RPi firmware uses the simple-bus compatible as
a fallback to benefit from its automatic probing of child nodes.
However, simple-bus also comes with some constraints, like having the ranges,
our case.
Let's switch to simple-mfd that provides the same probing logic without
those constraints.
Signed-off-by: Maxime Ripard <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Rob Herring <[email protected]>
|
|
KVM special-cases writes to MSR_IA32_TSC so that all CPUs have
the same base for the TSC. This logic is complicated, and we
do not want it to have any effect once the VM is started.
In particular, if any guest started to synchronize its TSCs
with writes to MSR_IA32_TSC rather than MSR_IA32_TSC_ADJUST,
the additional effect of kvm_write_tsc code would be uncharted
territory.
Therefore, this patch makes writes to MSR_IA32_TSC behave
essentially the same as writes to MSR_IA32_TSC_ADJUST when
they come from the guest. A new selftest (which passes
both before and after the patch) checks the current semantics
of writes to MSR_IA32_TSC and MSR_IA32_TSC_ADJUST originating
from both the host and the guest.
Upcoming work to remove the special side effects
of host-initiated writes to MSR_IA32_TSC and MSR_IA32_TSC_ADJUST
will be able to build onto this test, adjusting the host side
to use the new APIs and achieve the same effect.
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Allow userspace to set up the memory map after KVM_SET_NESTED_STATE;
to do so, move the call to nested_svm_vmrun_msrpm inside the
KVM_REQ_GET_NESTED_STATE_PAGES handler (which is currently
not used by nSVM). This is similar to what VMX does already.
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
We are going to use it for SVM too, so use a more generic name.
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
It's not desireable to have all MSRs always handled by KVM kernel space. Some
MSRs would be useful to handle in user space to either emulate behavior (like
uCode updates) or differentiate whether they are valid based on the CPU model.
To allow user space to specify which MSRs it wants to see handled by KVM,
this patch introduces a new ioctl to push filter rules with bitmaps into
KVM. Based on these bitmaps, KVM can then decide whether to reject MSR access.
With the addition of KVM_CAP_X86_USER_SPACE_MSR it can also deflect the
denied MSR events to user space to operate on.
If no filter is populated, MSR handling stays identical to before.
Signed-off-by: Alexander Graf <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
We will introduce the concept of MSRs that may not be handled in kernel
space soon. Some MSRs are directly passed through to the guest, effectively
making them handled by KVM from user space's point of view.
This patch introduces all logic required to ensure that MSRs that
user space wants trapped are not marked as direct access for guests.
Signed-off-by: Alexander Graf <[email protected]>
Message-Id: <[email protected]>
[Replace "_idx" with "_slot". - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
We will introduce the concept of MSRs that may not be handled in kernel
space soon. Some MSRs are directly passed through to the guest, effectively
making them handled by KVM from user space's point of view.
This patch introduces all logic required to ensure that MSRs that
user space wants trapped are not marked as direct access for guests.
Signed-off-by: Alexander Graf <[email protected]>
Message-Id: <[email protected]>
[Make terminology a bit more similar to VMX. - Paolo]
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Prepare vmx and svm for a subsequent change that ensures the MSR permission
bitmap is set to allow an MSR that userspace is tracking to force a vmx_vmexit
in the guest.
Signed-off-by: Aaron Lewis <[email protected]>
Reviewed-by: Oliver Upton <[email protected]>
[agraf: rebase, adapt SVM scheme to nested changes that came in between]
Signed-off-by: Alexander Graf <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
In the following commits we will add pieces of MSR filtering.
To ensure that code compiles even with the feature half-merged, let's add
a few stubs and struct definitions before the real patches start.
Signed-off-by: Alexander Graf <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
MSRs are weird. Some of them are normal control registers, such as EFER.
Some however are registers that really are model specific, not very
interesting to virtualization workloads, and not performance critical.
Others again are really just windows into package configuration.
Out of these MSRs, only the first category is necessary to implement in
kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against
certain CPU models and MSRs that contain information on the package level
are much better suited for user space to process. However, over time we have
accumulated a lot of MSRs that are not the first category, but still handled
by in-kernel KVM code.
This patch adds a generic interface to handle WRMSR and RDMSR from user
space. With this, any future MSR that is part of the latter categories can
be handled in user space.
Furthermore, it allows us to replace the existing "ignore_msrs" logic with
something that applies per-VM rather than on the full system. That way you
can run productive VMs in parallel to experimental ones where you don't care
about proper MSR handling.
Signed-off-by: Alexander Graf <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
When we find an MSR that we can not handle, bubble up that error code as
MSR error return code. Follow up patches will use that to expose the fact
that an MSR is not handled by KVM to user space.
Suggested-by: Aaron Lewis <[email protected]>
Signed-off-by: Alexander Graf <[email protected]>
Message-Id: <[email protected]>
Reviewed-by: Jim Mattson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Rename "index" to "slot" in struct vmx_uret_msr to align with the
terminology used by common x86's kvm_user_return_msrs, and to avoid
conflating "MSR's ECX index" with "MSR's index into an array".
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Rename "vmx_msr_index" to "vmx_uret_msrs_list" to associate it with the
uret MSRs array, and to avoid conflating "MSR's ECX index" with "MSR's
index into an array". Similarly, don't use "slot" in the name as that
terminology is claimed by the common x86 "user_return_msrs" mechanism.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Add "uret" to vmx_set_guest_msr() to explicitly associate it with the
guest_uret_msrs array, and to differentiate it from vmx_set_msr() as
well as VMX's load/store MSRs.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Rename "find_msr_entry" to scope it to VMX and to associate it with
guest_uret_msrs. Drop the "entry" so that the function name pairs with
the existing __vmx_find_uret_msr(), which intentionally uses a double
underscore prefix instead of appending "index" or "slot" as those names
are already claimed by other pieces of the user return MSR stack.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Add vmx_setup_uret_msr() to wrap the lookup and manipulation of the uret
MSRs array during setup_msrs(). In addition to consolidating code, this
eliminates move_msr_up(), which while being a very literally description
of the function, isn't exacly helpful in understanding the net effect of
the code.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Move checking for the existence of MSR_EFER in the uret MSR array into
update_transition_efer() so that the lookup and manipulation of the
array in setup_msrs() occur back-to-back. This paves the way toward
adding a helper to wrap the lookup and manipulation.
To avoid unnecessary overhead, defer the lookup until the uret array
would actually be modified in update_transition_efer(). EFER obviously
exists on CPUs that support the dedicated VMCS fields for switching
EFER, and EFER must exist for the guest and host EFER.NX value to
diverge, i.e. there is no danger of attempting to read/write EFER when
it doesn't exist.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Check for RDTSCP support prior to checking if MSR_TSC_AUX is in the uret
MSRs array so that the array lookup and manipulation are back-to-back.
This paves the way toward adding a helper to wrap the lookup and
manipulation.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Rename "__find_msr_index" to scope it to VMX, associate it with
guest_uret_msrs, and to avoid conflating "MSR's ECX index" with "MSR's
array index". Similarly, don't use "slot" in the name so as to avoid
colliding the common x86's half of "user_return_msrs" (the slot in
kvm_user_return_msrs is not the same slot in guest_uret_msrs).
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Add "uret" to "guest_msrs_ready" to explicitly associate it with the
"guest_uret_msrs" array, and replace "ready" with "loaded" to more
precisely reflect what it tracks, e.g. "ready" could be interpreted as
meaning ready for processing (setup_msrs() has run), which is wrong.
"loaded" also aligns with the similar "guest_state_loaded" field.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Add "uret" into the name of "save_nmsrs" to explicitly associate it with
the guest_uret_msrs array, and replace "save" with "active" (for lack of
a better word) to better describe what is being tracked. While "save"
is more or less accurate when viewed as a literal description of the
field, e.g. it holds the number of MSRs that were saved into the array
the last time setup_msrs() was invoked, it can easily be misinterpreted
by the reader, e.g. as meaning the number of MSRs that were saved from
hardware at some point in the past, or as the number of MSRs that need
to be saved at some point in the future, both of which are wrong.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Rename vcpu_vmx.nsmrs to vcpu_vmx.nr_uret_msrs to explicitly associate
it with the guest_uret_msrs array.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Rename struct "shared_msr_entry" to "vmx_uret_msr" to align with x86's
rename of "shared_msrs" to "user_return_msrs", and to call out that the
struct is specific to VMX, i.e. not part of the generic "shared_msrs"
framework. Abbreviate "user_return" as "uret" to keep line lengths
marginally sane and code more or less readable.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Add "loadstore" to vmx_find_msr_index() to differentiate it from the so
called shared MSRs helpers (which will soon be renamed), and replace
"index" with "slot" to better convey that the helper returns slot in the
array, not the MSR index (the value that gets stuffed into ECX).
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Add "MAX" to the LOADSTORE and so called SHARED MSR defines to make it
more clear that the define controls the array size, as opposed to the
actual number of valid entries that are in the array.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Rename the "shared_msrs" mechanism, which is used to defer restoring
MSRs that are only consumed when running in userspace, to a more banal
but less likely to be confusing "user_return_msrs".
The "shared" nomenclature is confusing as it's not obvious who is
sharing what, e.g. reasonable interpretations are that the guest value
is shared by vCPUs in a VM, or that the MSR value is shared/common to
guest and host, both of which are wrong.
"shared" is also misleading as the MSR value (in hardware) is not
guaranteed to be shared/reused between VMs (if that's indeed the correct
interpretation of the name), as the ability to share values between VMs
is simply a side effect (albiet a very nice side effect) of deferring
restoration of the host value until returning from userspace.
"user_return" avoids the above confusion by describing the mechanism
itself instead of its effects.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Move initialization of 'struct kvm_mmu' fields into alloc_mmu_pages() to
consolidate code, and rename the helper to __kvm_mmu_create().
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Read vmcs.EXIT_QUALIFICATION and vmcs.VM_EXIT_INTR_INFO only if the
VM-Exit is being reflected to L1 now that they are no longer passed
directly to the kvm_nested_vmexit tracepoint.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Use the newly introduced TRACE_EVENT_KVM_EXIT to define the guts of
kvm_nested_vmexit so that it captures and prints the same information as
kvm_exit. This has the bonus side effect of fixing the interrupt info
and error code printing for the case where they're invalid, e.g. if the
exit was a failed VM-Entry. This also sets the stage for retrieving
EXIT_QUALIFICATION and VM_EXIT_INTR_INFO in nested_vmx_reflect_vmexit()
if and only if the VM-Exit is being routed to L1.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Macrofy the definition of kvm_exit so that the definition can be reused
verbatim by kvm_nested_vmexit.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Extend the kvm_exit tracepoint to align it with kvm_nested_vmexit in
terms of what information is captured. On SVM, add interrupt info and
error code, while on VMX it add IDT vectoring and error code. This
sets the stage for macrofying the kvm_exit tracepoint definition so that
it can be reused for kvm_nested_vmexit without loss of information.
Opportunistically stuff a zero for VM_EXIT_INTR_INFO if the VM-Enter
failed, as the field is guaranteed to be invalid. Note, it'd be
possible to further filter the interrupt/exception fields based on the
VM-Exit reason, but the helper is intended only for tracepoints, i.e.
an extra VMREAD or two is a non-issue, the failed VM-Enter case is just
low hanging fruit.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Add a helper, is_exception_with_error_code(), to provide the simple but
difficult to read code of checking for a valid exception with an error
code given a vmcs.VM_EXIT_INTR_INFO value. The helper will gain another
user, vmx_get_exit_info(), in a future patch.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Use kvm_rip_read() to read the guest's RIP for the nested VM-Exit
tracepoint instead of having the caller pass in an argument. Params
that are passed into a tracepoint are evaluated even if the tracepoint
is disabled, i.e. passing in RIP for VMX incurs a VMREAD and retpoline
to retrieve a value that may never be used, e.g. if the exit is due to a
hardware interrupt.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Add RIP to the kvm_entry tracepoint to help debug if the kvm_exit
tracepoint is disabled or if VM-Enter fails, in which case the kvm_exit
tracepoint won't be hit.
Read RIP from within the tracepoint itself to avoid a potential VMREAD
and retpoline if the guest's RIP isn't available.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
WARN if KVM attempts to switch to the currently loaded VMCS. Now that
nested_vmx_free_vcpu() doesn't blindly call vmx_switch_vmcs(), all paths
that lead to vmx_switch_vmcs() are implicitly guarded by guest vs. host
mode, e.g. KVM should never emulate VMX instructions when guest mode is
active, and nested_vmx_vmexit() should never be called when host mode is
active.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Remove the explicit switch to vmcs01 and the call to free_nested() in
nested_vmx_free_vcpu(). free_nested(), which is called unconditionally
by vmx_leave_nested(), ensures vmcs01 is loaded prior to freeing vmcs02
and friends.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Add a WARN in free_nested() to ensure vmcs01 is loaded prior to freeing
vmcs02 and friends, and explicitly switch to vmcs01 if it's not. KVM is
supposed to keep is_guest_mode() and loaded_vmcs==vmcs02 synchronized,
but bugs happen and freeing vmcs02 while it's in use will escalate a KVM
error to a use-after-free and potentially crash the kernel.
Do the WARN and switch even in the !vmxon case to help detect latent
bugs. free_nested() is not a hot path, and the check is cheap.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Move free_nested() down below vmx_switch_vmcs() so that a future patch
can do an "emergency" invocation of vmx_switch_vmcs() if vmcs01 is not
the loaded VMCS when freeing nested resources.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Call guest_state_valid() directly instead of querying emulation_required
when checking if L1 is attempting VM-Enter with invalid guest state.
If emulate_invalid_guest_state is false, KVM will fixup segment regs to
avoid emulation and will never set emulation_required, i.e. KVM will
incorrectly miss the associated consistency checks because the nested
path stuffs segments directly into vmcs02.
Opportunsitically add Consistency Check tracing to make future debug
suck a little less.
Fixes: 2bb8cafea80bf ("KVM: vVMX: signal failure for nested VMEntry if emulation_required")
Fixes: 3184a995f782c ("KVM: nVMX: fix vmentry failure code when L2 state would require emulation")
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Reload vmcs01 when bailing from nested_vmx_enter_non_root_mode() as KVM
expects vmcs01 to be loaded when is_guest_mode() is false.
Fixes: 671ddc700fd08 ("KVM: nVMX: Don't leak L1 MMIO regions to L2")
Cc: [email protected]
Cc: Dan Cross <[email protected]>
Cc: Jim Mattson <[email protected]>
Cc: Peter Shier <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|
|
Explicitly reset the segment cache after stuffing guest segment regs in
prepare_vmcs02_rare(). Although the cache is reset when switching to
vmcs02, there is nothing that prevents KVM from re-populating the cache
prior to writing vmcs02 with vmcs12's values. E.g. if the vCPU is
preempted after switching to vmcs02 but before prepare_vmcs02_rare(),
kvm_arch_vcpu_put() will dereference GUEST_SS_AR_BYTES via .get_cpl()
and cache the stale vmcs02 value. While the current code base only
caches stale data in the preemption case, it's theoretically possible
future code could read a segment register during the nested flow itself,
i.e. this isn't technically illegal behavior in kvm_arch_vcpu_put(),
although it did introduce the bug.
This manifests as an unexpected nested VM-Enter failure when running
with unrestricted guest disabled if the above preemption case coincides
with L1 switching L2's CPL, e.g. when switching from a L2 vCPU at CPL3
to to a L2 vCPU at CPL0. stack_segment_valid() will see the new SS_SEL
but the old SS_AR_BYTES and incorrectly mark the guest state as invalid
due to SS.dpl != SS.rpl.
Don't bother updating the cache even though prepare_vmcs02_rare() writes
every segment. With unrestricted guest, guest segments are almost never
read, let alone L2 guest segments. On the other hand, populating the
cache requires a large number of memory writes, i.e. it's unlikely to be
a net win. Updating the cache would be a win when unrestricted guest is
not supported, as guest_state_valid() will immediately cache all segment
registers. But, nested virtualization without unrestricted guest is
dirt slow, saving some VMREADs won't change that, and every CPU
manufactured in the last decade supports unrestricted guest. In other
words, the extra (minor) complexity isn't worth the trouble.
Note, kvm_arch_vcpu_put() may see stale data when querying guest CPL
depending on when preemption occurs. This is "ok" in that the usage is
imperfect by nature, i.e. it's used heuristically to improve performance
but doesn't affect functionality. kvm_arch_vcpu_put() could be "fixed"
by also disabling preemption while loading segments, but that's
pointless and misleading as reading state from kvm_sched_{in,out}() is
guaranteed to see stale data in one form or another. E.g. even if all
the usage of regs_avail is fixed to call kvm_register_mark_available()
after the associated state is set, the individual state might still be
stale with respect to the overall vCPU state. I.e. making functional
decisions in an asynchronous hook is doomed from the get go. Thankfully
KVM doesn't do that.
Fixes: de63ad4cf4973 ("KVM: X86: implement the logic for spinlock optimization")
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
|