Age | Commit message (Collapse) | Author | Files | Lines |
|
Now that KVM has a framework for caching guest CPUID feature flags, add
a "rule" that IRQs must be enabled when doing guest CPUID lookups, and
enforce the rule via a lockdep assertion. CPUID lookups are slow, and
within KVM, IRQs are only ever disabled in hot paths, e.g. the core run
loop, fast page fault handling, etc. I.e. querying guest CPUID with IRQs
disabled, especially in the run loop, should be avoided.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Track "virtual NMI exposed to L1" via a governed feature flag instead of
using a dedicated bit/flag in vcpu_svm.
Note, checking KVM's capabilities instead of the "vnmi" param means that
the code isn't strictly equivalent, as vnmi_enabled could have been set
if nested=false where as that the governed feature cannot. But that's a
glorified nop as the feature/flag is consumed only by paths that are
gated by nSVM being enabled.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Track "virtual GIF exposed to L1" via a governed feature flag instead of
using a dedicated bit/flag in vcpu_svm.
Note, checking KVM's capabilities instead of the "vgif" param means that
the code isn't strictly equivalent, as vgif_enabled could have been set
if nested=false where as that the governed feature cannot. But that's a
glorified nop as the feature/flag is consumed only by paths that are
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Track "Pause Filtering is exposed to L1" via governed feature flags
instead of using dedicated bits/flags in vcpu_svm.
No functional change intended.
Reviewed-by: Yuan Yao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Track "LBR virtualization exposed to L1" via a governed feature flag
instead of using a dedicated bit/flag in vcpu_svm.
Note, checking KVM's capabilities instead of the "lbrv" param means that
the code isn't strictly equivalent, as lbrv_enabled could have been set
if nested=false where as that the governed feature cannot. But that's a
glorified nop as the feature/flag is consumed only by paths that are
gated by nSVM being enabled.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Track "virtual VMSAVE/VMLOAD exposed to L1" via a governed feature flag
instead of using a dedicated bit/flag in vcpu_svm.
Opportunistically add a comment explaining why KVM disallows virtual
VMLOAD/VMSAVE when the vCPU model is Intel.
No functional change intended.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Track "TSC scaling exposed to L1" via a governed feature flag instead of
using a dedicated bit/flag in vcpu_svm.
Note, this fixes a benign bug where KVM would mark TSC scaling as exposed
to L1 even if overall nested SVM supported is disabled, i.e. KVM would let
L1 write MSR_AMD64_TSC_RATIO even when KVM didn't advertise TSCRATEMSR
support to userspace.
Reviewed-by: Yuan Yao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Track "NRIPS exposed to L1" via a governed feature flag instead of using
a dedicated bit/flag in vcpu_svm.
No functional change intended.
Reviewed-by: Yuan Yao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Track "VMX exposed to L1" via a governed feature flag instead of using a
dedicated helper to provide the same functionality. The main goal is to
drive convergence between VMX and SVM with respect to querying features
that are controllable via module param (SVM likes to cache nested
features), avoiding the guest CPUID lookups at runtime is just a bonus
and unlikely to provide any meaningful performance benefits.
Note, X86_FEATURE_VMX is set in kvm_cpu_caps if and only if "nested" is
true, and the CPU obviously supports VMX if KVM+VMX is running. I.e. the
check on "nested" is now implicitly down by the kvm_cpu_cap_has() check
in kvm_governed_feature_check_and_set().
No functional change intended.
Reviewed-by: Yuan Yao <[email protected]>
Reviwed-by: Kai Huang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Use the governed feature framework to track if XSAVES is "enabled", i.e.
if XSAVES can be used by the guest. Add a comment in the SVM code to
explain the very unintuitive logic of deliberately NOT checking if XSAVES
is enumerated in the guest CPUID model.
No functional change intended.
Reviewed-by: Yuan Yao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Rename the XSAVES secondary execution control to follow KVM's preferred
style so that XSAVES related logic can use common macros that depend on
KVM's preferred style.
No functional change intended.
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Check KVM CPU capabilities instead of raw VMX support for XSAVES when
determining whether or not XSAVER can/should be exposed to the guest.
Practically speaking, it's nonsensical/impossible for a CPU to support
"enable XSAVES" without XSAVES being supported natively. The real
motivation for checking kvm_cpu_cap_has() is to allow using the governed
feature's standard check-and-set logic.
Reviewed-by: Yuan Yao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Recompute whether or not XSAVES is enabled for the guest only if the
guest's CPUID model changes instead of redoing the computation every time
KVM generates vmcs01's secondary execution controls. The boot_cpu_has()
and cpu_has_vmx_xsaves() checks should never change after KVM is loaded,
and if they do the kernel/KVM is hosed.
Opportunistically add a comment explaining _why_ XSAVES is effectively
exposed to the guest if and only if XSAVE is also exposed to the guest.
Practically speaking, no functional change intended (KVM will do fewer
computations, but should still see the same xsaves_enabled value whenever
KVM looks at it).
Reviewed-by: Yuan Yao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Use the governed feature framework to track whether or not the guest can
use 1GiB pages, and drop the one-off helper that wraps the surprisingly
non-trivial logic surrounding 1GiB page usage in the guest.
No functional change intended.
Reviewed-by: Yuan Yao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Introduce yet another X86_FEATURE flag framework to manage and cache KVM
governed features (for lack of a better name). "Governed" in this case
means that KVM has some level of involvement and/or vested interest in
whether or not an X86_FEATURE can be used by the guest. The intent of the
framework is twofold: to simplify caching of guest CPUID flags that KVM
needs to frequently query, and to add clarity to such caching, e.g. it
isn't immediately obvious that SVM's bundle of flags for "optional nested
SVM features" track whether or not a flag is exposed to L1.
Begrudgingly define KVM_MAX_NR_GOVERNED_FEATURES for the size of the
bitmap to avoid exposing governed_features.h in arch/x86/include/asm/, but
add a FIXME to call out that it can and should be cleaned up once
"struct kvm_vcpu_arch" is no longer expose to the kernel at large.
Cc: Zeng Guang <[email protected]>
Reviewed-by: Binbin Wu <[email protected]>
Reviewed-by: Kai Huang <[email protected]>
Reviewed-by: Yuan Yao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
bitmap and khz is assigned first, so it does not need to initialize the
assignment.
Signed-off-by: Li zeming <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Drop the WARN in KVM_RUN that asserts that KVM isn't using the hypervisor
timer, a.k.a. the VMX preemption timer, for a vCPU that is in the
UNINITIALIZIED activity state. The intent of the WARN is to sanity check
that KVM won't drop a timer interrupt due to an unexpected transition to
UNINITIALIZED, but unfortunately userspace can use various ioctl()s to
force the unexpected state.
Drop the sanity check instead of switching from the hypervisor timer to a
software based timer, as the only reason to switch to a software timer
when a vCPU is blocking is to ensure the timer interrupt wakes the vCPU,
but said interrupt isn't a valid wake event for vCPUs in UNINITIALIZED
state *and* the interrupt will be dropped in the end.
Reported-by: Yikebaer Aizezi <[email protected]>
Closes: https://lore.kernel.org/all/CALcu4rbFrU4go8sBHk3FreP+qjgtZCGcYNpSiEXOLm==qFv7iQ@mail.gmail.com
Reviewed-by: Paolo Bonzini <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Fix compiler warnings when compiling KVM with [-Wunreachable-code-break].
No functional change intended.
Cc: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Skip writes to MSR_AMD64_TSC_RATIO that are done in the context of a vCPU
if guest state isn't loaded, i.e. if KVM will update MSR_AMD64_TSC_RATIO
during svm_prepare_switch_to_guest() before entering the guest. Checking
guest_state_loaded may or may not be a net positive for performance as
the current_tsc_ratio cache will optimize away duplicate WRMSRs in the
vast majority of scenarios. However, the cost of the check is negligible,
and the real motivation is to document that KVM needs to load the vCPU's
value only when running the vCPU.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Drop the @offset and @multiplier params from the kvm_x86_ops hooks for
propagating TSC offsets/multipliers into hardware, and instead have the
vendor implementations pull the information directly from the vCPU
structure. The respective vCPU fields _must_ be written at the same
time in order to maintain consistent state, i.e. it's not random luck
that the value passed in by all callers is grabbed from the vCPU.
Explicitly grabbing the value from the vCPU field in SVM's implementation
in particular will allow for additional cleanup without introducing even
more subtle dependencies. Specifically, SVM can skip the WRMSR if guest
state isn't loaded, i.e. svm_prepare_switch_to_guest() will load the
correct value for the vCPU prior to entering the guest.
This also reconciles KVM's handling of related values that are stored in
the vCPU, as svm_write_tsc_offset() already assumes/requires the caller
to have updated l1_tsc_offset.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Explicitly disable preemption when writing MSR_AMD64_TSC_RATIO only in the
"outer" helper, as all direct callers of the "inner" helper now run with
preemption already disabled. And that isn't a coincidence, as the outer
helper requires a vCPU and is intended to be used when modifying guest
state and/or emulating guest instructions, which are typically done with
preemption enabled.
Direct use of the inner helper should be extremely limited, as the only
time KVM should modify MSR_AMD64_TSC_RATIO without a vCPU is when
sanitizing the MSR for a specific pCPU (currently done when {en,dis}abling
disabling SVM). The other direct caller is svm_prepare_switch_to_guest(),
which does have a vCPU, but is a one-off special case: KVM is about to
enter the guest on a specific pCPU and thus must have preemption disabled.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
When emulating nested SVM transitions, use the outer helper for writing
the TSC multiplier for L2. Using the inner helper only for one-off cases,
i.e. for paths where KVM is NOT emulating or modifying vCPU state, will
allow for multiple cleanups:
- Explicitly disabling preemption only in the outer helper
- Getting the multiplier from the vCPU field in the outer helper
- Skipping the WRMSR in the outer helper if guest state isn't loaded
Opportunistically delete an extra newline.
No functional change intended.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
When emulating nested VM-Exit, load L1's TSC multiplier if L1's desired
ratio doesn't match the current ratio, not if the ratio L1 is using for
L2 diverges from the default. Functionally, the end result is the same
as KVM will run L2 with L1's multiplier if L2's multiplier is the default,
i.e. checking that L1's multiplier is loaded is equivalent to checking if
L2 has a non-default multiplier.
However, the assertion that TSC scaling is exposed to L1 is flawed, as
userspace can trigger the WARN at will by writing the MSR and then
updating guest CPUID to hide the feature (modifying guest CPUID is
allowed anytime before KVM_RUN). E.g. hacking KVM's state_test
selftest to do
vcpu_set_msr(vcpu, MSR_AMD64_TSC_RATIO, 0);
vcpu_clear_cpuid_feature(vcpu, X86_FEATURE_TSCRATEMSR);
after restoring state in a new VM+vCPU yields an endless supply of:
------------[ cut here ]------------
WARNING: CPU: 10 PID: 206939 at arch/x86/kvm/svm/nested.c:1105
nested_svm_vmexit+0x6af/0x720 [kvm_amd]
Call Trace:
nested_svm_exit_handled+0x102/0x1f0 [kvm_amd]
svm_handle_exit+0xb9/0x180 [kvm_amd]
kvm_arch_vcpu_ioctl_run+0x1eab/0x2570 [kvm]
kvm_vcpu_ioctl+0x4c9/0x5b0 [kvm]
? trace_hardirqs_off+0x4d/0xa0
__se_sys_ioctl+0x7a/0xc0
__x64_sys_ioctl+0x21/0x30
do_syscall_64+0x41/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Unlike the nested VMRUN path, hoisting the svm->tsc_scaling_enabled check
into the if-statement is wrong as KVM needs to ensure L1's multiplier is
loaded in the above scenario. Alternatively, the WARN_ON() could simply
be deleted, but that would make KVM's behavior even more subtle, e.g. it's
not immediately obvious why it's safe to write MSR_AMD64_TSC_RATIO when
checking only tsc_ratio_msr.
Fixes: 5228eb96a487 ("KVM: x86: nSVM: implement nested TSC scaling")
Cc: Maxim Levitsky <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Check for nested TSC scaling support on nested SVM VMRUN instead of
asserting that TSC scaling is exposed to L1 if L1's MSR_AMD64_TSC_RATIO
has diverged from KVM's default. Userspace can trigger the WARN at will
by writing the MSR and then updating guest CPUID to hide the feature
(modifying guest CPUID is allowed anytime before KVM_RUN). E.g. hacking
KVM's state_test selftest to do
vcpu_set_msr(vcpu, MSR_AMD64_TSC_RATIO, 0);
vcpu_clear_cpuid_feature(vcpu, X86_FEATURE_TSCRATEMSR);
after restoring state in a new VM+vCPU yields an endless supply of:
------------[ cut here ]------------
WARNING: CPU: 164 PID: 62565 at arch/x86/kvm/svm/nested.c:699
nested_vmcb02_prepare_control+0x3d6/0x3f0 [kvm_amd]
Call Trace:
<TASK>
enter_svm_guest_mode+0x114/0x560 [kvm_amd]
nested_svm_vmrun+0x260/0x330 [kvm_amd]
vmrun_interception+0x29/0x30 [kvm_amd]
svm_invoke_exit_handler+0x35/0x100 [kvm_amd]
svm_handle_exit+0xe7/0x180 [kvm_amd]
kvm_arch_vcpu_ioctl_run+0x1eab/0x2570 [kvm]
kvm_vcpu_ioctl+0x4c9/0x5b0 [kvm]
__se_sys_ioctl+0x7a/0xc0
__x64_sys_ioctl+0x21/0x30
do_syscall_64+0x41/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x45ca1b
Note, the nested #VMEXIT path has the same flaw, but needs a different
fix and will be handled separately.
Fixes: 5228eb96a487 ("KVM: x86: nSVM: implement nested TSC scaling")
Cc: Maxim Levitsky <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Latest Intel platform GraniteRapids-D introduces AMX-COMPLEX, which adds
two instructions to perform matrix multiplication of two tiles containing
complex elements and accumulate the results into a packed single precision
tile.
AMX-COMPLEX is enumerated via CPUID.(EAX=7,ECX=1):EDX[bit 8]
Advertise AMX_COMPLEX if it's supported in hardware. There are no VMX
controls for the feature, i.e. the instructions can't be interecepted, and
KVM advertises base AMX in CPUID if AMX is supported in hardware, even if
KVM doesn't advertise AMX as being supported in XCR0, e.g. because the
process didn't opt-in to allocating tile data.
Signed-off-by: Tao Su <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[sean: tweak last paragraph of changelog]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Bail from vmx_emergency_disable() without processing the list of loaded
VMCSes if CR4.VMXE=0, i.e. if the CPU can't be post-VMXON. It should be
impossible for the list to have entries if VMX is already disabled, and
even if that invariant doesn't hold, VMCLEAR will #UD anyways, i.e.
processing the list is pointless even if it somehow isn't empty.
Assuming no existing KVM bugs, this should be a glorified nop. The
primary motivation for the change is to avoid having code that looks like
it does VMCLEAR, but then skips VMXON, which is nonsensical.
Suggested-by: Kai Huang <[email protected]>
Reviewed-by: Kai Huang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Now that kvm_rebooting is guaranteed to be true prior to disabling SVM
in an emergency, use the existing stgi() helper instead of open coding
STGI. In effect, eat faults on STGI if and only if kvm_rebooting==true.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Set kvm_rebooting when virtualization is disabled in an emergency so that
KVM eats faults on virtualization instructions even if kvm_reboot() isn't
reached.
Reviewed-by: Kai Huang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Move cpu_svm_disable() into KVM proper now that all hardware
virtualization management is routed through KVM. Remove the now-empty
virtext.h.
No functional change intended.
Reviewed-by: Kai Huang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Disable migration when probing VMX support during module load to ensure
the CPU is stable, mostly to match similar SVM logic, where allowing
migration effective requires deliberately writing buggy code. As a bonus,
KVM won't report the wrong CPU to userspace if VMX is unsupported, but in
practice that is a very, very minor bonus as the only way that reporting
the wrong CPU would actually matter is if hardware is broken or if the
system is misconfigured, i.e. if KVM gets migrated from a CPU that _does_
support VMX to a CPU that does _not_ support VMX.
Reviewed-by: Kai Huang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Check "this" CPU instead of the boot CPU when querying SVM support so that
the per-CPU checks done during hardware enabling actually function as
intended, i.e. will detect issues where SVM isn't support on all CPUs.
Disable migration for the use from svm_init() mostly so that the standard
accessors for the per-CPU data can be used without getting yelled at by
CONFIG_DEBUG_PREEMPT=y sanity checks. Preventing the "disabled by BIOS"
error message from reporting the wrong CPU is largely a bonus, as ensuring
a stable CPU during module load is a non-goal for KVM.
Link: https://lore.kernel.org/all/[email protected]
Cc: Kai Huang <[email protected]>
Cc: Chao Gao <[email protected]>
Reviewed-by: Kai Huang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Fold the guts of cpu_has_svm() into kvm_is_svm_supported(), its sole
remaining user.
No functional change intended.
Reviewed-by: Kai Huang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Drop the explicit check on the extended CPUID level in cpu_has_svm(), the
kernel's cached CPUID info will leave the entire SVM leaf unset if said
leaf is not supported by hardware. Prior to using cached information,
the check was needed to avoid false positives due to Intel's rather crazy
CPUID behavior of returning the values of the maximum supported leaf if
the specified leaf is unsupported.
Fixes: 682a8108872f ("x86/kvm/svm: Simplify cpu_has_svm()")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Make building KVM SVM support depend on support for AMD or Hygon. KVM
already effectively restricts SVM support to AMD and Hygon by virtue of
the vendor string checks in cpu_has_svm(), and KVM VMX supports depends
on one of its three known vendors (Intel, Centaur, or Zhaoxin).
Add the CPU_SUP_HYGON clause even though CPU_SUP_HYGON selects CPU_SUP_AMD
to document that KVM SVM support isn't just for AMD CPUs, and to prevent
breakage should Hygon support ever become a standalone thing.
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Now that VMX is disabled in emergencies via the virt callbacks, move the
VMXOFF helpers into KVM, the only remaining user.
No functional change intended.
Reviewed-by: Kai Huang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Fold the raw CPUID check for VMX into kvm_is_vmx_supported(), its sole
user. Keep the check even though KVM also checks X86_FEATURE_VMX, as the
intent is to provide a unique error message if VMX is unsupported by
hardware, whereas X86_FEATURE_VMX may be clear due to firmware and/or
kernel actions.
No functional change intended.
Reviewed-by: Kai Huang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Expose the crash/reboot hooks used by KVM to disable virtualization in
hardware and unblock INIT only if there's a potential in-tree user,
i.e. either KVM_INTEL or KVM_AMD is enabled.
Reviewed-by: Kai Huang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Attempt to disable virtualization during an emergency reboot if and only
if there is a registered virt callback, i.e. iff a hypervisor (KVM) is
active. If there's no active hypervisor, then the CPU can't be operating
with VMX or SVM enabled (barring an egregious bug).
Checking for a valid callback instead of simply for SVM or VMX support
can also eliminates spurious NMIs by avoiding the unecessary call to
nmi_shootdown_cpus_on_restart().
Note, IRQs are disabled, which prevents KVM from coming along and
enabling virtualization after the fact.
Reviewed-by: Kai Huang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Move the various "disable virtualization" helpers above the emergency
reboot path so that emergency_reboot_disable_virtualization() can be
stubbed out in a future patch if neither KVM_INTEL nor KVM_AMD is enabled,
i.e. if there is no in-tree user of CPU virtualization.
No functional change intended.
Reviewed-by: Kai Huang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Assert that IRQs are disabled when turning off virtualization in an
emergency. KVM enables hardware via on_each_cpu(), i.e. could re-enable
hardware if a pending IPI were delivered after disabling virtualization.
Remove a misleading comment from emergency_reboot_disable_virtualization()
about "just" needing to guarantee the CPU is stable (see above).
Reviewed-by: Kai Huang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Use the virt callback to disable SVM (and set GIF=1) during an emergency
instead of blindly attempting to disable SVM. Like the VMX case, if a
hypervisor, i.e. KVM, isn't loaded/active, SVM can't be in use.
Acked-by: Kai Huang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Use KVM VMX's reboot/crash callback to do VMXOFF in an emergency instead
of manually and blindly doing VMXOFF. There's no need to attempt VMXOFF
if a hypervisor, i.e. KVM, isn't loaded/active, i.e. if the CPU can't
possibly be post-VMXON.
Reviewed-by: Kai Huang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Provide dedicated helpers to (un)register virt hooks used during an
emergency crash/reboot, and WARN if there is an attempt to overwrite
the registered callback, or an attempt to do an unpaired unregister.
Opportunsitically use rcu_assign_pointer() instead of RCU_INIT_POINTER(),
mainly so that the set/unset paths are more symmetrical, but also because
any performance gains from using RCU_INIT_POINTER() are meaningless for
this code.
Reviewed-by: Kai Huang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
VMCLEAR active VMCSes before any emergency reboot, not just if the kernel
may kexec into a new kernel after a crash. Per Intel's SDM, the VMX
architecture doesn't require the CPU to flush the VMCS cache on INIT. If
an emergency reboot doesn't RESET CPUs, cached VMCSes could theoretically
be kept and only be written back to memory after the new kernel is booted,
i.e. could effectively corrupt memory after reboot.
Opportunistically remove the setting of the global pointer to NULL to make
checkpatch happy.
Cc: Andrew Cooper <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Retry the optimized APIC map recalculation if an APIC-enabled vCPU shows
up between allocating the map and filling in the map data. Conditionally
reschedule before retrying even though the number of vCPUs that can be
created is bounded by KVM. Retrying a few thousand times isn't so slow
as to be hugely problematic, but it's not blazing fast either.
Reset xapic_id_mistach on each retry as a vCPU could change its xAPIC ID
between loops, but do NOT reset max_id. The map size also factors in
whether or not a vCPU's local APIC is hardware-enabled, i.e. userspace
and/or the guest can theoretically keep KVM retrying indefinitely. The
only downside is that KVM will allocate more memory than is strictly
necessary if the vCPU with the highest x2APIC ID disabled its APIC while
the recalculation was in-progress.
Refresh kvm->arch.apic_map_dirty to opportunistically change it from
DIRTY => UPDATE_IN_PROGRESS to avoid an unnecessary recalc from a
different task, i.e. if another task is waiting to attempt an update
(which is likely since a retry happens if and only if an update is
required).
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Now that KVM snapshots the host's MSR_IA32_ARCH_CAPABILITIES, drop the
similar snapshot/cache of whether or not KVM is allowed to manipulate
MSR_IA32_MCU_OPT_CTRL.FB_CLEAR_DIS. The motivation for the cache was
presumably to avoid the RDMSR, e.g. boot_cpu_has_bug() is quite cheap, and
modifying the vCPU's MSR_IA32_ARCH_CAPABILITIES is an infrequent option
and a relatively slow path.
Cc: Pawan Gupta <[email protected]>
Reviewed-by: Pawan Gupta <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Snapshot the host's MSR_IA32_ARCH_CAPABILITIES, if it's supported, instead
of reading the MSR every time KVM wants to query the host state, e.g. when
initializing the default value during vCPU creation. The paths that query
ARCH_CAPABILITIES aren't particularly performance sensitive, but creating
vCPUs is a frequent enough operation that burning 8 bytes is a good
trade-off.
Alternatively, KVM could add a field in kvm_caps and thus skip the
on-demand calculations entirely, but a pure snapshot isn't possible due to
the way KVM handles the l1tf_vmx_mitigation module param. And unlike the
other "supported" fields in kvm_caps, KVM doesn't enforce the "supported"
value, i.e. KVM treats ARCH_CAPABILITIES like a CPUID leaf and lets
userspace advertise whatever it wants. Those problems are solvable, but
it's not clear there is real benefit versus snapshotting the host value,
and grabbing the host value will allow additional cleanup of KVM's
FB_CLEAR_CTRL code.
Link: https://lore.kernel.org/all/[email protected]
Cc: Chao Gao <[email protected]>
Cc: Xiaoyao Li <[email protected]>
Reviewed-by: Chao Gao <[email protected]>
Reviewed-by: Xiaoyao Li <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Advertise CPUID 0x80000005 (L1 cache and TLB info) to userspace so that
VMMs that reflect KVM_GET_SUPPORTED_CPUID into KVM_SET_CPUID2 will
enumerate sane cache/TLB information to the guest.
CPUID 0x80000006 (L2 cache and TLB and L3 cache info) has been returned
since commit 43d05de2bee7 ("KVM: pass through CPUID(0x80000006)").
Enumerating both 0x80000005 and 0x80000006 with KVM_GET_SUPPORTED_CPUID
is better than reporting one or the other, and 0x80000005 could be helpful
for VMM to pass it to KVM_SET_CPUID{,2} for the same reason with
0x80000006.
Signed-off-by: Takahiro Itazuri <[email protected]>
Link: https://lore.kernel.org/all/ZK7NmfKI9xur%[email protected]
Link: https://lore.kernel.org/r/[email protected]
[sean: add link, massage changelog]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Remove x86_emulate_ops::guest_has_long_mode along with its implementation,
emulator_guest_has_long_mode(). It has been unused since commit
1d0da94cdafe ("KVM: x86: do not go through ctxt->ops when emulating rsm").
No functional change intended.
Signed-off-by: Michal Luczaj <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|
|
Use sysfs_emit() instead of the sprintf() for sysfs entries. sysfs_emit()
knows the maximum of the temporary buffer used for outputting sysfs
content and avoids overrunning the buffer length.
Signed-off-by: Like Xu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>
|