aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)AuthorFilesLines
2022-06-15KVM: x86/mmu: Drop unused CMPXCHG macro from paging_tmpl.hSean Christopherson1-6/+0
Drop the CMPXCHG macro from paging_tmpl.h, it's no longer used now that KVM uses a common uaccess helper to do 8-byte CMPXCHG. Fixes: f122dfe44768 ("KVM: x86: Use __try_cmpxchg_user() to update guest PTE A/D bits") Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-15KVM: X86/SVM: Use root_level in svm_load_mmu_pgd()Lai Jiangshan1-1/+1
Use root_level in svm_load_mmu_pg() rather that looking up the root level in vcpu->arch.mmu->root_role.level. svm_load_mmu_pgd() has only one caller, kvm_mmu_load_pgd(), which always passes vcpu->arch.mmu->root_role.level as root_level. Signed-off-by: Lai Jiangshan <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-15KVM: X86/MMU: Remove useless mmu_topup_memory_caches() in kvm_mmu_pte_write()Lai Jiangshan1-7/+0
Since the commit c5e2184d1544("KVM: x86/mmu: Remove the defunct update_pte() paging hook"), kvm_mmu_pte_write() no longer uses the rmap cache. So remove mmu_topup_memory_caches() in it. Cc: Sean Christopherson <[email protected]> Signed-off-by: Lai Jiangshan <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-15KVM: X86/MMU: Remove unused PT32_DIR_BASE_ADDR_MASK from mmu.cLai Jiangshan1-2/+0
It is unused. Signed-off-by: Lai Jiangshan <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-15KVM: SVM: Hide SEV migration lockdep goo behind CONFIG_PROVE_LOCKINGSean Christopherson1-10/+7
Wrap the manipulation of @role and the manual mutex_{release,acquire}() invocations in CONFIG_PROVE_LOCKING=y to squash a clang-15 warning. When building with -Wunused-but-set-parameter and CONFIG_DEBUG_LOCK_ALLOC=n, clang-15 seees there's no usage of @role in mutex_lock_killable_nested() and yells. PROVE_LOCKING selects DEBUG_LOCK_ALLOC, and the only reason KVM manipulates @role is to make PROVE_LOCKING happy. To avoid true ugliness, use "i" and "j" to detect the first pass in the loops; the "idx" field that's used by kvm_for_each_vcpu() is guaranteed to be '0' on the first pass as it's simply the first entry in the vCPUs XArray, which is fully KVM controlled. kvm_for_each_vcpu() passes '0' for xa_for_each_range()'s "start", and xa_for_each_range() will not enter the loop if there's no entry at '0'. Fixes: 0c2c7c069285 ("KVM: SEV: Mark nested locking of vcpu->lock") Reported-by: kernel test robot <[email protected]> Cc: Peter Gonda <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-15KVM: SEV: fix misplaced closing parenthesisPaolo Bonzini1-2/+2
This caused a warning on 32-bit systems, but undoubtedly would have acted funny on 64-bit as well. The fix was applied directly on merge in 5.19, see commit 24625f7d91fb ("Merge tag for-linus of git://git.kernel.org/pub/scm/virt/kvm/kvm"). Fixes: 3743c2f02517 ("KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base") Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-14Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds8-125/+132
Pull kvm fixes from Paolo Bonzini: "While last week's pull request contained miscellaneous fixes for x86, this one covers other architectures, selftests changes, and a bigger series for APIC virtualization bugs that were discovered during 5.20 development. The idea is to base 5.20 development for KVM on top of this tag. ARM64: - Properly reset the SVE/SME flags on vcpu load - Fix a vgic-v2 regression regarding accessing the pending state of a HW interrupt from userspace (and make the code common with vgic-v3) - Fix access to the idreg range for protected guests - Ignore 'kvm-arm.mode=protected' when using VHE - Return an error from kvm_arch_init_vm() on allocation failure - A bunch of small cleanups (comments, annotations, indentation) RISC-V: - Typo fix in arch/riscv/kvm/vmid.c - Remove broken reference pattern from MAINTAINERS entry x86-64: - Fix error in page tables with MKTME enabled - Dirty page tracking performance test extended to running a nested guest - Disable APICv/AVIC in cases that it cannot implement correctly" [ This merge also fixes a misplaced end parenthesis bug introduced in commit 3743c2f02517 ("KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base") pointed out by Sean Christopherson ] Link: https://lore.kernel.org/all/[email protected]/ * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (34 commits) KVM: selftests: Restrict test region to 48-bit physical addresses when using nested KVM: selftests: Add option to run dirty_log_perf_test vCPUs in L2 KVM: selftests: Clean up LIBKVM files in Makefile KVM: selftests: Link selftests directly with lib object files KVM: selftests: Drop unnecessary rule for STATIC_LIBS KVM: selftests: Add a helper to check EPT/VPID capabilities KVM: selftests: Move VMX_EPT_VPID_CAP_AD_BITS to vmx.h KVM: selftests: Refactor nested_map() to specify target level KVM: selftests: Drop stale function parameter comment for nested_map() KVM: selftests: Add option to create 2M and 1G EPT mappings KVM: selftests: Replace x86_page_size with PG_LEVEL_XX KVM: x86: SVM: fix nested PAUSE filtering when L0 intercepts PAUSE KVM: x86: SVM: drop preempt-safe wrappers for avic_vcpu_load/put KVM: x86: disable preemption around the call to kvm_arch_vcpu_{un|}blocking KVM: x86: disable preemption while updating apicv inhibition KVM: x86: SVM: fix avic_kick_target_vcpus_fast KVM: x86: SVM: remove avic's broken code that updated APIC ID KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base KVM: x86: document AVIC/APICv inhibit reasons KVM: x86/mmu: Set memory encryption "value", not "mask", in shadow PDPTRs ...
2022-06-14Merge tag 'x86-bugs-2022-06-01' of ↵Linus Torvalds3-0/+77
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 MMIO stale data fixes from Thomas Gleixner: "Yet another hw vulnerability with a software mitigation: Processor MMIO Stale Data. They are a class of MMIO-related weaknesses which can expose stale data by propagating it into core fill buffers. Data which can then be leaked using the usual speculative execution methods. Mitigations include this set along with microcode updates and are similar to MDS and TAA vulnerabilities: VERW now clears those buffers too" * tag 'x86-bugs-2022-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/speculation/mmio: Print SMT warning KVM: x86/speculation: Disable Fill buffer clear within guests x86/speculation/mmio: Reuse SRBDS mitigation for SBDS x86/speculation/srbds: Update SRBDS mitigation selection x86/speculation/mmio: Add sysfs reporting for Processor MMIO Stale Data x86/speculation/mmio: Enable CPU Fill buffer clearing on idle x86/bugs: Group MDS, TAA & Processor MMIO Stale Data mitigations x86/speculation/mmio: Add mitigation for Processor MMIO Stale Data x86/speculation: Add a common function for MD_CLEAR mitigation update x86/speculation/mmio: Enumerate Processor MMIO Stale Data bug Documentation: Add documentation for Processor MMIO Stale Data
2022-06-10KVM: x86: Bug the VM on an out-of-bounds data readSean Christopherson1-1/+2
Bug the VM and terminate emulation if an out-of-bounds read into the emulator's data cache occurs. Knowingly contuining on all but guarantees that KVM will overwrite random kernel data, which is far, far worse than killing the VM. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-10KVM: x86: Bug the VM if the emulator generates a bogus exception vectorSean Christopherson1-2/+5
Bug the VM if KVM's emulator attempts to inject a bogus exception vector. The guest is likely doomed even if KVM continues on, and propagating a bad vector to the rest of KVM runs the risk of breaking other assumptions in KVM and thus triggering a more egregious bug. All existing users of emulate_exception() have hardcoded vector numbers (__load_segment_descriptor() uses a few different vectors, but they're all hardcoded), and future users are likely to follow suit, i.e. the change to emulate_exception() is a glorified nop. As for the ctxt->exception.vector check in x86_emulate_insn(), the few known times the WARN has been triggered in the past is when the field was not set when synthesizing a fault, i.e. for all intents and purposes the check protects against consumption of uninitialized data. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-10KVM: x86: Bug the VM if the emulator accesses a non-existent GPRSean Christopherson3-2/+21
Bug the VM, i.e. kill it, if the emulator accesses a non-existent GPR, i.e. generates an out-of-bounds GPR index. Continuing on all but gaurantees some form of data corruption in the guest, e.g. even if KVM were to redirect to a dummy register, KVM would be incorrectly read zeros and drop writes. Note, bugging the VM doesn't completely prevent data corruption, e.g. the current round of emulation will complete before the vCPU bails out to userspace. But, the very act of killing the guest can also cause data corruption, e.g. due to lack of file writeback before termination, so taking on additional complexity to cleanly bail out of the emulator isn't justified, the goal is purely to stem the bleeding and alert userspace that something has gone horribly wrong, i.e. to avoid _silent_ data corruption. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-10KVM: x86: Reduce the number of emulator GPRs to '8' for 32-bit KVMSean Christopherson2-5/+6
Reduce the number of GPRs emulated by 32-bit KVM from 16 to 8. KVM does not support emulating 64-bit mode on 32-bit host kernels, and so should never generate accesses to R8-15. Opportunistically use NR_EMULATOR_GPRS in rsm_load_state_{32,64}() now that it is precise and accurate for both flavors. Wrap the definition with full #ifdef ugliness; sadly, IS_ENABLED() doesn't guarantee a compile-time constant as far as BUILD_BUG_ON() is concerned. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Kees Cook <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-10KVM: x86: Use 16-bit fields to track dirty/valid emulator GPRsSean Christopherson2-2/+5
Use a u16 instead of a u32 to track the dirty/valid status of GPRs in the emulator. Unlike struct kvm_vcpu_arch, x86_emulate_ctxt tracks only the "true" GPRs, i.e. doesn't include RIP in its array, and so only needs to track 16 registers. Note, maxing out at 16 GPRs is a fundamental property of x86-64 and will not change barring a massive architecture update. Legacy x86 ModRM and SIB encodings use 3 bits for GPRs, i.e. support 8 registers. x86-64 uses a single bit in the REX prefix for each possible reference type to double the number of supported GPRs to 16 registers (4 bits). Reviewed-by: Kees Cook <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-10KVM: x86: Omit VCPU_REGS_RIP from emulator's _regs arraySean Christopherson2-6/+17
Omit RIP from the emulator's _regs array, which is used only for GPRs, i.e. registers that can be referenced via ModRM and/or SIB bytes. The emulator uses the dedicated _eip field for RIP, and manually reads from _eip to handle RIP-relative addressing. To avoid an even bigger, slightly more dangerous change, hardcode the number of GPRs to 16 for the time being even though 32-bit KVM's emulator technically should only have 8 GPRs. Add a TODO to address that in a future commit. See also the comments above the read_gpr() and write_gpr() declarations, and obviously the handling in writeback_registers(). No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Kees Cook <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-10KVM: x86: Harden _regs accesses to guard against buggy inputSean Christopherson1-0/+6
WARN and truncate the incoming GPR number/index when reading/writing GPRs in the emulator to guard against KVM bugs, e.g. to avoid out-of-bounds accesses to ctxt->_regs[] if KVM generates a bogus index. Truncate the index instead of returning e.g. zero, as reg_write() returns a pointer to the register, i.e. returning zero would result in a NULL pointer dereference. KVM could also force the index to any arbitrary GPR, but that's no better or worse, just different. Open code the restriction to 16 registers; RIP is handled via _eip and should never be accessed through reg_read() or reg_write(). See the comments above the declarations of reg_read() and reg_write(), and the behavior of writeback_registers(). The horrific open coded mess will be cleaned up in a future commit. There are no such bugs known to exist in the emulator, but determining that KVM is bug-free is not at all simple and requires a deep dive into the emulator. The code is so convoluted that GCC-12 with the recently enable -Warray-bounds spits out a false-positive due to a GCC bug: arch/x86/kvm/emulate.c:254:27: warning: array subscript 32 is above array bounds of 'long unsigned int[17]' [-Warray-bounds] 254 | return ctxt->_regs[nr]; | ~~~~~~~~~~~^~~~ In file included from arch/x86/kvm/emulate.c:23: arch/x86/kvm/kvm_emulate.h: In function 'reg_rmw': arch/x86/kvm/kvm_emulate.h:366:23: note: while referencing '_regs' 366 | unsigned long _regs[NR_VCPU_REGS]; | ^~~~~ Link: https://lore.kernel.org/all/[email protected] Link: https://bugzilla.kernel.org/show_bug.cgi?id=216026 Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105679 Reported-and-tested-by: Robert Dinse <[email protected]> Reported-by: Kees Cook <[email protected]> Reviewed-by: Kees Cook <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-10KVM: x86: Grab regs_dirty in local 'unsigned long'Sean Christopherson1-1/+2
Capture ctxt->regs_dirty in a local 'unsigned long' instead of casting it to an 'unsigned long *' for use in for_each_set_bit(). The bitops helpers really do read the entire 'unsigned long', even though the walking of the read value is capped at the specified size. I.e. 64-bit KVM is reading memory beyond ctxt->regs_dirty, which is a u32 and thus 4 bytes, whereas an unsigned long is 8 bytes. Functionally it's not an issue because regs_dirty is in the middle of x86_emulate_ctxt, i.e. KVM is just reading its own memory, but relying on that coincidence is gross and unsafe. Reviewed-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Kees Cook <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-09Merge branch 'kvm-5.20-early'Paolo Bonzini27-498/+1148
s390: * add an interface to provide a hypervisor dump for secure guests * improve selftests to show tests x86: * Intel IPI virtualization * Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS * PEBS virtualization * Simplify PMU emulation by just using PERF_TYPE_RAW events * More accurate event reinjection on SVM (avoid retrying instructions) * Allow getting/setting the state of the speaker port data bit * Rewrite gfn-pfn cache refresh * Refuse starting the module if VM-Entry/VM-Exit controls are inconsistent * "Notify" VM exit
2022-06-09KVM: x86: SVM: fix nested PAUSE filtering when L0 intercepts PAUSEPaolo Bonzini2-20/+23
Commit 74fd41ed16fd ("KVM: x86: nSVM: support PAUSE filtering when L0 doesn't intercept PAUSE") introduced passthrough support for nested pause filtering, (when the host doesn't intercept PAUSE) (either disabled with kvm module param, or disabled with '-overcommit cpu-pm=on') Before this commit, L1 KVM didn't intercept PAUSE at all; afterwards, the feature was exposed as supported by KVM cpuid unconditionally, thus if L1 could try to use it even when the L0 KVM can't really support it. In this case the fallback caused KVM to intercept each PAUSE instruction; in some cases, such intercept can slow down the nested guest so much that it can fail to boot. Instead, before the problematic commit KVM was already setting both thresholds to 0 in vmcb02, but after the first userspace VM exit shrink_ple_window was called and would reset the pause_filter_count to the default value. To fix this, change the fallback strategy - ignore the guest threshold values, but use/update the host threshold values unless the guest specifically requests disabling PAUSE filtering (either simple or advanced). Also fix a minor bug: on nested VM exit, when PAUSE filter counter were copied back to vmcb01, a dirty bit was not set. Thanks a lot to Suravee Suthikulpanit for debugging this! Fixes: 74fd41ed16fd ("KVM: x86: nSVM: support PAUSE filtering when L0 doesn't intercept PAUSE") Reported-by: Suravee Suthikulpanit <[email protected]> Tested-by: Suravee Suthikulpanit <[email protected]> Co-developed-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-09KVM: x86: SVM: drop preempt-safe wrappers for avic_vcpu_load/putMaxim Levitsky3-27/+8
Now that these functions are always called with preemption disabled, remove the preempt_disable()/preempt_enable() pair inside them. No functional change intended. Signed-off-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-09KVM: x86: disable preemption while updating apicv inhibitionMaxim Levitsky1-0/+2
Currently nothing prevents preemption in kvm_vcpu_update_apicv. On SVM, If the preemption happens after we update the vcpu->arch.apicv_active, the preemption itself will 'update' the inhibition since the AVIC will be first disabled on vCPU unload and then enabled, when the current task is loaded again. Then we will try to update it again, which will lead to a warning in __avic_vcpu_load, that the AVIC is already enabled. Fix this by disabling preemption in this code. Signed-off-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-09KVM: x86: SVM: fix avic_kick_target_vcpus_fastMaxim Levitsky1-36/+69
There are two issues in avic_kick_target_vcpus_fast 1. It is legal to issue an IPI request with APIC_DEST_NOSHORT and a physical destination of 0xFF (or 0xFFFFFFFF in case of x2apic), which must be treated as a broadcast destination. Fix this by explicitly checking for it. Also don’t use ‘index’ in this case as it gives no new information. 2. It is legal to issue a logical IPI request to more than one target. Index field only provides index in physical id table of first such target and therefore can't be used before we are sure that only a single target was addressed. Instead, parse the ICRL/ICRH, double check that a unicast interrupt was requested, and use that info to figure out the physical id of the target vCPU. At that point there is no need to use the index field as well. In addition to fixing the above issues, also skip the call to kvm_apic_match_dest. It is possible to do this now, because now as long as AVIC is not inhibited, it is guaranteed that none of the vCPUs changed their apic id from its default value. This fixes boot of windows guest with AVIC enabled because it uses IPI with 0xFF destination and no destination shorthand. Fixes: 7223fd2d5338 ("KVM: SVM: Use target APIC ID to complete AVIC IRQs when possible") Cc: [email protected] Signed-off-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-09KVM: x86: SVM: remove avic's broken code that updated APIC IDMaxim Levitsky1-35/+0
AVIC is now inhibited if the guest changes the apic id, and therefore this code is no longer needed. There are several ways this code was broken, including: 1. a vCPU was only allowed to change its apic id to an apic id of an existing vCPU. 2. After such change, the vCPU whose apic id entry was overwritten, could not correctly change its own apic id, because its own entry is already overwritten. Signed-off-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-09KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC baseMaxim Levitsky3-6/+29
Neither of these settings should be changed by the guest and it is a burden to support it in the acceleration code, so just inhibit this code instead. Signed-off-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-09KVM: x86/mmu: Set memory encryption "value", not "mask", in shadow PDPTRsYuan Yao1-1/+1
Assign shadow_me_value, not shadow_me_mask, to PAE root entries, a.k.a. shadow PDPTRs, when host memory encryption is supported. The "mask" is the set of all possible memory encryption bits, e.g. MKTME KeyIDs, whereas "value" holds the actual value that needs to be stuffed into host page tables. Using shadow_me_mask results in a failed VM-Entry due to setting reserved PA bits in the PDPTRs, and ultimately causes an OOPS due to physical addresses with non-zero MKTME bits sending to_shadow_page() into the weeds: set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state. BUG: unable to handle page fault for address: ffd43f00063049e8 PGD 86dfd8067 P4D 0 Oops: 0000 [#1] PREEMPT SMP RIP: 0010:mmu_free_root_page+0x3c/0x90 [kvm] kvm_mmu_free_roots+0xd1/0x200 [kvm] __kvm_mmu_unload+0x29/0x70 [kvm] kvm_mmu_unload+0x13/0x20 [kvm] kvm_arch_destroy_vm+0x8a/0x190 [kvm] kvm_put_kvm+0x197/0x2d0 [kvm] kvm_vm_release+0x21/0x30 [kvm] __fput+0x8e/0x260 ____fput+0xe/0x10 task_work_run+0x6f/0xb0 do_exit+0x327/0xa90 do_group_exit+0x35/0xa0 get_signal+0x911/0x930 arch_do_signal_or_restart+0x37/0x720 exit_to_user_mode_prepare+0xb2/0x140 syscall_exit_to_user_mode+0x16/0x30 do_syscall_64+0x4e/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: e54f1ff244ac ("KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_mask") Signed-off-by: Yuan Yao <[email protected]> Reviewed-by: Kai Huang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-09Merge tag 'kvm-riscv-fixes-5.19-1' of https://github.com/kvm-riscv/linux ↵Paolo Bonzini9-51/+43
into HEAD KVM/riscv fixes for 5.19, take #1 - Typo fix in arch/riscv/kvm/vmid.c - Remove broken reference pattern from MAINTAINERS entry
2022-06-08KVM: x86: PIT: Preserve state of speaker port data bitPaul Durrant2-4/+7
Currently the state of the speaker port (0x61) data bit (bit 1) is not saved in the exported state (kvm_pit_state2) and hence is lost when re-constructing guest state. This patch removes the 'speaker_data_port' field from kvm_kpit_state and instead tracks the state using a new KVM_PIT_FLAGS_SPEAKER_DATA_ON flag defined in the API. Signed-off-by: Paul Durrant <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: VMX: Reject kvm_intel if an inconsistent VMCS config is detectedSean Christopherson1-3/+17
Add an on-by-default module param, error_on_inconsistent_vmcs_config, to allow rejecting the load of kvm_intel if an inconsistent VMCS config is detected. Continuing on with an inconsistent, degraded config is undesirable in the vast majority of use cases, e.g. may result in a misconfigured VM, poor performance due to lack of fast MSR switching, or even security issues in the unlikely event the guest is relying on MPX. Practically speaking, an inconsistent VMCS config should never be encountered in a production quality environment, e.g. on bare metal it indicates a silicon defect (or a disturbing lack of validation by the hardware vendor), and in a virtualized machine (KVM as L1) it indicates a buggy/misconfigured L0 VMM/hypervisor. Provide a module param to override the behavior for testing purposes, or in the unlikely scenario that KVM is deployed on a flawed-but-usable CPU or virtual machine. Note, what is or isn't an inconsistency is somewhat subjective, e.g. one might argue that LOAD_EFER without SAVE_EFER is an inconsistency. KVM's unofficial guideline for an "inconsistency" is either scenarios that are completely nonsensical, e.g. the existing checks on having EPT/VPID knobs without EPT/VPID, and/or scenarios that prevent KVM from virtualizing or utilizing a feature, e.g. the unpaired entry/exit controls checks. Other checks that fall into one or both of the covered scenarios could be added in the future, e.g. asserting that a VMCS control exists available if and only if the associated feature is supported in bare metal. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: VMX: Sanitize VM-Entry/VM-Exit control pairs at kvm_intel load timeSean Christopherson2-7/+34
Sanitize the VM-Entry/VM-Exit control pairs (load+load or load+clear) during setup instead of checking both controls in a pair at runtime. If only one control is supported, KVM will report the associated feature as not available, but will leave the supported control bit set in the VMCS config, which could lead to corruption of host state. E.g. if only the VM-Entry control is supported and the feature is not dynamically toggled, KVM will set the control in all VMCSes and load zeros without restoring host state. Note, while this is technically a bug fix, practically speaking no sane CPU or VMM would support only one control. KVM's behavior of checking both controls is mostly pedantry. Cc: Chenyi Qiang <[email protected]> Cc: Lei Wang <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Accept 0 for absent PMU MSRs when host-initiated if !enable_pmuLike Xu2-1/+18
Whenever an MSR is part of KVM_GET_MSR_INDEX_LIST, as is the case for MSR_K7_EVNTSEL0 or MSR_F15H_PERF_CTL0, it has to be always retrievable and settable with KVM_GET_MSR and KVM_SET_MSR. Accept a zero value for these MSRs to obey the contract. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Restrict advanced features based on module enable_pmuLike Xu3-3/+12
Once vPMU is disabled, the KVM would not expose features like: PEBS (via clear kvm_pmu_cap.pebs_ept), legacy LBR and ARCH_LBR, CPUID 0xA leaf, PDCM bit and MSR_IA32_PERF_CAPABILITIES, plus PT_MODE_HOST_GUEST mode. What this group of features has in common is that their use relies on the underlying PMU counter and the host perf_event as a back-end resource requester or sharing part of the irq delivery path. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Avoid exposing Intel BTS featureLike Xu3-4/+8
The BTS feature (including the ability to set the BTS and BTINT bits in the DEBUGCTL MSR) is currently unsupported on KVM. But we may try using the BTS facility on a PEBS enabled guest like this: perf record -e branches:u -c 1 -d ls and then we would encounter the following call trace: [] unchecked MSR access error: WRMSR to 0x1d9 (tried to write 0x00000000000003c0) at rIP: 0xffffffff810745e4 (native_write_msr+0x4/0x20) [] Call Trace: [] intel_pmu_enable_bts+0x5d/0x70 [] bts_event_add+0x54/0x70 [] event_sched_in+0xee/0x290 As it lacks any CPUID indicator or perf_capabilities valid bit fields to prompt for this information, the platform would hint the Intel BTS feature unavailable to guest by setting the BTS_UNAVAIL bit in the IA32_MISC_ENABLE. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Update global enable_pmu when PMU is undetectedLike Xu1-5/+10
On some virt platforms (L1 guest w/o PMU), the value of module parameter 'enable_pmu' for nested L2 guests should be updated at initialisation. Considering that there is no concept of "architecture pmu" in AMD or Hygon and that the versions (prior to Zen 4) are all 0, but that the theoretical available counters are at least AMD64_NUM_COUNTERS, the utility check_hw_exists() is reused in the initialisation call path. Opportunistically update Intel specific comments. Fixes: 8eeac7e999e8 ("KVM: x86/pmu: Add kvm_pmu_cap to optimize perf_get_x86_pmu_capability") Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds10-36/+109
Pull KVM fixes from Paolo Bonzini: - syzkaller NULL pointer dereference - TDP MMU performance issue with disabling dirty logging - 5.14 regression with SVM TSC scaling - indefinite stall on applying live patches - unstable selftest - memory leak from wrong copy-and-paste - missed PV TLB flush when racing with emulation * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: do not report a vCPU as preempted outside instruction boundaries KVM: x86: do not set st->preempted when going back to user space KVM: SVM: fix tsc scaling cache logic KVM: selftests: Make hyperv_clock selftest more stable KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging x86: drop bogus "cc" clobber from __try_cmpxchg_user_asm() KVM: x86/mmu: Check every prev_roots in __kvm_mmu_free_obsolete_roots() entry/kvm: Exit to user mode when TIF_NOTIFY_SIGNAL is set KVM: Don't null dereference ops->destroy
2022-06-08KVM: VMX: Enable Notify VM exitTao Xu5-3/+80
There are cases that malicious virtual machines can cause CPU stuck (due to event windows don't open up), e.g., infinite loop in microcode when nested #AC (CVE-2015-5307). No event window means no event (NMI, SMI and IRQ) can be delivered. It leads the CPU to be unavailable to host or other VMs. VMM can enable notify VM exit that a VM exit generated if no event window occurs in VM non-root mode for a specified amount of time (notify window). Feature enabling: - The new vmcs field SECONDARY_EXEC_NOTIFY_VM_EXITING is introduced to enable this feature. VMM can set NOTIFY_WINDOW vmcs field to adjust the expected notify window. - Add a new KVM capability KVM_CAP_X86_NOTIFY_VMEXIT so that user space can query and enable this feature in per-VM scope. The argument is a 64bit value: bits 63:32 are used for notify window, and bits 31:0 are for flags. Current supported flags: - KVM_X86_NOTIFY_VMEXIT_ENABLED: enable the feature with the notify window provided. - KVM_X86_NOTIFY_VMEXIT_USER: exit to userspace once the exits happen. - It's safe to even set notify window to zero since an internal hardware threshold is added to vmcs.notify_window. VM exit handling: - Introduce a vcpu state notify_window_exits to records the count of notify VM exits and expose it through the debugfs. - Notify VM exit can happen incident to delivery of a vector event. Allow it in KVM. - Exit to userspace unconditionally for handling when VM_CONTEXT_INVALID bit is set. Nested handling - Nested notify VM exits are not supported yet. Keep the same notify window control in vmcs02 as vmcs01, so that L1 can't escape the restriction of notify VM exits through launching L2 VM. Notify VM exit is defined in latest Intel Architecture Instruction Set Extensions Programming Reference, chapter 9.2. Co-developed-by: Xiaoyao Li <[email protected]> Signed-off-by: Xiaoyao Li <[email protected]> Signed-off-by: Tao Xu <[email protected]> Co-developed-by: Chenyi Qiang <[email protected]> Signed-off-by: Chenyi Qiang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86: Introduce "struct kvm_caps" to track misc caps/settingsSean Christopherson9-86/+94
Add kvm_caps to hold a variety of capabilites and defaults that aren't handled by kvm_cpu_caps because they aren't CPUID bits in order to reduce the amount of boilerplate code required to add a new feature. The vast majority (all?) of the caps interact with vendor code and are written only during initialization, i.e. should be tagged __read_mostly, declared extern in x86.h, and exported. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86: Extend KVM_{G,S}ET_VCPU_EVENTS to support pending triple faultChenyi Qiang1-1/+20
For the triple fault sythesized by KVM, e.g. the RSM path or nested_vmx_abort(), if KVM exits to userspace before the request is serviced, userspace could migrate the VM and lose the triple fault. Extend KVM_{G,S}ET_VCPU_EVENTS to support pending triple fault with a new event KVM_VCPUEVENT_VALID_FAULT_FAULT so that userspace can save and restore the triple fault event. This extension is guarded by a new KVM capability KVM_CAP_TRIPLE_FAULT_EVENT. Note that in the set_vcpu_events path, userspace is able to set/clear the triple fault request through triple_fault.pending field. Signed-off-by: Chenyi Qiang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Drop amd_event_mapping[] in the KVM contextLike Xu4-64/+11
All gp or fixed counters have been reprogrammed using PERF_TYPE_RAW, which means that the table that maps perf_hw_id to event select values is no longer useful, at least for AMD. For Intel, the logic to check if the pmu event reported by Intel cpuid is not available is still required, in which case pmc_perf_hw_id() could be renamed to hw_event_is_unavail() and a bool value is returned to replace the semantics of "PERF_COUNT_HW_MAX+1". Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Replace pmc_perf_hw_id() with perf_get_hw_event_config()Like Xu1-7/+2
With the help of perf_get_hw_event_config(), KVM could query the correct EVENTSEL_{EVENT, UMASK} pair of a kernel-generic hw event directly from the different *_perfmon_event_map[] by the kernel's pre-defined perf_hw_id. Also extend the bit range of the comparison field to AMD64_RAW_EVENT_MASK_NB to prevent AMD from defining EventSelect[11:8] into perfmon_event_map[] one day. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Use PERF_TYPE_RAW to merge reprogram_{gp,fixed}counter()Like Xu1-58/+21
The code sketch for reprogram_{gp, fixed}_counter() is similar, while the fixed counter using the PERF_TYPE_HARDWAR type and the gp being able to use either PERF_TYPE_HARDWAR or PERF_TYPE_RAW type depending on the pmc->eventsel value. After 'commit 761875634a5e ("KVM: x86/pmu: Setup pmc->eventsel for fixed PMCs")', the pmc->eventsel of the fixed counter will also have been setup with the same semantic value and will not be changed during the guest runtime. The original story of using the PERF_TYPE_HARDWARE type is to emulate guest architecture PMU on a host without architecture PMU (the Pentium 4), for which the guest vPMC needs to be reprogrammed using the kernel generic perf_hw_id. But essentially, "the HARDWARE is just a convenience wrapper over RAW IIRC", quoated from Peterz. So it could be pretty safe to use the PERF_TYPE_RAW type only in practice to program both gp and fixed counters naturally in the reprogram_counter(). To make the gp and fixed counters more semantically symmetrical, the selection of EVENTSEL_{USER, OS, INT} bits is temporarily translated via fixed_ctr_ctrl before the pmc_reprogram_counter() call. Cc: Peter Zijlstra <[email protected]> Suggested-by: Jim Mattson <[email protected]> Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Use only the uniform interface reprogram_counter()Paolo Bonzini4-9/+5
Since reprogram_counter(), reprogram_{gp, fixed}_counter() currently have the same incoming parameter "struct kvm_pmc *pmc", the callers can simplify the conetxt by using uniformly exported interface, which makes reprogram_ {gp, fixed}_counter() static and eliminates EXPORT_SYMBOL_GPL. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Drop "u8 ctrl, int idx" for reprogram_fixed_counter()Like Xu3-15/+14
Since afrer reprogram_fixed_counter() is called, it's bound to assign the requested fixed_ctr_ctrl to pmu->fixed_ctr_ctrl, this assignment step can be moved forward (the stale value for diff is saved extra early), thus simplifying the passing of parameters. No functional change intended. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Drop "u64 eventsel" for reprogram_gp_counter()Like Xu4-8/+10
Because inside reprogram_gp_counter() it is bound to assign the requested eventel to pmc->eventsel, this assignment step can be moved forward, thus simplifying the passing of parameters to "struct kvm_pmc *pmc" only. No functional change intended. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Pass only "struct kvm_pmc *pmc" to reprogram_counter()Like Xu3-27/+24
Passing the reference "struct kvm_pmc *pmc" when creating pmc->perf_event is sufficient. This change helps to simplify the calling convention by replacing reprogram_{gp, fixed}_counter() with reprogram_counter() seamlessly. No functional change intended. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Extract check_pmu_event_filter() handling both GP and fixed ↵Like Xu1-26/+37
counters Checking the kvm->arch.pmu_event_filter policy in both gp and fixed code paths was somewhat redundant, so common parts can be extracted, which reduces code footprint and improves readability. Signed-off-by: Like Xu <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Update comments for AMD gp countersLike Xu1-2/+5
The obsolete comment could more accurately state that AMD platforms have two base MSR addresses and two different maximum numbers for gp counters, depending on the X86_FEATURE_PERFCTR_CORE feature. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86: always allow host-initiated writes to PMU MSRsPaolo Bonzini5-20/+27
Whenever an MSR is part of KVM_GET_MSR_INDEX_LIST, it has to be always retrievable and settable with KVM_GET_MSR and KVM_SET_MSR. Accept the PMU MSRs unconditionally in intel_is_valid_msr, if the access was host-initiated. Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: vmx, pmu: accept 0 for host-initiated write to MSR_IA32_DS_AREAPaolo Bonzini1-0/+2
Whenever an MSR is part of KVM_GET_MSR_INDEX_LIST, as is the case for MSR_IA32_DS_AREA, it has to be always settable with KVM_SET_MSR. Accept a zero value for these MSRs to obey the contract. Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Ignore pmu->global_ctrl check if vPMU doesn't support global_ctrlLike Xu1-0/+3
MSR_CORE_PERF_GLOBAL_CTRL is introduced as part of Architecture PMU V2, as indicated by Intel SDM 19.2.2 and the intel_is_valid_msr() function. So in the absence of global_ctrl support, all PMCs are enabled as AMD does. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: Don't overwrite the pmu->global_ctrl when refreshingLike Xu1-4/+5
Assigning a value to pmu->global_ctrl just to set the value of pmu->global_ctrl_mask is more readable but does not conform to the specification. The value is reset to zero on Power up and Reset but stays unchanged on INIT, like most other MSRs. Signed-off-by: Like Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-08KVM: x86/pmu: remove useless prototypePaolo Bonzini1-1/+0
Signed-off-by: Paolo Bonzini <[email protected]>