aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)AuthorFilesLines
2024-11-08KVM: VMX: Bury Intel PT virtualization (guest/host mode) behind CONFIG_BROKENSean Christopherson1-1/+3
Hide KVM's pt_mode module param behind CONFIG_BROKEN, i.e. disable support for virtualizing Intel PT via guest/host mode unless BROKEN=y. There are myriad bugs in the implementation, some of which are fatal to the guest, and others which put the stability and health of the host at risk. For guest fatalities, the most glaring issue is that KVM fails to ensure tracing is disabled, and *stays* disabled prior to VM-Enter, which is necessary as hardware disallows loading (the guest's) RTIT_CTL if tracing is enabled (enforced via a VMX consistency check). Per the SDM: If the logical processor is operating with Intel PT enabled (if IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load IA32_RTIT_CTL" VM-entry control must be 0. On the host side, KVM doesn't validate the guest CPUID configuration provided by userspace, and even worse, uses the guest configuration to decide what MSRs to save/load at VM-Enter and VM-Exit. E.g. configuring guest CPUID to enumerate more address ranges than are supported in hardware will result in KVM trying to passthrough, save, and load non-existent MSRs, which generates a variety of WARNs, ToPA ERRORs in the host, a potential deadlock, etc. Fixes: f99e3daf94ff ("KVM: x86: Add Intel PT virtualization work mode") Cc: stable@vger.kernel.org Cc: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Tested-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20241101185031.1799556-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08KVM: x86: Unconditionally set irr_pending when updating APICv stateSean Christopherson1-11/+18
Always set irr_pending (to true) when updating APICv status to fix a bug where KVM fails to set irr_pending when userspace sets APIC state and APICv is disabled, which ultimate results in KVM failing to inject the pending interrupt(s) that userspace stuffed into the vIRR, until another interrupt happens to be emulated by KVM. Only the APICv-disabled case is flawed, as KVM forces apic->irr_pending to be true if APICv is enabled, because not all vIRR updates will be visible to KVM. Hit the bug with a big hammer, even though strictly speaking KVM can scan the vIRR and set/clear irr_pending as appropriate for this specific case. The bug was introduced by commit 755c2bf87860 ("KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it"), which as the shortlog suggests, deleted code that updated irr_pending. Before that commit, kvm_apic_update_apicv() did indeed scan the vIRR, with with the crucial difference that kvm_apic_update_apicv() did the scan even when APICv was being *disabled*, e.g. due to an AVIC inhibition. struct kvm_lapic *apic = vcpu->arch.apic; if (vcpu->arch.apicv_active) { /* irr_pending is always true when apicv is activated. */ apic->irr_pending = true; apic->isr_count = 1; } else { apic->irr_pending = (apic_search_irr(apic) != -1); apic->isr_count = count_vectors(apic->regs + APIC_ISR); } And _that_ bug (clearing irr_pending) was introduced by commit b26a695a1d78 ("kvm: lapic: Introduce APICv update helper function"), prior to which KVM unconditionally set irr_pending to true in kvm_apic_set_state(), i.e. assumed that the new virtual APIC state could have a pending IRQ. Furthermore, in addition to introducing this issue, commit 755c2bf87860 also papered over the underlying bug: KVM doesn't ensure CPUs and devices see APICv as disabled prior to searching the IRR. Waiting until KVM emulates an EOI to update irr_pending "works", but only because KVM won't emulate EOI until after refresh_apicv_exec_ctrl(), and there are plenty of memory barriers in between. I.e. leaving irr_pending set is basically hacking around bad ordering. So, effectively revert to the pre-b26a695a1d78 behavior for state restore, even though it's sub-optimal if no IRQs are pending, in order to provide a minimal fix, but leave behind a FIXME to document the ugliness. With luck, the ordering issue will be fixed and the mess will be cleaned up in the not-too-distant future. Fixes: 755c2bf87860 ("KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Reported-by: Yong He <zhuangel570@gmail.com> Closes: https://lkml.kernel.org/r/20241023124527.1092810-1-alexyonghe%40tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20241106015135.2462147-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08kvm: svm: Fix gctx page leak on invalid inputsDionna Glaze1-4/+4
Ensure that snp gctx page allocation is adequately deallocated on failure during snp_launch_start. Fixes: 136d8bc931c8 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command") CC: Sean Christopherson <seanjc@google.com> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Thomas Gleixner <tglx@linutronix.de> CC: Ingo Molnar <mingo@redhat.com> CC: Borislav Petkov <bp@alien8.de> CC: Dave Hansen <dave.hansen@linux.intel.com> CC: Ashish Kalra <ashish.kalra@amd.com> CC: Tom Lendacky <thomas.lendacky@amd.com> CC: John Allen <john.allen@amd.com> CC: Herbert Xu <herbert@gondor.apana.org.au> CC: "David S. Miller" <davem@davemloft.net> CC: Michael Roth <michael.roth@amd.com> CC: Luis Chamberlain <mcgrof@kernel.org> CC: Russ Weight <russ.weight@linux.dev> CC: Danilo Krummrich <dakr@redhat.com> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> CC: "Rafael J. Wysocki" <rafael@kernel.org> CC: Tianfei zhang <tianfei.zhang@intel.com> CC: Alexey Kardashevskiy <aik@amd.com> Signed-off-by: Dionna Glaze <dionnaglaze@google.com> Message-ID: <20241105010558.1266699-2-dionnaglaze@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-04KVM: SVM: Propagate error from snp_guest_req_init() to userspaceSean Christopherson1-2/+5
If snp_guest_req_init() fails, return the provided error code up the stack to userspace, e.g. so that userspace can log that KVM_SEV_INIT2 failed, as opposed to some random operation later in VM setup failing because SNP wasn't actually enabled for the VM. Note, KVM itself doesn't consult the return value from __sev_guest_init(), i.e. the fallout is purely that userspace may be confused. Fixes: 88caf544c930 ("KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202410192220.MeTyHPxI-lkp@intel.com Link: https://lore.kernel.org/r/20241031203214.1585751-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: nVMX: Treat vpid01 as current if L2 is active, but with VPID disabledSean Christopherson2-6/+26
When getting the current VPID, e.g. to emulate a guest TLB flush, return vpid01 if L2 is running but with VPID disabled, i.e. if VPID is disabled in vmcs12. Architecturally, if VPID is disabled, then the guest and host effectively share VPID=0. KVM emulates this behavior by using vpid01 when running an L2 with VPID disabled (see prepare_vmcs02_early_rare()), and so KVM must also treat vpid01 as the current VPID while L2 is active. Unconditionally treating vpid02 as the current VPID when L2 is active causes KVM to flush TLB entries for vpid02 instead of vpid01, which results in TLB entries from L1 being incorrectly preserved across nested VM-Enter to L2 (L2=>L1 isn't problematic, because the TLB flush after nested VM-Exit flushes vpid01). The bug manifests as failures in the vmx_apicv_test KVM-Unit-Test, as KVM incorrectly retains TLB entries for the APIC-access page across a nested VM-Enter. Opportunisticaly add comments at various touchpoints to explain the architectural requirements, and also why KVM uses vpid01 instead of vpid02. All credit goes to Chao, who root caused the issue and identified the fix. Link: https://lore.kernel.org/all/ZwzczkIlYGX+QXJz@intel.com Fixes: 2b4a5a5d5688 ("KVM: nVMX: Flush current VPID (L1 vs. L2) for KVM_REQ_TLB_FLUSH_GUEST") Cc: stable@vger.kernel.org Cc: Like Xu <like.xu.linux@gmail.com> Debugged-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20241031202011.1580522-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-10-20KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memorySean Christopherson1-1/+5
Ignore nCR3[4:0] when loading PDPTEs from memory for nested SVM, as bits 4:0 of CR3 are ignored when PAE paging is used, and thus VMRUN doesn't enforce 32-byte alignment of nCR3. In the absolute worst case scenario, failure to ignore bits 4:0 can result in an out-of-bounds read, e.g. if the target page is at the end of a memslot, and the VMM isn't using guard pages. Per the APM: The CR3 register points to the base address of the page-directory-pointer table. The page-directory-pointer table is aligned on a 32-byte boundary, with the low 5 address bits 4:0 assumed to be 0. And the SDM's much more explicit: 4:0 Ignored Note, KVM gets this right when loading PDPTRs, it's only the nSVM flow that is broken. Fixes: e4e517b4be01 ("KVM: MMU: Do not unconditionally read PDPTE from guest memory") Reported-by: Kirk Swidowski <swidowski@google.com> Cc: Andy Nguyen <theflow@google.com> Cc: 3pvd <3pvd@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20241009140838.1036226-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20KVM: VMX: reset the segment cache after segment init in vmx_vcpu_reset()Maxim Levitsky1-3/+3
Reset the segment cache after segment initialization in vmx_vcpu_reset() to harden KVM against caching stale/uninitialized data. Without the recent fix to bypass the cache in kvm_arch_vcpu_put(), the following scenario is possible: - vCPU is just created, and the vCPU thread is preempted before SS.AR_BYTES is written in vmx_vcpu_reset(). - When scheduling out the vCPU task, kvm_arch_vcpu_in_kernel() => vmx_get_cpl() reads and caches '0' for SS.AR_BYTES. - vmx_vcpu_reset() => seg_setup() configures SS.AR_BYTES, but doesn't invoke vmx_segment_cache_clear() to invalidate the cache. As a result, KVM retains a stale value in the cache, which can be read, e.g. via KVM_GET_SREGS. Usually this is not a problem because the VMX segment cache is reset on each VM-Exit, but if the userspace VMM (e.g KVM selftests) reads and writes system registers just after the vCPU was created, _without_ modifying SS.AR_BYTES, userspace will write back the stale '0' value and ultimately will trigger a VM-Entry failure due to incorrect SS segment type. Invalidating the cache after writing the VMCS doesn't address the general issue of cache accesses from IRQ context being unsafe, but it does prevent KVM from clobbering the VMCS, i.e. mitigates the harm done _if_ KVM has a bug that results in an unsafe cache access. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Fixes: 2fb92db1ec08 ("KVM: VMX: Cache vmcs segment fields") [sean: rework changelog to account for previous patch] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20241009175002.1118178-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range()Sean Christopherson1-0/+11
Add a lockdep assertion in kvm_unmap_gfn_range() to ensure that either mmu_invalidate_in_progress is elevated, or that the range is being zapped due to memslot removal (loosely detected by slots_lock being held). Zapping SPTEs without mmu_invalidate_{in_progress,seq} protection is unsafe as KVM's page fault path snapshots state before acquiring mmu_lock, and thus can create SPTEs with stale information if vCPUs aren't forced to retry faults (due to seeing an in-progress or past MMU invalidation). Memslot removal is a special case, as the memslot is retrieved outside of mmu_invalidate_seq, i.e. doesn't use the "standard" protections, and instead relies on SRCU synchronization to ensure any in-flight page faults are fully resolved before zapping SPTEs. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20241009192345.1148353-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslotSean Christopherson1-10/+6
When performing a targeted zap on memslot removal, zap only MMU pages that shadow guest PTEs, as zapping all SPs that "match" the gfn is inexact and unnecessary. Furthermore, for_each_gfn_valid_sp() arguably shouldn't exist, because it doesn't do what most people would it expect it to do. The "round gfn for level" adjustment that is done for direct SPs (no gPTE) means that the exact gfn comparison will not get a match, even when a SP does "cover" a gfn, or was even created specifically for a gfn. For memslot deletion specifically, KVM's behavior will vary significantly based on the size and alignment of a memslot, and in weird ways. E.g. for a 4KiB memslot, KVM will zap more SPs if the slot is 1GiB aligned than if it's only 4KiB aligned. And as described below, zapping SPs in the aligned case overzaps for direct MMUs, as odds are good the upper-level SPs are serving other memslots. To iterate over all potentially-relevant gfns, KVM would need to make a pass over the hash table for each level, with the gfn used for lookup rounded for said level. And then check that the SP is of the correct level, too, e.g. to avoid over-zapping. But even then, KVM would massively overzap, as processing every level is all but guaranteed to zap SPs that serve other memslots, especially if the memslot being removed is relatively small. KVM could mitigate that issue by processing only levels that can be possible guest huge pages, i.e. are less likely to be re-used for other memslot, but while somewhat logical, that's quite arbitrary and would be a bit of a mess to implement. So, zap only SPs with gPTEs, as the resulting behavior is easy to describe, is predictable, and is explicitly minimal, i.e. KVM only zaps SPs that absolutely must be zapped. Cc: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241009192345.1148353-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-06Merge tag 'kvmarm-fixes-6.12-1' of ↵Paolo Bonzini3-19/+15
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.12, take #1 - Fix pKVM error path on init, making sure we do not change critical system registers as we're about to fail - Make sure that the host's vector length is at capped by a value common to all CPUs - Fix kvm_has_feat*() handling of "negative" features, as the current code is pretty broken - Promote Joey to the status of official reviewer, while James steps down -- hopefully only temporarly
2024-10-06KVM: x86: leave kvm.ko out of the build if no vendor module is requestedPaolo Bonzini2-4/+7
kvm.ko is nothing but library code shared by kvm-intel.ko and kvm-amd.ko. It provides no functionality on its own and it is unnecessary unless one of the vendor-specific module is compiled. In particular, /dev/kvm is not created until one of kvm-intel.ko or kvm-amd.ko is loaded. Use CONFIG_KVM to decide if it is built-in or a module, but use the vendor-specific modules for the actual decision on whether to build it. This also fixes a build failure when CONFIG_KVM_INTEL and CONFIG_KVM_AMD are both disabled. The cpu_emergency_register_virt_callback() function is called from kvm.ko, but it is only defined if at least one of CONFIG_KVM_INTEL and CONFIG_KVM_AMD is provided. Fixes: 590b09b1d88e ("KVM: x86: Register "emergency disable" callbacks when virt is enabled") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-03KVM: x86/mmu: fix KVM_X86_QUIRK_SLOT_ZAP_ALL for shadow MMUPaolo Bonzini1-14/+46
As was tried in commit 4e103134b862 ("KVM: x86/mmu: Zap only the relevant pages when removing a memslot"), all shadow pages, i.e. non-leaf SPTEs, need to be zapped. All of the accounting for a shadow page is tied to the memslot, i.e. the shadow page holds a reference to the memslot, for all intents and purposes. Deleting the memslot without removing all relevant shadow pages, as is done when KVM_X86_QUIRK_SLOT_ZAP_ALL is disabled, results in NULL pointer derefs when tearing down the VM. Reintroduce from that commit the code that walks the whole memslot when there are active shadow MMU pages. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-28Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds28-977/+1222
Pull x86 kvm updates from Paolo Bonzini: "x86: - KVM currently invalidates the entirety of the page tables, not just those for the memslot being touched, when a memslot is moved or deleted. This does not traditionally have particularly noticeable overhead, but Intel's TDX will require the guest to re-accept private pages if they are dropped from the secure EPT, which is a non starter. Actually, the only reason why this is not already being done is a bug which was never fully investigated and caused VM instability with assigned GeForce GPUs, so allow userspace to opt into the new behavior. - Advertise AVX10.1 to userspace (effectively prep work for the "real" AVX10 functionality that is on the horizon) - Rework common MSR handling code to suppress errors on userspace accesses to unsupported-but-advertised MSRs This will allow removing (almost?) all of KVM's exemptions for userspace access to MSRs that shouldn't exist based on the vCPU model (the actual cleanup is non-trivial future work) - Rework KVM's handling of x2APIC ICR, again, because AMD (x2AVIC) splits the 64-bit value into the legacy ICR and ICR2 storage, whereas Intel (APICv) stores the entire 64-bit value at the ICR offset - Fix a bug where KVM would fail to exit to userspace if one was triggered by a fastpath exit handler - Add fastpath handling of HLT VM-Exit to expedite re-entering the guest when there's already a pending wake event at the time of the exit - Fix a WARN caused by RSM entering a nested guest from SMM with invalid guest state, by forcing the vCPU out of guest mode prior to signalling SHUTDOWN (the SHUTDOWN hits the VM altogether, not the nested guest) - Overhaul the "unprotect and retry" logic to more precisely identify cases where retrying is actually helpful, and to harden all retry paths against putting the guest into an infinite retry loop - Add support for yielding, e.g. to honor NEED_RESCHED, when zapping rmaps in the shadow MMU - Refactor pieces of the shadow MMU related to aging SPTEs in prepartion for adding multi generation LRU support in KVM - Don't stuff the RSB after VM-Exit when RETPOLINE=y and AutoIBRS is enabled, i.e. when the CPU has already flushed the RSB - Trace the per-CPU host save area as a VMCB pointer to improve readability and cleanup the retrieval of the SEV-ES host save area - Remove unnecessary accounting of temporary nested VMCB related allocations - Set FINAL/PAGE in the page fault error code for EPT violations if and only if the GVA is valid. If the GVA is NOT valid, there is no guest-side page table walk and so stuffing paging related metadata is nonsensical - Fix a bug where KVM would incorrectly synthesize a nested VM-Exit instead of emulating posted interrupt delivery to L2 - Add a lockdep assertion to detect unsafe accesses of vmcs12 structures - Harden eVMCS loading against an impossible NULL pointer deref (really truly should be impossible) - Minor SGX fix and a cleanup - Misc cleanups Generic: - Register KVM's cpuhp and syscore callbacks when enabling virtualization in hardware, as the sole purpose of said callbacks is to disable and re-enable virtualization as needed - Enable virtualization when KVM is loaded, not right before the first VM is created Together with the previous change, this simplifies a lot the logic of the callbacks, because their very existence implies virtualization is enabled - Fix a bug that results in KVM prematurely exiting to userspace for coalesced MMIO/PIO in many cases, clean up the related code, and add a testcase - Fix a bug in kvm_clear_guest() where it would trigger a buffer overflow _if_ the gpa+len crosses a page boundary, which thankfully is guaranteed to not happen in the current code base. Add WARNs in more helpers that read/write guest memory to detect similar bugs Selftests: - Fix a goof that caused some Hyper-V tests to be skipped when run on bare metal, i.e. NOT in a VM - Add a regression test for KVM's handling of SHUTDOWN for an SEV-ES guest - Explicitly include one-off assets in .gitignore. Past Sean was completely wrong about not being able to detect missing .gitignore entries - Verify userspace single-stepping works when KVM happens to handle a VM-Exit in its fastpath - Misc cleanups" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits) Documentation: KVM: fix warning in "make htmldocs" s390: Enable KVM_S390_UCONTROL config in debug_defconfig selftests: kvm: s390: Add VM run test case KVM: SVM: let alternatives handle the cases when RSB filling is required KVM: VMX: Set PFERR_GUEST_{FINAL,PAGE}_MASK if and only if the GVA is valid KVM: x86/mmu: Use KVM_PAGES_PER_HPAGE() instead of an open coded equivalent KVM: x86/mmu: Add KVM_RMAP_MANY to replace open coded '1' and '1ul' literals KVM: x86/mmu: Fold mmu_spte_age() into kvm_rmap_age_gfn_range() KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helper KVM: x86/mmu: Honor NEED_RESCHED when zapping rmaps and blocking is allowed KVM: x86/mmu: Add a helper to walk and zap rmaps for a memslot KVM: x86/mmu: Plumb a @can_yield parameter into __walk_slot_rmaps() KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range() KVM: x86/mmu: WARN on MMIO cache hit when emulating write-protected gfn KVM: x86/mmu: Detect if unprotect will do anything based on invalid_list KVM: x86/mmu: Subsume kvm_mmu_unprotect_page() into the and_retry() version KVM: x86: Rename reexecute_instruction()=>kvm_unprotect_and_retry_on_failure() KVM: x86: Update retry protection fields when forcing retry on emulation failure KVM: x86: Apply retry protection to "unprotect on failure" path KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfn ...
2024-09-23Merge tag 'pull-stable-struct_fd' of ↵Linus Torvalds1-8/+8
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull 'struct fd' updates from Al Viro: "Just the 'struct fd' layout change, with conversion to accessor helpers" * tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: add struct fd constructors, get rid of __to_fd() struct fd: representation change introduce fd_file(), convert all accessors to it.
2024-09-17Merge tag 'kvm-x86-vmx-6.12' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini11-37/+102
KVM VMX changes for 6.12: - Set FINAL/PAGE in the page fault error code for EPT Violations if and only if the GVA is valid. If the GVA is NOT valid, there is no guest-side page table walk and so stuffing paging related metadata is nonsensical. - Fix a bug where KVM would incorrectly synthesize a nested VM-Exit instead of emulating posted interrupt delivery to L2. - Add a lockdep assertion to detect unsafe accesses of vmcs12 structures. - Harden eVMCS loading against an impossible NULL pointer deref (really truly should be impossible). - Minor SGX fix and a cleanup.
2024-09-17Merge tag 'kvm-x86-svm-6.12' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini4-22/+32
KVM SVM changes for 6.12: - Don't stuff the RSB after VM-Exit when RETPOLINE=y and AutoIBRS is enabled, i.e. when the CPU has already flushed the RSB. - Trace the per-CPU host save area as a VMCB pointer to improve readability and cleanup the retrieval of the SEV-ES host save area. - Remove unnecessary accounting of temporary nested VMCB related allocations.
2024-09-17Merge tag 'kvm-x86-pat_vmx_msrs-6.12' of https://github.com/kvm-x86/linux ↵Paolo Bonzini6-45/+66
into HEAD KVM VMX and x86 PAT MSR macro cleanup for 6.12: - Add common defines for the x86 architectural memory types, i.e. the types that are shared across PAT, MTRRs, VMCSes, and EPTPs. - Clean up the various VMX MSR macros to make the code self-documenting (inasmuch as possible), and to make it less painful to add new macros.
2024-09-17Merge tag 'kvm-x86-mmu-6.12' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini6-378/+360
KVM x86 MMU changes for 6.12: - Overhaul the "unprotect and retry" logic to more precisely identify cases where retrying is actually helpful, and to harden all retry paths against putting the guest into an infinite retry loop. - Add support for yielding, e.g. to honor NEED_RESCHED, when zapping rmaps in the shadow MMU. - Refactor pieces of the shadow MMU related to aging SPTEs in prepartion for adding MGLRU support in KVM. - Misc cleanups
2024-09-17Merge tag 'kvm-x86-misc-6.12' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini14-473/+594
KVM x86 misc changes for 6.12 - Advertise AVX10.1 to userspace (effectively prep work for the "real" AVX10 functionality that is on the horizon). - Rework common MSR handling code to suppress errors on userspace accesses to unsupported-but-advertised MSRs. This will allow removing (almost?) all of KVM's exemptions for userspace access to MSRs that shouldn't exist based on the vCPU model (the actual cleanup is non-trivial future work). - Rework KVM's handling of x2APIC ICR, again, because AMD (x2AVIC) splits the 64-bit value into the legacy ICR and ICR2 storage, whereas Intel (APICv) stores the entire 64-bit value a the ICR offset. - Fix a bug where KVM would fail to exit to userspace if one was triggered by a fastpath exit handler. - Add fastpath handling of HLT VM-Exit to expedite re-entering the guest when there's already a pending wake event at the time of the exit. - Finally fix the RSM vs. nested VM-Enter WARN by forcing the vCPU out of guest mode prior to signalling SHUTDOWN (architecturally, the SHUTDOWN is supposed to hit L1, not L2).
2024-09-17Merge branch 'kvm-redo-enable-virt' into HEADPaolo Bonzini5-30/+36
Register KVM's cpuhp and syscore callbacks when enabling virtualization in hardware, as the sole purpose of said callbacks is to disable and re-enable virtualization as needed. The primary motivation for this series is to simplify dealing with enabling virtualization for Intel's TDX, which needs to enable virtualization when kvm-intel.ko is loaded, i.e. long before the first VM is created. That said, this is a nice cleanup on its own. By registering the callbacks on-demand, the callbacks themselves don't need to check kvm_usage_count, because their very existence implies a non-zero count. Patch 1 (re)adds a dedicated lock for kvm_usage_count. This avoids a lock ordering issue between cpus_read_lock() and kvm_lock. The lock ordering issue still exist in very rare cases, and will be fixed for good by switching vm_list to an (S)RCU-protected list. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-17Merge branch 'kvm-memslot-zap-quirk' into HEADPaolo Bonzini1-1/+41
Today whenever a memslot is moved or deleted, KVM invalidates the entire page tables and generates fresh ones based on the new memslot layout. This behavior traditionally was kept because of a bug which was never fully investigated and caused VM instability with assigned GeForce GPUs. It generally does not have a huge overhead, because the old MMU is able to reuse cached page tables and the new one is more scalabale and can resolve EPT violations/nested page faults in parallel, but it has worse performance if the guest frequently deletes and adds small memslots, and it's entirely not viable for TDX. This is because TDX requires re-accepting of private pages after page dropping. For non-TDX VMs, this series therefore introduces the KVM_X86_QUIRK_SLOT_ZAP_ALL quirk, enabling users to control the behavior of memslot zapping when a memslot is moved/deleted. The quirk is turned on by default, leading to the zapping of all SPTEs when a memslot is moved/deleted; users however have the option to turn off the quirk, which limits the zapping only to those SPTEs hat lie within the range of memslot being moved/deleted. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-15Merge tag 'for-linus-6.11' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2-11/+7
Pull kvm fix from Paolo Bonzini: "Do not always honor guest PAT on CPUs that support self-snoop. This triggers an issue in the bochsdrm driver, which used ioremap() instead of ioremap_wc() to map the video RAM. The revert lets video RAM use the WB memory type instead of the slower UC memory type" * tag 'for-linus-6.11' of git://git.kernel.org/pub/scm/virt/kvm/kvm: Revert "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop"
2024-09-15Revert "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop"Paolo Bonzini2-11/+7
This reverts commit 377b2f359d1f71c75f8cc352b5c81f2210312d83. This caused a regression with the bochsdrm driver, which used ioremap() instead of ioremap_wc() to map the video RAM. After the commit, the WB memory type is used without the IGNORE_PAT, resulting in the slower UC memory type. In fact, UC is slow enough to basically cause guests to not boot... but only on new processors such as Sapphire Rapids and Cascade Lake. Coffee Lake for example works properly, though that might also be an effect of being on a larger, more NUMA system. The driver has been fixed but that does not help older guests. Until we figure out whether Cascade Lake and newer processors are working as intended, revert the commit. Long term we might add a quirk, but the details depend on whether the processors are working as intended: for example if they are, the quirk might reference bochs-compatible devices, e.g. in the name and documentation, so that userspace can disable the quirk by default and only leave it enabled if such a device is being exposed to the guest. If instead this is actually a bug in CLX+, then the actions we need to take are different and depend on the actual cause of the bug. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-10KVM: SVM: let alternatives handle the cases when RSB filling is requiredAmit Shah1-6/+2
Remove superfluous RSB filling after a VMEXIT when the CPU already has flushed the RSB after a VMEXIT when AutoIBRS is enabled. The initial implementation for adding RETPOLINES added an ALTERNATIVES implementation for filling the RSB after a VMEXIT in commit 117cc7a908c8 ("x86/retpoline: Fill return stack buffer on vmexit"). Later, X86_FEATURE_RSB_VMEXIT was added in commit 9756bba28470 ("x86/speculation: Fill RSB on vmexit for IBRS") to handle stuffing the RSB if RETPOLINE=y *or* KERNEL_IBRS=y, i.e. to also stuff the RSB if the kernel is configured to do IBRS mitigations on entry/exit. The AutoIBRS (on AMD) feature implementation added in commit e7862eda309e ("x86/cpu: Support AMD Automatic IBRS") used the already-implemented logic for EIBRS in spectre_v2_determine_rsb_fill_type_on_vmexit() -- but did not update the code at VMEXIT to act on the mode selected in that function -- resulting in VMEXITs continuing to clear the RSB when RETPOLINES are enabled, despite the presence of AutoIBRS. Signed-off-by: Amit Shah <amit.shah@amd.com> Link: https://lore.kernel.org/r/20240807123531.69677-1-amit@kernel.org [sean: massage changeloge, drop comment about AMD not needing RSB_VMEXIT_LITE] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: VMX: Set PFERR_GUEST_{FINAL,PAGE}_MASK if and only if the GVA is validSean Christopherson1-2/+3
Set PFERR_GUEST_{FINAL,PAGE}_MASK based on EPT_VIOLATION_GVA_TRANSLATED if and only if EPT_VIOLATION_GVA_IS_VALID is also set in exit qualification. Per the SDM, bit 8 (EPT_VIOLATION_GVA_TRANSLATED) is valid if and only if bit 7 (EPT_VIOLATION_GVA_IS_VALID) is set, and is '0' if bit 7 is '0'. Bit 7 (a.k.a. EPT_VIOLATION_GVA_IS_VALID) Set if the guest linear-address field is valid. The guest linear-address field is valid for all EPT violations except those resulting from an attempt to load the guest PDPTEs as part of the execution of the MOV CR instruction and those due to trace-address pre-translation Bit 8 (a.k.a. EPT_VIOLATION_GVA_TRANSLATED) If bit 7 is 1: • Set if the access causing the EPT violation is to a guest-physical address that is the translation of a linear address. • Clear if the access causing the EPT violation is to a paging-structure entry as part of a page walk or the update of an accessed or dirty bit. Reserved if bit 7 is 0 (cleared to 0). Failure to guard the logic on GVA_IS_VALID results in KVM marking the page fault as PFERR_GUEST_PAGE_MASK when there is no known GVA, which can put the vCPU into an infinite loop due to kvm_mmu_page_fault() getting false positive on its PFERR_NESTED_GUEST_PAGE logic (though only because that logic is also buggy/flawed). In practice, this is largely a non-issue because so GVA_IS_VALID is almost always set. However, when TDX comes along, GVA_IS_VALID will *never* be set, as the TDX Module deliberately clears bits 12:7 in exit qualification, e.g. so that the faulting virtual address and other metadata that aren't practically useful for the hypervisor aren't leaked to the untrusted host. When exit is due to EPT violation, bits 12-7 of the exit qualification are cleared to 0. Fixes: eebed2438923 ("kvm: nVMX: Add support for fast unprotection of nested guest page tables") Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Use KVM_PAGES_PER_HPAGE() instead of an open coded equivalentSean Christopherson1-1/+1
Use KVM_PAGES_PER_HPAGE() instead of open coding equivalent logic that is anything but obvious. No functional change intended, and verified by compiling with the below assertions: BUILD_BUG_ON((1UL << KVM_HPAGE_GFN_SHIFT(PG_LEVEL_4K)) != KVM_PAGES_PER_HPAGE(PG_LEVEL_4K)); BUILD_BUG_ON((1UL << KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M)) != KVM_PAGES_PER_HPAGE(PG_LEVEL_2M)); BUILD_BUG_ON((1UL << KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G)) != KVM_PAGES_PER_HPAGE(PG_LEVEL_1G)); Link: https://lore.kernel.org/r/20240809194335.1726916-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Add KVM_RMAP_MANY to replace open coded '1' and '1ul' literalsSean Christopherson1-15/+16
Replace all of the open coded '1' literals used to mark a PTE list as having many/multiple entries with a proper define. It's hard enough to read the code with one magic bit, and a future patch to support "locking" a single rmap will add another. No functional change intended. Link: https://lore.kernel.org/r/20240809194335.1726916-17-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Fold mmu_spte_age() into kvm_rmap_age_gfn_range()Sean Christopherson1-28/+22
Fold mmu_spte_age() into its sole caller now that aging and testing for young SPTEs is handled in a common location, i.e. doesn't require more helpers. Opportunistically remove the use of mmu_spte_get_lockless(), as mmu_lock is held (for write!), and marking SPTEs for access tracking outside of mmu_lock is unsafe (at least, as written). I.e. using the lockless accessor is quite misleading. No functional change intended. Link: https://lore.kernel.org/r/20240809194335.1726916-16-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helperSean Christopherson1-46/+22
Rework kvm_handle_gfn_range() into an aging-specic helper, kvm_rmap_age_gfn_range(). In addition to purging a bunch of unnecessary boilerplate code, this sets the stage for aging rmap SPTEs outside of mmu_lock. Note, there's a small functional change, as kvm_test_age_gfn() will now return immediately if a young SPTE is found, whereas previously KVM would continue iterating over other levels. Link: https://lore.kernel.org/r/20240809194335.1726916-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Honor NEED_RESCHED when zapping rmaps and blocking is allowedSean Christopherson1-10/+6
Convert kvm_unmap_gfn_range(), which is the helper that zaps rmap SPTEs in response to an mmu_notifier invalidation, to use __kvm_rmap_zap_gfn_range() and feed in range->may_block. In other words, honor NEED_RESCHED by way of cond_resched() when zapping rmaps. This fixes a long-standing issue where KVM could process an absurd number of rmap entries without ever yielding, e.g. if an mmu_notifier fired on a PUD (or larger) range. Opportunistically rename __kvm_zap_rmap() to kvm_zap_rmap(), and drop the old kvm_zap_rmap(). Ideally, the shuffling would be done in a different patch, but that just makes the compiler unhappy, e.g. arch/x86/kvm/mmu/mmu.c:1462:13: error: ‘kvm_zap_rmap’ defined but not used Reported-by: Peter Xu <peterx@redhat.com> Link: https://lore.kernel.org/r/20240809194335.1726916-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Add a helper to walk and zap rmaps for a memslotSean Christopherson1-3/+12
Add a dedicated helper to walk and zap rmaps for a given memslot so that the code can be shared between KVM-initiated zaps and mmu_notifier invalidations. No functional change intended. Link: https://lore.kernel.org/r/20240809194335.1726916-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Plumb a @can_yield parameter into __walk_slot_rmaps()Sean Christopherson1-4/+8
Add a @can_yield param to __walk_slot_rmaps() to control whether or not dropping mmu_lock and conditionally rescheduling is allowed. This will allow using __walk_slot_rmaps() and thus cond_resched() to handle mmu_notifier invalidations, which usually allow blocking/yielding, but not when invoked by the OOM killer. Link: https://lore.kernel.org/r/20240809194335.1726916-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range()Sean Christopherson1-53/+53
Move walk_slot_rmaps() and friends up near for_each_slot_rmap_range() so that the walkers can be used to handle mmu_notifier invalidations, and so that similar function has some amount of locality in code. No functional change intended. Link: https://lore.kernel.org/r/20240809194335.1726916-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: WARN on MMIO cache hit when emulating write-protected gfnSean Christopherson1-10/+20
WARN if KVM gets an MMIO cache hit on a RET_PF_WRITE_PROTECTED fault, as KVM should return RET_PF_WRITE_PROTECTED if and only if there is a memslot, and creating a memslot is supposed to invalidate the MMIO cache by virtue of changing the memslot generation. Keep the code around mainly to provide a convenient location to document why emulated MMIO should be impossible. Suggested-by: Yuan Yao <yuan.yao@linux.intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-23-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Detect if unprotect will do anything based on invalid_listSean Christopherson1-4/+7
Explicitly query the list of to-be-zapped shadow pages when checking to see if unprotecting a gfn for retry has succeeded, i.e. if KVM should retry the faulting instruction. Add a comment to explain why the list needs to be checked before zapping, which is the primary motivation for this change. No functional change intended. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-22-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Subsume kvm_mmu_unprotect_page() into the and_retry() versionSean Christopherson1-20/+13
Fold kvm_mmu_unprotect_page() into kvm_mmu_unprotect_gfn_and_retry() now that all other direct usage is gone. No functional change intended. Link: https://lore.kernel.org/r/20240831001538.336683-21-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Rename reexecute_instruction()=>kvm_unprotect_and_retry_on_failure()Sean Christopherson1-5/+7
Rename reexecute_instruction() to kvm_unprotect_and_retry_on_failure() to make the intent and purpose of the helper much more obvious. No functional change intended. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-20-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Update retry protection fields when forcing retry on emulation failureSean Christopherson2-6/+8
When retrying the faulting instruction after emulation failure, refresh the infinite loop protection fields even if no shadow pages were zapped, i.e. avoid hitting an infinite loop even when retrying the instruction as a last-ditch effort to avoid terminating the guest. Link: https://lore.kernel.org/r/20240831001538.336683-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Apply retry protection to "unprotect on failure" pathSean Christopherson1-19/+1
Use kvm_mmu_unprotect_gfn_and_retry() in reexecute_instruction() to pick up protection against infinite loops, e.g. if KVM somehow manages to encounter an unsupported instruction and unprotecting the gfn doesn't allow the vCPU to make forward progress. Other than that, the retry-on- failure logic is a functionally equivalent, open coded version of kvm_mmu_unprotect_gfn_and_retry(). Note, the emulation failure path still isn't fully protected, as KVM won't update the retry protection fields if no shadow pages are zapped (but this change is still a step forward). That flaw will be addressed in a future patch. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-18-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfnSean Christopherson1-9/+19
Don't bother unprotecting the target gfn if EMULTYPE_WRITE_PF_TO_SP is set, as KVM will simply report the emulation failure to userspace. This will allow converting reexecute_instruction() to use kvm_mmu_unprotect_gfn_instead_retry() instead of kvm_mmu_unprotect_page(). Link: https://lore.kernel.org/r/20240831001538.336683-17-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Remove manual pfn lookup when retrying #PF after failed emulationSean Christopherson1-18/+0
Drop the manual pfn look when retrying an instruction that KVM failed to emulation in response to a #PF due to a write-protected gfn. Now that KVM sets EMULTYPE_ALLOW_RETRY_PF if and only if the page fault hit a write- protected gfn, i.e. if and only if there's a writable memslot, there's no need to redo the lookup to avoid retrying an instruction that failed on emulated MMIO (no slot, or a write to a read-only slot). I.e. KVM will never attempt to retry an instruction that failed on emulated MMIO, whereas that was not the case prior to the introduction of RET_PF_WRITE_PROTECTED. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-16-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Move event re-injection unprotect+retry into common pathSean Christopherson1-21/+9
Move the event re-injection unprotect+retry logic into kvm_mmu_write_protect_fault(), i.e. unprotect and retry if and only if the #PF actually hit a write-protected gfn. Note, there is a small possibility that the gfn was unprotected by a different tasking between hitting the #PF and acquiring mmu_lock, but in that case, KVM will resume the guest immediately anyways because KVM will treat the fault as spurious. As a bonus, unprotecting _after_ handling the page fault also addresses the case where the installing a SPTE to handle fault encounters a shadowed PTE, i.e. *creates* a read-only SPTE. Opportunstically add a comment explaining what on earth the intent of the code is, as based on the changelog from commit 577bdc496614 ("KVM: Avoid instruction emulation when event delivery is pending"). Link: https://lore.kernel.org/r/20240831001538.336683-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Always walk guest PTEs with WRITE access when unprotectingSean Christopherson1-1/+1
When getting a gpa from a gva to unprotect the associated gfn when an event is awating reinjection, walk the guest PTEs for WRITE as there's no point in unprotecting the gfn if the guest is unable to write the page, i.e. if write-protection can't trigger emulation. Note, the entire flow should be guarded on the access being a write, and even better should be conditioned on actually triggering a write-protect fault. This will be addressed in a future commit. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Don't try to unprotect an INVALID_GPASean Christopherson1-1/+6
If getting the gpa for a gva fails, e.g. because the gva isn't mapped in the guest page tables, don't try to unprotect the invalid gfn. This is mostly a performance fix (avoids unnecessarily taking mmu_lock), as for_each_gfn_valid_sp_with_gptes() won't explode on garbage input, it's simply pointless. Link: https://lore.kernel.org/r/20240831001538.336683-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Fold retry_instruction() into x86_emulate_instruction()Sean Christopherson1-21/+9
Now that retry_instruction() is reasonably tiny, fold it into its sole caller, x86_emulate_instruction(). In addition to getting rid of the absurdly confusing retry_instruction() name, handling the retry in x86_emulate_instruction() pairs it back up with the code that resets last_retry_{eip,address}. No functional change intended. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Move EMULTYPE_ALLOW_RETRY_PF to x86_emulate_instruction()Sean Christopherson1-8/+5
Move the sanity checks for EMULTYPE_ALLOW_RETRY_PF to the top of x86_emulate_instruction(). In addition to deduplicating a small amount of code, this makes the connection between EMULTYPE_ALLOW_RETRY_PF and EMULTYPE_PF even more explicit, and will allow dropping retry_instruction() entirely. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Try "unprotect for retry" iff there are indirect SPsSean Christopherson1-0/+11
Try to unprotect shadow pages if and only if indirect_shadow_pages is non- zero, i.e. iff there is at least one protected such shadow page. Pre- checking indirect_shadow_pages avoids taking mmu_lock for write when the gfn is write-protected by a third party, i.e. not for KVM shadow paging, and in the *extremely* unlikely case that a different task has already unprotected the last shadow page. Link: https://lore.kernel.org/r/20240831001538.336683-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Apply retry protection to "fast nTDP unprotect" pathSean Christopherson2-27/+39
Move the anti-infinite-loop protection provided by last_retry_{eip,addr} into kvm_mmu_write_protect_fault() so that it guards unprotect+retry that never hits the emulator, as well as reexecute_instruction(), which is the last ditch "might as well try it" logic that kicks in when emulation fails on an instruction that faulted on a write-protected gfn. Add a new helper, kvm_mmu_unprotect_gfn_and_retry(), to set the retry fields and deduplicate other code (with more to come). Link: https://lore.kernel.org/r/20240831001538.336683-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Store gpa as gpa_t, not unsigned long, when unprotecting for retrySean Christopherson1-1/+2
Store the gpa used to unprotect the faulting gfn for retry as a gpa_t, not an unsigned long. This fixes a bug where 32-bit KVM would unprotect and retry the wrong gfn if the gpa had bits 63:32!=0. In practice, this bug is functionally benign, as unprotecting the wrong gfn is purely a performance issue (thanks to the anti-infinite-loop logic). And of course, almost no one runs 32-bit KVM these days. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Get RIP from vCPU state when storing it to last_retry_eipSean Christopherson1-1/+1
Read RIP from vCPU state instead of pulling it from the emulation context when filling last_retry_eip, which is part of the anti-infinite-loop protection used when unprotecting and retrying instructions that hit a write-protected gfn. This will allow reusing the anti-infinite-loop protection in flows that never make it into the emulator. No functional change intended, as ctxt->eip is set to kvm_rip_read() in init_emulate_ctxt(), and EMULTYPE_PF emulation is mutually exclusive with EMULTYPE_NO_DECODE and EMULTYPE_SKIP, i.e. always goes through x86_decode_emulated_instruction() and hasn't advanced ctxt->eip (yet). Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>