aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/kvm/mmu/spte.h
AgeCommit message (Collapse)AuthorFilesLines
2023-08-31KVM: x86/mmu: Rename MMU_WARN_ON() to KVM_MMU_WARN_ON()Sean Christopherson1-4/+4
Rename MMU_WARN_ON() to make it super obvious that the assertions are all about KVM's MMU, not the primary MMU. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Link: https://lore.kernel.org/r/20230729004722.1056172-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-24kvm: x86/mmu: Rename SPTE_TDP_AD_ENABLED_MASK to SPTE_TDP_AD_ENABLEDLai Jiangshan1-8/+8
SPTE_TDP_AD_ENABLED_MASK, SPTE_TDP_AD_DISABLED_MASK and SPTE_TDP_AD_WRPROT_ONLY_MASK are actual value, not mask. Remove "MASK" from their names. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Link: https://lore.kernel.org/r/20230105100204.6521-1-jiangshanlai@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-12-29KVM: x86: Unify pr_fmt to use module name for all KVM modulesSean Christopherson1-2/+2
Define pr_fmt using KBUILD_MODNAME for all KVM x86 code so that printks use consistent formatting across common x86, Intel, and AMD code. In addition to providing consistent print formatting, using KBUILD_MODNAME, e.g. kvm_amd and kvm_intel, allows referencing SVM and VMX (and SEV and SGX and ...) as technologies without generating weird messages, and without causing naming conflicts with other kernel code, e.g. "SEV: ", "tdx: ", "sgx: " etc.. are all used by the kernel for non-KVM subsystems. Opportunistically move away from printk() for prints that need to be modified anyways, e.g. to drop a manual "kvm: " prefix. Opportunistically convert a few SGX WARNs that are similarly modified to WARN_ONCE; in the very unlikely event that the WARNs fire, odds are good that they would fire repeatedly and spam the kernel log without providing unique information in each print. Note, defining pr_fmt yields undesirable results for code that uses KVM's printk wrappers, e.g. vcpu_unimpl(). But, that's a pre-existing problem as SVM/kvm_amd already defines a pr_fmt, and thankfully use of KVM's wrappers is relatively limited in KVM x86 code. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paul Durrant <paul@xen.org> Message-Id: <20221130230934.1014142-35-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-27kvm: x86/mmu: Remove duplicated "be split" in spte.hLai Jiangshan1-1/+1
"be split be split" -> "be split" Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20221207120505.9175-1-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/mmu: Add helper to convert SPTE value to its shadow pageSean Christopherson1-0/+17
Add a helper to convert a SPTE to its shadow page to deduplicate a variety of flows and hopefully avoid future bugs, e.g. if KVM attempts to get the shadow page for a SPTE without dropping high bits. Opportunistically add a comment in mmu_free_root_page() documenting why it treats the root HPA as a SPTE. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221019165618.927057-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09KVM: x86/mmu: fix some comment typosMiaohe Lin1-1/+1
Fix some typos in comments. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220913091725.35953-1-linmiaohe@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19kvm: x86: mmu: Always flush TLBs when enabling dirty loggingJunaid Shahid1-4/+10
When A/D bits are not available, KVM uses a software access tracking mechanism, which involves making the SPTEs inaccessible. However, the clear_young() MMU notifier does not flush TLBs. So it is possible that there may still be stale, potentially writable, TLB entries. This is usually fine, but can be problematic when enabling dirty logging, because it currently only does a TLB flush if any SPTEs were modified. But if all SPTEs are in access-tracked state, then there won't be a TLB flush, which means that the guest could still possibly write to memory and not have it reflected in the dirty bitmap. So just unconditionally flush the TLBs when enabling dirty logging. As an alternative, KVM could explicitly check the MMU-Writable bit when write-protecting SPTEs to decide if a flush is needed (instead of checking the Writable bit), but given that a flush almost always happens anyway, so just making it unconditional seems simpler. Signed-off-by: Junaid Shahid <junaids@google.com> Message-Id: <20220810224939.2611160-1-junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: x86/mmu: Add sanity check that MMIO SPTE mask doesn't overlap genSean Christopherson1-0/+14
Add compile-time and init-time sanity checks to ensure that the MMIO SPTE mask doesn't overlap the MMIO SPTE generation or the MMU-present bit. The generation currently avoids using bit 63, but that's as much coincidence as it is strictly necessarly. That will change in the future, as TDX support will require setting bit 63 (SUPPRESS_VE) in the mask. Explicitly carve out the bits that are allowed in the mask so that any future shuffling of SPTE bits doesn't silently break MMIO caching (KVM has broken MMIO caching more than once due to overlapping the generation with other things). Suggested-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-Id: <20220805194133.86299-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: SVM: Disable SEV-ES support if MMIO caching is disableSean Christopherson1-2/+0
Disable SEV-ES if MMIO caching is disabled as SEV-ES relies on MMIO SPTEs generating #NPF(RSVD), which are reflected by the CPU into the guest as a #VC. With SEV-ES, the untrusted host, a.k.a. KVM, doesn't have access to the guest instruction stream or register state and so can't directly emulate in response to a #NPF on an emulated MMIO GPA. Disabling MMIO caching means guest accesses to emulated MMIO ranges cause #NPF(!PRESENT), and those flavors of #NPF cause automatic VM-Exits, not #VC. Adjust KVM's MMIO masks to account for the C-bit location prior to doing SEV(-ES) setup, and document that dependency between adjusting the MMIO SPTE mask and SEV(-ES) setup. Fixes: b09763da4dd8 ("KVM: x86/mmu: Add module param to disable MMIO caching (for testing)") Reported-by: Michael Roth <michael.roth@amd.com> Tested-by: Michael Roth <michael.roth@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220803224957.1285926-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: x86/mmu: Fully re-evaluate MMIO caching when SPTE masks changeSean Christopherson1-0/+1
Fully re-evaluate whether or not MMIO caching can be enabled when SPTE masks change; simply clearing enable_mmio_caching when a configuration isn't compatible with caching fails to handle the scenario where the masks are updated, e.g. by VMX for EPT or by SVM to account for the C-bit location, and toggle compatibility from false=>true. Snapshot the original module param so that re-evaluating MMIO caching preserves userspace's desire to allow caching. Use a snapshot approach so that enable_mmio_caching still reflects KVM's actual behavior. Fixes: 8b9e74bfbf8c ("KVM: x86/mmu: Use enable_mmio_caching to track if MMIO caching is enabled") Reported-by: Michael Roth <michael.roth@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: stable@vger.kernel.org Tested-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-Id: <20220803224957.1285926-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Add shadow mask for effective host MTRR memtypeSean Christopherson1-0/+1
Add shadow_memtype_mask to capture that EPT needs a non-zero memtype mask instead of relying on TDP being enabled, as NPT doesn't need a non-zero mask. This is a glorified nop as kvm_x86_ops.get_mt_mask() returns zero for NPT anyways. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220715230016.3762909-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-14KVM: x86/mmu: Add optimized helper to retrieve an SPTE's indexSean Christopherson1-0/+6
Add spte_index() to dedup all the code that calculates a SPTE's index into its parent's page table and/or spt array. Opportunistically tweak the calculation to avoid pointer arithmetic, which is subtle (subtract in 8-byte chunks) and less performant (requires the compiler to generate the subtraction). Suggested-by: David Matlack <dmatlack@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220712020724.1262121-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMUDavid Matlack1-2/+2
Currently make_huge_page_split_spte() assumes execute permissions can be granted to any 4K SPTE when splitting huge pages. This is true for the TDP MMU but is not necessarily true for the shadow MMU, since KVM may be shadowing a non-executable huge page. To fix this, pass in the role of the child shadow page where the huge page will be split and derive the execution permission from that. This is correct because huge pages are always split with direct shadow page and thus the shadow page role contains the correct access permissions. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-19-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24KVM: x86/MMU: Allow NX huge pages to be disabled on a per-vm basisBen Gardon1-1/+2
In some cases, the NX hugepage mitigation for iTLB multihit is not needed for all guests on a host. Allow disabling the mitigation on a per-VM basis to avoid the performance hit of NX hugepages on trusted workloads. In order to disable NX hugepages on a VM, ensure that the userspace actor has permission to reboot the system. Since disabling NX hugepages would allow a guest to crash the system, it is similar to reboot permissions. Ideally, KVM would require userspace to prove it has access to KVM's nx_huge_pages module param, e.g. so that userspace can opt out without needing full reboot permissions. But getting access to the module param file info is difficult because it is buried in layers of sysfs and module glue. Requiring CAP_SYS_BOOT is sufficient for all known use cases. Suggested-by: Jim Mattson <jmattson@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20220613212523.3436117-9-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: x86/mmu: Use separate namespaces for guest PTEs and shadow PTEsSean Christopherson1-15/+12
Separate the macros for KVM's shadow PTEs (SPTE) from guest 64-bit PTEs (PT64). SPTE and PT64 are _mostly_ the same, but the few differences are quite critical, e.g. *_BASE_ADDR_MASK must differentiate between host and guest physical address spaces, and SPTE_PERM_MASK (was PT64_PERM_MASK) is very much specific to SPTEs. Opportunistically (and temporarily) move most guest macros into paging.h to clearly associate them with shadow paging, and to ensure that they're not used as of this commit. A future patch will eliminate them entirely. Sadly, PT32_LEVEL_BITS is left behind in mmu_internal.h because it's needed for the quadrant calculation in kvm_mmu_get_page(). The quadrant calculation is hot enough (when using shadow paging with 32-bit guests) that adding a per-context helper is undesirable, and burying the computation in paging_tmpl.h with a forward declaration isn't exactly an improvement. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220614233328.3896033-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: x86/mmu: Dedup macros for computing various page table masksSean Christopherson1-4/+3
Provide common helper macros to generate various masks, shifts, etc... for 32-bit vs. 64-bit page tables. Only the inputs differ, the actual calculations are identical. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220614233328.3896033-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_maskKai Huang1-0/+1
Intel Multi-Key Total Memory Encryption (MKTME) repurposes couple of high bits of physical address bits as 'KeyID' bits. Intel Trust Domain Extentions (TDX) further steals part of MKTME KeyID bits as TDX private KeyID bits. TDX private KeyID bits cannot be set in any mapping in the host kernel since they can only be accessed by software running inside a new CPU isolated mode. And unlike to AMD's SME, host kernel doesn't set any legacy MKTME KeyID bits to any mapping either. Therefore, it's not legitimate for KVM to set any KeyID bits in SPTE which maps guest memory. KVM maintains shadow_zero_check bits to represent which bits must be zero for SPTE which maps guest memory. MKTME KeyID bits should be set to shadow_zero_check. Currently, shadow_me_mask is used by AMD to set the sme_me_mask to SPTE, and shadow_me_shadow is excluded from shadow_zero_check. So initializing shadow_me_mask to represent all MKTME keyID bits doesn't work for VMX (as oppositely, they must be set to shadow_zero_check). Introduce a new 'shadow_me_value' to replace existing shadow_me_mask, and repurpose shadow_me_mask as 'all possible memory encryption bits'. The new schematic of them will be: - shadow_me_value: the memory encryption bit(s) that will be set to the SPTE (the original shadow_me_mask). - shadow_me_mask: all possible memory encryption bits (which is a super set of shadow_me_value). - For now, shadow_me_value is supposed to be set by SVM and VMX respectively, and it is a constant during KVM's life time. This perhaps doesn't fit MKTME but for now host kernel doesn't support it (and perhaps will never do). - Bits in shadow_me_mask are set to shadow_zero_check, except the bits in shadow_me_value. Introduce a new helper kvm_mmu_set_me_spte_mask() to initialize them. Replace shadow_me_mask with shadow_me_value in almost all code paths, except the one in PT64_PERM_MASK, which is used by need_remote_flush() to determine whether remote TLB flush is needed. This should still use shadow_me_mask as any encryption bit change should need a TLB flush. And for AMD, move initializing shadow_me_value/shadow_me_mask from kvm_mmu_reset_all_pte_masks() to svm_hardware_setup(). Signed-off-by: Kai Huang <kai.huang@intel.com> Message-Id: <f90964b93a3398b1cf1c56f510f3281e0709e2ab.1650363789.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: x86/mmu: Don't attempt fast page fault just because EPT is in useSean Christopherson1-0/+11
Check for A/D bits being disabled instead of the access tracking mask being non-zero when deciding whether or not to attempt to fix a page fault vian the fast path. Originally, the access tracking mask was non-zero if and only if A/D bits were disabled by _KVM_ (including not being supported by hardware), but that hasn't been true since nVMX was fixed to honor EPTP12's A/D enabling, i.e. since KVM allowed L1 to cause KVM to not use A/D bits while running L2 despite KVM using them while running L1. In other words, don't attempt the fast path just because EPT is enabled. Note, attempting the fast path for all !PRESENT faults can "fix" a very, _VERY_ tiny percentage of faults out of mmu_lock by detecting that the fault is spurious, i.e. has been fixed by a different vCPU, but again the odds of that happening are vanishingly small. E.g. booting an 8-vCPU VM gets less than 10 successes out of 30k+ faults, and that's likely one of the more favorable scenarios. Disabling dirty logging can likely lead to a rash of collisions between vCPUs for some workloads that operate on a common set of pages, but penalizing _all_ !PRESENT faults for that one case is unlikely to be a net positive, not to mention that that problem is best solved by not zapping in the first place. The number of spurious faults does scale with the number of vCPUs, e.g. a 255-vCPU VM using TDP "jumps" to ~60 spurious faults detected in the fast path (again out of 30k), but that's all of 0.2% of faults. Using legacy shadow paging does get more spurious faults, and a few more detected out of mmu_lock, but the percentage goes _down_ to 0.08% (and that's ignoring faults that are reflected into the guest), i.e. the extra detections are purely due to the sheer number of faults observed. On the other hand, getting a "negative" in the fast path takes in the neighborhood of 150-250 cycles. So while it is tempting to keep/extend the current behavior, such a change needs to come with hard numbers showing that it's actually a win in the grand scheme, or any scheme for that matter. Fixes: 995f00a61958 ("x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-03Merge branch 'kvm-tdp-mmu-atomicity-fix' into HEADPaolo Bonzini1-1/+3
We are dropping A/D bits (and W bits) in the TDP MMU. Even if mmu_lock is held for write, as volatile SPTEs can be written by other tasks/vCPUs outside of mmu_lock. Attempting to prove that bug exposed another notable goof, which has been lurking for a decade, give or take: KVM treats _all_ MMU-writable SPTEs as volatile, even though KVM never clears WRITABLE outside of MMU lock. As a result, the legacy MMU (and the TDP MMU if not fixed) uses XCHG to update writable SPTEs. The fix does not seem to have an easily-measurable affect on performance; page faults are so slow that wasting even a few hundred cycles is dwarfed by the base cost.
2022-05-03KVM: x86/mmu: Move shadow-present check out of spte_has_volatile_bits()Sean Christopherson1-0/+2
Move the is_shadow_present_pte() check out of spte_has_volatile_bits() and into its callers. Well, caller, since only one of its two callers doesn't already do the shadow-present check. Opportunistically move the helper to spte.c/h so that it can be used by the TDP MMU, which is also the primary motivation for the shadow-present change. Unlike the legacy MMU, the TDP MMU uses a single path for clear leaf and non-leaf SPTEs, and to avoid unnecessary atomic updates, the TDP MMU will need to check is_last_spte() prior to calling spte_has_volatile_bits(), and calling is_last_spte() without first calling is_shadow_present_spte() is at best odd, and at worst a violation of KVM's loosely defines SPTE rules. Note, mmu_spte_clear_track_bits() could likely skip the write entirely for SPTEs that are not shadow-present. Leave that cleanup for a future patch to avoid introducing a functional change, and because the shadow-present check can likely be moved further up the stack, e.g. drop_large_spte() appears to be the only path that doesn't already explicitly check for a shadow-present SPTE. No functional change intended. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-03KVM: x86/mmu: Don't treat fully writable SPTEs as volatile (modulo A/D)Sean Christopherson1-1/+1
Don't treat SPTEs that are truly writable, i.e. writable in hardware, as being volatile (unless they're volatile for other reasons, e.g. A/D bits). KVM _sets_ the WRITABLE bit out of mmu_lock, but never _clears_ the bit out of mmu_lock, so if the WRITABLE bit is set, it cannot magically get cleared just because the SPTE is MMU-writable. Rename the wrapper of MMU-writable to be more literal, the previous name of spte_can_locklessly_be_made_writable() is wrong and misleading. Fixes: c7ba5b48cc8d ("KVM: MMU: fast path of handling guest page fault") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86/mmu: Use enable_mmio_caching to track if MMIO caching is enabledSean Christopherson1-1/+3
Clear enable_mmio_caching if hardware can't support MMIO caching and use the dedicated flag to detect if MMIO caching is enabled instead of assuming shadow_mmio_value==0 means MMIO caching is disabled. TDX will use a zero value even when caching is enabled, and is_mmio_spte() isn't so hot that it needs to avoid an extra memory access, i.e. there's no reason to be super clever. And the clever approach may not even be more performant, e.g. gcc-11 lands the extra check on a non-zero value inline, but puts the enable_mmio_caching out-of-line, i.e. avoids the few extra uops for non-MMIO SPTEs. Cc: Isaku Yamahata <isaku.yamahata@intel.com> Cc: Kai Huang <kai.huang@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220420002747.3287931-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDRSean Christopherson1-6/+0
Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR. The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the guest, possibly with help from userspace, manages to coerce KVM into creating a SPTE for an "impossible" gfn, KVM will leak the associated shadow pages (page tables): WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57 kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm] Modules linked in: kvm_intel kvm irqbypass CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G W 5.18.0-rc1+ #293 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm] Call Trace: <TASK> kvm_arch_destroy_vm+0x130/0x1b0 [kvm] kvm_destroy_vm+0x162/0x2d0 [kvm] kvm_vm_release+0x1d/0x30 [kvm] __fput+0x82/0x240 task_work_run+0x5b/0x90 exit_to_user_mode_prepare+0xd2/0xe0 syscall_exit_to_user_mode+0x1d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> On bare metal, encountering an impossible gpa in the page fault path is well and truly impossible, barring CPU bugs, as the CPU will signal #PF during the gva=>gpa translation (or a similar failure when stuffing a physical address into e.g. the VMCS/VMCB). But if KVM is running as a VM itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR of the underlying hardware, in which case the hardware will not fault on the illegal-from-KVM's-perspective gpa. Alternatively, KVM could continue allowing the dodgy behavior and simply zap the max possible range. But, for hosts with MAXPHYADDR < 52, that's a (minor) waste of cycles, and more importantly, KVM can't reasonably support impossible memslots when running on bare metal (or with an accurate MAXPHYADDR as a VM). Note, limiting the overhead by checking if KVM is running as a guest is not a safe option as the host isn't required to announce itself to the guest in any way, e.g. doesn't need to set the HYPERVISOR CPUID bit. A second alternative to disallowing the memslot behavior would be to disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR. That restriction is undesirable as there are legitimate use cases for doing so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous systems so that VMs can be migrated between hosts with different MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess. Note that any guest.MAXPHYADDR is valid with shadow paging, and it is even useful in order to test KVM with MAXPHYADDR=52 (i.e. without any reserved physical address bits). The now common kvm_mmu_max_gfn() is inclusive instead of exclusive. The memslot and TDP MMU code want an exclusive value, but the name implies the returned value is inclusive, and the MMIO path needs an inclusive check. Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220428233416.2446833-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is ↵David Matlack1-0/+1
enabled When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10KVM: x86/mmu: Move restore_acc_track_spte() to spte.hDavid Matlack1-0/+15
restore_acc_track_spte() is pure SPTE bit manipulation, making it a good fit for spte.h. And now that the WARN_ON_ONCE() calls have been removed, there isn't any good reason to not inline it. This move also prepares for a follow-up commit that will need to call restore_acc_track_spte() from spte.c No functional change intended. Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-11-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10KVM: x86/mmu: Consolidate comments about {Host,MMU}-writableDavid Matlack1-49/+56
Consolidate the large comment above DEFAULT_SPTE_HOST_WRITABLE with the large comment above is_writable_pte() into one comment. This comment explains the different reasons why an SPTE may be non-writable and KVM keeps track of that with the {Host,MMU}-writable bits. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220125230723.1701061-1-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10KVM: x86/mmu: Rename DEFAULT_SPTE_MMU_WRITEABLE to DEFAULT_SPTE_MMU_WRITABLEDavid Matlack1-4/+4
Both "writeable" and "writable" are valid, but we should be consistent about which we use. DEFAULT_SPTE_MMU_WRITEABLE was the odd one out in the SPTE code, so rename it to DEFAULT_SPTE_MMU_WRITABLE. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220125230713.1700406-1-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10KVM: x86/mmu: Move is_writable_pte() to spte.hDavid Matlack1-1/+39
Move is_writable_pte() close to the other functions that check writability information about SPTEs. While here opportunistically replace the open-coded bit arithmetic in check_spte_writable_invariants() with a call to is_writable_pte(). No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220125230518.1697048-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10KVM: x86/mmu: Check SPTE writable invariants when setting leaf SPTEsDavid Matlack1-1/+1
Check SPTE writable invariants when setting SPTEs rather than in spte_can_locklessly_be_made_writable(). By the time KVM checks spte_can_locklessly_be_made_writable(), the SPTE has long been since corrupted. Note that these invariants only apply to shadow-present leaf SPTEs (i.e. not to MMIO SPTEs, non-leaf SPTEs, etc.). Add a comment explaining the restriction and only instrument the code paths that set shadow-present leaf SPTEs. To account for access tracking, also check the SPTE writable invariants when marking an SPTE as an access track SPTE. This also lets us remove a redundant WARN from mark_spte_for_access_track(). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220125230518.1697048-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-10KVM: x86/mmu: Move SPTE writable invariant checks to a helper functionDavid Matlack1-7/+13
Move the WARNs in spte_can_locklessly_be_made_writable() to a separate helper function. This is in preparation for moving these checks to the places where SPTEs are set. Opportunistically add warning error messages that include the SPTE to make future debugging of these warnings easier. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220125230518.1697048-2-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-01-19KVM: x86/mmu: Document and enforce MMU-writable and Host-writable invariantsDavid Matlack1-6/+36
SPTEs are tagged with software-only bits to indicate if it is "MMU-writable" and "Host-writable". These bits are used to determine why KVM has marked an SPTE as read-only. Document these bits and their invariants, and enforce the invariants with new WARNs in spte_can_locklessly_be_made_writable() to ensure they are not accidentally violated in the future. Opportunistically move DEFAULT_SPTE_{MMU,HOST}_WRITABLE next to EPT_SPTE_{MMU,HOST}_WRITABLE since the new documentation applies to both. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220113233020.3986005-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: x86/mmu: Propagate memslot const qualifierBen Gardon1-1/+1
In preparation for implementing in-place hugepage promotion, various functions will need to be called from zap_collapsible_spte_range, which has the const qualifier on its memslot argument. Propagate the const qualifier to the various functions which will be needed. This just serves to simplify the following patch. No functional change intended. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20211115234603.2908381-11-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22KVM: x86/mmu: clean up prefetch/prefault/speculative namingPaolo Bonzini1-1/+1
"prefetch", "prefault" and "speculative" are used throughout KVM to mean the same thing. Use a single name, standardizing on "prefetch" which is already used by various functions such as direct_pte_prefetch, FNAME(prefetch_gpte), FNAME(pte_prefetch), etc. Suggested-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-18KVM: x86: avoid warning with -Wbitwise-instead-of-logicalPaolo Bonzini1-6/+1
This is a new warning in clang top-of-tree (will be clang 14): In file included from arch/x86/kvm/mmu/mmu.c:27: arch/x86/kvm/mmu/spte.h:318:9: error: use of bitwise '|' with boolean operands [-Werror,-Wbitwise-instead-of-logical] return __is_bad_mt_xwr(rsvd_check, spte) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ || arch/x86/kvm/mmu/spte.h:318:9: note: cast one or both operands to int to silence this warning The code is fine, but change it anyway to shut up this clever clogs of a compiler. Reported-by: torvic9@mailbox.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01KVM: x86/mmu: Avoid memslot lookup in make_spte and mmu_try_to_unsync_pagesDavid Matlack1-0/+1
mmu_try_to_unsync_pages checks if page tracking is active for the given gfn, which requires knowing the memslot. We can pass down the memslot via make_spte to avoid this lookup. The memslot is also handy for make_spte's marking of the gfn as dirty: we can test whether dirty page tracking is enabled, and if so ensure that pages are mapped as writable with 4K granularity. Apart from the warning, no functional change is intended. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210813203504.2742757-7-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01KVM: MMU: pass kvm_mmu_page struct to make_sptePaolo Bonzini1-4/+4
The level and A/D bit support of the new SPTE can be found in the role, which is stored in the kvm_mmu_page struct. This merges two arguments into one. For the TDP MMU, the kvm_mmu_page was not used (kvm_tdp_mmu_map does not use it if the SPTE is already present) so we fetch it just before calling make_spte. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-01KVM: MMU: clean up make_spte return valuePaolo Bonzini1-6/+1
Now that make_spte is called directly by the shadow MMU (rather than wrapped by set_spte), it only has to return one boolean value. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-07-14KVM: x86/mmu: Do not apply HPA (memory encryption) mask to GPAsSean Christopherson1-6/+0
Ignore "dynamic" host adjustments to the physical address mask when generating the masks for guest PTEs, i.e. the guest PA masks. The host physical address space and guest physical address space are two different beasts, e.g. even though SEV's C-bit is the same bit location for both host and guest, disabling SME in the host (which clears shadow_me_mask) does not affect the guest PTE->GPA "translation". For non-SEV guests, not dropping bits is the correct behavior. Assuming KVM and userspace correctly enumerate/configure guest MAXPHYADDR, bits that are lost as collateral damage from memory encryption are treated as reserved bits, i.e. KVM will never get to the point where it attempts to generate a gfn using the affected bits. And if userspace wants to create a bogus vCPU, then userspace gets to deal with the fallout of hardware doing odd things with bad GPAs. For SEV guests, not dropping the C-bit is technically wrong, but it's a moot point because KVM can't read SEV guest's page tables in any case since they're always encrypted. Not to mention that the current KVM code is also broken since sme_me_mask does not have to be non-zero for SEV to be supported by KVM. The proper fix would be to teach all of KVM to correctly handle guest private memory, but that's a task for the future. Fixes: d0ec49d4de90 ("kvm/x86/svm: Support Secure Memory Encryption within KVM") Cc: stable@vger.kernel.org Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210623230552.4027702-5-seanjc@google.com> [Use a new header instead of adding header guards to paging_tmpl.h. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-24KVM: x86/mmu: Add helpers to do full reserved SPTE checks w/ generic MMUSean Christopherson1-0/+32
Extract the reserved SPTE check and print helpers in get_mmio_spte() to new helpers so that KVM can also WARN on reserved badness when making a SPTE. Tag the checking helper with __always_inline to improve the probability of the compiler generating optimal code for the checking loop, e.g. gcc appears to avoid using %rbp when the helper is tagged with a vanilla "inline". No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-48-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15KVM: x86/mmu: Use low available bits for removed SPTEsSean Christopherson1-4/+7
Use low "available" bits to tag REMOVED SPTEs. Using a high bit is moderately costly as it often causes the compiler to generate a 64-bit immediate. More importantly, this makes it very clear REMOVED_SPTE is a value, not a flag. Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-24-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15KVM: x86/mmu: Tweak auditing WARN for A/D bits to !PRESENT (was MMIO)Sean Christopherson1-9/+9
Tweak the MMU_WARN that guards against weirdness when querying A/D status to fire on a !MMU_PRESENT SPTE, as opposed to a MMIO SPTE. Attempting to query A/D status on any kind of !MMU_PRESENT SPTE, MMIO or otherwise, indicates a KVM bug. Case in point, several now-fixed bugs were identified by enabling this new WARN. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-22-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15KVM: x86/mmu: Use a dedicated bit to track shadow/MMU-present SPTEsSean Christopherson1-6/+17
Introduce MMU_PRESENT to explicitly track which SPTEs are "present" from the MMU's perspective. Checking for shadow-present SPTEs is a very common operation for the MMU, particularly in hot paths such as page faults. With the addition of "removed" SPTEs for the TDP MMU, identifying shadow-present SPTEs is quite costly especially since it requires checking multiple 64-bit values. On 64-bit KVM, this reduces the footprint of kvm.ko's .text by ~2k bytes. On 32-bit KVM, this increases the footprint by ~200 bytes, but only because gcc now inlines several more MMU helpers, e.g. drop_parent_pte(). We now need to drop bit 11, used for the MMU_PRESENT flag, from the set of bits used to store the generation number in MMIO SPTEs. Otherwise MMIO SPTEs with bit 11 set would get false positives for is_shadow_present_spte() and lead to a variety of fireworks, from oopses to likely hangs of the host kernel. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-21-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15KVM: x86/mmu: Use high bits for host/mmu writable masks for EPT SPTEsSean Christopherson1-15/+33
Use bits 57 and 58 for HOST_WRITABLE and MMU_WRITABLE when using EPT. This will allow using bit 11 as a constant MMU_PRESENT, which is desirable as checking for a shadow-present SPTE is one of the most common SPTE operations in KVM, particular in hot paths such as page faults. EPT is short on low available bits; currently only bit 11 is the only always-available bit. Bit 10 is also available, but only while KVM doesn't support mode-based execution. On the other hand, PAE paging doesn't have _any_ high available bits. Thus, using bit 11 is the only feasible option for MMU_PRESENT. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-20-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15KVM: x86/mmu: Make Host-writable and MMU-writable bit locations dynamicSean Christopherson1-7/+6
Make the location of the HOST_WRITABLE and MMU_WRITABLE configurable for a given KVM instance. This will allow EPT to use high available bits, which in turn will free up bit 11 for a constant MMU_PRESENT bit. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-19-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15KVM: x86/mmu: Document dependency bewteen TDP A/D type and saved bitsSean Christopherson1-3/+5
Document that SHADOW_ACC_TRACK_SAVED_BITS_SHIFT is directly dependent on bits 53:52 being used to track the A/D type. Remove PT64_SECOND_AVAIL_BITS_SHIFT as it is at best misleading, and at worst wrong. For PAE paging, which arguably is a variant of PT64, the bits are reserved. For MMIO SPTEs the bits are not available as they're used for the MMIO generation. For access tracked SPTEs, they are also not available as bits 56:54 are used to store the original RX bits. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-15-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15KVM: x86/mmu: Use MMIO SPTE bits 53 and 52 for the MMIO generationSean Christopherson1-4/+4
Use bits 53 and 52 for the MMIO generation now that they're not used to identify MMIO SPTEs. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15KVM: x86/mmu: Rename and document A/D scheme for TDP SPTEsSean Christopherson1-8/+26
Rename the various A/D status defines to explicitly associated them with TDP. There is a subtle dependency on the bits in question never being set when using PAE paging, as those bits are reserved, not available. I.e. using these bits outside of TDP (technically EPT) would cause explosions. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-15KVM: x86/mmu: Stop using software available bits to denote MMIO SPTEsSean Christopherson1-6/+4
Stop tagging MMIO SPTEs with specific available bits and instead detect MMIO SPTEs by checking for their unique SPTE value. The value is guaranteed to be unique on shadow paging and NPT as setting reserved physical address bits on any other type of SPTE would consistute a KVM bug. Ditto for EPT, as creating a WX non-MMIO would also be a bug. Note, this approach is also future-compatibile with TDX, which will need to reflect MMIO EPT violations as #VEs into the guest. To create an EPT violation instead of a misconfig, TDX EPTs will need to have RWX=0, But, MMIO SPTEs will also be the only case where KVM clears SUPPRESS_VE, so MMIO SPTEs will still be guaranteed to have a unique value within a given MMU context. The main motivation is to make it easier to reason about which types of SPTEs use which available bits. As a happy side effect, this frees up two more bits for storing the MMIO generation. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF handlerBen Gardon1-1/+20
When the TDP MMU is allowed to handle page faults in parallel there is the possiblity of a race where an SPTE is cleared and then imediately replaced with a present SPTE pointing to a different PFN, before the TLBs can be flushed. This race would violate architectural specs. Ensure that the TLBs are flushed properly before other threads are allowed to install any present value for the SPTE. Reviewed-by: Peter Feiner <pfeiner@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-22-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04KVM: x86/mmu: Use boolean returns for (S)PTE accessorsSean Christopherson1-8/+4
Return a 'bool' instead of an 'int' for various PTE accessors that are boolean in nature, e.g. is_shadow_present_pte(). Returning an int is goofy and potentially dangerous, e.g. if a flag being checked is moved into the upper 32 bits of a SPTE, then the compiler may silently squash the entire check since casting to an int is guaranteed to yield a return value of '0'. Opportunistically refactor is_last_spte() so that it naturally returns a bool value instead of letting it implicitly cast 0/1 to false/true. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210123003003.3137525-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>