aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)AuthorFilesLines
2022-06-24KVM: x86/mmu: Allow NULL @vcpu in kvm_mmu_find_shadow_page()David Matlack1-0/+10
Allow @vcpu to be NULL in kvm_mmu_find_shadow_page() (and its only caller __kvm_mmu_get_shadow_page()). @vcpu is only required to sync indirect shadow pages, so it's safe to pass in NULL when looking up direct shadow pages. This will be used for doing eager page splitting, which allocates direct shadow pages from the context of a VM ioctl without access to a vCPU pointer. Signed-off-by: David Matlack <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86/mmu: Pass kvm pointer separately from vcpu to ↵David Matlack1-13/+15
kvm_mmu_find_shadow_page() Get the kvm pointer from the caller, rather than deriving it from vcpu->kvm, and plumb the kvm pointer all the way from kvm_mmu_get_shadow_page(). With this change in place, the vcpu pointer is only needed to sync indirect shadow pages. In other words, __kvm_mmu_get_shadow_page() can now be used to get *direct* shadow pages without a vcpu pointer. This enables eager page splitting, which needs to allocate direct shadow pages during VM ioctls. No functional change intended. Signed-off-by: David Matlack <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page()David Matlack1-6/+6
The vcpu pointer in kvm_mmu_alloc_shadow_page() is only used to get the kvm pointer. So drop the vcpu pointer and just pass in the kvm pointer. No functional change intended. Signed-off-by: David Matlack <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86/mmu: Pass memory caches to allocate SPs separatelyDavid Matlack1-7/+29
Refactor kvm_mmu_alloc_shadow_page() to receive the caches from which it will allocate the various pieces of memory for shadow pages as a parameter, rather than deriving them from the vcpu pointer. This will be useful in a future commit where shadow pages are allocated during VM ioctls for eager page splitting, and thus will use a different set of caches. Preemptively pull the caches out all the way to kvm_mmu_get_shadow_page() since eager page splitting will not be calling kvm_mmu_alloc_shadow_page() directly. No functional change intended. Signed-off-by: David Matlack <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86/mmu: Move guest PT write-protection to account_shadowed()David Matlack1-4/+4
Move the code that write-protects newly-shadowed guest page tables into account_shadowed(). This avoids a extra gfn-to-memslot lookup and is a more logical place for this code to live. But most importantly, this reduces kvm_mmu_alloc_shadow_page()'s reliance on having a struct kvm_vcpu pointer, which will be necessary when creating new shadow pages during VM ioctls for eager page splitting. Note, it is safe to drop the role.level == PG_LEVEL_4K check since account_shadowed() returns early if role.level > PG_LEVEL_4K. No functional change intended. Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: David Matlack <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pagesDavid Matlack1-6/+7
Rename 2 functions: kvm_mmu_get_page() -> kvm_mmu_get_shadow_page() kvm_mmu_free_page() -> kvm_mmu_free_shadow_page() This change makes it clear that these functions deal with shadow pages rather than struct pages. It also aligns these functions with the naming scheme for kvm_mmu_find_shadow_page() and kvm_mmu_alloc_shadow_page(). Prefer "shadow_page" over the shorter "sp" since these are core functions and the line lengths aren't terrible. No functional change intended. Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: David Matlack <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86/mmu: Consolidate shadow page allocation and initializationDavid Matlack1-22/+17
Consolidate kvm_mmu_alloc_page() and kvm_mmu_alloc_shadow_page() under the latter so that all shadow page allocation and initialization happens in one place. No functional change intended. Signed-off-by: David Matlack <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functionsDavid Matlack1-13/+39
Decompose kvm_mmu_get_page() into separate helper functions to increase readability and prepare for allocating shadow pages without a vcpu pointer. Specifically, pull the guts of kvm_mmu_get_page() into 2 helper functions: kvm_mmu_find_shadow_page() - Walks the page hash checking for any existing mmu pages that match the given gfn and role. kvm_mmu_alloc_shadow_page() Allocates and initializes an entirely new kvm_mmu_page. This currently requries a vcpu pointer for allocation and looking up the memslot but that will be removed in a future commit. No functional change intended. Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: David Matlack <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86/mmu: Always pass 0 for @quadrant when gptes are 8 bytesDavid Matlack1-6/+14
The quadrant is only used when gptes are 4 bytes, but mmu_alloc_{direct,shadow}_roots() pass in a non-zero quadrant for PAE page directories regardless. Make this less confusing by only passing in a non-zero quadrant when it is actually necessary. Signed-off-by: David Matlack <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86/mmu: Derive shadow MMU page role from parentDavid Matlack2-52/+71
Instead of computing the shadow page role from scratch for every new page, derive most of the information from the parent shadow page. This eliminates the dependency on the vCPU root role to allocate shadow page tables, and reduces the number of parameters to kvm_mmu_get_page(). Preemptively split out the role calculation to a separate function for use in a following commit. Note that when calculating the MMU root role, we can take @role.passthrough, @role.direct, and @role.access directly from @vcpu->arch.mmu->root_role. Only @role.level and @role.quadrant still must be overridden for PAE page directories, when shadowing 32-bit guest page tables with PAE page tables. No functional change intended. Reviewed-by: Peter Xu <[email protected]> Signed-off-by: David Matlack <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86/mmu: Stop passing "direct" to mmu_alloc_root()David Matlack1-5/+6
The "direct" argument is vcpu->arch.mmu->root_role.direct, because unlike non-root page tables, it's impossible to have a direct root in an indirect MMU. So just use that. Suggested-by: Lai Jiangshan <[email protected]> Signed-off-by: David Matlack <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86/mmu: Use a bool for directDavid Matlack1-2/+2
The parameter "direct" can either be true or false, and all of the callers pass in a bool variable or true/false literal, so just use the type bool. No functional change intended. Reviewed-by: Lai Jiangshan <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: David Matlack <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPsDavid Matlack1-2/+2
Commit fb58a9c345f6 ("KVM: x86/mmu: Optimize MMU page cache lookup for fully direct MMUs") skipped the unsync checks and write flood clearing for full direct MMUs. We can extend this further to skip the checks for all direct shadow pages. Direct shadow pages in indirect MMUs (i.e. shadow paging) are used when shadowing a guest huge page with smaller pages. Such direct shadow pages, like their counterparts in fully direct MMUs, are never marked unsynced or have a non-zero write-flooding count. Checking sp->role.direct also generates better code than checking direct_map because, due to register pressure, direct_map has to get shoved onto the stack and then pulled back off. No functional change intended. Reviewed-by: Lai Jiangshan <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Reviewed-by: Peter Xu <[email protected]> Signed-off-by: David Matlack <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86/MMU: Allow NX huge pages to be disabled on a per-vm basisBen Gardon5-8/+41
In some cases, the NX hugepage mitigation for iTLB multihit is not needed for all guests on a host. Allow disabling the mitigation on a per-VM basis to avoid the performance hit of NX hugepages on trusted workloads. In order to disable NX hugepages on a VM, ensure that the userspace actor has permission to reboot the system. Since disabling NX hugepages would allow a guest to crash the system, it is similar to reboot permissions. Ideally, KVM would require userspace to prove it has access to KVM's nx_huge_pages module param, e.g. so that userspace can opt out without needing full reboot permissions. But getting access to the module param file info is difficult because it is buried in layers of sysfs and module glue. Requiring CAP_SYS_BOOT is sufficient for all known use cases. Suggested-by: Jim Mattson <[email protected]> Reviewed-by: David Matlack <[email protected]> Reviewed-by: Peter Xu <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86: Fix errant brace in KVM capability handlingBen Gardon1-1/+1
The braces around the KVM_CAP_XSAVE2 block also surround the KVM_CAP_PMU_CAPABILITY block, likely the result of a merge issue. Simply move the curly brace back to where it belongs. Fixes: ba7bb663f5547 ("KVM: x86: Provide per VM capability for disabling PMU virtualization") Reviewed-by: David Matlack <[email protected]> Reviewed-by: Peter Xu <[email protected]> Signed-off-by: Ben Gardon <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: SEV: Init target VMCBs in sev_migrate_fromPeter Gonda3-33/+48
The target VMCBs during an intra-host migration need to correctly setup for running SEV and SEV-ES guests. Add sev_init_vmcb() function and make sev_es_init_vmcb() static. sev_init_vmcb() uses the now private function to init SEV-ES guests VMCBs when needed. Fixes: 0b020f5af092 ("KVM: SEV: Add support for SEV-ES intra host migration") Fixes: b56639318bb2 ("KVM: SEV: Add support for SEV intra host migration") Signed-off-by: Peter Gonda <[email protected]> Cc: Marc Orr <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Tom Lendacky <[email protected]> Cc: [email protected] Cc: [email protected] Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-24KVM: x86/svm: add __GFP_ACCOUNT to __sev_dbg_{en,de}crypt_user()Mingwei Zhang1-2/+2
Adding the accounting flag when allocating pages within the SEV function, since these memory pages should belong to individual VM. No functional change intended. Signed-off-by: Mingwei Zhang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: x86: Add a quirk for KVM's "MONITOR/MWAIT are NOPs!" behaviorSean Christopherson1-11/+19
Add a quirk for KVM's behavior of emulating intercepted MONITOR/MWAIT instructions a NOPs regardless of whether or not they are supported in guest CPUID. KVM's current behavior was likely motiviated by a certain fruity operating system that expects MONITOR/MWAIT to be supported unconditionally and blindly executes MONITOR/MWAIT without first checking CPUID. And because KVM does NOT advertise MONITOR/MWAIT to userspace, that's effectively the default setup for any VMM that regurgitates KVM_GET_SUPPORTED_CPUID to KVM_SET_CPUID2. Note, this quirk interacts with KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT. The behavior is actually desirable, as userspace VMMs that want to unconditionally hide MONITOR/MWAIT from the guest can leave the MISC_ENABLE quirk enabled. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: x86: Ignore benign host writes to "unsupported" F15H_PERF_CTL MSRsSean Christopherson1-0/+1
Ignore host userspace writes of '0' to F15H_PERF_CTL MSRs KVM reports in the MSR-to-save list, but the MSRs are ultimately unsupported. All MSRs in said list must be writable by userspace, e.g. if userspace sends the list back at KVM without filtering out the MSRs it doesn't need. Note, reads of said MSRs already have the desired behavior. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: x86: Ignore benign host accesses to "unsupported" PEBS and BTS MSRsSean Christopherson1-0/+17
Ignore host userspace reads and writes of '0' to PEBS and BTS MSRs that KVM reports in the MSR-to-save list, but the MSRs are ultimately unsupported. All MSRs in said list must be writable by userspace, e.g. if userspace sends the list back at KVM without filtering out the MSRs it doesn't need. Fixes: 8183a538cd95 ("KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to support guest DS") Fixes: 902caeb6841a ("KVM: x86/pmu: Add PEBS_DATA_CFG MSR emulation to support adaptive PEBS") Fixes: c59a1f106f5c ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS") Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: VMX: Use vcpu_get_perf_capabilities() to get guest-visible valueSean Christopherson1-4/+7
Use vcpu_get_perf_capabilities() when querying MSR_IA32_PERF_CAPABILITIES from the guest's perspective, e.g. to update the vPMU and to determine which MSRs exist. If userspace ignores MSR_IA32_PERF_CAPABILITIES but clear X86_FEATURE_PDCM, the guest should see '0'. Fixes: 902caeb6841a ("KVM: x86/pmu: Add PEBS_DATA_CFG MSR emulation to support adaptive PEBS") Fixes: c59a1f106f5c ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS") Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20Revert "KVM: x86: always allow host-initiated writes to PMU MSRs"Sean Christopherson5-27/+20
Revert the hack to allow host-initiated accesses to all "PMU" MSRs, as intel_is_valid_msr() returns true for _all_ MSRs, regardless of whether or not it has a snowball's chance in hell of actually being a PMU MSR. That mostly gets papered over by the actual get/set helpers only handling MSRs that they knows about, except there's the minor detail that kvm_pmu_{g,s}et_msr() eat reads and writes when the PMU is disabled. I.e. KVM will happy allow reads and writes to _any_ MSR if the PMU is disabled, either via module param or capability. This reverts commit d1c88a4020567ba4da52f778bcd9619d87e4ea75. Fixes: d1c88a402056 ("KVM: x86: always allow host-initiated writes to PMU MSRs") Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20Revert "KVM: x86/pmu: Accept 0 for absent PMU MSRs when host-initiated if ↵Sean Christopherson2-18/+1
!enable_pmu" Eating reads and writes to all "PMU" MSRs when there is no PMU is wildly broken as it results in allowing accesses to _any_ MSR on Intel CPUs as intel_is_valid_msr() returns true for all host_initiated accesses. A revert of commit d1c88a402056 ("KVM: x86: always allow host-initiated writes to PMU MSRs") will soon follow. This reverts commit 8e6a58e28b34e8d247e772159b8fa8f6bae39192. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: VMX: Give host userspace full control of MSR_IA32_PERF_CAPABILITIESSean Christopherson1-2/+0
Do not clear manipulate MSR_IA32_PERF_CAPABILITIES in intel_pmu_refresh(), i.e. give userspace full control over capability/read-only MSRs. KVM is not a babysitter, it is userspace's responsiblity to provide a valid and coherent vCPU model. Attempting to "help" the guest by forcing a consistent model creates edge cases, and ironicially leads to inconsistent behavior. Example #1: KVM doesn't do intel_pmu_refresh() when userspace writes the MSR. Example #2: KVM doesn't clear the bits when the PMU is disabled, or when there's no architectural PMU. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: x86: Give host userspace full control of MSR_IA32_MISC_ENABLESSean Christopherson2-18/+11
Give userspace full control of the read-only bits in MISC_ENABLES, i.e. do not modify bits on PMU refresh and do not preserve existing bits when userspace writes MISC_ENABLES. With a few exceptions where KVM doesn't expose the necessary controls to userspace _and_ there is a clear cut association with CPUID, e.g. reserved CR4 bits, KVM does not own the vCPU and should not manipulate the vCPU model on behalf of "dummy user space". The argument that KVM is doing userspace a favor because "the order of setting vPMU capabilities and MSR_IA32_MISC_ENABLE is not strictly guaranteed" is specious, as attempting to configure MSRs on behalf of userspace inevitably leads to edge cases precisely because KVM does not prescribe a specific order of initialization. Example #1: intel_pmu_refresh() consumes and modifies the vCPU's MSR_IA32_PERF_CAPABILITIES, and so assumes userspace initializes config MSRs before setting the guest CPUID model. If userspace sets CPUID first, then KVM will mark PEBS as available when arch.perf_capabilities is initialized with a non-zero PEBS format, thus creating a bad vCPU model if userspace later disables PEBS by writing PERF_CAPABILITIES. Example #2: intel_pmu_refresh() does not clear PERF_CAP_PEBS_MASK in MSR_IA32_PERF_CAPABILITIES if there is no vPMU, making KVM inconsistent in its desire to be consistent. Example #3: intel_pmu_refresh() does not clear MSR_IA32_MISC_ENABLE_EMON if KVM_SET_CPUID2 is called multiple times, first with a vPMU, then without a vPMU. While slightly contrived, it's plausible a VMM could reflect KVM's default vCPU and then operate on KVM's copy of CPUID to later clear the vPMU settings, e.g. see KVM's selftests. Example #4: Enumerating an Intel vCPU on an AMD host will not call into intel_pmu_refresh() at any point, and so the BTS and PEBS "unavailable" bits will be left clear, without any way for userspace to set them. Keep the "R" behavior of the bit 7, "EMON available", for the guest. Unlike the BTS and PEBS bits, which are fully "RO", the EMON bit can be written with a different value, but that new value is ignored. Cc: Like Xu <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Reported-by: kernel test robot <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: x86/mmu: Shove refcounted page dependency into host_pfn_mapping_level()Sean Christopherson2-7/+11
Move the check that restricts mapping huge pages into the guest to pfns that are backed by refcounted 'struct page' memory into the helper that actually "requires" a 'struct page', host_pfn_mapping_level(). In addition to deduplicating code, moving the check to the helper eliminates the subtle requirement that the caller check that the incoming pfn is backed by a refcounted struct page, and as an added bonus avoids an extra pfn_to_page() lookup. Note, the is_error_noslot_pfn() check in kvm_mmu_hugepage_adjust() needs to stay where it is, as it guards against dereferencing a NULL memslot in the kvm_slot_dirty_track_enabled() that follows. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: Rename/refactor kvm_is_reserved_pfn() to kvm_pfn_to_refcounted_page()Sean Christopherson2-7/+10
Rename and refactor kvm_is_reserved_pfn() to kvm_pfn_to_refcounted_page() to better reflect what KVM is actually checking, and to eliminate extra pfn_to_page() lookups. The kvm_release_pfn_*() an kvm_try_get_pfn() helpers in particular benefit from "refouncted" nomenclature, as it's not all that obvious why KVM needs to get/put refcounts for some PG_reserved pages (ZERO_PAGE and ZONE_DEVICE). Add a comment to call out that the list of exceptions to PG_reserved is all but guaranteed to be incomplete. The list has mostly been compiled by people throwing noodles at KVM and finding out they stick a little too well, e.g. the ZERO_PAGE's refcount overflowed and ZONE_DEVICE pages didn't get freed. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: Take a 'struct page', not a pfn in kvm_is_zone_device_page()Sean Christopherson1-2/+3
Operate on a 'struct page' instead of a pfn when checking if a page is a ZONE_DEVICE page, and rename the helper accordingly. Generally speaking, KVM doesn't actually care about ZONE_DEVICE memory, i.e. shouldn't do anything special for ZONE_DEVICE memory. Rather, KVM wants to treat ZONE_DEVICE memory like regular memory, and the need to identify ZONE_DEVICE memory only arises as an exception to PG_reserved pages. In other words, KVM should only ever check for ZONE_DEVICE memory after KVM has already verified that there is a struct page associated with the pfn. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: nVMX: Use kvm_vcpu_map() to get/pin vmcs12's APIC-access pageSean Christopherson2-28/+13
Use kvm_vcpu_map() to get/pin the backing for vmcs12's APIC-access page, there's no reason it has to be restricted to 'struct page' backing. The APIC-access page actually doesn't need to be backed by anything, which is ironically why it got left behind by the series which introduced kvm_vcpu_map()[1]; the plan was to shove a dummy pfn into vmcs02[2], but that code never got merged. Switching the APIC-access page to kvm_vcpu_map() doesn't preclude using a magic pfn in the future, and will allow a future patch to drop kvm_vcpu_gpa_to_page(). [1] https://lore.kernel.org/all/[email protected] [2] https://lore.kernel.org/lkml/[email protected] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: x86/mmu: Use common logic for computing the 32/64-bit base PA maskSean Christopherson3-14/+1
Use common logic for computing PT_BASE_ADDR_MASK for 32-bit, 64-bit, and EPT paging. Both PAGE_MASK and the new-common logic are supsersets of what is actually needed for 32-bit paging. PAGE_MASK sets bits 63:12 and the former GUEST_PT64_BASE_ADDR_MASK sets bits 51:12, so regardless of which value is used, the result will always be bits 31:12. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: x86/mmu: Truncate paging32's PT_BASE_ADDR_MASK to 32 bitsSean Christopherson1-1/+1
Truncate paging32's PT_BASE_ADDR_MASK to a pt_element_t, i.e. to 32 bits. Ignoring PSE huge pages, the mask is only used in conjunction with gPTEs, which are 32 bits, and so the address is limited to bits 31:12. PSE huge pages encoded PA bits 39:32 in PTE bits 20:13, i.e. need custom logic to handle their funky encoding regardless of PT_BASE_ADDR_MASK. Note, PT_LVL_OFFSET_MASK is somewhat confusing in that it computes the offset of the _gfn_, not of the gpa, i.e. not having bits 63:32 set in PT_BASE_ADDR_MASK is again correct. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: x86/mmu: Use common macros to compute 32/64-bit paging masksPaolo Bonzini2-37/+11
Dedup the code for generating (most of) the per-type PT_* masks in paging_tmpl.h. The relevant macros only vary based on the number of bits per level, and that smidge of info is already provided in a common form as PT_LEVEL_BITS. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: x86/mmu: Use separate namespaces for guest PTEs and shadow PTEsSean Christopherson8-59/+70
Separate the macros for KVM's shadow PTEs (SPTE) from guest 64-bit PTEs (PT64). SPTE and PT64 are _mostly_ the same, but the few differences are quite critical, e.g. *_BASE_ADDR_MASK must differentiate between host and guest physical address spaces, and SPTE_PERM_MASK (was PT64_PERM_MASK) is very much specific to SPTEs. Opportunistically (and temporarily) move most guest macros into paging.h to clearly associate them with shadow paging, and to ensure that they're not used as of this commit. A future patch will eliminate them entirely. Sadly, PT32_LEVEL_BITS is left behind in mmu_internal.h because it's needed for the quadrant calculation in kvm_mmu_get_page(). The quadrant calculation is hot enough (when using shadow paging with 32-bit guests) that adding a per-context helper is undesirable, and burying the computation in paging_tmpl.h with a forward declaration isn't exactly an improvement. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: x86/mmu: Dedup macros for computing various page table masksSean Christopherson5-19/+29
Provide common helper macros to generate various masks, shifts, etc... for 32-bit vs. 64-bit page tables. Only the inputs differ, the actual calculations are identical. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: x86/mmu: Bury 32-bit PSE paging helpers in paging_tmpl.hSean Christopherson3-13/+17
Move a handful of one-off macros and helpers for 32-bit PSE paging into paging_tmpl.h and hide them behind "PTTYPE == 32". Under no circumstance should anything but 32-bit shadow paging care about PSE paging. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: VMX: Refactor 32-bit PSE PT creation to avoid using MMU macroSean Christopherson1-1/+1
Compute the number of PTEs to be filled for the 32-bit PSE page tables using the page size and the size of each entry. While using the MMU's PT32_ENT_PER_PAGE macro is arguably better in isolation, removing VMX's usage will allow a future namespacing cleanup to move the guest page table macros into paging_tmpl.h, out of the reach of code that isn't directly related to shadow paging. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: x86: Use lapic_in_kernel() to query in-kernel APIC in APICv helperSean Christopherson1-1/+1
Use lapic_in_kernel() in kvm_vcpu_apicv_active() to take advantage of the kvm_has_noapic_vcpu static branch. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: x86: Move "apicv_active" into "struct kvm_lapic"Sean Christopherson5-29/+23
Move the per-vCPU apicv_active flag into KVM's local APIC instance. APICv is fully dependent on an in-kernel local APIC, but that's not at all clear when reading the current code due to the flag being stored in the generic kvm_vcpu_arch struct. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: x86: Check for in-kernel xAPIC when querying APICv for directed yieldSean Christopherson1-1/+2
Use kvm_vcpu_apicv_active() to check if APICv is active when seeing if a vCPU is a candidate for directed yield due to a pending ACPIv interrupt. This will allow moving apicv_active into kvm_lapic without introducing a potential NULL pointer deref (kvm_vcpu_apicv_active() effectively adds a pre-check on the vCPU having an in-kernel APIC). No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: x86: Drop @vcpu parameter from kvm_x86_ops.hwapic_isr_update()Sean Christopherson2-5/+5
Drop the unused @vcpu parameter from hwapic_isr_update(). AMD/AVIC is unlikely to implement the helper, and VMX/APICv doesn't need the vCPU as it operates on the current VMCS. The result is somewhat odd, but allows for a decent amount of (future) cleanup in the APIC code. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: SVM: Drop unused AVIC / kvm_x86_ops declarationsSean Christopherson1-4/+0
Drop a handful of unused AVIC function declarations whose implementations were removed during the conversion to optional static calls. No functional change intended. Fixes: abb6d479e226 ("KVM: x86: make several APIC virtualization callbacks optional") Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: nVMX: Update vmcs12 on BNDCFGS write, not at vmcs02=>vmcs12 syncSean Christopherson2-3/+6
Update vmcs12->guest_bndcfgs on intercepted writes to BNDCFGS from L2 instead of waiting until vmcs02 is synchronized to vmcs12. KVM always intercepts BNDCFGS accesses, so the only way the value in vmcs02 can change is via KVM's explicit VMWRITE during emulation. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: nVMX: Save BNDCFGS to vmcs12 iff relevant controls are exposed to L1Sean Christopherson1-1/+2
Save BNDCFGS to vmcs12 (from vmcs02) if and only if at least of one of the load-on-entry or clear-on-exit fields for BNDCFGS is enumerated as an allowed-1 bit in vmcs12. Skipping the field avoids an unnecessary VMREAD when MPX is supported but not exposed to L1. Per Intel's SDM: If the processor supports either the 1-setting of the "load IA32_BNDCFGS" VM-entry control or that of the "clear IA32_BNDCFGS" VM-exit control, the contents of the IA32_BNDCFGS MSR are saved into the corresponding field. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: nVMX: Rename nested.vmcs01_* fields to nested.pre_vmenter_*Sean Christopherson3-7/+23
Rename the fields in struct nested_vmx used to snapshot pre-VM-Enter values to reflect that they can hold L2's values when restoring nested state, e.g. if userspace restores MSRs before nested state. As crazy as it seems, restoring MSRs before nested state actually works (because KVM goes out if it's way to make it work), even though the initial MSR writes will hit vmcs01 despite holding L2 values. Add a related comment to vmx_enter_smm() to call out that using the common VM-Exit and VM-Enter helpers to emulate SMI and RSM is wrong and broken. The few MSRs that have snapshots _could_ be fixed by taking a snapshot prior to the forced VM-Exit instead of at forced VM-Enter, but that's just the tip of the iceberg as the rather long list of MSRs that aren't snapshotted (hello, VM-Exit MSR load list) can't be handled this way. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: nVMX: Snapshot pre-VM-Enter DEBUGCTL for !nested_run_pending caseSean Christopherson1-1/+2
If a nested run isn't pending, snapshot vmcs01.GUEST_IA32_DEBUGCTL irrespective of whether or not VM_ENTRY_LOAD_DEBUG_CONTROLS is set in vmcs12. When restoring nested state, e.g. after migration, without a nested run pending, prepare_vmcs02() will propagate nested.vmcs01_debugctl to vmcs02, i.e. will load garbage/zeros into vmcs02.GUEST_IA32_DEBUGCTL. If userspace restores nested state before MSRs, then loading garbage is a non-issue as loading DEBUGCTL will also update vmcs02. But if usersepace restores MSRs first, then KVM is responsible for propagating L2's value, which is actually thrown into vmcs01, into vmcs02. Restoring L2 MSRs into vmcs01, i.e. loading all MSRs before nested state is all kinds of bizarre and ideally would not be supported. Sadly, some VMMs do exactly that and rely on KVM to make things work. Note, there's still a lurking SMM bug, as propagating vmcs01's DEBUGCTL to vmcs02 across RSM may corrupt L2's DEBUGCTL. But KVM's entire VMX+SMM emulation is flawed as SMI+RSM should not toouch _any_ VMCS when use the "default treatment of SMIs", i.e. when not using an SMI Transfer Monitor. Link: https://lore.kernel.org/all/[email protected] Fixes: 8fcc4b5923af ("kvm: nVMX: Introduce KVM_CAP_NESTED_STATE") Cc: [email protected] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-20KVM: nVMX: Snapshot pre-VM-Enter BNDCFGS for !nested_run_pending caseSean Christopherson1-1/+2
If a nested run isn't pending, snapshot vmcs01.GUEST_BNDCFGS irrespective of whether or not VM_ENTRY_LOAD_BNDCFGS is set in vmcs12. When restoring nested state, e.g. after migration, without a nested run pending, prepare_vmcs02() will propagate nested.vmcs01_guest_bndcfgs to vmcs02, i.e. will load garbage/zeros into vmcs02.GUEST_BNDCFGS. If userspace restores nested state before MSRs, then loading garbage is a non-issue as loading BNDCFGS will also update vmcs02. But if usersepace restores MSRs first, then KVM is responsible for propagating L2's value, which is actually thrown into vmcs01, into vmcs02. Restoring L2 MSRs into vmcs01, i.e. loading all MSRs before nested state is all kinds of bizarre and ideally would not be supported. Sadly, some VMMs do exactly that and rely on KVM to make things work. Note, there's still a lurking SMM bug, as propagating vmcs01.GUEST_BNDFGS to vmcs02 across RSM may corrupt L2's BNDCFGS. But KVM's entire VMX+SMM emulation is flawed as SMI+RSM should not toouch _any_ VMCS when use the "default treatment of SMIs", i.e. when not using an SMI Transfer Monitor. Link: https://lore.kernel.org/all/[email protected] Fixes: 62cf9bd8118c ("KVM: nVMX: Fix emulation of VM_ENTRY_LOAD_BNDCFGS") Cc: [email protected] Cc: Lei Wang <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-15KVM: x86/mmu: Use try_cmpxchg64 in fast_pf_fix_direct_spteUros Bizjak1-1/+1
Use try_cmpxchg64 instead of cmpxchg64 (*ptr, old, new) != old in fast_pf_fix_direct_spte. cmpxchg returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Signed-off-by: Uros Bizjak <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Jim Mattson <[email protected]> Cc: Joerg Roedel <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-15KVM: VMX: Use try_cmpxchg64 in pi_try_set_controlUros Bizjak1-1/+1
Use try_cmpxchg64 instead of cmpxchg64 (*ptr, old, new) != old in pi_try_set_control. cmpxchg returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg): b9: 88 44 24 60 mov %al,0x60(%rsp) bd: 48 89 c8 mov %rcx,%rax c0: c6 44 24 62 f2 movb $0xf2,0x62(%rsp) c5: 48 8b 74 24 60 mov 0x60(%rsp),%rsi ca: f0 49 0f b1 34 24 lock cmpxchg %rsi,(%r12) d0: 48 39 c1 cmp %rax,%rcx d3: 75 cf jne a4 <vmx_vcpu_pi_load+0xa4> patched: c1: 88 54 24 60 mov %dl,0x60(%rsp) c5: c6 44 24 62 f2 movb $0xf2,0x62(%rsp) ca: 48 8b 54 24 60 mov 0x60(%rsp),%rdx cf: f0 48 0f b1 13 lock cmpxchg %rdx,(%rbx) d4: 75 d5 jne ab <vmx_vcpu_pi_load+0xab> Signed-off-by: Uros Bizjak <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Jim Mattson <[email protected]> Cc: Joerg Roedel <[email protected]> Reported-by: kernel test robot <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-15KVM: x86/mmu: Use try_cmpxchg64 in tdp_mmu_set_spte_atomicUros Bizjak1-11/+1
Use try_cmpxchg64 instead of cmpxchg64 (*ptr, old, new) != old in tdp_mmu_set_spte_atomic. cmpxchg returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, remove explicit assignment to iter->old_spte when cmpxchg fails, this is what try_cmpxchg does implicitly. Cc: Paolo Bonzini <[email protected]> Cc: Sean Christopherson <[email protected]> Signed-off-by: Uros Bizjak <[email protected]> Reviewed-by: David Matlack <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-06-15KVM: VMX: Skip filter updates for MSRs that KVM is already interceptingSean Christopherson1-7/+11
When handling userspace MSR filter updates, recompute interception for possible passthrough MSRs if and only if KVM wants to disabled interception. If KVM wants to intercept accesses, i.e. the associated bit is set in vmx->shadow_msr_intercept, then there's no need to set the intercept again as KVM will intercept the MSR regardless of userspace's wants. No functional change intended, the call to vmx_enable_intercept_for_msr() really is just a gigantic nop. Suggested-by: Aaron Lewis <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>