aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-11-30KVM: x86: Remove unused argument in gpc_unmap_khva()Michal Luczaj1-4/+4
Remove the unused @kvm argument from gpc_unmap_khva(). Signed-off-by: Michal Luczaj <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: David Woodhouse <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-30KVM: Shorten gfn_to_pfn_cache function namesMichal Luczaj4-40/+39
Formalize "gpc" as the acronym and use it in function names. No functional change intended. Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Michal Luczaj <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: David Woodhouse <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-30KVM: x86/xen: Add runstate tests for 32-bit mode and crossing page boundaryDavid Woodhouse2-20/+97
Torture test the cases where the runstate crosses a page boundary, and and especially the case where it's configured in 32-bit mode and doesn't, but then switching to 64-bit mode makes it go onto the second page. To simplify this, make the KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST ioctl also update the guest runstate area. It already did so if the actual runstate changed, as a side-effect of kvm_xen_update_runstate(). So doing it in the plain adjustment case is making it more consistent, as well as giving us a nice way to trigger the update without actually running the vCPU again and changing the values. Signed-off-by: David Woodhouse <[email protected]> Reviewed-by: Paul Durrant <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-30KVM: x86/xen: Allow XEN_RUNSTATE_UPDATE flag behaviour to be configuredDavid Woodhouse6-20/+93
Closer inspection of the Xen code shows that we aren't supposed to be using the XEN_RUNSTATE_UPDATE flag unconditionally. It should be explicitly enabled by guests through the HYPERVISOR_vm_assist hypercall. If we randomly set the top bit of ->state_entry_time for a guest that hasn't asked for it and doesn't expect it, that could make the runtimes fail to add up and confuse the guest. Without the flag it's perfectly safe for a vCPU to read its own vcpu_runstate_info; just not for one vCPU to read *another's*. I briefly pondered adding a word for the whole set of VMASST_TYPE_* flags but the only one we care about for HVM guests is this, so it seemed a bit pointless. Signed-off-by: David Woodhouse <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-30KVM: x86/xen: Compatibility fixes for shared runstate areaDavid Woodhouse4-113/+276
The guest runstate area can be arbitrarily byte-aligned. In fact, even when a sane 32-bit guest aligns the overall structure nicely, the 64-bit fields in the structure end up being unaligned due to the fact that the 32-bit ABI only aligns them to 32 bits. So setting the ->state_entry_time field to something|XEN_RUNSTATE_UPDATE is buggy, because if it's unaligned then we can't update the whole field atomically; the low bytes might be observable before the _UPDATE bit is. Xen actually updates the *byte* containing that top bit, on its own. KVM should do the same. In addition, we cannot assume that the runstate area fits within a single page. One option might be to make the gfn_to_pfn cache cope with regions that cross a page — but getting a contiguous virtual kernel mapping of a discontiguous set of IOMEM pages is a distinctly non-trivial exercise, and it seems this is the *only* current use case for the GPC which would benefit from it. An earlier version of the runstate code did use a gfn_to_hva cache for this purpose, but it still had the single-page restriction because it used the uhva directly — because it needs to be able to do so atomically when the vCPU is being scheduled out, so it used pagefault_disable() around the accesses and didn't just use kvm_write_guest_cached() which has a fallback path. So... use a pair of GPCs for the first and potential second page covering the runstate area. We can get away with locking both at once because nothing else takes more than one GPC lock at a time so we can invent a trivial ordering rule. The common case where it's all in the same page is kept as a fast path, but in both cases, the actual guest structure (compat or not) is built up from the fields in @vx, following preset pointers to the state and times fields. The only difference is whether those pointers point to the kernel stack (in the split case) or to guest memory directly via the GPC. The fast path is also fixed to use a byte access for the XEN_RUNSTATE_UPDATE bit, then the only real difference is the dual memcpy. Finally, Xen also does write the runstate area immediately when it's configured. Flip the kvm_xen_update_runstate() and …_guest() functions and call the latter directly when the runstate area is set. This means that other ioctls which modify the runstate also write it immediately to the guest when they do so, which is also intended. Update the xen_shinfo_test to exercise the pathological case where the XEN_RUNSTATE_UPDATE flag in the top byte of the state_entry_time is actually in a different page to the rest of the 64-bit word. Signed-off-by: David Woodhouse <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-29KVM: selftests: Build access_tracking_perf_test for arm64Oliver Upton1-0/+1
Does exactly what it says on the tin. Reviewed-by: Gavin Shan <[email protected]> Signed-off-by: Oliver Upton <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-29KVM: selftests: Have perf_test_util signal when to stop vCPUsOliver Upton4-12/+8
Signal that a test run is complete through perf_test_args instead of having tests open code a similar solution. Ensure that the field resets to false at the beginning of a test run as the structure is reused between test runs, eliminating a couple of bugs: access_tracking_perf_test hangs indefinitely on a subsequent test run, as 'done' remains true. The bug doesn't amount to much right now, as x86 supports a single guest mode. However, this is a precondition of enabling the test for other architectures with >1 guest mode, like arm64. memslot_modification_stress_test has the exact opposite problem, where subsequent test runs complete immediately as 'run_vcpus' remains false. Co-developed-by: Sean Christopherson <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> [oliver: added commit message, preserve spin_wait_for_next_iteration()] Signed-off-by: Oliver Upton <[email protected]> Reviewed-by: Gavin Shan <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-29Documentation: document the ABI changes for KVM_CAP_ARM_MTEPeter Collingbourne1-2/+3
Document both the restriction on VM_MTE_ALLOWED mappings and the relaxation for shared mappings. Signed-off-by: Peter Collingbourne <[email protected]> Acked-by: Catalin Marinas <[email protected]> Reviewed-by: Cornelia Huck <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-29KVM: arm64: permit all VM_MTE_ALLOWED mappings with MTE enabledPeter Collingbourne1-8/+0
Certain VMMs such as crosvm have features (e.g. sandboxing) that depend on being able to map guest memory as MAP_SHARED. The current restriction on sharing MAP_SHARED pages with the guest is preventing the use of those features with MTE. Now that the races between tasks concurrently clearing tags on the same page have been fixed, remove this restriction. Note that this is a relaxation of the ABI. Signed-off-by: Peter Collingbourne <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Reviewed-by: Steven Price <[email protected]> Reviewed-by: Cornelia Huck <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-29KVM: arm64: unify the tests for VMAs in memslots when MTE is enabledPeter Collingbourne1-9/+16
Previously we allowed creating a memslot containing a private mapping that was not VM_MTE_ALLOWED, but would later reject KVM_RUN with -EFAULT. Now we reject the memory region at memslot creation time. Since this is a minor tweak to the ABI (a VMM that created one of these memslots would fail later anyway), no VMM to my knowledge has MTE support yet, and the hardware with the necessary features is not generally available, we can probably make this ABI change at this point. Signed-off-by: Peter Collingbourne <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Reviewed-by: Steven Price <[email protected]> Reviewed-by: Cornelia Huck <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-29arm64: mte: Lock a page for MTE tag initialisationCatalin Marinas9-29/+60
Initialising the tags and setting PG_mte_tagged flag for a page can race between multiple set_pte_at() on shared pages or setting the stage 2 pte via user_mem_abort(). Introduce a new PG_mte_lock flag as PG_arch_3 and set it before attempting page initialisation. Given that PG_mte_tagged is never cleared for a page, consider setting this flag to mean page unlocked and wait on this bit with acquire semantics if the page is locked: - try_page_mte_tagging() - lock the page for tagging, return true if it can be tagged, false if already tagged. No acquire semantics if it returns true (PG_mte_tagged not set) as there is no serialisation with a previous set_page_mte_tagged(). - set_page_mte_tagged() - set PG_mte_tagged with release semantics. The two-bit locking is based on Peter Collingbourne's idea. Signed-off-by: Catalin Marinas <[email protected]> Signed-off-by: Peter Collingbourne <[email protected]> Reviewed-by: Steven Price <[email protected]> Cc: Will Deacon <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Peter Collingbourne <[email protected]> Reviewed-by: Cornelia Huck <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-29mm: Add PG_arch_3 page flagPeter Collingbourne5-0/+5
As with PG_arch_2, this flag is only allowed on 64-bit architectures due to the shortage of bits available. It will be used by the arm64 MTE code in subsequent patches. Signed-off-by: Peter Collingbourne <[email protected]> Cc: Will Deacon <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Steven Price <[email protected]> [[email protected]: added flag preserving in __split_huge_page_tail()] Signed-off-by: Catalin Marinas <[email protected]> Reviewed-by: Steven Price <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-29KVM: arm64: Simplify the sanitise_mte_tags() logicCatalin Marinas1-25/+15
Currently sanitise_mte_tags() checks if it's an online page before attempting to sanitise the tags. Such detection should be done in the caller via the VM_MTE_ALLOWED vma flag. Since kvm_set_spte_gfn() does not have the vma, leave the page unmapped if not already tagged. Tag initialisation will be done on a subsequent access fault in user_mem_abort(). Signed-off-by: Catalin Marinas <[email protected]> [[email protected]: fix the page initializer] Signed-off-by: Peter Collingbourne <[email protected]> Reviewed-by: Steven Price <[email protected]> Cc: Will Deacon <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Peter Collingbourne <[email protected]> Reviewed-by: Cornelia Huck <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-29arm64: mte: Fix/clarify the PG_mte_tagged semanticsCatalin Marinas11-18/+56
Currently the PG_mte_tagged page flag mostly means the page contains valid tags and it should be set after the tags have been cleared or restored. However, in mte_sync_tags() it is set before setting the tags to avoid, in theory, a race with concurrent mprotect(PROT_MTE) for shared pages. However, a concurrent mprotect(PROT_MTE) with a copy on write in another thread can cause the new page to have stale tags. Similarly, tag reading via ptrace() can read stale tags if the PG_mte_tagged flag is set before actually clearing/restoring the tags. Fix the PG_mte_tagged semantics so that it is only set after the tags have been cleared or restored. This is safe for swap restoring into a MAP_SHARED or CoW page since the core code takes the page lock. Add two functions to test and set the PG_mte_tagged flag with acquire and release semantics. The downside is that concurrent mprotect(PROT_MTE) on a MAP_SHARED page may cause tag loss. This is already the case for KVM guests if a VMM changes the page protection while the guest triggers a user_mem_abort(). Signed-off-by: Catalin Marinas <[email protected]> [[email protected]: fix build with CONFIG_ARM64_MTE disabled] Signed-off-by: Peter Collingbourne <[email protected]> Reviewed-by: Cornelia Huck <[email protected]> Reviewed-by: Steven Price <[email protected]> Cc: Will Deacon <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Peter Collingbourne <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-29mm: Do not enable PG_arch_2 for all 64-bit architecturesCatalin Marinas6-7/+16
Commit 4beba9486abd ("mm: Add PG_arch_2 page flag") introduced a new page flag for all 64-bit architectures. However, even if an architecture is 64-bit, it may still have limited spare bits in the 'flags' member of 'struct page'. This may happen if an architecture enables SPARSEMEM without SPARSEMEM_VMEMMAP as is the case with the newly added loongarch. This architecture port needs 19 more bits for the sparsemem section information and, while it is currently fine with PG_arch_2, adding any more PG_arch_* flags will trigger build-time warnings. Add a new CONFIG_ARCH_USES_PG_ARCH_X option which can be selected by architectures that need more PG_arch_* flags beyond PG_arch_1. Select it on arm64. Signed-off-by: Catalin Marinas <[email protected]> [[email protected]: fix build with CONFIG_ARM64_MTE disabled] Signed-off-by: Peter Collingbourne <[email protected]> Reported-by: kernel test robot <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Steven Price <[email protected]> Reviewed-by: Steven Price <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-28Merge tag 'kvm-s390-next-6.2-1' of ↵Paolo Bonzini19-162/+603
https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD - Second batch of the lazy destroy patches - First batch of KVM changes for kernel virtual != physical address support - Removal of a unused function
2022-11-28KVM: x86: Advertise PREFETCHIT0/1 CPUID to user spaceJiaxi Chen2-1/+2
Latest Intel platform Granite Rapids has introduced a new instruction - PREFETCHIT0/1, which moves code to memory (cache) closer to the processor depending on specific hints. The bit definition: CPUID.(EAX=7,ECX=1):EDX[bit 14] PREFETCHIT0/1 is on a KVM-only subleaf. Plus an x86_FEATURE definition for this feature bit to direct it to the KVM entry. Advertise PREFETCHIT0/1 to KVM userspace. This is safe because there are no new VMX controls or additional host enabling required for guests to use this feature. Signed-off-by: Jiaxi Chen <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-28KVM: x86: Advertise AVX-NE-CONVERT CPUID to user spaceJiaxi Chen2-1/+2
AVX-NE-CONVERT is a new set of instructions which can convert low precision floating point like BF16/FP16 to high precision floating point FP32, and can also convert FP32 elements to BF16. This instruction allows the platform to have improved AI capabilities and better compatibility. The bit definition: CPUID.(EAX=7,ECX=1):EDX[bit 5] AVX-NE-CONVERT is on a KVM-only subleaf. Plus an x86_FEATURE definition for this feature bit to direct it to the KVM entry. Advertise AVX-NE-CONVERT to KVM userspace. This is safe because there are no new VMX controls or additional host enabling required for guests to use this feature. Signed-off-by: Jiaxi Chen <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-28KVM: x86: Advertise AVX-VNNI-INT8 CPUID to user spaceJiaxi Chen2-1/+10
AVX-VNNI-INT8 is a new set of instructions in the latest Intel platform Sierra Forest, aims for the platform to have superior AI capabilities. This instruction multiplies the individual bytes of two unsigned or unsigned source operands, then adds and accumulates the results into the destination dword element size operand. The bit definition: CPUID.(EAX=7,ECX=1):EDX[bit 4] AVX-VNNI-INT8 is on a new and sparse CPUID leaf and all bits on this leaf have no truly kernel use case for now. Given that and to save space for kernel feature bits, move this new leaf to KVM-only subleaf and plus an x86_FEATURE definition for AVX-VNNI-INT8 to direct it to the KVM entry. Advertise AVX-VNNI-INT8 to KVM userspace. This is safe because there are no new VMX controls or additional host enabling required for guests to use this feature. Signed-off-by: Jiaxi Chen <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-28x86: KVM: Advertise AVX-IFMA CPUID to user spaceJiaxi Chen2-1/+3
AVX-IFMA is a new instruction in the latest Intel platform Sierra Forest. This instruction packed multiplies unsigned 52-bit integers and adds the low/high 52-bit products to Qword Accumulators. The bit definition: CPUID.(EAX=7,ECX=1):EAX[bit 23] AVX-IFMA is on an expected-dense CPUID leaf and some other bits on this leaf have kernel usages. Given that, define this feature bit like X86_FEATURE_<name> in kernel. Considering AVX-IFMA itself has no truly kernel usages and /proc/cpuinfo has too much unreadable flags, hide this one in /proc/cpuinfo. Advertise AVX-IFMA to KVM userspace. This is safe because there are no new VMX controls or additional host enabling required for guests to use this feature. Signed-off-by: Jiaxi Chen <[email protected]> Acked-by: Borislav Petkov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-28x86: KVM: Advertise AMX-FP16 CPUID to user spaceChang S. Bae2-1/+2
Latest Intel platform Granite Rapids has introduced a new instruction - AMX-FP16, which performs dot-products of two FP16 tiles and accumulates the results into a packed single precision tile. AMX-FP16 adds FP16 capability and also allows a FP16 GPU trained model to run faster without loss of accuracy or added SW overhead. The bit definition: CPUID.(EAX=7,ECX=1):EAX[bit 21] AMX-FP16 is on an expected-dense CPUID leaf and some other bits on this leaf have kernel usages. Given that, define this feature bit like X86_FEATURE_<name> in kernel. Considering AMX-FP16 itself has no truly kernel usages and /proc/cpuinfo has too much unreadable flags, hide this one in /proc/cpuinfo. Advertise AMX-FP16 to KVM userspace. This is safe because there are no new VMX controls or additional host enabling required for guests to use this feature. Signed-off-by: Chang S. Bae <[email protected]> Signed-off-by: Jiaxi Chen <[email protected]> Acked-by: Borislav Petkov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-28x86: KVM: Advertise CMPccXADD CPUID to user spaceJiaxi Chen2-1/+2
CMPccXADD is a new set of instructions in the latest Intel platform Sierra Forest. This new instruction set includes a semaphore operation that can compare and add the operands if condition is met, which can improve database performance. The bit definition: CPUID.(EAX=7,ECX=1):EAX[bit 7] CMPccXADD is on an expected-dense CPUID leaf and some other bits on this leaf have kernel usages. Given that, define this feature bit like X86_FEATURE_<name> in kernel. Considering CMPccXADD itself has no truly kernel usages and /proc/cpuinfo has too much unreadable flags, hide this one in /proc/cpuinfo. Advertise CMPCCXADD to KVM userspace. This is safe because there are no new VMX controls or additional host enabling required for guests to use this feature. Signed-off-by: Jiaxi Chen <[email protected]> Acked-by: Borislav Petkov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-28KVM: x86: Update KVM-only leaf handling to allow for 100% KVM-only leafsSean Christopherson2-7/+19
Rename kvm_cpu_cap_init_scattered() to kvm_cpu_cap_init_kvm_defined() in anticipation of adding KVM-only CPUID leafs that aren't recognized by the kernel and thus not scattered, i.e. for leafs that are 100% KVM-defined. Adjust/add comments to kvm_only_cpuid_leafs and KVM_X86_FEATURE to document how to create new kvm_only_cpuid_leafs entries for scattered features as well as features that are entirely unknown to the kernel. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-28KVM: x86: Add BUILD_BUG_ON() to detect bad usage of "scattered" flagsSean Christopherson1-1/+7
Add a compile-time assert in the SF() macro to detect improper usage, i.e. to detect passing in an X86_FEATURE_* flag that isn't actually scattered by the kernel. Upcoming feature flags will be 100% KVM-only and will have X86_FEATURE_* macros that point at a kvm_only_cpuid_leafs word, not a kernel-defined word. Using SF() and thus boot_cpu_has() for such feature flags would access memory beyond x86_capability[NCAPINTS] and at best incorrectly hide a feature, and at worst leak kernel state to userspace. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-28MAINTAINERS: Add KVM x86/xen maintainer listDavid Woodhouse1-0/+10
Adding Paul as co-maintainer of Xen support to help ensure that things don't fall through the cracks when I spend three months at a time travelling... Signed-off-by: David Woodhouse <[email protected]> Reviewed-by: Paul Durrant <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-28KVM: x86/xen: Add CPL to Xen hypercall tracepointDavid Woodhouse2-7/+10
Signed-off-by: David Woodhouse <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-28KVM: always declare prototype for kvm_arch_irqchip_in_kernelPaolo Bonzini1-1/+1
Architecture code might want to use it even if CONFIG_HAVE_KVM_IRQ_ROUTING is false; for example PPC XICS has KVM_IRQ_LINE and wants to use kvm_arch_irqchip_in_kernel from there, but it does not have KVM_SET_GSI_ROUTING so the prototype was not provided. Fixes: d663b8a28598 ("KVM: replace direct irq.h inclusion") Reported-by: kernel test robot <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-28KVM: arm64: PMU: Sanitise PMCR_EL0.LP on first vcpu runMarc Zyngier2-4/+10
Userspace can play some dirty tricks on us by selecting a given PMU version (such as PMUv3p5), restore a PMCR_EL0 value that has PMCR_EL0.LP set, and then switch the PMU version to PMUv3p1, for example. In this situation, we end-up with PMCR_EL0.LP being set and spreading havoc in the PMU emulation. This is specially hard as the first two step can be done on one vcpu and the third step on another, meaning that we need to sanitise *all* vcpus when the PMU version is changed. In orer to avoid a pretty complicated locking situation, defer the sanitisation of PMCR_EL0 to the point where the vcpu is actually run for the first tine, using the existing KVM_REQ_RELOAD_PMU request that calls into kvm_pmu_handle_pmcr(). There is still an obscure corner case where userspace could do the above trick, and then save the VM without running it. They would then observe an inconsistent state (PMUv3.1 + LP set), but that state will be fixed on the first run anyway whenever the guest gets restored on a host. Reported-by: Reiji Watanabe <[email protected]> Signed-off-by: Marc Zyngier <[email protected]>
2022-11-28KVM: arm64: PMU: Simplify PMCR_EL0 reset handlingMarc Zyngier1-12/+6
Resetting PMCR_EL0 is a pretty involved process that includes poisoning some of the writable bits, just because we can. It makes it hard to reason about about what gets configured, and just resetting things to 0 seems like a much saner option. Reduce reset_pmcr() to just preserving PMCR_EL0.N from the host, and setting PMCR_EL0.LC if we don't support AArch32. Signed-off-by: Marc Zyngier <[email protected]>
2022-11-28KVM: arm64: PMU: Replace version number '0' with ID_AA64DFR0_EL1_PMUVer_NIAnshuman Khandual1-2/+3
kvm_host_pmu_init() returns when detected PMU is either not implemented, or implementation defined. kvm_pmu_probe_armpmu() also has a similar situation. Extracted ID_AA64DFR0_EL1_PMUVer value, when PMU is not implemented is '0', which can be replaced with ID_AA64DFR0_EL1_PMUVer_NI defined as '0b0000'. Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Will Deacon <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Signed-off-by: Anshuman Khandual <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-23Merge branch 'kvm-dwmw2-fixes' into HEADPaolo Bonzini2-10/+29
This brings in a few important fixes for Xen emulation. While nobody should be enabling it, the bug effectively allows userspace to read arbitrary memory. Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-23KVM: Update gfn_to_pfn_cache khva when it moves within the same pageDavid Woodhouse1-1/+6
In the case where a GPC is refreshed to a different location within the same page, we didn't bother to update it. Mostly we don't need to, but since the ->khva field also includes the offset within the page, that does have to be updated. Fixes: 3ba2c95ea180 ("KVM: Do not incorporate page offset into gfn=>pfn cache user address") Signed-off-by: David Woodhouse <[email protected]> Reviewed-by: Paul Durrant <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Cc: [email protected] Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-23KVM: x86/xen: Only do in-kernel acceleration of hypercalls for guest CPL0David Woodhouse1-1/+11
There are almost no hypercalls which are valid from CPL > 0, and definitely none which are handled by the kernel. Fixes: 2fd6df2f2b47 ("KVM: x86/xen: intercept EVTCHNOP_send from guests") Reported-by: Michal Luczaj <[email protected]> Signed-off-by: David Woodhouse <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Cc: [email protected] Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-23KVM: x86/xen: Validate port number in SCHEDOP_pollDavid Woodhouse1-8/+12
We shouldn't allow guests to poll on arbitrary port numbers off the end of the event channel table. Fixes: 1a65105a5aba ("KVM: x86/xen: handle PV spinlocks slowpath") [dwmw2: my bug though; the original version did check the validity as a side-effect of an idr_find() which I ripped out in refactoring.] Reported-by: Michal Luczaj <[email protected]> Signed-off-by: David Woodhouse <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Cc: [email protected] Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-23KVM: s390: remove unused gisa_clear_ipm_gisc() functionHeiko Carstens1-5/+0
clang warns about an unused function: arch/s390/kvm/interrupt.c:317:20: error: unused function 'gisa_clear_ipm_gisc' [-Werror,-Wunused-function] static inline void gisa_clear_ipm_gisc(struct kvm_s390_gisa *gisa, u32 gisc) Remove gisa_clear_ipm_gisc(), since it is unused and get rid of this warning. Signed-off-by: Heiko Carstens <[email protected]> Reviewed-by: Thomas Huth <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Borntraeger <[email protected]> Signed-off-by: Janosch Frank <[email protected]>
2022-11-23s390/vfio-ap: GISA: sort out physical vs virtual pointers usageNico Boehr1-1/+1
Fix virtual vs physical address confusion (which currently are the same) for the GISA when enabling the IRQ. Signed-off-by: Nico Boehr <[email protected]> Reviewed-by: Halil Pasic <[email protected]> Reviewed-by: Claudio Imbrenda <[email protected]> Link: https://lore.kernel.org/r/[email protected] Message-Id: <[email protected]> Signed-off-by: Janosch Frank <[email protected]>
2022-11-23KVM: s390: pv: module parameter to fence asynchronous destroyClaudio Imbrenda1-1/+7
Add the module parameter "async_destroy", to allow the asynchronous destroy mechanism to be switched off. This might be useful for debugging purposes. The parameter is enabled by default since the feature is opt-in anyway. Signed-off-by: Claudio Imbrenda <[email protected]> Reviewed-by: Janosch Frank <[email protected]> Reviewed-by: Steffen Eiden <[email protected]> Reviewed-by: Nico Boehr <[email protected]> Link: https://lore.kernel.org/r/[email protected] Message-Id: <[email protected]> Signed-off-by: Janosch Frank <[email protected]>
2022-11-23KVM: s390: pv: support for Destroy fast UVCClaudio Imbrenda2-8/+63
Add support for the Destroy Secure Configuration Fast Ultravisor call, and take advantage of it for asynchronous destroy. When supported, the protected guest is destroyed immediately using the new UVC, leaving only the memory to be cleaned up asynchronously. Signed-off-by: Claudio Imbrenda <[email protected]> Reviewed-by: Nico Boehr <[email protected]> Reviewed-by: Janosch Frank <[email protected]> Reviewed-by: Steffen Eiden <[email protected]> Link: https://lore.kernel.org/r/[email protected] Message-Id: <[email protected]> Signed-off-by: Janosch Frank <[email protected]>
2022-11-23KVM: s390: pv: avoid export before import if possibleClaudio Imbrenda1-0/+7
If the appropriate UV feature bit is set, there is no need to perform an export before import. The misc feature indicates, among other things, that importing a shared page from a different protected VM will automatically also transfer its ownership. Signed-off-by: Claudio Imbrenda <[email protected]> Reviewed-by: Nico Boehr <[email protected]> Reviewed-by: Janosch Frank <[email protected]> Reviewed-by: Steffen Eiden <[email protected]> Link: https://lore.kernel.org/r/[email protected] Message-Id: <[email protected]> Signed-off-by: Janosch Frank <[email protected]>
2022-11-23KVM: s390: pv: add KVM_CAP_S390_PROTECTED_ASYNC_DISABLEClaudio Imbrenda2-0/+4
Add KVM_CAP_S390_PROTECTED_ASYNC_DISABLE to signal that the KVM_PV_ASYNC_DISABLE and KVM_PV_ASYNC_DISABLE_PREPARE commands for the KVM_S390_PV_COMMAND ioctl are available. Signed-off-by: Claudio Imbrenda <[email protected]> Reviewed-by: Nico Boehr <[email protected]> Reviewed-by: Steffen Eiden <[email protected]> Reviewed-by: Janosch Frank <[email protected]> Link: https://lore.kernel.org/r/[email protected] Message-Id: <[email protected]> Signed-off-by: Janosch Frank <[email protected]>
2022-11-23KVM: s390: pv: api documentation for asynchronous destroyClaudio Imbrenda1-4/+37
Add documentation for the new commands added to the KVM_S390_PV_COMMAND ioctl. Signed-off-by: Claudio Imbrenda <[email protected]> Reviewed-by: Nico Boehr <[email protected]> Reviewed-by: Steffen Eiden <[email protected]> Reviewed-by: Janosch Frank <[email protected]> Link: https://lore.kernel.org/r/[email protected] Message-Id: <[email protected]> Signed-off-by: Janosch Frank <[email protected]>
2022-11-23KVM: s390: pv: asynchronous destroy for rebootClaudio Imbrenda5-18/+333
Until now, destroying a protected guest was an entirely synchronous operation that could potentially take a very long time, depending on the size of the guest, due to the time needed to clean up the address space from protected pages. This patch implements an asynchronous destroy mechanism, that allows a protected guest to reboot significantly faster than previously. This is achieved by clearing the pages of the old guest in background. In case of reboot, the new guest will be able to run in the same address space almost immediately. The old protected guest is then only destroyed when all of its memory has been destroyed or otherwise made non protected. Two new PV commands are added for the KVM_S390_PV_COMMAND ioctl: KVM_PV_ASYNC_CLEANUP_PREPARE: set aside the current protected VM for later asynchronous teardown. The current KVM VM will then continue immediately as non-protected. If a protected VM had already been set aside for asynchronous teardown, but without starting the teardown process, this call will fail. There can be at most one VM set aside at any time. Once it is set aside, the protected VM only exists in the context of the Ultravisor, it is not associated with the KVM VM anymore. Its protected CPUs have already been destroyed, but not its memory. This command can be issued again immediately after starting KVM_PV_ASYNC_CLEANUP_PERFORM, without having to wait for completion. KVM_PV_ASYNC_CLEANUP_PERFORM: tears down the protected VM previously set aside using KVM_PV_ASYNC_CLEANUP_PREPARE. Ideally the KVM_PV_ASYNC_CLEANUP_PERFORM PV command should be issued by userspace from a separate thread. If a fatal signal is received (or if the process terminates naturally), the command will terminate immediately without completing. All protected VMs whose teardown was interrupted will be put in the need_cleanup list. The rest of the normal KVM teardown process will take care of properly cleaning up all remaining protected VMs, including the ones on the need_cleanup list. Signed-off-by: Claudio Imbrenda <[email protected]> Reviewed-by: Nico Boehr <[email protected]> Reviewed-by: Janosch Frank <[email protected]> Reviewed-by: Steffen Eiden <[email protected]> Link: https://lore.kernel.org/r/[email protected] Message-Id: <[email protected]> Signed-off-by: Janosch Frank <[email protected]>
2022-11-22KVM: arm64: Reject shared table walks in the hyp codeOliver Upton2-3/+19
Exclusive table walks are the only supported table walk in the hyp, as there is no construct like RCU available in the hypervisor code. Reject any attempt to do a shared table walk by returning an error and allowing the caller to clean up the mess. Suggested-by: Will Deacon <[email protected]> Signed-off-by: Oliver Upton <[email protected]> Acked-by: Will Deacon <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-22KVM: arm64: Don't acquire RCU read lock for exclusive table walksOliver Upton2-8/+10
Marek reported a BUG resulting from the recent parallel faults changes, as the hyp stage-1 map walker attempted to allocate table memory while holding the RCU read lock: BUG: sleeping function called from invalid context at include/linux/sched/mm.h:274 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0 preempt_count: 0, expected: 0 RCU nest depth: 1, expected: 0 2 locks held by swapper/0/1: #0: ffff80000a8a44d0 (kvm_hyp_pgd_mutex){+.+.}-{3:3}, at: __create_hyp_mappings+0x80/0xc4 #1: ffff80000a927720 (rcu_read_lock){....}-{1:2}, at: kvm_pgtable_walk+0x0/0x1f4 CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.1.0-rc3+ #5918 Hardware name: Raspberry Pi 3 Model B (DT) Call trace: dump_backtrace.part.0+0xe4/0xf0 show_stack+0x18/0x40 dump_stack_lvl+0x8c/0xb8 dump_stack+0x18/0x34 __might_resched+0x178/0x220 __might_sleep+0x48/0xa0 prepare_alloc_pages+0x178/0x1a0 __alloc_pages+0x9c/0x109c alloc_page_interleave+0x1c/0xc4 alloc_pages+0xec/0x160 get_zeroed_page+0x1c/0x44 kvm_hyp_zalloc_page+0x14/0x20 hyp_map_walker+0xd4/0x134 kvm_pgtable_visitor_cb.isra.0+0x38/0x5c __kvm_pgtable_walk+0x1a4/0x220 kvm_pgtable_walk+0x104/0x1f4 kvm_pgtable_hyp_map+0x80/0xc4 __create_hyp_mappings+0x9c/0xc4 kvm_mmu_init+0x144/0x1cc kvm_arch_init+0xe4/0xef4 kvm_init+0x3c/0x3d0 arm_init+0x20/0x30 do_one_initcall+0x74/0x400 kernel_init_freeable+0x2e0/0x350 kernel_init+0x24/0x130 ret_from_fork+0x10/0x20 Since the hyp stage-1 table walkers are serialized by kvm_hyp_pgd_mutex, RCU protection really doesn't add anything. Don't acquire the RCU read lock for an exclusive walk. Reported-by: Marek Szyprowski <[email protected]> Signed-off-by: Oliver Upton <[email protected]> Acked-by: Will Deacon <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-22KVM: arm64: Take a pointer to walker data in kvm_dereference_pteref()Oliver Upton2-74/+76
Rather than passing through the state of the KVM_PGTABLE_WALK_SHARED flag, just take a pointer to the whole walker structure instead. Move around struct kvm_pgtable and the RCU indirection such that the associated ifdeffery remains in one place while ensuring the walker + flags definitions precede their use. No functional change intended. Signed-off-by: Oliver Upton <[email protected]> Acked-by: Will Deacon <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-11-21KVM: selftests: Rename 'evmcs_test' to 'hyperv_evmcs'Vitaly Kuznetsov3-2/+2
Conform to the rest of Hyper-V emulation selftests which have 'hyperv' prefix. Get rid of '_test' suffix as well as the purpose of this code is fairly obvious. Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-21KVM: selftests: hyperv_svm_test: Introduce L2 TLB flush testVitaly Kuznetsov2-5/+58
Enable Hyper-V L2 TLB flush and check that Hyper-V TLB flush hypercalls from L2 don't exit to L1 unless 'TlbLockCount' is set in the Partition assist page. Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-21KVM: selftests: evmcs_test: Introduce L2 TLB flush testVitaly Kuznetsov2-2/+50
Enable Hyper-V L2 TLB flush and check that Hyper-V TLB flush hypercalls from L2 don't exit to L1 unless 'TlbLockCount' is set in the Partition assist page. Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-21KVM: selftests: Introduce rdmsr_from_l2() and use it for MSR-Bitmap testsVitaly Kuznetsov2-21/+23
Hyper-V MSR-Bitmap tests do RDMSR from L2 to exit to L1. While 'evmcs_test' correctly clobbers all GPRs (which are not preserved), 'hyperv_svm_test' does not. Introduce a more generic rdmsr_from_l2() to avoid code duplication and remove hardcoding of MSRs. Do not put it in common code because it is really just a selftests bug rather than a processor feature that requires it. Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2022-11-21KVM: selftests: Stuff RAX/RCX with 'safe' values in vmmcall()/vmcall()Vitaly Kuznetsov3-10/+24
vmmcall()/vmcall() are used to exit from L2 to L1 and no concrete hypercall ABI is currenty followed. With the introduction of Hyper-V L2 TLB flush it becomes (theoretically) possible that L0 will take responsibility for handling the call and no L1 exit will happen. Prevent this by stuffing RAX (KVM ABI) and RCX (Hyper-V ABI) with 'safe' values. While on it, convert vmmcall() to 'static inline', make it setup stack frame and move to include/x86_64/svm_util.h. Signed-off-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>