aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-11-11KVM: VMX: Macrofy the MSR bitmap getters and settersSean Christopherson1-60/+25
Add builder macros to generate the MSR bitmap helpers to reduce the amount of copy-paste code, especially with respect to all the magic numbers needed to calc the correct bit location. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-11KVM: nVMX: Handle dynamic MSR intercept togglingSean Christopherson3-110/+111
Always check vmcs01's MSR bitmap when merging L0 and L1 bitmaps for L2, and always update the relevant bits in vmcs02. This fixes two distinct, but intertwined bugs related to dynamic MSR bitmap modifications. The first issue is that KVM fails to enable MSR interception in vmcs02 for the FS/GS base MSRs if L1 first runs L2 with interception disabled, and later enables interception. The second issue is that KVM fails to honor userspace MSR filtering when preparing vmcs02. Fix both issues simultaneous as fixing only one of the issues (doesn't matter which) would create a mess that no one should have to bisect. Fixing only the first bug would exacerbate the MSR filtering issue as userspace would see inconsistent behavior depending on the whims of L1. Fixing only the second bug (MSR filtering) effectively requires fixing the first, as the nVMX code only knows how to transition vmcs02's bitmap from 1->0. Move the various accessor/mutators that are currently buried in vmx.c into vmx.h so that they can be shared by the nested code. Fixes: 1a155254ff93 ("KVM: x86: Introduce MSR filtering") Fixes: d69129b4e46a ("KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible") Cc: [email protected] Cc: Alexander Graf <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-11KVM: nVMX: Query current VMCS when determining if MSR bitmaps are in useSean Christopherson1-4/+4
Check the current VMCS controls to determine if an MSR write will be intercepted due to MSR bitmaps being disabled. In the nested VMX case, KVM will disable MSR bitmaps in vmcs02 if they're disabled in vmcs12 or if KVM can't map L1's bitmaps for whatever reason. Note, the bad behavior is relatively benign in the current code base as KVM sets all bits in vmcs02's MSR bitmap by default, clears bits if and only if L0 KVM also disables interception of an MSR, and only uses the buggy helper for MSR_IA32_SPEC_CTRL. Because KVM explicitly tests WRMSR before disabling interception of MSR_IA32_SPEC_CTRL, the flawed check will only result in KVM reading MSR_IA32_SPEC_CTRL from hardware when it isn't strictly necessary. Tag the fix for stable in case a future fix wants to use msr_write_intercepted(), in which case a buggy implementation in older kernels could prove subtly problematic. Fixes: d28b387fb74d ("KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL") Cc: [email protected] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-11KVM: x86: Don't update vcpu->arch.pv_eoi.msr_val when a bogus value was ↵Vitaly Kuznetsov1-8/+13
written to MSR_KVM_PV_EOI_EN When kvm_gfn_to_hva_cache_init() call from kvm_lapic_set_pv_eoi() fails, MSR write to MSR_KVM_PV_EOI_EN results in #GP so it is reasonable to expect that the value we keep internally in KVM wasn't updated. Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-11KVM: x86: Rename kvm_lapic_enable_pv_eoi()Vitaly Kuznetsov4-5/+5
kvm_lapic_enable_pv_eoi() is a misnomer as the function is also used to disable PV EOI. Rename it to kvm_lapic_set_pv_eoi(). No functional change intended. Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-11KVM: x86: Make sure KVM_CPUID_FEATURES really are KVM_CPUID_FEATURESPaul Durrant5-8/+47
Currently when kvm_update_cpuid_runtime() runs, it assumes that the KVM_CPUID_FEATURES leaf is located at 0x40000001. This is not true, however, if Hyper-V support is enabled. In this case the KVM leaves will be offset. This patch introdues as new 'kvm_cpuid_base' field into struct kvm_vcpu_arch to track the location of the KVM leaves and function kvm_update_kvm_cpuid_base() (called from kvm_set_cpuid()) to locate the leaves using the 'KVMKVMKVM\0\0\0' signature (which is now given a definition in kvm_para.h). Adjustment of KVM_CPUID_FEATURES will hence now target the correct leaf. NOTE: A new for_each_possible_hypervisor_cpuid_base() macro is intoduced into processor.h to avoid having duplicate code for the iteration over possible hypervisor base leaves. Signed-off-by: Paul Durrant <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-11KVM: x86: Add helper to consolidate core logic of SET_CPUID{2} flowsSean Christopherson1-23/+24
Move the core logic of SET_CPUID and SET_CPUID2 to a common helper, the only difference between the two ioctls() is the format of the userspace struct. A future fix will add yet more code to the core logic. No functional change intended. Cc: [email protected] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-11kvm: mmu: Use fast PF path for access tracking of huge pages when possibleJunaid Shahid1-5/+5
The fast page fault path bails out on write faults to huge pages in order to accommodate dirty logging. This change adds a check to do that only when dirty logging is actually enabled, so that access tracking for huge pages can still use the fast path for write faults in the common case. Signed-off-by: Junaid Shahid <[email protected]> Reviewed-by: Ben Gardon <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-11KVM: x86/mmu: Properly dereference rcu-protected TDP MMU sptep iteratorSean Christopherson1-1/+1
Wrap the read of iter->sptep in tdp_mmu_map_handle_target_level() with rcu_dereference(). Shadow pages in the TDP MMU, and thus their SPTEs, are protected by rcu. This fixes a Sparse warning at tdp_mmu.c:900:51: warning: incorrect type in argument 1 (different address spaces) expected unsigned long long [usertype] *sptep got unsigned long long [noderef] [usertype] __rcu *[usertype] sptep Fixes: 7158bee4b475 ("KVM: MMU: pass kvm_mmu_page struct to make_spte") Cc: Ben Gardon <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-11KVM: x86: inhibit APICv when KVM_GUESTDBG_BLOCKIRQ activeMaxim Levitsky4-2/+25
KVM_GUESTDBG_BLOCKIRQ relies on interrupts being injected using standard kvm's inject_pending_event, and not via APICv/AVIC. Since this is a debug feature, just inhibit APICv/AVIC while KVM_GUESTDBG_BLOCKIRQ is in use on at least one vCPU. Fixes: 61e5f69ef0837 ("KVM: x86: implement KVM_GUESTDBG_BLOCKIRQ") Reported-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Maxim Levitsky <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Tested-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-11kvm: x86: Convert return type of *is_valid_rdpmc_ecx() to boolJim Mattson5-11/+11
These function names sound like predicates, and they have siblings, *is_valid_msr(), which _are_ predicates. Moreover, there are comments that essentially warn that these functions behave unexpectedly. Flip the polarity of the return values, so that they become predicates, and convert the boolean result to a success/failure code at the outer call site. Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-11KVM: x86: Fix recording of guest steal time / preempted statusDavid Woodhouse2-31/+76
In commit b043138246a4 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed") we switched to using a gfn_to_pfn_cache for accessing the guest steal time structure in order to allow for an atomic xchg of the preempted field. This has a couple of problems. Firstly, kvm_map_gfn() doesn't work at all for IOMEM pages when the atomic flag is set, which it is in kvm_steal_time_set_preempted(). So a guest vCPU using an IOMEM page for its steal time would never have its preempted field set. Secondly, the gfn_to_pfn_cache is not invalidated in all cases where it should have been. There are two stages to the GFN->PFN conversion; first the GFN is converted to a userspace HVA, and then that HVA is looked up in the process page tables to find the underlying host PFN. Correct invalidation of the latter would require being hooked up to the MMU notifiers, but that doesn't happen---so it just keeps mapping and unmapping the *wrong* PFN after the userspace page tables change. In the !IOMEM case at least the stale page *is* pinned all the time it's cached, so it won't be freed and reused by anyone else while still receiving the steal time updates. The map/unmap dance only takes care of the KVM administrivia such as marking the page dirty. Until the gfn_to_pfn cache handles the remapping automatically by integrating with the MMU notifiers, we might as well not get a kernel mapping of it, and use the perfectly serviceable userspace HVA that we already have. We just need to implement the atomic xchg on the userspace address with appropriate exception handling, which is fairly trivial. Cc: [email protected] Fixes: b043138246a4 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed") Signed-off-by: David Woodhouse <[email protected]> Message-Id: <[email protected]> [I didn't entirely agree with David's assessment of the usefulness of the gfn_to_pfn cache, and integrated the outcome of the discussion in the above commit message. - Paolo] Signed-off-by: Paolo Bonzini <[email protected]>
2021-11-02Merge tag 'kvm-riscv-5.16-2' of https://github.com/kvm-riscv/linux into HEADPaolo Bonzini5-17/+17
Minor cocci warning fixes: 1) Bool return warning fix 2) Unnedded semicolon warning fix
2021-11-01RISC-V: KVM: fix boolreturn.cocci warningsBixuan Cui1-9/+9
Fix boolreturn.cocci warnings: ./arch/riscv/kvm/mmu.c:603:9-10: WARNING: return of 0/1 in function 'kvm_age_gfn' with return type bool ./arch/riscv/kvm/mmu.c:582:9-10: WARNING: return of 0/1 in function 'kvm_set_spte_gfn' with return type bool ./arch/riscv/kvm/mmu.c:621:9-10: WARNING: return of 0/1 in function 'kvm_test_age_gfn' with return type bool ./arch/riscv/kvm/mmu.c:568:9-10: WARNING: return of 0/1 in function 'kvm_unmap_gfn_range' with return type bool Signed-off-by: Bixuan Cui <[email protected]> Signed-off-by: Anup Patel <[email protected]>
2021-11-01RISC-V: KVM: remove unneeded semicolonran jianping4-8/+8
Elimate the following coccinelle check warning: ./arch/riscv/kvm/vcpu_sbi.c:169:2-3: Unneeded semicolon ./arch/riscv/kvm/vcpu_exit.c:397:2-3: Unneeded semicolon ./arch/riscv/kvm/vcpu_exit.c:687:2-3: Unneeded semicolon ./arch/riscv/kvm/vcpu_exit.c:645:2-3: Unneeded semicolon ./arch/riscv/kvm/vcpu.c:247:2-3: Unneeded semicolon ./arch/riscv/kvm/vcpu.c:284:2-3: Unneeded semicolon ./arch/riscv/kvm/vcpu_timer.c:123:2-3: Unneeded semicolon ./arch/riscv/kvm/vcpu_timer.c:170:2-3: Unneeded semicolon Reported-by: Zeal Robot <[email protected]> Signed-off-by: ran jianping <[email protected]> Signed-off-by: Anup Patel <[email protected]>
2021-10-31Merge tag 'kvm-s390-next-5.16-1' of ↵Paolo Bonzini12-77/+200
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD KVM: s390: Fixes and Features for 5.16 - SIGP Fixes - initial preparations for lazy destroy of secure VMs - storage key improvements/fixes - Log the guest CPNC
2021-10-31RISC-V: KVM: Fix GPA passed to __kvm_riscv_hfence_gvma_xyz() functionsAnup Patel2-4/+5
The parameter passed to HFENCE.GVMA instruction in rs1 register is guest physical address right shifted by 2 (i.e. divided by 4). Unfortunately, we overlooked the semantics of rs1 registers for HFENCE.GVMA instruction and never right shifted guest physical address by 2. This issue did not manifest for hypervisors till now because: 1) Currently, only __kvm_riscv_hfence_gvma_all() and SBI HFENCE calls are used to invalidate TLB. 2) All H-extension implementations (such as QEMU, Spike, Rocket Core FPGA, etc) that we tried till now were conservatively flushing everything upon any HFENCE.GVMA instruction. This patch fixes GPA passed to __kvm_riscv_hfence_gvma_vmid_gpa() and __kvm_riscv_hfence_gvma_gpa() functions. Fixes: fd7bb4a251df ("RISC-V: KVM: Implement VMID allocator") Reported-by: Ian Huang <[email protected]> Signed-off-by: Anup Patel <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-10-31RISC-V: KVM: Factor-out FP virtualization into separate sourcesAnup Patel5-176/+228
The timer and SBI virtualization is already in separate sources. In future, we will have vector and AIA virtualization also added as separate sources. To align with above described modularity, we factor-out FP virtualization into separate sources. Signed-off-by: Anup Patel <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-10-31Merge tag 'kvmarm-5.16' of ↵Paolo Bonzini731-4234/+10961
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for Linux 5.16 - More progress on the protected VM front, now with the full fixed feature set as well as the limitation of some hypercalls after initialisation. - Cleanup of the RAZ/WI sysreg handling, which was pointlessly complicated - Fixes for the vgic placement in the IPA space, together with a bunch of selftests - More memcg accounting of the memory allocated on behalf of a guest - Timer and vgic selftests - Workarounds for the Apple M1 broken vgic implementation - KConfig cleanups - New kvmarm.mode=none option, for those who really dislike us
2021-10-27KVM: s390: add debug statement for diag 318 CPNC dataCollin Walling1-0/+1
The diag 318 data contains values that denote information regarding the guest's environment. Currently, it is unecessarily difficult to observe this value (either manually-inserted debug statements, gdb stepping, mem dumping etc). It's useful to observe this information to obtain an at-a-glance view of the guest's environment, so lets add a simple VCPU event that prints the CPNC to the s390dbf logs. Signed-off-by: Collin Walling <[email protected]> Acked-by: Christian Borntraeger <[email protected]> Link: https://lore.kernel.org/r/[email protected] [[email protected]]: change debug level to 3 Signed-off-by: Christian Borntraeger <[email protected]>
2021-10-27KVM: s390: pv: properly handle page flags for protected guestsClaudio Imbrenda4-7/+50
Introduce variants of the convert and destroy page functions that also clear the PG_arch_1 bit used to mark them as secure pages. The PG_arch_1 flag is always allowed to overindicate; using the new functions introduced here allows to reduce the extent of overindication and thus improve performance. These new functions can only be called on pages for which a reference is already being held. Signed-off-by: Claudio Imbrenda <[email protected]> Reviewed-by: Janosch Frank <[email protected]> Reviewed-by: Christian Borntraeger <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Borntraeger <[email protected]>
2021-10-27KVM: s390: Fix handle_sske page fault handlingJanis Schoetterl-Glausch1-0/+2
If handle_sske cannot set the storage key, because there is no page table entry or no present large page entry, it calls fixup_user_fault. However, currently, if the call succeeds, handle_sske returns -EAGAIN, without having set the storage key. Instead, retry by continue'ing the loop without incrementing the address. The same issue in handle_pfmf was fixed by a11bdb1a6b78 ("KVM: s390: Fix pfmf and conditional skey emulation"). Fixes: bd096f644319 ("KVM: s390: Add skey emulation fault handling") Signed-off-by: Janis Schoetterl-Glausch <[email protected]> Reviewed-by: Christian Borntraeger <[email protected]> Reviewed-by: Claudio Imbrenda <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Borntraeger <[email protected]>
2021-10-25Merge branch 'kvm-pvclock-raw-spinlock' into HEADPaolo Bonzini0-0/+0
pvclock_gtod_sync_lock is completely gone in Linux 5.16. Include this fix into the kvm/next history to record that the syzkaller report is not valid there. Signed-off-by: Paolo Bonzini <[email protected]>
2021-10-25KVM: x86: switch pvclock_gtod_sync_lock to a raw spinlockDavid Woodhouse2-15/+15
On the preemption path when updating a Xen guest's runstate times, this lock is taken inside the scheduler rq->lock, which is a raw spinlock. This was shown in a lockdep warning: [ 89.138354] ============================= [ 89.138356] [ BUG: Invalid wait context ] [ 89.138358] 5.15.0-rc5+ #834 Tainted: G S I E [ 89.138360] ----------------------------- [ 89.138361] xen_shinfo_test/2575 is trying to lock: [ 89.138363] ffffa34a0364efd8 (&kvm->arch.pvclock_gtod_sync_lock){....}-{3:3}, at: get_kvmclock_ns+0x1f/0x130 [kvm] [ 89.138442] other info that might help us debug this: [ 89.138444] context-{5:5} [ 89.138445] 4 locks held by xen_shinfo_test/2575: [ 89.138447] #0: ffff972bdc3b8108 (&vcpu->mutex){+.+.}-{4:4}, at: kvm_vcpu_ioctl+0x77/0x6f0 [kvm] [ 89.138483] #1: ffffa34a03662e90 (&kvm->srcu){....}-{0:0}, at: kvm_arch_vcpu_ioctl_run+0xdc/0x8b0 [kvm] [ 89.138526] #2: ffff97331fdbac98 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0xff/0xbd0 [ 89.138534] #3: ffffa34a03662e90 (&kvm->srcu){....}-{0:0}, at: kvm_arch_vcpu_put+0x26/0x170 [kvm] ... [ 89.138695] get_kvmclock_ns+0x1f/0x130 [kvm] [ 89.138734] kvm_xen_update_runstate+0x14/0x90 [kvm] [ 89.138783] kvm_xen_update_runstate_guest+0x15/0xd0 [kvm] [ 89.138830] kvm_arch_vcpu_put+0xe6/0x170 [kvm] [ 89.138870] kvm_sched_out+0x2f/0x40 [kvm] [ 89.138900] __schedule+0x5de/0xbd0 Cc: [email protected] Reported-by: [email protected] Fixes: 30b5c851af79 ("KVM: x86/xen: Add support for vCPU runstate information") Signed-off-by: David Woodhouse <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-10-25KVM: x86: SGX must obey the KVM_INTERNAL_ERROR_EMULATION protocolDavid Edmondson1-11/+5
When passing the failing address and size out to user space, SGX must ensure not to trample on the earlier fields of the emulation_failure sub-union of struct kvm_run. Signed-off-by: David Edmondson <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-10-25KVM: x86: On emulation failure, convey the exit reason, etc. to userspaceDavid Edmondson4-18/+69
Should instruction emulation fail, include the VM exit reason, etc. in the emulation_failure data passed to userspace, in order that the VMM can report it as a debugging aid when describing the failure. Suggested-by: Joao Martins <[email protected]> Signed-off-by: David Edmondson <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-10-25KVM: x86: Get exit_reason as part of kvm_x86_ops.get_exit_infoDavid Edmondson5-13/+19
Extend the get_exit_info static call to provide the reason for the VM exit. Modify relevant trace points to use this rather than extracting the reason in the caller. Signed-off-by: David Edmondson <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-10-25KVM: x86: Clarify the kvm_run.emulation_failure structure layoutDavid Edmondson1-2/+6
Until more flags for kvm_run.emulation_failure flags are defined, it is undetermined whether new payload elements corresponding to those flags will be additive or alternative. As a hint to userspace that an alternative is possible, wrap the current payload elements in a union. Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: David Edmondson <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-10-25KVM: s390: Add a routine for setting userspace CPU stateEric Farman2-3/+12
This capability exists, but we don't record anything when userspace enables it. Let's refactor that code so that a note can be made in the debug logs that it was enabled. Signed-off-by: Eric Farman <[email protected]> Reviewed-by: Thomas Huth <[email protected]> Reviewed-by: Claudio Imbrenda <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Borntraeger <[email protected]>
2021-10-25KVM: s390: Simplify SIGP Set Arch handlingEric Farman1-13/+1
The Principles of Operations describe the various reasons that each individual SIGP orders might be rejected, and the status bit that are set for each condition. For example, for the Set Architecture order, it states: "If it is not true that all other CPUs in the configu- ration are in the stopped or check-stop state, ... bit 54 (incorrect state) ... is set to one." However, it also states: "... if the CZAM facility is installed, ... bit 55 (invalid parameter) ... is set to one." Since the Configuration-z/Architecture-Architectural Mode (CZAM) facility is unconditionally presented, there is no need to examine each VCPU to determine if it is started/stopped. It can simply be rejected outright with the Invalid Parameter bit. Fixes: b697e435aeee ("KVM: s390: Support Configuration z/Architecture Mode") Signed-off-by: Eric Farman <[email protected]> Reviewed-by: Thomas Huth <[email protected]> Reviewed-by: Claudio Imbrenda <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Christian Borntraeger <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Borntraeger <[email protected]>
2021-10-25KVM: s390: pv: avoid stalls when making pages secureClaudio Imbrenda2-6/+28
Improve make_secure_pte to avoid stalls when the system is heavily overcommitted. This was especially problematic in kvm_s390_pv_unpack, because of the loop over all pages that needed unpacking. Due to the locks being held, it was not possible to simply replace uv_call with uv_call_sched. A more complex approach was needed, in which uv_call is replaced with __uv_call, which does not loop. When the UVC needs to be executed again, -EAGAIN is returned, and the caller (or its caller) will try again. When -EAGAIN is returned, the path is the same as when the page is in writeback (and the writeback check is also performed, which is harmless). Fixes: 214d9bbcd3a672 ("s390/mm: provide memory management functions for protected KVM guests") Signed-off-by: Claudio Imbrenda <[email protected]> Reviewed-by: Janosch Frank <[email protected]> Reviewed-by: Christian Borntraeger <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Borntraeger <[email protected]>
2021-10-25KVM: s390: pv: avoid stalls for kvm_s390_pv_init_vmClaudio Imbrenda1-1/+1
When the system is heavily overcommitted, kvm_s390_pv_init_vm might generate stall notifications. Fix this by using uv_call_sched instead of just uv_call. This is ok because we are not holding spinlocks. Signed-off-by: Claudio Imbrenda <[email protected]> Fixes: 214d9bbcd3a672 ("s390/mm: provide memory management functions for protected KVM guests") Reviewed-by: Christian Borntraeger <[email protected]> Reviewed-by: Janosch Frank <[email protected]> Message-Id: <[email protected]> Signed-off-by: Janosch Frank <[email protected]> Signed-off-by: Christian Borntraeger <[email protected]>
2021-10-25KVM: s390: pv: avoid double free of sida pageClaudio Imbrenda1-10/+9
If kvm_s390_pv_destroy_cpu is called more than once, we risk calling free_page on a random page, since the sidad field is aliased with the gbea, which is not guaranteed to be zero. This can happen, for example, if userspace calls the KVM_PV_DISABLE IOCTL, and it fails, and then userspace calls the same IOCTL again. This scenario is only possible if KVM has some serious bug or if the hardware is broken. The solution is to simply return successfully immediately if the vCPU was already non secure. Signed-off-by: Claudio Imbrenda <[email protected]> Fixes: 19e1227768863a1469797c13ef8fea1af7beac2c ("KVM: S390: protvirt: Introduce instruction data area bounce buffer") Reviewed-by: Janosch Frank <[email protected]> Reviewed-by: Christian Borntraeger <[email protected]> Message-Id: <[email protected]> Signed-off-by: Janosch Frank <[email protected]> Signed-off-by: Christian Borntraeger <[email protected]>
2021-10-25KVM: s390: pv: add macros for UVC CC valuesClaudio Imbrenda1-0/+5
Add macros to describe the 4 possible CC values returned by the UVC instruction. Signed-off-by: Claudio Imbrenda <[email protected]> Reviewed-by: Christian Borntraeger <[email protected]> Reviewed-by: Janosch Frank <[email protected]> Message-Id: <[email protected]> Signed-off-by: Janosch Frank <[email protected]> Signed-off-by: Christian Borntraeger <[email protected]>
2021-10-25s390/mm: optimize reset_guest_reference_bit()David Hildenbrand1-2/+12
We already optimize get_guest_storage_key() to assume that if we don't have a PTE table and don't have a huge page mapped that the storage key is 0. Similarly, optimize reset_guest_reference_bit() to simply do nothing if there is no PTE table and no huge page mapped. Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Claudio Imbrenda <[email protected]> Acked-by: Heiko Carstens <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Borntraeger <[email protected]>
2021-10-25s390/mm: optimize set_guest_storage_key()David Hildenbrand1-2/+12
We already optimize get_guest_storage_key() to assume that if we don't have a PTE table and don't have a huge page mapped that the storage key is 0. Similarly, optimize set_guest_storage_key() to simply do nothing in case the key to set is 0. Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Claudio Imbrenda <[email protected]> Acked-by: Heiko Carstens <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Borntraeger <[email protected]>
2021-10-25s390/mm: no need for pte_alloc_map_lock() if we know the pmd is presentDavid Hildenbrand1-12/+3
pte_map_lock() is sufficient. Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Heiko Carstens <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Borntraeger <[email protected]>
2021-10-25s390/uv: fully validate the VMA before calling follow_page()David Hildenbrand1-1/+1
We should not walk/touch page tables outside of VMA boundaries when holding only the mmap sem in read mode. Evil user space can modify the VMA layout just before this function runs and e.g., trigger races with page table removal code since commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap"). find_vma() does not check if the address is >= the VMA start address; use vma_lookup() instead. Fixes: 214d9bbcd3a6 ("s390/mm: provide memory management functions for protected KVM guests") Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Claudio Imbrenda <[email protected]> Acked-by: Heiko Carstens <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Borntraeger <[email protected]>
2021-10-25s390/mm: fix VMA and page table handling code in storage key handling functionsDavid Hildenbrand1-18/+39
There are multiple things broken about our storage key handling functions: 1. We should not walk/touch page tables outside of VMA boundaries when holding only the mmap sem in read mode. Evil user space can modify the VMA layout just before this function runs and e.g., trigger races with page table removal code since commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap"). gfn_to_hva() will only translate using KVM memory regions, but won't validate the VMA. 2. We should not allocate page tables outside of VMA boundaries: if evil user space decides to map hugetlbfs to these ranges, bad things will happen because we suddenly have PTE or PMD page tables where we shouldn't have them. 3. We don't handle large PUDs that might suddenly appeared inside our page table hierarchy. Don't manually allocate page tables, properly validate that we have VMA and bail out on pud_large(). All callers of page table handling functions, except get_guest_storage_key(), call fixup_user_fault() in case they receive an -EFAULT and retry; this will allocate the necessary page tables if required. To keep get_guest_storage_key() working as expected and not requiring kvm_s390_get_skeys() to call fixup_user_fault() distinguish between "there is simply no page table or huge page yet and the key is assumed to be 0" and "this is a fault to be reported". Although commit 637ff9efe5ea ("s390/mm: Add huge pmd storage key handling") introduced most of the affected code, it was actually already broken before when using get_locked_pte() without any VMA checks. Note: Ever since commit 637ff9efe5ea ("s390/mm: Add huge pmd storage key handling") we can no longer set a guest storage key (for example from QEMU during VM live migration) without actually resolving a fault. Although we would have created most page tables, we would choke on the !pmd_present(), requiring a call to fixup_user_fault(). I would have thought that this is problematic in combination with postcopy life migration ... but nobody noticed and this patch doesn't change the situation. So maybe it's just fine. Fixes: 9fcf93b5de06 ("KVM: S390: Create helper function get_guest_storage_key") Fixes: 24d5dd0208ed ("s390/kvm: Provide function for setting the guest storage key") Fixes: a7e19ab55ffd ("KVM: s390: handle missing storage-key facility") Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Claudio Imbrenda <[email protected]> Acked-by: Heiko Carstens <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Borntraeger <[email protected]>
2021-10-25s390/mm: validate VMA in PGSTE manipulation functionsDavid Hildenbrand1-0/+13
We should not walk/touch page tables outside of VMA boundaries when holding only the mmap sem in read mode. Evil user space can modify the VMA layout just before this function runs and e.g., trigger races with page table removal code since commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap"). gfn_to_hva() will only translate using KVM memory regions, but won't validate the VMA. Further, we should not allocate page tables outside of VMA boundaries: if evil user space decides to map hugetlbfs to these ranges, bad things will happen because we suddenly have PTE or PMD page tables where we shouldn't have them. Similarly, we have to check if we suddenly find a hugetlbfs VMA, before calling get_locked_pte(). Fixes: 2d42f9477320 ("s390/kvm: Add PGSTE manipulation functions") Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Claudio Imbrenda <[email protected]> Acked-by: Heiko Carstens <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Borntraeger <[email protected]>
2021-10-25s390/gmap: don't unconditionally call pte_unmap_unlock() in __gmap_zap()David Hildenbrand1-2/+3
... otherwise we will try unlocking a spinlock that was never locked via a garbage pointer. At the time we reach this code path, we usually successfully looked up a PGSTE already; however, evil user space could have manipulated the VMA layout in the meantime and triggered removal of the page table. Fixes: 1e133ab296f3 ("s390/mm: split arch/s390/mm/pgtable.c") Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Claudio Imbrenda <[email protected]> Acked-by: Heiko Carstens <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Borntraeger <[email protected]>
2021-10-25s390/gmap: validate VMA in __gmap_zap()David Hildenbrand1-0/+6
We should not walk/touch page tables outside of VMA boundaries when holding only the mmap sem in read mode. Evil user space can modify the VMA layout just before this function runs and e.g., trigger races with page table removal code since commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap"). The pure prescence in our guest_to_host radix tree does not imply that there is a VMA. Further, we should not allocate page tables (via get_locked_pte()) outside of VMA boundaries: if evil user space decides to map hugetlbfs to these ranges, bad things will happen because we suddenly have PTE or PMD page tables where we shouldn't have them. Similarly, we have to check if we suddenly find a hugetlbfs VMA, before calling get_locked_pte(). Note that gmap_discard() is different: zap_page_range()->unmap_single_vma() makes sure to stay within VMA boundaries. Fixes: b31288fa83b2 ("s390/kvm: support collaborative memory management") Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Claudio Imbrenda <[email protected]> Acked-by: Heiko Carstens <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Borntraeger <[email protected]>
2021-10-22KVM: selftests: Fix nested SVM tests when built with clangJim Mattson1-1/+13
Though gcc conveniently compiles a simple memset to "rep stos," clang prefers to call the libc version of memset. If a test is dynamically linked, the libc memset isn't available in L1 (nor is the PLT or the GOT, for that matter). Even if the test is statically linked, the libc memset may choose to use some CPU features, like AVX, which may not be enabled in L1. Note that __builtin_memset doesn't solve the problem, because (a) the compiler is free to call memset anyway, and (b) __builtin_memset may also choose to use features like AVX, which may not be available in L1. To avoid a myriad of problems, use an explicit "rep stos" to clear the VMCB in generic_svm_setup(), which is called both from L0 and L1. Reported-by: Ricardo Koller <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Fixes: 20ba262f8631a ("selftests: KVM: AMD Nested test infrastructure") Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-10-22kvm: x86: Remove stale declaration of kvm_no_apic_vcpuJim Mattson1-2/+0
This variable was renamed to kvm_has_noapic_vcpu in commit 6e4e3b4df4e3 ("KVM: Stop using deprecated jump label APIs"). Signed-off-by: Jim Mattson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-10-22KVM: VMX: Unregister posted interrupt wakeup handler on hardware unsetupSean Christopherson1-2/+5
Unregister KVM's posted interrupt wakeup handler during unsetup so that a spurious interrupt that arrives after kvm_intel.ko is unloaded doesn't call into freed memory. Fixes: bf9f6ac8d749 ("KVM: Update Posted-Interrupts Descriptor when vCPU is blocked") Cc: [email protected] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-10-22x86/irq: Ensure PI wakeup handler is unregistered before module unloadSean Christopherson1-1/+3
Add a synchronize_rcu() after clearing the posted interrupt wakeup handler to ensure all readers, i.e. in-flight IRQ handlers, see the new handler before returning to the caller. If the caller is an exiting module and is unregistering its handler, failure to wait could result in the IRQ handler jumping into an unloaded module. The registration path doesn't require synchronization, as it's the caller's responsibility to not generate interrupts it cares about until after its handler is registered. Fixes: f6b3c72c2366 ("x86/irq: Define a global vector for VT-d Posted-Interrupts") Cc: [email protected] Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-10-22KVM: x86: Use rw_semaphore for APICv lock to allow vCPU parallelismSean Christopherson3-8/+10
Use a rw_semaphore instead of a mutex to coordinate APICv updates so that vCPUs responding to requests can take the lock for read and run in parallel. Using a mutex forces serialization of vCPUs even though kvm_vcpu_update_apicv() only touches data local to that vCPU or is protected by a different lock, e.g. SVM's ir_list_lock. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-10-22KVM: x86: Move SVM's APICv sanity check to common x86Sean Christopherson2-2/+20
Move SVM's assertion that vCPU's APICv state is consistent with its VM's state out of svm_vcpu_run() and into x86's common inner run loop. The assertion and underlying logic is not unique to SVM, it's just that SVM has more inhibiting conditions and thus is more likely to run headfirst into any KVM bugs. Add relevant comments to document exactly why the update path has unusual ordering between the update the kick, why said ordering is safe, and also the basic rules behind the assertion in the run loop. Cc: Maxim Levitsky <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-10-22riscv: do not select non-existing config ANON_INODESLukas Bulwahn1-1/+0
Commit 99cdc6c18c2d ("RISC-V: Add initial skeletal KVM support") selects the config ANON_INODES in config KVM, but the config ANON_INODES is removed since commit 5dd50aaeb185 ("Make anon_inodes unconditional") in 2018. Hence, ./scripts/checkkconfigsymbols.py warns on non-existing symbols: ANON_INODES Referencing files: arch/riscv/kvm/Kconfig Remove selecting the non-existing config ANON_INODES. Signed-off-by: Lukas Bulwahn <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2021-10-22KVM: x86/mmu: Extract zapping of rmaps for gfn range to separate helperSean Christopherson1-22/+30
Extract the zapping of rmaps, a.k.a. legacy MMU, for a gfn range to a separate helper to clean up the unholy mess that kvm_zap_gfn_range() has become. In addition to deep nesting, the rmaps zapping spreads out the declaration of several variables and is generally a mess. Clean up the mess now so that future work to improve the memslots implementation doesn't need to deal with it. Cc: Maciej S. Szmigiero <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>