aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/kvm/x86.c
AgeCommit message (Collapse)AuthorFilesLines
2020-03-16KVM: x86: Move kvm_emulate.h into KVM's private directorySean Christopherson1-0/+1
Now that the emulation context is dynamically allocated and not embedded in struct kvm_vcpu, move its header, kvm_emulate.h, out of the public asm directory and into KVM's private x86 directory. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-03-16KVM: x86: Dynamically allocate per-vCPU emulation contextSean Christopherson1-11/+54
Allocate the emulation context instead of embedding it in struct kvm_vcpu_arch. Dynamic allocation provides several benefits: - Shrinks the size x86 vcpus by ~2.5k bytes, dropping them back below the PAGE_ALLOC_COSTLY_ORDER threshold. - Allows for dropping the include of kvm_emulate.h from asm/kvm_host.h and moving kvm_emulate.h into KVM's private directory. - Allows a reducing KVM's attack surface by shrinking the amount of vCPU data that is exposed to usercopy. - Allows a future patch to disable the emulator entirely, which may or may not be a realistic endeavor. Mark the entire struct as valid for usercopy to maintain existing behavior with respect to hardened usercopy. Future patches can shrink the usercopy range to cover only what is necessary. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-03-16KVM: x86: Explicitly pass an exception struct to check_interceptSean Christopherson1-1/+2
Explicitly pass an exception struct when checking for intercept from the emulator, which eliminates the last reference to arch.emulate_ctxt in vendor specific code. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-03-16KVM: x86: Refactor I/O emulation helpers to provide vcpu-only variantSean Christopherson1-15/+24
Add variants of the I/O helpers that take a vCPU instead of an emulation context. This will eventually allow KVM to limit use of the emulation context to the full emulation path. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-03-16KVM: x86: Fix warning due to implicit truncation on 32-bit KVMSean Christopherson1-2/+6
Explicitly cast the integer literal to an unsigned long when stuffing a non-canonical value into the host virtual address during private memslot deletion. The explicit cast fixes a warning that gets promoted to an error when running with KVM's newfangled -Werror setting. arch/x86/kvm/x86.c:9739:9: error: large integer implicitly truncated to unsigned type [-Werror=overflow] Fixes: a3e967c0b87d3 ("KVM: Terminate memslot walks via used_slots" Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-03-16KVM: x86/mmu: Rename kvm_mmu->get_cr3() to ->get_guest_pgd()Sean Christopherson1-1/+1
Rename kvm_mmu->get_cr3() to call out that it is retrieving a guest value, as opposed to kvm_mmu->set_cr3(), which sets a host value, and to note that it will return something other than CR3 when nested EPT is in use. Hopefully the new name will also make it more obvious that L1's nested_cr3 is returned in SVM's nested NPT case. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-03-16KVM: nVMX: Properly handle userspace interrupt window requestSean Christopherson1-5/+5
Return true for vmx_interrupt_allowed() if the vCPU is in L2 and L1 has external interrupt exiting enabled. IRQs are never blocked in hardware if the CPU is in the guest (L2 from L1's perspective) when IRQs trigger VM-Exit. The new check percolates up to kvm_vcpu_ready_for_interrupt_injection() and thus vcpu_run(), and so KVM will exit to userspace if userspace has requested an interrupt window (to inject an IRQ into L1). Remove the @external_intr param from vmx_check_nested_events(), which is actually an indicator that userspace wants an interrupt window, e.g. it's named @req_int_win further up the stack. Injecting a VM-Exit into L1 to try and bounce out to L0 userspace is all kinds of broken and is no longer necessary. Remove the hack in nested_vmx_vmexit() that attempted to workaround the breakage in vmx_check_nested_events() by only filling interrupt info if there's an actual interrupt pending. The hack actually made things worse because it caused KVM to _never_ fill interrupt info when the LAPIC resides in userspace (kvm_cpu_has_interrupt() queries interrupt.injected, which is always cleared by prepare_vmcs12() before reaching the hack in nested_vmx_vmexit()). Fixes: 6550c4df7e50 ("KVM: nVMX: Fix interrupt window request with "Acknowledge interrupt on exit"") Cc: [email protected] Cc: Liran Alon <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-03-16KVM: X86: trigger kvmclock sync request just once on VM creationWanpeng Li1-5/+3
In the progress of vCPUs creation, it queues a kvmclock sync worker to the global workqueue before each vCPU creation completes. The workqueue subsystem guarantees not to queue the already queued work; however, we can make the logic more clear by making just one leader to trigger this kvmclock sync request, and also save on cacheline bouncing caused by test_and_set_bit. Signed-off-by: Wanpeng Li <[email protected]> Reviewed-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-03-16KVM: LAPIC: Recalculate apic map in batchWanpeng Li1-0/+1
In the vCPU reset and set APIC_BASE MSR path, the apic map will be recalculated several times, each time it will consume 10+ us observed by ftrace in my non-overcommit environment since the expensive memory allocate/mutex/rcu etc operations. This patch optimizes it by recaluating apic map in batch, I hope this can benefit the serverless scenario which can frequently create/destroy VMs. Before patch: kvm_lapic_reset ~27us After patch: kvm_lapic_reset ~14us Observed by ftrace, improve ~48%. Signed-off-by: Wanpeng Li <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-03-16KVM: x86: enable dirty log gradually in small chunksJay Zhou1-4/+17
It could take kvm->mmu_lock for an extended period of time when enabling dirty log for the first time. The main cost is to clear all the D-bits of last level SPTEs. This situation can benefit from manual dirty log protect as well, which can reduce the mmu_lock time taken. The sequence is like this: 1. Initialize all the bits of the dirty bitmap to 1 when enabling dirty log for the first time 2. Only write protect the huge pages 3. KVM_GET_DIRTY_LOG returns the dirty bitmap info 4. KVM_CLEAR_DIRTY_LOG will clear D-bit for each of the leaf level SPTEs gradually in small chunks Under the Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz environment, I did some tests with a 128G windows VM and counted the time taken of memory_global_dirty_log_start, here is the numbers: VM Size Before After optimization 128G 460ms 10ms Signed-off-by: Jay Zhou <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-03-16KVM: x86: Consolidate VM allocation and free for VMX and SVMSean Christopherson1-0/+7
Move the VM allocation and free code to common x86 as the logic is more or less identical across SVM and VMX. Note, although hyperv.hv_pa_pg is part of the common kvm->arch, it's (currently) only allocated by VMX VMs. But, since kfree() plays nice when passed a NULL pointer, the superfluous call for SVM is harmless and avoids future churn if SVM gains support for HyperV's direct TLB flush. Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Vitaly Kuznetsov <[email protected]> [Make vm_size a field instead of a function. - Paolo] Signed-off-by: Paolo Bonzini <[email protected]>
2020-03-16KVM: x86/mmu: Move kvm_arch_flush_remote_tlbs_memslot() to mmu.cSean Christopherson1-11/+0
Move kvm_arch_flush_remote_tlbs_memslot() from x86.c to mmu.c in preparation for calling kvm_flush_remote_tlbs_with_address() instead of kvm_flush_remote_tlbs(). The with_address() variant is statically defined in mmu.c, arguably kvm_arch_flush_remote_tlbs_memslot() belongs in mmu.c anyways, and defining kvm_arch_flush_remote_tlbs_memslot() in mmu.c will allow the compiler to inline said function when a future patch consolidates open coded variants of the function. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-03-16KVM: Terminate memslot walks via used_slotsSean Christopherson1-7/+8
Refactor memslot handling to treat the number of used slots as the de facto size of the memslot array, e.g. return NULL from id_to_memslot() when an invalid index is provided instead of relying on npages==0 to detect an invalid memslot. Rework the sorting and walking of memslots in advance of dynamically sizing memslots to aid bisection and debug, e.g. with luck, a bug in the refactoring will bisect here and/or hit a WARN instead of randomly corrupting memory. Alternatively, a global null/invalid memslot could be returned, i.e. so callers of id_to_memslot() don't have to explicitly check for a NULL memslot, but that approach runs the risk of introducing difficult-to- debug issues, e.g. if the global null slot is modified. Constifying the return from id_to_memslot() to combat such issues is possible, but would require a massive refactoring of arch specific code and would still be susceptible to casting shenanigans. Add function comments to update_memslots() and search_memslots() to explicitly (and loudly) state how memslots are sorted. Opportunistically stuff @hva with a non-canonical value when deleting a private memslot on x86 to detect bogus usage of the freed slot. No functional change intended. Tested-by: Christoffer Dall <[email protected]> Tested-by: Marc Zyngier <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-03-16KVM: Provide common implementation for generic dirty log functionsSean Christopherson1-57/+4
Move the implementations of KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG for CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT into common KVM code. The arch specific implemenations are extremely similar, differing only in whether the dirty log needs to be sync'd from hardware (x86) and how the TLBs are flushed. Add new arch hooks to handle sync and TLB flush; the sync will also be used for non-generic dirty log support in a future patch (s390). The ulterior motive for providing a common implementation is to eliminate the dependency between arch and common code with respect to the memslot referenced by the dirty log, i.e. to make it obvious in the code that the validity of the memslot is guaranteed, as a future patch will rework memslot handling such that id_to_memslot() can return NULL. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-03-16KVM: Simplify kvm_free_memslot() and all its descendentsSean Christopherson1-13/+8
Now that all callers of kvm_free_memslot() pass NULL for @dont, remove the param from the top-level routine and all arch's implementations. No functional change intended. Tested-by: Christoffer Dall <[email protected]> Reviewed-by: Peter Xu <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-03-16KVM: x86: Free arrays for old memslot when moving memslot's base gfnSean Christopherson1-0/+4
Explicitly free the metadata arrays (stored in slot->arch) in the old memslot structure when moving the memslot's base gfn is committed. This eliminates x86's dependency on kvm_free_memslot() being called when a memslot move is committed, and paves the way for removing the funky code in kvm_free_memslot() that conditionally frees structures based on its @dont param. Reviewed-by: Peter Xu <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-03-16KVM: Drop "const" attribute from old memslot in commit_memory_region()Sean Christopherson1-1/+1
Drop the "const" attribute from @old in kvm_arch_commit_memory_region() to allow arch specific code to free arch specific resources in the old memslot without having to cast away the attribute. Freeing resources in kvm_arch_commit_memory_region() paves the way for simplifying kvm_free_memslot() by eliminating the last usage of its @dont param. Reviewed-by: Peter Xu <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-03-16KVM: Drop kvm_arch_create_memslot()Sean Christopherson1-6/+0
Remove kvm_arch_create_memslot() now that all arch implementations are effectively nops. Removing kvm_arch_create_memslot() eliminates the possibility for arch specific code to allocate memory prior to setting a memslot, which sets the stage for simplifying kvm_free_memslot(). Cc: Janosch Frank <[email protected]> Acked-by: Christian Borntraeger <[email protected]> Reviewed-by: Peter Xu <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-03-16KVM: x86: Allocate memslot resources during prepare_memory_region()Sean Christopherson1-4/+9
Allocate the various metadata structures associated with a new memslot during kvm_arch_prepare_memory_region(), which paves the way for removing kvm_arch_create_memslot() altogether. Moving x86's memory allocation only changes the order of kernel memory allocations between x86 and common KVM code. Reviewed-by: Peter Xu <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-03-16KVM: x86: Allocate new rmap and large page tracking when moving memslotSean Christopherson1-0/+11
Reallocate a rmap array and recalcuate large page compatibility when moving an existing memslot to correctly handle the alignment properties of the new memslot. The number of rmap entries required at each level is dependent on the alignment of the memslot's base gfn with respect to that level, e.g. moving a large-page aligned memslot so that it becomes unaligned will increase the number of rmap entries needed at the now unaligned level. Not updating the rmap array is the most obvious bug, as KVM accesses garbage data beyond the end of the rmap. KVM interprets the bad data as pointers, leading to non-canonical #GPs, unexpected #PFs, etc... general protection fault: 0000 [#1] SMP CPU: 0 PID: 1909 Comm: move_memory_reg Not tainted 5.4.0-rc7+ #139 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:rmap_get_first+0x37/0x50 [kvm] Code: <48> 8b 3b 48 85 ff 74 ec e8 6c f4 ff ff 85 c0 74 e3 48 89 d8 5b c3 RSP: 0018:ffffc9000021bbc8 EFLAGS: 00010246 RAX: ffff00617461642e RBX: ffff00617461642e RCX: 0000000000000012 RDX: ffff88827400f568 RSI: ffffc9000021bbe0 RDI: ffff88827400f570 RBP: 0010000000000000 R08: ffffc9000021bd00 R09: ffffc9000021bda8 R10: ffffc9000021bc48 R11: 0000000000000000 R12: 0030000000000000 R13: 0000000000000000 R14: ffff88827427d700 R15: ffffc9000021bce8 FS: 00007f7eda014700(0000) GS:ffff888277a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7ed9216ff8 CR3: 0000000274391003 CR4: 0000000000162eb0 Call Trace: kvm_mmu_slot_set_dirty+0xa1/0x150 [kvm] __kvm_set_memory_region.part.64+0x559/0x960 [kvm] kvm_set_memory_region+0x45/0x60 [kvm] kvm_vm_ioctl+0x30f/0x920 [kvm] do_vfs_ioctl+0xa1/0x620 ksys_ioctl+0x66/0x70 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x4c/0x170 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f7ed9911f47 Code: <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 21 6f 2c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffc00937498 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000001ab0010 RCX: 00007f7ed9911f47 RDX: 0000000001ab1350 RSI: 000000004020ae46 RDI: 0000000000000004 RBP: 000000000000000a R08: 0000000000000000 R09: 00007f7ed9214700 R10: 00007f7ed92149d0 R11: 0000000000000246 R12: 00000000bffff000 R13: 0000000000000003 R14: 00007f7ed9215000 R15: 0000000000000000 Modules linked in: kvm_intel kvm irqbypass ---[ end trace 0c5f570b3358ca89 ]--- The disallow_lpage tracking is more subtle. Failure to update results in KVM creating large pages when it shouldn't, either due to stale data or again due to indexing beyond the end of the metadata arrays, which can lead to memory corruption and/or leaking data to guest/userspace. Note, the arrays for the old memslot are freed by the unconditional call to kvm_free_memslot() in __kvm_set_memory_region(). Fixes: 05da45583de9b ("KVM: MMU: large page support") Cc: [email protected] Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Peter Xu <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-03-16KVM: x86: Move gpa_val and gpa_available into the emulator contextSean Christopherson1-7/+6
Move the GPA tracking into the emulator context now that the context is guaranteed to be initialized via __init_emulate_ctxt() prior to dereferencing gpa_{available,val}, i.e. now that seeing a stale gpa_available will also trigger a WARN due to an invalid context. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-03-16KVM: x86: Add EMULTYPE_PF when emulation is triggered by a page faultSean Christopherson1-6/+19
Add a new emulation type flag to explicitly mark emulation related to a page fault. Move the propation of the GPA into the emulator from the page fault handler into x86_emulate_instruction, using EMULTYPE_PF as an indicator that cr2 is valid. Similarly, don't propagate cr2 into the exception.address when it's *not* valid. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-03-16KVM: x86: eliminate some unreachable codeMiaohe Lin1-3/+0
These code are unreachable, remove them. Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Krish Sadhukhan <[email protected]> Reviewed-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-03-02KVM: X86: Fix dereference null cpufreq policyWanpeng Li1-3/+5
Naresh Kamboju reported: Linux version 5.6.0-rc4 (oe-user@oe-host) (gcc version (GCC)) #1 SMP Sun Mar 1 22:59:08 UTC 2020 kvm: no hardware support BUG: kernel NULL pointer dereference, address: 000000000000028c #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP NOPTI CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc4 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 04/01/2014 RIP: 0010:kobject_put+0x12/0x1c0 Call Trace: cpufreq_cpu_put+0x15/0x20 kvm_arch_init+0x1f6/0x2b0 kvm_init+0x31/0x290 ? svm_check_processor_compat+0xd/0xd ? svm_check_processor_compat+0xd/0xd svm_init+0x21/0x23 do_one_initcall+0x61/0x2f0 ? rdinit_setup+0x30/0x30 ? rcu_read_lock_sched_held+0x4f/0x80 kernel_init_freeable+0x219/0x279 ? rest_init+0x250/0x250 kernel_init+0xe/0x110 ret_from_fork+0x27/0x50 Modules linked in: CR2: 000000000000028c ---[ end trace 239abf40c55c409b ]--- RIP: 0010:kobject_put+0x12/0x1c0 cpufreq policy which is get by cpufreq_cpu_get() can be NULL if it is failure, this patch takes care of it. Fixes: aaec7c03de (KVM: x86: avoid useless copy of cpufreq policy) Reported-by: Naresh Kamboju <[email protected]> Cc: Naresh Kamboju <[email protected]> Signed-off-by: Wanpeng Li <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-02-28kvm: x86: Limit the number of "kvm: disabled by bios" messagesErwan Velu1-2/+2
In older version of systemd(219), at boot time, udevadm is called with : /usr/bin/udevadm trigger --type=devices --action=add" This program generates an echo "add" in /sys/devices/system/cpu/cpu<x>/uevent, leading to the "kvm: disabled by bios" message in case of your Bios disabled the virtualization extensions. On a modern system running up to 256 CPU threads, this pollutes the Kernel logs. This patch offers to ratelimit this message to avoid any userspace program triggering this uevent printing this message too often. This patch is only a workaround but greatly reduce the pollution without breaking the current behavior of printing a message if some try to instantiate KVM on a system that doesn't support it. Note that recent versions of systemd (>239) do not have trigger this behavior. This patch will be useful at least for some using older systemd with recent Kernels. Signed-off-by: Erwan Velu <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-02-28KVM: x86: avoid useless copy of cpufreq policyPaolo Bonzini1-5/+5
struct cpufreq_policy is quite big and it is not a good idea to allocate one on the stack. Just use cpufreq_cpu_get and cpufreq_cpu_put which is even simpler. Reported-by: Christoph Hellwig <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-02-23KVM: nVMX: Emulate MTF when performing instruction emulationOliver Upton1-0/+2
Since commit 5f3d45e7f282 ("kvm/x86: add support for MONITOR_TRAP_FLAG"), KVM has allowed an L1 guest to use the monitor trap flag processor-based execution control for its L2 guest. KVM simply forwards any MTF VM-exits to the L1 guest, which works for normal instruction execution. However, when KVM needs to emulate an instruction on the behalf of an L2 guest, the monitor trap flag is not emulated. Add the necessary logic to kvm_skip_emulated_instruction() to synthesize an MTF VM-exit to L1 upon instruction emulation for L2. Fixes: 5f3d45e7f282 ("kvm/x86: add support for MONITOR_TRAP_FLAG") Signed-off-by: Oliver Upton <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-02-17x86/vdso: Use generic VDSO clock mode storageThomas Gleixner1-11/+11
Switch to the generic VDSO clock mode storage. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Vincenzo Frascino <[email protected]> (VDSO parts) Acked-by: Juergen Gross <[email protected]> (Xen parts) Acked-by: Paolo Bonzini <[email protected]> (KVM parts) Link: https://lkml.kernel.org/r/[email protected]
2020-02-12KVM: x86: fix WARN_ON check of an unsigned less than zeroPaolo Bonzini1-1/+1
The check cpu->hv_clock.system_time < 0 is redundant since system_time is a u64 and hence can never be less than zero. But what was actually meant is to check that the result is positive, since kernel_ns and v->kvm->arch.kvmclock_offset are both s64. Reported-by: Colin King <[email protected]> Suggested-by: Sean Christopherson <[email protected]> Addresses-Coverity: ("Macro compares unsigned to 0") Reviewed-by: Miaohe Lin <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-02-12KVM: x86/mmu: Avoid retpoline on ->page_fault() with TDPSean Christopherson1-1/+1
Wrap calls to ->page_fault() with a small shim to directly invoke the TDP fault handler when the kernel is using retpolines and TDP is being used. Single out the TDP fault handler and annotate the TDP path as likely to coerce the compiler into preferring it over the indirect function call. Rename tdp_page_fault() to kvm_tdp_page_fault(), as it's exposed outside of mmu.c to allow inlining the shim. Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-02-12KVM: x86: remove duplicated KVM_REQ_EVENT requestMiaohe Lin1-1/+0
The KVM_REQ_EVENT request is already made in kvm_set_rflags(). We should not make it again. Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-02-12KVM: x86: Deliver exception payload on KVM_GET_VCPU_EVENTSOliver Upton1-13/+16
KVM allows the deferral of exception payloads when a vCPU is in guest mode to allow the L1 hypervisor to intercept certain events (#PF, #DB) before register state has been modified. However, this behavior is incompatible with the KVM_{GET,SET}_VCPU_EVENTS ABI, as userspace expects register state to have been immediately modified. Userspace may opt-in for the payload deferral behavior with the KVM_CAP_EXCEPTION_PAYLOAD per-VM capability. As such, kvm_multiple_exception() will immediately manipulate guest registers if the capability hasn't been requested. Since the deferral is only necessary if a userspace ioctl were to be serviced at the same as a payload bearing exception is recognized, this behavior can be relaxed. Instead, opportunistically defer the payload from kvm_multiple_exception() and deliver the payload before completing a KVM_GET_VCPU_EVENTS ioctl. Signed-off-by: Oliver Upton <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-02-12KVM: x86: Mask off reserved bit from #DB exception payloadOliver Upton1-0/+8
KVM defines the #DB payload as compatible with the 'pending debug exceptions' field under VMX, not DR6. Mask off bit 12 when applying the payload to DR6, as it is reserved on DR6 but not the 'pending debug exceptions' field. Fixes: f10c729ff965 ("kvm: vmx: Defer setting of DR6 until #DB delivery") Signed-off-by: Oliver Upton <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-02-05KVM: x86: Mark CR4.UMIP as reserved based on associated CPUID bitSean Christopherson1-0/+2
Re-add code to mark CR4.UMIP as reserved if UMIP is not supported by the host. The UMIP handling was unintentionally dropped during a recent refactoring. Not flagging CR4.UMIP allows the guest to set its CR4.UMIP regardless of host support or userspace desires. On CPUs with UMIP support, including emulated UMIP, this allows the guest to enable UMIP against the wishes of the userspace VMM. On CPUs without any form of UMIP, this results in a failed VM-Enter due to invalid guest state. Fixes: 345599f9a2928 ("KVM: x86: Add macro to ensure reserved cr4 bits checks stay in sync") Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-02-05KVM: x86: use raw clock values consistentlyPaolo Bonzini1-15/+23
Commit 53fafdbb8b21f ("KVM: x86: switch KVMCLOCK base to monotonic raw clock") changed kvmclock to use tkr_raw instead of tkr_mono. However, the default kvmclock_offset for the VM was still based on the monotonic clock and, if the raw clock drifted enough from the monotonic clock, this could cause a negative system_time to be written to the guest's struct pvclock. RHEL5 does not like it and (if it boots fast enough to observe a negative time value) it hangs. There is another thing to be careful about: getboottime64 returns the host boot time with tkr_mono frequency, and subtracting the tkr_raw-based kvmclock value will cause the wallclock to be off if tkr_raw drifts from tkr_mono. To avoid this, compute the wallclock delta from the current time instead of being clever and using getboottime64. Fixes: 53fafdbb8b21f ("KVM: x86: switch KVMCLOCK base to monotonic raw clock") Cc: [email protected] Reviewed-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-02-05KVM: x86: reorganize pvclock_gtod_data membersPaolo Bonzini1-17/+12
We will need a copy of tk->offs_boot in the next patch. Store it and cleanup the struct: instead of storing tk->tkr_xxx.base with the tk->offs_boot included, store the raw value in struct pvclock_clock and sum it in do_monotonic_raw and do_realtime. tk->tkr_xxx.xtime_nsec also moves to struct pvclock_clock. While at it, fix a (usually harmless) typo in do_monotonic_raw, which was using gtod->clock.shift instead of gtod->raw_clock.shift. Fixes: 53fafdbb8b21f ("KVM: x86: switch KVMCLOCK base to monotonic raw clock") Cc: [email protected] Reviewed-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-02-05kvm: x86: hyperv: Use APICv update request interfaceSuravee Suthikulpanit1-13/+0
Since disabling APICv has to be done for all vcpus on AMD-based system, adopt the newly introduced kvm_request_apicv_update() interface, and introduce a new APICV_INHIBIT_REASON_HYPERV. Also, remove the kvm_vcpu_deactivate_apicv() since no longer used. Cc: Roman Kagan <[email protected]> Signed-off-by: Suravee Suthikulpanit <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-02-05kvm: x86: Introduce x86 ops hook for pre-update APICvSuravee Suthikulpanit1-0/+2
AMD SVM AVIC needs to update APIC backing page mapping before changing APICv mode. Introduce struct kvm_x86_ops.pre_update_apicv_exec_ctrl function hook to be called prior KVM APICv update request to each vcpu. Signed-off-by: Suravee Suthikulpanit <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-02-05kvm: x86: Introduce APICv x86 ops for checking APIC inhibit reasonsSuravee Suthikulpanit1-0/+4
Inibit reason bits are used to determine if APICv deactivation is applicable for a particular hardware virtualization architecture. Signed-off-by: Suravee Suthikulpanit <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-02-05kvm: x86: Add APICv (de)activate request trace pointsSuravee Suthikulpanit1-0/+2
Add trace points when sending request to (de)activate APICv. Suggested-by: Alexander Graf <[email protected]> Signed-off-by: Suravee Suthikulpanit <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-02-05kvm: x86: Add support for dynamic APICv activationSuravee Suthikulpanit1-0/+37
Certain runtime conditions require APICv to be temporary deactivated during runtime. The current implementation only support run-time deactivation of APICv when Hyper-V SynIC is enabled, which is not temporary. In addition, for AMD, when APICv is (de)activated at runtime, all vcpus in the VM have to operate in the same mode. Thus the requesting vcpu must notify the others. So, introduce the following: * A new KVM_REQ_APICV_UPDATE request bit * Interfaces to request all vcpus to update APICv status * A new interface to update APICV-related parameters for each vcpu Signed-off-by: Suravee Suthikulpanit <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-02-05kvm: x86: Introduce APICv inhibit reason bitsSuravee Suthikulpanit1-1/+19
There are several reasons in which a VM needs to deactivate APICv e.g. disable APICv via parameter during module loading, or when enable Hyper-V SynIC support. Additional inhibit reasons will be introduced later on when dynamic APICv is supported, Introduce KVM APICv inhibit reason bits along with a new variable, apicv_inhibit_reasons, to help keep track of APICv state for each VM, Initially, the APICV_INHIBIT_REASON_DISABLE bit is used to indicate the case where APICv is disabled during KVM module load. (e.g. insmod kvm_amd avic=0 or insmod kvm_intel enable_apicv=0). Signed-off-by: Suravee Suthikulpanit <[email protected]> [Do not use get_enable_apicv; consider irqchip_split in svm.c. - Paolo] Signed-off-by: Paolo Bonzini <[email protected]>
2020-01-31Merge tag 'kvm-5.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-217/+352
Pull KVM updates from Paolo Bonzini: "This is the first batch of KVM changes. ARM: - cleanups and corner case fixes. PPC: - Bugfixes x86: - Support for mapping DAX areas with large nested page table entries. - Cleanups and bugfixes here too. A particularly important one is a fix for FPU load when the thread has TIF_NEED_FPU_LOAD. There is also a race condition which could be used in guest userspace to exploit the guest kernel, for which the embargo expired today. - Fast path for IPI delivery vmexits, shaving about 200 clock cycles from IPI latency. - Protect against "Spectre-v1/L1TF" (bring data in the cache via speculative out of bound accesses, use L1TF on the sibling hyperthread to read it), which unfortunately is an even bigger whack-a-mole game than SpectreV1. Sean continues his mission to rewrite KVM. In addition to a sizable number of x86 patches, this time he contributed a pretty large refactoring of vCPU creation that affects all architectures but should not have any visible effect. s390 will come next week together with some more x86 patches" * tag 'kvm-5.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits) x86/KVM: Clean up host's steal time structure x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed x86/kvm: Cache gfn to pfn translation x86/kvm: Introduce kvm_(un)map_gfn() x86/kvm: Be careful not to clear KVM_VCPU_FLUSH_TLB bit KVM: PPC: Book3S PR: Fix -Werror=return-type build failure KVM: PPC: Book3S HV: Release lock on page-out failure path KVM: arm64: Treat emulated TVAL TimerValue as a signed 32-bit integer KVM: arm64: pmu: Only handle supported event counters KVM: arm64: pmu: Fix chained SW_INCR counters KVM: arm64: pmu: Don't mark a counter as chained if the odd one is disabled KVM: arm64: pmu: Don't increment SW_INCR if PMCR.E is unset KVM: x86: Use a typedef for fastop functions KVM: X86: Add 'else' to unify fastop and execute call path KVM: x86: inline memslot_valid_for_gpte KVM: x86/mmu: Use huge pages for DAX-backed files KVM: x86/mmu: Remove lpage_is_disallowed() check from set_spte() KVM: x86/mmu: Fold max_mapping_level() into kvm_mmu_hugepage_adjust() KVM: x86/mmu: Zap any compound page when collapsing sptes KVM: x86/mmu: Remove obsolete gfn restoration in FNAME(fetch) ...
2020-01-30Merge branch 'cve-2019-3016' into kvm-next-5.6Paolo Bonzini1-26/+43
From Boris Ostrovsky: The KVM hypervisor may provide a guest with ability to defer remote TLB flush when the remote VCPU is not running. When this feature is used, the TLB flush will happen only when the remote VPCU is scheduled to run again. This will avoid unnecessary (and expensive) IPIs. Under certain circumstances, when a guest initiates such deferred action, the hypervisor may miss the request. It is also possible that the guest may mistakenly assume that it has already marked remote VCPU as needing a flush when in fact that request had already been processed by the hypervisor. In both cases this will result in an invalid translation being present in a vCPU, potentially allowing accesses to memory locations in that guest's address space that should not be accessible. Note that only intra-guest memory is vulnerable. The five patches address both of these problems: 1. The first patch makes sure the hypervisor doesn't accidentally clear a guest's remote flush request 2. The rest of the patches prevent the race between hypervisor acknowledging a remote flush request and guest issuing a new one. Conflicts: arch/x86/kvm/x86.c [move from kvm_arch_vcpu_free to kvm_arch_vcpu_destroy]
2020-01-30x86/KVM: Clean up host's steal time structureBoris Ostrovsky1-8/+3
Now that we are mapping kvm_steal_time from the guest directly we don't need keep a copy of it in kvm_vcpu_arch.st. The same is true for the stime field. This is part of CVE-2019-3016. Signed-off-by: Boris Ostrovsky <[email protected]> Reviewed-by: Joao Martins <[email protected]> Cc: [email protected] Signed-off-by: Paolo Bonzini <[email protected]>
2020-01-30x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missedBoris Ostrovsky1-21/+30
There is a potential race in record_steal_time() between setting host-local vcpu->arch.st.steal.preempted to zero (i.e. clearing KVM_VCPU_PREEMPTED) and propagating this value to the guest with kvm_write_guest_cached(). Between those two events the guest may still see KVM_VCPU_PREEMPTED in its copy of kvm_steal_time, set KVM_VCPU_FLUSH_TLB and assume that hypervisor will do the right thing. Which it won't. Instad of copying, we should map kvm_steal_time and that will guarantee atomicity of accesses to @preempted. This is part of CVE-2019-3016. Signed-off-by: Boris Ostrovsky <[email protected]> Reviewed-by: Joao Martins <[email protected]> Cc: [email protected] Signed-off-by: Paolo Bonzini <[email protected]>
2020-01-30x86/kvm: Cache gfn to pfn translationBoris Ostrovsky1-0/+10
__kvm_map_gfn()'s call to gfn_to_pfn_memslot() is * relatively expensive * in certain cases (such as when done from atomic context) cannot be called Stashing gfn-to-pfn mapping should help with both cases. This is part of CVE-2019-3016. Signed-off-by: Boris Ostrovsky <[email protected]> Reviewed-by: Joao Martins <[email protected]> Cc: [email protected] Signed-off-by: Paolo Bonzini <[email protected]>
2020-01-30x86/kvm: Be careful not to clear KVM_VCPU_FLUSH_TLB bitBoris Ostrovsky1-0/+3
kvm_steal_time_set_preempted() may accidentally clear KVM_VCPU_FLUSH_TLB bit if it is called more than once while VCPU is preempted. This is part of CVE-2019-3016. (This bug was also independently discovered by Jim Mattson <[email protected]>) Signed-off-by: Boris Ostrovsky <[email protected]> Reviewed-by: Joao Martins <[email protected]> Cc: [email protected] Signed-off-by: Paolo Bonzini <[email protected]>
2020-01-27kvm/x86: export kvm_vector_hashing_enabled() is unnecessaryPeng Hao1-1/+0
kvm_vector_hashing_enabled() is just called in kvm.ko module. Signed-off-by: Peng Hao <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-01-27KVM: nVMX: Check GUEST_DR7 on vmentry of nested guestsKrish Sadhukhan1-1/+1
According to section "Checks on Guest Control Registers, Debug Registers, and and MSRs" in Intel SDM vol 3C, the following checks are performed on vmentry of nested guests: If the "load debug controls" VM-entry control is 1, bits 63:32 in the DR7 field must be 0. In KVM, GUEST_DR7 is set prior to the vmcs02 VM-entry by kvm_set_dr() and the latter synthesizes a #GP if any bit in the high dword in the former is set. Hence this field needs to be checked in software. Signed-off-by: Krish Sadhukhan <[email protected]> Reviewed-by: Karl Heubaum <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>