aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/kvm/svm
AgeCommit message (Collapse)AuthorFilesLines
2020-08-02KVM: SVM: Fix sev_pin_memory() error handlingDan Carpenter1-10/+10
The sev_pin_memory() function was modified to return error pointers instead of NULL but there are two problems. The first problem is that if "npages" is zero then it still returns NULL. Secondly, several of the callers were not updated to check for error pointers instead of NULL. Either one of these issues will lead to an Oops. Fixes: a8d908b5873c ("KVM: x86: report sev_pin_memory errors with PTR_ERR") Signed-off-by: Dan Carpenter <[email protected]> Message-Id: <20200714142351.GA315374@mwanda> Reviewed-by: Sean Christopherson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-31KVM: SVM: Fix disable pause loop exit/pause filtering capability on SVMWanpeng Li1-3/+6
'Commit 8566ac8b8e7c ("KVM: SVM: Implement pause loop exit logic in SVM")' drops disable pause loop exit/pause filtering capability completely, I guess it is a merge fault by Radim since disable vmexits capabilities and pause loop exit for SVM patchsets are merged at the same time. This patch reintroduces the disable pause loop exit/pause filtering capability support. Reported-by: Haiwei Li <[email protected]> Tested-by: Haiwei Li <[email protected]> Fixes: 8566ac8b ("KVM: SVM: Implement pause loop exit logic in SVM") Signed-off-by: Wanpeng Li <[email protected]> Message-Id: <[email protected]> Cc: [email protected] Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-30KVM: x86: Specify max TDP level via kvm_configure_mmu()Sean Christopherson1-2/+1
Capture the max TDP level during kvm_configure_mmu() instead of using a kvm_x86_ops hook to do it at every vCPU creation. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-30KVM: x86: Dynamically calculate TDP level from max level and MAXPHYADDRSean Christopherson1-2/+2
Calculate the desired TDP level on the fly using the max TDP level and MAXPHYADDR instead of doing the same when CPUID is updated. This avoids the hidden dependency on cpuid_maxphyaddr() in vmx_get_tdp_level() and also standardizes the "use 5-level paging iff MAXPHYADDR > 48" behavior across x86. Suggested-by: Paolo Bonzini <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-30KVM: x86: Pull the PGD's level from the MMU instead of recalculating itSean Christopherson1-1/+2
Use the shadow_root_level from the current MMU as the root level for the PGD, i.e. for VMX's EPTP. This eliminates the weird dependency between VMX and the MMU where both must independently calculate the same root level for things to work correctly. Temporarily keep VMX's calculation of the level and use it to WARN if the incoming level diverges. Opportunistically refactor kvm_mmu_load_pgd() to avoid indentation hell, and rename a 'cr3' param in the load_mmu_pgd prototype that managed to survive the cr3 purge. No functional change intended. Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-30KVM: nSVM: Correctly set the shadow NPT root level in its MMU roleSean Christopherson1-1/+0
Move the initialization of shadow NPT MMU's shadow_root_level into kvm_init_shadow_npt_mmu() and explicitly set the level in the shadow NPT MMU's role to be the TDP level. This ensures the role and MMU levels are synchronized and also initialized before __kvm_mmu_new_pgd(), which consumes the level when attempting a fast PGD switch. Cc: Vitaly Kuznetsov <[email protected]> Fixes: 9fa72119b24db ("kvm: x86: Introduce kvm_mmu_calc_root_page_role()") Fixes: a506fdd223426 ("KVM: nSVM: implement nested_svm_load_cr3() and use it for host->guest switch") Signed-off-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Reviewed-by: Vitaly Kuznetsov <[email protected]> Tested-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-10KVM: nSVM: remove nonsensical EXITINFO1 adjustment on nested NPFPaolo Bonzini1-7/+0
The "if" that drops the present bit from the page structure fauls makes no sense. It was added by yours truly in order to be bug-compatible with pre-existing code and in order to make the tests pass; however, the tests are wrong. The behavior after this patch matches bare metal. Reported-by: Nadav Amit <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-10KVM: x86: Add a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR supportMohammed Gamal1-0/+15
This patch adds a new capability KVM_CAP_SMALLER_MAXPHYADDR which allows userspace to query if the underlying architecture would support GUEST_MAXPHYADDR < HOST_MAXPHYADDR and hence act accordingly (e.g. qemu can decide if it should warn for -cpu ..,phys-bits=X) The complications in this patch are due to unexpected (but documented) behaviour we see with NPF vmexit handling in AMD processor. If SVM is modified to add guest physical address checks in the NPF and guest #PF paths, we see the followning error multiple times in the 'access' test in kvm-unit-tests: test pte.p pte.36 pde.p: FAIL: pte 2000021 expected 2000001 Dump mapping: address: 0x123400000000 ------L4: 24c3027 ------L3: 24c4027 ------L2: 24c5021 ------L1: 1002000021 This is because the PTE's accessed bit is set by the CPU hardware before the NPF vmexit. This is handled completely by hardware and cannot be fixed in software. Therefore, availability of the new capability depends on a boolean variable allow_smaller_maxphyaddr which is set individually by VMX and SVM init routines. On VMX it's always set to true, on SVM it's only set to true when NPT is not enabled. CC: Tom Lendacky <[email protected]> CC: Babu Moger <[email protected]> Signed-off-by: Mohammed Gamal <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-10KVM: x86: rename update_bp_intercept to update_exception_bitmapPaolo Bonzini1-4/+3
We would like to introduce a callback to update the #PF intercept when CPUID changes. Just reuse update_bp_intercept since VMX is already using update_exception_bitmap instead of a bespoke function. While at it, remove an unnecessary assignment in the SVM version, which is already done in the caller (kvm_arch_vcpu_ioctl_set_guest_debug) and has nothing to do with the exception bitmap. Reviewed-by: Jim Mattson <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-10KVM: nSVM: use nested_svm_load_cr3() on guest->host switchVitaly Kuznetsov1-10/+7
Make nSVM code resemble nVMX where nested_vmx_load_cr3() is used on both guest->host and host->guest transitions. Also, we can now eliminate unconditional kvm_mmu_reset_context() and speed things up. Note, nVMX has two different paths: load_vmcs12_host_state() and nested_vmx_restore_host_state() and the later is used to restore from 'partial' switch to L2, it always uses kvm_mmu_reset_context(). nSVM doesn't have this yet. Also, nested_svm_vmexit()'s return value is almost always ignored nowadays. Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-10KVM: nSVM: implement nested_svm_load_cr3() and use it for host->guest switchVitaly Kuznetsov1-9/+29
Undesired triple fault gets injected to L1 guest on SVM when L2 is launched with certain CR3 values. #TF is raised by mmu_check_root() check in fast_pgd_switch() and the root cause is that when kvm_set_cr3() is called from nested_prepare_vmcb_save() with NPT enabled CR3 points to a nGPA so we can't check it with kvm_is_visible_gfn(). Using generic kvm_set_cr3() when switching to nested guest is not a great idea as we'll have to distinguish between 'real' CR3s and 'nested' CR3s to e.g. not call kvm_mmu_new_pgd() with nGPA. Following nVMX implement nested-specific nested_svm_load_cr3() doing the job. To support the change, nested_svm_load_cr3() needs to be re-ordered with nested_svm_init_mmu_context(). Note: the current implementation is sub-optimal as we always do TLB flush/MMU sync but this is still an improvement as we at least stop doing kvm_mmu_reset_context(). Fixes: 7c390d350f8b ("kvm: x86: Add fast CR3 switch code path") Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-10KVM: nSVM: move kvm_set_cr3() after nested_svm_uninit_mmu_context()Vitaly Kuznetsov1-6/+8
kvm_mmu_new_pgd() refers to arch.mmu and at this point it still references arch.guest_mmu while arch.root_mmu is expected. Note, the change is effectively a nop: when !npt_enabled, nested_svm_uninit_mmu_context() does nothing (as we don't do nested_svm_init_mmu_context()) and with npt_enabled we don't do kvm_set_cr3(). However, it will matter when we move the call to kvm_mmu_new_pgd into nested_svm_load_cr3(). No functional change intended. Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-10KVM: nSVM: introduce nested_svm_load_cr3()/nested_npt_enabled()Vitaly Kuznetsov1-2/+19
As a preparatory change for implementing nSVM-specific PGD switch (following nVMX' nested_vmx_load_cr3()), introduce nested_svm_load_cr3() instead of relying on kvm_set_cr3(). No functional change intended. Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-10KVM: nSVM: prepare to handle errors from enter_svm_guest_mode()Vitaly Kuznetsov3-14/+22
Some operations in enter_svm_guest_mode() may fail, e.g. currently we suppress kvm_set_cr3() return value. Prepare the code to proparate errors. No functional change intended. Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-10KVM: nSVM: reset nested_run_pending upon nested_svm_vmrun_msrpm() failureVitaly Kuznetsov1-0/+2
WARN_ON_ONCE(svm->nested.nested_run_pending) in nested_svm_vmexit() will fire if nested_run_pending remains '1' but it doesn't really need to, we are already failing and not going to run nested guest. Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-10KVM: nSVM: split kvm_init_shadow_npt_mmu() from kvm_init_shadow_mmu()Vitaly Kuznetsov1-1/+2
As a preparatory change for moving kvm_mmu_new_pgd() from nested_prepare_vmcb_save() to nested_svm_init_mmu_context() split kvm_init_shadow_npt_mmu() from kvm_init_shadow_mmu(). This also makes the code look more like nVMX (kvm_init_shadow_ept_mmu()). No functional change intended. Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-10KVM: x86: move MSR_IA32_PERF_CAPABILITIES emulation to common x86 codeVitaly Kuznetsov1-0/+2
state_test/smm_test selftests are failing on AMD with: "Unexpected result from KVM_GET_MSRS, r: 51 (failed MSR was 0x345)" MSR_IA32_PERF_CAPABILITIES is an emulated MSR on Intel but it is not known to AMD code, we can move the emulation to common x86 code. For AMD, we basically just allow the host to read and write zero to the MSR. Fixes: 27461da31089 ("KVM: x86/pmu: Support full width counting") Suggested-by: Jim Mattson <[email protected]> Suggested-by: Paolo Bonzini <[email protected]> Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-09x86/kvm/svm: Use uninstrumented wrmsrl() to restore GSThomas Gleixner1-1/+1
On guest exit MSR_GS_BASE contains whatever the guest wrote to it and the first action after returning from the ASM code is to set it to the host kernel value. This uses wrmsrl() which is interesting at least. wrmsrl() is either using native_write_msr() or the paravirt variant. The XEN_PV code is uninteresting as nested SVM in a XEN_PV guest does not work. But native_write_msr() can be placed out of line by the compiler especially when paravirtualization is enabled in the kernel configuration. The function is marked notrace, but still can be probed if CONFIG_KPROBE_EVENTS_ON_NOTRACE is enabled. That would be a fatal problem as kprobe events use per-CPU variables which are GS based and would be accessed with the guest GS. Depending on the GS value this would either explode in colorful ways or lead to completely undebugable data corruption. Aside of that native_write_msr() contains a tracepoint which objtool complains about as it is invoked from the noinstr section. As this cannot run inside a XEN_PV guest there is no point in using wrmsrl(). Use native_wrmsrl() instead which is just a plain native WRMSR without tracing or anything else attached. Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Juergen Gross <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-09x86/kvm/svm: Move guest enter/exit into .noinstr.textThomas Gleixner2-44/+56
Move the functions which are inside the RCU off region into the non-instrumentable text section. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexandre Chartre <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-09x86/kvm/svm: Add hardirq tracing on guest enter/exitThomas Gleixner1-5/+22
Entering guest mode is more or less the same as returning to user space. From an instrumentation point of view both leave kernel mode and the transition to guest or user mode reenables interrupts on the host. In user mode an interrupt is served directly and in guest mode it causes a VM exit which then handles or reinjects the interrupt. The transition from guest mode or user mode to kernel mode disables interrupts, which needs to be recorded in instrumentation to set the correct state again. This is important for e.g. latency analysis because otherwise the execution time in guest or user mode would be wrongly accounted as interrupt disabled and could trigger false positives. Add hardirq tracing to guest enter/exit functions in the same way as it is done in the user mode enter/exit code, respecting the RCU requirements. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexandre Chartre <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-09x86/kvm: Move context tracking where it belongsThomas Gleixner1-0/+16
Context tracking for KVM happens way too early in the vcpu_run() code. Anything after guest_enter_irqoff() and before guest_exit_irqoff() cannot use RCU and should also be not instrumented. The current way of doing this covers way too much code. Move it closer to the actual vmenter/exit code. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexandre Chartre <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Acked-by: Paolo Bonzini <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-09kvm: x86: replace kvm_spec_ctrl_test_value with runtime test on the hostMaxim Levitsky1-1/+1
To avoid complex and in some cases incorrect logic in kvm_spec_ctrl_test_value, just try the guest's given value on the host processor instead, and if it doesn't #GP, allow the guest to set it. One such case is when host CPU supports STIBP mitigation but doesn't support IBRS (as is the case with some Zen2 AMD cpus), and in this case we were giving guest #GP when it tried to use STIBP The reason why can can do the host test is that IA32_SPEC_CTRL msr is passed to the guest, after the guest sets it to a non zero value for the first time (due to performance reasons), and as as result of this, it is pointless to emulate #GP condition on this first access, in a different way than what the host CPU does. This is based on a patch from Sean Christopherson, who suggested this idea. Fixes: 6441fa6178f5 ("KVM: x86: avoid incorrect writes to host MSR_IA32_SPEC_CTRL") Cc: [email protected] Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Maxim Levitsky <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-09KVM: x86: Rename cpuid_update() callback to vcpu_after_set_cpuid()Xiaoyao Li1-2/+2
The name of callback cpuid_update() is misleading that it's not about updating CPUID settings of vcpu but updating the configurations of vcpu based on the CPUIDs. So rename it to vcpu_after_set_cpuid(). Signed-off-by: Xiaoyao Li <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-08KVM: nSVM: Check that MBZ bits in CR3 and CR4 are not set on vmrun of nested ↵Krish Sadhukhan2-3/+28
guests According to section "Canonicalization and Consistency Checks" in APM vol. 2 the following guest state is illegal: "Any MBZ bit of CR3 is set." "Any MBZ bit of CR4 is set." Suggeted-by: Paolo Bonzini <[email protected]> Signed-off-by: Krish Sadhukhan <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-08KVM: SVM: Rename svm_nested_virtualize_tpr() to nested_svm_virtualize_tpr()Joerg Roedel2-4/+4
Match the naming with other nested svm functions. No functional changes. Signed-off-by: Joerg Roedel <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-08KVM: SVM: Add svm_ prefix to set/clr/is_intercept()Joerg Roedel3-48/+48
Make clear the symbols belong to the SVM code when they are built-in. No functional changes. Signed-off-by: Joerg Roedel <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-08KVM: SVM: Add vmcb_ prefix to mark_*() functionsJoerg Roedel5-31/+31
Make it more clear what data structure these functions operate on. No functional changes. Signed-off-by: Joerg Roedel <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-08KVM: SVM: Rename struct nested_state to svm_nested_stateJoerg Roedel1-2/+2
Renaming is only needed in the svm.h header file. No functional changes. Signed-off-by: Joerg Roedel <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-08kvm: x86: Set last_vmentry_cpu in vcpu_enter_guestJim Mattson1-1/+0
Since this field is now in kvm_vcpu_arch, clean things up a little by setting it in vendor-agnostic code: vcpu_enter_guest. Note that it must be set after the call to kvm_x86_ops.run(), since it can't be updated before pre_sev_run(). Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Reviewed-by: Oliver Upton <[email protected]> Reviewed-by: Peter Shier <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-08kvm: x86: Move last_cpu into kvm_vcpu_arch as last_vmentry_cpuJim Mattson3-7/+4
Both the vcpu_vmx structure and the vcpu_svm structure have a 'last_cpu' field. Move the common field into the kvm_vcpu_arch structure. For clarity, rename it to 'last_vmentry_cpu.' Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Reviewed-by: Oliver Upton <[email protected]> Reviewed-by: Peter Shier <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-08kvm: x86: Add "last CPU" to some KVM_EXIT informationJim Mattson1-1/+3
More often than not, a failed VM-entry in an x86 production environment is induced by a defective CPU. To help identify the bad hardware, include the id of the last logical CPU to run a vCPU in the information provided to userspace on a KVM exit for failed VM-entry or for KVM internal errors not associated with emulation. The presence of this additional information is indicated by a new capability, KVM_CAP_LAST_CPU. Signed-off-by: Jim Mattson <[email protected]> Reviewed-by: Oliver Upton <[email protected]> Reviewed-by: Peter Shier <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-08kvm: svm: Always set svm->last_cpu on VMRUNJim Mattson2-1/+1
Previously, this field was only set when using SEV. Set it for all vCPU configurations, so that it can be communicated to userspace for diagnosing potential hardware errors. Signed-off-by: Jim Mattson <[email protected]> Reviewed-by: Oliver Upton <[email protected]> Reviewed-by: Peter Shier <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-08kvm: svm: Prefer vcpu->cpu to raw_smp_processor_id()Jim Mattson1-6/+3
The current logical processor id is cached in vcpu->cpu. Use it instead of raw_smp_processor_id() when a kvm_vcpu struct is available. Signed-off-by: Jim Mattson <[email protected]> Reviewed-by: Oliver Upton <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-08KVM: x86: report sev_pin_memory errors with PTR_ERRPaolo Bonzini1-10/+12
Callers of sev_pin_memory() treat NULL differently: sev_launch_secret()/svm_register_enc_region() return -ENOMEM sev_dbg_crypt() returns -EFAULT. Switching to ERR_PTR() preserves the error and enables cleaner reporting of different kinds of failures. Suggested-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-08KVM: SVM: convert get_user_pages() --> pin_user_pages()John Hubbard1-3/+3
This code was using get_user_pages*(), in a "Case 2" scenario (DMA/RDMA), using the categorization from [1]. That means that it's time to convert the get_user_pages*() + put_page() calls to pin_user_pages*() + unpin_user_pages() calls. There is some helpful background in [2]: basically, this is a small part of fixing a long-standing disconnect between pinning pages, and file systems' use of those pages. [1] Documentation/core-api/pin_user_pages.rst [2] "Explicit pinning of user-space pages": https://lwn.net/Articles/807108/ Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Jim Mattson <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: John Hubbard <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-08KVM: SVM: fix svn_pin_memory()'s use of get_user_pages_fast()John Hubbard1-1/+5
There are two problems in svn_pin_memory(): 1) The return value of get_user_pages_fast() is stored in an unsigned long, although the declared return value is of type int. This will not cause any symptoms, but it is misleading. Fix this by changing the type of npinned to "int". 2) The number of pages passed into get_user_pages_fast() is stored in an unsigned long, even though get_user_pages_fast() accepts an int. This means that it is possible to silently overflow the number of pages. Fix this by adding a WARN_ON_ONCE() and an early error return. The npages variable is left as an unsigned long for convenience in checking for overflow. Fixes: 89c505809052 ("KVM: SVM: Add support for KVM_SEV_LAUNCH_UPDATE_DATA command") Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Vitaly Kuznetsov <[email protected]> Cc: Wanpeng Li <[email protected]> Cc: Jim Mattson <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: John Hubbard <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-08KVM: nSVM: Check that DR6[63:32] and DR7[64:32] are not set on vmrun of ↵Krish Sadhukhan1-0/+3
nested guests According to section "Canonicalization and Consistency Checks" in APM vol. 2 the following guest state is illegal: "DR6[63:32] are not zero." "DR7[63:32] are not zero." "Any MBZ bit of EFER is set." Signed-off-by: Krish Sadhukhan <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-07-08KVM: X86: Do the same ignore_msrs check for feature msrsPeter Xu1-1/+1
Logically the ignore_msrs and report_ignored_msrs should also apply to feature MSRs. Add them in. Signed-off-by: Peter Xu <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-06-15kvm/svm: disable KCSAN for svm_vcpu_run()Qian Cai1-1/+1
For some reasons, running a simple qemu-kvm command with KCSAN will reset AMD hosts. It turns out svm_vcpu_run() could not be instrumented. Disable it for now. # /usr/libexec/qemu-kvm -name ubuntu-18.04-server-cloudimg -cpu host -smp 2 -m 2G -hda ubuntu-18.04-server-cloudimg.qcow2 === console output === Kernel 5.6.0-next-20200408+ on an x86_64 hp-dl385g10-05 login: <...host reset...> HPE ProLiant System BIOS A40 v1.20 (03/09/2018) (C) Copyright 1982-2018 Hewlett Packard Enterprise Development LP Early system initialization, please wait... Signed-off-by: Qian Cai <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-06-13Merge tag 'x86-entry-2020-06-12' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 entry updates from Thomas Gleixner: "The x86 entry, exception and interrupt code rework This all started about 6 month ago with the attempt to move the Posix CPU timer heavy lifting out of the timer interrupt code and just have lockless quick checks in that code path. Trivial 5 patches. This unearthed an inconsistency in the KVM handling of task work and the review requested to move all of this into generic code so other architectures can share. Valid request and solved with another 25 patches but those unearthed inconsistencies vs. RCU and instrumentation. Digging into this made it obvious that there are quite some inconsistencies vs. instrumentation in general. The int3 text poke handling in particular was completely unprotected and with the batched update of trace events even more likely to expose to endless int3 recursion. In parallel the RCU implications of instrumenting fragile entry code came up in several discussions. The conclusion of the x86 maintainer team was to go all the way and make the protection against any form of instrumentation of fragile and dangerous code pathes enforcable and verifiable by tooling. A first batch of preparatory work hit mainline with commit d5f744f9a2ac ("Pull x86 entry code updates from Thomas Gleixner") That (almost) full solution introduced a new code section '.noinstr.text' into which all code which needs to be protected from instrumentation of all sorts goes into. Any call into instrumentable code out of this section has to be annotated. objtool has support to validate this. Kprobes now excludes this section fully which also prevents BPF from fiddling with it and all 'noinstr' annotated functions also keep ftrace off. The section, kprobes and objtool changes are already merged. The major changes coming with this are: - Preparatory cleanups - Annotating of relevant functions to move them into the noinstr.text section or enforcing inlining by marking them __always_inline so the compiler cannot misplace or instrument them. - Splitting and simplifying the idtentry macro maze so that it is now clearly separated into simple exception entries and the more interesting ones which use interrupt stacks and have the paranoid handling vs. CR3 and GS. - Move quite some of the low level ASM functionality into C code: - enter_from and exit to user space handling. The ASM code now calls into C after doing the really necessary ASM handling and the return path goes back out without bells and whistels in ASM. - exception entry/exit got the equivivalent treatment - move all IRQ tracepoints from ASM to C so they can be placed as appropriate which is especially important for the int3 recursion issue. - Consolidate the declaration and definition of entry points between 32 and 64 bit. They share a common header and macros now. - Remove the extra device interrupt entry maze and just use the regular exception entry code. - All ASM entry points except NMI are now generated from the shared header file and the corresponding macros in the 32 and 64 bit entry ASM. - The C code entry points are consolidated as well with the help of DEFINE_IDTENTRY*() macros. This allows to ensure at one central point that all corresponding entry points share the same semantics. The actual function body for most entry points is in an instrumentable and sane state. There are special macros for the more sensitive entry points, e.g. INT3 and of course the nasty paranoid #NMI, #MCE, #DB and #DF. They allow to put the whole entry instrumentation and RCU handling into safe places instead of the previous pray that it is correct approach. - The INT3 text poke handling is now completely isolated and the recursion issue banned. Aside of the entry rework this required other isolation work, e.g. the ability to force inline bsearch. - Prevent #DB on fragile entry code, entry relevant memory and disable it on NMI, #MC entry, which allowed to get rid of the nested #DB IST stack shifting hackery. - A few other cleanups and enhancements which have been made possible through this and already merged changes, e.g. consolidating and further restricting the IDT code so the IDT table becomes RO after init which removes yet another popular attack vector - About 680 lines of ASM maze are gone. There are a few open issues: - An escape out of the noinstr section in the MCE handler which needs some more thought but under the aspect that MCE is a complete trainwreck by design and the propability to survive it is low, this was not high on the priority list. - Paravirtualization When PV is enabled then objtool complains about a bunch of indirect calls out of the noinstr section. There are a few straight forward ways to fix this, but the other issues vs. general correctness were more pressing than parawitz. - KVM KVM is inconsistent as well. Patches have been posted, but they have not yet been commented on or picked up by the KVM folks. - IDLE Pretty much the same problems can be found in the low level idle code especially the parts where RCU stopped watching. This was beyond the scope of the more obvious and exposable problems and is on the todo list. The lesson learned from this brain melting exercise to morph the evolved code base into something which can be validated and understood is that once again the violation of the most important engineering principle "correctness first" has caused quite a few people to spend valuable time on problems which could have been avoided in the first place. The "features first" tinkering mindset really has to stop. With that I want to say thanks to everyone involved in contributing to this effort. Special thanks go to the following people (alphabetical order): Alexandre Chartre, Andy Lutomirski, Borislav Petkov, Brian Gerst, Frederic Weisbecker, Josh Poimboeuf, Juergen Gross, Lai Jiangshan, Macro Elver, Paolo Bonzin,i Paul McKenney, Peter Zijlstra, Vitaly Kuznetsov, and Will Deacon" * tag 'x86-entry-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (142 commits) x86/entry: Force rcu_irq_enter() when in idle task x86/entry: Make NMI use IDTENTRY_RAW x86/entry: Treat BUG/WARN as NMI-like entries x86/entry: Unbreak __irqentry_text_start/end magic x86/entry: __always_inline CR2 for noinstr lockdep: __always_inline more for noinstr x86/entry: Re-order #DB handler to avoid *SAN instrumentation x86/entry: __always_inline arch_atomic_* for noinstr x86/entry: __always_inline irqflags for noinstr x86/entry: __always_inline debugreg for noinstr x86/idt: Consolidate idt functionality x86/idt: Cleanup trap_init() x86/idt: Use proper constants for table size x86/idt: Add comments about early #PF handling x86/idt: Mark init only functions __init x86/entry: Rename trace_hardirqs_off_prepare() x86/entry: Clarify irq_{enter,exit}_rcu() x86/entry: Remove DBn stacks x86/entry: Remove debug IDT frobbing x86/entry: Optimize local_db_save() for virt ...
2020-06-11x86/entry: Convert Machine Check to IDTENTRY_ISTThomas Gleixner1-1/+1
Convert #MC to IDTENTRY_MCE: - Implement the C entry points with DEFINE_IDTENTRY_MCE - Emit the ASM stub with DECLARE_IDTENTRY_MCE - Remove the ASM idtentry in 64bit - Remove the open coded ASM entry code in 32bit - Fixup the XEN/PV code - Remove the old prototypes - Remove the error code from *machine_check_vector() as it is always 0 and not used by any of the functions it can point to. Fixup all the functions as well. No functional change. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexandre Chartre <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Acked-by: Andy Lutomirski <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-06-08KVM: SVM: fix calls to is_interceptPaolo Bonzini2-2/+4
is_intercept takes an INTERCEPT_* constant, not SVM_EXIT_*; because of this, the compiler was removing the body of the conditionals, as if is_intercept returned 0. This unveils a latent bug: when clearing the VINTR intercept, int_ctl must also be changed in the L1 VMCB (svm->nested.hsave), just like the intercept itself is also changed in the L1 VMCB. Otherwise V_IRQ remains set and, due to the VINTR intercept being clear, we get a spurious injection of a vector 0 interrupt on the next L2->L1 vmexit. Reported-by: Qian Cai <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-06-03Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds4-416/+714
Pull kvm updates from Paolo Bonzini: "ARM: - Move the arch-specific code into arch/arm64/kvm - Start the post-32bit cleanup - Cherry-pick a few non-invasive pre-NV patches x86: - Rework of TLB flushing - Rework of event injection, especially with respect to nested virtualization - Nested AMD event injection facelift, building on the rework of generic code and fixing a lot of corner cases - Nested AMD live migration support - Optimization for TSC deadline MSR writes and IPIs - Various cleanups - Asynchronous page fault cleanups (from tglx, common topic branch with tip tree) - Interrupt-based delivery of asynchronous "page ready" events (host side) - Hyper-V MSRs and hypercalls for guest debugging - VMX preemption timer fixes s390: - Cleanups Generic: - switch vCPU thread wakeup from swait to rcuwait The other architectures, and the guest side of the asynchronous page fault work, will come next week" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (256 commits) KVM: selftests: fix rdtsc() for vmx_tsc_adjust_test KVM: check userspace_addr for all memslots KVM: selftests: update hyperv_cpuid with SynDBG tests x86/kvm/hyper-v: Add support for synthetic debugger via hypercalls x86/kvm/hyper-v: enable hypercalls regardless of hypercall page x86/kvm/hyper-v: Add support for synthetic debugger interface x86/hyper-v: Add synthetic debugger definitions KVM: selftests: VMX preemption timer migration test KVM: nVMX: Fix VMX preemption timer migration x86/kvm/hyper-v: Explicitly align hcall param for kvm_hyperv_exit KVM: x86/pmu: Support full width counting KVM: x86/pmu: Tweak kvm_pmu_get_msr to pass 'struct msr_data' in KVM: x86: announce KVM_FEATURE_ASYNC_PF_INT KVM: x86: acknowledgment mechanism for async pf page ready notifications KVM: x86: interrupt based APF 'page ready' event delivery KVM: introduce kvm_read_guest_offset_cached() KVM: rename kvm_arch_can_inject_async_page_present() to kvm_arch_can_dequeue_async_page_present() KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info Revert "KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page Ready" exceptions simultaneously" KVM: VMX: Replace zero-length array with flexible-array ...
2020-06-02mm: remove the pgprot argument to __vmallocChristoph Hellwig1-2/+1
The pgprot argument to __vmalloc is always PAGE_KERNEL now, so remove it. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Michael Kelley <[email protected]> [hyperv] Acked-by: Gao Xiang <[email protected]> [erofs] Acked-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Wei Liu <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: David Airlie <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Haiyang Zhang <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: "K. Y. Srinivasan" <[email protected]> Cc: Laura Abbott <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Robin Murphy <[email protected]> Cc: Sakari Ailus <[email protected]> Cc: Stephen Hemminger <[email protected]> Cc: Sumit Semwal <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-01KVM: x86/pmu: Tweak kvm_pmu_get_msr to pass 'struct msr_data' inWei Wang1-3/+4
Change kvm_pmu_get_msr() to get the msr_data struct, as the host_initiated field from the struct could be used by get_msr. This also makes this API consistent with kvm_pmu_set_msr. No functional changes. Signed-off-by: Wei Wang <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-06-01KVM: x86: extend struct kvm_vcpu_pv_apf_data with token infoVitaly Kuznetsov2-2/+3
Currently, APF mechanism relies on the #PF abuse where the token is being passed through CR2. If we switch to using interrupts to deliver page-ready notifications we need a different way to pass the data. Extent the existing 'struct kvm_vcpu_pv_apf_data' with token information for page-ready notifications. While on it, rename 'reason' to 'flags'. This doesn't change the semantics as we only have reasons '1' and '2' and these can be treated as bit flags but KVM_PV_REASON_PAGE_READY is going away with interrupt based delivery making 'reason' name misleading. The newly introduced apf_put_user_ready() temporary puts both flags and token information, this will be changed to put token only when we switch to interrupt based notifications. Signed-off-by: Vitaly Kuznetsov <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2020-06-01KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATEPaolo Bonzini2-0/+148
Similar to VMX, the state that is captured through the currently available IOCTLs is a mix of L1 and L2 state, dependent on whether the L2 guest was running at the moment when the process was interrupted to save its state. In particular, the SVM-specific state for nested virtualization includes the L1 saved state (including the interrupt flag), the cached L2 controls, and the GIF. Signed-off-by: Paolo Bonzini <[email protected]>
2020-06-01KVM: MMU: pass arbitrary CR0/CR4/EFER to kvm_init_shadow_mmuPaolo Bonzini1-1/+4
This allows fetching the registers from the hsave area when setting up the NPT shadow MMU, and is needed for KVM_SET_NESTED_STATE (which runs long after the CR0, CR4 and EFER values in vcpu have been switched to hold L2 guest state). Signed-off-by: Paolo Bonzini <[email protected]>
2020-06-01KVM: nSVM: leave guest mode when clearing EFER.SVMEPaolo Bonzini3-2/+25
According to the AMD manual, the effect of turning off EFER.SVME while a guest is running is undefined. We make it leave guest mode immediately, similar to the effect of clearing the VMX bit in MSR_IA32_FEAT_CTL. Signed-off-by: Paolo Bonzini <[email protected]>
2020-06-01KVM: nSVM: split nested_vmcb_check_controlsPaolo Bonzini1-9/+14
The authoritative state does not come from the VMCB once in guest mode, but KVM_SET_NESTED_STATE can still perform checks on L1's provided SVM controls because we get them from userspace. Therefore, split out a function to do them. Signed-off-by: Paolo Bonzini <[email protected]>