aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2024-08-22KVM: x86: Funnel all fancy MSR return value handling into a common helperSean Christopherson1-42/+42
Add a common helper, kvm_do_msr_access(), to invoke the "leaf" APIs that are type and access specific, and more importantly to handle errors that are returned from the leaf APIs. I.e. turn kvm_msr_ignored_check() from a a helper that is called on an error, into a trampoline that detects errors *and* applies relevant side effects, e.g. logging unimplemented accesses. Because the leaf APIs are used for guest accesses, userspace accesses, and KVM accesses, and because KVM supports restricting access to MSRs from userspace via filters, the error handling is subtly non-trivial. E.g. KVM has had at least one bug escape due to making each "outer" function handle errors. See commit 3376ca3f1a20 ("KVM: x86: Fix KVM_GET_MSRS stack info leak"). Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: x86: Refactor kvm_get_feature_msr() to avoid struct kvm_msr_entrySean Christopherson1-16/+13
Refactor kvm_get_feature_msr() to take the components of kvm_msr_entry as separate parameters, along with a vCPU pointer, i.e. to give it the same prototype as kvm_{g,s}et_msr_ignored_check(). This will allow using a common inner helper for handling accesses to "regular" and feature MSRs. No functional change intended. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: x86: Rename get_msr_feature() APIs to get_feature_msr()Sean Christopherson7-14/+14
Rename all APIs related to feature MSRs from get_msr_feature() to get_feature_msr(). The APIs get "feature MSRs", not "MSR features". And unlike kvm_{g,s}et_msr_common(), the "feature" adjective doesn't describe the helper itself. No functional change intended. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: x86: Refactor kvm_x86_ops.get_msr_feature() to avoid kvm_msr_entrySean Christopherson5-15/+13
Refactor get_msr_feature() to take the index and data pointer as distinct parameters in anticipation of eliminating "struct kvm_msr_entry" usage further up the primary callchain. No functional change intended. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: x86: Rename KVM_MSR_RET_INVALID to KVM_MSR_RET_UNSUPPORTEDSean Christopherson4-12/+19
Rename the "INVALID" internal MSR error return code to "UNSUPPORTED" to try and make it more clear that access was denied because the MSR itself is unsupported/unknown. "INVALID" is too ambiguous, as it could just as easily mean the value for WRMSR as invalid. Avoid UNKNOWN and UNIMPLEMENTED, as the error code is used for MSRs that _are_ actually implemented by KVM, e.g. if the MSR is unsupported because an associated feature flag is not present in guest CPUID. Opportunistically beef up the comments for the internal MSR error codes. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: x86: Move MSR_TYPE_{R,W,RW} values from VMX to x86, as enumsSean Christopherson2-4/+6
Move VMX's MSR_TYPE_{R,W,RW} #defines to x86.h, as enums, so that they can be used by common x86 code, e.g. instead of doing "bool write". Opportunistically tweak the definitions to make it more obvious that the values are bitmasks, not arbitrary ascending values. No functional change intended. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: SVM: Disallow guest from changing userspace's MSR_AMD64_DE_CFG valueSean Christopherson1-2/+7
Inject a #GP if the guest attempts to change MSR_AMD64_DE_CFG from its *current* value, not if the guest attempts to write a value other than KVM's set of supported bits. As per the comment and the changelog of the original code, the intent is to effectively make MSR_AMD64_DE_CFG read- only for the guest. Opportunistically use a more conventional equality check instead of an exclusive-OR check to detect attempts to change bits. Fixes: d1d93fa90f1a ("KVM: SVM: Add MSR-based feature support for serializing LFENCE") Cc: Tom Lendacky <[email protected]> Reviewed-by: Tom Lendacky <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: x86/mmu: Clean up function comments for dirty logging APIsSean Christopherson1-33/+15
Rework the function comment for kvm_arch_mmu_enable_log_dirty_pt_masked() into the body of the function, as it has gotten a bit stale, is harder to read without the code context, and is the last source of warnings for W=1 builds in KVM x86 due to using a kernel-doc comment without documenting all parameters. Opportunistically subsume the functions comments for kvm_mmu_write_protect_pt_masked() and kvm_mmu_clear_dirty_pt_masked(), as there is no value in regurgitating similar information at a higher level, and capturing the differences between write-protection and PML-based dirty logging is best done in a common location. No functional change intended. Cc: David Matlack <[email protected]> Reviewed-by: Kai Huang <[email protected]> Reviewed-by: Pankaj Gupta <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: x86: Use this_cpu_ptr() in kvm_user_return_msr_cpu_onlineLi Chen1-2/+1
Use this_cpu_ptr() instead of open coding the equivalent in kvm_user_return_msr_cpu_online. Signed-off-by: Li Chen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: VMX: hyper-v: Prevent impossible NULL pointer dereference in evmcs_load()Vitaly Kuznetsov1-0/+8
GCC 12.3.0 complains about a potential NULL pointer dereference in evmcs_load() as hv_get_vp_assist_page() can return NULL. In fact, this cannot happen because KVM verifies (hv_init_evmcs()) that every CPU has a valid VP assist page and aborts enabling the feature otherwise. CPU onlining path is also checked in vmx_hardware_enable(). To make the compiler happy and to future proof the code, add a KVM_BUG_ON() sentinel. It doesn't seem to be possible (and logical) to observe evmcs_load() happening without an active vCPU so it is presumed that kvm_get_running_vcpu() can't return NULL. No functional change intended. Reported-by: Mirsad Todorovac <[email protected]> Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Vitaly Kuznetsov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: nVMX: Use vmx_segment_cache_clear() instead of open coded equivalentMaxim Levitsky3-5/+7
In prepare_vmcs02_rare(), call vmx_segment_cache_clear() instead of setting segment_cache.bitmask directly. Using the helper minimizes the chances of prepare_vmcs02_rare() doing the wrong thing in the future, e.g. if KVM ends up doing more than just zero the bitmask when purging the cache. No functional change intended. Signed-off-by: Maxim Levitsky <[email protected]> Link: https://lore.kernel.org/r/[email protected] [sean: massage changelog] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: nVMX: Honor userspace MSR filter lists for nested VM-Enter/VM-ExitSean Christopherson3-8/+12
Synthesize a consistency check VM-Exit (VM-Enter) or VM-Abort (VM-Exit) if L1 attempts to load/store an MSR via the VMCS MSR lists that userspace has disallowed access to via an MSR filter. Intel already disallows including a handful of "special" MSRs in the VMCS lists, so denying access isn't completely without precedent. More importantly, the behavior is well-defined _and_ can be communicated the end user, e.g. to the customer that owns a VM running as L1 on top of KVM. On the other hand, ignoring userspace MSR filters is all but guaranteed to result in unexpected behavior as the access will hit KVM's internal state, which is likely not up-to-date. Unlike KVM-internal accesses, instruction emulation, and dedicated VMCS fields, the MSRs in the VMCS load/store lists are 100% guest controlled, thus making it all but impossible to reason about the correctness of ignoring the MSR filter. And if userspace *really* wants to deny access to MSRs via the aforementioned scenarios, userspace can hide the associated feature from the guest, e.g. by disabling the PMU to prevent accessing PERF_GLOBAL_CTRL via its VMCS field. But for the MSR lists, KVM is blindly processing MSRs; the MSR filters are the _only_ way for userspace to deny access. This partially reverts commit ac8d6cad3c7b ("KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr"). Cc: Hou Wenlong <[email protected]> Cc: Jim Mattson <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: VMX: Do not account for temporary memory allocation in ECREATE emulationKai Huang1-1/+1
In handle_encls_ecreate(), a page is allocated to store a copy of SECS structure used by the ENCLS[ECREATE] leaf from the guest. This page is only used temporarily and is freed after use in handle_encls_ecreate(). Don't account for the memory allocation of this page per [1]. Link: https://lore.kernel.org/kvm/[email protected]/T/#mb81987afc3ab308bbb5861681aa9a20f2aece7fd [1] Signed-off-by: Kai Huang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: VMX: Modify the BUILD_BUG_ON_MSG of the 32-bit field in the ↵Qiang Liu1-1/+1
vmcs_check16 function According to the SDM, the meaning of field bit 0 is: Access type (0 = full; 1 = high); must be full for 16-bit, 32-bit, and natural-width fields. So there is no 32-bit high field here, it should be a 32-bit field instead. Signed-off-by: Qiang Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: SVM: Remove unnecessary GFP_KERNEL_ACCOUNT in svm_set_nested_state()Yongqiang Liu1-2/+2
The fixed size temporary variables vmcb_control_area and vmcb_save_area allocated in svm_set_nested_state() are released when the function exits. Meanwhile, svm_set_nested_state() also have vcpu mutex held to avoid massive concurrency allocation, so we don't need to set GFP_KERNEL_ACCOUNT. Signed-off-by: Yongqiang Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: nVMX: Use macros and #defines in vmx_restore_vmx_misc()Xin Li1-7/+20
Use macros in vmx_restore_vmx_misc() instead of open coding everything using BIT_ULL() and GENMASK_ULL(). Opportunistically split feature bits and reserved bits into separate variables, and add a comment explaining the subset logic (it's not immediately obvious that the set of feature bits is NOT the set of _supported_ feature bits). Cc: Shan Kang <[email protected]> Cc: Kai Huang <[email protected]> Signed-off-by: Xin Li <[email protected]> [sean: split to separate patch, write changelog, drop #defines] Reviewed-by: Xiaoyao Li <[email protected]> Reviewed-by: Kai Huang <[email protected]> Reviewed-by: Zhao Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: VMX: Open code VMX preemption timer rate mask in its accessorXin Li2-3/+2
Use vmx_misc_preemption_timer_rate() to get the rate in hardware_setup(), and open code the rate's bitmask in vmx_misc_preemption_timer_rate() so that the function looks like all the helpers that grab values from VMX_BASIC and VMX_MISC MSR values. No functional change intended. Cc: Shan Kang <[email protected]> Cc: Kai Huang <[email protected]> Signed-off-by: Xin Li <[email protected]> [sean: split to separate patch, write changelog] Reviewed-by: Kai Huang <[email protected]> Reviewed-by: Xiaoyao Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM VMX: Move MSR_IA32_VMX_MISC bit defines to asm/vmx.hSean Christopherson5-16/+16
Move the handful of MSR_IA32_VMX_MISC bit defines that are currently in msr-indx.h to vmx.h so that all of the VMX_MISC defines and wrappers can be found in a single location. Opportunistically use BIT_ULL() instead of open coding hex values, add defines for feature bits that are architecturally defined, and move the defines down in the file so that they are colocated with the helpers for getting fields from VMX_MISC. No functional change intended. Cc: Shan Kang <[email protected]> Cc: Kai Huang <[email protected]> Signed-off-by: Xin Li <[email protected]> [sean: split to separate patch, write changelog] Reviewed-by: Zhao Liu <[email protected]> Reviewed-by: Kai Huang <[email protected]> Reviewed-by: Xiaoyao Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: nVMX: Add a helper to encode VMCS info in MSR_IA32_VMX_BASICSean Christopherson2-7/+8
Add a helper to encode the VMCS revision, size, and supported memory types in MSR_IA32_VMX_BASIC, i.e. when synthesizing KVM's supported BASIC MSR value, and delete the now unused VMCS size and memtype shift macros. For a variety of reasons, KVM has shifted (pun intended) to using helpers to *get* information from the VMX MSRs, as opposed to defined MASK and SHIFT macros for direct use. Provide a similar helper for the nested VMX code, which needs to *set* information, so that KVM isn't left with a mix of SHIFT macros and dedicated helpers. Reported-by: Xiaoyao Li <[email protected]> Reviewed-by: Xiaoyao Li <[email protected]> Reviewed-by: Kai Huang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: nVMX: Use macros and #defines in vmx_restore_vmx_basic()Xin Li1-7/+18
Use macros in vmx_restore_vmx_basic() instead of open coding everything using BIT_ULL() and GENMASK_ULL(). Opportunistically split feature bits and reserved bits into separate variables, and add a comment explaining the subset logic (it's not immediately obvious that the set of feature bits is NOT the set of _supported_ feature bits). Cc: Shan Kang <[email protected]> Cc: Kai Huang <[email protected]> Signed-off-by: Xin Li <[email protected]> [sean: split to separate patch, write changelog, drop #defines] Reviewed-by: Zhao Liu <[email protected]> Reviewed-by: Xiaoyao Li <[email protected]> Reviewed-by: Kai Huang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: VMX: Track CPU's MSR_IA32_VMX_BASIC as a single 64-bit valueXin Li3-18/+21
Track the "basic" capabilities VMX MSR as a single u64 in vmcs_config instead of splitting it across three fields, that obviously don't combine into a single 64-bit value, so that KVM can use the macros that define MSR bits using their absolute position. Replace all open coded shifts and masks, many of which are relative to the "high" half, with the appropriate macro. Opportunistically use VMX_BASIC_32BIT_PHYS_ADDR_ONLY instead of an open coded equivalent, and clean up the related comment to not reference a specific SDM section (to the surprise of no one, the comment is stale). No functional change intended (though obviously the code generation will be quite different). Cc: Shan Kang <[email protected]> Cc: Kai Huang <[email protected]> Signed-off-by: Xin Li <[email protected]> [sean: split to separate patch, write changelog] Reviewed-by: Xiaoyao Li <[email protected]> Reviewed-by: Kai Huang <[email protected]> Reviewed-by: Zhao Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: VMX: Move MSR_IA32_VMX_BASIC bit defines to asm/vmx.hXin Li2-8/+7
Move the bit defines for MSR_IA32_VMX_BASIC from msr-index.h to vmx.h so that they are colocated with other VMX MSR bit defines, and with the helpers that extract specific information from an MSR_IA32_VMX_BASIC value. Opportunistically use BIT_ULL() instead of open coding hex values. Opportunistically rename VMX_BASIC_64 to VMX_BASIC_32BIT_PHYS_ADDR_ONLY, as "VMX_BASIC_64" is widly misleading. The flag enumerates that addresses are limited to 32 bits, not that 64-bit addresses are allowed. Last but not least, opportunistically #define DUAL_MONITOR_TREATMENT so that all known single-bit feature flags are defined (this will allow replacing open-coded literals in the future). Cc: Shan Kang <[email protected]> Cc: Kai Huang <[email protected]> Signed-off-by: Xin Li <[email protected]> [sean: split to separate patch, write changelog] Reviewed-by: Zhao Liu <[email protected]> Reviewed-by: Kai Huang <[email protected]> Reviewed-by: Xiaoyao Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: x86: Stuff vCPU's PAT with default value at RESET, not creationSean Christopherson1-2/+2
Move the stuffing of the vCPU's PAT to the architectural "default" value from kvm_arch_vcpu_create() to kvm_vcpu_reset(), guarded by !init_event, to better capture that the default value is the value "Following Power-up or Reset". E.g. setting PAT only during creation would break if KVM were to expose a RESET ioctl() to userspace (which is unlikely, but that's not a good reason to have unintuitive code). No functional change. Reviewed-by: Xiaoyao Li <[email protected]> Reviewed-by: Kai Huang <[email protected]> Reviewed-by: Jim Mattson <[email protected]> Reviewed-by: Zhao Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22x86/cpu: KVM: Move macro to encode PAT value to common headerSean Christopherson3-11/+11
Move pat/memtype.c's PAT() macro to msr-index.h as PAT_VALUE(), and use it in KVM to define the default (Power-On / RESET) PAT value instead of open coding an inscrutable magic number. No functional change intended. Reviewed-by: Xiaoyao Li <[email protected]> Reviewed-by: Kai Huang <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22x86/cpu: KVM: Add common defines for architectural memory types (PAT, MTRRs, ↵Sean Christopherson6-26/+37
etc.) Add defines for the architectural memory types that can be shoved into various MSRs and registers, e.g. MTRRs, PAT, VMX capabilities MSRs, EPTPs, etc. While most MSRs/registers support only a subset of all memory types, the values themselves are architectural and identical across all users. Leave the goofy MTRR_TYPE_* definitions as-is since they are in a uapi header, but add compile-time assertions to connect the dots (and sanity check that the msr-index.h values didn't get fat-fingered). Keep the VMX_EPTP_MT_* defines so that it's slightly more obvious that the EPTP holds a single memory type in 3 of its 64 bits; those bits just happen to be 2:0, i.e. don't need to be shifted. Opportunistically use X86_MEMTYPE_WB instead of an open coded '6' in setup_vmcs_config(). No functional change intended. Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Kai Huang <[email protected]> Reviewed-by: Xiaoyao Li <[email protected]> Reviewed-by: Kai Huang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: SVM: fix emulation of msr reads/writes of MSR_FS_BASE and MSR_GS_BASEMaxim Levitsky1-0/+12
If these msrs are read by the emulator (e.g due to 'force emulation' prefix), SVM code currently fails to extract the corresponding segment bases, and return them to the emulator. Fix that. Cc: [email protected] Signed-off-by: Maxim Levitsky <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: x86: Acquire kvm->srcu when handling KVM_SET_VCPU_EVENTSSean Christopherson1-0/+2
Grab kvm->srcu when processing KVM_SET_VCPU_EVENTS, as KVM will forcibly leave nested VMX/SVM if SMM mode is being toggled, and leaving nested VMX reads guest memory. Note, kvm_vcpu_ioctl_x86_set_vcpu_events() can also be called from KVM_RUN via sync_regs(), which already holds SRCU. I.e. trying to precisely use kvm_vcpu_srcu_read_lock() around the problematic SMM code would cause problems. Acquiring SRCU isn't all that expensive, so for simplicity, grab it unconditionally for KVM_SET_VCPU_EVENTS. ============================= WARNING: suspicious RCU usage 6.10.0-rc7-332d2c1d713e-next-vm #552 Not tainted ----------------------------- include/linux/kvm_host.h:1027 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by repro/1071: #0: ffff88811e424430 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x7d/0x970 [kvm] stack backtrace: CPU: 15 PID: 1071 Comm: repro Not tainted 6.10.0-rc7-332d2c1d713e-next-vm #552 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl+0x7f/0x90 lockdep_rcu_suspicious+0x13f/0x1a0 kvm_vcpu_gfn_to_memslot+0x168/0x190 [kvm] kvm_vcpu_read_guest+0x3e/0x90 [kvm] nested_vmx_load_msr+0x6b/0x1d0 [kvm_intel] load_vmcs12_host_state+0x432/0xb40 [kvm_intel] vmx_leave_nested+0x30/0x40 [kvm_intel] kvm_vcpu_ioctl_x86_set_vcpu_events+0x15d/0x2b0 [kvm] kvm_arch_vcpu_ioctl+0x1107/0x1750 [kvm] ? mark_held_locks+0x49/0x70 ? kvm_vcpu_ioctl+0x7d/0x970 [kvm] ? kvm_vcpu_ioctl+0x497/0x970 [kvm] kvm_vcpu_ioctl+0x497/0x970 [kvm] ? lock_acquire+0xba/0x2d0 ? find_held_lock+0x2b/0x80 ? do_user_addr_fault+0x40c/0x6f0 ? lock_release+0xb7/0x270 __x64_sys_ioctl+0x82/0xb0 do_syscall_64+0x6c/0x170 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7ff11eb1b539 </TASK> Fixes: f7e570780efc ("KVM: x86: Forcibly leave nested virt when SMM state is toggled") Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: x86/mmu: Check that root is valid/loaded when pre-faulting SPTEsSean Christopherson1-1/+3
Error out if kvm_mmu_reload() fails when pre-faulting memory, as trying to fault-in SPTEs will fail miserably due to root.hpa pointing at garbage. Note, kvm_mmu_reload() can return -EIO and thus trigger the WARN on -EIO in kvm_vcpu_pre_fault_memory(), but all such paths also WARN, i.e. the WARN isn't user-triggerable and won't run afoul of warn-on-panic because the kernel would already be panicking. BUG: unable to handle page fault for address: 000029ffffffffe8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP CPU: 22 PID: 1069 Comm: pre_fault_memor Not tainted 6.10.0-rc7-332d2c1d713e-next-vm #548 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:is_page_fault_stale+0x3e/0xe0 [kvm] RSP: 0018:ffffc9000114bd48 EFLAGS: 00010206 RAX: 00003fffffffffc0 RBX: ffff88810a07c080 RCX: ffffc9000114bd78 RDX: ffff88810a07c080 RSI: ffffea0000000000 RDI: ffff88810a07c080 RBP: ffffc9000114bd78 R08: 00007fa3c8c00000 R09: 8000000000000225 R10: ffffea00043d7d80 R11: 0000000000000000 R12: ffff88810a07c080 R13: 0000000100000000 R14: ffffc9000114be58 R15: 0000000000000000 FS: 00007fa3c9da0740(0000) GS:ffff888277d80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000029ffffffffe8 CR3: 000000011d698000 CR4: 0000000000352eb0 Call Trace: <TASK> kvm_tdp_page_fault+0xcc/0x160 [kvm] kvm_mmu_do_page_fault+0xfb/0x1f0 [kvm] kvm_arch_vcpu_pre_fault_memory+0xd0/0x1a0 [kvm] kvm_vcpu_ioctl+0x761/0x8c0 [kvm] __x64_sys_ioctl+0x82/0xb0 do_syscall_64+0x5b/0x160 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> Modules linked in: kvm_intel kvm CR2: 000029ffffffffe8 ---[ end trace 0000000000000000 ]--- Fixes: 6e01b7601dfe ("KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory()") Reported-by: [email protected] Closes: https://lore.kernel.org/all/[email protected] Reviewed-by: Kai Huang <[email protected]> Tested-by: xingwei lee <[email protected]> Tested-by: yuxin wang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: x86/mmu: Fixup comments missed by the REMOVED_SPTE=>FROZEN_SPTE renameYan Zhao3-8/+8
Replace "removed" with "frozen" in comments as appropriate to complete the rename of REMOVED_SPTE to FROZEN_SPTE. Fixes: 964cea817196 ("KVM: x86/tdp_mmu: Rename REMOVED_SPTE to FROZEN_SPTE") Signed-off-by: Yan Zhao <[email protected]> Signed-off-by: Rick Edgecombe <[email protected]> Link: https://lore.kernel.org/r/[email protected] [sean: write changelog] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: x86: Advertise AVX10.1 CPUID to userspaceTao Su3-2/+37
Advertise AVX10.1 related CPUIDs, i.e. report AVX10 support bit via CPUID.(EAX=07H, ECX=01H):EDX[bit 19] and new CPUID leaf 0x24H so that guest OS and applications can query the AVX10.1 CPUIDs directly. Intel AVX10 represents the first major new vector ISA since the introduction of Intel AVX512, which will establish a common, converged vector instruction set across all Intel architectures[1]. AVX10.1 is an early version of AVX10, that enumerates the Intel AVX512 instruction set at 128, 256, and 512 bits which is enabled on Granite Rapids. I.e., AVX10.1 is only a new CPUID enumeration with no new functionality. New features, e.g. Embedded Rounding and Suppress All Exceptions (SAE) will be introduced in AVX10.2. Advertising AVX10.1 is safe because there is nothing to enable for AVX10.1, i.e. it's purely a new way to enumerate support, thus there will never be anything for the kernel to enable. Note just the CPUID checking is changed when using AVX512 related instructions, e.g. if using one AVX512 instruction needs to check (AVX512 AND AVX512DQ), it can check ((AVX512 AND AVX512DQ) OR AVX10.1) after checking XCR0[7:5]. The versions of AVX10 are expected to be inclusive, e.g. version N+1 is a superset of version N. Per the spec, the version can never be 0, just advertise AVX10.1 if it's supported in hardware. Moreover, advertising AVX10_{128,256,512} needs to land in the same commit as advertising basic AVX10.1 support, otherwise KVM would advertise an impossible CPU model. E.g. a CPU with AVX512 but not AVX10.1/512 is impossible per the SDM. As more and more AVX related CPUIDs are added (it would have resulted in around 40-50 CPUID flags when developing AVX10), the versioning approach is introduced. But incrementing version numbers are bad for virtualization. E.g. if AVX10.2 has a feature that shouldn't be enumerated to guests for whatever reason, then KVM can't enumerate any "later" features either, because the only way to hide the problematic AVX10.2 feature is to set the version to AVX10.1 or lower[2]. But most AVX features are just passed through and don't have virtualization controls, so AVX10 should not be problematic in practice, so long as Intel honors their promise that future versions will be supersets of past versions. [1] https://cdrdv2.intel.com/v1/dl/getContent/784267 [2] https://lore.kernel.org/all/[email protected]/ Suggested-by: Sean Christopherson <[email protected]> Signed-off-by: Tao Su <[email protected]> Link: https://lore.kernel.org/r/[email protected] [sean: minor changelog tweaks] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22KVM: x86: Optimize local variable in start_sw_tscdeadline()Thorsten Blum1-1/+1
Change the data type of the local variable this_tsc_khz to u32 because virtual_tsc_khz is also declared as u32. Since do_div() casts the divisor to u32 anyway, changing the data type of this_tsc_khz to u32 also removes the following Coccinelle/coccicheck warning reported by do_div.cocci: WARNING: do_div() does a 64-by-32 division, please consider using div64_ul instead Signed-off-by: Thorsten Blum <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfAlexei Starovoitov31-158/+217
Cross-merge bpf fixes after downstream PR including important fixes (from bpf-next point of view): commit 41c24102af7b ("selftests/bpf: Filter out _GNU_SOURCE when compiling test_cpp") commit fdad456cbcca ("bpf: Fix updating attached freplace prog in prog_array map") No conflicts. Adjacent changes in: include/linux/bpf_verifier.h kernel/bpf/verifier.c tools/testing/selftests/bpf/Makefile Link: https://lore.kernel.org/bpf/[email protected]/ Signed-off-by: Alexei Starovoitov <[email protected]>
2024-08-21x86/PCI: Check pcie_find_root_port() return for NULLSamasth Norway Ananda1-2/+2
If pcie_find_root_port() is unable to locate a Root Port, it will return NULL. Check the pointer for NULL before dereferencing it. This particular case is in a quirk for devices that are always below a Root Port, so this won't avoid a problem and doesn't need to be backported, but check as a matter of style and to prevent copy/paste mistakes. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Samasth Norway Ananda <[email protected]> [bhelgaas: drop Fixes: and explain why there's no problem in this case] Signed-off-by: Bjorn Helgaas <[email protected]> Reviewed-by: Mario Limonciello <[email protected]>
2024-08-20x86/kaslr: Expose and use the end of the physical memory address spaceThomas Gleixner4-6/+35
iounmap() on x86 occasionally fails to unmap because the provided valid ioremap address is not below high_memory. It turned out that this happens due to KASLR. KASLR uses the full address space between PAGE_OFFSET and vaddr_end to randomize the starting points of the direct map, vmalloc and vmemmap regions. It thereby limits the size of the direct map by using the installed memory size plus an extra configurable margin for hot-plug memory. This limitation is done to gain more randomization space because otherwise only the holes between the direct map, vmalloc, vmemmap and vaddr_end would be usable for randomizing. The limited direct map size is not exposed to the rest of the kernel, so the memory hot-plug and resource management related code paths still operate under the assumption that the available address space can be determined with MAX_PHYSMEM_BITS. request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1 downwards. That means the first allocation happens past the end of the direct map and if unlucky this address is in the vmalloc space, which causes high_memory to become greater than VMALLOC_START and consequently causes iounmap() to fail for valid ioremap addresses. MAX_PHYSMEM_BITS cannot be changed for that because the randomization does not align with address bit boundaries and there are other places which actually require to know the maximum number of address bits. All remaining usage sites of MAX_PHYSMEM_BITS have been analyzed and found to be correct. Cure this by exposing the end of the direct map via PHYSMEM_END and use that for the memory hot-plug and resource management related places instead of relying on MAX_PHYSMEM_BITS. In the KASLR case PHYSMEM_END maps to a variable which is initialized by the KASLR initialization and otherwise it is based on MAX_PHYSMEM_BITS as before. To prevent future hickups add a check into add_pages() to catch callers trying to add memory above PHYSMEM_END. Fixes: 0483e1fa6e09 ("x86/mm: Implement ASLR for kernel memory regions") Reported-by: Max Ramanouski <[email protected]> Reported-by: Alistair Popple <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-By: Max Ramanouski <[email protected]> Tested-by: Alistair Popple <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Alistair Popple <[email protected]> Reviewed-by: Kees Cook <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/all/87ed6soy3z.ffs@tglx
2024-08-19x86: do the user address masking outside the user access areaLinus Torvalds1-1/+3
In any normal situation this really shouldn't matter, but in case the address passed in to masked_user_access_begin() were to be some complex expression, we should evaluate it fully before doing the 'stac' instruction. And even without that issue (which objdump would pick up on for any really bad case), just in general we should strive to minimize the amount of code we run with user accesses enabled. For example, even for the trivial pselect6() case, the code generation (obviously with a non-debug build) just diff with this ends up being - stac mov %rax,%rcx sar $0x3f,%rcx or %rax,%rcx + stac mov (%rcx),%r13 mov 0x8(%rcx),%r14 clac so the area delimeted by the 'stac / clac' pair is now literally just the two user access instructions, and the address generation has been moved out to before that code. This will be much more noticeable if we end up deciding that we can go back to just inlining "get_user()" using the new masked user access model. The get_user() pointers can often be more complex expressions involving kernel memory accesses or even function calls. Signed-off-by: Linus Torvalds <[email protected]>
2024-08-19x86: support user address masking instead of non-speculative conditionalLinus Torvalds1-0/+8
The Spectre-v1 mitigations made "access_ok()" much more expensive, since it has to serialize execution with the test for a valid user address. All the normal user copy routines avoid this by just masking the user address with a data-dependent mask instead, but the fast "unsafe_user_read()" kind of patterms that were supposed to be a fast case got slowed down. This introduces a notion of using src = masked_user_access_begin(src); to do the user address sanity using a data-dependent mask instead of the more traditional conditional if (user_read_access_begin(src, len)) { model. This model only works for dense accesses that start at 'src' and on architectures that have a guard region that is guaranteed to fault in between the user space and the kernel space area. With this, the user access doesn't need to be manually checked, because a bad address is guaranteed to fault (by some architecture masking trick: on x86-64 this involves just turning an invalid user address into all ones, since we don't map the top of address space). This only converts a couple of examples for now. Example x86-64 code generation for loading two words from user space: stac mov %rax,%rcx sar $0x3f,%rcx or %rax,%rcx mov (%rcx),%r13 mov 0x8(%rcx),%r14 clac where all the error handling and -EFAULT is now purely handled out of line by the exception path. Of course, if the micro-architecture does badly at 'clac' and 'stac', the above is still pitifully slow. But at least we did as well as we could. Signed-off-by: Linus Torvalds <[email protected]>
2024-08-19runtime constants: move list of constants to vmlinux.lds.hJann Horn1-2/+1
Refactor the list of constant variables into a macro. This should make it easier to add more constants in the future. Signed-off-by: Jann Horn <[email protected]> Reviewed-by: Alexander Gordeev <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Acked-by: Will Deacon <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]>
2024-08-18x86/rust: support MITIGATION_RETHUNKMiguel Ojeda1-0/+5
The Rust compiler added support for `-Zfunction-return=thunk-extern` [1] in 1.76.0 [2], i.e. the equivalent of `-mfunction-return=thunk-extern`. Thus add support for `MITIGATION_RETHUNK`. Without this, `objtool` would warn if enabled for Rust and already warns under IBT builds, e.g.: samples/rust/rust_print.o: warning: objtool: _R...init+0xa5c: 'naked' return found in RETHUNK build Link: https://github.com/rust-lang/rust/issues/116853 [1] Link: https://github.com/rust-lang/rust/pull/116892 [2] Reviewed-by: Gary Guo <[email protected]> Tested-by: Alice Ryhl <[email protected]> Tested-by: Benno Lossin <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://github.com/Rust-for-Linux/linux/issues/945 Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Miguel Ojeda <[email protected]>
2024-08-18x86/rust: support MITIGATION_RETPOLINEMiguel Ojeda1-1/+1
Support `MITIGATION_RETPOLINE` by enabling the target features that Clang does. The existing target feature being enabled was a leftover from our old `rust` branch, and it is not enough: the target feature `retpoline-external-thunk` only implies `retpoline-indirect-calls`, but not `retpoline-indirect-branches` (see LLVM's `X86.td`), unlike Clang's flag of the same name `-mretpoline-external-thunk` which does imply both (see Clang's `lib/Driver/ToolChains/Arch/X86.cpp`). Without this, `objtool` would complain if enabled for Rust, e.g.: rust/core.o: warning: objtool: _R...escape_default+0x13: indirect jump found in RETPOLINE build In addition, change the comment to note that LLVM is the one disabling jump tables when retpoline is enabled, thus we do not need to use `-Zno-jump-tables` for Rust here -- see commit c58f2166ab39 ("Introduce the "retpoline" x86 mitigation technique ...") [1]: The goal is simple: avoid generating code which contains an indirect branch that could have its prediction poisoned by an attacker. In many cases, the compiler can simply use directed conditional branches and a small search tree. LLVM already has support for lowering switches in this way and the first step of this patch is to disable jump-table lowering of switches and introduce a pass to rewrite explicit indirectbr sequences into a switch over integers. As well as a live example at [2]. These should be eventually enabled via `-Ctarget-feature` when `rustc` starts recognizing them (or via a new dedicated flag) [3]. Cc: Daniel Borkmann <[email protected]> Link: https://github.com/llvm/llvm-project/commit/c58f2166ab3987f37cb0d7815b561bff5a20a69a [1] Link: https://godbolt.org/z/G4YPr58qG [2] Link: https://github.com/rust-lang/rust/issues/116852 [3] Reviewed-by: Gary Guo <[email protected]> Tested-by: Alice Ryhl <[email protected]> Tested-by: Benno Lossin <[email protected]> Link: https://github.com/Rust-for-Linux/linux/issues/945 Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Miguel Ojeda <[email protected]>
2024-08-14x86/mm: Remove duplicate check from build_cr3()Yuntao Wang1-1/+0
There is already a check for 'asid > MAX_ASID_AVAILABLE' in kern_pcid(), so it is unnecessary to perform this check in build_cr3() right before calling kern_pcid(). Remove it. Signed-off-by: Yuntao Wang <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-14x86/fpu: Avoid writing LBR bit to IA32_XSS unless supportedMitchell Levy3-2/+12
There are two distinct CPU features related to the use of XSAVES and LBR: whether LBR is itself supported and whether XSAVES supports LBR. The LBR subsystem correctly checks both in intel_pmu_arch_lbr_init(), but the XSTATE subsystem does not. The LBR bit is only removed from xfeatures_mask_independent when LBR is not supported by the CPU, but there is no validation of XSTATE support. If XSAVES does not support LBR the write to IA32_XSS causes a #GP fault, leaving the state of IA32_XSS unchanged, i.e. zero. The fault is handled with a warning and the boot continues. Consequently the next XRSTORS which tries to restore supervisor state fails with #GP because the RFBM has zero for all supervisor features, which does not match the XCOMP_BV field. As XFEATURE_MASK_FPSTATE includes supervisor features setting up the FPU causes a #GP, which ends up in fpu_reset_from_exception_fixup(). That fails due to the same problem resulting in recursive #GPs until the kernel runs out of stack space and double faults. Prevent this by storing the supported independent features in fpu_kernel_cfg during XSTATE initialization and use that cached value for retrieving the independent feature bits to be written into IA32_XSS. [ tglx: Massaged change log ] Fixes: f0dccc9da4c0 ("x86/fpu/xstate: Support dynamic supervisor feature for LBR") Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Mitchell Levy <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/all/[email protected]
2024-08-14KVM: x86/mmu: Introduce a quirk to control memslot zap behaviorYan Zhao3-2/+44
Introduce the quirk KVM_X86_QUIRK_SLOT_ZAP_ALL to allow users to select KVM's behavior when a memslot is moved or deleted for KVM_X86_DEFAULT_VM VMs. Make sure KVM behave as if the quirk is always disabled for non-KVM_X86_DEFAULT_VM VMs. The KVM_X86_QUIRK_SLOT_ZAP_ALL quirk offers two behavior options: - when enabled: Invalidate/zap all SPTEs ("zap-all"), - when disabled: Precisely zap only the leaf SPTEs within the range of the moving/deleting memory slot ("zap-slot-leafs-only"). "zap-all" is today's KVM behavior to work around a bug [1] where the changing the zapping behavior of memslot move/deletion would cause VM instability for VMs with an Nvidia GPU assigned; while "zap-slot-leafs-only" allows for more precise zapping of SPTEs within the memory slot range, improving performance in certain scenarios [2], and meeting the functional requirements for TDX. Previous attempts to select "zap-slot-leafs-only" include a per-VM capability approach [3] (which was not preferred because the root cause of the bug remained unidentified) and a per-memslot flag approach [4]. Sean and Paolo finally recommended the implementation of this quirk and explained that it's the least bad option [5]. By default, the quirk is enabled on KVM_X86_DEFAULT_VM VMs to use "zap-all". Users have the option to disable the quirk to select "zap-slot-leafs-only" for specific KVM_X86_DEFAULT_VM VMs that are unaffected by this bug. For non-KVM_X86_DEFAULT_VM VMs, the "zap-slot-leafs-only" behavior is always selected without user's opt-in, regardless of if the user opts for "zap-all". This is because it is assumed until proven otherwise that non- KVM_X86_DEFAULT_VM VMs will not be exposed to the bug [1], and most importantly, it's because TDX must have "zap-slot-leafs-only" always selected. In TDX's case a memslot's GPA range can be a mixture of "private" or "shared" memory. Shared is roughly analogous to how EPT is handled for normal VMs, but private GPAs need lots of special treatment: 1) "zap-all" would require to zap private root page or non-leaf entries or at least leaf-entries beyond the deleting memslot scope. However, TDX demands that the root page of the private page table remains unchanged, with leaf entries being zapped before non-leaf entries, and any dropped private guest pages must be re-accepted by the guest. 2) if "zap-all" zaps only shared page tables, it would result in private pages still being mapped when the memslot is gone. This may affect even other processes if later the gmem fd was whole punched, causing the pages being freed on the host while still mapped in the TD, because there's no pgoff to the gfn information to zap the private page table after memslot is gone. So, simply go "zap-slot-leafs-only" as if the quirk is always disabled for non-KVM_X86_DEFAULT_VM VMs to avoid manual opt-in for every VM type [6] or complicating quirk disabling interface (current quirk disabling interface is limited, no way to query quirks, or force them to be disabled). Add a new function kvm_mmu_zap_memslot_leafs() to implement "zap-slot-leafs-only". This function does not call kvm_unmap_gfn_range(), bypassing special handling to APIC_ACCESS_PAGE_PRIVATE_MEMSLOT, as 1) The APIC_ACCESS_PAGE_PRIVATE_MEMSLOT cannot be created by users, nor can it be moved. It is only deleted by KVM when APICv is permanently inhibited. 2) kvm_vcpu_reload_apic_access_page() effectively does nothing when APIC_ACCESS_PAGE_PRIVATE_MEMSLOT is deleted. 3) Avoid making all cpus request of KVM_REQ_APIC_PAGE_RELOAD can save on costly IPIs. Suggested-by: Kai Huang <[email protected]> Suggested-by: Sean Christopherson <[email protected]> Suggested-by: Paolo Bonzini <[email protected]> Link: https://patchwork.kernel.org/project/kvm/patch/[email protected] [1] Link: https://patchwork.kernel.org/project/kvm/patch/[email protected]/#25054908 [2] Link: https://lore.kernel.org/kvm/[email protected]/T/#mabc0119583dacf621025e9d873c85f4fbaa66d5c [3] Link: https://lore.kernel.org/all/[email protected] [4] Link: https://lore.kernel.org/all/[email protected] [5] Link: https://lore.kernel.org/all/[email protected] [6] Co-developed-by: Rick Edgecombe <[email protected]> Signed-off-by: Rick Edgecombe <[email protected]> Signed-off-by: Yan Zhao <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-08-14KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)Sean Christopherson1-0/+2
Disallow read-only memslots for SEV-{ES,SNP} VM types, as KVM can't directly emulate instructions for ES/SNP, and instead the guest must explicitly request emulation. Unless the guest explicitly requests emulation without accessing memory, ES/SNP relies on KVM creating an MMIO SPTE, with the subsequent #NPF being reflected into the guest as a #VC. But for read-only memslots, KVM deliberately doesn't create MMIO SPTEs, because except for ES/SNP, doing so requires setting reserved bits in the SPTE, i.e. the SPTE can't be readable while also generating a #VC on writes. Because KVM never creates MMIO SPTEs and jumps directly to emulation, the guest never gets a #VC. And since KVM simply resumes the guest if ES/SNP guests trigger emulation, KVM effectively puts the vCPU into an infinite #NPF loop if the vCPU attempts to write read-only memory. Disallow read-only memory for all VMs with protected state, i.e. for upcoming TDX VMs as well as ES/SNP VMs. For TDX, it's actually possible to support read-only memory, as TDX uses EPT Violation #VE to reflect the fault into the guest, e.g. KVM could configure read-only SPTEs with RX protections and SUPPRESS_VE=0. But there is no strong use case for supporting read-only memslots on TDX, e.g. the main historical usage is to emulate option ROMs, but TDX disallows executing from shared memory. And if someone comes along with a legitimate, strong use case, the restriction can always be lifted for TDX. Don't bother trying to retroactively apply the restriction to SEV-ES VMs that are created as type KVM_X86_DEFAULT_VM. Read-only memslots can't possibly work for SEV-ES, i.e. disallowing such memslots is really just means reporting an error to userspace instead of silently hanging vCPUs. Trying to deal with the ordering between KVM_SEV_INIT and memslot creation isn't worth the marginal benefit it would provide userspace. Fixes: 26c44aa9e076 ("KVM: SEV: define VM types for SEV and SEV-ES") Fixes: 1dfe571c12cf ("KVM: SEV: Add initial SEV-SNP support") Cc: Peter Gonda <[email protected]> Cc: Michael Roth <[email protected]> Cc: Vishal Annapurve <[email protected]> Cc: Ackerly Tng <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-08-14x86/mm: Remove unused NX related declarationsYue Haibing1-2/+0
Since commit 4763ed4d4552 ("x86, mm: Clean up and simplify NX enablement") these declarations is unused and can be removed. Signed-off-by: Yue Haibing <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-14x86/platform/uv: Remove unused declaration uv_irq_2_mmr_info()Yue Haibing1-1/+0
Commit 43fe1abc18a2 ("x86/uv: Use hierarchical irqdomain to manage UV interrupts") removed the implementation but left declaration. Signed-off-by: Yue Haibing <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-13x86/fred: Enable FRED right after init_mem_mapping()Xin Li (Intel)5-6/+23
On 64-bit init_mem_mapping() relies on the minimal page fault handler provided by the early IDT mechanism. The real page fault handler is installed right afterwards into the IDT. This is problematic on CPUs which have X86_FEATURE_FRED set because the real page fault handler retrieves the faulting address from the FRED exception stack frame and not from CR2, but that does obviously not work when FRED is not yet enabled in the CPU. To prevent this enable FRED right after init_mem_mapping() without interrupt stacks. Those are enabled later in trap_init() after the CPU entry area is set up. [ tglx: Encapsulate the FRED details ] Fixes: 14619d912b65 ("x86/fred: FRED entry/exit and dispatch code") Reported-by: Hou Wenlong <[email protected]> Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Xin Li (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-13x86/fred: Move FRED RSP initialization into a separate functionXin Li (Intel)3-11/+25
To enable FRED earlier, move the RSP initialization out of cpu_init_fred_exceptions() into cpu_init_fred_rsps(). This is required as the FRED RSP initialization depends on the availability of the CPU entry areas which are set up late in trap_init(), No functional change intended. Marked with Fixes as it's a depedency for the real fix. Fixes: 14619d912b65 ("x86/fred: FRED entry/exit and dispatch code") Signed-off-by: Xin Li (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-13x86/fred: Parse cmdline param "fred=" in cpu_parse_early_param()Xin Li (Intel)2-26/+5
Depending on whether FRED is enabled, sysvec_install() installs a system interrupt handler into either into FRED's system vector dispatch table or into the IDT. However FRED can be disabled later in trap_init(), after sysvec_install() has been invoked already; e.g., the HYPERVISOR_CALLBACK_VECTOR handler is registered with sysvec_install() in kvm_guest_init(), which is called in setup_arch() but way before trap_init(). IOW, there is a gap between FRED is available and available but disabled. As a result, when FRED is available but disabled, early sysvec_install() invocations fail to install the IDT handler resulting in spurious interrupts. Fix it by parsing cmdline param "fred=" in cpu_parse_early_param() to ensure that FRED is disabled before the first sysvec_install() incovations. Fixes: 3810da12710a ("x86/fred: Add a fred= cmdline param") Reported-by: Hou Wenlong <[email protected]> Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Xin Li (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-13KVM: x86: Make x2APIC ID 100% readonlySean Christopherson1-7/+15
Ignore the userspace provided x2APIC ID when fixing up APIC state for KVM_SET_LAPIC, i.e. make the x2APIC fully readonly in KVM. Commit a92e2543d6a8 ("KVM: x86: use hardware-compatible format for APIC ID register"), which added the fixup, didn't intend to allow userspace to modify the x2APIC ID. In fact, that commit is when KVM first started treating the x2APIC ID as readonly, apparently to fix some race: static inline u32 kvm_apic_id(struct kvm_lapic *apic) { - return (kvm_lapic_get_reg(apic, APIC_ID) >> 24) & 0xff; + /* To avoid a race between apic_base and following APIC_ID update when + * switching to x2apic_mode, the x2apic mode returns initial x2apic id. + */ + if (apic_x2apic_mode(apic)) + return apic->vcpu->vcpu_id; + + return kvm_lapic_get_reg(apic, APIC_ID) >> 24; } Furthermore, KVM doesn't support delivering interrupts to vCPUs with a modified x2APIC ID, but KVM *does* return the modified value on a guest RDMSR and for KVM_GET_LAPIC. I.e. no remotely sane setup can actually work with a modified x2APIC ID. Making the x2APIC ID fully readonly fixes a WARN in KVM's optimized map calculation, which expects the LDR to align with the x2APIC ID. WARNING: CPU: 2 PID: 958 at arch/x86/kvm/lapic.c:331 kvm_recalculate_apic_map+0x609/0xa00 [kvm] CPU: 2 PID: 958 Comm: recalc_apic_map Not tainted 6.4.0-rc3-vanilla+ #35 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.2-1-1 04/01/2014 RIP: 0010:kvm_recalculate_apic_map+0x609/0xa00 [kvm] Call Trace: <TASK> kvm_apic_set_state+0x1cf/0x5b0 [kvm] kvm_arch_vcpu_ioctl+0x1806/0x2100 [kvm] kvm_vcpu_ioctl+0x663/0x8a0 [kvm] __x64_sys_ioctl+0xb8/0xf0 do_syscall_64+0x56/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7fade8b9dd6f Unfortunately, the WARN can still trigger for other CPUs than the current one by racing against KVM_SET_LAPIC, so remove it completely. Reported-by: Michal Luczaj <[email protected]> Closes: https://lore.kernel.org/all/[email protected] Reported-by: Haoyu Wu <[email protected]> Closes: https://lore.kernel.org/all/[email protected] Reported-by: [email protected] Closes: https://lore.kernel.org/all/[email protected] Signed-off-by: Sean Christopherson <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-08-13KVM: x86: Use this_cpu_ptr() instead of per_cpu_ptr(smp_processor_id())Isaku Yamahata1-4/+2
Use this_cpu_ptr() instead of open coding the equivalent in various user return MSR helpers. Signed-off-by: Isaku Yamahata <[email protected]> Reviewed-by: Chao Gao <[email protected]> Reviewed-by: Yuan Yao <[email protected]> [sean: massage changelog] Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Pankaj Gupta <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>