aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2024-11-29archlinux_defconfig: Enable support for external schedulersBlaster43851-0/+8
2024-11-27archlinux_defconfig: Switch to CONFIG_HZ_1000Blaster43851-1/+1
2024-11-27archlinux_defconfig: Update defconfigBlaster43851-16/+1
- Enable MGLRU - Disable Intel DRM drivers - Disable some unneeded configs - Enable TMPFS
2024-11-27archlinux-local_defconfig: Update defconfigBlaster43851-1/+17
- Add support for NAT
2024-11-27{archlinux|archlinux-local}_defconfig: Enable full clang ltoBlaster43852-5/+2
2024-11-27{archlinux|archlinux-local}_defconfig: Switch zram compression to lz4Blaster43852-2/+2
- lz4 is supposedly faster than zstd
2024-11-27archlinux-local_defconfig: Add a local defconfigBlaster43851-0/+674
2024-11-27archlinux_defconfig: Enable CONFIG_RD_ZSTDBlaster43851-1/+0
- Support initial ramdisk/ramfs compressed using ZSTD
2024-11-27archlinux_defconfig: Enable CONFIG_SENSORS_K10TEMPBlaster43851-0/+1
- This is needed to monitor CPU temps
2024-11-27archlinux_defconfig: Nuke unnecessary debuggingBlaster43851-22/+0
2024-11-27archlinux_defconfig: Clean up and optimizeBlaster43851-317/+35
2024-11-27archlinux_defconfig: Rebrand to IllusionXBlaster43851-0/+2
2024-11-27archlinux_defconfig: Initial Arch Linux defconfigBlaster43851-0/+5815
2024-11-17Merge tag 'x86_urgent_for_v6.12' of ↵Linus Torvalds7-4/+42
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Make sure a kdump kernel with CONFIG_IMA_KEXEC enabled and booted on an AMD SME enabled hardware properly decrypts the ima_kexec buffer information passed to it from the previous kernel - Fix building the kernel with Clang where a non-TLS definition of the stack protector guard cookie leads to bogus code generation - Clear a wrongly advertised virtualized VMLOAD/VMSAVE feature flag on some Zen4 client systems as those insns are not supported on client * tag 'x86_urgent_for_v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Fix a kdump kernel failure on SME system when CONFIG_IMA_KEXEC=y x86/stackprotector: Work around strict Clang TLS symbol requirements x86/CPU/AMD: Clear virtualized VMLOAD/VMSAVE on Zen4 client
2024-11-16Merge tag 'mm-hotfixes-stable-2024-11-16-15-33' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "10 hotfixes, 7 of which are cc:stable. All singletons, please see the changelogs for details" * tag 'mm-hotfixes-stable-2024-11-16-15-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: revert "mm: shmem: fix data-race in shmem_getattr()" ocfs2: uncache inode which has failed entering the group mm: fix NULL pointer dereference in alloc_pages_bulk_noprof mm, doc: update read_ahead_kb for MADV_HUGEPAGE fs/proc/task_mmu: prevent integer overflow in pagemap_scan_get_args() sched/task_stack: fix object_is_on_stack() for KASAN tagged pointers crash, powerpc: default to CRASH_DUMP=n on PPC_BOOK3S_32 mm/mremap: fix address wraparound in move_page_tables() tools/mm: fix compile error mm, swap: fix allocation and scanning race with swapoff
2024-11-14crash, powerpc: default to CRASH_DUMP=n on PPC_BOOK3S_32Dave Vasilevsky1-0/+3
Fixes boot failures on 6.9 on PPC_BOOK3S_32 machines using Open Firmware. On these machines, the kernel refuses to boot from non-zero PHYSICAL_START, which occurs when CRASH_DUMP is on. Since most PPC_BOOK3S_32 machines boot via Open Firmware, it should default to off for them. Users booting via some other mechanism can still turn it on explicitly. Does not change the default on any other architectures for the time being. Link: https://lkml.kernel.org/r/20240917163720.1644584-1-dave@vasilevsky.ca Fixes: 75bc255a7444 ("crash: clean up kdump related config items") Signed-off-by: Dave Vasilevsky <dave@vasilevsky.ca> Reported-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de> Closes: https://lists.debian.org/debian-powerpc/2024/07/msg00001.html Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: Baoquan He <bhe@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Reimar Döffinger <Reimar.Doeffinger@gmx.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-13x86/mm: Fix a kdump kernel failure on SME system when CONFIG_IMA_KEXEC=yBaoquan He1-2/+4
The kdump kernel is broken on SME systems with CONFIG_IMA_KEXEC=y enabled. Debugging traced the issue back to b69a2afd5afc ("x86/kexec: Carry forward IMA measurement log on kexec"). Testing was previously not conducted on SME systems with CONFIG_IMA_KEXEC enabled, which led to the oversight, with the following incarnation: ... ima: No TPM chip found, activating TPM-bypass! Loading compiled-in module X.509 certificates Loaded X.509 cert 'Build time autogenerated kernel key: 18ae0bc7e79b64700122bb1d6a904b070fef2656' ima: Allocated hash algorithm: sha256 Oops: general protection fault, probably for non-canonical address 0xcfacfdfe6660003e: 0000 [#1] PREEMPT SMP NOPTI CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.11.0-rc2+ #14 Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.20.0 05/03/2023 RIP: 0010:ima_restore_measurement_list Call Trace: <TASK> ? show_trace_log_lvl ? show_trace_log_lvl ? ima_load_kexec_buffer ? __die_body.cold ? die_addr ? exc_general_protection ? asm_exc_general_protection ? ima_restore_measurement_list ? vprintk_emit ? ima_load_kexec_buffer ima_load_kexec_buffer ima_init ? __pfx_init_ima init_ima ? __pfx_init_ima do_one_initcall do_initcalls ? __pfx_kernel_init kernel_init_freeable kernel_init ret_from_fork ? __pfx_kernel_init ret_from_fork_asm </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- ... Kernel panic - not syncing: Fatal exception Kernel Offset: disabled Rebooting in 10 seconds.. Adding debug printks showed that the stored addr and size of ima_kexec buffer are not decrypted correctly like: ima: ima_load_kexec_buffer, buffer:0xcfacfdfe6660003e, size:0xe48066052d5df359 Three types of setup_data info — SETUP_EFI, - SETUP_IMA, and - SETUP_RNG_SEED are passed to the kexec/kdump kernel. Only the ima_kexec buffer experienced incorrect decryption. Debugging identified a bug in early_memremap_is_setup_data(), where an incorrect range calculation occurred due to the len variable in struct setup_data ended up only representing the length of the data field, excluding the struct's size, and thus leading to miscalculation. Address a similar issue in memremap_is_setup_data() while at it. [ bp: Heavily massage. ] Fixes: b3c72fc9a78e ("x86/boot: Introduce setup_indirect") Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20240911081615.262202-3-bhe@redhat.com
2024-11-12Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds4-24/+56
Pull kvm fixes from Paolo Bonzini: "x86 and selftests fixes. x86: - When emulating a guest TLB flush for a nested guest, flush vpid01, not vpid02, if L2 is active but VPID is disabled in vmcs12, i.e. if L2 and L1 are sharing VPID '0' (from L1's perspective). - Fix a bug in the SNP initialization flow where KVM would return '0' to userspace instead of -errno on failure. - Move the Intel PT virtualization (i.e. outputting host trace to host buffer and guest trace to guest buffer) behind CONFIG_BROKEN. - Fix memory leak on failure of KVM_SEV_SNP_LAUNCH_START - Fix a bug where KVM fails to inject an interrupt from the IRR after KVM_SET_LAPIC. Selftests: - Increase the timeout for the memslot performance selftest to avoid false failures on arm64 and nested x86 platforms. - Fix a goof in the guest_memfd selftest where a for-loop initialized a bit mask to zero instead of BIT(0). - Disable strict aliasing when building KVM selftests to prevent the compiler from treating things like "u64 *" to "uint64_t *" cases as undefined behavior, which can lead to nasty, hard to debug failures. - Force -march=x86-64-v2 for KVM x86 selftests if and only if the uarch is supported by the compiler. - Fix broken compilation of kvm selftests after a header sync in tools/" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: VMX: Bury Intel PT virtualization (guest/host mode) behind CONFIG_BROKEN KVM: x86: Unconditionally set irr_pending when updating APICv state kvm: svm: Fix gctx page leak on invalid inputs KVM: selftests: use X86_MEMTYPE_WB instead of VMX_BASIC_MEM_TYPE_WB KVM: SVM: Propagate error from snp_guest_req_init() to userspace KVM: nVMX: Treat vpid01 as current if L2 is active, but with VPID disabled KVM: selftests: Don't force -march=x86-64-v2 if it's unsupported KVM: selftests: Disable strict aliasing KVM: selftests: fix unintentional noop test in guest_memfd_test.c KVM: selftests: memslot_perf_test: increase guest sync timeout
2024-11-08x86/stackprotector: Work around strict Clang TLS symbol requirementsArd Biesheuvel5-2/+27
GCC and Clang both implement stack protector support based on Thread Local Storage (TLS) variables, and this is used in the kernel to implement per-task stack cookies, by copying a task's stack cookie into a per-CPU variable every time it is scheduled in. Both now also implement -mstack-protector-guard-symbol=, which permits the TLS variable to be specified directly. This is useful because it will allow to move away from using a fixed offset of 40 bytes into the per-CPU area on x86_64, which requires a lot of special handling in the per-CPU code and the runtime relocation code. However, while GCC is rather lax in its implementation of this command line option, Clang actually requires that the provided symbol name refers to a TLS variable (i.e., one declared with __thread), although it also permits the variable to be undeclared entirely, in which case it will use an implicit declaration of the right type. The upshot of this is that Clang will emit the correct references to the stack cookie variable in most cases, e.g., 10d: 64 a1 00 00 00 00 mov %fs:0x0,%eax 10f: R_386_32 __stack_chk_guard However, if a non-TLS definition of the symbol in question is visible in the same compilation unit (which amounts to the whole of vmlinux if LTO is enabled), it will drop the per-CPU prefix and emit a load from a bogus address. Work around this by using a symbol name that never occurs in C code, and emit it as an alias in the linker script. Fixes: 3fb0fdb3bbe7 ("x86/stackprotector/32: Make the canary into a regular percpu variable") Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Cc: stable@vger.kernel.org Link: https://github.com/ClangBuiltLinux/linux/issues/1854 Link: https://lore.kernel.org/r/20241105155801.1779119-2-brgerst@gmail.com
2024-11-08KVM: VMX: Bury Intel PT virtualization (guest/host mode) behind CONFIG_BROKENSean Christopherson1-1/+3
Hide KVM's pt_mode module param behind CONFIG_BROKEN, i.e. disable support for virtualizing Intel PT via guest/host mode unless BROKEN=y. There are myriad bugs in the implementation, some of which are fatal to the guest, and others which put the stability and health of the host at risk. For guest fatalities, the most glaring issue is that KVM fails to ensure tracing is disabled, and *stays* disabled prior to VM-Enter, which is necessary as hardware disallows loading (the guest's) RTIT_CTL if tracing is enabled (enforced via a VMX consistency check). Per the SDM: If the logical processor is operating with Intel PT enabled (if IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load IA32_RTIT_CTL" VM-entry control must be 0. On the host side, KVM doesn't validate the guest CPUID configuration provided by userspace, and even worse, uses the guest configuration to decide what MSRs to save/load at VM-Enter and VM-Exit. E.g. configuring guest CPUID to enumerate more address ranges than are supported in hardware will result in KVM trying to passthrough, save, and load non-existent MSRs, which generates a variety of WARNs, ToPA ERRORs in the host, a potential deadlock, etc. Fixes: f99e3daf94ff ("KVM: x86: Add Intel PT virtualization work mode") Cc: stable@vger.kernel.org Cc: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Tested-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20241101185031.1799556-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08KVM: x86: Unconditionally set irr_pending when updating APICv stateSean Christopherson1-11/+18
Always set irr_pending (to true) when updating APICv status to fix a bug where KVM fails to set irr_pending when userspace sets APIC state and APICv is disabled, which ultimate results in KVM failing to inject the pending interrupt(s) that userspace stuffed into the vIRR, until another interrupt happens to be emulated by KVM. Only the APICv-disabled case is flawed, as KVM forces apic->irr_pending to be true if APICv is enabled, because not all vIRR updates will be visible to KVM. Hit the bug with a big hammer, even though strictly speaking KVM can scan the vIRR and set/clear irr_pending as appropriate for this specific case. The bug was introduced by commit 755c2bf87860 ("KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it"), which as the shortlog suggests, deleted code that updated irr_pending. Before that commit, kvm_apic_update_apicv() did indeed scan the vIRR, with with the crucial difference that kvm_apic_update_apicv() did the scan even when APICv was being *disabled*, e.g. due to an AVIC inhibition. struct kvm_lapic *apic = vcpu->arch.apic; if (vcpu->arch.apicv_active) { /* irr_pending is always true when apicv is activated. */ apic->irr_pending = true; apic->isr_count = 1; } else { apic->irr_pending = (apic_search_irr(apic) != -1); apic->isr_count = count_vectors(apic->regs + APIC_ISR); } And _that_ bug (clearing irr_pending) was introduced by commit b26a695a1d78 ("kvm: lapic: Introduce APICv update helper function"), prior to which KVM unconditionally set irr_pending to true in kvm_apic_set_state(), i.e. assumed that the new virtual APIC state could have a pending IRQ. Furthermore, in addition to introducing this issue, commit 755c2bf87860 also papered over the underlying bug: KVM doesn't ensure CPUs and devices see APICv as disabled prior to searching the IRR. Waiting until KVM emulates an EOI to update irr_pending "works", but only because KVM won't emulate EOI until after refresh_apicv_exec_ctrl(), and there are plenty of memory barriers in between. I.e. leaving irr_pending set is basically hacking around bad ordering. So, effectively revert to the pre-b26a695a1d78 behavior for state restore, even though it's sub-optimal if no IRQs are pending, in order to provide a minimal fix, but leave behind a FIXME to document the ugliness. With luck, the ordering issue will be fixed and the mess will be cleaned up in the not-too-distant future. Fixes: 755c2bf87860 ("KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Reported-by: Yong He <zhuangel570@gmail.com> Closes: https://lkml.kernel.org/r/20241023124527.1092810-1-alexyonghe%40tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20241106015135.2462147-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08kvm: svm: Fix gctx page leak on invalid inputsDionna Glaze1-4/+4
Ensure that snp gctx page allocation is adequately deallocated on failure during snp_launch_start. Fixes: 136d8bc931c8 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command") CC: Sean Christopherson <seanjc@google.com> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Thomas Gleixner <tglx@linutronix.de> CC: Ingo Molnar <mingo@redhat.com> CC: Borislav Petkov <bp@alien8.de> CC: Dave Hansen <dave.hansen@linux.intel.com> CC: Ashish Kalra <ashish.kalra@amd.com> CC: Tom Lendacky <thomas.lendacky@amd.com> CC: John Allen <john.allen@amd.com> CC: Herbert Xu <herbert@gondor.apana.org.au> CC: "David S. Miller" <davem@davemloft.net> CC: Michael Roth <michael.roth@amd.com> CC: Luis Chamberlain <mcgrof@kernel.org> CC: Russ Weight <russ.weight@linux.dev> CC: Danilo Krummrich <dakr@redhat.com> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> CC: "Rafael J. Wysocki" <rafael@kernel.org> CC: Tianfei zhang <tianfei.zhang@intel.com> CC: Alexey Kardashevskiy <aik@amd.com> Signed-off-by: Dionna Glaze <dionnaglaze@google.com> Message-ID: <20241105010558.1266699-2-dionnaglaze@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08Merge tag 'kvm-x86-fixes-6.12-rcN' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini3-8/+31
KVM x86 and selftests fixes for 6.12: - Increase the timeout for the memslot performance selftest to avoid false failures on arm64 and nested x86 platforms. - Fix a goof in the guest_memfd selftest where a for-loop initialized a bit mask to zero instead of BIT(0). - Disable strict aliasing when building KVM selftests to prevent the compiler from treating things like "u64 *" to "uint64_t *" cases as undefined behavior, which can lead to nasty, hard to debug failures. - Force -march=x86-64-v2 for KVM x86 selftests if and only if the uarch is supported by the compiler. - When emulating a guest TLB flush for a nested guest, flush vpid01, not vpid02, if L2 is active but VPID is disabled in vmcs12, i.e. if L2 and L1 are sharing VPID '0' (from L1's perspective). - Fix a bug in the SNP initialization flow where KVM would return '0' to userspace instead of -errno on failure.
2024-11-06ACPI: processor: Move arch_init_invariance_cppc() call laterMario Limonciello2-6/+6
arch_init_invariance_cppc() is called at the end of acpi_cppc_processor_probe() in order to configure frequency invariance based upon the values from _CPC. This however doesn't work on AMD CPPC shared memory designs that have AMD preferred cores enabled because _CPC needs to be analyzed from all cores to judge if preferred cores are enabled. This issue manifests to users as a warning since commit 21fb59ab4b97 ("ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn"): ``` Could not retrieve highest performance (-19) ``` However the warning isn't the cause of this, it was actually commit 279f838a61f9 ("x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator()") which exposed the issue. To fix this problem, change arch_init_invariance_cppc() into a new weak symbol that is called at the end of acpi_processor_driver_init(). Each architecture that supports it can declare the symbol to override the weak one. Define it for x86, in arch/x86/kernel/acpi/cppc.c, and for all of the architectures using the generic arch_topology.c code. Fixes: 279f838a61f9 ("x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator()") Reported-by: Ivan Shapovalov <intelfx@intelfx.name> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219431 Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://patch.msgid.link/20241104222855.3959267-1-superm1@kernel.org [ rjw: Changelog edit ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-11-05x86/CPU/AMD: Clear virtualized VMLOAD/VMSAVE on Zen4 clientMario Limonciello1-0/+11
A number of Zen4 client SoCs advertise the ability to use virtualized VMLOAD/VMSAVE, but using these instructions is reported to be a cause of a random host reboot. These instructions aren't intended to be advertised on Zen4 client so clear the capability. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=219009
2024-11-04KVM: SVM: Propagate error from snp_guest_req_init() to userspaceSean Christopherson1-2/+5
If snp_guest_req_init() fails, return the provided error code up the stack to userspace, e.g. so that userspace can log that KVM_SEV_INIT2 failed, as opposed to some random operation later in VM setup failing because SNP wasn't actually enabled for the VM. Note, KVM itself doesn't consult the return value from __sev_guest_init(), i.e. the fallout is purely that userspace may be confused. Fixes: 88caf544c930 ("KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202410192220.MeTyHPxI-lkp@intel.com Link: https://lore.kernel.org/r/20241031203214.1585751-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: nVMX: Treat vpid01 as current if L2 is active, but with VPID disabledSean Christopherson2-6/+26
When getting the current VPID, e.g. to emulate a guest TLB flush, return vpid01 if L2 is running but with VPID disabled, i.e. if VPID is disabled in vmcs12. Architecturally, if VPID is disabled, then the guest and host effectively share VPID=0. KVM emulates this behavior by using vpid01 when running an L2 with VPID disabled (see prepare_vmcs02_early_rare()), and so KVM must also treat vpid01 as the current VPID while L2 is active. Unconditionally treating vpid02 as the current VPID when L2 is active causes KVM to flush TLB entries for vpid02 instead of vpid01, which results in TLB entries from L1 being incorrectly preserved across nested VM-Enter to L2 (L2=>L1 isn't problematic, because the TLB flush after nested VM-Exit flushes vpid01). The bug manifests as failures in the vmx_apicv_test KVM-Unit-Test, as KVM incorrectly retains TLB entries for the APIC-access page across a nested VM-Enter. Opportunisticaly add comments at various touchpoints to explain the architectural requirements, and also why KVM uses vpid01 instead of vpid02. All credit goes to Chao, who root caused the issue and identified the fix. Link: https://lore.kernel.org/all/ZwzczkIlYGX+QXJz@intel.com Fixes: 2b4a5a5d5688 ("KVM: nVMX: Flush current VPID (L1 vs. L2) for KVM_REQ_TLB_FLUSH_GUEST") Cc: stable@vger.kernel.org Cc: Like Xu <like.xu.linux@gmail.com> Debugged-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20241031202011.1580522-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-03Merge tag 'x86-urgent-2024-11-03' of ↵Linus Torvalds1-1/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Thomas Gleixner: "A trivial compile test fix for x86: When CONFIG_AMD_NB is not set a COMPILE_TEST of an AMD specific driver fails due to a missing inline stub. Add the stub to cure it" * tag 'x86-urgent-2024-11-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/amd_nb: Fix compile-testing without CONFIG_AMD_NB
2024-10-29x86/amd_nb: Fix compile-testing without CONFIG_AMD_NBArnd Bergmann1-1/+4
node_to_amd_nb() is defined to NULL in non-AMD configs: drivers/platform/x86/amd/hsmp/plat.c: In function 'init_platform_device': drivers/platform/x86/amd/hsmp/plat.c:165:68: error: dereferencing 'void *' pointer [-Werror] 165 | sock->root = node_to_amd_nb(i)->root; | ^~ drivers/platform/x86/amd/hsmp/plat.c:165:68: error: request for member 'root' in something not a structure or union Users of the interface who also allow COMPILE_TEST will cause the above build error so provide an inline stub to fix that. [ bp: Massage commit message. ] Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20241029092329.3857004-1-arnd@kernel.org
2024-10-28x86/traps: move kmsan check after instrumentation_beginSabyrzhan Tasbolatov1-6/+6
During x86_64 kernel build with CONFIG_KMSAN, the objtool warns following: AR built-in.a AR vmlinux.a LD vmlinux.o vmlinux.o: warning: objtool: handle_bug+0x4: call to kmsan_unpoison_entry_regs() leaves .noinstr.text section OBJCOPY modules.builtin.modinfo GEN modules.builtin MODPOST Module.symvers CC .vmlinux.export.o Moving kmsan_unpoison_entry_regs() _after_ instrumentation_begin() fixes the warning. There is decode_bug(regs->ip, &imm) is left before KMSAN unpoisoining, but it has the return condition and if we include it after instrumentation_begin() it results the warning "return with instrumentation enabled", hence, I'm concerned that regs will not be KMSAN unpoisoned if `ud_type == BUG_NONE` is true. Link: https://lkml.kernel.org/r/20241016152407.3149001-1-snovitoll@gmail.com Fixes: ba54d194f8da ("x86/traps: avoid KMSAN bugs originating from handle_bug()") Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-27Merge tag 'x86_urgent_for_v6.12_rc5' of ↵Linus Torvalds3-16/+38
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Prevent a certain range of pages which get marked as hypervisor-only, to get allocated to a CoCo (SNP) guest which cannot use them and thus fail booting - Fix the microcode loader on AMD to pay attention to the stepping of a patch and to handle the case where a BIOS config option splits the machine into logical NUMA nodes per L3 cache slice - Disable LAM from being built by default due to security concerns * tag 'x86_urgent_for_v6.12_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sev: Ensure that RMP table fixups are reserved x86/microcode/AMD: Split load_microcode_amd() x86/microcode/AMD: Pay attention to the stepping dynamically x86/lam: Disable ADDRESS_MASKING in most cases
2024-10-25x86: fix whitespace in runtime-const assembler outputLinus Torvalds1-2/+2
The x86 user pointer validation changes made me look at compiler output a lot, and the wrong indentation for the ".popsection" in the generated assembler triggered me. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-10-25x86: fix user address masking non-canonical speculation issueLinus Torvalds4-21/+42
It turns out that AMD has a "Meltdown Lite(tm)" issue with non-canonical accesses in kernel space. And so using just the high bit to decide whether an access is in user space or kernel space ends up with the good old "leak speculative data" if you have the right gadget using the result: CVE-2020-12965 “Transient Execution of Non-Canonical Accesses“ Now, the kernel surrounds the access with a STAC/CLAC pair, and those instructions end up serializing execution on older Zen architectures, which closes the speculation window. But that was true only up until Zen 5, which renames the AC bit [1]. That improves performance of STAC/CLAC a lot, but also means that the speculation window is now open. Note that this affects not just the new address masking, but also the regular valid_user_address() check used by access_ok(), and the asm version of the sign bit check in the get_user() helpers. It does not affect put_user() or clear_user() variants, since there's no speculative result to be used in a gadget for those operations. Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Link: https://lore.kernel.org/all/80d94591-1297-4afb-b510-c665efd37f10@citrix.com/ Link: https://lore.kernel.org/all/20241023094448.GAZxjFkEOOF_DM83TQ@fat_crate.local/ [1] Link: https://www.amd.com/en/resources/product-security/bulletin/amd-sb-1010.html Link: https://arxiv.org/pdf/2108.10771 Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Tested-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com> # LAM case Fixes: 2865baf54077 ("x86: support user address masking instead of non-speculative conditional") Fixes: 6014bc27561f ("x86-64: make access_ok() independent of LAM") Fixes: b19b74bc99b1 ("x86/mm: Rework address range check in get_user() and put_user()") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-10-23x86/sev: Ensure that RMP table fixups are reservedAshish Kalra1-0/+2
The BIOS reserves RMP table memory via e820 reservations. This can still lead to RMP page faults during kexec if the host tries to access memory within the same 2MB region. Commit 400fea4b9651 ("x86/sev: Add callback to apply RMP table fixups for kexec" adjusts the e820 reservations for the RMP table so that the entire 2MB range at the start/end of the RMP table is marked reserved. The e820 reservations are then passed to firmware via SNP_INIT where they get marked HV-Fixed. The RMP table fixups are done after the e820 ranges have been added to memblock, allowing the fixup ranges to still be allocated and used by the system. The problem is that this memory range is now marked reserved in the e820 tables and during SNP initialization these reserved ranges are marked as HV-Fixed. This means that the pages cannot be used by an SNP guest, only by the hypervisor. However, the memory management subsystem does not make this distinction and can allocate one of those pages to an SNP guest. This will ultimately result in RMPUPDATE failures associated with the guest, causing it to fail to start or terminate when accessing the HV-Fixed page. The issue is captured below with memblock=debug: [ 0.000000] SEV-SNP: *** DEBUG: snp_probe_rmptable_info:352 - rmp_base=0x280d4800000, rmp_end=0x28357efffff ... [ 0.000000] BIOS-provided physical RAM map: ... [ 0.000000] BIOS-e820: [mem 0x00000280d4800000-0x0000028357efffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000028357f00000-0x0000028357ffffff] usable ... ... [ 0.183593] memblock add: [0x0000028357f00000-0x0000028357ffffff] e820__memblock_setup+0x74/0xb0 ... [ 0.203179] MEMBLOCK configuration: [ 0.207057] memory size = 0x0000027d0d194000 reserved size = 0x0000000009ed2c00 [ 0.215299] memory.cnt = 0xb ... [ 0.311192] memory[0x9] [0x0000028357f00000-0x0000028357ffffff], 0x0000000000100000 bytes flags: 0x0 ... ... [ 0.419110] SEV-SNP: Reserving start/end of RMP table on a 2MB boundary [0x0000028357e00000] [ 0.428514] e820: update [mem 0x28357e00000-0x28357ffffff] usable ==> reserved [ 0.428517] e820: update [mem 0x28357e00000-0x28357ffffff] usable ==> reserved [ 0.428520] e820: update [mem 0x28357e00000-0x28357ffffff] usable ==> reserved ... ... [ 5.604051] MEMBLOCK configuration: [ 5.607922] memory size = 0x0000027d0d194000 reserved size = 0x0000000011faae02 [ 5.616163] memory.cnt = 0xe ... [ 5.754525] memory[0xc] [0x0000028357f00000-0x0000028357ffffff], 0x0000000000100000 bytes on node 0 flags: 0x0 ... ... [ 10.080295] Early memory node ranges[ 10.168065] ... node 0: [mem 0x0000028357f00000-0x0000028357ffffff] ... ... [ 8149.348948] SEV-SNP: RMPUPDATE failed for PFN 28357f7c, pg_level: 1, ret: 2 As shown above, the memblock allocations show 1MB after the end of the RMP as available for allocation, which is what the RMP table fixups have reserved. This memory range subsequently gets allocated as SNP guest memory, resulting in an RMPUPDATE failure. This can potentially be fixed by not reserving the memory range in the e820 table, but that causes kexec failures when using the KEXEC_FILE_LOAD syscall. The solution is to use memblock_reserve() to mark the memory reserved for the system, ensuring that it cannot be allocated to an SNP guest. Since HV-Fixed memory is still readable/writable by the host, this only ends up being a problem if the memory in this range requires a page state change, which generally will only happen when allocating memory in this range to be used for running SNP guests, which is now possible with the SNP hypervisor support in kernel 6.11. Backporter note: Fixes tag points to a 6.9 change but as the last paragraph above explains, this whole thing can happen after 6.11 received SNP HV support, therefore backporting to 6.9 is not really necessary. [ bp: Massage commit message. ] Fixes: 400fea4b9651 ("x86/sev: Add callback to apply RMP table fixups for kexec") Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: <stable@kernel.org> # 6.11, see Backporter note above. Link: https://lore.kernel.org/r/20240815221630.131133-1-Ashish.Kalra@amd.com
2024-10-22x86/microcode/AMD: Split load_microcode_amd()Borislav Petkov (AMD)1-8/+17
This function should've been split a long time ago because it is used in two paths: 1) On the late loading path, when the microcode is loaded through the request_firmware interface 2) In the save_microcode_in_initrd() path which collects all the microcode patches which are relevant for the current system before the initrd with the microcode container has been jettisoned. In that path, it is not really necessary to iterate over the nodes on a system and match a patch however it didn't cause any trouble so it was left for a later cleanup However, that later cleanup was expedited by the fact that Jens was enabling "Use L3 as a NUMA node" in the BIOS setting in his machine and so this causes the NUMA CPU masks used in cpumask_of_node() to be generated *after* 2) above happened on the first node. Which means, all those masks were funky, wrong, uninitialized and whatnot, leading to explosions when dereffing c->microcode in load_microcode_amd(). So split that function and do only the necessary work needed at each stage. Fixes: 94838d230a6c ("x86/microcode/AMD: Use the family,model,stepping encoded in the patch ID") Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/91194406-3fdf-4e38-9838-d334af538f74@kernel.dk
2024-10-22x86/microcode/AMD: Pay attention to the stepping dynamicallyBorislav Petkov (AMD)1-8/+18
Commit in Fixes changed how a microcode patch is loaded on Zen and newer but the patch matching needs to happen with different rigidity, depending on what is being done: 1) When the patch is added to the patches cache, the stepping must be ignored because the driver still supports different steppings per system 2) When the patch is matched for loading, then the stepping must be taken into account because each CPU needs the patch matching its exact stepping Take care of that by making the matching smarter. Fixes: 94838d230a6c ("x86/microcode/AMD: Use the family,model,stepping encoded in the patch ID") Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/91194406-3fdf-4e38-9838-d334af538f74@kernel.dk
2024-10-21x86/lam: Disable ADDRESS_MASKING in most casesPawan Gupta1-0/+1
Linear Address Masking (LAM) has a weakness related to transient execution as described in the SLAM paper[1]. Unless Linear Address Space Separation (LASS) is enabled this weakness may be exploitable. Until kernel adds support for LASS[2], only allow LAM for COMPILE_TEST, or when speculation mitigations have been disabled at compile time, otherwise keep LAM disabled. There are no processors in market that support LAM yet, so currently nobody is affected by this issue. [1] SLAM: https://download.vusec.net/papers/slam_sp24.pdf [2] LASS: https://lore.kernel.org/lkml/20230609183632.48706-1-alexander.shishkin@linux.intel.com/ [ dhansen: update SPECULATION_MITIGATIONS -> CPU_MITIGATIONS ] Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/5373262886f2783f054256babdf5a98545dc986b.1706068222.git.pawan.kumar.gupta%40linux.intel.com
2024-10-21Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds4-14/+29
Pull kvm fixes from Paolo Bonzini: "ARM64: - Fix the guest view of the ID registers, making the relevant fields writable from userspace (affecting ID_AA64DFR0_EL1 and ID_AA64PFR1_EL1) - Correcly expose S1PIE to guests, fixing a regression introduced in 6.12-rc1 with the S1POE support - Fix the recycling of stage-2 shadow MMUs by tracking the context (are we allowed to block or not) as well as the recycling state - Address a couple of issues with the vgic when userspace misconfigures the emulation, resulting in various splats. Headaches courtesy of our Syzkaller friends - Stop wasting space in the HYP idmap, as we are dangerously close to the 4kB limit, and this has already exploded in -next - Fix another race in vgic_init() - Fix a UBSAN error when faking the cache topology with MTE enabled RISCV: - RISCV: KVM: use raw_spinlock for critical section in imsic x86: - A bandaid for lack of XCR0 setup in selftests, which causes trouble if the compiler is configured to have x86-64-v3 (with AVX) as the default ISA. Proper XCR0 setup will come in the next merge window. - Fix an issue where KVM would not ignore low bits of the nested CR3 and potentially leak up to 31 bytes out of the guest memory's bounds - Fix case in which an out-of-date cached value for the segments could by returned by KVM_GET_SREGS. - More cleanups for KVM_X86_QUIRK_SLOT_ZAP_ALL - Override MTRR state for KVM confidential guests, making it WB by default as is already the case for Hyper-V guests. Generic: - Remove a couple of unused functions" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (27 commits) RISCV: KVM: use raw_spinlock for critical section in imsic KVM: selftests: Fix out-of-bounds reads in CPUID test's array lookups KVM: selftests: x86: Avoid using SSE/AVX instructions KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memory KVM: VMX: reset the segment cache after segment init in vmx_vcpu_reset() KVM: x86: Clean up documentation for KVM_X86_QUIRK_SLOT_ZAP_ALL KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range() KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslot x86/kvm: Override default caching mode for SEV-SNP and TDX KVM: Remove unused kvm_vcpu_gfn_to_pfn_atomic KVM: Remove unused kvm_vcpu_gfn_to_pfn KVM: arm64: Ensure vgic_ready() is ordered against MMIO registration KVM: arm64: vgic: Don't check for vgic_ready() when setting NR_IRQS KVM: arm64: Fix shift-out-of-bounds bug KVM: arm64: Shave a few bytes from the EL2 idmap code KVM: arm64: Don't eagerly teardown the vgic on init error KVM: arm64: Expose S1PIE to guests KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule KVM: arm64: nv: Punt stage-2 recycling to a vCPU request KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed ...
2024-10-20Merge tag 'x86_urgent_for_v6.12_rc4' of ↵Linus Torvalds7-16/+47
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Explicitly disable the TSC deadline timer when going idle to address some CPU errata in that area - Do not apply the Zenbleed fix on anything else except AMD Zen2 on the late microcode loading path - Clear CPU buffers later in the NMI exit path on 32-bit to avoid register clearing while they still contain sensitive data, for the RDFS mitigation - Do not clobber EFLAGS.ZF with VERW on the opportunistic SYSRET exit path on 32-bit - Fix parsing issues of memory bandwidth specification in sysfs for resctrl's memory bandwidth allocation feature - Other small cleanups and improvements * tag 'x86_urgent_for_v6.12_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/apic: Always explicitly disarm TSC-deadline timer x86/CPU/AMD: Only apply Zenbleed fix for Zen2 during late microcode load x86/bugs: Use code segment selector for VERW operand x86/entry_32: Clear CPU buffers after register restore in NMI return x86/entry_32: Do not clobber user EFLAGS.ZF x86/resctrl: Annotate get_mem_config() functions as __init x86/resctrl: Avoid overflow in MB settings in bw_validate() x86/amd_nb: Add new PCI ID for AMD family 1Ah model 20h
2024-10-20KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memorySean Christopherson1-1/+5
Ignore nCR3[4:0] when loading PDPTEs from memory for nested SVM, as bits 4:0 of CR3 are ignored when PAE paging is used, and thus VMRUN doesn't enforce 32-byte alignment of nCR3. In the absolute worst case scenario, failure to ignore bits 4:0 can result in an out-of-bounds read, e.g. if the target page is at the end of a memslot, and the VMM isn't using guard pages. Per the APM: The CR3 register points to the base address of the page-directory-pointer table. The page-directory-pointer table is aligned on a 32-byte boundary, with the low 5 address bits 4:0 assumed to be 0. And the SDM's much more explicit: 4:0 Ignored Note, KVM gets this right when loading PDPTRs, it's only the nSVM flow that is broken. Fixes: e4e517b4be01 ("KVM: MMU: Do not unconditionally read PDPTE from guest memory") Reported-by: Kirk Swidowski <swidowski@google.com> Cc: Andy Nguyen <theflow@google.com> Cc: 3pvd <3pvd@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20241009140838.1036226-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20KVM: VMX: reset the segment cache after segment init in vmx_vcpu_reset()Maxim Levitsky1-3/+3
Reset the segment cache after segment initialization in vmx_vcpu_reset() to harden KVM against caching stale/uninitialized data. Without the recent fix to bypass the cache in kvm_arch_vcpu_put(), the following scenario is possible: - vCPU is just created, and the vCPU thread is preempted before SS.AR_BYTES is written in vmx_vcpu_reset(). - When scheduling out the vCPU task, kvm_arch_vcpu_in_kernel() => vmx_get_cpl() reads and caches '0' for SS.AR_BYTES. - vmx_vcpu_reset() => seg_setup() configures SS.AR_BYTES, but doesn't invoke vmx_segment_cache_clear() to invalidate the cache. As a result, KVM retains a stale value in the cache, which can be read, e.g. via KVM_GET_SREGS. Usually this is not a problem because the VMX segment cache is reset on each VM-Exit, but if the userspace VMM (e.g KVM selftests) reads and writes system registers just after the vCPU was created, _without_ modifying SS.AR_BYTES, userspace will write back the stale '0' value and ultimately will trigger a VM-Entry failure due to incorrect SS segment type. Invalidating the cache after writing the VMCS doesn't address the general issue of cache accesses from IRQ context being unsafe, but it does prevent KVM from clobbering the VMCS, i.e. mitigates the harm done _if_ KVM has a bug that results in an unsafe cache access. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Fixes: 2fb92db1ec08 ("KVM: VMX: Cache vmcs segment fields") [sean: rework changelog to account for previous patch] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20241009175002.1118178-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range()Sean Christopherson1-0/+11
Add a lockdep assertion in kvm_unmap_gfn_range() to ensure that either mmu_invalidate_in_progress is elevated, or that the range is being zapped due to memslot removal (loosely detected by slots_lock being held). Zapping SPTEs without mmu_invalidate_{in_progress,seq} protection is unsafe as KVM's page fault path snapshots state before acquiring mmu_lock, and thus can create SPTEs with stale information if vCPUs aren't forced to retry faults (due to seeing an in-progress or past MMU invalidation). Memslot removal is a special case, as the memslot is retrieved outside of mmu_invalidate_seq, i.e. doesn't use the "standard" protections, and instead relies on SRCU synchronization to ensure any in-flight page faults are fully resolved before zapping SPTEs. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20241009192345.1148353-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslotSean Christopherson1-10/+6
When performing a targeted zap on memslot removal, zap only MMU pages that shadow guest PTEs, as zapping all SPs that "match" the gfn is inexact and unnecessary. Furthermore, for_each_gfn_valid_sp() arguably shouldn't exist, because it doesn't do what most people would it expect it to do. The "round gfn for level" adjustment that is done for direct SPs (no gPTE) means that the exact gfn comparison will not get a match, even when a SP does "cover" a gfn, or was even created specifically for a gfn. For memslot deletion specifically, KVM's behavior will vary significantly based on the size and alignment of a memslot, and in weird ways. E.g. for a 4KiB memslot, KVM will zap more SPs if the slot is 1GiB aligned than if it's only 4KiB aligned. And as described below, zapping SPs in the aligned case overzaps for direct MMUs, as odds are good the upper-level SPs are serving other memslots. To iterate over all potentially-relevant gfns, KVM would need to make a pass over the hash table for each level, with the gfn used for lookup rounded for said level. And then check that the SP is of the correct level, too, e.g. to avoid over-zapping. But even then, KVM would massively overzap, as processing every level is all but guaranteed to zap SPs that serve other memslots, especially if the memslot being removed is relatively small. KVM could mitigate that issue by processing only levels that can be possible guest huge pages, i.e. are less likely to be re-used for other memslot, but while somewhat logical, that's quite arbitrary and would be a bit of a mess to implement. So, zap only SPs with gPTEs, as the resulting behavior is easy to describe, is predictable, and is explicitly minimal, i.e. KVM only zaps SPs that absolutely must be zapped. Cc: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241009192345.1148353-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20x86/kvm: Override default caching mode for SEV-SNP and TDXKirill A. Shutemov1-0/+4
AMD SEV-SNP and Intel TDX have limited access to MTRR: either it is not advertised in CPUID or it cannot be programmed (on TDX, due to #VE on CR0.CD clear). This results in guests using uncached mappings where it shouldn't and pmd/pud_set_huge() failures due to non-uniform memory type reported by mtrr_type_lookup(). Override MTRR state, making it WB by default as the kernel does for Hyper-V guests. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Suggested-by: Binbin Wu <binbin.wu@intel.com> Cc: Juergen Gross <jgross@suse.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Juergen Gross <jgross@suse.com> Message-ID: <20241015095818.357915-1-kirill.shutemov@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-17Merge tag 'x86_bugs_post_ibpb' of ↵Linus Torvalds4-1/+43
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 IBPB fixes from Borislav Petkov: "This fixes the IBPB implementation of older AMDs (< gen4) that do not flush the RSB (Return Address Stack) so you can still do some leaking when using a "=ibpb" mitigation for Retbleed or SRSO. Fix it by doing the flushing in software on those generations. IBPB is not the default setting so this is not likely to affect anybody in practice" * tag 'x86_bugs_post_ibpb' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/bugs: Do not use UNTRAIN_RET with IBPB on entry x86/bugs: Skip RSB fill at VMEXIT x86/entry: Have entry_ibpb() invalidate return predictions x86/cpufeatures: Add a IBPB_NO_RET BUG flag x86/cpufeatures: Define X86_FEATURE_AMD_IBPB_RET
2024-10-15x86/apic: Always explicitly disarm TSC-deadline timerZhang Rui1-1/+13
New processors have become pickier about the local APIC timer state before entering low power modes. These low power modes are used (for example) when you close your laptop lid and suspend. If you put your laptop in a bag and it is not in this low power mode, it is likely to get quite toasty while it quickly sucks the battery dry. The problem boils down to some CPUs' inability to power down until the CPU recognizes that the local APIC timer is shut down. The current kernel code works in one-shot and periodic modes but does not work for deadline mode. Deadline mode has been the supported and preferred mode on Intel CPUs for over a decade and uses an MSR to drive the timer instead of an APIC register. Disable the TSC Deadline timer in lapic_timer_shutdown() by writing to MSR_IA32_TSC_DEADLINE when in TSC-deadline mode. Also avoid writing to the initial-count register (APIC_TMICT) which is ignored in TSC-deadline mode. Note: The APIC_LVTT|=APIC_LVT_MASKED operation should theoretically be enough to tell the hardware that the timer will not fire in any of the timer modes. But mitigating AMD erratum 411[1] also requires clearing out APIC_TMICT. Solely setting APIC_LVT_MASKED is also ineffective in practice on Intel Lunar Lake systems, which is the motivation for this change. 1. 411 Processor May Exit Message-Triggered C1E State Without an Interrupt if Local APIC Timer Reaches Zero - https://www.amd.com/content/dam/amd/en/documents/archived-tech-docs/revision-guides/41322_10h_Rev_Gd.pdf Fixes: 279f1461432c ("x86: apic: Use tsc deadline for oneshot when available") Suggested-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Tested-by: Todd Brandt <todd.e.brandt@intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20241015061522.25288-1-rui.zhang%40intel.com
2024-10-11Merge tag 'for-linus-6.12a-rc3-tag' of ↵Linus Torvalds1-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fix from Juergen Gross: "A fix for topology information of Xen PV guests" * tag 'for-linus-6.12a-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: x86/xen: mark boot CPU of PV guest in MSR_IA32_APICBASE
2024-10-11x86/CPU/AMD: Only apply Zenbleed fix for Zen2 during late microcode loadJohn Allen1-1/+2
Commit f69759be251d ("x86/CPU/AMD: Move Zenbleed check to the Zen2 init function") causes a bit in the DE_CFG MSR to get set erroneously after a microcode late load. The microcode late load path calls into amd_check_microcode() and subsequently zen2_zenbleed_check(). Since the above commit removes the cpu_has_amd_erratum() call from zen2_zenbleed_check(), this will cause all non-Zen2 CPUs to go through the function and set the bit in the DE_CFG MSR. Call into the Zenbleed fix path on Zen2 CPUs only. [ bp: Massage commit message, use cpu_feature_enabled(). ] Fixes: f69759be251d ("x86/CPU/AMD: Move Zenbleed check to the Zen2 init function") Signed-off-by: John Allen <john.allen@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20240923164404.27227-1-john.allen@amd.com
2024-10-10x86/bugs: Do not use UNTRAIN_RET with IBPB on entryJohannes Wikner1-0/+17
Since X86_FEATURE_ENTRY_IBPB will invalidate all harmful predictions with IBPB, no software-based untraining of returns is needed anymore. Currently, this change affects retbleed and SRSO mitigations so if either of the mitigations is doing IBPB and the other one does the software sequence, the latter is not needed anymore. [ bp: Massage commit message. ] Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Johannes Wikner <kwikner@ethz.ch> Cc: <stable@kernel.org>
2024-10-10x86/bugs: Skip RSB fill at VMEXITJohannes Wikner1-0/+15
entry_ibpb() is designed to follow Intel's IBPB specification regardless of CPU. This includes invalidating RSB entries. Hence, if IBPB on VMEXIT has been selected, entry_ibpb() as part of the RET untraining in the VMEXIT path will take care of all BTB and RSB clearing so there's no need to explicitly fill the RSB anymore. [ bp: Massage commit message. ] Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Johannes Wikner <kwikner@ethz.ch> Cc: <stable@kernel.org>