aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/kernel
AgeCommit message (Collapse)AuthorFilesLines
2024-04-11x86/bugs: Rename various 'ia32_cap' variables to 'x86_arch_cap_msr'Ingo Molnar3-42/+42
So we are using the 'ia32_cap' value in a number of places, which got its name from MSR_IA32_ARCH_CAPABILITIES MSR register. But there's very little 'IA32' about it - this isn't 32-bit only code, nor does it originate from there, it's just a historic quirk that many Intel MSR names are prefixed with IA32_. This is already clear from the helper method around the MSR: x86_read_arch_cap_msr(), which doesn't have the IA32 prefix. So rename 'ia32_cap' to 'x86_arch_cap_msr' to be consistent with its role and with the naming of the helper function. Signed-off-by: Ingo Molnar <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Nikolay Borisov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Sean Christopherson <[email protected]> Link: https://lore.kernel.org/r/9592a18a814368e75f8f4b9d74d3883aa4fd1eaf.1712813475.git.jpoimboe@kernel.org
2024-04-11x86/bugs: Cache the value of MSR_IA32_ARCH_CAPABILITIESJosh Poimboeuf1-15/+7
There's no need to keep reading MSR_IA32_ARCH_CAPABILITIES over and over. It's even read in the BHI sysfs function which is a big no-no. Just read it once and cache it. Fixes: ec9404e40e8f ("x86/bhi: Add BHI mitigation knob") Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Sean Christopherson <[email protected]> Link: https://lore.kernel.org/r/9592a18a814368e75f8f4b9d74d3883aa4fd1eaf.1712813475.git.jpoimboe@kernel.org
2024-04-10x86/topology: Don't update cpu_possible_map in topo_set_cpuids()Thomas Gleixner1-2/+5
topo_set_cpuids() updates cpu_present_map and cpu_possible map. It is invoked during enumeration and "physical hotplug" operations. In the latter case this results in a kernel crash because cpu_possible_map is marked read only after init completes. There is no reason to update cpu_possible_map in that function. During enumeration cpu_possible_map is not relevant and gets fully initialized after enumeration completed. On "physical hotplug" the bit is already set because the kernel allows only CPUs to be plugged which have been enumerated and associated to a CPU number during early boot. Remove the bogus update of cpu_possible_map. Fixes: 0e53e7b656cf ("x86/cpu/topology: Sanitize the APIC admission logic") Reported-by: Jonathan Cameron <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/87ttkc6kwx.ffs@tglx
2024-04-10x86/bugs: Fix return type of spectre_bhi_state()Daniel Sneddon1-1/+1
The definition of spectre_bhi_state() incorrectly returns a const char * const. This causes the a compiler warning when building with W=1: warning: type qualifiers ignored on function return type [-Wignored-qualifiers] 2812 | static const char * const spectre_bhi_state(void) Remove the const qualifier from the pointer. Fixes: ec9404e40e8f ("x86/bhi: Add BHI mitigation knob") Reported-by: Sean Christopherson <[email protected]> Signed-off-by: Daniel Sneddon <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-10x86/cpu: Improve readability of per-CPU cpumask initialization codeIngo Molnar2-14/+17
In smp_prepare_cpus_common() and x2apic_prepare_cpu(): - use 'cpu' instead of 'i' - use 'node' instead of 'n' - use vertical alignment to improve readability - better structure basic blocks - reduce col80 checkpatch damage Signed-off-by: Ingo Molnar <[email protected]> Cc: [email protected]
2024-04-10x86/cpu: Take NUMA node into account when allocating per-CPU cpumasksLi RongQing2-7/+9
per-CPU cpumasks are dominantly accessed from their own local CPUs, so allocate them node-local to improve performance. [ mingo: Rewrote the changelog. ] Signed-off-by: Li RongQing <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-09x86/alternatives: Sort local vars in apply_alternatives()Borislav Petkov (AMD)1-2/+2
In a reverse x-mas tree. No functional changes. Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-09x86/alternatives: Optimize optimize_nops()Borislav Petkov (AMD)1-0/+4
Return early if NOPs have already been optimized. Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-09x86/alternatives: Get rid of __optimize_nops()Borislav Petkov (AMD)1-43/+16
There's no need to carve out bits of the NOP optimization functionality and look at JMP opcodes - simply do one more NOPs optimization pass at the end of patching. A lot simpler code. Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-09x86/alternatives: Use a temporary buffer when optimizing NOPsBorislav Petkov (AMD)2-45/+48
Instead of optimizing NOPs in-place, use a temporary buffer like the usual alternatives patching flow does. This obviates the need to grab locks when patching, see 6778977590da ("x86/alternatives: Disable interrupts and sync when optimizing NOPs in place") While at it, add nomenclature definitions clarifying and simplifying the naming of function-local variables in the alternatives code. Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-09x86/alternatives: Catch late X86_FEATURE modifiersBorislav Petkov (AMD)1-0/+3
After alternatives have been patched, changes to the X86_FEATURE flags won't take effect and could potentially even be wrong. Warn about it. This is something which has been long overdue. Signed-off-by: Borislav Petkov (AMD) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Tested-by: Srikanth Aithal <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-09Merge tag 'v6.9-rc3' into locking/core, to pick up fixesIngo Molnar24-201/+185
Signed-off-by: Ingo Molnar <[email protected]>
2024-04-09x86/mce: Implement recovery for errors in TDX/SEAM non-root modeTony Luck2-2/+32
Machine check SMIs (MSMI) signaled during SEAM operation (typically inside TDX guests), on a system with Intel eMCA enabled, might eventually be reported to the kernel #MC handler with the saved RIP on the stack pointing to the instruction in kernel code after the SEAMCALL instruction that entered the SEAM operation. Linux currently says that is a fatal error and shuts down. There is a new bit in IA32_MCG_STATUS that, when set to 1, indicates that the machine check didn't originally occur at that saved RIP, but during SEAM non-root operation. Add new entries to the severity table to detect this for both data load and instruction fetch that set the severity to "AR" (action required). Increase the width of the mcgmask/mcgres fields in "struct severity" from unsigned char to unsigned short since the new bit is in position 12. Action required for these errors is just mark the page as poisoned and return from the machine check handler. HW ABI notes: ============= The SEAM_NR bit in IA32_MCG_STATUS hasn't yet made it into the Intel Software Developers' Manual. But it is described in section 16.5.2 of "Intel(R) Trust Domain Extensions (Intel(R) TDX) Module Base Architecture Specification" downloadable from: https://cdrdv2.intel.com/v1/dl/getContent/733575 Backport notes: =============== Little value in backporting this patch to stable or LTS kernels as this is only relevant with support for TDX, which I assume won't be backported. But for anyone taking this to v6.1 or older, you also need commit: a51cbd0d86d3 ("x86/mce: Use severity table to handle uncorrected errors in kernel") Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-09Merge tag 'v6.9-rc3' into x86/cpu, to pick up fixesIngo Molnar24-201/+185
Signed-off-by: Ingo Molnar <[email protected]>
2024-04-08Merge tag 'nativebhi' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds3-21/+125
Pull x86 mitigations from Thomas Gleixner: "Mitigations for the native BHI hardware vulnerabilty: Branch History Injection (BHI) attacks may allow a malicious application to influence indirect branch prediction in kernel by poisoning the branch history. eIBRS isolates indirect branch targets in ring0. The BHB can still influence the choice of indirect branch predictor entry, and although branch predictor entries are isolated between modes when eIBRS is enabled, the BHB itself is not isolated between modes. Add mitigations against it either with the help of microcode or with software sequences for the affected CPUs" [ This also ends up enabling the full mitigation by default despite the system call hardening, because apparently there are other indirect calls that are still sufficiently reachable, and the 'auto' case just isn't hardened enough. We'll have some more inevitable tweaking in the future - Linus ] * tag 'nativebhi' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: KVM: x86: Add BHI_NO x86/bhi: Mitigate KVM by default x86/bhi: Add BHI mitigation knob x86/bhi: Enumerate Branch History Injection (BHI) bug x86/bhi: Define SPEC_CTRL_BHI_DIS_S x86/bhi: Add support for clearing branch history at syscall entry x86/syscall: Don't force use of indirect calls for system calls x86/bugs: Change commas to semicolons in 'spectre_v2' sysfs file
2024-04-08x86/bhi: Mitigate KVM by defaultPawan Gupta1-1/+8
BHI mitigation mode spectre_bhi=auto does not deploy the software mitigation by default. In a cloud environment, it is a likely scenario where userspace is trusted but the guests are not trusted. Deploying system wide mitigation in such cases is not desirable. Update the auto mode to unconditionally mitigate against malicious guests. Deploy the software sequence at VMexit in auto mode also, when hardware mitigation is not available. Unlike the force =on mode, software sequence is not deployed at syscalls in auto mode. Suggested-by: Alexandre Chartre <[email protected]> Signed-off-by: Pawan Gupta <[email protected]> Signed-off-by: Daniel Sneddon <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexandre Chartre <[email protected]> Reviewed-by: Josh Poimboeuf <[email protected]>
2024-04-08x86/bhi: Add BHI mitigation knobPawan Gupta1-1/+89
Branch history clearing software sequences and hardware control BHI_DIS_S were defined to mitigate Branch History Injection (BHI). Add cmdline spectre_bhi={on|off|auto} to control BHI mitigation: auto - Deploy the hardware mitigation BHI_DIS_S, if available. on - Deploy the hardware mitigation BHI_DIS_S, if available, otherwise deploy the software sequence at syscall entry and VMexit. off - Turn off BHI mitigation. The default is auto mode which does not deploy the software sequence mitigation. This is because of the hardening done in the syscall dispatch path, which is the likely target of BHI. Signed-off-by: Pawan Gupta <[email protected]> Signed-off-by: Daniel Sneddon <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexandre Chartre <[email protected]> Reviewed-by: Josh Poimboeuf <[email protected]>
2024-04-08x86/bhi: Enumerate Branch History Injection (BHI) bugPawan Gupta1-8/+16
Mitigation for BHI is selected based on the bug enumeration. Add bits needed to enumerate BHI bug. Signed-off-by: Pawan Gupta <[email protected]> Signed-off-by: Daniel Sneddon <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexandre Chartre <[email protected]> Reviewed-by: Josh Poimboeuf <[email protected]>
2024-04-08x86/bhi: Define SPEC_CTRL_BHI_DIS_SDaniel Sneddon1-0/+1
Newer processors supports a hardware control BHI_DIS_S to mitigate Branch History Injection (BHI). Setting BHI_DIS_S protects the kernel from userspace BHI attacks without having to manually overwrite the branch history. Define MSR_SPEC_CTRL bit BHI_DIS_S and its enumeration CPUID.BHI_CTRL. Mitigation is enabled later. Signed-off-by: Daniel Sneddon <[email protected]> Signed-off-by: Pawan Gupta <[email protected]> Signed-off-by: Daniel Sneddon <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexandre Chartre <[email protected]> Reviewed-by: Josh Poimboeuf <[email protected]>
2024-04-08x86/bugs: Change commas to semicolons in 'spectre_v2' sysfs fileJosh Poimboeuf1-12/+12
Change the format of the 'spectre_v2' vulnerabilities sysfs file slightly by converting the commas to semicolons, so that mitigations for future variants can be grouped together and separated by commas. Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Daniel Sneddon <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]>
2024-04-06Merge branch 'linus' into x86/urgent, to pick up dependent commitIngo Molnar2-7/+8
We want to fix: 0e110732473e ("x86/retpoline: Do the necessary fixup to the Zen3/4 srso return thunk for !SRSO") So merge in Linus's latest into x86/urgent to have it available. Signed-off-by: Ingo Molnar <[email protected]>
2024-04-05x86/microcode/AMD: Remove unused PATCH_MAX_SIZE macroBorislav Petkov (AMD)1-2/+0
Orphaned after 05e91e721138 ("x86/microcode/AMD: Rip out static buffers") No functional changes. Signed-off-by: Borislav Petkov (AMD) <[email protected]>
2024-04-05x86/microcode/AMD: Avoid -Wformat warning with clang-15Arnd Bergmann1-1/+1
Older versions of clang show a warning for amd.c after a fix for a gcc warning: arch/x86/kernel/cpu/microcode/amd.c:478:47: error: format specifies type \ 'unsigned char' but the argument has type 'u16' (aka 'unsigned short') [-Werror,-Wformat] "amd-ucode/microcode_amd_fam%02hhxh.bin", family); ~~~~~~ ^~~~~~ %02hx In clang-16 and higher, this warning is disabled by default, but clang-15 is still supported, and it's trivial to avoid by adapting the types according to the range of the passed data and the format string. [ bp: Massage commit message. ] Fixes: 2e9064faccd1 ("x86/microcode/amd: Fix snprintf() format string warning in W=1 build") Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-04Merge tag 'net-6.9-rc3' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter, bluetooth and bpf. Fairly usual collection of driver and core fixes. The large selftest accompanying one of the fixes is also becoming a common occurrence. Current release - regressions: - ipv6: fix infinite recursion in fib6_dump_done() - net/rds: fix possible null-deref in newly added error path Current release - new code bugs: - net: do not consume a full cacheline for system_page_pool - bpf: fix bpf_arena-related file descriptor leaks in the verifier - drv: ice: fix freeing uninitialized pointers, fixing misuse of the newfangled __free() auto-cleanup Previous releases - regressions: - x86/bpf: fixes the BPF JIT with retbleed=stuff - xen-netfront: add missing skb_mark_for_recycle, fix page pool accounting leaks, revealed by recently added explicit warning - tcp: fix bind() regression for v6-only wildcard and v4-mapped-v6 non-wildcard addresses - Bluetooth: - replace "hci_qca: Set BDA quirk bit if fwnode exists in DT" with better workarounds to un-break some buggy Qualcomm devices - set conn encrypted before conn establishes, fix re-connecting to some headsets which use slightly unusual sequence of msgs - mptcp: - prevent BPF accessing lowat from a subflow socket - don't account accept() of non-MPC client as fallback to TCP - drv: mana: fix Rx DMA datasize and skb_over_panic - drv: i40e: fix VF MAC filter removal Previous releases - always broken: - gro: various fixes related to UDP tunnels - netns crossing problems, incorrect checksum conversions, and incorrect packet transformations which may lead to panics - bpf: support deferring bpf_link dealloc to after RCU grace period - nf_tables: - release batch on table validation from abort path - release mutex after nft_gc_seq_end from abort path - flush pending destroy work before exit_net release - drv: r8169: skip DASH fw status checks when DASH is disabled" * tag 'net-6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits) netfilter: validate user input for expected length net/sched: act_skbmod: prevent kernel-infoleak net: usb: ax88179_178a: avoid the interface always configured as random address net: dsa: sja1105: Fix parameters order in sja1110_pcs_mdio_write_c45() net: ravb: Always update error counters net: ravb: Always process TX descriptor ring netfilter: nf_tables: discard table flag update with pending basechain deletion netfilter: nf_tables: Fix potential data-race in __nft_flowtable_type_get() netfilter: nf_tables: reject new basechain after table flag update netfilter: nf_tables: flush pending destroy work before exit_net release netfilter: nf_tables: release mutex after nft_gc_seq_end from abort path netfilter: nf_tables: release batch on table validation from abort path Revert "tg3: Remove residual error handling in tg3_suspend" tg3: Remove residual error handling in tg3_suspend net: mana: Fix Rx DMA datasize and skb_over_panic net/sched: fix lockdep splat in qdisc_tree_reduce_backlog() net: phy: micrel: lan8814: Fix when enabling/disabling 1-step timestamping net: stmmac: fix rx queue priority assignment net: txgbe: fix i2c dev name cannot match clkdev net: fec: Set mac_managed_pm during probe ...
2024-04-04x86/mce: Make sure to grab mce_sysfs_mutex in set_bank()Borislav Petkov (AMD)1-1/+3
Modifying a MCA bank's MCA_CTL bits which control which error types to be reported is done over /sys/devices/system/machinecheck/ ├── machinecheck0 │   ├── bank0 │   ├── bank1 │   ├── bank10 │   ├── bank11 ... sysfs nodes by writing the new bit mask of events to enable. When the write is accepted, the kernel deletes all current timers and reinits all banks. Doing that in parallel can lead to initializing a timer which is already armed and in the timer wheel, i.e., in use already: ODEBUG: init active (active state 0) object: ffff888063a28000 object type: timer_list hint: mce_timer_fn+0x0/0x240 arch/x86/kernel/cpu/mce/core.c:2642 WARNING: CPU: 0 PID: 8120 at lib/debugobjects.c:514 debug_print_object+0x1a0/0x2a0 lib/debugobjects.c:514 Fix that by grabbing the sysfs mutex as the rest of the MCA sysfs code does. Reported by: Yue Sun <[email protected]> Reported by: xingwei lee <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Cc: <[email protected]> Link: https://lore.kernel.org/r/CAEkJfYNiENwQY8yV1LYJ9LjJs%[email protected]
2024-04-04x86/extable: Remove unused fixup type EX_TYPE_COPYTong Tiangen1-1/+0
After 034ff37d3407 ("x86: rewrite '__copy_user_nocache' function") rewrote __copy_user_nocache() to use EX_TYPE_UACCESS instead of the EX_TYPE_COPY exception type, there are no more EX_TYPE_COPY users, so remove it. [ bp: Massage commit message. ] Signed-off-by: Tong Tiangen <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-04x86/CPU/AMD: Track SNP host status with cc_platform_*()Borislav Petkov (AMD)3-26/+24
The host SNP worthiness can determined later, after alternatives have been patched, in snp_rmptable_init() depending on cmdline options like iommu=pt which is incompatible with SNP, for example. Which means that one cannot use X86_FEATURE_SEV_SNP and will need to have a special flag for that control. Use that newly added CC_ATTR_HOST_SEV_SNP in the appropriate places. Move kdump_sev_callback() to its rightful place, while at it. Fixes: 216d106c7ff7 ("x86/sev: Add SEV-SNP host initialization support") Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Tom Lendacky <[email protected]> Tested-by: Srikanth Aithal <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-04x86/coco: Require seeding RNG with RDRAND on CoCo systemsJason A. Donenfeld1-0/+2
There are few uses of CoCo that don't rely on working cryptography and hence a working RNG. Unfortunately, the CoCo threat model means that the VM host cannot be trusted and may actively work against guests to extract secrets or manipulate computation. Since a malicious host can modify or observe nearly all inputs to guests, the only remaining source of entropy for CoCo guests is RDRAND. If RDRAND is broken -- due to CPU hardware fault -- the RNG as a whole is meant to gracefully continue on gathering entropy from other sources, but since there aren't other sources on CoCo, this is catastrophic. This is mostly a concern at boot time when initially seeding the RNG, as after that the consequences of a broken RDRAND are much more theoretical. So, try at boot to seed the RNG using 256 bits of RDRAND output. If this fails, panic(). This will also trigger if the system is booted without RDRAND, as RDRAND is essential for a safe CoCo boot. Add this deliberately to be "just a CoCo x86 driver feature" and not part of the RNG itself. Many device drivers and platforms have some desire to contribute something to the RNG, and add_device_randomness() is specifically meant for this purpose. Any driver can call it with seed data of any quality, or even garbage quality, and it can only possibly make the quality of the RNG better or have no effect, but can never make it worse. Rather than trying to build something into the core of the RNG, consider the particular CoCo issue just a CoCo issue, and therefore separate it all out into driver (well, arch/platform) code. [ bp: Massage commit message. ] Signed-off-by: Jason A. Donenfeld <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Elena Reshetova <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Reviewed-by: Theodore Ts'o <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2024-04-04x86/fpu: Update fpu_swap_kvm_fpu() uses in comments as wellLi RongQing1-2/+2
The following commit: d69c1382e1b7 ("x86/kvm: Convert FPU handling to a single swap buffer") reworked KVM FPU handling, but forgot to update the comments in xstate_op_valid(): fpu_swap_kvm_fpu() doesn't exist anymore, fpu_swap_kvm_fpstate() is used instead. Update the comments accordingly. [ mingo: Improved the changelog. ] Signed-off-by: Li RongQing <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-03Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-5/+6
Pull KVM fixes from Paolo Bonzini: "ARM: - Ensure perf events programmed to count during guest execution are actually enabled before entering the guest in the nVHE configuration - Restore out-of-range handler for stage-2 translation faults - Several fixes to stage-2 TLB invalidations to avoid stale translations, possibly including partial walk caches - Fix early handling of architectural VHE-only systems to ensure E2H is appropriately set - Correct a format specifier warning in the arch_timer selftest - Make the KVM banner message correctly handle all of the possible configurations RISC-V: - Remove redundant semicolon in num_isa_ext_regs() - Fix APLIC setipnum_le/be write emulation - Fix APLIC in_clrip[x] read emulation x86: - Fix a bug in KVM_SET_CPUID{2,} where KVM looks at the wrong CPUID entries (old vs. new) and ultimately neglects to clear PV_UNHALT from vCPUs with HLT-exiting disabled - Documentation fixes for SEV - Fix compat ABI for KVM_MEMORY_ENCRYPT_OP - Fix a 14-year-old goof in a declaration shared by host and guest; the enabled field used by Linux when running as a guest pushes the size of "struct kvm_vcpu_pv_apf_data" from 64 to 68 bytes. This is really unconsequential because KVM never consumes anything beyond the first 64 bytes, but the resulting struct does not match the documentation Selftests: - Fix spelling mistake in arch_timer selftest" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits) KVM: arm64: Rationalise KVM banner output arm64: Fix early handling of FEAT_E2H0 not being implemented KVM: arm64: Ensure target address is granule-aligned for range TLBI KVM: arm64: Use TLBI_TTL_UNKNOWN in __kvm_tlb_flush_vmid_range() KVM: arm64: Don't pass a TLBI level hint when zapping table entries KVM: arm64: Don't defer TLB invalidation when zapping table entries KVM: selftests: Fix __GUEST_ASSERT() format warnings in ARM's arch timer test KVM: arm64: Fix out-of-IPA space translation fault handling KVM: arm64: Fix host-programmed guest events in nVHE RISC-V: KVM: Fix APLIC in_clrip[x] read emulation RISC-V: KVM: Fix APLIC setipnum_le/be write emulation RISC-V: KVM: Remove second semicolon KVM: selftests: Fix spelling mistake "trigged" -> "triggered" Documentation: kvm/sev: clarify usage of KVM_MEMORY_ENCRYPT_OP Documentation: kvm/sev: separate description of firmware KVM: SEV: fix compat ABI for KVM_MEMORY_ENCRYPT_OP KVM: selftests: Check that PV_UNHALT is cleared when HLT exiting is disabled KVM: x86: Use actual kvm_cpuid.base for clearing KVM_FEATURE_PV_UNHALT KVM: x86: Introduce __kvm_get_hypervisor_cpuid() helper KVM: SVM: Return -EINVAL instead of -EBUSY on attempt to re-init SEV/SEV-ES ...
2024-04-03x86/apic: Improve data types to fix Coccinelle warningsThorsten Blum1-4/+4
Given that acpi_pm_read_early() returns a u32 (masked to 24 bits), several variables that store its return value are improved by adjusting their data types from unsigned long to u32. Specifically, change deltapm's type from long to u32 because its value fits into 32 bits and it cannot be negative. These data type improvements resolve the following two Coccinelle/ coccicheck warnings reported by do_div.cocci: arch/x86/kernel/apic/apic.c:734:1-7: WARNING: do_div() does a 64-by-32 division, please consider using div64_long instead. arch/x86/kernel/apic/apic.c:742:2-8: WARNING: do_div() does a 64-by-32 division, please consider using div64_long instead. Signed-off-by: Thorsten Blum <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Dave Hansen <[email protected]> Link: https://lore.kernel.org/all/20240318104721.117741-3-thorsten.blum%40toblux.com
2024-04-03x86/rtc: Remove unused intel-mid.hAndy Shevchenko1-1/+0
The rtc driver used to be disabled with a direct check for Intel MID platforms. But that direct check was replaced long ago (see second link). Remove the (unused since 2016) include. [ dhansen: rewrite changelog to include some history ] Signed-off-by: Andy Shevchenko <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Link: https://lore.kernel.org/all/20240305161024.1364098-1-andriy.shevchenko%40linux.intel.com Link: https://lore.kernel.org/all/[email protected]
2024-04-03x86/resctrl: Fix uninitialized memory read when last CPU of domain goes offlineReinette Chatre1-1/+2
Tony encountered this OOPS when the last CPU of a domain goes offline while running a kernel built with CONFIG_NO_HZ_FULL: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 Oops: 0000 [#1] PREEMPT SMP NOPTI ... RIP: 0010:__find_nth_andnot_bit+0x66/0x110 ... Call Trace: <TASK> ? __die() ? page_fault_oops() ? exc_page_fault() ? asm_exc_page_fault() cpumask_any_housekeeping() mbm_setup_overflow_handler() resctrl_offline_cpu() resctrl_arch_offline_cpu() cpuhp_invoke_callback() cpuhp_thread_fun() smpboot_thread_fn() kthread() ret_from_fork() ret_from_fork_asm() </TASK> The NULL pointer dereference is encountered while searching for another online CPU in the domain (of which there are none) that can be used to run the MBM overflow handler. Because the kernel is configured with CONFIG_NO_HZ_FULL the search for another CPU (in its effort to prefer those CPUs that aren't marked nohz_full) consults the mask representing the nohz_full CPUs, tick_nohz_full_mask. On a kernel with CONFIG_CPUMASK_OFFSTACK=y tick_nohz_full_mask is not allocated unless the kernel is booted with the "nohz_full=" parameter and because of that any access to tick_nohz_full_mask needs to be guarded with tick_nohz_full_enabled(). Replace the IS_ENABLED(CONFIG_NO_HZ_FULL) with tick_nohz_full_enabled(). The latter ensures tick_nohz_full_mask can be accessed safely and can be used whether kernel is built with CONFIG_NO_HZ_FULL enabled or not. [ Use Ingo's suggestion that combines the two NO_HZ checks into one. ] Fixes: a4846aaf3945 ("x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow") Reported-by: Tony Luck <[email protected]> Signed-off-by: Reinette Chatre <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/ff8dfc8d3dcb04b236d523d1e0de13d2ef585223.1711993956.git.reinette.chatre@intel.com Closes: https://lore.kernel.org/lkml/ZgIFT5gZgIQ9A9G7@agluck-desk3/
2024-04-03x86/of: Change x86_dtb_parse_smp_config() to staticSaurabh Sengar1-9/+9
x86_dtb_parse_smp_config() is called locally only, change it to static. Signed-off-by: Saurabh Sengar <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-03x86/of: Map NUMA node to CPUs as per DeviceTreeSaurabh Sengar1-0/+2
Currently for DeviceTree bootup, x86 code does the default mapping of CPUs to NUMA, which is wrong. This can cause incorrect mapping and WARNs on SMT enabled systems: CPU #1's smt-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency. WARNING: CPU: 1 PID: 0 at topology_sane.isra.0+0x5c/0x6d match_smt+0xf6/0xfc set_cpu_sibling_map.cold+0x24f/0x512 start_secondary+0x5c/0x110 Call the set_apicid_to_node() function in dtb_cpu_setup() for setting the NUMA to CPU mapping for DeviceTree platforms. Signed-off-by: Saurabh Sengar <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-03x86/of: Set the parse_smp_cfg for all the DeviceTree platforms by defaultSaurabh Sengar1-2/+4
x86_dtb_parse_smp_config() must be set by DeviceTree platform for parsing SMP configuration. Set the parse_smp_cfg pointer to x86_dtb_parse_smp_config() by default so that all the dtb platforms need not to assign it explicitly. Today there are only two platforms using DeviceTree in x86, ce4100 and hv_vtl. Remove the explicit assignment of x86_dtb_parse_smp_config() function from these. Signed-off-by: Saurabh Sengar <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-02Merge tag 'kvmarm-fixes-6.9-1' of ↵Paolo Bonzini12-128/+114
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.9, part #1 - Ensure perf events programmed to count during guest execution are actually enabled before entering the guest in the nVHE configuration. - Restore out-of-range handler for stage-2 translation faults. - Several fixes to stage-2 TLB invalidations to avoid stale translations, possibly including partial walk caches. - Fix early handling of architectural VHE-only systems to ensure E2H is appropriately set. - Correct a format specifier warning in the arch_timer selftest. - Make the KVM banner message correctly handle all of the possible configurations.
2024-04-01x86/bpf: Fix IP for relocating call depth accountingJoan Bruguera Micó1-2/+2
The commit: 59bec00ace28 ("x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()") made PER_CPU_VAR() to use rip-relative addressing, hence INCREMENT_CALL_DEPTH macro and skl_call_thunk_template got rip-relative asm code inside of it. A follow up commit: 17bce3b2ae2d ("x86/callthunks: Handle %rip-relative relocations in call thunk template") changed x86_call_depth_emit_accounting() to use apply_relocation(), but mistakenly assumed that the code is being patched in-place (where the destination of the relocation matches the address of the code), using *pprog as the destination ip. This is not true for the call depth accounting, emitted by the BPF JIT, so the calculated address was wrong, JIT-ed BPF progs on kernels with call depth tracking got broken and usually caused a page fault. Pass the destination IP when the BPF JIT emits call depth accounting. Fixes: 17bce3b2ae2d ("x86/callthunks: Handle %rip-relative relocations in call thunk template") Signed-off-by: Joan Bruguera Micó <[email protected]> Reviewed-by: Uros Bizjak <[email protected]> Acked-by: Ingo Molnar <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-03-31Merge tag 'perf_urgent_for_v6.9_rc2' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 perf fixes from Borislav Petkov: - Define the correct set of default hw events on AMD Zen4 - Use the correct stalled cycles PMCs on AMD Zen2 and newer - Fix detection of the LBR freeze feature on AMD * tag 'perf_urgent_for_v6.9_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/amd/core: Define a proper ref-cycles event for Zen 4 and later perf/x86/amd/core: Update and fix stalled-cycles-* events for Zen 2 and later perf/x86/amd/lbr: Use freeze based on availability x86/cpufeatures: Add new word for scattered features
2024-03-29x86/boot: Move kernel cmdline setup earlier in the boot process (again)Julian Stecklina1-16/+16
When split_lock_detect=off (or similar) is specified in CONFIG_CMDLINE, its effect is lost. The flow is currently this: setup_arch(): -> early_cpu_init() -> early_identify_cpu() -> sld_setup() -> sld_state_setup() -> Looks for split_lock_detect in boot_command_line -> e820__memory_setup() -> Assemble final command line: boot_command_line = builtin_cmdline + boot_cmdline -> parse_early_param() There were earlier attempts at fixing this in: 8d48bf8206f7 ("x86/boot: Pull up cmdline preparation and early param parsing") later reverted in: fbe618399854 ("Revert "x86/boot: Pull up cmdline preparation and early param parsing"") ... because parse_early_param() can't be called before e820__memory_setup(). In this patch, we just move the command line concatenation to the beginning of early_cpu_init(). This should fix sld_state_setup(), while not running in the same issues as the earlier attempt. The order is now: setup_arch(): -> Assemble final command line: boot_command_line = builtin_cmdline + boot_cmdline -> early_cpu_init() -> early_identify_cpu() -> sld_setup() -> sld_state_setup() -> Looks for split_lock_detect in boot_command_line -> e820__memory_setup() -> parse_early_param() Signed-off-by: Julian Stecklina <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Kees Cook <[email protected]> Cc: [email protected]
2024-03-27x86/dumpstack: Use uniform "Oops: " prefix for die() messagesAlex Shi1-2/+2
panic() prints a uniform prompt: "Kernel panic - not syncing:", but die() messages don't have any of that, the message is the raw user-defined message with no prefix. There's companies that collect thousands of die() messages per week, but w/o a prompt in dmesg, it's hard to write scripts to collect and analize the reasons. Add a uniform "Oops:" prefix like other architectures. [ mingo: Rewrote changelog. ] Signed-off-by: Alex Shi <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-26x86/sev: Skip ROM range scans and validation for SEV-SNP guestsKevin Loughlin5-28/+17
SEV-SNP requires encrypted memory to be validated before access. Because the ROM memory range is not part of the e820 table, it is not pre-validated by the BIOS. Therefore, if a SEV-SNP guest kernel wishes to access this range, the guest must first validate the range. The current SEV-SNP code does indeed scan the ROM range during early boot and thus attempts to validate the ROM range in probe_roms(). However, this behavior is neither sufficient nor necessary for the following reasons: * With regards to sufficiency, if EFI_CONFIG_TABLES are not enabled and CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK is set, the kernel will attempt to access the memory at SMBIOS_ENTRY_POINT_SCAN_START (which falls in the ROM range) prior to validation. For example, Project Oak Stage 0 provides a minimal guest firmware that currently meets these configuration conditions, meaning guests booting atop Oak Stage 0 firmware encounter a problematic call chain during dmi_setup() -> dmi_scan_machine() that results in a crash during boot if SEV-SNP is enabled. * With regards to necessity, SEV-SNP guests generally read garbage (which changes across boots) from the ROM range, meaning these scans are unnecessary. The guest reads garbage because the legacy ROM range is unencrypted data but is accessed via an encrypted PMD during early boot (where the PMD is marked as encrypted due to potentially mapping actually-encrypted data in other PMD-contained ranges). In one exceptional case, EISA probing treats the ROM range as unencrypted data, which is inconsistent with other probing. Continuing to allow SEV-SNP guests to use garbage and to inconsistently classify ROM range encryption status can trigger undesirable behavior. For instance, if garbage bytes appear to be a valid signature, memory may be unnecessarily reserved for the ROM range. Future code or other use cases may result in more problematic (arbitrary) behavior that should be avoided. While one solution would be to overhaul the early PMD mapping to always treat the ROM region of the PMD as unencrypted, SEV-SNP guests do not currently rely on data from the ROM region during early boot (and even if they did, they would be mostly relying on garbage data anyways). As a simpler solution, skip the ROM range scans (and the otherwise- necessary range validation) during SEV-SNP guest early boot. The potential SEV-SNP guest crash due to lack of ROM range validation is thus avoided by simply not accessing the ROM range. In most cases, skip the scans by overriding problematic x86_init functions during sme_early_init() to SNP-safe variants, which can be likened to x86_init overrides done for other platforms (ex: Xen); such overrides also avoid the spread of cc_platform_has() checks throughout the tree. In the exceptional EISA case, still use cc_platform_has() for the simplest change, given (1) checks for guest type (ex: Xen domain status) are already performed here, and (2) these checks occur in a subsys initcall instead of an x86_init function. [ bp: Massage commit message, remove "we"s. ] Fixes: 9704c07bf9f7 ("x86/kernel: Validate ROM memory before accessing when SEV-SNP is active") Signed-off-by: Kevin Loughlin <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Cc: <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-26x86/mce: Dynamically size space for machine check recordsTony Luck1-15/+25
Systems with a large number of CPUs may generate a large number of machine check records when things go seriously wrong. But Linux has a fixed-size buffer that can only capture a few dozen errors. Allocate space based on the number of CPUs (with a minimum value based on the historical fixed buffer that could store 80 records). [ bp: Rename local var from tmpp to something more telling: gpool. ] Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Sohil Mehta <[email protected]> Reviewed-by: Avadhut Naik <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-26x86/nmi: Upgrade NMI backtrace stall checks & messagesPaul E. McKenney1-10/+14
The commit to improve NMI stall debuggability: 344da544f177 ("x86/nmi: Print reasons why backtrace NMIs are ignored") ... has shown value, but widespread use has also identified a few opportunities for improvement. The systems have (as usual) shown far more creativity than that commit's author, demonstrating yet again that failing CPUs can do whatever they want. In addition, the current message format is less friendly than one might like to those attempting to use these messages to identify failing CPUs. Therefore, separately flag CPUs that, during the full time that the stack-backtrace request was waiting, were always in an NMI handler, were never in an NMI handler, or exited one NMI handler. Also, split the message identifying the CPU and the time since that CPU's last NMI-related activity so that a single line identifies the CPU without any other variable information, greatly reducing the processing overhead required to identify repeat-offender CPUs. Co-developed-by: Breno Leitao <[email protected]> Signed-off-by: Breno Leitao <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/ab4d70c8-c874-42dc-b206-643018922393@paulmck-laptop
2024-03-26x86/cpu: Clear TME feature flag if TME is not enabled by BIOSBingsong Si1-0/+1
When TME is disabled by BIOS, the dmesg output is: x86/tme: not enabled by BIOS ... and TME functionality is not enabled by the kernel, but the TME feature is still shown in /proc/cpuinfo. Clear it. [ mingo: Clarified changelog ] Signed-off-by: Bingsong Si <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: "Huang, Kai" <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-25x86/head: Simplify relative include path to xen-head.SYuntao Wang2-2/+2
Fix the relative path specification in the include directives adding xen-head.S to the kernel's head_*.S files since they both have "arch/x86/" as prefix. [ bp: Rewrite commit message. ] Signed-off-by: Yuntao Wang <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-03-25x86/CPU/AMD: Improve the erratum 1386 workaroundBorislav Petkov (AMD)1-0/+12
Disable XSAVES only on machines which haven't loaded the microcode revision containing the erratum fix. This will come in handy when running archaic OSes as guests. OSes whose brilliant programmers thought that CPUID is overrated and one should not query it but use features directly, ala shoot first, ask questions later... but only if you're alive after the shooting. Signed-off-by: Borislav Petkov (AMD) <[email protected]> Tested-by: "Maciej S. Szmigiero" <[email protected]> Cc: Boris Ostrovsky <[email protected]> Link: https://lore.kernel.org/r/20240324200525.GBZgCHhYFsBj12PrKv@fat_crate.local
2024-03-25perf/x86/amd/lbr: Use freeze based on availabilitySandipan Das1-0/+1
Currently, the LBR code assumes that LBR Freeze is supported on all processors when X86_FEATURE_AMD_LBR_V2 is available i.e. CPUID leaf 0x80000022[EAX] bit 1 is set. This is incorrect as the availability of the feature is additionally dependent on CPUID leaf 0x80000022[EAX] bit 2 being set, which may not be set for all Zen 4 processors. Define a new feature bit for LBR and PMC freeze and set the freeze enable bit (FLBRI) in DebugCtl (MSR 0x1d9) conditionally. It should still be possible to use LBR without freeze for profile-guided optimization of user programs by using an user-only branch filter during profiling. When the user-only filter is enabled, branches are no longer recorded after the transition to CPL 0 upon PMI arrival. When branch entries are read in the PMI handler, the branch stack does not change. E.g. $ perf record -j any,u -e ex_ret_brn_tkn ./workload Since the feature bit is visible under flags in /proc/cpuinfo, it can be used to determine the feasibility of use-cases which require LBR Freeze to be supported by the hardware such as profile-guided optimization of kernels. Fixes: ca5b7c0d9621 ("perf/x86/amd/lbr: Add LbrExtV2 branch record support") Signed-off-by: Sandipan Das <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/69a453c97cfd11c6f2584b19f937fe6df741510f.1711091584.git.sandipan.das@amd.com
2024-03-24Merge tag 'x86-urgent-2024-03-24' of ↵Linus Torvalds11-80/+69
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: - Ensure that the encryption mask at boot is properly propagated on 5-level page tables, otherwise the PGD entry is incorrectly set to non-encrypted, which causes system crashes during boot. - Undo the deferred 5-level page table setup as it cannot work with memory encryption enabled. - Prevent inconsistent XFD state on CPU hotplug, where the MSR is reset to the default value but the cached variable is not, so subsequent comparisons might yield the wrong result and as a consequence the result prevents updating the MSR. - Register the local APIC address only once in the MPPARSE enumeration to prevent triggering the related WARN_ONs() in the APIC and topology code. - Handle the case where no APIC is found gracefully by registering a fake APIC in the topology code. That makes all related topology functions work correctly and does not affect the actual APIC driver code at all. - Don't evaluate logical IDs during early boot as the local APIC IDs are not yet enumerated and the invoked function returns an error code. Nothing requires the logical IDs before the final CPUID enumeration takes place, which happens after the enumeration. - Cure the fallout of the per CPU rework on UP which misplaced the copying of boot_cpu_data to per CPU data so that the final update to boot_cpu_data got lost which caused inconsistent state and boot crashes. - Use copy_from_kernel_nofault() in the kprobes setup as there is no guarantee that the address can be safely accessed. - Reorder struct members in struct saved_context to work around another kmemleak false positive - Remove the buggy code which tries to update the E820 kexec table for setup_data as that is never passed to the kexec kernel. - Update the resource control documentation to use the proper units. - Fix a Kconfig warning observed with tinyconfig * tag 'x86-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot/64: Move 5-level paging global variable assignments back x86/boot/64: Apply encryption mask to 5-level pagetable update x86/cpu: Add model number for another Intel Arrow Lake mobile processor x86/fpu: Keep xfd_state in sync with MSR_IA32_XFD Documentation/x86: Document that resctrl bandwidth control units are MiB x86/mpparse: Register APIC address only once x86/topology: Handle the !APIC case gracefully x86/topology: Don't evaluate logical IDs during early boot x86/cpu: Ensure that CPU info updates are propagated on UP kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe address x86/pm: Work around false positive kmemleak report in msr_build_context() x86/kexec: Do not update E820 kexec table for setup_data x86/config: Fix warning for 'make ARCH=x86_64 tinyconfig'
2024-03-24x86/boot/64: Move 5-level paging global variable assignments backTom Lendacky1-9/+7
Commit 63bed9660420 ("x86/startup_64: Defer assignment of 5-level paging global variables") moved assignment of 5-level global variables to later in the boot in order to avoid having to use RIP relative addressing in order to set them. However, when running with 5-level paging and SME active (mem_encrypt=on), the variables are needed as part of the page table setup needed to encrypt the kernel (using pgd_none(), p4d_offset(), etc.). Since the variables haven't been set, the page table manipulation is done as if 4-level paging is active, causing the system to crash on boot. While only a subset of the assignments that were moved need to be set early, move all of the assignments back into check_la57_support() so that these assignments aren't spread between two locations. Instead of just reverting the fix, this uses the new RIP_REL_REF() macro when assigning the variables. Fixes: 63bed9660420 ("x86/startup_64: Defer assignment of 5-level paging global variables") Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Ard Biesheuvel <[email protected]> Link: https://lore.kernel.org/r/2ca419f4d0de719926fd82353f6751f717590a86.1711122067.git.thomas.lendacky@amd.com