aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2024-08-02uretprobe: change syscall number, againArnd Bergmann1-1/+1
Despite multiple attempts to get the syscall number assignment right for the newly added uretprobe syscall, we ended up with a bit of a mess: - The number is defined as 467 based on the assumption that the xattrat family of syscalls would use 463 through 466, but those did not make it into 6.11. - The include/uapi/asm-generic/unistd.h file still lists the number 463, but the new scripts/syscall.tbl that was supposed to have the same data lists 467 instead as the number for arc, arm64, csky, hexagon, loongarch, nios2, openrisc and riscv. None of these architectures actually provide a uretprobe syscall. - All the other architectures (powerpc, arm, mips, ...) don't list this syscall at all. There are two ways to make it consistent again: either list it with the same syscall number on all architectures, or only list it on x86 but not in scripts/syscall.tbl and asm-generic/unistd.h. Based on the most recent discussion, it seems like we won't need it anywhere else, so just remove the inconsistent assignment and instead move the x86 number to the next available one in the architecture specific range, which is 335. Fixes: 5c28424e9a34 ("syscalls: Fix to add sys_uretprobe to syscall.tbl") Fixes: 190fec72df4a ("uprobe: Wire up uretprobe system call") Fixes: 63ded110979b ("uprobe: Change uretprobe syscall scope and number") Acked-by: Masami Hiramatsu (Google) <[email protected]> Reviewed-by: Jiri Olsa <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]>
2024-08-02x86/pkeys: Restore altstack access in sigreturn()Aruna Ramakrishna1-3/+3
A process can disable access to the alternate signal stack by not enabling the altstack's PKEY in the PKRU register. Nevertheless, the kernel updates the PKRU temporarily for signal handling. However, in sigreturn(), restore_sigcontext() will restore the PKRU to the user-defined PKRU value. This will cause restore_altstack() to fail with a SIGSEGV as it needs read access to the altstack which is prohibited by the user-defined PKRU value. Fix this by restoring altstack before restoring PKRU. Signed-off-by: Aruna Ramakrishna <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-02x86/pkeys: Update PKRU to enable all pkeys before XSAVEAruna Ramakrishna2-4/+19
If the alternate signal stack is protected by a different PKEY than the current execution stack, copying XSAVE data to the sigaltstack will fail if its PKEY is not enabled in the PKRU register. It's unknown which pkey was used by the application for the altstack, so enable all PKEYS before XSAVE. But this updated PKRU value is also pushed onto the sigframe, which means the register value restored from sigcontext will be different from the user-defined one, which is incorrect. Fix that by overwriting the PKRU value on the sigframe with the original, user-defined PKRU. Signed-off-by: Aruna Ramakrishna <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-02x86/pkeys: Add helper functions to update PKRU on the sigframeAruna Ramakrishna4-0/+43
In the case where a user thread sets up an alternate signal stack protected by the default PKEY (i.e. PKEY 0), while the thread's stack is protected by a non-zero PKEY, both these PKEYS have to be enabled in the PKRU register for the signal to be delivered to the application correctly. However, the PKRU value restored after handling the signal must not enable this extra PKEY (i.e. PKEY 0) - i.e., the PKRU value in the sigframe has to be overwritten with the user-defined value. Add helper functions that will update PKRU value in the sigframe after XSAVE. Note that sig_prepare_pkru() makes no assumption about which PKEY could be used to protect the altstack (i.e. it may not be part of init_pkru), and so enables all PKEYS. No functional change. Signed-off-by: Aruna Ramakrishna <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-02x86/pkeys: Add PKRU as a parameter in signal handling functionsAruna Ramakrishna3-5/+6
Assume there's a multithreaded application that runs untrusted user code. Each thread has its stack/code protected by a non-zero PKEY, and the PKRU register is set up such that only that particular non-zero PKEY is enabled. Each thread also sets up an alternate signal stack to handle signals, which is protected by PKEY zero. The PKEYs man page documents that the PKRU will be reset to init_pkru when the signal handler is invoked, which means that PKEY zero access will be enabled. But this reset happens after the kernel attempts to push fpu state to the alternate stack, which is not (yet) accessible by the kernel, which leads to a new SIGSEGV being sent to the application, terminating it. Enabling both the non-zero PKEY (for the thread) and PKEY zero in userspace will not work for this use case. It cannot have the alt stack writeable by all - the rationale here is that the code running in that thread (using a non-zero PKEY) is untrusted and should not have access to the alternate signal stack (that uses PKEY zero), to prevent the return address of a function from being changed. The expectation is that kernel should be able to set up the alternate signal stack and deliver the signal to the application even if PKEY zero is explicitly disabled by the application. The signal handler accessibility should not be dictated by whatever PKRU value the thread sets up. The PKRU register is managed by XSAVE, which means the sigframe contents must match the register contents - which is not the case here. It's required that the signal frame contains the user-defined PKRU value (so that it is restored correctly from sigcontext) but the actual register must be reset to init_pkru so that the alt stack is accessible and the signal can be delivered to the application. It seems that the proper fix here would be to remove PKRU from the XSAVE framework and manage it separately, which is quite complicated. As a workaround, do this: orig_pkru = rdpkru(); wrpkru(orig_pkru & init_pkru_value); xsave_to_user_sigframe(); put_user(pkru_sigframe_addr, orig_pkru) In preparation for writing PKRU to sigframe, pass PKRU as an additional parameter down the call chain from get_sigframe(). No functional change. Signed-off-by: Aruna Ramakrishna <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-02Merge branch 'linus' into x86/mmThomas Gleixner237-10378/+12376
Bring x86 and selftests up to date
2024-08-02perf,x86: avoid missing caller address in stack traces captured in uprobeAndrii Nakryiko1-0/+63
When tracing user functions with uprobe functionality, it's common to install the probe (e.g., a BPF program) at the first instruction of the function. This is often going to be `push %rbp` instruction in function preamble, which means that within that function frame pointer hasn't been established yet. This leads to consistently missing an actual caller of the traced function, because perf_callchain_user() only records current IP (capturing traced function) and then following frame pointer chain (which would be caller's frame, containing the address of caller's caller). So when we have target_1 -> target_2 -> target_3 call chain and we are tracing an entry to target_3, captured stack trace will report target_1 -> target_3 call chain, which is wrong and confusing. This patch proposes a x86-64-specific heuristic to detect `push %rbp` (`push %ebp` on 32-bit architecture) instruction being traced. Given entire kernel implementation of user space stack trace capturing works under assumption that user space code was compiled with frame pointer register (%rbp/%ebp) preservation, it seems pretty reasonable to use this instruction as a strong indicator that this is the entry to the function. In that case, return address is still pointed to by %rsp/%esp, so we fetch it and add to stack trace before proceeding to unwind the rest using frame pointer-based logic. We also check for `endbr64` (for 64-bit modes) as another common pattern for function entry, as suggested by Josh Poimboeuf. Even if we get this wrong sometimes for uprobes attached not at the function entry, it's OK because stack trace will still be overall meaningful, just with one extra bogus entry. If we don't detect this, we end up with guaranteed to be missing caller function entry in the stack trace, which is worse overall. Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-08-01x86/uaccess: Zero the 8-byte get_range case on failure on 32-bitDavid Gow1-1/+3
While zeroing the upper 32 bits of an 8-byte getuser on 32-bit x86 was fixed by commit 8c860ed825cb ("x86/uaccess: Fix missed zeroing of ia32 u64 get_user() range checking") it was broken again in commit 8a2462df1547 ("x86/uaccess: Improve the 8-byte getuser() case"). This is because the register which holds the upper 32 bits (%ecx) is being cleared _after_ the check_range, so if the range check fails, %ecx is never cleared. This can be reproduced with: ./tools/testing/kunit/kunit.py run --arch i386 usercopy Instead, clear %ecx _before_ check_range in the 8-byte case. This reintroduces a bit of the ugliness we were trying to avoid by adding another #ifndef CONFIG_X86_64, but at least keeps check_range from needing a separate bad_get_user_8 jump. Fixes: 8a2462df1547 ("x86/uaccess: Improve the 8-byte getuser() case") Signed-off-by: David Gow <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-01KVM: x86/mmu: fix determination of max NPT mapping level for private pagesAckerley Tng1-1/+1
The `if (req_max_level)` test was meant ignore req_max_level if PG_LEVEL_NONE was returned. Hence, this function should return max_level instead of the ignored req_max_level. This is only a latent issue for now, since guest_memfd does not support large pages. Signed-off-by: Ackerley Tng <[email protected]> Message-ID: <[email protected]> Fixes: f32fb32820b1 ("KVM: x86: Add hook for determining max NPT mapping level") Signed-off-by: Paolo Bonzini <[email protected]>
2024-08-01x86/mce: Use mce_prep_record() helpers for apei_smca_report_x86_error()Yazen Ghannam1-8/+8
Current AMD systems can report MCA errors using the ACPI Boot Error Record Table (BERT). The BERT entries for MCA errors will be an x86 Common Platform Error Record (CPER) with an MSR register context that matches the MCAX/SMCA register space. However, the BERT will not necessarily be processed on the CPU that reported the MCA errors. Therefore, the correct CPU number needs to be determined and the information saved in struct mce. Use the newly defined mce_prep_record_*() helpers to get the correct data. Also, add an explicit check to verify that a valid CPU number was found from the APIC ID search. Signed-off-by: Yazen Ghannam <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov (AMD) <[email protected]>
2024-08-01x86/mce: Define mce_prep_record() helpers for common and per-CPU fieldsYazen Ghannam2-11/+25
Generally, MCA information for an error is gathered on the CPU that reported the error. In this case, CPU-specific information from the running CPU will be correct. However, this will be incorrect if the MCA information is gathered while running on a CPU that didn't report the error. One example is creating an MCA record using mce_prep_record() for errors reported from ACPI. Split mce_prep_record() so that there is a helper function to gather common, i.e. not CPU-specific, information and another helper for CPU-specific information. Leave mce_prep_record() defined as-is for the common case when running on the reporting CPU. Get MCG_CAP in the global helper even though the register is per-CPU. This value is not already cached per-CPU like other values. And it does not assist with any per-CPU decoding or handling. Signed-off-by: Yazen Ghannam <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov (AMD) <[email protected]>
2024-08-01x86/mce: Rename mce_setup() to mce_prep_record()Yazen Ghannam4-7/+7
There is no MCE "setup" done in mce_setup(). Rather, this function initializes and prepares an MCE record. Rename the function to highlight what it does. No functional change is intended. Suggested-by: Borislav Petkov <[email protected]> Signed-off-by: Yazen Ghannam <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov (AMD) <[email protected]>
2024-08-01x86/mm: Fix pti_clone_entry_text() for i386Peter Zijlstra1-1/+1
While x86_64 has PMD aligned text sections, i386 does not have this luxery. Notably ALIGN_ENTRY_TEXT_END is empty and _etext has PAGE alignment. This means that text on i386 can be page granular at the tail end, which in turn means that the PTI text clones should consistently account for this. Make pti_clone_entry_text() consistent with pti_clone_kernel_text(). Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit") Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
2024-08-01x86/mm: Fix pti_clone_pgtable() alignment assumptionPeter Zijlstra1-3/+3
Guenter reported dodgy crashes on an i386-nosmp build using GCC-11 that had the form of endless traps until entry stack exhaust and then #DF from the stack guard. It turned out that pti_clone_pgtable() had alignment assumptions on the start address, notably it hard assumes start is PMD aligned. This is true on x86_64, but very much not true on i386. These assumptions can cause the end condition to malfunction, leading to a 'short' clone. Guess what happens when the user mapping has a short copy of the entry text? Use the correct increment form for addr to avoid alignment assumptions. Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit") Reported-by: Guenter Roeck <[email protected]> Tested-by: Guenter Roeck <[email protected]> Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-07-31x86/setup: Parse the builtin command line before mergingBorislav Petkov (AMD)3-8/+23
Commit in Fixes was added as a catch-all for cases where the cmdline is parsed before being merged with the builtin one. And promptly one issue appeared, see Link below. The microcode loader really needs to parse it that early, but the merging happens later. Reshuffling the early boot nightmare^W code to handle that properly would be a painful exercise for another day so do the chicken thing and parse the builtin cmdline too before it has been merged. Fixes: 0c40b1c7a897 ("x86/setup: Warn when option parsing is done too early") Reported-by: Mike Lothian <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/20240730152108.GAZqkE5Dfi9AuKllRw@fat_crate.local Link: https://lore.kernel.org/r/20240722152330.GCZp55ck8E_FT4kPnC@fat_crate.local
2024-07-31x86/tsc: Use topology_max_packages() to get package numberFeng Tang1-5/+3
Commit b50db7095fe0 ("x86/tsc: Disable clocksource watchdog for TSC on qualified platorms") was introduced to solve problem that sometimes TSC clocksource is wrongly judged as unstable by watchdog like 'jiffies', HPET, etc. In it, the hardware package number is a key factor for judging whether to disable the watchdog for TSC, and 'nr_online_nodes' was chosen due to, at that time (kernel v5.1x), it is available in early boot phase before registering 'tsc-early' clocksource, where all non-boot CPUs are not brought up yet. Dave and Rui pointed out there are many cases in which 'nr_online_nodes' is cheated and not accurate, like: * SNC (sub-numa cluster) mode enabled * numa emulation (numa=fake=8 etc.) * numa=off * platforms with CPU-less HBM nodes, CPU-less Optane memory nodes. * 'maxcpus=' cmdline setup, where chopped CPUs could be onlined later * 'nr_cpus=', 'possible_cpus=' cmdline setup, where chopped CPUs can not be onlined after boot The SNC case is the most user-visible case, as many CSP (Cloud Service Provider) enable this feature in their server fleets. When SNC3 enabled, a 2 socket machine will appear to have 6 NUMA nodes, and get impacted by the issue in reality. Thomas' recent patchset of refactoring x86 topology code improves topology_max_packages() greatly, by making it more accurate and available in early boot phase, which works well in most of the above cases. The only exceptions are 'nr_cpus=' and 'possible_cpus=' setup, which may under-estimate the package number. As during topology setup, the boot CPU iterates through all enumerated APIC IDs and either accepts or rejects the APIC ID. For accepted IDs, it figures out which bits of the ID map to the package number. It tracks which package numbers have been seen in a bitmap. topology_max_packages() just returns the number of bits set in that bitmap. 'nr_cpus=' and 'possible_cpus=' can cause more APIC IDs to be rejected and can artificially lower the number of bits in the package bitmap and thus topology_max_packages(). This means that, for example, a system with 8 physical packages might reject all the CPUs on 6 of those packages and be left with only 2 packages and 2 bits set in the package bitmap. It needs the TSC watchdog, but would disable it anyway. This isn't ideal, but it only happens for debug-oriented options. This is fixable by tracking the package numbers for rejected CPUs. But it's not worth the trouble for debugging. So use topology_max_packages() to replace nr_online_nodes(). Reported-by: Dave Hansen <[email protected]> Signed-off-by: Feng Tang <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Waiman Long <[email protected]> Link: https://lore.kernel.org/all/[email protected] Closes: https://lore.kernel.org/lkml/[email protected]/
2024-07-31perf/x86: Fix smp_processor_id()-in-preemptible warningsLi Huafei1-10/+12
The following bug was triggered on a system built with CONFIG_DEBUG_PREEMPT=y: # echo p > /proc/sysrq-trigger BUG: using smp_processor_id() in preemptible [00000000] code: sh/117 caller is perf_event_print_debug+0x1a/0x4c0 CPU: 3 UID: 0 PID: 117 Comm: sh Not tainted 6.11.0-rc1 #109 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x4f/0x60 check_preemption_disabled+0xc8/0xd0 perf_event_print_debug+0x1a/0x4c0 __handle_sysrq+0x140/0x180 write_sysrq_trigger+0x61/0x70 proc_reg_write+0x4e/0x70 vfs_write+0xd0/0x430 ? handle_mm_fault+0xc8/0x240 ksys_write+0x9c/0xd0 do_syscall_64+0x96/0x190 entry_SYSCALL_64_after_hwframe+0x4b/0x53 This is because the commit d4b294bf84db ("perf/x86: Hybrid PMU support for counters") took smp_processor_id() outside the irq critical section. If a preemption occurs in perf_event_print_debug() and the task is migrated to another cpu, we may get incorrect pmu debug information. Move smp_processor_id() back inside the irq critical section to fix this issue. Fixes: d4b294bf84db ("perf/x86: Hybrid PMU support for counters") Signed-off-by: Li Huafei <[email protected]> Reviewed-and-tested-by: K Prateek Nayak <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Kan Liang <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-30x86/CPU/AMD: Add models 0x60-0x6f to the Zen5 rangePerry Yuan1-1/+1
Add some new Zen5 models for the 0x1A family. [ bp: Merge the 0x60 and 0x70 ranges. ] Signed-off-by: Perry Yuan <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-30x86/bugs: Add a separate config for GDSBreno Leitao2-1/+12
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated, where some mitigations have entries in Kconfig, and they could be modified, while others mitigations do not have Kconfig entries, and could not be controlled at build time. Create a new kernel config that allows GDS to be completely disabled, similarly to the "gather_data_sampling=off" or "mitigations=off" kernel command-line. Now, there are two options for GDS mitigation: * CONFIG_MITIGATION_GDS=n -> Mitigation disabled (New) * CONFIG_MITIGATION_GDS=y -> Mitigation enabled (GDS_MITIGATION_FULL) Suggested-by: Josh Poimboeuf <[email protected]> Signed-off-by: Breno Leitao <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Josh Poimboeuf <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-30x86/bugs: Remove GDS Force Kconfig optionBreno Leitao2-23/+0
Remove the MITIGATION_GDS_FORCE Kconfig option, which aggressively disables AVX as a mitigation for Gather Data Sampling (GDS) vulnerabilities. This option is not widely used by distros. While removing the Kconfig option, retain the runtime configuration ability through the `gather_data_sampling=force` kernel parameter. This allows users to still enable this aggressive mitigation if needed, without baking it into the kernel configuration. Simplify the kernel configuration while maintaining flexibility for runtime mitigation choices. Suggested-by: Borislav Petkov <[email protected]> Signed-off-by: Breno Leitao <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Daniel Sneddon <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-30x86/bugs: Add a separate config for SSBBreno Leitao2-4/+16
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated, where some mitigations have entries in Kconfig, and they could be modified, while others mitigations do not have Kconfig entries, and could not be controlled at build time. Create an entry for the SSB CPU mitigation under CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable it at compilation time. Signed-off-by: Breno Leitao <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Josh Poimboeuf <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-30x86/bugs: Add a separate config for Spectre V2Breno Leitao2-4/+17
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated, where some mitigations have entries in Kconfig, and they could be modified, while others mitigations do not have Kconfig entries, and could not be controlled at build time. Create an entry for the Spectre V2 CPU mitigation under CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable it at compilation time. Signed-off-by: Breno Leitao <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Josh Poimboeuf <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-30x86/bugs: Add a separate config for SRBDSBreno Leitao2-1/+16
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated, where some mitigations have entries in Kconfig, and they could be modified, while others mitigations do not have Kconfig entries, and could not be controlled at build time. Create an entry for the SRBDS CPU mitigation under CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable it at compilation time. Signed-off-by: Breno Leitao <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Josh Poimboeuf <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-30x86/bugs: Add a separate config for Spectre v1Breno Leitao2-1/+12
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated, where some mitigations have entries in Kconfig, and they could be modified, while others mitigations do not have Kconfig entries, and could not be controlled at build time. Create an entry for the Spectre v1 CPU mitigation under CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable it at compilation time. Signed-off-by: Breno Leitao <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Josh Poimboeuf <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-30x86/bugs: Add a separate config for RETBLEEDBreno Leitao2-1/+14
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated, where some mitigations have entries in Kconfig, and they could be modified, while others mitigations do not have Kconfig entries, and could not be controlled at build time. Create an entry for the RETBLEED CPU mitigation under CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable it at compilation time. Signed-off-by: Breno Leitao <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Josh Poimboeuf <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-30x86/bugs: Add a separate config for L1TFBreno Leitao2-1/+12
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated, where some mitigations have entries in Kconfig, and they could be modified, while others mitigations do not have Kconfig entries, and could not be controlled at build time. Create an entry for the L1TF CPU mitigation under CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable it at compilation time. Signed-off-by: Breno Leitao <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Josh Poimboeuf <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-30x86/bugs: Add a separate config for MMIO Stable DataBreno Leitao2-1/+14
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated, where some mitigations have entries in Kconfig, and they could be modified, while others mitigations do not have Kconfig entries, and could not be controlled at build time. Create an entry for the MMIO Stale data CPU mitigation under CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable it at compilation time. Signed-off-by: Breno Leitao <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Josh Poimboeuf <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-30x86/bugs: Add a separate config for TAABreno Leitao2-1/+13
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated, where some mitigations have entries in Kconfig, and they could be modified, while others mitigations do not have Kconfig entries, and could not be controlled at build time. Create an entry for the TAA CPU mitigation under CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable it at compilation time. Signed-off-by: Breno Leitao <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Josh Poimboeuf <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-30x86/bugs: Add a separate config for MDSBreno Leitao2-1/+11
Currently, the CONFIG_SPECULATION_MITIGATIONS is halfway populated, where some mitigations have entries in Kconfig, and they could be modified, while others mitigations do not have Kconfig entries, and could not be controlled at build time. Create an entry for the MDS CPU mitigation under CONFIG_SPECULATION_MITIGATIONS. This allow users to enable or disable it at compilation time. Signed-off-by: Breno Leitao <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Josh Poimboeuf <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-30x86/sev: Fix __reserved field in sev_configPavan Kumar Paluri1-1/+1
sev_config currently has debug, ghcbs_initialized, and use_cas fields. However, __reserved count has not been updated. Fix this. Fixes: 34ff65901735 ("x86/sev: Use kernel provided SVSM Calling Areas") Signed-off-by: Pavan Kumar Paluri <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-30x86/microcode/AMD: Fix a -Wsometimes-uninitialized clang false positiveBorislav Petkov (AMD)1-1/+1
Initialize equiv_id in order to shut up: arch/x86/kernel/cpu/microcode/amd.c:714:6: warning: variable 'equiv_id' is \ used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized] if (x86_family(bsp_cpuid_1_eax) < 0x17) { ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ because clang doesn't do interprocedural analysis for warnings to see that this variable won't be used uninitialized. Fixes: 94838d230a6c ("x86/microcode/AMD: Use the family,model,stepping encoded in the patch ID") Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: Borislav Petkov (AMD) <[email protected]>
2024-07-29bpf, x64: Fix tailcall hierarchyLeon Hwang1-28/+79
This patch fixes a tailcall issue caused by abusing the tailcall in bpf2bpf feature. As we know, tail_call_cnt propagates by rax from caller to callee when to call subprog in tailcall context. But, like the following example, MAX_TAIL_CALL_CNT won't work because of missing tail_call_cnt back-propagation from callee to caller. \#include <linux/bpf.h> \#include <bpf/bpf_helpers.h> \#include "bpf_legacy.h" struct { __uint(type, BPF_MAP_TYPE_PROG_ARRAY); __uint(max_entries, 1); __uint(key_size, sizeof(__u32)); __uint(value_size, sizeof(__u32)); } jmp_table SEC(".maps"); int count = 0; static __noinline int subprog_tail1(struct __sk_buff *skb) { bpf_tail_call_static(skb, &jmp_table, 0); return 0; } static __noinline int subprog_tail2(struct __sk_buff *skb) { bpf_tail_call_static(skb, &jmp_table, 0); return 0; } SEC("tc") int entry(struct __sk_buff *skb) { volatile int ret = 1; count++; subprog_tail1(skb); subprog_tail2(skb); return ret; } char __license[] SEC("license") = "GPL"; At run time, the tail_call_cnt in entry() will be propagated to subprog_tail1() and subprog_tail2(). But, when the tail_call_cnt in subprog_tail1() updates when bpf_tail_call_static(), the tail_call_cnt in entry() won't be updated at the same time. As a result, in entry(), when tail_call_cnt in entry() is less than MAX_TAIL_CALL_CNT and subprog_tail1() returns because of MAX_TAIL_CALL_CNT limit, bpf_tail_call_static() in suprog_tail2() is able to run because the tail_call_cnt in subprog_tail2() propagated from entry() is less than MAX_TAIL_CALL_CNT. So, how many tailcalls are there for this case if no error happens? From top-down view, does it look like hierarchy layer and layer? With this view, there will be 2+4+8+...+2^33 = 2^34 - 2 = 17,179,869,182 tailcalls for this case. How about there are N subprog_tail() in entry()? There will be almost N^34 tailcalls. Then, in this patch, it resolves this case on x86_64. In stead of propagating tail_call_cnt from caller to callee, it propagates its pointer, tail_call_cnt_ptr, tcc_ptr for short. However, where does it store tail_call_cnt? It stores tail_call_cnt on the stack of main prog. When tail call happens in subprog, it increments tail_call_cnt by tcc_ptr. Meanwhile, it stores tail_call_cnt_ptr on the stack of main prog, too. And, before jump to tail callee, it has to pop tail_call_cnt and tail_call_cnt_ptr. Then, at the prologue of subprog, it must not make rax as tail_call_cnt_ptr again. It has to reuse tail_call_cnt_ptr from caller. As a result, at run time, it has to recognize rax is tail_call_cnt or tail_call_cnt_ptr at prologue by: 1. rax is tail_call_cnt if rax is <= MAX_TAIL_CALL_CNT. 2. rax is tail_call_cnt_ptr if rax is > MAX_TAIL_CALL_CNT, because a pointer won't be <= MAX_TAIL_CALL_CNT. Here's an example to dump JITed. struct { __uint(type, BPF_MAP_TYPE_PROG_ARRAY); __uint(max_entries, 1); __uint(key_size, sizeof(__u32)); __uint(value_size, sizeof(__u32)); } jmp_table SEC(".maps"); int count = 0; static __noinline int subprog_tail(struct __sk_buff *skb) { bpf_tail_call_static(skb, &jmp_table, 0); return 0; } SEC("tc") int entry(struct __sk_buff *skb) { int ret = 1; count++; subprog_tail(skb); subprog_tail(skb); return ret; } When bpftool p d j id 42: int entry(struct __sk_buff * skb): bpf_prog_0c0f4c2413ef19b1_entry: ; int entry(struct __sk_buff *skb) 0: endbr64 4: nopl (%rax,%rax) 9: xorq %rax, %rax ;; rax = 0 (tail_call_cnt) c: pushq %rbp d: movq %rsp, %rbp 10: endbr64 14: cmpq $33, %rax ;; if rax > 33, rax = tcc_ptr 18: ja 0x20 ;; if rax > 33 goto 0x20 ---+ 1a: pushq %rax ;; [rbp - 8] = rax = 0 | 1b: movq %rsp, %rax ;; rax = rbp - 8 | 1e: jmp 0x21 ;; ---------+ | 20: pushq %rax ;; <--------|---------------+ 21: pushq %rax ;; <--------+ [rbp - 16] = rax 22: pushq %rbx ;; callee saved 23: movq %rdi, %rbx ;; rbx = skb (callee saved) ; count++; 26: movabsq $-82417199407104, %rdi 30: movl (%rdi), %esi 33: addl $1, %esi 36: movl %esi, (%rdi) ; subprog_tail(skb); 39: movq %rbx, %rdi ;; rdi = skb 3c: movq -16(%rbp), %rax ;; rax = tcc_ptr 43: callq 0x80 ;; call subprog_tail() ; subprog_tail(skb); 48: movq %rbx, %rdi ;; rdi = skb 4b: movq -16(%rbp), %rax ;; rax = tcc_ptr 52: callq 0x80 ;; call subprog_tail() ; return ret; 57: movl $1, %eax 5c: popq %rbx 5d: leave 5e: retq int subprog_tail(struct __sk_buff * skb): bpf_prog_3a140cef239a4b4f_subprog_tail: ; int subprog_tail(struct __sk_buff *skb) 0: endbr64 4: nopl (%rax,%rax) 9: nopl (%rax) ;; do not touch tail_call_cnt c: pushq %rbp d: movq %rsp, %rbp 10: endbr64 14: pushq %rax ;; [rbp - 8] = rax (tcc_ptr) 15: pushq %rax ;; [rbp - 16] = rax (tcc_ptr) 16: pushq %rbx ;; callee saved 17: pushq %r13 ;; callee saved 19: movq %rdi, %rbx ;; rbx = skb ; asm volatile("r1 = %[ctx]\n\t" 1c: movabsq $-105487587488768, %r13 ;; r13 = jmp_table 26: movq %rbx, %rdi ;; 1st arg, skb 29: movq %r13, %rsi ;; 2nd arg, jmp_table 2c: xorl %edx, %edx ;; 3rd arg, index = 0 2e: movq -16(%rbp), %rax ;; rax = [rbp - 16] (tcc_ptr) 35: cmpq $33, (%rax) 39: jae 0x4e ;; if *tcc_ptr >= 33 goto 0x4e --------+ 3b: jmp 0x4e ;; jmp bypass, toggled by poking | 40: addq $1, (%rax) ;; (*tcc_ptr)++ | 44: popq %r13 ;; callee saved | 46: popq %rbx ;; callee saved | 47: popq %rax ;; undo rbp-16 push | 48: popq %rax ;; undo rbp-8 push | 49: nopl (%rax,%rax) ;; tail call target, toggled by poking | ; return 0; ;; | 4e: popq %r13 ;; restore callee saved <--------------+ 50: popq %rbx ;; restore callee saved 51: leave 52: retq Furthermore, when trampoline is the caller of bpf prog, which is tail_call_reachable, it is required to propagate rax through trampoline. Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT") Fixes: e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT") Reviewed-by: Eduard Zingerman <[email protected]> Signed-off-by: Leon Hwang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]>
2024-07-29x86/aperfmperf: Fix deadlock on cpu_hotplug_lockJonathan Cameron1-2/+4
The broken patch results in a call to init_freq_invariance_cppc() in a CPU hotplug handler in both the path for initially present CPUs and those hotplugged later. That function includes a one time call to amd_set_max_freq_ratio() which in turn calls freq_invariance_enable() that has a static_branch_enable() which takes the cpu_hotlug_lock which is already held. Avoid the deadlock by using static_branch_enable_cpuslocked() as the lock will always be already held. The equivalent path on Intel does not already hold this lock, so take it around the call to freq_invariance_enable(), which results in it being held over the call to register_syscall_ops, which looks to be safe to do. Fixes: c1385c1f0ba3 ("ACPI: processor: Simplify initial onlining to use same path for cold and hotplug") Closes: https://lore.kernel.org/all/CABXGCsPvqBfL5hQDOARwfqasLRJ_eNPBbCngZ257HOe=xbWDkA@mail.gmail.com/ Reported-by: Mikhail Gavrilov <[email protected]> Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Jonathan Cameron <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Tested-by: Mikhail Gavrilov <[email protected]> Tested-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-29perf/x86: Add hw_perf_event::aux_configPeter Zijlstra1-7/+7
Start a new section for AUX PMUs in hw_perf_event. Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
2024-07-29perf/x86/intel/pt: Fix sampling synchronizationAdrian Hunter1-8/+7
pt_event_snapshot_aux() uses pt->handle_nmi to determine if tracing needs to be stopped, however tracing can still be going because pt->handle_nmi is set to zero before tracing is stopped in pt_event_stop, whereas pt_event_snapshot_aux() requires that tracing must be stopped in order to copy a sample of trace from the buffer. Instead call pt_config_stop() always, which anyway checks config for RTIT_CTL_TRACEEN and does nothing if it is already clear. Note pt_event_snapshot_aux() can continue to use pt->handle_nmi to determine if the trace needs to be restarted afterwards. Fixes: 25e8920b301c ("perf/x86/intel/pt: Add sampling support") Signed-off-by: Adrian Hunter <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2024-07-29perf/x86/intel/cstate: Add pkg C2 residency counter for Sierra ForestZhenyu Wang1-2/+3
Package C2 residency counter is also available on Sierra Forest. So add it support in srf_cstates. Fixes: 3877d55a0db2 ("perf/x86/intel/cstate: Add Sierra Forest support") Signed-off-by: Zhenyu Wang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Kan Liang <[email protected]> Tested-by: Wendy Wang <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-29x86/elf: Add a new FPU buffer layout info to x86 core filesVignesh Balasubramanian3-0/+106
Add a new .note section containing type, size, offset and flags of every xfeature that is present. This information will be used by debuggers to understand the XSAVE layout of the machine where the core file has been dumped, and to read XSAVE registers, especially during cross-platform debugging. The XSAVE layouts of modern AMD and Intel CPUs differ, especially since Memory Protection Keys and the AVX-512 features have been inculcated into the AMD CPUs. Since AMD never adopted (and hence never left room in the XSAVE layout for) the Intel MPX feature, tools like GDB had assumed a fixed XSAVE layout matching that of Intel (based on the XCR0 mask). Hence, core dumps from AMD CPUs didn't match the known size for the XCR0 mask. This resulted in GDB and other tools not being able to access the values of the AVX-512 and PKRU registers on AMD CPUs. To solve this, an interim solution has been accepted into GDB, and is already a part of GDB 14, see https://sourceware.org/pipermail/gdb-patches/2023-March/198081.html. But it depends on heuristics based on the total XSAVE register set size and the XCR0 mask to infer the layouts of the various register blocks for core dumps, and hence, is not a foolproof mechanism to determine the layout of the XSAVE area. Therefore, add a new core dump note in order to allow GDB/LLDB and other relevant tools to determine the layout of the XSAVE area of the machine where the corefile was dumped. The new core dump note (which is being proposed as a per-process .note section), NT_X86_XSAVE_LAYOUT (0x205) contains an array of structures. Each structure describes an individual extended feature containing offset, size and flags in this format: struct x86_xfeat_component { u32 type; u32 size; u32 offset; u32 flags; }; and in an independent manner, allowing for future extensions without depending on hw arch specifics like CPUID etc. [ bp: Massage commit message, zap trailing whitespace. ] Co-developed-by: Jini Susan George <[email protected]> Signed-off-by: Jini Susan George <[email protected]> Co-developed-by: Borislav Petkov (AMD) <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Signed-off-by: Vignesh Balasubramanian <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-29x86/microcode/AMD: Use the family,model,stepping encoded in the patch IDBorislav Petkov1-32/+158
On Zen and newer, the family, model and stepping is part of the microcode patch ID so that the equivalence table the driver has been using, is not needed anymore. So switch the driver to use that from now on. The equivalence table in the microcode blob should still remain in case there's need to pass some additional information to the kernel loader. Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/20240725112037.GBZqI1BbUk1KMlOJ_D@fat_crate.local
2024-07-29x86/amd_nb: Add new PCI IDs for AMD family 1Ah model 60hShyam Sundar S K1-0/+3
Add new PCI device IDs into the root IDs and miscellaneous IDs lists to provide support for the latest generation of AMD 1Ah family 60h processor models. Signed-off-by: Shyam Sundar S K <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Yazen Ghannam <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-29treewide: context_tracking: Rename CONTEXT_* into CT_STATE_*Valentin Schneider1-1/+1
Context tracking state related symbols currently use a mix of the CONTEXT_ (e.g. CONTEXT_KERNEL) and CT_SATE_ (e.g. CT_STATE_MASK) prefixes. Clean up the naming and make the ctx_state enum use the CT_STATE_ prefix. Suggested-by: Frederic Weisbecker <[email protected]> Signed-off-by: Valentin Schneider <[email protected]> Acked-by: Frederic Weisbecker <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-07-28minmax: add a few more MIN_T/MAX_T usersLinus Torvalds1-1/+1
Commit 3a7e02c040b1 ("minmax: avoid overly complicated constant expressions in VM code") added the simpler MIN_T/MAX_T macros in order to avoid some excessive expansion from the rather complicated regular min/max macros. The complexity of those macros stems from two issues: (a) trying to use them in situations that require a C constant expression (in static initializers and for array sizes) (b) the type sanity checking and MIN_T/MAX_T avoids both of these issues. Now, in the whole (long) discussion about all this, it was pointed out that the whole type sanity checking is entirely unnecessary for min_t/max_t which get a fixed type that the comparison is done in. But that still leaves min_t/max_t unnecessarily complicated due to worries about the C constant expression case. However, it turns out that there really aren't very many cases that use min_t/max_t for this, and we can just force-convert those. This does exactly that. Which in turn will then allow for much simpler implementations of min_t()/max_t(). All the usual "macros in all upper case will evaluate the arguments multiple times" rules apply. We should do all the same things for the regular min/max() vs MIN/MAX() cases, but that has the added complexity of various drivers defining their own local versions of MIN/MAX, so that needs another level of fixes first. Link: https://lore.kernel.org/all/[email protected]/ Cc: David Laight <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2024-07-27Merge tag 'for-linus-6.11-rc1a-tag' of ↵Linus Torvalds6-64/+74
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: "Two fixes for issues introduced in this merge window: - fix enhanced debugging in the Xen multicall handling - two patches fixing a boot failure when running as dom0 in PVH mode" * tag 'for-linus-6.11-rc1a-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: x86/xen: fix memblock_reserve() usage on PVH x86/xen: move xen_reserve_extra_memory() xen: fix multicall debug data referencing
2024-07-26minmax: avoid overly complex min()/max() macro arguments in xenLinus Torvalds1-2/+3
We have some very fancy min/max macros that have tons of sanity checking to warn about mixed signedness etc. This is all things that a sane compiler should warn about, but there are no sane compiler interfaces for this, and '-Wsign-compare' is broken [1] and not useful. So then we compensate (some would say over-compensate) by doing the checks manually with some truly horrid macro games. And no, we can't just use __builtin_types_compatible_p(), because the whole question of "does it make sense to compare these two values" is a lot more complicated than that. For example, it makes a ton of sense to compare unsigned values with simple constants like "5", even if that is indeed a signed type. So we have these very strange macros to try to make sensible type checking decisions on the arguments to 'min()' and 'max()'. But that can cause enormous code expansion if the min()/max() macros are used with complicated expressions, and particularly if you nest these things so that you get the first big expansion then expanded again. The xen setup.c file ended up ballooning to over 50MB of preprocessed noise that takes 15s to compile (obviously depending on the build host), largely due to one single line. So let's split that one single line to just be simpler. I think it ends up being more legible to humans too at the same time. Now that single file compiles in under a second. Reported-and-reviewed-by: Lorenzo Stoakes <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Link: https://staticthinking.wordpress.com/2023/07/25/wsign-compare-is-garbage/ [1] Cc: David Laight <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2024-07-26KVM: guest_memfd: let kvm_gmem_populate() operate only on private gfnsPaolo Bonzini1-7/+0
This check is currently performed by sev_gmem_post_populate(), but it applies to all callers of kvm_gmem_populate(): the point of the function is that the memory is being encrypted and some work has to be done on all the gfns in order to encrypt them. Therefore, check the KVM_MEMORY_ATTRIBUTE_PRIVATE attribute prior to invoking the callback, and stop the operation if a shared page is encountered. Because CONFIG_KVM_PRIVATE_MEM in principle does not require attributes, this makes kvm_gmem_populate() depend on CONFIG_KVM_GENERIC_PRIVATE_MEM (which does require them). Reviewed-by: Michael Roth <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-07-26KVM: extend kvm_range_has_memory_attributes() to check subset of attributesPaolo Bonzini1-1/+1
While currently there is no other attribute than KVM_MEMORY_ATTRIBUTE_PRIVATE, KVM code such as kvm_mem_is_private() is written to expect their existence. Allow using kvm_range_has_memory_attributes() as a multi-page version of kvm_mem_is_private(), without it breaking later when more attributes are introduced. Reviewed-by: Michael Roth <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-07-26KVM: guest_memfd: move check for already-populated page to common codePaolo Bonzini1-1/+1
Do not allow populating the same page twice with startup data. In the case of SEV-SNP, for example, the firmware does not allow it anyway, since the launch-update operation is only possible on pages that are still shared in the RMP. Even if it worked, kvm_gmem_populate()'s callback is meant to have side effects such as updating launch measurements, and updating the same page twice is unlikely to have the desired results. Races between calls to the ioctl are not possible because kvm_gmem_populate() holds slots_lock and the VM should not be running. But again, even if this worked on other confidential computing technology, it doesn't matter to guest_memfd.c whether this is something fishy such as missing synchronization in userspace, or rather something intentional. One of the racers wins, and the page is initialized by either kvm_gmem_prepare_folio() or kvm_gmem_populate(). Anyway, out of paranoia, adjust sev_gmem_post_populate() anyway to use the same errno that kvm_gmem_populate() is using. Reviewed-by: Michael Roth <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-07-26KVM: remove kvm_arch_gmem_prepare_needed()Paolo Bonzini1-5/+0
It is enough to return 0 if a guest need not do any preparation. This is in fact how sev_gmem_prepare() works for non-SNP guests, and it extends naturally to Intel hosts: the x86 callback for gmem_prepare is optional and returns 0 if not defined. Reviewed-by: Michael Roth <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-07-26KVM: rename CONFIG_HAVE_KVM_GMEM_* to CONFIG_HAVE_KVM_ARCH_GMEM_*Paolo Bonzini2-4/+4
Add "ARCH" to the symbols; shortly, the "prepare" phase will include both the arch-independent step to clear out contents left in the page by the host, and the arch-dependent step enabled by CONFIG_HAVE_KVM_GMEM_PREPARE. For consistency do the same for CONFIG_HAVE_KVM_GMEM_INVALIDATE as well. Reviewed-by: Michael Roth <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-07-26KVM: x86: disallow pre-fault for SNP VMs before initializationPaolo Bonzini5-0/+16
KVM_PRE_FAULT_MEMORY for an SNP guest can race with sev_gmem_post_populate() in bad ways. The following sequence for instance can potentially trigger an RMP fault: thread A, sev_gmem_post_populate: called thread B, sev_gmem_prepare: places below 'pfn' in a private state in RMP thread A, sev_gmem_post_populate: *vaddr = kmap_local_pfn(pfn + i); thread A, sev_gmem_post_populate: copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE); RMP #PF Fix this by only allowing KVM_PRE_FAULT_MEMORY to run after a guest's initial private memory contents have been finalized via KVM_SEV_SNP_LAUNCH_FINISH. Beyond fixing this issue, it just sort of makes sense to enforce this, since the KVM_PRE_FAULT_MEMORY documentation states: "KVM maps memory as if the vCPU generated a stage-2 read page fault" which sort of implies we should be acting on the same guest state that a vCPU would see post-launch after the initial guest memory is all set up. Co-developed-by: Michael Roth <[email protected]> Signed-off-by: Michael Roth <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-07-26KVM: x86: Eliminate log spam from limited APIC timer periodsJim Mattson1-1/+1
SAP's vSMP MemoryONE continuously requests a local APIC timer period less than 500 us, resulting in the following kernel log spam: kvm: vcpu 15: requested 70240 ns lapic timer period limited to 500000 ns kvm: vcpu 19: requested 52848 ns lapic timer period limited to 500000 ns kvm: vcpu 15: requested 70256 ns lapic timer period limited to 500000 ns kvm: vcpu 9: requested 70256 ns lapic timer period limited to 500000 ns kvm: vcpu 9: requested 70208 ns lapic timer period limited to 500000 ns kvm: vcpu 9: requested 387520 ns lapic timer period limited to 500000 ns kvm: vcpu 9: requested 70160 ns lapic timer period limited to 500000 ns kvm: vcpu 66: requested 205744 ns lapic timer period limited to 500000 ns kvm: vcpu 9: requested 70224 ns lapic timer period limited to 500000 ns kvm: vcpu 9: requested 70256 ns lapic timer period limited to 500000 ns limit_periodic_timer_frequency: 7569 callbacks suppressed ... To eliminate this spam, change the pr_info_ratelimited() in limit_periodic_timer_frequency() to pr_info_once(). Reported-by: James Houghton <[email protected]> Signed-off-by: Jim Mattson <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>