aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/kernel
AgeCommit message (Collapse)AuthorFilesLines
2024-09-11cpufreq: amd-pstate: Merge amd_pstate_highest_perf_set() into ↵Mario Limonciello1-0/+16
amd_get_boost_ratio_numerator() The special case in amd_pstate_highest_perf_set() is the value used for calculating the boost numerator. Merge this into amd_get_boost_ratio_numerator() and then use that to calculate boost ratio. This allows dropping more special casing of the highest perf value. Reviewed-by: Gautham R. Shenoy <[email protected]> Signed-off-by: Mario Limonciello <[email protected]>
2024-09-11x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator()Mario Limonciello1-10/+83
AMD systems that support preferred cores will use "166" as their numerator for max frequency calculations instead of "255". Add a function for detecting preferred cores by looking at the highest perf value on all cores. If preferred cores are enabled return 166 and if disabled the value in the highest perf register. As the function will be called multiple times, cache the values for the boost numerator and if preferred cores will be enabled in global variables. Reviewed-by: Gautham R. Shenoy <[email protected]> Signed-off-by: Mario Limonciello <[email protected]>
2024-09-11x86/amd: Move amd_get_highest_perf() out of amd-pstateMario Limonciello1-0/+30
amd_pstate_get_highest_perf() is a helper used to get the highest perf value on AMD systems. It's used in amd-pstate as part of preferred core handling, but applicable for acpi-cpufreq as well. Move it out to cppc handling code as amd_get_highest_perf(). Reviewed-by: Perry Yuan <[email protected]> Reviewed-by: Gautham R. Shenoy <[email protected]> Signed-off-by: Mario Limonciello <[email protected]>
2024-09-11ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warnMario Limonciello1-3/+3
If the boost ratio isn't calculated properly for the system for any reason this can cause other problems that are non-obvious. Raise all messages to warn instead. Suggested-by: Perry Yuan <[email protected]> Reviewed-by: Perry Yuan <[email protected]> Reviewed-by: Gautham R. Shenoy <[email protected]> Signed-off-by: Mario Limonciello <[email protected]>
2024-09-11ACPI: CPPC: Drop check for non zero perf ratioMario Limonciello1-6/+1
perf_ratio is a u64 and SCHED_CAPACITY_SCALE is a large number. Shifting by one will never have a zero value. Drop the check. Suggested-by: Gautham R. Shenoy <[email protected]> Reviewed-by: Gautham R. Shenoy <[email protected]> Signed-off-by: Mario Limonciello <[email protected]>
2024-09-11x86/amd: Rename amd_get_highest_perf() to amd_get_boost_ratio_numerator()Mario Limonciello1-12/+32
The function name is ambiguous because it returns an intermediate value for calculating maximum frequency rather than the CPPC 'Highest Perf' register. Rename the function to clarify its use and allow the function to return errors. Adjust the consumer in acpi-cpufreq to catch errors. Reviewed-by: Gautham R. Shenoy <[email protected]> Signed-off-by: Mario Limonciello <[email protected]>
2024-09-11x86/amd: Move amd_get_highest_perf() from amd.c to cppc.cMario Limonciello2-16/+16
To prepare to let amd_get_highest_perf() detect preferred cores it will require CPPC functions. Move amd_get_highest_perf() to cppc.c to prepare for 'preferred core detection' rework. No functional changes intended. Reviewed-by: Perry Yuan <[email protected]> Reviewed-by: Gautham R. Shenoy <[email protected]> Signed-off-by: Mario Limonciello <[email protected]>
2024-09-10Merge branch 'linus' into timers/coreThomas Gleixner10-17/+30
To update with the latest fixes.
2024-09-09mm: make arch_get_unmapped_area() take vm_flags by defaultMark Brown1-18/+3
Patch series "mm: Care about shadow stack guard gap when getting an unmapped area", v2. As covered in the commit log for c44357c2e76b ("x86/mm: care about shadow stack guard gap during placement") our current mmap() implementation does not take care to ensure that a new mapping isn't placed with existing mappings inside it's own guard gaps. This is particularly important for shadow stacks since if two shadow stacks end up getting placed adjacent to each other then they can overflow into each other which weakens the protection offered by the feature. On x86 there is a custom arch_get_unmapped_area() which was updated by the above commit to cover this case by specifying a start_gap for allocations with VM_SHADOW_STACK. Both arm64 and RISC-V have equivalent features and use the generic implementation of arch_get_unmapped_area() so let's make the equivalent change there so they also don't get shadow stack pages placed without guard pages. The arm64 and RISC-V shadow stack implementations are currently on the list: https://lore.kernel.org/r/20240829-arm64-gcs-v12-0-42fec94743 https://lore.kernel.org/lkml/[email protected]/ Given the addition of the use of vm_flags in the generic implementation we also simplify the set of possibilities that have to be dealt with in the core code by making arch_get_unmapped_area() take vm_flags as standard. This is a bit invasive since the prototype change touches quite a few architectures but since the parameter is ignored the change is straightforward, the simplification for the generic code seems worth it. This patch (of 3): When we introduced arch_get_unmapped_area_vmflags() in 961148704acd ("mm: introduce arch_get_unmapped_area_vmflags()") we did so as part of properly supporting guard pages for shadow stacks on x86_64, which uses a custom arch_get_unmapped_area(). Equivalent features are also present on both arm64 and RISC-V, both of which use the generic implementation of arch_get_unmapped_area() and will require equivalent modification there. Rather than continue to deal with having two versions of the functions let's bite the bullet and have all implementations of arch_get_unmapped_area() take vm_flags as a parameter. The new parameter is currently ignored by all implementations other than x86. The only caller that doesn't have a vm_flags available is mm_get_unmapped_area(), as for the x86 implementation and the wrapper used on other architectures this is modified to supply no flags. No functional changes. Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-0-a46b8b6dc0ed@kernel.org Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-1-a46b8b6dc0ed@kernel.org Signed-off-by: Mark Brown <[email protected]> Acked-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Acked-by: Helge Deller <[email protected]> [parisc] Cc: Alexander Gordeev <[email protected]> Cc: Andreas Larsson <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Chris Zankel <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David S. Miller <[email protected]> Cc: "Edgecombe, Rick P" <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Guo Ren <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: James Bottomley <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Naveen N Rao <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: WANG Xuerui <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09Merge tag 'hyperv-fixes-signed-20240908' of ↵Linus Torvalds1-3/+18
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv fixes from Wei Liu: - Add a documentation overview of Confidential Computing VM support (Michael Kelley) - Use lapic timer in a TDX VM without paravisor (Dexuan Cui) - Set X86_FEATURE_TSC_KNOWN_FREQ when Hyper-V provides frequency (Michael Kelley) - Fix a kexec crash due to VP assist page corruption (Anirudh Rayabharam) - Python3 compatibility fix for lsvmbus (Anthony Nandaa) - Misc fixes (Rachel Menge, Roman Kisel, zhang jiao, Hongbo Li) * tag 'hyperv-fixes-signed-20240908' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: hv: vmbus: Constify struct kobj_type and struct attribute_group tools: hv: rm .*.cmd when make clean x86/hyperv: fix kexec crash due to VP assist page corruption Drivers: hv: vmbus: Fix the misplaced function description tools: hv: lsvmbus: change shebang to use python3 x86/hyperv: Set X86_FEATURE_TSC_KNOWN_FREQ when Hyper-V provides frequency Documentation: hyperv: Add overview of Confidential Computing VM support clocksource: hyper-v: Use lapic timer in a TDX VM without paravisor Drivers: hv: Remove deprecated hv_fcopy declarations
2024-09-08treewide: Fix wrong singular form of jiffies in commentsAnna-Maria Behnsen1-1/+1
There are several comments all over the place, which uses a wrong singular form of jiffies. Replace 'jiffie' by 'jiffy'. No functional change. Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> # m68k Link: https://lore.kernel.org/all/20240904-devel-anna-maria-b4-timers-flseep-v1-3-e98760256370@linutronix.de
2024-09-05x86/sgx: Log information when a node lacks an EPC sectionAaron Lu1-0/+7
For optimized performance, firmware typically distributes EPC sections evenly across different NUMA nodes. However, there are scenarios where a node may have both CPUs and memory but no EPC section configured. For example, in an 8-socket system with a Sub-Numa-Cluster=2 setup, there are a total of 16 nodes. Given that the maximum number of supported EPC sections is 8, it is simply not feasible to assign one EPC section to each node. This configuration is not incorrect - SGX will still operate correctly; it is just not optimized from a NUMA standpoint. For this reason, log a message when a node with both CPUs and memory lacks an EPC section. This will provide users with a hint as to why they might be experiencing less-than-ideal performance when running SGX enclaves. Suggested-by: Dave Hansen <[email protected]> Signed-off-by: Aaron Lu <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Jarkko Sakkinen <[email protected]> Acked-by: Kai Huang <[email protected]> Link: https://lore.kernel.org/all/20240905080855.1699814-3-aaron.lu%40intel.com
2024-09-05x86/sgx: Fix deadlock in SGX NUMA node searchAaron Lu1-13/+14
When the current node doesn't have an EPC section configured by firmware and all other EPC sections are used up, CPU can get stuck inside the while loop that looks for an available EPC page from remote nodes indefinitely, leading to a soft lockup. Note how nid_of_current will never be equal to nid in that while loop because nid_of_current is not set in sgx_numa_mask. Also worth mentioning is that it's perfectly fine for the firmware not to setup an EPC section on a node. While setting up an EPC section on each node can enhance performance, it is not a requirement for functionality. Rework the loop to start and end on *a* node that has SGX memory. This avoids the deadlock looking for the current SGX-lacking node to show up in the loop when it never will. Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()") Reported-by: "Molina Sabido, Gerardo" <[email protected]> Signed-off-by: Aaron Lu <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Kai Huang <[email protected]> Reviewed-by: Jarkko Sakkinen <[email protected]> Acked-by: Dave Hansen <[email protected]> Tested-by: Zhimin Luo <[email protected]> Link: https://lore.kernel.org/all/20240905080855.1699814-2-aaron.lu%40intel.com
2024-09-05x86/bugs: Fix handling when SRSO mitigation is disabledDavid Kaplan1-9/+5
When the SRSO mitigation is disabled, either via mitigations=off or spec_rstack_overflow=off, the warning about the lack of IBPB-enhancing microcode is printed anyway. This is unnecessary since the user has turned off the mitigation. [ bp: Massage, drop SBPB rationale as it doesn't matter because when mitigations are disabled x86_pred_cmd is not being used anyway. ] Signed-off-by: David Kaplan <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Josh Poimboeuf <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-09-05x86/bugs: Add missing NO_SSB flagDaniel Sneddon1-2/+2
The Moorefield and Lightning Mountain Atom processors are missing the NO_SSB flag in the vulnerabilities whitelist. This will cause unaffected parts to incorrectly be reported as vulnerable. Add the missing flag. These parts are currently out of service and were verified internally with archived documentation that they need the NO_SSB flag. Closes: https://lore.kernel.org/lkml/CAEJ9NQdhh+4GxrtG1DuYgqYhvc0hi-sKZh-2niukJ-MyFLntAA@mail.gmail.com/ Reported-by: Shanavas.K.S <[email protected]> Signed-off-by: Daniel Sneddon <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-09-05x86/hyperv: fix kexec crash due to VP assist page corruptionAnirudh Rayabharam (Microsoft)1-2/+2
commit 9636be85cc5b ("x86/hyperv: Fix hyperv_pcpu_input_arg handling when CPUs go online/offline") introduces a new cpuhp state for hyperv initialization. cpuhp_setup_state() returns the state number if state is CPUHP_AP_ONLINE_DYN or CPUHP_BP_PREPARE_DYN and 0 for all other states. For the hyperv case, since a new cpuhp state was introduced it would return 0. However, in hv_machine_shutdown(), the cpuhp_remove_state() call is conditioned upon "hyperv_init_cpuhp > 0". This will never be true and so hv_cpu_die() won't be called on all CPUs. This means the VP assist page won't be reset. When the kexec kernel tries to setup the VP assist page again, the hypervisor corrupts the memory region of the old VP assist page causing a panic in case the kexec kernel is using that memory elsewhere. This was originally fixed in commit dfe94d4086e4 ("x86/hyperv: Fix kexec panic/hang issues"). Get rid of hyperv_init_cpuhp entirely since we are no longer using a dynamic cpuhp state and use CPUHP_AP_HYPERV_ONLINE directly with cpuhp_remove_state(). Cc: [email protected] Fixes: 9636be85cc5b ("x86/hyperv: Fix hyperv_pcpu_input_arg handling when CPUs go online/offline") Signed-off-by: Anirudh Rayabharam (Microsoft) <[email protected]> Reviewed-by: Vitaly Kuznetsov <[email protected]> Reviewed-by: Michael Kelley <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Wei Liu <[email protected]> Message-ID: <[email protected]>
2024-09-04x86/sched: Add basic support for CPU capacity scalingRafael J. Wysocki1-2/+87
In order be able to compute the sizes of tasks consistently across all CPUs in a hybrid system, it is necessary to provide CPU capacity scaling information to the scheduler via arch_scale_cpu_capacity(). Moreover, the value returned by arch_scale_freq_capacity() for the given CPU must correspond to the arch_scale_cpu_capacity() return value for it, or utilization computations will be inaccurate. Add support for it through per-CPU variables holding the capacity and maximum-to-base frequency ratio (times SCHED_CAPACITY_SCALE) that will be returned by arch_scale_cpu_capacity() and used by scale_freq_tick() to compute arch_freq_scale for the current CPU, respectively. In order to avoid adding measurable overhead for non-hybrid x86 systems, which are the vast majority in the field, whether or not the new hybrid CPU capacity scaling will be in effect is controlled by a static key. This static key is set by calling arch_enable_hybrid_capacity_scale() which also allocates memory for the per-CPU data and initializes it. Next, arch_set_cpu_capacity() is used to set the per-CPU variables mentioned above for each CPU and arch_rebuild_sched_domains() needs to be called for the scheduler to realize that capacity-aware scheduling can be used going forward. Signed-off-by: Rafael J. Wysocki <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Ricardo Neri <[email protected]> Tested-by: Ricardo Neri <[email protected]> # scale invariance Link: https://patch.msgid.link/[email protected] [ rjw: Added parens to function kerneldoc comments ] Signed-off-by: Rafael J. Wysocki <[email protected]>
2024-09-03x86/cpu/intel: Replace PAT erratum model/family magic numbers with symbolic ↵Dave Hansen1-8/+10
IFM references There's an erratum that prevents the PAT from working correctly: https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-dual-core-specification-update.pdf # Document 316515 Version 010 The kernel currently disables PAT support on those CPUs, but it does it with some magic numbers. Replace the magic numbers with the new "IFM" macros. Make the check refer to the last affected CPU (INTEL_CORE_YONAH) rather than the first fixed one. This makes it easier to find the documentation of the erratum since Intel documents where it is broken and not where it is fixed. I don't think the Pentium Pro (or Pentium II) is actually affected. But the old check included them, so it can't hurt to keep doing the same. I'm also not completely sure about the "Pentium M" CPUs (models 0x9 and 0xd). But, again, they were included in in the old checks and were close Pentium III derivatives, so are likely affected. While we're at it, revise the comment referring to the erratum name and making sure it is a quote of the language from the actual errata doc. That should make it easier to find in the future when the URL inevitably changes. Why bother with this in the first place? It actually gets rid of one of the very few remaining direct references to c->x86{,_model}. No change in functionality intended. Signed-off-by: Dave Hansen <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Cc: Len Brown <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-08-29x86/EISA: Dereference memory directly instead of using readl()Maciej W. Rozycki1-2/+2
Sparse expect an __iomem pointer, but after converting the EISA probe to memremap() the pointer is a regular memory pointer. Access it directly instead. [ tglx: Converted it to fix the already applied version ] Fixes: 80a4da05642c ("x86/EISA: Use memremap() to probe for the EISA BIOS signature") Signed-off-by: Maciej W. Rozycki <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-28x86/resctrl: Fix arch_mbm_* array overrun on SNCPeter Newman1-0/+8
When using resctrl on systems with Sub-NUMA Clustering enabled, monitoring groups may be allocated RMID values which would overrun the arch_mbm_{local,total} arrays. This is due to inconsistencies in whether the SNC-adjusted num_rmid value or the unadjusted value in resctrl_arch_system_num_rmid_idx() is used. The num_rmid value for the L3 resource is currently: resctrl_arch_system_num_rmid_idx() / snc_nodes_per_l3_cache As a simple fix, make resctrl_arch_system_num_rmid_idx() return the SNC-adjusted, L3 num_rmid value on x86. Fixes: e13db55b5a0d ("x86/resctrl: Introduce snc_nodes_per_l3_cache") Signed-off-by: Peter Newman <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-08-25x86/entry: Set FRED RSP0 on return to userspace instead of context switchXin Li (Intel)1-0/+3
The FRED RSP0 MSR points to the top of the kernel stack for user level event delivery. As this is the task stack it needs to be updated when a task is scheduled in. The update is done at context switch. That means it's also done when switching to kernel threads, which is pointless as those never go out to user space. For KVM threads this means there are two writes to FRED_RSP0 as KVM has to switch to the guest value before VMENTER. Defer the update to the exit to user space path and cache the per CPU FRED_RSP0 value, so redundant writes can be avoided. Provide fred_sync_rsp0() for KVM to keep the cache in sync with the actual MSR value after returning from guest to host mode. [ tglx: Massage change log ] Suggested-by: Sean Christopherson <[email protected]> Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Xin Li (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-25x86/msr: Switch between WRMSRNS and WRMSR with the alternatives mechanismAndrew Cooper1-1/+0
Per the discussion about FRED MSR writes with WRMSRNS instruction [1], use the alternatives mechanism to choose WRMSRNS when it's available, otherwise fallback to WRMSR. Remove the dependency on X86_FEATURE_WRMSRNS as WRMSRNS is no longer dependent on FRED. [1] https://lore.kernel.org/lkml/[email protected]/ Use DS prefix to pad WRMSR instead of a NOP. The prefix is ignored. At least that's the current information from the hardware folks. Signed-off-by: Andrew Cooper <[email protected]> Signed-off-by: Xin Li (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-25x86/fred: Set SS to __KERNEL_DS when enabling FREDXin Li (Intel)1-0/+14
SS is initialized to NULL during boot time and not explicitly set to __KERNEL_DS. With FRED enabled, if a kernel event is delivered before a CPU goes to user level for the first time, its SS is NULL thus NULL is pushed into the SS field of the FRED stack frame. But before ERETS is executed, the CPU may context switch to another task and go to user level. Then when the CPU comes back to kernel mode, SS is changed to __KERNEL_DS. Later when ERETS is executed to return from the kernel event handler, a #GP fault is generated because SS doesn't match the SS saved in the FRED stack frame. Initialize SS to __KERNEL_DS when enabling FRED to prevent that. Note, IRET doesn't check if SS matches the SS saved in its stack frame, thus IDT doesn't have this problem. For IDT it doesn't matter whether SS is set to __KERNEL_DS or not, because it's set to NULL upon interrupt or exception delivery and __KERNEL_DS upon SYSCALL. Thus it's pointless to initialize SS for IDT. Signed-off-by: Xin Li (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-25x86/amd_nb: Add new PCI IDs for AMD family 1Ah model 60h-70hRichard Gong1-0/+4
Add new PCI IDs for Device 18h and Function 4 to enable the amd_atl driver on those systems. Signed-off-by: Richard Gong <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Yazen Ghannam <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-25x86/boot/64: Strip percpu address space when setting up GDT descriptorsUros Bizjak1-1/+2
init_per_cpu_var() returns a pointer in the percpu address space while rip_rel_ptr() expects a pointer in the generic address space. When strict address space checks are enabled, GCC's named address space checks fail: asm.h:124:63: error: passing argument 1 of 'rip_rel_ptr' from pointer to non-enclosed address space Add a explicit cast to remove address space of the returned pointer. Fixes: 11e36b0f7c21 ("x86/boot/64: Load the final kernel GDT during early boot directly, remove startup_gdt[]") Signed-off-by: Uros Bizjak <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-25x86/cpu: Clarify the error message when BIOS does not support SGXWangYuli1-1/+1
When SGX is not supported by the BIOS, the kernel log contains the error 'SGX disabled by BIOS', which can be confusing since there might not be an SGX-related option in the BIOS settings. For the kernel it's difficult to distinguish between the BIOS not supporting SGX and the BIOS supporting SGX but having it disabled. Therefore, update the error message to 'SGX disabled or unsupported by BIOS' to make it easier for those reading kernel logs to understand what's happening. Reported-by: Bo Wu <[email protected]> Co-developed-by: Zelong Xiang <[email protected]> Signed-off-by: Zelong Xiang <[email protected]> Signed-off-by: WangYuli <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Kai Huang <[email protected]> Link: https://lore.kernel.org/all/[email protected] Closes: https://github.com/linuxdeepin/developer-center/issues/10032
2024-08-25x86/kexec: Add comments around swap_pages() assembly to improve readabilityKai Huang1-2/+6
The current assembly around swap_pages() in the relocate_kernel() takes some time to follow because the use of registers can be easily lost when the line of assembly goes long. Add a couple of comments to clarify the code around swap_pages() to improve readability. Signed-off-by: Kai Huang <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Link: https://lore.kernel.org/all/8b52b0b8513a34b2a02fb4abb05c6700c2821475.1724573384.git.kai.huang@intel.com
2024-08-25x86/kexec: Fix a comment of swap_pages() assemblyKai Huang1-1/+1
When relocate_kernel() gets called, %rdi holds 'indirection_page' and %rsi holds 'page_list'. And %rdi always holds 'indirection_page' when swap_pages() is called. Therefore the comment of the first line code of swap_pages() movq %rdi, %rcx /* Put the page_list in %rcx */ .. isn't correct because it actually moves the 'indirection_page' to the %rcx. Fix it. Signed-off-by: Kai Huang <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Kirill A. Shutemov <[email protected]> Link: https://lore.kernel.org/all/adafdfb1421c88efce04420fc9a996c0e2ca1b34.1724573384.git.kai.huang@intel.com
2024-08-25x86/sgx: Fix a W=1 build warning in function commentKai Huang1-1/+1
Building the SGX code with W=1 generates below warning: arch/x86/kernel/cpu/sgx/main.c:741: warning: Function parameter or struct member 'low' not described in 'sgx_calc_section_metric' arch/x86/kernel/cpu/sgx/main.c:741: warning: Function parameter or struct member 'high' not described in 'sgx_calc_section_metric' ... The function sgx_calc_section_metric() is a simple helper which is only used in sgx/main.c. There's no need to use kernel-doc style comment for it. Downgrade to a normal comment. Signed-off-by: Kai Huang <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-25x86/EISA: Use memremap() to probe for the EISA BIOS signatureMaciej W. Rozycki1-3/+3
The area at the 0x0FFFD9 physical location in the PC memory space is regular memory, traditionally ROM BIOS and more recently a copy of BIOS code and data in RAM, write-protected. Therefore use memremap() to get access to it rather than ioremap(), avoiding issues in virtualization scenarios and complementing changes such as commit f7750a795687 ("x86, mpparse, x86/acpi, x86/PCI, x86/dmi, SFI: Use memremap() for RAM mappings") or commit 5997efb96756 ("x86/boot: Use memremap() to map the MPF and MPC data"). Reported-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Maciej W. Rozycki <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected] Closes: https://lore.kernel.org/r/[email protected]
2024-08-22x86/cpu: KVM: Add common defines for architectural memory types (PAT, MTRRs, ↵Sean Christopherson1-0/+6
etc.) Add defines for the architectural memory types that can be shoved into various MSRs and registers, e.g. MTRRs, PAT, VMX capabilities MSRs, EPTPs, etc. While most MSRs/registers support only a subset of all memory types, the values themselves are architectural and identical across all users. Leave the goofy MTRR_TYPE_* definitions as-is since they are in a uapi header, but add compile-time assertions to connect the dots (and sanity check that the msr-index.h values didn't get fat-fingered). Keep the VMX_EPTP_MT_* defines so that it's slightly more obvious that the EPTP holds a single memory type in 3 of its 64 bits; those bits just happen to be 2:0, i.e. don't need to be shifted. Opportunistically use X86_MEMTYPE_WB instead of an open coded '6' in setup_vmcs_config(). No functional change intended. Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Kai Huang <[email protected]> Reviewed-by: Xiaoyao Li <[email protected]> Reviewed-by: Kai Huang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-08-19runtime constants: move list of constants to vmlinux.lds.hJann Horn1-2/+1
Refactor the list of constant variables into a macro. This should make it easier to add more constants in the future. Signed-off-by: Jann Horn <[email protected]> Reviewed-by: Alexander Gordeev <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Acked-by: Will Deacon <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]>
2024-08-14x86/fpu: Avoid writing LBR bit to IA32_XSS unless supportedMitchell Levy2-2/+5
There are two distinct CPU features related to the use of XSAVES and LBR: whether LBR is itself supported and whether XSAVES supports LBR. The LBR subsystem correctly checks both in intel_pmu_arch_lbr_init(), but the XSTATE subsystem does not. The LBR bit is only removed from xfeatures_mask_independent when LBR is not supported by the CPU, but there is no validation of XSTATE support. If XSAVES does not support LBR the write to IA32_XSS causes a #GP fault, leaving the state of IA32_XSS unchanged, i.e. zero. The fault is handled with a warning and the boot continues. Consequently the next XRSTORS which tries to restore supervisor state fails with #GP because the RFBM has zero for all supervisor features, which does not match the XCOMP_BV field. As XFEATURE_MASK_FPSTATE includes supervisor features setting up the FPU causes a #GP, which ends up in fpu_reset_from_exception_fixup(). That fails due to the same problem resulting in recursive #GPs until the kernel runs out of stack space and double faults. Prevent this by storing the supported independent features in fpu_kernel_cfg during XSTATE initialization and use that cached value for retrieving the independent feature bits to be written into IA32_XSS. [ tglx: Massaged change log ] Fixes: f0dccc9da4c0 ("x86/fpu/xstate: Support dynamic supervisor feature for LBR") Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Mitchell Levy <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/all/[email protected]
2024-08-13x86/fred: Enable FRED right after init_mem_mapping()Xin Li (Intel)4-5/+21
On 64-bit init_mem_mapping() relies on the minimal page fault handler provided by the early IDT mechanism. The real page fault handler is installed right afterwards into the IDT. This is problematic on CPUs which have X86_FEATURE_FRED set because the real page fault handler retrieves the faulting address from the FRED exception stack frame and not from CR2, but that does obviously not work when FRED is not yet enabled in the CPU. To prevent this enable FRED right after init_mem_mapping() without interrupt stacks. Those are enabled later in trap_init() after the CPU entry area is set up. [ tglx: Encapsulate the FRED details ] Fixes: 14619d912b65 ("x86/fred: FRED entry/exit and dispatch code") Reported-by: Hou Wenlong <[email protected]> Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Xin Li (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-13x86/fred: Move FRED RSP initialization into a separate functionXin Li (Intel)2-11/+23
To enable FRED earlier, move the RSP initialization out of cpu_init_fred_exceptions() into cpu_init_fred_rsps(). This is required as the FRED RSP initialization depends on the availability of the CPU entry areas which are set up late in trap_init(), No functional change intended. Marked with Fixes as it's a depedency for the real fix. Fixes: 14619d912b65 ("x86/fred: FRED entry/exit and dispatch code") Signed-off-by: Xin Li (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-13x86/fred: Parse cmdline param "fred=" in cpu_parse_early_param()Xin Li (Intel)2-26/+5
Depending on whether FRED is enabled, sysvec_install() installs a system interrupt handler into either into FRED's system vector dispatch table or into the IDT. However FRED can be disabled later in trap_init(), after sysvec_install() has been invoked already; e.g., the HYPERVISOR_CALLBACK_VECTOR handler is registered with sysvec_install() in kvm_guest_init(), which is called in setup_arch() but way before trap_init(). IOW, there is a gap between FRED is available and available but disabled. As a result, when FRED is available but disabled, early sysvec_install() invocations fail to install the IDT handler resulting in spurious interrupts. Fix it by parsing cmdline param "fred=" in cpu_parse_early_param() to ensure that FRED is disabled before the first sysvec_install() incovations. Fixes: 3810da12710a ("x86/fred: Add a fred= cmdline param") Reported-by: Hou Wenlong <[email protected]> Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Xin Li (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-13x86/apic: Make x2apic_disable() work correctlyYuntao Wang1-5/+6
x2apic_disable() clears x2apic_state and x2apic_mode unconditionally, even when the state is X2APIC_ON_LOCKED, which prevents the kernel to disable it thereby creating inconsistent state. Due to the early state check for X2APIC_ON, the code path which warns about a locked X2APIC cannot be reached. Test for state < X2APIC_ON instead and move the clearing of the state and mode variables to the place which actually disables X2APIC. [ tglx: Massaged change log. Added Fixes tag. Moved clearing so it's at the right place for back ports ] Fixes: a57e456a7b28 ("x86/apic: Fix fallout from x2apic cleanup") Signed-off-by: Yuntao Wang <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/all/[email protected]
2024-08-12introduce fd_file(), convert all accessors to it.Al Viro1-2/+2
For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are very few (3 in linux/file.h, 1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in explicit initializers). Those can be dealt with in the commit converting to new layout; accesses to struct fd::file are too many for that. This commit converts (almost) all of f.file to fd_file(f). It's not entirely mechanical ('file' is used as a member name more than just in struct fd) and it does not even attempt to distinguish the uses in pointer context from those in boolean context; the latter will be eventually turned into a separate helper (fd_empty()). NOTE: mass conversion to fd_empty(), tempting as it might be, is a bad idea; better do that piecewise in commit that convert from fdget...() to CLASS(...). [conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c caught by git; fs/stat.c one got caught by git grep] [fs/xattr.c conflict] Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>
2024-08-09x86/apic: Remove logical destination mode for 64-bitThomas Gleixner1-112/+7
Logical destination mode of the local APIC is used for systems with up to 8 CPUs. It has an advantage over physical destination mode as it allows to target multiple CPUs at once with IPIs. That advantage was definitely worth it when systems with up to 8 CPUs were state of the art for servers and workstations, but that's history. Aside of that there are systems which fail to work with logical destination mode as the ACPI/DMI quirks show and there are AMD Zen1 systems out there which fail when interrupt remapping is enabled as reported by Rob and Christian. The latter problem can be cured by firmware updates, but not all OEMs distribute the required changes. Physical destination mode is guaranteed to work because it is the only way to get a CPU up and running via the INIT/INIT/STARTUP sequence. As the number of CPUs keeps increasing, logical destination mode becomes a less used code path so there is no real good reason to keep it around. Therefore remove logical destination mode support for 64-bit and default to physical destination mode. Reported-by: Rob Newcater <[email protected]> Reported-by: Christian Heusel <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Borislav Petkov (AMD) <[email protected]> Tested-by: Rob Newcater <[email protected]> Link: https://lore.kernel.org/all/877cd5u671.ffs@tglx
2024-08-08x86: Ignore stack unwinding in KCOVDmitry Vyukov1-0/+8
Stack unwinding produces large amounts of uninteresting coverage. It's called from KASAN kmalloc/kfree hooks, fault injection, etc. It's not particularly useful and is not a function of system call args. Ignore that code. Signed-off-by: Dmitry Vyukov <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexander Potapenko <[email protected]> Reviewed-by: Marco Elver <[email protected]> Reviewed-by: Andrey Konovalov <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/all/eaf54b8634970b73552dcd38bf9be6ef55238c10.1718092070.git.dvyukov@google.com
2024-08-08x86/mtrr: Check if fixed MTRRs exist before saving themAndi Kleen1-1/+1
MTRRs have an obsolete fixed variant for fine grained caching control of the 640K-1MB region that uses separate MSRs. This fixed variant has a separate capability bit in the MTRR capability MSR. So far all x86 CPUs which support MTRR have this separate bit set, so it went unnoticed that mtrr_save_state() does not check the capability bit before accessing the fixed MTRR MSRs. Though on a CPU that does not support the fixed MTRR capability this results in a #GP. The #GP itself is harmless because the RDMSR fault is handled gracefully, but results in a WARN_ON(). Add the missing capability check to prevent this. Fixes: 2b1f6278d77c ("[PATCH] x86: Save the MTRRs of the BSP before booting an AP") Signed-off-by: Andi Kleen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/paravirt: Fix incorrect virt spinlock setting on bare metalChen Yu1-4/+3
The kernel can change spinlock behavior when running as a guest. But this guest-friendly behavior causes performance problems on bare metal. The kernel uses a static key to switch between the two modes. In theory, the static key is enabled by default (run in guest mode) and should be disabled for bare metal (and in some guests that want native behavior or paravirt spinlock). A performance drop is reported when running encode/decode workload and BenchSEE cache sub-workload. Bisect points to commit ce0a1b608bfc ("x86/paravirt: Silence unused native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS is disabled the virt_spin_lock_key is incorrectly set to true on bare metal. The qspinlock degenerates to test-and-set spinlock, which decreases the performance on bare metal. Set the default value of virt_spin_lock_key to false. If booting in a VM, enable this key. Later during the VM initialization, if other high-efficient spinlock is preferred (e.g. paravirt-spinlock), or the user wants the native qspinlock (via nopvspin boot commandline), the virt_spin_lock_key is disabled accordingly. This results in the following decision matrix: X86_FEATURE_HYPERVISOR Y Y Y N CONFIG_PARAVIRT_SPINLOCKS Y Y N Y/N PV spinlock Y N N Y/N virt_spin_lock_key N Y/N Y N Fixes: ce0a1b608bfc ("x86/paravirt: Silence unused native_pv_lock_init() function warning") Reported-by: Prem Nath Dey <[email protected]> Reported-by: Xiaoping Zhou <[email protected]> Suggested-by: Dave Hansen <[email protected]> Suggested-by: Qiuxu Zhuo <[email protected]> Suggested-by: Nikolay Borisov <[email protected]> Signed-off-by: Chen Yu <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/acpi: Remove __ro_after_init from acpi_mp_wake_mailboxZhiquan Li1-1/+1
On a platform using the "Multiprocessor Wakeup Structure"[1] to startup secondary CPUs the control processor needs to memremap() the physical address of the MP Wakeup Structure mailbox to the variable acpi_mp_wake_mailbox, which holds the virtual address of mailbox. To wake up the AP the control processor writes the APIC ID of AP, the wakeup vector and the ACPI_MP_WAKE_COMMAND_WAKEUP command into the mailbox. Current implementation doesn't consider the case which restricts boot time CPU bringup to 1 with the kernel parameter "maxcpus=1" and brings other CPUs online later from user space as it sets acpi_mp_wake_mailbox to read-only after init. So when the first AP is tried to brought online after init, the attempt to update the variable results in a kernel panic. The memremap() call that initializes the variable cannot be moved into acpi_parse_mp_wake() because memremap() is not functional at that point in the boot process. Also as the APs might never be brought up, keep the memremap() call in acpi_wakeup_cpu() so that the operation only takes place when needed. Fixes: 24dd05da8c79 ("x86/apic: Mark acpi_mp_wake_* variables as __ro_after_init") Signed-off-by: Zhiquan Li <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/ioapic: Cleanup remaining coding style issuesThomas Gleixner1-11/+12
Add missing new lines and reorder variable definitions. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Qiuxu Zhuo <[email protected]> Tested-by: Breno Leitao <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/ioapic: Cleanup line breaksThomas Gleixner1-37/+18
80 character limit is history. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Qiuxu Zhuo <[email protected]> Tested-by: Breno Leitao <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/ioapic: Cleanup bracket usageThomas Gleixner1-16/+18
Add brackets around if/for constructs as required by coding style or remove pointless line breaks to make it true single line statements which do not require brackets. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Qiuxu Zhuo <[email protected]> Tested-by: Breno Leitao <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/ioapic: Cleanup commentsThomas Gleixner1-49/+37
Use proper comment styles and shrink comments to their scope where applicable. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Qiuxu Zhuo <[email protected]> Tested-by: Breno Leitao <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/ioapic: Move replace_pin_at_irq_node() to the call siteThomas Gleixner1-22/+18
It's only used by check_timer(). Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Qiuxu Zhuo <[email protected]> Tested-by: Breno Leitao <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/mpparse: Cleanup apic_printk()sThomas Gleixner1-7/+6
Use the new apic_pr_verbose() helper. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Qiuxu Zhuo <[email protected]> Tested-by: Breno Leitao <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-08-07x86/ioapic: Cleanup guarded debug printk()sThomas Gleixner1-38/+29
Cleanup the APIC printk()s which are inside of a apic verbosity guarded region by using apic_dbg() for the KERN_DEBUG level prints. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Qiuxu Zhuo <[email protected]> Tested-by: Breno Leitao <[email protected]> Link: https://lore.kernel.org/all/[email protected]