aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2024-09-11cpufreq: amd-pstate: Merge amd_pstate_highest_perf_set() into ↵Mario Limonciello1-0/+16
amd_get_boost_ratio_numerator() The special case in amd_pstate_highest_perf_set() is the value used for calculating the boost numerator. Merge this into amd_get_boost_ratio_numerator() and then use that to calculate boost ratio. This allows dropping more special casing of the highest perf value. Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2024-09-11x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator()Mario Limonciello1-10/+83
AMD systems that support preferred cores will use "166" as their numerator for max frequency calculations instead of "255". Add a function for detecting preferred cores by looking at the highest perf value on all cores. If preferred cores are enabled return 166 and if disabled the value in the highest perf register. As the function will be called multiple times, cache the values for the boost numerator and if preferred cores will be enabled in global variables. Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2024-09-11x86/amd: Move amd_get_highest_perf() out of amd-pstateMario Limonciello1-0/+30
amd_pstate_get_highest_perf() is a helper used to get the highest perf value on AMD systems. It's used in amd-pstate as part of preferred core handling, but applicable for acpi-cpufreq as well. Move it out to cppc handling code as amd_get_highest_perf(). Reviewed-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2024-09-11ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warnMario Limonciello1-3/+3
If the boost ratio isn't calculated properly for the system for any reason this can cause other problems that are non-obvious. Raise all messages to warn instead. Suggested-by: Perry Yuan <Perry.Yuan@amd.com> Reviewed-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2024-09-11ACPI: CPPC: Drop check for non zero perf ratioMario Limonciello1-6/+1
perf_ratio is a u64 and SCHED_CAPACITY_SCALE is a large number. Shifting by one will never have a zero value. Drop the check. Suggested-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Reviewed-by: Gautham R. Shenoy <gautham.sheoy@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2024-09-11x86/amd: Rename amd_get_highest_perf() to amd_get_boost_ratio_numerator()Mario Limonciello2-15/+32
The function name is ambiguous because it returns an intermediate value for calculating maximum frequency rather than the CPPC 'Highest Perf' register. Rename the function to clarify its use and allow the function to return errors. Adjust the consumer in acpi-cpufreq to catch errors. Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2024-09-11x86/amd: Move amd_get_highest_perf() from amd.c to cppc.cMario Limonciello2-16/+16
To prepare to let amd_get_highest_perf() detect preferred cores it will require CPPC functions. Move amd_get_highest_perf() to cppc.c to prepare for 'preferred core detection' rework. No functional changes intended. Reviewed-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2024-09-11platform/x86: intel_scu_ipc: Move intel_scu_ipc.h out of arch/x86/include/asmMika Westerberg3-70/+3
This is a platform/x86 library that is mostly being used by other drivers not directly under arch/x86 anyway (with the exception of the Intel MID setup code) so it makes sense that it lives under the platform_data/x86/ directory instead. No functional changes intended. Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20240909124952.1152017-3-andriy.shevchenko@linux.intel.com Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2024-09-10xen: add capability to remap non-RAM pages to different PFNsJuergen Gross2-0/+66
When running as a Xen PV dom0 it can happen that the kernel is being loaded to a guest physical address conflicting with the host memory map. In order to be able to resolve this conflict, add the capability to remap non-RAM areas to different guest PFNs. A function to use this remapping information for other purposes than doing the remap will be added when needed. As the number of conflicts should be rather low (currently only machines with max. 1 conflict are known), save the remap data in a small statically allocated array. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2024-09-10Merge branch 'linus' into timers/coreThomas Gleixner38-126/+256
To update with the latest fixes.
2024-09-10Merge branch 'perf/urgent' into perf/core, to pick up fixesIngo Molnar1-5/+42
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-09-10perf/x86/intel: Allow to setup LBR for counting event for BPFKan Liang1-2/+6
The BPF subsystem may capture LBR data on a counting event. However, the current implementation assumes that LBR can/should only be used with sampling events. For instance, retsnoop tool ([0]) makes an extensive use of this functionality and sets up perf event as follows: struct perf_event_attr attr; memset(&attr, 0, sizeof(attr)); attr.size = sizeof(attr); attr.type = PERF_TYPE_HARDWARE; attr.config = PERF_COUNT_HW_CPU_CYCLES; attr.sample_type = PERF_SAMPLE_BRANCH_STACK; attr.branch_sample_type = PERF_SAMPLE_BRANCH_KERNEL; To limit the LBR for a sampling event is to avoid unnecessary branch stack setup for a counting event in the sample read. Because LBR is only read in the sampling event's overflow. Although in most cases LBR is used in sampling, there is no HW limit to bind LBR to the sampling mode. Allow an LBR setup for a counting event unless in the sample read mode. Fixes: 85846b27072d ("perf/x86: Add PERF_X86_EVENT_NEEDS_BRANCH_STACK flag") Closes: https://lore.kernel.org/lkml/20240905180055.1221620-1-andrii@kernel.org/ Reported-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Andrii Nakryiko <andrii@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240909155848.326640-1-kan.liang@linux.intel.com
2024-09-10perf/x86/intel/cstate: Clean up cpumask and hotplugKan Liang1-137/+5
There are three cstate PMUs with different scopes, core, die and module. The scopes are supported by the generic perf_event subsystem now. Set the scope for each PMU and remove all the cpumask and hotplug codes. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240802151643.1691631-4-kan.liang@linux.intel.com
2024-09-10xen: move max_pfn in xen_memory_setup() out of function scopeJuergen Gross1-26/+26
Instead of having max_pfn as a local variable of xen_memory_setup(), make it a static variable in setup.c instead. This avoids having to pass it to subfunctions, which will be needed in more cases in future. Rename it to ini_nr_pages, as the value denotes the currently usable number of memory pages as passed from the hypervisor at boot time. Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2024-09-10xen: move checks for e820 conflicts further upJuergen Gross1-22/+22
Move the checks for e820 memory map conflicts using the xen_chk_is_e820_usable() helper further up in order to prepare resolving some of the possible conflicts by doing some e820 map modifications, which must happen before evaluating the RAM layout. Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2024-09-10xen: introduce generic helper checking for memory map conflictsJuergen Gross3-11/+31
When booting as a Xen PV dom0 the memory layout of the dom0 is modified to match that of the host, as this requires less changes in the kernel for supporting Xen. There are some cases, though, which are problematic, as it is the Xen hypervisor selecting the kernel's load address plus some other data, which might conflict with the host's memory map. These conflicts are detected at boot time and result in a boot error. In order to support handling at least some of these conflicts in future, introduce a generic helper function which will later gain the ability to adapt the memory layout when possible. Add the missing check for the xen_start_info area. Note that possible p2m map and initrd memory conflicts are handled already by copying the data to memory areas not conflicting with the memory map. The initial stack allocated by Xen doesn't need to be checked, as early boot code is switching to the statically allocated initial kernel stack. Initial page tables and the kernel itself will be handled later. Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2024-09-10xen: use correct end address of kernel for conflict checkingJuergen Gross1-1/+1
When running as a Xen PV dom0 the kernel is loaded by the hypervisor using a different memory map than that of the host. In order to minimize the required changes in the kernel, the kernel adapts its memory map to that of the host. In order to do that it is checking for conflicts of its load address with the host memory map. Unfortunately the tested memory range does not include the .brk area, which might result in crashes or memory corruption when this area does conflict with the memory map of the host. Fix the test by using the _end label instead of __bss_stop. Fixes: 808fdb71936c ("xen: check for kernel memory conflicting with memory layout") Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com>
2024-09-09mm: make arch_get_unmapped_area() take vm_flags by defaultMark Brown2-19/+3
Patch series "mm: Care about shadow stack guard gap when getting an unmapped area", v2. As covered in the commit log for c44357c2e76b ("x86/mm: care about shadow stack guard gap during placement") our current mmap() implementation does not take care to ensure that a new mapping isn't placed with existing mappings inside it's own guard gaps. This is particularly important for shadow stacks since if two shadow stacks end up getting placed adjacent to each other then they can overflow into each other which weakens the protection offered by the feature. On x86 there is a custom arch_get_unmapped_area() which was updated by the above commit to cover this case by specifying a start_gap for allocations with VM_SHADOW_STACK. Both arm64 and RISC-V have equivalent features and use the generic implementation of arch_get_unmapped_area() so let's make the equivalent change there so they also don't get shadow stack pages placed without guard pages. The arm64 and RISC-V shadow stack implementations are currently on the list: https://lore.kernel.org/r/20240829-arm64-gcs-v12-0-42fec94743 https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/ Given the addition of the use of vm_flags in the generic implementation we also simplify the set of possibilities that have to be dealt with in the core code by making arch_get_unmapped_area() take vm_flags as standard. This is a bit invasive since the prototype change touches quite a few architectures but since the parameter is ignored the change is straightforward, the simplification for the generic code seems worth it. This patch (of 3): When we introduced arch_get_unmapped_area_vmflags() in 961148704acd ("mm: introduce arch_get_unmapped_area_vmflags()") we did so as part of properly supporting guard pages for shadow stacks on x86_64, which uses a custom arch_get_unmapped_area(). Equivalent features are also present on both arm64 and RISC-V, both of which use the generic implementation of arch_get_unmapped_area() and will require equivalent modification there. Rather than continue to deal with having two versions of the functions let's bite the bullet and have all implementations of arch_get_unmapped_area() take vm_flags as a parameter. The new parameter is currently ignored by all implementations other than x86. The only caller that doesn't have a vm_flags available is mm_get_unmapped_area(), as for the x86 implementation and the wrapper used on other architectures this is modified to supply no flags. No functional changes. Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-0-a46b8b6dc0ed@kernel.org Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-1-a46b8b6dc0ed@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Helge Deller <deller@gmx.de> [parisc] Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Chris Zankel <chris@zankel.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N Rao <naveen@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09Merge tag 'hyperv-fixes-signed-20240908' of ↵Linus Torvalds3-8/+19
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv fixes from Wei Liu: - Add a documentation overview of Confidential Computing VM support (Michael Kelley) - Use lapic timer in a TDX VM without paravisor (Dexuan Cui) - Set X86_FEATURE_TSC_KNOWN_FREQ when Hyper-V provides frequency (Michael Kelley) - Fix a kexec crash due to VP assist page corruption (Anirudh Rayabharam) - Python3 compatibility fix for lsvmbus (Anthony Nandaa) - Misc fixes (Rachel Menge, Roman Kisel, zhang jiao, Hongbo Li) * tag 'hyperv-fixes-signed-20240908' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: hv: vmbus: Constify struct kobj_type and struct attribute_group tools: hv: rm .*.cmd when make clean x86/hyperv: fix kexec crash due to VP assist page corruption Drivers: hv: vmbus: Fix the misplaced function description tools: hv: lsvmbus: change shebang to use python3 x86/hyperv: Set X86_FEATURE_TSC_KNOWN_FREQ when Hyper-V provides frequency Documentation: hyperv: Add overview of Confidential Computing VM support clocksource: hyper-v: Use lapic timer in a TDX VM without paravisor Drivers: hv: Remove deprecated hv_fcopy declarations
2024-09-08treewide: Fix wrong singular form of jiffies in commentsAnna-Maria Behnsen1-1/+1
There are several comments all over the place, which uses a wrong singular form of jiffies. Replace 'jiffie' by 'jiffy'. No functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k Link: https://lore.kernel.org/all/20240904-devel-anna-maria-b4-timers-flseep-v1-3-e98760256370@linutronix.de
2024-09-06KVM: x86: don't fall through case statements without annotationsLinus Torvalds1-0/+1
clang warns on this because it has an unannotated fall-through between cases: arch/x86/kvm/x86.c:4819:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough] and while we could annotate it as a fallthrough, the proper fix is to just add the break for this case, instead of falling through to the default case and the break there. gcc also has that warning, but it looks like gcc only warns for the cases where they fall through to "real code", rather than to just a break. Odd. Fixes: d30d9ee94cc0 ("KVM: x86: Only advertise KVM_CAP_READONLY_MEM when supported by VM") Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Tom Dohrmann <erbse.13@gmx.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-06Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds7-13/+34
Pull x86 kvm fixes from Paolo Bonzini: "Many small fixes that accumulated while I was on vacation... - Fixup missed comments from the REMOVED_SPTE => FROZEN_SPTE rename - Ensure a root is successfully loaded when pre-faulting SPTEs - Grab kvm->srcu when handling KVM_SET_VCPU_EVENTS to guard against accessing memslots if toggling SMM happens to force a VM-Exit - Emulate MSR_{FS,GS}_BASE on SVM even though interception is always disabled, so that KVM does the right thing if KVM's emulator encounters {RD,WR}MSR - Explicitly clear BUS_LOCK_DETECT from KVM's caps on AMD, as KVM doesn't yet virtualize BUS_LOCK_DETECT on AMD - Cleanup the help message for CONFIG_KVM_AMD_SEV, and call out that KVM now supports SEV-SNP too - Specialize return value of KVM_CHECK_EXTENSION(KVM_CAP_READONLY_MEM), based on VM type - Remove unnecessary dependency on CONFIG_HIGH_RES_TIMERS - Note an RCU quiescent state on guest exit. This avoids a call to rcu_core() if there was a grace period request while guest was running" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: Remove HIGH_RES_TIMERS dependency kvm: Note an RCU quiescent state on guest exit KVM: x86: Only advertise KVM_CAP_READONLY_MEM when supported by VM KVM: SEV: Update KVM_AMD_SEV Kconfig entry and mention SEV-SNP KVM: SVM: Don't advertise Bus Lock Detect to guest if SVM support is missing KVM: SVM: fix emulation of msr reads/writes of MSR_FS_BASE and MSR_GS_BASE KVM: x86: Acquire kvm->srcu when handling KVM_SET_VCPU_EVENTS KVM: x86/mmu: Check that root is valid/loaded when pre-faulting SPTEs KVM: x86/mmu: Fixup comments missed by the REMOVED_SPTE=>FROZEN_SPTE rename
2024-09-06crypto: x86/aesni - update docs for aesni-intel moduleEric Biggers2-3/+7
Update the kconfig help and module description to reflect that VAES instructions are now used in some cases. Also fix XTR => XCTR. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-09-05x86/sgx: Log information when a node lacks an EPC sectionAaron Lu1-0/+7
For optimized performance, firmware typically distributes EPC sections evenly across different NUMA nodes. However, there are scenarios where a node may have both CPUs and memory but no EPC section configured. For example, in an 8-socket system with a Sub-Numa-Cluster=2 setup, there are a total of 16 nodes. Given that the maximum number of supported EPC sections is 8, it is simply not feasible to assign one EPC section to each node. This configuration is not incorrect - SGX will still operate correctly; it is just not optimized from a NUMA standpoint. For this reason, log a message when a node with both CPUs and memory lacks an EPC section. This will provide users with a hint as to why they might be experiencing less-than-ideal performance when running SGX enclaves. Suggested-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/all/20240905080855.1699814-3-aaron.lu%40intel.com
2024-09-05x86/sgx: Fix deadlock in SGX NUMA node searchAaron Lu1-13/+14
When the current node doesn't have an EPC section configured by firmware and all other EPC sections are used up, CPU can get stuck inside the while loop that looks for an available EPC page from remote nodes indefinitely, leading to a soft lockup. Note how nid_of_current will never be equal to nid in that while loop because nid_of_current is not set in sgx_numa_mask. Also worth mentioning is that it's perfectly fine for the firmware not to setup an EPC section on a node. While setting up an EPC section on each node can enhance performance, it is not a requirement for functionality. Rework the loop to start and end on *a* node that has SGX memory. This avoids the deadlock looking for the current SGX-lacking node to show up in the loop when it never will. Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()") Reported-by: "Molina Sabido, Gerardo" <gerardo.molina.sabido@intel.com> Signed-off-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Zhimin Luo <zhimin.luo@intel.com> Link: https://lore.kernel.org/all/20240905080855.1699814-2-aaron.lu%40intel.com
2024-09-05KVM: Remove HIGH_RES_TIMERS dependencySteven Rostedt1-1/+0
Commit 92b5265d38f6a ("KVM: Depend on HIGH_RES_TIMERS") added a dependency to high resolution timers with the comment: KVM lapic timer and tsc deadline timer based on hrtimer, setting a leftmost node to rb tree and then do hrtimer reprogram. If hrtimer not configured as high resolution, hrtimer_enqueue_reprogram do nothing and then make kvm lapic timer and tsc deadline timer fail. That was back in 2012, where hrtimer_start_range_ns() would do the reprogramming with hrtimer_enqueue_reprogram(). But as that was a nop with high resolution timers disabled, this did not work. But a lot has changed in the last 12 years. For example, commit 49a2a07514a3a ("hrtimer: Kick lowres dynticks targets on timer enqueue") modifies __hrtimer_start_range_ns() to work with low res timers. There's been lots of other changes that make low res work. ChromeOS has tested this before as well, and it hasn't seen any issues with running KVM with high res timers disabled. There could be problems, especially at low HZ, for guests that do not support kvmclock and rely on precise delivery of periodic timers to keep their clock running. This can be the APIC timer (provided by the kernel), the RTC (provided by userspace), or the i8254 (choice of kernel/userspace). These guests are few and far between these days, and in the case of the APIC timer + Intel hosts we can use the preemption timer (which is TSC-based and has better latency _and_ accuracy). In KVM, only x86 is requiring CONFIG_HIGH_RES_TIMERS; perhaps a "depends on HIGH_RES_TIMERS || EXPERT" could be added to virt/kvm, or a pr_warn could be added to kvm_init if HIGH_RES_TIMERS are not enabled. But in general, it seems that there must be other code in the kernel (maybe sound/?) that is relying on having high-enough HZ or hrtimers but that's not documented anywhere. Whenever you disable it you probably need to know what you're doing and what your workload is; so the dependency is not particularly interesting, and we can just remove it. Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Message-ID: <20240821095127.45d17b19@gandalf.local.home> [Added the last two paragraphs to the commit message. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-05Merge tag 'hwmon-for-v6.11-rc7' into review-hansHans de Goede23-91/+139
Merge "hwmon fixes for v6.11-rc7" into review-hans to bring in commit a54da9df75cd ("hwmon: (hp-wmi-sensors) Check if WMI event data exists"). This is a dependency for a set of WMI event data refactoring changes.
2024-09-05perf/x86/rapl: Fix the energy-pkg event for AMD CPUsDhananjay Ugwekar1-5/+42
After commit: 63edbaa48a57 ("x86/cpu/topology: Add support for the AMD 0x80000026 leaf") ... on AMD processors that support extended CPUID leaf 0x80000026, the topology_die_cpumask() and topology_logical_die_id() macros no longer return the package cpumask and package ID, instead they return the CCD (Core Complex Die) mask and ID respectively. This leads to the energy-pkg event scope to be modified to CCD instead of package. So, change the PMU scope for AMD and Hygon back to package. On a 12 CCD 1 Package AMD Zen4 Genoa machine: Before: $ cat /sys/devices/power/cpumask 0,8,16,24,32,40,48,56,64,72,80,88. The expected cpumask here is supposed to be just "0", as it is a package scope event, only one CPU will be collecting the event for all the CPUs in the package. After: $ cat /sys/devices/power/cpumask 0 [ mingo: Cleaned up the changelog ] Signed-off-by: Dhananjay Ugwekar <Dhananjay.Ugwekar@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20240904100934.3260-1-Dhananjay.Ugwekar@amd.com
2024-09-05x86/bugs: Fix handling when SRSO mitigation is disabledDavid Kaplan1-9/+5
When the SRSO mitigation is disabled, either via mitigations=off or spec_rstack_overflow=off, the warning about the lack of IBPB-enhancing microcode is printed anyway. This is unnecessary since the user has turned off the mitigation. [ bp: Massage, drop SBPB rationale as it doesn't matter because when mitigations are disabled x86_pred_cmd is not being used anyway. ] Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/r/20240904150711.193022-1-david.kaplan@amd.com
2024-09-05Merge branch 'perf/urgent' into perf/core, to pick up fixesIngo Molnar23-91/+157
This also refreshes the -rc1 based branch to -rc5. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-09-05x86/bugs: Add missing NO_SSB flagDaniel Sneddon1-2/+2
The Moorefield and Lightning Mountain Atom processors are missing the NO_SSB flag in the vulnerabilities whitelist. This will cause unaffected parts to incorrectly be reported as vulnerable. Add the missing flag. These parts are currently out of service and were verified internally with archived documentation that they need the NO_SSB flag. Closes: https://lore.kernel.org/lkml/CAEJ9NQdhh+4GxrtG1DuYgqYhvc0hi-sKZh-2niukJ-MyFLntAA@mail.gmail.com/ Reported-by: Shanavas.K.S <shanavasks@gmail.com> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240829192437.4074196-1-daniel.sneddon@linux.intel.com
2024-09-05x86/hyperv: fix kexec crash due to VP assist page corruptionAnirudh Rayabharam (Microsoft)3-7/+3
commit 9636be85cc5b ("x86/hyperv: Fix hyperv_pcpu_input_arg handling when CPUs go online/offline") introduces a new cpuhp state for hyperv initialization. cpuhp_setup_state() returns the state number if state is CPUHP_AP_ONLINE_DYN or CPUHP_BP_PREPARE_DYN and 0 for all other states. For the hyperv case, since a new cpuhp state was introduced it would return 0. However, in hv_machine_shutdown(), the cpuhp_remove_state() call is conditioned upon "hyperv_init_cpuhp > 0". This will never be true and so hv_cpu_die() won't be called on all CPUs. This means the VP assist page won't be reset. When the kexec kernel tries to setup the VP assist page again, the hypervisor corrupts the memory region of the old VP assist page causing a panic in case the kexec kernel is using that memory elsewhere. This was originally fixed in commit dfe94d4086e4 ("x86/hyperv: Fix kexec panic/hang issues"). Get rid of hyperv_init_cpuhp entirely since we are no longer using a dynamic cpuhp state and use CPUHP_AP_HYPERV_ONLINE directly with cpuhp_remove_state(). Cc: stable@vger.kernel.org Fixes: 9636be85cc5b ("x86/hyperv: Fix hyperv_pcpu_input_arg handling when CPUs go online/offline") Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/20240828112158.3538342-1-anirudh@anirudhrb.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20240828112158.3538342-1-anirudh@anirudhrb.com>
2024-09-04bpf, x64: Fix a jit convergence issueYonghong Song1-2/+52
Daniel Hodges reported a jit error when playing with a sched-ext program. The error message is: unexpected jmp_cond padding: -4 bytes But further investigation shows the error is actual due to failed convergence. The following are some analysis: ... pass4, final_proglen=4391: ... 20e: 48 85 ff test rdi,rdi 211: 74 7d je 0x290 213: 48 8b 77 00 mov rsi,QWORD PTR [rdi+0x0] ... 289: 48 85 ff test rdi,rdi 28c: 74 17 je 0x2a5 28e: e9 7f ff ff ff jmp 0x212 293: bf 03 00 00 00 mov edi,0x3 Note that insn at 0x211 is 2-byte cond jump insn for offset 0x7d (-125) and insn at 0x28e is 5-byte jmp insn with offset -129. pass5, final_proglen=4392: ... 20e: 48 85 ff test rdi,rdi 211: 0f 84 80 00 00 00 je 0x297 217: 48 8b 77 00 mov rsi,QWORD PTR [rdi+0x0] ... 28d: 48 85 ff test rdi,rdi 290: 74 1a je 0x2ac 292: eb 84 jmp 0x218 294: bf 03 00 00 00 mov edi,0x3 Note that insn at 0x211 is 6-byte cond jump insn now since its offset becomes 0x80 based on previous round (0x293 - 0x213 = 0x80). At the same time, insn at 0x292 is a 2-byte insn since its offset is -124. pass6 will repeat the same code as in pass4. pass7 will repeat the same code as in pass5, and so on. This will prevent eventual convergence. Passes 1-14 are with padding = 0. At pass15, padding is 1 and related insn looks like: 211: 0f 84 80 00 00 00 je 0x297 217: 48 8b 77 00 mov rsi,QWORD PTR [rdi+0x0] ... 24d: 48 85 d2 test rdx,rdx The similar code in pass14: 211: 74 7d je 0x290 213: 48 8b 77 00 mov rsi,QWORD PTR [rdi+0x0] ... 249: 48 85 d2 test rdx,rdx 24c: 74 21 je 0x26f 24e: 48 01 f7 add rdi,rsi ... Before generating the following insn, 250: 74 21 je 0x273 "padding = 1" enables some checking to ensure nops is either 0 or 4 where #define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp))) nops = INSN_SZ_DIFF - 2 In this specific case, addrs[i] = 0x24e // from pass14 addrs[i-1] = 0x24d // from pass15 prog - temp = 3 // from 'test rdx,rdx' in pass15 so nops = -4 and this triggers the failure. To fix the issue, we need to break cycles of je <-> jmp. For example, in the above case, we have 211: 74 7d je 0x290 the offset is 0x7d. If 2-byte je insn is generated only if the offset is less than 0x7d (<= 0x7c), the cycle can be break and we can achieve the convergence. I did some study on other cases like je <-> je, jmp <-> je and jmp <-> jmp which may cause cycles. Those cases are not from actual reproducible cases since it is pretty hard to construct a test case for them. the results show that the offset <= 0x7b (0x7b = 123) should be enough to cover all cases. This patch added a new helper to generate 8-bit cond/uncond jmp insns only if the offset range is [-128, 123]. Reported-by: Daniel Hodges <hodgesd@meta.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240904221251.37109-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-04x86/platform/geode: switch GPIO buttons and LEDs to software propertiesDmitry Torokhov7-209/+228
Convert GPIO-connected buttons and LEDs in Geode boards to software nodes/properties, so that support for platform data can be removed from gpio-keys driver (which will rely purely on generic device properties for configuration). To avoid repeating the same data structures over and over and over factor them out into a new geode-common.c file. Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/ZsV6MNS_tUPPSffJ@google.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2024-09-04x86/cpu/vfm: Delete all the *_FAM6_ CPU #definesTony Luck1-83/+2
All code has been converted to use the vendor/family/model versions. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240903173443.7962-4-tony.luck@intel.com
2024-09-04x86/cpu/vfm: Delete X86_MATCH_INTEL_FAM6_MODEL[_STEPPING]() macrosTony Luck1-20/+0
These macros have been replaced by X86_MATCH_VFM[_STEPPING]() Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240903173443.7962-3-tony.luck@intel.com
2024-09-04x86/mm: add ARCH_PKEY_BITS to KconfigJoey Gouly1-0/+4
The new config option specifies how many bits are in each PKEY. Signed-off-by: Joey Gouly <joey.gouly@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: x86@kernel.org Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20240822151113.1479789-3-joey.gouly@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2024-09-04x86/sched: Add basic support for CPU capacity scalingRafael J. Wysocki2-2/+100
In order be able to compute the sizes of tasks consistently across all CPUs in a hybrid system, it is necessary to provide CPU capacity scaling information to the scheduler via arch_scale_cpu_capacity(). Moreover, the value returned by arch_scale_freq_capacity() for the given CPU must correspond to the arch_scale_cpu_capacity() return value for it, or utilization computations will be inaccurate. Add support for it through per-CPU variables holding the capacity and maximum-to-base frequency ratio (times SCHED_CAPACITY_SCALE) that will be returned by arch_scale_cpu_capacity() and used by scale_freq_tick() to compute arch_freq_scale for the current CPU, respectively. In order to avoid adding measurable overhead for non-hybrid x86 systems, which are the vast majority in the field, whether or not the new hybrid CPU capacity scaling will be in effect is controlled by a static key. This static key is set by calling arch_enable_hybrid_capacity_scale() which also allocates memory for the per-CPU data and initializes it. Next, arch_set_cpu_capacity() is used to set the per-CPU variables mentioned above for each CPU and arch_rebuild_sched_domains() needs to be called for the scheduler to realize that capacity-aware scheduling can be used going forward. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> # scale invariance Link: https://patch.msgid.link/10523497.nUPlyArG6x@rjwysocki.net [ rjw: Added parens to function kerneldoc comments ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-09-03x86: remove PG_uncachedMatthew Wilcox (Oracle)2-8/+5
Convert x86 to use PG_arch_2 instead of PG_uncached and remove PG_uncached. Link: https://lkml.kernel.org/r/20240821193445.2294269-11-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03mm: make range-to-target_node lookup facility a part of numa_memblksMike Rapoport (Microsoft)2-47/+0
The x86 implementation of range-to-target_node lookup (i.e. phys_to_target_node() and memory_add_physaddr_to_nid()) relies on numa_memblks. Since numa_memblks are now part of the generic code, move these functions from x86 to mm/numa_memblks.c and select CONFIG_NUMA_KEEP_MEMINFO when CONFIG_NUMA_MEMBLKS=y for dax and cxl. [rppt@kernel.org: fix build] Link: https://lkml.kernel.org/r/ZtVfSt_zloPdDqVB@kernel.org Link: https://lkml.kernel.org/r/20240807064110.1003856-26-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64 Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU] Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David S. Miller <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Samuel Holland <samuel.holland@sifive.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03mm: numa_memblks: introduce numa_memblks_initMike Rapoport (Microsoft)1-36/+4
Move most of x86::numa_init() to numa_memblks so that the latter will be more self-contained. With this numa_memblk data structures should not be exposed to the architecture specific code. Link: https://lkml.kernel.org/r/20240807064110.1003856-21-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64 Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU] Acked-by: Dan Williams <dan.j.williams@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David S. Miller <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Samuel Holland <samuel.holland@sifive.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03mm: introduce numa_emulationMike Rapoport (Microsoft)5-605/+0
Move numa_emulation code from arch/x86 to mm/numa_emulation.c This code will be later reused by arch_numa. No functional changes. Link: https://lkml.kernel.org/r/20240807064110.1003856-20-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64 Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU] Acked-by: Dan Williams <dan.j.williams@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David S. Miller <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Samuel Holland <samuel.holland@sifive.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03mm: move numa_distance and related code from x86 to numa_memblksMike Rapoport (Microsoft)3-108/+0
Move code dealing with numa_distance array from arch/x86 to mm/numa_memblks.c This code will be later reused by arch_numa. No functional changes. Link: https://lkml.kernel.org/r/20240807064110.1003856-19-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64 Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU] Acked-by: Dan Williams <dan.j.williams@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David S. Miller <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Samuel Holland <samuel.holland@sifive.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03mm: introduce numa_memblksMike Rapoport (Microsoft)6-383/+10
Move code dealing with numa_memblks from arch/x86 to mm/ and add Kconfig options to let x86 select it in its Kconfig. This code will be later reused by arch_numa. No functional changes. Link: https://lkml.kernel.org/r/20240807064110.1003856-18-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64 Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU] Acked-by: Dan Williams <dan.j.williams@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David S. Miller <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Samuel Holland <samuel.holland@sifive.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03x86/numa: numa_{add,remove}_cpu: make cpu parameter unsignedMike Rapoport (Microsoft)3-15/+15
CPU id cannot be negative. Making it unsigned also aligns with declarations in include/asm-generic/numa.h used by arm64 and riscv and allows sharing numa emulation code with these architectures. Link: https://lkml.kernel.org/r/20240807064110.1003856-17-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64 Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU] Acked-by: Dan Williams <dan.j.williams@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David S. Miller <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Samuel Holland <samuel.holland@sifive.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03x86/numa_emu: use a helper function to get MAX_DMA32_PFNMike Rapoport (Microsoft)3-2/+8
This is required to make numa emulation code architecture independent so that it can be moved to generic code in following commits. Link: https://lkml.kernel.org/r/20240807064110.1003856-16-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64 Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU] Acked-by: Dan Williams <dan.j.williams@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David S. Miller <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Samuel Holland <samuel.holland@sifive.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03x86/numa_emu: split __apicid_to_node update to a helper functionMike Rapoport (Microsoft)3-13/+25
This is required to make numa emulation code architecture independent so that it can be moved to generic code in following commits. Link: https://lkml.kernel.org/r/20240807064110.1003856-15-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64 Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU] Acked-by: Dan Williams <dan.j.williams@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David S. Miller <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Samuel Holland <samuel.holland@sifive.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03x86/numa_emu: simplify allocation of phys_distMike Rapoport (Microsoft)1-6/+2
By the time numa_emulation() is called, all physical memory is already mapped in the direct map and there is no need to define limits for memblock allocation. Replace memblock_phys_alloc_range() with memblock_alloc(). Link: https://lkml.kernel.org/r/20240807064110.1003856-14-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64 Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU] Acked-by: Dan Williams <dan.j.williams@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David S. Miller <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Samuel Holland <samuel.holland@sifive.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03x86/numa: move FAKE_NODE_* defines to numa_emuMike Rapoport (Microsoft)2-2/+3
The definitions of FAKE_NODE_MIN_SIZE and FAKE_NODE_MIN_HASH_MASK are only used by numa emulation code, make them local to arch/x86/mm/numa_emulation.c Link: https://lkml.kernel.org/r/20240807064110.1003856-13-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64 Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU] Acked-by: Dan Williams <dan.j.williams@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David S. Miller <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Samuel Holland <samuel.holland@sifive.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03x86/numa: use get_pfn_range_for_nid to verify that node spans memoryMike Rapoport (Microsoft)1-10/+7
Instead of looping over numa_meminfo array to detect node's start and end addresses use get_pfn_range_for_init(). This is shorter and make it easier to lift numa_memblks to generic code. Link: https://lkml.kernel.org/r/20240807064110.1003856-12-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64 Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU] Acked-by: Dan Williams <dan.j.williams@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David S. Miller <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Samuel Holland <samuel.holland@sifive.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>