aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2024-07-04perf/x86/amd/uncore: Fix DF and UMC domain identificationSandipan Das1-3/+3
For uncore PMUs, a single context is shared across all CPUs in a domain. The domain can be a CCX, like in the case of the L3 PMU, or a socket, like in the case of DF and UMC PMUs. This information is available via the PMU's cpumask. For contexts shared across a socket, the domain is currently determined from topology_die_id() which is incorrect after the introduction of commit 63edbaa48a57 ("x86/cpu/topology: Add support for the AMD 0x80000026 leaf") as it now returns a CCX identifier on Zen 4 and later systems which support CPUID leaf 0x80000026. Use topology_logical_package_id() instead as it always returns a socket identifier irrespective of the availability of CPUID leaf 0x80000026. Fixes: 63edbaa48a57 ("x86/cpu/topology: Add support for the AMD 0x80000026 leaf") Signed-off-by: Sandipan Das <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-07-04perf/x86/amd/uncore: Avoid PMU registration if counters are unavailableSandipan Das1-8/+14
X86_FEATURE_PERFCTR_NB and X86_FEATURE_PERFCTR_LLC are derived from CPUID leaf 0x80000001 ECX bits 24 and 28 respectively and denote the availability of DF and L3 counters. When these bits are not set, the corresponding PMUs have no counters and hence, should not be registered. Fixes: 07888daa056e ("perf/x86/amd/uncore: Move discovery and registration") Signed-off-by: Sandipan Das <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-07-04perf/x86/intel: Support Perfmon MSRs aliasingKan Liang4-5/+32
The architectural performance monitoring V6 supports a new range of counters' MSRs in the 19xxH address range. They include all the GP counter MSRs, the GP control MSRs, and the fixed counter MSRs. The step between each sibling counter is 4. Add intel_pmu_addr_offset() to calculate the correct offset. Add fixedctr in struct x86_pmu to store the address of the fixed counter 0. It can be used to calculate the rest of the fixed counters. The MSR address of the fixed counter control is not changed. Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Reviewed-by: Ian Rogers <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-07-04perf/x86/intel: Support PERFEVTSEL extensionKan Liang2-4/+69
Two new fields (the unit mask2, and the equal flag) are added in the IA32_PERFEVTSELx MSRs. They can be enumerated by the CPUID.23H.0.EBX. Update the config_mask in x86_pmu and x86_hybrid_pmu for the true layout of the PERFEVTSEL. Expose the new formats into sysfs if they are available. The umask extension reuses the same format attr name "umask" as the previous umask. Add umask2_show to determine/display the correct format for the current machine. Co-developed-by: Dapeng Mi <[email protected]> Signed-off-by: Dapeng Mi <[email protected]> Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Reviewed-by: Ian Rogers <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-07-04perf/x86: Add config_mask to represent EVENTSEL bitmaskKan Liang3-1/+12
Different vendors may support different fields in EVENTSEL MSR, such as Intel would introduce new fields umask2 and eq bits in EVENTSEL MSR since Perfmon version 6. However, a fixed mask X86_RAW_EVENT_MASK is used to filter the attr.config. Introduce a new config_mask to record the real supported EVENTSEL bitmask. Only apply it to the existing code now. No functional change. Co-developed-by: Dapeng Mi <[email protected]> Signed-off-by: Dapeng Mi <[email protected]> Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Reviewed-by: Ian Rogers <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-07-04perf/x86/intel: Support new data source for Lunar LakeKan Liang3-3/+109
A new PEBS data source format is introduced for the p-core of Lunar Lake. The data source field is extended to 8 bits with new encodings. A new layout is introduced into the union intel_x86_pebs_dse. Introduce the lnl_latency_data() to parse the new format. Enlarge the pebs_data_source[] accordingly to include new encodings. Only the mem load and the mem store events can generate the data source. Introduce INTEL_HYBRID_LDLAT_CONSTRAINT and INTEL_HYBRID_STLAT_CONSTRAINT to mark them. Add two new bits for the new cache-related data src, L2_MHB and MSC. The L2_MHB is short for L2 Miss Handling Buffer, which is similar to LFB (Line Fill Buffer), but to track the L2 Cache misses. The MSC stands for the memory-side cache. Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Reviewed-by: Ian Rogers <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-07-04perf/x86/intel: Rename model-specific pebs_latency_data functionsKan Liang3-16/+16
The model-specific pebs_latency_data functions of ADL and MTL use the "small" as a postfix to indicate the e-core. The postfix is too generic for a model-specific function. It cannot provide useful information that can directly map it to a specific uarch, which can facilitate the development and maintenance. Use the abbr of the uarch to rename the model-specific functions. Suggested-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Ian Rogers <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-07-04perf/x86: Add Lunar Lake and Arrow Lake supportKan Liang4-0/+147
From PMU's perspective, Lunar Lake and Arrow Lake are similar to the previous generation Meteor Lake. Both are hybrid platforms, with e-core and p-core. The key differences include: - The e-core supports 3 new fixed counters - The p-core supports an updated PEBS Data Source format - More GP counters (Updated event constraint table) - New Architectural performance monitoring V6 (New Perfmon MSRs aliasing, umask2, eq). - New PEBS format V6 (Counters Snapshotting group) - New RDPMC metrics clear mode The legacy features, the 3 new fixed counters and updated event constraint table are enabled in this patch. The new PEBS data source format, the architectural performance monitoring V6, the PEBS format V6, and the new RDPMC metrics clear mode are supported in the following patches. Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Reviewed-by: Ian Rogers <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-07-04perf/x86: Support counter maskKan Liang9-179/+199
The current perf assumes that both GP and fixed counters are contiguous. But it's not guaranteed on newer Intel platforms or in a virtualization environment. Use the counter mask to replace the number of counters for both GP and the fixed counters. For the other ARCHs or old platforms which don't support a counter mask, using GENMASK_ULL(num_counter - 1, 0) to replace. There is no functional change for them. The interface to KVM is not changed. The number of counters still be passed to KVM. It can be updated later separately. Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Reviewed-by: Ian Rogers <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-07-04perf/x86/intel: Support the PEBS event maskKan Liang4-13/+26
The current perf assumes that the counters that support PEBS are contiguous. But it's not guaranteed with the new leaf 0x23 introduced. The counters are enumerated with a counter mask. There may be holes in the counter mask for future platforms or in a virtualization environment. Store the PEBS event mask rather than the maximum number of PEBS counters in the x86 PMU structures. Signed-off-by: Kan Liang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Andi Kleen <[email protected]> Reviewed-by: Ian Rogers <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-07-04perf/x86/intel/cstate: Add Lunarlake supportZhang Rui1-7/+19
Compared with previous client platforms, PC8 is removed from Lunarlake. It supports CC1/CC6/CC7 and PC2/PC3/PC6/PC10 residency counters. Signed-off-by: Zhang Rui <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Kan Liang <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-04perf/x86/intel/cstate: Add Arrowlake supportZhang Rui1-8/+12
Like Alderlake, Arrowlake supports CC1/CC6/CC7 and PC2/PC3/PC6/PC8/PC10. Signed-off-by: Zhang Rui <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Kan Liang <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-04perf/x86/intel/cstate: Fix Alderlake/Raptorlake/MeteorlakeZhang Rui1-5/+2
For Alderlake, the spec changes after the patch submitted and PC7/PC9 are removed. Raptorlake and Meteorlake, which copy the Alderlake cstate PMU, also don't have PC7/PC9. Remove PC7/PC9 support for Alderlake/Raptorlake/Meteorlake. Fixes: d0ca946bcf84 ("perf/x86/cstate: Add Alder Lake CPU support") Signed-off-by: Zhang Rui <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Kan Liang <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-04Merge branch 'tip/x86/cpu'Peter Zijlstra235-4109/+5393
The Lunarlake patches rely on the new VFM stuff. Signed-off-by: Peter Zijlstra <[email protected]>
2024-07-04perf/x86/intel/pt: Fix pt_topa_entry_for_page() address calculationAdrian Hunter1-1/+1
Currently, perf allocates an array of page pointers which is limited in size by MAX_PAGE_ORDER. That in turn limits the maximum Intel PT buffer size to 2GiB. Should that limitation be lifted, the Intel PT driver can support larger sizes, except for one calculation in pt_topa_entry_for_page(), which is limited to 32-bits. Fix pt_topa_entry_for_page() address calculation by adding a cast. Fixes: 39152ee51b77 ("perf/x86/intel/pt: Get rid of reverse lookup table for ToPA") Signed-off-by: Adrian Hunter <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-04perf/x86/intel/pt: Fix a topa_entry base address calculationAdrian Hunter1-1/+1
topa_entry->base is a bit-field. Bit-fields are not promoted to a 64-bit type, even if the underlying type is 64-bit, and so, if necessary, must be cast to a larger type when calculations are done. Fix a topa_entry->base address calculation by adding a cast. Without the cast, the address was limited to 36-bits i.e. 64GiB. The address calculation is used on systems that do not support Multiple Entry ToPA (only Broadwell), and affects physical addresses on or above 64GiB. Instead of writing to the correct address, the address comprising the first 36 bits would be written to. Intel PT snapshot and sampling modes are not affected. Fixes: 52ca9ced3f70 ("perf/x86/intel/pt: Add Intel PT PMU driver") Reported-by: Dave Hansen <[email protected]> Signed-off-by: Adrian Hunter <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2024-07-04perf/x86/intel/pt: Fix topa_entry base lengthMarco Cavenati1-2/+2
topa_entry->base needs to store a pfn. It obviously needs to be large enough to store the largest possible x86 pfn which is MAXPHYADDR-PAGE_SIZE (52-12). So it is 4 bits too small. Increase the size of topa_entry->base from 36 bits to 40 bits. Note, systems where physical addresses can be 256TiB or more are affected. [ Adrian: Amend commit message as suggested by Dave Hansen ] Fixes: 52ca9ced3f70 ("perf/x86/intel/pt: Add Intel PT PMU driver") Signed-off-by: Marco Cavenati <[email protected]> Signed-off-by: Adrian Hunter <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Adrian Hunter <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2024-07-03arch/x86: do not explicitly clear Reserved flag in free_pagetableOscar Salvador1-2/+0
In free_pagetable() we use the non-atomic version for clearing the PageReserved bit from the page. free_pagetable() will either call free_reserved_page() or put_page_bootmem(), which will eventually end up calling free_reserved_page(), and in there we already clear the PageReserved flag. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-03mm: drop leftover comment references to pxx_huge()Peter Xu1-2/+2
pxx_huge() has been removed in recent commit 9636f055dae1 ("mm/treewide: remove pXd_huge()"), however there are still three comments referencing the API that got overlooked. Remove them. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Reported-by: Christophe Leroy <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Jason Gunthorpe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-03mm/mm_init: use node's number of cpus in deferred_page_init_max_threadsEric Chanudet1-12/+0
x86_64 is already using the node's cpu as maximum threads. Make that the default for all archs setting DEFERRED_STRUCT_PAGE_INIT. This returns to the behavior prior making the function arch-specific with commit ecd096506922 ("mm: make deferred init's max threads arch-specific"). Setting DEFERRED_STRUCT_PAGE_INIT and testing on a few arm64 platforms shows faster deferred_init_memmap completions: | | x13s | SA8775p-ride | Ampere R137-P31 | Ampere HR330 | | | Metal, 32GB | VM, 36GB | VM, 58GB | Metal, 128GB | | | 8cpus | 8cpus | 8cpus | 32cpus | |---------|-------------|--------------|-----------------|--------------| | threads | ms (%) | ms (%) | ms (%) | ms (%) | |---------|-------------|--------------|-----------------|--------------| | 1 | 108 (0%) | 72 (0%) | 224 (0%) | 324 (0%) | | cpus | 24 (-77%) | 36 (-50%) | 40 (-82%) | 56 (-82%) | Michael Ellerman reported: : On a machine here (1TB, 40 cores, 4KB pages) the existing code gives: : : [ 0.500124] node 2 deferred pages initialised in 210ms : [ 0.515790] node 3 deferred pages initialised in 230ms : [ 0.516061] node 0 deferred pages initialised in 230ms : [ 0.516522] node 7 deferred pages initialised in 230ms : [ 0.516672] node 4 deferred pages initialised in 230ms : [ 0.516798] node 6 deferred pages initialised in 230ms : [ 0.517051] node 5 deferred pages initialised in 230ms : [ 0.523887] node 1 deferred pages initialised in 240ms : : vs with the patch: : : [ 0.379613] node 0 deferred pages initialised in 90ms : [ 0.380388] node 1 deferred pages initialised in 90ms : [ 0.380540] node 4 deferred pages initialised in 100ms : [ 0.390239] node 6 deferred pages initialised in 100ms : [ 0.390249] node 2 deferred pages initialised in 100ms : [ 0.390786] node 3 deferred pages initialised in 110ms : [ 0.396721] node 5 deferred pages initialised in 110ms : [ 0.397095] node 7 deferred pages initialised in 110ms : : Which is a nice speedup. [[email protected]: v3] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric Chanudet <[email protected]> Tested-by: Michael Ellerman <[email protected]> (powerpc) Reviewed-by: Baoquan He <[email protected]> Acked-by: Alexander Gordeev <[email protected]> Acked-by: Mike Rapoport (IBM) <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-03x86/vdso: Remove unused includeAnna-Maria Behnsen1-1/+0
Including hrtimer.h is not required and is probably a historical leftover. Remove it. Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Vincenzo Frascino <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-03x86/vgtod: Remove unused typedef gtod_long_tAnna-Maria Behnsen1-5/+0
The typedef gtod_long_t is not used anymore so remove it. The header file contains then only includes dependent on CONFIG_GENERIC_GETTIMEOFDAY to not break ARCH=um. Nevertheless, keep the header file only with those includes to prevent spreading ifdeffery all over the place. Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Vincenzo Frascino <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-03x86/vdso: Fix function reference in commentAnna-Maria Behnsen1-3/+2
Replace the reference to the non-existent function arch_vdso_cycles_valid() by the proper function arch_vdso_cycles_ok(). Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Vincenzo Frascino <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-03um: Enable preemption in UMLAnton Ivanov1-1/+0
Since userspace state is saved in the MM process, kernel using FPU still doesn't really need to do anything, so this really is as simple as enabling preemption. The irq critical section in sigio_handler() needs preempt_disable()/preempt_enable(). Signed-off-by: Anton Ivanov <[email protected]> Link: https://patch.msgid.link/20240702102549.d2fcea450854.I12f5a53d80ec1e425e66ef272b1e95cb523b608e@changeid [rebase, remove FPU save/restore, fix x86/um Makefile, rewrite commit message] Signed-off-by: Johannes Berg <[email protected]>
2024-07-03um: remove copy_context_skas0Benjamin Berg3-51/+0
The kernel flushes the memory ranges anyway for CoW and does not assume that the userspace process has anything set up already. So, start with a fresh process for the new mm context. Signed-off-by: Benjamin Berg <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2024-07-03um: remove LDT supportBenjamin Berg4-438/+2
The current LDT code has a few issues that mean it should be redone in a different way once we always start with a fresh MM even when cloning. In a new and better world, the kernel would just ensure its own LDT is clear at startup. At that point, all that is needed is a simple function to populate the LDT from another MM in arch_dup_mmap combined with some tracking of the installed LDT entries for each MM. Note that the old implementation was even incorrect with regard to reading, as it copied out the LDT entries in the internal format rather than converting them to the userspace structure. Removal should be fine as the LDT is not used for thread-local storage anymore. Signed-off-by: Benjamin Berg <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2024-07-03um: Rework syscall handlingBenjamin Berg5-134/+16
Rework syscall handling to be platform independent. Also create a clean split between queueing of syscalls and flushing them out, removing the need to keep state in the code that triggers the syscalls. The code adds syscall_data_len to the global mm_id structure. This will be used later to allow surrounding code to track whether syscalls still need to run and if errors occurred. Signed-off-by: Benjamin Berg <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2024-07-03um: Add generic stub_syscall6 functionBenjamin Berg2-0/+38
This function will be used by the new syscall handling code. Signed-off-by: Benjamin Berg <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2024-07-03um: Remove stub-data.h include from common-offsets.hBenjamin Berg2-6/+8
Further commits will require values from common-offsets.h inside stub-data.h. Resolve the possible circular dependency and simply use offsetof() inside stub_32.h and stub_64.h. Signed-off-by: Benjamin Berg <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2024-07-03x86/bhi: Avoid warning in #DB handler due to BHI mitigationAlexandre Chartre1-4/+10
When BHI mitigation is enabled, if SYSENTER is invoked with the TF flag set then entry_SYSENTER_compat() uses CLEAR_BRANCH_HISTORY and calls the clear_bhb_loop() before the TF flag is cleared. This causes the #DB handler (exc_debug_kernel()) to issue a warning because single-step is used outside the entry_SYSENTER_compat() function. To address this issue, entry_SYSENTER_compat() should use CLEAR_BRANCH_HISTORY after making sure the TF flag is cleared. The problem can be reproduced with the following sequence: $ cat sysenter_step.c int main() { asm("pushf; pop %ax; bts $8,%ax; push %ax; popf; sysenter"); } $ gcc -o sysenter_step sysenter_step.c $ ./sysenter_step Segmentation fault (core dumped) The program is expected to crash, and the #DB handler will issue a warning. Kernel log: WARNING: CPU: 27 PID: 7000 at arch/x86/kernel/traps.c:1009 exc_debug_kernel+0xd2/0x160 ... RIP: 0010:exc_debug_kernel+0xd2/0x160 ... Call Trace: <#DB> ? show_regs+0x68/0x80 ? __warn+0x8c/0x140 ? exc_debug_kernel+0xd2/0x160 ? report_bug+0x175/0x1a0 ? handle_bug+0x44/0x90 ? exc_invalid_op+0x1c/0x70 ? asm_exc_invalid_op+0x1f/0x30 ? exc_debug_kernel+0xd2/0x160 exc_debug+0x43/0x50 asm_exc_debug+0x1e/0x40 RIP: 0010:clear_bhb_loop+0x0/0xb0 ... </#DB> <TASK> ? entry_SYSENTER_compat_after_hwframe+0x6e/0x8d </TASK> [ bp: Massage commit message. ] Fixes: 7390db8aea0d ("x86/bhi: Add support for clearing branch history at syscall entry") Reported-by: Suman Maity <[email protected]> Signed-off-by: Alexandre Chartre <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Andrew Cooper <[email protected]> Reviewed-by: Pawan Gupta <[email protected]> Reviewed-by: Josh Poimboeuf <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-03arch: um: rust: Use the generated target.json againDavid Gow1-0/+1
The Rust compiler can take a target config from 'target.json', which is generated by scripts/generate_rust_target.rs. It used to be that all Linux architectures used this to generate a target.json, but now architectures must opt-in to this, or they will default to the Rust compiler's built-in target definition. This is mostly okay for (64-bit) x86 and UML, except that it can generate SSE instructions, which we can't use in the kernel. So re-instate the custom target.json, which disables SSE (and generally enables the 'soft-float' feature). This fixes the following compile error: error: <unknown>:0:0: in function _RNvMNtCs5QSdWC790r4_4core3f32f7next_up float (float): SSE register return with SSE disabled Fixes: f82811e22b48 ("rust: Refactor the build target to allow the use of builtin targets") Signed-off-by: David Gow <[email protected]> Reviewed-by: Boqun Feng <[email protected]> Tested-by: Miguel Ojeda <[email protected]> Reviewed-by: Miguel Ojeda <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2024-07-02x86/mm: Cleanup prctl_enable_tagged_addr() nr_bits error checkingYosry Ahmed1-7/+4
There are two separate checks in prctl_enable_tagged_addr() that nr_bits is in the correct range. The checks are arranged such the correct case is sandwiched between both error cases, which do exactly the same thing. Simplify the if condition and pull the correct case outside with the rest of the success code path. Signed-off-by: Yosry Ahmed <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Link: https://lore.kernel.org/all/20240702132139.3332013-4-yosryahmed%40google.com
2024-07-02x86/mm: Fix LAM inconsistency during context switchYosry Ahmed4-11/+20
LAM can only be enabled when a process is single-threaded. But _kernel_ threads can temporarily use a single-threaded process's mm. That means that a context-switching kernel thread can race and observe the mm's LAM metadata (mm->context.lam_cr3_mask) change. The context switch code does two logical things with that metadata: populate CR3 and populate 'cpu_tlbstate.lam'. If it hits this race, 'cpu_tlbstate.lam' and CR3 can end up out of sync. This de-synchronization is currently harmless. But it is confusing and might lead to warnings or real bugs. Update set_tlbstate_lam_mode() to take in the LAM mask and untag mask instead of an mm_struct pointer, and while we are at it, rename it to cpu_tlbstate_update_lam(). This should also make it clearer that we are updating cpu_tlbstate. In switch_mm_irqs_off(), read the LAM mask once and use it for both the cpu_tlbstate update and the CR3 update. Signed-off-by: Yosry Ahmed <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Link: https://lore.kernel.org/all/20240702132139.3332013-3-yosryahmed%40google.com
2024-07-02x86/mm: Use IPIs to synchronize LAM enablementYosry Ahmed2-7/+29
LAM can only be enabled when a process is single-threaded. But _kernel_ threads can temporarily use a single-threaded process's mm. If LAM is enabled by a userspace process while a kthread is using its mm, the kthread will not observe LAM enablement (i.e. LAM will be disabled in CR3). This could be fine for the kthread itself, as LAM only affects userspace addresses. However, if the kthread context switches to a thread in the same userspace process, CR3 may or may not be updated because the mm_struct doesn't change (based on pending TLB flushes). If CR3 is not updated, the userspace thread will run incorrectly with LAM disabled, which may cause page faults when using tagged addresses. Example scenario: CPU 1 CPU 2 /* kthread */ kthread_use_mm() /* user thread */ prctl_enable_tagged_addr() /* LAM enabled on CPU 2 */ /* LAM disabled on CPU 1 */ context_switch() /* to CPU 1 */ /* Switching to user thread */ switch_mm_irqs_off() /* CR3 not updated */ /* LAM is still disabled on CPU 1 */ Synchronize LAM enablement by sending an IPI to all CPUs running with the mm_struct to enable LAM. This makes sure LAM is enabled on CPU 1 in the above scenario before prctl_enable_tagged_addr() returns and userspace starts using tagged addresses, and before it's possible to run the userspace process on CPU 1. In switch_mm_irqs_off(), move reading the LAM mask until after mm_cpumask() is updated. This ensures that if an outdated LAM mask is written to CR3, an IPI is received to update it right after IRQs are re-enabled. [ dhansen: Add a LAM enabling helper and comment it ] Fixes: 82721d8b25d7 ("x86/mm: Handle LAM on context switch") Suggested-by: Andy Lutomirski <[email protected]> Signed-off-by: Yosry Ahmed <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Link: https://lore.kernel.org/all/20240702132139.3332013-2-yosryahmed%40google.com
2024-07-02x86/resctrl: Detect Sub-NUMA Cluster (SNC) modeTony Luck1-0/+66
There isn't a simple hardware bit that indicates whether a CPU is running in Sub-NUMA Cluster (SNC) mode. Infer the state by comparing the number of CPUs sharing the L3 cache with CPU0 to the number of CPUs in the same NUMA node as CPU0. Add the missing definition of pr_fmt() to monitor.c. This wasn't noticed before as there are only "can't happen" console messages from this file. [ bp: Massage commit message. ] Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Enable shared RMID mode on Sub-NUMA Cluster (SNC) systemsTony Luck4-0/+25
Hardware has two RMID configuration options for SNC systems. The default mode divides RMID counters between SNC nodes. E.g. with 200 RMIDs and two SNC nodes per L3 cache RMIDs 0..99 are used on node 0, and 100..199 on node 1. This isn't compatible with Linux resctrl usage. On this example system a process using RMID 5 would only update monitor counters while running on SNC node 0. The other mode is "RMID Sharing Mode". This is enabled by clearing bit 0 of the RMID_SNC_CONFIG (0xCA0) model specific register. In this mode the number of logical RMIDs is the number of physical RMIDs (from CPUID leaf 0xF) divided by the number of SNC nodes per L3 cache instance. A process can use the same RMID across different SNC nodes. See the "Intel Resource Director Technology Architecture Specification" for additional details. When SNC is enabled, update the MSR when a monitor domain is marked online. Technically this is overkill. It only needs to be done once per L3 cache instance rather than per SNC domain. But there is no harm in doing it more than once, and this is not in a critical path. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Make __mon_event_count() handle sum domainsTony Luck1-9/+42
Legacy resctrl monitor files must provide the sum of event values across all Sub-NUMA Cluster (SNC) domains that share an L3 cache instance. There are now two cases: 1) A specific domain is provided in struct rmid_read This is either a non-SNC system, or the request is to read data from just one SNC node. 2) Domain pointer is NULL. In this case the cacheinfo field in struct rmid_read indicates that all SNC nodes that share that L3 cache instance should have the event read and return the sum of all values. Update the CPU sanity check. The existing check that an event is read from a CPU in the requested domain still applies when reading a single domain. But when summing across domains a more relaxed check that the current CPU is in the scope of the L3 cache instance is appropriate since the MSRs to read events are scoped at L3 cache level. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Fill out rmid_read structure for smp_call*() to read a counterTony Luck3-10/+34
mon_event_read() fills out most fields of the struct rmid_read that is passed via an smp_call*() function to a CPU that is part of the correct domain to read the monitor counters. With Sub-NUMA Cluster (SNC) mode there are now two cases to handle: 1) Reading a file that returns a value for a single domain. + Choose the CPU to execute from the domain cpu_mask 2) Reading a file that must sum across domains sharing an L3 cache instance. + Indicate to called code that a sum is needed by passing a NULL rdt_mon_domain pointer. + Choose the CPU from the L3 shared_cpu_map. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Handle removing directories in Sub-NUMA Cluster (SNC) modeTony Luck1-6/+29
In SNC mode, there are multiple subdirectories in each L3 level monitor directory (one for each SNC node). If all the CPUs in an SNC node are taken offline, just remove the SNC directory for that node. In non-SNC mode, or when the last SNC node directory is removed, remove the L3 monitor directory. Add a helper function to avoid duplicated code. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Create Sub-NUMA Cluster (SNC) monitor filesTony Luck1-16/+46
When SNC mode is enabled, create subdirectories and files to monitor at the SNC node granularity. Legacy behavior is preserved by tagging the monitor files at the L3 granularity with the "sum" attribute. When the user reads these files the kernel will read monitor data from all SNC nodes that share the same L3 cache instance and return the aggregated value to the user. Note that the "domid" field for files that must sum across SNC domains has the L3 cache instance id, while non-summing files use the domain id. The "sum" files do not need to make a call to mon_event_read() to initialize the MBM counters. This will be handled by initializing the individual SNC nodes that share the L3. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Allocate a new field in union mon_data_bitsTony Luck1-7/+13
When Sub-NUMA Cluster (SNC) mode is enabled, the legacy monitor reporting files must report the sum of the data from all of the SNC nodes that share the L3 cache that is referenced by the monitor file. Resctrl squeezes all the attributes of these files into 32 bits so they can be stored in the "priv" field of struct kernfs_node. Currently, only three monitor events are defined by enum resctrl_event_id so reducing it from 8 bits to 7 bits still provides more than enough space to represent all the known event types. But note that this choice was arbitrary. The "rid" field is also far wider than needed for the current number of resource id types. This structure is purely internal to resctrl, no ABI issues with modifying it. Subsequent changes may rearrange the allocation of bits between each of the fields as needed. Give the bit to a new "sum" field that indicates that reading this file must sum across SNC nodes. This bit also indicates that the domid field is the id of an L3 cache (instead of a domain id) to find which domains must be summed. Fix up other issues in the kerneldoc description for mon_data_bits. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Refactor mkdir_mondata_subdir() with a helper functionTony Luck1-17/+28
In Sub-NUMA Cluster (SNC) mode Linux must create the monitor files in the original "mon_L3_XX" directories and also in each of the "mon_sub_L3_YY" directories. Refactor mkdir_mondata_subdir() to move the creation of monitoring files into a helper function to avoid the need to duplicate code later. No functional change. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Initialize on-stack struct rmid_read instancesTony Luck3-5/+3
New semantics rely on some struct rmid_read members having NULL values to distinguish between the SNC and non-SNC scenarios. resctrl can thus no longer rely on this struct not being initialized properly. Initialize all on-stack declarations of struct rmid_read: rdtgroup_mondata_show() mbm_update() mkdir_mondata_subdir() to ensure that garbage values from the stack are not passed down to other functions. [ bp: Massage commit message. ] Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Add a new field to struct rmid_read for summation of domainsTony Luck1-0/+19
When a user reads a monitor file rdtgroup_mondata_show() calls mon_event_read() to package up all the required details into an rmid_read structure which is passed across the smp_call*() infrastructure to code that will read data from hardware and return the value (or error status) in the rmid_read structure. Sub-NUMA Cluster (SNC) mode adds files with new semantics. These require the smp_call-ed code to sum event data from all domains that share an L3 cache. Add a pointer to the L3 "cacheinfo" structure to struct rmid_read for the data collection routines to use to pick the domains to be summed. [ Reinette: the rmid_read structure has become complex enough so document each of its fields and provide the kerneldoc documentation for struct rmid_read. ] Co-developed-by: Reinette Chatre <[email protected]> Signed-off-by: Reinette Chatre <[email protected]> Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Prepare for new Sub-NUMA Cluster (SNC) monitor filesTony Luck3-3/+6
When SNC is enabled, monitoring data is collected at the SNC node granularity, but must be reported at L3-cache granularity for backwards compatibility in addition to reporting at the node level. Add a "ci" field to the rdt_mon_domain structure to save the cache information about the enclosing L3 cache for the domain. This provides: 1) The cache id which is needed to compose the name of the legacy monitoring directory, and to determine which domains should be summed to provide L3-scoped data. 2) The shared_cpu_map which is needed to determine which CPUs can be used to read the RMID counters with the MSR interface. This is the first step to an eventual goal of monitor reporting files like this (for a system with two SNC nodes per L3): $ cd /sys/fs/resctrl/mon_data $ tree mon_L3_00 mon_L3_00 <- 00 here is L3 cache id ├── llc_occupancy \ These files provide legacy support ├── mbm_local_bytes > for non-SNC aware monitor apps ├── mbm_total_bytes / that expect data at L3 cache level ├── mon_sub_L3_00 <- 00 here is SNC node id │   ├── llc_occupancy \ These files are finer grained │   ├── mbm_local_bytes > data from each SNC node │   └── mbm_total_bytes / └── mon_sub_L3_01 ├── llc_occupancy \ ├── mbm_local_bytes > As above, but for node 1. └── mbm_total_bytes / [ bp: Massage commit message. ] Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Block use of mba_MBps mount option on Sub-NUMA Cluster (SNC) ↵Tony Luck1-3/+9
systems When SNC is enabled there is a mismatch between the MBA control function which operates at L3 cache scope and the MBM monitor functions which measure memory bandwidth on each SNC node. Block use of the mba_MBps when scopes for MBA/MBM do not match. Improve user diagnostics by adding invalfc() message when mba_MBps is not supported. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Introduce snc_nodes_per_l3_cacheTony Luck1-6/+50
Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores and memory controllers on a socket into two or more groups. These are presented to the operating system as NUMA nodes. This may enable some workloads to have slightly lower latency to memory as the memory controller(s) in an SNC node are electrically closer to the CPU cores on that SNC node. This cost may be offset by lower bandwidth since the memory accesses for each core can only be interleaved between the memory controllers on the same SNC node. Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks to track L3 cache occupancy and memory bandwidth. There is an MSR that controls how the RMIDs are shared between SNC nodes. The default mode divides them numerically. E.g. when there are two SNC nodes on a socket the lower number half of the RMIDs are given to the first node, the remainder to the second node. This would be difficult to use with the Linux resctrl interface as specific RMID values assigned to resctrl groups are not visible to users. RMID sharing mode divides the physical RMIDs evenly between SNC nodes but uses a logical RMID in the IA32_PQR_ASSOC MSR. For example a system with 200 physical RMIDs (as enumerated by CPUID leaf 0xF) that has two SNC nodes per L3 cache instance would have 100 logical RMIDs available for Linux to use. A task running on SNC node 0 with RMID 5 would accumulate LLC occupancy and MBM bandwidth data in physical RMID 5. Another task using RMID 5, but running on SNC node 1 would accumulate data in physical RMID 105. Even with this renumbering SNC mode requires several changes in resctrl behavior for correct operation. Add a static global to arch/x86/kernel/cpu/resctrl/monitor.c to indicate how many SNC domains share an L3 cache instance. Initialize this to "1". Runtime detection of SNC mode will adjust this value. Update all places to take appropriate action when SNC mode is enabled: 1) The number of logical RMIDs per L3 cache available for use is the number of physical RMIDs divided by the number of SNC nodes. 2) Likewise the "mon_scale" value must be divided by the number of SNC nodes. 3) Add a function to convert from logical RMID values (assigned to tasks and loaded into the IA32_PQR_ASSOC MSR on context switch) to physical RMID values to load into IA32_QM_EVTSEL MSR when reading counters on each SNC node. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Add node-scope to the options for feature scopeTony Luck1-0/+2
Currently supported resctrl features are all domain scoped the same as the scope of the L2 or L3 caches. Add RESCTRL_L3_NODE as a new option for features that are scoped at the same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA Cluster (SNC) feature where monitoring features are divided between nodes that share an L3 cache. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Split the rdt_domain and rdt_hw_domain structuresTony Luck6-125/+146
The same rdt_domain structure is used for both control and monitor functions. But this results in wasted memory as some of the fields are only used by control functions, while most are only used for monitor functions. Split into separate rdt_ctrl_domain and rdt_mon_domain structures with just the fields required for control and monitoring respectively. Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain and rdt_hw_mon_domain. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Prepare for different scope for control/monitor operationsTony Luck6-90/+221
Resctrl assumes that control and monitor operations on a resource are performed at the same scope. Prepare for systems that use different scope (specifically Intel needs to split the RDT_RESOURCE_L3 resource to use L3 scope for cache control and NODE scope for cache occupancy and memory bandwidth monitoring). Create separate domain lists for control and monitor operations. Note that errors during initialization of either control or monitor functions on a domain would previously result in that domain being excluded from both control and monitor operations. Now the domains are allocated independently it is no longer required to disable both control and monitor operations if either fail. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]