Age | Commit message (Collapse) | Author | Files | Lines |
|
For uncore PMUs, a single context is shared across all CPUs in a domain.
The domain can be a CCX, like in the case of the L3 PMU, or a socket,
like in the case of DF and UMC PMUs. This information is available via
the PMU's cpumask.
For contexts shared across a socket, the domain is currently determined
from topology_die_id() which is incorrect after the introduction of
commit 63edbaa48a57 ("x86/cpu/topology: Add support for the AMD
0x80000026 leaf") as it now returns a CCX identifier on Zen 4 and later
systems which support CPUID leaf 0x80000026.
Use topology_logical_package_id() instead as it always returns a socket
identifier irrespective of the availability of CPUID leaf 0x80000026.
Fixes: 63edbaa48a57 ("x86/cpu/topology: Add support for the AMD 0x80000026 leaf")
Signed-off-by: Sandipan Das <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
X86_FEATURE_PERFCTR_NB and X86_FEATURE_PERFCTR_LLC are derived from
CPUID leaf 0x80000001 ECX bits 24 and 28 respectively and denote the
availability of DF and L3 counters. When these bits are not set, the
corresponding PMUs have no counters and hence, should not be registered.
Fixes: 07888daa056e ("perf/x86/amd/uncore: Move discovery and registration")
Signed-off-by: Sandipan Das <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
The architectural performance monitoring V6 supports a new range of
counters' MSRs in the 19xxH address range. They include all the GP
counter MSRs, the GP control MSRs, and the fixed counter MSRs.
The step between each sibling counter is 4. Add intel_pmu_addr_offset()
to calculate the correct offset.
Add fixedctr in struct x86_pmu to store the address of the fixed counter
0. It can be used to calculate the rest of the fixed counters.
The MSR address of the fixed counter control is not changed.
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Ian Rogers <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Two new fields (the unit mask2, and the equal flag) are added in the
IA32_PERFEVTSELx MSRs. They can be enumerated by the CPUID.23H.0.EBX.
Update the config_mask in x86_pmu and x86_hybrid_pmu for the true layout
of the PERFEVTSEL.
Expose the new formats into sysfs if they are available. The umask
extension reuses the same format attr name "umask" as the previous
umask. Add umask2_show to determine/display the correct format
for the current machine.
Co-developed-by: Dapeng Mi <[email protected]>
Signed-off-by: Dapeng Mi <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Ian Rogers <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Different vendors may support different fields in EVENTSEL MSR, such as
Intel would introduce new fields umask2 and eq bits in EVENTSEL MSR
since Perfmon version 6. However, a fixed mask X86_RAW_EVENT_MASK is
used to filter the attr.config.
Introduce a new config_mask to record the real supported EVENTSEL
bitmask.
Only apply it to the existing code now. No functional change.
Co-developed-by: Dapeng Mi <[email protected]>
Signed-off-by: Dapeng Mi <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Ian Rogers <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
A new PEBS data source format is introduced for the p-core of Lunar
Lake. The data source field is extended to 8 bits with new encodings.
A new layout is introduced into the union intel_x86_pebs_dse.
Introduce the lnl_latency_data() to parse the new format.
Enlarge the pebs_data_source[] accordingly to include new encodings.
Only the mem load and the mem store events can generate the data source.
Introduce INTEL_HYBRID_LDLAT_CONSTRAINT and
INTEL_HYBRID_STLAT_CONSTRAINT to mark them.
Add two new bits for the new cache-related data src, L2_MHB and MSC.
The L2_MHB is short for L2 Miss Handling Buffer, which is similar to
LFB (Line Fill Buffer), but to track the L2 Cache misses.
The MSC stands for the memory-side cache.
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Ian Rogers <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
The model-specific pebs_latency_data functions of ADL and MTL use the
"small" as a postfix to indicate the e-core. The postfix is too generic
for a model-specific function. It cannot provide useful information that
can directly map it to a specific uarch, which can facilitate the
development and maintenance.
Use the abbr of the uarch to rename the model-specific functions.
Suggested-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Ian Rogers <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
From PMU's perspective, Lunar Lake and Arrow Lake are similar to the
previous generation Meteor Lake. Both are hybrid platforms, with e-core
and p-core.
The key differences include:
- The e-core supports 3 new fixed counters
- The p-core supports an updated PEBS Data Source format
- More GP counters (Updated event constraint table)
- New Architectural performance monitoring V6
(New Perfmon MSRs aliasing, umask2, eq).
- New PEBS format V6 (Counters Snapshotting group)
- New RDPMC metrics clear mode
The legacy features, the 3 new fixed counters and updated event
constraint table are enabled in this patch.
The new PEBS data source format, the architectural performance
monitoring V6, the PEBS format V6, and the new RDPMC metrics clear mode
are supported in the following patches.
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Ian Rogers <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
The current perf assumes that both GP and fixed counters are contiguous.
But it's not guaranteed on newer Intel platforms or in a virtualization
environment.
Use the counter mask to replace the number of counters for both GP and
the fixed counters. For the other ARCHs or old platforms which don't
support a counter mask, using GENMASK_ULL(num_counter - 1, 0) to
replace. There is no functional change for them.
The interface to KVM is not changed. The number of counters still be
passed to KVM. It can be updated later separately.
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Ian Rogers <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
The current perf assumes that the counters that support PEBS are
contiguous. But it's not guaranteed with the new leaf 0x23 introduced.
The counters are enumerated with a counter mask. There may be holes in
the counter mask for future platforms or in a virtualization
environment.
Store the PEBS event mask rather than the maximum number of PEBS
counters in the x86 PMU structures.
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Reviewed-by: Ian Rogers <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Compared with previous client platforms, PC8 is removed from Lunarlake.
It supports CC1/CC6/CC7 and PC2/PC3/PC6/PC10 residency counters.
Signed-off-by: Zhang Rui <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Kan Liang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Like Alderlake, Arrowlake supports CC1/CC6/CC7 and PC2/PC3/PC6/PC8/PC10.
Signed-off-by: Zhang Rui <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Kan Liang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
For Alderlake, the spec changes after the patch submitted and PC7/PC9
are removed.
Raptorlake and Meteorlake, which copy the Alderlake cstate PMU, also
don't have PC7/PC9.
Remove PC7/PC9 support for Alderlake/Raptorlake/Meteorlake.
Fixes: d0ca946bcf84 ("perf/x86/cstate: Add Alder Lake CPU support")
Signed-off-by: Zhang Rui <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Kan Liang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The Lunarlake patches rely on the new VFM stuff.
Signed-off-by: Peter Zijlstra <[email protected]>
|
|
Currently, perf allocates an array of page pointers which is limited in
size by MAX_PAGE_ORDER. That in turn limits the maximum Intel PT buffer
size to 2GiB. Should that limitation be lifted, the Intel PT driver can
support larger sizes, except for one calculation in
pt_topa_entry_for_page(), which is limited to 32-bits.
Fix pt_topa_entry_for_page() address calculation by adding a cast.
Fixes: 39152ee51b77 ("perf/x86/intel/pt: Get rid of reverse lookup table for ToPA")
Signed-off-by: Adrian Hunter <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
topa_entry->base is a bit-field. Bit-fields are not promoted to a 64-bit
type, even if the underlying type is 64-bit, and so, if necessary, must
be cast to a larger type when calculations are done.
Fix a topa_entry->base address calculation by adding a cast.
Without the cast, the address was limited to 36-bits i.e. 64GiB.
The address calculation is used on systems that do not support Multiple
Entry ToPA (only Broadwell), and affects physical addresses on or above
64GiB. Instead of writing to the correct address, the address comprising
the first 36 bits would be written to.
Intel PT snapshot and sampling modes are not affected.
Fixes: 52ca9ced3f70 ("perf/x86/intel/pt: Add Intel PT PMU driver")
Reported-by: Dave Hansen <[email protected]>
Signed-off-by: Adrian Hunter <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
|
|
topa_entry->base needs to store a pfn. It obviously needs to be
large enough to store the largest possible x86 pfn which is
MAXPHYADDR-PAGE_SIZE (52-12). So it is 4 bits too small.
Increase the size of topa_entry->base from 36 bits to 40 bits.
Note, systems where physical addresses can be 256TiB or more are affected.
[ Adrian: Amend commit message as suggested by Dave Hansen ]
Fixes: 52ca9ced3f70 ("perf/x86/intel/pt: Add Intel PT PMU driver")
Signed-off-by: Marco Cavenati <[email protected]>
Signed-off-by: Adrian Hunter <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Adrian Hunter <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
|
|
In free_pagetable() we use the non-atomic version for clearing the
PageReserved bit from the page. free_pagetable() will either call
free_reserved_page() or put_page_bootmem(), which will eventually end up
calling free_reserved_page(), and in there we already clear the
PageReserved flag.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Oscar Salvador <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
pxx_huge() has been removed in recent commit 9636f055dae1 ("mm/treewide:
remove pXd_huge()"), however there are still three comments referencing
the API that got overlooked. Remove them.
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Xu <[email protected]>
Reported-by: Christophe Leroy <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
x86_64 is already using the node's cpu as maximum threads. Make that the
default for all archs setting DEFERRED_STRUCT_PAGE_INIT.
This returns to the behavior prior making the function arch-specific with
commit ecd096506922 ("mm: make deferred init's max threads
arch-specific").
Setting DEFERRED_STRUCT_PAGE_INIT and testing on a few arm64 platforms
shows faster deferred_init_memmap completions:
| | x13s | SA8775p-ride | Ampere R137-P31 | Ampere HR330 |
| | Metal, 32GB | VM, 36GB | VM, 58GB | Metal, 128GB |
| | 8cpus | 8cpus | 8cpus | 32cpus |
|---------|-------------|--------------|-----------------|--------------|
| threads | ms (%) | ms (%) | ms (%) | ms (%) |
|---------|-------------|--------------|-----------------|--------------|
| 1 | 108 (0%) | 72 (0%) | 224 (0%) | 324 (0%) |
| cpus | 24 (-77%) | 36 (-50%) | 40 (-82%) | 56 (-82%) |
Michael Ellerman reported:
: On a machine here (1TB, 40 cores, 4KB pages) the existing code gives:
:
: [ 0.500124] node 2 deferred pages initialised in 210ms
: [ 0.515790] node 3 deferred pages initialised in 230ms
: [ 0.516061] node 0 deferred pages initialised in 230ms
: [ 0.516522] node 7 deferred pages initialised in 230ms
: [ 0.516672] node 4 deferred pages initialised in 230ms
: [ 0.516798] node 6 deferred pages initialised in 230ms
: [ 0.517051] node 5 deferred pages initialised in 230ms
: [ 0.523887] node 1 deferred pages initialised in 240ms
:
: vs with the patch:
:
: [ 0.379613] node 0 deferred pages initialised in 90ms
: [ 0.380388] node 1 deferred pages initialised in 90ms
: [ 0.380540] node 4 deferred pages initialised in 100ms
: [ 0.390239] node 6 deferred pages initialised in 100ms
: [ 0.390249] node 2 deferred pages initialised in 100ms
: [ 0.390786] node 3 deferred pages initialised in 110ms
: [ 0.396721] node 5 deferred pages initialised in 110ms
: [ 0.397095] node 7 deferred pages initialised in 110ms
:
: Which is a nice speedup.
[[email protected]: v3]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Eric Chanudet <[email protected]>
Tested-by: Michael Ellerman <[email protected]> (powerpc)
Reviewed-by: Baoquan He <[email protected]>
Acked-by: Alexander Gordeev <[email protected]>
Acked-by: Mike Rapoport (IBM) <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov (AMD) <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
|
|
Including hrtimer.h is not required and is probably a historical
leftover. Remove it.
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Vincenzo Frascino <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The typedef gtod_long_t is not used anymore so remove it.
The header file contains then only includes dependent on
CONFIG_GENERIC_GETTIMEOFDAY to not break ARCH=um. Nevertheless, keep the
header file only with those includes to prevent spreading ifdeffery all
over the place.
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Vincenzo Frascino <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Replace the reference to the non-existent function arch_vdso_cycles_valid()
by the proper function arch_vdso_cycles_ok().
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Vincenzo Frascino <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Since userspace state is saved in the MM process, kernel using
FPU still doesn't really need to do anything, so this really
is as simple as enabling preemption. The irq critical section
in sigio_handler() needs preempt_disable()/preempt_enable().
Signed-off-by: Anton Ivanov <[email protected]>
Link: https://patch.msgid.link/20240702102549.d2fcea450854.I12f5a53d80ec1e425e66ef272b1e95cb523b608e@changeid
[rebase, remove FPU save/restore, fix x86/um Makefile,
rewrite commit message]
Signed-off-by: Johannes Berg <[email protected]>
|
|
The kernel flushes the memory ranges anyway for CoW and does not assume
that the userspace process has anything set up already. So, start with a
fresh process for the new mm context.
Signed-off-by: Benjamin Berg <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Johannes Berg <[email protected]>
|
|
The current LDT code has a few issues that mean it should be redone in a
different way once we always start with a fresh MM even when cloning.
In a new and better world, the kernel would just ensure its own LDT is
clear at startup. At that point, all that is needed is a simple function
to populate the LDT from another MM in arch_dup_mmap combined with some
tracking of the installed LDT entries for each MM.
Note that the old implementation was even incorrect with regard to
reading, as it copied out the LDT entries in the internal format rather
than converting them to the userspace structure.
Removal should be fine as the LDT is not used for thread-local storage
anymore.
Signed-off-by: Benjamin Berg <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Johannes Berg <[email protected]>
|
|
Rework syscall handling to be platform independent. Also create a clean
split between queueing of syscalls and flushing them out, removing the
need to keep state in the code that triggers the syscalls.
The code adds syscall_data_len to the global mm_id structure. This will
be used later to allow surrounding code to track whether syscalls still
need to run and if errors occurred.
Signed-off-by: Benjamin Berg <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Johannes Berg <[email protected]>
|
|
This function will be used by the new syscall handling code.
Signed-off-by: Benjamin Berg <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Johannes Berg <[email protected]>
|
|
Further commits will require values from common-offsets.h inside
stub-data.h. Resolve the possible circular dependency and simply use
offsetof() inside stub_32.h and stub_64.h.
Signed-off-by: Benjamin Berg <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Johannes Berg <[email protected]>
|
|
When BHI mitigation is enabled, if SYSENTER is invoked with the TF flag set
then entry_SYSENTER_compat() uses CLEAR_BRANCH_HISTORY and calls the
clear_bhb_loop() before the TF flag is cleared. This causes the #DB handler
(exc_debug_kernel()) to issue a warning because single-step is used outside the
entry_SYSENTER_compat() function.
To address this issue, entry_SYSENTER_compat() should use CLEAR_BRANCH_HISTORY
after making sure the TF flag is cleared.
The problem can be reproduced with the following sequence:
$ cat sysenter_step.c
int main()
{ asm("pushf; pop %ax; bts $8,%ax; push %ax; popf; sysenter"); }
$ gcc -o sysenter_step sysenter_step.c
$ ./sysenter_step
Segmentation fault (core dumped)
The program is expected to crash, and the #DB handler will issue a warning.
Kernel log:
WARNING: CPU: 27 PID: 7000 at arch/x86/kernel/traps.c:1009 exc_debug_kernel+0xd2/0x160
...
RIP: 0010:exc_debug_kernel+0xd2/0x160
...
Call Trace:
<#DB>
? show_regs+0x68/0x80
? __warn+0x8c/0x140
? exc_debug_kernel+0xd2/0x160
? report_bug+0x175/0x1a0
? handle_bug+0x44/0x90
? exc_invalid_op+0x1c/0x70
? asm_exc_invalid_op+0x1f/0x30
? exc_debug_kernel+0xd2/0x160
exc_debug+0x43/0x50
asm_exc_debug+0x1e/0x40
RIP: 0010:clear_bhb_loop+0x0/0xb0
...
</#DB>
<TASK>
? entry_SYSENTER_compat_after_hwframe+0x6e/0x8d
</TASK>
[ bp: Massage commit message. ]
Fixes: 7390db8aea0d ("x86/bhi: Add support for clearing branch history at syscall entry")
Reported-by: Suman Maity <[email protected]>
Signed-off-by: Alexandre Chartre <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Andrew Cooper <[email protected]>
Reviewed-by: Pawan Gupta <[email protected]>
Reviewed-by: Josh Poimboeuf <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The Rust compiler can take a target config from 'target.json', which is
generated by scripts/generate_rust_target.rs. It used to be that all
Linux architectures used this to generate a target.json, but now
architectures must opt-in to this, or they will default to the Rust
compiler's built-in target definition.
This is mostly okay for (64-bit) x86 and UML, except that it can
generate SSE instructions, which we can't use in the kernel. So
re-instate the custom target.json, which disables SSE (and generally
enables the 'soft-float' feature). This fixes the following compile
error:
error: <unknown>:0:0: in function _RNvMNtCs5QSdWC790r4_4core3f32f7next_up float (float): SSE register return with SSE disabled
Fixes: f82811e22b48 ("rust: Refactor the build target to allow the use of builtin targets")
Signed-off-by: David Gow <[email protected]>
Reviewed-by: Boqun Feng <[email protected]>
Tested-by: Miguel Ojeda <[email protected]>
Reviewed-by: Miguel Ojeda <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Johannes Berg <[email protected]>
|
|
There are two separate checks in prctl_enable_tagged_addr() that nr_bits
is in the correct range. The checks are arranged such the correct case
is sandwiched between both error cases, which do exactly the same thing.
Simplify the if condition and pull the correct case outside with the
rest of the success code path.
Signed-off-by: Yosry Ahmed <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Link: https://lore.kernel.org/all/20240702132139.3332013-4-yosryahmed%40google.com
|
|
LAM can only be enabled when a process is single-threaded. But _kernel_
threads can temporarily use a single-threaded process's mm. That means
that a context-switching kernel thread can race and observe the mm's LAM
metadata (mm->context.lam_cr3_mask) change.
The context switch code does two logical things with that metadata:
populate CR3 and populate 'cpu_tlbstate.lam'. If it hits this race,
'cpu_tlbstate.lam' and CR3 can end up out of sync.
This de-synchronization is currently harmless. But it is confusing and
might lead to warnings or real bugs.
Update set_tlbstate_lam_mode() to take in the LAM mask and untag mask
instead of an mm_struct pointer, and while we are at it, rename it to
cpu_tlbstate_update_lam(). This should also make it clearer that we are
updating cpu_tlbstate. In switch_mm_irqs_off(), read the LAM mask once
and use it for both the cpu_tlbstate update and the CR3 update.
Signed-off-by: Yosry Ahmed <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Link: https://lore.kernel.org/all/20240702132139.3332013-3-yosryahmed%40google.com
|
|
LAM can only be enabled when a process is single-threaded. But _kernel_
threads can temporarily use a single-threaded process's mm.
If LAM is enabled by a userspace process while a kthread is using its
mm, the kthread will not observe LAM enablement (i.e. LAM will be
disabled in CR3). This could be fine for the kthread itself, as LAM only
affects userspace addresses. However, if the kthread context switches to
a thread in the same userspace process, CR3 may or may not be updated
because the mm_struct doesn't change (based on pending TLB flushes). If
CR3 is not updated, the userspace thread will run incorrectly with LAM
disabled, which may cause page faults when using tagged addresses.
Example scenario:
CPU 1 CPU 2
/* kthread */
kthread_use_mm()
/* user thread */
prctl_enable_tagged_addr()
/* LAM enabled on CPU 2 */
/* LAM disabled on CPU 1 */
context_switch() /* to CPU 1 */
/* Switching to user thread */
switch_mm_irqs_off()
/* CR3 not updated */
/* LAM is still disabled on CPU 1 */
Synchronize LAM enablement by sending an IPI to all CPUs running with
the mm_struct to enable LAM. This makes sure LAM is enabled on CPU 1
in the above scenario before prctl_enable_tagged_addr() returns and
userspace starts using tagged addresses, and before it's possible to
run the userspace process on CPU 1.
In switch_mm_irqs_off(), move reading the LAM mask until after
mm_cpumask() is updated. This ensures that if an outdated LAM mask is
written to CR3, an IPI is received to update it right after IRQs are
re-enabled.
[ dhansen: Add a LAM enabling helper and comment it ]
Fixes: 82721d8b25d7 ("x86/mm: Handle LAM on context switch")
Suggested-by: Andy Lutomirski <[email protected]>
Signed-off-by: Yosry Ahmed <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Reviewed-by: Kirill A. Shutemov <[email protected]>
Link: https://lore.kernel.org/all/20240702132139.3332013-2-yosryahmed%40google.com
|
|
There isn't a simple hardware bit that indicates whether a CPU is running in
Sub-NUMA Cluster (SNC) mode. Infer the state by comparing the number of CPUs
sharing the L3 cache with CPU0 to the number of CPUs in the same NUMA node as
CPU0.
Add the missing definition of pr_fmt() to monitor.c. This wasn't noticed
before as there are only "can't happen" console messages from this file.
[ bp: Massage commit message. ]
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Babu Moger <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Hardware has two RMID configuration options for SNC systems. The default
mode divides RMID counters between SNC nodes. E.g. with 200 RMIDs and
two SNC nodes per L3 cache RMIDs 0..99 are used on node 0, and 100..199
on node 1. This isn't compatible with Linux resctrl usage. On this
example system a process using RMID 5 would only update monitor counters
while running on SNC node 0.
The other mode is "RMID Sharing Mode". This is enabled by clearing bit
0 of the RMID_SNC_CONFIG (0xCA0) model specific register. In this mode
the number of logical RMIDs is the number of physical RMIDs (from CPUID
leaf 0xF) divided by the number of SNC nodes per L3 cache instance. A
process can use the same RMID across different SNC nodes.
See the "Intel Resource Director Technology Architecture Specification"
for additional details.
When SNC is enabled, update the MSR when a monitor domain is marked
online. Technically this is overkill. It only needs to be done once
per L3 cache instance rather than per SNC domain. But there is no harm
in doing it more than once, and this is not in a critical path.
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Legacy resctrl monitor files must provide the sum of event values across
all Sub-NUMA Cluster (SNC) domains that share an L3 cache instance.
There are now two cases:
1) A specific domain is provided in struct rmid_read
This is either a non-SNC system, or the request is to read data
from just one SNC node.
2) Domain pointer is NULL. In this case the cacheinfo field in struct
rmid_read indicates that all SNC nodes that share that L3 cache
instance should have the event read and return the sum of all
values.
Update the CPU sanity check. The existing check that an event is read
from a CPU in the requested domain still applies when reading a single
domain. But when summing across domains a more relaxed check that the
current CPU is in the scope of the L3 cache instance is appropriate
since the MSRs to read events are scoped at L3 cache level.
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Babu Moger <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
mon_event_read() fills out most fields of the struct rmid_read that is passed
via an smp_call*() function to a CPU that is part of the correct domain to
read the monitor counters.
With Sub-NUMA Cluster (SNC) mode there are now two cases to handle:
1) Reading a file that returns a value for a single domain.
+ Choose the CPU to execute from the domain cpu_mask
2) Reading a file that must sum across domains sharing an L3 cache
instance.
+ Indicate to called code that a sum is needed by passing a NULL
rdt_mon_domain pointer.
+ Choose the CPU from the L3 shared_cpu_map.
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Babu Moger <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
In SNC mode, there are multiple subdirectories in each L3 level monitor
directory (one for each SNC node). If all the CPUs in an SNC node are taken
offline, just remove the SNC directory for that node. In non-SNC mode, or when
the last SNC node directory is removed, remove the L3 monitor directory.
Add a helper function to avoid duplicated code.
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
When SNC mode is enabled, create subdirectories and files to monitor at the SNC
node granularity. Legacy behavior is preserved by tagging the monitor files at
the L3 granularity with the "sum" attribute. When the user reads these files
the kernel will read monitor data from all SNC nodes that share the same L3
cache instance and return the aggregated value to the user.
Note that the "domid" field for files that must sum across SNC domains has the
L3 cache instance id, while non-summing files use the domain id.
The "sum" files do not need to make a call to mon_event_read() to initialize
the MBM counters. This will be handled by initializing the individual SNC nodes
that share the L3.
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Babu Moger <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
When Sub-NUMA Cluster (SNC) mode is enabled, the legacy monitor reporting files
must report the sum of the data from all of the SNC nodes that share the L3
cache that is referenced by the monitor file.
Resctrl squeezes all the attributes of these files into 32 bits so they can be
stored in the "priv" field of struct kernfs_node.
Currently, only three monitor events are defined by enum resctrl_event_id so
reducing it from 8 bits to 7 bits still provides more than enough space to
represent all the known event types.
But note that this choice was arbitrary. The "rid" field is also far wider
than needed for the current number of resource id types. This structure is
purely internal to resctrl, no ABI issues with modifying it. Subsequent changes
may rearrange the allocation of bits between each of the fields as needed.
Give the bit to a new "sum" field that indicates that reading this file must
sum across SNC nodes. This bit also indicates that the domid field is the id of
an L3 cache (instead of a domain id) to find which domains must be summed.
Fix up other issues in the kerneldoc description for mon_data_bits.
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Babu Moger <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
In Sub-NUMA Cluster (SNC) mode Linux must create the monitor
files in the original "mon_L3_XX" directories and also in each
of the "mon_sub_L3_YY" directories.
Refactor mkdir_mondata_subdir() to move the creation of monitoring files
into a helper function to avoid the need to duplicate code later.
No functional change.
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Babu Moger <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
New semantics rely on some struct rmid_read members having NULL values to
distinguish between the SNC and non-SNC scenarios. resctrl can thus no longer
rely on this struct not being initialized properly.
Initialize all on-stack declarations of struct rmid_read:
rdtgroup_mondata_show()
mbm_update()
mkdir_mondata_subdir()
to ensure that garbage values from the stack are not passed down to other
functions.
[ bp: Massage commit message. ]
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Babu Moger <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
When a user reads a monitor file rdtgroup_mondata_show() calls mon_event_read()
to package up all the required details into an rmid_read structure which is
passed across the smp_call*() infrastructure to code that will read data from
hardware and return the value (or error status) in the rmid_read structure.
Sub-NUMA Cluster (SNC) mode adds files with new semantics. These require the
smp_call-ed code to sum event data from all domains that share an L3 cache.
Add a pointer to the L3 "cacheinfo" structure to struct rmid_read for the data
collection routines to use to pick the domains to be summed.
[ Reinette: the rmid_read structure has become complex enough so document each
of its fields and provide the kerneldoc documentation for struct rmid_read. ]
Co-developed-by: Reinette Chatre <[email protected]>
Signed-off-by: Reinette Chatre <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Tested-by: Babu Moger <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
When SNC is enabled, monitoring data is collected at the SNC node granularity,
but must be reported at L3-cache granularity for backwards compatibility in
addition to reporting at the node level.
Add a "ci" field to the rdt_mon_domain structure to save the cache information
about the enclosing L3 cache for the domain. This provides:
1) The cache id which is needed to compose the name of the legacy monitoring
directory, and to determine which domains should be summed to provide
L3-scoped data.
2) The shared_cpu_map which is needed to determine which CPUs can be used to
read the RMID counters with the MSR interface.
This is the first step to an eventual goal of monitor reporting files like this
(for a system with two SNC nodes per L3):
$ cd /sys/fs/resctrl/mon_data
$ tree mon_L3_00
mon_L3_00 <- 00 here is L3 cache id
├── llc_occupancy \ These files provide legacy support
├── mbm_local_bytes > for non-SNC aware monitor apps
├── mbm_total_bytes / that expect data at L3 cache level
├── mon_sub_L3_00 <- 00 here is SNC node id
│ ├── llc_occupancy \ These files are finer grained
│ ├── mbm_local_bytes > data from each SNC node
│ └── mbm_total_bytes /
└── mon_sub_L3_01
├── llc_occupancy \
├── mbm_local_bytes > As above, but for node 1.
└── mbm_total_bytes /
[ bp: Massage commit message. ]
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Babu Moger <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
systems
When SNC is enabled there is a mismatch between the MBA control function
which operates at L3 cache scope and the MBM monitor functions which
measure memory bandwidth on each SNC node.
Block use of the mba_MBps when scopes for MBA/MBM do not match.
Improve user diagnostics by adding invalfc() message when mba_MBps
is not supported.
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Babu Moger <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
and memory controllers on a socket into two or more groups. These are
presented to the operating system as NUMA nodes.
This may enable some workloads to have slightly lower latency to memory
as the memory controller(s) in an SNC node are electrically closer to the
CPU cores on that SNC node. This cost may be offset by lower bandwidth
since the memory accesses for each core can only be interleaved between
the memory controllers on the same SNC node.
Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
to track L3 cache occupancy and memory bandwidth. There is an MSR that
controls how the RMIDs are shared between SNC nodes.
The default mode divides them numerically. E.g. when there are two SNC
nodes on a socket the lower number half of the RMIDs are given to the
first node, the remainder to the second node. This would be difficult
to use with the Linux resctrl interface as specific RMID values assigned
to resctrl groups are not visible to users.
RMID sharing mode divides the physical RMIDs evenly between SNC nodes
but uses a logical RMID in the IA32_PQR_ASSOC MSR. For example a system
with 200 physical RMIDs (as enumerated by CPUID leaf 0xF) that has two
SNC nodes per L3 cache instance would have 100 logical RMIDs available
for Linux to use. A task running on SNC node 0 with RMID 5 would
accumulate LLC occupancy and MBM bandwidth data in physical RMID 5.
Another task using RMID 5, but running on SNC node 1 would accumulate
data in physical RMID 105.
Even with this renumbering SNC mode requires several changes in resctrl
behavior for correct operation.
Add a static global to arch/x86/kernel/cpu/resctrl/monitor.c to indicate
how many SNC domains share an L3 cache instance. Initialize this to
"1". Runtime detection of SNC mode will adjust this value.
Update all places to take appropriate action when SNC mode is enabled:
1) The number of logical RMIDs per L3 cache available for use is the
number of physical RMIDs divided by the number of SNC nodes.
2) Likewise the "mon_scale" value must be divided by the number of SNC
nodes.
3) Add a function to convert from logical RMID values (assigned to
tasks and loaded into the IA32_PQR_ASSOC MSR on context switch)
to physical RMID values to load into IA32_QM_EVTSEL MSR when
reading counters on each SNC node.
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Babu Moger <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Currently supported resctrl features are all domain scoped the same as the
scope of the L2 or L3 caches.
Add RESCTRL_L3_NODE as a new option for features that are scoped at the
same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA
Cluster (SNC) feature where monitoring features are divided between
nodes that share an L3 cache.
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Babu Moger <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The same rdt_domain structure is used for both control and monitor
functions. But this results in wasted memory as some of the fields are
only used by control functions, while most are only used for monitor
functions.
Split into separate rdt_ctrl_domain and rdt_mon_domain structures with
just the fields required for control and monitoring respectively.
Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain
and rdt_hw_mon_domain.
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Babu Moger <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Resctrl assumes that control and monitor operations on a resource are
performed at the same scope.
Prepare for systems that use different scope (specifically Intel needs
to split the RDT_RESOURCE_L3 resource to use L3 scope for cache control
and NODE scope for cache occupancy and memory bandwidth monitoring).
Create separate domain lists for control and monitor operations.
Note that errors during initialization of either control or monitor
functions on a domain would previously result in that domain being
excluded from both control and monitor operations. Now the domains are
allocated independently it is no longer required to disable both control
and monitor operations if either fail.
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Babu Moger <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|