aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2024-07-04perf/x86/intel/pt: Fix a topa_entry base address calculationAdrian Hunter1-1/+1
topa_entry->base is a bit-field. Bit-fields are not promoted to a 64-bit type, even if the underlying type is 64-bit, and so, if necessary, must be cast to a larger type when calculations are done. Fix a topa_entry->base address calculation by adding a cast. Without the cast, the address was limited to 36-bits i.e. 64GiB. The address calculation is used on systems that do not support Multiple Entry ToPA (only Broadwell), and affects physical addresses on or above 64GiB. Instead of writing to the correct address, the address comprising the first 36 bits would be written to. Intel PT snapshot and sampling modes are not affected. Fixes: 52ca9ced3f70 ("perf/x86/intel/pt: Add Intel PT PMU driver") Reported-by: Dave Hansen <[email protected]> Signed-off-by: Adrian Hunter <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2024-07-04perf/x86/intel/pt: Fix topa_entry base lengthMarco Cavenati1-2/+2
topa_entry->base needs to store a pfn. It obviously needs to be large enough to store the largest possible x86 pfn which is MAXPHYADDR-PAGE_SIZE (52-12). So it is 4 bits too small. Increase the size of topa_entry->base from 36 bits to 40 bits. Note, systems where physical addresses can be 256TiB or more are affected. [ Adrian: Amend commit message as suggested by Dave Hansen ] Fixes: 52ca9ced3f70 ("perf/x86/intel/pt: Add Intel PT PMU driver") Signed-off-by: Marco Cavenati <[email protected]> Signed-off-by: Adrian Hunter <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Adrian Hunter <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2024-07-03arch/x86: do not explicitly clear Reserved flag in free_pagetableOscar Salvador1-2/+0
In free_pagetable() we use the non-atomic version for clearing the PageReserved bit from the page. free_pagetable() will either call free_reserved_page() or put_page_bootmem(), which will eventually end up calling free_reserved_page(), and in there we already clear the PageReserved flag. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Dave Hansen <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-03mm: drop leftover comment references to pxx_huge()Peter Xu1-2/+2
pxx_huge() has been removed in recent commit 9636f055dae1 ("mm/treewide: remove pXd_huge()"), however there are still three comments referencing the API that got overlooked. Remove them. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Reported-by: Christophe Leroy <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Jason Gunthorpe <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-03mm/mm_init: use node's number of cpus in deferred_page_init_max_threadsEric Chanudet1-12/+0
x86_64 is already using the node's cpu as maximum threads. Make that the default for all archs setting DEFERRED_STRUCT_PAGE_INIT. This returns to the behavior prior making the function arch-specific with commit ecd096506922 ("mm: make deferred init's max threads arch-specific"). Setting DEFERRED_STRUCT_PAGE_INIT and testing on a few arm64 platforms shows faster deferred_init_memmap completions: | | x13s | SA8775p-ride | Ampere R137-P31 | Ampere HR330 | | | Metal, 32GB | VM, 36GB | VM, 58GB | Metal, 128GB | | | 8cpus | 8cpus | 8cpus | 32cpus | |---------|-------------|--------------|-----------------|--------------| | threads | ms (%) | ms (%) | ms (%) | ms (%) | |---------|-------------|--------------|-----------------|--------------| | 1 | 108 (0%) | 72 (0%) | 224 (0%) | 324 (0%) | | cpus | 24 (-77%) | 36 (-50%) | 40 (-82%) | 56 (-82%) | Michael Ellerman reported: : On a machine here (1TB, 40 cores, 4KB pages) the existing code gives: : : [ 0.500124] node 2 deferred pages initialised in 210ms : [ 0.515790] node 3 deferred pages initialised in 230ms : [ 0.516061] node 0 deferred pages initialised in 230ms : [ 0.516522] node 7 deferred pages initialised in 230ms : [ 0.516672] node 4 deferred pages initialised in 230ms : [ 0.516798] node 6 deferred pages initialised in 230ms : [ 0.517051] node 5 deferred pages initialised in 230ms : [ 0.523887] node 1 deferred pages initialised in 240ms : : vs with the patch: : : [ 0.379613] node 0 deferred pages initialised in 90ms : [ 0.380388] node 1 deferred pages initialised in 90ms : [ 0.380540] node 4 deferred pages initialised in 100ms : [ 0.390239] node 6 deferred pages initialised in 100ms : [ 0.390249] node 2 deferred pages initialised in 100ms : [ 0.390786] node 3 deferred pages initialised in 110ms : [ 0.396721] node 5 deferred pages initialised in 110ms : [ 0.397095] node 7 deferred pages initialised in 110ms : : Which is a nice speedup. [[email protected]: v3] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric Chanudet <[email protected]> Tested-by: Michael Ellerman <[email protected]> (powerpc) Reviewed-by: Baoquan He <[email protected]> Acked-by: Alexander Gordeev <[email protected]> Acked-by: Mike Rapoport (IBM) <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-07-03x86/vdso: Remove unused includeAnna-Maria Behnsen1-1/+0
Including hrtimer.h is not required and is probably a historical leftover. Remove it. Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Vincenzo Frascino <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-03x86/vgtod: Remove unused typedef gtod_long_tAnna-Maria Behnsen1-5/+0
The typedef gtod_long_t is not used anymore so remove it. The header file contains then only includes dependent on CONFIG_GENERIC_GETTIMEOFDAY to not break ARCH=um. Nevertheless, keep the header file only with those includes to prevent spreading ifdeffery all over the place. Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Vincenzo Frascino <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-03x86/vdso: Fix function reference in commentAnna-Maria Behnsen1-3/+2
Replace the reference to the non-existent function arch_vdso_cycles_valid() by the proper function arch_vdso_cycles_ok(). Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Vincenzo Frascino <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-03um: Enable preemption in UMLAnton Ivanov1-1/+0
Since userspace state is saved in the MM process, kernel using FPU still doesn't really need to do anything, so this really is as simple as enabling preemption. The irq critical section in sigio_handler() needs preempt_disable()/preempt_enable(). Signed-off-by: Anton Ivanov <[email protected]> Link: https://patch.msgid.link/20240702102549.d2fcea450854.I12f5a53d80ec1e425e66ef272b1e95cb523b608e@changeid [rebase, remove FPU save/restore, fix x86/um Makefile, rewrite commit message] Signed-off-by: Johannes Berg <[email protected]>
2024-07-03um: remove copy_context_skas0Benjamin Berg3-51/+0
The kernel flushes the memory ranges anyway for CoW and does not assume that the userspace process has anything set up already. So, start with a fresh process for the new mm context. Signed-off-by: Benjamin Berg <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2024-07-03um: remove LDT supportBenjamin Berg4-438/+2
The current LDT code has a few issues that mean it should be redone in a different way once we always start with a fresh MM even when cloning. In a new and better world, the kernel would just ensure its own LDT is clear at startup. At that point, all that is needed is a simple function to populate the LDT from another MM in arch_dup_mmap combined with some tracking of the installed LDT entries for each MM. Note that the old implementation was even incorrect with regard to reading, as it copied out the LDT entries in the internal format rather than converting them to the userspace structure. Removal should be fine as the LDT is not used for thread-local storage anymore. Signed-off-by: Benjamin Berg <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2024-07-03um: Rework syscall handlingBenjamin Berg5-134/+16
Rework syscall handling to be platform independent. Also create a clean split between queueing of syscalls and flushing them out, removing the need to keep state in the code that triggers the syscalls. The code adds syscall_data_len to the global mm_id structure. This will be used later to allow surrounding code to track whether syscalls still need to run and if errors occurred. Signed-off-by: Benjamin Berg <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2024-07-03um: Add generic stub_syscall6 functionBenjamin Berg2-0/+38
This function will be used by the new syscall handling code. Signed-off-by: Benjamin Berg <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2024-07-03um: Remove stub-data.h include from common-offsets.hBenjamin Berg2-6/+8
Further commits will require values from common-offsets.h inside stub-data.h. Resolve the possible circular dependency and simply use offsetof() inside stub_32.h and stub_64.h. Signed-off-by: Benjamin Berg <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2024-07-03x86/bhi: Avoid warning in #DB handler due to BHI mitigationAlexandre Chartre1-4/+10
When BHI mitigation is enabled, if SYSENTER is invoked with the TF flag set then entry_SYSENTER_compat() uses CLEAR_BRANCH_HISTORY and calls the clear_bhb_loop() before the TF flag is cleared. This causes the #DB handler (exc_debug_kernel()) to issue a warning because single-step is used outside the entry_SYSENTER_compat() function. To address this issue, entry_SYSENTER_compat() should use CLEAR_BRANCH_HISTORY after making sure the TF flag is cleared. The problem can be reproduced with the following sequence: $ cat sysenter_step.c int main() { asm("pushf; pop %ax; bts $8,%ax; push %ax; popf; sysenter"); } $ gcc -o sysenter_step sysenter_step.c $ ./sysenter_step Segmentation fault (core dumped) The program is expected to crash, and the #DB handler will issue a warning. Kernel log: WARNING: CPU: 27 PID: 7000 at arch/x86/kernel/traps.c:1009 exc_debug_kernel+0xd2/0x160 ... RIP: 0010:exc_debug_kernel+0xd2/0x160 ... Call Trace: <#DB> ? show_regs+0x68/0x80 ? __warn+0x8c/0x140 ? exc_debug_kernel+0xd2/0x160 ? report_bug+0x175/0x1a0 ? handle_bug+0x44/0x90 ? exc_invalid_op+0x1c/0x70 ? asm_exc_invalid_op+0x1f/0x30 ? exc_debug_kernel+0xd2/0x160 exc_debug+0x43/0x50 asm_exc_debug+0x1e/0x40 RIP: 0010:clear_bhb_loop+0x0/0xb0 ... </#DB> <TASK> ? entry_SYSENTER_compat_after_hwframe+0x6e/0x8d </TASK> [ bp: Massage commit message. ] Fixes: 7390db8aea0d ("x86/bhi: Add support for clearing branch history at syscall entry") Reported-by: Suman Maity <[email protected]> Signed-off-by: Alexandre Chartre <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Andrew Cooper <[email protected]> Reviewed-by: Pawan Gupta <[email protected]> Reviewed-by: Josh Poimboeuf <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-03arch: um: rust: Use the generated target.json againDavid Gow1-0/+1
The Rust compiler can take a target config from 'target.json', which is generated by scripts/generate_rust_target.rs. It used to be that all Linux architectures used this to generate a target.json, but now architectures must opt-in to this, or they will default to the Rust compiler's built-in target definition. This is mostly okay for (64-bit) x86 and UML, except that it can generate SSE instructions, which we can't use in the kernel. So re-instate the custom target.json, which disables SSE (and generally enables the 'soft-float' feature). This fixes the following compile error: error: <unknown>:0:0: in function _RNvMNtCs5QSdWC790r4_4core3f32f7next_up float (float): SSE register return with SSE disabled Fixes: f82811e22b48 ("rust: Refactor the build target to allow the use of builtin targets") Signed-off-by: David Gow <[email protected]> Reviewed-by: Boqun Feng <[email protected]> Tested-by: Miguel Ojeda <[email protected]> Reviewed-by: Miguel Ojeda <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2024-07-02x86/mm: Cleanup prctl_enable_tagged_addr() nr_bits error checkingYosry Ahmed1-7/+4
There are two separate checks in prctl_enable_tagged_addr() that nr_bits is in the correct range. The checks are arranged such the correct case is sandwiched between both error cases, which do exactly the same thing. Simplify the if condition and pull the correct case outside with the rest of the success code path. Signed-off-by: Yosry Ahmed <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Link: https://lore.kernel.org/all/20240702132139.3332013-4-yosryahmed%40google.com
2024-07-02x86/mm: Fix LAM inconsistency during context switchYosry Ahmed4-11/+20
LAM can only be enabled when a process is single-threaded. But _kernel_ threads can temporarily use a single-threaded process's mm. That means that a context-switching kernel thread can race and observe the mm's LAM metadata (mm->context.lam_cr3_mask) change. The context switch code does two logical things with that metadata: populate CR3 and populate 'cpu_tlbstate.lam'. If it hits this race, 'cpu_tlbstate.lam' and CR3 can end up out of sync. This de-synchronization is currently harmless. But it is confusing and might lead to warnings or real bugs. Update set_tlbstate_lam_mode() to take in the LAM mask and untag mask instead of an mm_struct pointer, and while we are at it, rename it to cpu_tlbstate_update_lam(). This should also make it clearer that we are updating cpu_tlbstate. In switch_mm_irqs_off(), read the LAM mask once and use it for both the cpu_tlbstate update and the CR3 update. Signed-off-by: Yosry Ahmed <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Link: https://lore.kernel.org/all/20240702132139.3332013-3-yosryahmed%40google.com
2024-07-02x86/mm: Use IPIs to synchronize LAM enablementYosry Ahmed2-7/+29
LAM can only be enabled when a process is single-threaded. But _kernel_ threads can temporarily use a single-threaded process's mm. If LAM is enabled by a userspace process while a kthread is using its mm, the kthread will not observe LAM enablement (i.e. LAM will be disabled in CR3). This could be fine for the kthread itself, as LAM only affects userspace addresses. However, if the kthread context switches to a thread in the same userspace process, CR3 may or may not be updated because the mm_struct doesn't change (based on pending TLB flushes). If CR3 is not updated, the userspace thread will run incorrectly with LAM disabled, which may cause page faults when using tagged addresses. Example scenario: CPU 1 CPU 2 /* kthread */ kthread_use_mm() /* user thread */ prctl_enable_tagged_addr() /* LAM enabled on CPU 2 */ /* LAM disabled on CPU 1 */ context_switch() /* to CPU 1 */ /* Switching to user thread */ switch_mm_irqs_off() /* CR3 not updated */ /* LAM is still disabled on CPU 1 */ Synchronize LAM enablement by sending an IPI to all CPUs running with the mm_struct to enable LAM. This makes sure LAM is enabled on CPU 1 in the above scenario before prctl_enable_tagged_addr() returns and userspace starts using tagged addresses, and before it's possible to run the userspace process on CPU 1. In switch_mm_irqs_off(), move reading the LAM mask until after mm_cpumask() is updated. This ensures that if an outdated LAM mask is written to CR3, an IPI is received to update it right after IRQs are re-enabled. [ dhansen: Add a LAM enabling helper and comment it ] Fixes: 82721d8b25d7 ("x86/mm: Handle LAM on context switch") Suggested-by: Andy Lutomirski <[email protected]> Signed-off-by: Yosry Ahmed <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Kirill A. Shutemov <[email protected]> Link: https://lore.kernel.org/all/20240702132139.3332013-2-yosryahmed%40google.com
2024-07-02x86/resctrl: Detect Sub-NUMA Cluster (SNC) modeTony Luck1-0/+66
There isn't a simple hardware bit that indicates whether a CPU is running in Sub-NUMA Cluster (SNC) mode. Infer the state by comparing the number of CPUs sharing the L3 cache with CPU0 to the number of CPUs in the same NUMA node as CPU0. Add the missing definition of pr_fmt() to monitor.c. This wasn't noticed before as there are only "can't happen" console messages from this file. [ bp: Massage commit message. ] Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Enable shared RMID mode on Sub-NUMA Cluster (SNC) systemsTony Luck4-0/+25
Hardware has two RMID configuration options for SNC systems. The default mode divides RMID counters between SNC nodes. E.g. with 200 RMIDs and two SNC nodes per L3 cache RMIDs 0..99 are used on node 0, and 100..199 on node 1. This isn't compatible with Linux resctrl usage. On this example system a process using RMID 5 would only update monitor counters while running on SNC node 0. The other mode is "RMID Sharing Mode". This is enabled by clearing bit 0 of the RMID_SNC_CONFIG (0xCA0) model specific register. In this mode the number of logical RMIDs is the number of physical RMIDs (from CPUID leaf 0xF) divided by the number of SNC nodes per L3 cache instance. A process can use the same RMID across different SNC nodes. See the "Intel Resource Director Technology Architecture Specification" for additional details. When SNC is enabled, update the MSR when a monitor domain is marked online. Technically this is overkill. It only needs to be done once per L3 cache instance rather than per SNC domain. But there is no harm in doing it more than once, and this is not in a critical path. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Make __mon_event_count() handle sum domainsTony Luck1-9/+42
Legacy resctrl monitor files must provide the sum of event values across all Sub-NUMA Cluster (SNC) domains that share an L3 cache instance. There are now two cases: 1) A specific domain is provided in struct rmid_read This is either a non-SNC system, or the request is to read data from just one SNC node. 2) Domain pointer is NULL. In this case the cacheinfo field in struct rmid_read indicates that all SNC nodes that share that L3 cache instance should have the event read and return the sum of all values. Update the CPU sanity check. The existing check that an event is read from a CPU in the requested domain still applies when reading a single domain. But when summing across domains a more relaxed check that the current CPU is in the scope of the L3 cache instance is appropriate since the MSRs to read events are scoped at L3 cache level. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Fill out rmid_read structure for smp_call*() to read a counterTony Luck3-10/+34
mon_event_read() fills out most fields of the struct rmid_read that is passed via an smp_call*() function to a CPU that is part of the correct domain to read the monitor counters. With Sub-NUMA Cluster (SNC) mode there are now two cases to handle: 1) Reading a file that returns a value for a single domain. + Choose the CPU to execute from the domain cpu_mask 2) Reading a file that must sum across domains sharing an L3 cache instance. + Indicate to called code that a sum is needed by passing a NULL rdt_mon_domain pointer. + Choose the CPU from the L3 shared_cpu_map. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Handle removing directories in Sub-NUMA Cluster (SNC) modeTony Luck1-6/+29
In SNC mode, there are multiple subdirectories in each L3 level monitor directory (one for each SNC node). If all the CPUs in an SNC node are taken offline, just remove the SNC directory for that node. In non-SNC mode, or when the last SNC node directory is removed, remove the L3 monitor directory. Add a helper function to avoid duplicated code. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Create Sub-NUMA Cluster (SNC) monitor filesTony Luck1-16/+46
When SNC mode is enabled, create subdirectories and files to monitor at the SNC node granularity. Legacy behavior is preserved by tagging the monitor files at the L3 granularity with the "sum" attribute. When the user reads these files the kernel will read monitor data from all SNC nodes that share the same L3 cache instance and return the aggregated value to the user. Note that the "domid" field for files that must sum across SNC domains has the L3 cache instance id, while non-summing files use the domain id. The "sum" files do not need to make a call to mon_event_read() to initialize the MBM counters. This will be handled by initializing the individual SNC nodes that share the L3. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Allocate a new field in union mon_data_bitsTony Luck1-7/+13
When Sub-NUMA Cluster (SNC) mode is enabled, the legacy monitor reporting files must report the sum of the data from all of the SNC nodes that share the L3 cache that is referenced by the monitor file. Resctrl squeezes all the attributes of these files into 32 bits so they can be stored in the "priv" field of struct kernfs_node. Currently, only three monitor events are defined by enum resctrl_event_id so reducing it from 8 bits to 7 bits still provides more than enough space to represent all the known event types. But note that this choice was arbitrary. The "rid" field is also far wider than needed for the current number of resource id types. This structure is purely internal to resctrl, no ABI issues with modifying it. Subsequent changes may rearrange the allocation of bits between each of the fields as needed. Give the bit to a new "sum" field that indicates that reading this file must sum across SNC nodes. This bit also indicates that the domid field is the id of an L3 cache (instead of a domain id) to find which domains must be summed. Fix up other issues in the kerneldoc description for mon_data_bits. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Refactor mkdir_mondata_subdir() with a helper functionTony Luck1-17/+28
In Sub-NUMA Cluster (SNC) mode Linux must create the monitor files in the original "mon_L3_XX" directories and also in each of the "mon_sub_L3_YY" directories. Refactor mkdir_mondata_subdir() to move the creation of monitoring files into a helper function to avoid the need to duplicate code later. No functional change. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Initialize on-stack struct rmid_read instancesTony Luck3-5/+3
New semantics rely on some struct rmid_read members having NULL values to distinguish between the SNC and non-SNC scenarios. resctrl can thus no longer rely on this struct not being initialized properly. Initialize all on-stack declarations of struct rmid_read: rdtgroup_mondata_show() mbm_update() mkdir_mondata_subdir() to ensure that garbage values from the stack are not passed down to other functions. [ bp: Massage commit message. ] Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Add a new field to struct rmid_read for summation of domainsTony Luck1-0/+19
When a user reads a monitor file rdtgroup_mondata_show() calls mon_event_read() to package up all the required details into an rmid_read structure which is passed across the smp_call*() infrastructure to code that will read data from hardware and return the value (or error status) in the rmid_read structure. Sub-NUMA Cluster (SNC) mode adds files with new semantics. These require the smp_call-ed code to sum event data from all domains that share an L3 cache. Add a pointer to the L3 "cacheinfo" structure to struct rmid_read for the data collection routines to use to pick the domains to be summed. [ Reinette: the rmid_read structure has become complex enough so document each of its fields and provide the kerneldoc documentation for struct rmid_read. ] Co-developed-by: Reinette Chatre <[email protected]> Signed-off-by: Reinette Chatre <[email protected]> Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Prepare for new Sub-NUMA Cluster (SNC) monitor filesTony Luck3-3/+6
When SNC is enabled, monitoring data is collected at the SNC node granularity, but must be reported at L3-cache granularity for backwards compatibility in addition to reporting at the node level. Add a "ci" field to the rdt_mon_domain structure to save the cache information about the enclosing L3 cache for the domain. This provides: 1) The cache id which is needed to compose the name of the legacy monitoring directory, and to determine which domains should be summed to provide L3-scoped data. 2) The shared_cpu_map which is needed to determine which CPUs can be used to read the RMID counters with the MSR interface. This is the first step to an eventual goal of monitor reporting files like this (for a system with two SNC nodes per L3): $ cd /sys/fs/resctrl/mon_data $ tree mon_L3_00 mon_L3_00 <- 00 here is L3 cache id ├── llc_occupancy \ These files provide legacy support ├── mbm_local_bytes > for non-SNC aware monitor apps ├── mbm_total_bytes / that expect data at L3 cache level ├── mon_sub_L3_00 <- 00 here is SNC node id │   ├── llc_occupancy \ These files are finer grained │   ├── mbm_local_bytes > data from each SNC node │   └── mbm_total_bytes / └── mon_sub_L3_01 ├── llc_occupancy \ ├── mbm_local_bytes > As above, but for node 1. └── mbm_total_bytes / [ bp: Massage commit message. ] Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Block use of mba_MBps mount option on Sub-NUMA Cluster (SNC) ↵Tony Luck1-3/+9
systems When SNC is enabled there is a mismatch between the MBA control function which operates at L3 cache scope and the MBM monitor functions which measure memory bandwidth on each SNC node. Block use of the mba_MBps when scopes for MBA/MBM do not match. Improve user diagnostics by adding invalfc() message when mba_MBps is not supported. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Introduce snc_nodes_per_l3_cacheTony Luck1-6/+50
Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores and memory controllers on a socket into two or more groups. These are presented to the operating system as NUMA nodes. This may enable some workloads to have slightly lower latency to memory as the memory controller(s) in an SNC node are electrically closer to the CPU cores on that SNC node. This cost may be offset by lower bandwidth since the memory accesses for each core can only be interleaved between the memory controllers on the same SNC node. Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks to track L3 cache occupancy and memory bandwidth. There is an MSR that controls how the RMIDs are shared between SNC nodes. The default mode divides them numerically. E.g. when there are two SNC nodes on a socket the lower number half of the RMIDs are given to the first node, the remainder to the second node. This would be difficult to use with the Linux resctrl interface as specific RMID values assigned to resctrl groups are not visible to users. RMID sharing mode divides the physical RMIDs evenly between SNC nodes but uses a logical RMID in the IA32_PQR_ASSOC MSR. For example a system with 200 physical RMIDs (as enumerated by CPUID leaf 0xF) that has two SNC nodes per L3 cache instance would have 100 logical RMIDs available for Linux to use. A task running on SNC node 0 with RMID 5 would accumulate LLC occupancy and MBM bandwidth data in physical RMID 5. Another task using RMID 5, but running on SNC node 1 would accumulate data in physical RMID 105. Even with this renumbering SNC mode requires several changes in resctrl behavior for correct operation. Add a static global to arch/x86/kernel/cpu/resctrl/monitor.c to indicate how many SNC domains share an L3 cache instance. Initialize this to "1". Runtime detection of SNC mode will adjust this value. Update all places to take appropriate action when SNC mode is enabled: 1) The number of logical RMIDs per L3 cache available for use is the number of physical RMIDs divided by the number of SNC nodes. 2) Likewise the "mon_scale" value must be divided by the number of SNC nodes. 3) Add a function to convert from logical RMID values (assigned to tasks and loaded into the IA32_PQR_ASSOC MSR on context switch) to physical RMID values to load into IA32_QM_EVTSEL MSR when reading counters on each SNC node. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Add node-scope to the options for feature scopeTony Luck1-0/+2
Currently supported resctrl features are all domain scoped the same as the scope of the L2 or L3 caches. Add RESCTRL_L3_NODE as a new option for features that are scoped at the same granularity as NUMA nodes. This is needed for Intel's Sub-NUMA Cluster (SNC) feature where monitoring features are divided between nodes that share an L3 cache. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Split the rdt_domain and rdt_hw_domain structuresTony Luck6-125/+146
The same rdt_domain structure is used for both control and monitor functions. But this results in wasted memory as some of the fields are only used by control functions, while most are only used for monitor functions. Split into separate rdt_ctrl_domain and rdt_mon_domain structures with just the fields required for control and monitoring respectively. Similar split of the rdt_hw_domain structure into rdt_hw_ctrl_domain and rdt_hw_mon_domain. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Prepare for different scope for control/monitor operationsTony Luck6-90/+221
Resctrl assumes that control and monitor operations on a resource are performed at the same scope. Prepare for systems that use different scope (specifically Intel needs to split the RDT_RESOURCE_L3 resource to use L3 scope for cache control and NODE scope for cache occupancy and memory bandwidth monitoring). Create separate domain lists for control and monitor operations. Note that errors during initialization of either control or monitor functions on a domain would previously result in that domain being excluded from both control and monitor operations. Now the domains are allocated independently it is no longer required to disable both control and monitor operations if either fail. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Prepare to split rdt_domain structureTony Luck5-69/+69
The rdt_domain structure is used for both control and monitor features. It is about to be split into separate structures for these two usages because the scope for control and monitoring features for a resource will be different for future resources. To allow for common code that scans a list of domains looking for a specific domain id, move all the common fields ("list", "id", "cpu_mask") into their own structure within the rdt_domain structure. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/resctrl: Prepare for new domain scopeTony Luck4-17/+42
Resctrl resources operate on subsets of CPUs in the system with the defining attribute of each subset being an instance of a particular level of cache. E.g. all CPUs sharing an L3 cache would be part of the same domain. In preparation for features that are scoped at the NUMA node level, change the code from explicit references to "cache_level" to a more generic scope. At this point the only options for this scope are groups of CPUs that share an L2 cache or L3 cache. Clean up the error handling when looking up domains. Report invalid ids before calling rdt_find_domain() in preparation for better messages when scope can be other than cache scope. This means that rdt_find_domain() will never return an error. So remove checks for error from the call sites. Signed-off-by: Tony Luck <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-07-02x86/xen: Convert comma to semicolonChen Ni1-2/+2
Replace a comma between expression statements by a semicolon. Fixes: 8310b77b48c5 ("Xen/gnttab: handle p2m update errors on a per-slot basis") Signed-off-by: Chen Ni <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Juergen Gross <[email protected]>
2024-07-02x86/xen/time: Reduce Xen timer tickFrediano Ziglio1-1/+1
Current timer tick is causing some deadline to fail. The current high value constant was probably due to an old bug in the Xen timer implementation causing errors if the deadline was in the future. This was fixed in Xen commit: 19c6cbd90965 xen/vcpu: ignore VCPU_SSHOTTMR_future Usage of VCPU_SSHOTTMR_future in Linux kernel was removed by: c06b6d70feb3 xen/x86: don't lose event interrupts Signed-off-by: Frediano Ziglio <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Juergen Gross <[email protected]>
2024-07-02x86/efi: Drop support for fake EFI memory mapsArd Biesheuvel8-270/+10
Between kexec and confidential VM support, handling the EFI memory maps correctly on x86 is already proving to be rather difficult (as opposed to other EFI architectures which manage to never modify the EFI memory map to begin with) EFI fake memory map support is essentially a development hack (for testing new support for the 'special purpose' and 'more reliable' EFI memory attributes) that leaked into production code. The regions marked in this manner are not actually recognized as such by the firmware itself or the EFI stub (and never have), and marking memory as 'more reliable' seems rather futile if the underlying memory is just ordinary RAM. Marking memory as 'special purpose' in this way is also dubious, but may be in use in production code nonetheless. However, the same should be achievable by using the memmap= command line option with the ! operator. EFI fake memmap support is not enabled by any of the major distros (Debian, Fedora, SUSE, Ubuntu) and does not exist on other architectures, so let's drop support for it. Acked-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Dan Williams <[email protected]> Signed-off-by: Ard Biesheuvel <[email protected]>
2024-07-01x86/alternatives, kvm: Fix a couple of CALLs without a frame pointerBorislav Petkov (AMD)4-7/+10
objtool complains: arch/x86/kvm/kvm.o: warning: objtool: .altinstr_replacement+0xc5: call without frame pointer save/setup vmlinux.o: warning: objtool: .altinstr_replacement+0x2eb: call without frame pointer save/setup Make sure %rSP is an output operand to the respective asm() statements. The test_cc() hunk and ALT_OUTPUT_SP() courtesy of peterz. Also from him add some helpful debugging info to the documentation. Now on to the explanations: tl;dr: The alternatives macros are pretty fragile. If I do ALT_OUTPUT_SP(output) in order to be able to package in a %rsp reference for objtool so that a stack frame gets properly generated, the inline asm input operand with positional argument 0 in clear_page(): "0" (page) gets "renumbered" due to the added : "+r" (current_stack_pointer), "=D" (page) and then gcc says: ./arch/x86/include/asm/page_64.h:53:9: error: inconsistent operand constraints in an ‘asm’ The fix is to use an explicit "D" constraint which points to a singleton register class (gcc terminology) which ends up doing what is expected here: the page pointer - input and output - should be in the same %rdi register. Other register classes have more than one register in them - example: "r" and "=r" or "A": ‘A’ The ‘a’ and ‘d’ registers. This class is used for instructions that return double word results in the ‘ax:dx’ register pair. Single word values will be allocated either in ‘ax’ or ‘dx’. so using "D" and "=D" just works in this particular case. And yes, one would say, sure, why don't you do "+D" but then: : "+r" (current_stack_pointer), "+D" (page) : [old] "i" (clear_page_orig), [new1] "i" (clear_page_rep), [new2] "i" (clear_page_erms), : "cc", "memory", "rax", "rcx") now find the Waldo^Wcomma which throws a wrench into all this. Because that silly macro has an "input..." consume-all last macro arg and in it, one is supposed to supply input *and* clobbers, leading to silly syntax snafus. Yap, they need to be cleaned up, one fine day... Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Reported-by: kernel test robot <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Sean Christopherson <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/20240625112056.GDZnqoGDXgYuWBDUwu@fat_crate.local
2024-06-30x86-32: fix cmpxchg8b_emu build error with clangLinus Torvalds1-7/+5
The kernel test robot reported that clang no longer compiles the 32-bit x86 kernel in some configurations due to commit 95ece48165c1 ("locking/atomic/x86: Rewrite x86_32 arch_atomic64_{,fetch}_{and,or,xor}() functions"). The build fails with arch/x86/include/asm/cmpxchg_32.h:149:9: error: inline assembly requires more registers than available and the reason seems to be that not only does the cmpxchg8b instruction need four fixed registers (EDX:EAX and ECX:EBX), with the emulation fallback the inline asm also wants a fifth fixed register for the address (it uses %esi for that, but that's just a software convention with cmpxchg8b_emu). Avoiding using another pointer input to the asm (and just forcing it to use the "0(%esi)" addressing that we end up requiring for the sw fallback) seems to fix the issue. Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Fixes: 95ece48165c1 ("locking/atomic/x86: Rewrite x86_32 arch_atomic64_{,fetch}_{and,or,xor}() functions") Link: https://lore.kernel.org/all/[email protected]/ Suggested-by: Uros Bizjak <[email protected]> Reviewed-and-Tested-by: Uros Bizjak <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2024-06-29x86/cpu/intel: Drop stray FAM6 check with new Intel CPU model definesAndrew Cooper1-11/+7
The outer if () should have been dropped when switching to c->x86_vfm. Fixes: 6568fc18c2f6 ("x86/cpu/intel: Switch to new Intel CPU model defines") Signed-off-by: Andrew Cooper <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Tony Luck <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-28Merge tag 'hardening-v6.10-rc6' of ↵Linus Torvalds1-9/+6
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening fixes from Kees Cook: - Remove invalid tty __counted_by annotation (Nathan Chancellor) - Add missing MODULE_DESCRIPTION()s for KUnit string tests (Jeff Johnson) - Remove non-functional per-arch kstack entropy filtering * tag 'hardening-v6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: tty: mxser: Remove __counted_by from mxser_board.ports[] randomize_kstack: Remove non-functional per-arch entropy filtering string: kunit: add missing MODULE_DESCRIPTION() macros
2024-06-28x86: stop playing stack games in profile_pc()Linus Torvalds1-19/+1
The 'profile_pc()' function is used for timer-based profiling, which isn't really all that relevant any more to begin with, but it also ends up making assumptions based on the stack layout that aren't necessarily valid. Basically, the code tries to account the time spent in spinlocks to the caller rather than the spinlock, and while I support that as a concept, it's not worth the code complexity or the KASAN warnings when no serious profiling is done using timers anyway these days. And the code really does depend on stack layout that is only true in the simplest of cases. We've lost the comment at some point (I think when the 32-bit and 64-bit code was unified), but it used to say: Assume the lock function has either no stack frame or a copy of eflags from PUSHF. which explains why it just blindly loads a word or two straight off the stack pointer and then takes a minimal look at the values to just check if they might be eflags or the return pc: Eflags always has bits 22 and up cleared unlike kernel addresses but that basic stack layout assumption assumes that there isn't any lock debugging etc going on that would complicate the code and cause a stack frame. It causes KASAN unhappiness reported for years by syzkaller [1] and others [2]. With no real practical reason for this any more, just remove the code. Just for historical interest, here's some background commits relating to this code from 2006: 0cb91a229364 ("i386: Account spinlocks to the caller during profiling for !FP kernels") 31679f38d886 ("Simplify profile_pc on x86-64") and a code unification from 2009: ef4512882dbe ("x86: time_32/64.c unify profile_pc") but the basics of this thing actually goes back to before the git tree. Link: https://syzkaller.appspot.com/bug?extid=84fe685c02cd112a2ac3 [1] Link: https://lore.kernel.org/all/CAK55_s7Xyq=nh97=K=G1sxueOFrJDAvPOJAL4TPTCAYvmxO9_A@mail.gmail.com/ [2] Signed-off-by: Linus Torvalds <[email protected]>
2024-06-28x86/cpufeatures: Add HWP highest perf change feature flagSrinivas Pandruvada1-0/+1
When CPUID[6].EAX[15] is set to 1, this CPU supports notification for HWP (Hardware P-states) highest performance change. Add a feature flag to check if the CPU supports HWP highest performance change. Signed-off-by: Srinivas Pandruvada <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Rafael J. Wysocki <[email protected]>
2024-06-28KVM: X86: Remove unnecessary GFP_KERNEL_ACCOUNT for temporary variablesPeng Hao1-5/+4
Some variables allocated in kvm_arch_vcpu_ioctl are released when the function exits, so there is no need to set GFP_KERNEL_ACCOUNT. Signed-off-by: Peng Hao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-06-28KVM: x86/pmu: Introduce distinct macros for GP/fixed counter max numberDapeng Mi6-25/+31
Refine the macros which define maximum General Purpose (GP) and fixed counter numbers. Currently the macro KVM_INTEL_PMC_MAX_GENERIC is used to represent the maximum supported General Purpose (GP) counter number ambiguously across Intel and AMD platforms. This would cause issues if AMD begins to support more GP counters than Intel. Thus a bunch of new macros including vendor specific and vendor independent are introduced to replace the old macros. The vendor independent macros are used in x86 common code to hide vendor difference and eliminate the ambiguity. No logic changes are introduced in this patch. Signed-off-by: Dapeng Mi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Co-developed-by: Sean Christopherson <[email protected]> Signed-off-by: Sean Christopherson <[email protected]>
2024-06-28KVM: x86: WARN if a vCPU gets a valid wakeup that KVM can't yet injectSean Christopherson1-1/+4
WARN if a blocking vCPU is awakened by a valid wake event that KVM can't inject, e.g. because KVM needs to complete a nested VM-enter, or needs to re-inject an exception. For the nested VM-Enter case, KVM is supposed to clear "nested_run_pending" if L1 puts L2 into HLT, i.e. entering HLT "completes" the nested VM-Enter. And for already-injected exceptions, it should be impossible for the vCPU to be in a blocking state if a VM-Exit occurred while an exception was being vectored. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>
2024-06-28KVM: nVMX: Fold requested virtual interrupt check into has_nested_events()Sean Christopherson7-33/+5
Check for a Requested Virtual Interrupt, i.e. a virtual interrupt that is pending delivery, in vmx_has_nested_events() and drop the one-off kvm_x86_ops.guest_apic_has_interrupt() hook. In addition to dropping a superfluous hook, this fixes a bug where KVM would incorrectly treat virtual interrupts _for L2_ as always enabled due to kvm_arch_interrupt_allowed(), by way of vmx_interrupt_blocked(), treating IRQs as enabled if L2 is active and vmcs12 is configured to exit on IRQs, i.e. KVM would treat a virtual interrupt for L2 as a valid wake event based on L1's IRQ blocking status. Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>