aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/kernel/cpu
AgeCommit message (Collapse)AuthorFilesLines
2023-07-27x86/microcode/AMD: Rip out static buffersBorislav Petkov (AMD)2-66/+29
Load straight from the containers (initrd or builtin, for example). There's no need to cache the patch per node. This even simplifies the code a bit with the opportunity for more cleanups later. Signed-off-by: Borislav Petkov (AMD) <[email protected]> Tested-by: John Allen <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-22x86/cpu: Enable STIBP on AMD if Automatic IBRS is enabledKim Phillips1-6/+9
Unlike Intel's Enhanced IBRS feature, AMD's Automatic IBRS does not provide protection to processes running at CPL3/user mode, see section "Extended Feature Enable Register (EFER)" in the APM v2 at https://bugzilla.kernel.org/attachment.cgi?id=304652 Explicitly enable STIBP to protect against cross-thread CPL3 branch target injections on systems with Automatic IBRS enabled. Also update the relevant documentation. Fixes: e7862eda309e ("x86/cpu: Support AMD Automatic IBRS") Reported-by: Tom Lendacky <[email protected]> Signed-off-by: Kim Phillips <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2023-07-22x86/MCE/AMD: Decrement threshold_bank refcount when removing threshold blocksYazen Ghannam1-2/+2
AMD systems from Family 10h to 16h share MCA bank 4 across multiple CPUs. Therefore, the threshold_bank structure for bank 4, and its threshold_block structures, will be initialized once at boot time. And the kobject for the shared bank will be added to each of the CPUs that share it. Furthermore, the threshold_blocks for the shared bank will be added again to the bank's kobject. These additions will increase the refcount for the bank's kobject. For example, a shared bank with two blocks and shared across two CPUs will be set up like this: CPU0 init bank create and add; bank refcount = 1; threshold_create_bank() block 0 init and add; bank refcount = 2; allocate_threshold_blocks() block 1 init and add; bank refcount = 3; allocate_threshold_blocks() CPU1 init bank add; bank refcount = 3; threshold_create_bank() block 0 add; bank refcount = 4; __threshold_add_blocks() block 1 add; bank refcount = 5; __threshold_add_blocks() Currently in threshold_remove_bank(), if the bank is shared then __threshold_remove_blocks() is called. Here the shared bank's kobject and the bank's blocks' kobjects are deleted. This is done on the first call even while the structures are still shared. Subsequent calls from other CPUs that share the structures will attempt to delete the kobjects. During kobject_del(), kobject->sd is removed. If the kobject is not part of a kset with default_groups, then subsequent kobject_del() calls seem safe even with kobject->sd == NULL. Originally, the AMD MCA thresholding structures did not use default_groups. And so the above behavior was not apparent. However, a recent change implemented default_groups for the thresholding structures. Therefore, kobject_del() will go down the sysfs_remove_groups() code path. In this case, the first kobject_del() may succeed and remove kobject->sd. But subsequent kobject_del() calls will give a WARNing in kernfs_remove_by_name_ns() since kobject->sd == NULL. Use kobject_put() on the shared bank's kobject when "removing" blocks. This decrements the bank's refcount while keeping kobjects enabled until the bank is no longer shared. At that point, kobject_put() will be called on the blocks which drives their refcount to 0 and deletes them and also decrementing the bank's refcount. And finally kobject_put() will be called on the bank driving its refcount to 0 and deleting it. The same example above: CPU1 shutdown bank is shared; bank refcount = 5; threshold_remove_bank() block 0 put parent bank; bank refcount = 4; __threshold_remove_blocks() block 1 put parent bank; bank refcount = 3; __threshold_remove_blocks() CPU0 shutdown bank is no longer shared; bank refcount = 3; threshold_remove_bank() block 0 put block; bank refcount = 2; deallocate_threshold_blocks() block 1 put block; bank refcount = 1; deallocate_threshold_blocks() put bank; bank refcount = 0; threshold_remove_bank() Fixes: 7f99cb5e6039 ("x86/CPU/AMD: Use default_groups in kobj_type") Reported-by: Mikulas Patocka <[email protected]> Signed-off-by: Yazen Ghannam <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Tested-by: Mikulas Patocka <[email protected]> Cc: <[email protected]> Link: https://lore.kernel.org/r/alpine.LRH.2.02.2205301145540.25840@file01.intranet.prod.int.rdu2.redhat.com
2023-07-21KVM: Add GDS_NO support to KVMDaniel Sneddon1-0/+7
Gather Data Sampling (GDS) is a transient execution attack using gather instructions from the AVX2 and AVX512 extensions. This attack allows malicious code to infer data that was previously stored in vector registers. Systems that are not vulnerable to GDS will set the GDS_NO bit of the IA32_ARCH_CAPABILITIES MSR. This is useful for VM guests that may think they are on vulnerable systems that are, in fact, not affected. Guests that are running on affected hosts where the mitigation is enabled are protected as if they were running on an unaffected system. On all hosts that are not affected or that are mitigated, set the GDS_NO bit. Signed-off-by: Daniel Sneddon <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Acked-by: Josh Poimboeuf <[email protected]>
2023-07-21x86/speculation: Add Kconfig option for GDSDaniel Sneddon1-0/+4
Gather Data Sampling (GDS) is mitigated in microcode. However, on systems that haven't received the updated microcode, disabling AVX can act as a mitigation. Add a Kconfig option that uses the microcode mitigation if available and disables AVX otherwise. Setting this option has no effect on systems not affected by GDS. This is the equivalent of setting gather_data_sampling=force. Signed-off-by: Daniel Sneddon <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Acked-by: Josh Poimboeuf <[email protected]>
2023-07-21x86/speculation: Add force option to GDS mitigationDaniel Sneddon1-1/+19
The Gather Data Sampling (GDS) vulnerability allows malicious software to infer stale data previously stored in vector registers. This may include sensitive data such as cryptographic keys. GDS is mitigated in microcode, and systems with up-to-date microcode are protected by default. However, any affected system that is running with older microcode will still be vulnerable to GDS attacks. Since the gather instructions used by the attacker are part of the AVX2 and AVX512 extensions, disabling these extensions prevents gather instructions from being executed, thereby mitigating the system from GDS. Disabling AVX2 is sufficient, but we don't have the granularity to do this. The XCR0[2] disables AVX, with no option to just disable AVX2. Add a kernel parameter gather_data_sampling=force that will enable the microcode mitigation if available, otherwise it will disable AVX on affected systems. This option will be ignored if cmdline mitigations=off. This is a *big* hammer. It is known to break buggy userspace that uses incomplete, buggy AVX enumeration. Unfortunately, such userspace does exist in the wild: https://www.mail-archive.com/[email protected]/msg33046.html [ dhansen: add some more ominous warnings about disabling AVX ] Signed-off-by: Daniel Sneddon <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Acked-by: Josh Poimboeuf <[email protected]>
2023-07-21x86/mce: Prevent duplicate error recordsBorislav Petkov (AMD)3-2/+27
A legitimate use case of the MCA infrastructure is to have the firmware log all uncorrectable errors and also, have the OS see all correctable errors. The uncorrectable, UCNA errors are usually configured to be reported through an SMI. CMCI, which is the correctable error reporting interrupt, uses SMI too and having both enabled, leads to unnecessary overhead. So what ends up happening is, people disable CMCI in the wild and leave on only the UCNA SMI. When CMCI is disabled, the MCA infrastructure resorts to polling the MCA banks. If a MCA MSR is shared between the logical threads, one error ends up getting logged multiple times as the polling runs on every logical thread. Therefore, introduce locking on the Intel side of the polling routine to prevent such duplicate error records from appearing. Based on a patch by Aristeu Rozanski <[email protected]>. Signed-off-by: Borislav Petkov (AMD) <[email protected]> Tested-by: Tony Luck <[email protected]> Acked-by: Aristeu Rozanski <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-19x86/speculation: Add Gather Data Sampling mitigationDaniel Sneddon3-9/+155
Gather Data Sampling (GDS) is a hardware vulnerability which allows unprivileged speculative access to data which was previously stored in vector registers. Intel processors that support AVX2 and AVX512 have gather instructions that fetch non-contiguous data elements from memory. On vulnerable hardware, when a gather instruction is transiently executed and encounters a fault, stale data from architectural or internal vector registers may get transiently stored to the destination vector register allowing an attacker to infer the stale data using typical side channel techniques like cache timing attacks. This mitigation is different from many earlier ones for two reasons. First, it is enabled by default and a bit must be set to *DISABLE* it. This is the opposite of normal mitigation polarity. This means GDS can be mitigated simply by updating microcode and leaving the new control bit alone. Second, GDS has a "lock" bit. This lock bit is there because the mitigation affects the hardware security features KeyLocker and SGX. It needs to be enabled and *STAY* enabled for these features to be mitigated against GDS. The mitigation is enabled in the microcode by default. Disable it by setting gather_data_sampling=off or by disabling all mitigations with mitigations=off. The mitigation status can be checked by reading: /sys/devices/system/cpu/vulnerabilities/gather_data_sampling Signed-off-by: Daniel Sneddon <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Acked-by: Josh Poimboeuf <[email protected]>
2023-07-17x86/cpu/amd: Add a Zenbleed fixBorislav Petkov (AMD)2-0/+62
Add a fix for the Zen2 VZEROUPPER data corruption bug where under certain circumstances executing VZEROUPPER can cause register corruption or leak data. The optimal fix is through microcode but in the case the proper microcode revision has not been applied, enable a fallback fix using a chicken bit. Signed-off-by: Borislav Petkov (AMD) <[email protected]>
2023-07-17x86/cpu/amd: Move the errata checking functionality upBorislav Petkov (AMD)1-72/+67
Avoid new and remove old forward declarations. No functional changes. Signed-off-by: Borislav Petkov (AMD) <[email protected]>
2023-07-11x86/cpufeatures: Add CPU feature flags for shadow stacksRick Edgecombe1-0/+1
The Control-Flow Enforcement Technology contains two related features, one of which is Shadow Stacks. Future patches will utilize this feature for shadow stack support in KVM, so add a CPU feature flags for Shadow Stacks (CPUID.(EAX=7,ECX=0):ECX[bit 7]). To protect shadow stack state from malicious modification, the registers are only accessible in supervisor mode. This implementation context-switches the registers with XSAVES. Make X86_FEATURE_SHSTK depend on XSAVES. The shadow stack feature, enumerated by the CPUID bit described above, encompasses both supervisor and userspace support for shadow stack. In near future patches, only userspace shadow stack will be enabled. In expectation of future supervisor shadow stack support, create a software CPU capability to enumerate kernel utilization of userspace shadow stack support. This user shadow stack bit should depend on the HW "shstk" capability and that logic will be implemented in future patches. Co-developed-by: Yu-cheng Yu <[email protected]> Signed-off-by: Yu-cheng Yu <[email protected]> Signed-off-by: Rick Edgecombe <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Kees Cook <[email protected]> Acked-by: Mike Rapoport (IBM) <[email protected]> Tested-by: Pengfei Xu <[email protected]> Tested-by: John Allen <[email protected]> Tested-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/all/20230613001108.3040476-9-rick.p.edgecombe%40intel.com
2023-06-28Merge tag 'mm-stable-2023-06-24-19-15' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: - Yosry Ahmed brought back some cgroup v1 stats in OOM logs - Yosry has also eliminated cgroup's atomic rstat flushing - Nhat Pham adds the new cachestat() syscall. It provides userspace with the ability to query pagecache status - a similar concept to mincore() but more powerful and with improved usability - Mel Gorman provides more optimizations for compaction, reducing the prevalence of page rescanning - Lorenzo Stoakes has done some maintanance work on the get_user_pages() interface - Liam Howlett continues with cleanups and maintenance work to the maple tree code. Peng Zhang also does some work on maple tree - Johannes Weiner has done some cleanup work on the compaction code - David Hildenbrand has contributed additional selftests for get_user_pages() - Thomas Gleixner has contributed some maintenance and optimization work for the vmalloc code - Baolin Wang has provided some compaction cleanups, - SeongJae Park continues maintenance work on the DAMON code - Huang Ying has done some maintenance on the swap code's usage of device refcounting - Christoph Hellwig has some cleanups for the filemap/directio code - Ryan Roberts provides two patch series which yield some rationalization of the kernel's access to pte entries - use the provided APIs rather than open-coding accesses - Lorenzo Stoakes has some fixes to the interaction between pagecache and directio access to file mappings - John Hubbard has a series of fixes to the MM selftesting code - ZhangPeng continues the folio conversion campaign - Hugh Dickins has been working on the pagetable handling code, mainly with a view to reducing the load on the mmap_lock - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from 128 to 8 - Domenico Cerasuolo has improved the zswap reclaim mechanism by reorganizing the LRU management - Matthew Wilcox provides some fixups to make gfs2 work better with the buffer_head code - Vishal Moola also has done some folio conversion work - Matthew Wilcox has removed the remnants of the pagevec code - their functionality is migrated over to struct folio_batch * tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits) mm/hugetlb: remove hugetlb_set_page_subpool() mm: nommu: correct the range of mmap_sem_read_lock in task_mem() hugetlb: revert use of page_cache_next_miss() Revert "page cache: fix page_cache_next/prev_miss off by one" mm/vmscan: fix root proactive reclaim unthrottling unbalanced node mm: memcg: rename and document global_reclaim() mm: kill [add|del]_page_to_lru_list() mm: compaction: convert to use a folio in isolate_migratepages_block() mm: zswap: fix double invalidate with exclusive loads mm: remove unnecessary pagevec includes mm: remove references to pagevec mm: rename invalidate_mapping_pagevec to mapping_try_invalidate mm: remove struct pagevec net: convert sunrpc from pagevec to folio_batch i915: convert i915_gpu_error to use a folio_batch pagevec: rename fbatch_count() mm: remove check_move_unevictable_pages() drm: convert drm_gem_put_pages() to use a folio_batch i915: convert shmem_sg_free_table() to use a folio_batch scatterlist: add sg_set_folio() ...
2023-06-27Merge tag 'locking-core-2023-06-27' of ↵Linus Torvalds1-8/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: - Introduce cmpxchg128() -- aka. the demise of cmpxchg_double() The cmpxchg128() family of functions is basically & functionally the same as cmpxchg_double(), but with a saner interface. Instead of a 6-parameter horror that forced u128 - u64/u64-halves layout details on the interface and exposed users to complexity, fragility & bugs, use a natural 3-parameter interface with u128 types. - Restructure the generated atomic headers, and add kerneldoc comments for all of the generic atomic{,64,_long}_t operations. The generated definitions are much cleaner now, and come with documentation. - Implement lock_set_cmp_fn() on lockdep, for defining an ordering when taking multiple locks of the same type. This gets rid of one use of lockdep_set_novalidate_class() in the bcache code. - Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended variable shadowing generating garbage code on Clang on certain ARM builds. * tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits) locking/atomic: scripts: fix ${atomic}_dec_if_positive() kerneldoc percpu: Fix self-assignment of __old in raw_cpu_generic_try_cmpxchg() locking/atomic: treewide: delete arch_atomic_*() kerneldoc locking/atomic: docs: Add atomic operations to the driver basic API documentation locking/atomic: scripts: generate kerneldoc comments docs: scripts: kernel-doc: accept bitwise negation like ~@var locking/atomic: scripts: simplify raw_atomic*() definitions locking/atomic: scripts: simplify raw_atomic_long*() definitions locking/atomic: scripts: split pfx/name/sfx/order locking/atomic: scripts: restructure fallback ifdeffery locking/atomic: scripts: build raw_atomic_long*() directly locking/atomic: treewide: use raw_atomic*_<op>() locking/atomic: scripts: add trivial raw_atomic*_<op>() locking/atomic: scripts: factor out order template generation locking/atomic: scripts: remove leftover "${mult}" locking/atomic: scripts: remove bogus order parameter locking/atomic: xtensa: add preprocessor symbols locking/atomic: x86: add preprocessor symbols locking/atomic: sparc: add preprocessor symbols locking/atomic: sh: add preprocessor symbols ...
2023-06-27Merge tag 'x86_sgx_for_v6.5' of ↵Linus Torvalds1-1/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull SGX update from Borislav Petkov: - A fix to avoid using a list iterator variable after the loop it is used in * tag 'x86_sgx_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sgx: Avoid using iterator after loop in sgx_mmu_notifier_release()
2023-06-27Merge tag 'x86_mtrr_for_v6.5' of ↵Linus Torvalds9-447/+659
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mtrr updates from Borislav Petkov: "A serious scrubbing of the MTRR code including adding a new map mechanism in order to look up the memory type of a region easily. Also address memory range lookup issues like returning an invalid memory type. Furthermore, this handles the decoupling of PAT from MTRR more naturally. All work by Juergen Gross" * tag 'x86_mtrr_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/xen: Set default memory type for PV guests to WB x86/mtrr: Unify debugging printing x86/mtrr: Remove unused code x86/mm: Only check uniform after calling mtrr_type_lookup() x86/mtrr: Don't let mtrr_type_lookup() return MTRR_TYPE_INVALID x86/mtrr: Use new cache_map in mtrr_type_lookup() x86/mtrr: Add mtrr=debug command line option x86/mtrr: Construct a memory map with cache modes x86/mtrr: Add get_effective_type() service function x86/mtrr: Allocate mtrr_value array dynamically x86/mtrr: Move 32-bit code from mtrr.c to legacy.c x86/mtrr: Have only one set_mtrr() variant x86/mtrr: Replace vendor tests in MTRR code x86/xen: Set MTRR state when running as Xen PV initial domain x86/hyperv: Set MTRR state when running as SEV-SNP Hyper-V guest x86/mtrr: Support setting MTRR state for software defined MTRRs x86/mtrr: Replace size_or_mask and size_and_mask with a much easier concept x86/mtrr: Remove physical address size calculation
2023-06-27Merge tag 'x86_microcode_for_v6.5' of ↵Linus Torvalds1-9/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 microcode loader updates from Borislav Petkov: - Load late on both SMT threads on AMD, just like it is being done in the early loading procedure - Cleanups * tag 'x86_microcode_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/microcode/AMD: Load late on both threads too x86/microcode/amd: Remove unneeded pointer arithmetic x86/microcode/AMD: Get rid of __find_equiv_id()
2023-06-26Merge tag 'x86_cpu_for_v6.5' of ↵Linus Torvalds2-7/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpu updates from Borislav Petkov: - Compute the purposeful misalignment of zen_untrain_ret automatically and assert __x86_return_thunk's alignment so that future changes to the symbol macros do not accidentally break them. - Remove CONFIG_X86_FEATURE_NAMES Kconfig option as its existence is pointless * tag 'x86_cpu_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/retbleed: Add __x86_return_thunk alignment checks x86/cpu: Remove X86_FEATURE_NAMES x86/Kconfig: Make X86_FEATURE_NAMES non-configurable in prompt
2023-06-26Merge tag 'x86_cache_for_v6.5' of ↵Linus Torvalds1-15/+156
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 resource control updates from Borislav Petkov: - Implement a rename operation in resctrlfs to facilitate handling of application containers with dynamically changing task lists - When reading the tasks file, show the tasks' pid which are only in the current namespace as opposed to showing the pids from the init namespace too - Other fixes and improvements * tag 'x86_cache_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Documentation/x86: Documentation for MON group move feature x86/resctrl: Implement rename op for mon groups x86/resctrl: Factor rdtgroup lock for multi-file ops x86/resctrl: Only show tasks' pid in current pid namespace
2023-06-26Merge tag 'ras_core_for_v6.5' of ↵Linus Torvalds2-3/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS updates from Borislav Petkov: - Add initial support for RAS hardware found on AMD server GPUs (MI200). Those GPUs and CPUs are connected together through the coherent fabric and the GPU memory controllers report errors through x86's MCA so EDAC needs to support them. The amd64_edac driver supports now HBM (High Bandwidth Memory) and thus such heterogeneous memory controller systems - Other small cleanups and improvements * tag 'ras_core_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: EDAC/amd64: Cache and use GPU node map EDAC/amd64: Add support for AMD heterogeneous Family 19h Model 30h-3Fh EDAC/amd64: Document heterogeneous system enumeration x86/MCE/AMD, EDAC/mce_amd: Decode UMC_V2 ECC errors x86/amd_nb: Re-sort and re-indent PCI defines x86/amd_nb: Add MI200 PCI IDs ras/debugfs: Fix error checking for debugfs_create_dir() x86/MCE: Check a hw error's address to determine proper recovery action
2023-06-26Merge tag 'smp-core-2023-06-26' of ↵Linus Torvalds2-54/+18
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull SMP updates from Thomas Gleixner: "A large update for SMP management: - Parallel CPU bringup The reason why people are interested in parallel bringup is to shorten the (kexec) reboot time of cloud servers to reduce the downtime of the VM tenants. The current fully serialized bringup does the following per AP: 1) Prepare callbacks (allocate, intialize, create threads) 2) Kick the AP alive (e.g. INIT/SIPI on x86) 3) Wait for the AP to report alive state 4) Let the AP continue through the atomic bringup 5) Let the AP run the threaded bringup to full online state There are two significant delays: #3 The time for an AP to report alive state in start_secondary() on x86 has been measured in the range between 350us and 3.5ms depending on vendor and CPU type, BIOS microcode size etc. #4 The atomic bringup does the microcode update. This has been measured to take up to ~8ms on the primary threads depending on the microcode patch size to apply. On a two socket SKL server with 56 cores (112 threads) the boot CPU spends on current mainline about 800ms busy waiting for the APs to come up and apply microcode. That's more than 80% of the actual onlining procedure. This can be reduced significantly by splitting the bringup mechanism into two parts: 1) Run the prepare callbacks and kick the AP alive for each AP which needs to be brought up. The APs wake up, do their firmware initialization and run the low level kernel startup code including microcode loading in parallel up to the first synchronization point. (#1 and #2 above) 2) Run the rest of the bringup code strictly serialized per CPU (#3 - #5 above) as it's done today. Parallelizing that stage of the CPU bringup might be possible in theory, but it's questionable whether required surgery would be justified for a pretty small gain. If the system is large enough the first AP is already waiting at the first synchronization point when the boot CPU finished the wake-up of the last AP. That reduces the AP bringup time on that SKL from ~800ms to ~80ms, i.e. by a factor ~10x. The actual gain varies wildly depending on the system, CPU, microcode patch size and other factors. There are some opportunities to reduce the overhead further, but that needs some deep surgery in the x86 CPU bringup code. For now this is only enabled on x86, but the core functionality obviously works for all SMP capable architectures. - Enhancements for SMP function call tracing so it is possible to locate the scheduling and the actual execution points. That allows to measure IPI delivery time precisely" * tag 'smp-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits) trace,smp: Add tracepoints for scheduling remotelly called functions trace,smp: Add tracepoints around remotelly called functions MAINTAINERS: Add CPU HOTPLUG entry x86/smpboot: Fix the parallel bringup decision x86/realmode: Make stack lock work in trampoline_compat() x86/smp: Initialize cpu_primary_thread_mask late cpu/hotplug: Fix off by one in cpuhp_bringup_mask() x86/apic: Fix use of X{,2}APIC_ENABLE in asm with older binutils x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it x86/smpboot: Support parallel startup of secondary CPUs x86/smpboot: Implement a bit spinlock to protect the realmode stack x86/apic: Save the APIC virtual base address cpu/hotplug: Allow "parallel" bringup up to CPUHP_BP_KICK_AP_STATE x86/apic: Provide cpu_primary_thread mask x86/smpboot: Enable split CPU startup cpu/hotplug: Provide a split up CPUHP_BRINGUP mechanism cpu/hotplug: Reset task stack state in _cpu_up() cpu/hotplug: Remove unused state functions riscv: Switch to hotplug core state synchronization parisc: Switch to hotplug core state synchronization ...
2023-06-26Merge tag 'x86-boot-2023-06-26' of ↵Linus Torvalds3-57/+74
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 boot updates from Thomas Gleixner: "Initialize FPU late. Right now FPU is initialized very early during boot. There is no real requirement to do so. The only requirement is to have it done before alternatives are patched. That's done in check_bugs() which does way more than what the function name suggests. So first rename check_bugs() to arch_cpu_finalize_init() which makes it clear what this is about. Move the invocation of arch_cpu_finalize_init() earlier in start_kernel() as it has to be done before fork_init() which needs to know the FPU register buffer size. With those prerequisites the FPU initialization can be moved into arch_cpu_finalize_init(), which removes it from the early and fragile part of the x86 bringup" * tag 'x86-boot-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mem_encrypt: Unbreak the AMD_MEM_ENCRYPT=n build x86/fpu: Move FPU initialization into arch_cpu_finalize_init() x86/fpu: Mark init functions __init x86/fpu: Remove cpuinfo argument from init functions x86/init: Initialize signal frame size late init, x86: Move mem_encrypt_init() into arch_cpu_finalize_init() init: Invoke arch_cpu_finalize_init() earlier init: Remove check_bugs() leftovers um/cpu: Switch to arch_cpu_finalize_init() sparc/cpu: Switch to arch_cpu_finalize_init() sh/cpu: Switch to arch_cpu_finalize_init() mips/cpu: Switch to arch_cpu_finalize_init() m68k/cpu: Switch to arch_cpu_finalize_init() loongarch/cpu: Switch to arch_cpu_finalize_init() ia64/cpu: Switch to arch_cpu_finalize_init() ARM: cpu: Switch to arch_cpu_finalize_init() x86/cpu: Switch to arch_cpu_finalize_init() init: Provide arch_cpu_finalize_init()
2023-06-16x86/fpu: Move FPU initialization into arch_cpu_finalize_init()Thomas Gleixner1-4/+8
Initializing the FPU during the early boot process is a pointless exercise. Early boot is convoluted and fragile enough. Nothing requires that the FPU is set up early. It has to be initialized before fork_init() because the task_struct size depends on the FPU register buffer size. Move the initialization to arch_cpu_finalize_init() which is the perfect place to do so. No functional change. This allows to remove quite some of the custom early command line parsing, but that's subject to the next installment. Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-06-16x86/fpu: Remove cpuinfo argument from init functionsThomas Gleixner1-1/+1
Nothing in the call chain requires it Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-06-16x86/init: Initialize signal frame size lateThomas Gleixner1-3/+0
No point in doing this during really early boot. Move it to an early initcall so that it is set up before possible user mode helpers are started during device initialization. Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-06-16init, x86: Move mem_encrypt_init() into arch_cpu_finalize_init()Thomas Gleixner1-0/+11
Invoke the X86ism mem_encrypt_init() from X86 arch_cpu_finalize_init() and remove the weak fallback from the core code. No functional change. Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-06-16x86/cpu: Switch to arch_cpu_finalize_init()Thomas Gleixner3-50/+55
check_bugs() is a dumping ground for finalizing the CPU bringup. Only parts of it has to do with actual CPU bugs. Split it apart into arch_cpu_finalize_init() and cpu_select_mitigations(). Fixup the bogus 32bit comments while at it. No functional change. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-06-13x86/sgx: Avoid using iterator after loop in sgx_mmu_notifier_release()Jakob Koschel1-1/+3
If &encl_mm->encl->mm_list does not contain the searched 'encl_mm', 'tmp' will not point to a valid sgx_encl_mm struct. Linus proposed to avoid any use of the list iterator variable after the loop, in the attempt to move the list iterator variable declaration into the macro to avoid any potential misuse after the loop. Using it in a pointer comparison after the loop is undefined behavior and should be omitted if possible, see Link tag. Instead, just use a 'found' boolean to indicate if an element was found. [ bp: Massage, fix typos. ] Signed-off-by: Jakob Koschel <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Jarkko Sakkinen <[email protected]> Acked-by: Dave Hansen <[email protected]> Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ Link: https://lore.kernel.org/r/[email protected]
2023-06-12x86/microcode/AMD: Load late on both threads tooBorislav Petkov (AMD)1-1/+1
Do the same as early loading - load on both threads. Signed-off-by: Borislav Petkov (AMD) <[email protected]> Cc: <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-06-09mm/gup: remove unused vmas parameter from get_user_pages()Lorenzo Stoakes1-1/+1
Patch series "remove the vmas parameter from GUP APIs", v6. (pin_/get)_user_pages[_remote]() each provide an optional output parameter for an array of VMA objects associated with each page in the input range. These provide the means for VMAs to be returned, as long as mm->mmap_lock is never released during the GUP operation (i.e. the internal flag FOLL_UNLOCKABLE is not specified). In addition, these VMAs can only be accessed with the mmap_lock held and become invalidated the moment it is released. The vast majority of invocations do not use this functionality and of those that do, all but one case retrieve a single VMA to perform checks upon. It is not egregious in the single VMA cases to simply replace the operation with a vma_lookup(). In these cases we duplicate the (fast) lookup on a slow path already under the mmap_lock, abstracted to a new get_user_page_vma_remote() inline helper function which also performs error checking and reference count maintenance. The special case is io_uring, where io_pin_pages() specifically needs to assert that the VMAs underlying the range do not result in broken long-term GUP file-backed mappings. As GUP now internally asserts that FOLL_LONGTERM mappings are not file-backed in a broken fashion (i.e. requiring dirty tracking) - as implemented in "mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to file-backed mappings" - this logic is no longer required and so we can simply remove it altogether from io_uring. Eliminating the vmas parameter eliminates an entire class of danging pointer errors that might have occured should the lock have been incorrectly released. In addition, the API is simplified and now clearly expresses what it is intended for - applying the specified GUP flags and (if pinning) returning pinned pages. This change additionally opens the door to further potential improvements in GUP and the possible marrying of disparate code paths. I have run this series against gup_test with no issues. Thanks to Matthew Wilcox for suggesting this refactoring! This patch (of 6): No invocation of get_user_pages() use the vmas parameter, so remove it. The GUP API is confusing and caveated. Recent changes have done much to improve that, however there is more we can do. Exporting vmas is a prime target as the caller has to be extremely careful to preclude their use after the mmap_lock has expired or otherwise be left with dangling pointers. Removing the vmas parameter focuses the GUP functions upon their primary purpose - pinning (and outputting) pages as well as performing the actions implied by the input flags. This is part of a patch series aiming to remove the vmas parameter altogether. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/589e0c64794668ffc799651e8d85e703262b1e9d.1684350871.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <[email protected]> Suggested-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: Greg Kroah-Hartman <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Acked-by: Christian König <[email protected]> (for radeon parts) Acked-by: Jarkko Sakkinen <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Acked-by: Sean Christopherson <[email protected]> (KVM) Cc: Catalin Marinas <[email protected]> Cc: Dennis Dalessandro <[email protected]> Cc: Janosch Frank <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Sakari Ailus <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-07x86/resctrl: Implement rename op for mon groupsPeter Newman1-0/+128
To change the resources allocated to a large group of tasks, such as an application container, a container manager must write all of the tasks' IDs into the tasks file interface of the new control group. This is challenging when the container's task list is always changing. In addition, if the container manager is using monitoring groups to separately track the bandwidth of containers assigned to the same control group, when moving a container, it must first move the container's tasks to the default monitoring group of the new control group before it can move these tasks into the container's replacement monitoring group under the destination control group. This is undesirable because it makes bandwidth usage during the move unattributable to the correct tasks and resets monitoring event counters and cache usage information for the group. Implement the rename operation only for resctrlfs monitor groups to enable users to move a monitoring group from one control group to another. This effects a change in resources allocated to all the tasks in the monitoring group while otherwise leaving the monitoring data intact. Signed-off-by: Peter Newman <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-06-07x86/resctrl: Factor rdtgroup lock for multi-file opsPeter Newman1-13/+22
rdtgroup_kn_lock_live() can only release a kernfs reference for a single file before waiting on the rdtgroup_mutex, limiting its usefulness for operations on multiple files, such as rename. Factor the work needed to respectively break and unbreak active protection on an individual file into rdtgroup_kn_{get,put}(). No functional change. Signed-off-by: Peter Newman <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Reinette Chatre <[email protected]> Tested-by: Babu Moger <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-06-05x86/MCE/AMD, EDAC/mce_amd: Decode UMC_V2 ECC errorsYazen Ghannam1-2/+4
The MI200 (Aldebaran) series of devices introduced a new SMCA bank type for Unified Memory Controllers. The MCE subsystem already has support for this new type. The MCE decoder module will decode the common MCA error information for the new bank type, but it will not pass the information to the AMD64 EDAC module for detailed memory error decoding. Have the MCE decoder module recognize the new bank type as an SMCA UMC memory error and pass the MCA information to AMD64 EDAC. Signed-off-by: Yazen Ghannam <[email protected]> Co-developed-by: Muralidhara M K <[email protected]> Signed-off-by: Muralidhara M K <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-06-05locking/atomic: treewide: use raw_atomic*_<op>()Mark Rutland1-8/+8
Now that we have raw_atomic*_<op>() definitions, there's no need to use arch_atomic*_<op>() definitions outside of the low-level atomic definitions. Move treewide users of arch_atomic*_<op>() over to the equivalent raw_atomic*_<op>(). There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-06-01x86/mtrr: Unify debugging printingBorislav Petkov (AMD)4-40/+29
Put all the debugging output behind "mtrr=debug" and get rid of "mtrr_cleanup_debug" which wasn't even documented anywhere. No functional changes. Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Link: https://lore.kernel.org/r/20230531174857.GDZHeIib57h5lT5Vh1@fat_crate.local
2023-06-01x86/mtrr: Remove unused codeJuergen Gross1-9/+0
mtrr_centaur_report_mcr() isn't used by anyone, so it can be removed. Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Tested-by: Michael Kelley <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov (AMD) <[email protected]>
2023-06-01x86/mtrr: Don't let mtrr_type_lookup() return MTRR_TYPE_INVALIDJuergen Gross1-2/+2
mtrr_type_lookup() should always return a valid memory type. In case there is no information available, it should return the default UC. This will remove the last case where mtrr_type_lookup() can return MTRR_TYPE_INVALID, so adjust the comment in include/uapi/asm/mtrr.h. Note that removing the MTRR_TYPE_INVALID #define from that header could break user code, so it has to stay. At the same time the mtrr_type_lookup() stub for the !CONFIG_MTRR case should set uniform to 1, as if the memory range would be covered by no MTRR at all. Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Tested-by: Michael Kelley <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov (AMD) <[email protected]>
2023-06-01x86/mtrr: Use new cache_map in mtrr_type_lookup()Juergen Gross1-194/+43
Instead of crawling through the MTRR register state, use the new cache_map for looking up the cache type(s) of a memory region. This allows now to set the uniform parameter according to the uniformity of the cache mode of the region, instead of setting it only if the complete region is mapped by a single MTRR. This now includes even the region covered by the fixed MTRR registers. Make sure uniform is always set. [ bp: Massage. ] [ jgross: Explain mtrr_type_lookup() logic. ] Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Tested-by: Michael Kelley <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov (AMD) <[email protected]>
2023-06-01x86/mtrr: Add mtrr=debug command line optionJuergen Gross1-19/+45
Add a new command line option "mtrr=debug" for getting debug output after building the new cache mode map. The output will include MTRR register values and the resulting map. Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Tested-by: Michael Kelley <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov (AMD) <[email protected]>
2023-06-01x86/mtrr: Construct a memory map with cache modesJuergen Gross3-2/+314
After MTRR initialization construct a memory map with cache modes from MTRR values. This will speed up lookups via mtrr_lookup_type() especially in case of overlapping MTRRs. This will be needed when switching the semantics of the "uniform" parameter of mtrr_lookup_type() from "only covered by one MTRR" to "memory range has a uniform cache mode", which is the data the callers really want to know. Today this information is not easily available, in case MTRRs are not well sorted regarding base address. The map will be built in __initdata. When memory management is up, the map will be moved to dynamically allocated memory, in order to avoid the need of an overly large array. The size of this array is calculated using the number of variable MTRR registers and the needed size for fixed entries. Only add the map creation and expansion for now. The lookup will be added later. When writing new MTRR entries in the running system rebuild the map inside the call from mtrr_rendezvous_handler() in order to avoid nasty race conditions with concurrent lookups. [ bp: Move out rebuild_map() call and rename it. ] Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Tested-by: Michael Kelley <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov (AMD) <[email protected]>
2023-06-01x86/mtrr: Add get_effective_type() service functionJuergen Gross1-20/+19
Add a service function for obtaining the effective cache mode of overlapping MTRR registers. Make use of that function in check_type_overlap(). Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Tested-by: Michael Kelley <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov (AMD) <[email protected]>
2023-06-01x86/mtrr: Allocate mtrr_value array dynamicallyJuergen Gross1-1/+7
The mtrr_value[] array is a static variable which is used only in a few configurations. Consuming 6kB is ridiculous for this case, especially as the array doesn't need to be that large and it can easily be allocated dynamically. Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Tested-by: Michael Kelley <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov (AMD) <[email protected]>
2023-06-01x86/mtrr: Move 32-bit code from mtrr.c to legacy.cJuergen Gross4-75/+98
There is some code in mtrr.c which is relevant for old 32-bit CPUs only. Move it to a new source legacy.c. While modifying mtrr_init_finalize() fix spelling of its name. Suggested-by: Borislav Petkov <[email protected]> Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Tested-by: Michael Kelley <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov (AMD) <[email protected]>
2023-06-01x86/mtrr: Have only one set_mtrr() variantJuergen Gross1-20/+8
Today there are two variants of set_mtrr(): one calling stop_machine() and one calling stop_machine_cpuslocked(). The first one (set_mtrr()) has only one caller, and this caller is running only when resuming from suspend when the interrupts are still off and only one CPU is active. Additionally this code is used only on rather old 32-bit CPUs not supporting SMP. For these reasons the first variant can be replaced by a simple call of mtrr_if->set(). Rename the second variant set_mtrr_cpuslocked() to set_mtrr() now that there is only one variant left, in order to have a shorter function name. Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Tested-by: Michael Kelley <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov (AMD) <[email protected]>
2023-06-01x86/mtrr: Replace vendor tests in MTRR codeJuergen Gross7-29/+31
Modern CPUs all share the same MTRR interface implemented via generic_mtrr_ops. At several places in MTRR code this generic interface is deduced via is_cpu(INTEL) tests, which is only working due to X86_VENDOR_INTEL being 0 (the is_cpu() macro is testing mtrr_if->vendor, which isn't explicitly set in generic_mtrr_ops). Test the generic CPU feature X86_FEATURE_MTRR instead. The only other place where the .vendor member of struct mtrr_ops is being used is in set_num_var_ranges(), where depending on the vendor the number of MTRR registers is determined. This can easily be changed by replacing .vendor with the static number of MTRR registers. It should be noted that the test "is_cpu(HYGON)" wasn't ever returning true, as there is no struct mtrr_ops with that vendor information. [ bp: Use mtrr_enabled() before doing mtrr_if-> accesses, esp. in mtrr_trim_uncached_memory() which gets called independently from whether mtrr_if is set or not. ] Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov (AMD) <[email protected]>
2023-06-01x86/mtrr: Support setting MTRR state for software defined MTRRsJuergen Gross2-2/+72
When running virtualized, MTRR access can be reduced (e.g. in Xen PV guests or when running as a SEV-SNP guest under Hyper-V). Typically, the hypervisor will not advertize the MTRR feature in CPUID data, resulting in no MTRR memory type information being available for the kernel. This has turned out to result in problems (Link tags below): - Hyper-V SEV-SNP guests using uncached mappings where they shouldn't - Xen PV dom0 mapping memory as WB which should be UC- instead Solve those problems by allowing an MTRR static state override, overwriting the empty state used today. In case such a state has been set, don't call get_mtrr_state() in mtrr_bp_init(). The set state will only be used by mtrr_type_lookup(), as in all other cases mtrr_enabled() is being checked, which will return false. Accept the overwrite call only for selected cases when running as a guest. Disable X86_FEATURE_MTRR in order to avoid any MTRR modifications by just refusing them. [ bp: Massage. ] Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Tested-by: Michael Kelley <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Link: https://lore.kernel.org/lkml/BYAPR21MB16883ABC186566BD4D2A1451D7FE9@BYAPR21MB1688.namprd21.prod.outlook.com Signed-off-by: Borislav Petkov (AMD) <[email protected]>
2023-06-01x86/mtrr: Replace size_or_mask and size_and_mask with a much easier conceptJuergen Gross4-43/+35
Replace size_or_mask and size_and_mask with the much easier concept of high reserved bits. While at it, instead of using constants in the MTRR code, use some new [ bp: - Drop mtrr_set_mask() - Unbreak long lines - Move struct mtrr_state_type out of the uapi header as it doesn't belong there. It also fixes a HDRTEST breakage "unknown type name ‘bool’" as Reported-by: kernel test robot <[email protected]> - Massage. ] Signed-off-by: Juergen Gross <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Borislav Petkov (AMD) <[email protected]>
2023-05-30x86/resctrl: Only show tasks' pid in current pid namespaceShawn Wang1-2/+6
When writing a task id to the "tasks" file in an rdtgroup, rdtgroup_tasks_write() treats the pid as a number in the current pid namespace. But when reading the "tasks" file, rdtgroup_tasks_show() shows the list of global pids from the init namespace, which is confusing and incorrect. To be more robust, let the "tasks" file only show pids in the current pid namespace. Fixes: e02737d5b826 ("x86/intel_rdt: Add tasks files") Signed-off-by: Shawn Wang <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Acked-by: Reinette Chatre <[email protected]> Acked-by: Fenghua Yu <[email protected]> Tested-by: Reinette Chatre <[email protected]> Link: https://lore.kernel.org/all/[email protected]/
2023-05-25x86/topology: Fix erroneous smp_num_siblings on Intel Hybrid platformsZhang Rui1-2/+3
Traditionally, all CPUs in a system have identical numbers of SMT siblings. That changes with hybrid processors where some logical CPUs have a sibling and others have none. Today, the CPU boot code sets the global variable smp_num_siblings when every CPU thread is brought up. The last thread to boot will overwrite it with the number of siblings of *that* thread. That last thread to boot will "win". If the thread is a Pcore, smp_num_siblings == 2. If it is an Ecore, smp_num_siblings == 1. smp_num_siblings describes if the *system* supports SMT. It should specify the maximum number of SMT threads among all cores. Ensure that smp_num_siblings represents the system-wide maximum number of siblings by always increasing its value. Never allow it to decrease. On MeteorLake-P platform, this fixes a problem that the Ecore CPUs are not updated in any cpu sibling map because the system is treated as an UP system when probing Ecore CPUs. Below shows part of the CPU topology information before and after the fix, for both Pcore and Ecore CPU (cpu0 is Pcore, cpu 12 is Ecore). ... -/sys/devices/system/cpu/cpu0/topology/package_cpus:000fff -/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-11 +/sys/devices/system/cpu/cpu0/topology/package_cpus:3fffff +/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-21 ... -/sys/devices/system/cpu/cpu12/topology/package_cpus:001000 -/sys/devices/system/cpu/cpu12/topology/package_cpus_list:12 +/sys/devices/system/cpu/cpu12/topology/package_cpus:3fffff +/sys/devices/system/cpu/cpu12/topology/package_cpus_list:0-21 Notice that the "before" 'package_cpus_list' has only one CPU. This means that userspace tools like lscpu will see a little laptop like an 11-socket system: -Core(s) per socket: 1 -Socket(s): 11 +Core(s) per socket: 16 +Socket(s): 1 This is also expected to make the scheduler do rather wonky things too. [ dhansen: remove CPUID detail from changelog, add end user effects ] CC: [email protected] Fixes: bbb65d2d365e ("x86: use cpuid vector 0xb when available for detecting cpu topology") Fixes: 95f3d39ccf7a ("x86/cpu/topology: Provide detect_extended_topology_early()") Suggested-by: Len Brown <[email protected]> Signed-off-by: Zhang Rui <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/all/20230323015640.27906-1-rui.zhang%40intel.com
2023-05-16x86/MCE: Check a hw error's address to determine proper recovery actionYazen Ghannam1-1/+1
Make sure that machine check errors with a usable address are properly marked as poison. This is needed for errors that occur on memory which have MCG_STATUS[RIPV] clear - i.e., the interrupted process cannot be restarted reliably. One example is data poison consumption through the instruction fetch units on AMD Zen-based systems. The MF_MUST_KILL flag is passed to memory_failure() when MCG_STATUS[RIPV] is not set. So the associated process will still be killed. What this does, practically, is get rid of one more check to kill_current_task with the eventual goal to remove it completely. Also, make the handling identical to what is done on the notifier path (uc_decode_notifier() does that address usability check too). The scenario described above occurs when hardware can precisely identify the address of poisoned memory, but execution cannot reliably continue for the interrupted hardware thread. [ bp: Massage commit message. ] Signed-off-by: Yazen Ghannam <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Reviewed-by: Tony Luck <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-05-15x86/cpu: Remove X86_FEATURE_NAMESLukas Bulwahn2-7/+1
While discussing to change the visibility of X86_FEATURE_NAMES (see Link) in order to remove CONFIG_EMBEDDED, Boris suggested to simply make the X86_FEATURE_NAMES functionality unconditional. As the need for really tiny kernel images has gone away and kernel images with !X86_FEATURE_NAMES are hardly tested, remove this config and the whole ifdeffery in the source code. Suggested-by: Borislav Petkov <[email protected]> Signed-off-by: Lukas Bulwahn <[email protected]> Signed-off-by: Borislav Petkov (AMD) <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Link: https://lore.kernel.org/r/[email protected]