aboutsummaryrefslogtreecommitdiff
path: root/arch/powerpc
AgeCommit message (Collapse)AuthorFilesLines
2024-05-03arch: Rename fbdev header and source filesThomas Zimmermann2-5/+5
The per-architecture fbdev code has no dependencies on fbdev and can be used for any video-related subsystem. Rename the files to 'video'. Use video-sti.c on parisc as the source file depends on CONFIG_STI_CORE. On arc, arm, arm64, sh, and um the asm header file is an empty wrapper around the file in asm-generic. Let Kbuild generate the file. The build system does this automatically. Only um needs to generate video.h explicitly, so that it overrides the host architecture's header. The latter would otherwise interfere with the build. Further update all includes statements, include guards, and Makefiles. Also update a few strings and comments to refer to video instead of fbdev. v3: - arc, arm, arm64, sh: generate asm header via build system (Sam, Helge, Arnd) - um: rename fb.h to video.h - fix typo in commit message (Sam) Signed-off-by: Thomas Zimmermann <[email protected]> Reviewed-by: Sam Ravnborg <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Huacai Chen <[email protected]> Cc: WANG Xuerui <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Helge Deller <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Rich Felker <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Andreas Larsson <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: [email protected] Cc: "H. Peter Anvin" <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]>
2024-05-03powerpc/dexcr: Reset DEXCR value across execBenjamin Gray4-1/+32
Inheriting the DEXCR across exec can have security and usability concerns. If a program is compiled with hash instructions it generally expects to run with NPHIE enabled. But if the parent process disables NPHIE then if it's not careful it will be disabled for any children too and the protection offered by hash checks is basically worthless. This patch introduces a per-process reset value that new execs in a particular process tree are initialized with. This enables fine grained control over what DEXCR value child processes run with by default. For example, containers running legacy binaries that expect hash instructions to act as NOPs could configure the reset value of the container root to control the default reset value for all members of the container. Signed-off-by: Benjamin Gray <[email protected]> [mpe: Add missing SPDX tag on dexcr.c] Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-05-03powerpc/dexcr: Track the DEXCR per-processBenjamin Gray3-6/+12
Add capability to make the DEXCR act as a per-process SPR. We do not yet have an interface for changing the values per task. We also expect the kernel to use a single DEXCR value across all tasks while in privileged state, so there is no need to synchronize after changing it (the userspace aspects will synchronize upon returning to userspace). Signed-off-by: Benjamin Gray <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-05-03powerpc/module: Remove arch specific module bug stuffDr. David Alan Gilbert2-7/+0
The last function to reference module_bug_list went in 2008's commit b9754568ef17 ("powerpc: Remove dead module_find_bug code") but I don't think that was called since 2006's commit 73c9ceab40b1 ("[POWERPC] Generic BUG for powerpc") Now that the list has gone, I think we can also clean up the bug entries in mod_arch_specific. Lightly boot tested. Signed-off-by: Dr. David Alan Gilbert <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-05-03powerpc/crash: remove unnecessary NULL check before kvfree()Sourabh Jain1-2/+1
Fix the following coccicheck build warning: arch/powerpc/kexec/crash.c:488:2-8: WARNING: NULL check before some freeing functions is not needed. Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: Sourabh Jain <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-05-02arch: use $(obj)/ instead of $(src)/ for preprocessed linker scriptsMasahiro Yamada1-2/+2
These are generated files. Prefix them with $(obj)/ instead of $(src)/. Signed-off-by: Masahiro Yamada <[email protected]> Acked-by: Helge Deller <[email protected]> Reviewed-by: Nicolas Schier <[email protected]>
2024-04-30powerpc: Mark memory_limit as initdataMichael Ellerman1-1/+1
The `memory_limit` variable should only be used during boot, enforce that by marking it initdata. Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-04-29powerpc/pseries/vio: Don't return ENODEV if node or compatible missingLidong Zhong1-6/+2
We noticed the following nuisance messages during boot process: vio vio: uevent: failed to send synthetic uevent vio 4000: uevent: failed to send synthetic uevent vio 4001: uevent: failed to send synthetic uevent vio 4002: uevent: failedto send synthetic uevent vio 4004: uevent: failed to send synthetic uevent It's caused by either vio_register_device_node() failing to set dev->of_node or the node is missing a "compatible" property. To match the definition of modalias in modalias_show(), remove the return of ENODEV in such cases. The failure messages is also suppressed with this change. Signed-off-by: Lidong Zhong <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-04-29powerpc/pseries: Enforce hcall result buffer validity and sizeNathan Lynch1-4/+4
plpar_hcall(), plpar_hcall9(), and related functions expect callers to provide valid result buffers of certain minimum size. Currently this is communicated only through comments in the code and the compiler has no idea. For example, if I write a bug like this: long retbuf[PLPAR_HCALL_BUFSIZE]; // should be PLPAR_HCALL9_BUFSIZE plpar_hcall9(H_ALLOCATE_VAS_WINDOW, retbuf, ...); This compiles with no diagnostics emitted, but likely results in stack corruption at runtime when plpar_hcall9() stores results past the end of the array. (To be clear this is a contrived example and I have not found a real instance yet.) To make this class of error less likely, we can use explicitly-sized array parameters instead of pointers in the declarations for the hcall APIs. When compiled with -Warray-bounds[1], the code above now provokes a diagnostic like this: error: array argument is too small; is of size 32, callee requires at least 72 [-Werror,-Warray-bounds] 60 | plpar_hcall9(H_ALLOCATE_VAS_WINDOW, retbuf, | ^ ~~~~~~ [1] Enabled for LLVM builds but not GCC for now. See commit 0da6e5fd6c37 ("gcc: disable '-Warray-bounds' for gcc-13 too") and related changes. Signed-off-by: Nathan Lynch <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-04-29powerpc/dart: Drop unnecessary call to kmemleak_no_scan()Michael Ellerman1-4/+0
Erhard reported that kmemleak was showing a warning at boot: kmemleak: Not scanning unknown object at 0xc00000007f000000 CPU: 0 PID: 0 Comm: swapper Not tainted 5.19.0-rc3-PMacG5+ #2 Call Trace: .dump_stack_lvl+0x7c/0xc4 (unreliable) .kmemleak_no_scan+0xe0/0x100 .iommu_init_early_dart+0x2f0/0x924 .pmac_probe+0x1b0/0x20c .setup_arch+0x1b8/0x674 .start_kernel+0xdc/0xb74 start_here_common+0x1c/0x44 DART table allocated at: (____ptrval____) Which he bisected to a change in kmemleak, commit 23c2d497de21 ("mm: kmemleak: take a full lowmem check in kmemleak_*_phys()"). Because pmac_probe() is called before mem_topology_setup(), the min/ max PFN variables are still zero. That causes kmemleak_alloc_phys() to ignore the allocation, because the checks against the PFN fail. Then kmemleak_no_scan() can't find the allocation and prints warning. Given that kmemleak_alloc_phys() is ignoring the allocation to begin with, there's no need to call kmemleak_no_scan() at all, which avoids the warning. Reported-by: Erhard Furtner <[email protected]> Closes: https://lore.kernel.org/all/[email protected]%2F/ Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-04-29powerpc/eeh: Permanently disable the removed deviceGanesh Goudar2-3/+21
When a device is hot removed on powernv, the hotplug driver clears the device's state. However, on pseries, if a device is removed by phyp after reaching the error threshold, the kernel remains unaware, leading to the device not being torn down. This prevents necessary remediation actions like failover. Permanently disable the device if the presence check fails. Also, in eeh_dev_check_failure in we may consider the error as false positive if the device is hotpluged out as the get_state call returns EEH_STATE_NOT_SUPPORT and we may end up not clearing the device state, so log the event if the state is not moved to permanent failure state. Signed-off-by: Ganesh Goudar <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-04-29powerpc/fadump: add hotplug_ready sysfs interfaceSourabh Jain1-0/+14
The elfcorehdr describes the CPUs and memory of the crashed kernel to the kernel that captures the dump, known as the second or fadump kernel. The elfcorehdr needs to be updated if the system's memory changes due to memory hotplug or online/offline events. Currently, memory hotplug events are monitored in userspace by udev rules, and fadump is re-registered, which recreates the elfcorehdr with the latest available memory in the system. However, the previous patch ("powerpc: make fadump resilient with memory add/remove events") moved the creation of elfcorehdr to the second or fadump kernel. This eliminates the need to regenerate the elfcorehdr during memory hotplug or online/offline events. Create a sysfs entry at /sys/kernel/fadump/hotplug_ready to let userspace know that fadump re-registration is not required for memory add/remove events. Signed-off-by: Sourabh Jain <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-04-29powerpc: make fadump resilient with memory add/remove eventsSourabh Jain4-206/+242
Due to changes in memory resources caused by either memory hotplug or online/offline events, the elfcorehdr, which describes the CPUs and memory of the crashed kernel to the kernel that collects the dump (known as second/fadump kernel), becomes outdated. Consequently, attempting dump collection with an outdated elfcorehdr can lead to failed or inaccurate dump collection. Memory hotplug or online/offline events is referred as memory add/remove events in reset of the commit message. The current solution to address the aforementioned issue is as follows: Monitor memory add/remove events in userspace using udev rules, and re-register fadump whenever there are changes in memory resources. This leads to the creation of a new elfcorehdr with updated system memory information. There are several notable issues associated with re-registering fadump for every memory add/remove events. 1. Bulk memory add/remove events with udev-based fadump re-registration can lead to race conditions and, more importantly, it creates a wide window during which fadump is inactive until all memory add/remove events are settled. 2. Re-registering fadump for every memory add/remove event is inefficient. 3. The memory for elfcorehdr is allocated based on the memblock regions available during early boot and remains fixed thereafter. However, if elfcorehdr is later recreated with additional memblock regions, its size will increase, potentially leading to memory corruption. Address the aforementioned challenges by shifting the creation of elfcorehdr from the first kernel (also referred as the crashed kernel), where it was created and frequently recreated for every memory add/remove event, to the fadump kernel. As a result, the elfcorehdr only needs to be created once, thus eliminating the necessity to re-register fadump during memory add/remove events. At present, the first kernel prepares fadump header and stores it in the fadump reserved area. The fadump header includes the start address of the elfcorehdr, crashing CPU details, and other relevant information. In the event of a crash in the first kernel, the second/fadump boots and accesses the fadump header prepared by the first kernel. It then performs the following steps in a platform-specific function [rtas|opal]_fadump_process: 1. Sanity check for fadump header 2. Update CPU notes in elfcorehdr Along with the above, update the setup_fadump()/fadump.c to create elfcorehdr and set its address to the global variable elfcorehdr_addr for the vmcore module to process it in the second/fadump kernel. Section below outlines the information required to create the elfcorehdr and the changes made to make it available to the fadump kernel if it's not already. To create elfcorehdr, the following crashed kernel information is required: CPU notes, vmcoreinfo, and memory ranges. At present, the CPU notes are already prepared in the fadump kernel, so no changes are needed in that regard. The fadump kernel has access to all crashed kernel memory regions, including boot memory regions that are relocated by firmware to fadump reserved areas, so no changes for that either. However, it is necessary to add new members to the fadump header, i.e., the 'fadump_crash_info_header' structure, in order to pass the crashed kernel's vmcoreinfo address and its size to fadump kernel. In addition to the vmcoreinfo address and size, there are a few other attributes also added to the fadump_crash_info_header structure. 1. version: It stores the fadump header version, which is currently set to 1. This provides flexibility to update the fadump crash info header in the future without changing the magic number. For each change in the fadump header, the version will be increased. This will help the updated kernel determine how to handle kernel dumps from older kernels. The magic number remains relevant for checking fadump header corruption. 2. pt_regs_sz/cpu_mask_sz: Store size of pt_regs and cpu_mask structure of first kernel. These attributes are used to prevent dump processing if the sizes of pt_regs or cpu_mask structure differ between the first and fadump kernels. Note: if either first/crashed kernel or second/fadump kernel do not have the changes introduced here then kernel fail to collect the dump and prints relevant error message on the console. Signed-off-by: Sourabh Jain <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-04-29powerpc/pseries: Add failure related checks for h_get_mpp and h_get_pppShrikanth Hegde3-7/+7
Couple of Minor fixes: - hcall return values are long. Fix that for h_get_mpp, h_get_ppp and parse_ppp_data - If hcall fails, values set should be at-least zero. It shouldn't be uninitialized values. Fix that for h_get_mpp and h_get_ppp Signed-off-by: Shrikanth Hegde <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-04-29powerpc/pseries: Add pool idle time at LPAR bootShrikanth Hegde1-9/+30
When there are no options specified for lparstat, it is expected to give reports since LPAR(Logical Partition) boot. APP(Available Processor Pool) is an indicator of how many cores in the shared pool are free to use in Shared Processor LPAR(SPLPAR). APP is derived using pool_idle_time which is obtained using H_PIC call. The interval based reports show correct APP value while since boot report shows very high APP values. This happens because in that case APP is obtained by dividing pool idle time by LPAR uptime. Since pool idle time is reported by the PowerVM hypervisor since its boot, it need not align with LPAR boot. To fix that export boot pool idle time in lparcfg and powerpc-utils will use this info to derive APP as below for since boot reports. APP = (pool idle time - boot pool idle time) / (uptime * timebase) Results:: Observe APP values. ====================== Shared LPAR ================================ lparstat System Configuration type=Shared mode=Uncapped smt=8 lcpu=12 mem=15573440 kB cpus=37 ent=12.00 reboot stress-ng --cpu=$(nproc) -t 600 sleep 600 So in this case app is expected to close to 37-6=31. ====== 6.9-rc1 and lparstat 1.3.10 ============= %user %sys %wait %idle physc %entc lbusy app vcsw phint ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 47.48 0.01 0.00 52.51 0.00 0.00 47.49 69099.72 541547 21 === With this patch and powerpc-utils patch to do the above equation === %user %sys %wait %idle physc %entc lbusy app vcsw phint ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 47.48 0.01 0.00 52.51 5.73 47.75 47.49 31.21 541753 21 ===================================================================== Note: physc, purr/idle purr being inaccurate is being handled in a separate patch in powerpc-utils tree. Signed-off-by: Shrikanth Hegde <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-04-25mm/treewide: rename CONFIG_HAVE_FAST_GUP to CONFIG_HAVE_GUP_FASTDavid Hildenbrand1-1/+1
Nowadays, we call it "GUP-fast", the external interface includes functions like "get_user_pages_fast()", and we renamed all internal functions to reflect that as well. Let's make the config option reflect that. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: John Hubbard <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25powerpc: mm: accelerate pagefault when badaccessKefeng Wang1-13/+20
The access_[pkey]_error() of vma already checked under per-VMA lock, if it is a bad access, directly handle error, no need to retry with mmap_lock again. In order to release the correct lock, pass the mm_struct into bad_access_pkey()/bad_access(), if mm is NULL, release vma lock, or release mmap_lock. Since the page faut is handled under per-VMA lock, count it as a vma lock event with VMA_LOCK_SUCCESS. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Acked-by: Michael Ellerman <[email protected]> (powerpc) Cc: Albert Ou <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Russell King <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25powerpc: use initializer for struct vm_unmapped_area_infoRick Edgecombe1-11/+9
Future changes will need to add a new member to struct vm_unmapped_area_info. This would cause trouble for any call site that doesn't initialize the struct. Currently every caller sets each member manually, so if new members are added they will be uninitialized and the core code parsing the struct will see garbage in the new member. It could be possible to initialize the new member manually to 0 at each call site. This and a couple other options were discussed, and a working consensus (see links) was that in general the best way to accomplish this would be via static initialization with designated member initiators. Having some struct vm_unmapped_area_info instances not zero initialized will put those sites at risk of feeding garbage into vm_unmapped_area() if the convention is to zero initialize the struct and any new member addition misses a call site that initializes each member manually. It could be possible to leave the code mostly untouched, and just change the line: struct vm_unmapped_area_info info to: struct vm_unmapped_area_info info = {}; However, that would leave cleanup for the members that are manually set to zero, as it would no longer be required. So to be reduce the chance of bugs via uninitialized members, instead simply continue the process to initialize the struct this way tree wide. This will zero any unspecified members. Move the member initializers to the struct declaration when they are known at that time. Leave the members out that were manually initialized to zero, as this would be redundant for designated initializers. Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/lkml/202402280912.33AEE7A9CF@keescook/#t Link: https://lore.kernel.org/lkml/j7bfvig3gew3qruouxrh7z7ehjjafrgkbcmg6tcghhfh3rhmzi@wzlcoecgy5rs/ Signed-off-by: Rick Edgecombe <[email protected]> Acked-by: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Naveen N. Rao <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Dan Williams <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Deepak Gupta <[email protected]> Cc: Guo Ren <[email protected]> Cc: Helge Deller <[email protected]> Cc: H. Peter Anvin (Intel) <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Kees Cook <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Mark Brown <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm/mm_init.c: remove arch_reserved_kernel_pages()Baoquan He2-9/+0
Since the current calculation of calc_nr_kernel_pages() has taken into consideration of kernel reserved memory, no need to have arch_reserved_kernel_pages() any more. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Baoquan He <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25change alloc_pages name in dma_map_ops to avoid name conflictsSuren Baghdasaryan3-4/+4
After redefining alloc_pages, all uses of that name are being replaced. Change the conflicting names to prevent preprocessor from replacing them when it's not intended. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Tested-by: Kees Cook <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alex Gaynor <[email protected]> Cc: Alice Ryhl <[email protected]> Cc: Andreas Hindborg <[email protected]> Cc: Benno Lossin <[email protected]> Cc: "Björn Roy Baron" <[email protected]> Cc: Boqun Feng <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Gary Guo <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Miguel Ojeda <[email protected]> Cc: Pasha Tatashin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wedson Almeida Filho <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25fix missing vmalloc.h includesKent Overstreet2-0/+2
Patch series "Memory allocation profiling", v6. Overview: Low overhead [1] per-callsite memory allocation profiling. Not just for debug kernels, overhead low enough to be deployed in production. Example output: root@moria-kvm:~# sort -rn /proc/allocinfo 127664128 31168 mm/page_ext.c:270 func:alloc_page_ext 56373248 4737 mm/slub.c:2259 func:alloc_slab_page 14880768 3633 mm/readahead.c:247 func:page_cache_ra_unbounded 14417920 3520 mm/mm_init.c:2530 func:alloc_large_system_hash 13377536 234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs 11718656 2861 mm/filemap.c:1919 func:__filemap_get_folio 9192960 2800 kernel/fork.c:307 func:alloc_thread_stack_node 4206592 4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable 4136960 1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start 3940352 962 mm/memory.c:4214 func:alloc_anon_folio 2894464 22613 fs/kernfs/dir.c:615 func:__kernfs_new_node ... Usage: kconfig options: - CONFIG_MEM_ALLOC_PROFILING - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT - CONFIG_MEM_ALLOC_PROFILING_DEBUG adds warnings for allocations that weren't accounted because of a missing annotation sysctl: /proc/sys/vm/mem_profiling Runtime info: /proc/allocinfo Notes: [1]: Overhead To measure the overhead we are comparing the following configurations: (1) Baseline with CONFIG_MEMCG_KMEM=n (2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) (3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) (4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1) (5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT (6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n) && CONFIG_MEMCG_KMEM=y (7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y && CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y Performance overhead: To evaluate performance we implemented an in-kernel test executing multiple get_free_page/free_page and kmalloc/kfree calls with allocation sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU affinity set to a specific CPU to minimize the noise. Below are results from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on 56 core Intel Xeon: kmalloc pgalloc (1 baseline) 6.764s 16.902s (2 default disabled) 6.793s (+0.43%) 17.007s (+0.62%) (3 default enabled) 7.197s (+6.40%) 23.666s (+40.02%) (4 runtime enabled) 7.405s (+9.48%) 23.901s (+41.41%) (5 memcg) 13.388s (+97.94%) 48.460s (+186.71%) (6 def disabled+memcg) 13.332s (+97.10%) 48.105s (+184.61%) (7 def enabled+memcg) 13.446s (+98.78%) 54.963s (+225.18%) Memory overhead: Kernel size: text data bss dec diff (1) 26515311 18890222 17018880 62424413 (2) 26524728 19423818 16740352 62688898 264485 (3) 26524724 19423818 16740352 62688894 264481 (4) 26524728 19423818 16740352 62688898 264485 (5) 26541782 18964374 16957440 62463596 39183 Memory consumption on a 56 core Intel CPU with 125GB of memory: Code tags: 192 kB PageExts: 262144 kB (256MB) SlabExts: 9876 kB (9.6MB) PcpuExts: 512 kB (0.5MB) Total overhead is 0.2% of total memory. Benchmarks: Hackbench tests run 100 times: hackbench -s 512 -l 200 -g 15 -f 25 -P baseline disabled profiling enabled profiling avg 0.3543 0.3559 (+0.0016) 0.3566 (+0.0023) stdev 0.0137 0.0188 0.0077 hackbench -l 10000 baseline disabled profiling enabled profiling avg 6.4218 6.4306 (+0.0088) 6.5077 (+0.0859) stdev 0.0933 0.0286 0.0489 stress-ng tests: stress-ng --class memory --seq 4 -t 60 stress-ng --class cpu --seq 4 -t 60 Results posted at: https://evilpiepirate.org/~kent/memalloc_prof_v4_stress-ng/ [2] https://lore.kernel.org/all/[email protected]/ This patch (of 37): The next patch drops vmalloc.h from a system header in order to fix a circular dependency; this adds it to all the files that were pulling it in implicitly. [[email protected]: fix arch/alpha/lib/memcpy.c] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix arch/x86/mm/numa_32.c] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: a few places were depending on sizes.h] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix mm/kasan/hw_tags.c] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix arc build] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kent Overstreet <[email protected]> Signed-off-by: Suren Baghdasaryan <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Reviewed-by: Pasha Tatashin <[email protected]> Tested-by: Kees Cook <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alex Gaynor <[email protected]> Cc: Alice Ryhl <[email protected]> Cc: Andreas Hindborg <[email protected]> Cc: Benno Lossin <[email protected]> Cc: "Björn Roy Baron" <[email protected]> Cc: Boqun Feng <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Dennis Zhou <[email protected]> Cc: Gary Guo <[email protected]> Cc: Miguel Ojeda <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Wedson Almeida Filho <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm/treewide: remove pXd_huge()Peter Xu3-45/+0
This API is not used anymore, drop it for the whole tree. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Andreas Larsson <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Bjorn Andersson <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David S. Miller <[email protected]> Cc: Fabio Estevam <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Konrad Dybcio <[email protected]> Cc: Krzysztof Kozlowski <[email protected]> Cc: Lucas Stach <[email protected]> Cc: Mark Salter <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: "Naveen N. Rao" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Russell King <[email protected]> Cc: Shawn Guo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm/treewide: replace pXd_huge() with pXd_leaf()Peter Xu1-3/+3
Now after we're sure all pXd_huge() definitions are the same as pXd_leaf(), reuse it. Luckily, pXd_huge() isn't widely used. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Andreas Larsson <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Bjorn Andersson <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David S. Miller <[email protected]> Cc: Fabio Estevam <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Konrad Dybcio <[email protected]> Cc: Krzysztof Kozlowski <[email protected]> Cc: Lucas Stach <[email protected]> Cc: Mark Salter <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: "Naveen N. Rao" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Russell King <[email protected]> Cc: Shawn Guo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm/powerpc: redefine pXd_huge() with pXd_leaf()Peter Xu2-27/+14
PowerPC book3s 4K mostly has the same definition on both, except pXd_huge() constantly returns 0 for hash MMUs. As Michael Ellerman pointed out [1], it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set so it will keep returning false. As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create such huge mappings for 4K hash MMUs. Meanwhile, the major powerpc hugetlb pgtable walker __find_linux_pte() already used pXd_leaf() to check leaf hugetlb mappings. The goal should be that we will have one API pXd_leaf() to detect all kinds of huge mappings (hugepd is still special in this case, though). AFAICT we need to use the pXd_leaf() impl (rather than pXd_huge()'s) to make sure ie. THPs on hash MMU will also return true. This helps to simplify a follow up patch to drop pXd_huge() treewide. NOTE: *_leaf() definition need to be moved before the inclusion of asm/book3s/64/pgtable-4k.h, which defines pXd_huge() with it. [1] https://lore.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peter Xu <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: "Aneesh Kumar K.V" <[email protected]> Cc: "Naveen N. Rao" <[email protected]> Cc: Alistair Popple <[email protected]> Cc: Andreas Larsson <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Bjorn Andersson <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David S. Miller <[email protected]> Cc: Fabio Estevam <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Konrad Dybcio <[email protected]> Cc: Krzysztof Kozlowski <[email protected]> Cc: Lucas Stach <[email protected]> Cc: Mark Salter <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Muchun Song <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Russell King <[email protected]> Cc: Shawn Guo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25powerpc/papr_scm: Move duplicate definitions to common header filesShivaprasad G Bhat2-206/+2
papr_scm and ndtest share common PDSM payload structs like nd_papr_pdsm_health. Presently these structs are duplicated across papr_pdsm.h and ndtest.h header files. Since 'ndtest' is essentially arch independent and can run on platforms other than PPC64, a way needs to be deviced to avoid redundancy and duplication of PDSM structs in future. So the patch proposes moving the PDSM header from arch/powerpc/include- -/uapi/ to the generic include/uapi/linux directory. Also, there are some #defines common between papr_scm and ndtest which are not exported to the user space. So, move them to a header file which can be shared across ndtest and papr_scm via newly introduced include/linux/papr_scm.h. Signed-off-by: Shivaprasad G Bhat <[email protected]> Signed-off-by: Vaibhav Jain <[email protected]> Suggested-by: Aneesh Kumar K.V <[email protected]> Link: https://lore.kernel.org/r/170638176942.112443.2937254675538057083.stgit@ltcd48-lp2.aus.stglab.ibm.com Signed-off-by: Ira Weiny <[email protected]>
2024-04-23Merge 6.9-rc5 into driver-core-nextGreg Kroah-Hartman3-7/+11
We want the kernfs fixes in here as well. Signed-off-by: Greg Kroah-Hartman <[email protected]>
2024-04-23powerpc/crash: add crash memory hotplug supportSourabh Jain5-2/+202
Extend the arch crash hotplug handler, as introduced by the patch title ("powerpc: add crash CPU hotplug support"), to also support memory add/remove events. Elfcorehdr describes the memory of the crash kernel to capture the kernel; hence, it needs to be updated if memory resources change due to memory add/remove events. Therefore, arch_crash_handle_hotplug_event() is updated to recreate the elfcorehdr and replace it with the previous one on memory add/remove events. The memblock list is used to prepare the elfcorehdr. In the case of memory hot remove, the memblock list is updated after the arch crash hotplug handler is triggered, as depicted in Figure 1. Thus, the hot-removed memory is explicitly removed from the crash memory ranges to ensure that the memory ranges added to elfcorehdr do not include the hot-removed memory. Memory remove | v Offline pages | v Initiate memory notify call <----> crash hotplug handler chain for MEM_OFFLINE event | v Update memblock list Figure 1 There are two system calls, `kexec_file_load` and `kexec_load`, used to load the kdump image. A few changes have been made to ensure that the kernel can safely update the elfcorehdr component of the kdump image for both system calls. For the kexec_file_load syscall, kdump image is prepared in the kernel. To support an increasing number of memory regions, the elfcorehdr is built with extra buffer space to ensure that it can accommodate additional memory ranges in future. For the kexec_load syscall, the elfcorehdr is updated only if the KEXEC_CRASH_HOTPLUG_SUPPORT kexec flag is passed to the kernel by the kexec tool. Passing this flag to the kernel indicates that the elfcorehdr is built to accommodate additional memory ranges and the elfcorehdr segment is not considered for SHA calculation, making it safe to update. The changes related to this feature are kept under the CRASH_HOTPLUG config, and it is enabled by default. Signed-off-by: Sourabh Jain <[email protected]> Acked-by: Hari Bathini <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-04-23powerpc/crash: add crash CPU hotplug supportSourabh Jain5-1/+134
Due to CPU/Memory hotplug or online/offline events, the elfcorehdr (which describes the CPUs and memory of the crashed kernel) and FDT (Flattened Device Tree) of kdump image becomes outdated. Consequently, attempting dump collection with an outdated elfcorehdr or FDT can lead to failed or inaccurate dump collection. Going forward, CPU hotplug or online/offline events are referred as CPU/Memory add/remove events. The current solution to address the above issue involves monitoring the CPU/Memory add/remove events in userspace using udev rules and whenever there are changes in CPU and memory resources, the entire kdump image is loaded again. The kdump image includes kernel, initrd, elfcorehdr, FDT, purgatory. Given that only elfcorehdr and FDT get outdated due to CPU/Memory add/remove events, reloading the entire kdump image is inefficient. More importantly, kdump remains inactive for a substantial amount of time until the kdump reload completes. To address the aforementioned issue, commit 247262756121 ("crash: add generic infrastructure for crash hotplug support") added a generic infrastructure that allows architectures to selectively update the kdump image component during CPU or memory add/remove events within the kernel itself. In the event of a CPU or memory add/remove events, the generic crash hotplug event handler, `crash_handle_hotplug_event()`, is triggered. It then acquires the necessary locks to update the kdump image and invokes the architecture-specific crash hotplug handler, `arch_crash_handle_hotplug_event()`, to update the required kdump image components. This patch adds crash hotplug handler for PowerPC and enable support to update the kdump image on CPU add/remove events. Support for memory add/remove events is added in a subsequent patch with the title "powerpc: add crash memory hotplug support" As mentioned earlier, only the elfcorehdr and FDT kdump image components need to be updated in the event of CPU or memory add/remove events. However, on PowerPC architecture crash hotplug handler only updates the FDT to enable crash hotplug support for CPU add/remove events. Here's why. The elfcorehdr on PowerPC is built with possible CPUs, and thus, it does not need an update on CPU add/remove events. On the other hand, the FDT needs to be updated on CPU add events to include the newly added CPU. If the FDT is not updated and the kernel crashes on a newly added CPU, the kdump kernel will fail to boot due to the unavailability of the crashing CPU in the FDT. During the early boot, it is expected that the boot CPU must be a part of the FDT; otherwise, the kernel will raise a BUG and fail to boot. For more information, refer to commit 36ae37e3436b0 ("powerpc: Make boot_cpuid common between 32 and 64-bit"). Since it is okay to have an offline CPU in the kdump FDT, no action is taken in case of CPU removal. There are two system calls, `kexec_file_load` and `kexec_load`, used to load the kdump image. Few changes have been made to ensure kernel can safely update the FDT of kdump image loaded using both system calls. For kexec_file_load syscall the kdump image is prepared in kernel. So to support an increasing number of CPUs, the FDT is constructed with extra buffer space to ensure it can accommodate a possible number of CPU nodes. Additionally, a call to fdt_pack (which trims the unused space once the FDT is prepared) is avoided if this feature is enabled. For the kexec_load syscall, the FDT is updated only if the KEXEC_CRASH_HOTPLUG_SUPPORT kexec flag is passed to the kernel by userspace (kexec tools). When userspace passes this flag to the kernel, it indicates that the FDT is built to accommodate possible CPUs, and the FDT segment is excluded from SHA calculation, making it safe to update. The changes related to this feature are kept under the CRASH_HOTPLUG config, and it is enabled by default. Signed-off-by: Sourabh Jain <[email protected]> Acked-by: Hari Bathini <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-04-23powerpc/kexec: make the update_cpus_node() function publicSourabh Jain3-87/+95
Move the update_cpus_node() from kexec/{file_load_64.c => core_64.c} to allow other kexec components to use it. Later in the series, this function is used for in-kernel updates to the kdump image during CPU/memory hotplug or online/offline events for both kexec_load and kexec_file_load syscalls. No functional changes are intended. Signed-off-by: Sourabh Jain <[email protected]> Acked-by: Hari Bathini <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-04-23powerpc/kexec: move *_memory_ranges functions to ranges.cSourabh Jain4-216/+224
Move the following functions form kexec/{file_load_64.c => ranges.c} and make them public so that components other than KEXEC_FILE can also use these functions. 1. get_exclude_memory_ranges 2. get_reserved_memory_ranges 3. get_crash_memory_ranges 4. get_usable_memory_ranges Later in the series get_crash_memory_ranges function is utilized for in-kernel updates to kdump image during CPU/Memory hotplug or online/offline events for both kexec_load and kexec_file_load syscalls. Since the above functions are moved to ranges.c, some of the helper functions in ranges.c are no longer required to be public. Mark them as static and removed them from kexec_ranges.h header file. Finally, remove the CONFIG_KEXEC_FILE build dependency for range.c because it is required for other config, such as CONFIG_CRASH_DUMP. No functional changes are intended. Signed-off-by: Sourabh Jain <[email protected]> Acked-by: Hari Bathini <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-04-23powerpc/pseries/iommu: LPAR panics during boot up with a frozen PEGaurav Batra1-0/+8
At the time of LPAR boot up, partition firmware provides Open Firmware property ibm,dma-window for the PE. This property is provided on the PCI bus the PE is attached to. There are execptions where the partition firmware might not provide this property for the PE at the time of LPAR boot up. One of the scenario is where the firmware has frozen the PE due to some error condition. This PE is frozen for 24 hours or unless the whole system is reinitialized. Within this time frame, if the LPAR is booted, the frozen PE will be presented to the LPAR but ibm,dma-window property could be missing. Today, under these circumstances, the LPAR oopses with NULL pointer dereference, when configuring the PCI bus the PE is attached to. BUG: Kernel NULL pointer dereference on read at 0x000000c8 Faulting instruction address: 0xc0000000001024c0 Oops: Kernel access of bad area, sig: 7 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries Modules linked in: Supported: Yes CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.4.0-150600.9-default #1 Hardware name: IBM,9043-MRX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_023) hv:phyp pSeries NIP: c0000000001024c0 LR: c0000000001024b0 CTR: c000000000102450 REGS: c0000000037db5c0 TRAP: 0300 Not tainted (6.4.0-150600.9-default) MSR: 8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 28000822 XER: 00000000 CFAR: c00000000010254c DAR: 00000000000000c8 DSISR: 00080000 IRQMASK: 0 ... NIP [c0000000001024c0] pci_dma_bus_setup_pSeriesLP+0x70/0x2a0 LR [c0000000001024b0] pci_dma_bus_setup_pSeriesLP+0x60/0x2a0 Call Trace: pci_dma_bus_setup_pSeriesLP+0x60/0x2a0 (unreliable) pcibios_setup_bus_self+0x1c0/0x370 __of_scan_bus+0x2f8/0x330 pcibios_scan_phb+0x280/0x3d0 pcibios_init+0x88/0x12c do_one_initcall+0x60/0x320 kernel_init_freeable+0x344/0x3e4 kernel_init+0x34/0x1d0 ret_from_kernel_user_thread+0x14/0x1c Fixes: b1fc44eaa9ba ("pseries/iommu/ddw: Fix kdump to work in absence of ibm,dma-window") Signed-off-by: Gaurav Batra <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-04-22powerpc/pseries: make max polling consistent for longer H_CALLsNayna Jain2-8/+7
Currently, plpks_confirm_object_flushed() function polls for 5msec in total instead of 5sec. Keep max polling time consistent for all the H_CALLs, which take longer than expected, to be 5sec. Also, make use of fsleep() everywhere to insert delay. Reported-by: Nageswara R Sastry <[email protected]> Fixes: 2454a7af0f2a ("powerpc/pseries: define driver for Platform KeyStore") Signed-off-by: Nayna Jain <[email protected]> Tested-by: Nageswara R Sastry <[email protected]> Reviewed-by: Andrew Donnellan <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-04-20Merge tag 'powerpc-6.9-3' of ↵Linus Torvalds2-5/+10
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix wireguard loading failure on pre-Power10 due to Power10 crypto routines - Fix papr-vpd selftest failure due to missing variable initialization - Avoid unnecessary get/put in spapr_tce_platform_iommu_attach_dev() Thanks to Geetika Moolchandani, Jason Gunthorpe, Michal Suchánek, Nathan Lynch, and Shivaprasad G Bhat. * tag 'powerpc-6.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: selftests/powerpc/papr-vpd: Fix missing variable initialization powerpc/crypto/chacha-p10: Fix failure on non Power10 powerpc/iommu: Refactor spapr_tce_platform_iommu_attach_dev()
2024-04-19powerpc/mm: Update the memory limit based on direct mapping restrictionsAneesh Kumar K.V (IBM)1-3/+2
memory limit value specified by the user are further updated such that the value is 16MB aligned. This is because hash translation mode use 16MB as direct mapping page size. Make sure we update the global variable 'memory_limit' with the 16MB aligned value such that all kernel components will see the new aligned value of the memory limit. Signed-off-by: Aneesh Kumar K.V (IBM) <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-04-19powerpc/fadump: Don't update the user-specified memory limitAneesh Kumar K.V (IBM)1-16/+0
If the user specifies the memory limit, the kernel should honor it such that all allocation and reservations are made within the memory limit specified. fadump was breaking that rule. Remove the code which updates the memory limit such that fadump reservations are done within the limit specified. Signed-off-by: Aneesh Kumar K.V (IBM) <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-04-19powerpc/mm: Align memory_limit value specified using mem= kernel parameterAneesh Kumar K.V (IBM)2-4/+7
The value specified for the memory limit is used to set a restriction on memory usage. It is important to ensure that this restriction is within the linear map kernel address space range. The hash page table translation uses a 16MB page size to map the kernel linear map address space. htab_bolt_mapping() function aligns down the size of the range while mapping kernel linear address space. Since the memblock limit is enforced very early during boot, before we can detect the type of memory translation (radix vs hash), we align the memory limit value specified as a kernel parameter to 16MB. This alignment value will work for both hash and radix translations. Signed-off-by: Aneesh Kumar K.V (IBM) <[email protected]> Acked-by: Joel Savitz <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-04-18powerpc/ptdump: Fix walk_vmemmap() to also print first vmemmap entryRitesh Harjani (IBM)1-1/+1
Currently walk_vmemmap() skips the first vmemmap entry pointed to by vmemmap_list pointer itself. Fix that. With the fix applied the vmemmap entry at 0xc00c000000000000 for hash is displayed: $ cat /sys/kernel/debug/kernel_hash_pagetable ... 0xc00c000000010000: AVPN:cd7bd4e0000 ssize: 1T ... 0xc00c000000000000: AVPN:cd7bd4e0000 ssize: 1T ... Signed-off-by: Ritesh Harjani (IBM) <[email protected]> [mpe: Tweak change log wording and add example output] Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/a19ee3dc2b304d39da364a592d5cd167449f8c4a.1713365940.git.ritesh.list@gmail.com
2024-04-17sched/vtime: Do not include <asm/vtime.h> headerAlexander Gordeev1-1/+0
There is no architecture-specific code or data left that generic <linux/vtime.h> needs to know about. Thus, avoid the inclusion of <asm/vtime.h> header. Signed-off-by: Alexander Gordeev <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Acked-by: Nicholas Piggin <[email protected]> Link: https://lore.kernel.org/r/f7cd245668b9ae61a55184871aec494ec9199c4a.1712760275.git.agordeev@linux.ibm.com
2024-04-17sched/vtime: Get rid of generic vtime_task_switch() implementationAlexander Gordeev2-13/+22
The generic vtime_task_switch() implementation gets built only if __ARCH_HAS_VTIME_TASK_SWITCH is not defined, but requires an architecture to implement arch_vtime_task_switch() callback at the same time, which is confusing. Further, arch_vtime_task_switch() is implemented for 32-bit PowerPC architecture only and vtime_task_switch() generic variant is rather superfluous. Simplify the whole vtime_task_switch() wiring by moving the existing generic implementation to PowerPC. Signed-off-by: Alexander Gordeev <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Reviewed-by: Nicholas Piggin <[email protected]> Acked-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/2cb6e3caada93623f6d4f78ad938ac6cd0e2fda8.1712760275.git.agordeev@linux.ibm.com
2024-04-15Replace macro "ARCH_HAVE_EXTRA_ELF_NOTES" with kconfigVignesh Balasubramanian2-2/+1
"ARCH_HAVE_EXTRA_ELF_NOTES" enables an extra note section in the core dump. Kconfig variable is preferred over ARCH_HAVE_* macro. Co-developed-by: Jini Susan George <[email protected]> Signed-off-by: Jini Susan George <[email protected]> Signed-off-by: Vignesh Balasubramanian <[email protected]> Acked-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>
2024-04-15powerpc: Avoid nmi_enter/nmi_exit in real mode interrupt.Mahesh Salgaonkar3-0/+22
nmi_enter()/nmi_exit() touches per cpu variables which can lead to kernel crash when invoked during real mode interrupt handling (e.g. early HMI/MCE interrupt handler) if percpu allocation comes from vmalloc area. Early HMI/MCE handlers are called through DEFINE_INTERRUPT_HANDLER_NMI() wrapper which invokes nmi_enter/nmi_exit calls. We don't see any issue when percpu allocation is from the embedded first chunk. However with CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there are chances where percpu allocation can come from the vmalloc area. With kernel command line "percpu_alloc=page" we can force percpu allocation to come from vmalloc area and can see kernel crash in machine_check_early: [ 1.215714] NIP [c000000000e49eb4] rcu_nmi_enter+0x24/0x110 [ 1.215717] LR [c0000000000461a0] machine_check_early+0xf0/0x2c0 [ 1.215719] --- interrupt: 200 [ 1.215720] [c000000fffd73180] [0000000000000000] 0x0 (unreliable) [ 1.215722] [c000000fffd731b0] [0000000000000000] 0x0 [ 1.215724] [c000000fffd73210] [c000000000008364] machine_check_early_common+0x134/0x1f8 Fix this by avoiding use of nmi_enter()/nmi_exit() in real mode if percpu first chunk is not embedded. Reviewed-by: Christophe Leroy <[email protected]> Tested-by: Shirisha Ganta <[email protected]> Signed-off-by: Mahesh Salgaonkar <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-04-15powerpc: Add static_key_feature_checks_initialized flagNicholas Miehlbradt4-2/+12
JUMP_LABEL_FEATURE_CHECK_DEBUG used static_key_intialized to determine whether {cpu,mmu}_has_feature() is used before static keys were initialized. However, {cpu,mmu}_has_feature() should not be used before setup_feature_keys() is called but static_key_initialized is set well before this by the call to jump_label_init() in early_init_devtree(). This creates a window in which JUMP_LABEL_FEATURE_CHECK_DEBUG will not detect misuse and report errors. Add a flag specifically to indicate when {cpu,mmu}_has_feature() is safe to use. Signed-off-by: Nicholas Miehlbradt <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-04-12genirq: Convert kstat_irqs to a structBitao Hu1-1/+1
The irq_desc::kstat_irqs member is a per-CPU variable of type int, which is only capable of counting. A snapshot mechanism for interrupt statistics will be added soon, which requires an additional variable to store the snapshot. To facilitate expansion, convert kstat_irqs here to a struct containing only the count. Originally-by: Thomas Gleixner <[email protected]> Signed-off-by: Bitao Hu <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-11KVM: delete .change_pte MMU notifier callbackPaolo Bonzini7-33/+0
The .change_pte() MMU notifier callback was intended as an optimization. The original point of it was that KSM could tell KVM to flip its secondary PTE to a new location without having to first zap it. At the time there was also an .invalidate_page() callback; both of them were *not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(), and .invalidate_page() also doubled as a fallback implementation of .change_pte(). Later on, however, both callbacks were changed to occur within an invalidate_range_start/end() block. In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end", 2012-10-09) did so to remove the fallback from .invalidate_page() to .change_pte() and allow sleepable .invalidate_page() hooks. This however made KVM's usage of the .change_pte() callback completely moot, because KVM unmaps the sPTEs during .invalidate_range_start() and therefore .change_pte() has no hope of finding a sPTE to change. Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as well as all the architecture specific implementations. Signed-off-by: Paolo Bonzini <[email protected]> Acked-by: Anup Patel <[email protected]> Acked-by: Michael Ellerman <[email protected]> (powerpc) Reviewed-by: Bibo Mao <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-11treewide: Use sysfs_bin_attr_simple_read() helperLukas Wunner1-9/+1
Deduplicate ->read() callbacks of bin_attributes which are backed by a simple buffer in memory: Use the newly introduced sysfs_bin_attr_simple_read() helper instead, either by referencing it directly or by declaring such bin_attributes with BIN_ATTR_SIMPLE_RO() or BIN_ATTR_SIMPLE_ADMIN_RO(). Aside from a reduction of LoC, this shaves off a few bytes from vmlinux (304 bytes on an x86_64 allyesconfig). No functional change intended. Signed-off-by: Lukas Wunner <[email protected]> Acked-by: Zhi Wang <[email protected]> Acked-by: Michael Ellerman <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Acked-by: Ard Biesheuvel <[email protected]> Link: https://lore.kernel.org/r/92ee0a0e83a5a3f3474845db6c8575297698933a.1712410202.git.lukas@wunner.de Signed-off-by: Greg Kroah-Hartman <[email protected]>
2024-04-08powerpc/52xx: Replace of_gpio.h by proper oneAndy Shevchenko2-3/+1
of_gpio.h is deprecated and subject to remove. The driver doesn't use it directly, replace it with what is really being used. Signed-off-by: Andy Shevchenko <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-04-08vdso: Consolidate vdso_calc_delta()Adrian Hunter1-15/+11
Consolidate vdso_calc_delta(), in preparation for further simplification. Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Adrian Hunter <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-08powerpc: Fix fatal warnings flag for LLVM's integrated assemblerNathan Chancellor1-2/+2
When building with LLVM_IAS=1, there is an error because '-fatal-warnings' is not recognized as a valid flag: clang: error: unsupported argument '-fatal-warnings' to option '-Wa,' Use the double hyphen version of the flag, '--fatal-warnings', which works with both the GNU assembler and LLVM's integrated assembler. Fixes: 608d4a5ca563 ("powerpc: Error on assembly warnings") Signed-off-by: Nathan Chancellor <[email protected]> Reviewed-by: Justin Stitt <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/20240405-ppc-fix-wa-fatal-warnings-clang-v1-1-bdcd969f2ef0@kernel.org
2024-04-05powerpc/crypto/chacha-p10: Fix failure on non Power10Michael Ellerman1-1/+7
The chacha-p10-crypto module provides optimised chacha routines for Power10. It also selects CRYPTO_ARCH_HAVE_LIB_CHACHA which says it provides chacha_crypt_arch() to generic code. Notably the module needs to provide chacha_crypt_arch() regardless of whether it is loaded on Power10 or an older CPU. The implementation of chacha_crypt_arch() already has a fallback to chacha_crypt_generic(), however the module as a whole fails to load on pre-Power10, because of the use of module_cpu_feature_match(). This breaks for example loading wireguard: jostaberry-1:~ # modprobe -v wireguard insmod /lib/modules/6.8.0-lp155.8.g7e0e887-default/kernel/arch/powerpc/crypto/chacha-p10-crypto.ko.zst modprobe: ERROR: could not insert 'wireguard': No such device Fix it by removing module_cpu_feature_match(), and instead check the CPU feature manually. If the CPU feature is not found, the module still loads successfully, but doesn't register the Power10 specific algorithms. That allows chacha_crypt_generic() to remain available for use, fixing the problem. [root@fedora ~]# modprobe -v wireguard insmod /lib/modules/6.8.0-00001-g786a790c4d79/kernel/net/ipv4/udp_tunnel.ko insmod /lib/modules/6.8.0-00001-g786a790c4d79/kernel/net/ipv6/ip6_udp_tunnel.ko insmod /lib/modules/6.8.0-00001-g786a790c4d79/kernel/lib/crypto/libchacha.ko insmod /lib/modules/6.8.0-00001-g786a790c4d79/kernel/arch/powerpc/crypto/chacha-p10-crypto.ko insmod /lib/modules/6.8.0-00001-g786a790c4d79/kernel/lib/crypto/libchacha20poly1305.ko insmod /lib/modules/6.8.0-00001-g786a790c4d79/kernel/drivers/net/wireguard/wireguard.ko [ 18.910452][ T721] wireguard: allowedips self-tests: pass [ 18.914999][ T721] wireguard: nonce counter self-tests: pass [ 19.029066][ T721] wireguard: ratelimiter self-tests: pass [ 19.029257][ T721] wireguard: WireGuard 1.0.0 loaded. See www.wireguard.com for information. [ 19.029361][ T721] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <[email protected]>. All Rights Reserved. Reported-by: Michal Suchánek <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ Acked-by: Herbert Xu <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-04-03vdso: Use CONFIG_PAGE_SHIFT in vdso/datapage.hArnd Bergmann1-2/+1
Both the vdso rework and the CONFIG_PAGE_SHIFT changes were merged during the v6.9 merge window, so it is now possible to use CONFIG_PAGE_SHIFT instead of including asm/page.h in the vdso. This avoids the workaround for arm64 - commit 8b3843ae3634 ("vdso/datapage: Quick fix - use asm/page-def.h for ARM64") and addresses a build warning for powerpc64: In file included from <built-in>:4: In file included from /home/arnd/arm-soc/arm-soc/lib/vdso/gettimeofday.c:5: In file included from ../include/vdso/datapage.h:25: arch/powerpc/include/asm/page.h:230:9: error: result of comparison of constant 13835058055282163712 with expression of type 'unsigned long' is always true [-Werror,-Wtautological-constant-out-of-range-compare] 230 | return __pa(kaddr) >> PAGE_SHIFT; | ^~~~~~~~~~~ arch/powerpc/include/asm/page.h:217:37: note: expanded from macro '__pa' 217 | VIRTUAL_WARN_ON((unsigned long)(x) < PAGE_OFFSET); \ | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~ arch/powerpc/include/asm/page.h:202:73: note: expanded from macro 'VIRTUAL_WARN_ON' 202 | #define VIRTUAL_WARN_ON(x) WARN_ON(IS_ENABLED(CONFIG_DEBUG_VIRTUAL) && (x)) | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~ arch/powerpc/include/asm/bug.h:88:25: note: expanded from macro 'WARN_ON' 88 | int __ret_warn_on = !!(x); \ | ^ Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Kees Cook <[email protected]> Acked-by: Michael Ellerman <[email protected]> (powerpc) Link: https://lore.kernel.org/r/[email protected]