aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2024-09-03sched_ext: Don't call put_prev_task_scx() before picking the next taskTejun Heo1-68/+62
fd03c5b85855 ("sched: Rework pick_next_task()") changed the definition of pick_next_task() from: pick_next_task() := pick_task() + set_next_task(.first = true) to: pick_next_task(prev) := pick_task() + put_prev_task() + set_next_task(.first = true) making invoking put_prev_task() pick_next_task()'s responsibility. This reordering allows pick_task() to be shared between regular and core-sched paths and put_prev_task() to know the next task. sched_ext depended on put_prev_task_scx() enqueueing the current task before pick_next_task_scx() is called. While pulling sched/core changes, 70cc76aa0d80 ("Merge branch 'tip/sched/core' into for-6.12") added an explicit put_prev_task_scx() call for SCX tasks in pick_next_task_scx() before picking the first task as a workaround. Clean it up and adopt the conventions that other sched classes are following. The operation of keeping running the current task was spread and required the task to be put on the local DSQ before picking: - balance_one() used SCX_TASK_BAL_KEEP to indicate that the task is still runnable, hasn't exhausted its slice, and thus should keep running. - put_prev_task_scx() enqueued the task to local DSQ if SCX_TASK_BAL_KEEP is set. It also called do_enqueue_task() with SCX_ENQ_LAST if it is the only runnable task. do_enqueue_task() in turn decided whether to use the local DSQ depending on SCX_OPS_ENQ_LAST. Consolidate the logic in balance_one() as it always knows whether it is going to keep the current task. balance_one() now considers all conditions where the current task should be kept and uses SCX_TASK_BAL_KEEP to tell pick_next_task_scx() to keep the current task instead of picking one from the local DSQ. Accordingly, SCX_ENQ_LAST handling is removed from put_prev_task_scx() and do_enqueue_task() and pick_next_task_scx() is updated to pick the current task if SCX_TASK_BAL_KEEP is set. The workaround put_prev_task[_scx]() calls are replaced with put_prev_set_next_task(). This causes two behavior changes observable from the BPF scheduler: - When a task keep running, it no longer goes through enqueue/dequeue cycle and thus ops.stopping/running() transitions. The new behavior is better and all the existing schedulers should be able to handle the new behavior. - The BPF scheduler cannot keep executing the current task by enqueueing SCX_ENQ_LAST task to the local DSQ. If SCX_OPS_ENQ_LAST is specified, the BPF scheduler is responsible for resuming execution after each SCX_ENQ_LAST. SCX_OPS_ENQ_LAST is mostly useful for cases where scheduling decisions are not made on the local CPU - e.g. central or userspace-driven schedulin - and the new behavior is more logical and shouldn't pose any problems. SCX_OPS_ENQ_LAST demonstration from scx_qmap is dropped as it doesn't fit that well anymore and the last task handling is moved to the end of qmap_dispatch(). Signed-off-by: Tejun Heo <[email protected]> Cc: David Vernet <[email protected]> Cc: Andrea Righi <[email protected]> Cc: Changwoo Min <[email protected]> Cc: Daniel Hodges <[email protected]> Cc: Dan Schatzberg <[email protected]>
2024-09-03sched/numa: Fix the vma scan starving issueYujie Liu1-0/+9
Problem statement: Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic"), the Numa vma scan overhead has been reduced a lot. Meanwhile, the reducing of the vma scan might create less Numa page fault information. The insufficient information makes it harder for the Numa balancer to make decision. Later, commit b7a5b537c55c08 ("sched/numa: Complete scanning of partial VMAs regardless of PID activity") and commit 84db47ca7146d7 ("sched/numa: Fix mm numa_scan_seq based unconditional scan") are found to bring back part of the performance. Recently when running SPECcpu omnetpp_r on a 320 CPUs/2 Sockets system, a long duration of remote Numa node read was observed by PMU events: A few cores having ~500MB/s remote memory access for ~20 seconds. It causes high core-to-core variance and performance penalty. After the investigation, it is found that many vmas are skipped due to the active PID check. According to the trace events, in most cases, vma_is_accessed() returns false because the history access info stored in pids_active array has been cleared. Proposal: The main idea is to adjust vma_is_accessed() to let it return true easier. Thus compare the diff between mm->numa_scan_seq and vma->numab_state->prev_scan_seq. If the diff has exceeded the threshold, scan the vma. This patch especially helps the cases where there are small number of threads, like the process-based SPECcpu. Without this patch, if the SPECcpu process access the vma at the beginning, then sleeps for a long time, the pid_active array will be cleared. A a result, if this process is woken up again, it never has a chance to set prot_none anymore. Because only the first 2 times of access is granted for vma scan: (current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2 to be worse, no other threads within the task can help set the prot_none. This causes information lost. Raghavendra helped test current patch and got the positive result on the AMD platform: autonumabench NUMA01 base patched Amean syst-NUMA01 194.05 ( 0.00%) 165.11 * 14.92%* Amean elsp-NUMA01 324.86 ( 0.00%) 315.58 * 2.86%* Duration User 380345.36 368252.04 Duration System 1358.89 1156.23 Duration Elapsed 2277.45 2213.25 autonumabench NUMA02 Amean syst-NUMA02 1.12 ( 0.00%) 1.09 * 2.93%* Amean elsp-NUMA02 3.50 ( 0.00%) 3.56 * -1.84%* Duration User 1513.23 1575.48 Duration System 8.33 8.13 Duration Elapsed 28.59 29.71 kernbench Amean user-256 22935.42 ( 0.00%) 22535.19 * 1.75%* Amean syst-256 7284.16 ( 0.00%) 7608.72 * -4.46%* Amean elsp-256 159.01 ( 0.00%) 158.17 * 0.53%* Duration User 68816.41 67615.74 Duration System 21873.94 22848.08 Duration Elapsed 506.66 504.55 Intel 256 CPUs/2 Sockets: autonuma benchmark also shows improvements: v6.10-rc5 v6.10-rc5 +patch Amean syst-NUMA01 245.85 ( 0.00%) 230.84 * 6.11%* Amean syst-NUMA01_THREADLOCAL 205.27 ( 0.00%) 191.86 * 6.53%* Amean syst-NUMA02 18.57 ( 0.00%) 18.09 * 2.58%* Amean syst-NUMA02_SMT 2.63 ( 0.00%) 2.54 * 3.47%* Amean elsp-NUMA01 517.17 ( 0.00%) 526.34 * -1.77%* Amean elsp-NUMA01_THREADLOCAL 99.92 ( 0.00%) 100.59 * -0.67%* Amean elsp-NUMA02 15.81 ( 0.00%) 15.72 * 0.59%* Amean elsp-NUMA02_SMT 13.23 ( 0.00%) 12.89 * 2.53%* v6.10-rc5 v6.10-rc5 +patch Duration User 1064010.16 1075416.23 Duration System 3307.64 3104.66 Duration Elapsed 4537.54 4604.73 The SPECcpu remote node access issue disappears with the patch applied. Link: https://lkml.kernel.org/r/[email protected] Fixes: fc137c0ddab2 ("sched/numa: enhance vma scanning logic") Signed-off-by: Chen Yu <[email protected]> Co-developed-by: Chen Yu <[email protected]> Signed-off-by: Yujie Liu <[email protected]> Reported-by: Xiaoping Zhou <[email protected]> Reviewed-and-tested-by: Raghavendra K T <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: "Chen, Tim C" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Raghavendra K T <[email protected]> Cc: Vincent Guittot <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03mm: support only one page_type per pageMatthew Wilcox (Oracle)1-4/+4
By using a few values in the top byte, users of page_type can store up to 24 bits of additional data in page_type. It also reduces the code size as (with replacement of READ_ONCE() with data_race()), the kernel can check just a single byte. eg: ffffffff811e3a79: 8b 47 30 mov 0x30(%rdi),%eax ffffffff811e3a7c: 55 push %rbp ffffffff811e3a7d: 48 89 e5 mov %rsp,%rbp ffffffff811e3a80: 25 00 00 00 82 and $0x82000000,%eax ffffffff811e3a85: 3d 00 00 00 80 cmp $0x80000000,%eax ffffffff811e3a8a: 74 4d je ffffffff811e3ad9 <folio_mapping+0x69> becomes: ffffffff811e3a69: 80 7f 33 f5 cmpb $0xf5,0x33(%rdi) ffffffff811e3a6d: 55 push %rbp ffffffff811e3a6e: 48 89 e5 mov %rsp,%rbp ffffffff811e3a71: 74 4d je ffffffff811e3ac0 <folio_mapping+0x60> replacing three instructions with one. [[email protected]: fix ubsan warnings] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Hyeonggon Yoo <[email protected]> Cc: Kent Overstreet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03mm: move kernel/numa.c to mm/Mike Rapoport (Microsoft)2-27/+0
Patch series "mm: introduce numa_memblks", v4. Following the discussion about handling of CXL fixed memory windows on arm64 [1] I decided to bite the bullet and move numa_memblks from x86 to the generic code so they will be available on arm64/riscv and maybe on loongarch sometime later. While it could be possible to use memblock to describe CXL memory windows, it currently lacks notion of unpopulated memory ranges and numa_memblks does implement this. Another reason to make numa_memblks generic is that both arch_numa (arm64 and riscv) and loongarch use trimmed copy of x86 code although there is no fundamental reason why the same code cannot be used on all these platforms. Having numa_memblks in mm/ will make it's interaction with ACPI and FDT more consistent and I believe will reduce maintenance burden. And with generic numa_memblks it is (almost) straightforward to enable NUMA emulation on arm64 and riscv. The first 9 commits in this series are cleanups that are not strictly related to numa_memblks. Commits 10-16 slightly reorder code in x86 to allow extracting numa_memblks and NUMA emulation to the generic code. Commits 17-19 actually move the code from arch/x86/ to mm/ and commits 20-22 does some aftermath cleanups. Commit 23 updates of_numa_init() to return error of no NUMA nodes were found in the device tree. Commit 24 switches arch_numa to numa_memblks. Commit 25 enables usage of phys_to_target_node() and memory_add_physaddr_to_nid() with numa_memblks. Commit 26 moves the description for numa=fake from x86 to admin-guide. [1] https://lore.kernel.org/all/[email protected]/ This patch (of 26): The stub functions in kernel/numa.c belong to mm/ rather than to kernel/ Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mike Rapoport (Microsoft) <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Tested-by: Zi Yan <[email protected]> # for x86_64 and arm64 Tested-by: Jonathan Cameron <[email protected]> [arm64 + CXL via QEMU] Acked-by: Dan Williams <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Andreas Larsson <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: David S. Miller <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiaxun Yang <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Rob Herring (Arm) <[email protected]> Cc: Samuel Holland <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-04dma-direct: optimize page freeing when it is not addressableChen Yu1-1/+1
When the CMA allocation succeeds but isn't addressable, its buffer has already been released and the page is set to NULL. So later when the normal page allocation succeeds but isn't addressable, __free_pages() can be used to free that normal page rather than using dma_free_contiguous that does extra checks that are not needed. Signed-off-by: Chen Yu <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2024-09-04dma-mapping: clearly mark DMA ops as an architecture featureChristoph Hellwig2-8/+3
DMA ops are a helper for architectures and not for drivers to override the DMA implementation. Unfortunately driver authors keep ignoring this. Make the fact more clear by renaming the symbol to ARCH_HAS_DMA_OPS and having the two drivers overriding their dma_ops depend on that. These drivers should probably be marked broken, but we can give them a bit of a grace period for that. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Acked-by: Sakari Ailus <[email protected]> # for IPU6 Acked-by: Robin Murphy <[email protected]>
2024-09-03Merge branch 'tip/sched/core' into for-6.12Tejun Heo8-217/+180
- Resolve trivial context conflicts from dl_server clearing being moved around. - Add @next to put_prev_task_scx() and @prev to pick_next_task_scx() to match sched/core. - Merge sched_class->switch_class() addition from sched_ext with tip/sched/core changes in __pick_next_task(). - Make pick_next_task_scx() call put_prev_task_scx() to emulate the previous behavior where sched_class->put_prev_task() was called before sched_class->pick_next_task(). While this makes sched_ext build and function, the behavior is not in line with other sched classes. The follow-up patches will address the discrepancies and remove sched_class->switch_class(). Signed-off-by: Tejun Heo <[email protected]>
2024-09-03audit: Make use of str_enabled_disabled() helperHongbo Li1-1/+1
Use str_enabled_disabled() helper instead of open coding the same. Signed-off-by: Hongbo Li <[email protected]> Signed-off-by: Paul Moore <[email protected]>
2024-09-03uprobes: Use kzalloc to allocate xol areaSven Schnelle1-2/+1
To prevent unitialized members, use kzalloc to allocate the xol area. Fixes: b059a453b1cf1 ("x86/vdso: Add mremap hook to vm_special_mapping") Signed-off-by: Sven Schnelle <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-09-03sched: Add put_prev_task(.next)Peter Zijlstra6-8/+8
In order to tell the previous sched_class what the next task is, add put_prev_task(.next). Notable SCX will use this to: 1) determine the next task will leave the SCX sched class and push the current task to another CPU if possible. 2) statistics on how often and which other classes preempt it Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-09-03sched: Rework dl_serverPeter Zijlstra4-34/+32
When a task is selected through a dl_server, it will have p->dl_server set, such that it can account runtime to the dl_server, see update_curr_task(). Currently p->dl_server is set in pick*task() whenever it goes through the dl_server, clearing it is a bit of a mess though. The trivial solution is clearing it on the final put (now that we have this location). However, this gives a problem when: p = pick_task(rq); if (p) put_prev_set_next_task(rq, prev, next); picks the same task but through a different path, notably when it goes from picking through the dl_server to a direct pick or vice-versa. In that case we cannot readily determine wether we should clear or preserve p->dl_server. An additional complication is pick_*task() setting p->dl_server for a remote pick, it might still need to update runtime before it schedules the core_pick. Close all these holes and remove all the random clearing of p->dl_server by: - having pick_*task() manage rq->dl_server - having the final put_prev_task() clear p->dl_server - having the first set_next_task() set p->dl_server = rq->dl_server - complicate the core_sched code to save/restore rq->dl_server where appropriate. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-09-03sched: Combine the last put_prev_task() and the first set_next_task()Peter Zijlstra3-16/+14
Ensure the last put_prev_task() and the first set_next_task() always go together. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-09-03sched: Rework pick_next_task()Peter Zijlstra7-74/+37
The current rule is that: pick_next_task() := pick_task() + set_next_task(.first = true) And many classes implement it directly as such. Change things around to make pick_next_task() optional while also changing the definition to: pick_next_task(prev) := pick_task() + put_prev_task() + set_next_task(.first = true) The reason is that sched_ext would like to have a 'final' call that knows the next task. By placing put_prev_task() right next to set_next_task() (as it already is for sched_core) this becomes trivial. As a bonus, this is a nice cleanup on its own. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-09-03sched: Split up put_prev_task_balance()Peter Zijlstra1-6/+6
With the goal of pushing put_prev_task() after pick_task() / into pick_next_task(). Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-09-03sched: Clean up DL server vs core schedPeter Zijlstra3-24/+8
Abide by the simple rule: pick_next_task() := pick_task() + set_next_task(.first = true) This allows us to trivially get rid of server_pick_next() and things collapse nicely. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-09-03sched: Fixup set_next_task() implementationsPeter Zijlstra2-34/+34
The rule is that: pick_next_task() := pick_task() + set_next_task(.first = true) Turns out, there's still a few things in pick_next_task() that are missing from that combination. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-09-03sched: Use set_next_task(.first) where requiredPeter Zijlstra2-2/+6
Turns out the core_sched bits forgot to use the set_next_task(.first=true) variant. Notably: pick_next_task() := pick_task() + set_next_task(.first = true) Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-09-03sched/fair: Properly deactivate sched_delayed task upon class changeValentin Schneider1-8/+17
__sched_setscheduler() goes through an enqueue/dequeue cycle like so: flags := DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK; prev_class->dequeue_task(rq, p, flags); new_class->enqueue_task(rq, p, flags); when prev_class := fair_sched_class, this is followed by: dequeue_task(rq, p, DEQUEUE_NOCLOCK | DEQUEUE_SLEEP); the idea being that since the task has switched classes, we need to drop the sched_delayed logic and have that task be deactivated per its previous dequeue_task(..., DEQUEUE_SLEEP). Unfortunately, this leaves the task on_rq. This is missing the tail end of dequeue_entities() that issues __block_task(), which __sched_setscheduler() won't have done due to not using DEQUEUE_DELAYED - not that it should, as it is pretty much a fair_sched_class specific thing. Make switched_from_fair() properly deactivate sched_delayed tasks upon class changes via __block_task(), as if a dequeue_task(..., DEQUEUE_DELAYED) had been issued. Fixes: 2e0199df252a ("sched/fair: Prepare exit/cleanup paths for delayed_dequeue") Reported-by: "Paul E. McKenney" <[email protected]> Reported-by: Chen Yu <[email protected]> Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-09-03sched/deadline: Fix schedstats vs deadline serversHuang Shijie1-22/+16
In dl_server_start(), when schedstats is enabled, the following happens: dl_server_start() dl_se->dl_server = 1; enqueue_dl_entity() update_stats_enqueue_dl() __schedstats_from_dl_se() dl_task_of() BUG_ON(dl_server(dl_se)); Since only tasks have schedstats and internal entries do not, avoid trying to update stats in this case. Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers") Signed-off-by: Huang Shijie <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Juri Lelli <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-09-01kthread: fix task state in kthread worker if being frozenChen Yu1-1/+9
When analyzing a kernel waring message, Peter pointed out that there is a race condition when the kworker is being frozen and falls into try_to_freeze() with TASK_INTERRUPTIBLE, which could trigger a might_sleep() warning in try_to_freeze(). Although the root cause is not related to freeze()[1], it is still worthy to fix this issue ahead. One possible race scenario: CPU 0 CPU 1 ----- ----- // kthread_worker_fn set_current_state(TASK_INTERRUPTIBLE); suspend_freeze_processes() freeze_processes static_branch_inc(&freezer_active); freeze_kernel_threads pm_nosig_freezing = true; if (work) { //false __set_current_state(TASK_RUNNING); } else if (!freezing(current)) //false, been frozen freezing(): if (static_branch_unlikely(&freezer_active)) if (pm_nosig_freezing) return true; schedule() } // state is still TASK_INTERRUPTIBLE try_to_freeze() might_sleep() <--- warning Fix this by explicitly set the TASK_RUNNING before entering try_to_freeze(). Link: https://lore.kernel.org/lkml/Zs2ZoAcUsZMX2B%2FI@chenyu5-mobl2/ [1] Link: https://lkml.kernel.org/r/[email protected] Fixes: b56c0d8937e6 ("kthread: implement kthread_worker") Signed-off-by: Chen Yu <[email protected]> Suggested-by: Peter Zijlstra <[email protected]> Suggested-by: Andrew Morton <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: David Gow <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Mickaël Salaün <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01Document/kexec: generalize crash hotplug descriptionSourabh Jain1-13/+20
Commit 79365026f869 ("crash: add a new kexec flag for hotplug support") generalizes the crash hotplug support to allow architectures to update multiple kexec segments on CPU/Memory hotplug and not just elfcorehdr. Therefore, update the relevant kernel documentation to reflect the same. No functional change. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sourabh Jain <[email protected]> Reviewed-by: Petr Tesarik <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Hari Bathini <[email protected]> Cc: Petr Tesarik <[email protected]> Cc: Sourabh Jain <[email protected]> Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01fault-inject: improve build for CONFIG_FAULT_INJECTION=nJani Nikula1-0/+1
The fault-inject.h users across the kernel need to add a lot of #ifdef CONFIG_FAULT_INJECTION to cater for shortcomings in the header. Make fault-inject.h self-contained for CONFIG_FAULT_INJECTION=n, and add stubs for DECLARE_FAULT_ATTR(), setup_fault_attr(), should_fail_ex(), and should_fail() to allow removal of conditional compilation. [[email protected]: repair fallout from no longer including debugfs.h into fault-inject.h] [[email protected]: fix drivers/misc/xilinx_tmr_inject.c] [[email protected]: Add debugfs.h inclusion to more files, per Stephen] Link: https://lkml.kernel.org/r/[email protected] Fixes: 6ff1cb355e62 ("[PATCH] fault-injection capabilities infrastructure") Signed-off-by: Jani Nikula <[email protected]> Cc: Akinobu Mita <[email protected]> Cc: Abhinav Kumar <[email protected]> Cc: Dmitry Baryshkov <[email protected]> Cc: Himal Prasad Ghimiray <[email protected]> Cc: Lucas De Marchi <[email protected]> Cc: Rob Clark <[email protected]> Cc: Rodrigo Vivi <[email protected]> Cc: Thomas Hellström <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01watchdog: handle the ENODEV failure case of lockup_detector_delay_init() ↵Waiman Long1-1/+4
separately When watchdog_hardlockup_probe() is being called by lockup_detector_delay_init(), an error return of -ENODEV will happen for the arm64 arch when arch_perf_nmi_is_available() returns false. This means that NMI is not usable by the hard lockup detector and so has to be disabled. This can be considered a deficiency in that particular arm64 chip, but there is nothing we can do about it. That also means the following error will always be reported when the kernel boot up. watchdog: Delayed init of the lockup detector failed: -19 The word "failed" itself has a connotation that there is something wrong with the kernel which is not really the case here. Handle this special ENODEV case separately and explain the reason behind disabling hard lockup detector without causing anxiety for those users who read the above message and wonder about it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Waiman Long <[email protected]> Cc: Douglas Anderson <[email protected]> Cc: Joel Granados <[email protected]> Cc: Li Zhe <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Thomas Weißschuh <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01locking/ww_mutex/test: add MODULE_DESCRIPTION()Jeff Johnson1-0/+1
Fix the 'make W=1' warning: WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/locking/test-ww_mutex.o Link: https://lkml.kernel.org/r/20240730-module_description_orphans-v1-5-7094088076c8@quicinc.com Signed-off-by: Jeff Johnson <[email protected]> Acked-by: Waiman Long <[email protected]> Cc: Alexandre Torgue <[email protected]> Cc: Alistar Popple <[email protected]> Cc: Andrew Jeffery <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Boqun Feng <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Eddie James <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jeremy Kerr <[email protected]> Cc: Joel Stanley <[email protected]> Cc: Karol Herbst <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Maxime Coquelin <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Naveen N Rao <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Nouveau <[email protected]> Cc: Pekka Paalanen <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Russell King <[email protected]> Cc: Steven Rostedt (Google) <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Viresh Kumar <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01crash: fix crash memory reserve exceed system memory bugJinjie Ruan1-0/+3
On x86_32 Qemu machine with 1GB memory, the cmdline "crashkernel=4G" is ok as below: crashkernel reserved: 0x0000000020000000 - 0x0000000120000000 (4096 MB) It's similar on other architectures, such as ARM32 and RISCV32. The cause is that the crash_size is parsed and printed with "unsigned long long" data type which is 8 bytes but allocated used with "phys_addr_t" which is 4 bytes in memblock_phys_alloc_range(). Fix it by checking if crash_size is greater than system RAM size and return error if so. After this patch, there is no above confusing reserve success info. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jinjie Ruan <[email protected]> Suggested-by: Mike Rapoport <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Albert Ou <[email protected]> Cc: Dave Young <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Vivek Goyal <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01kexec: use atomic_try_cmpxchg_acquire() in kexec_trylock()Uros Bizjak1-1/+2
Use atomic_try_cmpxchg_acquire(*ptr, &old, new) instead of atomic_cmpxchg_acquire(*ptr, old, new) == old in kexec_trylock(). x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Uros Bizjak <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Eric Biederman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: create promo_wmark_pages and clean up open-coded sitesKaiyang Zhao1-1/+1
Patch series "mm: print the promo watermark in zoneinfo", v2. This patch (of 2): Define promo_wmark_pages and convert current call sites of wmark_pages with fixed WMARK_PROMO to using it instead. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kaiyang Zhao <[email protected]> Cc: Johannes Weiner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm, memcg: cg2 memory{.swap,}.peak write handlersDavid Finkel2-0/+9
Patch series "mm, memcg: cg2 memory{.swap,}.peak write handlers", v7. This patch (of 2): Other mechanisms for querying the peak memory usage of either a process or v1 memory cgroup allow for resetting the high watermark. Restore parity with those mechanisms, but with a less racy API. For example: - Any write to memory.max_usage_in_bytes in a cgroup v1 mount resets the high watermark. - writing "5" to the clear_refs pseudo-file in a processes's proc directory resets the peak RSS. This change is an evolution of a previous patch, which mostly copied the cgroup v1 behavior, however, there were concerns about races/ownership issues with a global reset, so instead this change makes the reset filedescriptor-local. Writing any non-empty string to the memory.peak and memory.swap.peak pseudo-files reset the high watermark to the current usage for subsequent reads through that same FD. Notably, following Johannes's suggestion, this implementation moves the O(FDs that have written) behavior onto the FD write(2) path. Instead, on the page-allocation path, we simply add one additional watermark to conditionally bump per-hierarchy level in the page-counter. Additionally, this takes Longman's suggestion of nesting the page-charging-path checks for the two watermarks to reduce the number of common-case comparisons. This behavior is particularly useful for work scheduling systems that need to track memory usage of worker processes/cgroups per-work-item. Since memory can't be squeezed like CPU can (the OOM-killer has opinions), these systems need to track the peak memory usage to compute system/container fullness when binpacking workitems. Most notably, Vimeo's use-case involves a system that's doing global binpacking across many Kubernetes pods/containers, and while we can use PSI for some local decisions about overload, we strive to avoid packing workloads too tightly in the first place. To facilitate this, we track the peak memory usage. However, since we run with long-lived workers (to amortize startup costs) we need a way to track the high watermark while a work-item is executing. Polling runs the risk of missing short spikes that last for timescales below the polling interval, and peak memory tracking at the cgroup level is otherwise perfect for this use-case. As this data is used to ensure that binpacked work ends up with sufficient headroom, this use-case mostly avoids the inaccuracies surrounding reclaimable memory. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Finkel <[email protected]> Suggested-by: Johannes Weiner <[email protected]> Suggested-by: Waiman Long <[email protected]> Acked-by: Johannes Weiner <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Acked-by: Tejun Heo <[email protected]> Reviewed-by: Roman Gushchin <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: turn USE_SPLIT_PTE_PTLOCKS / USE_SPLIT_PTE_PTLOCKS into Kconfig optionsDavid Hildenbrand1-2/+2
Patch series "mm: split PTE/PMD PT table Kconfig cleanups+clarifications". This series is a follow up to the fixes: "[PATCH v1 0/2] mm/hugetlb: fix hugetlb vs. core-mm PT locking" When working on the fixes, I wondered why 8xx is fine (-> never uses split PT locks) and how PT locking even works properly with PMD page table sharing (-> always requires split PMD PT locks). Let's improve the split PT lock detection, make hugetlb properly depend on it and make 8xx bail out if it would ever get enabled by accident. As an alternative to patch #3 we could extend the Kconfig SPLIT_PTE_PTLOCKS option from patch #2 -- but enforcing it closer to the code that actually implements it feels a bit nicer for documentation purposes, and there is no need to actually disable it because it should always be disabled (!SMP). Did a bunch of cross-compilations to make sure that split PTE/PMD PT locks are still getting used where we would expect them. [1] https://lkml.kernel.org/r/[email protected] This patch (of 3): Let's clean that up a bit and prepare for depending on CONFIG_SPLIT_PMD_PTLOCKS in other Kconfig options. More cleanups would be reasonable (like the arch-specific "depends on" for CONFIG_SPLIT_PTE_PTLOCKS), but we'll leave that for another day. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Mike Rapoport (Microsoft) <[email protected]> Reviewed-by: Russell King (Oracle) <[email protected]> Reviewed-by: Qi Zheng <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Muchun Song <[email protected]> Cc: "Naveen N. Rao" <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Peter Xu <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01task_stack: uninline stack_not_usedPasha Tatashin2-3/+20
Given that stack_not_used() is not performance critical function uninline it. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Pasha Tatashin <[email protected]> Acked-by: Shakeel Butt <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Li Zhijian <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01vmstat: kernel stack usage histogramPasha Tatashin1-0/+38
As part of the dynamic kernel stack project, we need to know the amount of data that can be saved by reducing the default kernel stack size [1]. Provide a kernel stack usage histogram to aid in optimizing kernel stack sizes and minimizing memory waste in large-scale environments. The histogram divides stack usage into power-of-two buckets and reports the results in /proc/vmstat. This information is especially valuable in environments with millions of machines, where even small optimizations can have a significant impact. The histogram data is presented in /proc/vmstat with entries like "kstack_1k", "kstack_2k", and so on, indicating the number of threads that exited with stack usage falling within each respective bucket. Example outputs: Intel: $ grep kstack /proc/vmstat kstack_1k 3 kstack_2k 188 kstack_4k 11391 kstack_8k 243 kstack_16k 0 ARM with 64K page_size: $ grep kstack /proc/vmstat kstack_1k 1 kstack_2k 340 kstack_4k 25212 kstack_8k 1659 kstack_16k 0 kstack_32k 0 kstack_64k 0 Note: once the dynamic kernel stack is implemented it will depend on the implementation the usability of this feature: On hardware that supports faults on kernel stacks, we will have other metrics that show the total number of pages allocated for stacks. On hardware where faults are not supported, we will most likely have some optimization where only some threads are extended, and for those, these metrics will still be very useful. [1] https://lwn.net/Articles/974367 Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Pasha Tatashin <[email protected]> Reviewed-by: Kent Overstreet <[email protected]> Acked-by: Shakeel Butt <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Li Zhijian <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01memory tiering: introduce folio_use_access_time() checkZi Yan1-2/+1
If memory tiering mode is on and a folio is not in the top tier memory, folio's cpupid field is repurposed to store page access time. Instead of an open coded check, use a function to encapsulate the check. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zi Yan <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Kefeng Wang <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01mm: kvmalloc: align kvrealloc() with krealloc()Danilo Krummrich1-2/+1
Besides the obvious (and desired) difference between krealloc() and kvrealloc(), there is some inconsistency in their function signatures and behavior: - krealloc() frees the memory when the requested size is zero, whereas kvrealloc() simply returns a pointer to the existing allocation. - krealloc() behaves like kmalloc() if a NULL pointer is passed, whereas kvrealloc() does not accept a NULL pointer at all and, if passed, would fault instead. - krealloc() is self-contained, whereas kvrealloc() relies on the caller to provide the size of the previous allocation. Inconsistent behavior throughout allocation APIs is error prone, hence make kvrealloc() behave like krealloc(), which seems superior in all mentioned aspects. Besides that, implementing kvrealloc() by making use of krealloc() and vrealloc() provides oppertunities to grow (and shrink) allocations more efficiently. For instance, vrealloc() can be optimized to allocate and map additional pages to grow the allocation or unmap and free unused pages to shrink the allocation. [[email protected]: document concurrency restrictions] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: disable KASAN when switching to vmalloc] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: properly document __GFP_ZERO behavior] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Danilo Krummrich <[email protected]> Acked-by: Michal Hocko <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Chandan Babu R <[email protected]> Cc: Christian König <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David Rientjes <[email protected]> Cc: Hyeonggon Yoo <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Kees Cook <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Miguel Ojeda <[email protected]> Cc: Oliver Upton <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Uladzislau Rezki <[email protected]> Cc: Wedson Almeida Filho <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01kexec_file: fix elfcorehdr digest exclusion when CONFIG_CRASH_HOTPLUG=yPetr Tesarik1-1/+1
Fix the condition to exclude the elfcorehdr segment from the SHA digest calculation. The j iterator is an index into the output sha_regions[] array, not into the input image->segment[] array. Once it reaches image->elfcorehdr_index, all subsequent segments are excluded. Besides, if the purgatory segment precedes the elfcorehdr segment, the elfcorehdr may be wrongly included in the calculation. Link: https://lkml.kernel.org/r/[email protected] Fixes: f7cc804a9fd4 ("kexec: exclude elfcorehdr from the segment digest") Signed-off-by: Petr Tesarik <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Eric Biederman <[email protected]> Cc: Hari Bathini <[email protected]> Cc: Sourabh Jain <[email protected]> Cc: Eric DeVolder <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-01Merge tag 'x86-urgent-2024-09-01' of ↵Linus Torvalds1-4/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: - x2apic_disable() clears x2apic_state and x2apic_mode unconditionally, even when the state is X2APIC_ON_LOCKED, which prevents the kernel to disable it thereby creating inconsistent state. Reorder the logic so it actually works correctly - The XSTATE logic for handling LBR is incorrect as it assumes that XSAVES supports LBR when the CPU supports LBR. In fact both conditions need to be true. Otherwise the enablement of LBR in the IA32_XSS MSR fails and subsequently the machine crashes on the next XRSTORS operation because IA32_XSS is not initialized. Cache the XSTATE support bit during init and make the related functions use this cached information and the LBR CPU feature bit to cure this. - Cure a long standing bug in KASLR KASLR uses the full address space between PAGE_OFFSET and vaddr_end to randomize the starting points of the direct map, vmalloc and vmemmap regions. It thereby limits the size of the direct map by using the installed memory size plus an extra configurable margin for hot-plug memory. This limitation is done to gain more randomization space because otherwise only the holes between the direct map, vmalloc, vmemmap and vaddr_end would be usable for randomizing. The limited direct map size is not exposed to the rest of the kernel, so the memory hot-plug and resource management related code paths still operate under the assumption that the available address space can be determined with MAX_PHYSMEM_BITS. request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1 downwards. That means the first allocation happens past the end of the direct map and if unlucky this address is in the vmalloc space, which causes high_memory to become greater than VMALLOC_START and consequently causes iounmap() to fail for valid ioremap addresses. Cure this by exposing the end of the direct map via PHYSMEM_END and use that for the memory hot-plug and resource management related places instead of relying on MAX_PHYSMEM_BITS. In the KASLR case PHYSMEM_END maps to a variable which is initialized by the KASLR initialization and otherwise it is based on MAX_PHYSMEM_BITS as before. - Prevent a data leak in mmio_read(). The TDVMCALL exposes the value of an initialized variabled on the stack to the VMM. The variable is only required as output value, so it does not have to exposed to the VMM in the first place. - Prevent an array overrun in the resource control code on systems with Sub-NUMA Clustering enabled because the code failed to adjust the index by the number of SNC nodes per L3 cache. * tag 'x86-urgent-2024-09-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/resctrl: Fix arch_mbm_* array overrun on SNC x86/tdx: Fix data leak in mmio_read() x86/kaslr: Expose and use the end of the physical memory address space x86/fpu: Avoid writing LBR bit to IA32_XSS unless supported x86/apic: Make x2apic_disable() work correctly
2024-09-01Merge tag 'locking-urgent-2024-08-25' of ↵Linus Torvalds1-4/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Thomas Gleixner: "A single fix for rt_mutex. The deadlock detection code drops into an infinite scheduling loop while still holding rt_mutex::wait_lock, which rightfully triggers a 'scheduling in atomic' warning. Unlock it before that" * tag 'locking-urgent-2024-08-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rtmutex: Drop rt_mutex::wait_lock before scheduling
2024-09-01tinyconfig: remove unnecessary 'is not set' for choice blocksMasahiro Yamada1-6/+0
This reverts the following commits: - 236dec051078 ("kconfig: tinyconfig: provide whole choice blocks to avoid warnings") - b0f269728ccd ("x86/config: Fix warning for 'make ARCH=x86_64 tinyconfig'") Since commit f79dc03fe68c ("kconfig: refactor choice value calculation"), it is no longer necessary to disable the remaining options in choice blocks. Signed-off-by: Masahiro Yamada <[email protected]> Acked-by: Thomas Gleixner <[email protected]>
2024-08-30sched_ext: Use sched_clock_cpu() instead of rq_clock_task() in ↵Tejun Heo1-2/+6
touch_core_sched() Since 3cf78c5d01d6 ("sched_ext: Unpin and repin rq lock from balance_scx()"), sched_ext's balance path terminates rq_pin in the outermost function. This is simpler and in line with what other balance functions are doing but it loses control over rq->clock_update_flags which makes assert_clock_udpated() trigger if other CPUs pins the rq lock. The only place this matters is touch_core_sched() which uses the timestamp to order tasks from sibling rq's. Switch to sched_clock_cpu(). Later, it may be better to use per-core dispatch sequence number. v2: Use sched_clock_cpu() instead of ktime_get_ns() per David. Signed-off-by: Tejun Heo <[email protected]> Fixes: 3cf78c5d01d6 ("sched_ext: Unpin and repin rq lock from balance_scx()") Acked-by: David Vernet <[email protected]> Cc: Peter Zijlstra <[email protected]>
2024-08-30sched_ext: Use task_can_run_on_remote_rq() test in dispatch_to_local_dsq()Tejun Heo1-20/+20
When deciding whether a task can be migrated to a CPU, dispatch_to_local_dsq() was open-coding p->cpus_allowed and scx_rq_online() tests instead of using task_can_run_on_remote_rq(). This had two problems. - It was missing is_migration_disabled() check and thus could try to migrate a task which shouldn't leading to assertion and scheduling failures. - It was testing p->cpus_ptr directly instead of using task_allowed_on_cpu() and thus failed to consider ISA compatibility. Update dispatch_to_local_dsq() to use task_can_run_on_remote_rq(): - Move scx_ops_error() triggering into task_can_run_on_remote_rq(). - When migration isn't allowed, fall back to the global DSQ instead of the source DSQ by returning DTL_INVALID. This is both simpler and an overall better behavior. Signed-off-by: Tejun Heo <[email protected]> Cc: Peter Zijlstra <[email protected]> Acked-by: David Vernet <[email protected]>
2024-08-30cgroup/cpuset: guard cpuset-v1 code under CONFIG_CPUSETS_V1Chen Ridong3-1/+16
This patch introduces CONFIG_CPUSETS_V1 and guard cpuset-v1 code under CONFIG_CPUSETS_V1. The default value of CONFIG_CPUSETS_V1 is N, so that user who adopted v2 don't have 'pay' for cpuset v1. Signed-off-by: Chen Ridong <[email protected]> Acked-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-08-30cgroup/cpuset: rename functions shared between v1 and v2Chen Ridong3-56/+56
Some functions name declared in cpuset-internel.h are generic. To avoid confilicting with other variables for the same name, rename these functions with cpuset_/cpuset1_ prefix to make them unique to cpuset. Signed-off-by: Chen Ridong <[email protected]> Acked-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-08-30cgroup/cpuset: move v1 interfaces to cpuset-v1.cChen Ridong3-199/+199
Move legacy cpuset controller interfaces files and corresponding code into cpuset-v1.c. 'update_flag', 'cpuset_write_resmask' and 'cpuset_common_seq_show' are also used for v1, so declare them in cpuset-internal.h. 'cpuset_write_s64', 'cpuset_read_s64' and 'fmeter_getrate' are only used cpuset-v1.c now, make it static. Signed-off-by: Chen Ridong <[email protected]> Acked-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-08-30cgroup/cpuset: move validate_change_legacy to cpuset-v1.cChen Ridong3-73/+74
The validate_change_legacy functions is used for v1, move it to cpuset-v1.c. And two micro 'cpuset_for_each_child' and 'cpuset_for_each_descendant_pre' are common for v1 and v2, move them to cpuset-internal.h. Signed-off-by: Chen Ridong <[email protected]> Acked-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-08-30cgroup/cpuset: move legacy hotplug update to cpuset-v1.cChen Ridong3-94/+98
There are some differents about hotplug update between cpuset v1 and cpuset v2. Move the legacy code to cpuset-v1.c. 'update_tasks_cpumask' and 'update_tasks_nodemask' are both used in cpuset v1 and cpuset v2, declare them in cpuset-internal.h. The change from original code is that use callback_lock helpers to get callback_lock lock/unlock. Signed-off-by: Chen Ridong <[email protected]> Acked-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-08-30cgroup/cpuset: add callback_lock helperChen Ridong2-0/+12
To modify cpuset, both cpuset_mutex and callback_lock are needed. Add helpers for cpuset-v1 to get callback_lock. Signed-off-by: Chen Ridong <[email protected]> Acked-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-08-30cgroup/cpuset: move memory_spread to cpuset-v1.cChen Ridong3-42/+45
'memory_spread' is only set in cpuset v1. move corresponding code into cpuset-v1.c. Currently, 'cpuset_update_task_spread_flags' and 'update_tasks_flags' are exposed to cpuset.c. Signed-off-by: Chen Ridong <[email protected]> Acked-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-08-30cgroup/cpuset: move relax_domain_level to cpuset-v1.cChen Ridong3-61/+66
Setting domain level is not supported at cpuset v2, so move corresponding code into cpuset-v1.c. The 'cpuset_write_s64' and 'cpuset_read_s64' are only used for setting domain level, move them to cpuset-v1.c. Currently, expose to cpuset.c. After cpuset legacy interface files are move to cpuset-v1.c, they can be static. The 'rebuild_sched_domains_locked' is exposed to cpuset-v1.c. The change from original code is that using 'cpuset_lock' and 'cpuset_unlock' functions to lock or unlock cpuset_mutex. Signed-off-by: Chen Ridong <[email protected]> Acked-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-08-30cgroup/cpuset: move memory_pressure to cpuset-v1.cChen Ridong3-134/+141
Collection of memory_pressure can be enabled by writing 1 to the cpuset file 'memory_pressure_enabled', which is only for cpuset-v1. Therefore, move the corresponding code to cpuset-v1.c. Currently, the 'fmeter_init' and 'fmeter_getrate' functions are called at cpuset.c, so expose them to cpuset.c. Signed-off-by: Chen Ridong <[email protected]> Acked-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-08-30cgroup/cpuset: move common code to cpuset-internal.hChen Ridong2-235/+236
Move some declarations that will be used for cpuset v1 and v2, including 'cpuset struct', 'cpuset_flagbits_t', cpuset_filetype_t,etc. No logical change. Signed-off-by: Chen Ridong <[email protected]> Acked-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-08-30cgroup/cpuset: introduce cpuset-v1.cChen Ridong3-1/+10
This patch introduces the cgroup/cpuset-v1.c source file which will be used for all legacy (cgroup v1) cpuset cgroup code. It also introduces cgroup/cpuset-internal.h to keep declarations shared between cgroup/cpuset.c and cpuset/cpuset-v1.c. As of now, let's compile it if CONFIG_CPUSET is set. Later on it can be switched to use a separate config option, so that the legacy code won't be compiled if not required. Signed-off-by: Chen Ridong <[email protected]> Acked-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>