aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2024-09-10sched/debug: Fix the runnable tasks outputHuang Shijie1-8/+23
The current runnable tasks output looks like: runnable tasks: S task PID tree-key switches prio wait-time sum-exec sum-sleep ------------------------------------------------------------------------------------------------------------- Ikworker/R-rcu_g 4 0.129049 E 0.620179 0.750000 0.002920 2 100 0.000000 0.002920 0.000000 0.000000 0 0 / Ikworker/R-sync_ 5 0.125328 E 0.624147 0.750000 0.001840 2 100 0.000000 0.001840 0.000000 0.000000 0 0 / Ikworker/R-slub_ 6 0.120835 E 0.628680 0.750000 0.001800 2 100 0.000000 0.001800 0.000000 0.000000 0 0 / Ikworker/R-netns 7 0.114294 E 0.634701 0.750000 0.002400 2 100 0.000000 0.002400 0.000000 0.000000 0 0 / I kworker/0:1 9 508.781746 E 511.754666 3.000000 151.575240 224 120 0.000000 151.575240 0.000000 0.000000 0 0 / Which is messy. Remove the duplicate printing of sum_exec_runtime and tidy up the layout to make it look like: runnable tasks: S task PID vruntime eligible deadline slice sum-exec switches prio wait-time sum-sleep sum-block node group-id group-path ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- I kworker/0:3 1698 295.001459 E 297.977619 3.000000 38.862920 9 120 0.000000 0.000000 0.000000 0 0 / I kworker/0:4 1702 278.026303 E 281.026303 3.000000 9.918760 3 120 0.000000 0.000000 0.000000 0 0 / S NetworkManager 2646 0.377936 E 2.598104 3.000000 98.535880 314 120 0.000000 0.000000 0.000000 0 0 /system.slice/NetworkManager.service S virtqemud 2689 0.541016 E 2.440104 3.000000 50.967960 80 120 0.000000 0.000000 0.000000 0 0 /system.slice/virtqemud.service S gsd-smartcard 3058 73.604144 E 76.475904 3.000000 74.033320 88 120 0.000000 0.000000 0.000000 0 0 /user.slice/user-42.slice/session-c1.scope Reviewed-by: Christoph Lameter (Ampere) <[email protected]> Signed-off-by: Huang Shijie <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-09-10sched: Fix sched_delayed vs sched_corePeter Zijlstra1-0/+6
Completely analogous to commit dfa0a574cbc4 ("sched/uclamg: Handle delayed dequeue"), avoid double dequeue for the sched_core entries. Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
2024-09-10kernel/sched: Fix util_est accounting for DELAY_DEQUEUEDietmar Eggemann1-7/+9
Remove delayed tasks from util_est even they are runnable. Exclude delayed task which are (a) migrating between rq's or (b) in a SAVE/RESTORE dequeue/enqueue. Signed-off-by: Dietmar Eggemann <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-09-10kthread: Fix task state in kthread worker if being frozenChen Yu1-1/+9
When analyzing a kernel waring message, Peter pointed out that there is a race condition when the kworker is being frozen and falls into try_to_freeze() with TASK_INTERRUPTIBLE, which could trigger a might_sleep() warning in try_to_freeze(). Although the root cause is not related to freeze()[1], it is still worthy to fix this issue ahead. One possible race scenario: CPU 0 CPU 1 ----- ----- // kthread_worker_fn set_current_state(TASK_INTERRUPTIBLE); suspend_freeze_processes() freeze_processes static_branch_inc(&freezer_active); freeze_kernel_threads pm_nosig_freezing = true; if (work) { //false __set_current_state(TASK_RUNNING); } else if (!freezing(current)) //false, been frozen freezing(): if (static_branch_unlikely(&freezer_active)) if (pm_nosig_freezing) return true; schedule() } // state is still TASK_INTERRUPTIBLE try_to_freeze() might_sleep() <--- warning Fix this by explicitly set the TASK_RUNNING before entering try_to_freeze(). Fixes: b56c0d8937e6 ("kthread: implement kthread_worker") Suggested-by: Peter Zijlstra <[email protected]> Suggested-by: Andrew Morton <[email protected]> Signed-off-by: Chen Yu <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/lkml/Zs2ZoAcUsZMX2B%2FI@chenyu5-mobl2/ [1]
2024-09-10sched/pelt: Use rq_clock_task() for hw_pressureChen Yu1-1/+2
commit 97450eb90965 ("sched/pelt: Remove shift of thermal clock") removed the decay_shift for hw_pressure. This commit uses the sched_clock_task() in sched_tick() while it replaces the sched_clock_task() with rq_clock_pelt() in __update_blocked_others(). This could bring inconsistence. One possible scenario I can think of is in ___update_load_sum(): u64 delta = now - sa->last_update_time 'now' could be calculated by rq_clock_pelt() from __update_blocked_others(), and last_update_time was calculated by rq_clock_task() previously from sched_tick(). Usually the former chases after the latter, it cause a very large 'delta' and brings unexpected behavior. Fixes: 97450eb90965 ("sched/pelt: Remove shift of thermal clock") Signed-off-by: Chen Yu <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Hongyan Xia <[email protected]> Reviewed-by: Vincent Guittot <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-09-10sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.cVincent Guittot2-101/+99
Move effective_cpu_util() and sched_cpu_util() functions in fair.c file with others utilization related functions. No functional change. Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-09-10sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()Peter Zijlstra1-19/+26
Since commit b2a02fc43a1f ("smp: Optimize send_call_function_single_ipi()") an idle CPU in TIF_POLLING_NRFLAG mode can be pulled out of idle by setting TIF_NEED_RESCHED flag to service an IPI without actually sending an interrupt. Even in cases where the IPI handler does not queue a task on the idle CPU, do_idle() will call __schedule() since need_resched() returns true in these cases. Introduce and use SM_IDLE to identify call to __schedule() from schedule_idle() and shorten the idle re-entry time by skipping pick_next_task() when nr_running is 0 and the previous task is the idle task. With the SM_IDLE fast-path, the time taken to complete a fixed set of IPIs using ipistorm improves noticeably. Following are the numbers from a dual socket Intel Ice Lake Xeon server (2 x 32C/64T) and 3rd Generation AMD EPYC system (2 x 64C/128T) (boost on, C2 disabled) running ipistorm between CPU8 and CPU16: cmdline: insmod ipistorm.ko numipi=100000 single=1 offset=8 cpulist=8 wait=1 ================================================================== Test : ipistorm (modified) Units : Normalized runtime Interpretation: Lower is better Statistic : AMean ======================= Intel Ice Lake Xeon ====================== kernel: time [pct imp] tip:sched/core 1.00 [baseline] tip:sched/core + SM_IDLE 0.80 [20.51%] ==================== 3rd Generation AMD EPYC ===================== kernel: time [pct imp] tip:sched/core 1.00 [baseline] tip:sched/core + SM_IDLE 0.90 [10.17%] ================================================================== [ kprateek: Commit message, SM_RTLOCK_WAIT fix ] Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Not-yet-signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: K Prateek Nayak <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Vincent Guittot <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-09-10dma-mapping: add tracing for dma-mapping API callsSean Anderson1-1/+23
When debugging drivers, it can often be useful to trace when memory gets (un)mapped for DMA (and can be accessed by the device). Add some tracepoints for this purpose. Use u64 instead of phys_addr_t and dma_addr_t (and similarly %llx instead of %pa) because libtraceevent can't handle typedefs in all cases. Signed-off-by: Sean Anderson <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2024-09-09user_namespace: use kmemdup_array() instead of kmemdup() for multiple allocationJinjie Ruan1-3/+2
Let the kmemdup_array() take care about multiplication and possible overflows. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jinjie Ruan <[email protected]> Reviewed-by: Kees Cook <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Li zeming <[email protected]> Cc: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()Tejun Heo1-3/+229
Once a task is put into a DSQ, the allowed operations are fairly limited. Tasks in the built-in local and global DSQs are executed automatically and, ignoring dequeue, there is only one way a task in a user DSQ can be manipulated - scx_bpf_consume() moves the first task to the dispatching local DSQ. This inflexibility sometimes gets in the way and is an area where multiple feature requests have been made. Implement scx_bpf_dispatch[_vtime]_from_dsq(), which can be called during DSQ iteration and can move the task to any DSQ - local DSQs, global DSQ and user DSQs. The kfuncs can be called from ops.dispatch() and any BPF context which dosen't hold a rq lock including BPF timers and SYSCALL programs. This is an expansion of an earlier patch which only allowed moving into the dispatching local DSQ: http://lkml.kernel.org/r/[email protected] v2: Remove @slice and @vtime from scx_bpf_dispatch_from_dsq[_vtime]() as they push scx_bpf_dispatch_from_dsq_vtime() over the kfunc argument count limit and often won't be needed anyway. Instead provide scx_bpf_dispatch_from_dsq_set_{slice|vtime}() kfuncs which can be called only when needed and override the specified parameter for the subsequent dispatch. Signed-off-by: Tejun Heo <[email protected]> Cc: Daniel Hodges <[email protected]> Cc: David Vernet <[email protected]> Cc: Changwoo Min <[email protected]> Cc: Andrea Righi <[email protected]> Cc: Dan Schatzberg <[email protected]>
2024-09-09sched_ext: Compact struct bpf_iter_scx_dsq_kernTejun Heo1-11/+12
struct scx_iter_scx_dsq is defined as 6 u64's and scx_dsq_iter_kern was using 5 of them. We want to add two more u64 fields but it's better if we do so while staying within scx_iter_scx_dsq to maintain binary compatibility. The way scx_iter_scx_dsq_kern is laid out is rather inefficient - the node field takes up three u64's but only one bit of the last u64 is used. Turn the bool into u32 flags and only use the lower 16 bits freeing up 48 bits - 16 bits for flags, 32 bits for a u32 - for use by struct bpf_iter_scx_dsq_kern. This allows moving the dsq_seq and flags fields of bpf_iter_scx_dsq_kern into the cursor field reducing the struct size by a full u64. No behavior changes intended. Signed-off-by: Tejun Heo <[email protected]>
2024-09-09sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq()Tejun Heo1-16/+26
- Rename move_task_to_local_dsq() to move_remote_task_to_local_dsq(). - Rename consume_local_task() to move_local_task_to_local_dsq() and remove task_unlink_from_dsq() and source DSQ unlocking from it. This is to make the migration code easier to reuse. No functional changes intended. Signed-off-by: Tejun Heo <[email protected]> Acked-by: David Vernet <[email protected]>
2024-09-09sched_ext: Move consume_local_task() upwardTejun Heo1-17/+14
So that the local case comes first and two CONFIG_SMP blocks can be merged. No functional changes intended. Signed-off-by: Tejun Heo <[email protected]> Acked-by: David Vernet <[email protected]>
2024-09-09sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq()Tejun Heo1-4/+3
All task_unlink_from_dsq() users are doing dsq_mod_nr(dsq, -1). Move it into task_unlink_from_dsq(). Also move sanity check into it. No functional changes intended. Signed-off-by: Tejun Heo <[email protected]> Acked-by: David Vernet <[email protected]>
2024-09-09sched_ext: Reorder args for consume_local/remote_task()Tejun Heo1-7/+7
Reorder args for consistency in the order of: current_rq, p, src_[rq|dsq], dst_[rq|dsq]. No functional changes intended. Signed-off-by: Tejun Heo <[email protected]>
2024-09-09sched_ext: Restructure dispatch_to_local_dsq()Tejun Heo1-50/+46
Now that there's nothing left after the big if block, flip the if condition and unindent the body. No functional changes intended. v2: Add BUG() to clarify control can't reach the end of dispatch_to_local_dsq() in UP kernels per David. Signed-off-by: Tejun Heo <[email protected]> Acked-by: David Vernet <[email protected]>
2024-09-09sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handlingTejun Heo1-31/+10
With the preceding update, the only return value which makes meaningful difference is DTL_INVALID, for which one caller, finish_dispatch(), falls back to the global DSQ and the other, process_ddsp_deferred_locals(), doesn't do anything. It should always fallback to the global DSQ. Move the global DSQ fallback into dispatch_to_local_dsq() and remove the return value. v2: Patch title and description updated to reflect the behavior fix for process_ddsp_deferred_locals(). Signed-off-by: Tejun Heo <[email protected]> Acked-by: David Vernet <[email protected]>
2024-09-09sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ONTejun Heo1-50/+40
find_dsq_for_dispatch() handles all DSQ IDs except SCX_DSQ_LOCAL_ON. Instead, each caller is hanlding SCX_DSQ_LOCAL_ON before calling it. Move SCX_DSQ_LOCAL_ON lookup into find_dsq_for_dispatch() to remove duplicate code in direct_dispatch() and dispatch_to_local_dsq(). No functional changes intended. Signed-off-by: Tejun Heo <[email protected]> Acked-by: David Vernet <[email protected]>
2024-09-09sched_ext: Refactor consume_remote_task()Tejun Heo1-69/+76
The tricky p->scx.holding_cpu handling was split across consume_remote_task() body and move_task_to_local_dsq(). Refactor such that: - All the tricky part is now in the new unlink_dsq_and_lock_src_rq() with consolidated documentation. - move_task_to_local_dsq() now implements straightforward task migration making it easier to use in other places. - dispatch_to_local_dsq() is another user move_task_to_local_dsq(). The usage is updated accordingly. This makes the local and remote cases more symmetric. No functional changes intended. v2: s/task_rq/src_rq/ for consistency. Signed-off-by: Tejun Heo <[email protected]> Acked-by: David Vernet <[email protected]>
2024-09-09sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocateTejun Heo1-33/+33
Sleepables don't need to be in its own kfunc set as each is tagged with KF_SLEEPABLE. Rename to scx_kfunc_set_unlocked indicating that rq lock is not held and relocate right above the any set. This will be used to add kfuncs that are allowed to be called from SYSCALL but not TRACING. No functional changes intended. Signed-off-by: Tejun Heo <[email protected]> Acked-by: David Vernet <[email protected]>
2024-09-09cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCUKinsey Ho1-7/+9
Patch series "Improve mem_cgroup_iter()", v4. Incremental cgroup iteration is being used again [1]. This patchset improves the reliability of mem_cgroup_iter(). It also improves simplicity and code readability. [1] https://lore.kernel.org/[email protected]/ This patch (of 5): Explicitly document that css sibling/descendant linkage is protected by cgroup_mutex or RCU. Also, document in css_next_descendant_pre() and similar functions that it isn't necessary to hold a ref on @pos. The following changes in this patchset rely on this clarification for simplification in memcg iteration code. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Suggested-by: Yosry Ahmed <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Signed-off-by: Kinsey Ho <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Zefan Li <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: T.J. Mercier <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09uprobes: use vm_special_mapping close() functionalitySven Schnelle2-20/+17
The following KASAN splat was shown: [ 44.505448] ================================================================== 20:37:27 [3421/145075] [ 44.505455] BUG: KASAN: slab-use-after-free in special_mapping_close+0x9c/0xc8 [ 44.505471] Read of size 8 at addr 00000000868dac48 by task sh/1384 [ 44.505479] [ 44.505486] CPU: 51 UID: 0 PID: 1384 Comm: sh Not tainted 6.11.0-rc6-next-20240902-dirty #1496 [ 44.505503] Hardware name: IBM 3931 A01 704 (z/VM 7.3.0) [ 44.505508] Call Trace: [ 44.505511] [<000b0324d2f78080>] dump_stack_lvl+0xd0/0x108 [ 44.505521] [<000b0324d2f5435c>] print_address_description.constprop.0+0x34/0x2e0 [ 44.505529] [<000b0324d2f5464c>] print_report+0x44/0x138 [ 44.505536] [<000b0324d1383192>] kasan_report+0xc2/0x140 [ 44.505543] [<000b0324d2f52904>] special_mapping_close+0x9c/0xc8 [ 44.505550] [<000b0324d12c7978>] remove_vma+0x78/0x120 [ 44.505557] [<000b0324d128a2c6>] exit_mmap+0x326/0x750 [ 44.505563] [<000b0324d0ba655a>] __mmput+0x9a/0x370 [ 44.505570] [<000b0324d0bbfbe0>] exit_mm+0x240/0x340 [ 44.505575] [<000b0324d0bc0228>] do_exit+0x548/0xd70 [ 44.505580] [<000b0324d0bc1102>] do_group_exit+0x132/0x390 [ 44.505586] [<000b0324d0bc13b6>] __s390x_sys_exit_group+0x56/0x60 [ 44.505592] [<000b0324d0adcbd6>] do_syscall+0x2f6/0x430 [ 44.505599] [<000b0324d2f78434>] __do_syscall+0xa4/0x170 [ 44.505606] [<000b0324d2f9454c>] system_call+0x74/0x98 [ 44.505614] [ 44.505616] Allocated by task 1384: [ 44.505621] kasan_save_stack+0x40/0x70 [ 44.505630] kasan_save_track+0x28/0x40 [ 44.505636] __kasan_kmalloc+0xa0/0xc0 [ 44.505642] __create_xol_area+0xfa/0x410 [ 44.505648] get_xol_area+0xb0/0xf0 [ 44.505652] uprobe_notify_resume+0x27a/0x470 [ 44.505657] irqentry_exit_to_user_mode+0x15e/0x1d0 [ 44.505664] pgm_check_handler+0x122/0x170 [ 44.505670] [ 44.505672] Freed by task 1384: [ 44.505676] kasan_save_stack+0x40/0x70 [ 44.505682] kasan_save_track+0x28/0x40 [ 44.505687] kasan_save_free_info+0x4a/0x70 [ 44.505693] __kasan_slab_free+0x5a/0x70 [ 44.505698] kfree+0xe8/0x3f0 [ 44.505704] __mmput+0x20/0x370 [ 44.505709] exit_mm+0x240/0x340 [ 44.505713] do_exit+0x548/0xd70 [ 44.505718] do_group_exit+0x132/0x390 [ 44.505722] __s390x_sys_exit_group+0x56/0x60 [ 44.505727] do_syscall+0x2f6/0x430 [ 44.505732] __do_syscall+0xa4/0x170 [ 44.505738] system_call+0x74/0x98 The problem is that uprobe_clear_state() kfree's struct xol_area, which contains struct vm_special_mapping *xol_mapping. This one is passed to _install_special_mapping() in xol_add_vma(). __mput reads: static inline void __mmput(struct mm_struct *mm) { VM_BUG_ON(atomic_read(&mm->mm_users)); uprobe_clear_state(mm); exit_aio(mm); ksm_exit(mm); khugepaged_exit(mm); /* must run before exit_mmap */ exit_mmap(mm); ... } So uprobe_clear_state() in the beginning free's the memory area containing the vm_special_mapping data, but exit_mmap() uses this address later via vma->vm_private_data (which was set in _install_special_mapping(). Fix this by moving uprobe_clear_state() to uprobes.c and use it as close() callback. [[email protected]: remove unneeded condition] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 223febc6e557 ("mm: add optional close() to struct vm_special_mapping") Signed-off-by: Sven Schnelle <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kan Liang <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09sched_ext: Add missing static to scx_dump_dataTejun Heo1-1/+1
scx_dump_data is only used inside ext.c but doesn't have static. Add it. Signed-off-by: Tejun Heo <[email protected]> Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
2024-09-09bpf: Fix error message on kfunc arg type mismatchMaxim Mikityanskiy1-1/+2
When "arg#%d expected pointer to ctx, but got %s" error is printed, both template parts actually point to the type of the argument, therefore, it will also say "but got PTR", regardless of what was the actual register type. Fix the message to print the register type in the second part of the template, change the existing test to adapt to the new format, and add a new test to test the case when arg is a pointer to context, but reg is a scalar. Fixes: 00b85860feb8 ("bpf: Rewrite kfunc argument handling") Signed-off-by: Maxim Mikityanskiy <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2024-09-09tracing: Drop unused helper function to fix the buildAndy Shevchenko1-4/+0
A helper function defined but not used. This, in particular, prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y: kernel/trace/trace.c:2229:19: error: unused function 'run_tracer_selftest' [-Werror,-Wunused-function] 2229 | static inline int run_tracer_selftest(struct tracer *type) | ^~~~~~~~~~~~~~~~~~~ Fix this by dropping unused functions. See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static inline functions for W=1 build"). Cc: Masami Hiramatsu <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Nick Desaulniers <[email protected]> Cc: Bill Wendling <[email protected]> Cc: Justin Stitt <[email protected]> Link: https://lore.kernel.org/[email protected] Signed-off-by: Andy Shevchenko <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-09-09tracing/osnoise: Fix build when timerlat is not enabledSteven Rostedt1-5/+5
To fix some critical section races, the interface_lock was added to a few locations. One of those locations was above where the interface_lock was declared, so the declaration was moved up before that usage. Unfortunately, where it was placed was inside a CONFIG_TIMERLAT_TRACER ifdef block. As the interface_lock is used outside that config, this broke the build when CONFIG_OSNOISE_TRACER was enabled but CONFIG_TIMERLAT_TRACER was not. Cc: Masami Hiramatsu <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: "Helena Anna" <[email protected]> Cc: "Luis Claudio R. Goncalves" <[email protected]> Cc: Tomas Glozar <[email protected]> Link: https://lore.kernel.org/[email protected] Fixes: e6a53481da29 ("tracing/timerlat: Only clear timer if a kthread exists") Reported-by: "Bityutskiy, Artem" <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-09-09printk: Export match_devname_and_update_preferred_console()Yu Liao1-0/+1
When building serial_base as a module, modpost fails with the following error message: ERROR: modpost: "match_devname_and_update_preferred_console" [drivers/tty/serial/serial_base.ko] undefined! Export the symbol to allow using it from modules. Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Fixes: 12c91cec3155 ("serial: core: Add serial_base_match_and_update_preferred_console()") Signed-off-by: Yu Liao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]>
2024-09-08treewide: Fix wrong singular form of jiffies in commentsAnna-Maria Behnsen5-11/+11
There are several comments all over the place, which uses a wrong singular form of jiffies. Replace 'jiffie' by 'jiffy'. No functional change. Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> # m68k Link: https://lore.kernel.org/all/20240904-devel-anna-maria-b4-timers-flseep-v1-3-e98760256370@linutronix.de
2024-09-08cpu: Use already existing usleep_range()Anna-Maria Behnsen1-1/+1
usleep_range() is a wrapper arount usleep_range_state() which hands in TASK_UNTINTERRUPTIBLE as state argument. Use already exising wrapper usleep_range(). No functional change. Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/all/20240904-devel-anna-maria-b4-timers-flseep-v1-2-e98760256370@linutronix.de
2024-09-08timers: Rename next_expiry_recalc() to be uniqueAnna-Maria Behnsen1-3/+3
next_expiry_recalc is the name of a function as well as the name of a struct member of struct timer_base. This might lead to confusion. Rename next_expiry_recalc() to timer_recalc_next_expiry(). No functional change. Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/all/20240904-devel-anna-maria-b4-timers-flseep-v1-1-e98760256370@linutronix.de
2024-09-09Merge branches 'context_tracking.15.08.24a', 'csd.lock.15.08.24a', ↵Neeraj Upadhyay16-473/+692
'nocb.09.09.24a', 'rcutorture.14.08.24a', 'rcustall.09.09.24a', 'srcu.12.08.24a', 'rcu.tasks.14.08.24a', 'rcu_scaling_tests.15.08.24a', 'fixes.12.08.24a' and 'misc.11.08.24a' into next.09.09.24a
2024-09-09rcu: Defer printing stall-warning backtrace when holding rcu_node lockPaul E. McKenney1-0/+2
The rcu_dump_cpu_stacks() holds the leaf rcu_node structure's ->lock when dumping the stakcks of any CPUs stalling the current grace period. This lock is held to prevent confusion that would otherwise occur when the stalled CPU reported its quiescent state (and then went on to do unrelated things) just as the backtrace NMI was heading towards it. This has worked well, but on larger systems has recently been observed to cause severe lock contention resulting in CSD-lock stalls and other general unhappiness. This commit therefore does printk_deferred_enter() before acquiring the lock and printk_deferred_exit() after releasing it, thus deferring the overhead of actually outputting the stack trace out of that lock's critical section. Reported-by: Rik van Riel <[email protected]> Suggested-by: Rik van Riel <[email protected]> Signed-off-by: "Paul E. McKenney" <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-09-09rcu/nocb: Remove superfluous memory barrier after bypass enqueueFrederic Weisbecker1-1/+1
Pre-GP accesses performed by the update side must be ordered against post-GP accesses performed by the readers. This is ensured by the bypass or nocb locking on enqueue time, followed by the fully ordered rnp locking initiated while callbacks are accelerated, and then propagated throughout the whole GP lifecyle associated with the callbacks. Therefore the explicit barrier advertizing ordering between bypass enqueue and rcuo wakeup is superfluous. If anything, it would even only order the first bypass callback enqueue against the rcuo wakeup and ignore all the subsequent ones. Remove the needless barrier. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-09-09rcu/nocb: Conditionally wake up rcuo if not already waiting on GPFrederic Weisbecker1-4/+1
A callback enqueuer currently wakes up the rcuo kthread if it is adding the first non-done callback of a CPU, whether the kthread is waiting on a grace period or not (unless the CPU is offline). This looks like a desired behaviour because then the rcuo kthread doesn't wait for the end of the current grace period to handle the callback. It is accelerated right away and assigned to the next grace period. The GP kthread is notified about that fact and iterates with the upcoming GP without sleeping in-between. However this best-case scenario is contradicted by a few details, depending on the situation: 1) If the callback is a non-bypass one queued with IRQs enabled, the wake up only occurs if no other pending callbacks are on the list. Therefore the theoretical "optimization" actually applies on rare occasions. 2) If the callback is a non-bypass one queued with IRQs disabled, the situation is similar with even more uncertainty due to the deferred wake up. 3) If the callback is lazy, a few jiffies don't make any difference. 4) If the callback is bypass, the wake up timer is programmed 2 jiffies ahead by rcuo in case the regular pending queue has been handled in the meantime. The rare storm of callbacks can otherwise wait for the currently elapsing grace period to be flushed and handled. For all those reasons, the optimization is only theoretical and occasional. Therefore it is reasonable that callbacks enqueuers only wake up the rcuo kthread when it is not already waiting on a grace period to complete. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-09-09rcu/nocb: Fix RT throttling hrtimer armed from offline CPUFrederic Weisbecker1-1/+4
After a CPU is marked offline and until it reaches its final trip to idle, rcuo has several opportunities to be woken up, either because a callback has been queued in the meantime or because rcutree_report_cpu_dead() has issued the final deferred NOCB wake up. If RCU-boosting is enabled, RCU kthreads are set to SCHED_FIFO policy. And if RT-bandwidth is enabled, the related hrtimer might be armed. However this then happens after hrtimers have been migrated at the CPUHP_AP_HRTIMERS_DYING stage, which is broken as reported by the following warning: Call trace: enqueue_hrtimer+0x7c/0xf8 hrtimer_start_range_ns+0x2b8/0x300 enqueue_task_rt+0x298/0x3f0 enqueue_task+0x94/0x188 ttwu_do_activate+0xb4/0x27c try_to_wake_up+0x2d8/0x79c wake_up_process+0x18/0x28 __wake_nocb_gp+0x80/0x1a0 do_nocb_deferred_wakeup_common+0x3c/0xcc rcu_report_dead+0x68/0x1ac cpuhp_report_idle_dead+0x48/0x9c do_idle+0x288/0x294 cpu_startup_entry+0x34/0x3c secondary_start_kernel+0x138/0x158 Fix this with waking up rcuo using an IPI if necessary. Since the existing API to deal with this situation only handles swait queue, rcuo is only woken up from offline CPUs if it's not already waiting on a grace period. In the worst case some callbacks will just wait for a grace period to complete before being assigned to a subsequent one. Reported-by: "Cheng-Jui Wang (王正睿)" <[email protected]> Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-09-09rcu/nocb: Simplify (de-)offloading state machineFrederic Weisbecker3-84/+67
Now that the (de-)offloading process can only apply to offline CPUs, there is no more concurrency between rcu_core and nocb kthreads. Also the mutation now happens on empty queues. Therefore the state machine can be reduced to a single bit called SEGCBLIST_OFFLOADED. Simplify the transition as follows: * Upon offloading: queue the rdp to be added to the rcuog list and wait for the rcuog kthread to set the SEGCBLIST_OFFLOADED bit. Unpark rcuo kthread. * Upon de-offloading: Park rcuo kthread. Queue the rdp to be removed from the rcuog list and wait for the rcuog kthread to clear the SEGCBLIST_OFFLOADED bit. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Paul E. McKenney <[email protected]> Signed-off-by: Neeraj Upadhyay <[email protected]>
2024-09-08Merge tag 'perf_urgent_for_v6.11_rc7' of ↵Linus Torvalds4-8/+16
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Fix perf's AUX buffer serialization - Prevent uninitialized struct members in perf's uprobes handling * tag 'perf_urgent_for_v6.11_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/aux: Fix AUX buffer serialization uprobes: Use kzalloc to allocate xol area
2024-09-08genirq: Use cpumask_intersects()Costa Shulyupin2-3/+3
Replace `cpumask_any_and(a, b) >= nr_cpu_ids` and `cpumask_any_and(a, b) < nr_cpu_ids` with the more readable `!cpumask_intersects(a, b)` and `cpumask_intersects(a, b)` Signed-off-by: Costa Shulyupin <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-09-06sched_ext: Add missing static to scx_has_op[]Tejun Heo1-1/+1
scx_has_op[] is only used inside ext.c but doesn't have static. Add it. Signed-off-by: Tejun Heo <[email protected]> Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
2024-09-06sched_ext: Temporarily work around pick_task_scx() being called without ↵Tejun Heo1-1/+16
balance_scx() pick_task_scx() must be preceded by balance_scx() but there currently is a bug where fair could say yes on balance() but no on pick_task(), which then ends up calling pick_task_scx() without preceding balance_scx(). Work around by dropping WARN_ON_ONCE() and ignoring cases which don't make sense. This isn't great and can theoretically lead to stalls. However, for switch_all cases, this happens only while a BPF scheduler is being loaded or unloaded, and, for partial cases, fair will likely keep triggering this CPU. This will be reverted once the fair behavior is fixed. Signed-off-by: Tejun Heo <[email protected]> Cc: Peter Zijlstra <[email protected]>
2024-09-06static_call: Replace pointless WARN_ON() in static_call_module_notify()Thomas Gleixner1-1/+1
static_call_module_notify() triggers a WARN_ON(), when memory allocation fails in __static_call_add_module(). That's not really justified, because the failure case must be correctly handled by the well known call chain and the error code is passed through to the initiating userspace application. A memory allocation fail is not a fatal problem, but the WARN_ON() takes the machine out when panic_on_warn is set. Replace it with a pr_warn(). Fixes: 9183c3f9ed71 ("static_call: Add inline static call infrastructure") Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/8734mf7pmb.ffs@tglx
2024-09-06static_call: Handle module init failure correctly in static_call_del_module()Thomas Gleixner1-0/+11
Module insertion invokes static_call_add_module() to initialize the static calls in a module. static_call_add_module() invokes __static_call_init(), which allocates a struct static_call_mod to either encapsulate the built-in static call sites of the associated key into it so further modules can be added or to append the module to the module chain. If that allocation fails the function returns with an error code and the module core invokes static_call_del_module() to clean up eventually added static_call_mod entries. This works correctly, when all keys used by the module were converted over to a module chain before the failure. If not then static_call_del_module() causes a #GP as it blindly assumes that key::mods points to a valid struct static_call_mod. The problem is that key::mods is not a individual struct member of struct static_call_key, it's part of a union to save space: union { /* bit 0: 0 = mods, 1 = sites */ unsigned long type; struct static_call_mod *mods; struct static_call_site *sites; }; key::sites is a pointer to the list of built-in usage sites of the static call. The type of the pointer is differentiated by bit 0. A mods pointer has the bit clear, the sites pointer has the bit set. As static_call_del_module() blidly assumes that the pointer is a valid static_call_mod type, it fails to check for this failure case and dereferences the pointer to the list of built-in call sites, which is obviously bogus. Cure it by checking whether the key has a sites or a mods pointer. If it's a sites pointer then the key is not to be touched. As the sites are walked in the same order as in __static_call_init() the site walk can be terminated because all subsequent sites have not been touched by the init code due to the error exit. If it was converted before the allocation fail, then the inner loop which searches for a module match will find nothing. A fail in the second allocation in __static_call_init() is harmless and does not require special treatment. The first allocation succeeded and converted the key to a module chain. That first entry has mod::mod == NULL and mod::next == NULL, so the inner loop of static_call_del_module() will neither find a module match nor a module chain. The next site in the walk was either already converted, but can't match the module, or it will exit the outer loop because it has a static_call_site pointer and not a static_call_mod pointer. Fixes: 9183c3f9ed71 ("static_call: Add inline static call infrastructure") Closes: https://lore.kernel.org/all/[email protected] Reported-by: Jinjie Ruan <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Jinjie Ruan <[email protected]> Link: https://lore.kernel.org/r/87zfon6b0s.ffs@tglx
2024-09-06genirq/cpuhotplug: Use cpumask_intersects()Costa Shulyupin1-2/+2
Replace `cpumask_any_and(a, b) >= nr_cpu_ids` with the more readable `!cpumask_intersects(a, b)`. [ tglx: Massaged change log ] Signed-off-by: Costa Shulyupin <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-09-05Merge tag 'bpf-6.11-rc7' of ↵Linus Torvalds1-2/+4
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Fix crash when btf_parse_base() returns an error (Martin Lau) - Fix out of bounds access in btf_name_valid_section() (Jeongjun Park) * tag 'bpf-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Add a selftest to check for incorrect names bpf: add check for invalid name in btf_name_valid_section() bpf: Fix a crash when btf_parse_base() returns an error pointer
2024-09-05bpf: allow kfuncs within tracepoint and perf event programsJP Kobryn1-0/+2
Associate tracepoint and perf event program types with the kfunc tracing hook. This allows calling kfuncs within these types of programs. Signed-off-by: JP Kobryn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-09-05bpf: change int cmd argument in __sys_bpf into typed enum bpf_cmdAndrii Nakryiko1-1/+1
This improves BTF data recorded about this function and makes debugging/tracing better, because now command can be displayed as symbolic name, instead of obscure number. Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-09-05Merge tag 'trace-v6.11-rc4' of ↵Linus Torvalds4-34/+72
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix adding a new fgraph callback after function graph tracing has already started. If the new caller does not initialize its hash before registering the fgraph_ops, it can cause a NULL pointer dereference. Fix this by adding a new parameter to ftrace_graph_enable_direct() passing in the newly added gops directly and not rely on using the fgraph_array[], as entries in the fgraph_array[] must be initialized. Assign the new gops to the fgraph_array[] after it goes through ftrace_startup_subops() as that will properly initialize the gops->ops and initialize its hashes. - Fix a memory leak in fgraph storage memory test. If the "multiple fgraph storage on a function" boot up selftest fails in the registering of the function graph tracer, it will not free the memory it allocated for the filter. Break the loop up into two where it allocates the filters first and then registers the functions where any errors will do the appropriate clean ups. - Only clear the timerlat timers if it has an associated kthread. In the rtla tool that uses timerlat, if it was killed just as it was shutting down, the signals can free the kthread and the timer. But the closing of the timerlat files could cause the hrtimer_cancel() to be called on the already freed timer. As the kthread variable is is set to NULL when the kthreads are stopped and the timers are freed it can be used to know not to call hrtimer_cancel() on the timer if the kthread variable is NULL. - Use a cpumask to keep track of osnoise/timerlat kthreads The timerlat tracer can use user space threads for its analysis. With the killing of the rtla tool, the kernel can get confused between if it is using a user space thread to analyze or one of its own kernel threads. When this confusion happens, kthread_stop() can be called on a user space thread and bad things happen. As the kernel threads are per-cpu, a bitmask can be used to know when a kernel thread is used or when a user space thread is used. - Add missing interface_lock to osnoise/timerlat stop_kthread() The stop_kthread() function in osnoise/timerlat clears the osnoise kthread variable, and if it was a user space thread does a put_task on it. But this can race with the closing of the timerlat files that also does a put_task on the kthread, and if the race happens the task will have put_task called on it twice and oops. - Add cond_resched() to the tracing_iter_reset() loop. The latency tracers keep writing to the ring buffer without resetting when it issues a new "start" event (like interrupts being disabled). When reading the buffer with an iterator, the tracing_iter_reset() sets its pointer to that start event by walking through all the events in the buffer until it gets to the time stamp of the start event. In the case of a very large buffer, the loop that looks for the start event has been reported taking a very long time with a non preempt kernel that it can trigger a soft lock up warning. Add a cond_resched() into that loop to make sure that doesn't happen. - Use list_del_rcu() for eventfs ei->list variable It was reported that running loops of creating and deleting kprobe events could cause a crash due to the eventfs list iteration hitting a LIST_POISON variable. This is because the list is protected by SRCU but when an item is deleted from the list, it was using list_del() which poisons the "next" pointer. This is what list_del_rcu() was to prevent. * tag 'trace-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/timerlat: Add interface_lock around clearing of kthread in stop_kthread() tracing/timerlat: Only clear timer if a kthread exists tracing/osnoise: Use a cpumask to know what threads are kthreads eventfs: Use list_del_rcu() for SRCU protected list variable tracing: Avoid possible softlockup in tracing_iter_reset() tracing: Fix memory leak in fgraph storage selftest tracing: fgraph: Fix to add new fgraph_ops to array after ftrace_startup_subops()
2024-09-05bpf: use type_may_be_null() helper for nullable-param checkShung-Hsi Yu1-5/+0
Commit 980ca8ceeae6 ("bpf: check bpf_dummy_struct_ops program params for test runs") does bitwise AND between reg_type and PTR_MAYBE_NULL, which is correct, but due to type difference the compiler complains: net/bpf/bpf_dummy_struct_ops.c:118:31: warning: bitwise operation between different enumeration types ('const enum bpf_reg_type' and 'enum bpf_type_flag') [-Wenum-enum-conversion] 118 | if (info && (info->reg_type & PTR_MAYBE_NULL)) | ~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~ Workaround the warning by moving the type_may_be_null() helper from verifier.c into bpf_verifier.h, and reuse it here to check whether param is nullable. Fixes: 980ca8ceeae6 ("bpf: check bpf_dummy_struct_ops program params for test runs") Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: Shung-Hsi Yu <[email protected]> Acked-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-09-05bpf: Fix uprobe multi pid filter checkJiri Olsa1-1/+1
Uprobe multi link does its own process (thread leader) filtering before running the bpf program by comparing task's vm pointers. But as Oleg pointed out there can be processes sharing the vm (CLONE_VM), so we can't just compare task->vm pointers, but instead we need to use same_thread_group call. Suggested-by: Oleg Nesterov <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2024-09-05tracing/timerlat: Add interface_lock around clearing of kthread in ↵Steven Rostedt1-7/+6
stop_kthread() The timerlat interface will get and put the task that is part of the "kthread" field of the osn_var to keep it around until all references are released. But here's a race in the "stop_kthread()" code that will call put_task_struct() on the kthread if it is not a kernel thread. This can race with the releasing of the references to that task struct and the put_task_struct() can be called twice when it should have been called just once. Take the interface_lock() in stop_kthread() to synchronize this change. But to do so, the function stop_per_cpu_kthreads() needs to change the loop from for_each_online_cpu() to for_each_possible_cpu() and remove the cpu_read_lock(), as the interface_lock can not be taken while the cpu locks are held. The only side effect of this change is that it may do some extra work, as the per_cpu variables of the offline CPUs would not be set anyway, and would simply be skipped in the loop. Remove unneeded "return;" in stop_kthread(). Cc: [email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Tomas Glozar <[email protected]> Cc: John Kacur <[email protected]> Cc: "Luis Claudio R. Goncalves" <[email protected]> Link: https://lore.kernel.org/[email protected] Fixes: e88ed227f639e ("tracing/timerlat: Add user-space interface") Signed-off-by: Steven Rostedt (Google) <[email protected]>