aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2024-06-05irqdomain: Add missing parameter descriptions in kernel-doc commentsHerve Codina1-0/+27
During compilation, several warning of the following form were raised: Function parameter or struct member 'x' not described in 'yyy' Add the missing function parameter descriptions. Signed-off-by: Herve Codina <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-05sched/balance: Skip unnecessary updates to idle load balancer's flagsTim Chen1-0/+7
We observed that the overhead on trigger_load_balance(), now renamed sched_balance_trigger(), has risen with a system's core counts. For an OLTP workload running 6.8 kernel on a 2 socket x86 systems having 96 cores/socket, we saw that 0.7% cpu cycles are spent in trigger_load_balance(). On older systems with fewer cores/socket, this function's overhead was less than 0.1%. The cause of this overhead was that there are multiple cpus calling kick_ilb(flags), updating the balancing work needed to a common idle load balancer cpu. The ilb_cpu's flags field got updated unconditionally with atomic_fetch_or(). The atomic read and writes to ilb_cpu's flags causes much cache bouncing and cpu cycles overhead. This is seen in the annotated profile below. kick_ilb(): if (ilb_cpu < 0) test %r14d,%r14d ↑ js 6c flags = atomic_fetch_or(flags, nohz_flags(ilb_cpu)); mov $0x2d600,%rdi movslq %r14d,%r8 mov %rdi,%rdx add -0x7dd0c3e0(,%r8,8),%rdx arch_atomic_read(): 0.01 mov 0x64(%rdx),%esi 35.58 add $0x64,%rdx arch_atomic_fetch_or(): static __always_inline int arch_atomic_fetch_or(int i, atomic_t *v) { int val = arch_atomic_read(v); do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i)); 0.03 157: mov %r12d,%ecx arch_atomic_try_cmpxchg(): return arch_try_cmpxchg(&v->counter, old, new); 0.00 mov %esi,%eax arch_atomic_fetch_or(): do { } while (!arch_atomic_try_cmpxchg(v, &val, val | i)); or %esi,%ecx arch_atomic_try_cmpxchg(): return arch_try_cmpxchg(&v->counter, old, new); 0.01 lock cmpxchg %ecx,(%rdx) 42.96 ↓ jne 2d2 kick_ilb(): With instrumentation, we found that 81% of the updates do not result in any change in the ilb_cpu's flags. That is, multiple cpus are asking the ilb_cpu to do the same things over and over again, before the ilb_cpu has a chance to run NOHZ load balance. Skip updates to ilb_cpu's flags if no new work needs to be done. Such updates do not change ilb_cpu's NOHZ flags. This requires an extra atomic read but it is less expensive than frequent unnecessary atomic updates that generate cache bounces. We saw that on the OLTP workload, cpu cycles from trigger_load_balance() (or sched_balance_trigger()) got reduced from 0.7% to 0.2%. Signed-off-by: Tim Chen <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Chen Yu <[email protected]> Reviewed-by: Vincent Guittot <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-06-05idle: Remove stale RCU commentChristian Loehle1-6/+0
The call of rcu_idle_enter() from within cpuidle_idle_call() was removed in commit 1098582a0f6c ("sched,idle,rcu: Push rcu_idle deeper into the idle path") which makes the comment out of place. Signed-off-by: Christian Loehle <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-06-05perf/core: Fix missing wakeup when waiting for context referenceHaifeng Xu1-0/+13
In our production environment, we found many hung tasks which are blocked for more than 18 hours. Their call traces are like this: [346278.191038] __schedule+0x2d8/0x890 [346278.191046] schedule+0x4e/0xb0 [346278.191049] perf_event_free_task+0x220/0x270 [346278.191056] ? init_wait_var_entry+0x50/0x50 [346278.191060] copy_process+0x663/0x18d0 [346278.191068] kernel_clone+0x9d/0x3d0 [346278.191072] __do_sys_clone+0x5d/0x80 [346278.191076] __x64_sys_clone+0x25/0x30 [346278.191079] do_syscall_64+0x5c/0xc0 [346278.191083] ? syscall_exit_to_user_mode+0x27/0x50 [346278.191086] ? do_syscall_64+0x69/0xc0 [346278.191088] ? irqentry_exit_to_user_mode+0x9/0x20 [346278.191092] ? irqentry_exit+0x19/0x30 [346278.191095] ? exc_page_fault+0x89/0x160 [346278.191097] ? asm_exc_page_fault+0x8/0x30 [346278.191102] entry_SYSCALL_64_after_hwframe+0x44/0xae The task was waiting for the refcount become to 1, but from the vmcore, we found the refcount has already been 1. It seems that the task didn't get woken up by perf_event_release_kernel() and got stuck forever. The below scenario may cause the problem. Thread A Thread B ... ... perf_event_free_task perf_event_release_kernel ... acquire event->child_mutex ... get_ctx ... release event->child_mutex acquire ctx->mutex ... perf_free_event (acquire/release event->child_mutex) ... release ctx->mutex wait_var_event acquire ctx->mutex acquire event->child_mutex # move existing events to free_list release event->child_mutex release ctx->mutex put_ctx ... ... In this case, all events of the ctx have been freed, so we couldn't find the ctx in free_list and Thread A will miss the wakeup. It's thus necessary to add a wakeup after dropping the reference. Fixes: 1cf8dfe8a661 ("perf/core: Fix race between close() and fork()") Signed-off-by: Haifeng Xu <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Acked-by: Mark Rutland <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2024-06-05sched/headers: Move struct pre-declarations to the beginning of the headerIngo Molnar1-10/+6
There's a random number of structure pre-declaration lines in kernel/sched/sched.h, some of which are unnecessary duplicates. Move them to the head & order them a bit for readability. Signed-off-by: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected]
2024-06-05sched/core: Clean up kernel/sched/sched.h a bitIngo Molnar1-132/+180
- Fix whitespace noise - Fix col80 linebreak damage where possible - Apply CodingStyle consistently - Use consistent #else and #endif comments - Use consistent vertical alignment - Use 'extern' consistently Signed-off-by: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected]
2024-06-05sched/core: Simplify prefetch_curr_exec_start()Ingo Molnar1-2/+2
Remove unnecessary use of the address operator. Signed-off-by: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected]
2024-06-04function_graph: Use static_call and branch to optimize return functionSteven Rostedt (Google)1-5/+18
In most cases function graph is used by a single user. Instead of calling a loop to call function graph callbacks in this case, call the function return callback directly. Use the static_key that is set when the function graph tracer has less than 2 callbacks registered. It will do the direct call in that case, and will do the loop over all callers when there are 2 or more callbacks registered. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Florent Revest <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: bpf <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Alan Maguire <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Guo Ren <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-06-04function_graph: Use static_call and branch to optimize entry functionSteven Rostedt (Google)1-11/+66
In most cases function graph is used by a single user. Instead of calling a loop to call function graph callbacks in this case, call the function entry callback directly. Add a static_key that will be used to set the function graph logic to either do the loop (when more than one callback is registered) or to call the callback directly if there is only one registered callback. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Florent Revest <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: bpf <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Alan Maguire <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Guo Ren <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-06-04function_graph: Use bitmask to loop on fgraph entrySteven Rostedt (Google)1-1/+7
Instead of looping through all the elements of fgraph_array[] to see if there's an gops attached to one and then calling its gops->func(). Create a fgraph_array_bitmask that sets bits when an index in the array is reserved (via the simple lru algorithm). Then only the bits set in this bitmask needs to be looked at where only elements in the array that have ops registered need to be looked at. Note, we do not care about races. If a bit is set before the gops is assigned, it only wastes time looking at the element and ignoring it (as it did before this bitmask is added). Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Florent Revest <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: bpf <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Alan Maguire <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Guo Ren <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-06-04function_graph: Use for_each_set_bit() in __ftrace_return_to_handler()Steven Rostedt (Google)1-3/+2
Instead of iterating through the entire fgraph_array[] and seeing if one of the bitmap bits are set to know to call the array's retfunc() function, use for_each_set_bit() on the bitmap itself. This will only iterate for the number of set bits. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Florent Revest <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: bpf <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Alan Maguire <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Guo Ren <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-06-04ftrace: Add multiple fgraph storage selftestMasami Hiramatsu (Google)1-45/+126
Add a selftest for multiple function graph tracer with storage on a same function. In this case, the shadow stack entry will be shared among those fgraph with different data storage. So this will ensure the fgraph will not mixed those storage data. Link: https://lore.kernel.org/linux-trace-kernel/171509111465.162236.3795819216426570800.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Florent Revest <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: bpf <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Alan Maguire <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Guo Ren <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Suggested-by: Steven Rostedt (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-06-04function_graph: Add selftest for passing local variablesSteven Rostedt (VMware)1-0/+169
Add boot up selftest that passes variables from a function entry to a function exit, and make sure that they do get passed around. Co-developed with Masami Hiramatsu: Link: https://lore.kernel.org/linux-trace-kernel/171509110271.162236.11047551496319744627.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Florent Revest <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: bpf <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Alan Maguire <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Guo Ren <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-06-04function_graph: Implement fgraph_reserve_data() and fgraph_retrieve_data()Steven Rostedt (VMware)1-8/+183
Added functions that can be called by a fgraph_ops entryfunc and retfunc to store state between the entry of the function being traced to the exit of the same function. The fgraph_ops entryfunc() may call fgraph_reserve_data() to store up to 32 words onto the task's shadow ret_stack and this then can be retrieved by fgraph_retrieve_data() called by the corresponding retfunc(). Co-developed with Masami Hiramatsu: Link: https://lore.kernel.org/linux-trace-kernel/171509109089.162236.11372474169781184034.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Florent Revest <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: bpf <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Alan Maguire <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Guo Ren <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-06-04function_graph: Move graph notrace bit to shadow stack global varSteven Rostedt (VMware)2-4/+15
The use of the task->trace_recursion for the logic used for the function graph no-trace was a bit of an abuse of that variable. Now that there exists global vars that are per stack for registered graph traces, use that instead. Link: https://lore.kernel.org/linux-trace-kernel/171509107907.162236.6564679266777519065.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Florent Revest <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: bpf <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Alan Maguire <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Guo Ren <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-06-04function_graph: Move graph depth stored data to shadow stack global varSteven Rostedt (VMware)1-2/+32
The use of the task->trace_recursion for the logic used for the function graph depth was a bit of an abuse of that variable. Now that there exists global vars that are per stack for registered graph traces, use that instead. Link: https://lore.kernel.org/linux-trace-kernel/171509106728.162236.2398372644430125344.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Florent Revest <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: bpf <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Alan Maguire <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Guo Ren <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-06-04function_graph: Move set_graph_function tests to shadow stack global varSteven Rostedt (VMware)4-18/+28
The use of the task->trace_recursion for the logic used for the set_graph_function was a bit of an abuse of that variable. Now that there exists global vars that are per stack for registered graph traces, use that instead. Link: https://lore.kernel.org/linux-trace-kernel/171509105520.162236.10339831553995971290.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Florent Revest <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: bpf <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Alan Maguire <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Guo Ren <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-06-04function_graph: Add "task variables" per task for fgraph_opsSteven Rostedt (VMware)1-1/+73
Add a "task variables" array on the tasks shadow ret_stack that is the size of longs for each possible registered fgraph_ops. That's a total of 16, taking up 8 * 16 = 128 bytes (out of a page size 4k). This will allow for fgraph_ops to do specific features on a per task basis having a way to maintain state for each task. Co-developed with Masami Hiramatsu: Link: https://lore.kernel.org/linux-trace-kernel/171509104383.162236.12239656156685718550.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Florent Revest <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: bpf <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Alan Maguire <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Guo Ren <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-06-04function_graph: Use a simple LRU for fgraph_array index numberMasami Hiramatsu (Google)1-21/+50
Since the fgraph_array index is used for the bitmap on the shadow stack, it may leave some entries after a function_graph instance is removed. Thus if another instance reuses the fgraph_array index soon after releasing it, the fgraph may confuse to call the newer callback for the entries which are pushed by the older instance. To avoid reusing the fgraph_array index soon after releasing, introduce a simple LRU table for managing the index number. This will reduce the possibility of this confusion. Link: https://lore.kernel.org/linux-trace-kernel/171509103267.162236.6885097397289135378.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Florent Revest <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: bpf <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Alan Maguire <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Guo Ren <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-06-04function_graph: Add pid tracing back to function graph tracerSteven Rostedt (Google)3-2/+45
Now that the function_graph has a main callback that handles the function graph subops tracing, it no longer honors the pid filtering of ftrace. Add back this logic in the function_graph code to update the gops callback for the entry function to test if it should trace the current task or not. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Florent Revest <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: bpf <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Alan Maguire <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Guo Ren <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-06-04function_graph: Have the instances use their own ftrace_ops for filteringSteven Rostedt (VMware)5-41/+67
Allow for instances to have their own ftrace_ops part of the fgraph_ops that makes the funtion_graph tracer filter on the set_ftrace_filter file of the instance and not the top instance. This uses the new ftrace_startup_subops(), by using graph_ops as the "manager ops" that defines the callback function and adds the functions defined by the filters of the ops for each trace instance. The callback defined by the manager ops will call the registered fgraph ops that were added to the fgraph_array. Co-developed with Masami Hiramatsu: Link: https://lore.kernel.org/linux-trace-kernel/171509102088.162236.15758883237657317789.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Florent Revest <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: bpf <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Alan Maguire <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Guo Ren <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-06-04ftrace: Allow subops filtering to be modifiedSteven Rostedt (Google)1-28/+113
The subops filters use a "manager" ops to enable and disable its filters. The manager ops can handle more than one subops, and its filter is what controls what functions get set. Add a ftrace_hash_move_and_update_subops() function that will update the manager ops when the subops filters change. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Florent Revest <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: bpf <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Alan Maguire <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Guo Ren <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-06-04ftrace: Add subops logic to allow one ops to manage manySteven Rostedt (Google)4-2/+404
There are cases where a single system will use a single function callback to handle multiple users. For example, to allow function_graph tracer to have multiple users where each can trace their own set of functions, it is useful to only have one ftrace_ops registered to ftrace that will call a function by the function_graph tracer to handle the multiplexing with the different registered function_graph tracers. Add a "subop_list" to the ftrace_ops that will hold a list of other ftrace_ops that the top ftrace_ops will manage. The function ftrace_startup_subops() that takes the manager ftrace_ops and a subop ftrace_ops it will manage. If there are no subops with the ftrace_ops yet, it will copy the ftrace_ops subop filters to the manager ftrace_ops and register that with ftrace_startup(), and adds the subop to its subop_list. If the manager ops already has something registered, it will then merge the new subop filters with what it has and enable the new functions that covers all the subops it has. To remove a subop, ftrace_shutdown_subops() is called which will use the subop_list of the manager ops to rebuild all the functions it needs to trace, and update the ftrace records to only call the functions it now has registered. If there are no more functions registered, it will then call ftrace_shutdown() to disable itself completely. Note, it is up to the manager ops callback to always make sure that the subops callbacks are called if its filter matches, as there are times in the update where the callback could be calling more functions than those that are currently registered. This could be updated to handle other systems other than function_graph, for example, fprobes could use this (but will need an interface to call ftrace_startup_subops()). Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Florent Revest <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: bpf <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Alan Maguire <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Guo Ren <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-06-04ftrace: Allow function_graph tracer to be enabled in instancesSteven Rostedt (VMware)5-28/+63
Now that function graph tracing can handle more than one user, allow it to be enabled in the ftrace instances. Note, the filtering of the functions is still joined by the top level set_ftrace_filter and friends, as well as the graph and nograph files. Co-developed with Masami Hiramatsu: Link: https://lore.kernel.org/linux-trace-kernel/171509099743.162236.1699959255446248163.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Florent Revest <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: bpf <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Alan Maguire <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Guo Ren <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-06-04ftrace/function_graph: Pass fgraph_ops to function graph callbacksSteven Rostedt (VMware)7-21/+33
Pass the fgraph_ops structure to the function graph callbacks. This will allow callbacks to add a descriptor to a fgraph_ops private field that wil be added in the future and use it for the callbacks. This will be useful when more than one callback can be registered to the function graph tracer. Co-developed with Masami Hiramatsu: Link: https://lore.kernel.org/linux-trace-kernel/171509098588.162236.4787930115997357578.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Florent Revest <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: bpf <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Alan Maguire <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Guo Ren <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-06-04function_graph: Remove logic around ftrace_graph_entry and returnSteven Rostedt (VMware)3-56/+15
The function pointers ftrace_graph_entry and ftrace_graph_return are no longer called via the function_graph tracer. Instead, an array structure is now used that will allow for multiple users of the function_graph infrastructure. The variables are still used by the architecture code for non dynamic ftrace configs, where a test is made against them to see if they point to the default stub function or not. This is how the static function tracing knows to call into the function graph tracer infrastructure or not. Two new stub functions are made. entry_run() and return_run(). The ftrace_graph_entry and ftrace_graph_return are set to them respectively when the function graph tracer is enabled, and this will trigger the architecture specific function graph code to be executed. This also requires checking the global_ops hash for all calls into the function_graph tracer. Co-developed with Masami Hiramatsu: Link: https://lore.kernel.org/linux-trace-kernel/171509097408.162236.17387844142114638932.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Florent Revest <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: bpf <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Alan Maguire <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Guo Ren <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-06-04function_graph: Handle tail calls for stack unwindingMasami Hiramatsu (Google)1-3/+16
For the tail-call, there would be 2 or more ftrace_ret_stacks on the ret_stack, which records "return_to_handler" as the return address except for the last one. But on the real stack, there should be 1 entry because tail-call reuses the return address on the stack and jump to the next function. In ftrace_graph_ret_addr() that is used for stack unwinding, skip tail calls as a real stack unwinder would do. Link: https://lore.kernel.org/linux-trace-kernel/171509096221.162236.8806372072523195752.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Florent Revest <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: bpf <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Alan Maguire <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Guo Ren <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-06-04function_graph: Allow multiple users to attach to function graphSteven Rostedt (VMware)1-77/+302
Allow for multiple users to attach to function graph tracer at the same time. Only 16 simultaneous users can attach to the tracer. This is because there's an array that stores the pointers to the attached fgraph_ops. When a function being traced is entered, each of the ftrace_ops entryfunc is called and if it returns non zero, its index into the array will be added to the shadow stack. On exit of the function being traced, the shadow stack will contain the indexes of the ftrace_ops on the array that want their retfunc to be called. Because a function may sleep for a long time (if a task sleeps itself), the return of the function may be literally days later. If the ftrace_ops is removed, its place on the array is replaced with a ftrace_ops that contains the stub functions and that will be called when the function finally returns. If another ftrace_ops is added that happens to get the same index into the array, its return function may be called. But that's actually the way things current work with the old function graph tracer. If one tracer is removed and another is added, the new one will get the return calls of the function traced by the previous one, thus this is not a regression. This can be fixed by adding a counter to each time the array item is updated and save that on the shadow stack as well, such that it won't be called if the index saved does not match the index on the array. Note, being able to filter functions when both are called is not completely handled yet, but that shouldn't be too hard to manage. Co-developed with Masami Hiramatsu: Link: https://lore.kernel.org/linux-trace-kernel/171509096221.162236.8806372072523195752.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Florent Revest <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: bpf <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Alan Maguire <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Guo Ren <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-06-04function_graph: Add an array structure that will allow multiple callbacksSteven Rostedt (VMware)1-33/+81
Add an array structure that will eventually allow the function graph tracer to have up to 16 simultaneous callbacks attached. It's an array of 16 fgraph_ops pointers, that is assigned when one is registered. On entry of a function the entry of the first item in the array is called, and if it returns zero, then the callback returns non zero if it wants the return callback to be called on exit of the function. The array will simplify the process of having more than one callback attached to the same function, as its index into the array can be stored on the shadow stack. We need to only save the index, because this will allow the fgraph_ops to be freed before the function returns (which may happen if the function call schedule for a long time). Co-developed with Masami Hiramatsu: Link: https://lore.kernel.org/linux-trace-kernel/171509095075.162236.8272148192748284581.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Florent Revest <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: bpf <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Alan Maguire <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Guo Ren <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-06-04fgraph: Use BUILD_BUG_ON() to make sure we have structures divisible by longSteven Rostedt (VMware)1-1/+5
Instead of using "ALIGN()", use BUILD_BUG_ON() as the structures should always be divisible by sizeof(long). Co-developed with Masami Hiramatsu: Link: https://lore.kernel.org/linux-trace-kernel/171509093949.162236.14518699447151894536.stgit@devnote2 Link: http://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Florent Revest <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: bpf <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Alan Maguire <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Guo Ren <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-06-04function_graph: Convert ret_stack to a series of longsSteven Rostedt (VMware)1-54/+82
In order to make it possible to have multiple callbacks registered with the function_graph tracer, the retstack needs to be converted from an array of ftrace_ret_stack structures to an array of longs. This will allow to store the list of callbacks on the stack for the return side of the functions. Link: https://lore.kernel.org/linux-trace-kernel/171509092742.162236.4427737821399314856.stgit@devnote2 Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Florent Revest <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: bpf <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Alan Maguire <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Guo Ren <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-06-03bpf: limit the number of levels of a nested struct type.Kui-Feng Lee1-11/+19
Limit the number of levels looking into struct types to avoid running out of stack space. Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Kui-Feng Lee <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-06-03bpf: look into the types of the fields of a struct type recursively.Kui-Feng Lee1-23/+77
The verifier has field information for specific special types, such as kptr, rbtree root, and list head. These types are handled differently. However, we did not previously examine the types of fields of a struct type variable. Field information records were not generated for the kptrs, rbtree roots, and linked_list heads that are not located at the outermost struct type of a variable. For example, struct A { struct task_struct __kptr * task; }; struct B { struct A mem_a; } struct B var_b; It did not examine "struct A" so as not to generate field information for the kptr in "struct A" for "var_b". This patch enables BPF programs to define fields of these special types in a struct type other than the direct type of a variable or in a struct type that is the type of a field in the value type of a map. Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Kui-Feng Lee <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-06-03bpf: create repeated fields for arrays.Kui-Feng Lee1-4/+58
The verifier uses field information for certain special types, such as kptr, rbtree root, and list head. These types are treated differently. However, we did not previously support these types in arrays. This update examines arrays and duplicates field information the same number of times as the length of the array if the element type is one of the special types. Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Kui-Feng Lee <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-06-03bpf: refactor btf_find_struct_field() and btf_find_datasec_var().Kui-Feng Lee1-101/+79
Move common code of the two functions to btf_find_field_one(). Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Kui-Feng Lee <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-06-03bpf: Remove unnecessary call to btf_field_type_size().Kui-Feng Lee2-2/+2
field->size has been initialized by bpf_parse_fields() with the value returned by btf_field_type_size(). Use it instead of calling btf_field_type_size() again. Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Kui-Feng Lee <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-06-03bpf: Remove unnecessary checks on the offset of btf_field.Kui-Feng Lee1-1/+1
reg_find_field_offset() always return a btf_field with a matching offset value. Checking the offset of the returned btf_field is unnecessary. Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Kui-Feng Lee <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-06-03rcutorture: Make rcutorture support srcu double call testZqiang1-20/+26
This commit allows rcutorture to test double-call_srcu() when the CONFIG_DEBUG_OBJECTS_RCU_HEAD Kconfig option is enabled. The non-raw sdp structure's ->spinlock will be acquired in call_srcu(), hence this commit also removes the current IRQ and preemption disabling so as to avoid lockdep complaints. Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Zqiang <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2024-06-03Revert "rcu-tasks: Fix synchronize_rcu_tasks() VS zap_pid_ns_processes()"Frederic Weisbecker2-30/+3
This reverts commit 28319d6dc5e2ffefa452c2377dd0f71621b5bff0. The race it fixed was subject to conditions that don't exist anymore since: 1612160b9127 ("rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks") This latter commit removes the use of SRCU that used to cover the RCU-tasks blind spot on exit between the tasklist's removal and the final preemption disabling. The task is now placed instead into a temporary list inside which voluntary sleeps are accounted as RCU-tasks quiescent states. This would disarm the deadlock initially reported against PID namespace exit. Signed-off-by: Frederic Weisbecker <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2024-06-03rcu/nocb: Remove buggy bypass lock contention mitigationFrederic Weisbecker2-27/+6
The bypass lock contention mitigation assumes there can be at most 2 contenders on the bypass lock, following this scheme: 1) One kthread takes the bypass lock 2) Another one spins on it and increment the contended counter 3) A third one (a bypass enqueuer) sees the contended counter on and busy loops waiting on it to decrement. However this assumption is wrong. There can be only one CPU to find the lock contended because call_rcu() (the bypass enqueuer) is the only bypass lock acquire site that may not already hold the NOCB lock beforehand, all the other sites must first contend on the NOCB lock. Therefore step 2) is impossible. The other problem is that the mitigation assumes that contenders all belong to the same rdp CPU, which is also impossible for a raw spinlock. In theory the warning could trigger if the enqueuer holds the bypass lock and another CPU flushes the bypass queue concurrently but this is prevented from all flush users: 1) NOCB kthreads only flush if they successfully _tried_ to lock the bypass lock. So no contention management here. 2) Flush on callbacks migration happen remotely when the CPU is offline. No concurrency against bypass enqueue. 3) Flush on deoffloading happen either locally with IRQs disabled or remotely when the CPU is not yet online. No concurrency against bypass enqueue. 4) Flush on barrier entrain happen either locally with IRQs disabled or remotely when the CPU is offline. No concurrency against bypass enqueue. For those reasons, the bypass lock contention mitigation isn't needed and is even wrong. Remove it but keep the warning reporting a contended bypass lock on a remote CPU, to keep unexpected contention awareness. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2024-06-03rcu/nocb: Use kthread parking instead of ad-hoc implementationFrederic Weisbecker2-83/+36
Upon NOCB deoffloading, the rcuo kthread must be forced to sleep until the corresponding rdp is ever offloaded again. The deoffloader clears the SEGCBLIST_OFFLOADED flag, wakes up the rcuo kthread which then notices that change and clears in turn its SEGCBLIST_KTHREAD_CB flag before going to sleep, until it ever sees the SEGCBLIST_OFFLOADED flag again, should a re-offloading happen. Upon NOCB offloading, the rcuo kthread must be forced to wake up and handle callbacks until the corresponding rdp is ever deoffloaded again. The offloader sets the SEGCBLIST_OFFLOADED flag, wakes up the rcuo kthread which then notices that change and sets in turn its SEGCBLIST_KTHREAD_CB flag before going to check callbacks, until it ever sees the SEGCBLIST_OFFLOADED flag cleared again, should a de-offloading happen again. This is all a crude ad-hoc and error-prone kthread (un-)parking re-implementation. Consolidate the behaviour with the appropriate API instead. [ paulmck: Apply Qiang Zhang feedback provided in Link: below. ] Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2024-06-03cgroup/cpuset: Optimize isolated partition only generate_sched_domains() callsWaiman Long1-0/+8
If only isolated partitions are being created underneath the cgroup root, there will only be one sched domain with top_cpuset.effective_cpus. We can skip the unnecessary sched domains scanning code and save some cycles. Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-06-03bpf: Fix a potential use-after-free in bpf_link_free()Cong Wang1-5/+6
After commit 1a80dbcb2dba, bpf_link can be freed by link->ops->dealloc_deferred, but the code still tests and uses link->ops->dealloc afterward, which leads to a use-after-free as reported by syzbot. Actually, one of them should be sufficient, so just call one of them instead of both. Also add a WARN_ON() in case of any problematic implementation. Fixes: 1a80dbcb2dba ("bpf: support deferring bpf_link dealloc to after RCU grace period") Reported-by: [email protected] Signed-off-by: Cong Wang <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Jiri Olsa <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2024-06-03printk: Rename console_replay_all() and update contextSreenath Vijayan1-3/+3
Rename console_replay_all() to console_try_replay_all() to make clear that the implementation is best effort. Also, the function should not be called in NMI context as it takes locks, so update the comment in code. Fixes: 693f75b91a91 ("printk: Add function to replay kernel log on consoles") Fixes: 1b743485e27f ("tty/sysrq: Replay kernel log messages on consoles via sysrq") Suggested-by: Petr Mladek <[email protected]> Signed-off-by: Shimoyashiki Taichi <[email protected]> Signed-off-by: Sreenath Vijayan <[email protected]> Link: https://lore.kernel.org/r/Zlguq/[email protected]@sony.com Reviewed-by: Petr Mladek <[email protected]> Signed-off-by: Petr Mladek <[email protected]>
2024-06-03bpf, devmap: Remove unnecessary if check in for loopThorsten Blum1-3/+0
The iterator variable dst cannot be NULL and the if check can be removed. Remove it and fix the following Coccinelle/coccicheck warning reported by itnull.cocci: ERROR: iterator variable bound on line 762 cannot be NULL Signed-off-by: Thorsten Blum <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Toke Høiland-Jørgensen <[email protected]> Acked-by: Jiri Olsa <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2024-06-03fs: don't block i_writecount during execChristian Brauner1-23/+3
Back in 2021 we already discussed removing deny_write_access() for executables. Back then I was hesistant because I thought that this might cause issues in userspace. But even back then I had started taking some notes on what could potentially depend on this and I didn't come up with a lot so I've changed my mind and I would like to try this. Here are some of the notes that I took: (1) The deny_write_access() mechanism is causing really pointless issues such as [1]. If a thread in a thread-group opens a file writable, then writes some stuff, then closing the file descriptor and then calling execve() they can fail the execve() with ETXTBUSY because another thread in the thread-group could have concurrently called fork(). Multi-threaded libraries such as go suffer from this. (2) There are userspace attacks that rely on overwriting the binary of a running process. These attacks are _mitigated_ but _not at all prevented_ from ocurring by the deny_write_access() mechanism. I'll go over some details. The clearest example of such attacks was the attack against runC in CVE-2019-5736 (cf. [3]). An attack could compromise the runC host binary from inside a _privileged_ runC container. The malicious binary could then be used to take over the host. (It is crucial to note that this attack is _not_ possible with unprivileged containers. IOW, the setup here is already insecure.) The attack can be made when attaching to a running container or when starting a container running a specially crafted image. For example, when runC attaches to a container the attacker can trick it into executing itself. This could be done by replacing the target binary inside the container with a custom binary pointing back at the runC binary itself. As an example, if the target binary was /bin/bash, this could be replaced with an executable script specifying the interpreter path #!/proc/self/exe. As such when /bin/bash is executed inside the container, instead the target of /proc/self/exe will be executed. That magic link will point to the runc binary on the host. The attacker can then proceed to write to the target of /proc/self/exe to try and overwrite the runC binary on the host. However, this will not succeed because of deny_write_access(). Now, one might think that this would prevent the attack but it doesn't. To overcome this, the attacker has multiple ways: * Open a file descriptor to /proc/self/exe using the O_PATH flag and then proceed to reopen the binary as O_WRONLY through /proc/self/fd/<nr> and try to write to it in a busy loop from a separate process. Ultimately it will succeed when the runC binary exits. After this the runC binary is compromised and can be used to attack other containers or the host itself. * Use a malicious shared library annotating a function in there with the constructor attribute making the malicious function run as an initializor. The malicious library will then open /proc/self/exe for creating a new entry under /proc/self/fd/<nr>. It'll then call exec to a) force runC to exit and b) hand the file descriptor off to a program that then reopens /proc/self/fd/<nr> for writing (which is now possible because runC has exited) and overwriting that binary. To sum up: the deny_write_access() mechanism doesn't prevent such attacks in insecure setups. It just makes them minimally harder. That's all. The only way back then to prevent this is to create a temporary copy of the calling binary itself when it starts or attaches to containers. So what I did back then for LXC (and Aleksa for runC) was to create an anonymous, in-memory file using the memfd_create() system call and to copy itself into the temporary in-memory file, which is then sealed to prevent further modifications. This sealed, in-memory file copy is then executed instead of the original on-disk binary. Any compromising write operations from a privileged container to the host binary will then write to the temporary in-memory binary and not to the host binary on-disk, preserving the integrity of the host binary. Also as the temporary, in-memory binary is sealed, writes to this will also fail. The point is that deny_write_access() is uselss to prevent these attacks. (3) Denying write access to an inode because it's currently used in an exec path could easily be done on an LSM level. It might need an additional hook but that should be about it. (4) The MAP_DENYWRITE flag for mmap() has been deprecated a long time ago so while we do protect the main executable the bigger portion of the things you'd think need protecting such as the shared libraries aren't. IOW, we let anyone happily overwrite shared libraries. (5) We removed all remaining uses of VM_DENYWRITE in [2]. That means: (5.1) We removed the legacy uselib() protection for preventing overwriting of shared libraries. Nobody cared in 3 years. (5.2) We allow write access to the elf interpreter after exec completed treating it on a par with shared libraries. Yes, someone in userspace could potentially be relying on this. It's not completely out of the realm of possibility but let's find out if that's actually the case and not guess. Link: https://github.com/golang/go/issues/22315 [1] Link: 49624efa65ac ("Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux") [2] Link: https://unit42.paloaltonetworks.com/breaking-docker-via-runc-explaining-cve-2019-5736 [3] Link: https://lwn.net/Articles/866493 Link: https://github.com/golang/go/issues/22220 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/buildid.go#L724 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/work/exec.go#L1493 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/script/cmds.go#L457 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/cmd/go/internal/test/test.go#L1557 Link: https://github.com/golang/go/blob/5bf8c0cf09ee5c7e5a37ab90afcce154ab716a97/src/os/exec/lp_linux_test.go#L61 Link: https://github.com/buildkite/agent/pull/2736 Link: https://github.com/rust-lang/rust/issues/114554 Link: https://bugs.openjdk.org/browse/JDK-8068370 Link: https://github.com/dotnet/runtime/issues/58964 Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-06-03sysctl: Add module description to sysctl-testingJeff Johnson1-0/+1
Added a module description to sysctl Kunit self test module to fix the 'make W=1' warning (" WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/sysctl-test.o") Signed-off-by: Jeff Johnson <[email protected]> Signed-off-by: Joel Granados <[email protected]>
2024-06-03sysctl: constify ctl_table arguments of utility functionThomas Weißschuh1-10/+11
In a future commit the proc_handlers themselves will change to "const struct ctl_table". As a preparation for that adapt the internal helper. Signed-off-by: Thomas Weißschuh <[email protected]> Reviewed-by: Kees Cook <[email protected]> Signed-off-by: Joel Granados <[email protected]>
2024-06-03utsname: constify ctl_table arguments of utility functionThomas Weißschuh1-1/+1
The sysctl core is preparing to only expose instances of struct ctl_table as "const". This will also affect the ctl_table argument of sysctl handlers. As the function prototype of all sysctl handlers throughout the tree needs to stay consistent that change will be done in one commit. To reduce the size of that final commit, switch utility functions which are not bound by "typedef proc_handler" to "const struct ctl_table". No functional change. Signed-off-by: Thomas Weißschuh <[email protected]> Reviewed-by: Joel Granados <[email protected]> Signed-off-by: Joel Granados <[email protected]>
2024-06-03sysctl: move the extra1/2 boundary check of u8 to sysctl_check_table_arrayWen Yang2-8/+51
Move boundary checking for proc_dou8ved_minmax into module loading, thereby reporting errors in advance. And add a kunit test case ensuring the boundary check is done correctly. The boundary check in proc_dou8vec_minmax done to the extra elements in the ctl_table struct is currently performed at runtime. This allows buggy kernel modules to be loaded normally without any errors only to fail when used. This is a buggy example module: #include <linux/kernel.h> #include <linux/module.h> #include <linux/sysctl.h> static struct ctl_table_header *_table_header = NULL; static unsigned char _data = 0; struct ctl_table table[] = { { .procname = "foo", .data = &_data, .maxlen = sizeof(u8), .mode = 0644, .proc_handler = proc_dou8vec_minmax, .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE_THOUSAND, }, }; static int init_demo(void) { _table_header = register_sysctl("kernel", table); if (!_table_header) return -ENOMEM; return 0; } module_init(init_demo); MODULE_LICENSE("GPL"); And this is the result: # insmod test.ko # cat /proc/sys/kernel/foo cat: /proc/sys/kernel/foo: Invalid argument Suggested-by: Joel Granados <[email protected]> Signed-off-by: Wen Yang <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Kees Cook <[email protected]> Cc: Joel Granados <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Christian Brauner <[email protected]> Cc: [email protected] Reviewed-by: Joel Granados <[email protected]> Signed-off-by: Joel Granados <[email protected]>