aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2024-08-02uprobes: shift put_uprobe() from delete_uprobe() to uprobe_unregister()Oleg Nesterov1-5/+9
Kill the extra get_uprobe() + put_uprobe() in uprobe_unregister() and move the possibly final put_uprobe() from delete_uprobe() to its only caller, uprobe_unregister(). Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Jiri Olsa <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Acked-by: "Masami Hiramatsu (Google)" <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-08-02uprobes: fold __uprobe_unregister() into uprobe_unregister()Oleg Nesterov1-15/+10
Fold __uprobe_unregister() into its single caller, uprobe_unregister(). A separate patch to simplify the next change. Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Jiri Olsa <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Acked-by: "Masami Hiramatsu (Google)" <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-08-02uprobes: change uprobe_register() to use uprobe_unregister() instead of ↵Oleg Nesterov1-5/+7
__uprobe_unregister() If register_for_each_vma() fails uprobe_register() can safely drop uprobe->register_rwsem and use uprobe_unregister(). There is no worry about the races with another register/unregister, consumer_add() was already called so this case doesn't differ from _unregister() right after the successful _register(). Yes this means the extra up_write() + down_write(), but this is the slow and unlikely case anyway. Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Jiri Olsa <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Acked-by: "Masami Hiramatsu (Google)" <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-08-02uprobes: make uprobe_register() return struct uprobe *Oleg Nesterov3-60/+47
This way uprobe_unregister() and uprobe_apply() can use "struct uprobe *" rather than inode + offset. This simplifies the code and allows to avoid the unnecessary find_uprobe() + put_uprobe() in these functions. TODO: uprobe_unregister() still needs get_uprobe/put_uprobe to ensure that this uprobe can't be freed before up_write(&uprobe->register_rwsem). Co-developed-by: Andrii Nakryiko <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Jiri Olsa <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-08-02uprobes: kill uprobe_register_refctr()Oleg Nesterov3-28/+11
It doesn't make any sense to have 2 versions of _register(). Note that trace_uprobe_enable(), the only user of uprobe_register(), doesn't need to check tu->ref_ctr_offset to decide which one should be used, it could safely pass ref_ctr_offset == 0 to uprobe_register_refctr(). Add this argument to uprobe_register(), update the callers, and kill uprobe_register_refctr(). Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Jiri Olsa <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-08-02uprobes: simplify error handling for alloc_uprobe()Andrii Nakryiko1-3/+1
Return -ENOMEM instead of NULL, which makes caller's error handling just a touch simpler. Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: "Masami Hiramatsu (Google)" <[email protected]> Reviewed-by: Jiri Olsa <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-08-02uprobes: is_trap_at_addr: don't use get_user_pages_remote()Oleg Nesterov1-7/+1
get_user_pages_remote() and the comment above it make no sense. There is no task_struct passed into get_user_pages_remote() anymore, and nowadays mm_account_fault() increments the current->min/maj_flt counters regardless of FAULT_FLAG_REMOTE. Reported-by: Andrii Nakryiko <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Jiri Olsa <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-08-02uprobes: document the usage of mm->mmap_lockOleg Nesterov1-2/+8
The comment above uprobe_write_opcode() is wrong, unapply_uprobe() calls it under mmap_read_lock() and this is correct. And it is completely unclear why register_for_each_vma() takes mmap_lock for writing, add a comment to explain that mmap_write_lock() is needed to avoid the following race: - A task T hits the bp installed by uprobe and calls find_active_uprobe() - uprobe_unregister() removes this uprobe/bp - T calls find_uprobe() which returns NULL - another uprobe_register() installs the bp at the same address - T calls is_trap_at_addr() which returns true - T returns to handle_swbp() and gets SIGTRAP. Reported-by: Andrii Nakryiko <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Jiri Olsa <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Acked-by: "Masami Hiramatsu (Google)" <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-08-02perf,x86: avoid missing caller address in stack traces captured in uprobeAndrii Nakryiko1-0/+2
When tracing user functions with uprobe functionality, it's common to install the probe (e.g., a BPF program) at the first instruction of the function. This is often going to be `push %rbp` instruction in function preamble, which means that within that function frame pointer hasn't been established yet. This leads to consistently missing an actual caller of the traced function, because perf_callchain_user() only records current IP (capturing traced function) and then following frame pointer chain (which would be caller's frame, containing the address of caller's caller). So when we have target_1 -> target_2 -> target_3 call chain and we are tracing an entry to target_3, captured stack trace will report target_1 -> target_3 call chain, which is wrong and confusing. This patch proposes a x86-64-specific heuristic to detect `push %rbp` (`push %ebp` on 32-bit architecture) instruction being traced. Given entire kernel implementation of user space stack trace capturing works under assumption that user space code was compiled with frame pointer register (%rbp/%ebp) preservation, it seems pretty reasonable to use this instruction as a strong indicator that this is the entry to the function. In that case, return address is still pointed to by %rsp/%esp, so we fetch it and add to stack trace before proceeding to unwind the rest using frame pointer-based logic. We also check for `endbr64` (for 64-bit modes) as another common pattern for function entry, as suggested by Josh Poimboeuf. Even if we get this wrong sometimes for uprobes attached not at the function entry, it's OK because stack trace will still be overall meaningful, just with one extra bogus entry. If we don't detect this, we end up with guaranteed to be missing caller function entry in the stack trace, which is worse overall. Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-08-02perf: Support PERF_SAMPLE_READ with inheritBen Gainey1-14/+41
This change allows events to use PERF_SAMPLE_READ with inherit so long as PERF_SAMPLE_TID is also set. This enables sample based profiling of a group of counters over a hierarchy of processes or threads. This is useful, for example, for collecting per-thread counters/metrics, event based sampling of multiple counters as a unit, access to the enabled and running time when using multiplexing and so on. Prior to this, users were restricted to either collecting aggregate statistics for a multi-threaded/-process application (e.g. with "perf stat"), or to sample individual threads, or to profile the entire system (which requires root or CAP_PERFMON, and may produce much more data than is required). Theoretically a tool could poll for or otherwise monitor thread/process creation and construct whatever events the user is interested in using perf_event_open, for each new thread or process, but this is racy, can lead to file-descriptor exhaustion, and ultimately just replicates the behaviour of inherit, but in userspace. This configuration differs from inherit without PERF_SAMPLE_READ in that the accumulated event count, and consequently any sample (such as if triggered by overflow of sample_period) will be on a per-thread rather than on an aggregate basis. The meaning of read_format::value field of both PERF_RECORD_READ and PERF_RECORD_SAMPLE is changed such that if the sampled event uses this new configuration then the values reported will be per-thread rather than the global aggregate value. This is a change from the existing semantics of read_format (where PERF_SAMPLE_READ is used without inherit), but it is necessary to expose the per-thread counter values, and it avoids reinventing a separate "read_format_thread" field that otherwise replicates the same behaviour. This change should not break existing tools, since this configuration was not previously valid and was rejected by the kernel. Tools that opt into this new mode will need to account for this when calculating the counter delta for a given sample. Tools that wish to have both the per-thread and aggregate value can perform the global aggregation themselves from the per-thread values. The change to read_format::value does not affect existing valid perf_event_attr configurations, nor does it change the behaviour of calls to "read" on an event descriptor. Both continue to report the aggregate value for the entire thread/process hierarchy. The difference between the results reported by "read" and PERF_RECORD_SAMPLE in this new configuration is justified on the basis that it is not (easily) possible for "read" to target a specific thread (the caller only has the fd for the original parent event). Signed-off-by: Ben Gainey <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-08-02perf: Rename perf_event_context.nr_pending to nr_no_switch_fast.Ben Gainey1-6/+6
nr_pending counts the number of events in the context that either pending_sigtrap or pending_work, but it is used to prevent taking the fast path in perf_event_context_sched_out. Renamed to reflect what it is used for, rather than what it counts. This change allows using the field to track other event properties that also require skipping the fast path without possible confusion over the name. Signed-off-by: Ben Gainey <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-08-01kcsan: Use min() to fix Coccinelle warningThorsten Blum1-1/+1
Fixes the following Coccinelle/coccicheck warning reported by minmax.cocci: WARNING opportunity for min() Use const size_t instead of int for the result of min(). Compile-tested with CONFIG_KCSAN=y. Reviewed-by: Marco Elver <[email protected]> Signed-off-by: Thorsten Blum <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2024-08-01sched_ext: Build fix on !CONFIG_STACKTRACE[_SUPPORT]Tejun Heo2-1/+4
scx_dump_task() uses stack_trace_save_tsk() which is only available when CONFIG_STACKTRACE. Make CONFIG_SCHED_CLASS_EXT select CONFIG_STACKTRACE if the support is available and skip capturing stack trace if !CONFIG_STACKTRACE. Signed-off-by: Tejun Heo <[email protected]> Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Acked-by: David Vernet <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-07-31scx: Allow calling sleepable kfuncs from BPF_PROG_TYPE_SYSCALLDavid Vernet1-16/+11
We currently only allow calling sleepable scx kfuncs (i.e. scx_bpf_create_dsq()) from BPF_PROG_TYPE_STRUCT_OPS progs. The idea here was that we'd never have to call scx_bpf_create_dsq() outside of a sched_ext struct_ops callback, but that might not actually be true. For example, a scheduler could do something like the following: 1. Open and load (not yet attach) a scheduler skel 2. Synchronously call into a BPF_PROG_TYPE_SYSCALL prog from user space. For example, to initialize an LLC domain, or some other global, read-only state. 3. Attach the skel, which actually enables the scheduler The advantage of doing this is that it can preclude having to do pretty ugly boilerplate like initializing a read-only, statically sized array of u64[]'s which the kernel consumes literally once at init time to then create struct bpf_cpumask objects which are actually queried at runtime. Doing the above is already possible given that we can invoke core BPF kfuncs, such as bpf_cpumask_create(), from BPF_PROG_TYPE_SYSCALL progs. We already allow many scx kfuncs to be called from BPF_PROG_TYPE_SYSCALL progs (e.g. scx_bpf_kick_cpu()). Let's allow the sleepable kfuncs as well. Signed-off-by: David Vernet <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-07-31cgroup: Show # of subsystem CSSes in cgroup.statWaiman Long1-2/+53
Cgroup subsystem state (CSS) is an abstraction in the cgroup layer to help manage different structures in various cgroup subsystems by being an embedded element inside a larger structure like cpuset or mem_cgroup. The /proc/cgroups file shows the number of cgroups for each of the subsystems. With cgroup v1, the number of CSSes is the same as the number of cgroups. That is not the case anymore with cgroup v2. The /proc/cgroups file cannot show the actual number of CSSes for the subsystems that are bound to cgroup v2. So if a v2 cgroup subsystem is leaking cgroups (usually memory cgroup), we can't tell by looking at /proc/cgroups which cgroup subsystems may be responsible. As cgroup v2 had deprecated the use of /proc/cgroups, the hierarchical cgroup.stat file is now being extended to show the number of live and dying CSSes associated with all the non-inhibited cgroup subsystems that have been bound to cgroup v2. The number includes CSSes in the current cgroup as well as in all the descendants underneath it. This will help us pinpoint which subsystems are responsible for the increasing number of dying (nr_dying_descendants) cgroups. The CSSes dying counts are stored in the cgroup structure itself instead of inside the CSS as suggested by Johannes. This will allow us to accurately track dying counts of cgroup subsystems that have recently been disabled in a cgroup. It is now possible that a zero subsystem number is coupled with a non-zero dying subsystem number. The cgroup-v2.rst file is updated to discuss this new behavior. With this patch applied, a sample output from root cgroup.stat file was shown below. nr_descendants 56 nr_subsys_cpuset 1 nr_subsys_cpu 43 nr_subsys_io 43 nr_subsys_memory 56 nr_subsys_perf_event 57 nr_subsys_hugetlb 1 nr_subsys_pids 56 nr_subsys_rdma 1 nr_subsys_misc 1 nr_dying_descendants 30 nr_dying_subsys_cpuset 0 nr_dying_subsys_cpu 0 nr_dying_subsys_io 0 nr_dying_subsys_memory 30 nr_dying_subsys_perf_event 0 nr_dying_subsys_hugetlb 0 nr_dying_subsys_pids 0 nr_dying_subsys_rdma 0 nr_dying_subsys_misc 0 Another sample output from system.slice/cgroup.stat was: nr_descendants 34 nr_subsys_cpuset 0 nr_subsys_cpu 32 nr_subsys_io 32 nr_subsys_memory 34 nr_subsys_perf_event 35 nr_subsys_hugetlb 0 nr_subsys_pids 34 nr_subsys_rdma 0 nr_subsys_misc 0 nr_dying_descendants 30 nr_dying_subsys_cpuset 0 nr_dying_subsys_cpu 0 nr_dying_subsys_io 0 nr_dying_subsys_memory 30 nr_dying_subsys_perf_event 0 nr_dying_subsys_hugetlb 0 nr_dying_subsys_pids 0 nr_dying_subsys_rdma 0 nr_dying_subsys_misc 0 Note that 'debug' controller wasn't used to provide this information because the controller is not recommended in productions kernels, also many of them won't enable CONFIG_CGROUP_DEBUG by default. Similar information could be retrieved with debuggers like drgn but that's also not always available (e.g. lockdown) and the additional cost of runtime tracking here is deemed marginal. tj: Added Michal's paragraphs on why this is not added the debug controller to the commit message. Signed-off-by: Waiman Long <[email protected]> Acked-by: Johannes Weiner <[email protected]> Acked-by: Roman Gushchin <[email protected]> Reviewed-by: Kamalesh Babulal <[email protected]> Cc: Michal Koutný <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Tejun Heo <[email protected]>
2024-07-31jump_label: Fix the fix, brown paper bags galorePeter Zijlstra1-2/+2
Per the example of: !atomic_cmpxchg(&key->enabled, 0, 1) the inverse was written as: atomic_cmpxchg(&key->enabled, 1, 0) except of course, that while !old is only true for old == 0, old is true for everything except old == 0. Fix it to read: atomic_cmpxchg(&key->enabled, 1, 0) == 1 such that only the 1->0 transition returns true and goes on to disable the keys. Fixes: 83ab38ef0a0b ("jump_label: Fix concurrency issues in static_key_slow_dec()") Reported-by: Darrick J. Wong <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Darrick J. Wong <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-07-31tick/broadcast: Move per CPU pointer access into the atomic sectionThomas Gleixner1-1/+2
The recent fix for making the take over of the broadcast timer more reliable retrieves a per CPU pointer in preemptible context. This went unnoticed as compilers hoist the access into the non-preemptible region where the pointer is actually used. But of course it's valid that the compiler keeps it at the place where the code puts it which rightfully triggers: BUG: using smp_processor_id() in preemptible [00000000] code: caller is hotplug_cpu__broadcast_tick_pull+0x1c/0xc0 Move it to the actual usage site which is in a non-preemptible region. Fixes: f7d43dd206e7 ("tick/broadcast: Make takeover of broadcast hrtimer reliable") Reported-by: David Wang <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Yu Liao <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/all/87ttg56ers.ffs@tglx
2024-07-30cpuset: use Union-Find to optimize the merging of cpumasksXavier1-70/+44
The process of constructing scheduling domains involves multiple loops and repeated evaluations, leading to numerous redundant and ineffective assessments that impact code efficiency. Here, we use union-find to optimize the merging of cpumasks. By employing path compression and union by rank, we effectively reduce the number of lookups and merge comparisons. Signed-off-by: Xavier <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-07-30cgroup/cpuset: add decrease attach_in_progress helpersChen Ridong1-15/+24
There are several functions to decrease attach_in_progress, and they will wake up cpuset_attach_wq when attach_in_progress is zero. So, add a helper to make it concise. Signed-off-by: Chen Ridong <[email protected]> Reviewed-by: Kamalesh Babulal <[email protected]> Reviewed-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-07-30cgroup/cpuset: Remove cpuset_slab_spread_rotorXiu Jianfeng2-14/+0
Since the SLAB implementation was removed in v6.8, so the cpuset_slab_spread_rotor is no longer used and can be removed. Signed-off-by: Xiu Jianfeng <[email protected]> Reviewed-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-07-30cgroup/pids: Avoid spurious event notificationXiu Jianfeng1-11/+7
Currently when a process in a group forks and fails due to it's parent's max restriction, all the cgroups from 'pids_forking' to root will generate event notifications but only the cgroups from 'pids_over_limit' to root will increase the counter of PIDCG_MAX. Consider this scenario: there are 4 groups A, B, C,and D, the relationships are as follows, and user is watching on C.pids.events. root->A->B->C->D When a process in D forks and fails due to B.max restriction, the user will get a spurious event notification because when he wakes up and reads C.pids.events, he will find that the content has not changed. To address this issue, only the cgroups from 'pids_over_limit' to root will have their PIDCG_MAX counters increased and event notifications generated. Fixes: 385a635cacfe ("cgroup/pids: Make event counters hierarchical") Signed-off-by: Xiu Jianfeng <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-07-30cgroup/cpuset: remove child_ecpus_countChen Ridong1-19/+3
The child_ecpus_count variable was previously used to update sibling cpumask when parent's effective_cpus is updated. However, it became obsolete after commit e2ffe502ba45 ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2"). It should be removed. tj: Restored {} for style consistency. Signed-off-by: Chen Ridong <[email protected]> Acked-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-07-30Merge tag 'v6.11-rc1' into for-6.12Tejun Heo147-2464/+5000
Linux 6.11-rc1
2024-07-29profiling: remove stale percpu flip buffer variablesLinus Torvalds1-6/+0
For some reason I didn't see this issue on my arm64 or x86-64 builds, but Stephen Rothwell reports that commit 2accfdb7eff6 ("profiling: attempt to remove per-cpu profile flip buffer") left these static variables around, and the powerpc build is unhappy about them: kernel/profile.c:52:28: warning: 'cpu_profile_flip' defined but not used [-Wunused-variable] 52 | static DEFINE_PER_CPU(int, cpu_profile_flip); | ^~~~~~~~~~~~~~~~ .. So remove these stale left-over remnants too. Fixes: 2accfdb7eff6 ("profiling: attempt to remove per-cpu profile flip buffer") Reported-by: Stephen Rothwell <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2024-07-29bpf, x86, riscv, arm: no_caller_saved_registers for bpf_get_smp_processor_id()Eduard Zingerman2-2/+10
The function bpf_get_smp_processor_id() is processed in a different way, depending on the arch: - on x86 verifier replaces call to bpf_get_smp_processor_id() with a sequence of instructions that modify only r0; - on riscv64 jit replaces call to bpf_get_smp_processor_id() with a sequence of instructions that modify only r0; - on arm64 jit replaces call to bpf_get_smp_processor_id() with a sequence of instructions that modify only r0 and tmp registers. These rewrites satisfy attribute no_caller_saved_registers contract. Allow rewrite of no_caller_saved_registers patterns for bpf_get_smp_processor_id() in order to use this function as a canary for no_caller_saved_registers tests. Signed-off-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]>
2024-07-29bpf: no_caller_saved_registers attribute for helper callsEduard Zingerman1-3/+314
GCC and LLVM define a no_caller_saved_registers function attribute. This attribute means that function scratches only some of the caller saved registers defined by ABI. For BPF the set of such registers could be defined as follows: - R0 is scratched only if function is non-void; - R1-R5 are scratched only if corresponding parameter type is defined in the function prototype. This commit introduces flag bpf_func_prot->allow_nocsr. If this flag is set for some helper function, verifier assumes that it follows no_caller_saved_registers calling convention. The contract between kernel and clang allows to simultaneously use such functions and maintain backwards compatibility with old kernels that don't understand no_caller_saved_registers calls (nocsr for short): - clang generates a simple pattern for nocsr calls, e.g.: r1 = 1; r2 = 2; *(u64 *)(r10 - 8) = r1; *(u64 *)(r10 - 16) = r2; call %[to_be_inlined] r2 = *(u64 *)(r10 - 16); r1 = *(u64 *)(r10 - 8); r0 = r1; r0 += r2; exit; - kernel removes unnecessary spills and fills, if called function is inlined by verifier or current JIT (with assumption that patch inserted by verifier or JIT honors nocsr contract, e.g. does not scratch r3-r5 for the example above), e.g. the code above would be transformed to: r1 = 1; r2 = 2; call %[to_be_inlined] r0 = r1; r0 += r2; exit; Technically, the transformation is split into the following phases: - function mark_nocsr_patterns(), called from bpf_check() searches and marks potential patterns in instruction auxiliary data; - upon stack read or write access, function check_nocsr_stack_contract() is used to verify if stack offsets, presumably reserved for nocsr patterns, are used only from those patterns; - function remove_nocsr_spills_fills(), called from bpf_check(), applies the rewrite for valid patterns. See comment in mark_nocsr_pattern_for_call() for more details. Suggested-by: Alexei Starovoitov <[email protected]> Signed-off-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]>
2024-07-29bpf: add a get_helper_proto() utility functionEduard Zingerman1-7/+18
Extract the part of check_helper_call() as a utility function allowing to query 'struct bpf_func_proto' for a specific helper function id. Acked-by: Andrii Nakryiko <[email protected]> Signed-off-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]>
2024-07-29bpf: Get better reg range with ldsx and 32bit compareYonghong Song1-0/+38
With latest llvm19, the selftest iters/iter_arr_with_actual_elem_count failed with -mcpu=v4. The following are the details: 0: R1=ctx() R10=fp0 ; int iter_arr_with_actual_elem_count(const void *ctx) @ iters.c:1420 0: (b4) w7 = 0 ; R7_w=0 ; int i, n = loop_data.n, sum = 0; @ iters.c:1422 1: (18) r1 = 0xffffc90000191478 ; R1_w=map_value(map=iters.bss,ks=4,vs=1280,off=1144) 3: (61) r6 = *(u32 *)(r1 +128) ; R1_w=map_value(map=iters.bss,ks=4,vs=1280,off=1144) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) ; if (n > ARRAY_SIZE(loop_data.data)) @ iters.c:1424 4: (26) if w6 > 0x20 goto pc+27 ; R6_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=32,var_off=(0x0; 0x3f)) 5: (bf) r8 = r10 ; R8_w=fp0 R10=fp0 6: (07) r8 += -8 ; R8_w=fp-8 ; bpf_for(i, 0, n) { @ iters.c:1427 7: (bf) r1 = r8 ; R1_w=fp-8 R8_w=fp-8 8: (b4) w2 = 0 ; R2_w=0 9: (bc) w3 = w6 ; R3_w=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=32,var_off=(0x0; 0x3f)) R6_w=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=32,var_off=(0x0; 0x3f)) 10: (85) call bpf_iter_num_new#45179 ; R0=scalar() fp-8=iter_num(ref_id=2,state=active,depth=0) refs=2 11: (bf) r1 = r8 ; R1=fp-8 R8=fp-8 refs=2 12: (85) call bpf_iter_num_next#45181 13: R0=rdonly_mem(id=3,ref_obj_id=2,sz=4) R6=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=32,var_off=(0x0; 0x3f)) R7=0 R8=fp-8 R10=fp0 fp-8=iter_num(ref_id=2,state=active,depth=1) refs=2 ; bpf_for(i, 0, n) { @ iters.c:1427 13: (15) if r0 == 0x0 goto pc+2 ; R0=rdonly_mem(id=3,ref_obj_id=2,sz=4) refs=2 14: (81) r1 = *(s32 *)(r0 +0) ; R0=rdonly_mem(id=3,ref_obj_id=2,sz=4) R1_w=scalar(smin=0xffffffff80000000,smax=0x7fffffff) refs=2 15: (ae) if w1 < w6 goto pc+4 20: R0=rdonly_mem(id=3,ref_obj_id=2,sz=4) R1=scalar(smin=0xffffffff80000000,smax=smax32=umax32=31,umax=0xffffffff0000001f,smin32=0,var_off=(0x0; 0xffffffff0000001f)) R6=scalar(id=1,smin=umin=smin32=umin32=1,smax=umax=smax32=umax32=32,var_off=(0x0; 0x3f)) R7=0 R8=fp-8 R10=fp0 fp-8=iter_num(ref_id=2,state=active,depth=1) refs=2 ; sum += loop_data.data[i]; @ iters.c:1429 20: (67) r1 <<= 2 ; R1_w=scalar(smax=0x7ffffffc0000007c,umax=0xfffffffc0000007c,smin32=0,smax32=umax32=124,var_off=(0x0; 0xfffffffc0000007c)) refs=2 21: (18) r2 = 0xffffc90000191478 ; R2_w=map_value(map=iters.bss,ks=4,vs=1280,off=1144) refs=2 23: (0f) r2 += r1 math between map_value pointer and register with unbounded min value is not allowed The source code: int iter_arr_with_actual_elem_count(const void *ctx) { int i, n = loop_data.n, sum = 0; if (n > ARRAY_SIZE(loop_data.data)) return 0; bpf_for(i, 0, n) { /* no rechecking of i against ARRAY_SIZE(loop_data.n) */ sum += loop_data.data[i]; } return sum; } The insn #14 is a sign-extenstion load which is related to 'int i'. The insn #15 did a subreg comparision. Note that smin=0xffffffff80000000 and this caused later insn #23 failed verification due to unbounded min value. Actually insn #15 R1 smin range can be better. Before insn #15, we have R1_w=scalar(smin=0xffffffff80000000,smax=0x7fffffff) With the above range, we know for R1, upper 32bit can only be 0xffffffff or 0. Otherwise, the value range for R1 could be beyond [smin=0xffffffff80000000,smax=0x7fffffff]. After insn #15, for the true patch, we know smin32=0 and smax32=32. With the upper 32bit 0xffffffff, then the corresponding value is [0xffffffff00000000, 0xffffffff00000020]. The range is obviously beyond the original range [smin=0xffffffff80000000,smax=0x7fffffff] and the range is not possible. So the upper 32bit must be 0, which implies smin = smin32 and smax = smax32. This patch fixed the issue by adding additional register deduction after 32-bit compare insn. If the signed 32-bit register range is non-negative then 64-bit smin is in range of [S32_MIN, S32_MAX], then the actual 64-bit smin/smax should be the same as 32-bit smin32/smax32. With this patch, iters/iter_arr_with_actual_elem_count succeeded with better register range: from 15 to 20: R0=rdonly_mem(id=7,ref_obj_id=2,sz=4) R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=31,var_off=(0x0; 0x1f)) R6=scalar(id=1,smin=umin=smin32=umin32=1,smax=umax=smax32=umax32=32,var_off=(0x0; 0x3f)) R7=scalar(id=9,smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R8=scalar(id=9,smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R10=fp0 fp-8=iter_num(ref_id=2,state=active,depth=3) refs=2 Acked-by: Eduard Zingerman <[email protected]> Acked-by: Shung-Hsi Yu <[email protected]> Signed-off-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]>
2024-07-29bpf: Fail verification for sign-extension of packet data/data_end/data_metaYonghong Song1-2/+3
syzbot reported a kernel crash due to commit 1f1e864b6555 ("bpf: Handle sign-extenstin ctx member accesses"). The reason is due to sign-extension of 32-bit load for packet data/data_end/data_meta uapi field. The original code looks like: r2 = *(s32 *)(r1 + 76) /* load __sk_buff->data */ r3 = *(u32 *)(r1 + 80) /* load __sk_buff->data_end */ r0 = r2 r0 += 8 if r3 > r0 goto +1 ... Note that __sk_buff->data load has 32-bit sign extension. After verification and convert_ctx_accesses(), the final asm code looks like: r2 = *(u64 *)(r1 +208) r2 = (s32)r2 r3 = *(u64 *)(r1 +80) r0 = r2 r0 += 8 if r3 > r0 goto pc+1 ... Note that 'r2 = (s32)r2' may make the kernel __sk_buff->data address invalid which may cause runtime failure. Currently, in C code, typically we have void *data = (void *)(long)skb->data; void *data_end = (void *)(long)skb->data_end; ... and it will generate r2 = *(u64 *)(r1 +208) r3 = *(u64 *)(r1 +80) r0 = r2 r0 += 8 if r3 > r0 goto pc+1 If we allow sign-extension, void *data = (void *)(long)(int)skb->data; void *data_end = (void *)(long)skb->data_end; ... the generated code looks like r2 = *(u64 *)(r1 +208) r2 <<= 32 r2 s>>= 32 r3 = *(u64 *)(r1 +80) r0 = r2 r0 += 8 if r3 > r0 goto pc+1 and this will cause verification failure since "r2 <<= 32" is not allowed as "r2" is a packet pointer. To fix this issue for case r2 = *(s32 *)(r1 + 76) /* load __sk_buff->data */ this patch added additional checking in is_valid_access() callback function for packet data/data_end/data_meta access. If those accesses are with sign-extenstion, the verification will fail. [1] https://lore.kernel.org/bpf/[email protected]/ Reported-by: [email protected] Fixes: 1f1e864b6555 ("bpf: Handle sign-extenstin ctx member accesses") Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]>
2024-07-29bpf: Fix compare error in function retval_range_withinXu Kuohai1-5/+11
After checking lsm hook return range in verifier, the test case "test_progs -t test_lsm" failed, and the failure log says: libbpf: prog 'test_int_hook': BPF program load failed: Invalid argument libbpf: prog 'test_int_hook': -- BEGIN PROG LOAD LOG -- 0: R1=ctx() R10=fp0 ; int BPF_PROG(test_int_hook, struct vm_area_struct *vma, @ lsm.c:89 0: (79) r0 = *(u64 *)(r1 +24) ; R0_w=scalar(smin=smin32=-4095,smax=smax32=0) R1=ctx() [...] 24: (b4) w0 = -1 ; R0_w=0xffffffff ; int BPF_PROG(test_int_hook, struct vm_area_struct *vma, @ lsm.c:89 25: (95) exit At program exit the register R0 has smin=4294967295 smax=4294967295 should have been in [-4095, 0] It can be seen that instruction "w0 = -1" zero extended -1 to 64-bit register r0, setting both smin and smax values of r0 to 4294967295. This resulted in a false reject when r0 was checked with range [-4095, 0]. Given bpf lsm does not return 64-bit values, this patch fixes it by changing the compare between r0 and return range from 64-bit operation to 32-bit operation for bpf lsm. Fixes: 8fa4ecd49b81 ("bpf: enforce exact retval range on subprog/callback exit") Signed-off-by: Xu Kuohai <[email protected]> Acked-by: Shung-Hsi Yu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]>
2024-07-29bpf: Prevent tail call between progs attached to different hooksXu Kuohai1-3/+18
bpf progs can be attached to kernel functions, and the attached functions can take different parameters or return different return values. If prog attached to one kernel function tail calls prog attached to another kernel function, the ctx access or return value verification could be bypassed. For example, if prog1 is attached to func1 which takes only 1 parameter and prog2 is attached to func2 which takes two parameters. Since verifier assumes the bpf ctx passed to prog2 is constructed based on func2's prototype, verifier allows prog2 to access the second parameter from the bpf ctx passed to it. The problem is that verifier does not prevent prog1 from passing its bpf ctx to prog2 via tail call. In this case, the bpf ctx passed to prog2 is constructed from func1 instead of func2, that is, the assumption for ctx access verification is bypassed. Another example, if BPF LSM prog1 is attached to hook file_alloc_security, and BPF LSM prog2 is attached to hook bpf_lsm_audit_rule_known. Verifier knows the return value rules for these two hooks, e.g. it is legal for bpf_lsm_audit_rule_known to return positive number 1, and it is illegal for file_alloc_security to return positive number. So verifier allows prog2 to return positive number 1, but does not allow prog1 to return positive number. The problem is that verifier does not prevent prog1 from calling prog2 via tail call. In this case, prog2's return value 1 will be used as the return value for prog1's hook file_alloc_security. That is, the return value rule is bypassed. This patch adds restriction for tail call to prevent such bypasses. Signed-off-by: Xu Kuohai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]>
2024-07-29bpf, lsm: Add check for BPF LSM return valueXu Kuohai3-11/+88
A bpf prog returning a positive number attached to file_alloc_security hook makes kernel panic. This happens because file system can not filter out the positive number returned by the LSM prog using IS_ERR, and misinterprets this positive number as a file pointer. Given that hook file_alloc_security never returned positive number before the introduction of BPF LSM, and other BPF LSM hooks may encounter similar issues, this patch adds LSM return value check in verifier, to ensure no unexpected value is returned. Fixes: 520b7aa00d8c ("bpf: lsm: Initialize the BPF LSM hooks") Reported-by: Xin Liu <[email protected]> Signed-off-by: Xu Kuohai <[email protected]> Acked-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]>
2024-07-29bpf, lsm: Add disabled BPF LSM hook listXu Kuohai1-2/+29
Add a disabled hooks list for BPF LSM. progs being attached to the listed hooks will be rejected by the verifier. Suggested-by: KP Singh <[email protected]> Signed-off-by: Xu Kuohai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]>
2024-07-29signal: Replace BUG_ON()sThomas Gleixner1-4/+7
These really can be handled gracefully without killing the machine. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]>
2024-07-29signal: Remove task argument from dequeue_signal()Thomas Gleixner1-13/+10
The task pointer which is handed to dequeue_signal() is always current. The argument along with the first comment about signalfd in that function is confusing at best. Remove it and use current internally. Update the stale comment for dequeue_signal() while at it. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]>
2024-07-29posix-timers: Consolidate signal queueingThomas Gleixner4-19/+15
Rename posix_timer_event() to posix_timer_queue_signal() as this is what the function is about. Consolidate the requeue pending and deactivation updates into that function as there is no point in doing this in all incarnations of posix timers. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]>
2024-07-29posix-cpu-timers: Make k_itimer::it_active consistentThomas Gleixner1-0/+4
Posix CPU timers are not updating k_itimer::it_active which makes it impossible to base decisions in the common posix timer code on it. Update it when queueing or dequeueing posix CPU timers. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Reviewed-by: Anna-Maria Behnsen <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]>
2024-07-29posix-timers: Consolidate timer setupThomas Gleixner3-20/+21
hrtimer based and CPU timers have their own way to install the new interval and to reset overrun and signal handling related data. Create a helper function and do the same operation for all variants. This also makes the handling of the interval consistent. It's only stored when the timer is actually armed, i.e. timer->it_value != 0. Before that it was stored unconditionally for posix CPU timers and conditionally for the other posix timers. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]>
2024-07-29posix-timers: Convert timer list to hlistThomas Gleixner2-12/+9
No requirement for a real list. Spare a few bytes. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]>
2024-07-29posix-timers: Clear overrun in common_timer_set()Thomas Gleixner1-0/+1
Keeping the overrun count of the previous setup around is just wrong. The new setting has nothing to do with the previous one and has to start from a clean slate. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]>
2024-07-29posix-timers: Retrieve interval in common timer_settime() codeThomas Gleixner2-9/+6
No point in doing this all over the place. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Reviewed-by: Anna-Maria Behnsen <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]>
2024-07-29posix-cpu-timers: Simplify posix_cpu_timer_set()Thomas Gleixner1-27/+17
Avoid the late sighand lock/unlock dance when a timer is not armed to enforce reevaluation of the timer base so that the process wide CPU timer sampling can be disabled. Do it right at the point where the arming decision is made which already has sighand locked. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Reviewed-by: Anna-Maria Behnsen <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]>
2024-07-29posix-cpu-timers: Remove incorrect comment in posix_cpu_timer_set()Thomas Gleixner1-6/+1
A leftover from historical code which describes fiction. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]>
2024-07-29posix-cpu-timers: Use @now instead of @val for clarityThomas Gleixner1-13/+9
posix_cpu_timer_set() uses @val as variable for the current time. That's confusing at best. Use @now as anywhere else and rewrite the confusing comment about clock sampling. No functional change. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]>
2024-07-29posix-cpu-timers: Do not arm SIGEV_NONE timersThomas Gleixner1-16/+13
There is no point in arming SIGEV_NONE timers as they never deliver a signal. timer_gettime() is handling the expiry time correctly and that's all SIGEV_NONE timers care about. Prevent arming them and remove the expiry handler code which just disarms them. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Reviewed-by: Anna-Maria Behnsen <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]>
2024-07-29posix-cpu-timers: Replace old expiry retrieval in posix_cpu_timer_set()Thomas Gleixner1-31/+6
Reuse the split out __posix_cpu_timer_get() function which does already the right thing. No functional change. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Reviewed-by: Anna-Maria Behnsen <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]>
2024-07-29posix-cpu-timers: Handle SIGEV_NONE timers correctly in timer_set()Thomas Gleixner1-1/+11
Expired SIGEV_NONE oneshot timers must return 0 nsec for the expiry time in timer_get(), but the posix CPU timer implementation returns 1 nsec. Add the missing conditional. This will be cleaned up in a follow up patch. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]>
2024-07-29posix-cpu-timers: Handle SIGEV_NONE timers correctly in timer_get()Thomas Gleixner1-5/+9
Expired SIGEV_NONE oneshot timers must return 0 nsec for the expiry time in timer_get(), but the posix CPU timer implementation returns 1 nsec. Add the missing conditional. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]>
2024-07-29posix-cpu-timers: Handle interval timers correctly in timer_get()Thomas Gleixner1-1/+17
timer_gettime() must return the remaining time to the next expiry of a timer or 0 if the timer is not armed and no signal pending, but posix CPU timers fail to forward a timer which is already expired. Add the required logic to address that. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]>
2024-07-29posix-cpu-timers: Save interval only for armed timersThomas Gleixner1-8/+6
There is no point to return the interval for timers which have been disarmed. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]>