aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2023-07-28tracing: Fix kernel-doc warnings in trace_seq.cGaosheng Cui1-0/+1
Fix kernel-doc warning: kernel/trace/trace_seq.c:142: warning: Function parameter or member 'args' not described in 'trace_seq_vprintf' Link: https://lkml.kernel.org/r/[email protected] Cc: <[email protected]> Signed-off-by: Gaosheng Cui <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-07-28tracing: Fix kernel-doc warnings in trace_events_trigger.cGaosheng Cui1-0/+2
Fix kernel-doc warnings: kernel/trace/trace_events_trigger.c:59: warning: Function parameter or member 'buffer' not described in 'event_triggers_call' kernel/trace/trace_events_trigger.c:59: warning: Function parameter or member 'event' not described in 'event_triggers_call' Link: https://lkml.kernel.org/r/[email protected] Cc: <[email protected]> Signed-off-by: Gaosheng Cui <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-07-28tracing/synthetic: Fix kernel-doc warnings in trace_events_synth.cGaosheng Cui1-0/+1
Fix kernel-doc warning: kernel/trace/trace_events_synth.c:1257: warning: Function parameter or member 'mod' not described in 'synth_event_gen_cmd_array_start' Link: https://lkml.kernel.org/r/[email protected] Cc: <[email protected]> Signed-off-by: Gaosheng Cui <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-07-28ring-buffer: Fix kernel-doc warnings in ring_buffer.cGaosheng Cui1-2/+1
Fix kernel-doc warnings: kernel/trace/ring_buffer.c:954: warning: Function parameter or member 'cpu' not described in 'ring_buffer_wake_waiters' kernel/trace/ring_buffer.c:3383: warning: Excess function parameter 'event' description in 'ring_buffer_unlock_commit' kernel/trace/ring_buffer.c:5359: warning: Excess function parameter 'cpu' description in 'ring_buffer_reset_online_cpus' Link: https://lkml.kernel.org/r/[email protected] Cc: <[email protected]> Signed-off-by: Gaosheng Cui <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-07-28ring-buffer: Fix wrong stat of cpu_buffer->readZheng Yejian1-10/+12
When pages are removed in rb_remove_pages(), 'cpu_buffer->read' is set to 0 in order to make sure any read iterators reset themselves. However, this will mess 'entries' stating, see following steps: # cd /sys/kernel/tracing/ # 1. Enlarge ring buffer prepare for later reducing: # echo 20 > per_cpu/cpu0/buffer_size_kb # 2. Write a log into ring buffer of cpu0: # taskset -c 0 echo "hello1" > trace_marker # 3. Read the log: # cat per_cpu/cpu0/trace_pipe <...>-332 [000] ..... 62.406844: tracing_mark_write: hello1 # 4. Stop reading and see the stats, now 0 entries, and 1 event readed: # cat per_cpu/cpu0/stats entries: 0 [...] read events: 1 # 5. Reduce the ring buffer # echo 7 > per_cpu/cpu0/buffer_size_kb # 6. Now entries became unexpected 1 because actually no entries!!! # cat per_cpu/cpu0/stats entries: 1 [...] read events: 0 To fix it, introduce 'page_removed' field to count total removed pages since last reset, then use it to let read iterators reset themselves instead of changing the 'read' pointer. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: <[email protected]> Cc: <[email protected]> Fixes: 83f40318dab0 ("ring-buffer: Make removal of ring buffer pages atomic") Signed-off-by: Zheng Yejian <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-07-28bpf: Non-atomically allocate freelist during prefillYiFei Zhu1-4/+8
In internal testing of test_maps, we sometimes observed failures like: test_maps: test_maps.c:173: void test_hashmap_percpu(unsigned int, void *): Assertion `bpf_map_update_elem(fd, &key, value, BPF_ANY) == 0' failed. where the errno is ENOMEM. After some troubleshooting and enabling the warnings, we saw: [ 91.304708] percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left [ 91.304716] CPU: 51 PID: 24145 Comm: test_maps Kdump: loaded Tainted: G N 6.1.38-smp-DEV #7 [ 91.304719] Hardware name: Google Astoria/astoria, BIOS 0.20230627.0-0 06/27/2023 [ 91.304721] Call Trace: [ 91.304724] <TASK> [ 91.304730] [<ffffffffa7ef83b9>] dump_stack_lvl+0x59/0x88 [ 91.304737] [<ffffffffa7ef83f8>] dump_stack+0x10/0x18 [ 91.304738] [<ffffffffa75caa0c>] pcpu_alloc+0x6fc/0x870 [ 91.304741] [<ffffffffa75ca302>] __alloc_percpu_gfp+0x12/0x20 [ 91.304743] [<ffffffffa756785e>] alloc_bulk+0xde/0x1e0 [ 91.304746] [<ffffffffa7566c02>] bpf_mem_alloc_init+0xd2/0x2f0 [ 91.304747] [<ffffffffa7547c69>] htab_map_alloc+0x479/0x650 [ 91.304750] [<ffffffffa751d6e0>] map_create+0x140/0x2e0 [ 91.304752] [<ffffffffa751d413>] __sys_bpf+0x5a3/0x6c0 [ 91.304753] [<ffffffffa751c3ec>] __x64_sys_bpf+0x1c/0x30 [ 91.304754] [<ffffffffa7ef847a>] do_syscall_64+0x5a/0x80 [ 91.304756] [<ffffffffa800009b>] entry_SYSCALL_64_after_hwframe+0x63/0xcd This makes sense, because in atomic context, percpu allocation would not create new chunks; it would only create in non-atomic contexts. And if during prefill all precpu chunks are full, -ENOMEM would happen immediately upon next unit_alloc. Prefill phase does not actually run in atomic context, so we can use this fact to allocate non-atomically with GFP_KERNEL instead of GFP_NOWAIT. This avoids the immediate -ENOMEM. GFP_NOWAIT has to be used in unit_alloc when bpf program runs in atomic context. Even if bpf program runs in non-atomic context, in most cases, rcu read lock is enabled for the program so GFP_NOWAIT is still needed. This is often also the case for BPF_MAP_UPDATE_ELEM syscalls. Signed-off-by: YiFei Zhu <[email protected]> Acked-by: Yonghong Song <[email protected]> Acked-by: Hou Tao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-28bpf: Fix compilation warning with -WparenthesesYonghong Song2-4/+4
The kernel test robot reported compilation warnings when -Wparentheses is added to KBUILD_CFLAGS with gcc compiler. The following is the error message: .../bpf-next/kernel/bpf/verifier.c: In function ‘coerce_reg_to_size_sx’: .../bpf-next/kernel/bpf/verifier.c:5901:14: error: suggest parentheses around comparison in operand of ‘==’ [-Werror=parentheses] if (s64_max >= 0 == s64_min >= 0) { ~~~~~~~~^~~~ .../bpf-next/kernel/bpf/verifier.c: In function ‘coerce_subreg_to_size_sx’: .../bpf-next/kernel/bpf/verifier.c:5965:14: error: suggest parentheses around comparison in operand of ‘==’ [-Werror=parentheses] if (s32_min >= 0 == s32_max >= 0) { ~~~~~~~~^~~~ To fix the issue, add proper parentheses for the above '>=' condition to silence the warning/error. I tried a few clang compilers like clang16 and clang18 and they do not emit such warnings with -Wparentheses. Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: Yonghong Song <[email protected]> Acked-by: Jiri Olsa <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-27bpf: Add kernel/bpftool asm support for new instructionsYonghong Song1-6/+51
Add asm support for new instructions so kernel verifier and bpftool xlated insn dumps can have proper asm syntax for new instructions. Acked-by: Eduard Zingerman <[email protected]> Acked-by: Quentin Monnet <[email protected]> Signed-off-by: Yonghong Song <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-27bpf: Support new 32bit offset jmp instructionYonghong Song2-13/+38
Add interpreter/jit/verifier support for 32bit offset jmp instruction. If a conditional jmp instruction needs more than 16bit offset, it can be simulated with a conditional jmp + a 32bit jmp insn. Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-27bpf: Fix jit blinding with new sdiv/smov insnsYonghong Song1-2/+2
Handle new insns properly in bpf_jit_blind_insn() function. Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-27bpf: Support new signed div/mod instructions.Yonghong Song2-18/+98
Add interpreter/jit support for new signed div/mod insns. The new signed div/mod instructions are encoded with unsigned div/mod instructions plus insn->off == 1. Also add basic verifier support to ensure new insns get accepted. Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-27bpf: Support new unconditional bswap instructionYonghong Song2-2/+19
The existing 'be' and 'le' insns will do conditional bswap depends on host endianness. This patch implements unconditional bswap insns. Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-27bpf: Handle sign-extenstin ctx member accessesYonghong Song1-0/+6
Currently, if user accesses a ctx member with signed types, the compiler will generate an unsigned load followed by necessary left and right shifts. With the introduction of sign-extension load, compiler may just emit a ldsx insn instead. Let us do a final movsx sign extension to the final unsigned ctx load result to satisfy original sign extension requirement. Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-27bpf: Support new sign-extension mov insnsYonghong Song2-28/+157
Add interpreter/jit support for new sign-extension mov insns. The original 'MOV' insn is extended to support reg-to-reg signed version for both ALU and ALU64 operations. For ALU mode, the insn->off value of 8 or 16 indicates sign-extension from 8- or 16-bit value to 32-bit value. For ALU64 mode, the insn->off value of 8/16/32 indicates sign-extension from 8-, 16- or 32-bit value to 64-bit value. Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-27bpf: Support new sign-extension load insnsYonghong Song2-24/+137
Add interpreter/jit support for new sign-extension load insns which adds a new mode (BPF_MEMSX). Also add verifier support to recognize these insns and to do proper verification with new insns. In verifier, besides to deduce proper bounds for the dst_reg, probed memory access is also properly handled. Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski5-7/+21
Cross-merge networking fixes after downstream PR. No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <[email protected]>
2023-07-26mm: suppress mm fault logging if fatal signal already pendingLinus Torvalds1-0/+4
Commit eda0047296a1 ("mm: make the page fault mmap locking killable") intentionally made it much easier to trigger the "page fault fails because a fatal signal is pending" situation, by having the mmap locking fail early in that case. We have long aborted page faults in other fatal cases when the actual IO for a page is interrupted by SIGKILL - which is particularly useful for the traditional case of NFS hanging due to network issues, but local filesystems could cause it too if you happened to get the SIGKILL while waiting for a page to be faulted in (eg lock_folio_maybe_drop_mmap()). So aborting the page fault wasn't a new condition - but it now triggers earlier, before we even get to 'handle_mm_fault()'. And as a result the error doesn't go through our 'fault_signal_pending()' logic, and doesn't get filtered away there. Normally you'd never even notice, because if a fatal signal is pending, the new SIGSEGV we send ends up being ignored anyway. But it turns out that there is one very noticeable exception: if you enable 'show_unhandled_signals', the aborted page fault will be logged in the kernel messages, and you'll get a scary line looking something like this in your logs: pverados[2183248]: segfault at 55e5a00f9ae0 ip 000055e5a00f9ae0 sp 00007ffc0720bea8 error 14 in perl[55e5a00d4000+195000] likely on CPU 10 (core 4, socket 0) which is rather misleading. It's not really a segfault at all, it's just "the thread was killed before the page fault completed, so we aborted the page fault". Fix this by just making it clear that a pending fatal signal means that any new signal coming in after that is implicitly handled. This will avoid the misleading logging, since now the signal isn't 'unhandled' any more. Reported-and-tested-by: Fiona Ebner <[email protected]> Tested-by: Thomas Lamprecht <[email protected]> Link: https://lore.kernel.org/lkml/[email protected]/ Acked-by: Oleg Nesterov <[email protected]> Fixes: eda0047296a1 ("mm: make the page fault mmap locking killable") Signed-off-by: Linus Torvalds <[email protected]>
2023-07-26tracing/probes: Fix to add NULL check for BTF APIsMasami Hiramatsu (Google)1-4/+4
Since find_btf_func_param() abd btf_type_by_id() can return NULL, the caller must check the return value correctly. Link: https://lore.kernel.org/all/169024903951.395371.11361556840733470934.stgit@devnote2/ Fixes: b576e09701c7 ("tracing/probes: Support function parameters if BTF is available") Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Reviewed-by: Steven Rostedt (Google) <[email protected]>
2023-07-25bpf: work around -Wuninitialized warningArnd Bergmann1-6/+6
Splitting these out into separate helper functions means that we actually pass an uninitialized variable into another function call if dec_active() happens to not be inlined, and CONFIG_PREEMPT_RT is disabled: kernel/bpf/memalloc.c: In function 'add_obj_to_free_list': kernel/bpf/memalloc.c:200:9: error: 'flags' is used uninitialized [-Werror=uninitialized] 200 | dec_active(c, flags); Avoid this by passing the flags by reference, so they either get initialized and dereferenced through a pointer, or the pointer never gets accessed at all. Fixes: 18e027b1c7c6d ("bpf: Factor out inc/dec of active flag into helpers.") Suggested-by: Alexei Starovoitov <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-25bpf: Disable preemption in bpf_event_outputJiri Olsa1-1/+5
We received report [1] of kernel crash, which is caused by using nesting protection without disabled preemption. The bpf_event_output can be called by programs executed by bpf_prog_run_array_cg function that disabled migration but keeps preemption enabled. This can cause task to be preempted by another one inside the nesting protection and lead eventually to two tasks using same perf_sample_data buffer and cause crashes like: BUG: kernel NULL pointer dereference, address: 0000000000000001 #PF: supervisor instruction fetch in kernel mode #PF: error_code(0x0010) - not-present page ... ? perf_output_sample+0x12a/0x9a0 ? finish_task_switch.isra.0+0x81/0x280 ? perf_event_output+0x66/0xa0 ? bpf_event_output+0x13a/0x190 ? bpf_event_output_data+0x22/0x40 ? bpf_prog_dfc84bbde731b257_cil_sock4_connect+0x40a/0xacb ? xa_load+0x87/0xe0 ? __cgroup_bpf_run_filter_sock_addr+0xc1/0x1a0 ? release_sock+0x3e/0x90 ? sk_setsockopt+0x1a1/0x12f0 ? udp_pre_connect+0x36/0x50 ? inet_dgram_connect+0x93/0xa0 ? __sys_connect+0xb4/0xe0 ? udp_setsockopt+0x27/0x40 ? __pfx_udp_push_pending_frames+0x10/0x10 ? __sys_setsockopt+0xdf/0x1a0 ? __x64_sys_connect+0xf/0x20 ? do_syscall_64+0x3a/0x90 ? entry_SYSCALL_64_after_hwframe+0x72/0xdc Fixing this by disabling preemption in bpf_event_output. [1] https://github.com/cilium/cilium/issues/26756 Cc: [email protected] Reported-by: Oleg "livelace" Popov <[email protected]> Closes: https://github.com/cilium/cilium/issues/26756 Fixes: 2a916f2f546c ("bpf: Use migrate_disable/enable in array macros and cgroup/lirc code.") Acked-by: Hou Tao <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-25bpf: Disable preemption in bpf_perf_event_outputJiri Olsa1-4/+7
The nesting protection in bpf_perf_event_output relies on disabled preemption, which is guaranteed for kprobes and tracepoints. However bpf_perf_event_output can be also called from uprobes context through bpf_prog_run_array_sleepable function which disables migration, but keeps preemption enabled. This can cause task to be preempted by another one inside the nesting protection and lead eventually to two tasks using same perf_sample_data buffer and cause crashes like: kernel tried to execute NX-protected page - exploit attempt? (uid: 0) BUG: unable to handle page fault for address: ffffffff82be3eea ... Call Trace: ? __die+0x1f/0x70 ? page_fault_oops+0x176/0x4d0 ? exc_page_fault+0x132/0x230 ? asm_exc_page_fault+0x22/0x30 ? perf_output_sample+0x12b/0x910 ? perf_event_output+0xd0/0x1d0 ? bpf_perf_event_output+0x162/0x1d0 ? bpf_prog_c6271286d9a4c938_krava1+0x76/0x87 ? __uprobe_perf_func+0x12b/0x540 ? uprobe_dispatcher+0x2c4/0x430 ? uprobe_notify_resume+0x2da/0xce0 ? atomic_notifier_call_chain+0x7b/0x110 ? exit_to_user_mode_prepare+0x13e/0x290 ? irqentry_exit_to_user_mode+0x5/0x30 ? asm_exc_int3+0x35/0x40 Fixing this by disabling preemption in bpf_perf_event_output. Cc: [email protected] Fixes: 8c7dcb84e3b7 ("bpf: implement sleepable uprobes by chaining gps") Acked-by: Hou Tao <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-25workqueue: Scale up wq_cpu_intensive_thresh_us if BogoMIPS is below 4000Tejun Heo1-1/+42
wq_cpu_intensive_thresh_us is used to detect CPU-hogging per-cpu work items. Once detected, they're excluded from concurrency management to prevent them from blocking other per-cpu work items. If CONFIG_WQ_CPU_INTENSIVE_REPORT is enabled, repeat offenders are also reported so that the code can be updated. The default threshold is 10ms which is long enough to do fair bit of work on modern CPUs while short enough to be usually not noticeable. This unfortunately leads to a lot of, arguable spurious, detections on very slow CPUs. Using the same threshold across CPUs whose performance levels may be apart by multiple levels of magnitude doesn't make whole lot of sense. This patch scales up wq_cpu_intensive_thresh_us upto 1 second when BogoMIPS is below 4000. This is obviously very inaccurate but it doesn't have to be accurate to be useful. The mechanism is still useful when the threshold is fully scaled up and the benefits of reports are usually shared with everyone regardless of who's reporting, so as long as there are sufficient number of fast machines reporting, we don't lose much. Some (or is it all?) ARM CPUs systemtically report significantly lower BogoMIPS. While this doesn't break anything, given how widespread ARM CPUs are, it's at least a missed opportunity and it probably would be a good idea to teach workqueue about it. Signed-off-by: Tejun Heo <[email protected]> Reported-and-Tested-by: Geert Uytterhoeven <[email protected]>
2023-07-23Merge tag 'trace-v6.5-rc2' of ↵Linus Torvalds4-7/+17
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Swapping the ring buffer for snapshotting (for things like irqsoff) can crash if the ring buffer is being resized. Disable swapping when this happens. The missed swap will be reported to the tracer - Report error if the histogram fails to be created due to an error in adding a histogram variable, in event_hist_trigger_parse() - Remove unused declaration of tracing_map_set_field_descr() * tag 'trace-v6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/histograms: Return an error if we fail to add histogram to hist_vars list ring-buffer: Do not swap cpu_buffer during resize process tracing: Remove unused extern declaration tracing_map_set_field_descr()
2023-07-23tracing/histograms: Return an error if we fail to add histogram to hist_vars ↵Mohamed Khalfella1-1/+2
list Commit 6018b585e8c6 ("tracing/histograms: Add histograms to hist_vars if they have referenced variables") added a check to fail histogram creation if save_hist_vars() failed to add histogram to hist_vars list. But the commit failed to set ret to failed return code before jumping to unregister histogram, fix it. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: [email protected] Fixes: 6018b585e8c6 ("tracing/histograms: Add histograms to hist_vars if they have referenced variables") Signed-off-by: Mohamed Khalfella <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-07-23ring-buffer: Do not swap cpu_buffer during resize processChen Lin2-2/+15
When ring_buffer_swap_cpu was called during resize process, the cpu buffer was swapped in the middle, resulting in incorrect state. Continuing to run in the wrong state will result in oops. This issue can be easily reproduced using the following two scripts: /tmp # cat test1.sh //#! /bin/sh for i in `seq 0 100000` do echo 2000 > /sys/kernel/debug/tracing/buffer_size_kb sleep 0.5 echo 5000 > /sys/kernel/debug/tracing/buffer_size_kb sleep 0.5 done /tmp # cat test2.sh //#! /bin/sh for i in `seq 0 100000` do echo irqsoff > /sys/kernel/debug/tracing/current_tracer sleep 1 echo nop > /sys/kernel/debug/tracing/current_tracer sleep 1 done /tmp # ./test1.sh & /tmp # ./test2.sh & A typical oops log is as follows, sometimes with other different oops logs. [ 231.711293] WARNING: CPU: 0 PID: 9 at kernel/trace/ring_buffer.c:2026 rb_update_pages+0x378/0x3f8 [ 231.713375] Modules linked in: [ 231.714735] CPU: 0 PID: 9 Comm: kworker/0:1 Tainted: G W 6.5.0-rc1-00276-g20edcec23f92 #15 [ 231.716750] Hardware name: linux,dummy-virt (DT) [ 231.718152] Workqueue: events update_pages_handler [ 231.719714] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 231.721171] pc : rb_update_pages+0x378/0x3f8 [ 231.722212] lr : rb_update_pages+0x25c/0x3f8 [ 231.723248] sp : ffff800082b9bd50 [ 231.724169] x29: ffff800082b9bd50 x28: ffff8000825f7000 x27: 0000000000000000 [ 231.726102] x26: 0000000000000001 x25: fffffffffffff010 x24: 0000000000000ff0 [ 231.728122] x23: ffff0000c3a0b600 x22: ffff0000c3a0b5c0 x21: fffffffffffffe0a [ 231.730203] x20: ffff0000c3a0b600 x19: ffff0000c0102400 x18: 0000000000000000 [ 231.732329] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffffe7aa8510 [ 231.734212] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002 [ 231.736291] x11: ffff8000826998a8 x10: ffff800082b9baf0 x9 : ffff800081137558 [ 231.738195] x8 : fffffc00030e82c8 x7 : 0000000000000000 x6 : 0000000000000001 [ 231.740192] x5 : ffff0000ffbafe00 x4 : 0000000000000000 x3 : 0000000000000000 [ 231.742118] x2 : 00000000000006aa x1 : 0000000000000001 x0 : ffff0000c0007208 [ 231.744196] Call trace: [ 231.744892] rb_update_pages+0x378/0x3f8 [ 231.745893] update_pages_handler+0x1c/0x38 [ 231.746893] process_one_work+0x1f0/0x468 [ 231.747852] worker_thread+0x54/0x410 [ 231.748737] kthread+0x124/0x138 [ 231.749549] ret_from_fork+0x10/0x20 [ 231.750434] ---[ end trace 0000000000000000 ]--- [ 233.720486] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [ 233.721696] Mem abort info: [ 233.721935] ESR = 0x0000000096000004 [ 233.722283] EC = 0x25: DABT (current EL), IL = 32 bits [ 233.722596] SET = 0, FnV = 0 [ 233.722805] EA = 0, S1PTW = 0 [ 233.723026] FSC = 0x04: level 0 translation fault [ 233.723458] Data abort info: [ 233.723734] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 [ 233.724176] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 233.724589] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 233.725075] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000104943000 [ 233.725592] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000 [ 233.726231] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP [ 233.726720] Modules linked in: [ 233.727007] CPU: 0 PID: 9 Comm: kworker/0:1 Tainted: G W 6.5.0-rc1-00276-g20edcec23f92 #15 [ 233.727777] Hardware name: linux,dummy-virt (DT) [ 233.728225] Workqueue: events update_pages_handler [ 233.728655] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 233.729054] pc : rb_update_pages+0x1a8/0x3f8 [ 233.729334] lr : rb_update_pages+0x154/0x3f8 [ 233.729592] sp : ffff800082b9bd50 [ 233.729792] x29: ffff800082b9bd50 x28: ffff8000825f7000 x27: 0000000000000000 [ 233.730220] x26: 0000000000000000 x25: ffff800082a8b840 x24: ffff0000c0102418 [ 233.730653] x23: 0000000000000000 x22: fffffc000304c880 x21: 0000000000000003 [ 233.731105] x20: 00000000000001f4 x19: ffff0000c0102400 x18: ffff800082fcbc58 [ 233.731727] x17: 0000000000000000 x16: 0000000000000001 x15: 0000000000000001 [ 233.732282] x14: ffff8000825fe0c8 x13: 0000000000000001 x12: 0000000000000000 [ 233.732709] x11: ffff8000826998a8 x10: 0000000000000ae0 x9 : ffff8000801b760c [ 233.733148] x8 : fefefefefefefeff x7 : 0000000000000018 x6 : ffff0000c03298c0 [ 233.733553] x5 : 0000000000000002 x4 : 0000000000000000 x3 : 0000000000000000 [ 233.733972] x2 : ffff0000c3a0b600 x1 : 0000000000000000 x0 : 0000000000000000 [ 233.734418] Call trace: [ 233.734593] rb_update_pages+0x1a8/0x3f8 [ 233.734853] update_pages_handler+0x1c/0x38 [ 233.735148] process_one_work+0x1f0/0x468 [ 233.735525] worker_thread+0x54/0x410 [ 233.735852] kthread+0x124/0x138 [ 233.736064] ret_from_fork+0x10/0x20 [ 233.736387] Code: 92400000 910006b5 aa000021 aa0303f7 (f9400060) [ 233.736959] ---[ end trace 0000000000000000 ]--- After analysis, the seq of the error is as follows [1-5]: int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size, int cpu_id) { for_each_buffer_cpu(buffer, cpu) { cpu_buffer = buffer->buffers[cpu]; //1. get cpu_buffer, aka cpu_buffer(A) ... ... schedule_work_on(cpu, &cpu_buffer->update_pages_work); //2. 'update_pages_work' is queue on 'cpu', cpu_buffer(A) is passed to // update_pages_handler, do the update process, set 'update_done' in // complete(&cpu_buffer->update_done) and to wakeup resize process. //----> //3. Just at this moment, ring_buffer_swap_cpu is triggered, //cpu_buffer(A) be swaped to cpu_buffer(B), the max_buffer. //ring_buffer_swap_cpu is called as the 'Call trace' below. Call trace: dump_backtrace+0x0/0x2f8 show_stack+0x18/0x28 dump_stack+0x12c/0x188 ring_buffer_swap_cpu+0x2f8/0x328 update_max_tr_single+0x180/0x210 check_critical_timing+0x2b4/0x2c8 tracer_hardirqs_on+0x1c0/0x200 trace_hardirqs_on+0xec/0x378 el0_svc_common+0x64/0x260 do_el0_svc+0x90/0xf8 el0_svc+0x20/0x30 el0_sync_handler+0xb0/0xb8 el0_sync+0x180/0x1c0 //<---- /* wait for all the updates to complete */ for_each_buffer_cpu(buffer, cpu) { cpu_buffer = buffer->buffers[cpu]; //4. get cpu_buffer, cpu_buffer(B) is used in the following process, //the state of cpu_buffer(A) and cpu_buffer(B) is totally wrong. //for example, cpu_buffer(A)->update_done will leave be set 1, and will //not 'wait_for_completion' at the next resize round. if (!cpu_buffer->nr_pages_to_update) continue; if (cpu_online(cpu)) wait_for_completion(&cpu_buffer->update_done); cpu_buffer->nr_pages_to_update = 0; } ... } //5. the state of cpu_buffer(A) and cpu_buffer(B) is totally wrong, //Continuing to run in the wrong state, then oops occurs. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Signed-off-by: Chen Lin <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-07-23tracing: Remove unused extern declaration tracing_map_set_field_descr()YueHaibing1-4/+0
Since commit 08d43a5fa063 ("tracing: Add lock-free tracing_map"), this is never used, so can be removed. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: <[email protected]> Signed-off-by: YueHaibing <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-07-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski14-56/+87
Cross-merge networking fixes after downstream PR. No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <[email protected]>
2023-07-20Merge tag 'net-6.5-rc3' of ↵Linus Torvalds1-7/+25
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from BPF, netfilter, bluetooth and CAN. Current release - regressions: - eth: r8169: multiple fixes for PCIe ASPM-related problems - vrf: fix RCU lockdep splat in output path Previous releases - regressions: - gso: fall back to SW segmenting with GSO_UDP_L4 dodgy bit set - dsa: mv88e6xxx: do a final check before timing out when polling - nf_tables: fix sleep in atomic in nft_chain_validate Previous releases - always broken: - sched: fix undoing tcf_bind_filter() in multiple classifiers - bpf, arm64: fix BTI type used for freplace attached functions - can: gs_usb: fix time stamp counter initialization - nft_set_pipapo: fix improper element removal (leading to UAF) Misc: - net: support STP on bridge in non-root netns, STP prevents packet loops so not supporting it results in freezing systems of unsuspecting users, and in turn very upset noises being made - fix kdoc warnings - annotate various bits of TCP state to prevent data races" * tag 'net-6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (95 commits) net: phy: prevent stale pointer dereference in phy_init() tcp: annotate data-races around fastopenq.max_qlen tcp: annotate data-races around icsk->icsk_user_timeout tcp: annotate data-races around tp->notsent_lowat tcp: annotate data-races around rskq_defer_accept tcp: annotate data-races around tp->linger2 tcp: annotate data-races around icsk->icsk_syn_retries tcp: annotate data-races around tp->keepalive_probes tcp: annotate data-races around tp->keepalive_intvl tcp: annotate data-races around tp->keepalive_time tcp: annotate data-races around tp->tsoffset tcp: annotate data-races around tp->tcp_tx_delay Bluetooth: MGMT: Use correct address for memcpy() Bluetooth: btusb: Fix bluetooth on Intel Macbook 2014 Bluetooth: SCO: fix sco_conn related locking and validity issues Bluetooth: hci_conn: return ERR_PTR instead of NULL when there is no link Bluetooth: hci_sync: Avoid use-after-free in dbg for hci_remove_adv_monitor() Bluetooth: coredump: fix building with coredump disabled Bluetooth: ISO: fix iso_conn related locking and validity issues Bluetooth: hci_event: call disconnect callback before deleting conn ...
2023-07-19bpf, net: Introduce skb_pointer_if_linear().Alexei Starovoitov1-1/+4
Network drivers always call skb_header_pointer() with non-null buffer. Remove !buffer check to prevent accidental misuse of skb_header_pointer(). Introduce skb_pointer_if_linear() instead. Reported-by: Jakub Kicinski <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-19bpf: Add fd-based tcx multi-prog infra with link supportDaniel Borkmann4-13/+419
This work refactors and adds a lightweight extension ("tcx") to the tc BPF ingress and egress data path side for allowing BPF program management based on fds via bpf() syscall through the newly added generic multi-prog API. The main goal behind this work which we also presented at LPC [0] last year and a recent update at LSF/MM/BPF this year [3] is to support long-awaited BPF link functionality for tc BPF programs, which allows for a model of safe ownership and program detachment. Given the rise in tc BPF users in cloud native environments, this becomes necessary to avoid hard to debug incidents either through stale leftover programs or 3rd party applications accidentally stepping on each others toes. As a recap, a BPF link represents the attachment of a BPF program to a BPF hook point. The BPF link holds a single reference to keep BPF program alive. Moreover, hook points do not reference a BPF link, only the application's fd or pinning does. A BPF link holds meta-data specific to attachment and implements operations for link creation, (atomic) BPF program update, detachment and introspection. The motivation for BPF links for tc BPF programs is multi-fold, for example: - From Meta: "It's especially important for applications that are deployed fleet-wide and that don't "control" hosts they are deployed to. If such application crashes and no one notices and does anything about that, BPF program will keep running draining resources or even just, say, dropping packets. We at FB had outages due to such permanent BPF attachment semantics. With fd-based BPF link we are getting a framework, which allows safe, auto-detachable behavior by default, unless application explicitly opts in by pinning the BPF link." [1] - From Cilium-side the tc BPF programs we attach to host-facing veth devices and phys devices build the core datapath for Kubernetes Pods, and they implement forwarding, load-balancing, policy, EDT-management, etc, within BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently experienced hard-to-debug issues in a user's staging environment where another Kubernetes application using tc BPF attached to the same prio/handle of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath it. The goal is to establish a clear/safe ownership model via links which cannot accidentally be overridden. [0,2] BPF links for tc can co-exist with non-link attachments, and the semantics are in line also with XDP links: BPF links cannot replace other BPF links, BPF links cannot replace non-BPF links, non-BPF links cannot replace BPF links and lastly only non-BPF links can replace non-BPF links. In case of Cilium, this would solve mentioned issue of safe ownership model as 3rd party applications would not be able to accidentally wipe Cilium programs, even if they are not BPF link aware. Earlier attempts [4] have tried to integrate BPF links into core tc machinery to solve cls_bpf, which has been intrusive to the generic tc kernel API with extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could be wiped from the qdisc also. Locking a tc BPF program in place this way, is getting into layering hacks given the two object models are vastly different. We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF attach API, so that the BPF link implementation blends in naturally similar to other link types which are fd-based and without the need for changing core tc internal APIs. BPF programs for tc can then be successively migrated from classic cls_bpf to the new tc BPF link without needing to change the program's source code, just the BPF loader mechanics for attaching is sufficient. For the current tc framework, there is no change in behavior with this change and neither does this change touch on tc core kernel APIs. The gist of this patch is that the ingress and egress hook have a lightweight, qdisc-less extension for BPF to attach its tc BPF programs, in other words, a minimal entry point for tc BPF. The name tcx has been suggested from discussion of earlier revisions of this work as a good fit, and to more easily differ between the classic cls_bpf attachment and the fd-based one. For the ingress and egress tcx points, the device holds a cache-friendly array with program pointers which is separated from control plane (slow-path) data. Earlier versions of this work used priority to determine ordering and expression of dependencies similar as with classic tc, but it was challenged that for something more future-proof a better user experience is required. Hence this resulted in the design and development of the generic attach/detach/query API for multi-progs. See prior patch with its discussion on the API design. tcx is the first user and later we plan to integrate also others, for example, one candidate is multi-prog support for XDP which would benefit and have the same 'look and feel' from API perspective. The goal with tcx is to have maximum compatibility to existing tc BPF programs, so they don't need to be rewritten specifically. Compatibility to call into classic tcf_classify() is also provided in order to allow successive migration or both to cleanly co-exist where needed given its all one logical tc layer and the tcx plus classic tc cls/act build one logical overall processing pipeline. tcx supports the simplified return codes TCX_NEXT which is non-terminating (go to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT. The fd-based API is behind a static key, so that when unused the code is also not entered. The struct tcx_entry's program array is currently static, but could be made dynamic if necessary at a point in future. The a/b pair swap design has been chosen so that for detachment there are no allocations which otherwise could fail. The work has been tested with tc-testing selftest suite which all passes, as well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB. Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews of this work. [0] https://lpc.events/event/16/contributions/1353/ [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf [4] https://lore.kernel.org/bpf/[email protected] Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-19bpf: Add generic attach/detach/query API for multi-progsDaniel Borkmann2-1/+446
This adds a generic layer called bpf_mprog which can be reused by different attachment layers to enable multi-program attachment and dependency resolution. In-kernel users of the bpf_mprog don't need to care about the dependency resolution internals, they can just consume it with few API calls. The initial idea of having a generic API sparked out of discussion [0] from an earlier revision of this work where tc's priority was reused and exposed via BPF uapi as a way to coordinate dependencies among tc BPF programs, similar as-is for classic tc BPF. The feedback was that priority provides a bad user experience and is hard to use [1], e.g.: I cannot help but feel that priority logic copy-paste from old tc, netfilter and friends is done because "that's how things were done in the past". [...] Priority gets exposed everywhere in uapi all the way to bpftool when it's right there for users to understand. And that's the main problem with it. The user don't want to and don't need to be aware of it, but uapi forces them to pick the priority. [...] Your cover letter [0] example proves that in real life different service pick the same priority. They simply don't know any better. Priority is an unnecessary magic that apps _have_ to pick, so they just copy-paste and everyone ends up using the same. The course of the discussion showed more and more the need for a generic, reusable API where the "same look and feel" can be applied for various other program types beyond just tc BPF, for example XDP today does not have multi- program support in kernel, but also there was interest around this API for improving management of cgroup program types. Such common multi-program management concept is useful for BPF management daemons or user space BPF applications coordinating internally about their attachments. Both from Cilium and Meta side [2], we've collected the following requirements for a generic attach/detach/query API for multi-progs which has been implemented as part of this work: - Support prog-based attach/detach and link API - Dependency directives (can also be combined): - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none} - BPF_F_ID flag as {fd,id} toggle; the rationale for id is so that user space application does not need CAP_SYS_ADMIN to retrieve foreign fds via bpf_*_get_fd_by_id() - BPF_F_LINK flag as {prog,link} toggle - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and BPF_F_AFTER will just append for attaching - Enforced only at attach time - BPF_F_REPLACE with replace_bpf_fd which can be prog, links have their own infra for replacing their internal prog - If no flags are set, then it's default append behavior for attaching - Internal revision counter and optionally being able to pass expected_revision - User space application can query current state with revision, and pass it along for attachment to assert current state before doing updates - Query also gets extension for link_ids array and link_attach_flags: - prog_ids are always filled with program IDs - link_ids are filled with link IDs when link was used, otherwise 0 - {prog,link}_attach_flags for holding {prog,link}-specific flags - Must be easy to integrate/reuse for in-kernel users The uapi-side changes needed for supporting bpf_mprog are rather minimal, consisting of the additions of the attachment flags, revision counter, and expanding existing union with relative_{fd,id} member. The bpf_mprog framework consists of an bpf_mprog_entry object which holds an array of bpf_mprog_fp (fast-path structure). The bpf_mprog_cp (control-path structure) is part of bpf_mprog_bundle. Both have been separated, so that fast-path gets efficient packing of bpf_prog pointers for maximum cache efficiency. Also, array has been chosen instead of linked list or other structures to remove unnecessary indirections for a fast point-to-entry in tc for BPF. The bpf_mprog_entry comes as a pair via bpf_mprog_bundle so that in case of updates the peer bpf_mprog_entry is populated and then just swapped which avoids additional allocations that could otherwise fail, for example, in detach case. bpf_mprog_{fp,cp} arrays are currently static, but they could be converted to dynamic allocation if necessary at a point in future. Locking is deferred to the in-kernel user of bpf_mprog, for example, in case of tcx which uses this API in the next patch, it piggybacks on rtnl. An extensive test suite for checking all aspects of this API for prog-based attach/detach and link API comes as BPF selftests in this series. Thanks also to Andrii Nakryiko for early API discussions wrt Meta's BPF prog management. [0] https://lore.kernel.org/bpf/[email protected] [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-19bpf: allow any program to use the bpf_map_sum_elem_count kfuncAnton Protopopov1-1/+1
Register the bpf_map_sum_elem_count func for all programs, and update the map_ptr subtest of the test_progs test to test the new functionality. The usage is allowed as long as the pointer to the map is trusted (when using tracing programs) or is a const pointer to map, as in the following example: struct { __uint(type, BPF_MAP_TYPE_HASH); ... } hash SEC(".maps"); ... static inline int some_bpf_prog(void) { struct bpf_map *map = (struct bpf_map *)&hash; __s64 count; count = bpf_map_sum_elem_count(map); ... } Signed-off-by: Anton Protopopov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-19bpf: make an argument const in the bpf_map_sum_elem_count kfuncAnton Protopopov1-1/+1
We use the map pointer only to read the counter values, no locking involved, so mark the argument as const. Signed-off-by: Anton Protopopov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-19bpf: consider CONST_PTR_TO_MAP as trusted pointer to struct bpf_mapAnton Protopopov2-2/+2
Add the BTF id of struct bpf_map to the reg2btf_ids array. This makes the values of the CONST_PTR_TO_MAP type to be considered as trusted by kfuncs. This, in turn, allows users to execute trusted kfuncs which accept `struct bpf_map *` arguments from non-tracing programs. While exporting the btf_bpf_map_id variable, save some bytes by defining it as BTF_ID_LIST_GLOBAL_SINGLE (which is u32[1]) and not as BTF_ID_LIST (which is u32[64]). Signed-off-by: Anton Protopopov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-19bpf: consider types listed in reg2btf_ids as trustedAnton Protopopov1-9/+12
The reg2btf_ids array contains a list of types for which we can (and need) to find a corresponding static BTF id. All the types in the list can be considered as trusted for purposes of kfuncs. Signed-off-by: Anton Protopopov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-18bpf: Add 'owner' field to bpf_{list,rb}_nodeDave Marchevsky1-4/+25
As described by Kumar in [0], in shared ownership scenarios it is necessary to do runtime tracking of {rb,list} node ownership - and synchronize updates using this ownership information - in order to prevent races. This patch adds an 'owner' field to struct bpf_list_node and bpf_rb_node to implement such runtime tracking. The owner field is a void * that describes the ownership state of a node. It can have the following values: NULL - the node is not owned by any data structure BPF_PTR_POISON - the node is in the process of being added to a data structure ptr_to_root - the pointee is a data structure 'root' (bpf_rb_root / bpf_list_head) which owns this node The field is initially NULL (set by bpf_obj_init_field default behavior) and transitions states in the following sequence: Insertion: NULL -> BPF_PTR_POISON -> ptr_to_root Removal: ptr_to_root -> NULL Before a node has been successfully inserted, it is not protected by any root's lock, and therefore two programs can attempt to add the same node to different roots simultaneously. For this reason the intermediate BPF_PTR_POISON state is necessary. For removal, the node is protected by some root's lock so this intermediate hop isn't necessary. Note that bpf_list_pop_{front,back} helpers don't need to check owner before removing as the node-to-be-removed is not passed in as input and is instead taken directly from the list. Do the check anyways and WARN_ON_ONCE in this unexpected scenario. Selftest changes in this patch are entirely mechanical: some BTF tests have hardcoded struct sizes for structs that contain bpf_{list,rb}_node fields, those were adjusted to account for the new sizes. Selftest additions to validate the owner field are added in a further patch in the series. [0]: https://lore.kernel.org/bpf/d7hyspcow5wtjcmw4fugdgyp3fwhljwuscp3xyut5qnwivyeru@ysdq543otzv2 Signed-off-by: Dave Marchevsky <[email protected]> Suggested-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-18bpf: Introduce internal definitions for UAPI-opaque bpf_{rb,list}_nodeDave Marchevsky1-10/+13
Structs bpf_rb_node and bpf_list_node are opaquely defined in uapi/linux/bpf.h, as BPF program writers are not expected to touch their fields - nor does the verifier allow them to do so. Currently these structs are simple wrappers around structs rb_node and list_head and linked_list / rbtree implementation just casts and passes to library functions for those data structures. Later patches in this series, though, will add an "owner" field to bpf_{rb,list}_node, such that they're not just wrapping an underlying node type. Moreover, the bpf linked_list and rbtree implementations will deal with these owner pointers directly in a few different places. To avoid having to do void *owner = (void*)bpf_list_node + sizeof(struct list_head) with opaque UAPI node types, add bpf_{list,rb}_node_kern struct definitions to internal headers and modify linked_list and rbtree to use the internal types where appropriate. Signed-off-by: Dave Marchevsky <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-18bpf: Repeat check_max_stack_depth for async callbacksKumar Kartikeya Dwivedi1-2/+19
While the check_max_stack_depth function explores call chains emanating from the main prog, which is typically enough to cover all possible call chains, it doesn't explore those rooted at async callbacks unless the async callback will have been directly called, since unlike non-async callbacks it skips their instruction exploration as they don't contribute to stack depth. It could be the case that the async callback leads to a callchain which exceeds the stack depth, but this is never reachable while only exploring the entry point from main subprog. Hence, repeat the check for the main subprog *and* all async callbacks marked by the symbolic execution pass of the verifier, as execution of the program may begin at any of them. Consider functions with following stack depths: main: 256 async: 256 foo: 256 main: rX = async bpf_timer_set_callback(...) async: foo() Here, async is not descended as it does not contribute to stack depth of main (since it is referenced using bpf_pseudo_func and not bpf_pseudo_call). However, when async is invoked asynchronously, it will end up breaching the MAX_BPF_STACK limit by calling foo. Hence, in addition to main, we also need to explore call chains beginning at all async callback subprogs in a program. Fixes: 7ddc80a476c2 ("bpf: Teach stack depth check about async callbacks.") Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-18bpf: Fix subprog idx logic in check_max_stack_depthKumar Kartikeya Dwivedi1-5/+6
The assignment to idx in check_max_stack_depth happens once we see a bpf_pseudo_call or bpf_pseudo_func. This is not an issue as the rest of the code performs a few checks and then pushes the frame to the frame stack, except the case of async callbacks. If the async callback case causes the loop iteration to be skipped, the idx assignment will be incorrect on the next iteration of the loop. The value stored in the frame stack (as the subprogno of the current subprog) will be incorrect. This leads to incorrect checks and incorrect tail_call_reachable marking. Save the target subprog in a new variable and only assign to idx once we are done with the is_async_cb check which may skip pushing of frame to the frame stack and subsequent stack depth checks and tail call markings. Fixes: 7ddc80a476c2 ("bpf: Teach stack depth check about async callbacks.") Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-18Merge tag 'mm-hotfixes-stable-2023-07-18-12-28' of ↵Linus Torvalds1-5/+5
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "Seven hotfixes, six of which are cc:stable and one of which addresses a post-6.5 issue" * tag 'mm-hotfixes-stable-2023-07-18-12-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: maple_tree: fix node allocation testing on 32 bit maple_tree: fix 32 bit mas_next testing selftests/mm: mkdirty: fix incorrect position of #endif maple_tree: set the node limit when creating a new root node mm/mlock: fix vma iterator conversion of apply_vma_lock_flags() prctl: move PR_GET_AUXV out of PR_MCE_KILL selftests/mm: give scripts execute permission
2023-07-17prctl: move PR_GET_AUXV out of PR_MCE_KILLMiguel Ojeda1-5/+5
Somehow PR_GET_AUXV got added into PR_MCE_KILL's switch when the patch was applied [1]. Thus move it out of the switch, to the place the patch added it. In the recently released v6.4 kernel some user could, in principle, be already using this feature by mapping the right page and passing the PR_GET_AUXV constant as a pointer: prctl(PR_MCE_KILL, PR_GET_AUXV, ...) So this does change the behavior for users. We could keep the bug since the other subcases in PR_MCE_KILL (PR_MCE_KILL_CLEAR and PR_MCE_KILL_SET) do not overlap. However, v6.4 may be recent enough (2 weeks old) that moving the lines (rather than just adding a new case) does not break anybody? Moreover, the documentation in man-pages was just committed today [2]. Link: https://lkml.kernel.org/r/[email protected] Fixes: ddc65971bb67 ("prctl: add PR_GET_AUXV to copy auxv to userspace") Link: https://lore.kernel.org/all/d81864a7f7f43bca6afa2a09fc2e850e4050ab42.1680611394.git.josh@joshtriplett.org/ [1] Link: https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=8cf0c06bfd3c2b219b044d4151c96f0da50af9ad [2] Signed-off-by: Miguel Ojeda <[email protected]> Cc: Josh Triplett <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-07-17locking/rtmutex: Fix task->pi_waiters integrityPeter Zijlstra4-76/+155
Henry reported that rt_mutex_adjust_prio_check() has an ordering problem and puts the lie to the comment in [7]. Sharing the sort key between lock->waiters and owner->pi_waiters *does* create problems, since unlike what the comment claims, holding [L] is insufficient. Notably, consider: A / \ M1 M2 | | B C That is, task A owns both M1 and M2, B and C block on them. In this case a concurrent chain walk (B & C) will modify their resp. sort keys in [7] while holding M1->wait_lock and M2->wait_lock. So holding [L] is meaningless, they're different Ls. This then gives rise to a race condition between [7] and [11], where the requeue of pi_waiters will observe an inconsistent tree order. B C (holds M1->wait_lock, (holds M2->wait_lock, holds B->pi_lock) holds A->pi_lock) [7] waiter_update_prio(); ... [8] raw_spin_unlock(B->pi_lock); ... [10] raw_spin_lock(A->pi_lock); [11] rt_mutex_enqueue_pi(); // observes inconsistent A->pi_waiters // tree order Fixing this means either extending the range of the owner lock from [10-13] to [6-13], with the immediate problem that this means [6-8] hold both blocked and owner locks, or duplicating the sort key. Since the locking in chain walk is horrible enough without having to consider pi_lock nesting rules, duplicate the sort key instead. By giving each tree their own sort key, the above race becomes harmless, if C sees B at the old location, then B will correct things (if they need correcting) when it walks up the chain and reaches A. Fixes: fb00aca47440 ("rtmutex: Turn the plist into an rb-tree") Reported-by: Henry Wu <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Tested-by: Henry Wu <[email protected]> Link: https://lkml.kernel.org/r/20230707161052.GF2883469%40hirez.programming.kicks-ass.net
2023-07-16Merge tag 'sched_urgent_for_v6.5_rc2' of ↵Linus Torvalds3-10/+23
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Remove a cgroup from under a polling process properly - Fix the idle sibling selection * tag 'sched_urgent_for_v6.5_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/psi: use kernfs polling functions for PSI trigger polling sched/fair: Use recent_used_cpu to test p->cpus_ptr
2023-07-16Merge tag 'hardening-v6.5-rc2' of ↵Linus Torvalds1-3/+2
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening fixes from Kees Cook: - Remove LTO-only suffixes from promoted global function symbols (Yonghong Song) - Remove unused .text..refcount section from vmlinux.lds.h (Petr Pavlu) - Add missing __always_inline to sparc __arch_xchg() (Arnd Bergmann) - Claim maintainership of string routines * tag 'hardening-v6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: sparc: mark __arch_xchg() as __always_inline MAINTAINERS: Foolishly claim maintainership of string routines kallsyms: strip LTO-only suffixes from promoted global functions vmlinux.lds.h: Remove a reference to no longer used sections .text..refcount
2023-07-16Merge tag 'probes-fixes-v6.5-rc1-2' of ↵Linus Torvalds6-29/+24
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probe fixes from Masami Hiramatsu: - fprobe: Add a comment why fprobe will be skipped if another kprobe is running in fprobe_kprobe_handler(). - probe-events: Fix some issues related to fetch-arguments: - Fix double counting of the string length for user-string and symstr. This will require longer buffer in the array case. - Fix not to count error code (minus value) for the total used length in array argument. This makes the total used length shorter. - Fix to update dynamic used data size counter only if fetcharg uses the dynamic size data. This may mis-count the used dynamic data size and corrupt data. - Revert "tracing: Add "(fault)" name injection to kernel probes" because that did not work correctly with a bug, and we agreed the current '(fault)' output (instead of '"(fault)"' like a string) explains what happened more clearly. - Fix to record 0-length (means fault access) data_loc data in fetch function itself, instead of store_trace_args(). If we record an array of string, this will fix to save fault access data on each entry of the array correctly. * tag 'probes-fixes-v6.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/probes: Fix to record 0-length data_loc in fetch_store_string*() if fails Revert "tracing: Add "(fault)" name injection to kernel probes" tracing/probes: Fix to update dynamic data counter if fetcharg uses it tracing/probes: Fix not to count error code to total length tracing/probes: Fix to avoid double count of the string length on the array fprobes: Add a comment why fprobe_kprobe_handler exits if kprobe is running
2023-07-14Merge tag 'pm-6.5-rc2' of ↵Linus Torvalds2-2/+8
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix hibernation (after recent changes), frequency QoS and the sparc cpufreq driver. Specifics: - Unbreak the /sys/power/resume interface after recent changes (Azat Khuzhin). - Allow PM_QOS_DEFAULT_VALUE to be used with frequency QoS (Chungkai Yang). - Remove __init from cpufreq callbacks in the sparc driver, because they may be called after initialization too (Viresh Kumar)" * tag 'pm-6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: sparc: Don't mark cpufreq callbacks with __init PM: QoS: Restore support for default value on frequency QoS PM: hibernate: Fix writing maj:min to /sys/power/resume
2023-07-14Merge branches 'pm-sleep' and 'pm-qos'Rafael J. Wysocki2-2/+8
Merge a PM QoS fix and a hibernation fix for 6.5-rc2. - Unbreak the /sys/power/resume interface after recent changes (Azat Khuzhin). - Allow PM_QOS_DEFAULT_VALUE to be used with frequency QoS (Chungkai Yang). * pm-sleep: PM: hibernate: Fix writing maj:min to /sys/power/resume * pm-qos: PM: QoS: Restore support for default value on frequency QoS
2023-07-14tracing/probes: Fix to record 0-length data_loc in fetch_store_string*() if ↵Masami Hiramatsu (Google)3-12/+14
fails Fix to record 0-length data to data_loc in fetch_store_string*() if it fails to get the string data. Currently those expect that the data_loc is updated by store_trace_args() if it returns the error code. However, that does not work correctly if the argument is an array of strings. In that case, store_trace_args() only clears the first entry of the array (which may have no error) and leaves other entries. So it should be cleared by fetch_store_string*() itself. Also, 'dyndata' and 'maxlen' in store_trace_args() should be updated only if it is used (ret > 0 and argument is a dynamic data.) Link: https://lore.kernel.org/all/168908496683.123124.4761206188794205601.stgit@devnote2/ Fixes: 40b53b771806 ("tracing: probeevent: Add array type support") Cc: [email protected] Reviewed-by: Steven Rostedt (Google) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2023-07-13Merge tag 'for-netdev' of ↵Jakub Kicinski15-438/+897
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2023-07-13 We've added 67 non-merge commits during the last 15 day(s) which contain a total of 106 files changed, 4444 insertions(+), 619 deletions(-). The main changes are: 1) Fix bpftool build in presence of stale vmlinux.h, from Alexander Lobakin. 2) Introduce bpf_me_mcache_free_rcu() and fix OOM under stress, from Alexei Starovoitov. 3) Teach verifier actual bounds of bpf_get_smp_processor_id() and fix perf+libbpf issue related to custom section handling, from Andrii Nakryiko. 4) Introduce bpf map element count, from Anton Protopopov. 5) Check skb ownership against full socket, from Kui-Feng Lee. 6) Support for up to 12 arguments in BPF trampoline, from Menglong Dong. 7) Export rcu_request_urgent_qs_task, from Paul E. McKenney. 8) Fix BTF walking of unions, from Yafang Shao. 9) Extend link_info for kprobe_multi and perf_event links, from Yafang Shao. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (67 commits) selftests/bpf: Add selftest for PTR_UNTRUSTED bpf: Fix an error in verifying a field in a union selftests/bpf: Add selftests for nested_trust bpf: Fix an error around PTR_UNTRUSTED selftests/bpf: add testcase for TRACING with 6+ arguments bpf, x86: allow function arguments up to 12 for TRACING bpf, x86: save/restore regs with BPF_DW size bpftool: Use "fallthrough;" keyword instead of comments bpf: Add object leak check. bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu. bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu(). selftests/bpf: Improve test coverage of bpf_mem_alloc. rcu: Export rcu_request_urgent_qs_task() bpf: Allow reuse from waiting_for_gp_ttrace list. bpf: Add a hint to allocated objects. bpf: Change bpf_mem_cache draining process. bpf: Further refactor alloc_bulk(). bpf: Factor out inc/dec of active flag into helpers. bpf: Refactor alloc_bulk(). bpf: Let free_all() return the number of freed elements. ... ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-07-13bpf: Fix an error in verifying a field in a unionYafang Shao1-1/+1
We are utilizing BPF LSM to monitor BPF operations within our container environment. When we add support for raw_tracepoint, it hits below error. ; (const void *)attr->raw_tracepoint.name); 27: (79) r3 = *(u64 *)(r2 +0) access beyond the end of member map_type (mend:4) in struct (anon) with off 0 size 8 It can be reproduced with below BPF prog. SEC("lsm/bpf") int BPF_PROG(bpf_audit, int cmd, union bpf_attr *attr, unsigned int size) { switch (cmd) { case BPF_RAW_TRACEPOINT_OPEN: bpf_printk("raw_tracepoint is %s", attr->raw_tracepoint.name); break; default: break; } return 0; } The reason is that when accessing a field in a union, such as bpf_attr, if the field is located within a nested struct that is not the first member of the union, it can result in incorrect field verification. union bpf_attr { struct { __u32 map_type; <<<< Actually it will find that field. __u32 key_size; __u32 value_size; ... }; ... struct { __u64 name; <<<< We want to verify this field. __u32 prog_fd; } raw_tracepoint; }; Considering the potential deep nesting levels, finding a perfect solution to address this issue has proven challenging. Therefore, I propose a solution where we simply skip the verification process if the field in question is located within a union. Fixes: 7e3617a72df3 ("bpf: Add array support to btf_struct_access") Signed-off-by: Yafang Shao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>