aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2023-07-13bpf: Fix an error around PTR_UNTRUSTEDYafang Shao2-11/+14
Per discussion with Alexei, the PTR_UNTRUSTED flag should not been cleared when we start to walk a new struct, because the struct in question may be a struct nested in a union. We should also check and set this flag before we walk its each member, in case itself is a union. We will clear this flag if the field is BTF_TYPE_SAFE_RCU_OR_NULL. Fixes: 6fcd486b3a0a ("bpf: Refactor RCU enforcement in the verifier.") Signed-off-by: Yafang Shao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-13Merge tag 'net-6.5-rc2' of ↵Linus Torvalds2-18/+27
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from netfilter, wireless and ebpf. Current release - regressions: - netfilter: conntrack: gre: don't set assured flag for clash entries - wifi: iwlwifi: remove 'use_tfh' config to fix crash Previous releases - regressions: - ipv6: fix a potential refcount underflow for idev - icmp6: ifix null-ptr-deref of ip6_null_entry->rt6i_idev in icmp6_dev() - bpf: fix max stack depth check for async callbacks - eth: mlx5e: - check for NOT_READY flag state after locking - fix page_pool page fragment tracking for XDP - eth: igc: - fix tx hang issue when QBV gate is closed - fix corner cases for TSN offload - eth: octeontx2-af: Move validation of ptp pointer before its usage - eth: ena: fix shift-out-of-bounds in exponential backoff Previous releases - always broken: - core: prevent skb corruption on frag list segmentation - sched: - cls_fw: fix improper refcount update leads to use-after-free - sch_qfq: account for stab overhead in qfq_enqueue - netfilter: - report use refcount overflow - prevent OOB access in nft_byteorder_eval - wifi: mt7921e: fix init command fail with enabled device - eth: ocelot: fix oversize frame dropping for preemptible TCs - eth: fec: recycle pages for transmitted XDP frames" * tag 'net-6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (79 commits) selftests: tc-testing: add test for qfq with stab overhead net/sched: sch_qfq: account for stab overhead in qfq_enqueue selftests: tc-testing: add tests for qfq mtu sanity check net/sched: sch_qfq: reintroduce lmax bound check for MTU wifi: cfg80211: fix receiving mesh packets without RFC1042 header wifi: rtw89: debug: fix error code in rtw89_debug_priv_send_h2c_set() net: txgbe: fix eeprom calculation error net/sched: make psched_mtu() RTNL-less safe net: ena: fix shift-out-of-bounds in exponential backoff netdevsim: fix uninitialized data in nsim_dev_trap_fa_cookie_write() net/sched: flower: Ensure both minimum and maximum ports are specified MAINTAINERS: Add another mailing list for QUALCOMM ETHQOS ETHERNET DRIVER docs: netdev: update the URL of the status page wifi: iwlwifi: remove 'use_tfh' config to fix crash xdp: use trusted arguments in XDP hints kfuncs bpf: cpumap: Fix memory leak in cpu_map_update_elem wifi: airo: avoid uninitialized warning in airo_get_rate() octeontx2-pf: Add additional check for MCAM rules net: dsa: Removed unneeded of_node_put in felix_parse_ports_node net: fec: use netdev_err_once() instead of netdev_err() ...
2023-07-13Merge tag 'trace-v6.5-rc1-3' of ↵Linus Torvalds8-30/+81
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix some missing-prototype warnings - Fix user events struct args (did not include size of struct) When creating a user event, the "struct" keyword is to denote that the size of the field will be passed in. But the parsing failed to handle this case. - Add selftest to struct sizes for user events - Fix sample code for direct trampolines. The sample code for direct trampolines attached to handle_mm_fault(). But the prototype changed and the direct trampoline sample code was not updated. Direct trampolines needs to have the arguments correct otherwise it can fail or crash the system. - Remove unused ftrace_regs_caller_ret() prototype. - Quiet false positive of FORTIFY_SOURCE Due to backward compatibility, the structure used to save stack traces in the kernel had a fixed size of 8. This structure is exported to user space via the tracing format file. A change was made to allow more than 8 functions to be recorded, and user space now uses the size field to know how many functions are actually in the stack. But the structure still has size of 8 (even though it points into the ring buffer that has the required amount allocated to hold a full stack. This was fine until the fortifier noticed that the memcpy(&entry->caller, stack, size) was greater than the 8 functions and would complain at runtime about it. Hide this by using a pointer to the stack location on the ring buffer instead of using the address of the entry structure caller field. - Fix a deadloop in reading trace_pipe that was caused by a mismatch between ring_buffer_empty() returning false which then asked to read the data, but the read code uses rb_num_of_entries() that returned zero, and causing a infinite "retry". - Fix a warning caused by not using all pages allocated to store ftrace functions, where this can happen if the linker inserts a bunch of "NULL" entries, causing the accounting of how many pages needed to be off. - Fix histogram synthetic event crashing when the start event is removed and the end event is still using a variable from it - Fix memory leak in freeing iter->temp in tracing_release_pipe() * tag 'trace-v6.5-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix memory leak of iter->temp when reading trace_pipe tracing/histograms: Add histograms to hist_vars if they have referenced variables tracing: Stop FORTIFY_SOURCE complaining about stack trace caller ftrace: Fix possible warning on checking all pages used in ftrace_process_locs() ring-buffer: Fix deadloop issue on reading trace_pipe tracing: arm64: Avoid missing-prototype warnings selftests/user_events: Test struct size match cases tracing/user_events: Fix struct arg size match check x86/ftrace: Remove unsued extern declaration ftrace_regs_caller_ret() arm64: ftrace: Add direct call trampoline samples support samples: ftrace: Save required argument registers in sample trampolines
2023-07-14Revert "tracing: Add "(fault)" name injection to kernel probes"Masami Hiramatsu (Google)3-26/+9
This reverts commit 2e9906f84fc7c99388bb7123ade167250d50f1c0. It was turned out that commit 2e9906f84fc7 ("tracing: Add "(fault)" name injection to kernel probes") did not work correctly and probe events still show just '(fault)' (instead of '"(fault)"'). Also, current '(fault)' is more explicit that it faulted. This also moves FAULT_STRING macro to trace.h so that synthetic event can keep using it, and uses it in trace_probe.c too. Link: https://lore.kernel.org/all/168908495772.123124.1250788051922100079.stgit@devnote2/ Link: https://lore.kernel.org/all/[email protected]/ Cc: [email protected] Cc: Andrew Morton <[email protected]> Cc: Tom Zanussi <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Reviewed-by: Steven Rostedt (Google) <[email protected]>
2023-07-14tracing/probes: Fix to update dynamic data counter if fetcharg uses itMasami Hiramatsu (Google)1-5/+7
Fix to update dynamic data counter ('dyndata') and max length ('maxlen') only if the fetcharg uses the dynamic data. Also get out arg->dynamic from unlikely(). This makes dynamic data address wrong if process_fetch_insn() returns error on !arg->dynamic case. Link: https://lore.kernel.org/all/168908494781.123124.8160245359962103684.stgit@devnote2/ Suggested-by: Steven Rostedt <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Fixes: 9178412ddf5a ("tracing: probeevent: Return consumed bytes of dynamic area") Cc: [email protected] Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Reviewed-by: Steven Rostedt (Google) <[email protected]>
2023-07-14tracing/probes: Fix not to count error code to total lengthMasami Hiramatsu (Google)1-0/+2
Fix not to count the error code (which is minus value) to the total used length of array, because it can mess up the return code of process_fetch_insn_bottom(). Also clear the 'ret' value because it will be used for calculating next data_loc entry. Link: https://lore.kernel.org/all/168908493827.123124.2175257289106364229.stgit@devnote2/ Reported-by: Dan Carpenter <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ Fixes: 9b960a38835f ("tracing: probeevent: Unify fetch_insn processing common part") Cc: [email protected] Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Reviewed-by: Steven Rostedt (Google) <[email protected]>
2023-07-14tracing/probes: Fix to avoid double count of the string length on the arrayMasami Hiramatsu (Google)1-2/+2
If an array is specified with the ustring or symstr, the length of the strings are accumlated on both of 'ret' and 'total', which means the length is double counted. Just set the length to the 'ret' value for avoiding double counting. Link: https://lore.kernel.org/all/168908492917.123124.15076463491122036025.stgit@devnote2/ Reported-by: Dan Carpenter <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ Fixes: 88903c464321 ("tracing/probe: Add ustring type for user-space string") Cc: [email protected] Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Reviewed-by: Steven Rostedt (Google) <[email protected]>
2023-07-14fprobes: Add a comment why fprobe_kprobe_handler exits if kprobe is runningMasami Hiramatsu (Google)1-0/+6
Add a comment the reason why fprobe_kprobe_handler() exits if any other kprobe is running. Link: https://lore.kernel.org/all/168874788299.159442.2485957441413653858.stgit@devnote2/ Suggested-by: Steven Rostedt <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Reviewed-by: Steven Rostedt (Google) <[email protected]>
2023-07-13tracing: Fix memory leak of iter->temp when reading trace_pipeZheng Yejian1-0/+1
kmemleak reports: unreferenced object 0xffff88814d14e200 (size 256): comm "cat", pid 336, jiffies 4294871818 (age 779.490s) hex dump (first 32 bytes): 04 00 01 03 00 00 00 00 08 00 00 00 00 00 00 00 ................ 0c d8 c8 9b ff ff ff ff 04 5a ca 9b ff ff ff ff .........Z...... backtrace: [<ffffffff9bdff18f>] __kmalloc+0x4f/0x140 [<ffffffff9bc9238b>] trace_find_next_entry+0xbb/0x1d0 [<ffffffff9bc9caef>] trace_print_lat_context+0xaf/0x4e0 [<ffffffff9bc94490>] print_trace_line+0x3e0/0x950 [<ffffffff9bc95499>] tracing_read_pipe+0x2d9/0x5a0 [<ffffffff9bf03a43>] vfs_read+0x143/0x520 [<ffffffff9bf04c2d>] ksys_read+0xbd/0x160 [<ffffffff9d0f0edf>] do_syscall_64+0x3f/0x90 [<ffffffff9d2000aa>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 when reading file 'trace_pipe', 'iter->temp' is allocated or relocated in trace_find_next_entry() but not freed before 'trace_pipe' is closed. To fix it, free 'iter->temp' in tracing_release_pipe(). Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: [email protected] Fixes: ff895103a84ab ("tracing: Save off entry when peeking at next entry") Signed-off-by: Zheng Yejian <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-07-12tracing/histograms: Add histograms to hist_vars if they have referenced ↵Mohamed Khalfella1-3/+5
variables Hist triggers can have referenced variables without having direct variables fields. This can be the case if referenced variables are added for trigger actions. In this case the newly added references will not have field variables. Not taking such referenced variables into consideration can result in a bug where it would be possible to remove hist trigger with variables being refenced. This will result in a bug that is easily reproducable like so $ cd /sys/kernel/tracing $ echo 'synthetic_sys_enter char[] comm; long id' >> synthetic_events $ echo 'hist:keys=common_pid.execname,id.syscall:vals=hitcount:comm=common_pid.execname' >> events/raw_syscalls/sys_enter/trigger $ echo 'hist:keys=common_pid.execname,id.syscall:onmatch(raw_syscalls.sys_enter).synthetic_sys_enter($comm, id)' >> events/raw_syscalls/sys_enter/trigger $ echo '!hist:keys=common_pid.execname,id.syscall:vals=hitcount:comm=common_pid.execname' >> events/raw_syscalls/sys_enter/trigger [ 100.263533] ================================================================== [ 100.264634] BUG: KASAN: slab-use-after-free in resolve_var_refs+0xc7/0x180 [ 100.265520] Read of size 8 at addr ffff88810375d0f0 by task bash/439 [ 100.266320] [ 100.266533] CPU: 2 PID: 439 Comm: bash Not tainted 6.5.0-rc1 #4 [ 100.267277] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-20220807_005459-localhost 04/01/2014 [ 100.268561] Call Trace: [ 100.268902] <TASK> [ 100.269189] dump_stack_lvl+0x4c/0x70 [ 100.269680] print_report+0xc5/0x600 [ 100.270165] ? resolve_var_refs+0xc7/0x180 [ 100.270697] ? kasan_complete_mode_report_info+0x80/0x1f0 [ 100.271389] ? resolve_var_refs+0xc7/0x180 [ 100.271913] kasan_report+0xbd/0x100 [ 100.272380] ? resolve_var_refs+0xc7/0x180 [ 100.272920] __asan_load8+0x71/0xa0 [ 100.273377] resolve_var_refs+0xc7/0x180 [ 100.273888] event_hist_trigger+0x749/0x860 [ 100.274505] ? kasan_save_stack+0x2a/0x50 [ 100.275024] ? kasan_set_track+0x29/0x40 [ 100.275536] ? __pfx_event_hist_trigger+0x10/0x10 [ 100.276138] ? ksys_write+0xd1/0x170 [ 100.276607] ? do_syscall_64+0x3c/0x90 [ 100.277099] ? entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 100.277771] ? destroy_hist_data+0x446/0x470 [ 100.278324] ? event_hist_trigger_parse+0xa6c/0x3860 [ 100.278962] ? __pfx_event_hist_trigger_parse+0x10/0x10 [ 100.279627] ? __kasan_check_write+0x18/0x20 [ 100.280177] ? mutex_unlock+0x85/0xd0 [ 100.280660] ? __pfx_mutex_unlock+0x10/0x10 [ 100.281200] ? kfree+0x7b/0x120 [ 100.281619] ? ____kasan_slab_free+0x15d/0x1d0 [ 100.282197] ? event_trigger_write+0xac/0x100 [ 100.282764] ? __kasan_slab_free+0x16/0x20 [ 100.283293] ? __kmem_cache_free+0x153/0x2f0 [ 100.283844] ? sched_mm_cid_remote_clear+0xb1/0x250 [ 100.284550] ? __pfx_sched_mm_cid_remote_clear+0x10/0x10 [ 100.285221] ? event_trigger_write+0xbc/0x100 [ 100.285781] ? __kasan_check_read+0x15/0x20 [ 100.286321] ? __bitmap_weight+0x66/0xa0 [ 100.286833] ? _find_next_bit+0x46/0xe0 [ 100.287334] ? task_mm_cid_work+0x37f/0x450 [ 100.287872] event_triggers_call+0x84/0x150 [ 100.288408] trace_event_buffer_commit+0x339/0x430 [ 100.289073] ? ring_buffer_event_data+0x3f/0x60 [ 100.292189] trace_event_raw_event_sys_enter+0x8b/0xe0 [ 100.295434] syscall_trace_enter.constprop.0+0x18f/0x1b0 [ 100.298653] syscall_enter_from_user_mode+0x32/0x40 [ 100.301808] do_syscall_64+0x1a/0x90 [ 100.304748] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 100.307775] RIP: 0033:0x7f686c75c1cb [ 100.310617] Code: 73 01 c3 48 8b 0d 65 3c 10 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 21 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 35 3c 10 00 f7 d8 64 89 01 48 [ 100.317847] RSP: 002b:00007ffc60137a38 EFLAGS: 00000246 ORIG_RAX: 0000000000000021 [ 100.321200] RAX: ffffffffffffffda RBX: 000055f566469ea0 RCX: 00007f686c75c1cb [ 100.324631] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 000000000000000a [ 100.328104] RBP: 00007ffc60137ac0 R08: 00007f686c818460 R09: 000000000000000a [ 100.331509] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000009 [ 100.334992] R13: 0000000000000007 R14: 000000000000000a R15: 0000000000000007 [ 100.338381] </TASK> We hit the bug because when second hist trigger has was created has_hist_vars() returned false because hist trigger did not have variables. As a result of that save_hist_vars() was not called to add the trigger to trace_array->hist_vars. Later on when we attempted to remove the first histogram find_any_var_ref() failed to detect it is being used because it did not find the second trigger in hist_vars list. With this change we wait until trigger actions are created so we can take into consideration if hist trigger has variable references. Also, now we check the return value of save_hist_vars() and fail trigger creation if save_hist_vars() fails. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: [email protected] Fixes: 067fe038e70f6 ("tracing: Add variable reference handling to hist triggers") Signed-off-by: Mohamed Khalfella <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-07-12kallsyms: strip LTO-only suffixes from promoted global functionsYonghong Song1-3/+2
Commit 6eb4bd92c1ce ("kallsyms: strip LTO suffixes from static functions") stripped all function/variable suffixes started with '.' regardless of whether those suffixes are generated at LTO mode or not. In fact, as far as I know, in LTO mode, when a static function/variable is promoted to the global scope, '.llvm.<...>' suffix is added. The existing mechanism breaks live patch for a LTO kernel even if no <symbol>.llvm.<...> symbols are involved. For example, for the following kernel symbols: $ grep bpf_verifier_vlog /proc/kallsyms ffffffff81549f60 t bpf_verifier_vlog ffffffff8268b430 d bpf_verifier_vlog._entry ffffffff8282a958 d bpf_verifier_vlog._entry_ptr ffffffff82e12a1f d bpf_verifier_vlog.__already_done 'bpf_verifier_vlog' is a static function. '_entry', '_entry_ptr' and '__already_done' are static variables used inside 'bpf_verifier_vlog', so llvm promotes them to file-level static with prefix 'bpf_verifier_vlog.'. Note that the func-level to file-level static function promotion also happens without LTO. Given a symbol name 'bpf_verifier_vlog', with LTO kernel, current mechanism will return 4 symbols to live patch subsystem which current live patching subsystem cannot handle it. With non-LTO kernel, only one symbol is returned. In [1], we have a lengthy discussion, the suggestion is to separate two cases: (1). new symbols with suffix which are generated regardless of whether LTO is enabled or not, and (2). new symbols with suffix generated only when LTO is enabled. The cleanup_symbol_name() should only remove suffixes for case (2). Case (1) should not be changed so it can work uniformly with or without LTO. This patch removed LTO-only suffix '.llvm.<...>' so live patching and tracing should work the same way for non-LTO kernel. The cleanup_symbol_name() in scripts/kallsyms.c is also changed to have the same filtering pattern so both kernel and kallsyms tool have the same expectation on the order of symbols. [1] https://lore.kernel.org/live-patching/[email protected]/T/#u Fixes: 6eb4bd92c1ce ("kallsyms: strip LTO suffixes from static functions") Reported-by: Song Liu <[email protected]> Signed-off-by: Yonghong Song <[email protected]> Reviewed-by: Zhen Lei <[email protected]> Reviewed-by: Nick Desaulniers <[email protected]> Acked-by: Song Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>
2023-07-12tracing: Stop FORTIFY_SOURCE complaining about stack trace callerSteven Rostedt (Google)1-2/+19
The stack_trace event is an event created by the tracing subsystem to store stack traces. It originally just contained a hard coded array of 8 words to hold the stack, and a "size" to know how many entries are there. This is exported to user space as: name: kernel_stack ID: 4 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1; field:int size; offset:8; size:4; signed:1; field:unsigned long caller[8]; offset:16; size:64; signed:0; print fmt: "\t=> %ps\n\t=> %ps\n\t=> %ps\n" "\t=> %ps\n\t=> %ps\n\t=> %ps\n" "\t=> %ps\n\t=> %ps\n",i (void *)REC->caller[0], (void *)REC->caller[1], (void *)REC->caller[2], (void *)REC->caller[3], (void *)REC->caller[4], (void *)REC->caller[5], (void *)REC->caller[6], (void *)REC->caller[7] Where the user space tracers could parse the stack. The library was updated for this specific event to only look at the size, and not the array. But some older users still look at the array (note, the older code still checks to make sure the array fits inside the event that it read. That is, if only 4 words were saved, the parser would not read the fifth word because it will see that it was outside of the event size). This event was changed a while ago to be more dynamic, and would save a full stack even if it was greater than 8 words. It does this by simply allocating more ring buffer to hold the extra words. Then it copies in the stack via: memcpy(&entry->caller, fstack->calls, size); As the entry is struct stack_entry, that is created by a macro to both create the structure and export this to user space, it still had the caller field of entry defined as: unsigned long caller[8]. When the stack is greater than 8, the FORTIFY_SOURCE code notices that the amount being copied is greater than the source array and complains about it. It has no idea that the source is pointing to the ring buffer with the required allocation. To hide this from the FORTIFY_SOURCE logic, pointer arithmetic is used: ptr = ring_buffer_event_data(event); entry = ptr; ptr += offsetof(typeof(*entry), caller); memcpy(ptr, fstack->calls, size); Link: https://lore.kernel.org/all/[email protected]/ Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Reported-by: Sven Schnelle <[email protected]> Tested-by: Sven Schnelle <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-07-12ftrace: Fix possible warning on checking all pages used in ftrace_process_locs()Zheng Yejian1-14/+31
As comments in ftrace_process_locs(), there may be NULL pointers in mcount_loc section: > Some architecture linkers will pad between > the different mcount_loc sections of different > object files to satisfy alignments. > Skip any NULL pointers. After commit 20e5227e9f55 ("ftrace: allow NULL pointers in mcount_loc"), NULL pointers will be accounted when allocating ftrace pages but skipped before adding into ftrace pages, this may result in some pages not being used. Then after commit 706c81f87f84 ("ftrace: Remove extra helper functions"), warning may occur at: WARN_ON(pg->next); To fix it, only warn for case that no pointers skipped but pages not used up, then free those unused pages after releasing ftrace_lock. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: [email protected] Fixes: 706c81f87f84 ("ftrace: Remove extra helper functions") Suggested-by: Steven Rostedt <[email protected]> Signed-off-by: Zheng Yejian <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-07-12bpf: Add object leak check.Hou Tao1-0/+35
The object leak check is cheap. Do it unconditionally to spot difficult races in bpf_mem_alloc. Signed-off-by: Hou Tao <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-07-12bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu.Alexei Starovoitov1-14/+6
Convert bpf_cpumask to bpf_mem_cache_free_rcu. Note that migrate_disable() in bpf_cpumask_release() is still necessary, since bpf_cpumask_release() is a dtor. bpf_obj_free_fields() can be converted to do migrate_disable() there in a follow up. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: David Vernet <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-07-12bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().Alexei Starovoitov1-3/+126
Introduce bpf_mem_[cache_]free_rcu() similar to kfree_rcu(). Unlike bpf_mem_[cache_]free() that links objects for immediate reuse into per-cpu free list the _rcu() flavor waits for RCU grace period and then moves objects into free_by_rcu_ttrace list where they are waiting for RCU task trace grace period to be freed into slab. The life cycle of objects: alloc: dequeue free_llist free: enqeueu free_llist free_rcu: enqueue free_by_rcu -> waiting_for_gp free_llist above high watermark -> free_by_rcu_ttrace after RCU GP waiting_for_gp -> free_by_rcu_ttrace free_by_rcu_ttrace -> waiting_for_gp_ttrace -> slab Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Hou Tao <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-07-12rcu: Export rcu_request_urgent_qs_task()Paul E. McKenney1-2/+0
If a CPU is executing a long series of non-sleeping system calls, RCU grace periods can be delayed for on the order of a couple hundred milliseconds. This is normally not a problem, but if each system call does a call_rcu(), those callbacks can stack up. RCU will eventually notice this callback storm, but use of rcu_request_urgent_qs_task() allows the code invoking call_rcu() to give RCU a heads up. This function is not for general use, not yet, anyway. Reported-by: Alexei Starovoitov <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-07-12bpf: Allow reuse from waiting_for_gp_ttrace list.Alexei Starovoitov1-6/+10
alloc_bulk() can reuse elements from free_by_rcu_ttrace. Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc(). Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-07-12bpf: Add a hint to allocated objects.Alexei Starovoitov1-19/+31
To address OOM issue when one cpu is allocating and another cpu is freeing add a target bpf_mem_cache hint to allocated objects and when local cpu free_llist overflows free to that bpf_mem_cache. The hint addresses the OOM while maintaining the same performance for common case when alloc/free are done on the same cpu. Note that do_call_rcu_ttrace() now has to check 'draining' flag in one more case, since do_call_rcu_ttrace() is called not only for current cpu. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Hou Tao <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-07-12bpf: Change bpf_mem_cache draining process.Alexei Starovoitov1-9/+9
The next patch will introduce cross-cpu llist access and existing irq_work_sync() + drain_mem_cache() + rcu_barrier_tasks_trace() mechanism will not be enough, since irq_work_sync() + drain_mem_cache() on cpu A won't guarantee that llist on cpu A are empty. The free_bulk() on cpu B might add objects back to llist of cpu A. Add 'bool draining' flag. The modified sequence looks like: for_each_cpu: WRITE_ONCE(c->draining, true); // do_call_rcu_ttrace() won't be doing call_rcu() any more irq_work_sync(); // wait for irq_work callback (free_bulk) to finish drain_mem_cache(); // free all objects rcu_barrier_tasks_trace(); // wait for RCU callbacks to execute Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Hou Tao <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-07-12bpf: Further refactor alloc_bulk().Alexei Starovoitov1-12/+18
In certain scenarios alloc_bulk() might be taking free objects mainly from free_by_rcu_ttrace list. In such case get_memcg() and set_active_memcg() are redundant, but they show up in perf profile. Split the loop and only set memcg when allocating from slab. No performance difference in this patch alone, but it helps in combination with further patches. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Hou Tao <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-07-12bpf: Factor out inc/dec of active flag into helpers.Alexei Starovoitov1-12/+18
Factor out local_inc/dec_return(&c->active) into helpers. No functional changes. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Hou Tao <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-07-12bpf: Refactor alloc_bulk().Alexei Starovoitov1-20/+26
Factor out inner body of alloc_bulk into separate helper. No functional changes. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Hou Tao <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-07-12bpf: Let free_all() return the number of freed elements.Alexei Starovoitov1-2/+6
Let free_all() helper return the number of freed elements. It's not used in this patch, but helps in debug/development of bpf_mem_alloc. For example this diff for __free_rcu(): - free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size); + printk("cpu %d freed %d objs after tasks trace\n", raw_smp_processor_id(), + free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size)); would show how busy RCU tasks trace is. In artificial benchmark where one cpu is allocating and different cpu is freeing the RCU tasks trace won't be able to keep up and the list of objects would keep growing from thousands to millions and eventually OOMing. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Hou Tao <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-07-12bpf: Simplify code of destroy_mem_alloc() with kmemdup().Alexei Starovoitov1-5/+2
Use kmemdup() to simplify the code. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Hou Tao <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-07-12bpf: Rename few bpf_mem_alloc fields.Alexei Starovoitov1-28/+29
Rename: - struct rcu_head rcu; - struct llist_head free_by_rcu; - struct llist_head waiting_for_gp; - atomic_t call_rcu_in_progress; + struct llist_head free_by_rcu_ttrace; + struct llist_head waiting_for_gp_ttrace; + struct rcu_head rcu_ttrace; + atomic_t call_rcu_ttrace_in_progress; ... - static void do_call_rcu(struct bpf_mem_cache *c) + static void do_call_rcu_ttrace(struct bpf_mem_cache *c) to better indicate intended use. The 'tasks trace' is shortened to 'ttrace' to reduce verbosity. No functional changes. Later patches will add free_by_rcu/waiting_for_gp fields to be used with normal RCU. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Hou Tao <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-07-12Merge tag 'probes-fixes-v6.5-rc1' of ↵Linus Torvalds4-14/+40
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - Fix fprobe's rethook release issues: - Release rethook after ftrace_ops is unregistered so that the rethook is not accessed after free. - Stop rethook before ftrace_ops is unregistered so that the rethook is NOT used after exiting unregister_fprobe() - Fix eprobe cleanup logic. If it attaches to multiple events and failes to enable one of them, rollback all enabled events correctly. - Fix fprobe to unlock ftrace recursion lock correctly when it missed by another running kprobe. - Cleanup kprobe to remove unnecessary NULL. - Cleanup kprobe to remove unnecessary 0 initializations. * tag 'probes-fixes-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: fprobe: Ensure running fprobe_exit_handler() finished before calling rethook_free() kernel: kprobes: Remove unnecessary ‘0’ values kprobes: Remove unnecessary ‘NULL’ values from correct_ret_addr fprobe: add unlock to match a succeeded ftrace_test_recursion_trylock kernel/trace: Fix cleanup logic of enable_trace_eprobe fprobe: Release rethook after the ftrace_ops is unregistered
2023-07-12ring-buffer: Fix deadloop issue on reading trace_pipeZheng Yejian1-9/+15
Soft lockup occurs when reading file 'trace_pipe': watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [cat:4488] [...] RIP: 0010:ring_buffer_empty_cpu+0xed/0x170 RSP: 0018:ffff88810dd6fc48 EFLAGS: 00000246 RAX: 0000000000000000 RBX: 0000000000000246 RCX: ffffffff93d1aaeb RDX: ffff88810a280040 RSI: 0000000000000008 RDI: ffff88811164b218 RBP: ffff88811164b218 R08: 0000000000000000 R09: ffff88815156600f R10: ffffed102a2acc01 R11: 0000000000000001 R12: 0000000051651901 R13: 0000000000000000 R14: ffff888115e49500 R15: 0000000000000000 [...] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8d853c2000 CR3: 000000010dcd8000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __find_next_entry+0x1a8/0x4b0 ? peek_next_entry+0x250/0x250 ? down_write+0xa5/0x120 ? down_write_killable+0x130/0x130 trace_find_next_entry_inc+0x3b/0x1d0 tracing_read_pipe+0x423/0xae0 ? tracing_splice_read_pipe+0xcb0/0xcb0 vfs_read+0x16b/0x490 ksys_read+0x105/0x210 ? __ia32_sys_pwrite64+0x200/0x200 ? switch_fpu_return+0x108/0x220 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x61/0xc6 Through the vmcore, I found it's because in tracing_read_pipe(), ring_buffer_empty_cpu() found some buffer is not empty but then it cannot read anything due to "rb_num_of_entries() == 0" always true, Then it infinitely loop the procedure due to user buffer not been filled, see following code path: tracing_read_pipe() { ... ... waitagain: tracing_wait_pipe() // 1. find non-empty buffer here trace_find_next_entry_inc() // 2. loop here try to find an entry __find_next_entry() ring_buffer_empty_cpu(); // 3. find non-empty buffer peek_next_entry() // 4. but peek always return NULL ring_buffer_peek() rb_buffer_peek() rb_get_reader_page() // 5. because rb_num_of_entries() == 0 always true here // then return NULL // 6. user buffer not been filled so goto 'waitgain' // and eventually leads to an deadloop in kernel!!! } By some analyzing, I found that when resetting ringbuffer, the 'entries' of its pages are not all cleared (see rb_reset_cpu()). Then when reducing the ringbuffer, and if some reduced pages exist dirty 'entries' data, they will be added into 'cpu_buffer->overrun' (see rb_remove_pages()), which cause wrong 'overrun' count and eventually cause the deadloop issue. To fix it, we need to clear every pages in rb_reset_cpu(). Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: [email protected] Fixes: a5fb833172eca ("ring-buffer: Fix uninitialized read_stamp") Signed-off-by: Zheng Yejian <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-07-12tracing: arm64: Avoid missing-prototype warningsArnd Bergmann3-2/+7
These are all tracing W=1 warnings in arm64 allmodconfig about missing prototypes: kernel/trace/trace_kprobe_selftest.c:7:5: error: no previous prototype for 'kprobe_trace_selftest_target' [-Werror=missing-pro totypes] kernel/trace/ftrace.c:329:5: error: no previous prototype for '__register_ftrace_function' [-Werror=missing-prototypes] kernel/trace/ftrace.c:372:5: error: no previous prototype for '__unregister_ftrace_function' [-Werror=missing-prototypes] kernel/trace/ftrace.c:4130:15: error: no previous prototype for 'arch_ftrace_match_adjust' [-Werror=missing-prototypes] kernel/trace/fgraph.c:243:15: error: no previous prototype for 'ftrace_return_to_handler' [-Werror=missing-prototypes] kernel/trace/fgraph.c:358:6: error: no previous prototype for 'ftrace_graph_sleep_time_control' [-Werror=missing-prototypes] arch/arm64/kernel/ftrace.c:460:6: error: no previous prototype for 'prepare_ftrace_return' [-Werror=missing-prototypes] arch/arm64/kernel/ptrace.c:2172:5: error: no previous prototype for 'syscall_trace_enter' [-Werror=missing-prototypes] arch/arm64/kernel/ptrace.c:2195:6: error: no previous prototype for 'syscall_trace_exit' [-Werror=missing-prototypes] Move the declarations to an appropriate header where they can be seen by the caller and callee, and make sure the headers are included where needed. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Will Deacon <[email protected]> Cc: Kees Cook <[email protected]> Cc: Florent Revest <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Acked-by: Catalin Marinas <[email protected]> [ Fixed ftrace_return_to_handler() to handle CONFIG_HAVE_FUNCTION_GRAPH_RETVAL case ] Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-07-12bpf: teach verifier actual bounds of bpf_get_smp_processor_id() resultAndrii Nakryiko1-11/+26
bpf_get_smp_processor_id() helper returns current CPU on which BPF program runs. It can't return value that is bigger than maximum allowed number of CPUs (minus one, due to zero indexing). Teach BPF verifier to recognize that. This makes it possible to use bpf_get_smp_processor_id() result to index into arrays without extra checks, as demonstrated in subsequent selftests/bpf patch. Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-11bpf: Support ->fill_link_info for perf_eventYafang Shao2-3/+153
By introducing support for ->fill_link_info to the perf_event link, users gain the ability to inspect it using `bpftool link show`. While the current approach involves accessing this information via `bpftool perf show`, consolidating link information for all link types in one place offers greater convenience. Additionally, this patch extends support to the generic perf event, which is not currently accommodated by `bpftool perf show`. While only the perf type and config are exposed to userspace, other attributes such as sample_period and sample_freq are ignored. It's important to note that if kptr_restrict is not permitted, the probed address will not be exposed, maintaining security measures. A new enum bpf_perf_event_type is introduced to help the user understand which struct is relevant. Signed-off-by: Yafang Shao <[email protected]> Acked-by: Jiri Olsa <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-11bpf: Add a common helper bpf_copy_to_user()Yafang Shao1-14/+20
Add a common helper bpf_copy_to_user(), which will be used at multiple places. No functional change. Signed-off-by: Yafang Shao <[email protected]> Acked-by: Jiri Olsa <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-11bpf: Expose symbol's respective addressYafang Shao1-12/+4
Since different symbols can share the same name, it is insufficient to only expose the symbol name. It is essential to also expose the symbol address so that users can accurately identify which one is being probed. Signed-off-by: Yafang Shao <[email protected]> Acked-by: Jiri Olsa <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-11bpf: Clear the probe_addr for uprobeYafang Shao2-2/+3
To avoid returning uninitialized or random values when querying the file descriptor (fd) and accessing probe_addr, it is necessary to clear the variable prior to its use. Fixes: 41bdc4b40ed6 ("bpf: introduce bpf subcommand BPF_TASK_FD_QUERY") Signed-off-by: Yafang Shao <[email protected]> Acked-by: Yonghong Song <[email protected]> Acked-by: Jiri Olsa <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-11bpf: Protect probed address based on kptr_restrict settingYafang Shao1-1/+4
The probed address can be accessed by userspace through querying the task file descriptor (fd). However, it is crucial to adhere to the kptr_restrict setting and refrain from exposing the address if it is not permitted. Signed-off-by: Yafang Shao <[email protected]> Acked-by: Jiri Olsa <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-11bpf: Support ->fill_link_info for kprobe_multiYafang Shao1-0/+37
With the addition of support for fill_link_info to the kprobe_multi link, users will gain the ability to inspect it conveniently using the `bpftool link show`. This enhancement provides valuable information to the user, including the count of probed functions and their respective addresses. It's important to note that if the kptr_restrict setting is not permitted, the probed address will not be exposed, ensuring security. Signed-off-by: Yafang Shao <[email protected]> Acked-by: Jiri Olsa <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-11bpf: cpumap: Fix memory leak in cpu_map_update_elemPu Lehui1-16/+24
Syzkaller reported a memory leak as follows: BUG: memory leak unreferenced object 0xff110001198ef748 (size 192): comm "syz-executor.3", pid 17672, jiffies 4298118891 (age 9.906s) hex dump (first 32 bytes): 00 00 00 00 4a 19 00 00 80 ad e3 e4 fe ff c0 00 ....J........... 00 b2 d3 0c 01 00 11 ff 28 f5 8e 19 01 00 11 ff ........(....... backtrace: [<ffffffffadd28087>] __cpu_map_entry_alloc+0xf7/0xb00 [<ffffffffadd28d8e>] cpu_map_update_elem+0x2fe/0x3d0 [<ffffffffadc6d0fd>] bpf_map_update_value.isra.0+0x2bd/0x520 [<ffffffffadc7349b>] map_update_elem+0x4cb/0x720 [<ffffffffadc7d983>] __se_sys_bpf+0x8c3/0xb90 [<ffffffffb029cc80>] do_syscall_64+0x30/0x40 [<ffffffffb0400099>] entry_SYSCALL_64_after_hwframe+0x61/0xc6 BUG: memory leak unreferenced object 0xff110001198ef528 (size 192): comm "syz-executor.3", pid 17672, jiffies 4298118891 (age 9.906s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffffadd281f0>] __cpu_map_entry_alloc+0x260/0xb00 [<ffffffffadd28d8e>] cpu_map_update_elem+0x2fe/0x3d0 [<ffffffffadc6d0fd>] bpf_map_update_value.isra.0+0x2bd/0x520 [<ffffffffadc7349b>] map_update_elem+0x4cb/0x720 [<ffffffffadc7d983>] __se_sys_bpf+0x8c3/0xb90 [<ffffffffb029cc80>] do_syscall_64+0x30/0x40 [<ffffffffb0400099>] entry_SYSCALL_64_after_hwframe+0x61/0xc6 BUG: memory leak unreferenced object 0xff1100010fd93d68 (size 8): comm "syz-executor.3", pid 17672, jiffies 4298118891 (age 9.906s) hex dump (first 8 bytes): 00 00 00 00 00 00 00 00 ........ backtrace: [<ffffffffade5db3e>] kvmalloc_node+0x11e/0x170 [<ffffffffadd28280>] __cpu_map_entry_alloc+0x2f0/0xb00 [<ffffffffadd28d8e>] cpu_map_update_elem+0x2fe/0x3d0 [<ffffffffadc6d0fd>] bpf_map_update_value.isra.0+0x2bd/0x520 [<ffffffffadc7349b>] map_update_elem+0x4cb/0x720 [<ffffffffadc7d983>] __se_sys_bpf+0x8c3/0xb90 [<ffffffffb029cc80>] do_syscall_64+0x30/0x40 [<ffffffffb0400099>] entry_SYSCALL_64_after_hwframe+0x61/0xc6 In the cpu_map_update_elem flow, when kthread_stop is called before calling the threadfn of rcpu->kthread, since the KTHREAD_SHOULD_STOP bit of kthread has been set by kthread_stop, the threadfn of rcpu->kthread will never be executed, and rcpu->refcnt will never be 0, which will lead to the allocated rcpu, rcpu->queue and rcpu->queue->queue cannot be released. Calling kthread_stop before executing kthread's threadfn will return -EINTR. We can complete the release of memory resources in this state. Fixes: 6710e1126934 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP") Signed-off-by: Pu Lehui <[email protected]> Acked-by: Jesper Dangaard Brouer <[email protected]> Acked-by: Hou Tao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-07-11PM: QoS: Restore support for default value on frequency QoSChungkai Yang1-2/+7
Commit 8d36694245f2 ("PM: QoS: Add check to make sure CPU freq is non-negative") makes sure CPU freq is non-negative to avoid negative value converting to unsigned data type. However, when the value is PM_QOS_DEFAULT_VALUE, pm_qos_update_target specifically uses c->default_value which is set to FREQ_QOS_MIN/MAX_DEFAULT_VALUE when cpufreq_policy_alloc is executed, for this case handling. Adding check for PM_QOS_DEFAULT_VALUE to let default setting work will fix this problem. Fixes: 8d36694245f2 ("PM: QoS: Add check to make sure CPU freq is non-negative") Link: https://lore.kernel.org/lkml/[email protected]/ Link: https://lore.kernel.org/lkml/[email protected]/ Link: https://lore.kernel.org/lkml/CAJZ5v0gxNOWhC58PHeUhW_tgf6d1fGJVZ1x91zkDdht11yUv-A@mail.gmail.com/ Signed-off-by: Chungkai Yang <[email protected]> Cc: 6.0+ <[email protected]> # 6.0+ Signed-off-by: Rafael J. Wysocki <[email protected]>
2023-07-11PM: hibernate: Fix writing maj:min to /sys/power/resumeAzat Khuzhin1-0/+1
resume_store() first calls lookup_bdev() and after tries to handle maj:min, but it does not reset the error before, hence if you will write maj:min you will get ENOENT: # echo 259:2 >| /sys/power/resume bash: echo: write error: No such file or directory This also should fix hiberation via systemd, since it uses this way. Fixes: 1e8c813b083c4 ("PM: hibernate: don't use early_lookup_bdev in resume_store") Signed-off-by: Azat Khuzhin <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> [ rjw: Subject edits ] Signed-off-by: Rafael J. Wysocki <[email protected]>
2023-07-10tracing/user_events: Fix struct arg size match checkBeau Belgrave1-0/+3
When users register an event the name of the event and it's argument are checked to ensure they match if the event already exists. Normally all arguments are in the form of "type name", except for when the type starts with "struct ". In those cases, the size of the struct is passed in addition to the name, IE: "struct my_struct a 20" for an argument that is of type "struct my_struct" with a field name of "a" and has the size of 20 bytes. The current code does not honor the above case properly when comparing a match. This causes the event register to fail even when the same string was used for events that contain a struct argument within them. The example above "struct my_struct a 20" generates a match string of "struct my_struct a" omitting the size field. Add the struct size of the existing field when generating a comparison string for a struct field to ensure proper match checking. Link: https://lkml.kernel.org/r/[email protected] Cc: [email protected] Fixes: e6f89a149872 ("tracing/user_events: Ensure user provided strings are safely formatted") Signed-off-by: Beau Belgrave <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-07-11fprobe: Ensure running fprobe_exit_handler() finished before calling ↵Masami Hiramatsu (Google)2-0/+16
rethook_free() Ensure running fprobe_exit_handler() has finished before calling rethook_free() in the unregister_fprobe() so that caller can free the fprobe right after unregister_fprobe(). unregister_fprobe() ensured that all running fprobe_entry/exit_handler() have finished by calling unregister_ftrace_function() which synchronizes RCU. But commit 5f81018753df ("fprobe: Release rethook after the ftrace_ops is unregistered") changed to call rethook_free() after unregister_ftrace_function(). So call rethook_stop() to make rethook disabled before unregister_ftrace_function() and ensure it again. Here is the possible code flow that can call the exit handler after unregister_fprobe(). ------ CPU1 CPU2 call unregister_fprobe(fp) ... __fprobe_handler() rethook_hook() on probed function unregister_ftrace_function() return from probed function rethook hooks find rh->handler == fprobe_exit_handler call fprobe_exit_handler() rethook_free(): set rh->handler = NULL; return from unreigster_fprobe; call fp->exit_handler() <- (*) ------ (*) At this point, the exit handler is called after returning from unregister_fprobe(). This fixes it as following; ------ CPU1 CPU2 call unregister_fprobe() ... rethook_stop(): set rh->handler = NULL; __fprobe_handler() rethook_hook() on probed function unregister_ftrace_function() return from probed function rethook hooks find rh->handler == NULL return from rethook rethook_free() return from unreigster_fprobe; ------ Link: https://lore.kernel.org/all/168873859949.156157.13039240432299335849.stgit@devnote2/ Fixes: 5f81018753df ("fprobe: Release rethook after the ftrace_ops is unregistered") Cc: [email protected] Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Reviewed-by: Steven Rostedt (Google) <[email protected]>
2023-07-11kernel: kprobes: Remove unnecessary ‘0’ valuesLi zeming1-3/+3
it is assigned first, so it does not need to initialize the assignment. Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Li zeming <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2023-07-11kprobes: Remove unnecessary ‘NULL’ values from correct_ret_addrLi zeming1-1/+1
The 'correct_ret_addr' pointer is always set in the later code, no need to initialize it at definition time. Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Li zeming <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2023-07-11fprobe: add unlock to match a succeeded ftrace_test_recursion_trylockZe Gao1-1/+3
Unlock ftrace recursion lock when fprobe_kprobe_handler() is failed because of some running kprobe. Link: https://lore.kernel.org/all/[email protected]/ Fixes: 3cc4e2c5fbae ("fprobe: make fprobe_kprobe_handler recursion free") Reported-by: Yafang <[email protected]> Closes: https://lore.kernel.org/linux-trace-kernel/CALOAHbC6UpfFOOibdDiC7xFc5YFUgZnk3MZ=3Ny6we=AcrNbew@mail.gmail.com/ Signed-off-by: Ze Gao <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]> Acked-by: Yafang Shao <[email protected]> Reviewed-by: Steven Rostedt (Google) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2023-07-11kernel/trace: Fix cleanup logic of enable_trace_eprobeTzvetomir Stoyanov (VMware)1-2/+16
The enable_trace_eprobe() function enables all event probes, attached to given trace probe. If an error occurs in enabling one of the event probes, all others should be roll backed. There is a bug in that roll back logic - instead of all event probes, only the failed one is disabled. Link: https://lore.kernel.org/all/[email protected]/ Reported-by: Dan Carpenter <[email protected]> Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events") Signed-off-by: Tzvetomir Stoyanov (VMware) <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]> Reviewed-by: Steven Rostedt (Google) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2023-07-10sched/psi: use kernfs polling functions for PSI trigger pollingSuren Baghdasaryan2-9/+22
Destroying psi trigger in cgroup_file_release causes UAF issues when a cgroup is removed from under a polling process. This is happening because cgroup removal causes a call to cgroup_file_release while the actual file is still alive. Destroying the trigger at this point would also destroy its waitqueue head and if there is still a polling process on that file accessing the waitqueue, it will step on the freed pointer: do_select vfs_poll do_rmdir cgroup_rmdir kernfs_drain_open_files cgroup_file_release cgroup_pressure_release psi_trigger_destroy wake_up_pollfree(&t->event_wait) // vfs_poll is unblocked synchronize_rcu kfree(t) poll_freewait -> UAF access to the trigger's waitqueue head Patch [1] fixed this issue for epoll() case using wake_up_pollfree(), however the same issue exists for synchronous poll() case. The root cause of this issue is that the lifecycles of the psi trigger's waitqueue and of the file associated with the trigger are different. Fix this by using kernfs_generic_poll function when polling on cgroup-specific psi triggers. It internally uses kernfs_open_node->poll waitqueue head with its lifecycle tied to the file's lifecycle. This also renders the fix in [1] obsolete, so revert it. [1] commit c2dbe32d5db5 ("sched/psi: Fix use-after-free in ep_remove_wait_queue()") Fixes: 0e94682b73bf ("psi: introduce psi monitor") Closes: https://lore.kernel.org/all/[email protected]/ Reported-by: Lu Jialin <[email protected]> Signed-off-by: Suren Baghdasaryan <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2023-07-10sched/fair: Use recent_used_cpu to test p->cpus_ptrMiaohe Lin1-1/+1
When checking whether a recently used CPU can be a potential idle candidate, recent_used_cpu should be used to test p->cpus_ptr as p->recent_used_cpu is not equal to recent_used_cpu and candidate decision is made based on recent_used_cpu here. Fixes: 89aafd67f28c ("sched/fair: Use prev instead of new target as recent_used_cpu") Signed-off-by: Miaohe Lin <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Phil Auld <[email protected]> Acked-by: Mel Gorman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-07-09Merge tag 'dma-mapping-6.5-2023-07-09' of ↵Linus Torvalds1-11/+35
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fixes from Christoph Hellwig: - swiotlb area sizing fixes (Petr Tesarik) * tag 'dma-mapping-6.5-2023-07-09' of git://git.infradead.org/users/hch/dma-mapping: swiotlb: reduce the number of areas to match actual memory pool size swiotlb: always set the number of areas before allocating the pool
2023-07-09Merge tag 'irq_urgent_for_v6.5_rc1' of ↵Linus Torvalds1-3/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq update from Borislav Petkov: - Optimize IRQ domain's name assignment * tag 'irq_urgent_for_v6.5_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqdomain: Use return value of strreplace()
2023-07-08fork: lock VMAs of the parent process when forkingSuren Baghdasaryan1-0/+1
When forking a child process, the parent write-protects anonymous pages and COW-shares them with the child being forked using copy_present_pte(). We must not take any concurrent page faults on the source vma's as they are being processed, as we expect both the vma and the pte's behind it to be stable. For example, the anon_vma_fork() expects the parents vma->anon_vma to not change during the vma copy. A concurrent page fault on a page newly marked read-only by the page copy might trigger wp_page_copy() and a anon_vma_prepare(vma) on the source vma, defeating the anon_vma_clone() that wasn't done because the parent vma originally didn't have an anon_vma, but we now might end up copying a pte entry for a page that has one. Before the per-vma lock based changes, the mmap_lock guaranteed exclusion with concurrent page faults. But now we need to do a vma_start_write() to make sure no concurrent faults happen on this vma while it is being processed. This fix can potentially regress some fork-heavy workloads. Kernel build time did not show noticeable regression on a 56-core machine while a stress test mapping 10000 VMAs and forking 5000 times in a tight loop shows ~5% regression. If such fork time regression is unacceptable, disabling CONFIG_PER_VMA_LOCK should restore its performance. Further optimizations are possible if this regression proves to be problematic. Suggested-by: David Hildenbrand <[email protected]> Reported-by: Jiri Slaby <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ Reported-by: Holger Hoffstätte <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ Reported-by: Jacob Young <[email protected]> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624 Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first") Cc: [email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>