aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2024-05-08kernel/watchdog_perf.c: tidy up kerneldocAndrew Morton1-3/+0
It is unconventional to have a blank line between name-of-function and description-of-args. Cc: Peter Zijlstra <[email protected]> Cc: Song Liu <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Cc: Ryusuke Konishi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-08watchdog: allow nmi watchdog to use raw perf eventSong Liu2-0/+48
NMI watchdog permanently consumes one hardware counters per CPU on the system. For systems that use many hardware counters, this causes more aggressive time multiplexing of perf events. OTOH, some CPUs (mostly Intel) support "ref-cycles" event, which is rarely used. Add kernel cmdline arg nmi_watchdog=rNNN to configure the watchdog to use raw event. For example, on Intel CPUs, we can use "r300" to configure the watchdog to use ref-cycles event. If the raw event does not work, fall back to use "cycles". [[email protected]: fix kerneldoc] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: "Matthew Wilcox (Oracle)" <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-08watchdog: handle comma separated nmi_watchdog command lineSong Liu1-0/+7
Per the document, the kernel can accept comma separated command line like nmi_watchdog=nopanic,0. However, the code doesn't really handle it. Fix the kernel to handle it properly. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Song Liu <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-08crash: add prefix for crash dumping messagesBaoquan He2-2/+4
Add pr_fmt() to kernel/crash_core.c to add the module name to debugging message printed as prefix. And also add prefix 'crashkernel:' to two lines of message printing code in kernel/crash_reserve.c. In kernel/crash_reserve.c, almost all debugging messages have 'crashkernel:' prefix or there's keyword crashkernel at the beginning or in the middle, adding pr_fmt() makes it redundant. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Baoquan He <[email protected]> Cc: Dave Young <[email protected]> Cc: Jiri Slaby <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-05-08timers/migration: Prevent out of bounds access on failureLevi Yun1-2/+2
When tmigr_setup_groups() fails the level 0 group allocation, then the cleanup derefences index -1 of the local stack array. Prevent this by checking the loop condition first. Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model") Signed-off-by: Levi Yun <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Anna-Maria Behnsen <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-05-07bpf: Remove redundant page mask of vmf->addressHaiyue Wang1-1/+1
As the comment described in "struct vm_fault": ".address" : 'Faulting virtual address - masked' ".real_address" : 'Faulting virtual address - unmasked' The link [1] said: "Whatever the routes, all architectures end up to the invocation of handle_mm_fault() which, in turn, (likely) ends up calling __handle_mm_fault() to carry out the actual work of allocating the page tables." __handle_mm_fault() does address assignment: .address = address & PAGE_MASK, .real_address = address, This is debug dump by running `./test_progs -a "*arena*"`: [ 69.767494] arena fault: vmf->address = 10000001d000, vmf->real_address = 10000001d008 [ 69.767496] arena fault: vmf->address = 10000001c000, vmf->real_address = 10000001c008 [ 69.767499] arena fault: vmf->address = 10000001b000, vmf->real_address = 10000001b008 [ 69.767501] arena fault: vmf->address = 10000001a000, vmf->real_address = 10000001a008 [ 69.767504] arena fault: vmf->address = 100000019000, vmf->real_address = 100000019008 [ 69.769388] arena fault: vmf->address = 10000001e000, vmf->real_address = 10000001e1e8 So we can use the value of 'vmf->address' to do BPF arena kernel address space cast directly. [1] https://docs.kernel.org/mm/page_tables.html Signed-off-by: Haiyue Wang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-05-07kcsan, compiler_types: Introduce __data_racy type qualifierMarco Elver1-0/+17
Based on the discussion at [1], it would be helpful to mark certain variables as explicitly "data racy", which would result in KCSAN not reporting data races involving any accesses on such variables. To do that, introduce the __data_racy type qualifier: struct foo { ... int __data_racy bar; ... }; In KCSAN-kernels, __data_racy turns into volatile, which KCSAN already treats specially by considering them "marked". In non-KCSAN kernels the type qualifier turns into no-op. The generated code between KCSAN-instrumented kernels and non-KCSAN kernels is already huge (inserted calls into runtime for every memory access), so the extra generated code (if any) due to volatile for few such __data_racy variables are unlikely to have measurable impact on performance. Link: https://lore.kernel.org/all/CAHk-=wi3iondeh_9V2g3Qz5oHTRjLsOpoy83hb58MVh=nRZe0A@mail.gmail.com/ [1] Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Marco Elver <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Tetsuo Handa <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2024-05-07Merge tag 'kvm-riscv-6.10-1' of https://github.com/kvm-riscv/linux into HEADPaolo Bonzini4-16/+58
KVM/riscv changes for 6.10 - Support guest breakpoints using ebreak - Introduce per-VCPU mp_state_lock and reset_cntx_lock - Virtualize SBI PMU snapshot and counter overflow interrupts - New selftests for SBI PMU and Guest ebreak
2024-05-07dma: avoid redundant calls for sync operationsAlexander Lobakin2-12/+49
Quite often, devices do not need dma_sync operations on x86_64 at least. Indeed, when dev_is_dma_coherent(dev) is true and dev_use_swiotlb(dev) is false, iommu_dma_sync_single_for_cpu() and friends do nothing. However, indirectly calling them when CONFIG_RETPOLINE=y consumes about 10% of cycles on a cpu receiving packets from softirq at ~100Gbit rate. Even if/when CONFIG_RETPOLINE is not set, there is a cost of about 3%. Add dev->need_dma_sync boolean and turn it off during the device initialization (dma_set_mask()) depending on the setup: dev_is_dma_coherent() for the direct DMA, !(sync_single_for_device || sync_single_for_cpu) or the new dma_map_ops flag, %DMA_F_CAN_SKIP_SYNC, advertised for non-NULL DMA ops. Then later, if/when swiotlb is used for the first time, the flag is reset back to on, from swiotlb_tbl_map_single(). On iavf, the UDP trafficgen with XDP_DROP in skb mode test shows +3-5% increase for direct DMA. Suggested-by: Christoph Hellwig <[email protected]> # direct DMA shortcut Co-developed-by: Eric Dumazet <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: Alexander Lobakin <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2024-05-07dma: compile-out DMA sync op calls when not usedAlexander Lobakin2-10/+17
Some platforms do have DMA, but DMA there is always direct and coherent. Currently, even on such platforms DMA sync operations are compiled and called. Add a new hidden Kconfig symbol, DMA_NEED_SYNC, and set it only when either sync operations are needed or there is DMA ops or swiotlb or DMA debug is enabled. Compile global dma_sync_*() and dma_need_sync() only when it's set, otherwise provide empty inline stubs. The change allows for future optimizations of DMA sync calls depending on runtime conditions. Signed-off-by: Alexander Lobakin <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2024-05-07swiotlb: remove alloc_size argument to swiotlb_tbl_map_single()Michael Kelley1-14/+42
Currently swiotlb_tbl_map_single() takes alloc_align_mask and alloc_size arguments to specify an swiotlb allocation that is larger than mapping_size. This larger allocation is used solely by iommu_dma_map_single() to handle untrusted devices that should not have DMA visibility to memory pages that are partially used for unrelated kernel data. Having two arguments to specify the allocation is redundant. While alloc_align_mask naturally specifies the alignment of the starting address of the allocation, it can also implicitly specify the size by rounding up the mapping_size to that alignment. Additionally, the current approach has an edge case bug. iommu_dma_map_page() already does the rounding up to compute the alloc_size argument. But swiotlb_tbl_map_single() then calculates the alignment offset based on the DMA min_align_mask, and adds that offset to alloc_size. If the offset is non-zero, the addition may result in a value that is larger than the max the swiotlb can allocate. If the rounding up is done _after_ the alignment offset is added to the mapping_size (and the original mapping_size conforms to the value returned by swiotlb_max_mapping_size), then the max that the swiotlb can allocate will not be exceeded. In view of these issues, simplify the swiotlb_tbl_map_single() interface by removing the alloc_size argument. Most call sites pass the same value for mapping_size and alloc_size, and they pass alloc_align_mask as zero. Just remove the redundant argument from these callers, as they will see no functional change. For iommu_dma_map_page() also remove the alloc_size argument, and have swiotlb_tbl_map_single() compute the alloc_size by rounding up mapping_size after adding the offset based on min_align_mask. This has the side effect of fixing the edge case bug but with no other functional change. Also add a sanity test on the alloc_align_mask. While IOMMU code currently ensures the granule is not larger than PAGE_SIZE, if that guarantee were to be removed in the future, the downstream effect on the swiotlb might go unnoticed until strange allocation failures occurred. Tested on an ARM64 system with 16K page size and some kernel test-only hackery to allow modifying the DMA min_align_mask and the granule size that becomes the alloc_align_mask. Tested these combinations with a variety of original memory addresses and sizes, including those that reproduce the edge case bug: * 4K granule and 0 min_align_mask * 4K granule and 0xFFF min_align_mask (4K - 1) * 16K granule and 0xFFF min_align_mask * 64K granule and 0xFFF min_align_mask * 64K granule and 0x3FFF min_align_mask (16K - 1) With the changes, all combinations pass. Signed-off-by: Michael Kelley <[email protected]> Reviewed-by: Petr Tesarik <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2024-05-07printk: cleanup deprecated uses of strncpy/strcpyJustin Stitt1-13/+13
Cleanup some deprecated uses of strncpy() and strcpy() [1]. There doesn't seem to be any bugs with the current code but the readability of this code could benefit from a quick makeover while removing some deprecated stuff as a benefit. The most interesting replacement made in this patch involves concatenating "ttyS" with a digit-led user-supplied string. Instead of doing two distinct string copies with carefully managed offsets and lengths, let's use the more robust and self-explanatory scnprintf(). scnprintf will 1) respect the bounds of @buf, 2) null-terminate @buf, 3) do the concatenation. This allows us to drop the manual NUL-byte assignment. Also, since isdigit() is used about a dozen lines after the open-coded version we'll replace it for uniformity's sake. All the strcpy() --> strscpy() replacements are trivial as the source strings are literals and much smaller than the destination size. No behavioral change here. Use the new 2-argument version of strscpy() introduced in Commit e6584c3964f2f ("string: Allow 2-argument strscpy()"). However, to make this work fully (since the size must be known at compile time), also update the extern-qualified declaration to have the proper size information. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://github.com/KSPP/linux/issues/90 [2] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [3] Cc: [email protected] Signed-off-by: Justin Stitt <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/20240429-strncpy-kernel-printk-printk-c-v1-1-4da7926d7b69@google.com [[email protected]: Removed obsolete brackets and added empty lines.] Signed-off-by: Petr Mladek <[email protected]>
2024-05-06bpf/verifier: relax MUL range computation checkCupertino Miranda1-5/+1
MUL instruction required that src_reg would be a known value (i.e. src_reg would be a const value). The condition in this case can be relaxed, since the range computation algorithm used in current code already supports a proper range computation for any valid range value on its operands. Signed-off-by: Cupertino Miranda <[email protected]> Acked-by: Eduard Zingerman <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Cc: Yonghong Song <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: David Faust <[email protected]> Cc: Jose Marchesi <[email protected]> Cc: Elena Zannoni <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-05-06bpf/verifier: improve XOR and OR range computationCupertino Miranda1-2/+2
Range for XOR and OR operators would not be attempted unless src_reg would resolve to a single value, i.e. a known constant value. This condition is unnecessary, and the following XOR/OR operator handling could compute a possible better range. Acked-by: Eduard Zingerman <[email protected]> Signed-off-by: Cupertino Miranda <[email protected] Acked-by: Eduard Zingerman <[email protected]> Cc: Yonghong Song <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: David Faust <[email protected]> Cc: Jose Marchesi <[email protected]> Cc: Elena Zannoni <[email protected]> Cc: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-05-06bpf/verifier: refactor checks for range computationCupertino Miranda1-64/+45
Split range computation checks in its own function, isolating pessimitic range set for dst_reg and failing return to a single point. Signed-off-by: Cupertino Miranda <[email protected]> Acked-by: Eduard Zingerman <[email protected]> Cc: Yonghong Song <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: David Faust <[email protected]> Cc: Jose Marchesi <[email protected]> Cc: Elena Zannoni <[email protected]> Cc: Andrii Nakryiko <[email protected]> bpf/verifier: improve code after range computation recent changes. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-05-06bpf/verifier: replace calls to mark_reg_unknown.Cupertino Miranda1-5/+4
In order to further simplify the code in adjust_scalar_min_max_vals all the calls to mark_reg_unknown are replaced by __mark_reg_unknown. static void mark_reg_unknown(struct bpf_verifier_env *env, struct bpf_reg_state *regs, u32 regno) { if (WARN_ON(regno >= MAX_BPF_REG)) { ... mark all regs not init ... return; } __mark_reg_unknown(env, regs + regno); } The 'regno >= MAX_BPF_REG' does not apply to adjust_scalar_min_max_vals(), because it is only called from the following stack: - check_alu_op - adjust_reg_min_max_vals - adjust_scalar_min_max_vals The check_alu_op() does check_reg_arg() which verifies that both src and dst register numbers are within bounds. Signed-off-by: Cupertino Miranda <[email protected]> Acked-by: Eduard Zingerman <[email protected]> Cc: Yonghong Song <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: David Faust <[email protected]> Cc: Jose Marchesi <[email protected]> Cc: Elena Zannoni <[email protected]> Cc: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-05-06kunit: Handle test faultsMickaël Salaün1-0/+1
Previously, when a kernel test thread crashed (e.g. NULL pointer dereference, general protection fault), the KUnit test hanged for 30 seconds and exited with a timeout error. Fix this issue by waiting on task_struct->vfork_done instead of the custom kunit_try_catch.try_completion, and track the execution state by initially setting try_result with -EINTR and only setting it to 0 if the test passed. Fix kunit_generic_run_threadfn_adapter() signature by returning 0 instead of calling kthread_complete_and_exit(). Because thread's exit code is never checked, always set it to 0 to make it clear. To make this explicit, export kthread_exit() for KUnit tests built as module. Fix the -EINTR error message, which couldn't be reached until now. This is tested with a following patch. Cc: Brendan Higgins <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Shuah Khan <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: David Gow <[email protected]> Tested-by: Rae Moar <[email protected]> Signed-off-by: Mickaël Salaün <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Shuah Khan <[email protected]>
2024-05-06printk: Change type of CONFIG_BASE_SMALL to boolYoann Congal2-2/+2
CONFIG_BASE_SMALL is currently a type int but is only used as a boolean. So, change its type to bool and adapt all usages: CONFIG_BASE_SMALL == 0 becomes !IS_ENABLED(CONFIG_BASE_SMALL) and CONFIG_BASE_SMALL != 0 becomes IS_ENABLED(CONFIG_BASE_SMALL). Reviewed-by: Petr Mladek <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Masahiro Yamada <[email protected]> Signed-off-by: Yoann Congal <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]>
2024-05-06powerpc/dexcr: Add DEXCR prctl interfaceBenjamin Gray1-0/+16
Now that we track a DEXCR on a per-task basis, individual tasks are free to configure it as they like. The interface is a pair of getter/setter prctl's that work on a single aspect at a time (multiple aspects at once is more difficult if there are different rules applied for each aspect, now or in future). The getter shows the current state of the process config, and the setter allows setting/clearing the aspect. Signed-off-by: Benjamin Gray <[email protected]> [mpe: Account for PR_RISCV_SET_ICACHE_FLUSH_CTX, shrink some longs lines] Signed-off-by: Michael Ellerman <[email protected]> Link: https://msgid.link/[email protected]
2024-05-05Merge tag 'irq-urgent-2024-05-05' of ↵Linus Torvalds1-4/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Ingo Molnar: "Fix suspicious RCU usage in __do_softirq()" * tag 'irq-urgent-2024-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: softirq: Fix suspicious RCU usage in __do_softirq()
2024-05-05Merge tag 'probes-fixes-v6.9-rc6' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fix from Masami Hiramatsu: - probe-events: Fix memory leak in parsing probe argument. There is a memory leak (forget to free an allocated buffer) in a memory allocation failure path. Fix it to jump to the correct error handling code. * tag 'probes-fixes-v6.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/probes: Fix memory leak in traceprobe_parse_probe_arg_body()
2024-05-05Merge tag 'trace-v6.9-rc6-2' of ↵Linus Torvalds1-0/+12
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing and tracefs fixes from Steven Rostedt: - Fix RCU callback of freeing an eventfs_inode. The freeing of the eventfs_inode from the kref going to zero freed the contents of the eventfs_inode and then used kfree_rcu() to free the inode itself. But the contents should also be protected by RCU. Switch to a call_rcu() that calls a function to free all of the eventfs_inode after the RCU synchronization. - The tracing subsystem maps its own descriptor to a file represented by eventfs. The freeing of this descriptor needs to know when the last reference of an eventfs_inode is released, but currently there is no interface for that. Add a "release" callback to the eventfs_inode entry array that allows for freeing of data that can be referenced by the eventfs_inode being opened. Then increment the ref counter for this descriptor when the eventfs_inode file is created, and decrement/free it when the last reference to the eventfs_inode is released and the file is removed. This prevents races between freeing the descriptor and the opening of the eventfs file. - Fix the permission processing of eventfs. The change to make the permissions of eventfs default to the mount point but keep track of when changes were made had a side effect that could cause security concerns. When the tracefs is remounted with a given gid or uid, all the files within it should inherit that gid or uid. But if the admin had changed the permission of some file within the tracefs file system, it would not get updated by the remount. This caused the kselftest of file permissions to fail the second time it is run. The first time, all changes would look fine, but the second time, because the changes were "saved", the remount did not reset them. Create a link list of all existing tracefs inodes, and clear the saved flags on them on a remount if the remount changes the corresponding gid or uid fields. This also simplifies the code by removing the distinction between the toplevel eventfs and an instance eventfs. They should both act the same. They were different because of a misconception due to the remount not resetting the flags. Now that remount resets all the files and directories to default to the root node if a uid/gid is specified, it makes the logic simpler to implement. * tag 'trace-v6.9-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: eventfs: Have "events" directory get permissions from its parent eventfs: Do not treat events directory different than other directories eventfs: Do not differentiate the toplevel events directory tracefs: Still use mount point as default permissions for instances tracefs: Reset permissions on remount if permissions are options eventfs: Free all of the eventfs_inode after RCU eventfs/tracing: Add callback for release of an eventfs_inode
2024-05-05Merge tag 'dma-mapping-6.9-2024-05-04' of ↵Linus Torvalds1-0/+1
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fix from Christoph Hellwig: - fix the combination of restricted pools and dynamic swiotlb (Will Deacon) * tag 'dma-mapping-6.9-2024-05-04' of git://git.infradead.org/users/hch/dma-mapping: swiotlb: initialise restricted pool list_head when SWIOTLB_DYNAMIC=y
2024-05-04eventfs/tracing: Add callback for release of an eventfs_inodeSteven Rostedt (Google)1-0/+12
Synthetic events create and destroy tracefs files when they are created and removed. The tracing subsystem has its own file descriptor representing the state of the events attached to the tracefs files. There's a race between the eventfs files and this file descriptor of the tracing system where the following can cause an issue: With two scripts 'A' and 'B' doing: Script 'A': echo "hello int aaa" > /sys/kernel/tracing/synthetic_events while : do echo 0 > /sys/kernel/tracing/events/synthetic/hello/enable done Script 'B': echo > /sys/kernel/tracing/synthetic_events Script 'A' creates a synthetic event "hello" and then just writes zero into its enable file. Script 'B' removes all synthetic events (including the newly created "hello" event). What happens is that the opening of the "enable" file has: { struct trace_event_file *file = inode->i_private; int ret; ret = tracing_check_open_get_tr(file->tr); [..] But deleting the events frees the "file" descriptor, and a "use after free" happens with the dereference at "file->tr". The file descriptor does have a reference counter, but there needs to be a way to decrement it from the eventfs when the eventfs_inode is removed that represents this file descriptor. Add an optional "release" callback to the eventfs_entry array structure, that gets called when the eventfs file is about to be removed. This allows for the creating on the eventfs file to increment the tracing file descriptor ref counter. When the eventfs file is deleted, it can call the release function that will call the put function for the tracing file descriptor. This will protect the tracing file from being freed while a eventfs file that references it is being opened. Link: https://lore.kernel.org/linux-trace-kernel/[email protected]/ Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: [email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Fixes: 5790b1fb3d672 ("eventfs: Remove eventfs_file and just use eventfs_inode") Reported-by: Tze-nan wu <[email protected]> Tested-by: Tze-nan Wu (吳澤南) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-05-03stackleak: Use a copy of the ctl_table argumentThomas Weißschuh1-3/+3
Sysctl handlers are not supposed to modify the ctl_table passed to them. Adapt the logic to work with a temporary variable, similar to how it is done in other parts of the kernel. This is also a prerequisite to enforce the immutability of the argument through the callbacks. Reviewed-by: Luis Chamberlain <[email protected]> Signed-off-by: Thomas Weißschuh <[email protected]> Reviewed-by: Tycho Andersen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>
2024-05-02swsusp: don't bother with setting block sizeAl Viro1-6/+1
same as with the swap... Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Al Viro <[email protected]>
2024-05-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski9-72/+105
Cross-merge networking fixes after downstream PR. Conflicts: include/linux/filter.h kernel/bpf/core.c 66e13b615a0c ("bpf: verifier: prevent userspace memory access") d503a04f8bc0 ("bpf: Add support for certain atomics in bpf_arena to x86 JIT") https://lore.kernel.org/all/[email protected]/ No adjacent changes. Signed-off-by: Jakub Kicinski <[email protected]>
2024-05-02Merge tag 'net-6.9-rc7' of ↵Linus Torvalds2-2/+40
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from bpf. Relatively calm week, likely due to public holiday in most places. No known outstanding regressions. Current release - regressions: - rxrpc: fix wrong alignmask in __page_frag_alloc_align() - eth: e1000e: change usleep_range to udelay in PHY mdic access Previous releases - regressions: - gro: fix udp bad offset in socket lookup - bpf: fix incorrect runtime stat for arm64 - tipc: fix UAF in error path - netfs: fix a potential infinite loop in extract_user_to_sg() - eth: ice: ensure the copied buf is NUL terminated - eth: qeth: fix kernel panic after setting hsuid Previous releases - always broken: - bpf: - verifier: prevent userspace memory access - xdp: use flags field to disambiguate broadcast redirect - bridge: fix multicast-to-unicast with fraglist GSO - mptcp: ensure snd_nxt is properly initialized on connect - nsh: fix outer header access in nsh_gso_segment(). - eth: bcmgenet: fix racing registers access - eth: vxlan: fix stats counters. Misc: - a bunch of MAINTAINERS file updates" * tag 'net-6.9-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (45 commits) MAINTAINERS: mark MYRICOM MYRI-10G as Orphan MAINTAINERS: remove Ariel Elior net: gro: add flush check in udp_gro_receive_segment net: gro: fix udp bad offset in socket lookup by adding {inner_}network_offset to napi_gro_cb ipv4: Fix uninit-value access in __ip_make_skb() s390/qeth: Fix kernel panic after setting hsuid vxlan: Pull inner IP header in vxlan_rcv(). tipc: fix a possible memleak in tipc_buf_append tipc: fix UAF in error path rxrpc: Clients must accept conn from any address net: core: reject skb_copy(_expand) for fraglist GSO skbs net: bridge: fix multicast-to-unicast with fraglist GSO mptcp: ensure snd_nxt is properly initialized on connect e1000e: change usleep_range to udelay in PHY mdic access net: dsa: mv88e6xxx: Fix number of databases for 88E6141 / 88E6341 cxgb4: Properly lock TX queue for the selftest. rxrpc: Fix using alignmask being zero for __page_frag_alloc_align() vxlan: Add missing VNI filter counter update in arp_reduce(). vxlan: Fix racy device stats updates. net: qede: use return from qede_parse_actions() ...
2024-05-02swiotlb: initialise restricted pool list_head when SWIOTLB_DYNAMIC=yWill Deacon1-0/+1
Using restricted DMA pools (CONFIG_DMA_RESTRICTED_POOL=y) in conjunction with dynamic SWIOTLB (CONFIG_SWIOTLB_DYNAMIC=y) leads to the following crash when initialising the restricted pools at boot-time: | Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 | Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP | pc : rmem_swiotlb_device_init+0xfc/0x1ec | lr : rmem_swiotlb_device_init+0xf0/0x1ec | Call trace: | rmem_swiotlb_device_init+0xfc/0x1ec | of_reserved_mem_device_init_by_idx+0x18c/0x238 | of_dma_configure_id+0x31c/0x33c | platform_dma_configure+0x34/0x80 faddr2line reveals that the crash is in the list validation code: include/linux/list.h:83 include/linux/rculist.h:79 include/linux/rculist.h:106 kernel/dma/swiotlb.c:306 kernel/dma/swiotlb.c:1695 because add_mem_pool() is trying to list_add_rcu() to a NULL 'mem->pools'. Fix the crash by initialising the 'mem->pools' list_head in rmem_swiotlb_device_init() before calling add_mem_pool(). Reported-by: Nikita Ioffe <[email protected]> Tested-by: Nikita Ioffe <[email protected]> Fixes: 1aaa736815eb ("swiotlb: allocate a new memory pool when existing pools are full") Signed-off-by: Will Deacon <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2024-05-02vmlinux: Avoid weak reference to notes sectionArd Biesheuvel1-2/+2
Weak references are references that are permitted to remain unsatisfied in the final link. This means they cannot be implemented using place relative relocations, resulting in GOT entries when using position independent code generation. The notes section should always exist, so the weak annotations can be omitted. Acked-by: Arnd Bergmann <[email protected]> Signed-off-by: Ard Biesheuvel <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
2024-05-02kallsyms: Avoid weak references for kallsyms symbolsArd Biesheuvel2-24/+12
kallsyms is a directory of all the symbols in the vmlinux binary, and so creating it is somewhat of a chicken-and-egg problem, as its non-zero size affects the layout of the binary, and therefore the values of the symbols. For this reason, the kernel is linked more than once, and the first pass does not include any kallsyms data at all. For the linker to accept this, the symbol declarations describing the kallsyms metadata are emitted as having weak linkage, so they can remain unsatisfied. During the subsequent passes, the weak references are satisfied by the kallsyms metadata that was constructed based on information gathered from the preceding passes. Weak references lead to somewhat worse codegen, because taking their address may need to produce NULL (if the reference was unsatisfied), and this is not usually supported by RIP or PC relative symbol references. Given that these references are ultimately always satisfied in the final link, let's drop the weak annotation, and instead, provide fallback definitions in the linker script that are only emitted if an unsatisfied reference exists. While at it, drop the FRV specific annotation that these symbols reside in .rodata - FRV is long gone. Tested-by: Nick Desaulniers <[email protected]> # Boot Reviewed-by: Nick Desaulniers <[email protected]> Reviewed-by: Kees Cook <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Link: https://lkml.kernel.org/r/20230504174320.3930345-1-ardb%40kernel.org Signed-off-by: Ard Biesheuvel <[email protected]> Signed-off-by: Masahiro Yamada <[email protected]>
2024-05-01bpf: crypto: fix build when CONFIG_CRYPTO=mVadim Fedorenko1-1/+1
Crypto subsytem can be build as a module. In this case we still have to build BPF crypto framework otherwise the build will fail. Fixes: 3e1c6f35409f ("bpf: make common crypto API for TC/XDP programs") Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: Vadim Fedorenko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2024-05-01hardening: Enable KCFI and some other optionsKees Cook1-0/+8
Add some stuff that got missed along the way: - CONFIG_UNWIND_PATCH_PAC_INTO_SCS=y so SCS vs PAC is hardware selectable. - CONFIG_X86_KERNEL_IBT=y while a default, just be sure. - CONFIG_CFI_CLANG=y globally. - CONFIG_PAGE_TABLE_CHECK=y for userspace mapping sanity. Reviewed-by: Nathan Chancellor <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>
2024-05-01rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()Andrii Nakryiko1-0/+2
Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating that RCU is watching when trying to setup rethooko on a function entry. One notable exception when we force rcu_is_watching() check is CONFIG_KPROBE_EVENTS_ON_NOTRACE=y case, in which case kretprobes will use old-style int3-based workflow instead of relying on ftrace, making RCU watching check important to validate. This further (in addition to improvements in the previous patch) improves BPF multi-kretprobe (which rely on rethook) runtime throughput by 2.3%, according to BPF benchmarks ([0]). [0] https://lore.kernel.org/bpf/CAEf4BzauQ2WKMjZdc9s0rBWa01BYbgwHN6aNDXQSHYia47pQ-w@mail.gmail.com/ Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2024-05-01ftrace: make extra rcu_is_watching() validation check optionalAndrii Nakryiko1-0/+13
Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to control whether ftrace low-level code performs additional rcu_is_watching()-based validation logic in an attempt to catch noinstr violations. This check is expected to never be true and is mostly useful for low-level validation of ftrace subsystem invariants. For most users it should probably be kept disabled to eliminate unnecessary runtime overhead. This improves BPF multi-kretprobe (relying on ftrace and rethook infrastructure) runtime throughput by 2%, according to BPF benchmarks ([0]). [0] https://lore.kernel.org/bpf/CAEf4BzauQ2WKMjZdc9s0rBWa01BYbgwHN6aNDXQSHYia47pQ-w@mail.gmail.com/ Link: https://lore.kernel.org/all/[email protected]/ Cc: Steven Rostedt <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Paul E. McKenney <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2024-05-01uprobes: reduce contention on uprobes_tree accessJonathan Haslam1-11/+11
Active uprobes are stored in an RB tree and accesses to this tree are dominated by read operations. Currently these accesses are serialized by a spinlock but this leads to enormous contention when large numbers of threads are executing active probes. This patch converts the spinlock used to serialize access to the uprobes_tree RB tree into a reader-writer spinlock. This lock type aligns naturally with the overwhelmingly read-only nature of the tree usage here. Although the addition of reader-writer spinlocks are discouraged [0], this fix is proposed as an interim solution while an RCU based approach is implemented (that work is in a nascent form). This fix also has the benefit of being trivial, self contained and therefore simple to backport. We have used a uprobe benchmark from the BPF selftests [1] to estimate the improvements. Each block of results below show 1 line per execution of the benchmark ("the "Summary" line) and each line is a run with one more thread added - a thread is a "producer". The lines are edited to remove extraneous output. The tests were executed with this driver script: for num_threads in {1..20} do sudo ./bench -a -p $num_threads trig-uprobe-nop | grep Summary done SPINLOCK (BEFORE) ================== Summary: hits 1.396 ± 0.007M/s ( 1.396M/prod) Summary: hits 1.656 ± 0.016M/s ( 0.828M/prod) Summary: hits 2.246 ± 0.008M/s ( 0.749M/prod) Summary: hits 2.114 ± 0.010M/s ( 0.529M/prod) Summary: hits 2.013 ± 0.009M/s ( 0.403M/prod) Summary: hits 1.753 ± 0.008M/s ( 0.292M/prod) Summary: hits 1.847 ± 0.001M/s ( 0.264M/prod) Summary: hits 1.889 ± 0.001M/s ( 0.236M/prod) Summary: hits 1.833 ± 0.006M/s ( 0.204M/prod) Summary: hits 1.900 ± 0.003M/s ( 0.190M/prod) Summary: hits 1.918 ± 0.006M/s ( 0.174M/prod) Summary: hits 1.925 ± 0.002M/s ( 0.160M/prod) Summary: hits 1.837 ± 0.001M/s ( 0.141M/prod) Summary: hits 1.898 ± 0.001M/s ( 0.136M/prod) Summary: hits 1.799 ± 0.016M/s ( 0.120M/prod) Summary: hits 1.850 ± 0.005M/s ( 0.109M/prod) Summary: hits 1.816 ± 0.002M/s ( 0.101M/prod) Summary: hits 1.787 ± 0.001M/s ( 0.094M/prod) Summary: hits 1.764 ± 0.002M/s ( 0.088M/prod) RW SPINLOCK (AFTER) =================== Summary: hits 1.444 ± 0.020M/s ( 1.444M/prod) Summary: hits 2.279 ± 0.011M/s ( 1.139M/prod) Summary: hits 3.422 ± 0.014M/s ( 1.141M/prod) Summary: hits 3.565 ± 0.017M/s ( 0.891M/prod) Summary: hits 2.671 ± 0.013M/s ( 0.534M/prod) Summary: hits 2.409 ± 0.005M/s ( 0.401M/prod) Summary: hits 2.485 ± 0.008M/s ( 0.355M/prod) Summary: hits 2.496 ± 0.003M/s ( 0.312M/prod) Summary: hits 2.585 ± 0.002M/s ( 0.287M/prod) Summary: hits 2.908 ± 0.011M/s ( 0.291M/prod) Summary: hits 2.346 ± 0.016M/s ( 0.213M/prod) Summary: hits 2.804 ± 0.004M/s ( 0.234M/prod) Summary: hits 2.556 ± 0.001M/s ( 0.197M/prod) Summary: hits 2.754 ± 0.004M/s ( 0.197M/prod) Summary: hits 2.482 ± 0.002M/s ( 0.165M/prod) Summary: hits 2.412 ± 0.005M/s ( 0.151M/prod) Summary: hits 2.710 ± 0.003M/s ( 0.159M/prod) Summary: hits 2.826 ± 0.005M/s ( 0.157M/prod) Summary: hits 2.718 ± 0.001M/s ( 0.143M/prod) Summary: hits 2.844 ± 0.006M/s ( 0.142M/prod) The numbers in parenthesis give averaged throughput per thread which is of greatest interest here as a measure of scalability. Improvements are in the order of 22 - 68% with this particular benchmark (mean = 43%). V2: - Updated commit message to include benchmark results. [0] https://docs.kernel.org/locking/spinlocks.html [1] https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/benchs/bench_trigger.c Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Jonathan Haslam <[email protected]> Acked-by: Jiri Olsa <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2024-05-01rethook: Remove warning messages printed for finding return address of a frame.Kui-Feng Lee1-1/+1
The function rethook_find_ret_addr() prints a warning message and returns 0 when the target task is running and is not the "current" task in order to prevent the incorrect return address, although it still may return an incorrect address. However, the warning message turns into noise when BPF profiling programs call bpf_get_task_stack() on running tasks in a firm with a large number of hosts. The callers should be aware and willing to take the risk of receiving an incorrect return address from a task that is currently running other than the "current" one. A warning is not needed here as the callers are intent on it. Link: https://lore.kernel.org/all/[email protected]/ Acked-by: Andrii Nakryiko <[email protected]> Acked-by: John Fastabend <[email protected]> Signed-off-by: Kui-Feng Lee <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2024-05-01tracing/probes: support '%pD' type for print struct file's nameYe Bin2-23/+36
As like '%pd' type, this patch supports print type '%pD' for print file's name. For example "name=$arg1:%pD" casts the `$arg1` as (struct file*), dereferences the "file.f_path.dentry.d_name.name" field and stores it to "name" argument as a kernel string. Here is an example: [tracing]# echo 'p:testprobe vfs_read name=$arg1:%pD' > kprobe_event [tracing]# echo 1 > events/kprobes/testprobe/enable [tracing]# grep -q "1" events/kprobes/testprobe/enable [tracing]# echo 0 > events/kprobes/testprobe/enable [tracing]# grep "vfs_read" trace | grep "enable" grep-15108 [003] ..... 5228.328609: testprobe: (vfs_read+0x4/0xbb0) name="enable" Note that this expects the given argument (e.g. $arg1) is an address of struct file. User must ensure it. Link: https://lore.kernel.org/all/[email protected]/ [Masami: replaced "previous patch" with '%pd' type] Signed-off-by: Ye Bin <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2024-05-01tracing/probes: support '%pd' type for print struct dentry's nameYe Bin5-1/+65
During fault locating, the file name needs to be printed based on the dentry address. The offset needs to be calculated each time, which is troublesome. Similar to printk, kprobe support print type '%pd' for print dentry's name. For example "name=$arg1:%pd" casts the `$arg1` as (struct dentry *), dereferences the "d_name.name" field and stores it to "name" argument as a kernel string. Here is an example: [tracing]# echo 'p:testprobe dput name=$arg1:%pd' > kprobe_events [tracing]# echo 1 > events/kprobes/testprobe/enable [tracing]# grep -q "1" events/kprobes/testprobe/enable [tracing]# echo 0 > events/kprobes/testprobe/enable [tracing]# cat trace | grep "enable" bash-14844 [002] ..... 16912.889543: testprobe: (dput+0x4/0x30) name="enable" grep-15389 [003] ..... 16922.834182: testprobe: (dput+0x4/0x30) name="enable" grep-15389 [003] ..... 16922.836103: testprobe: (dput+0x4/0x30) name="enable" bash-14844 [001] ..... 16931.820909: testprobe: (dput+0x4/0x30) name="enable" Note that this expects the given argument (e.g. $arg1) is an address of struct dentry. User must ensure it. Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Ye Bin <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2024-05-01uprobes: add speculative lockless system-wide uprobe filter checkAndrii Nakryiko1-3/+7
It's very common with BPF-based uprobe/uretprobe use cases to have a system-wide (not PID specific) probes used. In this case uprobe's trace_uprobe_filter->nr_systemwide counter is bumped at registration time, and actual filtering is short circuited at the time when uprobe/uretprobe is triggered. This is a great optimization, and the only issue with it is that to even get to checking this counter uprobe subsystem is taking read-side trace_uprobe_filter->rwlock. This is actually noticeable in profiles and is just another point of contention when uprobe is triggered on multiple CPUs simultaneously. This patch moves this nr_systemwide check outside of filter list's rwlock scope, as rwlock is meant to protect list modification, while nr_systemwide-based check is speculative and racy already, despite the lock (as discussed in [0]). trace_uprobe_filter_remove() and trace_uprobe_filter_add() already check for filter->nr_systewide explicitly outside of __uprobe_perf_filter, so no modifications are required there. Confirming with BPF selftests's based benchmarks. BEFORE (based on changes in previous patch) =========================================== uprobe-nop : 2.732 ± 0.022M/s uprobe-push : 2.621 ± 0.016M/s uprobe-ret : 1.105 ± 0.007M/s uretprobe-nop : 1.396 ± 0.007M/s uretprobe-push : 1.347 ± 0.008M/s uretprobe-ret : 0.800 ± 0.006M/s AFTER ===== uprobe-nop : 2.878 ± 0.017M/s (+5.5%, total +8.3%) uprobe-push : 2.753 ± 0.013M/s (+5.3%, total +10.2%) uprobe-ret : 1.142 ± 0.010M/s (+3.8%, total +3.8%) uretprobe-nop : 1.444 ± 0.008M/s (+3.5%, total +6.5%) uretprobe-push : 1.410 ± 0.010M/s (+4.8%, total +7.1%) uretprobe-ret : 0.816 ± 0.002M/s (+2.0%, total +3.9%) In the above, first percentage value is based on top of previous patch (lazy uprobe buffer optimization), while the "total" percentage is based on kernel without any of the changes in this patch set. As can be seen, we get about 4% - 10% speed up, in total, with both lazy uprobe buffer and speculative filter check optimizations. [0] https://lore.kernel.org/bpf/[email protected]/ Reviewed-by: Jiri Olsa <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2024-05-01uprobes: prepare uprobe args buffer lazilyAndrii Nakryiko1-21/+28
uprobe_cpu_buffer and corresponding logic to store uprobe args into it are used for uprobes/uretprobes that are created through tracefs or perf events. BPF is yet another user of uprobe/uretprobe infrastructure, but doesn't need uprobe_cpu_buffer and associated data. For BPF-only use cases this buffer handling and preparation is a pure overhead. At the same time, BPF-only uprobe/uretprobe usage is very common in practice. Also, for a lot of cases applications are very senstivie to performance overheads, as they might be tracing a very high frequency functions like malloc()/free(), so every bit of performance improvement matters. All that is to say that this uprobe_cpu_buffer preparation is an unnecessary overhead that each BPF user of uprobes/uretprobe has to pay. This patch is changing this by making uprobe_cpu_buffer preparation optional. It will happen only if either tracefs-based or perf event-based uprobe/uretprobe consumer is registered for given uprobe/uretprobe. For BPF-only use cases this step will be skipped. We used uprobe/uretprobe benchmark which is part of BPF selftests (see [0]) to estimate the improvements. We have 3 uprobe and 3 uretprobe scenarios, which vary an instruction that is replaced by uprobe: nop (fastest uprobe case), `push rbp` (typical case), and non-simulated `ret` instruction (slowest case). Benchmark thread is constantly calling user space function in a tight loop. User space function has attached BPF uprobe or uretprobe program doing nothing but atomic counter increments to count number of triggering calls. Benchmark emits throughput in millions of executions per second. BEFORE these changes ==================== uprobe-nop : 2.657 ± 0.024M/s uprobe-push : 2.499 ± 0.018M/s uprobe-ret : 1.100 ± 0.006M/s uretprobe-nop : 1.356 ± 0.004M/s uretprobe-push : 1.317 ± 0.019M/s uretprobe-ret : 0.785 ± 0.007M/s AFTER these changes =================== uprobe-nop : 2.732 ± 0.022M/s (+2.8%) uprobe-push : 2.621 ± 0.016M/s (+4.9%) uprobe-ret : 1.105 ± 0.007M/s (+0.5%) uretprobe-nop : 1.396 ± 0.007M/s (+2.9%) uretprobe-push : 1.347 ± 0.008M/s (+2.3%) uretprobe-ret : 0.800 ± 0.006M/s (+1.9) So the improvements on this particular machine seems to be between 2% and 5%. [0] https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/benchs/bench_trigger.c Reviewed-by: Jiri Olsa <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2024-05-01uprobes: encapsulate preparation of uprobe args bufferAndrii Nakryiko1-37/+41
Move the logic of fetching temporary per-CPU uprobe buffer and storing uprobes args into it to a new helper function. Store data size as part of this buffer, simplifying interfaces a bit, as now we only pass single uprobe_cpu_buffer reference around, instead of pointer + dsize. This logic was duplicated across uprobe_dispatcher and uretprobe_dispatcher, and now will be centralized. All this is also in preparation to make this uprobe_cpu_buffer handling logic optional in the next patch. Link: https://lore.kernel.org/all/[email protected]/ [Masami: update for v6.9-rc3 kernel] Signed-off-by: Andrii Nakryiko <[email protected]> Reviewed-by: Jiri Olsa <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2024-05-01Merge branches 'fixes.2024.04.15a', 'misc.2024.04.12a', ↵Uladzislau Rezki (Sony)12-77/+479
'rcu-sync-normal-improve.2024.04.15a', 'rcu-tasks.2024.04.15a' and 'rcutorture.2024.04.15a' into rcu-merge.2024.04.15a fixes.2024.04.15a: RCU fixes misc.2024.04.12a: Miscellaneous fixes rcu-sync-normal-improve.2024.04.15a: Improving synchronize_rcu() call rcu-tasks.2024.04.15a: Tasks RCU updates rcutorture.2024.04.15a: Torture-test updates
2024-04-30bpf: Add BPF_PROG_TYPE_CGROUP_SKB attach type enforcement in BPF_LINK_CREATEStanislav Fomichev1-0/+5
bpf_prog_attach uses attach_type_to_prog_type to enforce proper attach type for BPF_PROG_TYPE_CGROUP_SKB. link_create uses bpf_prog_get and relies on bpf_prog_attach_check_attach_type to properly verify prog_type <> attach_type association. Add missing attach_type enforcement for the link_create case. Otherwise, it's currently possible to attach cgroup_skb prog types to other cgroup hooks. Fixes: af6eea57437a ("bpf: Implement bpf_link-based cgroup BPF program attachment") Link: https://lore.kernel.org/bpf/[email protected]/ Reported-by: [email protected] Signed-off-by: Stanislav Fomichev <[email protected]> Acked-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2024-04-30Merge patch series "riscv: Create and document PR_RISCV_SET_ICACHE_FLUSH_CTX ↵Palmer Dabbelt1-0/+6
prctl" Charlie Jenkins <[email protected]> says: Improve the performance of icache flushing by creating a new prctl flag PR_RISCV_SET_ICACHE_FLUSH_CTX. The interface is left generic to allow for future expansions such as with the proposed J extension [1]. Documentation is also provided to explain the use case. Patch sent to add PR_RISCV_SET_ICACHE_FLUSH_CTX to man-pages [2]. [1] https://github.com/riscv/riscv-j-extension [2] https://lore.kernel.org/linux-man/[email protected] * b4-shazam-merge: cpumask: Add assign cpu documentation: Document PR_RISCV_SET_ICACHE_FLUSH_CTX prctl riscv: Include riscv_set_icache_flush_ctx prctl riscv: Remove unnecessary irqflags processor.h include Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Palmer Dabbelt <[email protected]>
2024-04-30bpf: Add support for kprobe session cookieJiri Olsa2-3/+23
Adding support for cookie within the session of kprobe multi entry and return program. The session cookie is u64 value and can be retrieved be new kfunc bpf_session_cookie, which returns pointer to the cookie value. The bpf program can use the pointer to store (on entry) and load (on return) the value. The cookie value is implemented via fprobe feature that allows to share values between entry and return ftrace fprobe callbacks. Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2024-04-30bpf: Add support for kprobe session contextJiri Olsa2-7/+63
Adding struct bpf_session_run_ctx object to hold session related data, which is atm is_return bool and data pointer coming in following changes. Placing bpf_session_run_ctx layer in between bpf_run_ctx and bpf_kprobe_multi_run_ctx so the session data can be retrieved regardless of if it's kprobe_multi or uprobe_multi link, which support is coming in future. This way both kprobe_multi and uprobe_multi can use same kfuncs to access the session data. Adding bpf_session_is_return kfunc that returns true if the bpf program is executed from the exit probe of the kprobe multi link attached in wrapper mode. It returns false otherwise. Adding new kprobe hook for kprobe program type. Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2024-04-30bpf: Add support for kprobe session attachJiri Olsa2-9/+26
Adding support to attach bpf program for entry and return probe of the same function. This is common use case which at the moment requires to create two kprobe multi links. Adding new BPF_TRACE_KPROBE_SESSION attach type that instructs kernel to attach single link program to both entry and exit probe. It's possible to control execution of the bpf program on return probe simply by returning zero or non zero from the entry bpf program execution to execute or not the bpf program on return probe respectively. Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2024-04-30bpf: Do not walk twice the hash map on freeBenjamin Tissoires1-36/+13
If someone stores both a timer and a workqueue in a hash map, on free, we would walk it twice. Add a check in htab_free_malloced_timers_or_wq and free the timers and workqueues if they are present. Fixes: 246331e3f1ea ("bpf: allow struct bpf_wq to be embedded in arraymaps and hashmaps") Signed-off-by: Benjamin Tissoires <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2024-04-30bpf: Do not walk twice the map on freeBenjamin Tissoires1-7/+8
If someone stores both a timer and a workqueue in a map, on free we would walk it twice. Add a check in array_map_free_timers_wq and free the timers and workqueues if they are present. Fixes: 246331e3f1ea ("bpf: allow struct bpf_wq to be embedded in arraymaps and hashmaps") Signed-off-by: Benjamin Tissoires <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]