aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2023-08-25bpf: Consider non-owning refs to refcounted nodes RCU protectedDave Marchevsky1-1/+12
An earlier patch in the series ensures that the underlying memory of nodes with bpf_refcount - which can have multiple owners - is not reused until RCU grace period has elapsed. This prevents use-after-free with non-owning references that may point to recently-freed memory. While RCU read lock is held, it's safe to dereference such a non-owning ref, as by definition RCU GP couldn't have elapsed and therefore underlying memory couldn't have been reused. From the perspective of verifier "trustedness" non-owning refs to refcounted nodes are now trusted only in RCU CS and therefore should no longer pass is_trusted_reg, but rather is_rcu_reg. Let's mark them MEM_RCU in order to reflect this new state. Signed-off-by: Dave Marchevsky <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-08-25bpf: Reenable bpf_refcount_acquireDave Marchevsky1-4/+1
Now that all reported issues are fixed, bpf_refcount_acquire can be turned back on. Also reenable all bpf_refcount-related tests which were disabled. This a revert of: * commit f3514a5d6740 ("selftests/bpf: Disable newly-added 'owner' field test until refcount re-enabled") * commit 7deca5eae833 ("bpf: Disable bpf_refcount_acquire kfunc calls until race conditions are fixed") Signed-off-by: Dave Marchevsky <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-08-25bpf: Use bpf_mem_free_rcu when bpf_obj_dropping refcounted nodesDave Marchevsky1-1/+5
This is the final fix for the use-after-free scenario described in commit 7793fc3babe9 ("bpf: Make bpf_refcount_acquire fallible for non-owning refs"). That commit, by virtue of changing bpf_refcount_acquire's refcount_inc to a refcount_inc_not_zero, fixed the "refcount incr on 0" splat. The not_zero check in refcount_inc_not_zero, though, still occurs on memory that could have been free'd and reused, so the commit didn't properly fix the root cause. This patch actually fixes the issue by free'ing using the recently-added bpf_mem_free_rcu, which ensures that the memory is not reused until RCU grace period has elapsed. If that has happened then there are no non-owning references alive that point to the recently-free'd memory, so it can be safely reused. Signed-off-by: Dave Marchevsky <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-08-25bpf: Ensure kptr_struct_meta is non-NULL for collection insert and ↵Dave Marchevsky1-0/+14
refcount_acquire It's straightforward to prove that kptr_struct_meta must be non-NULL for any valid call to these kfuncs: * btf_parse_struct_metas in btf.c creates a btf_struct_meta for any struct in user BTF with a special field (e.g. bpf_refcount, {rb,list}_node). These are stored in that BTF's struct_meta_tab. * __process_kf_arg_ptr_to_graph_node in verifier.c ensures that nodes have {rb,list}_node field and that it's at the correct offset. Similarly, check_kfunc_args ensures bpf_refcount field existence for node param to bpf_refcount_acquire. * So a btf_struct_meta must have been created for the struct type of node param to these kfuncs * That BTF and its struct_meta_tab are guaranteed to still be around. Any arbitrary {rb,list} node the BPF program interacts with either: came from bpf_obj_new or a collection removal kfunc in the same program, in which case the BTF is associated with the program and still around; or came from bpf_kptr_xchg, in which case the BTF was associated with the map and is still around Instead of silently continuing with NULL struct_meta, which caused confusing bugs such as those addressed by commit 2140a6e3422d ("bpf: Set kptr_struct_meta for node param to list and rbtree insert funcs"), let's error out. Then, at runtime, we can confidently say that the implementations of these kfuncs were given a non-NULL kptr_struct_meta, meaning that special-field-specific functionality like bpf_obj_free_fields and the bpf_obj_drop change introduced later in this series are guaranteed to execute. This patch doesn't change functionality, just makes it easier to reason about existing functionality. Signed-off-by: Dave Marchevsky <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-08-25kernel/fork: group allocation/free of per-cpu counters for mm structMateusz Guzik1-11/+4
A trivial execve scalability test which tries to be very friendly (statically linked binaries, all separate) is predominantly bottlenecked by back-to-back per-cpu counter allocations which serialize on global locks. Ease the pain by allocating and freeing them in one go. Bench can be found here: http://apollo.backplane.com/DFlyMisc/doexec.c $ cc -static -O2 -o static-doexec doexec.c $ ./static-doexec $(nproc) Even at a very modest scale of 26 cores (ops/s): before: 133543.63 after: 186061.81 (+39%) While with the patch these allocations remain a significant problem, the primary bottleneck shifts to page release handling. Signed-off-by: Mateusz Guzik <[email protected]> Link: https://lore.kernel.org/r/[email protected] [Dennis: reflowed 1 line] Signed-off-by: Dennis Zhou <[email protected]>
2023-08-24crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()Eric DeVolder1-2/+2
The function crash_prepare_elf64_headers() generates the elfcorehdr which describes the CPUs and memory in the system for the crash kernel. In particular, it writes out ELF PT_NOTEs for memory regions and the CPUs in the system. With respect to the CPUs, the current implementation utilizes for_each_present_cpu() which means that as CPUs are added and removed, the elfcorehdr must again be updated to reflect the new set of CPUs. The reasoning behind the move to use for_each_possible_cpu(), is: - At kernel boot time, all percpu crash_notes are allocated for all possible CPUs; that is, crash_notes are not allocated dynamically when CPUs are plugged/unplugged. Thus the crash_notes for each possible CPU are always available. - The crash_prepare_elf64_headers() creates an ELF PT_NOTE per CPU. Changing to for_each_possible_cpu() is valid as the crash_notes pointed to by each CPU PT_NOTE are present and always valid. Furthermore, examining a common crash processing path of: kernel panic -> crash kernel -> makedumpfile -> 'crash' analyzer elfcorehdr /proc/vmcore vmcore reveals how the ELF CPU PT_NOTEs are utilized: - Upon panic, each CPU is sent an IPI and shuts itself down, recording its state in its crash_notes. When all CPUs are shutdown, the crash kernel is launched with a pointer to the elfcorehdr. - The crash kernel via linux/fs/proc/vmcore.c does not examine or use the contents of the PT_NOTEs, it exposes them via /proc/vmcore. - The makedumpfile utility uses /proc/vmcore and reads the CPU PT_NOTEs to craft a nr_cpus variable, which is reported in a header but otherwise generally unused. Makedumpfile creates the vmcore. - The 'crash' dump analyzer does not appear to reference the CPU PT_NOTEs. Instead it looks-up the cpu_[possible|present|onlin]_mask symbols and directly examines those structure contents from vmcore memory. From that information it is able to determine which CPUs are present and online, and locate the corresponding crash_notes. Said differently, it appears that 'crash' analyzer does not rely on the ELF PT_NOTEs for CPUs; rather it obtains the information directly via kernel symbols and the memory within the vmcore. (There maybe other vmcore generating and analysis tools that do use these PT_NOTEs, but 'makedumpfile' and 'crash' seems to be the most common solution.) This results in the benefit of having all CPUs described in the elfcorehdr, and therefore reducing the need to re-generate the elfcorehdr on CPU changes, at the small expense of an additional 56 bytes per PT_NOTE for not-present-but-possible CPUs. On systems where kexec_file_load() syscall is utilized, all the above is valid. On systems where kexec_load() syscall is utilized, there may be the need for the elfcorehdr to be regenerated once. The reason being that some archs only populate the 'present' CPUs from the /sys/devices/system/cpus entries, which the userspace 'kexec' utility uses to generate the userspace-supplied elfcorehdr. In this situation, one memory or CPU change will rewrite the elfcorehdr via the crash_prepare_elf64_headers() function and now all possible CPUs will be described, just as with kexec_file_load() syscall. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric DeVolder <[email protected]> Suggested-by: Sourabh Jain <[email protected]> Reviewed-by: Sourabh Jain <[email protected]> Acked-by: Hari Bathini <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Akhil Raj <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dave Young <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Mimi Zohar <[email protected]> Cc: Naveen N. Rao <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Takashi Iwai <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Thomas Weißschuh <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Vivek Goyal <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24crash: hotplug support for kexec_load()Eric DeVolder4-0/+55
The hotplug support for kexec_load() requires changes to the userspace kexec-tools and a little extra help from the kernel. Given a kdump capture kernel loaded via kexec_load(), and a subsequent hotplug event, the crash hotplug handler finds the elfcorehdr and rewrites it to reflect the hotplug change. That is the desired outcome, however, at kernel panic time, the purgatory integrity check fails (because the elfcorehdr changed), and the capture kernel does not boot and no vmcore is generated. Therefore, the userspace kexec-tools/kexec must indicate to the kernel that the elfcorehdr can be modified (because the kexec excluded the elfcorehdr from the digest, and sized the elfcorehdr memory buffer appropriately). To facilitate hotplug support with kexec_load(): - a new kexec flag KEXEC_UPATE_ELFCOREHDR indicates that it is safe for the kernel to modify the kexec_load()'d elfcorehdr - the /sys/kernel/crash_elfcorehdr_size node communicates the preferred size of the elfcorehdr memory buffer - The sysfs crash_hotplug nodes (ie. /sys/devices/system/[cpu|memory]/crash_hotplug) dynamically take into account kexec_file_load() vs kexec_load() and KEXEC_UPDATE_ELFCOREHDR. This is critical so that the udev rule processing of crash_hotplug is all that is needed to determine if the userspace unload-then-load of the kdump image is to be skipped, or not. The proposed udev rule change looks like: # The kernel updates the crash elfcorehdr for CPU and memory changes SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" The table below indicates the behavior of kexec_load()'d kdump image updates (with the new udev crash_hotplug rule in place): Kernel |Kexec -------+-----+---- Old |Old |New | a | a -------+-----+---- New | a | b -------+-----+---- where kexec 'old' and 'new' delineate kexec-tools has the needed modifications for the crash hotplug feature, and kernel 'old' and 'new' delineate the kernel supports this crash hotplug feature. Behavior 'a' indicates the unload-then-reload of the entire kdump image. For the kexec 'old' column, the unload-then-reload occurs due to the missing flag KEXEC_UPDATE_ELFCOREHDR. An 'old' kernel (with 'new' kexec) does not present the crash_hotplug sysfs node, which leads to the unload-then-reload of the kdump image. Behavior 'b' indicates the desired optimized behavior of the kernel directly modifying the elfcorehdr and avoiding the unload-then-reload of the kdump image. If the udev rule is not updated with crash_hotplug node check, then no matter any combination of kernel or kexec is new or old, the kdump image continues to be unload-then-reload on hotplug changes. To fully support crash hotplug feature, there needs to be a rollout of kernel, kexec-tools and udev rule changes. However, the order of the rollout of these pieces does not matter; kexec_load()'d kdump images still function for hotplug as-is. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric DeVolder <[email protected]> Suggested-by: Hari Bathini <[email protected]> Acked-by: Hari Bathini <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Akhil Raj <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dave Young <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Mimi Zohar <[email protected]> Cc: Naveen N. Rao <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Sourabh Jain <[email protected]> Cc: Takashi Iwai <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Thomas Weißschuh <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Vivek Goyal <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24kexec: exclude elfcorehdr from the segment digestEric DeVolder1-0/+6
When a crash kernel is loaded via the kexec_file_load() syscall, the kernel places the various segments (ie crash kernel, crash initrd, boot_params, elfcorehdr, purgatory, etc) in memory. For those architectures that utilize purgatory, a hash digest of the segments is calculated for integrity checking. The digest is embedded into the purgatory image prior to placing in memory. Updates to the elfcorehdr in response to CPU and memory changes would cause the purgatory integrity checking to fail (at crash time, and no vmcore created). Therefore, the elfcorehdr segment is explicitly excluded from the purgatory digest, enabling updates to the elfcorehdr while also avoiding the need to recompute the hash digest and reload purgatory. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric DeVolder <[email protected]> Suggested-by: Baoquan He <[email protected]> Reviewed-by: Sourabh Jain <[email protected]> Acked-by: Hari Bathini <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Akhil Raj <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dave Young <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Mimi Zohar <[email protected]> Cc: Naveen N. Rao <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Takashi Iwai <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Thomas Weißschuh <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Vivek Goyal <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24crash: add generic infrastructure for crash hotplug supportEric DeVolder3-0/+179
To support crash hotplug, a mechanism is needed to update the crash elfcorehdr upon CPU or memory changes (eg. hot un/plug or off/ onlining). The crash elfcorehdr describes the CPUs and memory to be written into the vmcore. To track CPU changes, callbacks are registered with the cpuhp mechanism via cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN). The crash hotplug elfcorehdr update has no explicit ordering requirement (relative to other cpuhp states), so meets the criteria for utilizing CPUHP_BP_PREPARE_DYN. CPUHP_BP_PREPARE_DYN is a dynamic state and avoids the need to introduce a new state for crash hotplug. Also, CPUHP_BP_PREPARE_DYN is the last state in the PREPARE group, just prior to the STARTING group, which is very close to the CPU starting up in a plug/online situation, or stopping in a unplug/ offline situation. This minimizes the window of time during an actual plug/online or unplug/offline situation in which the elfcorehdr would be inaccurate. Note that for a CPU being unplugged or offlined, the CPU will still be present in the list of CPUs generated by crash_prepare_elf64_headers(). However, there is no need to explicitly omit the CPU, see justification in 'crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()'. To track memory changes, a notifier is registered to capture the memblock MEM_ONLINE and MEM_OFFLINE events via register_memory_notifier(). The CPU callbacks and memory notifiers invoke crash_handle_hotplug_event() which performs needed tasks and then dispatches the event to the architecture specific arch_crash_handle_hotplug_event() to update the elfcorehdr with the current state of CPUs and memory. During the process, the kexec_lock is held. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric DeVolder <[email protected]> Reviewed-by: Sourabh Jain <[email protected]> Acked-by: Hari Bathini <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Akhil Raj <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dave Young <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Mimi Zohar <[email protected]> Cc: Naveen N. Rao <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Takashi Iwai <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Thomas Weißschuh <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Vivek Goyal <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24crash: move a few code bits to setup support of crash hotplugEric DeVolder3-218/+218
Patch series "crash: Kernel handling of CPU and memory hot un/plug", v28. Once the kdump service is loaded, if changes to CPUs or memory occur, either by hot un/plug or off/onlining, the crash elfcorehdr must also be updated. The elfcorehdr describes to kdump the CPUs and memory in the system, and any inaccuracies can result in a vmcore with missing CPU context or memory regions. The current solution utilizes udev to initiate an unload-then-reload of the kdump image (eg. kernel, initrd, boot_params, purgatory and elfcorehdr) by the userspace kexec utility. In the original post I outlined the significant performance problems related to offloading this activity to userspace. This patchset introduces a generic crash handler that registers with the CPU and memory notifiers. Upon CPU or memory changes, from either hot un/plug or off/onlining, this generic handler is invoked and performs important housekeeping, for example obtaining the appropriate lock, and then invokes an architecture specific handler to do the appropriate elfcorehdr update. Note the description in patch 'crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()' and 'x86/crash: optimize CPU changes' that enables further optimizations related to CPU plug/unplug/online/offline performance of elfcorehdr updates. In the case of x86_64, the arch specific handler generates a new elfcorehdr, and overwrites the old one in memory; thus no involvement with userspace needed. To realize the benefits/test this patchset, one must make a couple of minor changes to userspace: - Prevent udev from updating kdump crash kernel on hot un/plug changes. Add the following as the first lines to the RHEL udev rule file /usr/lib/udev/rules.d/98-kexec.rules: # The kernel updates the crash elfcorehdr for CPU and memory changes SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" With this changeset applied, the two rules evaluate to false for CPU and memory change events and thus skip the userspace unload-then-reload of kdump. - Change to the kexec_file_load for loading the kdump kernel: Eg. on RHEL: in /usr/bin/kdumpctl, change to: standard_kexec_args="-p -d -s" which adds the -s to select kexec_file_load() syscall. This kernel patchset also supports kexec_load() with a modified kexec userspace utility. A working changeset to the kexec userspace utility is posted to the kexec-tools mailing list here: http://lists.infradead.org/pipermail/kexec/2023-May/027049.html To use the kexec-tools patch, apply, build and install kexec-tools, then change the kdumpctl's standard_kexec_args to replace the -s with --hotplug. The removal of -s reverts to the kexec_load syscall and the addition of --hotplug invokes the changes put forth in the kexec-tools patch. This patch (of 8): The crash hotplug support leans on the work for the kexec_file_load() syscall. To also support the kexec_load() syscall, a few bits of code need to be move outside of CONFIG_KEXEC_FILE. As such, these bits are moved out of kexec_file.c and into a common location crash_core.c. In addition, struct crash_mem and crash_notes were moved to new locales so that PROC_KCORE, which sets CRASH_CORE alone, builds correctly. No functionality change intended. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric DeVolder <[email protected]> Reviewed-by: Sourabh Jain <[email protected]> Acked-by: Hari Bathini <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Akhil Raj <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dave Young <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Mimi Zohar <[email protected]> Cc: Naveen N. Rao <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: Takashi Iwai <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Thomas Weißschuh <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Vivek Goyal <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-24kallsyms: Add more debug output for selftestKees Cook1-4/+18
While debugging a recent kallsyms_selftest failure[1], I needed more details on what specifically was failing. This adds those details for each failure state that is checked. [1] https://lore.kernel.org/all/[email protected]/ Cc: Luis Chamberlain <[email protected]> Cc: Yonghong Song <[email protected]> Cc: "Erhard F." <[email protected]> Cc: Zhen Lei <[email protected]> Cc: kernel test robot <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Yang Li <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Luis Chamberlain <[email protected]>
2023-08-24bpf: Remove a WARN_ON_ONCE warning related to local kptrYonghong Song1-1/+0
Currently, in function bpf_obj_free_fields(), for local kptr, a warning will be issued if the struct does not contain any special fields. But actually the kernel seems totally okay with a local kptr without any special fields. Permitting no special fields also aligns with future percpu kptr which also allows no special fields. Acked-by: Dave Marchevsky <[email protected]> Signed-off-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-08-23bpf: Fix issue in verifying allow_ptr_leaksYafang Shao1-8/+9
After we converted the capabilities of our networking-bpf program from cap_sys_admin to cap_net_admin+cap_bpf, our networking-bpf program failed to start. Because it failed the bpf verifier, and the error log is "R3 pointer comparison prohibited". A simple reproducer as follows, SEC("cls-ingress") int ingress(struct __sk_buff *skb) { struct iphdr *iph = (void *)(long)skb->data + sizeof(struct ethhdr); if ((long)(iph + 1) > (long)skb->data_end) return TC_ACT_STOLEN; return TC_ACT_OK; } Per discussion with Yonghong and Alexei [1], comparison of two packet pointers is not a pointer leak. This patch fixes it. Our local kernel is 6.1.y and we expect this fix to be backported to 6.1.y, so stable is CCed. [1]. https://lore.kernel.org/bpf/CAADnVQ+Nmspr7Si+pxWn8zkE7hX-7s93ugwC+94aXSy4uQ9vBg@mail.gmail.com/ Suggested-by: Yonghong Song <[email protected]> Suggested-by: Alexei Starovoitov <[email protected]> Signed-off-by: Yafang Shao <[email protected]> Acked-by: Eduard Zingerman <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-08-23entry: Remove empty addr_limit_user_check()Mark Rutland1-2/+1
Back when set_fs() was a generic API for altering the address limit, addr_limit_user_check() was a safety measure to prevent userspace being able to issue syscalls with an unbound limit. With the the removal of set_fs() as a generic API, the last user of addr_limit_user_check() was removed in commit: b5a5a01d8e9a44ec ("arm64: uaccess: remove addr_limit_user_check()") ... as since that commit, no architecture defines TIF_FSCHECK, and hence addr_limit_user_check() always expands to nothing. Remove addr_limit_user_check(), updating the comment in exit_to_user_mode_prepare() to no longer refer to it. At the same time, the comment is reworded to be a little more generic so as to cover kmap_assert_nomap() in addition to lockdep_sys_exit(). No functional change. Signed-off-by: Mark Rutland <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-08-23tracing/fprobe-event: Assume fprobe is a return event by $retvalMasami Hiramatsu (Google)1-14/+44
Assume the fprobe event is a return event if there is $retval is used in the probe's argument without %return. e.g. echo 'f:myevent vfs_read $retval' >> dynamic_events then 'myevent' is a return probe event. Link: https://lore.kernel.org/all/169272160261.160970.13613040161560998787.stgit@devnote2/ Suggested-by: Steven Rostedt <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Acked-by: Steven Rostedt (Google) <[email protected]>
2023-08-23tracing/probes: Add string type check with BTFMasami Hiramatsu (Google)2-3/+89
Add a string type checking with BTF information if possible. This will check whether the given BTF argument (and field) is signed char array or pointer to signed char. If not, it reject the 'string' type. If it is pointer to signed char, it adds a dereference opration so that it can correctly fetch the string data from memory. # echo 'f getname_flags%return retval->name:string' >> dynamic_events # echo 't sched_switch next->comm:string' >> dynamic_events The above cases, 'struct filename::name' is 'char *' and 'struct task_struct::comm' is 'char []'. But in both case, user can specify ':string' to fetch the string data. Link: https://lore.kernel.org/all/169272159250.160970.1881112937198526188.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Acked-by: Steven Rostedt (Google) <[email protected]>
2023-08-23tracing/probes: Support BTF field access from $retvalMasami Hiramatsu (Google)2-102/+86
Support BTF argument on '$retval' for function return events including kretprobe and fprobe for accessing the return value. This also allows user to access its fields if the return value is a pointer of a data structure. E.g. # echo 'f getname_flags%return +0($retval->name):string' \ > dynamic_events # echo 1 > events/fprobes/getname_flags__exit/enable # ls > /dev/null # head -n 40 trace | tail ls-87 [000] ...1. 8067.616101: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./function_profile_enabled" ls-87 [000] ...1. 8067.616108: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./trace_stat" ls-87 [000] ...1. 8067.616115: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./set_graph_notrace" ls-87 [000] ...1. 8067.616122: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./set_graph_function" ls-87 [000] ...1. 8067.616129: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./set_ftrace_notrace" ls-87 [000] ...1. 8067.616135: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./set_ftrace_filter" ls-87 [000] ...1. 8067.616143: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./touched_functions" ls-87 [000] ...1. 8067.616237: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./enabled_functions" ls-87 [000] ...1. 8067.616245: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./available_filter_functions" ls-87 [000] ...1. 8067.616253: getname_flags__exit: (vfs_fstatat+0x3c/0x70 <- getname_flags) arg1="./set_ftrace_notrace_pid" Link: https://lore.kernel.org/all/169272158234.160970.2446691104240645205.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Acked-by: Steven Rostedt (Google) <[email protected]>
2023-08-23tracing/probes: Support BTF based data structure field accessMasami Hiramatsu (Google)3-28/+216
Using BTF to access the fields of a data structure. You can use this for accessing the field with '->' or '.' operation with BTF argument. # echo 't sched_switch next=next->pid vruntime=next->se.vruntime' \ > dynamic_events # echo 1 > events/tracepoints/sched_switch/enable # head -n 40 trace | tail <idle>-0 [000] d..3. 272.565382: sched_switch: (__probestub_sched_switch+0x4/0x10) next=26 vruntime=956533179 kcompactd0-26 [000] d..3. 272.565406: sched_switch: (__probestub_sched_switch+0x4/0x10) next=0 vruntime=0 <idle>-0 [000] d..3. 273.069441: sched_switch: (__probestub_sched_switch+0x4/0x10) next=9 vruntime=956533179 kworker/0:1-9 [000] d..3. 273.069464: sched_switch: (__probestub_sched_switch+0x4/0x10) next=26 vruntime=956579181 kcompactd0-26 [000] d..3. 273.069480: sched_switch: (__probestub_sched_switch+0x4/0x10) next=0 vruntime=0 <idle>-0 [000] d..3. 273.141434: sched_switch: (__probestub_sched_switch+0x4/0x10) next=22 vruntime=956533179 kworker/u2:1-22 [000] d..3. 273.141461: sched_switch: (__probestub_sched_switch+0x4/0x10) next=0 vruntime=0 <idle>-0 [000] d..3. 273.480872: sched_switch: (__probestub_sched_switch+0x4/0x10) next=22 vruntime=956585857 kworker/u2:1-22 [000] d..3. 273.480905: sched_switch: (__probestub_sched_switch+0x4/0x10) next=70 vruntime=959533179 sh-70 [000] d..3. 273.481102: sched_switch: (__probestub_sched_switch+0x4/0x10) next=0 vruntime=0 Link: https://lore.kernel.org/all/169272157251.160970.9318175874130965571.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Reviewed-by: Alan Maguire <[email protected]> Acked-by: Steven Rostedt (Google) <[email protected]>
2023-08-23tracing/probes: Add a function to search a member of a struct/unionMasami Hiramatsu (Google)2-0/+73
Add btf_find_struct_member() API to search a member of a given data structure or union from the member's name. Link: https://lore.kernel.org/all/169272156248.160970.8868479822371129043.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Reviewed-by: Alan Maguire <[email protected]> Acked-by: Steven Rostedt (Google) <[email protected]>
2023-08-23tracing/probes: Move finding func-proto API and getting func-param API to ↵Masami Hiramatsu (Google)4-40/+72
trace_btf Move generic function-proto find API and getting function parameter API to BTF library code from trace_probe.c. This will avoid redundant efforts on different feature. Link: https://lore.kernel.org/all/169272155255.160970.719426926348706349.stgit@devnote2/ Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Acked-by: Steven Rostedt (Google) <[email protected]>
2023-08-23tracing/probes: Support BTF argument on module functionsMasami Hiramatsu (Google)7-49/+74
Since the btf returned from bpf_get_btf_vmlinux() only covers functions in the vmlinux, BTF argument is not available on the functions in the modules. Use bpf_find_btf_id() instead of bpf_get_btf_vmlinux()+btf_find_name_kind() so that BTF argument can find the correct struct btf and btf_type in it. With this fix, fprobe events can use `$arg*` on module functions as below # grep nf_log_ip_packet /proc/kallsyms ffffffffa0005c00 t nf_log_ip_packet [nf_log_syslog] ffffffffa0005bf0 t __pfx_nf_log_ip_packet [nf_log_syslog] # echo 'f nf_log_ip_packet $arg*' > dynamic_events # cat dynamic_events f:fprobes/nf_log_ip_packet__entry nf_log_ip_packet net=net pf=pf hooknum=hooknum skb=skb in=in out=out loginfo=loginfo prefix=prefix To support the module's btf which is removable, the struct btf needs to be ref-counted. So this also records the btf in the traceprobe_parse_context and returns the refcount when the parse has done. Link: https://lore.kernel.org/all/169272154223.160970.3507930084247934031.stgit@devnote2/ Suggested-by: Alexei Starovoitov <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Acked-by: Steven Rostedt (Google) <[email protected]>
2023-08-23tracing/eprobe: Iterate trace_eprobe directlyChuang Wang1-9/+9
Refer to the description in [1], we can skip "container_of()" following "list_for_each_entry()" by using "list_for_each_entry()" with "struct trace_eprobe" and "tp.list". Also, this patch defines "for_each_trace_eprobe_tp" to simplify the code of the same logic. [1] https://lore.kernel.org/all/CAHk-=wjakjw6-rDzDDBsuMoDCqd+9ogifR_EE1F0K-jYek1CdA@mail.gmail.com/ Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Chuang Wang <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2023-08-23kernel: kprobes: Use struct_size()Ruan Jinjie1-4/+2
Use struct_size() instead of hand-writing it, when allocating a structure with a flex array. This is less verbose. Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Ruan Jinjie <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2023-08-22bpf: Fix check_func_arg_reg_off bug for graph root/nodeKumar Kartikeya Dwivedi1-11/+0
The commit being fixed introduced a hunk into check_func_arg_reg_off that bypasses reg->off == 0 enforcement when offset points to a graph node or root. This might possibly be done for treating bpf_rbtree_remove and others as KF_RELEASE and then later check correct reg->off in helper argument checks. But this is not the case, those helpers are already not KF_RELEASE and permit non-zero reg->off and verify it later to match the subobject in BTF type. However, this logic leads to bpf_obj_drop permitting free of register arguments with non-zero offset when they point to a graph root or node within them, which is not ok. For instance: struct foo { int i; int j; struct bpf_rb_node node; }; struct foo *f = bpf_obj_new(typeof(*f)); if (!f) ... bpf_obj_drop(f); // OK bpf_obj_drop(&f->i); // still ok from verifier PoV bpf_obj_drop(&f->node); // Not OK, but permitted right now Fix this by dropping the whole part of code altogether. Fixes: 6a3cd3318ff6 ("bpf: Migrate release_on_unlock logic to non-owning ref semantics") Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-08-22PM: QoS: Add check to make sure CPU latency is non-negativeClive Lin1-2/+7
CPU latency should never be negative, which will be incorrectly high when converted to unsigned data type. Commit 8d36694245f2 ("PM: QoS: Add check to make sure CPU freq is non-negative") makes sure CPU frequency is non-negative to fix incorrect behavior in freqency QoS. Add an analogous check to make sure CPU latency is non-negative so as to prevent this problem from happening in CPU latency QoS. Signed-off-by: Clive Lin <[email protected]> [ rjw: Changelog edits ] Signed-off-by: Rafael J. Wysocki <[email protected]>
2023-08-22bpf: Fix a bpf_kptr_xchg() issue with local kptrYonghong Song1-10/+15
When reviewing local percpu kptr support, Alexei discovered a bug wherea bpf_kptr_xchg() may succeed even if the map value kptr type and locally allocated obj type do not match ([1]). Missed struct btf_id comparison is the reason for the bug. This patch added such struct btf_id comparison and will flag verification failure if types do not match. [1] https://lore.kernel.org/bpf/20230819002907.io3iphmnuk43xblu@macbook-pro-8.dhcp.thefacebook.com/#t Reported-by: Alexei Starovoitov <[email protected]> Fixes: 738c96d5e2e3 ("bpf: Allow local kptrs to be exchanged via bpf_kptr_xchg") Signed-off-by: Yonghong Song <[email protected]> Acked-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-08-22tracing/user_events: Optimize safe list traversalsEric Vaughn1-7/+8
Several of the list traversals in the user_events facility use safe list traversals where they could be using the unsafe versions instead. Replace these safe traversals with their unsafe counterparts in the interest of optimization. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Suggested-by: Beau Belgrave <[email protected]> Signed-off-by: Eric Vaughn <[email protected]> Acked-by: Beau Belgrave <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-08-22tracing: Remove unused function declarationsYue Haibing1-2/+0
Commit 9457158bbc0e ("tracing: Fix reset of time stamps during trace_clock changes") left behind tracing_reset_current() declaration. Also commit 6954e415264e ("tracing: Place trace_pid_list logic into abstract functions") removed trace_free_pid_list() implementation but leave declaration. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: <[email protected]> Signed-off-by: Yue Haibing <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-08-22tracing/filters: Further optimise scalar vs cpumask comparisonValentin Schneider1-6/+20
Per the previous commits, we now only enter do_filter_scalar_cpumask() with a mask of weight greater than one. Optimise the equality checks. Link: https://lkml.kernel.org/r/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Daniel Bristot de Oliveira <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Leonardo Bras <[email protected]> Cc: Frederic Weisbecker <[email protected]> Suggested-by: Steven Rostedt <[email protected]> Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-08-22tracing/filters: Optimise CPU vs cpumask filtering when the user mask is a ↵Valentin Schneider1-2/+7
single CPU Steven noted that when the user-provided cpumask contains a single CPU, then the filtering function can use a scalar as input instead of a full-fledged cpumask. In this case we can directly re-use filter_pred_cpu(), we just need to transform '&' into '==' before executing it. Link: https://lkml.kernel.org/r/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Daniel Bristot de Oliveira <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Leonardo Bras <[email protected]> Cc: Frederic Weisbecker <[email protected]> Suggested-by: Steven Rostedt <[email protected]> Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-08-22tracing/filters: Optimise scalar vs cpumask filtering when the user mask is ↵Valentin Schneider1-1/+6
a single CPU Steven noted that when the user-provided cpumask contains a single CPU, then the filtering function can use a scalar as input instead of a full-fledged cpumask. When the mask contains a single CPU, directly re-use the unsigned field predicate functions. Transform '&' into '==' beforehand. Link: https://lkml.kernel.org/r/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Daniel Bristot de Oliveira <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Leonardo Bras <[email protected]> Cc: Frederic Weisbecker <[email protected]> Suggested-by: Steven Rostedt <[email protected]> Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-08-22tracing/filters: Optimise cpumask vs cpumask filtering when user mask is a ↵Valentin Schneider1-1/+34
single CPU Steven noted that when the user-provided cpumask contains a single CPU, then the filtering function can use a scalar as input instead of a full-fledged cpumask. Reuse do_filter_scalar_cpumask() when the input mask has a weight of one. Link: https://lkml.kernel.org/r/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Daniel Bristot de Oliveira <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Leonardo Bras <[email protected]> Cc: Frederic Weisbecker <[email protected]> Suggested-by: Steven Rostedt <[email protected]> Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-08-22tracing/filters: Enable filtering the CPU common field by a cpumaskValentin Schneider1-0/+14
The tracing_cpumask lets us specify which CPUs are traced in a buffer instance, but doesn't let us do this on a per-event basis (unless one creates an instance per event). A previous commit added filtering scalar fields by a user-given cpumask, make this work with the CPU common field as well. This enables doing things like $ trace-cmd record -e 'sched_switch' -f 'CPU & CPUS{12-52}' \ -e 'sched_wakeup' -f 'target_cpu & CPUS{12-52}' Link: https://lkml.kernel.org/r/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Daniel Bristot de Oliveira <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Leonardo Bras <[email protected]> Cc: Frederic Weisbecker <[email protected]> Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-08-22tracing/filters: Enable filtering a scalar field by a cpumaskValentin Schneider1-11/+81
Several events use a scalar field to denote a CPU: o sched_wakeup.target_cpu o sched_migrate_task.orig_cpu,dest_cpu o sched_move_numa.src_cpu,dst_cpu o ipi_send_cpu.cpu o ... Filtering these currently requires using arithmetic comparison functions, which can be tedious when dealing with interleaved SMT or NUMA CPU ids. Allow these to be filtered by a user-provided cpumask, which enables e.g.: $ trace-cmd record -e 'sched_wakeup' -f 'target_cpu & CPUS{2,4,6,8-32}' Link: https://lkml.kernel.org/r/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Daniel Bristot de Oliveira <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Leonardo Bras <[email protected]> Cc: Frederic Weisbecker <[email protected]> Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-08-22tracing/filters: Enable filtering a cpumask field by another cpumaskValentin Schneider1-2/+95
The recently introduced ipi_send_cpumask trace event contains a cpumask field, but it currently cannot be used in filter expressions. Make event filtering aware of cpumask fields, and allow these to be filtered by a user-provided cpumask. The user-provided cpumask is to be given in cpulist format and wrapped as: "CPUS{$cpulist}". The use of curly braces instead of parentheses is to prevent predicate_parse() from parsing the contents of CPUS{...} as a full-fledged predicate subexpression. This enables e.g.: $ trace-cmd record -e 'ipi_send_cpumask' -f 'cpumask & CPUS{2,4,6,8-32}' Link: https://lkml.kernel.org/r/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Daniel Bristot de Oliveira <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Leonardo Bras <[email protected]> Cc: Frederic Weisbecker <[email protected]> Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-08-22tracing/filters: Dynamically allocate filter_pred.regexValentin Schneider1-25/+39
Every predicate allocation includes a MAX_FILTER_STR_VAL (256) char array in the regex field, even if the predicate function does not use the field. A later commit will introduce a dynamically allocated cpumask to struct filter_pred, which will require a dedicated freeing function. Bite the bullet and make filter_pred.regex dynamically allocated. While at it, reorder the fields of filter_pred to fill in the byte holes. The struct now fits on a single cacheline. No change in behaviour intended. The kfree()'s were patched via Coccinelle: @@ struct filter_pred *pred; @@ -kfree(pred); +free_predicate(pred); Link: https://lkml.kernel.org/r/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Daniel Bristot de Oliveira <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Leonardo Bras <[email protected]> Cc: Frederic Weisbecker <[email protected]> Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-08-21bpf: Add bpf_get_func_ip helper support for uprobe linkJiri Olsa1-3/+30
Adding support for bpf_get_func_ip helper being called from ebpf program attached by uprobe_multi link. It returns the ip of the uprobe. Acked-by: Andrii Nakryiko <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-08-21bpf: Add pid filter support for uprobe_multi linkJiri Olsa2-1/+34
Adding support to specify pid for uprobe_multi link and the uprobes are created only for task with given pid value. Using the consumer.filter filter callback for that, so the task gets filtered during the uprobe installation. We still need to check the task during runtime in the uprobe handler, because the handler could get executed if there's another system wide consumer on the same uprobe (thanks Oleg for the insight). Cc: Oleg Nesterov <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-08-21bpf: Add cookies support for uprobe_multi linkJiri Olsa2-5/+42
Adding support to specify cookies array for uprobe_multi link. The cookies array share indexes and length with other uprobe_multi arrays (offsets/ref_ctr_offsets). The cookies[i] value defines cookie for i-the uprobe and will be returned by bpf_get_attach_cookie helper when called from ebpf program hooked to that specific uprobe. Acked-by: Andrii Nakryiko <[email protected]> Acked-by: Yafang Shao <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-08-21bpf: Add multi uprobe linkJiri Olsa2-3/+244
Adding new multi uprobe link that allows to attach bpf program to multiple uprobes. Uprobes to attach are specified via new link_create uprobe_multi union: struct { __aligned_u64 path; __aligned_u64 offsets; __aligned_u64 ref_ctr_offsets; __u32 cnt; __u32 flags; } uprobe_multi; Uprobes are defined for single binary specified in path and multiple calling sites specified in offsets array with optional reference counters specified in ref_ctr_offsets array. All specified arrays have length of 'cnt'. The 'flags' supports single bit for now that marks the uprobe as return probe. Acked-by: Andrii Nakryiko <[email protected]> Acked-by: Yafang Shao <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-08-21bpf: Add attach_type checks under bpf_prog_attach_check_attach_typeJiri Olsa1-68/+52
Add extra attach_type checks from link_create under bpf_prog_attach_check_attach_type. Suggested-by: Andrii Nakryiko <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Acked-by: Yafang Shao <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-08-21bpf, cpumask: Clean up bpf_cpu_map_entry directly in cpu_map_freeHou Tao1-9/+8
After synchronous_rcu(), both the dettached XDP program and xdp_do_flush() are completed, and the only user of bpf_cpu_map_entry will be cpu_map_kthread_run(), so instead of calling __cpu_map_entry_replace() to stop kthread and cleanup entry after a RCU grace period, do these things directly. Signed-off-by: Hou Tao <[email protected]> Reviewed-by: Toke Høiland-Jørgensen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-08-21bpf, cpumap: Use queue_rcu_work() to remove unnecessary rcu_barrier()Hou Tao1-69/+27
As for now __cpu_map_entry_replace() uses call_rcu() to wait for the inflight xdp program to exit the RCU read critical section, and then launch kworker cpu_map_kthread_stop() to call kthread_stop() to flush all pending xdp frames or skbs. But it is unnecessary to use rcu_barrier() in cpu_map_kthread_stop() to wait for the completion of __cpu_map_entry_free(), because rcu_barrier() will wait for all pending RCU callbacks and cpu_map_kthread_stop() only needs to wait for the completion of a specific __cpu_map_entry_free(). So use queue_rcu_work() to replace call_rcu(), schedule_work() and rcu_barrier(). queue_rcu_work() will queue a __cpu_map_entry_free() kworker after a RCU grace period. Because __cpu_map_entry_free() is running in a kworker context, so it is OK to do all of these freeing procedures include kthread_stop() in it. After the update, there is no need to do reference-counting for bpf_cpu_map_entry, because bpf_cpu_map_entry is freed directly in __cpu_map_entry_free(), so just remove it. Signed-off-by: Hou Tao <[email protected]> Reviewed-by: Toke Høiland-Jørgensen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-08-21mm: free up a word in the first tail pageMatthew Wilcox (Oracle)1-1/+0
Store the folio order in the low byte of the flags word in the first tail page. This frees up the word that was being used to store the order and dtor bytes previously. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Yanteng Si <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm: add large_rmappable page flagMatthew Wilcox (Oracle)1-1/+0
Stored in the first tail page's flags, this flag replaces the destructor. That removes the last of the destructors, so remove all references to folio_dtor and compound_dtor. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Yanteng Si <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm: remove HUGETLB_PAGE_DTORMatthew Wilcox (Oracle)1-1/+1
We can use a bit in page[1].flags to indicate that this folio belongs to hugetlb instead of using a value in page[1].dtors. That lets folio_test_hugetlb() become an inline function like it should be. We can also get rid of NULL_COMPOUND_DTOR. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Yanteng Si <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21treewide: drop CONFIG_EMBEDDEDRandy Dunlap1-1/+1
There is only one Kconfig user of CONFIG_EMBEDDED and it can be switched to EXPERT or "if !ARCH_MULTIPLATFORM" (suggested by Arnd). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Randy Dunlap <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Acked-by: Palmer Dabbelt <[email protected]> [RISC-V] Acked-by: Greg Ungerer <[email protected]> Acked-by: Jason A. Donenfeld <[email protected]> Acked-by: Michael Ellerman <[email protected]> [powerpc] Cc: Russell King <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Brian Cain <[email protected]> Cc: Michal Simek <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Dinh Nguyen <[email protected]> Cc: Jonas Bonn <[email protected]> Cc: Stefan Kristiansson <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Albert Ou <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Rich Felker <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Max Filippov <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Masahiro Yamada <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21lockdep: fix static memory detection even moreHelge Deller1-22/+14
On the parisc architecture, lockdep reports for all static objects which are in the __initdata section (e.g. "setup_done" in devtmpfs, "kthreadd_done" in init/main.c) this warning: INFO: trying to register non-static key. The warning itself is wrong, because those objects are in the __initdata section, but the section itself is on parisc outside of range from _stext to _end, which is why the static_obj() functions returns a wrong answer. While fixing this issue, I noticed that the whole existing check can be simplified a lot. Instead of checking against the _stext and _end symbols (which include code areas too) just check for the .data and .bss segments (since we check a data object). This can be done with the existing is_kernel_core_data() macro. In addition objects in the __initdata section can be checked with init_section_contains(), and is_kernel_rodata() allows keys to be in the _ro_after_init section. This partly reverts and simplifies commit bac59d18c701 ("x86/setup: Fix static memory detection"). Link: https://lkml.kernel.org/r/ZNqrLRaOi/3wPAdp@p100 Fixes: bac59d18c701 ("x86/setup: Fix static memory detection") Signed-off-by: Helge Deller <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21kernel/fork: stop playing lockless games for exe_file replacementMateusz Guzik1-13/+9
xchg originated in 6e399cd144d8 ("prctl: avoid using mmap_sem for exe_file serialization"). While the commit message does not explain *why* the change, I found the original submission [1] which ultimately claims it cleans things up by removing dependency of exe_file on the semaphore. However, fe69d560b5bd ("kernel/fork: always deny write access to current MM exe_file") added a semaphore up/down cycle to synchronize the state of exe_file against fork, defeating the point of the original change. This is on top of semaphore trips already present both in the replacing function and prctl (the only consumer). Normally replacing exe_file does not happen for busy processes, thus write-locking is not an impediment to performance in the intended use case. If someone keeps invoking the routine for a busy processes they are trying to play dirty and that's another reason to avoid any trickery. As such I think the atomic here only adds complexity for no benefit. Just write-lock around the replacement. I also note that replacement races against the mapping check loop as nothing synchronizes actual assignment with with said checks but I am not addressing it in this patch. (Is the loop of any use to begin with?) Link: https://lore.kernel.org/linux-mm/[email protected]/ [1] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mateusz Guzik <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: "Christian Brauner (Microsoft)" <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mateusz Guzik <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21memfd: replace ratcheting feature from vm.memfd_noexec with hierarchyAleksa Sarai3-19/+18
This sysctl has the very unusual behaviour of not allowing any user (even CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you were to set this sysctl to a more restrictive option in the host pidns you would need to reboot your machine in order to reset it. The justification given in [1] is that this is a security feature and thus it should not be possible to disable. Aside from the fact that we have plenty of security-related sysctls that can be disabled after being enabled (fs.protected_symlinks for instance), the protection provided by the sysctl is to stop users from being able to create a binary and then execute it. A user with CAP_SYS_ADMIN can trivially do this without memfd_create(2): % cat mount-memfd.c #include <fcntl.h> #include <string.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <linux/mount.h> #define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:" int main(void) { int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC); assert(fsfd >= 0); assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2)); int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0); assert(dfd >= 0); int execfd = openat(dfd, "exe", O_CREAT | O_RDWR | O_CLOEXEC, 0782); assert(execfd >= 0); assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE)); assert(!close(execfd)); char *execpath = NULL; char *argv[] = { "bad-exe", NULL }, *envp[] = { NULL }; execfd = openat(dfd, "exe", O_PATH | O_CLOEXEC); assert(execfd >= 0); assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0); assert(!execve(execpath, argv, envp)); } % ./mount-memfd this file was executed from this totally private tmpfs: /proc/self/fd/5 % Given that it is possible for CAP_SYS_ADMIN users to create executable binaries without memfd_create(2) and without touching the host filesystem (not to mention the many other things a CAP_SYS_ADMIN process would be able to do that would be equivalent or worse), it seems strange to cause a fair amount of headache to admins when there doesn't appear to be an actual security benefit to blocking this. There appear to be concerns about confused-deputy-esque attacks[2] but a confused deputy that can write to arbitrary sysctls is a bigger security issue than executable memfds. /* New API */ The primary requirement from the original author appears to be more based on the need to be able to restrict an entire system in a hierarchical manner[3], such that child namespaces cannot re-enable executable memfds. So, implement that behaviour explicitly -- the vm.memfd_noexec scope is evaluated up the pidns tree to &init_pid_ns and you have the most restrictive value applied to you. The new lower limit you can set vm.memfd_noexec is whatever limit applies to your parent. Note that a pidns will inherit a copy of the parent pidns's effective vm.memfd_noexec setting at unshare() time. This matches the existing behaviour, and it also ensures that a pidns will never have its vm.memfd_noexec setting *lowered* behind its back (but it will be raised if the parent raises theirs). /* Backwards Compatibility */ As the previous version of the sysctl didn't allow you to lower the setting at all, there are no backwards compatibility issues with this aspect of the change. However it should be noted that now that the setting is completely hierarchical. Previously, a cloned pidns would just copy the current pidns setting, meaning that if the parent's vm.memfd_noexec was changed it wouldn't propoagate to existing pid namespaces. Now, the restriction applies recursively. This is a uAPI change, however: * The sysctl is very new, having been merged in 6.3. * Several aspects of the sysctl were broken up until this patchset and the other patchset by Jeff Xu last month. And thus it seems incredibly unlikely that any real users would run into this issue. In the worst case, if this causes userspace isues we could make it so that modifying the setting follows the hierarchical rules but the restriction checking uses the cached copy. [1]: https://lore.kernel.org/CABi2SkWnAgHK1i6iqSqPMYuNEhtHBkO8jUuCvmG3RmUB5TKHJw@mail.gmail.com/ [2]: https://lore.kernel.org/CALmYWFs_dNCzw_pW1yRAo4bGCPEtykroEQaowNULp7svwMLjOg@mail.gmail.com/ [3]: https://lore.kernel.org/CALmYWFuahdUF7cT4cm7_TGLqPanuHXJ-hVSfZt7vpTnc18DPrw@mail.gmail.com/ Link: https://lkml.kernel.org/r/[email protected] Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC") Signed-off-by: Aleksa Sarai <[email protected]> Cc: Dominique Martinet <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Daniel Verkamp <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Kees Cook <[email protected]> Cc: Shuah Khan <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>