aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2023-08-23tracing/probes: Support BTF argument on module functionsMasami Hiramatsu (Google)7-49/+74
Since the btf returned from bpf_get_btf_vmlinux() only covers functions in the vmlinux, BTF argument is not available on the functions in the modules. Use bpf_find_btf_id() instead of bpf_get_btf_vmlinux()+btf_find_name_kind() so that BTF argument can find the correct struct btf and btf_type in it. With this fix, fprobe events can use `$arg*` on module functions as below # grep nf_log_ip_packet /proc/kallsyms ffffffffa0005c00 t nf_log_ip_packet [nf_log_syslog] ffffffffa0005bf0 t __pfx_nf_log_ip_packet [nf_log_syslog] # echo 'f nf_log_ip_packet $arg*' > dynamic_events # cat dynamic_events f:fprobes/nf_log_ip_packet__entry nf_log_ip_packet net=net pf=pf hooknum=hooknum skb=skb in=in out=out loginfo=loginfo prefix=prefix To support the module's btf which is removable, the struct btf needs to be ref-counted. So this also records the btf in the traceprobe_parse_context and returns the refcount when the parse has done. Link: https://lore.kernel.org/all/169272154223.160970.3507930084247934031.stgit@devnote2/ Suggested-by: Alexei Starovoitov <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Acked-by: Steven Rostedt (Google) <[email protected]>
2023-08-23tracing/eprobe: Iterate trace_eprobe directlyChuang Wang1-9/+9
Refer to the description in [1], we can skip "container_of()" following "list_for_each_entry()" by using "list_for_each_entry()" with "struct trace_eprobe" and "tp.list". Also, this patch defines "for_each_trace_eprobe_tp" to simplify the code of the same logic. [1] https://lore.kernel.org/all/CAHk-=wjakjw6-rDzDDBsuMoDCqd+9ogifR_EE1F0K-jYek1CdA@mail.gmail.com/ Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Chuang Wang <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2023-08-23kernel: kprobes: Use struct_size()Ruan Jinjie1-4/+2
Use struct_size() instead of hand-writing it, when allocating a structure with a flex array. This is less verbose. Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Ruan Jinjie <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2023-08-22bpf: Fix check_func_arg_reg_off bug for graph root/nodeKumar Kartikeya Dwivedi1-11/+0
The commit being fixed introduced a hunk into check_func_arg_reg_off that bypasses reg->off == 0 enforcement when offset points to a graph node or root. This might possibly be done for treating bpf_rbtree_remove and others as KF_RELEASE and then later check correct reg->off in helper argument checks. But this is not the case, those helpers are already not KF_RELEASE and permit non-zero reg->off and verify it later to match the subobject in BTF type. However, this logic leads to bpf_obj_drop permitting free of register arguments with non-zero offset when they point to a graph root or node within them, which is not ok. For instance: struct foo { int i; int j; struct bpf_rb_node node; }; struct foo *f = bpf_obj_new(typeof(*f)); if (!f) ... bpf_obj_drop(f); // OK bpf_obj_drop(&f->i); // still ok from verifier PoV bpf_obj_drop(&f->node); // Not OK, but permitted right now Fix this by dropping the whole part of code altogether. Fixes: 6a3cd3318ff6 ("bpf: Migrate release_on_unlock logic to non-owning ref semantics") Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-08-22PM: QoS: Add check to make sure CPU latency is non-negativeClive Lin1-2/+7
CPU latency should never be negative, which will be incorrectly high when converted to unsigned data type. Commit 8d36694245f2 ("PM: QoS: Add check to make sure CPU freq is non-negative") makes sure CPU frequency is non-negative to fix incorrect behavior in freqency QoS. Add an analogous check to make sure CPU latency is non-negative so as to prevent this problem from happening in CPU latency QoS. Signed-off-by: Clive Lin <[email protected]> [ rjw: Changelog edits ] Signed-off-by: Rafael J. Wysocki <[email protected]>
2023-08-22bpf: Fix a bpf_kptr_xchg() issue with local kptrYonghong Song1-10/+15
When reviewing local percpu kptr support, Alexei discovered a bug wherea bpf_kptr_xchg() may succeed even if the map value kptr type and locally allocated obj type do not match ([1]). Missed struct btf_id comparison is the reason for the bug. This patch added such struct btf_id comparison and will flag verification failure if types do not match. [1] https://lore.kernel.org/bpf/20230819002907.io3iphmnuk43xblu@macbook-pro-8.dhcp.thefacebook.com/#t Reported-by: Alexei Starovoitov <[email protected]> Fixes: 738c96d5e2e3 ("bpf: Allow local kptrs to be exchanged via bpf_kptr_xchg") Signed-off-by: Yonghong Song <[email protected]> Acked-by: Kumar Kartikeya Dwivedi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-08-22tracing/user_events: Optimize safe list traversalsEric Vaughn1-7/+8
Several of the list traversals in the user_events facility use safe list traversals where they could be using the unsafe versions instead. Replace these safe traversals with their unsafe counterparts in the interest of optimization. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Suggested-by: Beau Belgrave <[email protected]> Signed-off-by: Eric Vaughn <[email protected]> Acked-by: Beau Belgrave <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-08-22tracing: Remove unused function declarationsYue Haibing1-2/+0
Commit 9457158bbc0e ("tracing: Fix reset of time stamps during trace_clock changes") left behind tracing_reset_current() declaration. Also commit 6954e415264e ("tracing: Place trace_pid_list logic into abstract functions") removed trace_free_pid_list() implementation but leave declaration. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: <[email protected]> Signed-off-by: Yue Haibing <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-08-22tracing/filters: Further optimise scalar vs cpumask comparisonValentin Schneider1-6/+20
Per the previous commits, we now only enter do_filter_scalar_cpumask() with a mask of weight greater than one. Optimise the equality checks. Link: https://lkml.kernel.org/r/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Daniel Bristot de Oliveira <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Leonardo Bras <[email protected]> Cc: Frederic Weisbecker <[email protected]> Suggested-by: Steven Rostedt <[email protected]> Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-08-22tracing/filters: Optimise CPU vs cpumask filtering when the user mask is a ↵Valentin Schneider1-2/+7
single CPU Steven noted that when the user-provided cpumask contains a single CPU, then the filtering function can use a scalar as input instead of a full-fledged cpumask. In this case we can directly re-use filter_pred_cpu(), we just need to transform '&' into '==' before executing it. Link: https://lkml.kernel.org/r/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Daniel Bristot de Oliveira <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Leonardo Bras <[email protected]> Cc: Frederic Weisbecker <[email protected]> Suggested-by: Steven Rostedt <[email protected]> Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-08-22tracing/filters: Optimise scalar vs cpumask filtering when the user mask is ↵Valentin Schneider1-1/+6
a single CPU Steven noted that when the user-provided cpumask contains a single CPU, then the filtering function can use a scalar as input instead of a full-fledged cpumask. When the mask contains a single CPU, directly re-use the unsigned field predicate functions. Transform '&' into '==' beforehand. Link: https://lkml.kernel.org/r/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Daniel Bristot de Oliveira <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Leonardo Bras <[email protected]> Cc: Frederic Weisbecker <[email protected]> Suggested-by: Steven Rostedt <[email protected]> Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-08-22tracing/filters: Optimise cpumask vs cpumask filtering when user mask is a ↵Valentin Schneider1-1/+34
single CPU Steven noted that when the user-provided cpumask contains a single CPU, then the filtering function can use a scalar as input instead of a full-fledged cpumask. Reuse do_filter_scalar_cpumask() when the input mask has a weight of one. Link: https://lkml.kernel.org/r/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Daniel Bristot de Oliveira <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Leonardo Bras <[email protected]> Cc: Frederic Weisbecker <[email protected]> Suggested-by: Steven Rostedt <[email protected]> Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-08-22tracing/filters: Enable filtering the CPU common field by a cpumaskValentin Schneider1-0/+14
The tracing_cpumask lets us specify which CPUs are traced in a buffer instance, but doesn't let us do this on a per-event basis (unless one creates an instance per event). A previous commit added filtering scalar fields by a user-given cpumask, make this work with the CPU common field as well. This enables doing things like $ trace-cmd record -e 'sched_switch' -f 'CPU & CPUS{12-52}' \ -e 'sched_wakeup' -f 'target_cpu & CPUS{12-52}' Link: https://lkml.kernel.org/r/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Daniel Bristot de Oliveira <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Leonardo Bras <[email protected]> Cc: Frederic Weisbecker <[email protected]> Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-08-22tracing/filters: Enable filtering a scalar field by a cpumaskValentin Schneider1-11/+81
Several events use a scalar field to denote a CPU: o sched_wakeup.target_cpu o sched_migrate_task.orig_cpu,dest_cpu o sched_move_numa.src_cpu,dst_cpu o ipi_send_cpu.cpu o ... Filtering these currently requires using arithmetic comparison functions, which can be tedious when dealing with interleaved SMT or NUMA CPU ids. Allow these to be filtered by a user-provided cpumask, which enables e.g.: $ trace-cmd record -e 'sched_wakeup' -f 'target_cpu & CPUS{2,4,6,8-32}' Link: https://lkml.kernel.org/r/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Daniel Bristot de Oliveira <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Leonardo Bras <[email protected]> Cc: Frederic Weisbecker <[email protected]> Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-08-22tracing/filters: Enable filtering a cpumask field by another cpumaskValentin Schneider1-2/+95
The recently introduced ipi_send_cpumask trace event contains a cpumask field, but it currently cannot be used in filter expressions. Make event filtering aware of cpumask fields, and allow these to be filtered by a user-provided cpumask. The user-provided cpumask is to be given in cpulist format and wrapped as: "CPUS{$cpulist}". The use of curly braces instead of parentheses is to prevent predicate_parse() from parsing the contents of CPUS{...} as a full-fledged predicate subexpression. This enables e.g.: $ trace-cmd record -e 'ipi_send_cpumask' -f 'cpumask & CPUS{2,4,6,8-32}' Link: https://lkml.kernel.org/r/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Daniel Bristot de Oliveira <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Leonardo Bras <[email protected]> Cc: Frederic Weisbecker <[email protected]> Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-08-22tracing/filters: Dynamically allocate filter_pred.regexValentin Schneider1-25/+39
Every predicate allocation includes a MAX_FILTER_STR_VAL (256) char array in the regex field, even if the predicate function does not use the field. A later commit will introduce a dynamically allocated cpumask to struct filter_pred, which will require a dedicated freeing function. Bite the bullet and make filter_pred.regex dynamically allocated. While at it, reorder the fields of filter_pred to fill in the byte holes. The struct now fits on a single cacheline. No change in behaviour intended. The kfree()'s were patched via Coccinelle: @@ struct filter_pred *pred; @@ -kfree(pred); +free_predicate(pred); Link: https://lkml.kernel.org/r/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Daniel Bristot de Oliveira <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Leonardo Bras <[email protected]> Cc: Frederic Weisbecker <[email protected]> Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-08-21bpf: Add bpf_get_func_ip helper support for uprobe linkJiri Olsa1-3/+30
Adding support for bpf_get_func_ip helper being called from ebpf program attached by uprobe_multi link. It returns the ip of the uprobe. Acked-by: Andrii Nakryiko <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-08-21bpf: Add pid filter support for uprobe_multi linkJiri Olsa2-1/+34
Adding support to specify pid for uprobe_multi link and the uprobes are created only for task with given pid value. Using the consumer.filter filter callback for that, so the task gets filtered during the uprobe installation. We still need to check the task during runtime in the uprobe handler, because the handler could get executed if there's another system wide consumer on the same uprobe (thanks Oleg for the insight). Cc: Oleg Nesterov <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-08-21bpf: Add cookies support for uprobe_multi linkJiri Olsa2-5/+42
Adding support to specify cookies array for uprobe_multi link. The cookies array share indexes and length with other uprobe_multi arrays (offsets/ref_ctr_offsets). The cookies[i] value defines cookie for i-the uprobe and will be returned by bpf_get_attach_cookie helper when called from ebpf program hooked to that specific uprobe. Acked-by: Andrii Nakryiko <[email protected]> Acked-by: Yafang Shao <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-08-21bpf: Add multi uprobe linkJiri Olsa2-3/+244
Adding new multi uprobe link that allows to attach bpf program to multiple uprobes. Uprobes to attach are specified via new link_create uprobe_multi union: struct { __aligned_u64 path; __aligned_u64 offsets; __aligned_u64 ref_ctr_offsets; __u32 cnt; __u32 flags; } uprobe_multi; Uprobes are defined for single binary specified in path and multiple calling sites specified in offsets array with optional reference counters specified in ref_ctr_offsets array. All specified arrays have length of 'cnt'. The 'flags' supports single bit for now that marks the uprobe as return probe. Acked-by: Andrii Nakryiko <[email protected]> Acked-by: Yafang Shao <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-08-21bpf: Add attach_type checks under bpf_prog_attach_check_attach_typeJiri Olsa1-68/+52
Add extra attach_type checks from link_create under bpf_prog_attach_check_attach_type. Suggested-by: Andrii Nakryiko <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Acked-by: Yafang Shao <[email protected]> Signed-off-by: Jiri Olsa <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-08-21bpf, cpumask: Clean up bpf_cpu_map_entry directly in cpu_map_freeHou Tao1-9/+8
After synchronous_rcu(), both the dettached XDP program and xdp_do_flush() are completed, and the only user of bpf_cpu_map_entry will be cpu_map_kthread_run(), so instead of calling __cpu_map_entry_replace() to stop kthread and cleanup entry after a RCU grace period, do these things directly. Signed-off-by: Hou Tao <[email protected]> Reviewed-by: Toke Høiland-Jørgensen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-08-21bpf, cpumap: Use queue_rcu_work() to remove unnecessary rcu_barrier()Hou Tao1-69/+27
As for now __cpu_map_entry_replace() uses call_rcu() to wait for the inflight xdp program to exit the RCU read critical section, and then launch kworker cpu_map_kthread_stop() to call kthread_stop() to flush all pending xdp frames or skbs. But it is unnecessary to use rcu_barrier() in cpu_map_kthread_stop() to wait for the completion of __cpu_map_entry_free(), because rcu_barrier() will wait for all pending RCU callbacks and cpu_map_kthread_stop() only needs to wait for the completion of a specific __cpu_map_entry_free(). So use queue_rcu_work() to replace call_rcu(), schedule_work() and rcu_barrier(). queue_rcu_work() will queue a __cpu_map_entry_free() kworker after a RCU grace period. Because __cpu_map_entry_free() is running in a kworker context, so it is OK to do all of these freeing procedures include kthread_stop() in it. After the update, there is no need to do reference-counting for bpf_cpu_map_entry, because bpf_cpu_map_entry is freed directly in __cpu_map_entry_free(), so just remove it. Signed-off-by: Hou Tao <[email protected]> Reviewed-by: Toke Høiland-Jørgensen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-08-21mm: free up a word in the first tail pageMatthew Wilcox (Oracle)1-1/+0
Store the folio order in the low byte of the flags word in the first tail page. This frees up the word that was being used to store the order and dtor bytes previously. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Yanteng Si <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm: add large_rmappable page flagMatthew Wilcox (Oracle)1-1/+0
Stored in the first tail page's flags, this flag replaces the destructor. That removes the last of the destructors, so remove all references to folio_dtor and compound_dtor. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Yanteng Si <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm: remove HUGETLB_PAGE_DTORMatthew Wilcox (Oracle)1-1/+1
We can use a bit in page[1].flags to indicate that this folio belongs to hugetlb instead of using a value in page[1].dtors. That lets folio_test_hugetlb() become an inline function like it should be. We can also get rid of NULL_COMPOUND_DTOR. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Yanteng Si <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21treewide: drop CONFIG_EMBEDDEDRandy Dunlap1-1/+1
There is only one Kconfig user of CONFIG_EMBEDDED and it can be switched to EXPERT or "if !ARCH_MULTIPLATFORM" (suggested by Arnd). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Randy Dunlap <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Acked-by: Palmer Dabbelt <[email protected]> [RISC-V] Acked-by: Greg Ungerer <[email protected]> Acked-by: Jason A. Donenfeld <[email protected]> Acked-by: Michael Ellerman <[email protected]> [powerpc] Cc: Russell King <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Brian Cain <[email protected]> Cc: Michal Simek <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Dinh Nguyen <[email protected]> Cc: Jonas Bonn <[email protected]> Cc: Stefan Kristiansson <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Albert Ou <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Rich Felker <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Max Filippov <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Masahiro Yamada <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21lockdep: fix static memory detection even moreHelge Deller1-22/+14
On the parisc architecture, lockdep reports for all static objects which are in the __initdata section (e.g. "setup_done" in devtmpfs, "kthreadd_done" in init/main.c) this warning: INFO: trying to register non-static key. The warning itself is wrong, because those objects are in the __initdata section, but the section itself is on parisc outside of range from _stext to _end, which is why the static_obj() functions returns a wrong answer. While fixing this issue, I noticed that the whole existing check can be simplified a lot. Instead of checking against the _stext and _end symbols (which include code areas too) just check for the .data and .bss segments (since we check a data object). This can be done with the existing is_kernel_core_data() macro. In addition objects in the __initdata section can be checked with init_section_contains(), and is_kernel_rodata() allows keys to be in the _ro_after_init section. This partly reverts and simplifies commit bac59d18c701 ("x86/setup: Fix static memory detection"). Link: https://lkml.kernel.org/r/ZNqrLRaOi/3wPAdp@p100 Fixes: bac59d18c701 ("x86/setup: Fix static memory detection") Signed-off-by: Helge Deller <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21kernel/fork: stop playing lockless games for exe_file replacementMateusz Guzik1-13/+9
xchg originated in 6e399cd144d8 ("prctl: avoid using mmap_sem for exe_file serialization"). While the commit message does not explain *why* the change, I found the original submission [1] which ultimately claims it cleans things up by removing dependency of exe_file on the semaphore. However, fe69d560b5bd ("kernel/fork: always deny write access to current MM exe_file") added a semaphore up/down cycle to synchronize the state of exe_file against fork, defeating the point of the original change. This is on top of semaphore trips already present both in the replacing function and prctl (the only consumer). Normally replacing exe_file does not happen for busy processes, thus write-locking is not an impediment to performance in the intended use case. If someone keeps invoking the routine for a busy processes they are trying to play dirty and that's another reason to avoid any trickery. As such I think the atomic here only adds complexity for no benefit. Just write-lock around the replacement. I also note that replacement races against the mapping check loop as nothing synchronizes actual assignment with with said checks but I am not addressing it in this patch. (Is the loop of any use to begin with?) Link: https://lore.kernel.org/linux-mm/[email protected]/ [1] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mateusz Guzik <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: "Christian Brauner (Microsoft)" <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mateusz Guzik <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21memfd: replace ratcheting feature from vm.memfd_noexec with hierarchyAleksa Sarai3-19/+18
This sysctl has the very unusual behaviour of not allowing any user (even CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you were to set this sysctl to a more restrictive option in the host pidns you would need to reboot your machine in order to reset it. The justification given in [1] is that this is a security feature and thus it should not be possible to disable. Aside from the fact that we have plenty of security-related sysctls that can be disabled after being enabled (fs.protected_symlinks for instance), the protection provided by the sysctl is to stop users from being able to create a binary and then execute it. A user with CAP_SYS_ADMIN can trivially do this without memfd_create(2): % cat mount-memfd.c #include <fcntl.h> #include <string.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <linux/mount.h> #define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:" int main(void) { int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC); assert(fsfd >= 0); assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2)); int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0); assert(dfd >= 0); int execfd = openat(dfd, "exe", O_CREAT | O_RDWR | O_CLOEXEC, 0782); assert(execfd >= 0); assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE)); assert(!close(execfd)); char *execpath = NULL; char *argv[] = { "bad-exe", NULL }, *envp[] = { NULL }; execfd = openat(dfd, "exe", O_PATH | O_CLOEXEC); assert(execfd >= 0); assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0); assert(!execve(execpath, argv, envp)); } % ./mount-memfd this file was executed from this totally private tmpfs: /proc/self/fd/5 % Given that it is possible for CAP_SYS_ADMIN users to create executable binaries without memfd_create(2) and without touching the host filesystem (not to mention the many other things a CAP_SYS_ADMIN process would be able to do that would be equivalent or worse), it seems strange to cause a fair amount of headache to admins when there doesn't appear to be an actual security benefit to blocking this. There appear to be concerns about confused-deputy-esque attacks[2] but a confused deputy that can write to arbitrary sysctls is a bigger security issue than executable memfds. /* New API */ The primary requirement from the original author appears to be more based on the need to be able to restrict an entire system in a hierarchical manner[3], such that child namespaces cannot re-enable executable memfds. So, implement that behaviour explicitly -- the vm.memfd_noexec scope is evaluated up the pidns tree to &init_pid_ns and you have the most restrictive value applied to you. The new lower limit you can set vm.memfd_noexec is whatever limit applies to your parent. Note that a pidns will inherit a copy of the parent pidns's effective vm.memfd_noexec setting at unshare() time. This matches the existing behaviour, and it also ensures that a pidns will never have its vm.memfd_noexec setting *lowered* behind its back (but it will be raised if the parent raises theirs). /* Backwards Compatibility */ As the previous version of the sysctl didn't allow you to lower the setting at all, there are no backwards compatibility issues with this aspect of the change. However it should be noted that now that the setting is completely hierarchical. Previously, a cloned pidns would just copy the current pidns setting, meaning that if the parent's vm.memfd_noexec was changed it wouldn't propoagate to existing pid namespaces. Now, the restriction applies recursively. This is a uAPI change, however: * The sysctl is very new, having been merged in 6.3. * Several aspects of the sysctl were broken up until this patchset and the other patchset by Jeff Xu last month. And thus it seems incredibly unlikely that any real users would run into this issue. In the worst case, if this causes userspace isues we could make it so that modifying the setting follows the hierarchical rules but the restriction checking uses the cached copy. [1]: https://lore.kernel.org/CABi2SkWnAgHK1i6iqSqPMYuNEhtHBkO8jUuCvmG3RmUB5TKHJw@mail.gmail.com/ [2]: https://lore.kernel.org/CALmYWFs_dNCzw_pW1yRAo4bGCPEtykroEQaowNULp7svwMLjOg@mail.gmail.com/ [3]: https://lore.kernel.org/CALmYWFuahdUF7cT4cm7_TGLqPanuHXJ-hVSfZt7vpTnc18DPrw@mail.gmail.com/ Link: https://lkml.kernel.org/r/[email protected] Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC") Signed-off-by: Aleksa Sarai <[email protected]> Cc: Dominique Martinet <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Daniel Verkamp <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Kees Cook <[email protected]> Cc: Shuah Khan <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21perf/core: use vma_is_initial_stack() and vma_is_initial_heap()Kefeng Wang1-22/+11
Use the helpers to simplify code, also kill unneeded goto cpy_name. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Alex Deucher <[email protected]> Cc: Christian Göttsche <[email protected]> Cc: "Christian König" <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: David Airlie <[email protected]> Cc: Eric Paris <[email protected]> Cc: Felix Kuehling <[email protected]> Cc: "Pan, Xinhui" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Stephen Smalley <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21kernel/iomem.c: remove __weak ioremap_cache helperArnd Bergmann1-8/+4
No portable code calls into this function any more, and on architectures that don't use or define their own, it causes a warning: kernel/iomem.c:10:22: warning: no previous prototype for 'ioremap_cache' [-Wmissing-prototypes] 10 | __weak void __iomem *ioremap_cache(resource_size_t offset, unsigned long size) Fold it into the only caller that uses it on architectures without the #define. Note that the fallback to ioremap is probably still wrong on those architectures, but this is what it's always done there. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]> Reviewed-by: Baoquan He <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21nsproxy: Convert nsproxy.count to refcount_tElena Reshetova1-2/+2
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable nsproxy.count is used as pure reference counter. Convert it to refcount_t and fix up the operations. **Important note for maintainers: Some functions from refcount_t API defined in refcount.h have different memory ordering guarantees than their atomic counterparts. Please check Documentation/core-api/refcount-vs-atomic.rst for more information. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the nsproxy.count it might make a difference in following places: - put_nsproxy() and switch_task_namespaces(): decrement in refcount_dec_and_test() only provides RELEASE ordering and ACQUIRE ordering on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook <[email protected]> Signed-off-by: Elena Reshetova <[email protected]> Reviewed-by: David Windsor <[email protected]> Reviewed-by: Hans Liljestrand <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>
2023-08-21tracing: Introduce pipe_cpumask to avoid race on trace_pipesZheng Yejian2-7/+50
There is race issue when concurrently splice_read main trace_pipe and per_cpu trace_pipes which will result in data read out being different from what actually writen. As suggested by Steven: > I believe we should add a ref count to trace_pipe and the per_cpu > trace_pipes, where if they are opened, nothing else can read it. > > Opening trace_pipe locks all per_cpu ref counts, if any of them are > open, then the trace_pipe open will fail (and releases any ref counts > it had taken). > > Opening a per_cpu trace_pipe will up the ref count for just that > CPU buffer. This will allow multiple tasks to read different per_cpu > trace_pipe files, but will prevent the main trace_pipe file from > being opened. But because we only need to know whether per_cpu trace_pipe is open or not, using a cpumask instead of using ref count may be easier. After this patch, users will find that: - Main trace_pipe can be opened by only one user, and if it is opened, all per_cpu trace_pipes cannot be opened; - Per_cpu trace_pipes can be opened by multiple users, but each per_cpu trace_pipe can only be opened by one user. And if one of them is opened, main trace_pipe cannot be opened. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Suggested-by: Steven Rostedt (Google) <[email protected]> Signed-off-by: Zheng Yejian <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-08-20Merge commit b320441c04c9 ("Merge tag 'tty-6.5-rc7' of ↵Greg Kroah-Hartman4-21/+76
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty") into tty-next We need the serial-core fixes in here as well. Signed-off-by: Greg Kroah-Hartman <[email protected]>
2023-08-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+1
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/sfc/tc.c fa165e194997 ("sfc: don't unregister flow_indr if it was never registered") 3bf969e88ada ("sfc: add MAE table machinery for conntrack table") https://lore.kernel.org/all/[email protected]/ No adjacent changes. Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-18watchdog/hardlockup: avoid large stack frames in watchdog_hardlockup_check()Douglas Anderson1-7/+2
After commit 77c12fc95980 ("watchdog/hardlockup: add a "cpu" param to watchdog_hardlockup_check()") we started storing a `struct cpumask` on the stack in watchdog_hardlockup_check(). On systems with CONFIG_NR_CPUS set to 8192 this takes up 1K on the stack. That triggers warnings with `CONFIG_FRAME_WARN` set to 1024. We'll use the new trigger_allbutcpu_cpu_backtrace() to avoid needing to use a CPU mask at all. Link: https://lkml.kernel.org/r/20230804065935.v4.2.I501ab68cb926ee33a7c87e063d207abf09b9943c@changeid Fixes: 77c12fc95980 ("watchdog/hardlockup: add a "cpu" param to watchdog_hardlockup_check()") Signed-off-by: Douglas Anderson <[email protected]> Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/r/[email protected] Acked-by: Michal Hocko <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Cc: Lecopzer Chen <[email protected]> Cc: Pingfan Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-18nmi_backtrace: allow excluding an arbitrary CPUDouglas Anderson1-1/+1
The APIs that allow backtracing across CPUs have always had a way to exclude the current CPU. This convenience means callers didn't need to find a place to allocate a CPU mask just to handle the common case. Let's extend the API to take a CPU ID to exclude instead of just a boolean. This isn't any more complex for the API to handle and allows the hardlockup detector to exclude a different CPU (the one it already did a trace for) without needing to find space for a CPU mask. Arguably, this new API also encourages safer behavior. Specifically if the caller wants to avoid tracing the current CPU (maybe because they already traced the current CPU) this makes it more obvious to the caller that they need to make sure that the current CPU ID can't change. [[email protected]: fix trigger_allbutcpu_cpu_backtrace() stub] Link: https://lkml.kernel.org/r/20230804065935.v4.1.Ia35521b91fc781368945161d7b28538f9996c182@changeid Signed-off-by: Douglas Anderson <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: kernel test robot <[email protected]> Cc: Lecopzer Chen <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Pingfan Liu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-18kthread: unexport __kthread_should_park()Greg Kroah-Hartman1-2/+1
There are no in-kernel users of __kthread_should_park() so mark it as static and do not export it. Link: https://lkml.kernel.org/r/2023080450-handcuff-stump-1d6e@gregkh Signed-off-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Kees Cook <[email protected]> Cc: John Stultz <[email protected]> Cc: "Peter Zijlstra (Intel)" <[email protected]> Cc: "Arve Hjønnevåg" <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Andrew Morton <[email protected]> Cc: "Christian Brauner (Microsoft)" <[email protected]> Cc: Mike Christie <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Zqiang <[email protected]> Cc: Prathu Baronia <[email protected]> Cc: Sami Tolvanen <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-18gcov: shut up missing prototype warnings for internal stubsArnd Bergmann1-0/+2
gcov uses global functions that are called from generated code, but these have no prototype in a header, which causes a W=1 build warning: kernel/gcov/gcc_base.c:12:6: error: no previous prototype for '__gcov_init' [-Werror=missing-prototypes] kernel/gcov/gcc_base.c:40:6: error: no previous prototype for '__gcov_flush' [-Werror=missing-prototypes] kernel/gcov/gcc_base.c:46:6: error: no previous prototype for '__gcov_merge_add' [-Werror=missing-prototypes] kernel/gcov/gcc_base.c:52:6: error: no previous prototype for '__gcov_merge_single' [-Werror=missing-prototypes] Just turn off these warnings unconditionally for the two files that contain them. Link: https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]> Tested-by: Peter Oberparleiter <[email protected]> Acked-by: Peter Oberparleiter <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-18kernel: relay: remove unnecessary NULL values from relay_open_bufLi kunyu1-1/+1
buf is assigned first, so it does not need to initialize the assignment. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Li kunyu <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-18remove ARCH_DEFAULT_KEXEC from Kconfig.kexecEric DeVolder1-1/+0
This patch is a minor cleanup to the series "refactor Kconfig to consolidate KEXEC and CRASH options". In that series, a new option ARCH_DEFAULT_KEXEC was introduced in order to obtain the equivalent behavior of s390 original Kconfig settings for KEXEC. As it turns out, this new option did not fully provide the equivalent behavior, rather a "select KEXEC" did. As such, the ARCH_DEFAULT_KEXEC is not needed anymore, so remove it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric DeVolder <[email protected]> Acked-by: Alexander Gordeev <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-18kexec: rename ARCH_HAS_KEXEC_PURGATORYEric DeVolder1-3/+3
The Kconfig refactor to consolidate KEXEC and CRASH options utilized option names of the form ARCH_SUPPORTS_<option>. Thus rename the ARCH_HAS_KEXEC_PURGATORY to ARCH_SUPPORTS_KEXEC_PURGATORY to follow the same. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric DeVolder <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-18kexec: consolidate kexec and crash options into kernel/Kconfig.kexecEric DeVolder1-0/+116
Patch series "refactor Kconfig to consolidate KEXEC and CRASH options", v6. The Kconfig is refactored to consolidate KEXEC and CRASH options from various arch/<arch>/Kconfig files into new file kernel/Kconfig.kexec. The Kconfig.kexec is now a submenu titled "Kexec and crash features" located under "General Setup". The following options are impacted: - KEXEC - KEXEC_FILE - KEXEC_SIG - KEXEC_SIG_FORCE - KEXEC_IMAGE_VERIFY_SIG - KEXEC_BZIMAGE_VERIFY_SIG - KEXEC_JUMP - CRASH_DUMP Over time, these options have been copied between Kconfig files and are very similar to one another, but with slight differences. The following architectures are impacted by the refactor (because of use of one or more KEXEC/CRASH options): - arm - arm64 - ia64 - loongarch - m68k - mips - parisc - powerpc - riscv - s390 - sh - x86 More information: In the patch series "crash: Kernel handling of CPU and memory hot un/plug" https://lore.kernel.org/lkml/[email protected]/ the new kernel feature introduces the config option CRASH_HOTPLUG. In reviewing, Thomas Gleixner requested that the new config option not be placed in x86 Kconfig. Rather the option needs a generic/common home. To Thomas' point, the KEXEC and CRASH options have largely been duplicated in the various arch/<arch>/Kconfig files, with minor differences. This kind of proliferation is to be avoid/stopped. https://lore.kernel.org/lkml/875y91yv63.ffs@tglx/ To that end, I have refactored the arch Kconfigs so as to consolidate the various KEXEC and CRASH options. Generally speaking, this work has the following themes: - KEXEC and CRASH options are moved into new file kernel/Kconfig.kexec - These items from arch/Kconfig: CRASH_CORE KEXEC_CORE KEXEC_ELF HAVE_IMA_KEXEC - These items from arch/x86/Kconfig form the common options: KEXEC KEXEC_FILE KEXEC_SIG KEXEC_SIG_FORCE KEXEC_BZIMAGE_VERIFY_SIG KEXEC_JUMP CRASH_DUMP - These items from arch/arm64/Kconfig form the common options: KEXEC_IMAGE_VERIFY_SIG - The crash hotplug series appends CRASH_HOTPLUG to Kconfig.kexec - The Kconfig.kexec is now a submenu titled "Kexec and crash features" and is now listed in "General Setup" submenu from init/Kconfig. - To control the common options, each has a new ARCH_SUPPORTS_<option> option. These gateway options determine whether the common options options are valid for the architecture. - To account for the slight differences in the original architecture coding of the common options, each now has a corresponding ARCH_SELECTS_<option> which are used to elicit the same side effects as the original arch/<arch>/Kconfig files for KEXEC and CRASH options. An example, 'make menuconfig' illustrating the submenu: > General setup > Kexec and crash features [*] Enable kexec system call [*] Enable kexec file based system call [*] Verify kernel signature during kexec_file_load() syscall [ ] Require a valid signature in kexec_file_load() syscall [ ] Enable bzImage signature verification support [*] kexec jump [*] kernel crash dumps [*] Update the crash elfcorehdr on system configuration changes In the process of consolidating the common options, I encountered slight differences in the coding of these options in several of the architectures. As a result, I settled on the following solution: - Each of the common options has a 'depends on ARCH_SUPPORTS_<option>' statement. For example, the KEXEC_FILE option has a 'depends on ARCH_SUPPORTS_KEXEC_FILE' statement. This approach is needed on all common options so as to prevent options from appearing for architectures which previously did not allow/enable them. For example, arm supports KEXEC but not KEXEC_FILE. The arch/arm/Kconfig does not provide ARCH_SUPPORTS_KEXEC_FILE and so KEXEC_FILE and related options are not available to arm. - The boolean ARCH_SUPPORTS_<option> in effect allows the arch to determine when the feature is allowed. Archs which don't have the feature simply do not provide the corresponding ARCH_SUPPORTS_<option>. For each arch, where there previously were KEXEC and/or CRASH options, these have been replaced with the corresponding boolean ARCH_SUPPORTS_<option>, and an appropriate def_bool statement. For example, if the arch supports KEXEC_FILE, then the ARCH_SUPPORTS_KEXEC_FILE simply has a 'def_bool y'. This permits the KEXEC_FILE option to be available. If the arch has a 'depends on' statement in its original coding of the option, then that expression becomes part of the def_bool expression. For example, arm64 had: config KEXEC depends on PM_SLEEP_SMP and in this solution, this converts to: config ARCH_SUPPORTS_KEXEC def_bool PM_SLEEP_SMP - In order to account for the architecture differences in the coding for the common options, the ARCH_SELECTS_<option> in the arch/<arch>/Kconfig is used. This option has a 'depends on <option>' statement to couple it to the main option, and from there can insert the differences from the common option and the arch original coding of that option. For example, a few archs enable CRYPTO and CRYTPO_SHA256 for KEXEC_FILE. These require a ARCH_SELECTS_KEXEC_FILE and 'select CRYPTO' and 'select CRYPTO_SHA256' statements. Illustrating the option relationships: For each of the common KEXEC and CRASH options: ARCH_SUPPORTS_<option> <- <option> <- ARCH_SELECTS_<option> <option> # in Kconfig.kexec ARCH_SUPPORTS_<option> # in arch/<arch>/Kconfig, as needed ARCH_SELECTS_<option> # in arch/<arch>/Kconfig, as needed For example, KEXEC: ARCH_SUPPORTS_KEXEC <- KEXEC <- ARCH_SELECTS_KEXEC KEXEC # in Kconfig.kexec ARCH_SUPPORTS_KEXEC # in arch/<arch>/Kconfig, as needed ARCH_SELECTS_KEXEC # in arch/<arch>/Kconfig, as needed To summarize, the ARCH_SUPPORTS_<option> permits the <option> to be enabled, and the ARCH_SELECTS_<option> handles side effects (ie. select statements). Examples: A few examples to show the new strategy in action: ===== x86 (minus the help section) ===== Original: config KEXEC bool "kexec system call" select KEXEC_CORE config KEXEC_FILE bool "kexec file based system call" select KEXEC_CORE select HAVE_IMA_KEXEC if IMA depends on X86_64 depends on CRYPTO=y depends on CRYPTO_SHA256=y config ARCH_HAS_KEXEC_PURGATORY def_bool KEXEC_FILE config KEXEC_SIG bool "Verify kernel signature during kexec_file_load() syscall" depends on KEXEC_FILE config KEXEC_SIG_FORCE bool "Require a valid signature in kexec_file_load() syscall" depends on KEXEC_SIG config KEXEC_BZIMAGE_VERIFY_SIG bool "Enable bzImage signature verification support" depends on KEXEC_SIG depends on SIGNED_PE_FILE_VERIFICATION select SYSTEM_TRUSTED_KEYRING config CRASH_DUMP bool "kernel crash dumps" depends on X86_64 || (X86_32 && HIGHMEM) config KEXEC_JUMP bool "kexec jump" depends on KEXEC && HIBERNATION help becomes... New: config ARCH_SUPPORTS_KEXEC def_bool y config ARCH_SUPPORTS_KEXEC_FILE def_bool X86_64 && CRYPTO && CRYPTO_SHA256 config ARCH_SELECTS_KEXEC_FILE def_bool y depends on KEXEC_FILE select HAVE_IMA_KEXEC if IMA config ARCH_SUPPORTS_KEXEC_PURGATORY def_bool KEXEC_FILE config ARCH_SUPPORTS_KEXEC_SIG def_bool y config ARCH_SUPPORTS_KEXEC_SIG_FORCE def_bool y config ARCH_SUPPORTS_KEXEC_BZIMAGE_VERIFY_SIG def_bool y config ARCH_SUPPORTS_KEXEC_JUMP def_bool y config ARCH_SUPPORTS_CRASH_DUMP def_bool X86_64 || (X86_32 && HIGHMEM) ===== powerpc (minus the help section) ===== Original: config KEXEC bool "kexec system call" depends on PPC_BOOK3S || PPC_E500 || (44x && !SMP) select KEXEC_CORE config KEXEC_FILE bool "kexec file based system call" select KEXEC_CORE select HAVE_IMA_KEXEC if IMA select KEXEC_ELF depends on PPC64 depends on CRYPTO=y depends on CRYPTO_SHA256=y config ARCH_HAS_KEXEC_PURGATORY def_bool KEXEC_FILE config CRASH_DUMP bool "Build a dump capture kernel" depends on PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP) select RELOCATABLE if PPC64 || 44x || PPC_85xx becomes... New: config ARCH_SUPPORTS_KEXEC def_bool PPC_BOOK3S || PPC_E500 || (44x && !SMP) config ARCH_SUPPORTS_KEXEC_FILE def_bool PPC64 && CRYPTO=y && CRYPTO_SHA256=y config ARCH_SUPPORTS_KEXEC_PURGATORY def_bool KEXEC_FILE config ARCH_SELECTS_KEXEC_FILE def_bool y depends on KEXEC_FILE select KEXEC_ELF select HAVE_IMA_KEXEC if IMA config ARCH_SUPPORTS_CRASH_DUMP def_bool PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP) config ARCH_SELECTS_CRASH_DUMP def_bool y depends on CRASH_DUMP select RELOCATABLE if PPC64 || 44x || PPC_85xx Testing Approach and Results There are 388 config files in the arch/<arch>/configs directories. For each of these config files, a .config is generated both before and after this Kconfig series, and checked for equivalence. This approach allows for a rather rapid check of all architectures and a wide variety of configs wrt/ KEXEC and CRASH, and avoids requiring compiling for all architectures and running kernels and run-time testing. For each config file, the olddefconfig, allnoconfig and allyesconfig targets are utilized. In testing the randconfig has revealed problems as well, but is not used in the before and after equivalence check since one can not generate the "same" .config for before and after, even if using the same KCONFIG_SEED since the option list is different. As such, the following script steps compare the before and after of 'make olddefconfig'. The new symbols introduced by this series are filtered out, but otherwise the config files are PASS only if they were equivalent, and FAIL otherwise. The script performs the test by doing the following: # Obtain the "golden" .config output for given config file # Reset test sandbox git checkout master git branch -D test_Kconfig git checkout -B test_Kconfig master make distclean # Write out updated config cp -f <config file> .config make ARCH=<arch> olddefconfig # Track each item in .config, LHSB is "golden" scoreboard .config # Obtain the "changed" .config output for given config file # Reset test sandbox make distclean # Apply this Kconfig series git am <this Kconfig series> # Write out updated config cp -f <config file> .config make ARCH=<arch> olddefconfig # Track each item in .config, RHSB is "changed" scoreboard .config # Determine test result # Filter-out new symbols introduced by this series # Filter-out symbol=n which not in either scoreboard # Compare LHSB "golden" and RHSB "changed" scoreboards and issue PASS/FAIL The script was instrumental during the refactoring of Kconfig as it continually revealed problems. The end result being that the solution presented in this series passes all configs as checked by the script, with the following exceptions: - arch/ia64/configs/zx1_config with olddefconfig This config file has: # CONFIG_KEXEC is not set CONFIG_CRASH_DUMP=y and this refactor now couples KEXEC to CRASH_DUMP, so it is not possible to enable CRASH_DUMP without KEXEC. - arch/sh/configs/* with allyesconfig The arch/sh/Kconfig codes CRASH_DUMP as dependent upon BROKEN_ON_MMU (which clearly is not meant to be set). This symbol is not provided but with the allyesconfig it is set to yes which enables CRASH_DUMP. But KEXEC is coded as dependent upon MMU, and is set to no in arch/sh/mm/Kconfig, so KEXEC is not enabled. This refactor now couples KEXEC to CRASH_DUMP, so it is not possible to enable CRASH_DUMP without KEXEC. While the above exceptions are not equivalent to their original, the config file produced is valid (and in fact better wrt/ CRASH_DUMP handling). This patch (of 14) The config options for kexec and crash features are consolidated into new file kernel/Kconfig.kexec. Under the "General Setup" submenu is a new submenu "Kexec and crash handling". All the kexec and crash options that were once in the arch-dependent submenu "Processor type and features" are now consolidated in the new submenu. The following options are impacted: - KEXEC - KEXEC_FILE - KEXEC_SIG - KEXEC_SIG_FORCE - KEXEC_BZIMAGE_VERIFY_SIG - KEXEC_JUMP - CRASH_DUMP The three main options are KEXEC, KEXEC_FILE and CRASH_DUMP. Architectures specify support of certain KEXEC and CRASH features with similarly named new ARCH_SUPPORTS_<option> config options. Architectures can utilize the new ARCH_SELECTS_<option> config options to specify additional components when <option> is enabled. To summarize, the ARCH_SUPPORTS_<option> permits the <option> to be enabled, and the ARCH_SELECTS_<option> handles side effects (ie. select statements). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Eric DeVolder <[email protected]> Cc: Albert Ou <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Baoquan He <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Cc. "H. Peter Anvin" <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> # for x86 Cc: Frederic Weisbecker <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hari Bathini <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Juerg Haefliger <[email protected]> Cc: Kees Cook <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Linus Walleij <[email protected]> Cc: Marc Aurèle La France <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: Masami Hiramatsu (Google) <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Miguel Ojeda <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Nick Desaulniers <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Russell King (Oracle) <[email protected]> Cc: Sami Tolvanen <[email protected]> Cc: Sebastian Reichel <[email protected]> Cc: Sourabh Jain <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: WANG Xuerui <[email protected]> Cc: Will Deacon <[email protected]> Cc: Xin Li <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Zhen Lei <[email protected]> Cc: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-18acct: replace all non-returning strlcpy with strscpyAzeem Shaikh1-1/+1
strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89 Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Azeem Shaikh <[email protected]> Cc: Kees Cook <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-18signal: print comm and exe name on fatal signalsVincent Whitchurch1-1/+12
Make the print-fatal-signals message more useful by printing the comm and the exe name for the process which received the fatal signal: Before: potentially unexpected fatal signal 4 potentially unexpected fatal signal 11 After: buggy-program: pool: potentially unexpected fatal signal 4 some-daemon: gdbus: potentially unexpected fatal signal 11 comm used to be present but was removed in commit 681a90ffe829b8ee25d ("arc, print-fatal-signals: reduce duplicated information") because it's also included as part of the later stack trace. Having the comm as part of the main "unexpected fatal..." print is rather useful though when analysing logs, and the exe name is also valuable as shown in the examples above where the comm ends up having some generic name like "pool". [[email protected]: don't include linux/file.h twice] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vincent Whitchurch <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Vineet Gupta <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-18cred: convert printks to pr_<level>tiozhang1-12/+15
Use current logging style. Link: https://lkml.kernel.org/r/20230625033452.GA22858@didi-ThinkCentre-M930t-N000 Signed-off-by: tiozhang <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Joe Perches <[email protected]> Cc: Kees Cook <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Paulo Alcantara <[email protected]> Cc: Weiping Zhang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-18mmu_notifiers: don't invalidate secondary TLBs as part of ↵Alistair Popple1-1/+1
mmu_notifier_invalidate_range_end() Secondary TLBs are now invalidated from the architecture specific TLB invalidation functions. Therefore there is no need to explicitly notify or invalidate as part of the range end functions. This means we can remove mmu_notifier_invalidate_range_end_only() and some of the ptep_*_notify() functions. Link: https://lkml.kernel.org/r/90d749d03cbab256ca0edeb5287069599566d783.1690292440.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Cc: Andrew Donnellan <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chaitanya Kumar Borah <[email protected]> Cc: Frederic Barrat <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: John Hubbard <[email protected]> Cc: Kevin Tian <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Nicolin Chen <[email protected]> Cc: Robin Murphy <[email protected]> Cc: Sean Christopherson <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Tvrtko Ursulin <[email protected]> Cc: Will Deacon <[email protected]> Cc: Zhi Wang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-18mm: move is_ioremap_addr() into new header fileBaoquan He1-0/+1
Now is_ioremap_addr() is only used in kernel/iomem.c and gonna be used in mm/ioremap.c. Move it into its own new header file linux/ioremap.h. Link: https://lkml.kernel.org/r/[email protected] Suggested-by: Christoph Hellwig <[email protected]> Signed-off-by: Baoquan He <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Brian Cain <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Chris Zankel <[email protected]> Cc: David Laight <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Jonas Bonn <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Mike Rapoport (IBM) <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Niklas Schnelle <[email protected]> Cc: Rich Felker <[email protected]> Cc: Stafford Horne <[email protected]> Cc: Stefan Kristiansson <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Vineet Gupta <[email protected]> Cc: Will Deacon <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-18mm/mm_init.c: remove obsolete macro HASH_SMALLMiaohe Lin1-2/+1
HASH_SMALL only works when parameter numentries is 0. But the sole caller futex_init() never calls alloc_large_system_hash() with numentries set to 0. So HASH_SMALL is obsolete and remove it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Miaohe Lin <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Cc: André Almeida <[email protected]> Cc: Darren Hart <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>