aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2023-04-04bpf: Remove unused arguments from btf_struct_access().Alexei Starovoitov1-2/+2
Remove unused arguments from btf_struct_access() callback. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: David Vernet <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-04-04bpf: Invoke btf_struct_access() callback only for writes.Alexei Starovoitov1-1/+1
Remove duplicated if (atype == BPF_READ) btf_struct_access() from btf_struct_access() callback and invoke it only for writes. This is possible to do because currently btf_struct_access() custom callback always delegates to generic btf_struct_access() helper for BPF_READ accesses. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: David Vernet <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-04-04srcu: Fix long lines in srcu_funnel_gp_start()Paul E. McKenney1-13/+14
This commit creates an srcu_usage pointer named "sup" as a shorter synonym for the "ssp->srcu_sup" that was bloating several lines of code. Cc: Christoph Hellwig <[email protected]> Tested-by: Sachin Sant <[email protected]> Tested-by: "Zhang, Qiang1" <[email protected]> Tested-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2023-04-04srcu: Fix long lines in srcu_gp_end()Paul E. McKenney1-20/+21
This commit creates an srcu_usage pointer named "sup" as a shorter synonym for the "ssp->srcu_sup" that was bloating several lines of code. Cc: Christoph Hellwig <[email protected]> Tested-by: Sachin Sant <[email protected]> Tested-by: "Zhang, Qiang1" <[email protected]> Tested-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2023-04-04srcu: Fix long lines in cleanup_srcu_struct()Paul E. McKenney1-10/+11
This commit creates an srcu_usage pointer named "sup" as a shorter synonym for the "ssp->srcu_sup" that was bloating several lines of code. Cc: Christoph Hellwig <[email protected]> Tested-by: Sachin Sant <[email protected]> Tested-by: "Zhang, Qiang1" <[email protected]> Tested-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2023-04-04srcu: Fix long lines in srcu_get_delay()Paul E. McKenney1-5/+6
This commit creates an srcu_usage pointer named "sup" as a shorter synonym for the "ssp->srcu_sup" that was bloating several lines of code. Tested-by: Sachin Sant <[email protected]> Tested-by: "Zhang, Qiang1" <[email protected]> Cc: Christoph Hellwig <[email protected]> Tested-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2023-04-04srcu: Check for readers at module-exit timePaul E. McKenney1-1/+2
If a given statically allocated in-module srcu_struct structure was ever used for updates, srcu_module_going() will invoke cleanup_srcu_struct() at module-exit time. This will check for the error case of SRCU readers persisting past module-exit time. On the other hand, if this srcu_struct structure never went through a grace period, srcu_module_going() only invokes free_percpu(), which would result in strange failures if SRCU readers persisted past module-exit time. This commit therefore adds a srcu_readers_active() check to srcu_module_going(), splatting if readers have persisted and refraining from invoking free_percpu() in that case. Better to leak memory than to suffer silent memory corruption! [ paulmck: Apply Zhang, Qiang1 feedback on memory leak. ] Tested-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2023-04-04srcu: Move work-scheduling fields from srcu_struct to srcu_usagePaul E. McKenney1-19/+22
This commit moves the ->reschedule_jiffies, ->reschedule_count, and ->work fields from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. However, this means that the container_of() calls cannot get a pointer to the srcu_struct because they are no longer in the srcu_struct. This issue is addressed by adding a ->srcu_ssp field in the srcu_usage structure that references the corresponding srcu_struct structure. And given the presence of the sup pointer to the srcu_usage structure, replace some ssp->srcu_usage-> instances with sup->. [ paulmck Apply feedback from kernel test robot. ] Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Suggested-by: Christoph Hellwig <[email protected]> Tested-by: Sachin Sant <[email protected]> Tested-by: "Zhang, Qiang1" <[email protected]> Tested-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2023-04-04srcu: Move srcu_barrier() fields from srcu_struct to srcu_usagePaul E. McKenney1-19/+19
This commit moves the ->srcu_barrier_seq, ->srcu_barrier_mutex, ->srcu_barrier_completion, and ->srcu_barrier_cpu_cnt fields from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <[email protected]> Tested-by: Sachin Sant <[email protected]> Tested-by: "Zhang, Qiang1" <[email protected]> Tested-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2023-04-04srcu: Move ->sda_is_static from srcu_struct to srcu_usagePaul E. McKenney1-4/+4
This commit moves the ->sda_is_static field from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <[email protected]> Tested-by: Sachin Sant <[email protected]> Tested-by: "Zhang, Qiang1" <[email protected]> Tested-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2023-04-04srcu: Move heuristics fields from srcu_struct to srcu_usagePaul E. McKenney1-9/+9
This commit moves the ->srcu_size_jiffies, ->srcu_n_lock_retries, and ->srcu_n_exp_nodelay fields from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <[email protected]> Tested-by: Sachin Sant <[email protected]> Tested-by: "Zhang, Qiang1" <[email protected]> Tested-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2023-04-04srcu: Move grace-period fields from srcu_struct to srcu_usagePaul E. McKenney1-64/+64
This commit moves the ->srcu_gp_seq, ->srcu_gp_seq_needed, ->srcu_gp_seq_needed_exp, ->srcu_gp_start, and ->srcu_last_gp_end fields from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <[email protected]> Tested-by: Sachin Sant <[email protected]> Tested-by: "Zhang, Qiang1" <[email protected]> Tested-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2023-04-04srcu: Move ->srcu_gp_mutex from srcu_struct to srcu_usagePaul E. McKenney1-7/+7
This commit moves the ->srcu_gp_mutex field from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <[email protected]> Tested-by: Sachin Sant <[email protected]> Tested-by: "Zhang, Qiang1" <[email protected]> Tested-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2023-04-04srcu: Move ->lock from srcu_struct to srcu_usagePaul E. McKenney1-28/+28
This commit moves the ->lock field from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <[email protected]> Tested-by: Sachin Sant <[email protected]> Tested-by: "Zhang, Qiang1" <[email protected]> Tested-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2023-04-04srcu: Move ->lock initialization after srcu_usage allocationPaul E. McKenney1-2/+2
Currently, both __init_srcu_struct() in CONFIG_DEBUG_LOCK_ALLOC=y kernels and init_srcu_struct() in CONFIG_DEBUG_LOCK_ALLOC=n kernel initialize the srcu_struct structure's ->lock before the srcu_usage structure has been allocated. This of course prevents the ->lock from being moved to the srcu_usage structure, so this commit moves the initialization into the init_srcu_struct_fields() after the srcu_usage structure has been allocated. Cc: Christoph Hellwig <[email protected]> Tested-by: Sachin Sant <[email protected]> Tested-by: "Zhang, Qiang1" <[email protected]> Tested-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2023-04-04srcu: Move ->srcu_cb_mutex from srcu_struct to srcu_usagePaul E. McKenney1-3/+3
This commit moves the ->srcu_cb_mutex field from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <[email protected]> Tested-by: Sachin Sant <[email protected]> Tested-by: "Zhang, Qiang1" <[email protected]> Tested-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2023-04-04srcu: Move ->srcu_size_state from srcu_struct to srcu_usagePaul E. McKenney1-18/+19
This commit moves the ->srcu_size_state field from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <[email protected]> Tested-by: Sachin Sant <[email protected]> Tested-by: "Zhang, Qiang1" <[email protected]> Tested-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2023-04-04srcu: Move ->level from srcu_struct to srcu_usagePaul E. McKenney1-7/+7
This commit moves the ->level[] array from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <[email protected]> Tested-by: Sachin Sant <[email protected]> Tested-by: "Zhang, Qiang1" <[email protected]> Tested-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2023-04-04srcu: Begin offloading srcu_struct fields to srcu_updatePaul E. McKenney2-11/+23
The current srcu_struct structure is on the order of 200 bytes in size (depending on architecture and .config), which is much better than the old-style 26K bytes, but still all too inconvenient when one is trying to achieve good cache locality on a fastpath involving SRCU readers. However, only a few fields in srcu_struct are used by SRCU readers. The remaining fields could be offloaded to a new srcu_update structure, thus shrinking the srcu_struct structure down to a few tens of bytes. This commit begins this noble quest, a quest that is complicated by open-coded initialization of the srcu_struct within the srcu_notifier_head structure. This complication is addressed by updating the srcu_notifier_head structure's open coding, given that there does not appear to be a straightforward way of abstracting that initialization. This commit moves only the ->node pointer to srcu_update. Later commits will move additional fields. [ paulmck: Fold in [email protected]'s memory-leak fix. ] Link: https://lore.kernel.org/all/[email protected]/ Suggested-by: Christoph Hellwig <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: "Michał Mirosław" <[email protected]> Cc: Dmitry Osipenko <[email protected]> Tested-by: Sachin Sant <[email protected]> Tested-by: "Zhang, Qiang1" <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Tested-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2023-04-04srcu: Use static init for statically allocated in-module srcu_structPaul E. McKenney1-6/+13
Further shrinking the srcu_struct structure is eased by requiring that in-module srcu_struct structures rely more heavily on static initialization. In particular, this preserves the property that a module-load-time srcu_struct initialization can fail only due to memory-allocation failure of the per-CPU srcu_data structures. It might also slightly improve robustness by keeping the number of memory allocations that must succeed down percpu_alloc() call. This is in preparation for splitting an srcu_usage structure out of the srcu_struct structure. [ paulmck: Fold in [email protected] feedback. ] Cc: Christoph Hellwig <[email protected]> Tested-by: Sachin Sant <[email protected]> Tested-by: "Zhang, Qiang1" <[email protected]> Tested-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2023-04-04rcu-tasks: Fix warning for unused tasks_rcu_exit_srcuPaul E. McKenney1-0/+2
The tasks_rcu_exit_srcu variable is used only by kernels built with CONFIG_TASKS_RCU=y, but is defined for all kernesl with CONFIG_TASKS_RCU_GENERIC=y. Therefore, in kernels built with CONFIG_TASKS_RCU_GENERIC=y but CONFIG_TASKS_RCU=n, this gives a "defined but not used" warning. This commit therefore moves this variable under CONFIG_TASKS_RCU. Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Reported-by: kernel test robot <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Tested-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2023-04-03bpf: Fix struct_meta lookup for bpf_obj_free_fields kfunc callDave Marchevsky1-5/+9
bpf_obj_drop_impl has a void return type. In check_kfunc_call, the "else if" which sets insn_aux->kptr_struct_meta for bpf_obj_drop_impl is surrounded by a larger if statement which checks btf_type_is_ptr. As a result: * The bpf_obj_drop_impl-specific code will never execute * The btf_struct_meta input to bpf_obj_drop is always NULL * __bpf_obj_drop_impl will always see a NULL btf_record when called from BPF program, and won't call bpf_obj_free_fields * program-allocated kptrs which have fields that should be cleaned up by bpf_obj_free_fields may instead leak resources This patch adds a btf_type_is_void branch to the larger if and moves special handling for bpf_obj_drop_impl there, fixing the issue. Fixes: ac9f06050a35 ("bpf: Introduce bpf_obj_drop") Cc: Kumar Kartikeya Dwivedi <[email protected]> Signed-off-by: Dave Marchevsky <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-04-03driver core: class: remove struct class_interface * from callbacksGreg Kroah-Hartman1-2/+1
The add_dev and remove_dev callbacks in struct class_interface currently pass in a pointer back to the class_interface structure that is calling them, but none of the callback implementations actually use this pointer as it is pointless (the structure is known, the driver passed it in in the first place if it is really needed again.) So clean this up and just remove the pointer from the callbacks and fix up all callback functions. Cc: Jean Delvare <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: Paolo Abeni <[email protected]> Cc: Kurt Schwemmer <[email protected]> Cc: Jon Mason <[email protected]> Cc: Dave Jiang <[email protected]> Cc: Allen Hubbe <[email protected]> Cc: Dominik Brodowski <[email protected]> Cc: Matt Porter <[email protected]> Cc: Alexandre Bounine <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: "Martin K. Petersen" <[email protected]> Cc: Doug Gilbert <[email protected]> Cc: John Stultz <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Stephen Boyd <[email protected]> Cc: Hans de Goede <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Wang Weiyang <[email protected]> Cc: Yang Yingliang <[email protected]> Cc: Jakob Koschel <[email protected]> Cc: Cai Xinchen <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Acked-by: Logan Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/2023040250-pushover-platter-509c@gregkh Signed-off-by: Greg Kroah-Hartman <[email protected]>
2023-04-03tracing/osnoise: Fix notify new tracing_max_latencyDaniel Bristot de Oliveira1-1/+1
osnoise/timerlat tracers are reporting new max latency on instances where the tracing is off, creating inconsistencies between the max reported values in the trace and in the tracing_max_latency. Thus only report new tracing_max_latency on active tracing instances. Link: https://lkml.kernel.org/r/ecd109fde4a0c24ab0f00ba1e9a144ac19a91322.1680104184.git.bristot@kernel.org Cc: [email protected] Fixes: dae181349f1e ("tracing/osnoise: Support a list of trace_array *tr") Signed-off-by: Daniel Bristot de Oliveira <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-04-03tracing/timerlat: Notify new max thread latencyDaniel Bristot de Oliveira1-0/+2
timerlat is not reporting a new tracing_max_latency for the thread latency. The reason is that it is not calling notify_new_max_latency() function after the new thread latency is sampled. Call notify_new_max_latency() after computing the thread latency. Link: https://lkml.kernel.org/r/16e18d61d69073d0192ace07bf61e405cca96e9c.1680104184.git.bristot@kernel.org Cc: [email protected] Fixes: dae181349f1e ("tracing/osnoise: Support a list of trace_array *tr") Signed-off-by: Daniel Bristot de Oliveira <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-04-03ring-buffer: Fix race while reader and writer are on the same pageZheng Yejian1-1/+12
When user reads file 'trace_pipe', kernel keeps printing following logs that warn at "cpu_buffer->reader_page->read > rb_page_size(reader)" in rb_get_reader_page(). It just looks like there's an infinite loop in tracing_read_pipe(). This problem occurs several times on arm64 platform when testing v5.10 and below. Call trace: rb_get_reader_page+0x248/0x1300 rb_buffer_peek+0x34/0x160 ring_buffer_peek+0xbc/0x224 peek_next_entry+0x98/0xbc __find_next_entry+0xc4/0x1c0 trace_find_next_entry_inc+0x30/0x94 tracing_read_pipe+0x198/0x304 vfs_read+0xb4/0x1e0 ksys_read+0x74/0x100 __arm64_sys_read+0x24/0x30 el0_svc_common.constprop.0+0x7c/0x1bc do_el0_svc+0x2c/0x94 el0_svc+0x20/0x30 el0_sync_handler+0xb0/0xb4 el0_sync+0x160/0x180 Then I dump the vmcore and look into the problematic per_cpu ring_buffer, I found that tail_page/commit_page/reader_page are on the same page while reader_page->read is obviously abnormal: tail_page == commit_page == reader_page == { .write = 0x100d20, .read = 0x8f9f4805, // Far greater than 0xd20, obviously abnormal!!! .entries = 0x10004c, .real_end = 0x0, .page = { .time_stamp = 0x857257416af0, .commit = 0xd20, // This page hasn't been full filled. // .data[0...0xd20] seems normal. } } The root cause is most likely the race that reader and writer are on the same page while reader saw an event that not fully committed by writer. To fix this, add memory barriers to make sure the reader can see the content of what is committed. Since commit a0fcaaed0c46 ("ring-buffer: Fix race between reset page and reading page") has added the read barrier in rb_get_reader_page(), here we just need to add the write barrier. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: [email protected] Fixes: 77ae365eca89 ("ring-buffer: make lockless") Suggested-by: Steven Rostedt (Google) <[email protected]> Signed-off-by: Zheng Yejian <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-04-03tracing/synthetic: Fix races on freeing last_cmdTze-nan Wu1-4/+15
Currently, the "last_cmd" variable can be accessed by multiple processes asynchronously when multiple users manipulate synthetic_events node at the same time, it could lead to use-after-free or double-free. This patch add "lastcmd_mutex" to prevent "last_cmd" from being accessed asynchronously. ================================================================ It's easy to reproduce in the KASAN environment by running the two scripts below in different shells. script 1: while : do echo -n -e '\x88' > /sys/kernel/tracing/synthetic_events done script 2: while : do echo -n -e '\xb0' > /sys/kernel/tracing/synthetic_events done ================================================================ double-free scenario: process A process B ------------------- --------------- 1.kstrdup last_cmd 2.free last_cmd 3.free last_cmd(double-free) ================================================================ use-after-free scenario: process A process B ------------------- --------------- 1.kstrdup last_cmd 2.free last_cmd 3.tracing_log_err(use-after-free) ================================================================ Appendix 1. KASAN report double-free: BUG: KASAN: double-free in kfree+0xdc/0x1d4 Free of addr ***** by task sh/4879 Call trace: ... kfree+0xdc/0x1d4 create_or_delete_synth_event+0x60/0x1e8 trace_parse_run_command+0x2bc/0x4b8 synth_events_write+0x20/0x30 vfs_write+0x200/0x830 ... Allocated by task 4879: ... kstrdup+0x5c/0x98 create_or_delete_synth_event+0x6c/0x1e8 trace_parse_run_command+0x2bc/0x4b8 synth_events_write+0x20/0x30 vfs_write+0x200/0x830 ... Freed by task 5464: ... kfree+0xdc/0x1d4 create_or_delete_synth_event+0x60/0x1e8 trace_parse_run_command+0x2bc/0x4b8 synth_events_write+0x20/0x30 vfs_write+0x200/0x830 ... ================================================================ Appendix 2. KASAN report use-after-free: BUG: KASAN: use-after-free in strlen+0x5c/0x7c Read of size 1 at addr ***** by task sh/5483 sh: CPU: 7 PID: 5483 Comm: sh ... __asan_report_load1_noabort+0x34/0x44 strlen+0x5c/0x7c tracing_log_err+0x60/0x444 create_or_delete_synth_event+0xc4/0x204 trace_parse_run_command+0x2bc/0x4b8 synth_events_write+0x20/0x30 vfs_write+0x200/0x830 ... Allocated by task 5483: ... kstrdup+0x5c/0x98 create_or_delete_synth_event+0x80/0x204 trace_parse_run_command+0x2bc/0x4b8 synth_events_write+0x20/0x30 vfs_write+0x200/0x830 ... Freed by task 5480: ... kfree+0xdc/0x1d4 create_or_delete_synth_event+0x74/0x204 trace_parse_run_command+0x2bc/0x4b8 synth_events_write+0x20/0x30 vfs_write+0x200/0x830 ... Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Fixes: 27c888da9867 ("tracing: Remove size restriction on synthetic event cmd error logging") Cc: [email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Matthias Brugger <[email protected]> Cc: AngeloGioacchino Del Regno <[email protected]> Cc: "Tom Zanussi" <[email protected]> Signed-off-by: Tze-nan Wu <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-04-03printk: Remove obsoleted check for non-existent "user" objectStanislav Kinsburskii1-12/+1
The original check for non-null "user" object was introduced by commit e11fea92e13f ("kmsg: export printk records to the /dev/kmsg interface") when "user" could be NULL if /dev/ksmg was opened for writing. Subsequent change 750afe7babd1 ("printk: add kernel parameter to control writes to /dev/kmsg") made "user" context required for files opened for write, but didn't remove now redundant checks for it to be non-NULL. This patch removes the dead code while preserving the current logic. Signed-off-by: Stanislav Kinsburskii <[email protected]> CC: Petr Mladek <[email protected]> CC: Sergey Senozhatsky <[email protected]> CC: Steven Rostedt <[email protected]> CC: John Ogness <[email protected]> CC: [email protected] Reviewed-by: Sergey Senozhatsky <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Signed-off-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/167929571877.2810.9926967619100618792.stgit@skinsburskii.localdomain
2023-04-03fork: use pidfd_prepare()Christian Brauner1-11/+2
Stop open-coding get_unused_fd_flags() and anon_inode_getfile(). That's brittle just for keeping the flags between both calls in sync. Use the dedicated helper. Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-04-03pid: add pidfd_prepare()Christian Brauner2-12/+92
Add a new helper that allows to reserve a pidfd and allocates a new pidfd file that stashes the provided struct pid. This will allow us to remove places that either open code this function or that call pidfd_create() but then have to call close_fd() because there are still failure points after pidfd_create() has been called. Reviewed-by: Jan Kara <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-04-03Merge 6.3-rc5 into driver-core-nextGreg Kroah-Hartman19-59/+117
We need the fixes in here for testing, as well as the driver core changes for documentation updates to build on. Signed-off-by: Greg Kroah-Hartman <[email protected]>
2023-04-02bpf: compute hashes in bloom filter similar to hashmapAnton Protopopov1-15/+2
If the value size in a bloom filter is a multiple of 4, then the jhash2() function is used to compute hashes. The length parameter of this function equals to the number of 32-bit words in input. Compute it in the hot path instead of pre-computing it, as this is translated to one extra shift to divide the length by four vs. one extra memory load of a pre-computed length. Signed-off-by: Anton Protopopov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-04-01bpf: optimize hashmap lookups when key_size is divisible by 4Anton Protopopov1-0/+2
The BPF hashmap uses the jhash() hash function. There is an optimized version of this hash function which may be used if hash size is a multiple of 4. Apply this optimization to the hashmap in a similar way as it is done in the bloom filter map. On practice the optimization is only noticeable for smaller key sizes, which, however, is sufficient for many applications. An example is listed in the following table of measurements (a hashmap of 65536 elements was used): -------------------------------------------------------------------- | key_size | fullness | lookups /sec | lookups (opt) /sec | gain | -------------------------------------------------------------------- | 4 | 25% | 42.990M | 46.000M | 7.0% | | 4 | 50% | 37.910M | 39.094M | 3.1% | | 4 | 75% | 34.486M | 36.124M | 4.7% | | 4 | 100% | 31.760M | 32.719M | 3.0% | -------------------------------------------------------------------- | 8 | 25% | 43.855M | 49.626M | 13.2% | | 8 | 50% | 38.328M | 42.152M | 10.0% | | 8 | 75% | 34.483M | 38.088M | 10.5% | | 8 | 100% | 31.306M | 34.686M | 10.8% | -------------------------------------------------------------------- | 12 | 25% | 38.398M | 43.770M | 14.0% | | 12 | 50% | 33.336M | 37.712M | 13.1% | | 12 | 75% | 29.917M | 34.440M | 15.1% | | 12 | 100% | 27.322M | 30.480M | 11.6% | -------------------------------------------------------------------- | 16 | 25% | 41.491M | 41.921M | 1.0% | | 16 | 50% | 36.206M | 36.474M | 0.7% | | 16 | 75% | 32.529M | 33.027M | 1.5% | | 16 | 100% | 29.581M | 30.325M | 2.5% | -------------------------------------------------------------------- | 20 | 25% | 34.240M | 36.787M | 7.4% | | 20 | 50% | 30.328M | 32.663M | 7.7% | | 20 | 75% | 27.536M | 29.354M | 6.6% | | 20 | 100% | 24.847M | 26.505M | 6.7% | -------------------------------------------------------------------- | 24 | 25% | 36.329M | 40.608M | 11.8% | | 24 | 50% | 31.444M | 35.059M | 11.5% | | 24 | 75% | 28.426M | 31.452M | 10.6% | | 24 | 100% | 26.278M | 28.741M | 9.4% | -------------------------------------------------------------------- | 28 | 25% | 31.540M | 31.944M | 1.3% | | 28 | 50% | 27.739M | 28.063M | 1.2% | | 28 | 75% | 24.993M | 25.814M | 3.3% | | 28 | 100% | 23.513M | 23.500M | -0.1% | -------------------------------------------------------------------- | 32 | 25% | 32.116M | 33.953M | 5.7% | | 32 | 50% | 28.879M | 29.859M | 3.4% | | 32 | 75% | 26.227M | 26.948M | 2.7% | | 32 | 100% | 23.829M | 24.613M | 3.3% | -------------------------------------------------------------------- | 64 | 25% | 22.535M | 22.554M | 0.1% | | 64 | 50% | 20.471M | 20.675M | 1.0% | | 64 | 75% | 19.077M | 19.146M | 0.4% | | 64 | 100% | 17.710M | 18.131M | 2.4% | -------------------------------------------------------------------- The following script was used to gather the results (SMT & frequency off): cd tools/testing/selftests/bpf for key_size in 4 8 12 16 20 24 28 32 64; do for nr_entries in `seq 16384 16384 65536`; do fullness=$(printf '%3s' $((nr_entries*100/65536))) echo -n "key_size=$key_size: $fullness% full: " sudo ./bench -d2 -a bpf-hashmap-lookup --key_size=$key_size --nr_entries=$nr_entries --max_entries=65536 --nr_loops=2000000 --map_flags=0x40 | grep cpu done echo done Signed-off-by: Anton Protopopov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-04-01bpf: Remove now-defunct task kfuncsDavid Vernet1-69/+0
In commit 22df776a9a86 ("tasks: Extract rcu_users out of union"), the 'refcount_t rcu_users' field was extracted out of a union with the 'struct rcu_head rcu' field. This allows us to safely perform a refcount_inc_not_zero() on task->rcu_users when acquiring a reference on a task struct. A prior patch leveraged this by making struct task_struct an RCU-protected object in the verifier, and by bpf_task_acquire() to use the task->rcu_users field for synchronization. Now that we can use RCU to protect tasks, we no longer need bpf_task_kptr_get(), or bpf_task_acquire_not_zero(). bpf_task_kptr_get() is truly completely unnecessary, as we can just use RCU to get the object. bpf_task_acquire_not_zero() is now equivalent to bpf_task_acquire(). In addition to these changes, this patch also updates the associated selftests to no longer use these kfuncs. Signed-off-by: David Vernet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-04-01bpf: Make struct task_struct an RCU-safe typeDavid Vernet2-4/+8
struct task_struct objects are a bit interesting in terms of how their lifetime is protected by refcounts. task structs have two refcount fields: 1. refcount_t usage: Protects the memory backing the task struct. When this refcount drops to 0, the task is immediately freed, without waiting for an RCU grace period to elapse. This is the field that most callers in the kernel currently use to ensure that a task remains valid while it's being referenced, and is what's currently tracked with bpf_task_acquire() and bpf_task_release(). 2. refcount_t rcu_users: A refcount field which, when it drops to 0, schedules an RCU callback that drops a reference held on the 'usage' field above (which is acquired when the task is first created). This field therefore provides a form of RCU protection on the task by ensuring that at least one 'usage' refcount will be held until an RCU grace period has elapsed. The qualifier "a form of" is important here, as a task can remain valid after task->rcu_users has dropped to 0 and the subsequent RCU gp has elapsed. In terms of BPF, we want to use task->rcu_users to protect tasks that function as referenced kptrs, and to allow tasks stored as referenced kptrs in maps to be accessed with RCU protection. Let's first determine whether we can safely use task->rcu_users to protect tasks stored in maps. All of the bpf_task* kfuncs can only be called from tracepoint, struct_ops, or BPF_PROG_TYPE_SCHED_CLS, program types. For tracepoint and struct_ops programs, the struct task_struct passed to a program handler will always be trusted, so it will always be safe to call bpf_task_acquire() with any task passed to a program. Note, however, that we must update bpf_task_acquire() to be KF_RET_NULL, as it is possible that the task has exited by the time the program is invoked, even if the pointer is still currently valid because the main kernel holds a task->usage refcount. For BPF_PROG_TYPE_SCHED_CLS, tasks should never be passed as an argument to the any program handlers, so it should not be relevant. The second question is whether it's safe to use RCU to access a task that was acquired with bpf_task_acquire(), and stored in a map. Because bpf_task_acquire() now uses task->rcu_users, it follows that if the task is present in the map, that it must have had at least one task->rcu_users refcount by the time the current RCU cs was started. Therefore, it's safe to access that task until the end of the current RCU cs. With all that said, this patch makes struct task_struct is an RCU-protected object. In doing so, we also change bpf_task_acquire() to be KF_ACQUIRE | KF_RCU | KF_RET_NULL, and adjust any selftests as necessary. A subsequent patch will remove bpf_task_kptr_get(), and bpf_task_acquire_not_zero() respectively. Signed-off-by: David Vernet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-03-31iommu/sva: Move PASID helpers to sva codeJacob Pan1-0/+1
Preparing to remove IOASID infrastructure, PASID management will be under SVA code. Decouple mm code from IOASID. Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Jacob Pan <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2023-03-30Merge tag 'dma-mapping-6.3-2023-03-31' of ↵Linus Torvalds1-13/+16
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fixes from Christoph Hellwig: - fix for swiotlb deadlock due to wrong alignment checks (GuoRui.Yu, Petr Tesarik) * tag 'dma-mapping-6.3-2023-03-31' of git://git.infradead.org/users/hch/dma-mapping: swiotlb: fix slot alignment checks swiotlb: use wrap_area_index() instead of open-coding it swiotlb: fix the deadlock in swiotlb_do_find_slots
2023-03-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski5-19/+51
Conflicts: drivers/net/ethernet/mediatek/mtk_ppe.c 3fbe4d8c0e53 ("net: ethernet: mtk_eth_soc: ppe: add support for flow accounting") 924531326e2d ("net: ethernet: mtk_eth_soc: add missing ppe cache flush when deleting a flow") Signed-off-by: Jakub Kicinski <[email protected]>
2023-03-30bpf: Handle PTR_MAYBE_NULL case in PTR_TO_BTF_ID helper call argDavid Vernet1-0/+4
When validating a helper function argument, we use check_reg_type() to ensure that the register containing the argument is of the correct type. When the register's base type is PTR_TO_BTF_ID, there is some supplemental logic where we do extra checks for various combinations of PTR_TO_BTF_ID type modifiers. For example, for PTR_TO_BTF_ID, PTR_TO_BTF_ID | PTR_TRUSTED, and PTR_TO_BTF_ID | MEM_RCU, we call map_kptr_match_type() for bpf_kptr_xchg() calls, and btf_struct_ids_match() for other helper calls. When an unhandled PTR_TO_BTF_ID type modifier combination is passed to check_reg_type(), the verifier fails with an internal verifier error message. This can currently be triggered by passing a PTR_MAYBE_NULL pointer to helper functions (currently just bpf_kptr_xchg()) with an ARG_PTR_TO_BTF_ID_OR_NULL arg type. For example, by callin bpf_kptr_xchg(&v->kptr, bpf_cpumask_create()). Whether or not passing a PTR_MAYBE_NULL arg to an ARG_PTR_TO_BTF_ID_OR_NULL argument is valid is an interesting question. In a vacuum, it seems fine. A helper function with an ARG_PTR_TO_BTF_ID_OR_NULL arg would seem to be implying that it can handle either a NULL or non-NULL arg, and has logic in place to detect and gracefully handle each. This is the case for bpf_kptr_xchg(), which of course simply does an xchg(). On the other hand, bpf_kptr_xchg() also specifies OBJ_RELEASE, and refcounting semantics for a PTR_MAYBE_NULL pointer is different than handling it for a NULL _OR_ non-NULL pointer. For example, with a non-NULL arg, we should always fail if there was not a nonzero refcount for the value in the register being passed to the helper. For PTR_MAYBE_NULL on the other hand, it's unclear. If the pointer is NULL it would be fine, but if it's not NULL, it would be incorrect to load the program. The current solution to this is to just fail if PTR_MAYBE_NULL is passed, and to instead require programs to have a NULL check to explicitly handle the NULL and non-NULL cases. This seems reasonable. Not only would it possibly be quite complicated to correctly handle PTR_MAYBE_NULL refcounting in the verifier, but it's also an arguably odd programming pattern in general to not explicitly handle the NULL case anyways. For example, it seems odd to not care about whether a pointer you're passing to bpf_kptr_xchg() was successfully allocated in a program such as the following: private(MASK) static struct bpf_cpumask __kptr * global_mask; SEC("tp_btf/task_newtask") int BPF_PROG(example, struct task_struct *task, u64 clone_flags) { struct bpf_cpumask *prev; /* bpf_cpumask_create() returns PTR_MAYBE_NULL */ prev = bpf_kptr_xchg(&global_mask, bpf_cpumask_create()); if (prev) bpf_cpumask_release(prev); return 0; } This patch therefore updates the verifier to explicitly check for PTR_MAYBE_NULL in check_reg_type(), and fail gracefully if it's observed. This isn't really "fixing" anything unsafe or incorrect. We're just updating the verifier to fail gracefully, and explicitly handle this pattern rather than unintentionally falling back to an internal verifier error path. A subsequent patch will update selftests. Signed-off-by: David Vernet <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-03-30dma-debug: Use %pa to format phys_addr_tGeert Uytterhoeven1-4/+4
On 32-bit without LPAE: kernel/dma/debug.c: In function ‘debug_dma_dump_mappings’: kernel/dma/debug.c:537:7: warning: format ‘%llx’ expects argument of type ‘long long unsigned int’, but argument 9 has type ‘phys_addr_t’ {aka ‘unsigned int’} [-Wformat=] kernel/dma/debug.c: In function ‘dump_show’: kernel/dma/debug.c:568:59: warning: format ‘%llx’ expects argument of type ‘long long unsigned int’, but argument 11 has type ‘phys_addr_t’ {aka ‘unsigned int’} [-Wformat=] Fixes: bd89d69a529fbef3 ("dma-debug: add cacheline to user/kernel space dump messages") Reported-by: kernel test robot <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reported-by: [email protected] Signed-off-by: Geert Uytterhoeven <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2023-03-29cgroup/cpuset: Include offline CPUs when tasks' cpumasks in top_cpuset are ↵Waiman Long1-9/+14
updated Similar to commit 3fb906e7fabb ("group/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks"), the whole set of possible CPUs including offline ones should be used for setting cpumasks for tasks in the top cpuset when a cpuset partition is modified as the hotplug code won't update cpumasks for tasks in the top cpuset when CPUs become online or offline. Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2023-03-29cgroup/cpuset: Skip task update if hotplug doesn't affect current cpusetWaiman Long1-0/+3
If a hotplug event doesn't affect the current cpuset, there is no point to call hotplug_update_tasks() or hotplug_update_tasks_legacy(). So just skip it. Signed-off-by: Waiman Long <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2023-03-29Merge branch 'for-6.3-fixes' into for-6.4Tejun Heo2-5/+11
To receive 292fd843de26 ("cgroup/cpuset: Fix partition root's cpuset.cpus update bug") in preparation for further cpuset updates.
2023-03-29cgroup/cpuset: Fix partition root's cpuset.cpus update bugWaiman Long1-2/+10
It was found that commit 7a2127e66a00 ("cpuset: Call set_cpus_allowed_ptr() with appropriate mask for task") introduced a bug that corrupted "cpuset.cpus" of a partition root when it was updated. It is because the tmp->new_cpus field of the passed tmp parameter of update_parent_subparts_cpumask() should not be used at all as it contains important cpumask data that should not be overwritten. Fix it by using tmp->addmask instead. Also update update_cpumask() to make sure that trialcs->cpu_allowed will not be corrupted until it is no longer needed. Fixes: 7a2127e66a00 ("cpuset: Call set_cpus_allowed_ptr() with appropriate mask for task") Signed-off-by: Waiman Long <[email protected]> Cc: [email protected] # v6.2+ Signed-off-by: Tejun Heo <[email protected]>
2023-03-29tracing: Unbreak user eventsSteven Rostedt (Google)1-1/+0
The user events was added a bit prematurely, and there were a few kernel developers that had issues with it. The API also needed a bit of work to make sure it would be stable. It was decided to make user events "broken" until this was settled. Now it has a new API that appears to be as stable as it will be without the use of a crystal ball. It's being used within Microsoft as is, which means the API has had some testing in real world use cases. It went through many discussions in the bi-weekly tracing meetings, and there's been no more comments about updates. I feel this is good to go. Cc: Mathieu Desnoyers <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-03-29tracing/user_events: Use print_format_fields() for trace outputSteven Rostedt (Google)1-6/+3
Currently, user events are shown using the "hex" output for "safety" reasons as one cannot trust user events behaving nicely. But the hex output is not the only utility for safe outputting of trace events. The print_event_fields() is just as safe and gives user readable output. Before: example-839 [001] ..... 43.222244: 00000000: b1 06 00 00 47 03 00 00 00 00 00 00 ....G....... example-839 [001] ..... 43.564433: 00000000: b1 06 00 00 47 03 00 00 01 00 00 00 ....G....... example-839 [001] ..... 43.763917: 00000000: b1 06 00 00 47 03 00 00 02 00 00 00 ....G....... example-839 [001] ..... 43.967929: 00000000: b1 06 00 00 47 03 00 00 03 00 00 00 ....G....... After: example-837 [006] ..... 55.739249: test: count=0x0 (0) example-837 [006] ..... 111.104784: test: count=0x1 (1) example-837 [006] ..... 111.268444: test: count=0x2 (2) example-837 [006] ..... 111.416533: test: count=0x3 (3) example-837 [006] ..... 111.542859: test: count=0x4 (4) Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Beau Belgrave <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-03-29tracing/user_events: Align structs with tabs for readabilityBeau Belgrave1-41/+41
Add tabs to make struct members easier to read and unify the style of the code. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Beau Belgrave <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-03-29tracing/user_events: Limit global user_event countBeau Belgrave1-0/+47
Operators want to be able to ensure enough tracepoints exist on the system for kernel components as well as for user components. Since there are only up to 64K events, by default allow up to half to be used by user events. Add a kernel sysctl parameter (kernel.user_events_max) to set a global limit that is honored among all groups on the system. This ensures hard limits can be setup to prevent user processes from consuming all event IDs on the system. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Beau Belgrave <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-03-29tracing/user_events: Charge event allocs to cgroupsBeau Belgrave1-10/+10
Operators need a way to limit how much memory cgroups use. User events need to be included into that accounting. Fix this by using GFP_KERNEL_ACCOUNT for allocations generated by user programs for user_event tracing. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Beau Belgrave <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-03-29tracing/user_events: Add ioctl for disabling addressesBeau Belgrave1-2/+95
Enablements are now tracked by the lifetime of the task/mm. User processes need to be able to disable their addresses if tracing is requested to be turned off. Before unmapping the page would suffice. However, we now need a stronger contract. Add an ioctl to enable this. A new flag bit is added, freeing, to user_event_enabler to ensure that if the event is attempted to be removed while a fault is being handled that the remove is delayed until after the fault is reattempted. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Beau Belgrave <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>