aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2021-04-17sched/debug: Rename the sched_debug parameter to sched_verbosePeter Zijlstra3-9/+9
CONFIG_SCHED_DEBUG is the build-time Kconfig knob, the boot param sched_debug and the /debug/sched/debug_enabled knobs control the sched_debug_enabled variable, but what they really do is make SCHED_DEBUG more verbose, so rename the lot. Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
2021-04-16gcov: clang: fix clang-11+ buildJohannes Berg1-1/+1
With clang-11+, the code is broken due to my kvmalloc() conversion (which predated the clang-11 support code) leaving one vmalloc() in place. Fix that. Link: https://lkml.kernel.org/r/20210412214210.6e1ecca9cdc5.I24459763acf0591d5e6b31c7e3a59890d802f79c@changeid Signed-off-by: Johannes Berg <[email protected]> Reviewed-by: Nick Desaulniers <[email protected]> Tested-by: Nick Desaulniers <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-04-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller1-74/+156
Daniel Borkmann says: ==================== pull-request: bpf 2021-04-17 The following pull-request contains BPF updates for your *net* tree. We've added 10 non-merge commits during the last 9 day(s) which contain a total of 8 files changed, 175 insertions(+), 111 deletions(-). The main changes are: 1) Fix a potential NULL pointer dereference in libbpf's xsk umem handling, from Ciara Loftus. 2) Mitigate a speculative oob read of up to map value size by tightening the masking window, from Daniel Borkmann. ==================== Signed-off-by: David S. Miller <[email protected]>
2021-04-16bpf: Tighten speculative pointer arithmetic maskDaniel Borkmann1-29/+44
This work tightens the offset mask we use for unprivileged pointer arithmetic in order to mitigate a corner case reported by Piotr and Benedict where in the speculative domain it is possible to advance, for example, the map value pointer by up to value_size-1 out-of-bounds in order to leak kernel memory via side-channel to user space. Before this change, the computed ptr_limit for retrieve_ptr_limit() helper represents largest valid distance when moving pointer to the right or left which is then fed as aux->alu_limit to generate masking instructions against the offset register. After the change, the derived aux->alu_limit represents the largest potential value of the offset register which we mask against which is just a narrower subset of the former limit. For minimal complexity, we call sanitize_ptr_alu() from 2 observation points in adjust_ptr_min_max_vals(), that is, before and after the simulated alu operation. In the first step, we retieve the alu_state and alu_limit before the operation as well as we branch-off a verifier path and push it to the verification stack as we did before which checks the dst_reg under truncation, in other words, when the speculative domain would attempt to move the pointer out-of-bounds. In the second step, we retrieve the new alu_limit and calculate the absolute distance between both. Moreover, we commit the alu_state and final alu_limit via update_alu_sanitation_state() to the env's instruction aux data, and bail out from there if there is a mismatch due to coming from different verification paths with different states. Reported-by: Piotr Krysiuk <[email protected]> Reported-by: Benedict Schlueter <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: John Fastabend <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Tested-by: Benedict Schlueter <[email protected]>
2021-04-16bpf: Move sanitize_val_alu out of op switchDaniel Borkmann1-6/+11
Add a small sanitize_needed() helper function and move sanitize_val_alu() out of the main opcode switch. In upcoming work, we'll move sanitize_ptr_alu() as well out of its opcode switch so this helps to streamline both. Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: John Fastabend <[email protected]> Acked-by: Alexei Starovoitov <[email protected]>
2021-04-16bpf: Refactor and streamline bounds check into helperDaniel Borkmann1-16/+33
Move the bounds check in adjust_ptr_min_max_vals() into a small helper named sanitize_check_bounds() in order to simplify the former a bit. Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: John Fastabend <[email protected]> Acked-by: Alexei Starovoitov <[email protected]>
2021-04-16bpf: Improve verifier error messages for usersDaniel Borkmann1-23/+63
Consolidate all error handling and provide more user-friendly error messages from sanitize_ptr_alu() and sanitize_val_alu(). Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: John Fastabend <[email protected]> Acked-by: Alexei Starovoitov <[email protected]>
2021-04-16bpf: Rework ptr_limit into alu_limit and add common error pathDaniel Borkmann1-8/+13
Small refactor with no semantic changes in order to consolidate the max ptr_limit boundary check. Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: John Fastabend <[email protected]> Acked-by: Alexei Starovoitov <[email protected]>
2021-04-16bpf: Ensure off_reg has no mixed signed bounds for all typesDaniel Borkmann1-10/+9
The mixed signed bounds check really belongs into retrieve_ptr_limit() instead of outside of it in adjust_ptr_min_max_vals(). The reason is that this check is not tied to PTR_TO_MAP_VALUE only, but to all pointer types that we handle in retrieve_ptr_limit() and given errors from the latter propagate back to adjust_ptr_min_max_vals() and lead to rejection of the program, it's a better place to reside to avoid anything slipping through for future types. The reason why we must reject such off_reg is that we otherwise would not be able to derive a mask, see details in 9d7eceede769 ("bpf: restrict unknown scalars of mixed signed bounds for unprivileged"). Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: John Fastabend <[email protected]> Acked-by: Alexei Starovoitov <[email protected]>
2021-04-16bpf: Move off_reg into sanitize_ptr_aluDaniel Borkmann1-4/+5
Small refactor to drag off_reg into sanitize_ptr_alu(), so we later on can use off_reg for generalizing some of the checks for all pointer types. Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: John Fastabend <[email protected]> Acked-by: Alexei Starovoitov <[email protected]>
2021-04-16bpf: Use correct permission flag for mixed signed bounds arithmeticDaniel Borkmann1-1/+1
We forbid adding unknown scalars with mixed signed bounds due to the spectre v1 masking mitigation. Hence this also needs bypass_spec_v1 flag instead of allow_ptr_leaks. Fixes: 2c78ee898d8f ("bpf: Implement CAP_BPF") Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: John Fastabend <[email protected]> Acked-by: Alexei Starovoitov <[email protected]>
2021-04-16cgroup: use tsk->in_iowait instead of delayacct_is_task_waiting_on_io()Chunguang Xu1-1/+1
If delayacct is disabled, then delayacct_is_task_waiting_on_io() always returns false, which causes the statistical value to be wrong. Perhaps tsk->in_iowait is better. Signed-off-by: Chunguang Xu <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2021-04-16tick/broadcast: Allow late registered device to enter oneshot modeJindong Yue1-2/+14
The broadcast device is switched to oneshot mode when the system switches to oneshot mode. If a broadcast clock event device is registered after the system switched to oneshot mode, it will stay in periodic mode forever. Ensure that a late registered device which is selected as broadcast device is initialized in oneshot mode when the system already uses oneshot mode. [ tglx: Massage changelog ] Signed-off-by: Jindong Yue <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-16tick: Use tick_check_replacement() instead of open coding itWang Wensheng1-6/+1
The function tick_check_replacement() is the combination of tick_check_percpu() and tick_check_preferred(), but tick_check_new_device() has the same logic open coded. Use the helper to simplify the code. [ tglx: Massage changelog ] Signed-off-by: Wang Wensheng <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-16time/timecounter: Mark 1st argument of timecounter_cyc2time() as constMarc Kleine-Budde1-1/+1
The timecounter is not modified in this function. Mark it as const. Signed-off-by: Marc Kleine-Budde <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-04-16sched,fair: Alternative sched_slice()Peter Zijlstra2-1/+14
The current sched_slice() seems to have issues; there's two possible things that could be improved: - the 'nr_running' used for __sched_period() is daft when cgroups are considered. Using the RQ wide h_nr_running seems like a much more consistent number. - (esp) cgroups can slice it real fine, which makes for easy over-scheduling, ensure min_gran is what the name says. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Valentin Schneider <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-04-16sched: Move /proc/sched_debug to debugfsPeter Zijlstra1-9/+16
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Tested-by: Valentin Schneider <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-04-16sched,debug: Convert sysctl sched_domains to debugfsPeter Zijlstra3-211/+59
Stop polluting sysctl, move to debugfs for SCHED_DEBUG stuff. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Dietmar Eggemann <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Tested-by: Valentin Schneider <[email protected]> Link: https://lkml.kernel.org/r/YHgB/[email protected]
2021-04-16sched,preempt: Move preempt_dynamic to debug.cPeter Zijlstra3-77/+78
Move the #ifdef SCHED_DEBUG bits to kernel/sched/debug.c in order to collect all the debugfs bits. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Valentin Schneider <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-04-16sched: Move SCHED_DEBUG sysctl to debugfsPeter Zijlstra5-108/+77
Stop polluting sysctl with undocumented knobs that really are debug only, move them all to /debug/sched/ along with the existing /debug/sched_* files that already exist. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Tested-by: Valentin Schneider <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-04-16sched: Remove sched_schedstats sysctl out from under SCHED_DEBUGPeter Zijlstra1-11/+11
CONFIG_SCHEDSTATS does not depend on SCHED_DEBUG, it is inconsistent to have the sysctl depend on it. Suggested-by: Mel Gorman <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Valentin Schneider <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-04-16sched/numa: Allow runtime enabling/disabling of NUMA balance without SCHED_DEBUGMel Gorman1-1/+3
The ability to enable/disable NUMA balancing is not a debugging feature and should not depend on CONFIG_SCHED_DEBUG. For example, machines within a HPC cluster may disable NUMA balancing temporarily for some jobs and re-enable it for other jobs without needing to reboot. This patch removes the dependency on CONFIG_SCHED_DEBUG for kernel.numa_balancing sysctl. The other numa balancing related sysctls are left as-is because if they need to be tuned then it is more likely that NUMA balancing needs to be fixed instead. Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Valentin Schneider <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-04-16sched: Use cpu_dying() to fix balance_push vs hotplug-rollbackPeter Zijlstra2-12/+15
Use the new cpu_dying() state to simplify and fix the balance_push() vs CPU hotplug rollback state. Specifically, we currently rely on notifiers sched_cpu_dying() / sched_cpu_activate() to terminate balance_push, however if the cpu_down() fails when we're past sched_cpu_deactivate(), it should terminate balance_push at that point and not wait until we hit sched_cpu_activate(). Similarly, when cpu_up() fails and we're going back down, balance_push should be active, where it currently is not. So instead, make sure balance_push is enabled below SCHED_AP_ACTIVE (when !cpu_active()), and gate it's utility with cpu_dying(). Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-04-16cpumask: Introduce DYING maskPeter Zijlstra1-0/+6
Introduce a cpumask that indicates (for each CPU) what direction the CPU hotplug is currently going. Notably, it tracks rollbacks. Eg. when an up fails and we do a roll-back down, it will accurately reflect the direction. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-04-16perf: Add support for SIGTRAP on perf eventsMarco Elver1-1/+48
Adds bit perf_event_attr::sigtrap, which can be set to cause events to send SIGTRAP (with si_code TRAP_PERF) to the task where the event occurred. The primary motivation is to support synchronous signals on perf events in the task where an event (such as breakpoints) triggered. To distinguish perf events based on the event type, the type is set in si_errno. For events that are associated with an address, si_addr is copied from perf_sample_data. The new field perf_event_attr::sig_data is copied to si_perf, which allows user space to disambiguate which event (of the same type) triggered the signal. For example, user space could encode the relevant information it cares about in sig_data. We note that the choice of an opaque u64 provides the simplest and most flexible option. Alternatives where a reference to some user space data is passed back suffer from the problem that modification of referenced data (be it the event fd, or the perf_event_attr) can race with the signal being delivered (of course, the same caveat applies if user space decides to store a pointer in sig_data, but the ABI explicitly avoids prescribing such a design). Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Dmitry Vyukov <[email protected]> Link: https://lore.kernel.org/lkml/[email protected]/
2021-04-16signal: Introduce TRAP_PERF si_code and si_perf to siginfoMarco Elver1-0/+11
Introduces the TRAP_PERF si_code, and associated siginfo_t field si_perf. These will be used by the perf event subsystem to send signals (if requested) to the task where an event occurred. Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> # m68k Acked-by: Arnd Bergmann <[email protected]> # asm-generic Link: https://lkml.kernel.org/r/[email protected]
2021-04-16perf: Add support for event removal on execMarco Elver1-8/+62
Adds bit perf_event_attr::remove_on_exec, to support removing an event from a task on exec. This option supports the case where an event is supposed to be process-wide only, and should not propagate beyond exec, to limit monitoring to the original process image only. Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-04-16perf: Support only inheriting events if cloned with CLONE_THREADMarco Elver2-8/+15
Adds bit perf_event_attr::inherit_thread, to restricting inheriting events only if the child was cloned with CLONE_THREAD. This option supports the case where an event is supposed to be process-wide only (including subthreads), but should not propagate beyond the current process's shared environment. Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/lkml/YBvj6eJR%[email protected]/
2021-04-16perf: Apply PERF_EVENT_IOC_MODIFY_ATTRIBUTES to childrenMarco Elver1-1/+21
As with other ioctls (such as PERF_EVENT_IOC_{ENABLE,DISABLE}), fix up handling of PERF_EVENT_IOC_MODIFY_ATTRIBUTES to also apply to children. Suggested-by: Dmitry Vyukov <[email protected]> Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Dmitry Vyukov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-04-16perf: Rework perf_event_exit_event()Peter Zijlstra1-63/+79
Make perf_event_exit_event() more robust, such that we can use it from other contexts. Specifically the up and coming remove_on_exec. For this to work we need to address a few issues. Remove_on_exec will not destroy the entire context, so we cannot rely on TASK_TOMBSTONE to disable event_function_call() and we thus have to use perf_remove_from_context(). When using perf_remove_from_context(), there's two races to consider. The first is against close(), where we can have concurrent tear-down of the event. The second is against child_list iteration, which should not find a half baked event. To address this, teach perf_remove_from_context() to special case !ctx->is_active and about DETACH_CHILD. [ [email protected]: fix racing parent/child exit in sync_child_event(). ] Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-04-16perf: Cap allocation order at aux_watermarkAlexander Shishkin1-16/+18
Currently, we start allocating AUX pages half the size of the total requested AUX buffer size, ignoring the attr.aux_watermark setting. This, in turn, makes intel_pt driver disregard the watermark also, as it uses page order for its SG (ToPA) configuration. Now, this can be fixed in the intel_pt PMU driver, but seeing as it's the only one currently making use of high order allocations, there is no reason not to fix the allocator instead. This way, any other driver wishing to add this support would not have to worry about this. Signed-off-by: Alexander Shishkin <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-04-16fs: split receive_fd_replace from __receive_fdChristoph Hellwig1-5/+12
receive_fd_replace shares almost no code with the general case, so split it out. Also remove the "Bump the sock usage counts" comment from both copies, as that is now what __receive_sock actually does. [AV: ... and make the only user of receive_fd_replace() choose between it and receive_fd() according to what userland had passed to it in flags] Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Al Viro <[email protected]>
2021-04-15ftrace: Reuse the output of the function tracer for func_repeatsSteven Rostedt (VMware)1-10/+13
The func_repeats event shows the output of the function tracer followed by a count of the number of repeats the previous function had made, as well as the timestamp of the last function that was repeated. The printing of the function should be the same as is for the function it is displaying. Reuse the code in trace_fn_trace() by making a helper function print_fn_trace() and use it for trace_func_repeats_print(). Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-04-15tracing: Add "func_no_repeats" option for function tracingYordan Karadzhov (VMware)1-3/+159
If the option is activated the function tracing record gets consolidated in the cases when a single function is called number of times consecutively. Instead of having an identical record for each call of the function we will record only the first call following by event showing the number of repeats. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yordan Karadzhov (VMware) <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-04-15tracing: Unify the logic for function tracing optionsYordan Karadzhov (VMware)1-27/+38
Currently the logic for dealing with the options for function tracing has two different implementations. One is used when we set the flags (in "static int func_set_flag()") and another used when we initialize the tracer (in "static int function_trace_init()"). Those two implementations are meant to do essentially the same thing and they are both not very convenient for adding new options. In this patch we add a helper function that provides a single implementation of the logic for dealing with the options and we make it such that new options can be easily added. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yordan Karadzhov (VMware) <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-04-15tracing: Add method for recording "func_repeats" eventsYordan Karadzhov (VMware)2-0/+38
This patch only provides the implementation of the method. Later we will used it in a combination with a new option for function tracing. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yordan Karadzhov (VMware) <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-04-15tracing: Add "last_func_repeats" to struct trace_arrayYordan Karadzhov (VMware)2-0/+13
The field is used to keep track of the consecutive (on the same CPU) calls of a single function. This information is needed in order to consolidate the function tracing record in the cases when a single function is called number of times. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yordan Karadzhov (VMware) <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-04-15tracing: Define new ftrace event "func_repeats"Yordan Karadzhov (VMware)3-0/+73
The event aims to consolidate the function tracing record in the cases when a single function is called number of times consecutively. while (cond) do_func(); This may happen in various scenarios (busy waiting for example). The new ftrace event can be used to show repeated function events with a single event and save space on the ring buffer Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yordan Karadzhov (VMware) <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-04-15tracing: Define static void trace_print_time()Yordan Karadzhov (VMware)1-9/+17
The part of the code that prints the time of the trace record in "int trace_print_context()" gets extracted in a static function. This is done as a preparation for a following patch, in which we will define a new ftrace event called "func_repeats". The new static method, defined here, will be used by this new event to print the time of the last repeat of a function that is consecutively called number of times. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yordan Karadzhov (VMware) <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-04-14rseq: Optimise rseq_get_rseq_cs() and clear_rseq_cs()Eric Dumazet1-0/+9
Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union, update includes") added regressions for our servers. Using copy_from_user() and clear_user() for 64bit values is suboptimal. We can use faster put_user() and get_user() on 64bit arches. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Mathieu Desnoyers <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-04-14rseq: Remove redundant access_ok()Eric Dumazet1-4/+1
After commit 8f2817701492 ("rseq: Use get_user/put_user rather than __get_user/__put_user") we no longer need an access_ok() call from __rseq_handle_notify_resume() Mathieu pointed out the same cleanup can be done in rseq_syscall(). Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Mathieu Desnoyers <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-04-14rseq: Optimize rseq_update_cpu_id()Eric Dumazet1-4/+11
Two put_user() in rseq_update_cpu_id() are replaced by a pair of unsafe_put_user() with appropriate surroundings. This removes one stac/clac pair on x86 in fast path. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Mathieu Desnoyers <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-04-14signal: Allow tasks to cache one sigqueue structThomas Gleixner3-2/+44
The idea for this originates from the real time tree to make signal delivery for realtime applications more efficient. In quite some of these application scenarios a control tasks signals workers to start their computations. There is usually only one signal per worker on flight. This works nicely as long as the kmem cache allocations do not hit the slow path and cause latencies. To cure this an optimistic caching was introduced (limited to RT tasks) which allows a task to cache a single sigqueue in a pointer in task_struct instead of handing it back to the kmem cache after consuming a signal. When the next signal is sent to the task then the cached sigqueue is used instead of allocating a new one. This solved the problem for this set of application scenarios nicely. The task cache is not preallocated so the first signal sent to a task goes always to the cache allocator. The cached sigqueue stays around until the task exits and is freed when task::sighand is dropped. After posting this solution for mainline the discussion came up whether this would be useful in general and should not be limited to realtime tasks: https://lore.kernel.org/r/[email protected] One concern leading to the original limitation was to avoid a large amount of pointlessly cached sigqueues in alive tasks. The other concern was vs. RLIMIT_SIGPENDING as these cached sigqueues are not accounted for. The accounting problem is real, but on the other hand slightly academic. After gathering some statistics it turned out that after boot of a regular distro install there are less than 10 sigqueues cached in ~1500 tasks. In case of a 'mass fork and fire signal to child' scenario the extra 80 bytes of memory per task are well in the noise of the overall memory consumption of the fork bomb. If this should be limited then this would need an extra counter in struct user, more atomic instructions and a seperate rlimit. Yet another tunable which is mostly unused. The caching is actually used. After boot and a full kernel compile on a 64CPU machine with make -j128 the number of 'allocations' looks like this: From slab: 23996 From task cache: 52223 I.e. it reduces the number of slab cache operations by ~68%. A typical pattern there is: <...>-58490 __sigqueue_alloc: for 58488 from slab ffff8881132df460 <...>-58488 __sigqueue_free: cache ffff8881132df460 <...>-58488 __sigqueue_alloc: for 1149 from cache ffff8881103dc550 bash-1149 exit_task_sighand: free ffff8881132df460 bash-1149 __sigqueue_free: cache ffff8881103dc550 The interesting sequence is that the exiting task 58488 grabs the sigqueue from bash's task cache to signal exit and bash sticks it back into it's own cache. Lather, rinse and repeat. The caching is probably not noticable for the general use case, but the benefit for latency sensitive applications is clear. While kmem caches are usually just serving from the fast path the slab merging (default) can depending on the usage pattern of the merged slabs cause occasional slow path allocations. The time spared per cached entry is a few micro seconds per signal which is not relevant for e.g. a kernel build, but for signal heavy workloads it's measurable. As there is no real downside of this caching mechanism making it unconditionally available is preferred over more conditional code or new magic tunables. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-04-14signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc()Thomas Gleixner1-10/+7
There is no point in having the conditional at the callsite. Just hand in the allocation mode flag to __sigqueue_alloc() and use it to initialize sigqueue::flags. No functional change. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-04-14kdb: Refactor env variables get/set codeSumit Garg1-74/+68
Add two new kdb environment access methods as kdb_setenv() and kdb_printenv() in order to abstract out environment access code from kdb command functions. Also, replace (char *)0 with NULL as an initializer for environment variables array. Signed-off-by: Sumit Garg <[email protected]> Reviewed-by: Douglas Anderson <[email protected]> Link: https://lore.kernel.org/r/[email protected] [[email protected]: Replaced (char *)0/NULL initializers with an array size] Signed-off-by: Daniel Thompson <[email protected]>
2021-04-14kconfig: do not use allnoconfig_y optionMasahiro Yamada1-0/+1
allnoconfig_y is an ugly hack that sets a symbol to 'y' by allnoconfig. allnoconfig does not mean a minimal set of CONFIG options because a bunch of prompts are hidden by 'if EMBEDDED' or 'if EXPERT', but I do not like to hack Kconfig this way. Use the pre-existing feature, KCONFIG_ALLCONFIG, to provide a one liner config fragment. CONFIG_EMBEDDED=y is still forced when allnoconfig is invoked as a part of tinyconfig. No change in the .config file produced by 'make tinyconfig'. The output of 'make allnoconfig' will be changed; we will get CONFIG_EMBEDDED=n because allnoconfig literally sets all symbols to n. Signed-off-by: Masahiro Yamada <[email protected]>
2021-04-13Merge tag 'trace-v5.12-rc7' of ↵Linus Torvalds1-2/+4
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Fix a memory link in dyn_event_release(). An error path exited the function before freeing the allocated 'argv' variable" * tag 'trace-v5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing/dynevent: Fix a memory leak in an error handling path
2021-04-13bpf: Return target info when a tracing bpf_link is queriedToke Høiland-Jørgensen1-0/+3
There is currently no way to discover the target of a tracing program attachment after the fact. Add this information to bpf_link_info and return it when querying the bpf_link fd. Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-04-13arm64: Introduce prctl(PR_PAC_{SET,GET}_ENABLED_KEYS)Peter Collingbourne1-0/+16
This change introduces a prctl that allows the user program to control which PAC keys are enabled in a particular task. The main reason why this is useful is to enable a userspace ABI that uses PAC to sign and authenticate function pointers and other pointers exposed outside of the function, while still allowing binaries conforming to the ABI to interoperate with legacy binaries that do not sign or authenticate pointers. The idea is that a dynamic loader or early startup code would issue this prctl very early after establishing that a process may load legacy binaries, but before executing any PAC instructions. This change adds a small amount of overhead to kernel entry and exit due to additional required instruction sequences. On a DragonBoard 845c (Cortex-A75) with the powersave governor, the overhead of similar instruction sequences was measured as 4.9ns when simulating the common case where IA is left enabled, or 43.7ns when simulating the uncommon case where IA is disabled. These numbers can be seen as the worst case scenario, since in more realistic scenarios a better performing governor would be used and a newer chip would be used that would support PAC unlike Cortex-A75 and would be expected to be faster than Cortex-A75. On an Apple M1 under a hypervisor, the overhead of the entry/exit instruction sequences introduced by this patch was measured as 0.3ns in the case where IA is left enabled, and 33.0ns in the case where IA is disabled. Signed-off-by: Peter Collingbourne <[email protected]> Reviewed-by: Dave Martin <[email protected]> Link: https://linux-review.googlesource.com/id/Ibc41a5e6a76b275efbaa126b31119dc197b927a5 Link: https://lore.kernel.org/r/d6609065f8f40397a4124654eb68c9f490b4d477.1616123271.git.pcc@google.com Signed-off-by: Catalin Marinas <[email protected]>
2021-04-13tracing/dynevent: Fix a memory leak in an error handling pathChristophe JAILLET1-2/+4
We must free 'argv' before returning, as already done in all the other paths of this function. Link: https://lkml.kernel.org/r/21e3594ccd7fc88c5c162c98450409190f304327.1618136448.git.christophe.jaillet@wanadoo.fr Fixes: d262271d0483 ("tracing/dynevent: Delegate parsing to create function") Acked-by: Masami Hiramatsu <[email protected]> Signed-off-by: Christophe JAILLET <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>