aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2021-08-17locking/rtmutex: Use rt_mutex_wake_q_headThomas Gleixner4-23/+20
Prepare for the required state aware handling of waiter wakeups via wake_q and switch the rtmutex code over to the rtmutex specific wrapper. No functional change. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-17locking/rtmutex: Provide rt_wake_q_head and helpersThomas Gleixner2-0/+29
To handle the difference between wakeups for regular sleeping locks (mutex, rtmutex, rw_semaphore) and the wakeups for 'sleeping' spin/rwlocks on PREEMPT_RT enabled kernels correctly, it is required to provide a wake_q_head construct which allows to keep them separate. Provide a wrapper around wake_q_head and the required helpers, which will be extended with the state handling later. No functional change. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-17locking/rtmutex: Add wake_state to rt_mutex_waiterThomas Gleixner2-1/+10
Regular sleeping locks like mutexes, rtmutexes and rw_semaphores are always entering and leaving a blocking section with task state == TASK_RUNNING. On a non-RT kernel spinlocks and rwlocks never affect the task state, but on RT kernels these locks are converted to rtmutex based 'sleeping' locks. So in case of contention the task goes to block, which requires to carefully preserve the task state, and restore it after acquiring the lock taking regular wakeups for the task into account, which happened while the task was blocked. This state preserving is achieved by having a separate task state for blocking on a RT spin/rwlock and a saved_state field in task_struct along with careful handling of these wakeup scenarios in try_to_wake_up(). To avoid conditionals in the rtmutex code, store the wake state which has to be used for waking a lock waiter in rt_mutex_waiter which allows to handle the regular and RT spin/rwlocks by handing it to wake_up_state(). Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-17locking/rwsem: Add rtmutex based R/W semaphore implementationThomas Gleixner1-0/+108
The RT specific R/W semaphore implementation used to restrict the number of readers to one, because a writer cannot block on multiple readers and inherit its priority or budget. The single reader restricting was painful in various ways: - Performance bottleneck for multi-threaded applications in the page fault path (mmap sem) - Progress blocker for drivers which are carefully crafted to avoid the potential reader/writer deadlock in mainline. The analysis of the writer code paths shows that properly written RT tasks should not take them. Syscalls like mmap(), file access which take mmap sem write locked have unbound latencies, which are completely unrelated to mmap sem. Other R/W sem users like graphics drivers are not suitable for RT tasks either. So there is little risk to hurt RT tasks when the RT rwsem implementation is done in the following way: - Allow concurrent readers - Make writers block until the last reader left the critical section. This blocking is not subject to priority/budget inheritance. - Readers blocked on a writer inherit their priority/budget in the normal way. There is a drawback with this scheme: R/W semaphores become writer unfair though the applications which have triggered writer starvation (mostly on mmap_sem) in the past are not really the typical workloads running on a RT system. So while it's unlikely to hit writer starvation, it's possible. If there are unexpected workloads on RT systems triggering it, the problem has to be revisited. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-17locking/rt: Add base code for RT rw_semaphore and rwlockThomas Gleixner1-0/+263
On PREEMPT_RT, rw_semaphores and rwlocks are substituted with an rtmutex and a reader count. The implementation is writer unfair, as it is not feasible to do priority inheritance on multiple readers, but experience has shown that real-time workloads are not the typical workloads which are sensitive to writer starvation. The inner workings of rw_semaphores and rwlocks on RT are almost identical except for the task state and signal handling. rw_semaphores are not state preserving over a contention, they are expected to enter and leave with state == TASK_RUNNING. rwlocks have a mechanism to preserve the state of the task at entry and restore it after unblocking taking potential non-lock related wakeups into account. rw_semaphores can also be subject to signal handling interrupting a blocked state, while rwlocks ignore signals. To avoid code duplication, provide a shared implementation which takes the small difference vs. state and signals into account. The code is included into the relevant rw_semaphore/rwlock base code and compiled for each use case separately. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-17locking/rtmutex: Provide rt_mutex_slowlock_locked()Thomas Gleixner2-43/+59
Split the inner workings of rt_mutex_slowlock() out into a separate function, which can be reused by the upcoming RT lock substitutions, e.g. for rw_semaphores. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-17locking/rtmutex: Split out the inner parts of 'struct rtmutex'Peter Zijlstra5-73/+80
RT builds substitutions for rwsem, mutex, spinlock and rwlock around rtmutexes. Split the inner working out so each lock substitution can use them with the appropriate lockdep annotations. This avoids having an extra unused lockdep map in the wrapped rtmutex. No functional change. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-17locking/rtmutex: Split API from implementationThomas Gleixner4-498/+514
Prepare for reusing the inner functions of rtmutex for RT lock substitutions: introduce kernel/locking/rtmutex_api.c and move them there. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-17locking/rtmutex: Switch to from cmpxchg_*() to try_cmpxchg_*()Thomas Gleixner1-2/+2
Allows the compiler to generate better code depending on the architecture. Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-17locking/rtmutex: Convert macros to inlinesSebastian Andrzej Siewior1-4/+27
Inlines are type-safe... Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-17sched/core: Provide a scheduling point for RT locksThomas Gleixner1-1/+19
RT enabled kernels substitute spin/rwlocks with 'sleeping' variants based on rtmutexes. Blocking on such a lock is similar to preemption versus: - I/O scheduling and worker handling, because these functions might block on another substituted lock, or come from a lock contention within these functions. - RCU considers this like a preemption, because the task might be in a read side critical section. Add a separate scheduling point for this, and hand a new scheduling mode argument to __schedule() which allows, along with separate mode masks, to handle this gracefully from within the scheduler, without proliferating that to other subsystems like RCU. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-17sched/core: Rework the __schedule() preempt argumentThomas Gleixner1-11/+23
PREEMPT_RT needs to hand a special state into __schedule() when a task blocks on a 'sleeping' spin/rwlock. This is required to handle rcu_note_context_switch() correctly without having special casing in the RCU code. From an RCU point of view the blocking on the sleeping spinlock is equivalent to preemption, because the task might be in a read side critical section. schedule_debug() also has a check which would trigger with the !preempt case, but that could be handled differently. To avoid adding another argument and extra checks which cannot be optimized out by the compiler, the following solution has been chosen: - Replace the boolean 'preempt' argument with an unsigned integer 'sched_mode' argument and define constants to hand in: (0 == no preemption, 1 = preemption). - Add two masks to apply on that mode: one for the debug/rcu invocations, and one for the actual scheduling decision. For a non RT kernel these masks are UINT_MAX, i.e. all bits are set, which allows the compiler to optimize the AND operation out, because it is not masking out anything. IOW, it's not different from the boolean. RT enabled kernels will define these masks separately. No functional change. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-17sched/wakeup: Prepare for RT sleeping spin/rwlocksThomas Gleixner1-0/+33
Waiting for spinlocks and rwlocks on non RT enabled kernels is task::state preserving. Any wakeup which matches the state is valid. RT enabled kernels substitutes them with 'sleeping' spinlocks. This creates an issue vs. task::__state. In order to block on the lock, the task has to overwrite task::__state and a consecutive wakeup issued by the unlocker sets the state back to TASK_RUNNING. As a consequence the task loses the state which was set before the lock acquire and also any regular wakeup targeted at the task while it is blocked on the lock. To handle this gracefully, add a 'saved_state' member to task_struct which is used in the following way: 1) When a task blocks on a 'sleeping' spinlock, the current state is saved in task::saved_state before it is set to TASK_RTLOCK_WAIT. 2) When the task unblocks and after acquiring the lock, it restores the saved state. 3) When a regular wakeup happens for a task while it is blocked then the state change of that wakeup is redirected to operate on task::saved_state. This is also required when the task state is running because the task might have been woken up from the lock wait and has not yet restored the saved state. To make it complete, provide the necessary helpers to save and restore the saved state along with the necessary documentation how the RT lock blocking is supposed to work. For non-RT kernels there is no functional change. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-17sched/wakeup: Split out the wakeup ->__state checkThomas Gleixner1-6/+18
RT kernels have a slightly more complicated handling of wakeups due to 'sleeping' spin/rwlocks. If a task is blocked on such a lock then the original state of the task is preserved over the blocking period, and any regular (non lock related) wakeup has to be targeted at the saved state to ensure that these wakeups are not lost. Once the task acquires the lock it restores the task state from the saved state. To avoid cluttering try_to_wake_up() with that logic, split the wakeup state check out into an inline helper and use it at both places where task::__state is checked against the state argument of try_to_wake_up(). No functional change. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-17locking/rtmutex: Set proper wait context for lockdepThomas Gleixner1-1/+1
RT mutexes belong to the LD_WAIT_SLEEP class. Make them so. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-08-17Merge tag 'v5.14-rc6' into locking/core, to pick up fixesIngo Molnar175-3077/+10370
Signed-off-by: Ingo Molnar <[email protected]>
2021-08-17bpf: Add bpf_get_attach_cookie() BPF helper to access bpf_cookie valueAndrii Nakryiko1-1/+34
Add new BPF helper, bpf_get_attach_cookie(), which can be used by BPF programs to get access to a user-provided bpf_cookie value, specified during BPF program attachment (BPF link creation) time. Naming is hard, though. With the concept being named "BPF cookie", I've considered calling the helper: - bpf_get_cookie() -- seems too unspecific and easily mistaken with socket cookie; - bpf_get_bpf_cookie() -- too much tautology; - bpf_get_link_cookie() -- would be ok, but while we create a BPF link to attach BPF program to BPF hook, it's still an "attachment" and the bpf_cookie is associated with BPF program attachment to a hook, not a BPF link itself. Technically, we could support bpf_cookie with old-style cgroup programs.So I ultimately rejected it in favor of bpf_get_attach_cookie(). Currently all perf_event-backed BPF program types support bpf_get_attach_cookie() helper. Follow-up patches will add support for fentry/fexit programs as well. While at it, mark bpf_tracing_func_proto() as static to make it obvious that it's only used from within the kernel/trace/bpf_trace.c. Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-08-17bpf: Allow to specify user-provided bpf_cookie for BPF perf linksAndrii Nakryiko4-22/+38
Add ability for users to specify custom u64 value (bpf_cookie) when creating BPF link for perf_event-backed BPF programs (kprobe/uprobe, perf_event, tracepoints). This is useful for cases when the same BPF program is used for attaching and processing invocation of different tracepoints/kprobes/uprobes in a generic fashion, but such that each invocation is distinguished from each other (e.g., BPF program can look up additional information associated with a specific kernel function without having to rely on function IP lookups). This enables new use cases to be implemented simply and efficiently that previously were possible only through code generation (and thus multiple instances of almost identical BPF program) or compilation at runtime (BCC-style) on target hosts (even more expensive resource-wise). For uprobes it is not even possible in some cases to know function IP before hand (e.g., when attaching to shared library without PID filtering, in which case base load address is not known for a library). This is done by storing u64 bpf_cookie in struct bpf_prog_array_item, corresponding to each attached and run BPF program. Given cgroup BPF programs already use two 8-byte pointers for their needs and cgroup BPF programs don't have (yet?) support for bpf_cookie, reuse that space through union of cgroup_storage and new bpf_cookie field. Make it available to kprobe/tracepoint BPF programs through bpf_trace_run_ctx. This is set by BPF_PROG_RUN_ARRAY, used by kprobe/uprobe/tracepoint BPF program execution code, which luckily is now also split from BPF_PROG_RUN_ARRAY_CG. This run context will be utilized by a new BPF helper giving access to this user-provided cookie value from inside a BPF program. Generic perf_event BPF programs will access this value from perf_event itself through passed in BPF program context. Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Yonghong Song <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-08-17bpf: Implement minimal BPF perf linkAndrii Nakryiko2-13/+102
Introduce a new type of BPF link - BPF perf link. This brings perf_event-based BPF program attachments (perf_event, tracepoints, kprobes, and uprobes) into the common BPF link infrastructure, allowing to list all active perf_event based attachments, auto-detaching BPF program from perf_event when link's FD is closed, get generic BPF link fdinfo/get_info functionality. BPF_LINK_CREATE command expects perf_event's FD as target_fd. No extra flags are currently supported. Force-detaching and atomic BPF program updates are not yet implemented, but with perf_event-based BPF links we now have common framework for this without the need to extend ioctl()-based perf_event interface. One interesting consideration is a new value for bpf_attach_type, which BPF_LINK_CREATE command expects. Generally, it's either 1-to-1 mapping from bpf_attach_type to bpf_prog_type, or many-to-1 mapping from a subset of bpf_attach_types to one bpf_prog_type (e.g., see BPF_PROG_TYPE_SK_SKB or BPF_PROG_TYPE_CGROUP_SOCK). In this case, though, we have three different program types (KPROBE, TRACEPOINT, PERF_EVENT) using the same perf_event-based mechanism, so it's many bpf_prog_types to one bpf_attach_type. I chose to define a single BPF_PERF_EVENT attach type for all of them and adjust link_create()'s logic for checking correspondence between attach type and program type. The alternative would be to define three new attach types (e.g., BPF_KPROBE, BPF_TRACEPOINT, and BPF_PERF_EVENT), but that seemed like unnecessary overkill and BPF_KPROBE will cause naming conflicts with BPF_KPROBE() macro, defined by libbpf. I chose to not do this to avoid unnecessary proliferation of bpf_attach_type enum values and not have to deal with naming conflicts. Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Yonghong Song <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-08-17bpf: Refactor perf_event_set_bpf_prog() to use struct bpf_prog inputAndrii Nakryiko1-33/+28
Make internal perf_event_set_bpf_prog() use struct bpf_prog pointer as an input argument, which makes it easier to re-use for other internal uses (coming up for BPF link in the next patch). BPF program FD is not as convenient and in some cases it's not available. So switch to struct bpf_prog, move out refcounting outside and let caller do bpf_prog_put() in case of an error. This follows the approach of most of the other BPF internal functions. Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-08-17bpf: Refactor BPF_PROG_RUN_ARRAY family of macros into functionsAndrii Nakryiko2-17/+17
Similar to BPF_PROG_RUN, turn BPF_PROG_RUN_ARRAY macros into proper functions with all the same readability and maintainability benefits. Making them into functions required shuffling around bpf_set_run_ctx/bpf_reset_run_ctx functions. Also, explicitly specifying the type of the BPF prog run callback required adjusting __bpf_prog_run_save_cb() to accept const void *, casted internally to const struct sk_buff. Further, split out a cgroup-specific BPF_PROG_RUN_ARRAY_CG and BPF_PROG_RUN_ARRAY_CG_FLAGS from the more generic BPF_PROG_RUN_ARRAY due to the differences in bpf_run_ctx used for those two different use cases. I think BPF_PROG_RUN_ARRAY_CG would benefit from further refactoring to accept struct cgroup and enum bpf_attach_type instead of bpf_prog_array, fetching cgrp->bpf.effective[type] and RCU-dereferencing it internally. But that required including include/linux/cgroup-defs.h, which I wasn't sure is ok with everyone. The remaining generic BPF_PROG_RUN_ARRAY function will be extended to pass-through user-provided context value in the next patch. Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-08-17bpf: Refactor BPF_PROG_RUN into a functionAndrii Nakryiko7-15/+15
Turn BPF_PROG_RUN into a proper always inlined function. No functional and performance changes are intended, but it makes it much easier to understand what's going on with how BPF programs are actually get executed. It's more obvious what types and callbacks are expected. Also extra () around input parameters can be dropped, as well as `__` variable prefixes intended to avoid naming collisions, which makes the code simpler to read and write. This refactoring also highlighted one extra issue. BPF_PROG_RUN is both a macro and an enum value (BPF_PROG_RUN == BPF_PROG_TEST_RUN). Turning BPF_PROG_RUN into a function causes naming conflict compilation error. So rename BPF_PROG_RUN into lower-case bpf_prog_run(), similar to bpf_prog_run_xdp(), bpf_prog_run_pin_on_cpu(), etc. All existing callers of BPF_PROG_RUN, the macro, are switched to bpf_prog_run() explicitly. Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-08-16notifier: Remove atomic_notifier_call_chain_robust()Valentin Schneider1-19/+0
This now has no more users, remove it. Suggested-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Valentin Schneider <[email protected]> Acked-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2021-08-16PM: cpu: Make notifier chain use a raw_spinlock_tValentin Schneider1-12/+38
Invoking atomic_notifier_chain_notify() requires acquiring a spinlock_t, which can block under CONFIG_PREEMPT_RT. Notifications for members of the cpu_pm notification chain will be issued by the idle task, which can never block. Making *all* atomic_notifiers use a raw_spinlock is too big of a hammer, as only notifications issued by the idle task are problematic. Special-case cpu_pm_notifier_chain by kludging a raw_notifier and raw_spinlock_t together, matching the atomic_notifier behavior with a raw_spinlock_t. Fixes: 70d932985757 ("notifier: Fix broken error handling pattern") Signed-off-by: Valentin Schneider <[email protected]> Acked-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2021-08-16PM: sleep: unmark 'state' functions as kernel-docRandy Dunlap1-1/+1
Fix kernel-doc warnings in kernel/power/main.c by unmarking the comment block as kernel-doc notation. This eliminates the following kernel-doc warnings: kernel/power/main.c:593: warning: expecting prototype for state(). Prototype was for state_show() instead kernel/power/main.c:593: warning: Function parameter or member 'kobj' not described in 'state_show' kernel/power/main.c:593: warning: Function parameter or member 'attr' not described in 'state_show' kernel/power/main.c:593: warning: Function parameter or member 'buf' not described in 'state_show' Signed-off-by: Randy Dunlap <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2021-08-16Merge tag 'trace-v5.14-rc5-2' of ↵Linus Torvalds3-1/+62
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Fixes and clean ups to tracing: - Fix header alignment when PREEMPT_RT is enabled for osnoise tracer - Inject "stop" event to see where osnoise stopped the trace - Define DYNAMIC_FTRACE_WITH_ARGS as some code had an #ifdef for it - Fix erroneous message for bootconfig cmdline parameter - Fix crash caused by not found variable in histograms" * tag 'trace-v5.14-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing / histogram: Fix NULL pointer dereference on strcmp() on NULL event name init: Suppress wrong warning for bootconfig cmdline parameter tracing: define needed config DYNAMIC_FTRACE_WITH_ARGS trace/osnoise: Print a stop tracing message trace/timerlat: Add a header with PREEMPT_RT additional fields trace/osnoise: Add a header with PREEMPT_RT additional fields
2021-08-16tracepoint: Fix kerneldoc commentszhaoxiao1-2/+2
Fix function name in tracepoint.c kernel-doc comment to remove a warning found by clang_w1. kernel/tracepoint.c:589: warning: expecting prototype for register_tracepoint_notifier(). Prototype was for register_tracepoint_module_notifier() instead kernel/tracepoint.c:613: warning: expecting prototype for unregister_tracepoint_notifier(). Prototype was for unregister_tracepoint_module_notifier() instead Link: https://lkml.kernel.org/r/[email protected] Acked-by: Mathieu Desnoyers <[email protected]> Signed-off-by: zhaoxiao <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-08-16tracing/boot: Show correct histogram error commandMasami Hiramatsu1-4/+10
Since trigger_process_regex() modifies given trigger actions while parsing, the error message couldn't show what command was passed to the trigger_process_regex() when it returns an error. To fix that, show the backed up trigger action command instead of parsed buffer. Link: https://lkml.kernel.org/r/162856126413.203126.9465564928450701424.stgit@devnote2 Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-08-16tracing/boot: Support multiple histograms for each eventMasami Hiramatsu1-9/+32
Add multiple histograms support for each event. This allows user to set multiple histograms to an event. ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist[.N] { ... } The 'N' is a digit started string and it can be omitted for the default histogram. For example, multiple hist triggers example in the Documentation/trace/histogram.rst can be written as below; ftrace.event.net.netif_receive_skb.hist { 1 { keys = skbaddr.hex values = len filter = len < 0 } 2 { keys = skbaddr.hex values = len filter = len > 4096 } 3 { keys = skbaddr.hex values = len filter = len == 256 } 4 { keys = skbaddr.hex values = len } 5 { keys = len values = common_preempt_count } } Link: https://lkml.kernel.org/r/162856125628.203126.15846930277378572120.stgit@devnote2 Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-08-16tracing/boot: Support multiple handlers for per-event histogramMasami Hiramatsu1-8/+35
Support multiple handlers for per-event histogram in boot-time tracing. Since the histogram can register multiple same handler-actions with different parameters, this expands the syntax to support such cases. With this update, the 'onmax', 'onchange' and 'onmatch' handler subkeys under per-event histogram option will take a number subkeys optionally as below. (see [.N]) ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist { onmax|onchange[.N] { var = <VAR>; <ACTION> [= <PARAM>] } onmatch[.N] { event = <EVENT>; <ACTION> [= <PARAM>] } } The 'N' must be a digit (or digit started word). Thus user can add several handler-actions to the histogram, for example, ftrace.event.SOMEGROUP.SOMEEVENT.hist { keys = SOME_ID; lat = common_timestamp.usecs-$ts0 onmatch.1 { event = GROUP1.STARTEVENT1 trace = latency_event, SOME_ID, $lat } onmatch.2 { event = GROUP2.STARTEVENT2 trace = latency_event, SOME_ID, $lat } } Then, it can trace the elapsed time from GROUP1.STARTEVENT1 to SOMEGROUP.SOMEEVENT, and from GROUP2.STARTEVENT2 to SOMEGROUP.SOMEEVENT with SOME_ID key. Link: https://lkml.kernel.org/r/162856124905.203126.14913731908137885922.stgit@devnote2 Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-08-16tracing/boot: Add per-event histogram action optionsMasami Hiramatsu1-0/+231
Add a hist-trigger action syntax support to boot-time tracing. Currently, boot-time tracing supports per-event actions as option strings. However, for the histogram action, it has a special syntax and usually needs a long action definition. To make it readable and fit to the bootconfig syntax, this introduces a new options for histogram. Here are the histogram action options for boot-time tracing. ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist { keys = <KEY>[,...] values = <VAL>[,...] sort = <SORT-KEY>[,...] size = <ENTRIES> name = <HISTNAME> var { <VAR> = <EXPR> ... } pause|continue|clear onmax|onchange { var = <VAR>; <ACTION> [= <PARAM>] } onmatch { event = <EVENT>; <ACTION> [= <PARAM>] } filter = <FILTER> } Where <ACTION> is one of below; trace = <EVENT>, <ARG1>[, ...] save = <ARG1>[, ...] snapshot Link: https://lkml.kernel.org/r/162856124106.203126.10501871028479029087.stgit@devnote2 Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-08-16tracing: Simplify the Kconfig dependency of FTRACEMasahiro Yamada1-5/+1
The entire FTRACE block is surrounded by 'if TRACING_SUPPORT' ... 'endif'. Using 'depends on' is a simpler way to guard FTRACE. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Masahiro Yamada <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-08-16tracing: Allow execnames to be passed as args for synthetic eventsSteven Rostedt (VMware)1-4/+42
Allow common_pid.execname to be saved in a variable in one histogram to be passed to another histogram that can pass it as a parameter to a synthetic event. ># echo 'hist:keys=pid:__arg__1=common_timestamp.usecs:arg2=common_pid.execname' \ > events/sched/sched_waking/trigger ># echo 'wakeup_lat s32 pid; u64 delta; char wake_comm[]' > synthetic_events ># echo 'hist:keys=next_pid:pid=next_pid,delta=common_timestamp.usecs-$__arg__1,exec=$arg2'\ ':onmatch(sched.sched_waking).trace(wakeup_lat,$pid,$delta,$exec)' \ > events/sched/sched_switch/trigger The above is a wake up latency synthetic event setup that passes the execname of the common_pid that woke the task to the scheduling of that task, which triggers a synthetic event that passes the original execname as a parameter to display it. ># echo 1 > events/synthetic/enable ># cat trace <idle>-0 [006] d..4 186.863801: wakeup_lat: pid=1306 delta=65 wake_comm=kworker/u16:3 <idle>-0 [000] d..4 186.863858: wakeup_lat: pid=163 delta=27 wake_comm=<idle> <idle>-0 [001] d..4 186.863903: wakeup_lat: pid=1307 delta=36 wake_comm=kworker/u16:4 <idle>-0 [000] d..4 186.863927: wakeup_lat: pid=163 delta=5 wake_comm=<idle> <idle>-0 [006] d..4 186.863957: wakeup_lat: pid=1306 delta=24 wake_comm=kworker/u16:3 sshd-1306 [006] d..4 186.864051: wakeup_lat: pid=61 delta=62 wake_comm=<idle> <idle>-0 [000] d..4 186.965030: wakeup_lat: pid=609 delta=18 wake_comm=<idle> <idle>-0 [006] d..4 186.987582: wakeup_lat: pid=1306 delta=65 wake_comm=kworker/u16:3 <idle>-0 [000] d..4 186.987639: wakeup_lat: pid=163 delta=27 wake_comm=<idle> Link: https://lkml.kernel.org/r/[email protected] Reviewed-by: Tom Zanussi <[email protected]> Reviewed-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-08-16tracing: Have histogram types be constant when possibleSteven Rostedt (VMware)1-18/+14
Instead of kstrdup("const", GFP_KERNEL), have the hist_field type simply assign the constant hist_field->type = "const"; And when the value passed to it is a variable, use "kstrdup_const(var, GFP_KERNEL);" which will just copy the value if the variable is already a constant. This saves on having to allocate when not needed. All frees of the hist_field->type will need to use kfree_const(). Link: https://lkml.kernel.org/r/[email protected] Suggested-by: Masami Hiramatsu <[email protected]> Reviewed-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-08-16tracing/histogram: Update the documentation for the buckets modifierSteven Rostedt (VMware)1-0/+1
Update both the tracefs README file as well as the histogram.rst to include an explanation of what the buckets modifier is and how to use it. Include an example with the wakeup_latency example for both log2 and the buckets modifiers as there was no existing log2 example. Link: https://lkml.kernel.org/r/[email protected] Acked-by: Namhyung Kim <[email protected]> Reviewed-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-08-16tracing: Add linear buckets to histogram logicSteven Rostedt (VMware)1-7/+58
There's been several times I wished the histogram logic had a "grouping" feature for the buckets. Currently, each bucket has a size of one. That is, if you trace the amount of requested allocations, each allocation is its own bucket, even if you are interested in what allocates 100 bytes or less, 100 to 200, 200 to 300, etc. Also, without grouping, it fills up the allocated histogram buckets quickly. If you are tracking latency, and don't care if something is 200 microseconds off, or 201 microseconds off, but want to track them by say 10 microseconds each. This can not currently be done. There is a log2 but that grouping get's too big too fast for a lot of cases. Introduce a "buckets=SIZE" command to each field where it will record in a rounded number. For example: ># echo 'hist:keys=bytes_req.buckets=100:sort=bytes_req' > events/kmem/kmalloc/trigger ># cat events/kmem/kmalloc/hist # event histogram # # trigger info: hist:keys=bytes_req.buckets=100:vals=hitcount:sort=bytes_req.buckets=100:size=2048 [active] # { bytes_req: ~ 0-99 } hitcount: 3149 { bytes_req: ~ 100-199 } hitcount: 1468 { bytes_req: ~ 200-299 } hitcount: 39 { bytes_req: ~ 300-399 } hitcount: 306 { bytes_req: ~ 400-499 } hitcount: 364 { bytes_req: ~ 500-599 } hitcount: 32 { bytes_req: ~ 600-699 } hitcount: 69 { bytes_req: ~ 700-799 } hitcount: 37 { bytes_req: ~ 1200-1299 } hitcount: 16 { bytes_req: ~ 1400-1499 } hitcount: 30 { bytes_req: ~ 2000-2099 } hitcount: 6 { bytes_req: ~ 4000-4099 } hitcount: 2168 { bytes_req: ~ 5000-5099 } hitcount: 6 Totals: Hits: 7690 Entries: 13 Dropped: 0 Link: https://lkml.kernel.org/r/[email protected] Acked-by: Namhyung Kim <[email protected]> Reviewed-by: Tom Zanussi <[email protected]> Reviewed-by: Masami Hiramatsu <[email protected]> Tested-by: Daniel Bristot de Oliveira <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-08-16tracing/boot: Fix a hist trigger dependency for boot time tracingMasami Hiramatsu1-6/+9
Fixes a build error when CONFIG_HIST_TRIGGERS=n with boot-time tracing. Since the trigger_process_regex() is defined only when CONFIG_HIST_TRIGGERS=y, if it is disabled, the 'actions' event option also must be disabled. Link: https://lkml.kernel.org/r/162856123376.203126.582144262622247352.stgit@devnote2 Fixes: 81a59555ff15 ("tracing/boot: Add per-event settings") Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-08-16tracing: Apply trace filters on all output channelsPingfan Liu2-35/+15
The event filters are not applied on all of the output, which results in the flood of printk when using tp_printk. Unfolding event_trigger_unlock_commit_regs() into trace_event_buffer_commit(), so the filters can be applied on every output. Link: https://lkml.kernel.org/r/[email protected] Cc: [email protected] Fixes: 0daa2302968c1 ("tracing: Add tp_printk cmdline to have tracepoints go to printk()") Signed-off-by: Pingfan Liu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-08-16params: lift param_set_uint_minmax to common codeSagi Grimberg1-0/+18
It is a useful helper hence move it to common code so others can enjoy it. Suggested-by: Christoph Hellwig <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2021-08-15Merge tag 'irq-urgent-2021-08-15' of ↵Linus Torvalds3-6/+17
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "A set of fixes for PCI/MSI and x86 interrupt startup: - Mask all MSI-X entries when enabling MSI-X otherwise stale unmasked entries stay around e.g. when a crashkernel is booted. - Enforce masking of a MSI-X table entry when updating it, which mandatory according to speification - Ensure that writes to MSI[-X} tables are flushed. - Prevent invalid bits being set in the MSI mask register - Properly serialize modifications to the mask cache and the mask register for multi-MSI. - Cure the violation of the affinity setting rules on X86 during interrupt startup which can cause lost and stale interrupts. Move the initial affinity setting ahead of actualy enabling the interrupt. - Ensure that MSI interrupts are completely torn down before freeing them in the error handling case. - Prevent an array out of bounds access in the irq timings code" * tag 'irq-urgent-2021-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: driver core: Add missing kernel doc for device::msi_lock genirq/msi: Ensure deactivation on teardown genirq/timings: Prevent potential array overflow in __irq_timings_store() x86/msi: Force affinity setup before startup x86/ioapic: Force affinity setup before startup genirq: Provide IRQCHIP_AFFINITY_PRE_STARTUP PCI/MSI: Protect msi_desc::masked for multi-MSI PCI/MSI: Use msi_mask_irq() in pci_msi_shutdown() PCI/MSI: Correct misleading comments PCI/MSI: Do not set invalid bits in MSI mask PCI/MSI: Enforce MSI[X] entry updates to be visible PCI/MSI: Enforce that MSI-X table entry is masked for update PCI/MSI: Mask all unused MSI-X entries PCI/MSI: Enable and mask MSI-X early
2021-08-15Merge tag 'locking_urgent_for_v5.14_rc6' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Borislav Petkov: - Fix a CONFIG symbol's spelling * tag 'locking_urgent_for_v5.14_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rtmutex: Use the correct rtmutex debugging config option
2021-08-15bpf: Support "%c" in bpf_bprintf_prepare().Kuniyuki Iwashima1-0/+14
/proc/net/unix uses "%c" to print a single-byte character to escape '\0' in the name of the abstract UNIX domain socket. The following selftest uses it, so this patch adds support for "%c". Note that it does not support wide character ("%lc" and "%llc") for simplicity. Signed-off-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-08-14dma-mapping: return an unsigned int from dma_map_sg{,_attrs}Christoph Hellwig1-1/+1
These can only return 0 for failure or the number of entries, so turn the return value into an unsigned int. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Logan Gunthorpe <[email protected]>
2021-08-13bpf: Allow bpf_get_netns_cookie in BPF_PROG_TYPE_CGROUP_SOCKOPTStanislav Fomichev1-0/+18
This is similar to existing BPF_PROG_TYPE_CGROUP_SOCK and BPF_PROG_TYPE_CGROUP_SOCK_ADDR. Signed-off-by: Stanislav Fomichev <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-08-13bpf: Clear zext_dst of dead insnsIlya Leoshkevich1-0/+1
"access skb fields ok" verifier test fails on s390 with the "verifier bug. zext_dst is set, but no reg is defined" message. The first insns of the test prog are ... 0: 61 01 00 00 00 00 00 00 ldxw %r0,[%r1+0] 8: 35 00 00 01 00 00 00 00 jge %r0,0,1 10: 61 01 00 08 00 00 00 00 ldxw %r0,[%r1+8] ... and the 3rd one is dead (this does not look intentional to me, but this is a separate topic). sanitize_dead_code() converts dead insns into "ja -1", but keeps zext_dst. When opt_subreg_zext_lo32_rnd_hi32() tries to parse such an insn, it sees this discrepancy and bails. This problem can be seen only with JITs whose bpf_jit_needs_zext() returns true. Fix by clearning dead insns' zext_dst. The commits that contributed to this problem are: 1. 5aa5bd14c5f8 ("bpf: add initial suite for selftests"), which introduced the test with the dead code. 2. 5327ed3d44b7 ("bpf: verifier: mark verified-insn with sub-register zext flag"), which introduced the zext_dst flag. 3. 83a2881903f3 ("bpf: Account for BPF_FETCH in insn_has_def32()"), which introduced the sanity check. 4. 9183671af6db ("bpf: Fix leakage under speculation on mispredicted branches"), which bisect points to. It's best to fix this on stable branches that contain the second one, since that's the point where the inconsistency was introduced. Fixes: 5327ed3d44b7 ("bpf: verifier: mark verified-insn with sub-register zext flag") Signed-off-by: Ilya Leoshkevich <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-08-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski11-108/+246
Conflicts: drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.h 9e26680733d5 ("bnxt_en: Update firmware call to retrieve TX PTP timestamp") 9e518f25802c ("bnxt_en: 1PPS functions to configure TSIO pins") 099fdeda659d ("bnxt_en: Event handler for PPS events") kernel/bpf/helpers.c include/linux/bpf-cgroup.h a2baf4e8bb0f ("bpf: Fix potentially incorrect results with bpf_get_local_storage()") c7603cfa04e7 ("bpf: Add ambient BPF runtime context stored in current") drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c 5957cc557dc5 ("net/mlx5: Set all field of mlx5_irq before inserting it to the xarray") 2d0b41a37679 ("net/mlx5: Refcount mlx5_irq with integer") MAINTAINERS 7b637cd52f02 ("MAINTAINERS: fix Microchip CAN BUS Analyzer Tool entry typo") 7d901a1e878a ("net: phy: add Maxlinear GPY115/21x/24x driver") Signed-off-by: Jakub Kicinski <[email protected]>
2021-08-13genirq: Fix kernel doc indentationThomas Gleixner1-3/+1
Fixes: 61377ec14457 ("genirq: Clarify documentation for request_threaded_irq()") Reported-by: Stephen Rothwell <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]>
2021-08-12Merge tag 'net-5.14-rc6' of ↵Linus Torvalds4-13/+19
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes, including fixes from netfilter, bpf, can and ieee802154. The size of this is pretty normal, but we got more fixes for 5.14 changes this week than last week. Nothing major but the trend is the opposite of what we like. We'll see how the next week goes.. Current release - regressions: - r8169: fix ASPM-related link-up regressions - bridge: fix flags interpretation for extern learn fdb entries - phy: micrel: fix link detection on ksz87xx switch - Revert "tipc: Return the correct errno code" - ptp: fix possible memory leak caused by invalid cast Current release - new code bugs: - bpf: add missing bpf_read_[un]lock_trace() for syscall program - bpf: fix potentially incorrect results with bpf_get_local_storage() - page_pool: mask the page->signature before the checking, avoid dma mapping leaks - netfilter: nfnetlink_hook: 5 fixes to information in netlink dumps - bnxt_en: fix firmware interface issues with PTP - mlx5: Bridge, fix ageing time Previous releases - regressions: - linkwatch: fix failure to restore device state across suspend/resume - bareudp: fix invalid read beyond skb's linear data Previous releases - always broken: - bpf: fix integer overflow involving bucket_size - ppp: fix issues when desired interface name is specified via netlink - wwan: mhi_wwan_ctrl: fix possible deadlock - dsa: microchip: ksz8795: fix number of VLAN related bugs - dsa: drivers: fix broken backpressure in .port_fdb_dump - dsa: qca: ar9331: make proper initial port defaults Misc: - bpf: add lockdown check for probe_write_user helper - netfilter: conntrack: remove offload_pickup sysctl before 5.14 is out - netfilter: conntrack: collect all entries in one cycle, heuristically slow down garbage collection scans on idle systems to prevent frequent wake ups" * tag 'net-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (87 commits) vsock/virtio: avoid potential deadlock when vsock device remove wwan: core: Avoid returning NULL from wwan_create_dev() net: dsa: sja1105: unregister the MDIO buses during teardown Revert "tipc: Return the correct errno code" net: mscc: Fix non-GPL export of regmap APIs net: igmp: increase size of mr_ifc_count MAINTAINERS: switch to my OMP email for Renesas Ethernet drivers tcp_bbr: fix u32 wrap bug in round logic if bbr_init() called after 2B packets net: pcs: xpcs: fix error handling on failed to allocate memory net: linkwatch: fix failure to restore device state across suspend/resume net: bridge: fix memleak in br_add_if() net: switchdev: zero-initialize struct switchdev_notifier_fdb_info emitted by drivers towards the bridge net: bridge: fix flags interpretation for extern learn fdb entries net: dsa: sja1105: fix broken backpressure in .port_fdb_dump net: dsa: lantiq: fix broken backpressure in .port_fdb_dump net: dsa: lan9303: fix broken backpressure in .port_fdb_dump net: dsa: hellcreek: fix broken backpressure in .port_fdb_dump bpf, core: Fix kernel-doc notation net: igmp: fix data-race in igmp_ifc_timer_expire() net: Fix memory leak in ieee802154_raw_deliver ...
2021-08-12cgroup/cpuset: Enable memory migration for cpuset v2Waiman Long1-1/+5
When a user changes cpuset.cpus, each task in a v2 cpuset will be moved to one of the new cpus if it is not there already. For memory, however, they won't be migrated to the new nodes when cpuset.mems changes. This is an inconsistency in behavior. In cpuset v1, there is a memory_migrate control file to enable such behavior by setting the CS_MEMORY_MIGRATE flag. Make it the default for cpuset v2 so that we have a consistent set of behavior for both cpus and memory. There is certainly a cost to make memory migration the default, but it is a one time cost that shouldn't really matter as long as cpuset.mems isn't changed frequenty. Update the cgroup-v2.rst file to document the new behavior and recommend against changing cpuset.mems frequently. Since there won't be any concurrent access to the newly allocated cpuset structure in cpuset_css_alloc(), we can use the cheaper non-atomic __set_bit() instead of the more expensive atomic set_bit(). Signed-off-by: Waiman Long <[email protected]> Acked-by: Johannes Weiner <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2021-08-12hrtimer: Unbreak hrtimer_force_reprogram()Thomas Gleixner1-20/+20
Since the recent consoliation of reprogramming functions, hrtimer_force_reprogram() is affected by a check whether the new expiry time is past the current expiry time. This breaks the NOHZ logic as that relies on the fact that the tick hrtimer is moved into the future. That means cpu_base->expires_next becomes stale and subsequent reprogramming attempts fail as well until the situation is cleaned up by an hrtimer interrupts. For some yet unknown reason this leads to a complete stall, so for now partially revert the offending commit to a known working state. The root cause for the stall is still investigated and will be fixed in a subsequent commit. Fixes: b14bca97c9f5 ("hrtimer: Consolidate reprogramming code") Reported-by: Mike Galbraith <[email protected]> Reported-by: Marek Szyprowski <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Mike Galbraith <[email protected]> Link: https://lore.kernel.org/r/8735recskh.ffs@tglx