aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2024-04-12watchdog/softlockup: Report the most frequent interruptsBitao Hu1-4/+112
When the watchdog determines that the current soft lockup is due to an interrupt storm based on CPU utilization, reporting the most frequent interrupts could be good enough for further troubleshooting. Below is an example of interrupt storm. The call tree does not provide useful information, but analyzing which interrupt caused the soft lockup by comparing the counts of interrupts during the lockup period allows to identify the culprit. [ 638.870231] watchdog: BUG: soft lockup - CPU#9 stuck for 26s! [swapper/9:0] [ 638.870825] CPU#9 Utilization every 4s during lockup: [ 638.871194] #1: 0% system, 0% softirq, 100% hardirq, 0% idle [ 638.871652] #2: 0% system, 0% softirq, 100% hardirq, 0% idle [ 638.872107] #3: 0% system, 0% softirq, 100% hardirq, 0% idle [ 638.872563] #4: 0% system, 0% softirq, 100% hardirq, 0% idle [ 638.873018] #5: 0% system, 0% softirq, 100% hardirq, 0% idle [ 638.873494] CPU#9 Detect HardIRQ Time exceeds 50%. Most frequent HardIRQs: [ 638.873994] #1: 330945 irq#7 [ 638.874236] #2: 31 irq#82 [ 638.874493] #3: 10 irq#10 [ 638.874744] #4: 2 irq#89 [ 638.874992] #5: 1 irq#102 ... [ 638.875313] Call trace: [ 638.875315] __do_softirq+0xa8/0x364 Signed-off-by: Bitao Hu <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Liu Song <[email protected]> Reviewed-by: Douglas Anderson <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-12watchdog/softlockup: Low-overhead detection of interrupt stormBitao Hu1-1/+98
The following softlockup is caused by interrupt storm, but it cannot be identified from the call tree. Because the call tree is just a snapshot and doesn't fully capture the behavior of the CPU during the soft lockup. watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921] ... Call trace: __do_softirq+0xa0/0x37c __irq_exit_rcu+0x108/0x140 irq_exit+0x14/0x20 __handle_domain_irq+0x84/0xe0 gic_handle_irq+0x80/0x108 el0_irq_naked+0x50/0x58 Therefore, it is necessary to report CPU utilization during the softlockup_threshold period (report once every sample_period, for a total of 5 reportings), like this: watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921] CPU#28 Utilization every 4s during lockup: #1: 0% system, 0% softirq, 100% hardirq, 0% idle #2: 0% system, 0% softirq, 100% hardirq, 0% idle #3: 0% system, 0% softirq, 100% hardirq, 0% idle #4: 0% system, 0% softirq, 100% hardirq, 0% idle #5: 0% system, 0% softirq, 100% hardirq, 0% idle ... This is helpful in determining whether an interrupt storm has occurred or in identifying the cause of the softlockup. The criteria for determination are as follows: a. If the hardirq utilization is high, then interrupt storm should be considered and the root cause cannot be determined from the call tree. b. If the softirq utilization is high, then the call might not necessarily point at the root cause. c. If the system utilization is high, then analyzing the root cause from the call tree is possible in most cases. The mechanism requires a considerable amount of global storage space when configured for the maximum number of CPUs. Therefore, adding a SOFTLOCKUP_DETECTOR_INTR_STORM Kconfig knob that defaults to "yes" if the max number of CPUs is <= 128. Signed-off-by: Bitao Hu <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Douglas Anderson <[email protected]> Reviewed-by: Liu Song <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-12genirq: Avoid summation loops for /proc/interruptsBitao Hu3-9/+15
show_interrupts() unconditionally accumulates the per CPU interrupt statistics to determine whether an interrupt was ever raised. This can be avoided for all interrupts which are not strictly per CPU and not of type NMI because those interrupts provide already an accumulated counter. The required logic is already implemented in kstat_irqs(). Split the inner access logic out of kstat_irqs() and use it for kstat_irqs() and show_interrupts() to avoid the accumulation loop when possible. Originally-by: Thomas Gleixner <[email protected]> Signed-off-by: Bitao Hu <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Liu Song <[email protected]> Reviewed-by: Douglas Anderson <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-12genirq: Provide a snapshot mechanism for interrupt statisticsBitao Hu2-0/+29
The soft lockup detector lacks a mechanism to identify interrupt storms as root cause of a lockup. To enable this the detector needs a mechanism to snapshot the interrupt count statistics on a CPU when the detector observes a potential lockup scenario and compare that against the interrupt count when it warns about the lockup later on. The number of interrupts in that period give a hint whether the lockup might have been caused by an interrupt storm. Instead of having extra storage in the lockup detector and accessing the internals of the interrupt descriptor directly, add a snapshot member to the per CPU irq_desc::kstat_irq structure and provide interfaces to take a snapshot of all interrupts on the current CPU and to retrieve the delta of a specific interrupt later on. Originally-by: Thomas Gleixner <[email protected]> Signed-off-by: Bitao Hu <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-12genirq: Convert kstat_irqs to a structBitao Hu3-9/+7
The irq_desc::kstat_irqs member is a per-CPU variable of type int, which is only capable of counting. A snapshot mechanism for interrupt statistics will be added soon, which requires an additional variable to store the snapshot. To facilitate expansion, convert kstat_irqs here to a struct containing only the count. Originally-by: Thomas Gleixner <[email protected]> Signed-off-by: Bitao Hu <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-12perf/bpf: Change the !CONFIG_BPF_SYSCALL stubs to static inlinesIngo Molnar1-7/+7
Otherwise the compiler will be unhappy if they go unused, which they do on allnoconfigs. Signed-off-by: Ingo Molnar <[email protected]> Cc: Kyle Huey <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-12perf/bpf: Allow a BPF program to suppress all sample side effectsKyle Huey1-2/+4
Returning zero from a BPF program attached to a perf event already suppresses any data output. Return early from __perf_event_overflow() in this case so it will also suppress event_limit accounting, SIGTRAP generation, and F_ASYNC signalling. Signed-off-by: Kyle Huey <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Song Liu <[email protected]> Acked-by: Jiri Olsa <[email protected]> Acked-by: Namhyung Kim <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-12perf/bpf: Call BPF handler directly, not through overflow machineryKyle Huey1-16/+11
To ultimately allow BPF programs attached to perf events to completely suppress all of the effects of a perf event overflow (rather than just the sample output, as they do today), call bpf_overflow_handler() from __perf_event_overflow() directly rather than modifying struct perf_event's overflow_handler. Return the BPF program's return value from bpf_overflow_handler() so that __perf_event_overflow() knows how to proceed. Remove the now unnecessary orig_overflow_handler from struct perf_event. This patch is solely a refactoring and results in no behavior change. Suggested-by: Namhyung Kim <[email protected]> Signed-off-by: Kyle Huey <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Song Liu <[email protected]> Acked-by: Jiri Olsa <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-12perf/bpf: Create bpf_overflow_handler() stub for !CONFIG_BPF_SYSCALLKyle Huey1-0/+6
This will allow __perf_event_overflow() (which is independent of CONFIG_BPF_SYSCALL) to call bpf_overflow_handler(). Signed-off-by: Kyle Huey <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-12perf/bpf: Reorder bpf_overflow_handler() ahead of __perf_event_overflow()Kyle Huey1-91/+92
This will allow __perf_event_overflow() to call bpf_overflow_handler(). Signed-off-by: Kyle Huey <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-12locking/pvqspinlock: Use try_cmpxchg() in qspinlock_paravirt.hUros Bizjak1-8/+8
Use try_cmpxchg(*ptr, &old, new) instead of cmpxchg(*ptr, old, new) == old in qspinlock_paravirt.h x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg. No functional change intended. Signed-off-by: Uros Bizjak <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Waiman Long <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-12locking/pvqspinlock: Use try_cmpxchg_acquire() in trylock_clear_pending()Uros Bizjak1-18/+13
Replace this pattern in trylock_clear_pending(): cmpxchg_acquire(*ptr, old, new) == old ... with the simpler and faster: try_cmpxchg_acquire(*ptr, &old, new) The x86 CMPXCHG instruction returns success in the ZF flag, so this change saves a compare after the CMPXCHG. Also change the return type of the function to bool and streamline the control flow in the _Q_PENDING_BITS == 8 variant a bit. No functional change intended. Signed-off-by: Uros Bizjak <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Waiman Long <[email protected]> Reviewed-by: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-12ftrace: Choose RCU Tasks based on TASKS_RCU rather than PREEMPTIONPaul E. McKenney1-2/+1
The advent of CONFIG_PREEMPT_AUTO, AKA lazy preemption, will mean that even kernels built with CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY might see the occasional preemption, and that this preemption just might happen within a trampoline. Therefore, update ftrace_shutdown() to invoke synchronize_rcu_tasks() based on CONFIG_TASKS_RCU instead of CONFIG_PREEMPTION. [ paulmck: Apply Steven Rostedt feedback. ] Signed-off-by: Paul E. McKenney <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Ankur Arora <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
2024-04-12bpf: Choose RCU Tasks based on TASKS_RCU rather than PREEMPTIONPaul E. McKenney1-1/+1
The advent of CONFIG_PREEMPT_AUTO, AKA lazy preemption, will mean that even kernels built with CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY might see the occasional preemption, and that this preemption just might happen within a trampoline. Therefore, update bpf_tramp_image_put() to choose call_rcu_tasks() based on CONFIG_TASKS_RCU instead of CONFIG_PREEMPTION. This change might enable further simplifications, but the goal of this effort is to make the code safe, not necessarily optimal. Signed-off-by: Paul E. McKenney <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: John Fastabend <[email protected]> Cc: Andrii Nakryiko <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: Song Liu <[email protected]> Cc: Yonghong Song <[email protected]> Cc: KP Singh <[email protected]> Cc: Stanislav Fomichev <[email protected]> Cc: Hao Luo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Ankur Arora <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
2024-04-12mm: replace set_pte_at_notify() with just set_pte_at()Paolo Bonzini1-3/+3
With the demise of the .change_pte() MMU notifier callback, there is no notification happening in set_pte_at_notify(). It is a synonym of set_pte_at() and can be replaced with it. Signed-off-by: Paolo Bonzini <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Philippe Mathieu-Daudé <[email protected]> Message-ID: <[email protected]> Acked-by: Andrew Morton <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-04-12padata: Disable BH when taking works lock on MT pathHerbert Xu1-4/+4
As the old padata code can execute in softirq context, disable softirqs for the new padata_do_mutithreaded code too as otherwise lockdep will get antsy. Reported-by: [email protected] Signed-off-by: Herbert Xu <[email protected]> Acked-by: Daniel Jordan <[email protected]> Signed-off-by: Herbert Xu <[email protected]>
2024-04-11ring-buffer: Only update pages_touched when a new page is touchedSteven Rostedt (Google)1-3/+3
The "buffer_percent" logic that is used by the ring buffer splice code to only wake up the tasks when there's no data after the buffer is filled to the percentage of the "buffer_percent" file is dependent on three variables that determine the amount of data that is in the ring buffer: 1) pages_read - incremented whenever a new sub-buffer is consumed 2) pages_lost - incremented every time a writer overwrites a sub-buffer 3) pages_touched - incremented when a write goes to a new sub-buffer The percentage is the calculation of: (pages_touched - (pages_lost + pages_read)) / nr_pages Basically, the amount of data is the total number of sub-bufs that have been touched, minus the number of sub-bufs lost and sub-bufs consumed. This is divided by the total count to give the buffer percentage. When the percentage is greater than the value in the "buffer_percent" file, it wakes up splice readers waiting for that amount. It was observed that over time, the amount read from the splice was constantly decreasing the longer the trace was running. That is, if one asked for 60%, it would read over 60% when it first starts tracing, but then it would be woken up at under 60% and would slowly decrease the amount of data read after being woken up, where the amount becomes much less than the buffer percent. This was due to an accounting of the pages_touched incrementation. This value is incremented whenever a writer transfers to a new sub-buffer. But the place where it was incremented was incorrect. If a writer overflowed the current sub-buffer it would go to the next one. If it gets preempted by an interrupt at that time, and the interrupt performs a trace, it too will end up going to the next sub-buffer. But only one should increment the counter. Unfortunately, that was not the case. Change the cmpxchg() that does the real switch of the tail-page into a try_cmpxchg(), and on success, perform the increment of pages_touched. This will only increment the counter once for when the writer moves to a new sub-buffer, and not when there's a race and is incremented for when a writer and its preempting writer both move to the same new sub-buffer. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: [email protected] Cc: Mathieu Desnoyers <[email protected]> Fixes: 2c2b0a78b3739 ("ring-buffer: Add percentage of ring buffer full to wake up reader") Acked-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-04-11tracing: hide unused ftrace_event_id_fopsArnd Bergmann1-0/+4
When CONFIG_PERF_EVENTS, a 'make W=1' build produces a warning about the unused ftrace_event_id_fops variable: kernel/trace/trace_events.c:2155:37: error: 'ftrace_event_id_fops' defined but not used [-Werror=unused-const-variable=] 2155 | static const struct file_operations ftrace_event_id_fops = { Hide this in the same #ifdef as the reference to it. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Zheng Yejian <[email protected]> Cc: Kees Cook <[email protected]> Cc: Ajay Kaher <[email protected]> Cc: Jinjie Ruan <[email protected]> Cc: Clément Léger <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: "Tzvetomir Stoyanov (VMware)" <[email protected]> Fixes: 620a30e97feb ("tracing: Don't pass file_operations array to event_create_dir()") Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-04-11tracing: Fix FTRACE_RECORD_RECURSION_SIZE Kconfig entryPrasad Pandit1-1/+1
Fix FTRACE_RECORD_RECURSION_SIZE entry, replace tab with a space character. It helps Kconfig parsers to read file without error. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Fixes: 773c16705058 ("ftrace: Add recording of functions that caused recursion") Signed-off-by: Prasad Pandit <[email protected]> Reviewed-by: Randy Dunlap <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-04-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski5-22/+70
Cross-merge networking fixes after downstream PR. Conflicts: net/unix/garbage.c 47d8ac011fe1 ("af_unix: Fix garbage collector racing against connect()") 4090fa373f0e ("af_unix: Replace garbage collection algorithm.") Adjacent changes: drivers/net/ethernet/broadcom/bnxt/bnxt.c faa12ca24558 ("bnxt_en: Reset PTP tx_avail after possible firmware reset") b3d0083caf9a ("bnxt_en: Support RSS contexts in ethtool .{get|set}_rxfh()") drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c 7ac10c7d728d ("bnxt_en: Fix possible memory leak in bnxt_rdma_aux_device_init()") 194fad5b2781 ("bnxt_en: Refactor bnxt_rdma_aux_device_init/uninit functions") drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 958f56e48385 ("net/mlx5e: Un-expose functions in en.h") 49e6c9387051 ("net/mlx5e: RSS, Block XOR hash with over 128 channels") Signed-off-by: Jakub Kicinski <[email protected]>
2024-04-11Merge tag 'pm-6.9-rc4' of ↵Linus Torvalds1-0/+6
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fix from Rafael Wysocki: "Fix the suspend-to-idle core code to guarantee that timers queued on CPUs other than the one that has first left the idle state, which should expire directly after resume, will be handled (Anna-Maria Behnsen)" * tag 'pm-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: s2idle: Make sure CPUs will wakeup directly on resume
2024-04-11locking/mutex: Introduce devm_mutex_init()George Stark1-0/+12
Using of devm API leads to a certain order of releasing resources. So all dependent resources which are not devm-wrapped should be deleted with respect to devm-release order. Mutex is one of such objects that often is bound to other resources and has no own devm wrapping. Since mutex_destroy() actually does nothing in non-debug builds frequently calling mutex_destroy() is just ignored which is safe for now but wrong formally and can lead to a problem if mutex_destroy() will be extended so introduce devm_mutex_init(). Suggested-by: Christophe Leroy <[email protected]> Signed-off-by: George Stark <[email protected]> Reviewed-by: Christophe Leroy <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Reviewed-by: Marek Behún <[email protected]> Acked-by: Waiman Long <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lee Jones <[email protected]>
2024-04-11tracing: Select new NEED_TASKS_RCU Kconfig optionPaul E. McKenney1-2/+2
Currently, if a Kconfig option depends on TASKS_RCU, it conditionally does "select TASKS_RCU if PREEMPTION". This works, but requires any change in this enablement logic to be replicated across all such "select" clauses. A new NEED_TASKS_RCU Kconfig option has been created to allow this enablement logic to be in one place in kernel/rcu/Kconfig. Therefore, select the new NEED_TASKS_RCU Kconfig option instead of the old TASKS_RCU option. Signed-off-by: Paul E. McKenney <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: <[email protected]> Cc: Ankur Arora <[email protected]> Cc: Thomas Gleixner <[email protected]> Acked-by: Mark Rutland <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
2024-04-11rcu: Add data structures for synchronize_rcu()Uladzislau Rezki (Sony)1-0/+14
The synchronize_rcu() call is going to be reworked, thus this patch adds dedicated fields into the rcu_state structure. Reviewed-by: Paul E. McKenney <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
2024-04-11treewide: Use sysfs_bin_attr_simple_read() helperLukas Wunner1-12/+1
Deduplicate ->read() callbacks of bin_attributes which are backed by a simple buffer in memory: Use the newly introduced sysfs_bin_attr_simple_read() helper instead, either by referencing it directly or by declaring such bin_attributes with BIN_ATTR_SIMPLE_RO() or BIN_ATTR_SIMPLE_ADMIN_RO(). Aside from a reduction of LoC, this shaves off a few bytes from vmlinux (304 bytes on an x86_64 allyesconfig). No functional change intended. Signed-off-by: Lukas Wunner <[email protected]> Acked-by: Zhi Wang <[email protected]> Acked-by: Michael Ellerman <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Acked-by: Ard Biesheuvel <[email protected]> Link: https://lore.kernel.org/r/92ee0a0e83a5a3f3474845db6c8575297698933a.1712410202.git.lukas@wunner.de Signed-off-by: Greg Kroah-Hartman <[email protected]>
2024-04-11locking/qspinlock: Use atomic_try_cmpxchg_relaxed() in xchg_tail()Uros Bizjak1-8/+5
Use atomic_try_cmpxchg_relaxed(*ptr, &old, new) instead of atomic_cmpxchg_relaxed (*ptr, old, new) == old in xchg_tail(). x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after CMPXCHG. No functional change intended. Since this code requires NR_CPUS >= 16k, I have tested it by unconditionally setting _Q_PENDING_BITS to 1 in <asm-generic/qspinlock_types.h>. Signed-off-by: Uros Bizjak <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Waiman Long <[email protected]> Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-11printk: Add function to replay kernel log on consolesSreenath Vijayan1-24/+53
Add a generic function console_replay_all() for replaying the kernel log on consoles, in any context. It would allow viewing the logs on an unresponsive terminal via sysrq. Reuse the existing code from console_flush_on_panic() for resetting the sequence numbers, by introducing a new helper function __console_rewind_all(). It is safe to be called under console_lock(). Try to acquire lock on the console subsystem without waiting. If successful, reset the sequence number to oldest available record on all consoles and call console_unlock() which will automatically flush the messages to the consoles. Suggested-by: John Ogness <[email protected]> Suggested-by: Petr Mladek <[email protected]> Signed-off-by: Shimoyashiki Taichi <[email protected]> Reviewed-by: John Ogness <[email protected]> Signed-off-by: Sreenath Vijayan <[email protected]> Link: https://lore.kernel.org/r/90ee131c643a5033d117b556c0792de65129d4c3.1710220326.git.sreenath.vijayan@sony.com Signed-off-by: Greg Kroah-Hartman <[email protected]>
2024-04-10bpf: Add bpf_link support for sk_msg and sk_skb progsYonghong Song1-0/+4
Add bpf_link support for sk_msg and sk_skb programs. We have an internal request to support bpf_link for sk_msg programs so user space can have a uniform handling with bpf_link based libbpf APIs. Using bpf_link based libbpf API also has a benefit which makes system robust by decoupling prog life cycle and attachment life cycle. Reviewed-by: John Fastabend <[email protected]> Signed-off-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-04-10kprobes: Fix possible use-after-free issue on kprobe registrationZheng Yejian1-6/+12
When unloading a module, its state is changing MODULE_STATE_LIVE -> MODULE_STATE_GOING -> MODULE_STATE_UNFORMED. Each change will take a time. `is_module_text_address()` and `__module_text_address()` works with MODULE_STATE_LIVE and MODULE_STATE_GOING. If we use `is_module_text_address()` and `__module_text_address()` separately, there is a chance that the first one is succeeded but the next one is failed because module->state becomes MODULE_STATE_UNFORMED between those operations. In `check_kprobe_address_safe()`, if the second `__module_text_address()` is failed, that is ignored because it expected a kernel_text address. But it may have failed simply because module->state has been changed to MODULE_STATE_UNFORMED. In this case, arm_kprobe() will try to modify non-exist module text address (use-after-free). To fix this problem, we should not use separated `is_module_text_address()` and `__module_text_address()`, but use only `__module_text_address()` once and do `try_module_get(module)` which is only available with MODULE_STATE_LIVE. Link: https://lore.kernel.org/all/[email protected]/ Fixes: 28f6c37a2910 ("kprobes: Forbid probing on trampoline and BPF code areas") Cc: [email protected] Signed-off-by: Zheng Yejian <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
2024-04-10x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=nSean Christopherson1-1/+2
Initialize cpu_mitigations to CPU_MITIGATIONS_OFF if the kernel is built with CONFIG_SPECULATION_MITIGATIONS=n, as the help text quite clearly states that disabling SPECULATION_MITIGATIONS is supposed to turn off all mitigations by default. │ If you say N, all mitigations will be disabled. You really │ should know what you are doing to say so. As is, the kernel still defaults to CPU_MITIGATIONS_AUTO, which results in some mitigations being enabled in spite of SPECULATION_MITIGATIONS=n. Fixes: f43b9876e857 ("x86/retbleed: Add fine grained Kconfig knobs") Signed-off-by: Sean Christopherson <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Daniel Sneddon <[email protected]> Cc: [email protected] Cc: Linus Torvalds <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-10timekeeping: Use READ/WRITE_ONCE() for tick_do_timer_cpuThomas Gleixner2-22/+31
tick_do_timer_cpu is used lockless to check which CPU needs to take care of the per tick timekeeping duty. This is done to avoid a thundering herd problem on jiffies_lock. The read and writes are not annotated so KCSAN complains about data races: BUG: KCSAN: data-race in tick_nohz_idle_stop_tick / tick_nohz_next_event write to 0xffffffff8a2bda30 of 4 bytes by task 0 on cpu 26: tick_nohz_idle_stop_tick+0x3b1/0x4a0 do_idle+0x1e3/0x250 read to 0xffffffff8a2bda30 of 4 bytes by task 0 on cpu 16: tick_nohz_next_event+0xe7/0x1e0 tick_nohz_get_sleep_length+0xa7/0xe0 menu_select+0x82/0xb90 cpuidle_select+0x44/0x60 do_idle+0x1c2/0x250 value changed: 0x0000001a -> 0xffffffff Annotate them with READ/WRITE_ONCE() to document the intentional data race. Reported-by: Mirsad Todorovac <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Tested-by: Sean Anderson <[email protected]> Link: https://lore.kernel.org/r/87cyqy7rt3.ffs@tglx
2024-04-10perf/core: Reduce PMU access to adjust sample freqNamhyung Kim1-1/+2
In perf_adjust_freq_unthr_context(), it first starts the event and then stop unnecessarily to adjust the sampling frequency if the event is throttled. For a throttled non-frequency event, it doesn't have a freq so no need to adjust. Just starting the event would be ok. For a frequency event, whether it's throttled or not, it needs to stop before adjusting the frequency. That means it should not start the even if it was throttled. I tried to skip calling the stop callback, but it didn't work well since the event count might not be up to date. It should call the stop callback with PERF_EF_UPDATE anyway. However not calling start would prevent unnecessary MSR accesses (which can be costly) for already stopped events as stop state is saved in the hw config. Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Ian Rogers <[email protected]> Reviewed-by: Kan Liang <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-10perf/core: Optimize perf_adjust_freq_unthr_context()Namhyung Kim1-26/+44
It was unnecessarily disabling and enabling PMUs for each event. It should be done at PMU level. Add pmu_ctx->nr_freq counter to check it at each PMU. As PMU context has separate active lists for pinned group and flexible group, factor out a new function to do the job. Another minor optimization is that it can skip PMUs w/ CAP_NO_INTERRUPT even if it needs to unthrottle sampling events. Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Tested-by: Mingwei Zhang <[email protected]> Reviewed-by: Ian Rogers <[email protected]> Reviewed-by: Kan Liang <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-09bpf: Add support for certain atomics in bpf_arena to x86 JITAlexei Starovoitov2-1/+23
Support atomics in bpf_arena that can be JITed as a single x86 instruction. Instructions that are JITed as loops are not supported at the moment, since they require more complex extable and loop logic. JITs can choose to do smarter things with bpf_jit_supports_insn(). Like arm64 may decide to support all bpf atomics instructions when emit_lse_atomic is available and none in ll_sc mode. bpf_jit_supports_percpu_insn(), bpf_jit_supports_ptr_xchg() and other such callbacks can be replaced with bpf_jit_supports_insn() in the future. Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2024-04-09printk: Flag register_console() if console is set on command lineTony Lindgren1-1/+4
If add_preferred_console() is not called early in setup_console(), we can end up having register_console() call try_enable_default_console() before a console device has called add_preferred_console(). Let's set console_set_on_cmdline flag in console_setup() to prevent this from happening. Signed-off-by: Tony Lindgren <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2024-04-09printk: Don't try to parse DEVNAME:0.0 console optionsTony Lindgren1-0/+4
Currently console_setup() tries to make a console index out of any digits passed in the kernel command line for console. In the DEVNAME:0.0 case, the name can contain a device IO address, so bail out on console names with a ':'. Signed-off-by: Tony Lindgren <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2024-04-09printk: Save console options for add_preferred_console_match()Tony Lindgren4-4/+164
Driver subsystems may need to translate the preferred console name to the character device name used. We already do some of this in console_setup() with a few hardcoded names, but that does not scale well. The console options are parsed early in console_setup(), and the consoles are added with __add_preferred_console(). At this point we don't know much about the character device names and device drivers getting probed. To allow driver subsystems to set up a preferred console, let's save the kernel command line console options. To add a preferred console from a driver subsystem with optional character device name translation, let's add a new function add_preferred_console_match(). This allows the serial core layer to support console=DEVNAME:0.0 style hardware based addressing in addition to the current console=ttyS0 style naming. And we can start moving console_setup() character device parsing to the driver subsystem specific code. We use a separate array from the console_cmdline array as the character device name and index may be unknown at the console_setup() time. And eventually there's no need to call __add_preferred_console() until the subsystem is ready to handle the console. Adding the console name in addition to the character device name, and a flag for an added console, could be added to the struct console_cmdline. And the console_cmdline array handling could be modified accordingly. But that complicates things compared saving the console options, and then adding the consoles when the subsystems handling the consoles are ready. Co-developed-by: Andy Shevchenko <[email protected]> Signed-off-by: Tony Lindgren <[email protected]> Signed-off-by: Andy Shevchenko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2024-04-09bpf: Select new NEED_TASKS_RCU Kconfig optionPaul E. McKenney1-1/+1
Currently, if a Kconfig option depends on TASKS_RCU, it conditionally does "select TASKS_RCU if PREEMPTION". This works, but requires any change in this enablement logic to be replicated across all such "select" clauses. A new NEED_TASKS_RCU Kconfig option has been created to allow this enablement logic to be in one place in kernel/rcu/Kconfig. Therefore, make BPF select the new NEED_TASKS_RCU Kconfig option. Signed-off-by: Paul E. McKenney <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Daniel Borkmann <[email protected]> Cc: Andrii Nakryiko <[email protected]> Cc: Martin KaFai Lau <[email protected]> Cc: Song Liu <[email protected]> Cc: Yonghong Song <[email protected]> Cc: John Fastabend <[email protected]> Cc: KP Singh <[email protected]> Cc: Stanislav Fomichev <[email protected]> Cc: Hao Luo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: <[email protected]> Cc: Ankur Arora <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Steven Rostedt <[email protected]> Acked-by: Mark Rutland <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
2024-04-09rcu-tasks: Make Tasks RCU wait idly for grace-period delaysPaul E. McKenney2-3/+7
Currently, all waits for grace periods sleep at TASK_UNINTERRUPTIBLE, regardless of RCU flavor. This has worked well, but there have been cases where a longer-than-average Tasks RCU grace period has triggered softlockup splats, many of them, before the Tasks RCU CPU stall warning appears. These softlockup splats unnecessarily consume console bandwidth and complicate diagnosis of the underlying problem. Plus a long but not pathologically long Tasks RCU grace period might trigger a few softlockup splats before completing normally, which generates noise for no good reason. This commit therefore causes Tasks RCU grace periods to sleep at TASK_IDLE priority. If there really is a persistent problem, the eventual Tasks RCU CPU stall warning will flag it, and without the extra noise. Reported-by: Breno Leitao <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
2024-04-09rcutorture: ASSERT_EXCLUSIVE_WRITER() for ->rtort_pipe_count updatesPaul E. McKenney1-0/+3
It turns out that only one CPU at a time will ever invoke rcu_torture_pipe_update_one() on a given rcu_torture structure. This commit therefore adds three ASSERT_EXCLUSIVE_WRITER() calls to enlist KCSAN's aid in checking this. Signed-off-by: Paul E. McKenney <[email protected]> Cc: Linus Torvalds <[email protected]> Reviewed-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
2024-04-09rcutorture: Dump GP kthread state on insufficient cb-flood launderingPaul E. McKenney1-1/+2
If a callback flood prevents grace period from completing, rcutorture does a WARN_ON(). Avoiding this WARN_ON() currently requires that at least three grace periods elapse during an eight-second callback-flood interval. Unfortunately, the current debug information does not include anything about the grace-period state. This commit therefore adds a call to cur_ops->gp_kthread_dbg(), if this function pointer is non-NULL. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
2024-04-09rcutorture: Dump # online CPUs on insufficient cb-flood launderingPaul E. McKenney1-2/+2
This commit adds the number of online CPUs to the state dump following an unsuccesful callback-flood test. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
2024-04-09rcu: Add lockdep checks and kernel-doc header to rcu_softirq_qs()Paul E. McKenney1-0/+28
There is some indications that rcu_softirq_qs() might be more generally used than anticipated. This commit therefore adds some lockdep assertions and some cautionary tales in a new kernel-doc header. Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Paul E. McKenney <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Yan Zhai <[email protected]> Cc: <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
2024-04-09clockevents: Convert s[n]printf() to sysfs_emit()Li Zhijian1-1/+1
Per filesystems/sysfs.rst, show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. coccinelle complains that there are still a couple of functions that use snprintf(). Convert them to sysfs_emit(). Signed-off-by: Li Zhijian <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-09clocksource: Convert s[n]printf() to sysfs_emit()Li Zhijian1-1/+1
Per filesystems/sysfs.rst, show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. coccinelle complains that there are still a couple of functions that use snprintf(). Convert them to sysfs_emit(). Signed-off-by: Li Zhijian <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-09Merge tag 'v6.9-rc3' into locking/core, to pick up fixesIngo Molnar38-792/+1321
Signed-off-by: Ingo Molnar <[email protected]>
2024-04-08workqueue: Add destroy_work_on_stack() in workqueue_softirq_dead()Zqiang1-0/+1
This commit add missed destroy_work_on_stack() operations for dead_work.work. Signed-off-by: Zqiang <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-04-08cgroup/cpuset: Make cpuset hotplug processing synchronousWaiman Long3-135/+56
Since commit 3a5a6d0c2b03("cpuset: don't nest cgroup_mutex inside get_online_cpus()"), cpuset hotplug was done asynchronously via a work function. This is to avoid recursive locking of cgroup_mutex. Since then, the cgroup locking scheme has changed quite a bit. A cpuset_mutex was introduced to protect cpuset specific operations. The cpuset_mutex is then replaced by a cpuset_rwsem. With commit d74b27d63a8b ("cgroup/cpuset: Change cpuset_rwsem and hotplug lock order"), cpu_hotplug_lock is acquired before cpuset_rwsem. Later on, cpuset_rwsem is reverted back to cpuset_mutex. All these locking changes allow the hotplug code to call into cpuset core directly. The following commits were also merged due to the asynchronous nature of cpuset hotplug processing. - commit b22afcdf04c9 ("cpu/hotplug: Cure the cpusets trainwreck") - commit 50e76632339d ("sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs") - commit 28b89b9e6f7b ("cpuset: handle race between CPU hotplug and cpuset_hotplug_work") Clean up all these bandages by making cpuset hotplug processing synchronous again with the exception that the call to cgroup_transfer_tasks() to transfer tasks out of an empty cgroup v1 cpuset, if necessary, will still be done via a work function due to the existing cgroup_mutex -> cpu_hotplug_lock dependency. It is possible to reverse that dependency, but that will require updating a number of different cgroup controllers. This special hotplug code path should be rarely taken anyway. As all the cpuset states will be updated by the end of the hotplug operation, we can revert most the above commits except commit 50e76632339d ("sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs") which is partially reverted. Also removing some cpus_read_lock trylock attempts in the cpuset partition code as they are no longer necessary since the cpu_hotplug_lock is now held for the whole duration of the cpuset hotplug code path. Signed-off-by: Waiman Long <[email protected]> Tested-by: Valentin Schneider <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-04-08PM: EM: Add em_dev_update_chip_binning()Lukasz Luba1-0/+48
Add a function which allows to modify easily the EM after the new voltage information is available. The device drivers for the chip can adjust the voltage values after setup. The voltage for the same frequency in OPP can be different due to chip binning. The voltage impacts the power usage and the EM power values can be updated to reflect that. Reviewed-by: Dietmar Eggemann <[email protected]> Signed-off-by: Lukasz Luba <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2024-04-08PM: EM: Refactor em_adjust_new_capacity()Lukasz Luba1-19/+39
Extract em_table_dup() and em_recalc_and_update() from em_adjust_new_capacity(). Both functions will be later reused by the 'update EM due to chip binning' functionality. Reviewed-by: Dietmar Eggemann <[email protected]> Signed-off-by: Lukasz Luba <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>