aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2023-06-16trace,smp: Add tracepoints for scheduling remotelly called functionsLeonardo Bras1-11/+5
Add a tracepoint for when a CSD is queued to a remote CPU's call_single_queue. This allows finding exactly which CPU queued a given CSD when looking at a csd_function_{entry,exit} event, and also enables us to accurately measure IPI delivery time with e.g. a synthetic event: $ echo 'hist:keys=cpu,csd.hex:ts=common_timestamp.usecs' >\ /sys/kernel/tracing/events/smp/csd_queue_cpu/trigger $ echo 'csd_latency unsigned int dst_cpu; unsigned long csd; u64 time' >\ /sys/kernel/tracing/synthetic_events $ echo \ 'hist:keys=common_cpu,csd.hex:'\ 'time=common_timestamp.usecs-$ts:'\ 'onmatch(smp.csd_queue_cpu).trace(csd_latency,common_cpu,csd,$time)' >\ /sys/kernel/tracing/events/smp/csd_function_entry/trigger $ trace-cmd record -e 'synthetic:csd_latency' hackbench $ trace-cmd report <...>-467 [001] 21.824263: csd_queue_cpu: cpu=0 callsite=try_to_wake_up+0x2ea func=sched_ttwu_pending csd=0xffff8880076148b8 <...>-467 [001] 21.824280: ipi_send_cpu: cpu=0 callsite=try_to_wake_up+0x2ea callback=generic_smp_call_function_single_interrupt+0x0 <...>-489 [000] 21.824299: csd_function_entry: func=sched_ttwu_pending csd=0xffff8880076148b8 <...>-489 [000] 21.824320: csd_latency: dst_cpu=0, csd=18446612682193848504, time=36 Suggested-by: Valentin Schneider <[email protected]> Signed-off-by: Leonardo Bras <[email protected]> Tested-and-reviewed-by: Valentin Schneider <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-06-16trace,smp: Add tracepoints around remotelly called functionsLeonardo Bras1-6/+19
The recently added ipi_send_{cpu,cpumask} tracepoints allow finding sources of IPIs targeting CPUs running latency-sensitive applications. For NOHZ_FULL CPUs, all IPIs are interference, and those tracepoints are sufficient to find them and work on getting rid of them. In some setups however, not *all* IPIs are to be suppressed, but long-running IPI callbacks can still be problematic. Add a pair of tracepoints to mark the start and end of processing a CSD IPI callback, similar to what exists for softirq, workqueue or timer callbacks. Signed-off-by: Leonardo Bras <[email protected]> Tested-and-reviewed-by: Valentin Schneider <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-06-16tick/common: Align tick period during sched_timer setupThomas Gleixner2-13/+13
The tick period is aligned very early while the first clock_event_device is registered. At that point the system runs in periodic mode and switches later to one-shot mode if possible. The next wake-up event is programmed based on the aligned value (tick_next_period) but the delta value, that is used to program the clock_event_device, is computed based on ktime_get(). With the subtracted offset, the device fires earlier than the exact time frame. With a large enough offset the system programs the timer for the next wake-up and the remaining time left is too small to make any boot progress. The system hangs. Move the alignment later to the setup of tick_sched timer. At this point the system switches to oneshot mode and a high resolution clocksource is available. At this point it is safe to align tick_next_period because ktime_get() will now return accurate (not jiffies based) time. [bigeasy: Patch description + testing]. Fixes: e9523a0d81899 ("tick/common: Align tick period with the HZ tick.") Reported-by: Mathias Krause <[email protected]> Reported-by: "Bhatnagar, Rishabh" <[email protected]> Suggested-by: Mathias Krause <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Richard W.M. Jones <[email protected]> Tested-by: Mathias Krause <[email protected]> Acked-by: SeongJae Park <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/[email protected] Link: https://lore.kernel.org/[email protected] Link: https://lore.kernel.org/r/[email protected]
2023-06-16bpf: Remove in_atomic() from bpf_link_put().Sebastian Andrzej Siewior1-13/+16
bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are invoked within softirq context. By setting rcutree.use_softirq=0 boot option the RCU callbacks will be invoked in a per-CPU kthread with bottom halves disabled which implies a RCU read section. On PREEMPT_RT the context remains fully preemptible. The RCU read section however does not allow schedule() invocation. The latter happens in mutex_lock() performed by bpf_trampoline_unlink_prog() originated from bpf_link_put(). It was pointed out that the bpf_link_put() invocation should not be delayed if originated from close(). It was also pointed out that other invocations from within a syscall should also avoid the workqueue. Everyone else should use workqueue by default to remain safe in the future (while auditing the code, every caller was preemptible except for the RCU case). Let bpf_link_put() use the worker unconditionally. Add bpf_link_put_direct() which will directly free the resources and is used by close() and from within __sys_bpf(). Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-06-16sched/wait: Fix a kthread_park race with wait_woken()Arve Hjønnevåg2-6/+11
kthread_park and wait_woken have a similar race that kthread_stop and wait_woken used to have before it was fixed in commit cb6538e740d7 ("sched/wait: Fix a kthread race with wait_woken()"). Extend that fix to also cover kthread_park. [jstultz: Made changes suggested by Peter to optimize memory loads] Signed-off-by: Arve Hjønnevåg <[email protected]> Signed-off-by: John Stultz <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-06-16sched/topology: Mark set_sched_topology() __initMiaohe Lin1-1/+1
All callers of set_sched_topology() are within __init section. Mark it __init too. Signed-off-by: Miaohe Lin <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-06-16sched/fair: Rename variable cpu_util eff_utilTom Rix1-3/+3
cppcheck reports kernel/sched/fair.c:7436:17: style: Local variable 'cpu_util' shadows outer function [shadowFunction] unsigned long cpu_util; ^ Clean this up by renaming the variable to eff_util Signed-off-by: Tom Rix <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Reviewed-by: Dietmar Eggemann <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-06-16genirq: Allow fasteoi handler to resend interrupts on concurrent handlingJames Gowans2-1/+17
There is a class of interrupt controllers out there that, once they have signalled a given interrupt number, will still signal incoming instances of the *same* interrupt despite the original interrupt not having been EOIed yet. As long as the new interrupt reaches the *same* CPU, nothing bad happens, as that CPU still has its interrupts globally disabled, and we will only take the new interrupt once the interrupt has been EOIed. However, things become more "interesting" if an affinity change comes in while the interrupt is being handled. More specifically, while the per-irq lock is being dropped. This results in the affinity change taking place immediately. At this point, there is nothing that prevents the interrupt from firing on the new target CPU. We end-up with the interrupt running concurrently on two CPUs, which isn't a good thing. And that's where things become worse: the new CPU notices that the interrupt handling is in progress (irq_may_run() return false), and *drops the interrupt on the floor*. The whole race looks like this: CPU 0 | CPU 1 -----------------------------|----------------------------- interrupt start | handle_fasteoi_irq | set_affinity(CPU 1) handler | ... | interrupt start ... | handle_fasteoi_irq -> early out handle_fasteoi_irq return | interrupt end interrupt end | If the interrupt was an edge, too bad. The interrupt is lost, and the system will eventually die one way or another. Not great. A way to avoid this situation is to detect this problem at the point we handle the interrupt on the new target. Instead of dropping the interrupt, use the resend mechanism to force it to be replayed. Also, in order to limit the impact of this workaround to the pathetic architectures that require it, gate it behind a new irq flag aptly named IRQD_RESEND_WHEN_IN_PROGRESS. Suggested-by: Marc Zyngier <[email protected]> Signed-off-by: James Gowans <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: KarimAllah Raslan <[email protected]> Cc: Yipeng Zou <[email protected]> Cc: Zhang Jianhua <[email protected]> [maz: reworded commit mesage] Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-06-16genirq: Expand doc for PENDING and REPLAY flagsJames Gowans1-2/+5
Adding a bit more info about what the flags are used for may help future code readers. Signed-off-by: James Gowans <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Liao Chang <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-06-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski4-19/+32
Cross-merge networking fixes after downstream PR. Conflicts: include/linux/mlx5/driver.h 617f5db1a626 ("RDMA/mlx5: Fix affinity assignment") dc13180824b7 ("net/mlx5: Enable devlink port for embedded cpu VF vports") https://lore.kernel.org/all/[email protected]/ tools/testing/selftests/net/mptcp/mptcp_join.sh 47867f0a7e83 ("selftests: mptcp: join: skip check if MIB counter not supported") 425ba803124b ("selftests: mptcp: join: support RM_ADDR for used endpoints or not") 45b1a1227a7a ("mptcp: introduces more address related mibs") 0639fa230a21 ("selftests: mptcp: add explicit check for new mibs") https://lore.kernel.org/netdev/20230609-upstream-net-20230610-mptcp-selftests-support-old-kernels-part-3-v1-0-2896fe2ee8a3@tessares.net/ No adjacent changes. Signed-off-by: Jakub Kicinski <[email protected]>
2023-06-14kallsyms: Replace all non-returning strlcpy with strscpyAzeem Shaikh2-3/+3
strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89 Signed-off-by: Azeem Shaikh <[email protected]> Reviewed-by: Kees Cook <[email protected]> Signed-off-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-06-14tracing/user_events: Add auto cleanup and future persist flagBeau Belgrave1-13/+126
Currently user events need to be manually deleted via the delete IOCTL call or via the dynamic_events file. Most operators and processes wish to have these events auto cleanup when they are no longer used by anything to prevent them piling without manual maintenance. However, some operators may not want this, such as pre-registering events via the dynamic_events tracefs file. Update user_event_put() to attempt an auto delete of the event if it's the last reference. The auto delete must run in a work queue to ensure proper behavior of class->reg() invocations that don't expect the call to go away from underneath them during the unregister. Add work_struct to user_event struct to ensure we can do this reliably. Add a persist flag, that is not yet exposed, to ensure we can toggle between auto-cleanup and leaving the events existing in the future. When a non-zero flag is seen during register, return -EINVAL to ensure ABI is clear for the user processes while we work out the best approach for persistent events. Link: https://lkml.kernel.org/r/[email protected] Link: https://lore.kernel.org/linux-trace-kernel/[email protected]/ Suggested-by: Steven Rostedt <[email protected]> Signed-off-by: Beau Belgrave <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-06-14tracing/user_events: Track refcount consistently via put/getBeau Belgrave1-28/+41
Various parts of the code today track user_event's refcnt field directly via a refcount_add/dec. This makes it hard to modify the behavior of the last reference decrement in all code paths consistently. For example, in the future we will auto-delete events upon the last reference going away. This last reference could happen in many places, but we want it to be consistently handled. Add user_event_get() and user_event_put() for the add/dec. Update all places where direct refcounts are being used to utilize these new functions. In each location pass if event_mutex is locked or not. This allows us to drop events automatically in future patches clearly. Ensure when caller states the lock is held, it really is (or is not) held. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Beau Belgrave <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-06-14tracing/user_events: Store register flags on eventsBeau Belgrave1-6/+10
Currently we don't have any available flags for user processes to use to indicate options for user_events. We will soon have a flag to indicate the event should or should not auto-delete once it's not being used by anyone. Add a reg_flags field to user_events and parameters to existing functions to allow for this in future patches. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Beau Belgrave <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-06-14tracing/user_events: Remove user_ns walk for groupsBeau Belgrave1-37/+5
During discussions it was suggested that user_ns is not a good place to try to attach a tracing namespace. The current code has stubs to enable that work that are very likely to change and incur a performance cost. Remove the user_ns walk when creating a group and determining the system name to use, since it's unlikely user_ns will be used in the future. Link: https://lore.kernel.org/all/20230601-urenkel-holzofen-cd9403b9cadd@brauner/ Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Suggested-by: Christian Brauner <[email protected]> Signed-off-by: Beau Belgrave <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-06-14tracing/user_events: Fix the incorrect trace record for empty arguments eventssunliming1-2/+2
The user_events support events that has empty arguments. But the trace event is discarded and not really committed when the arguments is empty. Fix this by not attempting to copy in zero-length data. Link: https://lkml.kernel.org/r/[email protected] Acked-by: Beau Belgrave <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: sunliming <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-06-14tracing: Modify print_fields() for fields output ordersunliming1-1/+1
Now the print_fields() print trace event fields in reverse order. Modify it to the positive sequence. Example outputs for a user event: test0 u32 count1; u32 count2 Output before: example-2547 [000] ..... 325.666387: test0: count2=0x2 (2) count1=0x1 (1) Output after: example-2742 [002] ..... 429.769370: test0: count1=0x1 (1) count2=0x2 (2) Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Fixes: 80a76994b2d88 ("tracing: Add "fields" option to show raw trace event fields") Signed-off-by: sunliming <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-06-14tracing/user_events: Handle matching arguments that is null from dyn_eventssunliming1-0/+2
When A registering user event from dyn_events has no argments, it will pass the matching check, regardless of whether there is a user event with the same name and arguments. Add the matching check when the arguments of registering user event is null. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Signed-off-by: sunliming <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-06-14tracing/user_events: Prevent same name but different args eventsunliming1-6/+30
User processes register name_args for events. If the same name but different args event are registered. The trace outputs of second event are printed as the first event. This is incorrect. Return EADDRINUSE back to the user process if the same name but different args event has being registered. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Signed-off-by: sunliming <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Acked-by: Beau Belgrave <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-06-13bpf: Verify scalar ids mapping in regsafe() using check_ids()Eduard Zingerman1-23/+68
Make sure that the following unsafe example is rejected by verifier: 1: r9 = ... some pointer with range X ... 2: r6 = ... unbound scalar ID=a ... 3: r7 = ... unbound scalar ID=b ... 4: if (r6 > r7) goto +1 5: r6 = r7 6: if (r6 > X) goto ... --- checkpoint --- 7: r9 += r7 8: *(u64 *)r9 = Y This example is unsafe because not all execution paths verify r7 range. Because of the jump at (4) the verifier would arrive at (6) in two states: I. r6{.id=b}, r7{.id=b} via path 1-6; II. r6{.id=a}, r7{.id=b} via path 1-4, 6. Currently regsafe() does not call check_ids() for scalar registers, thus from POV of regsafe() states (I) and (II) are identical. If the path 1-6 is taken by verifier first, and checkpoint is created at (6) the path [1-4, 6] would be considered safe. Changes in this commit: - check_ids() is modified to disallow mapping multiple old_id to the same cur_id. - check_scalar_ids() is added, unlike check_ids() it treats ID zero as a unique scalar ID. - check_scalar_ids() needs to generate temporary unique IDs, field 'tmp_id_gen' is added to bpf_verifier_env::idmap_scratch to facilitate this. - regsafe() is updated to: - use check_scalar_ids() for precise scalar registers. - compare scalar registers using memcmp only for explore_alu_limits branch. This simplifies control flow for scalar case, and has no measurable performance impact. - check_alu_op() is updated to avoid generating bpf_reg_state::id for constant scalar values when processing BPF_MOV. ID is needed to propagate range information for identical values, but there is nothing to propagate for constants. Fixes: 75748837b7e5 ("bpf: Propagate scalar ranges through register assignments.") Signed-off-by: Eduard Zingerman <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-06-13bpf: Use scalar ids in mark_chain_precision()Eduard Zingerman1-0/+115
Change mark_chain_precision() to track precision in situations like below: r2 = unknown value ... --- state #0 --- ... r1 = r2 // r1 and r2 now share the same ID ... --- state #1 {r1.id = A, r2.id = A} --- ... if (r2 > 10) goto exit; // find_equal_scalars() assigns range to r1 ... --- state #2 {r1.id = A, r2.id = A} --- r3 = r10 r3 += r1 // need to mark both r1 and r2 At the beginning of the processing of each state, ensure that if a register with a scalar ID is marked as precise, all registers sharing this ID are also marked as precise. This property would be used by a follow-up change in regsafe(). Signed-off-by: Eduard Zingerman <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-06-13bpf: ensure main program has an extableKrister Johansen1-2/+5
When subprograms are in use, the main program is not jit'd after the subprograms because jit_subprogs sets a value for prog->bpf_func upon success. Subsequent calls to the JIT are bypassed when this value is non-NULL. This leads to a situation where the main program and its func[0] counterpart are both in the bpf kallsyms tree, but only func[0] has an extable. Extables are only created during JIT. Now there are two nearly identical program ksym entries in the tree, but only one has an extable. Depending upon how the entries are placed, there's a chance that a fault will call search_extable on the aux with the NULL entry. Since jit_subprogs already copies state from func[0] to the main program, include the extable pointer in this state duplication. Additionally, ensure that the copy of the main program in func[0] is not added to the bpf_prog_kallsyms table. Instead, let the main program get added later in bpf_prog_load(). This ensures there is only a single copy of the main program in the kallsyms table, and that its tag matches the tag observed by tooling like bpftool. Cc: [email protected] Fixes: 1c2a088a6626 ("bpf: x64: add JIT support for multi-function programs") Signed-off-by: Krister Johansen <[email protected]> Acked-by: Yonghong Song <[email protected]> Acked-by: Ilya Leoshkevich <[email protected]> Tested-by: Ilya Leoshkevich <[email protected]> Link: https://lore.kernel.org/r/6de9b2f4b4724ef56efbb0339daaa66c8b68b1e7.1686616663.git.kjlx@templeofstupid.com Signed-off-by: Alexei Starovoitov <[email protected]>
2023-06-12Merge tag 'mm-hotfixes-stable-2023-06-12-12-22' of ↵Linus Torvalds1-1/+13
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "19 hotfixes. 14 are cc:stable and the remainder address issues which were introduced during this development cycle or which were considered inappropriate for a backport" * tag 'mm-hotfixes-stable-2023-06-12-12-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: zswap: do not shrink if cgroup may not zswap page cache: fix page_cache_next/prev_miss off by one ocfs2: check new file size on fallocate call mailmap: add entry for John Keeping mm/damon/core: fix divide error in damon_nr_accesses_to_accesses_bp() epoll: ep_autoremove_wake_function should use list_del_init_careful mm/gup_test: fix ioctl fail for compat task nilfs2: reject devices with insufficient block count ocfs2: fix use-after-free when unmounting read-only filesystem lib/test_vmalloc.c: avoid garbage in page array nilfs2: fix possible out-of-bounds segment allocation in resize ioctl riscv/purgatory: remove PGO flags powerpc/purgatory: remove PGO flags x86/purgatory: remove PGO flags kexec: support purgatories with .text.hot sections mm/uffd: allow vma to merge as much as possible mm/uffd: fix vma operation where start addr cuts part of vma radix-tree: move declarations to header nilfs2: fix incomplete buffer cleanup in nilfs_btnode_abort_change_key()
2023-06-12bpf: Replace bpf_cpumask_any* with bpf_cpumask_any_distribute*David Vernet1-10/+12
We currently export the bpf_cpumask_any() and bpf_cpumask_any_and() kfuncs. Intuitively, one would expect these to choose any CPU in the cpumask, but what they actually do is alias to cpumask_first() and cpmkas_first_and(). This is useless given that we already export bpf_cpumask_first() and bpf_cpumask_first_and(), so this patch replaces them with kfuncs that call cpumask_any_distribute() and cpumask_any_and_distribute(), which actually choose any CPU from the cpumask (or the AND of two cpumasks for the latter). Signed-off-by: David Vernet <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-06-12bpf: Add bpf_cpumask_first_and() kfuncDavid Vernet1-0/+16
We currently provide bpf_cpumask_first(), bpf_cpumask_any(), and bpf_cpumask_any_and() kfuncs. bpf_cpumask_any() and bpf_cpumask_any_and() are confusing misnomers in that they actually just call cpumask_first() and cpumask_first_and() respectively. We'll replace them with bpf_cpumask_any_distribute() and bpf_cpumask_any_distribute_and() kfuncs in a subsequent patch, so let's ensure feature parity by adding a bpf_cpumask_first_and() kfunc to account for bpf_cpumask_any_and() being removed. Signed-off-by: David Vernet <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-06-12kexec: support purgatories with .text.hot sectionsRicardo Ribalda1-1/+13
Patch series "kexec: Fix kexec_file_load for llvm16 with PGO", v7. When upreving llvm I realised that kexec stopped working on my test platform. The reason seems to be that due to PGO there are multiple .text sections on the purgatory, and kexec does not supports that. This patch (of 4): Clang16 links the purgatory text in two sections when PGO is in use: [ 1] .text PROGBITS 0000000000000000 00000040 00000000000011a1 0000000000000000 AX 0 0 16 [ 2] .rela.text RELA 0000000000000000 00003498 0000000000000648 0000000000000018 I 24 1 8 ... [17] .text.hot. PROGBITS 0000000000000000 00003220 000000000000020b 0000000000000000 AX 0 0 1 [18] .rela.text.hot. RELA 0000000000000000 00004428 0000000000000078 0000000000000018 I 24 17 8 And both of them have their range [sh_addr ... sh_addr+sh_size] on the area pointed by `e_entry`. This causes that image->start is calculated twice, once for .text and another time for .text.hot. The second calculation leaves image->start in a random location. Because of this, the system crashes immediately after: kexec_core: Starting new kernel Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 930457057abe ("kernel/kexec_file.c: split up __kexec_load_puragory") Signed-off-by: Ricardo Ribalda <[email protected]> Reviewed-by: Ross Zwisler <[email protected]> Reviewed-by: Steven Rostedt (Google) <[email protected]> Reviewed-by: Philipp Rudo <[email protected]> Cc: Albert Ou <[email protected]> Cc: Baoquan He <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dave Young <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Nick Desaulniers <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Palmer Dabbelt <[email protected]> Cc: Paul Walmsley <[email protected]> Cc: Simon Horman <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tom Rix <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-12include/linux/suspend.h: Only show pm_pr_dbg messages at suspend/resumeMario Limonciello1-0/+6
All uses in the kernel are currently already oriented around suspend/resume. As some other parts of the kernel may also use these messages in functions that could also be used outside of suspend/resume, only enable in suspend/resume path. Signed-off-by: Mario Limonciello <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2023-06-12cgroup: Do not corrupt task iteration when rebinding subsystemXiu Jianfeng1-3/+17
We found a refcount UAF bug as follows: refcount_t: addition on 0; use-after-free. WARNING: CPU: 1 PID: 342 at lib/refcount.c:25 refcount_warn_saturate+0xa0/0x148 Workqueue: events cpuset_hotplug_workfn Call trace: refcount_warn_saturate+0xa0/0x148 __refcount_add.constprop.0+0x5c/0x80 css_task_iter_advance_css_set+0xd8/0x210 css_task_iter_advance+0xa8/0x120 css_task_iter_next+0x94/0x158 update_tasks_root_domain+0x58/0x98 rebuild_root_domains+0xa0/0x1b0 rebuild_sched_domains_locked+0x144/0x188 cpuset_hotplug_workfn+0x138/0x5a0 process_one_work+0x1e8/0x448 worker_thread+0x228/0x3e0 kthread+0xe0/0xf0 ret_from_fork+0x10/0x20 then a kernel panic will be triggered as below: Unable to handle kernel paging request at virtual address 00000000c0000010 Call trace: cgroup_apply_control_disable+0xa4/0x16c rebind_subsystems+0x224/0x590 cgroup_destroy_root+0x64/0x2e0 css_free_rwork_fn+0x198/0x2a0 process_one_work+0x1d4/0x4bc worker_thread+0x158/0x410 kthread+0x108/0x13c ret_from_fork+0x10/0x18 The race that cause this bug can be shown as below: (hotplug cpu) | (umount cpuset) mutex_lock(&cpuset_mutex) | mutex_lock(&cgroup_mutex) cpuset_hotplug_workfn | rebuild_root_domains | rebind_subsystems update_tasks_root_domain | spin_lock_irq(&css_set_lock) css_task_iter_start | list_move_tail(&cset->e_cset_node[ss->id] while(css_task_iter_next) | &dcgrp->e_csets[ss->id]); css_task_iter_end | spin_unlock_irq(&css_set_lock) mutex_unlock(&cpuset_mutex) | mutex_unlock(&cgroup_mutex) Inside css_task_iter_start/next/end, css_set_lock is hold and then released, so when iterating task(left side), the css_set may be moved to another list(right side), then it->cset_head points to the old list head and it->cset_pos->next points to the head node of new list, which can't be used as struct css_set. To fix this issue, switch from all css_sets to only scgrp's css_sets to patch in-flight iterators to preserve correct iteration, and then update it->cset_head as well. Reported-by: Gaosheng Cui <[email protected]> Link: https://www.spinics.net/lists/cgroups/msg37935.html Suggested-by: Michal Koutný <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Xiu Jianfeng <[email protected]> Fixes: 2d8f243a5e6e ("cgroup: implement cgroup->e_csets[]") Cc: [email protected] # v3.16+ Signed-off-by: Tejun Heo <[email protected]>
2023-06-12cgroup: remove unused task_cgroup_path()Miaohe Lin1-39/+0
task_cgroup_path() is not used anymore. So remove it. Signed-off-by: Miaohe Lin <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2023-06-12bpf: Hide unused bpf_patch_call_argsArnd Bergmann1-3/+5
This function is only used when CONFIG_BPF_JIT_ALWAYS_ON is disabled, but CONFIG_BPF_SYSCALL is enabled. When both are turned off, the prototype is missing but the unused function is still compiled, as seen from this W=1 warning: [...] kernel/bpf/core.c:2075:6: error: no previous prototype for 'bpf_patch_call_args' [-Werror=missing-prototypes] [...] Add a matching #ifdef for the definition to leave it out. Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-06-12cgroup/cpuset: remove unneeded header filesMiaohe Lin1-2/+0
Remove some unnecessary header files. No functional change intended. Signed-off-by: Miaohe Lin <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2023-06-12cgroup,freezer: hold cpu_hotplug_lock before freezer_mutex in ↵Tetsuo Handa1-2/+6
freezer_css_{online,offline}() syzbot is again reporting circular locking dependency between cpu_hotplug_lock and freezer_mutex. Do like what we did with commit 57dcd64c7e036299 ("cgroup,freezer: hold cpu_hotplug_lock before freezer_mutex"). Reported-by: syzbot <[email protected]> Closes: https://syzkaller.appspot.com/bug?extid=2ab700fe1829880a2ec6 Signed-off-by: Tetsuo Handa <[email protected]> Tested-by: syzbot <[email protected]> Fixes: f5d39b020809 ("freezer,sched: Rewrite core freezer logic") Cc: [email protected] # v6.1+ Signed-off-by: Tejun Heo <[email protected]>
2023-06-12block: replace fmode_t with a block-specific type for block open flagsChristoph Hellwig1-3/+3
The only overlap between the block open flags mapped into the fmode_t and other uses of fmode_t are FMODE_READ and FMODE_WRITE. Define a new blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and ->ioctl and stop abusing fmode_t. Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Jack Wang <[email protected]> [rnbd] Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-06-12block: use the holder as indication for exclusive opensChristoph Hellwig3-21/+14
The current interface for exclusive opens is rather confusing as it requires both the FMODE_EXCL flag and a holder. Remove the need to pass FMODE_EXCL and just key off the exclusive open off a non-NULL holder. For blkdev_put this requires adding the holder argument, which provides better debug checking that only the holder actually releases the hold, but at the same time allows removing the now superfluous mode argument. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Acked-by: Christian Brauner <[email protected]> Acked-by: David Sterba <[email protected]> [btrfs] Acked-by: Jack Wang <[email protected]> [rnbd] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-06-12swsusp: don't pass a stack address to blkdev_get_by_pathChristoph Hellwig1-2/+3
holder is just an on-stack pointer that can easily be reused by other calls, replace it with a static variable that doesn't change. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2023-06-09syscalls: add sys_ni_posix_timers prototypeArnd Bergmann1-0/+1
The sys_ni_posix_timers() definition causes a warning when the declaration is missing, so this needs to be added along with the normal syscalls, outside of the #ifdef. kernel/time/posix-stubs.c:26:17: error: no previous prototype for 'sys_ni_posix_timers' [-Werror=missing-prototypes] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]> Reviewed-by: Kees Cook <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09kexec: enable kexec_crash_size to support two crash kernel regionsZhen Lei1-5/+38
The crashk_low_res should be considered by /sys/kernel/kexec_crash_size to support two crash kernel regions shrinking if existing. While doing it, crashk_low_res will only be shrunk when the entire crashk_res is empty; and if the crashk_res is empty and crahk_low_res is not, change crashk_low_res to be crashk_res. [[email protected]: redo changelog] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zhen Lei <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Cong Wang <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Michael Holzheu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09kexec: add helper __crash_shrink_memory()Zhen Lei1-22/+29
No functional change, in preparation for the next patch so that it is easier to review. [[email protected]: make __crash_shrink_memory() static] Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zhen Lei <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Cong Wang <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Michael Holzheu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09kexec: improve the readability of crash_shrink_memory()Zhen Lei1-10/+5
The major adjustments are: 1. end = start + new_size. The 'end' here is not an accurate representation, because it is not the new end of crashk_res, but the start of ram_res, difference 1. So eliminate it and replace it with ram_res->start. 2. Use 'ram_res->start' and 'ram_res->end' as arguments to crash_free_reserved_phys_range() to indicate that the memory covered by 'ram_res' is released from the crashk. And keep it close to insert_resource(). 3. Replace 'if (start == end)' with 'if (!new_size)', clear indication that all crashk memory will be shrunken. No functional change. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zhen Lei <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Cong Wang <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Michael Holzheu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09kexec: clear crashk_res if all its memory has been releasedZhen Lei1-4/+7
If the resource of crashk_res has been released, it is better to clear crashk_res.start and crashk_res.end. Because 'end = start - 1' is not reasonable, and in some places the test is based on crashk_res.end, not resource_size(&crashk_res). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zhen Lei <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Cong Wang <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Michael Holzheu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09kexec: delete a useless check in crash_shrink_memory()Zhen Lei1-1/+1
The check '(crashk_res.parent != NULL)' is added by commit e05bd3367bd3 ("kexec: fix Oops in crash_shrink_memory()"), but it's stale now. Because if 'crashk_res' is not reserved, it will be zero in size and will be intercepted by the above 'if (new_size >= old_size)'. Ago: if (new_size >= end - start + 1) Now: old_size = (end == 0) ? 0 : end - start + 1; if (new_size >= old_size) Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Zhen Lei <[email protected]> Cc: Baoquan He <[email protected]> Cc: Cong Wang <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Michael Holzheu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09kexec: fix a memory leak in crash_shrink_memory()Zhen Lei1-3/+2
Patch series "kexec: enable kexec_crash_size to support two crash kernel regions". When crashkernel=X fails to reserve region under 4G, it will fall back to reserve region above 4G and a region of the default size will also be reserved under 4G. Unfortunately, /sys/kernel/kexec_crash_size only supports one crash kernel region now, the user cannot sense the low memory reserved by reading /sys/kernel/kexec_crash_size. Also, low memory cannot be freed by writing this file. For example: resource_size(crashk_res) = 512M resource_size(crashk_low_res) = 256M The result of 'cat /sys/kernel/kexec_crash_size' is 512M, but it should be 768M. When we execute 'echo 0 > /sys/kernel/kexec_crash_size', the size of crashk_res becomes 0 and resource_size(crashk_low_res) is still 256 MB, which is incorrect. Since crashk_res manages the memory with high address and crashk_low_res manages the memory with low address, crashk_low_res is shrunken only when all crashk_res is shrunken. And because when there is only one crash kernel region, crashk_res is always used. Therefore, if all crashk_res is shrunken and crashk_low_res still exists, swap them. This patch (of 6): If the value of parameter 'new_size' is in the semi-open and semi-closed interval (crashk_res.end - KEXEC_CRASH_MEM_ALIGN + 1, crashk_res.end], the calculation result of ram_res is: ram_res->start = crashk_res.end + 1 ram_res->end = crashk_res.end The operation of insert_resource() fails, and ram_res is not added to iomem_resource. As a result, the memory of the control block ram_res is leaked. In fact, on all architectures, the start address and size of crashk_res are already aligned by KEXEC_CRASH_MEM_ALIGN. Therefore, we do not need to round up crashk_res.start again. Instead, we should round up 'new_size' in advance. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 6480e5a09237 ("kdump: add missing RAM resource in crash_shrink_memory()") Fixes: 06a7f711246b ("kexec: premit reduction of the reserved memory size") Signed-off-by: Zhen Lei <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Cong Wang <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Michael Holzheu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09kexec: avoid calculating array size twiceSimon Horman1-3/+4
Avoid calculating array size twice in kexec_purgatory_setup_sechdrs(). Once using array_size(), and once open-coded. Flagged by Coccinelle: .../kexec_file.c:881:8-25: WARNING: array_size is already used (line 877) to compute the same size No functional change intended. Compile tested only. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Simon Horman <[email protected]> Acked-by: Baoquan He <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Zhen Lei <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09watchdog/perf: adapt the watchdog_perf interface for async modelLecopzer Chen1-1/+66
When lockup_detector_init()->watchdog_hardlockup_probe(), PMU may be not ready yet. E.g. on arm64, PMU is not ready until device_initcall(armv8_pmu_driver_init). And it is deeply integrated with the driver model and cpuhp. Hence it is hard to push this initialization before smp_init(). But it is easy to take an opposite approach and try to initialize the watchdog once again later. The delayed probe is called using workqueues. It need to allocate memory and must be proceed in a normal context. The delayed probe is able to use if watchdog_hardlockup_probe() returns non-zero which means the return code returned when PMU is not ready yet. Provide an API - lockup_detector_retry_init() for anyone who needs to delayed init lockup detector if they had ever failed at lockup_detector_init(). The original assumption is: nobody should use delayed probe after lockup_detector_check() which has __init attribute. That is, anyone uses this API must call between lockup_detector_init() and lockup_detector_check(), and the caller must have __init attribute Link: https://lkml.kernel.org/r/20230519101840.v5.16.If4ad5dd5d09fb1309cebf8bcead4b6a5a7758ca7@changeid Reviewed-by: Petr Mladek <[email protected]> Co-developed-by: Pingfan Liu <[email protected]> Signed-off-by: Pingfan Liu <[email protected]> Signed-off-by: Lecopzer Chen <[email protected]> Signed-off-by: Douglas Anderson <[email protected]> Suggested-by: Petr Mladek <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chen-Yu Tsai <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Colin Cross <[email protected]> Cc: Daniel Thompson <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Masayoshi Mizuma <[email protected]> Cc: Matthias Kaehlcke <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: "Ravi V. Shankar" <[email protected]> Cc: Ricardo Neri <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Stephen Boyd <[email protected]> Cc: Sumit Garg <[email protected]> Cc: Tzung-Bi Shih <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09watchdog/perf: add a weak function for an arch to detect if perf can use NMIsDouglas Anderson1-1/+11
On arm64, NMI support needs to be detected at runtime. Add a weak function to the perf hardlockup detector so that an architecture can implement it to detect whether NMIs are available. Link: https://lkml.kernel.org/r/20230519101840.v5.15.Ic55cb6f90ef5967d8aaa2b503a4e67c753f64d3a@changeid Signed-off-by: Douglas Anderson <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chen-Yu Tsai <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Colin Cross <[email protected]> Cc: Daniel Thompson <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Lecopzer Chen <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Masayoshi Mizuma <[email protected]> Cc: Matthias Kaehlcke <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Pingfan Liu <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: "Ravi V. Shankar" <[email protected]> Cc: Ricardo Neri <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Stephen Boyd <[email protected]> Cc: Sumit Garg <[email protected]> Cc: Tzung-Bi Shih <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUsDouglas Anderson3-7/+116
Implement a hardlockup detector that doesn't doesn't need any extra arch-specific support code to detect lockups. Instead of using something arch-specific we will use the buddy system, where each CPU watches out for another one. Specifically, each CPU will use its softlockup hrtimer to check that the next CPU is processing hrtimer interrupts by verifying that a counter is increasing. NOTE: unlike the other hard lockup detectors, the buddy one can't easily show what's happening on the CPU that locked up just by doing a simple backtrace. It relies on some other mechanism in the system to get information about the locked up CPUs. This could be support for NMI backtraces like [1], it could be a mechanism for printing the PC of locked CPUs at panic time like [2] / [3], or it could be something else. Even though that means we still rely on arch-specific code, this arch-specific code seems to often be implemented even on architectures that don't have a hardlockup detector. This style of hardlockup detector originated in some downstream Android trees and has been rebased on / carried in ChromeOS trees for quite a long time for use on arm and arm64 boards. Historically on these boards we've leveraged mechanism [2] / [3] to get information about hung CPUs, but we could move to [1]. Although the original motivation for the buddy system was for use on systems without an arch-specific hardlockup detector, it can still be useful to use even on systems that _do_ have an arch-specific hardlockup detector. On x86, for instance, there is a 24-part patch series [4] in progress switching the arch-specific hard lockup detector from a scarce perf counter to a less-scarce hardware resource. Potentially the buddy system could be a simpler alternative to free up the perf counter but still get hard lockup detection. Overall, pros (+) and cons (-) of the buddy system compared to an arch-specific hardlockup detector (which might be implemented using perf): + The buddy system is usable on systems that don't have an arch-specific hardlockup detector, like arm32 and arm64 (though it's being worked on for arm64 [5]). + The buddy system may free up scarce hardware resources. + If a CPU totally goes out to lunch (can't process NMIs) the buddy system could still detect the problem (though it would be unlikely to be able to get a stack trace). + The buddy system uses the same timer function to pet the hardlockup detector on the running CPU as it uses to detect hardlockups on other CPUs. Compared to other hardlockup detectors, this means it generates fewer interrupts and thus is likely better able to let CPUs stay idle longer. - If all CPUs are hard locked up at the same time the buddy system can't detect it. - If we don't have SMP we can't use the buddy system. - The buddy system needs an arch-specific mechanism (possibly NMI backtrace) to get info about the locked up CPU. [1] https://lore.kernel.org/r/[email protected] [2] https://issuetracker.google.com/172213129 [3] https://docs.kernel.org/trace/coresight/coresight-cpu-debug.html [4] https://lore.kernel.org/lkml/[email protected]/ [5] https://lore.kernel.org/linux-arm-kernel/[email protected]/ Link: https://lkml.kernel.org/r/20230519101840.v5.14.I6bf789d21d0c3d75d382e7e51a804a7a51315f2c@changeid Signed-off-by: Colin Cross <[email protected]> Signed-off-by: Matthias Kaehlcke <[email protected]> Signed-off-by: Guenter Roeck <[email protected]> Signed-off-by: Tzung-Bi Shih <[email protected]> Signed-off-by: Douglas Anderson <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chen-Yu Tsai <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Daniel Thompson <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Masayoshi Mizuma <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Pingfan Liu <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: "Ravi V. Shankar" <[email protected]> Cc: Ricardo Neri <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Stephen Boyd <[email protected]> Cc: Sumit Garg <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09watchdog/hardlockup: have the perf hardlockup use __weak functions more cleanlyDouglas Anderson2-18/+32
The fact that there watchdog_hardlockup_enable(), watchdog_hardlockup_disable(), and watchdog_hardlockup_probe() are declared __weak means that the configured hardlockup detector can define non-weak versions of those functions if it needs to. Instead of doing this, the perf hardlockup detector hooked itself into the default __weak implementation, which was a bit awkward. Clean this up. From comments, it looks as if the original design was done because the __weak function were expected to implemented by the architecture and not by the configured hardlockup detector. This got awkward when we tried to add the buddy lockup detector which was not arch-specific but wanted to hook into those same functions. This is not expected to have any functional impact. Link: https://lkml.kernel.org/r/20230519101840.v5.13.I847d9ec852449350997ba00401d2462a9cb4302b@changeid Signed-off-by: Douglas Anderson <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chen-Yu Tsai <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Colin Cross <[email protected]> Cc: Daniel Thompson <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Lecopzer Chen <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Masayoshi Mizuma <[email protected]> Cc: Matthias Kaehlcke <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Pingfan Liu <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: "Ravi V. Shankar" <[email protected]> Cc: Ricardo Neri <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Stephen Boyd <[email protected]> Cc: Sumit Garg <[email protected]> Cc: Tzung-Bi Shih <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09watchdog/hardlockup: rename some "NMI watchdog" constants/functionDouglas Anderson2-57/+58
Do a search and replace of: - NMI_WATCHDOG_ENABLED => WATCHDOG_HARDLOCKUP_ENABLED - SOFT_WATCHDOG_ENABLED => WATCHDOG_SOFTOCKUP_ENABLED - watchdog_nmi_ => watchdog_hardlockup_ - nmi_watchdog_available => watchdog_hardlockup_available - nmi_watchdog_user_enabled => watchdog_hardlockup_user_enabled - soft_watchdog_user_enabled => watchdog_softlockup_user_enabled - NMI_WATCHDOG_DEFAULT => WATCHDOG_HARDLOCKUP_DEFAULT Then update a few comments near where names were changed. This is specifically to make it less confusing when we want to introduce the buddy hardlockup detector, which isn't using NMIs. As part of this, we sanitized a few names for consistency. [[email protected]: make variables static] Link: https://lkml.kernel.org/r/20230525162822.1.I0fb41d138d158c9230573eaa37dc56afa2fb14ee@changeid Link: https://lkml.kernel.org/r/20230519101840.v5.12.I91f7277bab4bf8c0cb238732ed92e7ce7bbd71a6@changeid Signed-off-by: Douglas Anderson <[email protected]> Signed-off-by: Tom Rix <[email protected]> Reviewed-by: Tom Rix <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chen-Yu Tsai <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Colin Cross <[email protected]> Cc: Daniel Thompson <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Lecopzer Chen <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Masayoshi Mizuma <[email protected]> Cc: Matthias Kaehlcke <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Pingfan Liu <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: "Ravi V. Shankar" <[email protected]> Cc: Ricardo Neri <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Stephen Boyd <[email protected]> Cc: Sumit Garg <[email protected]> Cc: Tzung-Bi Shih <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09watchdog/hardlockup: move perf hardlockup watchdog petting to watchdog.cDouglas Anderson2-19/+19
In preparation for the buddy hardlockup detector, which wants the same petting logic as the current perf hardlockup detector, move the code to watchdog.c. While doing this, rename the global variable to match others nearby. As part of this change we have to change the code to account for the fact that the CPU we're running on might be different than the one we're checking. Currently the code in watchdog.c is guarded by CONFIG_HARDLOCKUP_DETECTOR_PERF, which makes this change seem silly. However, a future patch will change this. Link: https://lkml.kernel.org/r/20230519101840.v5.11.I00dfd6386ee00da25bf26d140559a41339b53e57@changeid Signed-off-by: Douglas Anderson <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chen-Yu Tsai <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Colin Cross <[email protected]> Cc: Daniel Thompson <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Lecopzer Chen <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Masayoshi Mizuma <[email protected]> Cc: Matthias Kaehlcke <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Pingfan Liu <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: "Ravi V. Shankar" <[email protected]> Cc: Ricardo Neri <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Stephen Boyd <[email protected]> Cc: Sumit Garg <[email protected]> Cc: Tzung-Bi Shih <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-06-09watchdog/hardlockup: add a "cpu" param to watchdog_hardlockup_check()Douglas Anderson2-21/+33
In preparation for the buddy hardlockup detector where the CPU checking for lockup might not be the currently running CPU, add a "cpu" parameter to watchdog_hardlockup_check(). As part of this change, make hrtimer_interrupts an atomic_t since now the CPU incrementing the value and the CPU reading the value might be different. Technially this could also be done with just READ_ONCE and WRITE_ONCE, but atomic_t feels a little cleaner in this case. While hrtimer_interrupts is made atomic_t, we change hrtimer_interrupts_saved from "unsigned long" to "int". The "int" is needed to match the data type backing atomic_t for hrtimer_interrupts. Even if this changes us from 64-bits to 32-bits (which I don't think is true for most compilers), it doesn't really matter. All we ever do is increment it every few seconds and compare it to an old value so 32-bits is fine (even 16-bits would be). The "signed" vs "unsigned" also doesn't matter for simple equality comparisons. hrtimer_interrupts_saved is _not_ switched to atomic_t nor even accessed with READ_ONCE / WRITE_ONCE. The hrtimer_interrupts_saved is always consistently accessed with the same CPU. NOTE: with the upcoming "buddy" detector there is one special case. When a CPU goes offline/online then we can change which CPU is the one to consistently access a given instance of hrtimer_interrupts_saved. We still can't end up with a partially updated hrtimer_interrupts_saved, however, because we end up petting all affected CPUs to make sure the new and old CPU can't end up somehow read/write hrtimer_interrupts_saved at the same time. Link: https://lkml.kernel.org/r/20230519101840.v5.10.I3a7d4dd8c23ac30ee0b607d77feb6646b64825c0@changeid Signed-off-by: Douglas Anderson <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chen-Yu Tsai <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Colin Cross <[email protected]> Cc: Daniel Thompson <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Guenter Roeck <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Lecopzer Chen <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Masayoshi Mizuma <[email protected]> Cc: Matthias Kaehlcke <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Pingfan Liu <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: "Ravi V. Shankar" <[email protected]> Cc: Ricardo Neri <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Stephen Boyd <[email protected]> Cc: Sumit Garg <[email protected]> Cc: Tzung-Bi Shih <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Andrew Morton <[email protected]>