aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2024-09-05tracing/timerlat: Only clear timer if a kthread existsSteven Rostedt1-6/+13
The timerlat tracer can use user space threads to check for osnoise and timer latency. If the program using this is killed via a SIGTERM, the threads are shutdown one at a time and another tracing instance can start up resetting the threads before they are fully closed. That causes the hrtimer assigned to the kthread to be shutdown and freed twice when the dying thread finally closes the file descriptors, causing a use-after-free bug. Only cancel the hrtimer if the associated thread is still around. Also add the interface_lock around the resetting of the tlat_var->kthread. Note, this is just a quick fix that can be backported to stable. A real fix is to have a better synchronization between the shutdown of old threads and the starting of new ones. Link: https://lore.kernel.org/all/[email protected]/ Cc: [email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: "Luis Claudio R. Goncalves" <[email protected]> Link: https://lore.kernel.org/[email protected] Fixes: e88ed227f639e ("tracing/timerlat: Add user-space interface") Reported-by: Tomas Glozar <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-09-05tracing/osnoise: Use a cpumask to know what threads are kthreadsSteven Rostedt1-3/+15
The start_kthread() and stop_thread() code was not always called with the interface_lock held. This means that the kthread variable could be unexpectedly changed causing the kthread_stop() to be called on it when it should not have been, leading to: while true; do rtla timerlat top -u -q & PID=$!; sleep 5; kill -INT $PID; sleep 0.001; kill -TERM $PID; wait $PID; done Causing the following OOPS: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN PTI KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017] CPU: 5 UID: 0 PID: 885 Comm: timerlatu/5 Not tainted 6.11.0-rc4-test-00002-gbc754cc76d1b-dirty #125 a533010b71dab205ad2f507188ce8c82203b0254 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:hrtimer_active+0x58/0x300 Code: 48 c1 ee 03 41 54 48 01 d1 48 01 d6 55 53 48 83 ec 20 80 39 00 0f 85 30 02 00 00 49 8b 6f 30 4c 8d 75 10 4c 89 f0 48 c1 e8 03 <0f> b6 3c 10 4c 89 f0 83 e0 07 83 c0 03 40 38 f8 7c 09 40 84 ff 0f RSP: 0018:ffff88811d97f940 EFLAGS: 00010202 RAX: 0000000000000002 RBX: ffff88823c6b5b28 RCX: ffffed10478d6b6b RDX: dffffc0000000000 RSI: ffffed10478d6b6c RDI: ffff88823c6b5b28 RBP: 0000000000000000 R08: ffff88823c6b5b58 R09: ffff88823c6b5b60 R10: ffff88811d97f957 R11: 0000000000000010 R12: 00000000000a801d R13: ffff88810d8b35d8 R14: 0000000000000010 R15: ffff88823c6b5b28 FS: 0000000000000000(0000) GS:ffff88823c680000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000561858ad7258 CR3: 000000007729e001 CR4: 0000000000170ef0 Call Trace: <TASK> ? die_addr+0x40/0xa0 ? exc_general_protection+0x154/0x230 ? asm_exc_general_protection+0x26/0x30 ? hrtimer_active+0x58/0x300 ? __pfx_mutex_lock+0x10/0x10 ? __pfx_locks_remove_file+0x10/0x10 hrtimer_cancel+0x15/0x40 timerlat_fd_release+0x8e/0x1f0 ? security_file_release+0x43/0x80 __fput+0x372/0xb10 task_work_run+0x11e/0x1f0 ? _raw_spin_lock+0x85/0xe0 ? __pfx_task_work_run+0x10/0x10 ? poison_slab_object+0x109/0x170 ? do_exit+0x7a0/0x24b0 do_exit+0x7bd/0x24b0 ? __pfx_migrate_enable+0x10/0x10 ? __pfx_do_exit+0x10/0x10 ? __pfx_read_tsc+0x10/0x10 ? ktime_get+0x64/0x140 ? _raw_spin_lock_irq+0x86/0xe0 do_group_exit+0xb0/0x220 get_signal+0x17ba/0x1b50 ? vfs_read+0x179/0xa40 ? timerlat_fd_read+0x30b/0x9d0 ? __pfx_get_signal+0x10/0x10 ? __pfx_timerlat_fd_read+0x10/0x10 arch_do_signal_or_restart+0x8c/0x570 ? __pfx_arch_do_signal_or_restart+0x10/0x10 ? vfs_read+0x179/0xa40 ? ksys_read+0xfe/0x1d0 ? __pfx_ksys_read+0x10/0x10 syscall_exit_to_user_mode+0xbc/0x130 do_syscall_64+0x74/0x110 ? __pfx___rseq_handle_notify_resume+0x10/0x10 ? __pfx_ksys_read+0x10/0x10 ? fpregs_restore_userregs+0xdb/0x1e0 ? fpregs_restore_userregs+0xdb/0x1e0 ? syscall_exit_to_user_mode+0x116/0x130 ? do_syscall_64+0x74/0x110 ? do_syscall_64+0x74/0x110 ? do_syscall_64+0x74/0x110 entry_SYSCALL_64_after_hwframe+0x71/0x79 RIP: 0033:0x7ff0070eca9c Code: Unable to access opcode bytes at 0x7ff0070eca72. RSP: 002b:00007ff006dff8c0 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: 0000000000000000 RBX: 0000000000000005 RCX: 00007ff0070eca9c RDX: 0000000000000400 RSI: 00007ff006dff9a0 RDI: 0000000000000003 RBP: 00007ff006dffde0 R08: 0000000000000000 R09: 00007ff000000ba0 R10: 00007ff007004b08 R11: 0000000000000246 R12: 0000000000000003 R13: 00007ff006dff9a0 R14: 0000000000000007 R15: 0000000000000008 </TASK> Modules linked in: snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hwdep snd_hda_core ---[ end trace 0000000000000000 ]--- This is because it would mistakenly call kthread_stop() on a user space thread making it "exit" before it actually exits. Since kthreads are created based on global behavior, use a cpumask to know when kthreads are running and that they need to be shutdown before proceeding to do new work. Link: https://lore.kernel.org/all/[email protected]/ This was debugged by using the persistent ring buffer: Link: https://lore.kernel.org/all/[email protected]/ Note, locking was originally used to fix this, but that proved to cause too many deadlocks to work around: https://lore.kernel.org/linux-trace-kernel/[email protected]/ Cc: [email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: "Luis Claudio R. Goncalves" <[email protected]> Link: https://lore.kernel.org/[email protected] Fixes: e88ed227f639e ("tracing/timerlat: Add user-space interface") Reported-by: Tomas Glozar <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-09-05uprobes: perform lockless SRCU-protected uprobes_tree lookupAndrii Nakryiko1-6/+24
Another big bottleneck to scalablity is uprobe_treelock that's taken in a very hot path in handle_swbp(). Now that uprobes are SRCU-protected, take advantage of that and make uprobes_tree RB-tree look up lockless. To make RB-tree RCU-protected lockless lookup correct, we need to take into account that such RB-tree lookup can return false negatives if there are parallel RB-tree modifications (rotations) going on. We use seqcount lock to detect whether RB-tree changed, and if we find nothing while RB-tree got modified inbetween, we just retry. If uprobe was found, then it's guaranteed to be a correct lookup. With all the lock-avoiding changes done, we get a pretty decent improvement in performance and scalability of uprobes with number of CPUs, even though we are still nowhere near linear scalability. This is due to SRCU not really scaling very well with number of CPUs on a particular hardware that was used for testing (80-core Intel Xeon Gold 6138 CPU @ 2.00GHz), but also due to the remaning mmap_lock, which is currently taken to resolve interrupt address to inode+offset and then uprobe instance. And, of course, uretprobes still need similar RCU to avoid refcount in the hot path, which will be addressed in the follow up patches. Nevertheless, the improvement is good. We used BPF selftest-based uprobe-nop and uretprobe-nop benchmarks to get the below numbers, varying number of CPUs on which uprobes and uretprobes are triggered. BASELINE ======== uprobe-nop ( 1 cpus): 3.032 ± 0.023M/s ( 3.032M/s/cpu) uprobe-nop ( 2 cpus): 3.452 ± 0.005M/s ( 1.726M/s/cpu) uprobe-nop ( 4 cpus): 3.663 ± 0.005M/s ( 0.916M/s/cpu) uprobe-nop ( 8 cpus): 3.718 ± 0.038M/s ( 0.465M/s/cpu) uprobe-nop (16 cpus): 3.344 ± 0.008M/s ( 0.209M/s/cpu) uprobe-nop (32 cpus): 2.288 ± 0.021M/s ( 0.071M/s/cpu) uprobe-nop (64 cpus): 3.205 ± 0.004M/s ( 0.050M/s/cpu) uretprobe-nop ( 1 cpus): 1.979 ± 0.005M/s ( 1.979M/s/cpu) uretprobe-nop ( 2 cpus): 2.361 ± 0.005M/s ( 1.180M/s/cpu) uretprobe-nop ( 4 cpus): 2.309 ± 0.002M/s ( 0.577M/s/cpu) uretprobe-nop ( 8 cpus): 2.253 ± 0.001M/s ( 0.282M/s/cpu) uretprobe-nop (16 cpus): 2.007 ± 0.000M/s ( 0.125M/s/cpu) uretprobe-nop (32 cpus): 1.624 ± 0.003M/s ( 0.051M/s/cpu) uretprobe-nop (64 cpus): 2.149 ± 0.001M/s ( 0.034M/s/cpu) SRCU CHANGES ============ uprobe-nop ( 1 cpus): 3.276 ± 0.005M/s ( 3.276M/s/cpu) uprobe-nop ( 2 cpus): 4.125 ± 0.002M/s ( 2.063M/s/cpu) uprobe-nop ( 4 cpus): 7.713 ± 0.002M/s ( 1.928M/s/cpu) uprobe-nop ( 8 cpus): 8.097 ± 0.006M/s ( 1.012M/s/cpu) uprobe-nop (16 cpus): 6.501 ± 0.056M/s ( 0.406M/s/cpu) uprobe-nop (32 cpus): 4.398 ± 0.084M/s ( 0.137M/s/cpu) uprobe-nop (64 cpus): 6.452 ± 0.000M/s ( 0.101M/s/cpu) uretprobe-nop ( 1 cpus): 2.055 ± 0.001M/s ( 2.055M/s/cpu) uretprobe-nop ( 2 cpus): 2.677 ± 0.000M/s ( 1.339M/s/cpu) uretprobe-nop ( 4 cpus): 4.561 ± 0.003M/s ( 1.140M/s/cpu) uretprobe-nop ( 8 cpus): 5.291 ± 0.002M/s ( 0.661M/s/cpu) uretprobe-nop (16 cpus): 5.065 ± 0.019M/s ( 0.317M/s/cpu) uretprobe-nop (32 cpus): 3.622 ± 0.003M/s ( 0.113M/s/cpu) uretprobe-nop (64 cpus): 3.723 ± 0.002M/s ( 0.058M/s/cpu) Peak througput increased from 3.7 mln/s (uprobe triggerings) up to about 8 mln/s. For uretprobes it's a bit more modest with bump from 2.4 mln/s to 5mln/s. Suggested-by: "Peter Zijlstra (Intel)" <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-09-05perf/uprobe: split uprobe_unregister()Peter Zijlstra3-8/+24
With uprobe_unregister() having grown a synchronize_srcu(), it becomes fairly slow to call. Esp. since both users of this API call it in a loop. Peel off the sync_srcu() and do it once, after the loop. We also need to add uprobe_unregister_sync() into uprobe_register()'s error handling path, as we need to be careful about returning to the caller before we have a guarantee that partially attached consumer won't be called anymore. This is an unlikely slow path and this should be totally fine to be slow in the case of a failed attach. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: "Peter Zijlstra (Intel)" <[email protected]> Co-developed-by: Andrii Nakryiko <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-09-05uprobes: travers uprobe's consumer list locklessly under SRCU protectionAndrii Nakryiko1-43/+61
uprobe->register_rwsem is one of a few big bottlenecks to scalability of uprobes, so we need to get rid of it to improve uprobe performance and multi-CPU scalability. First, we turn uprobe's consumer list to a typical doubly-linked list and utilize existing RCU-aware helpers for traversing such lists, as well as adding and removing elements from it. For entry uprobes we already have SRCU protection active since before uprobe lookup. For uretprobe we keep refcount, guaranteeing that uprobe won't go away from under us, but we add SRCU protection around consumer list traversal. Lastly, to keep handler_chain()'s UPROBE_HANDLER_REMOVE handling simple, we remember whether any removal was requested during handler calls, but then we double-check the decision under a proper register_rwsem using consumers' filter callbacks. Handler removal is very rare, so this extra lock won't hurt performance, overall, but we also avoid the need for any extra protection (e.g., seqcount locks). Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-09-05uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacksAndrii Nakryiko3-19/+11
It serves no purpose beyond adding unnecessray argument passed to the filter callback. Just get rid of it, no one is actually using it. Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-09-05uprobes: protected uprobe lifetime with SRCUAndrii Nakryiko1-40/+54
To avoid unnecessarily taking a (brief) refcount on uprobe during breakpoint handling in handle_swbp for entry uprobes, make find_uprobe() not take refcount, but protect the lifetime of a uprobe instance with RCU. This improves scalability, as refcount gets quite expensive due to cache line bouncing between multiple CPUs. Specifically, we utilize our own uprobe-specific SRCU instance for this RCU protection. put_uprobe() will delay actual kfree() using call_srcu(). For now, uretprobe and single-stepping handling will still acquire refcount as necessary. We'll address these issues in follow up patches by making them use SRCU with timeout. Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-09-05uprobes: revamp uprobe refcounting and lifetime managementAndrii Nakryiko1-78/+101
Revamp how struct uprobe is refcounted, and thus how its lifetime is managed. Right now, there are a few possible "owners" of uprobe refcount: - uprobes_tree RB tree assumes one refcount when uprobe is registered and added to the lookup tree; - while uprobe is triggered and kernel is handling it in the breakpoint handler code, temporary refcount bump is done to keep uprobe from being freed; - if we have uretprobe requested on a given struct uprobe instance, we take another refcount to keep uprobe alive until user space code returns from the function and triggers return handler. The uprobe_tree's extra refcount of 1 is confusing and problematic. No matter how many actual consumers are attached, they all share the same refcount, and we have an extra logic to drop the "last" (which might not really be last) refcount once uprobe's consumer list becomes empty. This is unconventional and has to be kept in mind as a special case all the time. Further, because of this design we have the situations where find_uprobe() will find uprobe, bump refcount, return it to the caller, but that uprobe will still need uprobe_is_active() check, after which the caller is required to drop refcount and try again. This is just too many details leaking to the higher level logic. This patch changes refcounting scheme in such a way as to not have uprobes_tree keeping extra refcount for struct uprobe. Instead, each uprobe_consumer is assuming its own refcount, which will be dropped when consumer is unregistered. Other than that, all the active users of uprobe (entry and return uprobe handling code) keeps exactly the same refcounting approach. With the above setup, once uprobe's refcount drops to zero, we need to make sure that uprobe's "destructor" removes uprobe from uprobes_tree, of course. This, though, races with uprobe entry handling code in handle_swbp(), which, through find_active_uprobe()->find_uprobe() lookup, can race with uprobe being destroyed after refcount drops to zero (e.g., due to uprobe_consumer unregistering). So we add try_get_uprobe(), which will attempt to bump refcount, unless it already is zero. Caller needs to guarantee that uprobe instance won't be freed in parallel, which is the case while we keep uprobes_treelock (for read or write, doesn't matter). Note also, we now don't leak the race between registration and unregistration, so we remove the retry logic completely. If find_uprobe() returns valid uprobe, it's guaranteed to remain in uprobes_tree with properly incremented refcount. The race is handled inside __insert_uprobe() and put_uprobe() working together: __insert_uprobe() will remove uprobe from RB-tree, if it can't bump refcount and will retry to insert the new uprobe instance. put_uprobe() won't attempt to remove uprobe from RB-tree, if it's already not there. All that is protected by uprobes_treelock, which keeps things simple. Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-09-05bpf: Fix use-after-free in bpf_uprobe_multi_link_attach()Oleg Nesterov1-3/+6
If bpf_link_prime() fails, bpf_uprobe_multi_link_attach() goes to the error_free label and frees the array of bpf_uprobe's without calling bpf_uprobe_unregister(). This leaks bpf_uprobe->uprobe and worse, this frees bpf_uprobe->consumer without removing it from the uprobe->consumers list. Fixes: 89ae89f53d20 ("bpf: Add multi uprobe link") Closes: https://lore.kernel.org/all/[email protected]/ Reported-by: [email protected] Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Acked-by: Jiri Olsa <[email protected]> Tested-by: [email protected] Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2024-09-05perf/core: Fix small negative period being ignoredLuo Gengkun1-1/+5
In perf_adjust_period, we will first calculate period, and then use this period to calculate delta. However, when delta is less than 0, there will be a deviation compared to when delta is greater than or equal to 0. For example, when delta is in the range of [-14,-1], the range of delta = delta + 7 is between [-7,6], so the final value of delta/8 is 0. Therefore, the impact of -1 and -2 will be ignored. This is unacceptable when the target period is very short, because we will lose a lot of samples. Here are some tests and analyzes: before: # perf record -e cs -F 1000 ./a.out [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.022 MB perf.data (518 samples) ] # perf script ... a.out 396 257.956048: 23 cs: ffffffff81f4eeec schedul> a.out 396 257.957891: 23 cs: ffffffff81f4eeec schedul> a.out 396 257.959730: 23 cs: ffffffff81f4eeec schedul> a.out 396 257.961545: 23 cs: ffffffff81f4eeec schedul> a.out 396 257.963355: 23 cs: ffffffff81f4eeec schedul> a.out 396 257.965163: 23 cs: ffffffff81f4eeec schedul> a.out 396 257.966973: 23 cs: ffffffff81f4eeec schedul> a.out 396 257.968785: 23 cs: ffffffff81f4eeec schedul> a.out 396 257.970593: 23 cs: ffffffff81f4eeec schedul> ... after: # perf record -e cs -F 1000 ./a.out [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.058 MB perf.data (1466 samples) ] # perf script ... a.out 395 59.338813: 11 cs: ffffffff81f4eeec schedul> a.out 395 59.339707: 12 cs: ffffffff81f4eeec schedul> a.out 395 59.340682: 13 cs: ffffffff81f4eeec schedul> a.out 395 59.341751: 13 cs: ffffffff81f4eeec schedul> a.out 395 59.342799: 12 cs: ffffffff81f4eeec schedul> a.out 395 59.343765: 11 cs: ffffffff81f4eeec schedul> a.out 395 59.344651: 11 cs: ffffffff81f4eeec schedul> a.out 395 59.345539: 12 cs: ffffffff81f4eeec schedul> a.out 395 59.346502: 13 cs: ffffffff81f4eeec schedul> ... test.c int main() { for (int i = 0; i < 20000; i++) usleep(10); return 0; } # time ./a.out real 0m1.583s user 0m0.040s sys 0m0.298s The above results were tested on x86-64 qemu with KVM enabled using test.c as test program. Ideally, we should have around 1500 samples, but the previous algorithm had only about 500, whereas the modified algorithm now has about 1400. Further more, the new version shows 1 sample per 0.001s, while the previous one is 1 sample per 0.002s.This indicates that the new algorithm is more sensitive to small negative values compared to old algorithm. Fixes: bd2b5b12849a ("perf_counter: More aggressive frequency adjustment") Signed-off-by: Luo Gengkun <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Adrian Hunter <[email protected]> Reviewed-by: Kan Liang <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2024-09-05tracing: Avoid possible softlockup in tracing_iter_reset()Zheng Yejian1-0/+2
In __tracing_open(), when max latency tracers took place on the cpu, the time start of its buffer would be updated, then event entries with timestamps being earlier than start of the buffer would be skipped (see tracing_iter_reset()). Softlockup will occur if the kernel is non-preemptible and too many entries were skipped in the loop that reset every cpu buffer, so add cond_resched() to avoid it. Cc: [email protected] Fixes: 2f26ebd549b9a ("tracing: use timestamp to determine start of latency traces") Link: https://lore.kernel.org/[email protected] Suggested-by: Steven Rostedt <[email protected]> Signed-off-by: Zheng Yejian <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2024-09-05dma-mapping: use IOMMU DMA calls for common alloc/free page callsLeon Romanovsky2-15/+11
Common alloca and free pages routines are called when IOMMU DMA is used, and internally it calls to DMA ops structure which is not available for default IOMMU. This patch adds necessary if checks to call IOMMU DMA. It fixes the following crash: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000040 Mem abort info: ESR = 0x0000000096000006 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x06: level 2 translation fault Data abort info: ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 user pgtable: 4k pages, 48-bit VAs, pgdp=00000000d20bb000 [0000000000000040] pgd=08000000d20c1003 , p4d=08000000d20c1003 , pud=08000000d20c2003, pmd=0000000000000000 Internal error: Oops: 0000000096000006 [#1] PREEMPT SMP Modules linked in: ipv6 hci_uart venus_core btqca v4l2_mem2mem btrtl qcom_spmi_adc5 sbs_battery btbcm qcom_vadc_common cros_ec_typec videobuf2_v4l2 leds_cros_ec cros_kbd_led_backlight cros_ec_chardev videodev elan_i2c videobuf2_common qcom_stats mc bluetooth coresight_stm stm_core ecdh_generic ecc pwrseq_core panel_edp icc_bwmon ath10k_snoc ath10k_core ath mac80211 phy_qcom_qmp_combo aux_bridge libarc4 coresight_replicator coresight_etm4x coresight_tmc coresight_funnel cfg80211 rfkill coresight qcom_wdt cbmem ramoops reed_solomon pwm_bl coreboot_table backlight crct10dif_ce CPU: 7 UID: 0 PID: 70 Comm: kworker/u32:4 Not tainted 6.11.0-rc6-next-20240903-00003-gdfc6015d0711 #660 Hardware name: Google Lazor Limozeen without Touchscreen (rev5 - rev8) (DT) Workqueue: events_unbound deferred_probe_work_func hub 2-1:1.0: 4 ports detected pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : dma_common_alloc_pages+0x54/0x1b4 lr : dma_common_alloc_pages+0x4c/0x1b4 sp : ffff8000807d3730 x29: ffff8000807d3730 x28: ffff02a7d312f880 x27: 0000000000000001 x26: 000000000000c000 x25: 0000000000000000 x24: 0000000000000001 x23: ffff02a7d23b6898 x22: 0000000000006cc0 x21: 000000000000c000 x20: ffff02a7858bf410 x19: fffffe0a60006000 x18: 0000000000000001 x17: 00000000000000d5 x16: 1fffe054f0bcc261 x15: 0000000000000001 x14: ffff02a7844dc680 x13: 0000000000100180 x12: dead000000000100 x11: dead000000000122 x10: 00000000001001ff x9 : ffff02a87f7b7b00 x8 : ffff02a87f7b7b00 x7 : ffff405977d6b000 x6 : ffff8000807d3310 x5 : ffff02a87f6b6398 x4 : 0000000000000001 x3 : ffff405977d6b000 x2 : ffff02a7844dc600 x1 : 0000000100000000 x0 : fffffe0a60006000 Call trace: dma_common_alloc_pages+0x54/0x1b4 __dma_alloc_pages+0x68/0x90 dma_alloc_pages+0x10/0x1c snd_dma_noncoherent_alloc+0x28/0x8c __snd_dma_alloc_pages+0x30/0x50 snd_dma_alloc_dir_pages+0x40/0x80 do_alloc_pages+0xb8/0x13c preallocate_pcm_pages+0x6c/0xf8 preallocate_pages+0x160/0x1a4 snd_pcm_set_managed_buffer_all+0x64/0xb0 lpass_platform_pcm_new+0xc0/0xe8 snd_soc_pcm_component_new+0x3c/0xc8 soc_new_pcm+0x4fc/0x668 snd_soc_bind_card+0xabc/0xbac snd_soc_register_card+0xf0/0x108 devm_snd_soc_register_card+0x4c/0xa4 sc7180_snd_platform_probe+0x180/0x224 platform_probe+0x68/0xc0 really_probe+0xbc/0x298 __driver_probe_device+0x78/0x12c driver_probe_device+0x3c/0x15c __device_attach_driver+0xb8/0x134 bus_for_each_drv+0x84/0xe0 __device_attach+0x9c/0x188 device_initial_probe+0x14/0x20 bus_probe_device+0xac/0xb0 deferred_probe_work_func+0x88/0xc0 process_one_work+0x14c/0x28c worker_thread+0x2cc/0x3d4 kthread+0x114/0x118 ret_from_fork+0x10/0x20 Code: f9411c19 940000c9 aa0003f3 b4000460 (f9402326) ---[ end trace 0000000000000000 ]--- Fixes: b5c58b2fdc42 ("dma-mapping: direct calls for dma-iommu") Closes: https://lore.kernel.org/all/10431dfd-ce04-4e0f-973b-c78477303c18@notapiano Reported-by: Nícolas F. R. A. Prado <[email protected]> #KernelCI Signed-off-by: Leon Romanovsky <[email protected]> Tested-by: Nícolas F. R. A. Prado <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2024-09-05Merge branch 'perf/urgent' into perf/core, to pick up fixesIngo Molnar43-462/+325
This also refreshes the -rc1 based branch to -rc5. Signed-off-by: Ingo Molnar <[email protected]>
2024-09-04Merge branch 'bpf/master' into for-6.12Tejun Heo54-778/+1365
Pull bpf/master to receive baebe9aaba1e ("bpf: allow passing struct bpf_iter_<type> as kfunc arguments") and related changes in preparation for the DSQ iterator patchset. Signed-off-by: Tejun Heo <[email protected]>
2024-09-04cgroup/cpuset: Move cpu.h include to cpuset-internal.hWaiman Long2-4/+4
The newly created cpuset-v1.c file uses cpus_read_lock/unlock() functions which are defined in cpu.h but not included in cpuset-internal.h yet leading to compilation error under certain kernel configurations. Fix it by moving the cpu.h include from cpuset.c to cpuset-internal.h. While at it, sort the include files in alphabetic order. Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Fixes: 047b83097448 ("cgroup/cpuset: move relax_domain_level to cpuset-v1.c") Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2024-09-04sched_ext: Add cgroup supportTejun Heo4-19/+594
Add sched_ext_ops operations to init/exit cgroups, and track task migrations and config changes. A BPF scheduler may not implement or implement only subset of cgroup features. The implemented features can be indicated using %SCX_OPS_HAS_CGOUP_* flags. If cgroup configuration makes use of features that are not implemented, a warning is triggered. While a BPF scheduler is being enabled and disabled, relevant cgroup operations are locked out using scx_cgroup_rwsem. This avoids situations like task prep taking place while the task is being moved across cgroups, making things easier for BPF schedulers. v7: - cgroup interface file visibility toggling is dropped in favor just warning messages. Dynamically changing interface visiblity caused more confusion than helping. v6: - Updated to reflect the removal of SCX_KF_SLEEPABLE. - Updated to use CONFIG_GROUP_SCHED_WEIGHT and fixes for !CONFIG_FAIR_GROUP_SCHED && CONFIG_EXT_GROUP_SCHED. v5: - Flipped the locking order between scx_cgroup_rwsem and cpus_read_lock() to avoid locking order conflict w/ cpuset. Better documentation around locking. - sched_move_task() takes an early exit if the source and destination are identical. This triggered the warning in scx_cgroup_can_attach() as it left p->scx.cgrp_moving_from uncleared. Updated the cgroup migration path so that ops.cgroup_prep_move() is skipped for identity migrations so that its invocations always match ops.cgroup_move() one-to-one. v4: - Example schedulers moved into their own patches. - Fix build failure when !CONFIG_CGROUP_SCHED, reported by Andrea Righi. v3: - Make scx_example_pair switch all tasks by default. - Convert to BPF inline iterators. - scx_bpf_task_cgroup() is added to determine the current cgroup from CPU controller's POV. This allows BPF schedulers to accurately track CPU cgroup membership. - scx_example_flatcg added. This demonstrates flattened hierarchy implementation of CPU cgroup control and shows significant performance improvement when cgroups which are nested multiple levels are under competition. v2: - Build fixes for different CONFIG combinations. Signed-off-by: Tejun Heo <[email protected]> Reviewed-by: David Vernet <[email protected]> Acked-by: Josh Don <[email protected]> Acked-by: Hao Luo <[email protected]> Acked-by: Barret Rhoden <[email protected]> Reported-by: kernel test robot <[email protected]> Cc: Andrea Righi <[email protected]>
2024-09-04sched: Introduce CONFIG_GROUP_SCHED_WEIGHTTejun Heo2-8/+10
sched_ext will soon add cgroup cpu.weigh support. The cgroup interface code is currently gated behind CONFIG_FAIR_GROUP_SCHED. As the fair class and/or SCX may implement the feature, put the interface code behind the new CONFIG_CGROUP_SCHED_WEIGHT which is selected by CONFIG_FAIR_GROUP_SCHED. This allows either sched class to enable the itnerface code without ading more complex CONFIG tests. When !CONFIG_FAIR_GROUP_SCHED, a dummy version of sched_group_set_shares() is added to support later CONFIG_CGROUP_SCHED_WEIGHT && !CONFIG_FAIR_GROUP_SCHED builds. No functional changes. Signed-off-by: Tejun Heo <[email protected]>
2024-09-04sched: Make cpu_shares_read_u64() use tg_weight()Tejun Heo1-8/+6
Move tg_weight() upward and make cpu_shares_read_u64() use it too. This makes the weight retrieval shared between cgroup v1 and v2 paths and will be used to implement cgroup support for sched_ext. No functional changes. Signed-off-by: Tejun Heo <[email protected]>
2024-09-04sched: Expose css_tg()Tejun Heo2-5/+5
A new BPF extensible sched_class will use css_tg() in the init and exit paths to visit all task_groups by walking cgroups. v4: __setscheduler_prio() is already exposed. Dropped from this patch. v3: Dropped SCHED_CHANGE_BLOCK() as upstream is adding more generic cleanup mechanism. v2: Expose SCHED_CHANGE_BLOCK() too and update the description. Signed-off-by: Tejun Heo <[email protected]> Reviewed-by: David Vernet <[email protected]> Acked-by: Josh Don <[email protected]> Acked-by: Hao Luo <[email protected]> Acked-by: Barret Rhoden <[email protected]>
2024-09-04sched_ext: TASK_DEAD tasks must be switched into SCX on ops_enableTejun Heo1-17/+13
During scx_ops_enable(), SCX needs to invoke the sleepable ops.init_task() on every task. To do this, it does get_task_struct() on each iterated task, drop the lock and then call ops.init_task(). However, a TASK_DEAD task may already have lost all its usage count and be waiting for RCU grace period to be freed. If get_task_struct() is called on such task, use-after-free can happen. To avoid such situations, scx_ops_enable() skips initialization of TASK_DEAD tasks, which seems safe as they are never going to be scheduled again. Unfortunately, a racing sched_setscheduler(2) can grab the task before the task is unhashed and then continue to e.g. move the task from RT to SCX after TASK_DEAD is set and ops_enable skipped the task. As the task hasn't gone through scx_ops_init_task(), scx_ops_enable_task() called from switching_to_scx() triggers the following warning: sched_ext: Invalid task state transition 0 -> 3 for stress-ng-race-[2872] WARNING: CPU: 6 PID: 2367 at kernel/sched/ext.c:3327 scx_ops_enable_task+0x18f/0x1f0 ... RIP: 0010:scx_ops_enable_task+0x18f/0x1f0 ... switching_to_scx+0x13/0xa0 __sched_setscheduler+0x84e/0xa50 do_sched_setscheduler+0x104/0x1c0 __x64_sys_sched_setscheduler+0x18/0x30 do_syscall_64+0x7b/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e As in the ops_disable path, it just doesn't seem like a good idea to leave any task in an inconsistent state, even when the task is dead. The root cause is ops_enable not being able to tell reliably whether a task is truly dead (no one else is looking at it and it's about to be freed) and was testing TASK_DEAD instead. Fix it by testing the task's usage count directly. - ops_init no longer ignores TASK_DEAD tasks. As now all users iterate all tasks, @include_dead is removed from scx_task_iter_next_locked() along with dead task filtering. - tryget_task_struct() is added. Tasks are skipped iff tryget_task_struct() fails. Signed-off-by: Tejun Heo <[email protected]> Cc: David Vernet <[email protected]> Cc: Peter Zijlstra <[email protected]>
2024-09-04sched_ext: TASK_DEAD tasks must be switched out of SCX on ops_disableTejun Heo1-16/+8
scx_ops_disable_workfn() only switches !TASK_DEAD tasks out of SCX while calling scx_ops_exit_task() on all tasks including dead ones. This can leave a dead task on SCX but with SCX_TASK_NONE state, which is inconsistent. If another task was in the process of changing the TASK_DEAD task's scheduling class and grabs the rq lock after scx_ops_disable_workfn() is done with the task, the task ends up calling scx_ops_disable_task() on the dead task which is in an inconsistent state triggering a warning: WARNING: CPU: 6 PID: 3316 at kernel/sched/ext.c:3411 scx_ops_disable_task+0x12c/0x160 ... RIP: 0010:scx_ops_disable_task+0x12c/0x160 ... Call Trace: <TASK> check_class_changed+0x2c/0x70 __sched_setscheduler+0x8a0/0xa50 do_sched_setscheduler+0x104/0x1c0 __x64_sys_sched_setscheduler+0x18/0x30 do_syscall_64+0x7b/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f140d70ea5b There is no reason to leave dead tasks on SCX when unloading the BPF scheduler. Fix by making scx_ops_disable_workfn() eject all tasks including the dead ones from SCX. Signed-off-by: Tejun Heo <[email protected]>
2024-09-04bpf: Fix indentation issue in epilogue_idxMartin KaFai Lau1-1/+1
There is a report on new indentation issue in epilogue_idx. This patch fixed it. Fixes: 169c31761c8d ("bpf: Add gen_epilogue to bpf_verifier_ops") Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-09-04bpf: Remove the insn_buf array stack usage from the inline_bpf_loop()Martin KaFai Lau1-41/+42
This patch removes the insn_buf array stack usage from the inline_bpf_loop(). Instead, the env->insn_buf is used. The usage in inline_bpf_loop() needs more than 16 insn, so the INSN_BUF_SIZE needs to be increased from 16 to 32. The compiler stack size warning on the verifier is gone after this change. Cc: Eduard Zingerman <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-09-04bpf: add check for invalid name in btf_name_valid_section()Jeongjun Park1-1/+3
If the length of the name string is 1 and the value of name[0] is NULL byte, an OOB vulnerability occurs in btf_name_valid_section() and the return value is true, so the invalid name passes the check. To solve this, you need to check if the first position is NULL byte and if the first character is printable. Suggested-by: Eduard Zingerman <[email protected]> Fixes: bd70a8fb7ca4 ("bpf: Allow all printable characters in BTF DATASEC names") Signed-off-by: Jeongjun Park <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Eduard Zingerman <[email protected]>
2024-09-04perf/aux: Fix AUX buffer serializationPeter Zijlstra3-6/+15
Ole reported that event->mmap_mutex is strictly insufficient to serialize the AUX buffer, add a per RB mutex to fully serialize it. Note that in the lock order comment the perf_event::mmap_mutex order was already wrong, that is, it nesting under mmap_lock is not new with this patch. Fixes: 45bfb2e50471 ("perf: Add AUX area to ring buffer for raw data streams") Reported-by: Ole <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2024-09-04Merge tag 'mm-hotfixes-stable-2024-09-03-20-19' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "17 hotfixes, 15 of which are cc:stable. Mostly MM, no identifiable theme. And a few nilfs2 fixups" * tag 'mm-hotfixes-stable-2024-09-03-20-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: alloc_tag: fix allocation tag reporting when CONFIG_MODULES=n mm: vmalloc: optimize vmap_lazy_nr arithmetic when purging each vmap_area mailmap: update entry for Jan Kuliga codetag: debug: mark codetags for poisoned page as empty mm/memcontrol: respect zswap.writeback setting from parent cg too scripts: fix gfp-translate after ___GFP_*_BITS conversion to an enum Revert "mm: skip CMA pages when they are not available" maple_tree: remove rcu_read_lock() from mt_validate() kexec_file: fix elfcorehdr digest exclusion when CONFIG_CRASH_HOTPLUG=y mm/slub: add check for s->flags in the alloc_tagging_slab_free_hook nilfs2: fix state management in error path of log writing function nilfs2: fix missing cleanup on rollforward recovery error nilfs2: protect references to superblock parameters exposed in sysfs userfaultfd: don't BUG_ON() if khugepaged yanks our page table userfaultfd: fix checks for huge PMDs mm: vmalloc: ensure vmap_block is initialised before adding to queue selftests: mm: fix build errors on armhf
2024-09-04printk: Avoid false positive lockdep report for legacy printingJohn Ogness1-20/+63
Legacy console printing from printk() caller context may invoke the console driver from atomic context. This leads to a lockdep splat because the console driver will acquire a sleeping lock and the caller may already hold a spinning lock. This is noticed by lockdep on !PREEMPT_RT configurations because it will lead to a problem on PREEMPT_RT. However, on PREEMPT_RT the printing path from atomic context is always avoided and the console driver is always invoked from a dedicated thread. Thus the lockdep splat on !PREEMPT_RT is a false positive. For !PREEMPT_RT override the lock-context before invoking the console driver to avoid the false positive. Do not override the lock-context for PREEMPT_RT in order to allow lockdep to catch any real locking context issues related to the write callback usage. Signed-off-by: John Ogness <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]>
2024-09-04printk: nbcon: Assign nice -20 for printing threadsJohn Ogness2-0/+12
It is important that console printing threads are scheduled shortly after a printk call and with generous runtime budgets. Signed-off-by: John Ogness <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]>
2024-09-04printk: Implement legacy printer kthread for PREEMPT_RTJohn Ogness3-18/+159
The write() callback of legacy consoles usually makes use of spinlocks. This is not permitted with PREEMPT_RT in atomic contexts. For PREEMPT_RT, create a new kthread to handle printing of all the legacy consoles (and nbcon consoles if boot consoles are registered). This allows legacy consoles to work on PREEMPT_RT without requiring modification. (However they will not have the reliability properties guaranteed by nbcon atomic consoles.) Use the existing printk_kthreads_check_locked() to start/stop the legacy kthread as needed. Introduce the macro force_legacy_kthread() to query if the forced threading of legacy consoles is in effect. Although currently only enabled for PREEMPT_RT, this acts as a simple mechanism for the future to allow other preemption models to easily take advantage of the non-interference property provided by the legacy kthread. When force_legacy_kthread() is true, the legacy kthread fulfills the role of the console_flush_type @legacy_offload by waking the legacy kthread instead of printing via the console_lock in the irq_work. If the legacy kthread is not yet available, no legacy printing takes place (unless in panic). If for some reason the legacy kthread fails to create, any legacy consoles are unregistered. With force_legacy_kthread(), the legacy kthread is a critical component for legacy consoles. These changes only affect CONFIG_PREEMPT_RT. Signed-off-by: John Ogness <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]>
2024-09-04printk: nbcon: Show replay message on takeoverJohn Ogness3-0/+38
An emergency or panic context can takeover console ownership while the current owner was printing a printk message. The atomic printer will re-print the message that the previous owner was printing. However, this can look confusing to the user and may even seem as though a message was lost. [3430014.1 [3430014.181123] usb 1-2: Product: USB Audio Add a new field @nbcon_prev_seq to struct console to track the sequence number to print that was assigned to the previous console owner. If this matches the sequence number to print that the current owner is assigned, then a takeover must have occurred. In this case, print an additional message to inform the user that the previous message is being printed again. [3430014.1 ** replaying previous printk message ** [3430014.181123] usb 1-2: Product: USB Audio Signed-off-by: John Ogness <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]>
2024-09-04printk: Provide helper for message prependingJohn Ogness1-11/+25
In order to support prepending different texts to printk messages, split out the prepending code into a helper function. Signed-off-by: John Ogness <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]>
2024-09-04printk: nbcon: Rely on kthreads for normal operationJohn Ogness3-7/+84
Once the kthread is running and available (i.e. @printk_kthreads_running is set), the kthread becomes responsible for flushing any pending messages which are added in NBCON_PRIO_NORMAL context. Namely the legacy console_flush_all() and device_release() no longer flush the console. And nbcon_atomic_flush_pending() used by nbcon_cpu_emergency_exit() no longer flushes messages added after the emergency messages. The console context is safe when used by the kthread only when one of the following conditions are true: 1. Other caller acquires the console context with NBCON_PRIO_NORMAL with preemption disabled. It will release the context before rescheduling. 2. Other caller acquires the console context with NBCON_PRIO_NORMAL under the device_lock. 3. The kthread is the only context which acquires the console with NBCON_PRIO_NORMAL. This is satisfied for all atomic printing call sites: nbcon_legacy_emit_next_record() (#1) nbcon_atomic_flush_pending_con() (#1) nbcon_device_release() (#2) It is even double guaranteed when @printk_kthreads_running is set because then _only_ the kthread will print for NBCON_PRIO_NORMAL. (#3) Signed-off-by: John Ogness <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]>
2024-09-04printk: nbcon: Use thread callback if in task context for legacyJohn Ogness3-45/+59
When printing via console_lock, the write_atomic() callback is used for nbcon consoles. However, if it is known that the current context is a task context, the write_thread() callback can be used instead. Using write_thread() instead of write_atomic() helps to reduce large disabled preemption regions when the device_lock does not disable preemption. This is mainly a preparatory change to allow avoiding write_atomic() completely during normal operation if boot consoles are registered. As a side-effect, it also allows consolidating the printing code for legacy printing and the kthread printer. Signed-off-by: John Ogness <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]>
2024-09-04printk: nbcon: Relocate nbcon_atomic_emit_one()John Ogness1-39/+39
Move nbcon_atomic_emit_one() so that it can be used by nbcon_kthread_func() in a follow-up commit. Signed-off-by: John Ogness <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]>
2024-09-04printk: nbcon: Introduce printer kthreadsThomas Gleixner3-0/+380
Provide the main implementation for running a printer kthread per nbcon console that is takeover/handover aware. This includes: - new mandatory write_thread() callback - kthread creation - kthread main printing loop - kthread wakeup mechanism - kthread shutdown kthread creation is a bit tricky because consoles may register before kthreads can be created. In such cases, registration will succeed, even though no kthread exists. Once kthreads can be created, an early_initcall will set @printk_kthreads_ready. If there are no registered boot consoles, the early_initcall creates the kthreads for all registered nbcon consoles. If kthread creation fails, the related console is unregistered. If there are registered boot consoles when @printk_kthreads_ready is set, no kthreads are created until the final boot console unregisters. Once kthread creation finally occurs, @printk_kthreads_running is set so that the system knows kthreads are available for all registered nbcon consoles. If @printk_kthreads_running is already set when the console is registering, the kthread is created during registration. If kthread creation fails, the registration will fail. Until @printk_kthreads_running is set, console printing occurs directly via the console_lock. kthread shutdown on system shutdown/reboot is necessary to ensure the printer kthreads finish their printing so that the system can cleanly transition back to direct printing via the console_lock in order to reliably push out the final shutdown/reboot messages. @printk_kthreads_running is cleared before shutting down the individual kthreads. The kthread uses a new mandatory write_thread() callback that is called with both device_lock() and the console context acquired. The console ownership handling is necessary for synchronization against write_atomic() which is synchronized only via the console context ownership. The device_lock() serializes acquiring the console context with NBCON_PRIO_NORMAL. It is needed in case the device_lock() does not disable preemption. It prevents the following race: CPU0 CPU1 [ task A ] nbcon_context_try_acquire() # success with NORMAL prio # .unsafe == false; // safe for takeover [ schedule: task A -> B ] WARN_ON() nbcon_atomic_flush_pending() nbcon_context_try_acquire() # success with EMERGENCY prio # flushing nbcon_context_release() # HERE: con->nbcon_state is free # to take by anyone !!! nbcon_context_try_acquire() # success with NORMAL prio [ task B ] [ schedule: task B -> A ] nbcon_enter_unsafe() nbcon_context_can_proceed() BUG: nbcon_context_can_proceed() returns "true" because the console is owned by a context on CPU0 with NBCON_PRIO_NORMAL. But it should return "false". The console is owned by a context from task B and we do the check in a context from task A. Note that with these changes, the printer kthreads do not yet take over full responsibility for nbcon printing during normal operation. These changes only focus on the lifecycle of the kthreads. Co-developed-by: John Ogness <[email protected]> Signed-off-by: John Ogness <[email protected]> Signed-off-by: Thomas Gleixner (Intel) <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]>
2024-09-04printk: nbcon: Init @nbcon_seq to highest possibleJohn Ogness2-1/+9
When initializing an nbcon console, have nbcon_alloc() set @nbcon_seq to the highest possible sequence number. For all practical purposes, this will guarantee that the console will have nothing to print until later when @nbcon_seq is set to the proper initial printing value. This will be particularly important once kthread printing is introduced because nbcon_alloc() can create/start the kthread before the desired initial sequence number is known. Signed-off-by: John Ogness <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]>
2024-09-04printk: nbcon: Add context to usable() and emit()John Ogness3-21/+27
The nbcon consoles will have two callbacks to be used for different contexts. In order to determine if an nbcon console is usable, console_is_usable() must know if it is a context that will need to use the optional write_atomic() callback. Also, nbcon_emit_next_record() must know which callback it needs to call. Add an extra parameter @use_atomic to console_is_usable() and nbcon_emit_next_record() to specify this. Since so far only the write_atomic() callback exists, @use_atomic is set to true for all call sites. For legacy consoles, @use_atomic is not used. Signed-off-by: John Ogness <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]>
2024-09-04printk: Flush console on unregister_console()John Ogness1-2/+7
Ensure consoles have flushed pending records before unregistering. The console should print up to at least its related "console disabled" record. Signed-off-by: John Ogness <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]>
2024-09-04printk: Fail pr_flush() if before SYSTEM_SCHEDULINGJohn Ogness1-0/+4
A follow-up change adds pr_flush() to console unregistration. However, with boot consoles unregistration can happen very early if there are also regular consoles registering as well. In this case the pr_flush() is not important because all consoles are flushed when checking the initial console sequence number. Allow pr_flush() to fail if @system_state has not yet reached SYSTEM_SCHEDULING. This avoids might_sleep() and msleep() explosions that would otherwise occur: [ 0.436739][ T0] printk: legacy console [ttyS0] enabled [ 0.439820][ T0] printk: legacy bootconsole [earlyser0] disabled [ 0.446822][ T0] BUG: scheduling while atomic: swapper/0/0/0x00000002 [ 0.450491][ T0] 1 lock held by swapper/0/0: [ 0.457897][ T0] #0: ffffffff82ae5f88 (console_mutex){+.+.}-{4:4}, at: console_list_lock+0x20/0x70 [ 0.463141][ T0] Modules linked in: [ 0.465307][ T0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.10.0-rc1+ #372 [ 0.469394][ T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 [ 0.474402][ T0] Call Trace: [ 0.476246][ T0] <TASK> [ 0.481473][ T0] dump_stack_lvl+0x93/0xb0 [ 0.483949][ T0] dump_stack+0x10/0x20 [ 0.486256][ T0] __schedule_bug+0x68/0x90 [ 0.488753][ T0] __schedule+0xb9b/0xd80 [ 0.491179][ T0] ? lock_release+0xb5/0x270 [ 0.493732][ T0] schedule+0x43/0x170 [ 0.495998][ T0] schedule_timeout+0xc5/0x1e0 [ 0.498634][ T0] ? __pfx_process_timeout+0x10/0x10 [ 0.501522][ T0] ? msleep+0x13/0x50 [ 0.503728][ T0] msleep+0x3c/0x50 [ 0.505847][ T0] __pr_flush.constprop.0.isra.0+0x56/0x500 [ 0.509050][ T0] ? _printk+0x58/0x80 [ 0.511332][ T0] ? lock_is_held_type+0x9c/0x110 [ 0.514106][ T0] unregister_console_locked+0xe1/0x450 [ 0.517144][ T0] register_console+0x509/0x620 [ 0.519827][ T0] ? __pfx_univ8250_console_init+0x10/0x10 [ 0.523042][ T0] univ8250_console_init+0x24/0x40 [ 0.525845][ T0] console_init+0x43/0x210 [ 0.528280][ T0] start_kernel+0x493/0x980 [ 0.530773][ T0] x86_64_start_reservations+0x18/0x30 [ 0.533755][ T0] x86_64_start_kernel+0xae/0xc0 [ 0.536473][ T0] common_startup_64+0x12c/0x138 [ 0.539210][ T0] </TASK> And then the kernel goes into an infinite loop complaining about: 1. releasing a pinned lock 2. unpinning an unpinned lock 3. bad: scheduling from the idle thread! 4. goto 1 Signed-off-by: John Ogness <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]>
2024-09-04printk: nbcon: Add function for printers to reacquire ownershipJohn Ogness1-7/+67
Since ownership can be lost at any time due to handover or takeover, a printing context _must_ be prepared to back out immediately and carefully. However, there are scenarios where the printing context must reacquire ownership in order to finalize or revert hardware changes. One such example is when interrupts are disabled during printing. No other context will automagically re-enable the interrupts. For this case, the disabling context _must_ reacquire nbcon ownership so that it can re-enable the interrupts. Provide nbcon_reacquire_nobuf() for exactly this purpose. It allows a printing context to reacquire ownership using the same priority as its previous ownership. Note that after a successful reacquire the printing context will have no output buffer because that has been lost. This function cannot be used to resume printing. Signed-off-by: John Ogness <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]>
2024-09-04printk: nbcon: Use raw_cpu_ptr() instead of open codingJohn Ogness1-2/+1
There is no need to open code a non-migration-checking this_cpu_ptr(). That is exactly what raw_cpu_ptr() is. Signed-off-by: John Ogness <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]>
2024-09-04cpu: Fix W=1 build kernel-doc warningThorsten Blum1-3/+1
Building the kernel with W=1 generates the following warning: kernel/cpu.c:2693: warning: This comment starts with '/**', but isn't a kernel-doc comment. The function topology_is_core_online() is a simple helper function and doesn't need a kernel-doc comment. Use a normal comment instead. Signed-off-by: Thorsten Blum <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/all/[email protected]
2024-09-04Merge branch 'linus' into smp/coreThomas Gleixner41-459/+294
Pull in upstream changes so further patches don't conflict.
2024-09-04timers: Annotate possible non critical data race of next_expiryAnna-Maria Behnsen1-5/+37
Global timers could be expired remotely when the target CPU is idle. After a remote timer expiry, the remote timer_base->next_expiry value is updated while holding the timer_base->lock. When the formerly idle CPU becomes active at the same time and checks whether timers need to expire, this check is done lockless as it is on the local CPU. This could lead to a data race, which was reported by sysbot: https://lore.kernel.org/r/[email protected] When the value is read lockless but changed by the remote CPU, only two non critical scenarios could happen: 1) The already update value is read -> everything is perfect 2) The old value is read -> a superfluous timer soft interrupt is raised The same situation could happen when enqueueing a new first pinned timer by a remote CPU also with non critical scenarios: 1) The already update value is read -> everything is perfect 2) The old value is read -> when the CPU is idle, an IPI is executed nevertheless and when the CPU isn't idle, the updated value will be visible on the next tick and the timer might be late one jiffie. As this is very unlikely to happen, the overhead of doing the check under the lock is a way more effort, than a superfluous timer soft interrupt or a possible 1 jiffie delay of the timer. Document and annotate this non critical behavior in the code by using READ/WRITE_ONCE() pair when accessing timer_base->next_expiry. Reported-by: [email protected] Signed-off-by: Anna-Maria Behnsen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/all/[email protected] Closes: https://lore.kernel.org/lkml/[email protected]
2024-09-04printk: Use the BITS_PER_LONG macroJinjie Ruan1-1/+2
sizeof(unsigned long) * 8 is the number of bits in an unsigned long variable, replace it with BITS_PER_LONG macro to make it simpler. Signed-off-by: Jinjie Ruan <[email protected]> Reviewed-by: John Ogness <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]>
2024-09-03sched_ext: Remove sched_class->switch_class()Tejun Heo2-13/+1
With sched_ext converted to use put_prev_task() for class switch detection, there's no user of switch_class() left. Drop it. Signed-off-by: Tejun Heo <[email protected]> Cc: Peter Zijlstra <[email protected]>
2024-09-03sched_ext: Remove switch_class_scx()Tejun Heo1-5/+4
Now that put_prev_task_scx() is called with @next on task switches, there's no reason to use sched_class.switch_class(). Rename switch_class_scx() to switch_class() and call it from put_prev_task_scx(). Signed-off-by: Tejun Heo <[email protected]>
2024-09-03sched_ext: Relocate functions in kernel/sched/ext.cTejun Heo1-78/+78
Relocate functions to ease the removal of switch_class_scx(). No functional changes. Signed-off-by: Tejun Heo <[email protected]>
2024-09-03sched_ext: Unify regular and core-sched pick task pathsTejun Heo1-67/+11
Because the BPF scheduler's dispatch path is invoked from balance(), sched_ext needs to invoke balance_one() on all sibling rq's before picking the next task for core-sched. Before the recent pick_next_task() updates, sched_ext couldn't share pick task between regular and core-sched paths because pick_next_task() depended on put_prev_task() being called on the current task. Tasks currently running on sibling rq's can't be put when one rq is trying to pick the next task, so pick_task_scx() had to have a separate mechanism to pick between a sibling rq's current task and the first task in its local DSQ. However, with the preceding updates, pick_next_task_scx() no longer depends on the current task being put and can compare the current task and the next in line statelessly, and the pick task logic should be shareable between regular and core-sched paths. Unify regular and core-sched pick task paths: - There's no reason to distinguish local and sibling picks anymore. @local is removed from balance_one(). - pick_next_task_scx() is turned into pick_task_scx() by dropping the put_prev_set_next_task() call. - The old pick_task_scx() is dropped. Signed-off-by: Tejun Heo <[email protected]>
2024-09-03sched_ext: Replace SCX_TASK_BAL_KEEP with SCX_RQ_BAL_KEEPTejun Heo2-11/+10
SCX_TASK_BAL_KEEP is used by balance_one() to tell pick_next_task_scx() to keep running the current task. It's not really a task property. Replace it with SCX_RQ_BAL_KEEP which resides in rq->scx.flags and is a better fit for the usage. Also, the existing clearing rule is unnecessarily strict and makes it difficult to use with core-sched. Just clear it on entry to balance_one(). Signed-off-by: Tejun Heo <[email protected]>