aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2021-07-27kdb: Simplify kdb_defcmd macro logicSumit Garg1-49/+58
Switch to use a linked list instead of dynamic array which makes allocation of kdb macro and traversing the kdb macro commands list simpler. Suggested-by: Daniel Thompson <[email protected]> Signed-off-by: Sumit Garg <[email protected]> Reviewed-by: Douglas Anderson <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Daniel Thompson <[email protected]>
2021-07-27kdb: Get rid of redundant kdb_register_flags()Sumit Garg3-130/+62
Commit e4f291b3f7bb ("kdb: Simplify kdb commands registration") allowed registration of pre-allocated kdb commands with pointer to struct kdbtab_t. Lets switch other users as well to register pre- allocated kdb commands via: - Changing prototype for kdb_register() to pass a pointer to struct kdbtab_t instead. - Embed kdbtab_t structure in kdb_macro_t rather than individual params. With these changes kdb_register_flags() becomes redundant and hence removed. Also, since we have switched all users to register pre-allocated commands, "is_dynamic" flag in struct kdbtab_t becomes redundant and hence removed as well. Suggested-by: Daniel Thompson <[email protected]> Signed-off-by: Sumit Garg <[email protected]> Acked-by: Steven Rostedt (VMware) <[email protected]> Reviewed-by: Douglas Anderson <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Daniel Thompson <[email protected]>
2021-07-27kdb: Rename struct defcmd_set to struct kdb_macroSumit Garg1-20/+20
Rename struct defcmd_set to struct kdb_macro as that sounds more appropriate given its purpose. Suggested-by: Daniel Thompson <[email protected]> Signed-off-by: Sumit Garg <[email protected]> Reviewed-by: Douglas Anderson <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Daniel Thompson <[email protected]>
2021-07-27kdb: Get rid of custom debug heap allocatorSumit Garg3-307/+28
Currently the only user for debug heap is kdbnearsym() which can be modified to rather use statically allocated buffer for symbol name as per it's current usage. So do that and hence remove custom debug heap allocator. Note that this change puts a restriction on kdbnearsym() callers to carefully use shared namebuf such that a caller should consume the symbol returned immediately prior to another call to fetch a different symbol. Also, this change uses standard KSYM_NAME_LEN macro for namebuf allocation instead of local variable: knt1_size which should avoid any conflicts caused by changes to KSYM_NAME_LEN macro value. This change has been tested using kgdbtest on arm64 which doesn't show any regressions. Suggested-by: Daniel Thompson <[email protected]> Signed-off-by: Sumit Garg <[email protected]> Reviewed-by: Douglas Anderson <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Daniel Thompson <[email protected]>
2021-07-26cgroup/cpuset: Fix a partition bug with hotplugWaiman Long1-0/+7
In cpuset_hotplug_workfn(), the detection of whether the cpu list has been changed is done by comparing the effective cpus of the top cpuset with the cpu_active_mask. However, in the rare case that just all the CPUs in the subparts_cpus are offlined, the detection fails and the partition states are not updated correctly. Fix it by forcing the cpus_updated flag to true in this particular case. Fixes: 4b842da276a8 ("cpuset: Make CPU hotplug work with partition") Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2021-07-26cgroup/cpuset: Miscellaneous code cleanupWaiman Long1-21/+19
Use more descriptive variable names for update_prstate(), remove unnecessary code and fix some typos. There is no functional change. Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2021-07-26printk: syslog: close window between wait and readJohn Ogness1-19/+36
Syslog's SYSLOG_ACTION_READ is supposed to block until the next syslog record can be read, and then it should read that record. However, because @syslog_lock is not held between waking up and reading the record, another reader could read the record first, thus causing SYSLOG_ACTION_READ to return with a value of 0, never having read _anything_. By holding @syslog_lock between waking up and reading, it can be guaranteed that SYSLOG_ACTION_READ blocks until it successfully reads a syslog record (or a real error occurs). Signed-off-by: John Ogness <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Signed-off-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-07-26printk: convert @syslog_lock to mutexJohn Ogness1-29/+20
@syslog_lock was a raw_spin_lock to simplify the transition of removing @logbuf_lock and the safe buffers. With that transition complete, and since all uses of @syslog_lock are within sleepable contexts, @syslog_lock can become a mutex. Note that until now register_console() would disable interrupts using irqsave, which implies that it may be called with interrupts disabled. And indeed, there is one possible call chain on parisc where this happens: handle_interruption(code=1) /* High-priority machine check (HPMC) */ pdc_console_restart() pdc_console_init_force() register_console() However, register_console() calls console_lock(), which might sleep. So it has never been allowed to call register_console() from an atomic context and the above call chain is a bug. Note that the removal of read_syslog_seq_irq() is slightly changing the behavior of SYSLOG_ACTION_READ by testing against a possibly outdated @seq value. However, the value of @seq could have changed after the test, so it is not a new window. A follow-up commit closes this window. Signed-off-by: John Ogness <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Signed-off-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-07-26printk: remove NMI trackingJohn Ogness3-46/+1
All NMI contexts are handled the same as the safe context: store the message and defer printing. There is no need to have special NMI context tracking for this. Using in_nmi() is enough. There are several parts of the kernel that are manually calling into the printk NMI context tracking in order to cause general printk deferred printing: arch/arm/kernel/smp.c arch/powerpc/kexec/crash.c kernel/trace/trace.c For arm/kernel/smp.c and powerpc/kexec/crash.c, provide a new function pair printk_deferred_enter/exit that explicitly achieves the same objective. For ftrace, remove the printk context manipulation completely. It was added in commit 03fc7f9c99c1 ("printk/nmi: Prevent deadlock when accessing the main log buffer in NMI"). The purpose was to enforce storing messages directly into the ring buffer even in NMI context. It really should have only modified the behavior in NMI context. There is no need for a special behavior any longer. All messages are always stored directly now. The console deferring is handled transparently in vprintk(). Signed-off-by: John Ogness <[email protected]> [[email protected]: Remove special handling in ftrace.c completely. Signed-off-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-07-26printk: remove safe buffersJohn Ogness5-428/+48
With @logbuf_lock removed, the high level printk functions for storing messages are lockless. Messages can be stored from any context, so there is no need for the NMI and safe buffers anymore. Remove the NMI and safe buffers. Although the safe buffers are removed, the NMI and safe context tracking is still in place. In these contexts, store the message immediately but still use irq_work to defer the console printing. Since printk recursion tracking is in place, safe context tracking for most of printk is not needed. Remove it. Only safe context tracking relating to the console and console_owner locks is left in place. This is because the console and console_owner locks are needed for the actual printing. Signed-off-by: John Ogness <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Signed-off-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-07-26printk: track/limit recursionJohn Ogness1-3/+83
Currently the printk safe buffers provide a form of recursion protection by redirecting to the safe buffers whenever printk() is recursively called. In preparation for removal of the safe buffers, provide an alternate explicit recursion protection. Recursion is limited to 3 levels per-CPU and per-context. Signed-off-by: John Ogness <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Signed-off-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-07-26printk: Move the printk() kerneldoc comment to its new homeJonathan Corbet1-24/+0
Commit 337015573718 ("printk: Userspace format indexing support") turned printk() into a macro, but left the kerneldoc comment for it with the (now) _printk() function, resulting in this docs-build warning: kernel/printk/printk.c:1: warning: 'printk' not found Move the kerneldoc comment back next to the (now) macro it's meant to describe and have the docs build find it there. Fixes: 337015573718b161 ("printk: Userspace format indexing support") Signed-off-by: Jonathan Corbet <[email protected]> Signed-off-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-07-26printk/index: Fix warning about missing prototypesPetr Mladek1-2/+2
The commit 337015573718b161 ("printk: Userspace format indexing support") triggered the following build failure: kernel/printk/index.c:140:6: warning: no previous prototype for ‘pi_create_file’ [-Wmissing-prototypes] void pi_create_file(struct module *mod) ^~~~~~~~~~~~~~ kernel/printk/index.c:146:6: warning: no previous prototype for ‘pi_remove_file’ [-Wmissing-prototypes] void pi_remove_file(struct module *mod) ^~~~~~~~~~~~~~ Fixes: 337015573718b161 ("printk: Userspace format indexing support") Reported-by: kernel test robot <[email protected]> Suggested-by: Chris Down <[email protected]> [[email protected]: Let the compiler decide about inlining.] Signed-off-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/lkml/YPql089IwSpudw%2F1@alley/
2021-07-25smpboot: fix duplicate and misplaced inlining directiveLinus Torvalds1-1/+1
gcc doesn't care, but clang quite reasonably pointed out that the recent commit e9ba16e68cce ("smpboot: Mark idle_init() as __always_inlined to work around aggressive compiler un-inlining") did some really odd things: kernel/smpboot.c:50:20: warning: duplicate 'inline' declaration specifier [-Wduplicate-decl-specifier] static inline void __always_inline idle_init(unsigned int cpu) ^ which not only has that duplicate inlining specifier, but the new __always_inline was put in the wrong place of the function definition. We put the storage class specifiers (ie things like "static" and "extern") first, and the type information after that. And while the compiler may not care, we put the inline specifier before the types. So it should be just static __always_inline void idle_init(unsigned int cpu) instead. Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-07-25Merge tag 'timers-urgent-2021-07-25' of ↵Linus Torvalds2-8/+10
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "A small set of timer related fixes: - Plug a race between rearm and process tick in the posix CPU timers code - Make the optimization to avoid recalculation of the next timer interrupt work correctly when there are no timers pending" * tag 'timers-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers: Fix get_next_timer_interrupt() with no timers pending posix-cpu-timers: Fix rearm racing against process tick
2021-07-25Merge tag 'core-urgent-2021-07-25' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core fix from Thomas Gleixner: "A single update for the boot code to prevent aggressive un-inlining which causes a section mismatch" * tag 'core-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: smpboot: Mark idle_init() as __always_inlined to work around aggressive compiler un-inlining
2021-07-25Merge tag 'dma-mapping-5.14-1' of git://git.infradead.org/users/hch/dma-mappingLinus Torvalds1-2/+10
Pull dma-mapping fix from Christoph Hellwig: - handle vmalloc addresses in dma_common_{mmap,get_sgtable} (Roman Skakun) * tag 'dma-mapping-5.14-1' of git://git.infradead.org/users/hch/dma-mapping: dma-mapping: handle vmalloc addresses in dma_common_{mmap,get_sgtable}
2021-07-23swiotlb: Free tbl memory in swiotlb_exit()Will Deacon1-6/+15
Although swiotlb_exit() frees the 'slots' metadata array referenced by 'io_tlb_default_mem', it leaves the underlying buffer pages allocated despite no longer being usable. Extend swiotlb_exit() to free the buffer pages as well as the slots array. Cc: Claire Chang <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Robin Murphy <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Tested-by: Nathan Chancellor <[email protected]> Tested-by: Claire Chang <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Will Deacon <[email protected]> Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
2021-07-23swiotlb: Emit diagnostic in swiotlb_exit()Will Deacon1-0/+1
A recent debugging session would have been made a little bit easier if we had noticed sooner that swiotlb_exit() was being called during boot. Add a simple diagnostic message to swiotlb_exit() to complement the one from swiotlb_print_info() during initialisation. Cc: Claire Chang <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Robin Murphy <[email protected]> Link: https://lore.kernel.org/r/20210705190352.GA19461@willie-the-truck Suggested-by: Konrad Rzeszutek Wilk <[email protected]> Tested-by: Nathan Chancellor <[email protected]> Tested-by: Claire Chang <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Will Deacon <[email protected]> Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
2021-07-23swiotlb: Convert io_default_tlb_mem to static allocationWill Deacon1-30/+36
Since commit 69031f500865 ("swiotlb: Set dev->dma_io_tlb_mem to the swiotlb pool used"), 'struct device' may hold a copy of the global 'io_default_tlb_mem' pointer if the device is using swiotlb for DMA. A subsequent call to swiotlb_exit() will therefore leave dangling pointers behind in these device structures, resulting in KASAN splats such as: | BUG: KASAN: use-after-free in __iommu_dma_unmap_swiotlb+0x64/0xb0 | Read of size 8 at addr ffff8881d7830000 by task swapper/0/0 | | CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.12.0-rc3-debug #1 | Hardware name: HP HP Desktop M01-F1xxx/87D6, BIOS F.12 12/17/2020 | Call Trace: | <IRQ> | dump_stack+0x9c/0xcf | print_address_description.constprop.0+0x18/0x130 | kasan_report.cold+0x7f/0x111 | __iommu_dma_unmap_swiotlb+0x64/0xb0 | nvme_pci_complete_rq+0x73/0x130 | blk_complete_reqs+0x6f/0x80 | __do_softirq+0xfc/0x3be Convert 'io_default_tlb_mem' to a static structure, so that the per-device pointers remain valid after swiotlb_exit() has been invoked. All users are updated to reference the static structure directly, using the 'nslabs' field to determine whether swiotlb has been initialised. The 'slots' array is still allocated dynamically and referenced via a pointer rather than a flexible array member. Cc: Claire Chang <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Robin Murphy <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Fixes: 69031f500865 ("swiotlb: Set dev->dma_io_tlb_mem to the swiotlb pool used") Reported-by: Nathan Chancellor <[email protected]> Tested-by: Nathan Chancellor <[email protected]> Tested-by: Claire Chang <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Will Deacon <[email protected]> Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
2021-07-23bpf: tcp: Support bpf_(get|set)sockopt in bpf tcp iterMartin KaFai Lau2-1/+28
This patch allows bpf tcp iter to call bpf_(get|set)sockopt. To allow a specific bpf iter (tcp here) to call a set of helpers, get_func_proto function pointer is added to bpf_iter_reg. The bpf iter is a tracing prog which currently requires CAP_PERFMON or CAP_SYS_ADMIN, so this patch does not impose other capability checks for bpf_(get|set)sockopt. Signed-off-by: Martin KaFai Lau <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Acked-by: Kuniyuki Iwashima <[email protected]> Acked-by: Yonghong Song <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-07-23Merge tag 'trace-v5.14-rc2' of ↵Linus Torvalds7-19/+52
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: - Fix deadloop in ring buffer because of using stale "read" variable - Fix synthetic event use of field_pos as boolean and not an index - Fixed histogram special var "cpu" overriding event fields called "cpu" - Cleaned up error prone logic in alloc_synth_event() - Removed call to synchronize_rcu_tasks_rude() when not needed - Removed redundant initialization of a local variable "ret" - Fixed kernel crash when updating tracepoint callbacks of different priorities. * tag 'trace-v5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracepoints: Update static_call before tp_funcs when adding a tracepoint ftrace: Remove redundant initialization of variable ret ftrace: Avoid synchronize_rcu_tasks_rude() call when not necessary tracing: Clean up alloc_synth_event() tracing/histogram: Rename "cpu" to "common_cpu" tracing: Synthetic event field_pos is an index not a boolean tracing: Fix bug in rb_per_cpu_empty() that might cause deadloop.
2021-07-23signal: Rename SIL_PERF_EVENT SIL_FAULT_PERF_EVENT for consistencyEric W. Biederman1-5/+5
It helps to know which part of the siginfo structure the siginfo_layout value is talking about. v1: https://lkml.kernel.org/r/[email protected] v2: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/87zgumw8cc.fsf_-_@disp2133 Acked-by: Marco Elver <[email protected]> Signed-off-by: Eric W. Biederman <[email protected]>
2021-07-23signal: Remove the generic __ARCH_SI_TRAPNO supportEric W. Biederman1-14/+0
Now that __ARCH_SI_TRAPNO is no longer set by any architecture remove all of the code it enabled from the kernel. On alpha and sparc a more explict approach of using send_sig_fault_trapno or force_sig_fault_trapno in the very limited circumstances where si_trapno was set to a non-zero value. The generic support that is being removed always set si_trapno on all fault signals. With only SIGILL ILL_ILLTRAP on sparc and SIGFPE and SIGTRAP TRAP_UNK on alpla providing si_trapno values asking all senders of fault signals to provide an si_trapno value does not make sense. Making si_trapno an ordinary extension of the fault siginfo layout has enabled the architecture generic implementation of SIGTRAP TRAP_PERF, and enables other faulting signals to grow architecture generic senders as well. v1: https://lkml.kernel.org/r/[email protected] v2: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/87bl73xx6x.fsf_-_@disp2133 Signed-off-by: "Eric W. Biederman" <[email protected]>
2021-07-23signal/alpha: si_trapno is only used with SIGFPE and SIGTRAP TRAP_UNKEric W. Biederman1-0/+21
While reviewing the signal handlers on alpha it became clear that si_trapno is only set to a non-zero value when sending SIGFPE and when sending SITGRAP with si_code TRAP_UNK. Add send_sig_fault_trapno and send SIGTRAP TRAP_UNK, and SIGFPE with it. Remove the define of __ARCH_SI_TRAPNO and remove the always zero si_trapno parameter from send_sig_fault and force_sig_fault. v1: https://lkml.kernel.org/r/[email protected] v2: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/87h7gvxx7l.fsf_-_@disp2133 Signed-off-by: "Eric W. Biederman" <[email protected]>
2021-07-23signal/sparc: si_trapno is only used with SIGILL ILL_ILLTRPEric W. Biederman1-0/+19
While reviewing the signal handlers on sparc it became clear that si_trapno is only set to a non-zero value when sending SIGILL with si_code ILL_ILLTRP. Add force_sig_fault_trapno and send SIGILL ILL_ILLTRP with it. Remove the define of __ARCH_SI_TRAPNO and remove the always zero si_trapno parameter from send_sig_fault and force_sig_fault. v1: https://lkml.kernel.org/r/[email protected] v2: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/87mtqnxx89.fsf_-_@disp2133 Signed-off-by: "Eric W. Biederman" <[email protected]>
2021-07-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller7-16/+16
Conflicts are simple overlapping changes. Signed-off-by: David S. Miller <[email protected]>
2021-07-23tracepoints: Update static_call before tp_funcs when adding a tracepointSteven Rostedt (VMware)1-1/+1
Because of the significant overhead that retpolines pose on indirect calls, the tracepoint code was updated to use the new "static_calls" that can modify the running code to directly call a function instead of using an indirect caller, and this function can be changed at runtime. In the tracepoint code that calls all the registered callbacks that are attached to a tracepoint, the following is done: it_func_ptr = rcu_dereference_raw((&__tracepoint_##name)->funcs); if (it_func_ptr) { __data = (it_func_ptr)->data; static_call(tp_func_##name)(__data, args); } If there's just a single callback, the static_call is updated to just call that callback directly. Once another handler is added, then the static caller is updated to call the iterator, that simply loops over all the funcs in the array and calls each of the callbacks like the old method using indirect calling. The issue was discovered with a race between updating the funcs array and updating the static_call. The funcs array was updated first and then the static_call was updated. This is not an issue as long as the first element in the old array is the same as the first element in the new array. But that assumption is incorrect, because callbacks also have a priority field, and if there's a callback added that has a higher priority than the callback on the old array, then it will become the first callback in the new array. This means that it is possible to call the old callback with the new callback data element, which can cause a kernel panic. static_call = callback1() funcs[] = {callback1,data1}; callback2 has higher priority than callback1 CPU 1 CPU 2 ----- ----- new_funcs = {callback2,data2}, {callback1,data1} rcu_assign_pointer(tp->funcs, new_funcs); /* * Now tp->funcs has the new array * but the static_call still calls callback1 */ it_func_ptr = tp->funcs [ new_funcs ] data = it_func_ptr->data [ data2 ] static_call(callback1, data); /* Now callback1 is called with * callback2's data */ [ KERNEL PANIC ] update_static_call(iterator); To prevent this from happening, always switch the static_call to the iterator before assigning the tp->funcs to the new array. The iterator will always properly match the callback with its data. To trigger this bug: In one terminal: while :; do hackbench 50; done In another terminal echo 1 > /sys/kernel/tracing/events/sched/sched_waking/enable while :; do echo 1 > /sys/kernel/tracing/set_event_pid; sleep 0.5 echo 0 > /sys/kernel/tracing/set_event_pid; sleep 0.5 done And it doesn't take long to crash. This is because the set_event_pid adds a callback to the sched_waking tracepoint with a high priority, which will be called before the sched_waking trace event callback is called. Note, the removal to a single callback updates the array first, before changing the static_call to single callback, which is the proper order as the first element in the array is the same as what the static_call is being changed to. Link: https://lore.kernel.org/io-uring/[email protected]/ Cc: [email protected] Fixes: d25e37d89dd2f ("tracepoint: Optimize using static_call()") Reported-by: Stefan Metzmacher <[email protected]> tested-by: Stefan Metzmacher <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-07-23ftrace: Remove redundant initialization of variable retColin Ian King1-1/+1
The variable ret is being initialized with a value that is never read, it is being updated later on. The assignment is redundant and can be removed. Link: https://lkml.kernel.org/r/[email protected] Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-07-23ftrace: Avoid synchronize_rcu_tasks_rude() call when not necessaryNicolas Saenz Julienne1-1/+2
synchronize_rcu_tasks_rude() triggers IPIs and forces rescheduling on all CPUs. It is a costly operation and, when targeting nohz_full CPUs, very disrupting (hence the name). So avoid calling it when 'old_hash' doesn't need to be freed. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nicolas Saenz Julienne <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-07-23tracing: Clean up alloc_synth_event()Steven Rostedt (VMware)1-5/+3
alloc_synth_event() currently has the following code to initialize the event fields and dynamic_fields: for (i = 0, j = 0; i < n_fields; i++) { event->fields[i] = fields[i]; if (fields[i]->is_dynamic) { event->dynamic_fields[j] = fields[i]; event->dynamic_fields[j]->field_pos = i; event->dynamic_fields[j++] = fields[i]; event->n_dynamic_fields++; } } 1) It would make more sense to have all fields keep track of their field_pos. 2) event->dynmaic_fields[j] is assigned twice for no reason. 3) We can move updating event->n_dynamic_fields outside the loop, and just assign it to j. This combination makes the code much cleaner. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-07-23tracing/histogram: Rename "cpu" to "common_cpu"Steven Rostedt (VMware)2-6/+20
Currently the histogram logic allows the user to write "cpu" in as an event field, and it will record the CPU that the event happened on. The problem with this is that there's a lot of events that have "cpu" as a real field, and using "cpu" as the CPU it ran on, makes it impossible to run histograms on the "cpu" field of events. For example, if I want to have a histogram on the count of the workqueue_queue_work event on its cpu field, running: ># echo 'hist:keys=cpu' > events/workqueue/workqueue_queue_work/trigger Gives a misleading and wrong result. Change the command to "common_cpu" as no event should have "common_*" fields as that's a reserved name for fields used by all events. And this makes sense here as common_cpu would be a field used by all events. Now we can even do: ># echo 'hist:keys=common_cpu,cpu if cpu < 100' > events/workqueue/workqueue_queue_work/trigger ># cat events/workqueue/workqueue_queue_work/hist # event histogram # # trigger info: hist:keys=common_cpu,cpu:vals=hitcount:sort=hitcount:size=2048 if cpu < 100 [active] # { common_cpu: 0, cpu: 2 } hitcount: 1 { common_cpu: 0, cpu: 4 } hitcount: 1 { common_cpu: 7, cpu: 7 } hitcount: 1 { common_cpu: 0, cpu: 7 } hitcount: 1 { common_cpu: 0, cpu: 1 } hitcount: 1 { common_cpu: 0, cpu: 6 } hitcount: 2 { common_cpu: 0, cpu: 5 } hitcount: 2 { common_cpu: 1, cpu: 1 } hitcount: 4 { common_cpu: 6, cpu: 6 } hitcount: 4 { common_cpu: 5, cpu: 5 } hitcount: 14 { common_cpu: 4, cpu: 4 } hitcount: 26 { common_cpu: 0, cpu: 0 } hitcount: 39 { common_cpu: 2, cpu: 2 } hitcount: 184 Now for backward compatibility, I added a trick. If "cpu" is used, and the field is not found, it will fall back to "common_cpu" and work as it did before. This way, it will still work for old programs that use "cpu" to get the actual CPU, but if the event has a "cpu" as a field, it will get that event's "cpu" field, which is probably what it wants anyway. I updated the tracefs/README to include documentation about both the common_timestamp and the common_cpu. This way, if that text is present in the README, then an application can know that common_cpu is supported over just plain "cpu". Link: https://lkml.kernel.org/r/[email protected] Cc: Namhyung Kim <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Andrew Morton <[email protected]> Cc: [email protected] Fixes: 8b7622bf94a44 ("tracing: Add cpu field for hist triggers") Reviewed-by: Tom Zanussi <[email protected]> Reviewed-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-07-23tracing: Synthetic event field_pos is an index not a booleanSteven Rostedt (VMware)1-1/+1
Performing the following: ># echo 'wakeup_lat s32 pid; u64 delta; char wake_comm[]' > synthetic_events ># echo 'hist:keys=pid:__arg__1=common_timestamp.usecs' > events/sched/sched_waking/trigger ># echo 'hist:keys=next_pid:pid=next_pid,delta=common_timestamp.usecs-$__arg__1:onmatch(sched.sched_waking).trace(wakeup_lat,$pid,$delta,prev_comm)'\ > events/sched/sched_switch/trigger ># echo 1 > events/synthetic/enable Crashed the kernel: BUG: kernel NULL pointer dereference, address: 000000000000001b #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP CPU: 7 PID: 0 Comm: swapper/7 Not tainted 5.13.0-rc5-test+ #104 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016 RIP: 0010:strlen+0x0/0x20 Code: f6 82 80 2b 0b bc 20 74 11 0f b6 50 01 48 83 c0 01 f6 82 80 2b 0b bc 20 75 ef c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 <80> 3f 00 74 10 48 89 f8 48 83 c0 01 80 38 9 f8 c3 31 RSP: 0018:ffffaa75000d79d0 EFLAGS: 00010046 RAX: 0000000000000002 RBX: ffff9cdb55575270 RCX: 0000000000000000 RDX: ffff9cdb58c7a320 RSI: ffffaa75000d7b40 RDI: 000000000000001b RBP: ffffaa75000d7b40 R08: ffff9cdb40a4f010 R09: ffffaa75000d7ab8 R10: ffff9cdb4398c700 R11: 0000000000000008 R12: ffff9cdb58c7a320 R13: ffff9cdb55575270 R14: ffff9cdb58c7a000 R15: 0000000000000018 FS: 0000000000000000(0000) GS:ffff9cdb5aa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000000001b CR3: 00000000c0612006 CR4: 00000000001706e0 Call Trace: trace_event_raw_event_synth+0x90/0x1d0 action_trace+0x5b/0x70 event_hist_trigger+0x4bd/0x4e0 ? cpumask_next_and+0x20/0x30 ? update_sd_lb_stats.constprop.0+0xf6/0x840 ? __lock_acquire.constprop.0+0x125/0x550 ? find_held_lock+0x32/0x90 ? sched_clock_cpu+0xe/0xd0 ? lock_release+0x155/0x440 ? update_load_avg+0x8c/0x6f0 ? enqueue_entity+0x18a/0x920 ? __rb_reserve_next+0xe5/0x460 ? ring_buffer_lock_reserve+0x12a/0x3f0 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x1ae/0x240 trace_event_raw_event_sched_switch+0x114/0x170 __traceiter_sched_switch+0x39/0x50 __schedule+0x431/0xb00 schedule_idle+0x28/0x40 do_idle+0x198/0x2e0 cpu_startup_entry+0x19/0x20 secondary_startup_64_no_verify+0xc2/0xcb The reason is that the dynamic events array keeps track of the field position of the fields array, via the field_pos variable in the synth_field structure. Unfortunately, that field is a boolean for some reason, which means any field_pos greater than 1 will be a bug (in this case it was 2). Link: https://lkml.kernel.org/r/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Andrew Morton <[email protected]> Cc: [email protected] Fixes: bd82631d7ccdc ("tracing: Add support for dynamic strings to synthetic events") Reviewed-by: Tom Zanussi <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-07-22bpf: Remove redundant intiialization of variable stypeColin Ian King1-1/+1
The variable stype is being initialized with a value that is never read, it is being updated later on. The assignment is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-07-22bpf: Fix pointer cast warningArnd Bergmann1-1/+1
kp->addr is a pointer, so it cannot be cast directly to a 'u64' when it gets interpreted as an integer value: kernel/trace/bpf_trace.c: In function '____bpf_get_func_ip_kprobe': kernel/trace/bpf_trace.c:968:21: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast] 968 | return kp ? (u64) kp->addr : 0; Use the uintptr_t type instead. Fixes: 9ffd9f3ff719 ("bpf: Add bpf_get_func_ip helper for kprobe programs") Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-07-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds1-0/+2
Pull networking fixes from David Miller: 1) Fix type of bind option flag in af_xdp, from Baruch Siach. 2) Fix use after free in bpf_xdp_link_release(), from Xuan Zhao. 3) PM refcnt imbakance in r8152, from Takashi Iwai. 4) Sign extension ug in liquidio, from Colin Ian King. 5) Mising range check in s390 bpf jit, from Colin Ian King. 6) Uninit value in caif_seqpkt_sendmsg(), from Ziyong Xuan. 7) Fix skb page recycling race, from Ilias Apalodimas. 8) Fix memory leak in tcindex_partial_destroy_work, from Pave Skripkin. 9) netrom timer sk refcnt issues, from Nguyen Dinh Phi. 10) Fix data races aroun tcp's tfo_active_disable_stamp, from Eric Dumazet. 11) act_skbmod should only operate on ethernet packets, from Peilin Ye. 12) Fix slab out-of-bpunds in fib6_nh_flush_exceptions(),, from Psolo Abeni. 13) Fix sparx5 dependencies, from Yajun Deng. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (74 commits) dpaa2-switch: seed the buffer pool after allocating the swp net: sched: cls_api: Fix the the wrong parameter net: sparx5: fix unmet dependencies warning net: dsa: tag_ksz: dont let the hardware process the layer 4 checksum net: dsa: ensure linearized SKBs in case of tail taggers ravb: Remove extra TAB ravb: Fix a typo in comment net: dsa: sja1105: make VID 4095 a bridge VLAN too tcp: disable TFO blackhole logic by default sctp: do not update transport pathmtu if SPP_PMTUD_ENABLE is not set net: ixp46x: fix ptp build failure ibmvnic: Remove the proper scrq flush selftests: net: add ESP-in-UDP PMTU test udp: check encap socket in __udp_lib_err sctp: update active_key for asoc when old key is being replaced r8169: Avoid duplicate sysfs entry creation error ixgbe: Fix packet corruption due to missing DMA sync Revert "qed: fix possible unpaired spin_{un}lock_bh in _qed_mcp_cmd_and_union()" ipv6: fix another slab-out-of-bounds in fib6_nh_flush_exceptions fsl/fman: Add fibre support ...
2021-07-22tracing: Fix bug in rb_per_cpu_empty() that might cause deadloop.Haoran Luo1-4/+24
The "rb_per_cpu_empty()" misinterpret the condition (as not-empty) when "head_page" and "commit_page" of "struct ring_buffer_per_cpu" points to the same buffer page, whose "buffer_data_page" is empty and "read" field is non-zero. An error scenario could be constructed as followed (kernel perspective): 1. All pages in the buffer has been accessed by reader(s) so that all of them will have non-zero "read" field. 2. Read and clear all buffer pages so that "rb_num_of_entries()" will return 0 rendering there's no more data to read. It is also required that the "read_page", "commit_page" and "tail_page" points to the same page, while "head_page" is the next page of them. 3. Invoke "ring_buffer_lock_reserve()" with large enough "length" so that it shot pass the end of current tail buffer page. Now the "head_page", "commit_page" and "tail_page" points to the same page. 4. Discard current event with "ring_buffer_discard_commit()", so that "head_page", "commit_page" and "tail_page" points to a page whose buffer data page is now empty. When the error scenario has been constructed, "tracing_read_pipe" will be trapped inside a deadloop: "trace_empty()" returns 0 since "rb_per_cpu_empty()" returns 0 when it hits the CPU containing such constructed ring buffer. Then "trace_find_next_entry_inc()" always return NULL since "rb_num_of_entries()" reports there's no more entry to read. Finally "trace_seq_to_user()" returns "-EBUSY" spanking "tracing_read_pipe" back to the start of the "waitagain" loop. I've also written a proof-of-concept script to construct the scenario and trigger the bug automatically, you can use it to trace and validate my reasoning above: https://github.com/aegistudio/RingBufferDetonator.git Tests has been carried out on linux kernel 5.14-rc2 (2734d6c1b1a089fb593ef6a23d4b70903526fe0c), my fixed version of kernel (for testing whether my update fixes the bug) and some older kernels (for range of affected kernels). Test result is also attached to the proof-of-concept repository. Link: https://lore.kernel.org/linux-trace-devel/YPaNxsIlb2yjSi5Y@aegistudio/ Link: https://lore.kernel.org/linux-trace-devel/YPgrN85WL9VyrZ55@aegistudio Cc: [email protected] Fixes: bf41a158cacba ("ring-buffer: make reentrant") Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Haoran Luo <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-07-21workqueue: fix UAF in pwq_unbound_release_workfn()Yang Yingliang1-7/+13
I got a UAF report when doing fuzz test: [ 152.880091][ T8030] ================================================================== [ 152.881240][ T8030] BUG: KASAN: use-after-free in pwq_unbound_release_workfn+0x50/0x190 [ 152.882442][ T8030] Read of size 4 at addr ffff88810d31bd00 by task kworker/3:2/8030 [ 152.883578][ T8030] [ 152.883932][ T8030] CPU: 3 PID: 8030 Comm: kworker/3:2 Not tainted 5.13.0+ #249 [ 152.885014][ T8030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 [ 152.886442][ T8030] Workqueue: events pwq_unbound_release_workfn [ 152.887358][ T8030] Call Trace: [ 152.887837][ T8030] dump_stack_lvl+0x75/0x9b [ 152.888525][ T8030] ? pwq_unbound_release_workfn+0x50/0x190 [ 152.889371][ T8030] print_address_description.constprop.10+0x48/0x70 [ 152.890326][ T8030] ? pwq_unbound_release_workfn+0x50/0x190 [ 152.891163][ T8030] ? pwq_unbound_release_workfn+0x50/0x190 [ 152.891999][ T8030] kasan_report.cold.15+0x82/0xdb [ 152.892740][ T8030] ? pwq_unbound_release_workfn+0x50/0x190 [ 152.893594][ T8030] __asan_load4+0x69/0x90 [ 152.894243][ T8030] pwq_unbound_release_workfn+0x50/0x190 [ 152.895057][ T8030] process_one_work+0x47b/0x890 [ 152.895778][ T8030] worker_thread+0x5c/0x790 [ 152.896439][ T8030] ? process_one_work+0x890/0x890 [ 152.897163][ T8030] kthread+0x223/0x250 [ 152.897747][ T8030] ? set_kthread_struct+0xb0/0xb0 [ 152.898471][ T8030] ret_from_fork+0x1f/0x30 [ 152.899114][ T8030] [ 152.899446][ T8030] Allocated by task 8884: [ 152.900084][ T8030] kasan_save_stack+0x21/0x50 [ 152.900769][ T8030] __kasan_kmalloc+0x88/0xb0 [ 152.901416][ T8030] __kmalloc+0x29c/0x460 [ 152.902014][ T8030] alloc_workqueue+0x111/0x8e0 [ 152.902690][ T8030] __btrfs_alloc_workqueue+0x11e/0x2a0 [ 152.903459][ T8030] btrfs_alloc_workqueue+0x6d/0x1d0 [ 152.904198][ T8030] scrub_workers_get+0x1e8/0x490 [ 152.904929][ T8030] btrfs_scrub_dev+0x1b9/0x9c0 [ 152.905599][ T8030] btrfs_ioctl+0x122c/0x4e50 [ 152.906247][ T8030] __x64_sys_ioctl+0x137/0x190 [ 152.906916][ T8030] do_syscall_64+0x34/0xb0 [ 152.907535][ T8030] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 152.908365][ T8030] [ 152.908688][ T8030] Freed by task 8884: [ 152.909243][ T8030] kasan_save_stack+0x21/0x50 [ 152.909893][ T8030] kasan_set_track+0x20/0x30 [ 152.910541][ T8030] kasan_set_free_info+0x24/0x40 [ 152.911265][ T8030] __kasan_slab_free+0xf7/0x140 [ 152.911964][ T8030] kfree+0x9e/0x3d0 [ 152.912501][ T8030] alloc_workqueue+0x7d7/0x8e0 [ 152.913182][ T8030] __btrfs_alloc_workqueue+0x11e/0x2a0 [ 152.913949][ T8030] btrfs_alloc_workqueue+0x6d/0x1d0 [ 152.914703][ T8030] scrub_workers_get+0x1e8/0x490 [ 152.915402][ T8030] btrfs_scrub_dev+0x1b9/0x9c0 [ 152.916077][ T8030] btrfs_ioctl+0x122c/0x4e50 [ 152.916729][ T8030] __x64_sys_ioctl+0x137/0x190 [ 152.917414][ T8030] do_syscall_64+0x34/0xb0 [ 152.918034][ T8030] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 152.918872][ T8030] [ 152.919203][ T8030] The buggy address belongs to the object at ffff88810d31bc00 [ 152.919203][ T8030] which belongs to the cache kmalloc-512 of size 512 [ 152.921155][ T8030] The buggy address is located 256 bytes inside of [ 152.921155][ T8030] 512-byte region [ffff88810d31bc00, ffff88810d31be00) [ 152.922993][ T8030] The buggy address belongs to the page: [ 152.923800][ T8030] page:ffffea000434c600 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10d318 [ 152.925249][ T8030] head:ffffea000434c600 order:2 compound_mapcount:0 compound_pincount:0 [ 152.926399][ T8030] flags: 0x57ff00000010200(slab|head|node=1|zone=2|lastcpupid=0x7ff) [ 152.927515][ T8030] raw: 057ff00000010200 dead000000000100 dead000000000122 ffff888009c42c80 [ 152.928716][ T8030] raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000 [ 152.929890][ T8030] page dumped because: kasan: bad access detected [ 152.930759][ T8030] [ 152.931076][ T8030] Memory state around the buggy address: [ 152.931851][ T8030] ffff88810d31bc00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 152.932967][ T8030] ffff88810d31bc80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 152.934068][ T8030] >ffff88810d31bd00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 152.935189][ T8030] ^ [ 152.935763][ T8030] ffff88810d31bd80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 152.936847][ T8030] ffff88810d31be00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 152.937940][ T8030] ================================================================== If apply_wqattrs_prepare() fails in alloc_workqueue(), it will call put_pwq() which invoke a work queue to call pwq_unbound_release_workfn() and use the 'wq'. The 'wq' allocated in alloc_workqueue() will be freed in error path when apply_wqattrs_prepare() fails. So it will lead a UAF. CPU0 CPU1 alloc_workqueue() alloc_and_link_pwqs() apply_wqattrs_prepare() fails apply_wqattrs_cleanup() schedule_work(&pwq->unbound_release_work) kfree(wq) worker_thread() pwq_unbound_release_workfn() <- trigger uaf here If apply_wqattrs_prepare() fails, the new pwq are not linked, it doesn't hold any reference to the 'wq', 'wq' is invalid to access in the worker, so add check pwq if linked to fix this. Fixes: 2d5f0764b526 ("workqueue: split apply_workqueue_attrs() into 3 stages") Cc: [email protected] # v4.2+ Reported-by: Hulk Robot <[email protected]> Suggested-by: Lai Jiangshan <[email protected]> Signed-off-by: Yang Yingliang <[email protected]> Reviewed-by: Lai Jiangshan <[email protected]> Tested-by: Pavel Skripkin <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2021-07-21cgroup1: fix leaked context root causing sporadic NULL deref in LTPPaul Gortmaker1-3/+1
Richard reported sporadic (roughly one in 10 or so) null dereferences and other strange behaviour for a set of automated LTP tests. Things like: BUG: kernel NULL pointer dereference, address: 0000000000000008 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 1516 Comm: umount Not tainted 5.10.0-yocto-standard #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014 RIP: 0010:kernfs_sop_show_path+0x1b/0x60 ...or these others: RIP: 0010:do_mkdirat+0x6a/0xf0 RIP: 0010:d_alloc_parallel+0x98/0x510 RIP: 0010:do_readlinkat+0x86/0x120 There were other less common instances of some kind of a general scribble but the common theme was mount and cgroup and a dubious dentry triggering the NULL dereference. I was only able to reproduce it under qemu by replicating Richard's setup as closely as possible - I never did get it to happen on bare metal, even while keeping everything else the same. In commit 71d883c37e8d ("cgroup_do_mount(): massage calling conventions") we see this as a part of the overall change: -------------- struct cgroup_subsys *ss; - struct dentry *dentry; [...] - dentry = cgroup_do_mount(&cgroup_fs_type, fc->sb_flags, root, - CGROUP_SUPER_MAGIC, ns); [...] - if (percpu_ref_is_dying(&root->cgrp.self.refcnt)) { - struct super_block *sb = dentry->d_sb; - dput(dentry); + ret = cgroup_do_mount(fc, CGROUP_SUPER_MAGIC, ns); + if (!ret && percpu_ref_is_dying(&root->cgrp.self.refcnt)) { + struct super_block *sb = fc->root->d_sb; + dput(fc->root); deactivate_locked_super(sb); msleep(10); return restart_syscall(); } -------------- In changing from the local "*dentry" variable to using fc->root, we now export/leave that dentry pointer in the file context after doing the dput() in the unlikely "is_dying" case. With LTP doing a crazy amount of back to back mount/unmount [testcases/bin/cgroup_regression_5_1.sh] the unlikely becomes slightly likely and then bad things happen. A fix would be to not leave the stale reference in fc->root as follows: --------------                 dput(fc->root); + fc->root = NULL;                 deactivate_locked_super(sb); -------------- ...but then we are just open-coding a duplicate of fc_drop_locked() so we simply use that instead. Cc: Al Viro <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Zefan Li <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: [email protected] # v5.1+ Reported-by: Richard Purdie <[email protected]> Fixes: 71d883c37e8d ("cgroup_do_mount(): massage calling conventions") Signed-off-by: Paul Gortmaker <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2021-07-20kcsan: permissive: Ignore data-racy 1-bit value changesMarco Elver2-1/+80
Add rules to ignore data-racy reads with only 1-bit value changes. Details about the rules are captured in comments in kernel/kcsan/permissive.h. More background follows. While investigating a number of data races, we've encountered data-racy accesses on flags variables to be very common. The typical pattern is a reader masking all but one bit, and/or the writer setting/clearing only 1 bit (current->flags being a frequently encountered case; more examples in mm/sl[au]b.c, which disable KCSAN for this reason). Since these types of data-racy accesses are common (with the assumption they are intentional and hard to miscompile) having the option (with CONFIG_KCSAN_PERMISSIVE=y) to filter them will avoid forcing everyone to mark them, and deliberately left to preference at this time. One important motivation for having this option built-in is to move closer to being able to enable KCSAN on CI systems or for testers wishing to test the whole kernel, while more easily filtering less interesting data races with higher probability. For the implementation, we considered several alternatives, but had one major requirement: that the rules be kept together with the Linux-kernel tree. Adding them to the compiler would preclude us from making changes quickly; if the rules require tweaks, having them part of the compiler requires waiting another ~1 year for the next release -- that's not realistic. We are left with the following options: 1. Maintain compiler plugins as part of the kernel-tree that removes instrumentation for some accesses (e.g. plain-& with 1-bit mask). The analysis would be reader-side focused, as no assumption can be made about racing writers. Because it seems unrealistic to maintain 2 plugins, one for LLVM and GCC, we would likely pick LLVM. Furthermore, no kernel infrastructure exists to maintain LLVM plugins, and the build-system implications and maintenance overheads do not look great (historically, plugins written against old LLVM APIs are not guaranteed to work with newer LLVM APIs). 2. Find a set of rules that can be expressed in terms of observed value changes, and make it part of the KCSAN runtime. The analysis is writer-side focused, given we rely on observed value changes. The approach taken here is (2). While a complete approach requires both (1) and (2), experiments show that the majority of data races involving trivial bit operations on flags variables can be removed with (2) alone. It goes without saying that the filtering of data races using (1) or (2) does _not_ guarantee they are safe! Therefore, limiting ourselves to (2) for now is the conservative choice for setups that wish to enable CONFIG_KCSAN_PERMISSIVE=y. Signed-off-by: Marco Elver <[email protected]> Acked-by: Mark Rutland <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-07-20kcsan: Print if strict or non-strict during initMarco Elver1-0/+9
Show a brief message if KCSAN is strict or non-strict, and if non-strict also say that CONFIG_KCSAN_STRICT=y can be used to see all data races. This is to hint to users of KCSAN who blindly use the default config that their configuration might miss data races of interest. Signed-off-by: Marco Elver <[email protected]> Acked-by: Mark Rutland <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-07-20kcsan: Rework atomic.h into permissive.hMarco Elver3-32/+71
Rework atomic.h into permissive.h to better reflect its purpose, and introduce kcsan_ignore_address() and kcsan_ignore_data_race(). Introduce CONFIG_KCSAN_PERMISSIVE and update the stub functions in preparation for subsequent changes. As before, developers who choose to use KCSAN in "strict" mode will see all data races and are not affected. Furthermore, by relying on the value-change filter logic for kcsan_ignore_data_race(), even if the permissive rules are enabled, the opt-outs in report.c:skip_report() override them (such as for RCU-related functions by default). The option CONFIG_KCSAN_PERMISSIVE is disabled by default, so that the documented default behaviour of KCSAN does not change. Instead, like CONFIG_KCSAN_IGNORE_ATOMICS, the option needs to be explicitly opted in. Signed-off-by: Marco Elver <[email protected]> Acked-by: Mark Rutland <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-07-20kcsan: Reduce get_ctx() uses in kcsan_found_watchpoint()Marco Elver1-10/+16
There are a number get_ctx() calls that are close to each other, which results in poor codegen (repeated preempt_count loads). Specifically in kcsan_found_watchpoint() (even though it's a slow-path) it is beneficial to keep the race-window small until the watchpoint has actually been consumed to avoid missed opportunities to report a race. Let's clean it up a bit before we add more code in kcsan_found_watchpoint(). Signed-off-by: Marco Elver <[email protected]> Acked-by: Mark Rutland <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-07-20kcsan: Remove CONFIG_KCSAN_DEBUGMarco Elver1-9/+0
By this point CONFIG_KCSAN_DEBUG is pretty useless, as the system just isn't usable with it due to spamming console (I imagine a randconfig test robot will run into this sooner or later). Remove it. Back in 2019 I used it occasionally to record traces of watchpoints and verify the encoding is correct, but these days we have proper tests. If something similar is needed in future, just add it back ad-hoc. Signed-off-by: Marco Elver <[email protected]> Acked-by: Mark Rutland <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-07-20rcu: Fix macro name CONFIG_TASKS_RCU_TRACEZhouyi Zhou1-4/+4
This commit fixes several typos where CONFIG_TASKS_RCU_TRACE should instead be CONFIG_TASKS_TRACE_RCU. Among other things, these typos could cause CONFIG_TASKS_TRACE_RCU_READ_MB=y kernels to suffer from memory-ordering bugs that could result in false-positive quiescent states and too-short grace periods. Signed-off-by: Zhouyi Zhou <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-07-20rcu-tasks: Fix synchronize_rcu_rude() typo in commentPaul E. McKenney1-2/+2
This commit replaces the fictitious synchronize_rcu_rude() function with its real-world synchronize_rcu_tasks_rude() counterpart. Signed-off-by: Paul E. McKenney <[email protected]>
2021-07-20rcu-tasks: Mark ->trc_reader_special.b.need_qs data racesPaul E. McKenney1-4/+4
There are several ->trc_reader_special.b.need_qs data races that are too low-probability for KCSAN to notice, but which will happen sooner or later. This commit therefore marks these accesses. Signed-off-by: Paul E. McKenney <[email protected]>
2021-07-20rcu-tasks: Mark ->trc_reader_nesting data racesPaul E. McKenney1-5/+6
There are several ->trc_reader_nesting data races that are too low-probability for KCSAN to notice, but which will happen sooner or later. This commit therefore marks these accesses, and comments one that cannot race. Signed-off-by: Paul E. McKenney <[email protected]>
2021-07-20rcu-tasks: Add comments explaining task_struct strategyPaul E. McKenney1-1/+10
Accesses to task_struct structures must be either protected by RCU or by get_task_struct(). Tasks trace RCU uses these in a non-obvious combination, in conjunction with an IPI handler. This commit therefore adds comments explaining this usage. Signed-off-by: Paul E. McKenney <[email protected]>
2021-07-20rcu/nocb: Remove NOCB deferred wakeup from rcutree_dead_cpu()Frederic Weisbecker1-3/+0
At CPU offline time, we must handle any pending wakeup for the nocb_gp kthread linked to the outgoing CPU. Now we are making sure of that twice: 1) From rcu_report_dead() when the outgoing CPU makes the very last local cleanups by itself before switching offline. 2) From rcutree_dead_cpu(). Here the offlining CPU has gone and is truly now offline. Another CPU takes care of post-portem cleaning up and check if the offline CPU had pending wakeup. Both ways are fine but we have to choose one or the other because we don't need to repeat that action. Simply benefit from cache locality and keep only the first solution. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>