Age | Commit message (Collapse) | Author | Files | Lines |
|
clock_was_set() can be invoked from preemptible context. Use raw_cpu_ptr()
to check whether high resolution mode is active or not. It does not matter
whether the task migrates after acquiring the pointer.
Fixes: e71a4153b7c2 ("hrtimer: Force clock_was_set() handling for the HIGHRES=n, NOHZ=y case")
Reported-by: Mike Galbraith <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/875ywacsmb.ffs@tglx
|
|
The following commands:
# echo 'read_max u64 size;' > synthetic_events
# echo 'hist:keys=common_pid:count=count:onmax($count).trace(read_max,count)' > events/syscalls/sys_enter_read/trigger
Causes:
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP
CPU: 4 PID: 1763 Comm: bash Not tainted 5.14.0-rc2-test+ #155
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01
v03.03 07/14/2016
RIP: 0010:strcmp+0xc/0x20
Code: 75 f7 31 c0 0f b6 0c 06 88 0c 02 48 83 c0 01 84 c9 75 f1 4c 89 c0
c3 0f 1f 80 00 00 00 00 31 c0 eb 08 48 83 c0 01 84 d2 74 0f <0f> b6 14 07
3a 14 06 74 ef 19 c0 83 c8 01 c3 31 c0 c3 66 90 48 89
RSP: 0018:ffffb5fdc0963ca8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffffffb3a4e040 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9714c0d0b640 RDI: 0000000000000000
RBP: 0000000000000000 R08: 00000022986b7cde R09: ffffffffb3a4dff8
R10: 0000000000000000 R11: 0000000000000000 R12: ffff9714c50603c8
R13: 0000000000000000 R14: ffff97143fdf9e48 R15: ffff9714c01a2210
FS: 00007f1fa6785740(0000) GS:ffff9714da400000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000002d863004 CR4: 00000000001706e0
Call Trace:
__find_event_file+0x4e/0x80
action_create+0x6b7/0xeb0
? kstrdup+0x44/0x60
event_hist_trigger_func+0x1a07/0x2130
trigger_process_regex+0xbd/0x110
event_trigger_write+0x71/0xd0
vfs_write+0xe9/0x310
ksys_write+0x68/0xe0
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f1fa6879e87
The problem was the "trace(read_max,count)" where the "count" should be
"$count" as "onmax()" only handles variables (although it really should be
able to figure out that "count" is a field of sys_enter_read). But there's
a path that does not find the variable and ends up passing a NULL for the
event, which ends up getting passed to "strcmp()".
Add a check for NULL to return and error on the command with:
# cat error_log
hist:syscalls:sys_enter_read: error: Couldn't create or find variable
Command: hist:keys=common_pid:count=count:onmax($count).trace(read_max,count)
^
Link: https://lkml.kernel.org/r/[email protected]
Cc: Masami Hiramatsu <[email protected]>
Cc: [email protected]
Fixes: 50450603ec9cb tracing: Add 'onmax' hist trigger action support
Reviewed-by: Tom Zanussi <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
Commit 2860cd8a2353 ("livepatch: Use the default ftrace_ops instead of
REGS when ARGS is available") intends to enable config LIVEPATCH when
ftrace with ARGS is available. However, the chain of configs to enable
LIVEPATCH is incomplete, as HAVE_DYNAMIC_FTRACE_WITH_ARGS is available,
but the definition of DYNAMIC_FTRACE_WITH_ARGS, combining DYNAMIC_FTRACE
and HAVE_DYNAMIC_FTRACE_WITH_ARGS, needed to enable LIVEPATCH, is missing
in the commit.
Fortunately, ./scripts/checkkconfigsymbols.py detects this and warns:
DYNAMIC_FTRACE_WITH_ARGS
Referencing files: kernel/livepatch/Kconfig
So, define the config DYNAMIC_FTRACE_WITH_ARGS analogously to the already
existing similar configs, DYNAMIC_FTRACE_WITH_REGS and
DYNAMIC_FTRACE_WITH_DIRECT_CALLS, in ./kernel/trace/Kconfig to connect the
chain of configs.
Link: https://lore.kernel.org/kernel-janitors/CAKXUXMwT2zS9fgyQHKUUiqo8ynZBdx2UEUu1WnV_q0OCmknqhw@mail.gmail.com/
Link: https://lkml.kernel.org/r/[email protected]
Cc: Josh Poimboeuf <[email protected]>
Cc: Jiri Kosina <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Miroslav Benes <[email protected]>
Cc: [email protected]
Fixes: 2860cd8a2353 ("livepatch: Use the default ftrace_ops instead of REGS when ARGS is available")
Signed-off-by: Lukas Bulwahn <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
When using osnoise/timerlat with stop tracing, sometimes it is
not clear in which CPU the stop condition was hit, mainly
when using some extra events.
Print a message informing in which CPU the trace stopped, like
in the example below:
<idle>-0 [006] d.h. 2932.676616: #1672599 context irq timer_latency 34689 ns
<idle>-0 [006] dNh. 2932.676618: irq_noise: local_timer:236 start 2932.676615639 duration 2391 ns
<idle>-0 [006] dNh. 2932.676620: irq_noise: virtio0-output.0:47 start 2932.676620180 duration 86 ns
<idle>-0 [003] d.h. 2932.676621: #1673374 context irq timer_latency 1200 ns
<idle>-0 [006] d... 2932.676623: thread_noise: swapper/6:0 start 2932.676615964 duration 4339 ns
<idle>-0 [003] dNh. 2932.676623: irq_noise: local_timer:236 start 2932.676620597 duration 1881 ns
<idle>-0 [006] d... 2932.676623: sched_switch: prev_comm=swapper/6 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=timerlat/6 next_pid=852 next_prio=4
timerlat/6-852 [006] .... 2932.676623: #1672599 context thread timer_latency 41931 ns
<idle>-0 [003] d... 2932.676623: thread_noise: swapper/3:0 start 2932.676620854 duration 880 ns
<idle>-0 [003] d... 2932.676624: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=timerlat/3 next_pid=849 next_prio=4
timerlat/6-852 [006] .... 2932.676624: timerlat_main: stop tracing hit on cpu 6
timerlat/3-849 [003] .... 2932.676624: #1673374 context thread timer_latency 4310 ns
Link: https://lkml.kernel.org/r/b30a0d7542adba019185f44ee648e60e14923b11.1626598844.git.bristot@kernel.org
Cc: Tom Zanussi <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
Some extra flags are printed to the trace header when using the
PREEMPT_RT config. The extra flags are: need-resched-lazy,
preempt-lazy-depth, and migrate-disable.
Without printing these fields, the timerlat specific fields are
shifted by three positions, for example:
# tracer: timerlat
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# || /
# |||| ACTIVATION
# TASK-PID CPU# |||| TIMESTAMP ID CONTEXT LATENCY
# | | | |||| | | | |
<idle>-0 [000] d..h... 3279.798871: #1 context irq timer_latency 830 ns
<...>-807 [000] ....... 3279.798881: #1 context thread timer_latency 11301 ns
Add a new header for timerlat with the missing fields, to be used
when the PREEMPT_RT is enabled.
Link: https://lkml.kernel.org/r/babb83529a3211bd0805be0b8c21608230202c55.1626598844.git.bristot@kernel.org
Cc: Tom Zanussi <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
Some extra flags are printed to the trace header when using the
PREEMPT_RT config. The extra flags are: need-resched-lazy,
preempt-lazy-depth, and migrate-disable.
Without printing these fields, the osnoise specific fields are
shifted by three positions, for example:
# tracer: osnoise
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth MAX
# || / SINGLE Interference counters:
# |||| RUNTIME NOISE %% OF CPU NOISE +-----------------------------+
# TASK-PID CPU# |||| TIMESTAMP IN US IN US AVAILABLE IN US HW NMI IRQ SIRQ THREAD
# | | | |||| | | | | | | | | | |
<...>-741 [000] ....... 1105.690909: 1000000 234 99.97660 36 21 0 1001 22 3
<...>-742 [001] ....... 1105.691923: 1000000 281 99.97190 197 7 0 1012 35 14
<...>-743 [002] ....... 1105.691958: 1000000 1324 99.86760 118 11 0 1016 155 143
<...>-744 [003] ....... 1105.691998: 1000000 109 99.98910 21 4 0 1004 33 7
<...>-745 [004] ....... 1105.692015: 1000000 2023 99.79770 97 37 0 1023 52 18
Add a new header for osnoise with the missing fields, to be used
when the PREEMPT_RT is enabled.
Link: https://lkml.kernel.org/r/1f03289d2a51fde5a58c2e7def063dc630820ad1.1626598844.git.bristot@kernel.org
Cc: Tom Zanussi <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull ucounts fix from Eric Biederman:
"This fixes the ucount sysctls on big endian architectures.
The counts were expanded to be longs instead of ints, and the sysctl
code was overlooked, so only the low 32bit were being processed. On
litte endian just processing the low 32bits is fine, but on 64bit big
endian processing just the low 32bits results in the high order bits
instead of the low order bits being processed and nothing works
proper.
This change took a little bit to mature as we have the SYSCTL_ZERO,
and SYSCTL_INT_MAX macros that are only usable for sysctls operating
on ints, but unfortunately are not obviously broken. Which resulted in
the versions of this change working on big endian and not on little
endian, because the int SYSCTL_ZERO when extended 64bit wound up being
0x100000000. So we only allowed values greater than 0x100000000 and
less than 0faff. Which unfortunately broken everything that tried to
set the sysctls. (First reported with the windows subsystem for
linux).
I have tested this on x86_64 64bit after first reproducing the
problems with the earlier version of this change, and then verifying
the problems do not exist when we use appropriate long min and max
values for extra1 and extra2"
* 'for-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ucounts: add missing data type changes
|
|
in copy_process(): non root users but with capability CAP_SYS_RESOURCE
or CAP_SYS_ADMIN will clean PF_NPROC_EXCEEDED flag even
rlimit(RLIMIT_NPROC) exceeds. Add the same capability check logic here.
Align the permission checks in copy_process() and set_user(). In
copy_process() CAP_SYS_RESOURCE or CAP_SYS_ADMIN capable users will be
able to circumvent and clear the PF_NPROC_EXCEEDED flag whereas they
aren't able to the same in set_user(). There's no obvious logic to this
and trying to unearth the reason in the thread didn't go anywhere.
The gist seems to be that this code wants to make sure that a program
can't successfully exec if it has gone through a set*id() transition
while exceeding its RLIMIT_NPROC.
A capable but non-INIT_USER caller getting PF_NPROC_EXCEEDED set during
a set*id() transition wouldn't be able to exec right away if they still
exceed their RLIMIT_NPROC at the time of exec. So their exec would fail
in fs/exec.c:
if ((current->flags & PF_NPROC_EXCEEDED) &&
is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
retval = -EAGAIN;
goto out_ret;
}
However, if the caller were to fork() right after the set*id()
transition but before the exec while still exceeding their RLIMIT_NPROC
then they would get PF_NPROC_EXCEEDED cleared (while the child would
inherit it):
retval = -EAGAIN;
if (is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
if (p->real_cred->user != INIT_USER &&
!capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
goto bad_fork_free;
}
current->flags &= ~PF_NPROC_EXCEEDED;
which means a subsequent exec by the capable caller would now succeed
even though they could still exceed their RLIMIT_NPROC limit. This seems
inconsistent. Allow a CAP_SYS_ADMIN or CAP_SYS_RESOURCE capable user to
avoid PF_NPROC_EXCEEDED as they already can in copy_process().
Cc: [email protected], [email protected], [email protected], Ran Xiaokai <[email protected]>, , ,
Link: https://lore.kernel.org/r/[email protected]
Cc: Neil Brown <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: James Morris <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Ran Xiaokai <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
|
|
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Cc: Steffen Klassert <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: [email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Acked-by: Daniel Jordan <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp fixes from Kees Cook:
- Fix typo in user notification documentation (Rodrigo Campos)
- Fix userspace counter report when using TSYNC (Hsuan-Chi Kuo, Wiktor
Garbacz)
* tag 'seccomp-v5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
seccomp: Fix setting loaded filter count during TSYNC
Documentation: seccomp: Fix typo in user notification
|
|
If rcu_read_lock_sched tracing is enabled, the tracing subsystem can
perform a jump which needs to be checked by CFI. For example, stm_ftrace
source is enabled as a module and hooks into enabled ftrace events. This
can cause an recursive loop where find_shadow_check_fn ->
rcu_read_lock_sched -> (call to stm_ftrace generates cfi slowpath) ->
find_shadow_check_fn -> rcu_read_lock_sched -> ...
To avoid the recursion, either the ftrace codes needs to be marked with
__no_cfi or CFI should not trace. Use the "_notrace" in CFI to avoid
tracing so that CFI can guard ftrace.
Signed-off-by: Elliot Berman <[email protected]>
Reviewed-by: Sami Tolvanen <[email protected]>
Cc: [email protected]
Fixes: cf68fffb66d6 ("add support for Clang CFI")
Signed-off-by: Kees Cook <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The desired behavior is to set the caller's filter count to thread's.
This value is reported via /proc, so this fixes the inaccurate count
exposed to userspace; it is not used for reference counting, etc.
Signed-off-by: Hsuan-Chi Kuo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Co-developed-by: Wiktor Garbacz <[email protected]>
Signed-off-by: Wiktor Garbacz <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]
Signed-off-by: Kees Cook <[email protected]>
Cc: [email protected]
Fixes: c818c03b661c ("seccomp: Report number of loaded filters in /proc/$pid/status")
|
|
Currently, if bpf_get_current_cgroup_id() or
bpf_get_current_ancestor_cgroup_id() helper is
called with sleepable programs e.g., sleepable
fentry/fmod_ret/fexit/lsm programs, a rcu warning
may appear. For example, if I added the following
hack to test_progs/test_lsm sleepable fentry program
test_sys_setdomainname:
--- a/tools/testing/selftests/bpf/progs/lsm.c
+++ b/tools/testing/selftests/bpf/progs/lsm.c
@@ -168,6 +168,10 @@ int BPF_PROG(test_sys_setdomainname, struct pt_regs *regs)
int buf = 0;
long ret;
+ __u64 cg_id = bpf_get_current_cgroup_id();
+ if (cg_id == 1000)
+ copy_test++;
+
ret = bpf_copy_from_user(&buf, sizeof(buf), ptr);
if (len == -2 && ret == 0 && buf == 1234)
copy_test++;
I will hit the following rcu warning:
include/linux/cgroup.h:481 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by test_progs/260:
#0: ffffffffa5173360 (rcu_read_lock_trace){....}-{0:0}, at: __bpf_prog_enter_sleepable+0x0/0xa0
stack backtrace:
CPU: 1 PID: 260 Comm: test_progs Tainted: G O 5.14.0-rc2+ #176
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack_lvl+0x56/0x7b
bpf_get_current_cgroup_id+0x9c/0xb1
bpf_prog_a29888d1c6706e09_test_sys_setdomainname+0x3e/0x89c
bpf_trampoline_6442469132_0+0x2d/0x1000
__x64_sys_setdomainname+0x5/0x110
do_syscall_64+0x3a/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
I can get similar warning using bpf_get_current_ancestor_cgroup_id() helper.
syzbot reported a similar issue in [1] for syscall program. Helper
bpf_get_current_cgroup_id() or bpf_get_current_ancestor_cgroup_id()
has the following callchain:
task_dfl_cgroup
task_css_set
task_css_set_check
and we have
#define task_css_set_check(task, __c) \
rcu_dereference_check((task)->cgroups, \
lockdep_is_held(&cgroup_mutex) || \
lockdep_is_held(&css_set_lock) || \
((task)->flags & PF_EXITING) || (__c))
Since cgroup_mutex/css_set_lock is not held and the task
is not existing and rcu read_lock is not held, a warning
will be issued. Note that bpf sleepable program is protected by
rcu_read_lock_trace().
The above sleepable bpf programs are already protected
by migrate_disable(). Adding rcu_read_lock() in these
two helpers will silence the above warning.
I marked the patch fixing 95b861a7935b
("bpf: Allow bpf_get_current_ancestor_cgroup_id for tracing")
which added bpf_get_current_ancestor_cgroup_id() to tracing programs
in 5.14. I think backporting 5.14 is probably good enough as sleepable
progrems are not widely used.
This patch should fix [1] as well since syscall program is a sleepable
program protected with migrate_disable().
[1] https://lore.kernel.org/bpf/[email protected]/
Fixes: 95b861a7935b ("bpf: Allow bpf_get_current_ancestor_cgroup_id for tracing")
Reported-by: [email protected]
Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
A valid cpuset partition can become invalid if all its CPUs are offlined
or somehow removed. This can happen through external events without
"cpuset.cpus.partition" being touched at all.
Users that rely on the property of a partition being present do not
currently have a simple way to get such an event notified other than
constant periodic polling which is both inefficient and cumbersome.
To make life easier for those users, event notification is now enabled
for "cpuset.cpus.partition" whenever its state changes.
Suggested-by: Tejun Heo <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
Fix kernel-doc warnings found in cgroup-v1.c:
kernel/cgroup/cgroup-v1.c:55: warning: No description found for return value of 'cgroup_attach_task_all'
kernel/cgroup/cgroup-v1.c:94: warning: expecting prototype for cgroup_trasnsfer_tasks(). Prototype was for cgroup_transfer_tasks() instead
cgroup-v1.c:96: warning: No description found for return value of 'cgroup_transfer_tasks'
kernel/cgroup/cgroup-v1.c:687: warning: No description found for return value of 'cgroupstats_build'
Signed-off-by: Randy Dunlap <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Zefan Li <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: [email protected]
Signed-off-by: Tejun Heo <[email protected]>
|
|
Fix the following warnings:
kernel/smp.c:1189: warning: cannot understand function prototype: 'struct smp_call_on_cpu_struct '
kernel/smp.c:788: warning: No description found for return value of 'smp_call_function_single_async'
kernel/smp.c:990: warning: Function parameter or member 'wait' not described in 'smp_call_function_many'
kernel/smp.c:990: warning: Excess function parameter 'flags' description in 'smp_call_function_many'
kernel/smp.c:1198: warning: Function parameter or member 'work' not described in 'smp_call_on_cpu_struct'
kernel/smp.c:1198: warning: Function parameter or member 'done' not described in 'smp_call_on_cpu_struct'
kernel/smp.c:1198: warning: Function parameter or member 'func' not described in 'smp_call_on_cpu_struct'
kernel/smp.c:1198: warning: Function parameter or member 'data' not described in 'smp_call_on_cpu_struct'
kernel/smp.c:1198: warning: Function parameter or member 'ret' not described in 'smp_call_on_cpu_struct'
kernel/smp.c:1198: warning: Function parameter or member 'cpu' not described in 'smp_call_on_cpu_struct'
Signed-off-by: Randy Dunlap <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Fix all kernel-doc warnings in these 3 files and do some simple editing
(capitalize acronyms, capitalize Linux).
kernel/irq/pm.c:235: warning: expecting prototype for irq_pm_syscore_ops(). Prototype was for irq_pm_syscore_resume() instead
kernel/irq/msi.c:530: warning: expecting prototype for __msi_domain_free_irqs(). Prototype was for msi_domain_free_irqs() instead
kernel/irq/msi.c:31: warning: No description found for return value of 'alloc_msi_entry'
kernel/irq/msi.c:103: warning: No description found for return value of 'msi_domain_set_affinity'
kernel/irq/msi.c:288: warning: No description found for return value of 'msi_create_irq_domain'
kernel/irq/msi.c:499: warning: No description found for return value of 'msi_domain_alloc_irqs'
kernel/irq/msi.c:545: warning: No description found for return value of 'msi_get_domain_info'
kernel/irq/ipi.c:264: warning: expecting prototype for ipi_send_mask(). Prototype was for __ipi_send_mask() instead
kernel/irq/ipi.c:25: warning: No description found for return value of 'irq_reserve_ipi'
kernel/irq/ipi.c:116: warning: No description found for return value of 'irq_destroy_ipi'
kernel/irq/ipi.c:163: warning: No description found for return value of 'ipi_get_hwirq'
kernel/irq/ipi.c:222: warning: No description found for return value of '__ipi_send_single'
kernel/irq/ipi.c:308: warning: No description found for return value of 'ipi_send_single'
kernel/irq/ipi.c:329: warning: No description found for return value of 'ipi_send_mask'
Signed-off-by: Randy Dunlap <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Return a negative error code from the error handling case instead of 0, as
done elsewhere in this function.
Fixes: f52da98d900e ("genirq/timings: Add selftest for irqs circular buffer")
Reported-by: Hulk Robot <[email protected]>
Signed-off-by: Zhen Lei <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Describe the arguments correctly.
Fixes the following W=1 kernel build warning(s):
kernel/irq/matrix.c:287: warning: Function parameter or
member 'msk' not described in 'irq_matrix_alloc_managed'
kernel/irq/matrix.c:287: warning: Function parameter or
member 'mapped_cpu' not described in 'irq_matrix_alloc_managed'
kernel/irq/matrix.c:287: warning: Excess function
parameter 'cpu' description in 'irq_matrix_alloc_managed'
Signed-off-by: Baokun Li <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
With CONFIG_IRQ_FORCED_THREADING=y, testing the boolean force_irqthreads
could incur a cache line miss in invoke_softirq() and other places.
Replace the test with a static key to avoid the potential cache miss.
[ tglx: Dropped the IDE part, removed the export and updated blk-mq ]
Suggested-by: Eric Dumazet <[email protected]>
Signed-off-by: Tanner Love <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
'nolibc.2021.07.20c', 'tasks.2021.07.20c', 'torture.2021.07.27a' and 'torturescript.2021.07.27a' into HEAD
doc.2021.07.20c: Documentation updates.
fixes.2021.08.06a: Miscellaneous fixes.
nocb.2021.07.20c: Callback-offloading (NOCB CPU) updates.
nolibc.2021.07.20c: Tiny userspace library updates.
tasks.2021.07.20c: Tasks RCU updates.
torture.2021.07.27a: In-kernel torture-test updates.
torturescript.2021.07.27a: Torture-test scripting updates.
|
|
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Cc: Davidlohr Bueso <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Josh Triplett <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Cc: "Paul E. McKenney" <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: [email protected]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
CPU hotplug callbacks can fail and cause a rollback to the previous
state. These failures are silent and therefore hard to debug.
Add pr_debug() to the up and down paths which provide information about the
error code, the CPU and the failed state. The debug printks can be enabled
via kernel command line or sysfs.
[ tglx: Adopt to current mainline, massage printk and changelog ]
Signed-off-by: Dongli Zhang <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Qais Yousef <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Use DEVICE_ATTR_*() helper instead of plain DEVICE_ATTR,
which makes the code a bit shorter and easier to read.
Signed-off-by: YueHaibing <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
kernel/cpu.c:57: warning: cannot understand function prototype: 'struct cpuhp_cpu_state '
kernel/cpu.c:115: warning: cannot understand function prototype: 'struct cpuhp_step '
kernel/cpu.c:146: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* cpuhp_invoke_callback _ Invoke the callbacks for a given state
kernel/cpu.c:75: warning: Function parameter or member 'fail' not described in 'cpuhp_cpu_state'
kernel/cpu.c:75: warning: Function parameter or member 'cpu' not described in 'cpuhp_cpu_state'
kernel/cpu.c:75: warning: Function parameter or member 'node' not described in 'cpuhp_cpu_state'
kernel/cpu.c:75: warning: Function parameter or member 'last' not described in 'cpuhp_cpu_state'
kernel/cpu.c:130: warning: Function parameter or member 'list' not described in 'cpuhp_step'
kernel/cpu.c:130: warning: Function parameter or member 'multi_instance' not described in 'cpuhp_step'
kernel/cpu.c:158: warning: No description found for return value of 'cpuhp_invoke_callback'
kernel/cpu.c:1188: warning: No description found for return value of 'cpu_device_down'
kernel/cpu.c:1400: warning: No description found for return value of 'cpu_device_up'
kernel/cpu.c:1425: warning: No description found for return value of 'bringup_hibernate_cpu'
Signed-off-by: Randy Dunlap <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Fixes the following W=1 kernel build warning(s):
kernel/cpu.c:1949: warning: Function parameter or member
'name' not described in '__cpuhp_setup_state_cpuslocked'
Signed-off-by: Baokun Li <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
By unconditionally updating the offsets there are more indicators
whether the SMP function calls on clock_was_set() can be avoided:
- When the offset update already happened on the remote CPU then the
remote update attempt will yield the same seqeuence number and no
IPI is required.
- When the remote CPU is currently handling hrtimer_interrupt(). In
that case the remote CPU will reevaluate the timer bases before
reprogramming anyway, so nothing to do.
- After updating it can be checked whether the first expiring timer in
the affected clock bases moves before the first expiring (softirq)
timer of the CPU. If that's not the case then sending the IPI is not
required.
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Setting of clocks triggers an unconditional SMP function call on all online
CPUs to reprogram the clock event device.
However, only some clocks have their offsets updated and therefore
potentially require a reprogram. That's CLOCK_REALTIME and CLOCK_TAI and in
the case of resume (delayed sleep time injection) also CLOCK_BOOTTIME.
Instead of sending an IPI unconditionally, check each per CPU hrtimer base
whether it has active timers in the affected clock bases which are
indicated by the caller in the @bases argument of clock_was_set().
If that's not the case, skip the IPI and update the offsets remotely which
ensures that any subsequently armed timers on the affected clocks are
evaluated with the correct offsets.
[ tglx: Adopted to the new bases argument, removed the softirq_active
check, added comment, fixed up stale comment ]
Signed-off-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
clock_was_set() unconditionaly invokes retrigger_next_event() on all online
CPUs. This was necessary because that mechanism was also used for resume
from suspend to idle which is not longer the case.
The bases arguments allows the callers of clock_was_set() to hand in a mask
which tells clock_was_set() which of the hrtimer clock bases are affected
by the clock setting. This mask will be used in the next step to check
whether a CPU base has timers queued on a clock base affected by the event
and avoid the SMP function call if there are none.
Add a @bases argument, provide defines for the active bases masking and
fixup all callsites.
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
do_adjtimex() might end up scheduling a delayed clock_was_set() via
timekeeping_advance() and then invoke clock_was_set() directly which is
pointless.
Make timekeeping_advance() return whether an invocation of clock_was_set()
is required and handle it at the call sites which allows do_adjtimex() to
issue a single direct call if required.
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Resuming timekeeping is a clock-was-set event and uses the clock-was-set
notification mechanism. This is in the way of making the clock-was-set
update for hrtimers selective so unnecessary IPIs are avoided when a CPU
base does not have timers queued which are affected by the clock setting.
Distangle it by invoking hrtimer_resume() on each unfreezing CPU and invoke
the new timerfd_resume() function from timekeeping_resume() which is the
only place where this is needed.
Rename hrtimer_resume() to hrtimer_resume_local() to reflect the change.
With this the clock_was_set*() functions are not longer required to IPI all
CPUs unconditionally and can get some smarts to avoid them.
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
When CONFIG_HIGH_RES_TIMERS is disabled, but NOHZ is enabled then
clock_was_set() is not doing anything. With HIGHRES=n the kernel relies on
the periodic tick to update the clock offsets, but when NOHZ is enabled and
active then CPUs which are in a deep idle sleep do not have a periodic tick
which means the expiry of timers affected by clock_was_set() can be
arbitrarily delayed up to the point where the CPUs are brought out of idle
again.
Make the clock_was_set() logic unconditionaly available so that idle CPUs
are kicked out of idle to handle the update.
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
If high resolution timers are disabled the timerfd notification about a
clock was set event is not happening for all cases which use
clock_was_set_delayed() because that's a NOP for HIGHRES=n, which is wrong.
Make clock_was_set_delayed() unconditially available to fix that.
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
This code is mostly duplicated. The redudant store in the force reprogram
case does no harm and the in hrtimer interrupt condition cannot be true for
the force reprogram invocations.
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
If __hrtimer_start_range_ns() is invoked with an already armed hrtimer then
the timer has to be canceled first and then added back. If the timer is the
first expiring timer then on removal the clockevent device is reprogrammed
to the next expiring timer to avoid that the pending expiry fires needlessly.
If the new expiry time ends up to be the first expiry again then the clock
event device has to reprogrammed again.
Avoid this by checking whether the timer is the first to expire and in that
case, keep the timer on the current CPU and delay the reprogramming up to
the point where the timer has been enqueued again.
Reported-by: Lorenzo Colitti <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
queueing
There are several scenarios that can result in posix_cpu_timer_set()
not queueing the timer but still leaving the threadgroup cputime counter
running or keeping the tick dependency around for a random amount of time.
1) If timer_settime() is called with a 0 expiration on a timer that is
already disabled, the process wide cputime counter will be started
and won't ever get a chance to be stopped by stop_process_timer()
since no timer is actually armed to be processed.
The following snippet is enough to trigger the issue.
void trigger_process_counter(void)
{
timer_t id;
struct itimerspec val = { };
timer_create(CLOCK_PROCESS_CPUTIME_ID, NULL, &id);
timer_settime(id, TIMER_ABSTIME, &val, NULL);
timer_delete(id);
}
2) If timer_settime() is called with a 0 expiration on a timer that is
already armed, the timer is dequeued but not really disarmed. So the
process wide cputime counter and the tick dependency may still remain
a while around.
The following code snippet keeps this overhead around for one week after
the timer deletion:
void trigger_process_counter(void)
{
timer_t id;
struct itimerspec val = { };
val.it_value.tv_sec = 604800;
timer_create(CLOCK_PROCESS_CPUTIME_ID, NULL, &id);
timer_settime(id, 0, &val, NULL);
timer_delete(id);
}
3) If the timer was initially deactivated, this call to timer_settime()
with an early expiration may have started the process wide cputime
counter even though the timer hasn't been queued and armed because it
has fired early and inline within posix_cpu_timer_set() itself. As a
result the process wide cputime counter may never stop until a new
timer is ever armed in the future.
The following code snippet can reproduce this:
void trigger_process_counter(void)
{
timer_t id;
struct itimerspec val = { };
signal(SIGALRM, SIG_IGN);
timer_create(CLOCK_PROCESS_CPUTIME_ID, NULL, &id);
val.it_value.tv_nsec = 1;
timer_settime(id, TIMER_ABSTIME, &val, NULL);
}
4) If the timer was initially armed with a former expiration value
before this call to timer_settime() and the current call sets an
early deadline that has already expired, the timer fires inline
within posix_cpu_timer_set(). In this case it must have been dequeued
before firing inline with its new expiration value, yet it hasn't
been disarmed in this case. So the process wide cputime counter and
the tick dependency may still be around for a while even after the
timer fired.
The following code snippet can reproduce this:
void trigger_process_counter(void)
{
timer_t id;
struct itimerspec val = { };
signal(SIGALRM, SIG_IGN);
timer_create(CLOCK_PROCESS_CPUTIME_ID, NULL, &id);
val.it_value.tv_sec = 100;
timer_settime(id, TIMER_ABSTIME, &val, NULL);
val.it_value.tv_sec = 0;
val.it_value.tv_nsec = 1;
timer_settime(id, TIMER_ABSTIME, &val, NULL);
}
Fix all these issues with triggering the related base next expiration
recalculation on the next tick. This also implies to re-evaluate the need
to keep around the process wide cputime counter and the tick dependency, in
a similar fashion to disarm_timer().
Suggested-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Remove the ad-hoc timer base accessors and provide a consolidated one.
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The end of the function cannot be reached with an error in variable
ret. Unconfuse reviewers about that.
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
When an itimer deactivates a previously armed expiration, it simply doesn't
do anything. As a result the process wide cputime counter keeps running and
the tick dependency stays set until it reaches the old ghost expiration
value.
This can be reproduced with the following snippet:
void trigger_process_counter(void)
{
struct itimerval n = {};
n.it_value.tv_sec = 100;
setitimer(ITIMER_VIRTUAL, &n, NULL);
n.it_value.tv_sec = 0;
setitimer(ITIMER_VIRTUAL, &n, NULL);
}
Fix this with resetting the relevant base expiration. This is similar to
disarming a timer.
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
A timer deletion only dequeues the timer but it doesn't shutdown
the related costly process wide cputimer counter and the tick dependency.
The following code snippet keeps this overhead around for one week after
the timer deletion:
void trigger_process_counter(void)
{
timer_t id;
struct itimerspec val = { };
val.it_value.tv_sec = 604800;
timer_create(CLOCK_PROCESS_CPUTIME_ID, NULL, &id);
timer_settime(id, 0, &val, NULL);
timer_delete(id);
}
Make sure the next target's tick recalculates the nearest expiration and
clears the process wide counter and tick dependency if necessary.
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Starting the process wide cputime counter needs to be done in the same
sighand locking sequence than actually arming the related timer otherwise
this races against concurrent timers setting/expiring in the same
threadgroup.
Detecting that the cputime counter is started without holding the sighand
lock is a first step toward debugging such situations.
Suggested-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The variable ret is being initialized with a value that is never read, it
is being updated later on. The assignment is redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Daniel Borkmann says:
====================
bpf-next 2021-08-10
We've added 31 non-merge commits during the last 8 day(s) which contain
a total of 28 files changed, 3644 insertions(+), 519 deletions(-).
1) Native XDP support for bonding driver & related BPF selftests, from Jussi Maki.
2) Large batch of new BPF JIT tests for test_bpf.ko that came out as a result from
32-bit MIPS JIT development, from Johan Almbladh.
3) Rewrite of netcnt BPF selftest and merge into test_progs, from Stanislav Fomichev.
4) Fix XDP bpf_prog_test_run infra after net to net-next merge, from Andrii Nakryiko.
5) Follow-up fix in unix_bpf_update_proto() to enforce socket type, from Cong Wang.
6) Fix bpf-iter-tcp4 selftest to print the correct dest IP, from Jose Blanquicet.
7) Various misc BPF XDP sample improvements, from Niklas Söderlund, Matthew Cover,
and Muhammad Falak R Wani.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (31 commits)
bpf, tests: Add tail call test suite
bpf, tests: Add tests for BPF_CMPXCHG
bpf, tests: Add tests for atomic operations
bpf, tests: Add test for 32-bit context pointer argument passing
bpf, tests: Add branch conversion JIT test
bpf, tests: Add word-order tests for load/store of double words
bpf, tests: Add tests for ALU operations implemented with function calls
bpf, tests: Add more ALU64 BPF_MUL tests
bpf, tests: Add more BPF_LSH/RSH/ARSH tests for ALU64
bpf, tests: Add more ALU32 tests for BPF_LSH/RSH/ARSH
bpf, tests: Add more tests of ALU32 and ALU64 bitwise operations
bpf, tests: Fix typos in test case descriptions
bpf, tests: Add BPF_MOV tests for zero and sign extension
bpf, tests: Add BPF_JMP32 test cases
samples, bpf: Add an explict comment to handle nested vlan tagging.
selftests/bpf: Add tests for XDP bonding
selftests/bpf: Fix xdp_tx.c prog section name
net, core: Allow netdev_lower_get_next_private_rcu in bh context
bpf, devmap: Exclude XDP broadcast to master device
net, bonding: Add XDP support to the bonding driver
...
====================
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
msi_domain_alloc_irqs() invokes irq_domain_activate_irq(), but
msi_domain_free_irqs() does not enforce deactivation before tearing down
the interrupts.
This happens when PCI/MSI interrupts are set up and never used before being
torn down again, e.g. in error handling pathes. The only place which cleans
that up is the error handling path in msi_domain_alloc_irqs().
Move the cleanup from msi_domain_alloc_irqs() into msi_domain_free_irqs()
to cure that.
Fixes: f3b0946d629c ("genirq/msi: Make sure PCI MSIs are activated early")
Signed-off-by: Bixuan Cui <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
|
|
When the interrupt interval is greater than 2 ^ PREDICTION_BUFFER_SIZE *
PREDICTION_FACTOR us and less than 1s, the calculated index will be greater
than the length of irqs->ema_time[]. Check the calculated index before
using it to prevent array overflow.
Fixes: 23aa3b9a6b7d ("genirq/timings: Encapsulate storing function")
Signed-off-by: Ben Dai <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
|
|
Make use of the struct_size() helper instead of an open-coded version,
in order to avoid any potential type mistakes or integer overflows
that, in the worst scenario, could lead to heap overflows.
This code was detected with the help of Coccinelle and, audited and
fixed manually.
Signed-off-by: Gustavo A. R. Silva <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/20210513212729.GA214145@embeddedor
|
|
PCI MSI interrupt numbers are now mapped in a PCI-MSI domain but the
underlying calls handling the passthrough of the interrupt in the
guest need a number in the XIVE IRQ domain.
Use the IRQ data mapped in the XIVE IRQ domain and not the one in the
PCI-MSI domain.
Signed-off-by: Cédric Le Goater <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Clarify wording and document commonly used IRQF_ONESHOT flag.
Signed-off-by: Joel Savitz <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|