Age | Commit message (Collapse) | Author | Files | Lines |
|
After commit 9cf57731b63e ("watchdog/softlockup: Replace "watchdog/%u"
threads with cpu_stop_work"), the percpu soft_lockup_hrtimer_cnt is
not used any more, so remove it and related code.
Signed-off-by: Jisheng Zhang <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Daniel Borkmann says:
====================
pull-request: bpf 2020-01-15
The following pull-request contains BPF updates for your *net* tree.
We've added 12 non-merge commits during the last 9 day(s) which contain
a total of 13 files changed, 95 insertions(+), 43 deletions(-).
The main changes are:
1) Fix refcount leak for TCP time wait and request sockets for socket lookup
related BPF helpers, from Lorenz Bauer.
2) Fix wrong verification of ARSH instruction under ALU32, from Daniel Borkmann.
3) Batch of several sockmap and related TLS fixes found while operating
more complex BPF programs with Cilium and OpenSSL, from John Fastabend.
4) Fix sockmap to read psock's ingress_msg queue before regular sk_receive_queue()
to avoid purging data upon teardown, from Lingpeng Chen.
5) Fix printing incorrect pointer in bpftool's btf_dump_ptr() in order to properly
dump a BPF map's value with BTF, from Martin KaFai Lau.
====================
Signed-off-by: David S. Miller <[email protected]>
|
|
htab can't use generic batch support due some problematic behaviours
inherent to the data structre, i.e. while iterating the bpf map a
concurrent program might delete the next entry that batch was about to
use, in that case there's no easy solution to retrieve the next entry,
the issue has been discussed multiple times (see [1] and [2]).
The only way hmap can be traversed without the problem previously
exposed is by making sure that the map is traversing entire buckets.
This commit implements those strict requirements for hmap, the
implementation follows the same interaction that generic support with
some exceptions:
- If keys/values buffer are not big enough to traverse a bucket,
ENOSPC will be returned.
- out_batch contains the value of the next bucket in the iteration, not
the next key, but this is transparent for the user since the user
should never use out_batch for other than bpf batch syscalls.
This commits implements BPF_MAP_LOOKUP_BATCH and adds support for new
command BPF_MAP_LOOKUP_AND_DELETE_BATCH. Note that for update/delete
batch ops it is possible to use the generic implementations.
[1] https://lore.kernel.org/bpf/[email protected]/
[2] https://lore.kernel.org/bpf/[email protected]/
Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Brian Vazquez <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
This adds the generic batch ops functionality to bpf arraymap, note that
since deletion is not a valid operation for arraymap, only batch and
lookup are added.
Signed-off-by: Brian Vazquez <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
This commit adds generic support for update and delete batch ops that
can be used for almost all the bpf maps. These commands share the same
UAPI attr that lookup and lookup_and_delete batch ops use and the
syscall commands are:
BPF_MAP_UPDATE_BATCH
BPF_MAP_DELETE_BATCH
The main difference between update/delete and lookup batch ops is that
for update/delete keys/values must be specified for userspace and
because of that, neither in_batch nor out_batch are used.
Suggested-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Brian Vazquez <[email protected]>
Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
This commit introduces generic support for the bpf_map_lookup_batch.
This implementation can be used by almost all the bpf maps since its core
implementation is relying on the existing map_get_next_key and
map_lookup_elem. The bpf syscall subcommand introduced is:
BPF_MAP_LOOKUP_BATCH
The UAPI attribute is:
struct { /* struct used by BPF_MAP_*_BATCH commands */
__aligned_u64 in_batch; /* start batch,
* NULL to start from beginning
*/
__aligned_u64 out_batch; /* output: next start batch */
__aligned_u64 keys;
__aligned_u64 values;
__u32 count; /* input/output:
* input: # of key/value
* elements
* output: # of filled elements
*/
__u32 map_fd;
__u64 elem_flags;
__u64 flags;
} batch;
in_batch/out_batch are opaque values use to communicate between
user/kernel space, in_batch/out_batch must be of key_size length.
To start iterating from the beginning in_batch must be null,
count is the # of key/value elements to retrieve. Note that the 'keys'
buffer must be a buffer of key_size * count size and the 'values' buffer
must be value_size * count, where value_size must be aligned to 8 bytes
by userspace if it's dealing with percpu maps. 'count' will contain the
number of keys/values successfully retrieved. Note that 'count' is an
input/output variable and it can contain a lower value after a call.
If there's no more entries to retrieve, ENOENT will be returned. If error
is ENOENT, count might be > 0 in case it copied some values but there were
no more entries to retrieve.
Note that if the return code is an error and not -EFAULT,
count indicates the number of elements successfully processed.
Suggested-by: Stanislav Fomichev <[email protected]>
Signed-off-by: Brian Vazquez <[email protected]>
Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
This commit moves reusable code from map_lookup_elem and map_update_elem
to avoid code duplication in kernel/bpf/syscall.c.
Signed-off-by: Brian Vazquez <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Acked-by: John Fastabend <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Anatoly has been fuzzing with kBdysch harness and reported a hang in one
of the outcomes:
0: R1=ctx(id=0,off=0,imm=0) R10=fp0
0: (85) call bpf_get_socket_cookie#46
1: R0_w=invP(id=0) R10=fp0
1: (57) r0 &= 808464432
2: R0_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
2: (14) w0 -= 810299440
3: R0_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
3: (c4) w0 s>>= 1
4: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
4: (76) if w0 s>= 0x30303030 goto pc+216
221: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
221: (95) exit
processed 6 insns (limit 1000000) [...]
Taking a closer look, the program was xlated as follows:
# ./bpftool p d x i 12
0: (85) call bpf_get_socket_cookie#7800896
1: (bf) r6 = r0
2: (57) r6 &= 808464432
3: (14) w6 -= 810299440
4: (c4) w6 s>>= 1
5: (76) if w6 s>= 0x30303030 goto pc+216
6: (05) goto pc-1
7: (05) goto pc-1
8: (05) goto pc-1
[...]
220: (05) goto pc-1
221: (05) goto pc-1
222: (95) exit
Meaning, the visible effect is very similar to f54c7898ed1c ("bpf: Fix
precision tracking for unbounded scalars"), that is, the fall-through
branch in the instruction 5 is considered to be never taken given the
conclusion from the min/max bounds tracking in w6, and therefore the
dead-code sanitation rewrites it as goto pc-1. However, real-life input
disagrees with verification analysis since a soft-lockup was observed.
The bug sits in the analysis of the ARSH. The definition is that we shift
the target register value right by K bits through shifting in copies of
its sign bit. In adjust_scalar_min_max_vals(), we do first coerce the
register into 32 bit mode, same happens after simulating the operation.
However, for the case of simulating the actual ARSH, we don't take the
mode into account and act as if it's always 64 bit, but location of sign
bit is different:
dst_reg->smin_value >>= umin_val;
dst_reg->smax_value >>= umin_val;
dst_reg->var_off = tnum_arshift(dst_reg->var_off, umin_val);
Consider an unknown R0 where bpf_get_socket_cookie() (or others) would
for example return 0xffff. With the above ARSH simulation, we'd see the
following results:
[...]
1: R1=ctx(id=0,off=0,imm=0) R2_w=invP65535 R10=fp0
1: (85) call bpf_get_socket_cookie#46
2: R0_w=invP(id=0) R10=fp0
2: (57) r0 &= 808464432
-> R0_runtime = 0x3030
3: R0_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
3: (14) w0 -= 810299440
-> R0_runtime = 0xcfb40000
4: R0_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
(0xffffffff)
4: (c4) w0 s>>= 1
-> R0_runtime = 0xe7da0000
5: R0_w=invP(id=0,umin_value=1740636160,umax_value=2147221496,var_off=(0x67c00000; 0x183bfff8)) R10=fp0
(0x67c00000) (0x7ffbfff8)
[...]
In insn 3, we have a runtime value of 0xcfb40000, which is '1100 1111 1011
0100 0000 0000 0000 0000', the result after the shift has 0xe7da0000 that
is '1110 0111 1101 1010 0000 0000 0000 0000', where the sign bit is correctly
retained in 32 bit mode. In insn4, the umax was 0xffffffff, and changed into
0x7ffbfff8 after the shift, that is, '0111 1111 1111 1011 1111 1111 1111 1000'
and means here that the simulation didn't retain the sign bit. With above
logic, the updates happen on the 64 bit min/max bounds and given we coerced
the register, the sign bits of the bounds are cleared as well, meaning, we
need to force the simulation into s32 space for 32 bit alu mode.
Verification after the fix below. We're first analyzing the fall-through branch
on 32 bit signed >= test eventually leading to rejection of the program in this
specific case:
0: R1=ctx(id=0,off=0,imm=0) R10=fp0
0: (b7) r2 = 808464432
1: R1=ctx(id=0,off=0,imm=0) R2_w=invP808464432 R10=fp0
1: (85) call bpf_get_socket_cookie#46
2: R0_w=invP(id=0) R10=fp0
2: (bf) r6 = r0
3: R0_w=invP(id=0) R6_w=invP(id=0) R10=fp0
3: (57) r6 &= 808464432
4: R0_w=invP(id=0) R6_w=invP(id=0,umax_value=808464432,var_off=(0x0; 0x30303030)) R10=fp0
4: (14) w6 -= 810299440
5: R0_w=invP(id=0) R6_w=invP(id=0,umax_value=4294967295,var_off=(0xcf800000; 0x3077fff0)) R10=fp0
5: (c4) w6 s>>= 1
6: R0_w=invP(id=0) R6_w=invP(id=0,umin_value=3888119808,umax_value=4294705144,var_off=(0xe7c00000; 0x183bfff8)) R10=fp0
(0x67c00000) (0xfffbfff8)
6: (76) if w6 s>= 0x30303030 goto pc+216
7: R0_w=invP(id=0) R6_w=invP(id=0,umin_value=3888119808,umax_value=4294705144,var_off=(0xe7c00000; 0x183bfff8)) R10=fp0
7: (30) r0 = *(u8 *)skb[808464432]
BPF_LD_[ABS|IND] uses reserved fields
processed 8 insns (limit 1000000) [...]
Fixes: 9cbe1f5a32dc ("bpf/verifier: improve register value range tracking with ARSH")
Reported-by: Anatoly Trosinenko <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Yonghong Song <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
Suspend to IDLE invokes tick_unfreeze() on resume. tick_unfreeze() on the
first resuming CPU resumes timekeeping, which also has the side effect of
resetting the softlockup watchdog on this CPU.
But on the secondary CPUs the watchdog is not reset in the resume /
unfreeze() path, which can result in false softlockup warnings on those
CPUs depending on the time spent in suspend.
Prevent this by clearing the softlock watchdog in the unfreeze path also
on the secondary resuming CPUs.
[ tglx: Massaged changelog ]
Signed-off-by: Chunyan Zhang <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Commit 8b401f9ed244 ("bpf: implement bpf_send_signal() helper")
added helper bpf_send_signal() which permits bpf program to
send a signal to the current process. The signal may be
delivered to any threads in the process.
We found a use case where sending the signal to the current
thread is more preferable.
- A bpf program will collect the stack trace and then
send signal to the user application.
- The user application will add some thread specific
information to the just collected stack trace for
later analysis.
If bpf_send_signal() is used, user application will need
to check whether the thread receiving the signal matches
the thread collecting the stack by checking thread id.
If not, it will need to send signal to another thread
through pthread_kill().
This patch proposed a new helper bpf_send_signal_thread(),
which sends the signal to the thread corresponding to
the current kernel task. This way, user space is guaranteed that
bpf_program execution context and user space signal handling
context are the same thread.
Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
The test_cgcore_no_internal_process_constraint_on_threads selftest when
running with subsystem controlling noise triggers two warnings:
> [ 597.443115] WARNING: CPU: 1 PID: 28167 at kernel/cgroup/cgroup.c:3131 cgroup_apply_control_enable+0xe0/0x3f0
> [ 597.443413] WARNING: CPU: 1 PID: 28167 at kernel/cgroup/cgroup.c:3177 cgroup_apply_control_disable+0xa6/0x160
Both stem from a call to cgroup_type_write. The first warning was also
triggered by syzkaller.
When we're switching cgroup to threaded mode shortly after a subsystem
was disabled on it, we can see the respective subsystem css dying there.
The warning in cgroup_apply_control_enable is harmless in this case
since we're not adding new subsys anyway.
The warning in cgroup_apply_control_disable indicates an attempt to kill
css of recently disabled subsystem repeatedly.
The commit prevents these situations by making cgroup_type_write wait
for all dying csses to go away before re-applying subtree controls.
When at it, the locations of WARN_ON_ONCE calls are moved so that
warning is triggered only when we are about to misuse the dying css.
Reported-by: [email protected]
Reported-by: Christian Brauner <[email protected]>
Signed-off-by: Michal Koutný <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
It's surprising that workqueue_execute_end includes only the work when
its counterpart workqueue_execute_start has both the work and the worker
function.
You can't set a tracing filter or trigger based on the function, and
postprocessing scripts interested in specific functions are harder to
write since they have to remember the work from _start and match it up
with the same field in _end.
Add the function name, taking care to use the copy stashed in the
worker since the work is no longer safe to touch.
Signed-off-by: Daniel Jordan <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: [email protected]
Signed-off-by: Tejun Heo <[email protected]>
|
|
Function name cgroup_rstat_cpu_pop_upated() in comment should be
cgroup_rstat_cpu_pop_updated().
Signed-off-by: Chen Zhou <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
|
|
It is useful to know which module failed signature verification, so
print the module name along with the error message.
Signed-off-by: Jessica Yu <[email protected]>
|
|
The alarmtimer_rtc_add_device() function creates a wakeup source and then
tries to grab a module reference. If that fails the function returns early
with an error code, but fails to remove the wakeup source.
Cleanup this exit path so there is no dangling wakeup source, which is
named 'alarmtime' left allocated which will conflict with another RTC
device that may be registered later.
Fixes: 51218298a25e ("alarmtimer: Ensure RTC module is not unloaded")
Signed-off-by: Stephen Boyd <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Douglas Anderson <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
|
|
syzbot (KCSAN) reported a data-race in tick_do_update_jiffies64():
BUG: KCSAN: data-race in tick_do_update_jiffies64 / tick_do_update_jiffies64
write to 0xffffffff8603d008 of 8 bytes by interrupt on cpu 1:
tick_do_update_jiffies64+0x100/0x250 kernel/time/tick-sched.c:73
tick_sched_do_timer+0xd4/0xe0 kernel/time/tick-sched.c:138
tick_sched_timer+0x43/0xe0 kernel/time/tick-sched.c:1292
__run_hrtimer kernel/time/hrtimer.c:1514 [inline]
__hrtimer_run_queues+0x274/0x5f0 kernel/time/hrtimer.c:1576
hrtimer_interrupt+0x22a/0x480 kernel/time/hrtimer.c:1638
local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1110 [inline]
smp_apic_timer_interrupt+0xdc/0x280 arch/x86/kernel/apic/apic.c:1135
apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830
arch_local_irq_restore arch/x86/include/asm/paravirt.h:756 [inline]
kcsan_setup_watchpoint+0x1d4/0x460 kernel/kcsan/core.c:436
check_access kernel/kcsan/core.c:466 [inline]
__tsan_read1 kernel/kcsan/core.c:593 [inline]
__tsan_read1+0xc2/0x100 kernel/kcsan/core.c:593
kallsyms_expand_symbol.constprop.0+0x70/0x160 kernel/kallsyms.c:79
kallsyms_lookup_name+0x7f/0x120 kernel/kallsyms.c:170
insert_report_filterlist kernel/kcsan/debugfs.c:155 [inline]
debugfs_write+0x14b/0x2d0 kernel/kcsan/debugfs.c:256
full_proxy_write+0xbd/0x100 fs/debugfs/file.c:225
__vfs_write+0x67/0xc0 fs/read_write.c:494
vfs_write fs/read_write.c:558 [inline]
vfs_write+0x18a/0x390 fs/read_write.c:542
ksys_write+0xd5/0x1b0 fs/read_write.c:611
__do_sys_write fs/read_write.c:623 [inline]
__se_sys_write fs/read_write.c:620 [inline]
__x64_sys_write+0x4c/0x60 fs/read_write.c:620
do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x44/0xa9
read to 0xffffffff8603d008 of 8 bytes by task 0 on cpu 0:
tick_do_update_jiffies64+0x2b/0x250 kernel/time/tick-sched.c:62
tick_nohz_update_jiffies kernel/time/tick-sched.c:505 [inline]
tick_nohz_irq_enter kernel/time/tick-sched.c:1257 [inline]
tick_irq_enter+0x139/0x1c0 kernel/time/tick-sched.c:1274
irq_enter+0x4f/0x60 kernel/softirq.c:354
entering_irq arch/x86/include/asm/apic.h:517 [inline]
entering_ack_irq arch/x86/include/asm/apic.h:523 [inline]
smp_apic_timer_interrupt+0x55/0x280 arch/x86/kernel/apic/apic.c:1133
apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830
native_safe_halt+0xe/0x10 arch/x86/include/asm/irqflags.h:60
arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:571
default_idle_call+0x1e/0x40 kernel/sched/idle.c:94
cpuidle_idle_call kernel/sched/idle.c:154 [inline]
do_idle+0x1af/0x280 kernel/sched/idle.c:263
cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:355
rest_init+0xec/0xf6 init/main.c:452
arch_call_rest_init+0x17/0x37
start_kernel+0x838/0x85e init/main.c:786
x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:490
x86_64_start_kernel+0x72/0x76 arch/x86/kernel/head64.c:471
secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:241
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-rc7+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Use READ_ONCE() and WRITE_ONCE() to annotate this expected race.
Reported-by: syzbot <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
With CONFIG_PROVE_RCU_LIST, I had many suspicious RCU warnings
when I ran ftracetest trigger testcases.
-----
# dmesg -c > /dev/null
# ./ftracetest test.d/trigger
...
# dmesg | grep "RCU-list traversed" | cut -f 2 -d ] | cut -f 2 -d " "
kernel/trace/trace_events_hist.c:6070
kernel/trace/trace_events_hist.c:1760
kernel/trace/trace_events_hist.c:5911
kernel/trace/trace_events_trigger.c:504
kernel/trace/trace_events_hist.c:1810
kernel/trace/trace_events_hist.c:3158
kernel/trace/trace_events_hist.c:3105
kernel/trace/trace_events_hist.c:5518
kernel/trace/trace_events_hist.c:5998
kernel/trace/trace_events_hist.c:6019
kernel/trace/trace_events_hist.c:6044
kernel/trace/trace_events_trigger.c:1500
kernel/trace/trace_events_trigger.c:1540
kernel/trace/trace_events_trigger.c:539
kernel/trace/trace_events_trigger.c:584
-----
I investigated those warnings and found that the RCU-list
traversals in event trigger and hist didn't need to use
RCU version because those were called only under event_mutex.
I also checked other RCU-list traversals related to event
trigger list, and found that most of them were called from
event_hist_trigger_func() or hist_unregister_trigger() or
register/unregister functions except for a few cases.
Replace these unneeded RCU-list traversals with normal list
traversal macro and lockdep_assert_held() to check the
event_mutex is held.
Link: http://lkml.kernel.org/r/157680910305.11685.15110237954275915782.stgit@devnote2
Cc: [email protected]
Fixes: 30350d65ac567 ("tracing: Add variable support to hist triggers")
Reviewed-by: Tom Zanussi <[email protected]>
Signed-off-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
rb_update_event has changed without the kernel-doc update.
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
Also fixes a couple of typos
Link: http://lkml.kernel.org/r/[email protected]
Cc: Andrew Morton <[email protected]>
Signed-off-by: Fabian Frederick <[email protected]>
[ Found this deep in the abyss of my INBOX ]
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
Fix double perf_event linking to trace_uprobe_filter on
multiple uprobe event by moving trace_uprobe_filter under
trace_probe_event.
In uprobe perf event, trace_uprobe_filter data structure is
managing target mm filters (in perf_event) related to each
uprobe event.
Since commit 60d53e2c3b75 ("tracing/probe: Split trace_event
related data from trace_probe") left the trace_uprobe_filter
data structure in trace_uprobe, if a trace_probe_event has
multiple trace_uprobe (multi-probe event), a perf_event is
added to different trace_uprobe_filter on each trace_uprobe.
This leads a linked list corruption.
To fix this issue, move trace_uprobe_filter to trace_probe_event
and link it once on each event instead of each probe.
Link: http://lkml.kernel.org/r/157862073931.1800.3800576241181489174.stgit@devnote2
Cc: Jiri Olsa <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "Naveen N . Rao" <[email protected]>
Cc: Anil S Keshavamurthy <[email protected]>
Cc: "David S . Miller" <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: =?utf-8?q?Toke_H=C3=B8iland-J?= =?utf-8?b?w7hyZ2Vuc2Vu?= <[email protected]>
Cc: Jean-Tsung Hsiao <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Cc: [email protected]
Fixes: 60d53e2c3b75 ("tracing/probe: Split trace_event related data from trace_probe")
Link: https://lkml.kernel.org/r/[email protected]
Reported-by: Arnaldo Carvalho de Melo <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Signed-off-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
Merge misc fixes from David Howells.
Two afs fixes and a key refcounting fix.
* dhowells:
afs: Fix afs_lookup() to not clobber the version on a new dentry
afs: Fix use-after-loss-of-ref
keys: Fix request_key() cache
|
|
Instead of using bpf_struct_ops_map_lookup_elem() which is
not implemented, bpf_struct_ops_map_seq_show_elem() should
also use bpf_struct_ops_map_sys_lookup_elem() which does
an inplace update to the value. The change allocates
a value to pass to bpf_struct_ops_map_sys_lookup_elem().
[root@arch-fb-vm1 bpf]# cat /sys/fs/bpf/dctcp
{{{1}},BPF_STRUCT_OPS_STATE_INUSE,{{00000000df93eebc,00000000df93eebc},0,2, ...
Fixes: 85d33df357b6 ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS")
Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
When the key cached by request_key() and co. is cleaned up on exit(),
the code looks in the wrong task_struct, and so clears the wrong cache.
This leads to anomalies in key refcounting when doing, say, a kernel
build on an afs volume, that then trigger kasan to report a
use-after-free when the key is viewed in /proc/keys.
Fix this by making exit_creds() look in the passed-in task_struct rather
than in current (the task_struct cleanup code is deferred by RCU and
potentially run in another task).
Fixes: 7743c48e54ee ("keys: Cache result of request_key*() temporarily in task_struct")
Signed-off-by: David Howells <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
The name mmu_notifier_mm implies that the thing is a mm_struct pointer,
and is difficult to abbreviate. The struct is actually holding the
interval tree and hlist containing the notifiers subscribed to a mm.
Use 'subscriptions' as the variable name for this struct instead of the
really terrible and misleading 'mmn_mm'.
Signed-off-by: Jason Gunthorpe <[email protected]>
|
|
API to set time namespace offsets for children processes, i.e.:
echo "$clockid $offset_sec $offset_nsec" > /proc/self/timens_offsets
Co-developed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The VVAR page layout depends on whether a task belongs to the root or
non-root time namespace. Whenever a task changes its namespace, the VVAR
page tables are cleared and then they will be re-faulted with a
corresponding layout.
Co-developed-by: Andrei Vagin <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
VDSO support for Time namespace needs to set up a page with the same
layout as VVAR. That timens page will be placed on position of VVAR page
inside namespace. That page contains time namespace clock offsets and it
has vdso_data->seq set to 1 to enforce the slow path and
vdso_data->clock_mode set to VCLOCK_TIMENS to enforce the time namespace
handling path.
Allocate the timens page during namespace creation. Setup the offsets
when the first task enters the ns and freeze them to guarantee the pace
of monotonic/boottime clocks and to avoid breakage of applications.
The design decision is to have a global offset_lock which is used during
namespace offsets setup and to freeze offsets when the first task joins the
new time namespace. That is better in terms of memory usage compared to
having a per namespace mutex that's used only during the setup period.
Suggested-by: Andy Lutomirski <[email protected]>
Based-on-work-by: Thomas Gleixner <[email protected]>
Co-developed-by: Andrei Vagin <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
clock_nanosleep() accepts absolute values of expiration time, if the
TIMER_ABSTIME flag is set. This value is in the tasks time namespace,
which has to be converted to the host time namespace.
Co-developed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
clock_nanosleep() accepts absolute values of expiration time when
TIMER_ABSTIME flag is set. This absolute value is inside the task's
time namespace, and has to be converted to the host's time.
There is timens_ktime_to_host() helper for converting time, but
it accepts ktime argument.
As a preparation, make hrtimer_nanosleep() accept a clock value in ktime
instead of timespec64.
Co-developed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
clock_nanosleep() accepts absolute values of expiration time when the
TIMER_ABSTIME flag is set. This absolute value is inside the task's
time namespace and has to be converted to the host's time.
Co-developed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Wire timer_settime() syscall into time namespace virtualization.
sys_timer_settime() calls the ktime->timer_set() callback. Right now,
common_timer_set() is the only implementation for the callback.
The user-supplied expiry value is converted from timespec64 to ktime and
then timens_ktime_to_host() can be used to convert namespace's time to the
host time.
Inside a time namespace kernel's time differs by a fixed offset from a
user-supplied time, but only absolute values (TIMER_ABSTIME) must be
converted.
Co-developed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The helper subtracts namespace's clock offset from the given time
and ensures that the result is within [0, KTIME_MAX].
Co-developed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Adjust monotonic and boottime clocks with per-timens offsets. As the
result a process inside time namespace will see timers and clocks corrected
to offsets that were set when the namespace was created
Note that applications usually go through vDSO to get time, which is not
yet adjusted. Further changes will complete time namespace virtualisation
with vDSO support.
Co-developed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Now, when the clock_get_ktime() callback exists, the suboptimal
timespec64-based conversion can be removed from common_timer_get().
Suggested-by: Thomas Gleixner <[email protected]>
Co-developed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The callsite in common_timer_get() has already a comment:
/*
* The timespec64 based conversion is suboptimal, but it's not
* worth to implement yet another callback.
*/
kc->clock_get(timr->it_clock, &ts64);
now = timespec64_to_ktime(ts64);
The upcoming support for time namespaces requires to have access to:
- The time in a task's time namespace for sys_clock_gettime()
- The time in the root name space for common_timer_get()
That adds a valid reason to finally implement a separate callback which
returns the time in ktime_t format.
Suggested-by: Thomas Gleixner <[email protected]>
Co-developed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The upcoming support for time namespaces requires to have access to:
- The time in a task's time namespace for sys_clock_gettime()
- The time in the root name space for common_timer_get()
Wire up alarm bases with get_timespec().
Suggested-by: Thomas Gleixner <[email protected]>
Co-developed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The upcoming support for time namespaces requires to have access to:
- The time in a tasks time namespace for sys_clock_gettime()
- The time in the root name space for common_timer_get()
struct alarm_base needs to follow the same naming convention, so rename
.gettime() callback into get_ktime() as a preparation for introducing
get_timespec().
Suggested-by: Thomas Gleixner <[email protected]>
Co-developed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The upcoming support for time namespaces requires to have access to:
- The time in a task's time namespace for sys_clock_gettime()
- The time in the root name space for common_timer_get()
That adds a valid reason to finally implement a separate callback which
returns the time in ktime_t format in (struct k_clock).
As a preparation ground for introducing clock_get_ktime(), the original
callback clock_get() was renamed into clock_get_timespec().
Reflect the renaming into the callback implementations.
Suggested-by: Thomas Gleixner <[email protected]>
Co-developed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The upcoming support for time namespaces requires to have access to:
- The time in a task's time namespace for sys_clock_gettime()
- The time in the root name space for common_timer_get()
That adds a valid reason to finally implement a separate callback which
returns the time in ktime_t format, rather than in (struct timespec).
Rename the clock_get() callback to clock_get_timespec() as a preparation
for introducing clock_get_ktime().
Suggested-by: Thomas Gleixner <[email protected]>
Co-developed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Introduce offsets for time namespace. They will contain an adjustment
needed to convert clocks to/from host's.
A new namespace is created with the same offsets as the time namespace
of the current process.
Co-developed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Time Namespace isolates clock values.
The kernel provides access to several clocks CLOCK_REALTIME,
CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc.
CLOCK_REALTIME
System-wide clock that measures real (i.e., wall-clock) time.
CLOCK_MONOTONIC
Clock that cannot be set and represents monotonic time since
some unspecified starting point.
CLOCK_BOOTTIME
Identical to CLOCK_MONOTONIC, except it also includes any time
that the system is suspended.
For many users, the time namespace means the ability to changes date and
time in a container (CLOCK_REALTIME). Providing per namespace notions of
CLOCK_REALTIME would be complex with a massive overhead, but has a dubious
value.
But in the context of checkpoint/restore functionality, monotonic and
boottime clocks become interesting. Both clocks are monotonic with
unspecified starting points. These clocks are widely used to measure time
slices and set timers. After restoring or migrating processes, it has to be
guaranteed that they never go backward. In an ideal case, the behavior of
these clocks should be the same as for a case when a whole system is
suspended. All this means that it is required to set CLOCK_MONOTONIC and
CLOCK_BOOTTIME clocks, which can be achieved by adding per-namespace
offsets for clocks.
A time namespace is similar to a pid namespace in the way how it is
created: unshare(CLONE_NEWTIME) system call creates a new time namespace,
but doesn't set it to the current process. Then all children of the process
will be born in the new time namespace, or a process can use the setns()
system call to join a namespace.
This scheme allows setting clock offsets for a namespace, before any
processes appear in it.
All available clone flags have been used, so CLONE_NEWTIME uses the highest
bit of CSIGNAL. It means that it can be used only with the unshare() and
the clone3() system calls.
[ tglx: Adjusted paragraph about clone3() to reality and massaged the
changelog a bit. ]
Co-developed-by: Dmitry Safonov <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Dmitry Safonov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://criu.org/Time_namespace
Link: https://lists.openvz.org/pipermail/criu/2018-June/041504.html
Link: https://lore.kernel.org/r/[email protected]
|
|
With CONFIG_PROVE_RCU_LIST, I had many suspicious RCU warnings
when I ran ftracetest trigger testcases.
-----
# dmesg -c > /dev/null
# ./ftracetest test.d/trigger
...
# dmesg | grep "RCU-list traversed" | cut -f 2 -d ] | cut -f 2 -d " "
kernel/trace/trace_events_hist.c:6070
kernel/trace/trace_events_hist.c:1760
kernel/trace/trace_events_hist.c:5911
kernel/trace/trace_events_trigger.c:504
kernel/trace/trace_events_hist.c:1810
kernel/trace/trace_events_hist.c:3158
kernel/trace/trace_events_hist.c:3105
kernel/trace/trace_events_hist.c:5518
kernel/trace/trace_events_hist.c:5998
kernel/trace/trace_events_hist.c:6019
kernel/trace/trace_events_hist.c:6044
kernel/trace/trace_events_trigger.c:1500
kernel/trace/trace_events_trigger.c:1540
kernel/trace/trace_events_trigger.c:539
kernel/trace/trace_events_trigger.c:584
-----
I investigated those warnings and found that the RCU-list
traversals in event trigger and hist didn't need to use
RCU version because those were called only under event_mutex.
I also checked other RCU-list traversals related to event
trigger list, and found that most of them were called from
event_hist_trigger_func() or hist_unregister_trigger() or
register/unregister functions except for a few cases.
Replace these unneeded RCU-list traversals with normal list
traversal macro and lockdep_assert_held() to check the
event_mutex is held.
Link: http://lkml.kernel.org/r/157680910305.11685.15110237954275915782.stgit@devnote2
Reviewed-by: Tom Zanussi <[email protected]>
Signed-off-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
This syscall allows for the retrieval of file descriptors from other
processes, based on their pidfd. This is possible using ptrace, and
injection of parasitic code to inject code which leverages SCM_RIGHTS
to move file descriptors between a tracee and a tracer. Unfortunately,
ptrace comes with a high cost of requiring the process to be stopped,
and breaks debuggers. This does not require stopping the process under
manipulation.
One reason to use this is to allow sandboxers to take actions on file
descriptors on the behalf of another process. For example, this can be
combined with seccomp-bpf's user notification to do on-demand fd
extraction and take privileged actions. One such privileged action
is binding a socket to a privileged port.
/* prototype */
/* flags is currently reserved and should be set to 0 */
int sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
/* testing */
Ran self-test suite on x86_64
Signed-off-by: Sargun Dhillon <[email protected]>
Acked-by: Christian Brauner <[email protected]>
Reviewed-by: Arnd Bergmann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>
|
|
Add below function-tracer filter options to boot-time tracing.
- ftrace.[instance.INSTANCE.]ftrace.filters
This will take an array of tracing function filter rules
- ftrace.[instance.INSTANCE.]ftrace.notraces
This will take an array of NON-tracing function filter rules
Link: http://lkml.kernel.org/r/157867244841.17873.10933616628243103561.stgit@devnote2
Signed-off-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
Add ftrace.cpumask option support to boot-time tracing.
This sets cpumask for each instance.
- ftrace.[instance.INSTANCE.]cpumask = CPUMASK;
Set the trace cpumask. Note that the CPUMASK should be a string
which <tracefs>/tracing_cpumask can accepts.
Link: http://lkml.kernel.org/r/157867243625.17873.13613922641273149372.stgit@devnote2
Signed-off-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
Add instance node support to boot-time tracing. User can set
some options and event nodes under instance node.
- ftrace.instance.INSTANCE[...]
Add new INSTANCE instance. Some options and event nodes
are acceptable for instance node.
Link: http://lkml.kernel.org/r/157867242413.17873.9814204526141500278.stgit@devnote2
Signed-off-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
Add synthetic event node support to boot time tracing.
The synthetic event is a kind of event node, but the group
name is "synthetic".
- ftrace.event.synthetic.EVENT.fields = FIELD[, FIELD2...]
Defines new synthetic event with FIELDs. Each field should be
"type varname".
The synthetic node requires "fields" string arraies, which defines
the fields as same as tracing/synth_events interface.
Link: http://lkml.kernel.org/r/157867241236.17873.12411615143321557709.stgit@devnote2
Signed-off-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
Add kprobe event support on event node to boot-time tracing.
If the group name of event is "kprobes", the boot-time tracing
defines new probe event according to "probes" values.
- ftrace.event.kprobes.EVENT.probes = PROBE[, PROBE2...]
Defines new kprobe event based on PROBEs. It is able to define
multiple probes on one event, but those must have same type of
arguments.
For example,
ftrace.events.kprobes.myevent {
probes = "vfs_read $arg1 $arg2";
enable;
}
This will add kprobes:myevent on vfs_read with the 1st and the 2nd
arguments.
Link: http://lkml.kernel.org/r/157867240104.17873.9712052065426433111.stgit@devnote2
Signed-off-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
Add per-event settings for boottime tracing. User can set filter,
actions and enable on each event on boot. The event entries are
under ftrace.event.GROUP.EVENT node (note that the option key
includes event's group name and event name.) This supports below
configs.
- ftrace.event.GROUP.EVENT.enable
Enables GROUP:EVENT tracing.
- ftrace.event.GROUP.EVENT.filter = FILTER
Set FILTER rule to the GROUP:EVENT.
- ftrace.event.GROUP.EVENT.actions = ACTION[, ACTION2...]
Set ACTIONs to the GROUP:EVENT.
For example,
ftrace.event.sched.sched_process_exec {
filter = "pid < 128"
enable
}
this will enable tracing "sched:sched_process_exec" event
with "pid < 128" filter.
Link: http://lkml.kernel.org/r/157867238942.17873.11177628789184546198.stgit@devnote2
Signed-off-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
Setup tracing options via extra boot config in addition to kernel
command line.
This adds following commands support. These are applied to
the global trace instance.
- ftrace.options = OPT1[,OPT2...]
Enable given ftrace options.
- ftrace.trace_clock = CLOCK
Set given CLOCK to ftrace's trace_clock.
- ftrace.buffer_size = SIZE
Configure ftrace buffer size to SIZE. You can use "KB" or "MB"
for that SIZE.
- ftrace.events = EVENT[, EVENT2...]
Enable given events on boot. You can use a wild card in EVENT.
- ftrace.tracer = TRACER
Set TRACER to current tracer on boot. (e.g. function)
Note that this is NOT replacing the kernel parameters, because
this boot config based setting is later than that. If you want to
trace earlier boot events, you still need kernel parameters.
Link: http://lkml.kernel.org/r/157867237723.17873.17494943526320587488.stgit@devnote2
Signed-off-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|