Age | Commit message (Collapse) | Author | Files | Lines |
|
With the rework of how the __string() handles dynamic strings where it
saves off the source string in field in the helper structure[1], the
assignment of that value to the trace event field is stored in the helper
value and does not need to be passed in again.
This means that with:
__string(field, mystring)
Which use to be assigned with __assign_str(field, mystring), no longer
needs the second parameter and it is unused. With this, __assign_str()
will now only get a single parameter.
There's over 700 users of __assign_str() and because coccinelle does not
handle the TRACE_EVENT() macro I ended up using the following sed script:
git grep -l __assign_str | while read a ; do
sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file;
mv /tmp/test-file $a;
done
I then searched for __assign_str() that did not end with ';' as those
were multi line assignments that the sed script above would fail to catch.
Note, the same updates will need to be done for:
__assign_str_len()
__assign_rel_str()
__assign_rel_str_len()
I tested this with both an allmodconfig and an allyesconfig (build only for both).
[1] https://lore.kernel.org/linux-trace-kernel/[email protected]/
Link: https://lore.kernel.org/linux-trace-kernel/[email protected]
Cc: Masami Hiramatsu <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Julia Lawall <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
Acked-by: Jani Nikula <[email protected]>
Acked-by: Christian König <[email protected]> for the amdgpu parts.
Acked-by: Thomas Hellström <[email protected]> #for
Acked-by: Rafael J. Wysocki <[email protected]> # for thermal
Acked-by: Takashi Iwai <[email protected]>
Acked-by: Darrick J. Wong <[email protected]> # xfs
Tested-by: Guenter Roeck <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- Add cpufreq pressure feedback for the scheduler
- Rework misfit load-balancing wrt affinity restrictions
- Clean up and simplify the code around ::overutilized and
::overload access.
- Simplify sched_balance_newidle()
- Bump SCHEDSTAT_VERSION to 16 due to a cleanup of CPU_MAX_IDLE_TYPES
handling that changed the output.
- Rework & clean up <asm/vtime.h> interactions wrt arch_vtime_task_switch()
- Reorganize, clean up and unify most of the higher level
scheduler balancing function names around the sched_balance_*()
prefix
- Simplify the balancing flag code (sched_balance_running)
- Miscellaneous cleanups & fixes
* tag 'sched-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
sched/pelt: Remove shift of thermal clock
sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure()
thermal/cpufreq: Remove arch_update_thermal_pressure()
sched/cpufreq: Take cpufreq feedback into account
cpufreq: Add a cpufreq pressure feedback for the scheduler
sched/fair: Fix update of rd->sg_overutilized
sched/vtime: Do not include <asm/vtime.h> header
s390/irq,nmi: Include <asm/vtime.h> header directly
s390/vtime: Remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover
sched/vtime: Get rid of generic vtime_task_switch() implementation
sched/vtime: Remove confusing arch_vtime_task_switch() declaration
sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags
sched/fair: Rename set_rd_overutilized_status() to set_rd_overutilized()
sched/fair: Rename SG_OVERLOAD to SG_OVERLOADED
sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded()
sched/fair: Rename root_domain::overload to ::overloaded
sched/fair: Use helper functions to access root_domain::overload
sched/fair: Check root_domain::overload value before update
sched/fair: Combine EAS check with root_domain::overutilized access
sched/fair: Simplify the continue_balancing logic in sched_balance_newidle()
...
|
|
arch_update_hw_pressure()
Now that cpufreq provides a pressure value to the scheduler, rename
arch_update_thermal_pressure into HW pressure to reflect that it returns
a pressure applied by HW (i.e. with a high frequency change) and not
always related to thermal mitigation but also generated by max current
limitation as an example. Such high frequency signal needs filtering to be
smoothed and provide an value that reflects the average available capacity
into the scheduler time scale.
Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Lukasz Luba <[email protected]>
Reviewed-by: Qais Yousef <[email protected]>
Reviewed-by: Lukasz Luba <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Add "sched_prepare_exec" tracepoint, which is run right after the point
of no return but before the current task assumes its new exec identity.
Unlike the tracepoint "sched_process_exec", the "sched_prepare_exec"
tracepoint runs before flushing the old exec, i.e. while the task still
has the original state (such as original MM), but when the new exec
either succeeds or crashes (but never returns to the original exec).
Being able to trace this event can be helpful in a number of use cases:
* allowing tracing eBPF programs access to the original MM on exec,
before current->mm is replaced;
* counting exec in the original task (via perf event);
* profiling flush time ("sched_prepare_exec" to "sched_process_exec").
Example of tracing output:
$ cat /sys/kernel/debug/tracing/trace_pipe
<...>-379 [003] ..... 179.626921: sched_prepare_exec: interp=/usr/bin/sshd filename=/usr/bin/sshd pid=379 comm=sshd
<...>-381 [002] ..... 180.048580: sched_prepare_exec: interp=/bin/bash filename=/bin/bash pid=381 comm=sshd
<...>-385 [001] ..... 180.068277: sched_prepare_exec: interp=/usr/bin/tty filename=/usr/bin/tty pid=385 comm=bash
<...>-389 [006] ..... 192.020147: sched_prepare_exec: interp=/usr/bin/dmesg filename=/usr/bin/dmesg pid=389 comm=bash
Signed-off-by: Marco Elver <[email protected]>
Acked-by: Steven Rostedt (Google) <[email protected]>
Reviewed-by: Masami Hiramatsu (Google) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Kees Cook <[email protected]>
|
|
Tracing the runtime delta makes sense, observer can sum over time.
Tracing the absolute vruntime makes less sense, inconsistent:
absolute-vs-delta, but also vruntime delta can be computed from
runtime delta.
Removing the vruntime thing also makes the two tracepoint sites
identical, allowing to unify the code in a later patch.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
|
|
VMAs are skipped if there is no recent fault activity but this represents
a chicken-and-egg problem as there may be no fault activity if the PTEs
are never updated to trap NUMA hints. There is an indirect reliance on
scanning to be forced early in the lifetime of a task but this may fail
to detect changes in phase behaviour. Force inactive VMAs to be scanned
when all other eligible VMAs have been updated within the same scan
sequence.
Test results in general look good with some changes in performance, both
negative and positive, depending on whether the additional scanning and
faulting was beneficial or not to the workload. The autonuma benchmark
workload NUMA01_THREADLOCAL was picked for closer examination. The workload
creates two processes with numerous threads and thread-local storage that
is zero-filled in a loop. It exercises the corner case where unrelated
threads may skip VMAs that are thread-local to another thread and still
has some VMAs that inactive while the workload executes.
The VMA skipping activity frequency with and without the patch:
6.6.0-rc2-sched-numabtrace-v1
=============================
649 reason=scan_delay
9,094 reason=unsuitable
48,915 reason=shared_ro
143,919 reason=inaccessible
193,050 reason=pid_inactive
6.6.0-rc2-sched-numabselective-v1
=============================
146 reason=seq_completed
622 reason=ignore_pid_inactive
624 reason=scan_delay
6,570 reason=unsuitable
16,101 reason=shared_ro
27,608 reason=inaccessible
41,939 reason=pid_inactive
Note that with the patch applied, the PID activity is ignored
(ignore_pid_inactive) to ensure a VMA with some activity is completely
scanned. In addition, a small number of VMAs are scanned when no other
eligible VMA is available during a single scan window (seq_completed).
The number of times a VMA is skipped due to no PID activity from the
scanning task (pid_inactive) drops dramatically. It is expected that
this will increase the number of PTEs updated for NUMA hinting faults
as well as hinting faults but these represent PTEs that would otherwise
have been missed. The tradeoff is scan+fault overhead versus improving
locality due to migration.
On a 2-socket Cascade Lake test machine, the time to complete the
workload is as follows;
6.6.0-rc2 6.6.0-rc2
sched-numabtrace-v1 sched-numabselective-v1
Min elsp-NUMA01_THREADLOCAL 174.22 ( 0.00%) 117.64 ( 32.48%)
Amean elsp-NUMA01_THREADLOCAL 175.68 ( 0.00%) 123.34 * 29.79%*
Stddev elsp-NUMA01_THREADLOCAL 1.20 ( 0.00%) 4.06 (-238.20%)
CoeffVar elsp-NUMA01_THREADLOCAL 0.68 ( 0.00%) 3.29 (-381.70%)
Max elsp-NUMA01_THREADLOCAL 177.18 ( 0.00%) 128.03 ( 27.74%)
The time to complete the workload is reduced by almost 30%:
6.6.0-rc2 6.6.0-rc2
sched-numabtrace-v1 sched-numabselective-v1 /
Duration User 91201.80 63506.64
Duration System 2015.53 1819.78
Duration Elapsed 1234.77 868.37
In this specific case, system CPU time was not increased but it's not
universally true.
From vmstat, the NUMA scanning and fault activity is as follows;
6.6.0-rc2 6.6.0-rc2
sched-numabtrace-v1 sched-numabselective-v1
Ops NUMA base-page range updates 64272.00 26374386.00
Ops NUMA PTE updates 36624.00 55538.00
Ops NUMA PMD updates 54.00 51404.00
Ops NUMA hint faults 15504.00 75786.00
Ops NUMA hint local faults % 14860.00 56763.00
Ops NUMA hint local percent 95.85 74.90
Ops NUMA pages migrated 1629.00 6469222.00
Both the number of PTE updates and hint faults is dramatically
increased. While this is superficially unfortunate, it represents
ranges that were simply skipped without the patch. As a result
of the scanning and hinting faults, many more pages were also
migrated but as the time to completion is reduced, the overhead
is offset by the gain.
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Raghavendra K T <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
NUMA Balancing skips VMAs when the current task has not trapped a NUMA
fault within the VMA. If the VMA is skipped then mm->numa_scan_offset
advances and a task that is trapping faults within the VMA may never
fully update PTEs within the VMA.
Force tasks to update PTEs for partially scanned PTEs. The VMA will
be tagged for NUMA hints by some task but this removes some of the
benefit of tracking PID activity within a VMA. A follow-on patch
will mitigate this problem.
The test cases and machines evaluated did not trigger the corner case so
the performance results are neutral with only small changes within the
noise from normal test-to-test variance. However, the next patch makes
the corner case easier to trigger.
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Raghavendra K T <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
NUMA balancing skips or scans VMAs for a variety of reasons. In preparation
for completing scans of VMAs regardless of PID access, trace the reasons
why a VMA was skipped. In a later patch, the tracing will be used to track
if a VMA was forcibly scanned.
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
It was useful to track feec() placement decision and debug the spare
capacity and optimization issues vs uclamp_max.
Signed-off-by: Qais Yousef (Google) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Commit fa2c3254d7cf (sched/tracing: Don't re-read p->state when emitting
sched_switch event, 2022-01-20) added a new prev_state argument to the
sched_switch tracepoint, before the prev task_struct pointer.
This reordering of arguments broke BPF programs that use the raw
tracepoint (e.g. tp_btf programs). The type of the second argument has
changed and existing programs that assume a task_struct* argument
(e.g. for bpf_task_storage access) will now fail to verify.
If we instead append the new argument to the end, all existing programs
would continue to work and can conditionally extract the prev_state
argument on supported kernel versions.
Fixes: fa2c3254d7cf (sched/tracing: Don't re-read p->state when emitting sched_switch event, 2022-01-20)
Signed-off-by: Delyan Kratunov <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Steven Rostedt (Google) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
As of commit
c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
the following sequence becomes possible:
p->__state = TASK_INTERRUPTIBLE;
__schedule()
deactivate_task(p);
ttwu()
READ !p->on_rq
p->__state=TASK_WAKING
trace_sched_switch()
__trace_sched_switch_state()
task_state_index()
return 0;
TASK_WAKING isn't in TASK_REPORT, so the task appears as TASK_RUNNING in
the trace event.
Prevent this by pushing the value read from __schedule() down the trace
event.
Reported-by: Abhijeet Dharmapurikar <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Steven Rostedt (Google) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
'success' is left here for a long time and also it is meaningless
for the upper user. Just remove it.
[ There were some tools expecting this, and this may break them. But
hopefully they've been fixed in the mean time. Otherwise this may be
likely reverted - SDR ]
Link: https://lkml.kernel.org/r/[email protected]
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Ed Tsai <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
Fix ~59 single-word typos in the tracing code comments, and fix
the grammar in a handful of places.
Link: https://lore.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Reviewed-by: Randy Dunlap <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
The old _do_fork() helper has been removed in favor of kernel_clone().
Here correct some comments which still contain _do_fork()
Link: https://lore.kernel.org/r/[email protected]
Cc: [email protected]
Cc: [email protected]
Acked-by: Christian Brauner <[email protected]>
Signed-off-by: Yanfei Xu <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
|
|
While migrating some code from wq to kthread_worker, I found that I missed
the execute_start/end tracepoints. So add similar tracepoints for
kthread_work. And for completeness, queue_work tracepoint (although this
one differs slightly from the matching workqueue tracepoint).
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Rob Clark <[email protected]>
Cc: Rob Clark <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "Peter Zijlstra (Intel)" <[email protected]>
Cc: Phil Auld <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Thara Gopinath <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Vincent Donnefort <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Ilias Stamatis <[email protected]>
Cc: Liang Chen <[email protected]>
Cc: Ben Dooks <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: "J. Bruce Fields" <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
rq->cpu_capacity is a key element in several scheduler parts, such as EAS
task placement and load balancing. Tracking this value enables testing
and/or debugging by a toolkit.
Signed-off-by: Vincent Donnefort <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Change "It it" to "It is".
Signed-off-by: Randy Dunlap <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Add a bare tracepoint trace_sched_update_nr_running_tp which tracks
->nr_running CPU's rq. This is used to accurately trace this data and
provide a visualization of scheduler imbalances in, for example, the
form of a heat map. The tracepoint is accessed by loading an external
kernel module. An example module (forked from Qais' module and including
the pelt related tracepoints) can be found at:
https://github.com/auldp/tracepoints-helpers.git
A script to turn the trace-cmd report output into a heatmap plot can be
found at:
https://github.com/jirvoz/plot-nr-running
The tracepoints are added to add_nr_running() and sub_nr_running() which
are in kernel/sched/sched.h. In order to avoid CREATE_TRACE_POINTS in
the header a wrapper call is used and the trace/events/sched.h include
is moved before sched.h in kernel/sched/core.
Signed-off-by: Phil Auld <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
The util_est signals are key elements for EAS task placement and
frequency selection. Having tracepoints to track these signals enables
load-tracking and schedutil testing and/or debugging by a toolkit.
Signed-off-by: Vincent Donnefort <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
Extrapolating on the existing framework to track rt/dl utilization using
pelt signals, add a similar mechanism to track thermal pressure. The
difference here from rt/dl utilization tracking is that, instead of
tracking time spent by a CPU running a RT/DL task through util_avg, the
average thermal pressure is tracked through load_avg. This is because
thermal pressure signal is weighted time "delta" capacity unlike util_avg
which is binary. "delta capacity" here means delta between the actual
capacity of a CPU and the decreased capacity a CPU due to a thermal event.
In order to track average thermal pressure, a new sched_avg variable
avg_thermal is introduced. Function update_thermal_load_avg can be called
to do the periodic bookkeeping (accumulate, decay and average) of the
thermal pressure.
Reviewed-by: Vincent Guittot <[email protected]>
Signed-off-by: Thara Gopinath <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
sched:sched_stick_numa is meant to fire when a task is unable to migrate
to the preferred node but from the trace, it's possibile to tell the
difference between "no CPU found", "migration to idle CPU failed" and
"tasks could not be swapped". Extend the tracepoint accordingly.
Signed-off-by: Mel Gorman <[email protected]>
[ Minor edits. ]
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Valentin Schneider <[email protected]>
Cc: Phil Auld <[email protected]>
Cc: Hillf Danton <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The new tracepoint allows us to track the changes in overutilized
status.
Overutilized status is associated with EAS. It indicates that the system
is in high performance state. EAS is disabled when the system is in this
state since there's not much energy savings while high performance tasks
are pushing the system to the limit and it's better to default to the
spreading behavior of the scheduler.
This tracepoint helps understanding and debugging the conditions under
which this happens.
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Pavankumar Kondeti <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Uwe Kleine-Konig <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The new tracepoint allows tracking PELT signals at sched_entity level.
Which is supported in CFS tasks and taskgroups only.
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Pavankumar Kondeti <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Uwe Kleine-Konig <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The new tracepoints allow tracking PELT signals at rq level for all
scheduling classes + irq.
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Pavankumar Kondeti <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Uwe Kleine-Konig <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
not set
The tracepoints trace_sched_stat_{iowait, blocked, wait, sleep} should
be not exposed to user if CONFIG_SCHEDSTATS is not set.
Link: http://lkml.kernel.org/r/[email protected]
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Yafang Shao <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
commit 3f5fe9fef5b2 ("sched/debug: Fix task state recording/printout")
tried to fix the problem introduced by a previous commit efb40f588b43
("sched/tracing: Fix trace_sched_switch task-state printing"). However
the prev_state output in sched_switch is still broken.
task_state_index() uses fls() which considers the LSB as 1. Left
shifting 1 by this value gives an incorrect mapping to the task state.
Fix this by decrementing the value returned by __get_task_state()
before shifting.
Link: http://lkml.kernel.org/r/[email protected]
Cc: [email protected]
Fixes: 3f5fe9fef5b2 ("sched/debug: Fix task state recording/printout")
Signed-off-by: Pavankumar Kondeti <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
include/trace/events/sched.h includes <linux/sched.h> (via
<linux/sched/numa_balancing.h>) and so knows about the TASK_* constants
used to interpret .prev_state. So instead of duplicating the magic
numbers make use of the defined macros to ease understanding the
mapping from state bits to letters which isn't completely intuitive for
an outsider.
Signed-off-by: Uwe Kleine-König <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Since the following commit:
b91473ff6e97 ("sched,tracing: Update trace_sched_pi_setprio()")
the sched_pi_setprio trace point shows the "newprio" during a deboost:
|futex sched_pi_setprio: comm=futex_requeue_p pid"34 oldprio newprio=3D98
|futex sched_switch: prev_comm=futex_requeue_p prev_pid"34 prev_prio=120
This patch open codes __rt_effective_prio() in the tracepoint as the
'newprio' to get the old behaviour back / the correct priority:
|futex sched_pi_setprio: comm=futex_requeue_p pid"20 oldprio newprio=3D120
|futex sched_switch: prev_comm=futex_requeue_p prev_pid"20 prev_prio=120
Peter suggested to open code the new priority so people using tracehook
could get the deadline data out.
Reported-by: Mansky Christian <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Fixes: b91473ff6e97 ("sched,tracing: Update trace_sched_pi_setprio()")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The recent conversion of the task state recording to use task_state_index()
broke the sched_switch tracepoint task state output.
task_state_index() returns surprisingly an index (0-7) which is then
printed with __print_flags() applying bitmasks. Not really working and
resulting in weird states like 'prev_state=t' instead of 'prev_state=I'.
Use TASK_REPORT_MAX instead of TASK_STATE_MAX to report preemption. Build a
bitmask from the return value of task_state_index() and store it in
entry->prev_state, which makes __print_flags() work as expected.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: [email protected]
Fixes: efb40f588b43 ("sched/tracing: Fix trace_sched_switch task-state printing")
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1711221304180.1751@nanos
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <[email protected]>
Reviewed-by: Philippe Ombredanne <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
|
|
Steve requested better names for the new task-state helper functions.
So introduce the concept of task-state index for the printing and
rename __get_task_state() to task_state_index() and
__task_state_to_char() to task_index_to_char().
Requested-by: Steven Rostedt <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Currently TASK_PARKED is masqueraded as TASK_INTERRUPTIBLE, give it
its own print state because it will not in fact get woken by regular
wakeups and is a long-term state.
This requires moving TASK_PARKED into the TASK_REPORT mask, and since
that latter needs to be a contiguous bitmask, we need to shuffle the
bits around a bit.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Markus reported that kthreads that idle using TASK_IDLE instead of
TASK_INTERRUPTIBLE are reported in as TASK_UNINTERRUPTIBLE and things
like htop mark those red.
This is undesirable, so add an explicit state for TASK_IDLE.
Reported-by: Markus Trippelsdorf <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Convert trace_sched_switch to use the common task-state helpers and
fix the "X" and "Z" order, possibly they ended up in the wrong order
because TASK_REPORT has them in the wrong order too.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Pass the PI donor task, instead of a numerical priority.
Numerical priorities are not sufficient to describe state ever since
SCHED_DEADLINE.
Annotate all sched tracepoints that are currently broken; fixing them
will bork userspace. *hate*.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Steven Rostedt <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
<linux/sched/numa_balancing.h>
We are going to split <linux/sched/numa_balancing.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/numa_balancing.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
__trace_sched_switch_state() is the last remaining PREEMPT_ACTIVE
user, move trace_sched_switch() from prepare_task_switch() to
__schedule() and propagate the @preempt argument.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Reviewed-by: Steven Rostedt <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Mathieu reported that since 317f394160e9 ("sched: Move the second half
of ttwu() to the remote cpu") trace_sched_wakeup() can happen out of
context of the waker.
This is a problem when you want to analyse wakeup paths because it is
now very hard to correlate the wakeup event to whoever issued the
wakeup.
OTOH trace_sched_wakeup() is issued at the point where we set
p->state = TASK_RUNNING, which is right were we hand the task off to
the scheduler, so this is an important point when looking at
scheduling behaviour, up to here its been the wakeup path everything
hereafter is due to scheduler policy.
To bridge this gap, introduce a second tracepoint: trace_sched_waking.
It is guaranteed to be called in the waker context.
Reported-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Francis Giraldeau <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Currently people use TASK_INTERRUPTIBLE to idle kthreads and wait for
'work' because TASK_UNINTERRUPTIBLE contributes to the loadavg. Having
all idle kthreads contribute to the loadavg is somewhat silly.
Now mostly this works OK, because kthreads have all their signals
masked. However there's a few sites where this is causing problems and
TASK_UNINTERRUPTIBLE should be used, except for that loadavg issue.
This patch adds TASK_NOLOAD which, when combined with
TASK_UNINTERRUPTIBLE avoids the loadavg accounting.
As most of imagined usage sites are loops where a thread wants to
idle, waiting for work, a helper TASK_IDLE is introduced.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Julian Anastasov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: NeilBrown <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
task_preempt_count() has nothing to do with the actual preempt counter,
thread_info->saved_preempt_count is only valid right after switch_to().
__trace_sched_switch_state() can use preempt_count(), prev is still the
current task when trace_sched_switch() is called.
Signed-off-by: Oleg Nesterov <[email protected]>
[ Added BUG_ON(). ]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Steven Rostedt <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Remote wakeups of polling CPUs are a valuable performance
improvement; add a tracepoint to make it much easier to verify that
they're working.
Signed-off-by: Andy Lutomirski <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: David Ahern <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/16205aee116772aa686814f9b13bccb562108047.1401902905.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <[email protected]>
|
|
This patch adds three tracepoints
o trace_sched_move_numa when a task is moved to a node
o trace_sched_swap_numa when a task is swapped with another task
o trace_sched_stick_numa when a numa-related migration fails
The tracepoints allow the NUMA scheduler activity to be monitored and the
following high-level metrics can be calculated
o NUMA migrated stuck nr trace_sched_stick_numa
o NUMA migrated idle nr trace_sched_move_numa
o NUMA migrated swapped nr trace_sched_swap_numa
o NUMA local swapped trace_sched_swap_numa src_nid == dst_nid (should never happen)
o NUMA remote swapped trace_sched_swap_numa src_nid != dst_nid (should == NUMA migrated swapped)
o NUMA group swapped trace_sched_swap_numa src_ngid == dst_ngid
Maybe a small number of these are acceptable
but a high number would be a major surprise.
It would be even worse if bounces are frequent.
o NUMA avg task migs. Average number of migrations for tasks
o NUMA stddev task mig Self-explanatory
o NUMA max task migs. Maximum number of migrations for a single task
In general the intent of the tracepoints is to help diagnose problems
where automatic NUMA balancing appears to be doing an excessive amount
of useless work.
[[email protected]: remove semicolon-after-if, repair coding-style]
Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Alex Thorlton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
file move
Conflicts:
kernel/Makefile
There are conflicts in kernel/Makefile due to file moving in the
scheduler tree - resolve them.
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Currently check_hung_task() prints a warning if it detects the
problem, but it is not convenient to watch the system logs if
user-space wants to be notified about the hang.
Add the new trace_sched_process_hang() into check_hung_task(),
this way a user-space monitor can easily wait for the hang and
potentially resolve a problem.
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: Dave Sullivan <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
We need a few special preempt_count accessors:
- task_preempt_count() for when we're interested in the preemption
count of another (non-running) task.
- init_task_preempt_count() for properly initializing the preemption
count.
- init_idle_preempt_count() a special case of the above for the idle
threads.
With these no generic code ever touches thread_info::preempt_count
anymore and architectures could choose to remove it.
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The next patch tries to avoid the costly perf_trace_buf_* calls
when possible but there is a problem. We can only do this if
__task == NULL, perf_tp_event(task != NULL) has the additional
code for this case.
Unfortunately, TP_perf_assign/__perf_xxx which changes the default
values of __count/__task variables for perf_trace_buf_submit() is
called "too late", after we already did perf_trace_buf_prepare(),
and the optimization above can't work.
So this patch simply embeds __perf_xxx() into TP_ARGS(), this way
DECLARE_EVENT_CLASS() can use the result of assignments hidden in
"args" right after ftrace_get_offsets_##call() which is mostly
trivial. This allows us to have the fast-path "__task != NULL"
check at the start, see the next patch.
Link: http://lkml.kernel.org/r/[email protected]
Tested-by: David Ahern <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
|
|
To simplify the review of the next patches:
1. We are going to reimplent __perf_task/counter and embedd them
into TP_ARGS(). expand TRACE_EVENT(sched_stat_runtime) into
DECLARE_EVENT_CLASS() + DEFINE_EVENT(), this way they can use
different TP_ARGS's.
2. Change perf_trace_##call() macro to do perf_fetch_caller_regs()
right before perf_trace_buf_prepare().
This way it evaluates TP_ARGS() asap, the next patch explores
this fact.
Note: after 87f44bbc perf_trace_buf_prepare() doesn't need
"struct pt_regs *regs", perhaps it makes sense to remove this
argument. And perhaps we can teach perf_trace_buf_submit()
to accept regs == NULL and do fetch_caller_regs(CALLER_ADDR1)
in this case.
3. Cosmetic, but the typecast from "void*" buys nothing. It just
adds the noise, remove it.
Link: http://lkml.kernel.org/r/[email protected]
Acked-by: Peter Zijlstra <[email protected]>
Tested-by: David Ahern <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
|
|
The smpboot threads rely on the park/unpark mechanism which binds per
cpu threads on a particular core. Though the functionality is racy:
CPU0 CPU1 CPU2
unpark(T) wake_up_process(T)
clear(SHOULD_PARK) T runs
leave parkme() due to !SHOULD_PARK
bind_to(CPU2) BUG_ON(wrong CPU)
We cannot let the tasks move themself to the target CPU as one of
those tasks is actually the migration thread itself, which requires
that it starts running on the target cpu right away.
The solution to this problem is to prevent wakeups in park mode which
are not from unpark(). That way we can guarantee that the association
of the task to the target cpu is working correctly.
Add a new task state (TASK_PARKED) which prevents other wakeups and
use this state explicitly for the unpark wakeup.
Peter noticed: Also, since the task state is visible to userspace and
all the parked tasks are still in the PID space, its a good hint in ps
and friends that these tasks aren't really there for the moment.
The migration thread has another related issue.
CPU0 CPU1
Bring up CPU2
create_thread(T)
park(T)
wait_for_completion()
parkme()
complete()
sched_set_stop_task()
schedule(TASK_PARKED)
The sched_set_stop_task() call is issued while the task is on the
runqueue of CPU1 and that confuses the hell out of the stop_task class
on that cpu. So we need the same synchronizaion before
sched_set_stop_task().
Reported-by: Dave Jones <[email protected]>
Reported-and-tested-by: Dave Hansen <[email protected]>
Reported-and-tested-by: Borislav Petkov <[email protected]>
Acked-by: Peter Ziljstra <[email protected]>
Cc: Srivatsa S. Bhat <[email protected]>
Cc: [email protected]
Cc: Ingo Molnar <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
A few events are interesting not only for a current task.
For example, sched_stat_* events are interesting for a task
which wakes up. For this reason, it will be good if such
events will be delivered to a target task too.
Now a target task can be set by using __perf_task().
The original idea and a draft patch belongs to Peter Zijlstra.
I need these events for profiling sleep times. sched_switch is used for
getting callchains and sched_stat_* is used for getting time periods.
These events are combined in user space, then it can be analyzed by
perf tools.
Inspired-by: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Arun Sharma <[email protected]>
Signed-off-by: Andrew Vagin <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|