aboutsummaryrefslogtreecommitdiff
path: root/include/trace/events/sched.h
AgeCommit message (Collapse)AuthorFilesLines
2024-05-22tracing/treewide: Remove second parameter of __assign_str()Steven Rostedt (Google)1-4/+4
With the rework of how the __string() handles dynamic strings where it saves off the source string in field in the helper structure[1], the assignment of that value to the trace event field is stored in the helper value and does not need to be passed in again. This means that with: __string(field, mystring) Which use to be assigned with __assign_str(field, mystring), no longer needs the second parameter and it is unused. With this, __assign_str() will now only get a single parameter. There's over 700 users of __assign_str() and because coccinelle does not handle the TRACE_EVENT() macro I ended up using the following sed script: git grep -l __assign_str | while read a ; do sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file; mv /tmp/test-file $a; done I then searched for __assign_str() that did not end with ';' as those were multi line assignments that the sed script above would fail to catch. Note, the same updates will need to be done for: __assign_str_len() __assign_rel_str() __assign_rel_str_len() I tested this with both an allmodconfig and an allyesconfig (build only for both). [1] https://lore.kernel.org/linux-trace-kernel/[email protected]/ Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Julia Lawall <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]> Acked-by: Jani Nikula <[email protected]> Acked-by: Christian König <[email protected]> for the amdgpu parts. Acked-by: Thomas Hellström <[email protected]> #for Acked-by: Rafael J. Wysocki <[email protected]> # for thermal Acked-by: Takashi Iwai <[email protected]> Acked-by: Darrick J. Wong <[email protected]> # xfs Tested-by: Guenter Roeck <[email protected]>
2024-05-13Merge tag 'sched-core-2024-05-13' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Add cpufreq pressure feedback for the scheduler - Rework misfit load-balancing wrt affinity restrictions - Clean up and simplify the code around ::overutilized and ::overload access. - Simplify sched_balance_newidle() - Bump SCHEDSTAT_VERSION to 16 due to a cleanup of CPU_MAX_IDLE_TYPES handling that changed the output. - Rework & clean up <asm/vtime.h> interactions wrt arch_vtime_task_switch() - Reorganize, clean up and unify most of the higher level scheduler balancing function names around the sched_balance_*() prefix - Simplify the balancing flag code (sched_balance_running) - Miscellaneous cleanups & fixes * tag 'sched-core-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits) sched/pelt: Remove shift of thermal clock sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure() thermal/cpufreq: Remove arch_update_thermal_pressure() sched/cpufreq: Take cpufreq feedback into account cpufreq: Add a cpufreq pressure feedback for the scheduler sched/fair: Fix update of rd->sg_overutilized sched/vtime: Do not include <asm/vtime.h> header s390/irq,nmi: Include <asm/vtime.h> header directly s390/vtime: Remove unused __ARCH_HAS_VTIME_TASK_SWITCH leftover sched/vtime: Get rid of generic vtime_task_switch() implementation sched/vtime: Remove confusing arch_vtime_task_switch() declaration sched/balancing: Simplify the sg_status bitmask and use separate ->overloaded and ->overutilized flags sched/fair: Rename set_rd_overutilized_status() to set_rd_overutilized() sched/fair: Rename SG_OVERLOAD to SG_OVERLOADED sched/fair: Rename {set|get}_rd_overload() to {set|get}_rd_overloaded() sched/fair: Rename root_domain::overload to ::overloaded sched/fair: Use helper functions to access root_domain::overload sched/fair: Check root_domain::overload value before update sched/fair: Combine EAS check with root_domain::overutilized access sched/fair: Simplify the continue_balancing logic in sched_balance_newidle() ...
2024-04-24sched/cpufreq: Rename arch_update_thermal_pressure() => ↵Vincent Guittot1-1/+1
arch_update_hw_pressure() Now that cpufreq provides a pressure value to the scheduler, rename arch_update_thermal_pressure into HW pressure to reflect that it returns a pressure applied by HW (i.e. with a high frequency change) and not always related to thermal mitigation but also generated by max current limitation as an example. Such high frequency signal needs filtering to be smoothed and provide an value that reflects the average available capacity into the scheduler time scale. Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Tested-by: Lukasz Luba <[email protected]> Reviewed-by: Qais Yousef <[email protected]> Reviewed-by: Lukasz Luba <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-04-11tracing: Add sched_prepare_exec tracepointMarco Elver1-0/+35
Add "sched_prepare_exec" tracepoint, which is run right after the point of no return but before the current task assumes its new exec identity. Unlike the tracepoint "sched_process_exec", the "sched_prepare_exec" tracepoint runs before flushing the old exec, i.e. while the task still has the original state (such as original MM), but when the new exec either succeeds or crashes (but never returns to the original exec). Being able to trace this event can be helpful in a number of use cases: * allowing tracing eBPF programs access to the original MM on exec, before current->mm is replaced; * counting exec in the original task (via perf event); * profiling flush time ("sched_prepare_exec" to "sched_process_exec"). Example of tracing output: $ cat /sys/kernel/debug/tracing/trace_pipe <...>-379 [003] ..... 179.626921: sched_prepare_exec: interp=/usr/bin/sshd filename=/usr/bin/sshd pid=379 comm=sshd <...>-381 [002] ..... 180.048580: sched_prepare_exec: interp=/bin/bash filename=/bin/bash pid=381 comm=sshd <...>-385 [001] ..... 180.068277: sched_prepare_exec: interp=/usr/bin/tty filename=/usr/bin/tty pid=385 comm=bash <...>-389 [006] ..... 192.020147: sched_prepare_exec: interp=/usr/bin/dmesg filename=/usr/bin/dmesg pid=389 comm=bash Signed-off-by: Marco Elver <[email protected]> Acked-by: Steven Rostedt (Google) <[email protected]> Reviewed-by: Masami Hiramatsu (Google) <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>
2023-11-15sched: Remove vruntime from trace_sched_stat_runtime()Peter Zijlstra1-9/+6
Tracing the runtime delta makes sense, observer can sum over time. Tracing the absolute vruntime makes less sense, inconsistent: absolute-vs-delta, but also vruntime delta can be computed from runtime delta. Removing the vruntime thing also makes the two tracepoint sites identical, allowing to unify the code in a later patch. Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
2023-10-10sched/numa: Complete scanning of inactive VMAs when there is no alternativeMel Gorman1-1/+2
VMAs are skipped if there is no recent fault activity but this represents a chicken-and-egg problem as there may be no fault activity if the PTEs are never updated to trap NUMA hints. There is an indirect reliance on scanning to be forced early in the lifetime of a task but this may fail to detect changes in phase behaviour. Force inactive VMAs to be scanned when all other eligible VMAs have been updated within the same scan sequence. Test results in general look good with some changes in performance, both negative and positive, depending on whether the additional scanning and faulting was beneficial or not to the workload. The autonuma benchmark workload NUMA01_THREADLOCAL was picked for closer examination. The workload creates two processes with numerous threads and thread-local storage that is zero-filled in a loop. It exercises the corner case where unrelated threads may skip VMAs that are thread-local to another thread and still has some VMAs that inactive while the workload executes. The VMA skipping activity frequency with and without the patch: 6.6.0-rc2-sched-numabtrace-v1 ============================= 649 reason=scan_delay 9,094 reason=unsuitable 48,915 reason=shared_ro 143,919 reason=inaccessible 193,050 reason=pid_inactive 6.6.0-rc2-sched-numabselective-v1 ============================= 146 reason=seq_completed 622 reason=ignore_pid_inactive 624 reason=scan_delay 6,570 reason=unsuitable 16,101 reason=shared_ro 27,608 reason=inaccessible 41,939 reason=pid_inactive Note that with the patch applied, the PID activity is ignored (ignore_pid_inactive) to ensure a VMA with some activity is completely scanned. In addition, a small number of VMAs are scanned when no other eligible VMA is available during a single scan window (seq_completed). The number of times a VMA is skipped due to no PID activity from the scanning task (pid_inactive) drops dramatically. It is expected that this will increase the number of PTEs updated for NUMA hinting faults as well as hinting faults but these represent PTEs that would otherwise have been missed. The tradeoff is scan+fault overhead versus improving locality due to migration. On a 2-socket Cascade Lake test machine, the time to complete the workload is as follows; 6.6.0-rc2 6.6.0-rc2 sched-numabtrace-v1 sched-numabselective-v1 Min elsp-NUMA01_THREADLOCAL 174.22 ( 0.00%) 117.64 ( 32.48%) Amean elsp-NUMA01_THREADLOCAL 175.68 ( 0.00%) 123.34 * 29.79%* Stddev elsp-NUMA01_THREADLOCAL 1.20 ( 0.00%) 4.06 (-238.20%) CoeffVar elsp-NUMA01_THREADLOCAL 0.68 ( 0.00%) 3.29 (-381.70%) Max elsp-NUMA01_THREADLOCAL 177.18 ( 0.00%) 128.03 ( 27.74%) The time to complete the workload is reduced by almost 30%: 6.6.0-rc2 6.6.0-rc2 sched-numabtrace-v1 sched-numabselective-v1 / Duration User 91201.80 63506.64 Duration System 2015.53 1819.78 Duration Elapsed 1234.77 868.37 In this specific case, system CPU time was not increased but it's not universally true. From vmstat, the NUMA scanning and fault activity is as follows; 6.6.0-rc2 6.6.0-rc2 sched-numabtrace-v1 sched-numabselective-v1 Ops NUMA base-page range updates 64272.00 26374386.00 Ops NUMA PTE updates 36624.00 55538.00 Ops NUMA PMD updates 54.00 51404.00 Ops NUMA hint faults 15504.00 75786.00 Ops NUMA hint local faults % 14860.00 56763.00 Ops NUMA hint local percent 95.85 74.90 Ops NUMA pages migrated 1629.00 6469222.00 Both the number of PTE updates and hint faults is dramatically increased. While this is superficially unfortunate, it represents ranges that were simply skipped without the patch. As a result of the scanning and hinting faults, many more pages were also migrated but as the time to completion is reduced, the overhead is offset by the gain. Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Tested-by: Raghavendra K T <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-10-10sched/numa: Complete scanning of partial VMAs regardless of PID activityMel Gorman1-1/+2
NUMA Balancing skips VMAs when the current task has not trapped a NUMA fault within the VMA. If the VMA is skipped then mm->numa_scan_offset advances and a task that is trapping faults within the VMA may never fully update PTEs within the VMA. Force tasks to update PTEs for partially scanned PTEs. The VMA will be tagged for NUMA hints by some task but this removes some of the benefit of tracking PID activity within a VMA. A follow-on patch will mitigate this problem. The test cases and machines evaluated did not trigger the corner case so the performance results are neutral with only small changes within the noise from normal test-to-test variance. However, the next patch makes the corner case easier to trigger. Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Tested-by: Raghavendra K T <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-10-10sched/numa: Trace decisions related to skipping VMAsMel Gorman1-0/+50
NUMA balancing skips or scans VMAs for a variety of reasons. In preparation for completing scans of VMAs regardless of PID access, trace the reasons why a VMA was skipped. In a later patch, the tracing will be used to track if a VMA was forcibly scanned. Signed-off-by: Mel Gorman <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-09-29sched/debug: Add new tracepoint to track compute energy computationQais Yousef1-0/+5
It was useful to track feec() placement decision and debug the spare capacity and optimization issues vs uclamp_max. Signed-off-by: Qais Yousef (Google) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Dietmar Eggemann <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2022-05-12sched/tracing: Append prev_state to tp args insteadDelyan Kratunov1-3/+3
Commit fa2c3254d7cf (sched/tracing: Don't re-read p->state when emitting sched_switch event, 2022-01-20) added a new prev_state argument to the sched_switch tracepoint, before the prev task_struct pointer. This reordering of arguments broke BPF programs that use the raw tracepoint (e.g. tp_btf programs). The type of the second argument has changed and existing programs that assume a task_struct* argument (e.g. for bpf_task_storage access) will now fail to verify. If we instead append the new argument to the end, all existing programs would continue to work and can conditionally extract the prev_state argument on supported kernel versions. Fixes: fa2c3254d7cf (sched/tracing: Don't re-read p->state when emitting sched_switch event, 2022-01-20) Signed-off-by: Delyan Kratunov <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Steven Rostedt (Google) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2022-03-01sched/tracing: Don't re-read p->state when emitting sched_switch eventValentin Schneider1-4/+7
As of commit c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") the following sequence becomes possible: p->__state = TASK_INTERRUPTIBLE; __schedule() deactivate_task(p); ttwu() READ !p->on_rq p->__state=TASK_WAKING trace_sched_switch() __trace_sched_switch_state() task_state_index() return 0; TASK_WAKING isn't in TASK_REPORT, so the task appears as TASK_RUNNING in the trace event. Prevent this by pushing the value read from __schedule() down the trace event. Reported-by: Abhijeet Dharmapurikar <[email protected]> Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Steven Rostedt (Google) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-10sched/tracing: Remove the redundant 'success' in the sched tracepointEd Tsai1-2/+0
'success' is left here for a long time and also it is meaningless for the upper user. Just remove it. [ There were some tools expecting this, and this may break them. But hopefully they've been fixed in the mean time. Otherwise this may be likely reverted - SDR ] Link: https://lkml.kernel.org/r/[email protected] Cc: Peter Zijlstra <[email protected]> Signed-off-by: Ed Tsai <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-03-23tracing: Fix various typos in commentsIngo Molnar1-1/+1
Fix ~59 single-word typos in the tracing code comments, and fix the grammar in a handful of places. Link: https://lore.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Reviewed-by: Randy Dunlap <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-01-11kthread: remove comments about old _do_fork() helperYanfei Xu1-1/+1
The old _do_fork() helper has been removed in favor of kernel_clone(). Here correct some comments which still contain _do_fork() Link: https://lore.kernel.org/r/[email protected] Cc: [email protected] Cc: [email protected] Acked-by: Christian Brauner <[email protected]> Signed-off-by: Yanfei Xu <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2020-12-15kthread: add kthread_work tracepointsRob Clark1-0/+84
While migrating some code from wq to kthread_worker, I found that I missed the execute_start/end tracepoints. So add similar tracepoints for kthread_work. And for completeness, queue_work tracepoint (although this one differs slightly from the matching workqueue tracepoint). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Rob Clark <[email protected]> Cc: Rob Clark <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "Peter Zijlstra (Intel)" <[email protected]> Cc: Phil Auld <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Thara Gopinath <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Vincent Donnefort <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Ilias Stamatis <[email protected]> Cc: Liang Chen <[email protected]> Cc: Ben Dooks <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: "J. Bruce Fields" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-10-03sched/debug: Add new tracepoint to track cpu_capacityVincent Donnefort1-0/+4
rq->cpu_capacity is a key element in several scheduler parts, such as EAS task placement and load balancing. Tracking this value enables testing and/or debugging by a toolkit. Signed-off-by: Vincent Donnefort <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-22trace/events/sched.h: fix duplicated wordRandy Dunlap1-1/+1
Change "It it" to "It is". Signed-off-by: Randy Dunlap <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-07-08sched: Add a tracepoint to track rq->nr_runningPhil Auld1-0/+4
Add a bare tracepoint trace_sched_update_nr_running_tp which tracks ->nr_running CPU's rq. This is used to accurately trace this data and provide a visualization of scheduler imbalances in, for example, the form of a heat map. The tracepoint is accessed by loading an external kernel module. An example module (forked from Qais' module and including the pelt related tracepoints) can be found at: https://github.com/auldp/tracepoints-helpers.git A script to turn the trace-cmd report output into a heatmap plot can be found at: https://github.com/jirvoz/plot-nr-running The tracepoints are added to add_nr_running() and sub_nr_running() which are in kernel/sched/sched.h. In order to avoid CREATE_TRACE_POINTS in the header a wrapper call is used and the trace/events/sched.h include is moved before sched.h in kernel/sched/core. Signed-off-by: Phil Auld <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-06-15sched/debug: Add new tracepoints to track util_estVincent Donnefort1-0/+8
The util_est signals are key elements for EAS task placement and frequency selection. Having tracepoints to track these signals enables load-tracking and schedutil testing and/or debugging by a toolkit. Signed-off-by: Vincent Donnefort <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-03-06sched/pelt: Add support to track thermal pressureThara Gopinath1-0/+4
Extrapolating on the existing framework to track rt/dl utilization using pelt signals, add a similar mechanism to track thermal pressure. The difference here from rt/dl utilization tracking is that, instead of tracking time spent by a CPU running a RT/DL task through util_avg, the average thermal pressure is tracked through load_avg. This is because thermal pressure signal is weighted time "delta" capacity unlike util_avg which is binary. "delta capacity" here means delta between the actual capacity of a CPU and the decreased capacity a CPU due to a thermal event. In order to track average thermal pressure, a new sched_avg variable avg_thermal is introduced. Function update_thermal_load_avg can be called to do the periodic bookkeeping (accumulate, decay and average) of the thermal pressure. Reviewed-by: Vincent Guittot <[email protected]> Signed-off-by: Thara Gopinath <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-02-24sched/numa: Distinguish between the different task_numa_migrate() failure casesMel Gorman1-22/+27
sched:sched_stick_numa is meant to fire when a task is unable to migrate to the preferred node but from the trace, it's possibile to tell the difference between "no CPU found", "migration to idle CPU failed" and "tasks could not be swapped". Extend the tracepoint accordingly. Signed-off-by: Mel Gorman <[email protected]> [ Minor edits. ] Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Dietmar Eggemann <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Phil Auld <[email protected]> Cc: Hillf Danton <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2019-06-24sched/debug: Add sched_overutilized tracepointQais Yousef1-0/+4
The new tracepoint allows us to track the changes in overutilized status. Overutilized status is associated with EAS. It indicates that the system is in high performance state. EAS is disabled when the system is in this state since there's not much energy savings while high performance tasks are pushing the system to the limit and it's better to default to the spreading behavior of the scheduler. This tracepoint helps understanding and debugging the conditions under which this happens. Signed-off-by: Qais Yousef <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Dietmar Eggemann <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Pavankumar Kondeti <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Quentin Perret <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Uwe Kleine-Konig <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-06-24sched/debug: Add new tracepoint to track PELT at se levelQais Yousef1-0/+4
The new tracepoint allows tracking PELT signals at sched_entity level. Which is supported in CFS tasks and taskgroups only. Signed-off-by: Qais Yousef <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Dietmar Eggemann <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Pavankumar Kondeti <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Quentin Perret <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Uwe Kleine-Konig <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-06-24sched/debug: Add new tracepoints to track PELT at rq levelQais Yousef1-0/+23
The new tracepoints allow tracking PELT signals at rq level for all scheduling classes + irq. Signed-off-by: Qais Yousef <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Dietmar Eggemann <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Pavankumar Kondeti <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Quentin Perret <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Uwe Kleine-Konig <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2019-04-08sched/fair: do not expose some tracepoints to user if CONFIG_SCHEDSTATS is ↵Yafang Shao1-7/+14
not set The tracepoints trace_sched_stat_{iowait, blocked, wait, sleep} should be not exposed to user if CONFIG_SCHEDSTATS is not set. Link: http://lkml.kernel.org/r/[email protected] Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Yafang Shao <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-11-27sched, trace: Fix prev_state output in sched_switch tracepointPavankumar Kondeti1-1/+11
commit 3f5fe9fef5b2 ("sched/debug: Fix task state recording/printout") tried to fix the problem introduced by a previous commit efb40f588b43 ("sched/tracing: Fix trace_sched_switch task-state printing"). However the prev_state output in sched_switch is still broken. task_state_index() uses fls() which considers the LSB as 1. Left shifting 1 by this value gives an incorrect mapping to the task state. Fix this by decrementing the value returned by __get_task_state() before shifting. Link: http://lkml.kernel.org/r/[email protected] Cc: [email protected] Fixes: 3f5fe9fef5b2 ("sched/debug: Fix task state recording/printout") Signed-off-by: Pavankumar Kondeti <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2018-09-10sched/debug: Use symbolic names for task state constantsUwe Kleine-König1-3/+8
include/trace/events/sched.h includes <linux/sched.h> (via <linux/sched/numa_balancing.h>) and so knows about the TASK_* constants used to interpret .prev_state. So instead of duplicating the magic numbers make use of the defined macros to ease understanding the mapping from state bits to letters which isn't completely intuitive for an outsider. Signed-off-by: Uwe Kleine-König <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-05-25sched, tracing: Fix trace_sched_pi_setprio() for deboostingSebastian Andrzej Siewior1-1/+3
Since the following commit: b91473ff6e97 ("sched,tracing: Update trace_sched_pi_setprio()") the sched_pi_setprio trace point shows the "newprio" during a deboost: |futex sched_pi_setprio: comm=futex_requeue_p pid"34 oldprio˜ newprio=3D98 |futex sched_switch: prev_comm=futex_requeue_p prev_pid"34 prev_prio=120 This patch open codes __rt_effective_prio() in the tracepoint as the 'newprio' to get the old behaviour back / the correct priority: |futex sched_pi_setprio: comm=futex_requeue_p pid"20 oldprio˜ newprio=3D120 |futex sched_switch: prev_comm=futex_requeue_p prev_pid"20 prev_prio=120 Peter suggested to open code the new priority so people using tracehook could get the deadline data out. Reported-by: Mansky Christian <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Fixes: b91473ff6e97 ("sched,tracing: Update trace_sched_pi_setprio()") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-11-24sched/debug: Fix task state recording/printoutThomas Gleixner1-3/+3
The recent conversion of the task state recording to use task_state_index() broke the sched_switch tracepoint task state output. task_state_index() returns surprisingly an index (0-7) which is then printed with __print_flags() applying bitmasks. Not really working and resulting in weird states like 'prev_state=t' instead of 'prev_state=I'. Use TASK_REPORT_MAX instead of TASK_STATE_MAX to report preemption. Build a bitmask from the return value of task_state_index() and store it in entry->prev_state, which makes __print_flags() work as expected. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: [email protected] Fixes: efb40f588b43 ("sched/tracing: Fix trace_sched_switch task-state printing") Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1711221304180.1751@nanos Signed-off-by: Ingo Molnar <[email protected]>
2017-11-08Merge branch 'linus' into sched/core, to pick up fixesIngo Molnar1-0/+1
Signed-off-by: Ingo Molnar <[email protected]>
2017-11-02License cleanup: add SPDX GPL-2.0 license identifier to files with no licenseGreg Kroah-Hartman1-0/+1
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <[email protected]> Reviewed-by: Philippe Ombredanne <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2017-10-10sched/debug: Rename task-state printing helpersPeter Zijlstra1-1/+1
Steve requested better names for the new task-state helper functions. So introduce the concept of task-state index for the printing and rename __get_task_state() to task_state_index() and __task_state_to_char() to task_index_to_char(). Requested-by: Steven Rostedt <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-29sched/debug: Add explicit TASK_PARKED printingPeter Zijlstra1-1/+1
Currently TASK_PARKED is masqueraded as TASK_INTERRUPTIBLE, give it its own print state because it will not in fact get woken by regular wakeups and is a long-term state. This requires moving TASK_PARKED into the TASK_REPORT mask, and since that latter needs to be a contiguous bitmask, we need to shuffle the bits around a bit. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-29sched/debug: Add explicit TASK_IDLE printingPeter Zijlstra1-3/+4
Markus reported that kthreads that idle using TASK_IDLE instead of TASK_INTERRUPTIBLE are reported in as TASK_UNINTERRUPTIBLE and things like htop mark those red. This is undesirable, so add an explicit state for TASK_IDLE. Reported-by: Markus Trippelsdorf <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-09-29sched/tracing: Fix trace_sched_switch task-state printingPeter Zijlstra1-7/+11
Convert trace_sched_switch to use the common task-state helpers and fix the "X" and "Z" order, possibly they ended up in the wrong order because TASK_REPORT has them in the wrong order too. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-04-04sched,tracing: Update trace_sched_pi_setprio()Peter Zijlstra1-7/+9
Pass the PI donor task, instead of a numerical priority. Numerical priorities are not sufficient to describe state ever since SCHED_DEADLINE. Annotate all sched tracepoints that are currently broken; fixing them will bork userspace. *hate*. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Steven Rostedt <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar1-1/+1
<linux/sched/numa_balancing.h> We are going to split <linux/sched/numa_balancing.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/numa_balancing.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-10-06sched/core: Fix trace_sched_switch()Peter Zijlstra1-13/+9
__trace_sched_switch_state() is the last remaining PREEMPT_ACTIVE user, move trace_sched_switch() from prepare_task_switch() to __schedule() and propagate the @preempt argument. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Reviewed-by: Steven Rostedt <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-08-03sched: Introduce the 'trace_sched_waking' tracepointPeter Zijlstra1-9/+21
Mathieu reported that since 317f394160e9 ("sched: Move the second half of ttwu() to the remote cpu") trace_sched_wakeup() can happen out of context of the waker. This is a problem when you want to analyse wakeup paths because it is now very hard to correlate the wakeup event to whoever issued the wakeup. OTOH trace_sched_wakeup() is issued at the point where we set p->state = TASK_RUNNING, which is right were we hand the task off to the scheduler, so this is an important point when looking at scheduling behaviour, up to here its been the wakeup path everything hereafter is due to scheduler policy. To bridge this gap, introduce a second tracepoint: trace_sched_waking. It is guaranteed to be called in the waker context. Reported-by: Mathieu Desnoyers <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Francis Giraldeau <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-05-19sched/wait: Introduce TASK_NOLOAD and TASK_IDLEPeter Zijlstra1-1/+2
Currently people use TASK_INTERRUPTIBLE to idle kthreads and wait for 'work' because TASK_UNINTERRUPTIBLE contributes to the loadavg. Having all idle kthreads contribute to the loadavg is somewhat silly. Now mostly this works OK, because kthreads have all their signals masked. However there's a few sites where this is causing problems and TASK_UNINTERRUPTIBLE should be used, except for that loadavg issue. This patch adds TASK_NOLOAD which, when combined with TASK_UNINTERRUPTIBLE avoids the loadavg accounting. As most of imagined usage sites are loops where a thread wants to idle, waiting for work, a helper TASK_IDLE is introduced. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Julian Anastasov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: NeilBrown <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2014-10-28sched: Fix the PREEMPT_ACTIVE check in __trace_sched_switch_state()Oleg Nesterov1-3/+6
task_preempt_count() has nothing to do with the actual preempt counter, thread_info->saved_preempt_count is only valid right after switch_to(). __trace_sched_switch_state() can use preempt_count(), prev is still the current task when trace_sched_switch() is called. Signed-off-by: Oleg Nesterov <[email protected]> [ Added BUG_ON(). ] Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched, trace: Add a tracepoint for IPI-less remote wakeupsAndy Lutomirski1-0/+20
Remote wakeups of polling CPUs are a valuable performance improvement; add a tracepoint to make it much easier to verify that they're working. Signed-off-by: Andy Lutomirski <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: David Ahern <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/16205aee116772aa686814f9b13bccb562108047.1401902905.git.luto@amacapital.net Signed-off-by: Ingo Molnar <[email protected]>
2014-01-21sched: add tracepoints related to NUMA task migrationMel Gorman1-0/+87
This patch adds three tracepoints o trace_sched_move_numa when a task is moved to a node o trace_sched_swap_numa when a task is swapped with another task o trace_sched_stick_numa when a numa-related migration fails The tracepoints allow the NUMA scheduler activity to be monitored and the following high-level metrics can be calculated o NUMA migrated stuck nr trace_sched_stick_numa o NUMA migrated idle nr trace_sched_move_numa o NUMA migrated swapped nr trace_sched_swap_numa o NUMA local swapped trace_sched_swap_numa src_nid == dst_nid (should never happen) o NUMA remote swapped trace_sched_swap_numa src_nid != dst_nid (should == NUMA migrated swapped) o NUMA group swapped trace_sched_swap_numa src_ngid == dst_ngid Maybe a small number of these are acceptable but a high number would be a major surprise. It would be even worse if bounces are frequent. o NUMA avg task migs. Average number of migrations for tasks o NUMA stddev task mig Self-explanatory o NUMA max task migs. Maximum number of migrations for a single task In general the intent of the tracepoints is to help diagnose problems where automatic NUMA balancing appears to be doing an excessive amount of useless work. [[email protected]: remove semicolon-after-if, repair coding-style] Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Cc: Alex Thorlton <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-11-06Merge branch 'sched/core' into core/locking, to prepare the kernel/locking/ ↵Ingo Molnar1-1/+1
file move Conflicts: kernel/Makefile There are conflicts in kernel/Makefile due to file moving in the scheduler tree - resolve them. Signed-off-by: Ingo Molnar <[email protected]>
2013-10-31hung_task debugging: Add tracepoint to report the hangOleg Nesterov1-0/+19
Currently check_hung_task() prints a warning if it detects the problem, but it is not convenient to watch the system logs if user-space wants to be notified about the hang. Add the new trace_sched_process_hang() into check_hung_task(), this way a user-space monitor can easily wait for the hang and potentially resolve a problem. Signed-off-by: Oleg Nesterov <[email protected]> Cc: Dave Sullivan <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-09-25sched: Create more preempt_count accessorsPeter Zijlstra1-1/+1
We need a few special preempt_count accessors: - task_preempt_count() for when we're interested in the preemption count of another (non-running) task. - init_task_preempt_count() for properly initializing the preemption count. - init_idle_preempt_count() a special case of the above for the idle threads. With these no generic code ever touches thread_info::preempt_count anymore and architectures could choose to remove it. Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-08-13tracing/perf: Reimplement TP_perf_assign() logicOleg Nesterov1-13/+3
The next patch tries to avoid the costly perf_trace_buf_* calls when possible but there is a problem. We can only do this if __task == NULL, perf_tp_event(task != NULL) has the additional code for this case. Unfortunately, TP_perf_assign/__perf_xxx which changes the default values of __count/__task variables for perf_trace_buf_submit() is called "too late", after we already did perf_trace_buf_prepare(), and the optimization above can't work. So this patch simply embeds __perf_xxx() into TP_ARGS(), this way DECLARE_EVENT_CLASS() can use the result of assignments hidden in "args" right after ftrace_get_offsets_##call() which is mostly trivial. This allows us to have the fast-path "__task != NULL" check at the start, see the next patch. Link: http://lkml.kernel.org/r/[email protected] Tested-by: David Ahern <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-08-13tracing/perf: Expand TRACE_EVENT(sched_stat_runtime)Oleg Nesterov1-1/+5
To simplify the review of the next patches: 1. We are going to reimplent __perf_task/counter and embedd them into TP_ARGS(). expand TRACE_EVENT(sched_stat_runtime) into DECLARE_EVENT_CLASS() + DEFINE_EVENT(), this way they can use different TP_ARGS's. 2. Change perf_trace_##call() macro to do perf_fetch_caller_regs() right before perf_trace_buf_prepare(). This way it evaluates TP_ARGS() asap, the next patch explores this fact. Note: after 87f44bbc perf_trace_buf_prepare() doesn't need "struct pt_regs *regs", perhaps it makes sense to remove this argument. And perhaps we can teach perf_trace_buf_submit() to accept regs == NULL and do fetch_caller_regs(CALLER_ADDR1) in this case. 3. Cosmetic, but the typecast from "void*" buys nothing. It just adds the noise, remove it. Link: http://lkml.kernel.org/r/[email protected] Acked-by: Peter Zijlstra <[email protected]> Tested-by: David Ahern <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-04-12kthread: Prevent unpark race which puts threads on the wrong cpuThomas Gleixner1-1/+1
The smpboot threads rely on the park/unpark mechanism which binds per cpu threads on a particular core. Though the functionality is racy: CPU0 CPU1 CPU2 unpark(T) wake_up_process(T) clear(SHOULD_PARK) T runs leave parkme() due to !SHOULD_PARK bind_to(CPU2) BUG_ON(wrong CPU) We cannot let the tasks move themself to the target CPU as one of those tasks is actually the migration thread itself, which requires that it starts running on the target cpu right away. The solution to this problem is to prevent wakeups in park mode which are not from unpark(). That way we can guarantee that the association of the task to the target cpu is working correctly. Add a new task state (TASK_PARKED) which prevents other wakeups and use this state explicitly for the unpark wakeup. Peter noticed: Also, since the task state is visible to userspace and all the parked tasks are still in the PID space, its a good hint in ps and friends that these tasks aren't really there for the moment. The migration thread has another related issue. CPU0 CPU1 Bring up CPU2 create_thread(T) park(T) wait_for_completion() parkme() complete() sched_set_stop_task() schedule(TASK_PARKED) The sched_set_stop_task() call is issued while the task is on the runqueue of CPU1 and that confuses the hell out of the stop_task class on that cpu. So we need the same synchronizaion before sched_set_stop_task(). Reported-by: Dave Jones <[email protected]> Reported-and-tested-by: Dave Hansen <[email protected]> Reported-and-tested-by: Borislav Petkov <[email protected]> Acked-by: Peter Ziljstra <[email protected]> Cc: Srivatsa S. Bhat <[email protected]> Cc: [email protected] Cc: Ingo Molnar <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos Signed-off-by: Thomas Gleixner <[email protected]>
2012-07-31perf/trace: Add ability to set a target task for eventsAndrew Vagin1-0/+4
A few events are interesting not only for a current task. For example, sched_stat_* events are interesting for a task which wakes up. For this reason, it will be good if such events will be delivered to a target task too. Now a target task can be set by using __perf_task(). The original idea and a draft patch belongs to Peter Zijlstra. I need these events for profiling sleep times. sched_switch is used for getting callchains and sched_stat_* is used for getting time periods. These events are combined in user space, then it can be analyzed by perf tools. Inspired-by: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Arun Sharma <[email protected]> Signed-off-by: Andrew Vagin <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>