aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2015-12-07Merge branches 'doc.2015.12.05a', 'exp.2015.12.07a', 'fixes.2015.12.07a', ↵Paul E. McKenney9-144/+262
'list.2015.12.04b' and 'torture.2015.12.05a' into HEAD doc.2015.12.05a: Documentation updates exp.2015.12.07a: Expedited grace-period updates fixes.2015.12.07a: Miscellaneous fixes list.2015.12.04b: Linked-list updates torture.2015.12.05a: Torture-test updates
2015-12-07rcu: Make rcu_gp_init() be bool rather than intPaul E. McKenney1-5/+5
The return value from rcu_gp_init() is always used as a bool, so this commit makes it be a bool. Reported-by: Iftekhar Ahmed <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2015-12-07rcu: Move wakeup out from under rnp->lockPeter Zijlstra1-1/+1
This patch removes a potential deadlock hazard by moving the wake_up_process() in rcu_spawn_gp_kthread() out from under rnp->lock. Signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2015-12-07rcu: Don't redundantly disable irqs in rcu_irq_{enter,exit}()Paul E. McKenney1-6/+26
This commit replaces a local_irq_save()/local_irq_restore() pair with a lockdep assertion that interrupts are already disabled. This should remove the corresponding overhead from the interrupt entry/exit fastpaths. This change was inspired by the fact that Iftekhar Ahmed's mutation testing showed that removing rcu_irq_enter()'s call to local_ird_restore() had no effect, which might indicate that interrupts were always enabled anyway. Signed-off-by: Paul E. McKenney <[email protected]>
2015-12-07rcu: Make cpu_needs_another_gp() be boolPaul E. McKenney1-7/+7
The cpu_needs_another_gp() function is currently of type int, but only returns zero or one. Bow to reality and make it be of type bool. Signed-off-by: Paul E. McKenney <[email protected]>
2015-12-07rcu: Eliminate unused rcu_init_one() argumentPaul E. McKenney2-5/+4
Now that the rcu_state structure's ->rda field is compile-time initialized, there is no need to pass the per-CPU rcu_data structure into rcu_init_one(). This commit therefore eliminates this now-unused parameter. Signed-off-by: Paul E. McKenney <[email protected]>
2015-12-07rcu: Remove TINY_RCU bloat from pointless boot parametersPaul E. McKenney2-3/+8
The rcu_expedited, rcu_normal, and rcu_normal_after_boot kernel boot parameters are pointless in the case of TINY_RCU because in that case synchronous grace periods, both expedited and normal, are no-ops. However, these three symbols contribute several hundred bytes of bloat. This commit therefore uses CPP directives to avoid compiling this code in TINY_RCU kernels. Reported-by: kbuild test robot <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Josh Triplett <[email protected]>
2015-12-07Merge branch 'for-4.5-ancestor-test' of ↵Tejun Heo1-27/+44
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into for-4.5 Signed-off-by: Tejun Heo <[email protected]>
2015-12-07clocksource: Add CPU info to clocksource watchdog reportingSeiichi Ikarashi1-2/+2
The clocksource watchdog reporting was improved by 0b046b217ad4c6. I want to add the info of CPU where the watchdog detects a deviation because it is necessary to identify the trouble spot if the clocksource is TSC. Signed-off-by: Seiichi Ikarashi <[email protected]> [jstultz: Tweaked commit message] Signed-off-by: John Stultz <[email protected]>
2015-12-07time: Avoid signed overflow in timekeeping_get_ns()David Gibson1-2/+1
1e75fa8 "time: Condense timekeeper.xtime into xtime_sec" replaced a call to clocksource_cyc2ns() from timekeeping_get_ns() with an open-coded version of the same logic to avoid keeping a semi-redundant struct timespec in struct timekeeper. However, the commit also introduced a subtle semantic change - where clocksource_cyc2ns() uses purely unsigned math, the new version introduces a signed temporary, meaning that if (delta * tk->mult) has a 63-bit overflow the following shift will still give a negative result. The choice of 'maxsec' in __clocksource_updatefreq_scale() means this will generally happen if there's a ~10 minute pause in examining the clocksource. This can be triggered on a powerpc KVM guest by stopping it from qemu for a bit over 10 minutes. After resuming time has jumped backwards several minutes causing numerous problems (jiffies does not advance, msleep()s can be extended by minutes..). It doesn't happen on x86 KVM guests, because the guest TSC is effectively frozen while the guest is stopped, which is not the case for the powerpc timebase. Obviously an unsigned (64 bit) overflow will only take twice as long as a signed, 63-bit overflow. I don't know the time code well enough to know if that will still cause incorrect calculations, or if a 64-bit overflow is avoided elsewhere. Still, an incorrect forwards clock adjustment will cause less trouble than time going backwards. So, this patch removes the potential for intermediate signed overflow. Cc: [email protected] (3.7+) Suggested-by: Laurent Vivier <[email protected]> Tested-by: Laurent Vivier <[email protected]> Signed-off-by: David Gibson <[email protected]> Signed-off-by: John Stultz <[email protected]>
2015-12-07Merge branch 'master' into for-4.4-fixesTejun Heo16-56/+147
The following commit which went into mainline through networking tree 3b13758f51de ("cgroups: Allow dynamically changing net_classid") conflicts in net/core/netclassid_cgroup.c with the following pending fix in cgroup/for-4.4-fixes. 1f7dd3e5a6e4 ("cgroup: fix handling of multi-destination migration from subtree_control enabling") The former separates out update_classid() from cgrp_attach() and updates it to walk all fds of all tasks in the target css so that it can be used from both migration and config change paths. The latter drops @css from cgrp_attach(). Resolve the conflict by making cgrp_attach() call update_classid() with the css from the first task. We can revive @tset walking in cgrp_attach() but given that net_cls is v1 only where there always is only one target css during migration, this is fine. Signed-off-by: Tejun Heo <[email protected]> Reported-by: Stephen Rothwell <[email protected]> Cc: Nina Schiff <[email protected]>
2015-12-06Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds5-15/+45
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: "This updates contains the following changes: - Fix a signal handling regression in the bit wait functions. - Avoid false positive warnings in the wakeup path. - Initialize the scheduler root domain properly. - Handle gtime calculations in proc/$PID/stat proper. - Add more documentation for the barriers in try_to_wake_up(). - Fix a subtle race in try_to_wake_up() which might cause a task to be scheduled on two cpus - Compile static helper function only when it is used" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Fix an SMP ordering race in try_to_wake_up() vs. schedule() sched/core: Better document the try_to_wake_up() barriers sched/cputime: Fix invalid gtime in proc sched/core: Clear the root_domain cpumasks in init_rootdomain() sched/core: Remove false-positive warning from wake_up_process() sched/wait: Fix signal handling in bit wait helpers sched/rt: Hide the push_irq_work_func() declaration
2015-12-06perf/core: Collapse common IPI patternPeter Zijlstra1-104/+76
Various functions implement the same pattern to send IPIs to an event's CPU. Collapse the easy ones in a common helper function to reduce duplication. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2015-12-06perf: Do not send exit event twiceJiri Olsa1-11/+31
In case we monitor events system wide, we get EXIT event (when configured) twice for each task that exited. Note doubled lines with same pid/tid in following example: $ sudo ./perf record -a ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.480 MB perf.data (2518 samples) ] $ sudo ./perf report -D | grep EXIT 0 60290687567581 0x59910 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250) 0 60290687568354 0x59948 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250) 0 60290687988744 0x59ad8 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250) 0 60290687989198 0x59b10 [0x38]: PERF_RECORD_EXIT(1250:1250):(1250:1250) 1 60290692567895 0x62af0 [0x38]: PERF_RECORD_EXIT(1253:1253):(1253:1253) 1 60290692568322 0x62b28 [0x38]: PERF_RECORD_EXIT(1253:1253):(1253:1253) 2 60290692739276 0x69a18 [0x38]: PERF_RECORD_EXIT(1252:1252):(1252:1252) 2 60290692739910 0x69a50 [0x38]: PERF_RECORD_EXIT(1252:1252):(1252:1252) The reason is that the cpu contexts are processes each time we call perf_event_task. I'm changing the perf_event_aux logic to serve task_ctx and cpu contexts separately, which ensure we don't get EXIT event generated twice on same cpu context. This does not affect other auxiliary events, as they don't use task_ctx at all. Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: David Ahern <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vince Weaver <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-12-05rcutorture: Print symbolic name for ->gp_statePaul E. McKenney2-2/+25
Currently, ->gp_state is printed as an integer, which slows debugging. This commit therefore prints a symbolic name in addition to the integer. Signed-off-by: Paul E. McKenney <[email protected]> [ paulmck: Updated to fix relational operator called out by Dan Carpenter. ] [ paulmck: More "const", as suggested by Josh Triplett. ] Reviewed-by: Josh Triplett <[email protected]>
2015-12-05rcutorture: Print symbolic name for rcu_torture_writer_statePaul E. McKenney1-1/+23
Currently, rcu_torture_writer_state is printed as an integer, which slows debugging. This commit therefore prints a symbolic name in addition to the integer. Signed-off-by: Paul E. McKenney <[email protected]> [ paulmck: More "const", as suggested by Josh Triplett. ] Reviewed-by: Josh Triplett <[email protected]>
2015-12-05rcutorture: Dump stack when GP kthread stallsPaul E. McKenney1-1/+4
This commit increases debug information in the case where the grace-period kthread is being prevented from running by dumping that kthread's stack. Signed-off-by: Paul E. McKenney <[email protected]> [ paulmck: Split into prior commit and this commit, as suggested by Josh Triplett. ] Reviewed-by: Josh Triplett <[email protected]>
2015-12-05rcutorture: Flag nonexistent RCU GP kthreadPaul E. McKenney1-1/+1
Currently, if the RCU grace-period kthread has not yet been created, in which case the starvation-check code will print zero for the state, which maps to TASK_RUNNING. This could clearly be quite confusing, so this commit prints ~0, which does not map to any legal ->state value. Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Josh Triplett <[email protected]>
2015-12-04rcu: Stop disabling interrupts in scheduler fastpathsPaul E. McKenney3-25/+22
We need the scheduler's fastpaths to be, well, fast, and unnecessarily disabling and re-enabling interrupts is not necessarily consistent with this goal. Especially given that there are regions of the scheduler that already have interrupts disabled. This commit therefore moves the call to rcu_note_context_switch() to one of the interrupts-disabled regions of the scheduler, and removes the now-redundant disabling and re-enabling of interrupts from rcu_note_context_switch() and the functions it calls. Reported-by: Peter Zijlstra <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> [ paulmck: Shift rcu_note_context_switch() to avoid deadlock, as suggested by Peter Zijlstra. ]
2015-12-04rcu: Avoid tick_nohz_active checks on NOCBs CPUsPaul E. McKenney1-5/+2
Currently, rcu_prepare_for_idle() checks for tick_nohz_active, even on individual NOCBs CPUs, unless all CPUs are marked as NOCBs CPUs at build time. This check is pointless on NOCBs CPUs because they never have any callbacks posted, given that all of their callbacks are handed off to the corresponding rcuo kthread. There is a check for individually designated NOCBs CPUs, but it pointelessly follows the check for tick_nohz_active. This commit therefore moves the check for individually designated NOCBs CPUs up with the check for CONFIG_RCU_NOCB_CPU_ALL. Signed-off-by: Paul E. McKenney <[email protected]>
2015-12-04rcu: Fix obsolete rcu_bootup_announce_oddness() commentPaul E. McKenney1-2/+1
This function no longer has #ifdefs, so this commit removes the header comment calling them out. Signed-off-by: Paul E. McKenney <[email protected]>
2015-12-04rcu: Remove lock-acquisition loop from rcu_read_unlock_special()Paul E. McKenney1-12/+6
Several releases have come and gone without the warning triggering, so remove the lock-acquisition loop. Retain the WARN_ON_ONCE() out of sheer paranoia. Signed-off-by: Paul E. McKenney <[email protected]>
2015-12-04rcu: Simplify rcu_sched_qs() control flowPaul E. McKenney1-15/+14
This commit applies an early-exit approach to rcu_sched_qs(), reducing the nesting level and saving a line of code. Signed-off-by: Paul E. McKenney <[email protected]>
2015-12-04kernel: Make rcu/tree_trace.c explicitly non-modularPaul Gortmaker1-16/+3
The Kconfig currently controlling compilation of this code is: init/Kconfig:config TREE_RCU_TRACE init/Kconfig: def_bool RCU_TRACE && ( TREE_RCU || PREEMPT_RCU ) ...meaning that it currently is not being built as a module by anyone. Lets remove the modular code that is essentially orphaned, so that when reading the file there is no doubt it is builtin-only. Since module_init translates to device_initcall in the non-modular case, the init ordering remains unchanged with this commit. We could consider moving this to an earlier initcall if desired. We don't replace module.h with init.h since the file already has that. We also delete the moduleparam.h include that is left over from commit 64db4cfff99c04cd5f550357edcc8780f96b54a2 (""Tree RCU": scalable classic RCU implementation") since it is not needed here either. We morph some tags like MODULE_AUTHOR into the comments at the top of the file for documentation purposes. Cc: "Paul E. McKenney" <[email protected]> Cc: Josh Triplett <[email protected]> Reviewed-by: Josh Triplett <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Lai Jiangshan <[email protected]> Signed-off-by: Paul Gortmaker <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2015-12-04rcu: Move lock_class_key to local scopePaul E. McKenney1-4/+3
Currently, the rcu_node_class[], rcu_fqs_class[], and rcu_exp_class[] arrays needlessly pollute the global namespace within tree.c. This commit therefore converts them to static local variables within rcu_init_one(). Signed-off-by: Paul E. McKenney <[email protected]>
2015-12-04rcu: Allow expedited grace periods to be disabled at initPaul E. McKenney1-0/+5
Expedited grace periods can speed up boot, but are undesirable in aggressive real-time systems. This commit therefore introduces a kernel parameter rcupdate.rcu_normal_after_boot that disables expedited grace periods just before init is spawned. Signed-off-by: Paul E. McKenney <[email protected]>
2015-12-04rcu: Add rcu_normal kernel parameter to suppress expeditingPaul E. McKenney5-3/+45
Although expedited grace periods can be quite useful, and although their OS jitter has been greatly reduced, they can still pose problems for extreme real-time workloads. This commit therefore adds a rcu_normal kernel boot parameter (which can also be manipulated via sysfs) to suppress expedited grace periods, that is, to treat requests for expedited grace periods as if they were requests for normal grace periods. If both rcu_expedited and rcu_normal are specified, rcu_normal wins. This means that if you are relying on expedited grace periods to speed up boot, you will want to specify rcu_expedited on the kernel command line, and then specify rcu_normal via sysfs once boot completes. Signed-off-by: Paul E. McKenney <[email protected]>
2015-12-04rcu: Add more diagnostics to expedited stall warning messages.Paul E. McKenney1-3/+21
This commit adds print statements that check the rcu_node structure to find which ->expmask bits and which ->exp_tasks structures are blocking the current expedited grace period. Signed-off-by: Paul E. McKenney <[email protected]>
2015-12-04rcu: Make expedited grace periods resolve stall-warning tiesPaul E. McKenney1-1/+1
Currently, if a grace period ends just as the stall-warning timeout fires, an empty stall warning will be printed. This is not helpful, so this commit avoids these useless warnings by rechecking completion after awakening in synchronize_sched_expedited_wait(). Signed-off-by: Paul E. McKenney <[email protected]>
2015-12-04rcu: Reduce expedited GP memory contention via per-CPU variablesPaul E. McKenney3-16/+21
Currently, the piggybacked-work checks carried out by sync_exp_work_done() atomically increment a small set of variables (the ->expedited_workdone0, ->expedited_workdone1, ->expedited_workdone2, ->expedited_workdone3 fields in the rcu_state structure), which will form a memory-contention bottleneck given a sufficiently large number of CPUs concurrently invoking either synchronize_rcu_expedited() or synchronize_sched_expedited(). This commit therefore moves these for fields to the per-CPU rcu_data structure, eliminating the memory contention. The show_rcuexp() function also changes to sum up each field in the rcu_data structures. Signed-off-by: Paul E. McKenney <[email protected]>
2015-12-04rcu: Invert sync_rcu_exp_select_cpus() "if" statementPaul E. McKenney1-16/+14
This commit saves a couple lines of code and reduces indentation by inverting the sense of an "if" statement in the function sync_rcu_exp_select_cpus(). Signed-off-by: Paul E. McKenney <[email protected]>
2015-12-04rcu: Move smp_mb() from rcu_seq_snap() to rcu_exp_gp_seq_snap()Paul E. McKenney1-1/+1
The memory barrier in rcu_seq_snap() is needed only for grace periods, so this commit moves it to the grace-period-oriented wrapper rcu_exp_gp_seq_snap(). Signed-off-by: Paul E. McKenney <[email protected]>
2015-12-04rcu: Clarify role of ->expmaskinitnextPaul E. McKenney1-0/+2
Analogy with the ->qsmaskinitnext field might lead one to believe that ->expmaskinitnext tracks online CPUs. This belief is incorrect: Any CPU that has ever been online will have its bit set in the ->expmaskinitnext field. This commit therefore adds a comment to make this clear, at least to people who read comments. Signed-off-by: Paul E. McKenney <[email protected]>
2015-12-04rcu: Short-circuit synchronize_sched_expedited() if only one CPUPaul E. McKenney1-0/+4
If there is only one CPU, then invoking synchronize_sched_expedited() is by definition a grace period. This commit checks for this condition and does a short-circuit return in that case. Signed-off-by: Paul E. McKenney <[email protected]>
2015-12-04locking/pvqspinlock: Queue node adaptive spinningWaiman Long3-4/+50
In an overcommitted guest where some vCPUs have to be halted to make forward progress in other areas, it is highly likely that a vCPU later in the spinlock queue will be spinning while the ones earlier in the queue would have been halted. The spinning in the later vCPUs is then just a waste of precious CPU cycles because they are not going to get the lock soon as the earlier ones have to be woken up and take their turn to get the lock. This patch implements an adaptive spinning mechanism where the vCPU will call pv_wait() if the previous vCPU is not running. Linux kernel builds were run in KVM guest on an 8-socket, 4 cores/socket Westmere-EX system and a 4-socket, 8 cores/socket Haswell-EX system. Both systems are configured to have 32 physical CPUs. The kernel build times before and after the patch were: Westmere Haswell Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs ----- -------- -------- -------- -------- Before patch 3m02.3s 5m00.2s 1m43.7s 3m03.5s After patch 3m03.0s 4m37.5s 1m43.0s 2m47.2s For 32 vCPUs, this patch doesn't cause any noticeable change in performance. For 48 vCPUs (over-committed), there is about 8% performance improvement. Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Douglas Hatch <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Scott J Norton <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-12-04locking/pvqspinlock: Allow limited lock stealingWaiman Long3-28/+155
This patch allows one attempt for the lock waiter to steal the lock when entering the PV slowpath. To prevent lock starvation, the pending bit will be set by the queue head vCPU when it is in the active lock spinning loop to disable any lock stealing attempt. This helps to reduce the performance penalty caused by lock waiter preemption while not having much of the downsides of a real unfair lock. The pv_wait_head() function was renamed as pv_wait_head_or_lock() as it was modified to acquire the lock before returning. This is necessary because of possible lock stealing attempts from other tasks. Linux kernel builds were run in KVM guest on an 8-socket, 4 cores/socket Westmere-EX system and a 4-socket, 8 cores/socket Haswell-EX system. Both systems are configured to have 32 physical CPUs. The kernel build times before and after the patch were: Westmere Haswell Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs ----- -------- -------- -------- -------- Before patch 3m15.6s 10m56.1s 1m44.1s 5m29.1s After patch 3m02.3s 5m00.2s 1m43.7s 3m03.5s For the overcommited case (48 vCPUs), this patch is able to reduce kernel build time by more than 54% for Westmere and 44% for Haswell. Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Douglas Hatch <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Scott J Norton <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-12-04locking/pvqspinlock: Collect slowpath lock statisticsWaiman Long2-5/+308
This patch enables the accumulation of kicking and waiting related PV qspinlock statistics when the new QUEUED_LOCK_STAT configuration option is selected. It also enables the collection of data which enable us to calculate the kicking and wakeup latencies which have a heavy dependency on the CPUs being used. The statistical counters are per-cpu variables to minimize the performance overhead in their updates. These counters are exported via the debugfs filesystem under the qlockstat directory. When the corresponding debugfs files are read, summation and computing of the required data are then performed. The measured latencies for different CPUs are: CPU Wakeup Kicking --- ------ ------- Haswell-EX 63.6us 7.4us Westmere-EX 67.6us 9.3us The measured latencies varied a bit from run-to-run. The wakeup latency is much higher than the kicking latency. A sample of statistical counters after system bootup (with vCPU overcommit) was: pv_hash_hops=1.00 pv_kick_unlock=1148 pv_kick_wake=1146 pv_latency_kick=11040 pv_latency_wake=194840 pv_spurious_wakeup=7 pv_wait_again=4 pv_wait_head=23 pv_wait_node=1129 Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Douglas Hatch <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Scott J Norton <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-12-04sched/fair: Disable the task group load_avg update for the root_task_groupWaiman Long1-0/+6
Currently, the update_tg_load_avg() function attempts to update the tg's load_avg value whenever the load changes even for root_task_group where the load_avg value will never be used. This patch will disable the load_avg update when the given task group is the root_task_group. Running a Java benchmark with noautogroup and a 4.3 kernel on a 16-socket IvyBridge-EX system, the amount of CPU time (as reported by perf) consumed by task_tick_fair() which includes update_tg_load_avg() decreased from 0.71% to 0.22%, a more than 3X reduction. The Max-jOPs results also increased slightly from 983015 to 986449. Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Ben Segall <[email protected]> Cc: Douglas Hatch <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Morten Rasmussen <[email protected]> Cc: Paul Turner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Scott J Norton <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Yuyang Du <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-12-04sched/fair: Move the cache-hot 'load_avg' variable into its own cachelineWaiman Long2-4/+13
If a system with large number of sockets was driven to full utilization, it was found that the clock tick handling occupied a rather significant proportion of CPU time when fair group scheduling and autogroup were enabled. Running a java benchmark on a 16-socket IvyBridge-EX system, the perf profile looked like: 10.52% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt 9.66% 0.05% java [kernel.vmlinux] [k] hrtimer_interrupt 8.65% 0.03% java [kernel.vmlinux] [k] tick_sched_timer 8.56% 0.00% java [kernel.vmlinux] [k] update_process_times 8.07% 0.03% java [kernel.vmlinux] [k] scheduler_tick 6.91% 1.78% java [kernel.vmlinux] [k] task_tick_fair 5.24% 5.04% java [kernel.vmlinux] [k] update_cfs_shares In particular, the high CPU time consumed by update_cfs_shares() was mostly due to contention on the cacheline that contained the task_group's load_avg statistical counter. This cacheline may also contains variables like shares, cfs_rq & se which are accessed rather frequently during clock tick processing. This patch moves the load_avg variable into another cacheline separated from the other frequently accessed variables. It also creates a cacheline aligned kmemcache for task_group to make sure that all the allocated task_group's are cacheline aligned. By doing so, the perf profile became: 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares The %cpu time is still pretty high, but it is better than before. The benchmark results before and after the patch was as follows: Before patch - Max-jOPs: 907533 Critical-jOps: 134877 After patch - Max-jOPs: 916011 Critical-jOps: 142366 Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Ben Segall <[email protected]> Cc: Douglas Hatch <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Morten Rasmussen <[email protected]> Cc: Paul Turner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Scott J Norton <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Yuyang Du <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-12-04sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats()Waiman Long1-3/+7
Part of the responsibility of the update_sg_lb_stats() function is to update the idle_cpus statistical counter in struct sg_lb_stats. This check is done by calling idle_cpu(). The idle_cpu() function, in turn, checks a number of fields within the run queue structure such as rq->curr and rq->nr_running. With the current layout of the run queue structure, rq->curr and rq->nr_running are in separate cachelines. The rq->curr variable is checked first followed by nr_running. As nr_running is also accessed by update_sg_lb_stats() earlier, it makes no sense to load another cacheline when nr_running is not 0 as idle_cpu() will always return false in this case. This patch eliminates this redundant cacheline load by checking the cached nr_running before calling idle_cpu(). Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Douglas Hatch <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Scott J Norton <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-12-04sched/core: Move the sched_to_prio[] arrays out of lineAndi Kleen3-43/+46
When building a kernel with a gcc 6 snapshot the compiler complains about unused const static variables for prio_to_weight and prio_to_mult for multiple scheduler files (all but core.c and autogroup.c) The way the array is currently declared it will be duplicated in every scheduler file that includes sched.h, which seems rather wasteful. Move the array out of line into core.c. I also added a sched_ prefix to avoid any potential name space collisions. Signed-off-by: Andi Kleen <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-12-04sched/cputime: Convert vtime_seqlock to seqcountFrederic Weisbecker2-23/+25
The cputime can only be updated by the current task itself, even in vtime case. So we can safely use seqcount instead of seqlock as there is no writer concurrency involved. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Hiroshi Shimamoto <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luiz Capitulino <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Paul E . McKenney <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-12-04sched/cputime: Introduce vtime accounting check for readersFrederic Weisbecker1-3/+3
Readers need to know if vtime runs at all on some CPU somewhere, this is a fast-path check to determine if we need to check further the need to add up any tickless cputime delta. This fast path check uses context tracking state because vtime is tied to context tracking as of now. This check appears to be confusing though so lets use a vtime function that deals with context tracking details in vtime implementation instead. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Hiroshi Shimamoto <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luiz Capitulino <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Paul E . McKenney <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-12-04sched/cputime: Rename vtime_accounting_enabled() to ↵Frederic Weisbecker2-2/+2
vtime_accounting_cpu_enabled() vtime_accounting_enabled() checks if vtime is running on the current CPU and is as such a misnomer. Lets rename it to a function that reflect its locality. We are going to need the current name for a function that tells if vtime runs at all on some CPU. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Hiroshi Shimamoto <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luiz Capitulino <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Paul E . McKenney <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-12-04sched/cputime: Correctly handle task guest time on housekeepersFrederic Weisbecker1-1/+1
When a task runs on a housekeeper (a CPU running with the periodic tick with neighbours running tickless), it doesn't account cputime using vtime but relies on the tick. Such a task has its vtime_snap_whence value set to VTIME_INACTIVE. Readers won't handle that correctly though. As long as vtime is running on some CPU, readers incorretly assume that vtime runs on all CPUs and always compute the tickless cputime delta, which is only junk on housekeepers. So lets fix this with checking that the target runs on a vtime CPU through the appropriate state check before computing the tickless delta. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Hiroshi Shimamoto <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luiz Capitulino <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Paul E . McKenney <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-12-04sched/cputime: Clarify vtime symbols and document themFrederic Weisbecker2-4/+4
VTIME_SLEEPING state happens either when: 1) The task is sleeping and no tickless delta is to be added on the task cputime stats. 2) The CPU isn't running vtime at all, so the same properties of 1) applies. Lets rename the vtime symbol to reflect both states. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Hiroshi Shimamoto <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luiz Capitulino <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Paul E . McKenney <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-12-04sched/cputime: Remove extra cost in task_cputime()Hiroshi Shimamoto1-0/+16
There is an extra cost in task_cputime() and task_cputime_scaled() when nohz_full is not activated. When vtime accounting is not enabled, we don't need to get deltas of utime and stime under vtime seqlock. This patch removes that cost with adding a shortcut route if vtime accounting is not enabled. Use context_tracking_is_enabled() to check if vtime is accounting on some cpu, in which case only we need to check the tickless cputime delta. Signed-off-by: Hiroshi Shimamoto <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luiz Capitulino <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Paul E . McKenney <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-12-04sched/fair: Make it possible to account fair load avg consistentlyByungchul Park3-1/+60
The current code accounts for the time a task was absent from the fair class (per ATTACH_AGE_LOAD). However it does not work correctly when a task got migrated or moved to another cgroup while outside of the fair class. This patch tries to address that by aging on migration. We locklessly read the 'last_update_time' stamp from both the old and new cfs_rq, ages the load upto the old time, and sets it to the new time. These timestamps should in general not be more than 1 tick apart from one another, so there is a definite bound on things. Signed-off-by: Byungchul Park <[email protected]> [ Changelog, a few edits and !SMP build fix ] Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-12-04sched/core, locking: Document Program-Order guaranteesPeter Zijlstra1-0/+91
These are some notes on the scheduler locking and how it provides program order guarantees on SMP systems. ( This commit is in the locking tree, because the new documentation refers to a newly introduced locking primitive. ) Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Boqun Feng <[email protected]> Cc: David Howells <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2015-12-04locking, sched: Introduce smp_cond_acquire() and use itPeter Zijlstra3-10/+3
Introduce smp_cond_acquire() which combines a control dependency and a read barrier to form acquire semantics. This primitive has two benefits: - it documents control dependencies, - its typically cheaper than using smp_load_acquire() in a loop. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>