aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2011-12-11rcu: Eliminate RCU_FAST_NO_HZ grace-period hangPaul E. McKenney3-81/+2
With the new implementation of RCU_FAST_NO_HZ, it was possible to hang RCU grace periods as follows: o CPU 0 attempts to go idle, cycles several times through the rcu_prepare_for_idle() loop, then goes dyntick-idle when RCU needs nothing more from it, while still having at least on RCU callback pending. o CPU 1 goes idle with no callbacks. Both CPUs can then stay in dyntick-idle mode indefinitely, preventing the RCU grace period from ever completing, possibly hanging the system. This commit therefore prevents CPUs that have RCU callbacks from entering dyntick-idle mode. This approach also eliminates the need for the end-of-grace-period IPIs used previously. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11rcu: Avoid needlessly IPIing CPUs at GP endPaul E. McKenney1-2/+14
If a CPU enters dyntick-idle mode with callbacks pending, it will need an IPI at the end of the grace period. However, if it exits dyntick-idle mode before the grace period ends, it will be needlessly IPIed at the end of the grace period. Therefore, this commit clears the per-CPU rcu_awake_at_gp_end flag when a CPU determines that it does not need it. This in turn requires disabling interrupts across much of rcu_prepare_for_idle() in order to avoid having nested interrupts clearing this state out from under us. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11rcu: Go dyntick-idle more quickly if CPU has serviced current grace periodPaul E. McKenney1-6/+18
The earlier version would attempt to push callbacks through five times before going into dyntick-idle mode if callbacks remained, but the CPU had done all that it needed to do for the current RCU grace periods. This is wasteful: In most cases, once the CPU has done all that it needs to for the current RCU grace periods, it will make no further progress on the callbacks no matter how many times it loops through the RCU core processing and the idle-entry code. This commit therefore goes to dyntick-idle mode whenever the current CPU has done all it can for the current grace period. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11rcu: Add tracing for RCU_FAST_NO_HZPaul E. McKenney1-3/+15
This commit adds trace_rcu_prep_idle(), which is invoked from rcu_prepare_for_idle() and rcu_wake_cpu() to trace attempts on the part of RCU to force CPUs into dyntick-idle mode. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11nohz: Remove tick_nohz_idle_enter_norcu() / tick_nohz_idle_exit_norcu()Frederic Weisbecker1-7/+8
Those two APIs were provided to optimize the calls of tick_nohz_idle_enter() and rcu_idle_enter() into a single irq disabled section. This way no interrupt happening in-between would needlessly process any RCU job. Now we are talking about an optimization for which benefits have yet to be measured. Let's start simple and completely decouple idle rcu and dyntick idle logics to simplify. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11rcu: Add rcutorture CPU-hotplug capabilityPaul E. McKenney1-5/+112
Running CPU-hotplug operations concurrently with rcutorture has historically been a good way to find bugs in both RCU and CPU hotplug. This commit therefore adds an rcutorture module parameter called "onoff_interval" that causes a randomly selected CPU-hotplug operation to be executed at the specified interval, in seconds. The default value of "onoff_interval" is zero, which disables rcutorture-instigated CPU-hotplug operations. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11events: Make events use the new is_idle_task() APIPaul E. McKenney1-1/+1
Change from direct comparison of ->pid with zero to is_idle_task(). Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-12-11kdb: Make KDB use the new is_idle_task() APIPaul E. McKenney1-1/+1
Change from direct comparison of ->pid with zero to is_idle_task(). Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Jason Wessel <jason.wessel@windriver.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-12-11rcu: Make RCU use the new is_idle_task() APIPaul E. McKenney2-4/+4
Change from direct comparison of ->pid with zero to is_idle_task(). Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-12-11rcu: Control rcutorture startup from kernel boot parametersPaul E. McKenney1-0/+2
Currently, if rcutorture is built into the kernel, it must be manually started or started from an init script. This is inconvenient for automated KVM testing, where it is good to be able to fully control rcutorture execution from the kernel parameters. This patch therefore adds a module parameter named "rcutorture_runnable" that defaults to zero ("don't start automatically"), but which can be set to one to cause rcutorture to start up immediately during boot. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11rcu: Add rcutorture system-shutdown capabilityPaul E. McKenney1-4/+64
Although it is easy to run rcutorture tests under KVM, there is currently no nice way to run such a test for a fixed time period, collect all of the rcutorture data, and then shut the system down cleanly. This commit therefore adds an rcutorture module parameter named "shutdown_secs" that specified the run duration in seconds, after which rcutorture terminates the test and powers the system down. The default value for "shutdown_secs" is zero, which disables shutdown. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11rcu: Fix idle-task checksPaul E. McKenney2-4/+4
RCU has traditionally relied on idle_cpu() to determine whether a given CPU is running in the context of an idle task, but commit 908a3283 (Fix idle_cpu()) has invalidated this approach. After commit 908a3283, idle_cpu() will return true if the current CPU is currently running the idle task, and will be doing so for the foreseeable future. RCU instead needs to know whether or not the current CPU is currently running the idle task, regardless of what the near future might bring. This commit therefore switches from idle_cpu() to "current->pid != 0". Reported-by: Wu Fengguang <fengguang.wu@intel.com> Suggested-by: Carsten Emde <C.Emde@osadl.org> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Tested-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11rcu: Allow dyntick-idle mode for CPUs with callbacksPaul E. McKenney3-33/+132
Currently, RCU does not permit a CPU to enter dyntick-idle mode if that CPU has any RCU callbacks queued. This means that workloads for which each CPU wakes up and does some RCU updates every few ticks will never enter dyntick-idle mode. This can result in significant unnecessary power consumption, so this patch permits a given to enter dyntick-idle mode if it has callbacks, but only if that same CPU has completed all current work for the RCU core. We determine use rcu_pending() to determine whether a given CPU has completed all current work for the RCU core. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11rcu: Add more information to the wrong-idle-task complaintPaul E. McKenney2-4/+20
The current code just complains if the current task is not the idle task. This commit therefore adds printing of the identity of the idle task. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-12-11rcu: Deconfuse dynticks entry-exit tracingPaul E. McKenney3-26/+44
The trace_rcu_dyntick() trace event did not print both the old and the new value of the nesting level, and furthermore printed only the low-order 32 bits of it. This could result in some confusion when interpreting trace-event dumps, so this commit prints both the old and the new value, prints the full 64 bits, and also selects the process-entry/exit increment to print nicely in hexadecimal. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-12-11rcu: Fix early call to rcu_idle_enter()Frederic Weisbecker1-1/+1
On the irq exit path, tick_nohz_irq_exit() may raise a softirq, which action leads to the wake up path and select_task_rq_fair() that makes use of rcu to iterate the domains. This is an illegal use of RCU because we may be in RCU extended quiescent state if we interrupted an RCU-idle window in the idle loop: [ 132.978883] =============================== [ 132.978883] [ INFO: suspicious RCU usage. ] [ 132.978883] ------------------------------- [ 132.978883] kernel/sched_fair.c:1707 suspicious rcu_dereference_check() usage! [ 132.978883] [ 132.978883] other info that might help us debug this: [ 132.978883] [ 132.978883] [ 132.978883] rcu_scheduler_active = 1, debug_locks = 0 [ 132.978883] RCU used illegally from extended quiescent state! [ 132.978883] 2 locks held by swapper/0: [ 132.978883] #0: (&p->pi_lock){-.-.-.}, at: [<ffffffff8105a729>] try_to_wake_up+0x39/0x2f0 [ 132.978883] #1: (rcu_read_lock){.+.+..}, at: [<ffffffff8105556a>] select_task_rq_fair+0x6a/0xec0 [ 132.978883] [ 132.978883] stack backtrace: [ 132.978883] Pid: 0, comm: swapper Tainted: G W 3.0.0+ #178 [ 132.978883] Call Trace: [ 132.978883] <IRQ> [<ffffffff810a01f6>] lockdep_rcu_suspicious+0xe6/0x100 [ 132.978883] [<ffffffff81055c49>] select_task_rq_fair+0x749/0xec0 [ 132.978883] [<ffffffff8105556a>] ? select_task_rq_fair+0x6a/0xec0 [ 132.978883] [<ffffffff812fe494>] ? do_raw_spin_lock+0x54/0x150 [ 132.978883] [<ffffffff810a1f2d>] ? trace_hardirqs_on+0xd/0x10 [ 132.978883] [<ffffffff8105a7c3>] try_to_wake_up+0xd3/0x2f0 [ 132.978883] [<ffffffff81094f98>] ? ktime_get+0x68/0xf0 [ 132.978883] [<ffffffff8105aa35>] wake_up_process+0x15/0x20 [ 132.978883] [<ffffffff81069dd5>] raise_softirq_irqoff+0x65/0x110 [ 132.978883] [<ffffffff8108eb65>] __hrtimer_start_range_ns+0x415/0x5a0 [ 132.978883] [<ffffffff812fe3ee>] ? do_raw_spin_unlock+0x5e/0xb0 [ 132.978883] [<ffffffff8108ed08>] hrtimer_start+0x18/0x20 [ 132.978883] [<ffffffff8109c9c3>] tick_nohz_stop_sched_tick+0x393/0x450 [ 132.978883] [<ffffffff810694f2>] irq_exit+0xd2/0x100 [ 132.978883] [<ffffffff81829e96>] do_IRQ+0x66/0xe0 [ 132.978883] [<ffffffff81820d53>] common_interrupt+0x13/0x13 [ 132.978883] <EOI> [<ffffffff8103434b>] ? native_safe_halt+0xb/0x10 [ 132.978883] [<ffffffff810a1f2d>] ? trace_hardirqs_on+0xd/0x10 [ 132.978883] [<ffffffff810144ea>] default_idle+0xba/0x370 [ 132.978883] [<ffffffff810147fe>] amd_e400_idle+0x5e/0x130 [ 132.978883] [<ffffffff8100a9f6>] cpu_idle+0xb6/0x120 [ 132.978883] [<ffffffff817f217f>] rest_init+0xef/0x150 [ 132.978883] [<ffffffff817f20e2>] ? rest_init+0x52/0x150 [ 132.978883] [<ffffffff81ed9cf3>] start_kernel+0x3da/0x3e5 [ 132.978883] [<ffffffff81ed9346>] x86_64_start_reservations+0x131/0x135 [ 132.978883] [<ffffffff81ed944d>] x86_64_start_kernel+0x103/0x112 Fix this by calling rcu_idle_enter() after tick_nohz_irq_exit(). Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-12-11nohz: Allow rcu extended quiescent state handling seperately from tick stopFrederic Weisbecker1-12/+13
It is assumed that rcu won't be used once we switch to tickless mode and until we restart the tick. However this is not always true, as in x86-64 where we dereference the idle notifiers after the tick is stopped. To prepare for fixing this, add two new APIs: tick_nohz_idle_enter_norcu() and tick_nohz_idle_exit_norcu(). If no use of RCU is made in the idle loop between tick_nohz_enter_idle() and tick_nohz_exit_idle() calls, the arch must instead call the new *_norcu() version such that the arch doesn't need to call rcu_idle_enter() and rcu_idle_exit(). Otherwise the arch must call tick_nohz_enter_idle() and tick_nohz_exit_idle() and also call explicitly: - rcu_idle_enter() after its last use of RCU before the CPU is put to sleep. - rcu_idle_exit() before the first use of RCU after the CPU is woken up. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mike Frysinger <vapier@gentoo.org> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: David Miller <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Paul Mackerras <paulus@samba.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11nohz: Separate out irq exit and idle loop dyntick logicFrederic Weisbecker2-37/+58
The tick_nohz_stop_sched_tick() function, which tries to delay the next timer tick as long as possible, can be called from two places: - From the idle loop to start the dytick idle mode - From interrupt exit if we have interrupted the dyntick idle mode, so that we reprogram the next tick event in case the irq changed some internal state that requires this action. There are only few minor differences between both that are handled by that function, driven by the ts->inidle cpu variable and the inidle parameter. The whole guarantees that we only update the dyntick mode on irq exit if we actually interrupted the dyntick idle mode, and that we enter in RCU extended quiescent state from idle loop entry only. Split this function into: - tick_nohz_idle_enter(), which sets ts->inidle to 1, enters dynticks idle mode unconditionally if it can, and enters into RCU extended quiescent state. - tick_nohz_irq_exit() which only updates the dynticks idle mode when ts->inidle is set (ie: if tick_nohz_idle_enter() has been called). To maintain symmetry, tick_nohz_restart_sched_tick() has been renamed into tick_nohz_idle_exit(). This simplifies the code and micro-optimize the irq exit path (no need for local_irq_save there). This also prepares for the split between dynticks and rcu extended quiescent state logics. We'll need this split to further fix illegal uses of RCU in extended quiescent states in the idle loop. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mike Frysinger <vapier@gentoo.org> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: David Miller <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Paul Mackerras <paulus@samba.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-12-11rcu: Inform the user about extended quiescent state on PROVE_RCU warningFrederic Weisbecker1-0/+22
Inform the user if an RCU usage error is detected by lockdep while in an extended quiescent state (in this case, the RCU-free window in idle). This is accomplished by adding a line to the RCU lockdep splat indicating whether or not the splat occurred in extended quiescent state. Uses of RCU from within extended quiescent state mode are totally ignored by RCU, hence the importance of this diagnostic. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-12-11rcu: Detect illegal rcu dereference in extended quiescent stateFrederic Weisbecker3-0/+4
Report that none of the rcu read lock maps are held while in an RCU extended quiescent state (the section between rcu_idle_enter() and rcu_idle_exit()). This helps detect any use of rcu_dereference() and friends from within the section in idle where RCU is not allowed. This way we can guarantee an extended quiescent window where the CPU can be put in dyntick idle mode or can simply aoid to be part of any global grace period completion while in the idle loop. Uses of RCU from such mode are totally ignored by RCU, hence the importance of these checks. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-12-11rcu: Remove redundant return from rcu_report_exp_rnp()Thomas Gleixner1-1/+0
Empty void functions do not need "return", so this commit removes it from rcu_report_exp_rnp(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11rcu: Omit self-awaken when setting up expedited grace periodThomas Gleixner3-7/+14
When setting up an expedited grace period, if there were no readers, the task will awaken itself. This commit removes this useless self-awakening. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11rcu: Disable preemption in rcu_is_cpu_idle()Paul E. McKenney1-3/+7
Because rcu_is_cpu_idle() is to be used to check for extended quiescent states in RCU-preempt read-side critical sections, it cannot assume that preemption is disabled. And preemption must be disabled when accessing the dyntick-idle state, because otherwise the following sequence of events could occur: 1. Task A on CPU 1 enters rcu_is_cpu_idle() and picks up the pointer to CPU 1's per-CPU variables. 2. Task B preempts Task A and starts running on CPU 1. 3. Task A migrates to CPU 2. 4. Task B blocks, leaving CPU 1 idle. 5. Task A continues execution on CPU 2, accessing CPU 1's dyntick-idle information using the pointer fetched in step 1 above, and finds that CPU 1 is idle. 6. Task A therefore incorrectly concludes that it is executing in an extended quiescent state, possibly issuing a spurious splat. Therefore, this commit disables preemption within the rcu_is_cpu_idle() function. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-12-11rcu: Add failure tracing to rcutorturePaul E. McKenney2-0/+28
Trace the rcutorture RCU accesses and dump the trace buffer when the first failure is detected. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-12-11trace: Allow ftrace_dump() to be called from modulesPaul E. McKenney1-0/+1
Add an EXPORT_SYMBOL_GPL() so that rcutorture can dump the trace buffer upon detection of an RCU error. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-12-11rcu: Track idleness independent of idle tasksPaul E. McKenney5-106/+278
Earlier versions of RCU used the scheduling-clock tick to detect idleness by checking for the idle task, but handled idleness differently for CONFIG_NO_HZ=y. But there are now a number of uses of RCU read-side critical sections in the idle task, for example, for tracing. A more fine-grained detection of idleness is therefore required. This commit presses the old dyntick-idle code into full-time service, so that rcu_idle_enter(), previously known as rcu_enter_nohz(), is always invoked at the beginning of an idle loop iteration. Similarly, rcu_idle_exit(), previously known as rcu_exit_nohz(), is always invoked at the end of an idle-loop iteration. This allows the idle task to use RCU everywhere except between consecutive rcu_idle_enter() and rcu_idle_exit() calls, in turn allowing architecture maintainers to specify exactly where in the idle loop that RCU may be used. Because some of the userspace upcall uses can result in what looks to RCU like half of an interrupt, it is not possible to expect that the irq_enter() and irq_exit() hooks will give exact counts. This patch therefore expands the ->dynticks_nesting counter to 64 bits and uses two separate bitfields to count process/idle transitions and interrupt entry/exit transitions. It is presumed that userspace upcalls do not happen in the idle loop or from usermode execution (though usermode might do a system call that results in an upcall). The counter is hard-reset on each process/idle transition, which avoids the interrupt entry/exit error from accumulating. Overflow is avoided by the 64-bitness of the ->dyntick_nesting counter. This commit also adds warnings if a non-idle task asks RCU to enter idle state (and these checks will need some adjustment before applying Frederic's OS-jitter patches (http://lkml.org/lkml/2011/10/7/246). In addition, validation of ->dynticks and ->dynticks_nesting is added. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-12-11rcu: Make synchronize_sched_expedited() better at work sharingPaul E. McKenney1-1/+1
When synchronize_sched_expedited() takes its second and subsequent snapshots of sync_sched_expedited_started, it subtracts 1. This means that the concurrent caller of synchronize_sched_expedited() that incremented to that value sees our successful completion, it will not be able to take advantage of it. This restriction is pointless, given that our full expedited grace period would have happened after the other guy started, and thus should be able to serve as a proxy for the other guy successfully executing try_stop_cpus(). This commit therefore removes the subtraction of 1. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-12-11rcu: Avoid RCU-preempt expedited grace-period botchPaul E. McKenney1-2/+5
Because rcu_read_unlock_special() samples rcu_preempted_readers_exp(rnp) after dropping rnp->lock, the following sequence of events is possible: 1. Task A exits its RCU read-side critical section, and removes itself from the ->blkd_tasks list, releases rnp->lock, and is then preempted. Task B remains on the ->blkd_tasks list, and blocks the current expedited grace period. 2. Task B exits from its RCU read-side critical section and removes itself from the ->blkd_tasks list. Because it is the last task blocking the current expedited grace period, it ends that expedited grace period. 3. Task A resumes, and samples rcu_preempted_readers_exp(rnp) which of course indicates that nothing is blocking the nonexistent expedited grace period. Task A is again preempted. 4. Some other CPU starts an expedited grace period. There are several tasks blocking this expedited grace period queued on the same rcu_node structure that Task A was using in step 1 above. 5. Task A examines its state and incorrectly concludes that it was the last task blocking the expedited grace period on the current rcu_node structure. It therefore reports completion up the rcu_node tree. 6. The expedited grace period can then incorrectly complete before the tasks blocked on this same rcu_node structure exit their RCU read-side critical sections. Arbitrarily bad things happen. This commit therefore takes a snapshot of rcu_preempted_readers_exp(rnp) prior to dropping the lock, so that only the last task thinks that it is the last task, thus avoiding the failure scenario laid out above. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-12-11rcu: ->signaled better named ->fqs_statePaul E. McKenney3-11/+11
The ->signaled field was named before complications in the form of dyntick-idle mode and offlined CPUs. These complications have required that force_quiescent_state() be implemented as a state machine, instead of simply unconditionally sending reschedule IPIs. Therefore, this commit renames ->signaled to ->fqs_state to catch up with the new force_quiescent_state() reality. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-12-09Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds2-3/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Do no try to schedule task events if there are none lockdep, kmemcheck: Annotate ->lock in lockdep_init_map() perf header: Use event_name() to get an event name perf stat: Failure with "Operation not supported"
2011-12-09sys_getppid: add missing rcu_dereferenceMandeep Singh Baines1-1/+1
In order to safely dereference current->real_parent inside an rcu_read_lock, we need an rcu_dereference. Signed-off-by: Mandeep Singh Baines <msb@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-09printk: avoid double lock acquirePeter Zijlstra1-1/+2
Commit 4f2a8d3cf5e ("printk: Fix console_sem vs logbuf_lock unlock race") introduced another silly bug where we would want to acquire an already held lock. Avoid this. Reported-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-08Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: alarmtimers: Fix time comparison ptp: Fix clock_getres() implementation
2011-12-07perf: Do no try to schedule task events if there are noneGleb Natapov1-2/+2
perf_event_sched_in() shouldn't try to schedule task events if there are none otherwise task's ctx->is_active will be set and will not be cleared during sched_out. This will prevent newly added events from being scheduled into the task context. Fixes a boo-boo in commit 1d5f003f5a9 ("perf: Do not set task_ctx pointer in cpuctx if there are no events in the context"). Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20111122140821.GF2557@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds4-5/+11
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: ftrace: Fix hash record accounting bug perf: Fix parsing of __print_flags() in TP_printk() jump_label: jump_label_inc may return before the code is patched ftrace: Remove force undef config value left for testing tracing: Restore system filter behavior tracing: fix event_subsystem ref counting
2011-12-06lockdep, kmemcheck: Annotate ->lock in lockdep_init_map()Yong Zhang1-1/+7
Since commit f59de89 ("lockdep: Clear whole lockdep_map on initialization"), lockdep_init_map() will clear all the struct. But it will break lock_set_class()/lock_set_subclass(). A typical race condition is like below: CPU A CPU B lock_set_subclass(lockA); lock_set_class(lockA); lockdep_init_map(lockA); /* lockA->name is cleared */ memset(lockA); __lock_acquire(lockA); /* lockA->class_cache[] is cleared */ register_lock_class(lockA); look_up_lock_class(lockA); WARN_ON_ONCE(class->name != lock->name); lock->name = name; So restore to what we have done before commit f59de89 but annotate ->lock with kmemcheck_mark_initialized() to suppress the kmemcheck warning reported in commit f59de89. Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reported-by: Borislav Petkov <bp@alien8.de> Suggested-by: Vegard Nossum <vegard.nossum@gmail.com> Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20111109080451.GB8124@zhy Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-06alarmtimers: Fix time comparisonThomas Gleixner1-1/+1
The expiry function compares the timer against current time and does not expire the timer when the expiry time is >= now. That's wrong. If the timer is set for now, then it must expire. Make the condition expiry > now for breaking out the loop. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <john.stultz@linaro.org> Cc: stable@kernel.org
2011-12-05Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds4-6/+95
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix loss of notification with multi-event perf, x86: Force IBS LVT offset assignment for family 10h perf, x86: Disable PEBS on SandyBridge chips trace_events_filter: Use rcu_assign_pointer() when setting ftrace_event_call->filter perf session: Fix crash with invalid CPU list perf python: Fix undefined symbol problem perf/x86: Enable raw event access to Intel offcore events perf: Don't use -ENOSPC for out of PMU resources perf: Do not set task_ctx pointer in cpuctx if there are no events in the context perf/x86: Fix PEBS instruction unwind oprofile, x86: Fix crash when unloading module (nmi timer mode) oprofile: Fix crash when unloading module (hr timer mode)
2011-12-05Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds3-3/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clockevents: Set noop handler in clockevents_exchange_device() tick-broadcast: Stop active broadcast device when replacing it clocksource: Fix bug with max_deferment margin calculation rtc: Fix some bugs that allowed accumulating time drift in suspend/resume rtc: Disable the alarm in the hardware
2011-12-05Merge branches 'core-urgent-for-linus' and 'irq-urgent-for-linus' of ↵Linus Torvalds1-1/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: slab, lockdep: Fix silly bug * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Fix race condition when stopping the irq thread
2011-12-05Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds4-34/+146
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched, x86: Avoid unnecessary overflow in sched_clock sched: Fix buglet in return_cfs_rq_runtime() sched: Avoid SMT siblings in select_idle_sibling() if possible sched: Set the command name of the idle tasks in SMP kernels sched, rt: Provide means of disabling cross-cpu bandwidth sharing sched: Document wait_for_completion_*() return values sched_fair: Fix a typo in the comment describing update_sd_lb_stats sched: Add a comment to effective_load() since it's a pain
2011-12-05ftrace: Fix hash record accounting bugSteven Rostedt1-1/+3
If the set_ftrace_filter is cleared by writing just whitespace to it, then the filter hash refcounts will be decremented but not updated. This causes two bugs: 1) No functions will be enabled for tracing when they all should be 2) If the users clears the set_ftrace_filter twice, it will crash ftrace: ------------[ cut here ]------------ WARNING: at /home/rostedt/work/git/linux-trace.git/kernel/trace/ftrace.c:1384 __ftrace_hash_rec_update.part.27+0x157/0x1a7() Modules linked in: Pid: 2330, comm: bash Not tainted 3.1.0-test+ #32 Call Trace: [<ffffffff81051828>] warn_slowpath_common+0x83/0x9b [<ffffffff8105185a>] warn_slowpath_null+0x1a/0x1c [<ffffffff810ba362>] __ftrace_hash_rec_update.part.27+0x157/0x1a7 [<ffffffff810ba6e8>] ? ftrace_regex_release+0xa7/0x10f [<ffffffff8111bdfe>] ? kfree+0xe5/0x115 [<ffffffff810ba51e>] ftrace_hash_move+0x2e/0x151 [<ffffffff810ba6fb>] ftrace_regex_release+0xba/0x10f [<ffffffff8112e49a>] fput+0xfd/0x1c2 [<ffffffff8112b54c>] filp_close+0x6d/0x78 [<ffffffff8113a92d>] sys_dup3+0x197/0x1c1 [<ffffffff8113a9a6>] sys_dup2+0x4f/0x54 [<ffffffff8150cac2>] system_call_fastpath+0x16/0x1b ---[ end trace 77a3a7ee73794a02 ]--- Link: http://lkml.kernel.org/r/20111101141420.GA4918@debian Reported-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-12-05jump_label: jump_label_inc may return before the code is patchedGleb Natapov1-1/+2
If cpu A calls jump_label_inc() just after atomic_add_return() is called by cpu B, atomic_inc_not_zero() will return value greater then zero and jump_label_inc() will return to a caller before jump_label_update() finishes its job on cpu B. Link: http://lkml.kernel.org/r/20111018175551.GH17571@redhat.com Cc: stable@vger.kernel.org Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Jason Baron <jbaron@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-12-05ftrace: Remove force undef config value left for testingSteven Rostedt1-1/+0
A forced undef of a config value was used for testing and was accidently left in during the final commit. This causes x86 to run slower than needed while running function tracing as well as causes the function graph selftest to fail when DYNMAIC_FTRACE is not set. This is because the code in MCOUNT expects the ftrace code to be processed with the config value set that happened to be forced not set. The forced config option was left in by: commit 6331c28c962561aee59e5a493b7556a4bb585957 ftrace: Fix dynamic selftest failure on some archs Link: http://lkml.kernel.org/r/20111102150255.GA6973@debian Cc: stable@vger.kernel.org Reported-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-12-05tracing: Restore system filter behaviorLi Zefan1-1/+6
Though not all events have field 'prev_pid', it was allowed to do this: # echo 'prev_pid == 100' > events/sched/filter but commit 75b8e98263fdb0bfbdeba60d4db463259f1fe8a2 (tracing/filter: Swap entire filter of events) broke it without any reason. Link: http://lkml.kernel.org/r/4EAF46CF.8040408@cn.fujitsu.com Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-12-05tracing: fix event_subsystem ref countingIlya Dryomov1-1/+0
Fix a bug introduced by e9dbfae5, which prevents event_subsystem from ever being released. Ref_count was added to keep track of subsystem users, not for counting events. Subsystem is created with ref_count = 1, so there is no need to increment it for every event, we have nr_events for that. Fix this by touching ref_count only when we actually have a new user - subsystem_open(). Cc: stable@vger.kernel.org Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Link: http://lkml.kernel.org/r/1320052062-7846-1-git-send-email-idryomov@gmail.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-12-05Merge branch 'tip/perf/urgent' of ↵Ingo Molnar1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/urgent
2011-12-05perf: Fix loss of notification with multi-eventPeter Zijlstra3-2/+90
When you do: $ perf record -e cycles,cycles,cycles noploop 10 You expect about 10,000 samples for each event, i.e., 10s at 1000samples/sec. However, this is not what's happening. You get much fewer samples, maybe 3700 samples/event: $ perf report -D | tail -15 Aggregated stats: TOTAL events: 10998 MMAP events: 66 COMM events: 2 SAMPLE events: 10930 cycles stats: TOTAL events: 3644 SAMPLE events: 3644 cycles stats: TOTAL events: 3642 SAMPLE events: 3642 cycles stats: TOTAL events: 3644 SAMPLE events: 3644 On a Intel Nehalem or even AMD64, there are 4 counters capable of measuring cycles, so there is plenty of space to measure those events without multiplexing (even with the NMI watchdog active). And even with multiplexing, we'd expect roughly the same number of samples per event. The root of the problem was that when the event that caused the buffer to become full was not the first event passed on the cmdline, the user notification would get lost. The notification was sent to the file descriptor of the overflowed event but the perf tool was not polling on it. The perf tool aggregates all samples into a single buffer, i.e., the buffer of the first event. Consequently, it assumes notifications for any event will come via that descriptor. The seemingly straight forward solution of moving the waitq into the ringbuffer object doesn't work because of life-time issues. One could perf_event_set_output() on a fd that you're also blocking on and cause the old rb object to be freed while its waitq would still be referenced by the blocked thread -> FAIL. Therefore link all events to the ringbuffer and broadcast the wakeup from the ringbuffer object to all possible events that could be waited upon. This is rather ugly, and we're open to better solutions but it works for now. Reported-by: Stephane Eranian <eranian@google.com> Finished-by: Stephane Eranian <eranian@google.com> Reviewed-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20111126014731.GA7030@quad Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-02clockevents: Set noop handler in clockevents_exchange_device()Thomas Gleixner1-0/+1
If a device is shutdown, then there might be a pending interrupt, which will be processed after we reenable interrupts, which causes the original handler to be run. If the old handler is the (broadcast) periodic handler the shutdown state might hang the kernel completely. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org
2011-12-02tick-broadcast: Stop active broadcast device when replacing itThomas Gleixner1-1/+1
When a better rated broadcast device is installed, then the current active device is not disabled, which results in two running broadcast devices. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org