aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2014-05-14rcutorture: Check for rcu_torture_fqs creation errorsPaul E. McKenney1-1/+2
The return value from torture_create_kthread() is currently ignored when creating the rcu_torture_fqs kthread. This commit therefore captures the return value so that it can be tested for errors. Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Josh Triplett <[email protected]>
2014-05-14torture: Notice if an all-zero cpumask is passed inside a critical sectionIulia Manda1-6/+1
In torture_shuffle_tasks function, the check if an all-zero mask can be passed to set_cpus_allowed_ptr() is redundant after clearing the shuffle_idle_cpu bit. If the mask had more than one bit set, after clearing a bit it has at least one bit set. If the mask had only one bit set, a check is made at the beginning, where the function returns, as there is no need to shuffle only one cpu. Also, this code is executed inside a critical section, delimited by get_online_cpus(), and put_online_cpus(), preventing CPUs from leaving between the check of num_online_cpus and the calls to set_cpus_allowed_ptr() function. Signed-off-by: Iulia Manda <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Josh Triplett <[email protected]>
2014-05-14rcutorture: Make rcu_torture_reader() use cond_resched()Paul E. McKenney1-4/+7
The rcu_torture_reader() function currently uses schedule(). This commit therefore speeds things up a bit by substituting cond_resched(). This change makes rcu_torture_reader() more CPU-bound, so this commit also adjusts the number of readers (the "nreaders" module parameter, which feeds into the "nrealreaders" variable) to allow one CPU to be free of readers on SMP systems. The point of this is to increase the probability that readers will be watching while an updater makes a change. Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Josh Triplett <[email protected]>
2014-05-14sched,rcu: Make cond_resched() report RCU quiescent statesPaul E. McKenney2-1/+24
Given a CPU running a loop containing cond_resched(), with no other tasks runnable on that CPU, RCU will eventually report RCU CPU stall warnings due to lack of quiescent states. Fortunately, every call to cond_resched() is a perfectly good quiescent state. Unfortunately, invoking rcu_note_context_switch() is a bit heavyweight for cond_resched(), especially given the need to disable preemption, and, for RCU-preempt, interrupts as well. This commit therefore maintains a per-CPU counter that causes cond_resched(), cond_resched_lock(), and cond_resched_softirq() to call rcu_note_context_switch(), but only about once per 256 invocations. This ratio was chosen in keeping with the relative time constants of RCU grace periods. Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Josh Triplett <[email protected]>
2014-05-14rcutorture: Export RCU grace-period kthread wait state to rcutorturePaul E. McKenney3-1/+25
This commit allows rcutorture to print additional state for the RCU grace-period kthreads in cases where RCU seems reluctant to start a new grace period. Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Josh Triplett <[email protected]>
2014-05-14torture: Dump ftrace buffer when the RCU grace period stallsPaul E. McKenney1-0/+1
This commit adds a call to rcutorture_trace_dump() to dump the ftrace buffer when the RCU grace period stalls in order to help debug the stall. Note that this is different than the RCU CPU stall warning, as it is rcutorture detecting the stall rather than the underlying RCU implementation. Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Josh Triplett <[email protected]>
2014-05-14torture: Increase stutter-end intensityPaul E. McKenney1-2/+10
Currently, all stuttered kthreads block a jiffy at a time, which can result in them starting at different times. (Note: This is not an energy-efficiency problem unless you run torture tests in production, in which case you have other problems!) This commit increases the intensity of the restart event by causing kthreads to spin through the last jiffy, restarting when they see the variable change. Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Josh Triplett <[email protected]>
2014-05-14torture: Include "Stopping" string to torture_kthread_stopping()Paul E. McKenney1-2/+4
Currently, torture_kthread_stopping() prints only the name of the kthread that is stopping, which can be unedifying. This commit therefore adds "Stopping" to make things more evident. Signed-off-by: Paul E. McKenney <[email protected]>
2014-05-14rcutorture: Print negatives for SRCU counter wraparoundPaul E. McKenney1-3/+5
The srcu_torture_stats() function prints SRCU's per-CPU c[] array with an unsigned format, which means that the number one less than zero is a very large number. This commit therefore prints this array with a signed format in order to improve readability of the rcutorture output. Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Josh Triplett <[email protected]>
2014-05-14rcutorture: Mark function as static in kernel/rcu/torture.cRashika Kheria1-2/+2
Mark functions as static in kernel/rcu/torture.c because they are not used outside this file. This eliminates the following warning in kernel/rcu/torture.c: kernel/rcu/torture.c:902:6: warning: no previous prototype for ‘rcutorture_trace_dump’ [-Wmissing-prototypes] kernel/rcu/torture.c:1572:6: warning: no previous prototype for ‘rcu_torture_barrier_cbf’ [-Wmissing-prototypes] Signed-off-by: Rashika Kheria <[email protected]> Reviewed-by: Josh Triplett <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2014-05-13torture: Intensify locking testPaul E. McKenney1-1/+2
The current lock_torture_writer() spends too much time sleeping and not enough time hammering locks, as in an eight-CPU test will often only be utilizing a CPU or two. This commit therefore makes lock_torture_writer() sleep less and hammer more. Signed-off-by: Paul E. McKenney <[email protected]>
2014-05-13rcutorture: Add forward-progress checking for writerPaul E. McKenney2-0/+70
The rcutorture output currently does not distinguish between stalls in the RCU implementation and stalls in the rcu_torture_writer() kthreads. This commit therefore adds some diagnostics to help distinguish between these two conditions, at least for the non-SRCU implementations. (SRCU does not provide evidence of update-side forward progress by design.) Signed-off-by: Paul E. McKenney <[email protected]>
2014-05-13cgroup: fix rcu_read_lock() leak in update_if_frozen()Tejun Heo1-1/+3
While updating cgroup_freezer locking, 68fafb77d827 ("cgroup_freezer: replace freezer->lock with freezer_mutex") introduced a bug in update_if_frozen() where it returns with rcu_read_lock() held. Fix it by adding rcu_read_unlock() before returning. Signed-off-by: Tejun Heo <[email protected]> Reported-by: kbuild test robot <[email protected]>
2014-05-13cgroup_freezer: replace freezer->lock with freezer_mutexTejun Heo1-66/+46
After 96d365e0b86e ("cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem"), css task iterators requires sleepable context as it may block on css_set_rwsem. I missed that cgroup_freezer was iterating tasks under IRQ-safe spinlock freezer->lock. This leads to errors like the following on freezer state reads and transitions. BUG: sleeping function called from invalid context at /work /os/work/kernel/locking/rwsem.c:20 in_atomic(): 0, irqs_disabled(): 0, pid: 462, name: bash 5 locks held by bash/462: #0: (sb_writers#7){.+.+.+}, at: [<ffffffff811f0843>] vfs_write+0x1a3/0x1c0 #1: (&of->mutex){+.+.+.}, at: [<ffffffff8126d78b>] kernfs_fop_write+0xbb/0x170 #2: (s_active#70){.+.+.+}, at: [<ffffffff8126d793>] kernfs_fop_write+0xc3/0x170 #3: (freezer_mutex){+.+...}, at: [<ffffffff81135981>] freezer_write+0x61/0x1e0 #4: (rcu_read_lock){......}, at: [<ffffffff81135973>] freezer_write+0x53/0x1e0 Preemption disabled at:[<ffffffff81104404>] console_unlock+0x1e4/0x460 CPU: 3 PID: 462 Comm: bash Not tainted 3.15.0-rc1-work+ #10 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 ffff88000916a6d0 ffff88000e0a3da0 ffffffff81cf8c96 0000000000000000 ffff88000e0a3dc8 ffffffff810cf4f2 ffffffff82388040 ffff880013aaf740 0000000000000002 ffff88000e0a3de8 ffffffff81d05974 0000000000000246 Call Trace: [<ffffffff81cf8c96>] dump_stack+0x4e/0x7a [<ffffffff810cf4f2>] __might_sleep+0x162/0x260 [<ffffffff81d05974>] down_read+0x24/0x60 [<ffffffff81133e87>] css_task_iter_start+0x27/0x70 [<ffffffff8113584d>] freezer_apply_state+0x5d/0x130 [<ffffffff81135a16>] freezer_write+0xf6/0x1e0 [<ffffffff8112eb88>] cgroup_file_write+0xd8/0x230 [<ffffffff8126d7b7>] kernfs_fop_write+0xe7/0x170 [<ffffffff811f0756>] vfs_write+0xb6/0x1c0 [<ffffffff811f121d>] SyS_write+0x4d/0xc0 [<ffffffff81d08292>] system_call_fastpath+0x16/0x1b freezer->lock used to be used in hot paths but that time is long gone and there's no reason for the lock to be IRQ-safe spinlock or even per-cgroup. In fact, given the fact that a cgroup may contain large number of tasks, it's not a good idea to iterate over them while holding IRQ-safe spinlock. Let's simplify locking by replacing per-cgroup freezer->lock with global freezer_mutex. This also makes the comments explaining the intricacies of policy inheritance and the locking around it as the states are protected by a common mutex. The conversion is mostly straight-forward. The followings are worth mentioning. * freezer_css_online() no longer needs double locking. * freezer_attach() now performs propagation simply while holding freezer_mutex. update_if_frozen() race no longer exists and the comment is removed. * freezer_fork() now tests whether the task is in root cgroup using the new task_css_is_root() without doing rcu_read_lock/unlock(). If not, it grabs freezer_mutex and performs the operation. * freezer_read() and freezer_change_state() grab freezer_mutex across the whole operation and pin the css while iterating so that each descendant processing happens in sleepable context. Fixes: 96d365e0b86e ("cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem") Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2014-05-13cgroup: introduce task_css_is_root()Tejun Heo1-1/+1
Determining the css of a task usually requires RCU read lock as that's the only thing which keeps the returned css accessible till its reference is acquired; however, testing whether a task belongs to the root can be performed without dereferencing the returned css by comparing the returned pointer against the root one in init_css_set[] which never changes. Implement task_css_is_root() which can be invoked in any context. This will be used by the scheduled cgroup_freezer change. v2: cgroup no longer supports modular controllers. No need to export init_css_set. Pointed out by Li. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2014-05-13Merge branch 'for-3.15-fixes' of ↵Linus Torvalds1-6/+30
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: "Fixes for two bugs in workqueue. One is exiting with internal mutex held in a failure path of wq_update_unbound_numa(). The other is a subtle and unlikely use-after-possible-last-put in the rescuer logic. Both have been around for quite some time now and are unlikely to have triggered noticeably often. All patches are marked for -stable backport" * 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix a possible race condition between rescuer and pwq-release workqueue: make rescuer_thread() empty wq->maydays list before exiting workqueue: fix bugs in wq_update_unbound_numa() failure path
2014-05-13Merge branch 'for-3.15-fixes' of ↵Linus Torvalds1-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "During recent restructuring, device_cgroup unified config input check and enforcement logic; unfortunately, it turned out to share too much. Aristeu's patches fix the breakage and marked for -stable backport. The other two patches are fallouts from kernfs conversion. The blkcg change is temporary and will go away once kernfs internal locking gets simplified (patches pending)" * 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: blkcg: use trylock on blkcg_pol_mutex in blkcg_reset_stats() device_cgroup: check if exception removal is allowed device_cgroup: fix the comment format for recently added functions device_cgroup: rework device access check and exception checking cgroup: fix the retry path of cgroup_mount()
2014-05-12ntp: Make is_error_status() use its argumentGeorge Spelvin1-6/+6
is_error_status() is an inline function always called with the global time_status as an argument, so there's zero functional difference with this change, but the non-CONFIG_NTP_PPS version uses the passed-in argument, while the CONFIG_NTP_PPS one ignores its argument and uses the global. Looks like is_error_status was refactored out, but someone forgot to change the logic to check the local argument value. Thus this patch makes it use the argument always; shorter variable names are good. Signed-off-by: George Spelvin <[email protected]> [jstultz: Tweaked commit message] Signed-off-by: John Stultz <[email protected]>
2014-05-12ntp: Convert simple_strtol to kstrtolFabian Frederick1-1/+4
Replace obsolete function simple_strtol w/ kstrtol Inspired-By: Andrew Morton <[email protected]> Cc: John Stultz <[email protected]> Cc: Andrew Morton <[email protected]> Signed-off-by: Fabian Frederick <[email protected]> [jstultz: Tweak commit message] Signed-off-by: John Stultz <[email protected]>
2014-05-12hrtimer: Set expiry time before switch_hrtimer_base()Viresh Kumar1-4/+4
switch_hrtimer_base() calls hrtimer_check_target() which ensures that we do not migrate a timer to a remote cpu if the timer expires before the current programmed expiry time on that remote cpu. But __hrtimer_start_range_ns() calls switch_hrtimer_base() before the new expiry time is set. So the sanity check in hrtimer_check_target() is operating on stale or even uninitialized data. Update expiry time before calling switch_hrtimer_base(). [ tglx: Rewrote changelog once again ] Signed-off-by: Viresh Kumar <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/81999e148745fc51bbcd0615823fbab9b2e87e23.1399882253.git.viresh.kumar@linaro.org Cc: [email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2014-05-09PM / hibernate: convert simple_strtoul to kstrtoulFabian Frederick1-1/+4
Replace obsolete function. Signed-off-by: Fabian Frederick <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2014-05-09Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds6-12/+12
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Peter Anvin: "A somewhat unpleasantly large collection of small fixes. The big ones are the __visible tree sweep and a fix for 'earlyprintk=efi,keep'. It was using __init functions with predictably suboptimal results. Another key fix is a build fix which would produce output that simply would not decompress correctly in some configuration, due to the existing Makefiles picking up an unfortunate local label and mistaking it for the global symbol _end. Additional fixes include the handling of 64-bit numbers when setting the vdso data page (a latent bug which became manifest when i386 started exporting a vdso with time functions), a fix to the new MSR manipulation accessors which would cause features to not get properly unblocked, a build fix for 32-bit userland, and a few new platform quirks" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, vdso, time: Cast tv_nsec to u64 for proper shifting in update_vsyscall() x86: Fix typo in MSR_IA32_MISC_ENABLE_LIMIT_CPUID macro x86: Fix typo preventing msr_set/clear_bit from having an effect x86/intel: Add quirk to disable HPET for the Baytrail platform x86/hpet: Make boot_hpet_disable extern x86-64, build: Fix stack protector Makefile breakage with 32-bit userland x86/reboot: Add reboot quirk for Certec BPC600 asmlinkage: Add explicit __visible to drivers/*, lib/*, kernel/* asmlinkage, x86: Add explicit __visible to arch/x86/* asmlinkage: Revert "lto: Make asmlinkage __visible" x86, build: Don't get confused by local symbols x86/efi: earlyprintk=efi,keep fix
2014-05-08Merge tag 'trace-fixes-v3.15-rc4-v2' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "This contains two fixes. The first is a long standing bug that causes bogus data to show up in the refcnt field of the module_refcnt tracepoint. It was introduced by a merge conflict resolution back in 2.6.35-rc days. The result should be 'refcnt = incs - decs', but instead it did 'refcnt = incs + decs'. The second fix is to a bug that was introduced in this merge window that allowed for a tracepoint funcs pointer to be used after it was freed. Moving the location of where the probes are released solved the problem" * tag 'trace-fixes-v3.15-rc4-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracepoint: Fix use of tracepoint funcs after rcu free trace: module: Maintain a valid user count
2014-05-08tracepoint: Fix use of tracepoint funcs after rcu freeMathieu Desnoyers1-2/+2
Commit de7b2973903c "tracepoint: Use struct pointer instead of name hash for reg/unreg tracepoints" introduces a use after free by calling release_probes on the old struct tracepoint array before the newly allocated array is published with rcu_assign_pointer. There is a race window where tracepoints (RCU readers) can perform a "use-after-grace-period-after-free", which shows up as a GPF in stress-tests. Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/p/[email protected] Reported-by: Sasha Levin <[email protected]> CC: Oleg Nesterov <[email protected]> CC: Dave Jones <[email protected]> Fixes: de7b2973903c "tracepoint: Use struct pointer instead of name hash for reg/unreg tracepoints" Signed-off-by: Mathieu Desnoyers <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2014-05-08sched/idle: Make cpuidle_idle_call() voidRafael J. Wysocki1-5/+2
The only value ever returned by cpuidle_idle_call() is 0 and its only caller ignores that value anyway, so make it void. Signed-off-by: Rafael J. Wysocki <[email protected]> Cc: Daniel Lezcano <[email protected]> Cc: Linus Torvalds <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-08sched/idle: Reflow cpuidle_idle_call()Peter Zijlstra1-73/+58
Apply goto to reduce lines and nesting levels. Signed-off-by: Peter Zijlstra <[email protected]> Acked-by: Nicolas Pitre <[email protected]> Cc: Daniel Lezcano <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-08sched/idle: Delay clearing the polling bitPeter Zijlstra1-7/+10
With the generic idle functions assuming !polling we should only clear the polling bit at the very last opportunity in order to avoid spurious IPIs. Ideally we'd flip the default to polling, but that means auditing all arch idle functions. Signed-off-by: Peter Zijlstra <[email protected]> Acked-by: Nicolas Pitre <[email protected]> Cc: Daniel Lezcano <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-08sched/idle: Avoid spurious wakeup IPIsPeter Zijlstra1-5/+36
Because mwait_idle_with_hints() gets called from !idle context it must call current_clr_polling(). This however means that resched_task() is very likely to send an IPI even when we were polling: CPU0 CPU1 if (current_set_polling_and_test()) goto out; __monitor(&ti->flags); if (!need_resched()) __mwait(eax, ecx); set_tsk_need_resched(p); smp_mb(); out: current_clr_polling(); if (!tsk_is_polling(p)) smp_send_reschedule(cpu); So while it is correct (extra IPIs aren't a problem, whereas a missed IPI would be) it is a performance problem (for some). Avoid this issue by using fetch_or() to atomically set NEED_RESCHED and test if POLLING_NRFLAG is set. Since a CPU stuck in mwait is unlikely to modify the flags word, contention on the cmpxchg is unlikely and thus we should mostly succeed in a single go. Signed-off-by: Peter Zijlstra <[email protected]> Acked-by: Nicolas Pitre <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-07perf: Simplify perf_event_exit_task_context()Peter Zijlstra1-16/+1
Instead of jumping through hoops to make sure to find (and exit) each event, do it the simple straight fwd way. Signed-off-by: Peter Zijlstra <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Vince Weaver <[email protected]> Cc: Stephane Eranian <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-07perf: Rework free pathsPeter Zijlstra1-26/+40
Primarily make perf_event_release_kernel() into put_event(), this will allow kernel space to create per-task inherited events, and is safer in general. Also, document the free_event() assumptions. Signed-off-by: Peter Zijlstra <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Vince Weaver <[email protected]> Cc: Stephane Eranian <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-07perf: Validate locking assumptionPeter Zijlstra1-0/+2
Document and validate the locking assumption of event_sched_in(). Signed-off-by: Peter Zijlstra <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Vince Weaver <[email protected]> Cc: Stephane Eranian <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-07perf: Always destroy groups on exitPeter Zijlstra1-1/+1
Commit 38b435b16c36 ("perf: Fix tear-down of inherited group events") states that we need to destroy groups for inherited events, but it doesn't make any sense to not also destroy groups for normal events. And while it usually makes no difference (the normal events won't leak, and its very likely all the group events will die in quick succession) it does make the code more consistent and closes a potential hole for trouble. Signed-off-by: Peter Zijlstra <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Vince Weaver <[email protected]> Cc: Stephane Eranian <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-07perf: Ensure consistent inherit state in groupsPeter Zijlstra1-3/+10
Make sure all events in a group have the same inherit state. It was possible for group leaders to have inherit set while sibling events would not have inherit set. In this case we'd still inherit the siblings, leading to some non-fatal weirdness. Signed-off-by: Peter Zijlstra <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Vince Weaver <[email protected]> Cc: Stephane Eranian <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-07Merge branch 'perf/urgent' into perf/core, to avoid conflictsIngo Molnar11-62/+82
Signed-off-by: Ingo Molnar <[email protected]>
2014-05-07sched/fair: Stop searching for tasks in newidle balance if there are ↵Jason Low1-2/+6
runnable tasks It was found that when running some workloads (such as AIM7) on large systems with many cores, CPUs do not remain idle for long. Thus, tasks can wake/get enqueued while doing idle balancing. In this patch, while traversing the domains in idle balance, in addition to checking for pulled_task, we add an extra check for this_rq->nr_running for determining if we should stop searching for tasks to pull. If there are runnable tasks on this rq, then we will stop traversing the domains. This reduces the chance that idle balance delays a task from running. This patch resulted in approximately a 6% performance improvement when running a Java Server workload on an 8 socket machine. Signed-off-by: Jason Low <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-07sched: Add a new SD_SHARE_POWERDOMAIN for sched_domainVincent Guittot1-3/+7
A new flag SD_SHARE_POWERDOMAIN is created to reflect whether groups of CPUs in a sched_domain level can or not reach different power state. As an example, the flag should be cleared at CPU level if groups of cores can be power gated independently. This information can be used in the load balance decision or to add load balancing level between group of CPUs that can power gate independantly. This flag is part of the topology flags that can be set by arch. Reviewed-by: Dietmar Eggemann <[email protected]> Tested-by: Dietmar Eggemann <[email protected]> Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-07sched, powerpc: Create a dedicated topology tableVincent Guittot1-6/+0
Create a dedicated topology table for handling asymetric feature of powerpc. Signed-off-by: Vincent Guittot <[email protected]> Reviewed-by: Preeti U Murthy <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Andy Fleming <[email protected]> Cc: Anton Blanchard <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Grant Likely <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Gortmaker <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Preeti U. Murthy <[email protected]> Cc: Rob Herring <[email protected]> Cc: Srivatsa S. Bhat <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vasant Hegde <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-07sched, s390: Create a dedicated topology tableVincent Guittot1-3/+0
BOOK level is only relevant for s390 so we create a dedicated topology table with BOOK level and remove it from default table. Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Philipp Hachtmann <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-07sched: Rework sched_domain topology definitionVincent Guittot1-113/+120
We replace the old way to configure the scheduler topology with a new method which enables a platform to declare additionnal level (if needed). We still have a default topology table definition that can be used by platform that don't want more level than the SMT, MC, CPU and NUMA ones. This table can be overwritten by an arch which either wants to add new level where a load balance make sense like BOOK or powergating level or wants to change the flags configuration of some levels. For each level, we need a function pointer that returns cpumask for each cpu, a function pointer that returns the flags for the level and a name. Only flags that describe topology, can be set by an architecture. The current topology flags are: SD_SHARE_CPUPOWER SD_SHARE_PKG_RESOURCES SD_NUMA SD_ASYM_PACKING Then, each level must be a subset on the next one. The build sequence of the sched_domain will take care of removing useless levels like those with 1 CPU and those with the same CPU span and no more relevant information for load balancing than its children. Signed-off-by: Vincent Guittot <[email protected]> Tested-by: Dietmar Eggemann <[email protected]> Reviewed-by: Preeti U Murthy <[email protected]> Reviewed-by: Dietmar Eggemann <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: David S. Miller <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hanjun Guo <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Jason Low <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Tony Luck <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-07sched/numa: Do not set preferred_node on migration to a second choice nodeRik van Riel1-1/+10
Setting the numa_preferred_node for a task in task_numa_migrate does nothing on a 2-node system. Either we migrate to the node that already was our preferred node, or we stay where we were. On a 4-node system, it can slightly decrease overhead, by not calling the NUMA code as much. Since every node tends to be directly connected to every other node, running on the wrong node for a while does not do much damage. However, on an 8 node system, there are far more bad nodes than there are good ones, and pretending that a second choice is actually the preferred node can greatly delay, or even prevent, a workload from converging. The only time we can safely pretend that a second choice node is the preferred node is when the task is part of a workload that spans multiple NUMA nodes. Signed-off-by: Rik van Riel <[email protected]> Tested-by: Vinod Chegu <[email protected]> Acked-by: Mel Gorman <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-07sched/numa: Retry placement more frequently when misplacedRik van Riel1-1/+4
When tasks have not converged on their preferred nodes yet, we want to retry fairly often, to make sure we do not migrate a task's memory to an undesirable location, only to have to move it again later. This patch reduces the interval at which migration is retried, when the task's numa_scan_period is small. Signed-off-by: Rik van Riel <[email protected]> Tested-by: Vinod Chegu <[email protected]> Acked-by: Mel Gorman <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-07sched/numa: Count pages on active node as localRik van Riel1-1/+13
The NUMA code is smart enough to distribute the memory of workloads that span multiple NUMA nodes across those NUMA nodes. However, it still has a pretty high scan rate for such workloads, because any memory that is left on a node other than the node of the CPU that faulted on the memory is counted as non-local, which causes the scan rate to go up. Counting the memory on any node where the task's numa group is actively running as local, allows the scan rate to slow down once the application is settled in. This should reduce the overhead of the automatic NUMA placement code, when a workload spans multiple NUMA nodes. Signed-off-by: Rik van Riel <[email protected]> Tested-by: Vinod Chegu <[email protected]> Acked-by: Mel Gorman <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-07Merge branch 'sched/urgent' into sched/core, to avoid conflictsIngo Molnar77-3585/+3318
Signed-off-by: Ingo Molnar <[email protected]>
2014-05-07sched/numa: Initialize newidle balance stats in sd_numa_init()Jason Low1-0/+2
Also initialize the per-sd variables for newidle load balancing in sd_numa_init(). Signed-off-by: Jason Low <[email protected]> Acked-by: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-07sched: Fix updating rq->max_idle_balance_cost and rq->next_balance in ↵Jason Low1-8/+8
idle_balance() The following commit: e5fc66119ec9 ("sched: Fix race in idle_balance()") can potentially cause rq->max_idle_balance_cost to not be updated, even when load_balance(NEWLY_IDLE) is attempted and the per-sd max cost value is updated. Preeti noticed a similar issue with updating rq->next_balance. In this patch, we fix this by making sure we still check/update those values even if a task gets enqueued while browsing the domains. Signed-off-by: Jason Low <[email protected]> Reviewed-by: Preeti U Murthy <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-07sched: Skip double execution of pick_next_task_fair()Peter Zijlstra1-2/+8
Tim wrote: "The current code will call pick_next_task_fair a second time in the slow path if we did not pull any task in our first try. This is really unnecessary as we already know no task can be pulled and it doubles the delay for the cpu to enter idle. We instrumented some network workloads and that saw that pick_next_task_fair is frequently called twice before a cpu enters idle. The call to pick_next_task_fair can add non trivial latency as it calls load_balance which runs find_busiest_group on an hierarchy of sched domains spanning the cpus for a large system. For some 4 socket systems, we saw almost 0.25 msec spent per call of pick_next_task_fair before a cpu can be idled." Optimize the second call away for the common case and document the dependency. Reported-by: Tim Chen <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Len Brown <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-07sched: Use CPUPRI_NR_PRIORITIES instead of MAX_RT_PRIO in cpupri checkSteven Rostedt (Red Hat)1-2/+1
The check at the beginning of cpupri_find() makes sure that the task_pri variable does not exceed the cp->pri_to_cpu array length. But that length is CPUPRI_NR_PRIORITIES not MAX_RT_PRIO, where it will miss the last two priorities in that array. As task_pri is computed from convert_prio() which should never be bigger than CPUPRI_NR_PRIORITIES, if the check should cause a panic if it is hit. Reported-by: Mike Galbraith <[email protected]> Signed-off-by: Steven Rostedt <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-07sched/deadline: Fix memory leakLi Zefan1-3/+1
Free cpudl->free_cpus allocated in cpudl_init(). Signed-off-by: Li Zefan <[email protected]> Acked-by: Juri Lelli <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: <[email protected]> # 3.14+ Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-07sched/deadline: Fix sched_yield() behaviorJuri Lelli2-2/+4
yield_task_dl() is broken: o it forces current to be throttled setting its runtime to zero; o it sets current's dl_se->dl_new to one, expecting that dl_task_timer() will queue it back with proper parameters at replenish time. Unfortunately, dl_task_timer() has this check at the very beginning: if (!dl_task(p) || dl_se->dl_new) goto unlock; So, it just bails out and the task is never replenished. It actually yielded forever. To fix this, introduce a new flag indicating that the task properly yielded the CPU before its current runtime expired. While this is a little overdoing at the moment, the flag would be useful in the future to discriminate between "good" jobs (of which remaining runtime could be reclaimed, i.e. recycled) and "bad" jobs (for which dl_throttled task has been set) that needed to be stopped. Reported-by: yjay.kim <[email protected]> Signed-off-by: Juri Lelli <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-07sched: Sanitize irq accounting madnessThomas Gleixner1-16/+16
Russell reported, that irqtime_account_idle_ticks() takes ages due to: for (i = 0; i < ticks; i++) irqtime_account_process_tick(current, 0, rq); It's sad, that this code was written way _AFTER_ the NOHZ idle functionality was available. I charge myself guitly for not paying attention when that crap got merged with commit abb74cefa ("sched: Export ns irqtimes through /proc/stat") So instead of looping nr_ticks times just apply the whole thing at once. As a side note: The whole cputime_t vs. u64 business in that context wants to be cleaned up as well. There is no point in having all these back and forth conversions. Lets standardise on u64 nsec for all kernel internal accounting and be done with it. Everything else does not make sense at all for fine grained accounting. Frederic, can you please take care of that? Reported-by: Russell King <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Paul E. McKenney <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Venkatesh Pallipadi <[email protected]> Cc: Shaun Ruffell <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>