aboutsummaryrefslogtreecommitdiff
path: root/kernel/sched.c
AgeCommit message (Collapse)AuthorFilesLines
2009-01-14[CVE-2009-0029] System call wrappers part 07Heiko Carstens1-2/+2
Signed-off-by: Heiko Carstens <[email protected]>
2009-01-14[CVE-2009-0029] System call wrappers part 06Heiko Carstens1-13/+13
Signed-off-by: Heiko Carstens <[email protected]>
2009-01-12Revert "sched: improve preempt debugging"Ingo Molnar1-1/+1
This reverts commit 7317d7b87edb41a9135e30be1ec3f7ef817c53dd. This has been reported (and bisected) by Alexey Zaytsev and Kamalesh Babulal to produce annoying warnings during bootup on both x86 and powerpc. kernel_locked() is not a valid test in IRQ context (we update the BKL's ->lock_depth and the preempt count separately and non-atomicalyy), so we cannot put it into the generic preempt debugging checks which can run in IRQ contexts too. Reported-and-bisected-by: Alexey Zaytsev <[email protected]> Reported-and-bisected-by: Kamalesh Babulal <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2009-01-11kernel/sched.c: add missing forward declaration for 'double_rq_lock'Steven Noonan1-0/+3
Impact: build fix on certain configs Added 'double_rq_lock' forward declaration, allowing double_rq_lock to be used in _double_lock_balance(). Signed-off-by: Steven Noonan <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2009-01-11Merge branch 'sched/latest' of ↵Ingo Molnar1-11/+78
git://git.kernel.org/pub/scm/linux/kernel/git/ghaskins/linux-2.6-hacks into sched/rt
2009-01-11Merge commit 'v2.6.29-rc1' into perfcounters/coreIngo Molnar1-478/+632
Conflicts: include/linux/kernel_stat.h
2009-01-11cpumask: fix CONFIG_NUMA=y sched.cRusty Russell1-5/+5
Impact: fix panic on ia64 with NR_CPUS=1024 struct sched_domain is now a dangling structure; where we really want static ones, we need to use static_sched_domain. (As the FIXME in this file says, cpumask_var_t would be better, but this code is hairy enough without trying to add initialization code to the right places). Reported-by: Mike Travis <[email protected]> Signed-off-by: Rusty Russell <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2009-01-07sched: fix possible recursive rq->lockPeter Zijlstra1-0/+5
Vaidyanathan Srinivasan reported: > ============================================= > [ INFO: possible recursive locking detected ] > 2.6.28-autotest-tip-sv #1 > --------------------------------------------- > klogd/5062 is trying to acquire lock: > (&rq->lock){++..}, at: [<ffffffff8022aca2>] task_rq_lock+0x45/0x7e > > but task is already holding lock: > (&rq->lock){++..}, at: [<ffffffff805f7354>] schedule+0x158/0xa31 With sched_mc at 2. (it is default-off) Strictly speaking we'll not deadlock, because ttwu will not be able to place the migration task on our rq, but since the code can deal with both rqs getting unlocked, this seems the easiest way out. Signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2009-01-06sched: fix section mismatchLi Zefan1-1/+1
init_rootdomain() calls alloc_bootmem_cpumask_var() at system boot, so does cpupri_init(). Signed-off-by: Li Zefan <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2009-01-06sched: fix double kfree in failure pathLi Zefan1-3/+2
It's not the responsibility of init_rootdomain() to free root_domain allocated by alloc_rootdomain(). Signed-off-by: Li Zefan <[email protected]> Reviewed-by: Pekka Enberg <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2009-01-05sched: clean up arch_reinit_sched_domains()Li Zefan1-6/+3
- Make arch_reinit_sched_domains() static. It was exported to be used in s390, but now rebuild_sched_domains() is used instead. - Make it return void. Signed-off-by: Li Zefan <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2009-01-05sched: mark sched_create_sysfs_power_savings_entries() as __initLi Zefan1-1/+1
Impact: cleanup The only caller is cpu_dev_init() which is marked as __init. Signed-off-by: Li Zefan <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2009-01-03Merge branch 'cpus4096-for-linus-3' of ↵Linus Torvalds1-38/+15
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'cpus4096-for-linus-3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (77 commits) x86: setup_per_cpu_areas() cleanup cpumask: fix compile error when CONFIG_NR_CPUS is not defined cpumask: use alloc_cpumask_var_node where appropriate cpumask: convert shared_cpu_map in acpi_processor* structs to cpumask_var_t x86: use cpumask_var_t in acpi/boot.c x86: cleanup some remaining usages of NR_CPUS where s/b nr_cpu_ids sched: put back some stack hog changes that were undone in kernel/sched.c x86: enable cpus display of kernel_max and offlined cpus ia64: cpumask fix for is_affinity_mask_valid() cpumask: convert RCU implementations, fix xtensa: define __fls mn10300: define __fls m32r: define __fls h8300: define __fls frv: define __fls cris: define __fls cpumask: CONFIG_DISABLE_OBSOLETE_CPUMASK_FUNCTIONS cpumask: zero extra bits in alloc_cpumask_var_node cpumask: replace for_each_cpu_mask_nr with for_each_cpu in kernel/time/ cpumask: convert mm/ ...
2009-01-03Merge branch 'cputime' of git://git390.osdl.marist.edu/pub/scm/linux-2.6Linus Torvalds1-39/+76
* 'cputime' of git://git390.osdl.marist.edu/pub/scm/linux-2.6: [PATCH] fast vdso implementation for CLOCK_THREAD_CPUTIME_ID [PATCH] improve idle cputime accounting [PATCH] improve precision of idle time detection. [PATCH] improve precision of process accounting. [PATCH] idle cputime accounting [PATCH] fix scaled & unscaled cputime accounting
2009-01-03sched: put back some stack hog changes that were undone in kernel/sched.cMike Travis1-38/+15
Impact: prevents panic from stack overflow on numa-capable machines. Some of the "removal of stack hogs" changes in kernel/sched.c by using node_to_cpumask_ptr were undone by the early cpumask API updates, and causes a panic due to stack overflow. This patch undoes those changes by using cpumask_of_node() which returns a 'const struct cpumask *'. In addition, cpu_coregoup_map is replaced with cpu_coregroup_mask further reducing stack usage. (Both of these updates removed 9 FIXME's!) Also: Pick up some remaining changes from the old 'cpumask_t' functions to the new 'struct cpumask *' functions. Optimize memory traffic by allocating each percpu local_cpu_mask on the same node as the referring cpu. Signed-off-by: Mike Travis <[email protected]> Acked-by: Rusty Russell <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2009-01-03Merge branch 'master' of ↵Mike Travis1-28/+87
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask into merge-rr-cpumask Conflicts: arch/x86/kernel/io_apic.c kernel/rcuclassic.c kernel/sched.c kernel/time/tick-sched.c Signed-off-by: Mike Travis <[email protected]> [ [email protected]: backmerged typo fix for io_apic.c ] Signed-off-by: Ingo Molnar <[email protected]>
2009-01-02Merge branch 'cpus4096-for-linus-2' of ↵Linus Torvalds1-414/+556
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'cpus4096-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (66 commits) x86: export vector_used_by_percpu_irq x86: use logical apicid in x2apic_cluster's x2apic_cpu_mask_to_apicid_and() sched: nominate preferred wakeup cpu, fix x86: fix lguest used_vectors breakage, -v2 x86: fix warning in arch/x86/kernel/io_apic.c sched: fix warning in kernel/sched.c sched: move test_sd_parent() to an SMP section of sched.h sched: add SD_BALANCE_NEWIDLE at MC and CPU level for sched_mc>0 sched: activate active load balancing in new idle cpus sched: bias task wakeups to preferred semi-idle packages sched: nominate preferred wakeup cpu sched: favour lower logical cpu number for sched_mc balance sched: framework for sched_mc/smt_power_savings=N sched: convert BALANCE_FOR_xx_POWER to inline functions x86: use possible_cpus=NUM to extend the possible cpus allowed x86: fix cpu_mask_to_apicid_and to include cpu_online_mask x86: update io_apic.c to the new cpumask code x86: Introduce topology_core_cpumask()/topology_thread_cpumask() x86: xen: use smp_call_function_many() x86: use work_on_cpu in x86/kernel/cpu/mcheck/mce_amd_64.c ... Fixed up trivial conflict in kernel/time/tick-sched.c manually
2008-12-31[PATCH] idle cputime accountingMartin Schwidefsky1-17/+63
The cpu time spent by the idle process actually doing something is currently accounted as idle time. This is plain wrong, the architectures that support VIRT_CPU_ACCOUNTING=y can do better: distinguish between the time spent doing nothing and the time spent by idle doing work. The first is accounted with account_idle_time and the second with account_system_time. The architectures that use the account_xxx_time interface directly and not the account_xxx_ticks interface now need to do the check for the idle process in their arch code. In particular to improve the system vs true idle time accounting the arch code needs to measure the true idle time instead of just testing for the idle process. To improve the tick based accounting as well we would need an architecture primitive that can tell us if the pt_regs of the interrupted context points to the magic instruction that halts the cpu. In addition idle time is no more added to the stime of the idle process. This field now contains the system time of the idle process as it should be. On systems without VIRT_CPU_ACCOUNTING this will always be zero as every tick that occurs while idle is running will be accounted as idle time. This patch contains the necessary common code changes to be able to distinguish idle system time and true idle time. The architectures with support for VIRT_CPU_ACCOUNTING need some changes to exploit this. Signed-off-by: Martin Schwidefsky <[email protected]>
2008-12-31[PATCH] fix scaled & unscaled cputime accountingMartin Schwidefsky1-25/+16
The utimescaled / stimescaled fields in the task structure and the global cpustat should be set on all architectures. On s390 the calls to account_user_time_scaled and account_system_time_scaled never have been added. In addition system time that is accounted as guest time to the user time of a process is accounted to the scaled system time instead of the scaled user time. To fix the bugs and to prevent future forgetfulness this patch merges account_system_time_scaled into account_system_time and account_user_time_scaled into account_user_time. Cc: Benjamin Herrenschmidt <[email protected]> Cc: Hidetoshi Seto <[email protected]> Cc: Tony Luck <[email protected]> Cc: Jeremy Fitzhardinge <[email protected]> Cc: Chris Wright <[email protected]> Cc: Michael Neuling <[email protected]> Acked-by: Paul Mackerras <[email protected]> Signed-off-by: Martin Schwidefsky <[email protected]>
2008-12-31Merge branch 'master' of ↵Rusty Russell1-4/+1
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 Conflicts: arch/x86/kernel/io_apic.c
2008-12-31Merge branch 'linus' into stackprotectorIngo Molnar1-239/+301
Conflicts: arch/x86/include/asm/pda.h kernel/fork.c
2008-12-30Merge branch 'timers-core-for-linus' of ↵Linus Torvalds1-2/+0
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: hrtimers: fix warning in kernel/hrtimer.c x86: make sure we really have an hpet mapping before using it x86: enable HPET on Fujitsu u9200 linux/timex.h: cleanup for userspace posix-timers: simplify de_thread()->exit_itimers() path posix-timers: check ->it_signal instead of ->it_pid to validate the timer posix-timers: use "struct pid*" instead of "struct task_struct*" nohz: suppress needless timer reprogramming clocksource, acpi_pm.c: put acpi_pm_read_slow() under CONFIG_PCI nohz: no softirq pending warnings for offline cpus hrtimer: removing all ur callback modes, fix hrtimer: removing all ur callback modes, fix hotplug hrtimer: removing all ur callback modes x86: correct link to HPET timer specification rtc-cmos: export second NVRAM bank Fixed up conflicts in sound/drivers/pcsp/pcsp.c and sound/core/hrtimer.c manually.
2008-12-30Merge branch 'core-for-linus' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (63 commits) stacktrace: provide save_stack_trace_tsk() weak alias rcu: provide RCU options on non-preempt architectures too printk: fix discarding message when recursion_bug futex: clean up futex_(un)lock_pi fault handling "Tree RCU": scalable classic RCU implementation futex: rename field in futex_q to clarify single waiter semantics x86/swiotlb: add default swiotlb_arch_range_needs_mapping x86/swiotlb: add default phys<->bus conversion x86: unify pci iommu setup and allow swiotlb to compile for 32 bit x86: add swiotlb allocation functions swiotlb: consolidate swiotlb info message printing swiotlb: support bouncing of HighMem pages swiotlb: factor out copy to/from device swiotlb: add arch hook to force mapping swiotlb: allow architectures to override phys<->bus<->phys conversions swiotlb: add comment where we handle the overflow of a dma mask on 32 bit rcu: fix rcutorture behavior during reboot resources: skip sanity check of busy resources swiotlb: move some definitions to header swiotlb: allow architectures to override swiotlb pool allocation ... Fix up trivial conflicts in arch/x86/kernel/Makefile arch/x86/mm/init_32.c include/linux/hardirq.h as per Ingo's suggestions.
2008-12-30Merge branch 'master' of ↵Rusty Russell1-182/+225
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
2008-12-29sched: create "pushable_tasks" list to limit pushing to one attemptGregory Haskins1-0/+4
The RT scheduler employs a "push/pull" design to actively balance tasks within the system (on a per disjoint cpuset basis). When a task is awoken, it is immediately determined if there are any lower priority cpus which should be preempted. This is opposed to the way normal SCHED_OTHER tasks behave, which will wait for a periodic rebalancing operation to occur before spreading out load. When a particular RQ has more than 1 active RT task, it is said to be in an "overloaded" state. Once this occurs, the system enters the active balancing mode, where it will try to push the task away, or persuade a different cpu to pull it over. The system will stay in this state until the system falls back below the <= 1 queued RT task per RQ. However, the current implementation suffers from a limitation in the push logic. Once overloaded, all tasks (other than current) on the RQ are analyzed on every push operation, even if it was previously unpushable (due to affinity, etc). Whats more, the operation stops at the first task that is unpushable and will not look at items lower in the queue. This causes two problems: 1) We can have the same tasks analyzed over and over again during each push, which extends out the fast path in the scheduler for no gain. Consider a RQ that has dozens of tasks that are bound to a core. Each one of those tasks will be encountered and skipped for each push operation while they are queued. 2) There may be lower-priority tasks under the unpushable task that could have been successfully pushed, but will never be considered until either the unpushable task is cleared, or a pull operation succeeds. The net result is a potential latency source for mid priority tasks. This patch aims to rectify these two conditions by introducing a new priority sorted list: "pushable_tasks". A task is added to the list each time a task is activated or preempted. It is removed from the list any time it is deactivated, made current, or fails to push. This works because a task only needs to be attempted to push once. After an initial failure to push, the other cpus will eventually try to pull the task when the conditions are proper. This also solves the problem that we don't completely analyze all tasks due to encountering an unpushable tasks. Now every task will have a push attempted (when appropriate). This reduces latency both by shorting the critical section of the rq->lock for certain workloads, and by making sure the algorithm considers all eligible tasks in the system. [ rostedt: added a couple more BUG_ONs ] Signed-off-by: Gregory Haskins <[email protected]> Acked-by: Steven Rostedt <[email protected]>
2008-12-29sched: add sched_class->needs_post_schedule() memberGregory Haskins1-1/+7
We currently run class->post_schedule() outside of the rq->lock, which means that we need to test for the need to post_schedule outside of the lock to avoid a forced reacquistion. This is currently not a problem as we only look at rq->rt.overloaded. However, we want to enhance this going forward to look at more state to reduce the need to post_schedule to a bare minimum set. Therefore, we introduce a new member-func called needs_post_schedule() which tests for the post_schedule condtion without actually performing the work. Therefore it is safe to call this function before the rq->lock is released, because we are guaranteed not to drop the lock at an intermediate point (such as what post_schedule() may do). We will use this later in the series [ rostedt: removed paranoid BUG_ON ] Signed-off-by: Gregory Haskins <[email protected]>
2008-12-29sched: make double-lock-balance fairGregory Haskins1-7/+44
double_lock balance() currently favors logically lower cpus since they often do not have to release their own lock to acquire a second lock. The result is that logically higher cpus can get starved when there is a lot of pressure on the RQs. This can result in higher latencies on higher cpu-ids. This patch makes the algorithm more fair by forcing all paths to have to release both locks before acquiring them again. Since callsites to double_lock_balance already consider it a potential preemption/reschedule point, they have the proper logic to recheck for atomicity violations. Signed-off-by: Gregory Haskins <[email protected]>
2008-12-29sched: pull only one task during NEWIDLE balancing to limit critical sectionGregory Haskins1-1/+17
git-id c4acb2c0669c5c5c9b28e9d02a34b5c67edf7092 attempted to limit newidle critical section length by stopping after at least one task was moved. Further investigation has shown that there are other paths nested further inside the algorithm which still remain that allow long latencies to occur with newidle balancing. This patch applies the same technique inside balance_tasks() to limit the duration of this optional balancing operation. Signed-off-by: Gregory Haskins <[email protected]> CC: Nick Piggin <[email protected]>
2008-12-29sched: track the next-highest priority on each runqueueGregory Haskins1-2/+6
We will use this later in the series to reduce the amount of rq-lock contention during a pull operation Signed-off-by: Gregory Haskins <[email protected]>
2008-12-29Merge branch 'linus' into perfcounters/coreIngo Molnar1-182/+225
Conflicts: fs/exec.c include/linux/init_task.h Simple context conflicts.
2008-12-28Merge branch 'sched-core-for-linus' of ↵Linus Torvalds1-174/+193
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits) sched: fix warning in fs/proc/base.c schedstat: consolidate per-task cpu runtime stats sched: use RCU variant of list traversal in for_each_leaf_rt_rq() sched, cpuacct: export percpu cpuacct cgroup stats sched, cpuacct: refactoring cpuusage_read / cpuusage_write sched: optimize update_curr() sched: fix wakeup preemption clock sched: add missing arch_update_cpu_topology() call sched: let arch_update_cpu_topology indicate if topology changed sched: idle_balance() does not call load_balance_newidle() sched: fix sd_parent_degenerate on non-numa smp machine sched: add uid information to sched_debug for CONFIG_USER_SCHED sched: move double_unlock_balance() higher sched: update comment for move_task_off_dead_cpu sched: fix inconsistency when redistribute per-cpu tg->cfs_rq shares sched/rt: removed unneeded defintion sched: add hierarchical accounting to cpu accounting controller sched: include group statistics in /proc/sched_debug sched: rename SCHED_NO_NO_OMIT_FRAME_POINTER => SCHED_OMIT_FRAME_POINTER sched: clean up SCHED_CPUMASK_ALLOC ...
2008-12-28Merge branch 'tracing-core-for-linus' of ↵Linus Torvalds1-3/+11
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (241 commits) sched, trace: update trace_sched_wakeup() tracing/ftrace: don't trace on early stage of a secondary cpu boot, v3 Revert "x86: disable X86_PTRACE_BTS" ring-buffer: prevent false positive warning ring-buffer: fix dangling commit race ftrace: enable format arguments checking x86, bts: memory accounting x86, bts: add fork and exit handling ftrace: introduce tracing_reset_online_cpus() helper tracing: fix warnings in kernel/trace/trace_sched_switch.c tracing: fix warning in kernel/trace/trace.c tracing/ring-buffer: remove unused ring_buffer size trace: fix task state printout ftrace: add not to regex on filtering functions trace: better use of stack_trace_enabled for boot up code trace: add a way to enable or disable the stack tracer x86: entry_64 - introduce FTRACE_ frame macro v2 tracing/ftrace: add the printk-msg-only option tracing/ftrace: use preempt_enable_no_resched_notrace in ring_buffer_time_stamp() x86, bts: correctly report invalid bts records ... Fixed up trivial conflict in scripts/recordmcount.pl due to SH bits being already partly merged by the SH merge.
2008-12-26cpumask: Replace cpu_coregroup_map with cpu_coregroup_maskRusty Russell1-3/+3
cpu_coregroup_map returned a cpumask_t: it's going away. (Note, the sched part of this patch won't apply meaningfully to the sched tree, but I'm posting it to show the goal). Signed-off-by: Rusty Russell <[email protected]> Signed-off-by: Mike Travis <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Ingo Molnar <[email protected]>
2008-12-25Merge branches 'timers/clocksource', 'timers/hpet', 'timers/hrtimers', ↵Ingo Molnar1-2/+0
'timers/nohz', 'timers/ntp', 'timers/posixtimers' and 'timers/rtc' into timers/core
2008-12-25Merge commit 'v2.6.28' into core/coreIngo Molnar1-2/+5
2008-12-25sched, trace: update trace_sched_wakeup()Peter Zijlstra1-1/+1
Impact: extend the wakeup tracepoint with the info whether the wakeup was real Add the information needed to distinguish 'real' wakeups from 'false' wakeups. Signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2008-12-25Merge branch 'next' into for-linusJames Morris1-5/+21
2008-12-23sched: nominate preferred wakeup cpu, fixVaidyanathan Srinivasan1-1/+1
Andrew Morton reported: > kernel/sched.c: In function 'schedule': > kernel/sched.c:3679: warning: 'active_balance' may be used uninitialized in this function > > This warning is correct - the code is buggy. In sched.c load_balance_newidle, there's real potential use of uninitialised variable - fix it. Signed-off-by: Vaidyanathan Srinivasan <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2008-12-23perfcounters: fix task clock counterIngo Molnar1-3/+46
Impact: fix per task clock counter precision Signed-off-by: Ingo Molnar <[email protected]>
2008-12-19Merge branches 'tracing/ftrace', 'tracing/ring-buffer' and 'tracing/urgent' ↵Ingo Molnar1-0/+2
into tracing/core Conflicts: include/linux/ftrace.h
2008-12-19sched: fix warning in kernel/sched.cIngo Molnar1-1/+1
Impact: fix cpumask conversion bug this warning: kernel/sched.c: In function ‘find_busiest_group’: kernel/sched.c:3429: warning: passing argument 1 of ‘__first_cpu’ from incompatible pointer type shows that we forgot to convert a new patch to the new cpumask APIs. Signed-off-by: Ingo Molnar <[email protected]>
2008-12-19sched: activate active load balancing in new idle cpusVaidyanathan Srinivasan1-0/+54
Impact: tweak task balancing to save power more agressively Active load balancing is a process by which migration thread is woken up on the target CPU in order to pull current running task on another package into this newly idle package. This method is already in use with normal load_balance(), this patch introduces this method to new idle cpus when sched_mc is set to POWERSAVINGS_BALANCE_WAKEUP. This logic provides effective consolidation of short running daemon jobs in a almost idle system The side effect of this patch may be ping-ponging of tasks if the system is moderately utilised. May need to adjust the iterations before triggering. Signed-off-by: Vaidyanathan Srinivasan <[email protected]> Acked-by: Balbir Singh <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2008-12-19sched: nominate preferred wakeup cpuVaidyanathan Srinivasan1-0/+12
Impact: extend load-balancing code (no change in behavior yet) When the system utilisation is low and more cpus are idle, then the process waking up from sleep should prefer to wakeup an idle cpu from semi-idle cpu package (multi core package) rather than a completely idle cpu package which would waste power. Use the sched_mc balance logic in find_busiest_group() to nominate a preferred wakeup cpu. This info can be stored in appropriate sched_domain, but updating this info in all copies of sched_domain is not practical. Hence this information is stored in root_domain struct which is one copy per partitioned sched domain. The root_domain can be accessed from each cpu's runqueue and there is one copy per partitioned sched domain. Signed-off-by: Vaidyanathan Srinivasan <[email protected]> Acked-by: Balbir Singh <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2008-12-19sched: favour lower logical cpu number for sched_mc balanceVaidyanathan Srinivasan1-2/+2
Impact: change load-balancing direction to match that of irqbalanced Just in case two groups have identical load, prefer to move load to lower logical cpu number rather than the present logic of moving to higher logical number. find_busiest_group() tries to look for a group_leader that has spare capacity to take more tasks and freeup an appropriate least loaded group. Just in case there is a tie and the load is equal, then the group with higher logical number is favoured. This conflicts with user space irqbalance daemon that will move interrupts to lower logical number if the system utilisation is very low. Signed-off-by: Vaidyanathan Srinivasan <[email protected]> Acked-by: Balbir Singh <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2008-12-19sched: framework for sched_mc/smt_power_savings=NGautham R Shenoy1-3/+14
Impact: extend range of /sys/devices/system/cpu/sched_mc_power_savings Currently the sched_mc/smt_power_savings variable is a boolean, which either enables or disables topology based power savings. This patch extends the behaviour of the variable from boolean to multivalued, such that based on the value, we decide how aggressively do we want to perform powersavings balance at appropriate sched domain based on topology. Variable levels of power saving tunable would benefit end user to match the required level of power savings vs performance trade-off depending on the system configuration and workloads. This version makes the sched_mc_power_savings global variable to take more values (0,1,2). Later versions can have a single tunable called sched_power_savings instead of sched_{mc,smt}_power_savings. Signed-off-by: Gautham R Shenoy <[email protected]> Signed-off-by: Vaidyanathan Srinivasan <[email protected]> Acked-by: Balbir Singh <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2008-12-19tracing: fix warnings in kernel/trace/trace_sched_switch.cIngo Molnar1-1/+1
these warnings: kernel/trace/trace_sched_switch.c: In function ‘tracing_sched_register’: kernel/trace/trace_sched_switch.c:96: warning: passing argument 1 of ‘register_trace_sched_wakeup_new’ from incompatible pointer type kernel/trace/trace_sched_switch.c:112: warning: passing argument 1 of ‘unregister_trace_sched_wakeup_new’ from incompatible pointer type kernel/trace/trace_sched_switch.c: In function ‘tracing_sched_unregister’: kernel/trace/trace_sched_switch.c:121: warning: passing argument 1 of ‘unregister_trace_sched_wakeup_new’ from incompatible pointer type Trigger because sched_wakeup_new tracepoints need the same trace signature as sched_wakeup - which was changed recently. Fix it. Signed-off-by: Ingo Molnar <[email protected]>
2008-12-18schedstat: consolidate per-task cpu runtime statsKen Chen1-0/+2
Impact: simplify code When we turn on CONFIG_SCHEDSTATS, per-task cpu runtime is accumulated twice. Once in task->se.sum_exec_runtime and once in sched_info.cpu_time. These two stats are exactly the same. Given that task->se.sum_exec_runtime is always accumulated by the core scheduler, sched_info can reuse that data instead of duplicate the accounting. Signed-off-by: Ken Chen <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2008-12-16sched, cpuacct: export percpu cpuacct cgroup statsKen Chen1-0/+20
This patch export per-cpu CPU cycle usage for a given cpuacct cgroup. There is a need for a user space monitor daemon to track group CPU usage on per-cpu base. It is also useful for monitoring CFS load balancer behavior by tracking per CPU group usage. Signed-off-by: Ken Chen <[email protected]> Reviewed-by: Li Zefan <[email protected]> Reviewed-by: Andrew Morton <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2008-12-16sched, cpuacct: refactoring cpuusage_read / cpuusage_writeKen Chen1-17/+39
Impact: micro-optimize the code on 64-bit architectures In the thread regarding to 'export percpu cpuacct cgroup stats' http://lkml.org/lkml/2008/12/7/13 akpm pointed out that current cpuacct code is inefficient. This patch refactoring the following: * make cpu_rq locking only on 32-bit * change iterator to each_present_cpu instead of each_possible_cpu to make it hotplug friendly. It's a bit of code churn, but I was rewarded with 160 byte code size saving on x86-64 arch and zero code size change on i386. Signed-off-by: Ken Chen <[email protected]> Cc: Paul Menage <[email protected]> Cc: Li Zefan <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2008-12-16sched: fix wakeup preemption clockMike Galbraith1-1/+1
Impact: sharpen the wakeup-granularity to always be against current scheduler time It was possible to do the preemption check against an old time stamp. Signed-off-by: Mike Galbraith <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>