aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2014-06-05sched, trace: Add a tracepoint for IPI-less remote wakeupsAndy Lutomirski2-0/+24
Remote wakeups of polling CPUs are a valuable performance improvement; add a tracepoint to make it much easier to verify that they're working. Signed-off-by: Andy Lutomirski <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: David Ahern <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/16205aee116772aa686814f9b13bccb562108047.1401902905.git.luto@amacapital.net Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05cpuidle: Set polling in poll_idleAndy Lutomirski1-2/+5
poll_idle is the archetypal polling idle loop; tell the core idle code about it. This avoids pointless IPIs when all of the other cpuidle states are disabled. Signed-off-by: Andy Lutomirski <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Daniel Lezcano <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/c65ce49615d338bae8fb79df5daffab19353c900.1401902905.git.luto@amacapital.net Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched: Remove redundant assignment to "rt_rq" in update_curr_rt(...)Giedrius Rekasius1-2/+1
Variable "rt_rq" is used only in block "for_each_sched_rt_entity" so the value assigned to it at the beginning of the update_curr_rt(...) gets overwritten without ever being read. Remove redundant assignment and move variable declaration to the block in which it is being used. Signed-off-by: Giedrius Rekasius <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: Linus Torvalds <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched: Rename capacity related flagsNicolas Pitre5-18/+18
It is better not to think about compute capacity as being equivalent to "CPU power". The upcoming "power aware" scheduler work may create confusion with the notion of energy consumption if "power" is used too liberally. Let's rename the following feature flags since they do relate to capacity: SD_SHARE_CPUPOWER -> SD_SHARE_CPUCAPACITY ARCH_POWER -> ARCH_CAPACITY NONTASK_POWER -> NONTASK_CAPACITY Signed-off-by: Nicolas Pitre <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Daniel Lezcano <[email protected]> Cc: Morten Rasmussen <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: [email protected] Cc: Andy Fleming <[email protected]> Cc: Anton Blanchard <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Grant Likely <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Gortmaker <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Preeti U Murthy <[email protected]> Cc: Rob Herring <[email protected]> Cc: Srivatsa S. Bhat <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vasant Hegde <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched: Final power vs. capacity cleanupsNicolas Pitre4-62/+63
It is better not to think about compute capacity as being equivalent to "CPU power". The upcoming "power aware" scheduler work may create confusion with the notion of energy consumption if "power" is used too liberally. This contains the architecture visible changes. Incidentally, only ARM takes advantage of the available pow^H^H^Hcapacity scaling hooks and therefore those changes outside kernel/sched/ are confined to one ARM specific file. The default arch_scale_smt_power() hook is not overridden by anyone. Replacements are as follows: arch_scale_freq_power --> arch_scale_freq_capacity arch_scale_smt_power --> arch_scale_smt_capacity SCHED_POWER_SCALE --> SCHED_CAPACITY_SCALE SCHED_POWER_SHIFT --> SCHED_CAPACITY_SHIFT The local usage of "power" in arch/arm/kernel/topology.c is also changed to "capacity" as appropriate. Signed-off-by: Nicolas Pitre <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Daniel Lezcano <[email protected]> Cc: Morten Rasmussen <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: [email protected] Cc: Arnd Bergmann <[email protected]> Cc: Dietmar Eggemann <[email protected]> Cc: Grant Likely <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mark Brown <[email protected]> Cc: Rob Herring <[email protected]> Cc: Russell King <[email protected]> Cc: Sudeep KarkadaNagesha <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched: Remove remaining dubious usage of "power"Nicolas Pitre3-55/+55
It is better not to think about compute capacity as being equivalent to "CPU power". The upcoming "power aware" scheduler work may create confusion with the notion of energy consumption if "power" is used too liberally. This is the remaining "power" -> "capacity" rename for local symbols. Those symbols visible to the rest of the kernel are not included yet. Signed-off-by: Nicolas Pitre <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Daniel Lezcano <[email protected]> Cc: Morten Rasmussen <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: [email protected] Cc: Linus Torvalds <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched: Let 'struct sched_group_power' care about CPU capacityNicolas Pitre4-115/+115
It is better not to think about compute capacity as being equivalent to "CPU power". The upcoming "power aware" scheduler work may create confusion with the notion of energy consumption if "power" is used too liberally. Since struct sched_group_power is really about compute capacity of sched groups, let's rename it to struct sched_group_capacity. Similarly sgp becomes sgc. Related variables and functions dealing with groups are also adjusted accordingly. Signed-off-by: Nicolas Pitre <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Daniel Lezcano <[email protected]> Cc: Morten Rasmussen <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: [email protected] Cc: Linus Torvalds <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched/fair: Disambiguate existing/remaining "capacity" usageNicolas Pitre1-21/+21
We have "power" (which should actually become "capacity") and "capacity" which is a scaled down "capacity factor" in terms of unitary tasks. Let's use "capacity_factor" to make room for proper usage of "capacity" later. Signed-off-by: Nicolas Pitre <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Daniel Lezcano <[email protected]> Cc: Morten Rasmussen <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: [email protected] Cc: Linus Torvalds <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched/fair: Change "has_capacity" to "has_free_capacity"Nicolas Pitre1-13/+13
The capacity of a CPU/group should be some intrinsic value that doesn't change with task placement. It is like a container which capacity is stable regardless of the amount of liquid in it (its "utilization")... unless the container itself is crushed that is, but that's another story. Therefore let's rename "has_capacity" to "has_free_capacity" in order to better convey the wanted meaning. Signed-off-by: Nicolas Pitre <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Daniel Lezcano <[email protected]> Cc: Morten Rasmussen <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: [email protected] Cc: Linus Torvalds <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched/fair: Remove "power" from 'struct numa_stats'Nicolas Pitre1-6/+7
It is better not to think about compute capacity as being equivalent to "CPU power". The upcoming "power aware" scheduler work may create confusion with the notion of energy consumption if "power" is used too liberally. To make things explicit and not create more confusion with the existing "capacity" member, let's rename things as follows: power -> compute_capacity capacity -> task_capacity Note: none of those fields are actually used outside update_numa_stats(). Signed-off-by: Nicolas Pitre <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Daniel Lezcano <[email protected]> Cc: Morten Rasmussen <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: [email protected] Cc: Linus Torvalds <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched: Fix signedness bug in yield_to()Dan Carpenter4-5/+5
yield_to() is supposed to return -ESRCH if there is no task to yield to, but because the type is bool that is the same as returning true. The only place I see which cares is kvm_vcpu_on_spin(). Signed-off-by: Dan Carpenter <[email protected]> Reviewed-by: Raghavendra <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Gleb Natapov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/20140523102042.GA7267@mwanda Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched/fair: Use time_after() in record_wakee()Manuel Schölling1-1/+1
To be future-proof and for better readability the time comparisons are modified to use time_after() instead of plain, error-prone math. Signed-off-by: Manuel Schölling <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched/balancing: Reduce the rate of needless idle load balancingTim Chen1-6/+11
The current no_hz idle load balancer do load balancing for *all* idle cpus, even though the time due to load balance for a particular idle cpu could be still a while in the future. This introduces a much higher load balancing rate than what is necessary. The patch changes the behavior by only doing idle load balancing on behalf of an idle cpu only when it is due for load balancing. On SGI's systems with over 3000 cores, the cpu responsible for idle balancing got overwhelmed with idle balancing, and introduces a lot of OS noise to workloads. This patch fixes the issue. Signed-off-by: Tim Chen <[email protected]> Acked-by: Russ Anderson <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Reviewed-by: Jason Low <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Len Brown <[email protected]> Cc: Dimitri Sivanich <[email protected]> Cc: Hedi Berriche <[email protected]> Cc: Andi Kleen <[email protected]> Cc: MichelLespinasse <[email protected]> Cc: Peter Hurley <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/1400621967.2970.280.camel@schen9-DESK Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched/fair: Fix unlocked reads of some cfs_b->quota/periodBen Segall1-19/+21
sched_cfs_period_timer() reads cfs_b->period without locks before calling do_sched_cfs_period_timer(), and similarly unthrottle_offline_cfs_rqs() would read cfs_b->period without the right lock. Thus a simultaneous change of bandwidth could cause corruption on any platform where ktime_t or u64 writes/reads are not atomic. Extend cfs_b->lock from do_sched_cfs_period_timer() to include the read of cfs_b->period to solve that issue; unthrottle_offline_cfs_rqs() can just use 1 rather than the exact quota, much like distribute_cfs_runtime() does. There is also an unlocked read of cfs_b->runtime_expires, but a race there would only delay runtime expiry by a tick. Still, the comparison should just be != anyway, which clarifies even that problem. Signed-off-by: Ben Segall <[email protected]> Tested-by: Roman Gushchin <[email protected]> [peterz: Fix compile warn] Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/20140519224945.20303.93530.stgit@sword-of-the-dawn.mtv.corp.google.com Cc: [email protected] Cc: Linus Torvalds <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22sched/numa: Decay ->wakee_flips instead of zeroingRik van Riel1-1/+1
Affine wakeups have the potential to interfere with NUMA placement. If a task wakes up too many other tasks, affine wakeups will get disabled. However, regardless of how many other tasks it wakes up, it gets re-enabled once a second, potentially interfering with NUMA placement of other tasks. By decaying wakee_wakes in half instead of zeroing it, we can avoid that problem for some workloads. Signed-off-by: Rik van Riel <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22sched/numa: Update migrate_improves/degrades_locality()Rik van Riel1-13/+29
Update the migrate_improves/degrades_locality() functions with knowledge of pseudo-interleaving. Do not consider moving tasks around within the set of group's active nodes as improving or degrading locality. Instead, leave the load balancer free to balance the load between a numa_group's active nodes. Also, switch from the group/task_weight functions to the group/task_fault functions. The "weight" functions involve a division, but both calls use the same divisor, so there's no point in doing that from these functions. On a 4 node (x10 core) system, performance of SPECjbb2005 seems unaffected, though the number of migrations with 2 8-warehouse wide instances seems to have almost halved, due to the scheduler running each instance on a single node. Signed-off-by: Rik van Riel <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22sched/numa: Allow task switch if load imbalance improvesRik van Riel1-10/+36
Currently the NUMA balancing code only allows moving tasks between NUMA nodes when the load on both nodes is in balance. This breaks down when the load was imbalanced to begin with. Allow tasks to be moved between NUMA nodes if the imbalance is small, or if the new imbalance is be smaller than the original one. Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Rik van Riel <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2014-05-22sched/rt: Fix 'struct sched_dl_entity' and dl_task_time() comments, to match ↵xiaofeng.yan2-3/+3
the current upstream code Signed-off-by: xiaofeng.yan <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22sched: Consolidate open coded implementations of nice level frobbing into ↵Dongsheng Yang4-5/+21
nice_to_rlimit() and rlimit_to_nice() Signed-off-by: Dongsheng Yang <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/a568a1e3cc8e78648f41b5035fa5e381d36274da.1399532322.git.yangds.fnst@cn.fujitsu.com Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22sched: Initialize rq->age_stamp on processor startCorey Minyard1-0/+11
If the sched_clock time starts at a large value, the kernel will spin in sched_avg_update for a long time while rq->age_stamp catches up with rq->clock. The comment in kernel/sched/clock.c says that there is no strict promise that it starts at zero. So initialize rq->age_stamp when a cpu starts up to avoid this. I was seeing long delays on a simulator that didn't start the clock at zero. This might also be an issue on reboots on processors that don't re-initialize the timer to zero on reset, and when using kexec. Signed-off-by: Corey Minyard <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22sched, nohz: Change rq->nr_running to always use wrappersKirill Tkhai5-15/+17
Sometimes ->nr_running may cross 2 but interrupt is not being sent to rq's cpu. In this case we don't reenable the timer. Looks like this may be the reason for rare unexpected effects, if nohz is enabled. Patch replaces all places of direct changing of nr_running and makes add_nr_running() caring about crossing border. Signed-off-by: Kirill Tkhai <[email protected]> Acked-by: Frederic Weisbecker <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/20140508225830.2469.97461.stgit@localhost Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22sched: Fix the rq->next_balance logic in rebalance_domains() and idle_balance()Jason Low1-23/+46
Currently, in idle_balance(), we update rq->next_balance when we pull_tasks. However, it is also important to update this in the !pulled_tasks case too. When the CPU is "busy" (the CPU isn't idle), rq->next_balance gets computed using sd->busy_factor (so we increase the balance interval when the CPU is busy). However, when the CPU goes idle, rq->next_balance could still be set to a large value that was computed with the sd->busy_factor. Thus, we need to also update rq->next_balance in idle_balance() in the cases where !pulled_tasks too, so that rq->next_balance gets updated without taking the busy_factor into account when the CPU is about to go idle. This patch makes rq->next_balance get updated independently of whether or not we pulled_task. Also, we add logic to ensure that we always traverse at least 1 of the sched domains to get a proper next_balance value for updating rq->next_balance. Additionally, since load_balance() modifies the sd->balance_interval, we need to re-obtain the sched domain's interval after the call to load_balance() in rebalance_domains() before we update rq->next_balance. This patch adds and uses 2 new helper functions, update_next_balance() and get_sd_balance_interval() to update next_balance and obtain the sched domain's balance_interval. Signed-off-by: Jason Low <[email protected]> Reviewed-by: Preeti U Murthy <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/1399596562.2200.7.camel@j-VirtualBox Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22sched: Use clamp() and clamp_val() to make sys_nice() more readableDongsheng Yang1-9/+2
Suggested-by: Kees Cook <[email protected]> Signed-off-by: Dongsheng Yang <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22sched: Do not zero sg->cpumask and sg->sgp->power in build_sched_groups()Dietmar Eggemann1-2/+0
There is no need to zero struct sched_group member cpumask and struct sched_group_power member power since both structures are already allocated as zeroed memory in __sdt_alloc(). This patch has been tested with BUG_ON(!cpumask_empty(sched_group_cpus(sg))); and BUG_ON(sg->sgp->power); in build_sched_groups() on ARM TC2 and INTEL i5 M520 platform including CPU hotplug scenarios. Signed-off-by: Dietmar Eggemann <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22sched/numa: Fix initialization of sched_domain_topology for NUMAVincent Guittot1-1/+1
Jet Chen has reported a kernel panics when booting qemu-system-x86_64 with kvm64 cpu. A panic occured while building the sched_domain. In sched_init_numa, we create a new topology table in which both default levels and numa levels are copied. The last row of the table must have a null pointer in the mask field. The current implementation doesn't add this last row in the computation of the table size. So we add 1 row in the allocation size that will be used as the last row of the table. The kzalloc will ensure that the mask field is NULL. Reported-by: Jet Chen <[email protected]> Tested-by: Jet Chen <[email protected]> Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22sched: Call select_idle_sibling() when not affine_sdRik van Riel1-3/+3
On smaller systems, the top level sched domain will be an affine domain, and select_idle_sibling is invoked for every SD_WAKE_AFFINE wakeup. This seems to be working well. On larger systems, with the node distance between far away NUMA nodes being > RECLAIM_DISTANCE, select_idle_sibling is only called if the waker and the wakee are on nodes less than RECLAIM_DISTANCE apart. This patch leaves in place the policy of not pulling the task across nodes on such systems, while fixing the issue that select_idle_sibling is not called at all in certain circumstances. The code will look for an idle CPU in the same CPU package as the CPU where the task ran previously. Signed-off-by: Rik van Riel <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Mel Gorman <[email protected]> Cc: Mike Galbraith <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22sched: Simplify return logic in sched_read_attr()Michael Kerrisk1-7/+2
Gotos are chained pointlessly here, and the 'out' label can be dispensed with. Signed-off-by: Michael Kerrisk <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22sched: Simplify return logic in sched_copy_attr()Michael Kerrisk1-4/+2
The logic in this function is a little contorted, clean it up: * Rather than having chained gotos for the -EFBIG case, just return -EFBIG directly. * Now, the label 'out' is no longer needed, and 'ret' must be zero zero by the time we fall through to this point, so just return 0. Signed-off-by: Michael Kerrisk <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22sched: Fix exec_start/task_hot on migrated tasksBen Segall1-0/+3
task_hot checks exec_start on any runnable task, but if it has been migrated since the it last ran, then exec_start is a clock_task from another cpu. If the old cpu's clock_task was sufficiently far ahead of this cpu's then the task will not be considered for another migration until it has run. Instead reset exec_start whenever a task is migrated, since it is presumably no longer hot anyway. Signed-off-by: Ben Segall <[email protected]> [ Made it compile. ] Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/20140515225920.7179.13924.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22Merge branch 'sched/urgent' into sched/core to avoid conflicts with upcoming ↵Ingo Molnar6-31/+78
changes Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22Merge branch 'pm-cpuidle' of ↵Ingo Molnar5-35/+59
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm into sched/core Pull scheduling related CPU idle updates from Rafael J. Wysocki. Conflicts: kernel/sched/idle.c Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22Merge tag 'v3.15-rc6' into sched/core, to pick up the latest fixesIngo Molnar918-5767/+10052
Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22arm64: Remove TIF_POLLING_NRFLAGPeter Zijlstra1-2/+0
The only idle method for arm64 is WFI and it therefore unconditionally requires the reschedule interrupt when idle. Suggested-by: Catalin Marinas <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Acked-by: Catalin Marinas <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22metag: Remove TIF_POLLING_NRFLAGJames Hogan1-4/+2
The Meta idle function jumps into the interrupt handler which efficiently blocks waiting for the next interrupt when it reads the interrupt status register (TXSTATI). No other (polling) idle functions can be used, therefore TIF_POLLING_NRFLAG is unnecessary, so lets remove it. Peter Zijlstra said: > Most archs have (x86) hlt or (arm) wfi like idle instructions, and if > that is your only possible idle function, you'll require the interrupt > to wake up and there's really no point to having the POLLING bit. Signed-off-by: James Hogan <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22sched: Fix hotplug vs. set_cpus_allowed_ptr()Lai Jiangshan2-3/+4
Lai found that: WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b() ... migration_cpu_stop+0x1d/0x22 was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is always a sub-set of cpu_online_mask. This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness"). So set active and online at the same time to avoid this particular problem. Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness") Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Gautham R. Shenoy <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Michael wang <[email protected]> Cc: Paul Gortmaker <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Srivatsa S. Bhat <[email protected]> Cc: Toshi Kani <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22sched/cpupri: Replace NR_CPUS arraysPeter Zijlstra2-1/+8
Tejun reported that his resume was failing due to order-3 allocations from sched_domain building. Replace the NR_CPUS arrays in there with a dynamically allocated array. Reported-by: Tejun Heo <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22sched/deadline: Replace NR_CPUS arraysPeter Zijlstra2-12/+27
Tejun reported that his resume was failing due to order-3 allocations from sched_domain building. Replace the NR_CPUS arrays in there with a dynamically allocated array. Reported-by: Tejun Heo <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Acked-by: Juri Lelli <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22sched/deadline: Restrict user params max value to 2^63 nsJuri Lelli1-7/+30
Michael Kerrisk noticed that creating SCHED_DEADLINE reservations with certain parameters (e.g, a runtime of something near 2^64 ns) can cause a system freeze for some amount of time. The problem is that in the interface we have u64 sched_runtime; while internally we need to have a signed runtime (to cope with budget overruns) s64 runtime; At the time we setup a new dl_entity we copy the first value in the second. The cast turns out with negative values when sched_runtime is too big, and this causes the scheduler to go crazy right from the start. Moreover, considering how we deal with deadlines wraparound (s64)(a - b) < 0 we also have to restrict acceptable values for sched_{deadline,period}. This patch fixes the thing checking that user parameters are always below 2^63 ns (still large enough for everyone). It also rewrites other conditions that we check, since in __checkparam_dl we don't have to deal with deadline wraparounds and what we have now erroneously fails when the difference between values is too big. Reported-by: Michael Kerrisk <[email protected]> Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Juri Lelli <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: <[email protected]> Cc: Dario Faggioli<[email protected]> Cc: Dave Jones <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22sched/deadline: Change sched_getparam() behaviour vs SCHED_DEADLINEPeter Zijlstra1-6/+3
The way we read POSIX one should only call sched_getparam() when sched_getscheduler() returns either SCHED_FIFO or SCHED_RR. Given that we currently return sched_param::sched_priority=0 for all others, extend the same behaviour to SCHED_DEADLINE. Requested-by: Michael Kerrisk <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Acked-by: Michael Kerrisk <[email protected]> Cc: Dario Faggioli <[email protected]> Cc: linux-man <[email protected]> Cc: "Michael Kerrisk (man-pages)" <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22sched: Disallow sched_attr::sched_policy < 0Peter Zijlstra1-0/+3
The scheduler uses policy=-1 to preserve the current policy state to implement sys_sched_setparam(), this got exposed to userspace by accident through sys_sched_setattr(), cure this. Reported-by: Michael Kerrisk <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Acked-by: Michael Kerrisk <[email protected]> Cc: <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22sched: Make sched_setattr() correctly return -EFBIGMichael Kerrisk1-2/+3
The documented[1] behavior of sched_attr() in the proposed man page text is: sched_attr::size must be set to the size of the structure, as in sizeof(struct sched_attr), if the provided structure is smaller than the kernel structure, any additional fields are assumed '0'. If the provided structure is larger than the kernel structure, the kernel verifies all additional fields are '0' if not the syscall will fail with -E2BIG. As currently implemented, sched_copy_attr() returns -EFBIG for for this case, but the logic in sys_sched_setattr() converts that error to -EFAULT. This patch fixes the behavior. [1] http://thread.gmane.org/gmane.linux.kernel/1615615/focus=1697760 Signed-off-by: Michael Kerrisk <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-05-22Linux 3.15-rc6Linus Torvalds1-1/+1
2014-05-22Merge branch 'merge' of ↵Linus Torvalds2-4/+2
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull two powerpc fixes from Ben Herrenschmidt: "Here are a couple of fixes for 3.15. One from Anton fixes a nasty regression I introduced when trying to fix a loss of irq_work whose consequences is that we can completely lose timer interrupts on a CPU... not pretty. The other one is a change to our PCIe reset hook to use a firmware call instead of direct config space accesses to trigger a fundamental reset on the root port. This is necessary so that the FW gets a chance to disable the link down error monitoring, which would otherwise trip and cause subsequent fatal EEH error" * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: powerpc: irq work racing with timer interrupt can result in timer interrupt hang powerpc/powernv: Reset root port in firmware
2014-05-22Merge branch 'for-linus' of ↵Linus Torvalds2-2/+6
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull two btrfs fixes from Chris Mason: "This has two fixes that we've been testing for 3.16, but since both are safe and fix real bugs, it makes sense to send for 3.15 instead" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: send, fix incorrect ref access when using extrefs Btrfs: fix EIO on reading file after ioctl clone works on it
2014-05-22Merge branch 'for-linus' of ↵Linus Torvalds2-1/+24
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull two ceph fixes from Sage Weil: "The first patch fixes a problem when we have a page count of 0 for sendpage which is triggered by zfs. The second fixes a bug in CRUSH that was resolved in the userland code a while back but fell through the cracks on the kernel side" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: crush: decode and initialize chooseleaf_vary_r libceph: fix corruption when using page_count 0 page in rbd
2014-05-22Merge tag 'xfs-for-linus-3.15-rc6' of git://oss.sgi.com/xfs/xfsLinus Torvalds5-25/+27
Pull xfs fixes from Dave Chinner: "Code inspection of the XFS error number sign translations found a bunch of issues, including returning incorrectly signed errors for some data integrity operations. These leak to userspace and result in applications not getting the errors correctly reported. Hence they need fixing sooner rather than later. A couple of the bugs are in data integrity operations, a couple more are in the new COLLAPSE_RANGE code. One of these came in through a recent ext4 merge and so I had to update the base tree to 3.15-rc5 before fixing the issues" * tag 'xfs-for-linus-3.15-rc6' of git://oss.sgi.com/xfs/xfs: xfs: list_lru_init returns a negative error xfs: negate xfs_icsb_init_counters error value xfs: negate mount workqueue init error value xfs: fix wrong err sign on xfs_set_acl() xfs: fix wrong errno from xfs_initxattrs xfs: correct error sign on COLLAPSE_RANGE errors xfs: xfs_commit_metadata returns wrong errno xfs: fix incorrect error sign in xfs_file_aio_read xfs: xfs_dir_fsync() returns positive errno
2014-05-22Merge branch 'renameat2' of ↵Linus Torvalds10-5/+16
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull renameat2 arch support from Miklos Szeredi: "I've collected architecture patches for the renameat2 syscall that maintainers acked and/or asked me to queue. This adds architecture support for the renameat2 syscall to m68k, parisc, ia64 and through asm-generic to arc, arm64, c6x, hexagon, metag, openrisc, score, tile, unicore32" * 'renameat2' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: scripts/checksyscalls.sh: Make renameat optional asm-generic: Add renameat2 syscall ia64: add renameat2 syscall parisc: add renameat2 syscall m68k: add renameat2 syscall
2014-05-22Merge tag 'iommu-fixes-v3.15-rc5' of ↵Linus Torvalds3-2/+4
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu fixes from Joerg Roedel: "Three fixes for the AMD IOMMU driver: - fix a locking issue around get_user_pages() - fix two issues with device aliasing and exclusion range handling" * tag 'iommu-fixes-v3.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: iommu/amd: fix enabling exclusion range for an exact device iommu/amd: Take mmap_sem when calling get_user_pages iommu/amd: Fix interrupt remapping for aliased devices
2014-05-22Merge tag 'stable/for-linus-3.15-rc5-tag' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/ibft Pull iscsi_ibft fix from Konrad Rzeszutek Wilk: "Fix iBFT regression on Broadcom NICs introduced in 3.2" * tag 'stable/for-linus-3.15-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/ibft: iscsi_ibft: Fix finding Broadcom specific ibft sign
2014-05-22Merge tag 'renesas-sh-drivers-for-v3.15' of ↵Linus Torvalds3-8/+28
git://git.kernel.org/pub/scm/linux/kernel/git/horms/renesas Pull SH driver fix from Simon Horman: "Compile drivers/sh/pm_runtime.c if ARCH_SHMOBILE_MULTI This resolves a regression introduced in v3.14 by commit bf98c1eac1d4 ("ARM: Rename ARCH_SHMOBILE to ARCH_SHMOBILE_LEGACY")" * tag 'renesas-sh-drivers-for-v3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/horms/renesas: drivers: sh: compile drivers/sh/pm_runtime.c if ARCH_SHMOBILE_MULTI