aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2014-06-05uprobes: Shift ->readpage check from __copy_insn() to uprobe_register()Oleg Nesterov1-3/+3
copy_insn() fails with -EIO if ->readpage == NULL, but this error is not propagated unless uprobe_register() path finds ->mm which already mmaps this file. In this case (say) "perf record" does not actually install the probe, but the user can't know about this. Move this check into uprobe_register() so that this problem can be detected earlier and reported to user. Note: this is still not perfect, - copy_insn() and arch_uprobe_analyze_insn() should be called by uprobe_register() but this is not simple, we need vm_file for read_mapping_page() (although perhaps we can pass NULL), and we need ->mm for is_64bit_mm() (although this logic is broken anyway). - uprobe_register() should be called by create_trace_uprobe(), not by probe_event_enable(), so that an error can be detected at "perf probe -x" time. This also needs more changes in the core uprobe code, uprobe register/unregister interface was poorly designed from the very beginning. Reported-by: Denys Vlasenko <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05perf: Disable sampled events if no PMU interruptVince Weaver1-0/+7
Add common code to generate -ENOTSUPP at event creation time if an architecture attempts to create a sampled event and PERF_PMU_NO_INTERRUPT is set. This adds a new pmu->capabilities flag. Initially we only support PERF_PMU_NO_INTERRUPT (to indicate a PMU has no support for generating hardware interrupts) but there are other capabilities that can be added later. Signed-off-by: Vince Weaver <[email protected]> Acked-by: Will Deacon <[email protected]> [peterz: rename to PERF_PMU_CAP_* and moved the pmu::capabilities word into a hole] Signed-off-by: Peter Zijlstra <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1405161708060.11099@vincent-weaver-1.umelst.maine.edu Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05perf: Fix use after free in perf_remove_from_context()Peter Zijlstra1-2/+2
While that mutex should guard the elements, it doesn't guard against the use-after-free that's from list_for_each_entry_rcu(). __perf_event_exit_task() can actually free the event. And because list addition/deletion is guarded by both ctx->mutex and ctx->lock, holding ctx->mutex is sufficient for reading the list, so we don't actually need the rcu list iteration. Fixes: 3a497f48637e ("perf: Simplify perf_event_exit_task_context()") Reported-by: Sasha Levin <[email protected]> Tested-by: Sasha Levin <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Dave Jones <[email protected]> Cc: [email protected] Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05Merge branch 'perf/kprobes' into perf/coreIngo Molnar8-254/+343
Conflicts: arch/x86/kernel/traps.c The kprobes enhancements are fully cooked, ship them upstream. Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05Merge branch 'perf/uprobes' into perf/coreIngo Molnar2-34/+47
These bits from Oleg are fully cooked, ship them to Linus. Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched/idle: Optimize try-to-wake-up IPIPeter Zijlstra3-9/+61
[ This series reduces the number of IPIs on Andy's workload by something like 99%. It's down from many hundreds per second to very few. The basic idea behind this series is to make TIF_POLLING_NRFLAG be a reliable indication that the idle task is polling. Once that's done, the rest is reasonably straightforward. ] When enqueueing tasks on remote LLC domains, we send an IPI to do the work 'locally' and avoid bouncing all the cachelines over. However, when the remote CPU is idle (and polling, say x86 mwait), we don't need to send an IPI, we can simply kick the TIF word to wake it up and have the 'idle' loop do the work. So when _TIF_POLLING_NRFLAG is set, but _TIF_NEED_RESCHED is not (yet) set, set _TIF_NEED_RESCHED and avoid sending the IPI. Much-requested-by: Andy Lutomirski <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> [Edited by Andy Lutomirski, but this is mostly Peter Zijlstra's code.] Signed-off-by: Andy Lutomirski <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Mike Galbraith <[email protected]> Cc: [email protected] Cc: Rafael J. Wysocki <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/ce06f8b02e7e337be63e97597fc4b248d3aa6f9b.1401902905.git.luto@amacapital.net Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched/idle: Simplify wake_up_idle_cpu()Andy Lutomirski1-20/+1
Now that rq->idle's polling bit is a reliable indication that the cpu is polling, use it. Signed-off-by: Andy Lutomirski <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Linus Torvalds <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/922f00761445a830ebb23d058e2ae53956ce2d73.1401902905.git.luto@amacapital.net Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched/idle: Clear polling before descheduling the idle threadAndy Lutomirski1-1/+25
Currently, the only real guarantee provided by the polling bit is that, if you hold rq->lock and the polling bit is set, then you can set need_resched to force a reschedule. The only reason the lock is needed is that the idle thread might not be running at all when setting its need_resched bit, and rq->lock keeps it pinned. This is easy to fix: just clear the polling bit before scheduling. Now the idle thread's polling bit is only ever set when rq->curr == rq->idle. Signed-off-by: Andy Lutomirski <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Rafael J. Wysocki <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/b2059fcb4c613d520cb503b6fad6e47033c7c203.1401902905.git.luto@amacapital.net Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched, trace: Add a tracepoint for IPI-less remote wakeupsAndy Lutomirski1-0/+4
Remote wakeups of polling CPUs are a valuable performance improvement; add a tracepoint to make it much easier to verify that they're working. Signed-off-by: Andy Lutomirski <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: David Ahern <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/16205aee116772aa686814f9b13bccb562108047.1401902905.git.luto@amacapital.net Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched: Remove redundant assignment to "rt_rq" in update_curr_rt(...)Giedrius Rekasius1-2/+1
Variable "rt_rq" is used only in block "for_each_sched_rt_entity" so the value assigned to it at the beginning of the update_curr_rt(...) gets overwritten without ever being read. Remove redundant assignment and move variable declaration to the block in which it is being used. Signed-off-by: Giedrius Rekasius <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: Linus Torvalds <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched: Rename capacity related flagsNicolas Pitre3-15/+15
It is better not to think about compute capacity as being equivalent to "CPU power". The upcoming "power aware" scheduler work may create confusion with the notion of energy consumption if "power" is used too liberally. Let's rename the following feature flags since they do relate to capacity: SD_SHARE_CPUPOWER -> SD_SHARE_CPUCAPACITY ARCH_POWER -> ARCH_CAPACITY NONTASK_POWER -> NONTASK_CAPACITY Signed-off-by: Nicolas Pitre <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Daniel Lezcano <[email protected]> Cc: Morten Rasmussen <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: [email protected] Cc: Andy Fleming <[email protected]> Cc: Anton Blanchard <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Grant Likely <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Gortmaker <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Preeti U Murthy <[email protected]> Cc: Rob Herring <[email protected]> Cc: Srivatsa S. Bhat <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vasant Hegde <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched: Final power vs. capacity cleanupsNicolas Pitre2-32/+33
It is better not to think about compute capacity as being equivalent to "CPU power". The upcoming "power aware" scheduler work may create confusion with the notion of energy consumption if "power" is used too liberally. This contains the architecture visible changes. Incidentally, only ARM takes advantage of the available pow^H^H^Hcapacity scaling hooks and therefore those changes outside kernel/sched/ are confined to one ARM specific file. The default arch_scale_smt_power() hook is not overridden by anyone. Replacements are as follows: arch_scale_freq_power --> arch_scale_freq_capacity arch_scale_smt_power --> arch_scale_smt_capacity SCHED_POWER_SCALE --> SCHED_CAPACITY_SCALE SCHED_POWER_SHIFT --> SCHED_CAPACITY_SHIFT The local usage of "power" in arch/arm/kernel/topology.c is also changed to "capacity" as appropriate. Signed-off-by: Nicolas Pitre <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Daniel Lezcano <[email protected]> Cc: Morten Rasmussen <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: [email protected] Cc: Arnd Bergmann <[email protected]> Cc: Dietmar Eggemann <[email protected]> Cc: Grant Likely <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mark Brown <[email protected]> Cc: Rob Herring <[email protected]> Cc: Russell King <[email protected]> Cc: Sudeep KarkadaNagesha <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched: Remove remaining dubious usage of "power"Nicolas Pitre3-55/+55
It is better not to think about compute capacity as being equivalent to "CPU power". The upcoming "power aware" scheduler work may create confusion with the notion of energy consumption if "power" is used too liberally. This is the remaining "power" -> "capacity" rename for local symbols. Those symbols visible to the rest of the kernel are not included yet. Signed-off-by: Nicolas Pitre <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Daniel Lezcano <[email protected]> Cc: Morten Rasmussen <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: [email protected] Cc: Linus Torvalds <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched: Let 'struct sched_group_power' care about CPU capacityNicolas Pitre3-114/+114
It is better not to think about compute capacity as being equivalent to "CPU power". The upcoming "power aware" scheduler work may create confusion with the notion of energy consumption if "power" is used too liberally. Since struct sched_group_power is really about compute capacity of sched groups, let's rename it to struct sched_group_capacity. Similarly sgp becomes sgc. Related variables and functions dealing with groups are also adjusted accordingly. Signed-off-by: Nicolas Pitre <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Daniel Lezcano <[email protected]> Cc: Morten Rasmussen <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: [email protected] Cc: Linus Torvalds <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched/fair: Disambiguate existing/remaining "capacity" usageNicolas Pitre1-21/+21
We have "power" (which should actually become "capacity") and "capacity" which is a scaled down "capacity factor" in terms of unitary tasks. Let's use "capacity_factor" to make room for proper usage of "capacity" later. Signed-off-by: Nicolas Pitre <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Daniel Lezcano <[email protected]> Cc: Morten Rasmussen <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: [email protected] Cc: Linus Torvalds <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched/fair: Change "has_capacity" to "has_free_capacity"Nicolas Pitre1-13/+13
The capacity of a CPU/group should be some intrinsic value that doesn't change with task placement. It is like a container which capacity is stable regardless of the amount of liquid in it (its "utilization")... unless the container itself is crushed that is, but that's another story. Therefore let's rename "has_capacity" to "has_free_capacity" in order to better convey the wanted meaning. Signed-off-by: Nicolas Pitre <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Daniel Lezcano <[email protected]> Cc: Morten Rasmussen <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: [email protected] Cc: Linus Torvalds <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched/fair: Remove "power" from 'struct numa_stats'Nicolas Pitre1-6/+7
It is better not to think about compute capacity as being equivalent to "CPU power". The upcoming "power aware" scheduler work may create confusion with the notion of energy consumption if "power" is used too liberally. To make things explicit and not create more confusion with the existing "capacity" member, let's rename things as follows: power -> compute_capacity capacity -> task_capacity Note: none of those fields are actually used outside update_numa_stats(). Signed-off-by: Nicolas Pitre <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Daniel Lezcano <[email protected]> Cc: Morten Rasmussen <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: [email protected] Cc: Linus Torvalds <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched: Fix signedness bug in yield_to()Dan Carpenter1-1/+1
yield_to() is supposed to return -ESRCH if there is no task to yield to, but because the type is bool that is the same as returning true. The only place I see which cares is kvm_vcpu_on_spin(). Signed-off-by: Dan Carpenter <[email protected]> Reviewed-by: Raghavendra <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Gleb Natapov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/20140523102042.GA7267@mwanda Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched/fair: Use time_after() in record_wakee()Manuel Schölling1-1/+1
To be future-proof and for better readability the time comparisons are modified to use time_after() instead of plain, error-prone math. Signed-off-by: Manuel Schölling <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched/balancing: Reduce the rate of needless idle load balancingTim Chen1-6/+11
The current no_hz idle load balancer do load balancing for *all* idle cpus, even though the time due to load balance for a particular idle cpu could be still a while in the future. This introduces a much higher load balancing rate than what is necessary. The patch changes the behavior by only doing idle load balancing on behalf of an idle cpu only when it is due for load balancing. On SGI's systems with over 3000 cores, the cpu responsible for idle balancing got overwhelmed with idle balancing, and introduces a lot of OS noise to workloads. This patch fixes the issue. Signed-off-by: Tim Chen <[email protected]> Acked-by: Russ Anderson <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Reviewed-by: Jason Low <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Len Brown <[email protected]> Cc: Dimitri Sivanich <[email protected]> Cc: Hedi Berriche <[email protected]> Cc: Andi Kleen <[email protected]> Cc: MichelLespinasse <[email protected]> Cc: Peter Hurley <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/1400621967.2970.280.camel@schen9-DESK Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched/fair: Fix unlocked reads of some cfs_b->quota/periodBen Segall1-19/+21
sched_cfs_period_timer() reads cfs_b->period without locks before calling do_sched_cfs_period_timer(), and similarly unthrottle_offline_cfs_rqs() would read cfs_b->period without the right lock. Thus a simultaneous change of bandwidth could cause corruption on any platform where ktime_t or u64 writes/reads are not atomic. Extend cfs_b->lock from do_sched_cfs_period_timer() to include the read of cfs_b->period to solve that issue; unthrottle_offline_cfs_rqs() can just use 1 rather than the exact quota, much like distribute_cfs_runtime() does. There is also an unlocked read of cfs_b->runtime_expires, but a race there would only delay runtime expiry by a tick. Still, the comparison should just be != anyway, which clarifies even that problem. Signed-off-by: Ben Segall <[email protected]> Tested-by: Roman Gushchin <[email protected]> [peterz: Fix compile warn] Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/20140519224945.20303.93530.stgit@sword-of-the-dawn.mtv.corp.google.com Cc: [email protected] Cc: Linus Torvalds <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched/fair: Fix tg_set_cfs_bandwidth() deadlock on rq->lockRoman Gushchin3-7/+6
tg_set_cfs_bandwidth() sets cfs_b->timer_active to 0 to force the period timer restart. It's not safe, because can lead to deadlock, described in commit 927b54fccbf0: "__start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock, waiting for the hrtimer to finish. However, if sched_cfs_period_timer runs for another loop iteration, the hrtimer can attempt to take rq->lock, resulting in deadlock." Three CPUs must be involved: CPU0 CPU1 CPU2 take rq->lock period timer fired ... take cfs_b lock ... ... tg_set_cfs_bandwidth() throttle_cfs_rq() release cfs_b lock take cfs_b lock ... distribute_cfs_runtime() timer_active = 0 take cfs_b->lock wait for rq->lock ... __start_cfs_bandwidth() {wait for timer callback break if timer_active == 1} So, CPU0 and CPU1 are deadlocked. Instead of resetting cfs_b->timer_active, tg_set_cfs_bandwidth can wait for period timer callbacks (ignoring cfs_b->timer_active) and restart the timer explicitly. Signed-off-by: Roman Gushchin <[email protected]> Reviewed-by: Ben Segall <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/87wqdi9g8e.wl\%[email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Linus Torvalds <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched/dl: Fix race in dl_task_timer()Kirill Tkhai1-1/+9
Throttled task is still on rq, and it may be moved to other cpu if user is playing with sched_setaffinity(). Therefore, unlocked task_rq() access makes the race. Juri Lelli reports he got this race when dl_bandwidth_enabled() was not set. Other thing, pointed by Peter Zijlstra: "Now I suppose the problem can still actually happen when you change the root domain and trigger a effective affinity change that way". To fix that we do the same as made in __task_rq_lock(). We do not use __task_rq_lock() itself, because it has a useful lockdep check, which is not correct in case of dl_task_timer(). We do not need pi_lock locked here. This case is an exception (PeterZ): "The only reason we don't strictly need ->pi_lock now is because we're guaranteed to have p->state == TASK_RUNNING here and are thus free of ttwu races". Signed-off-by: Kirill Tkhai <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: <[email protected]> # v3.14+ Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched: Fix sched_policy < 0 comparisonRichard Weinberger1-1/+1
attr.sched_policy is u32, therefore a comparison against < 0 is never true. Fix this by casting sched_policy to int. This issue was reported by coverity CID 1219934. Fixes: dbdb22754fde ("sched: Disallow sched_attr::sched_policy < 0") Signed-off-by: Richard Weinberger <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Michael Kerrisk <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05sched/numa: Fix use of spin_{un}lock_irq() when interrupts are disabledSteven Rostedt1-3/+4
As Peter Zijlstra told me, we have the following path: do_exit() exit_itimers() itimer_delete() spin_lock_irqsave(&timer->it_lock, &flags); timer_delete_hook(timer); kc->timer_del(timer) := posix_cpu_timer_del() put_task_struct() __put_task_struct() task_numa_free() spin_lock(&grp->lock); Which means that task_numa_free() can be called with interrupts disabled, which means that we should not be using spin_lock_irq() but spin_lock_irqsave() instead. Otherwise we are enabling interrupts while holding an interrupt unsafe lock! Signed-off-by: Steven Rostedt <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner<[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05locking/rwsem: Fix checkpatch.pl warningsAndrew Morton1-3/+3
WARNING: line over 80 characters #205: FILE: kernel/locking/rwsem-xadd.c:275: + old = cmpxchg(&sem->count, count, count + RWSEM_ACTIVE_WRITE_BIAS); WARNING: line over 80 characters #376: FILE: kernel/locking/rwsem-xadd.c:434: + * If there were already threads queued before us and there are no WARNING: line over 80 characters #377: FILE: kernel/locking/rwsem-xadd.c:435: + * active writers, the lock must be read owned; so we try to wake total: 0 errors, 3 warnings, 417 lines checked Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Tim Chen <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-05locking/rwsem: Support optimistic spinningDavidlohr Bueso2-30/+226
We have reached the point where our mutexes are quite fine tuned for a number of situations. This includes the use of heuristics and optimistic spinning, based on MCS locking techniques. Exclusive ownership of read-write semaphores are, conceptually, just about the same as mutexes, making them close cousins. To this end we need to make them both perform similarly, and right now, rwsems are simply not up to it. This was discovered by both reverting commit 4fc3f1d6 (mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable) and similarly, converting some other mutexes (ie: i_mmap_mutex) to rwsems. This creates a situation where users have to choose between a rwsem and mutex taking into account this important performance difference. Specifically, biggest difference between both locks is when we fail to acquire a mutex in the fastpath, optimistic spinning comes in to play and we can avoid a large amount of unnecessary sleeping and overhead of moving tasks in and out of wait queue. Rwsems do not have such logic. This patch, based on the work from Tim Chen and I, adds support for write-side optimistic spinning when the lock is contended. It also includes support for the recently added cancelable MCS locking for adaptive spinning. Note that is is only applicable to the xadd method, and the spinlock rwsem variant remains intact. Allowing optimistic spinning before putting the writer on the wait queue reduces wait queue contention and provided greater chance for the rwsem to get acquired. With these changes, rwsem is on par with mutex. The performance benefits can be seen on a number of workloads. For instance, on a 8 socket, 80 core 64bit Westmere box, aim7 shows the following improvements in throughput: +--------------+---------------------+-----------------+ | Workload | throughput-increase | number of users | +--------------+---------------------+-----------------+ | alltests | 20% | >1000 | | custom | 27%, 60% | 10-100, >1000 | | high_systime | 36%, 30% | >100, >1000 | | shared | 58%, 29% | 10-100, >1000 | +--------------+---------------------+-----------------+ There was also improvement on smaller systems, such as a quad-core x86-64 laptop running a 30Gb PostgreSQL (pgbench) workload for up to +60% in throughput for over 50 clients. Additionally, benefits were also noticed in exim (mail server) workloads. Furthermore, no performance regression have been seen at all. Based-on-work-from: Tim Chen <[email protected]> Signed-off-by: Davidlohr Bueso <[email protected]> [peterz: rej fixup due to comment patches, sched/rt.h header] Signed-off-by: Peter Zijlstra <[email protected]> Cc: Alex Shi <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Peter Hurley <[email protected]> Cc: "Paul E.McKenney" <[email protected]> Cc: Jason Low <[email protected]> Cc: Aswin Chandramouleeswaran <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: "Scott J Norton" <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Chris Mason <[email protected]> Cc: Josef Bacik <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-04cgroup: don't destroy the default rootLi Zefan1-1/+4
The default root is allocated and initialized at boot phase, so we shouldn't destroy the default root when it's umounted, otherwise it will lead to disaster. Just try mount and then umount the default root, and the kernel will crash immediately. v2: - No need to check for CSS_NO_REF in cgroup_get/put(). (Tejun) - Better call cgroup_put() for the default root in kill_sb(). (Tejun) - Add a comment. Signed-off-by: Li Zefan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2014-06-04Merge branch 'akpm' (patchbomb from Andrew) into nextLinus Torvalds28-229/+337
Merge misc updates from Andrew Morton: - a few fixes for 3.16. Cc'ed to stable so they'll get there somehow. - various misc fixes and cleanups - most of the ocfs2 queue. Review is slow... - most of MM. The MM queue is pretty huge this time, but not much in the way of feature work. - some tweaks under kernel/ - printk maintenance work - updates to lib/ - checkpatch updates - tweaks to init/ * emailed patches from Andrew Morton <[email protected]>: (276 commits) fs/autofs4/dev-ioctl.c: add __init to autofs_dev_ioctl_init fs/ncpfs/getopt.c: replace simple_strtoul by kstrtoul init/main.c: remove an ifdef kthreads: kill CLONE_KERNEL, change kernel_thread(kernel_init) to avoid CLONE_SIGHAND init/main.c: add initcall_blacklist kernel parameter init/main.c: don't use pr_debug() fs/binfmt_flat.c: make old_reloc() static fs/binfmt_elf.c: fix bool assignements fs/efs: convert printk(KERN_DEBUG to pr_debug fs/efs: add pr_fmt / use __func__ fs/efs: convert printk to pr_foo() scripts/checkpatch.pl: device_initcall is not the only __initcall substitute checkpatch: check stable email address checkpatch: warn on unnecessary void function return statements checkpatch: prefer kstrto<foo> to sscanf(buf, "%<lhuidx>", &bar); checkpatch: add warning for kmalloc/kzalloc with multiply checkpatch: warn on #defines ending in semicolon checkpatch: make --strict a default for files in drivers/net and net/ checkpatch: always warn on missing blank line after variable declaration block checkpatch: fix wildcard DT compatible string checking ...
2014-06-04kernel/compat.c: use sizeof() instead of sizeofFabian Frederick1-4/+4
Fix 4 checkpatch warnings WARNING: sizeof *tv should be sizeof(*tv) Signed-off-by: Fabian Frederick <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Arnd Bergmann <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04kernel/printk: use symbolic defines for console loglevelsBorislav Petkov4-13/+6
... instead of naked numbers. Stuff in sysrq.c used to set it to 8 which is supposed to mean above default level so set it to DEBUG instead as we're terminating/killing all tasks and we want to be verbose there. Also, correct the check in x86_64_start_kernel which should be >= as we're clearly issuing the string there for all debug levels, not only the magical 10. Signed-off-by: Borislav Petkov <[email protected]> Acked-by: Kees Cook <[email protected]> Acked-by: Randy Dunlap <[email protected]> Cc: Joe Perches <[email protected]> Cc: Valdis Kletnieks <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04printk: report dropping of messages from logbufWill Deacon1-2/+7
If the log ring buffer becomes full, we silently overwrite old messages with new data. console_unlock will detect this case and fast-forward the console_* pointers to skip over the corrupted data, but nothing will be reported to the user. This patch hijacks the first valid log message after detecting that we dropped messages and prefixes it with a note detailing how many messages were dropped. For long (~1000 char) messages, this will result in some truncation of the real message, but given that we're dropping things anyway, that doesn't seem to be the end of the world. Signed-off-by: Will Deacon <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Kay Sievers <[email protected]> Cc: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04timekeeping: use printk_deferred when holding timekeeping seqlockJohn Stultz2-9/+13
Jiri Bohac pointed out that there are rare but potential deadlock possibilities when calling printk while holding the timekeeping seqlock. This is due to printk() triggering console sem wakeup, which can cause scheduling code to trigger hrtimers which may try to read the time. Specifically, as Jiri pointed out, that path is: printk vprintk_emit console_unlock up(&console_sem) __up wake_up_process try_to_wake_up ttwu_do_activate ttwu_activate activate_task enqueue_task enqueue_task_fair hrtick_update hrtick_start_fair hrtick_start_fair get_time ktime_get --> endless loop on read_seqcount_retry(&timekeeper_seq, ...) This patch tries to avoid this issue by using printk_deferred (previously named printk_sched) which should defer printing via a irq_work_queue. Signed-off-by: John Stultz <[email protected]> Reported-by: Jiri Bohac <[email protected]> Reviewed-by: Steven Rostedt <[email protected]> Cc: Jan Kara <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04printk: Add printk_deferred_onceJohn Stultz2-13/+2
Two of the three prink_deferred uses are really printk_once style uses, so add a printk_deferred_once macro to simplify those call sites. Signed-off-by: John Stultz <[email protected]> Reviewed-by: Steven Rostedt <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Jiri Bohac <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04printk: rename printk_sched to printk_deferredJohn Stultz4-4/+4
After learning we'll need some sort of deferred printk functionality in the timekeeping core, Peter suggested we rename the printk_sched function so it can be reused by needed subsystems. This only changes the function name. No logic changes. Signed-off-by: John Stultz <[email protected]> Reviewed-by: Steven Rostedt <[email protected]> Cc: Jan Kara <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Jiri Bohac <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04printk: disable preemption for printk_schedJohn Stultz1-0/+2
An earlier change in -mm (printk: remove separate printk_sched buffers...), removed the printk_sched irqsave/restore lines since it was safe for current users. Since we may be expanding usage of printk_sched(), disable preepmtion for this function to make it more generally safe to call. Signed-off-by: John Stultz <[email protected]> Reviewed-by: Jan Kara <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Jiri Bohac <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04printk: remove separate printk_sched buffers and use printk buf insteadSteven Rostedt1-18/+29
To prevent deadlocks with doing a printk inside the scheduler, printk_sched() was created. The issue is that printk has a console_sem that it can grab and release. The release does a wake up if there's a task pending on the sem, and this wake up grabs the rq locks that is held in the scheduler. This leads to a possible deadlock if the wake up uses the same rq as the one with the rq lock held already. What printk_sched() does is to save the printk write in a per cpu buffer and sets the PRINTK_PENDING_SCHED flag. On a timer tick, if this flag is set, the printk() is done against the buffer. There's a couple of issues with this approach. 1) If two printk_sched()s are called before the tick, the second one will overwrite the first one. 2) The temporary buffer is 512 bytes and is per cpu. This is a quite a bit of space wasted for something that is seldom used. In order to remove this, the printk_sched() can use the printk buffer instead, and delay the console_trylock()/console_unlock() to the queued work. Because printk_sched() would then be taking the logbuf_lock, the logbuf_lock must not be held while doing anything that may call into the scheduler functions, which includes wake ups. Unfortunately, printk() also has a console_sem that it uses, and on release, the up(&console_sem) may do a wake up of any pending waiters. This must be avoided while holding the logbuf_lock. Signed-off-by: Steven Rostedt <[email protected]> Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04printk: enable interrupts before calling console_trylock_for_printk()Jan Kara1-11/+18
We need interrupts disabled when calling console_trylock_for_printk() only so that cpu id we pass to can_use_console() remains valid (for other things console_sem provides all the exclusion we need and deadlocks on console_sem due to interrupts are impossible because we use down_trylock()). However if we are rescheduled, we are guaranteed to run on an online cpu so we can easily just get the cpu id in can_use_console(). We can lose a bit of performance when we enable interrupts in vprintk_emit() and then disable them again in console_unlock() but OTOH it can somewhat reduce interrupt latency caused by console_unlock() especially since later in the patch series we will want to spin on console_sem in console_trylock_for_printk(). Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04printk: fix lockdep instrumentation of console_semJan Kara1-14/+32
Printk calls mutex_acquire() / mutex_release() by hand to instrument lockdep about console_sem. However in some corner cases the instrumentation is missing. Fix the problem by creating helper functions for locking / unlocking console_sem which take care of lockdep instrumentation as well. Signed-off-by: Jan Kara <[email protected]> Reported-by: Fabio Estevam <[email protected]> Reported-by: Andy Shevchenko <[email protected]> Tested-by: Fabio Estevam <[email protected]> Tested-By: Valdis Kletnieks <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04printk: release lockbuf_lock before calling console_trylock_for_printk()Jan Kara1-33/+21
There's no reason to hold lockbuf_lock when entering console_trylock_for_printk(). The first thing this function does is to call down_trylock(console_sem) and if that fails it immediately unlocks lockbuf_lock. So lockbuf_lock isn't needed for that branch. When down_trylock() succeeds, the rest of console_trylock() is OK without lockbuf_lock (it is called without it from other places), and the only remaining thing in console_trylock_for_printk() is can_use_console() call. For that call console_sem is enough (it iterates all consoles and checks CON_ANYTIME flag). So we drop logbuf_lock before entering console_trylock_for_printk() which simplifies the code. [[email protected]: fix have_callable_console() comment] Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04printk: remove outdated commentJan Kara1-2/+1
Comment about interesting interlocking between lockbuf_lock and console_sem is outdated. It was added in 2002 by commit a880f45a48be during conversion of console_lock to console_sem + lockbuf_lock. At that time release_console_sem() (today's equivalent is console_unlock()) was indeed using lockbuf_lock to avoid races between trylock on console_sem in printk() and unlock of console_sem. However these days the interlocking is gone and the races are avoided by rechecking logbuf state after releasing console_sem. Signed-off-by: Jan Kara <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04printk: return really stored message lengthPetr Mladek1-15/+21
I wonder if anyone uses printk return value but it is there and should be counted correctly. This patch modifies log_store() to return the number of really stored bytes from the 'text' part. Also it handles the return value in vprintk_emit(). Note that log_store() is used also in cont_flush() but we could ignore the return value there. The function works with characters that were already counted earlier. In addition, the store could newer fail here because the length of the printed text is limited by the "cont" buffer and "dict" is NULL. Signed-off-by: Petr Mladek <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: Kay Sievers <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04printk: shrink too long messagesPetr Mladek1-3/+39
We might want to print at least part of too long messages and add some warning for debugging purpose. The question is how long the shrunken message should be. If we use the whole buffer, it might get rotated too soon. Let's try to use only 1/4 of the buffer for now. Also shrink the whole dictionary. We do not want to parse it or break it in the middle of some pair of values. It would not cause any real harm but still. Signed-off-by: Petr Mladek <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: Kay Sievers <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04printk: split message size computationPetr Mladek1-3/+13
We will want to recompute the message size when shrinking too long messages. Let's put the code into separate function. The side effect of setting "pad_len" is not nice but it is worth removing the code duplication. Note that I will probably have one more usage for this function when handling messages safe way in NMI context. This patch does not change the existing behavior. Signed-off-by: Petr Mladek <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: Kay Sievers <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04printk: ignore too long messagesPetr Mladek1-7/+23
There was no check for too long messages. The check for free space always passed when first_seq and next_seq were equal. Enough free space was not guaranteed, though. log_store() might be called to store messages up to 64kB + 64kB + 16B. This is sum of maximal text_len, dict_len values, and the size of the structure printk_log. On the other hand, the minimal size for the main log buffer currently is 4kB and it is enforced only by Kconfig. The good news is that the usage looks safe right now. log_store() is called only from vprintk_emit() and cont_flush(). Here the "text" part is always passed via a static buffer and the length is limited to LOG_LINE_MAX which is 1024. The "dict" part is NULL in most cases. The only exceptions is when vprintk_emit() is called from printk_emit() and dev_vprintk_emit(). But printk_emit() is currently used only in devkmsg_writev() and here "dict" is NULL as well. In dev_vprintk_emit(), "dict" is limited by the static buffer "hdr" of the size 128 bytes. It meas that the current maximal printed text is 1024B + 128B + 16B and it always fit the log buffer. But it is only matter of time when someone calls printk_emit() with unsafe parameters, especially the "dict" one. This patch adds a check for the free space when the buffer is empty. It reuses the already existing log_has_space() function but it has to add an extra parameter. It defines whether the buffer is empty. Note that the same values of "first_idx" and "next_idx" might also mean that the buffer is full. If the buffer is empty, we must respect the current position of the indexes. We cannot reset them to the beginning of the buffer. Otherwise, the functions reading the buffer would get crazy. The question is what to do when the message is too long. This patch uses the easiest solution and just ignores the problematic message. Let's do something better in a followup patch. Signed-off-by: Petr Mladek <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: Kay Sievers <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04printk: split code for making free space in the log bufferPetr Mladek1-15/+29
The check for free space in the log buffer always passes when "first_seq" and "next_seq" are equal. In theory, it might cause writing outside of the log buffer. Fortunately, the current usage looks safe because the used "text" and "dict" buffers are quite limited. See the second patch for more details. Anyway, it is better to be on the safe side and add a check. An easy solution is done in the 2nd patch and it is improved in the 4th patch. 5th patch fixes the computation of the printed message length. 1st and 3rd patches just do some code refactoring to make the other patches easier. This patch (of 5): There will be needed some fixes in the check for free space. They will be easier if the code is moved outside of the quite long log_store() function. This patch does not change the existing behavior. Signed-off-by: Petr Mladek <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: Kay Sievers <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04kernel/user.c: drop unused field 'files' from user_structKirill A. Shutemov1-1/+0
Nobody seems uses it for a long time. Let's drop it. Signed-off-by: Kirill A. Shutemov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04kernel/hung_task.c: convert simple_strtoul to kstrtouintFabian Frederick1-1/+3
sysctl_hung_task_panic has been changed to unsigned int. use kstrtouint instead of obsolete simple_strtoul Signed-off-by: Fabian Frederick <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04kernel/utsname_sysctl.c: replace obsolete __initcall by device_initcallFabian Frederick1-2/+2
Also fixes checkpatch warnings on proc_dostring function parameters Signed-off-by: Fabian Frederick <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-04kernel/reboot.c: convert simple_strtoul to kstrtointFabian Frederick1-7/+14
Replace obsolete function. kstrtoint is used as reboot_cpu is an integer. Signed-off-by: Fabian Frederick <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>