aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2014-07-07rcu: Parallelize and economize NOCB kthread wakeupsPaul E. McKenney2-43/+237
An 80-CPU system with a context-switch-heavy workload can require so many NOCB kthread wakeups that the RCU grace-period kthreads spend several tens of percent of a CPU just awakening things. This clearly will not scale well: If you add enough CPUs, the RCU grace-period kthreads would get behind, increasing grace-period latency. To avoid this problem, this commit divides the NOCB kthreads into leaders and followers, where the grace-period kthreads awaken the leaders each of whom in turn awakens its followers. By default, the number of groups of kthreads is the square root of the number of CPUs, but this default may be overridden using the rcutree.rcu_nocb_leader_stride boot parameter. This reduces the number of wakeups done per grace period by the RCU grace-period kthread by the square root of the number of CPUs, but of course by shifting those wakeups to the leaders. In addition, because the leaders do grace periods on behalf of their respective followers, the number of wakeups of the followers decreases by up to a factor of two. Instead of being awakened once when new callbacks arrive and again at the end of the grace period, the followers are awakened only at the end of the grace period. For a numerical example, in a 4096-CPU system, the grace-period kthread would awaken 64 leaders, each of which would awaken its 63 followers at the end of the grace period. This compares favorably with the 79 wakeups for the grace-period kthread on an 80-CPU system. Reported-by: Rik van Riel <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2014-07-07torture: Avoid format string leak to thead nameKees Cook1-1/+1
Since the torture-test thread creation interface does not include format string arguments, this commit makes sure the name can never be accidentally processed as a format string. Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Josh Triplett <[email protected]>
2014-07-07workqueue: zero cpumask of wq_numa_possible_cpumask on initYasuaki Ishimatsu1-1/+1
When hot-adding and onlining CPU, kernel panic occurs, showing following call trace. BUG: unable to handle kernel paging request at 0000000000001d08 IP: [<ffffffff8114acfd>] __alloc_pages_nodemask+0x9d/0xb10 PGD 0 Oops: 0000 [#1] SMP ... Call Trace: [<ffffffff812b8745>] ? cpumask_next_and+0x35/0x50 [<ffffffff810a3283>] ? find_busiest_group+0x113/0x8f0 [<ffffffff81193bc9>] ? deactivate_slab+0x349/0x3c0 [<ffffffff811926f1>] new_slab+0x91/0x300 [<ffffffff815de95a>] __slab_alloc+0x2bb/0x482 [<ffffffff8105bc1c>] ? copy_process.part.25+0xfc/0x14c0 [<ffffffff810a3c78>] ? load_balance+0x218/0x890 [<ffffffff8101a679>] ? sched_clock+0x9/0x10 [<ffffffff81105ba9>] ? trace_clock_local+0x9/0x10 [<ffffffff81193d1c>] kmem_cache_alloc_node+0x8c/0x200 [<ffffffff8105bc1c>] copy_process.part.25+0xfc/0x14c0 [<ffffffff81114d0d>] ? trace_buffer_unlock_commit+0x4d/0x60 [<ffffffff81085a80>] ? kthread_create_on_node+0x140/0x140 [<ffffffff8105d0ec>] do_fork+0xbc/0x360 [<ffffffff8105d3b6>] kernel_thread+0x26/0x30 [<ffffffff81086652>] kthreadd+0x2c2/0x300 [<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60 [<ffffffff815f20ec>] ret_from_fork+0x7c/0xb0 [<ffffffff81086390>] ? kthread_create_on_cpu+0x60/0x60 In my investigation, I found the root cause is wq_numa_possible_cpumask. All entries of wq_numa_possible_cpumask is allocated by alloc_cpumask_var_node(). And these entries are used without initializing. So these entries have wrong value. When hot-adding and onlining CPU, wq_update_unbound_numa() is called. wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq() calls get_unbound_pool(). In get_unbound_pool(), worker_pool->node is set as follow: 3592 /* if cpumask is contained inside a NUMA node, we belong to that node */ 3593 if (wq_numa_enabled) { 3594 for_each_node(node) { 3595 if (cpumask_subset(pool->attrs->cpumask, 3596 wq_numa_possible_cpumask[node])) { 3597 pool->node = node; 3598 break; 3599 } 3600 } 3601 } But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong node is selected. As a result, kernel panic occurs. By this patch, all entries of wq_numa_possible_cpumask are allocated by zalloc_cpumask_var_node to initialize them. And the panic disappeared. Signed-off-by: Yasuaki Ishimatsu <[email protected]> Reviewed-by: Lai Jiangshan <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Cc: [email protected] Fixes: bce903809ab3 ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]")
2014-07-05Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "A few minor fixlets in ARM SoC irq drivers and a fix for a memory leak which I introduced in the last round of cleanups :(" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Fix memory leak when calling irq_free_hwirqs() irqchip: spear_shirq: Fix interrupt offset irqchip: brcmstb-l2: Level-2 interrupts are edge sensitive irqchip: armada-370-xp: Mask all interrupts during initialization.
2014-07-05genirq: Fix memory leak when calling irq_free_hwirqs()Keith Busch1-2/+2
irq_free_hwirqs() always calls irq_free_descs() with a cnt == 0 which makes it a no-op since the interrupt count to free is decremented in itself. Fixes: 7b6ef1262549f6afc5c881aaef80beb8fd15f908 Signed-off-by: Keith Busch <[email protected]> Acked-by: David Rientjes <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2014-07-05locking/mutexes: Optimize mutex trylock slowpathJason Low1-0/+4
The mutex_trylock() function calls into __mutex_trylock_fastpath() when trying to obtain the mutex. On 32 bit x86, in the !__HAVE_ARCH_CMPXCHG case, __mutex_trylock_fastpath() calls directly into __mutex_trylock_slowpath() regardless of whether or not the mutex is locked. In __mutex_trylock_slowpath(), we then acquire the wait_lock spinlock, xchg() lock->count with -1, then set lock->count back to 0 if there are no waiters, and return true if the prev lock count was 1. However, if the mutex is already locked, then there isn't much point in attempting all of the above expensive operations. In this patch, we only attempt the above trylock operations if the mutex is unlocked. Signed-off-by: Jason Low <[email protected]> Reviewed-by: Davidlohr Bueso <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-07-05locking/mutexes: Try to acquire mutex only if it is unlockedJason Low1-3/+4
Upon entering the slowpath in __mutex_lock_common(), we try once more to acquire the mutex. We only try to acquire if (lock->count >= 0). However, what we actually want here is to try to acquire if the mutex is unlocked (lock->count == 1). This patch changes it so that we only try-acquire the mutex upon entering the slowpath if it is unlocked, rather than if the lock count is non-negative. This helps further reduce unnecessary atomic xchg() operations. Furthermore, this patch uses !mutex_is_locked(lock) to do the initial checks for if the lock is free rather than directly calling atomic_read() on the lock->count, in order to improve readability. Signed-off-by: Jason Low <[email protected]> Acked-by: Waiman Long <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-07-05locking/mutexes: Delete the MUTEX_SHOW_NO_WAITER macroJason Low1-10/+8
MUTEX_SHOW_NO_WAITER() is a macro which checks for if there are "no waiters" on a mutex by checking if the lock count is non-negative. Based on feedback from the discussion in the earlier version of this patchset, the macro is not very readable. Furthermore, checking lock->count isn't always the correct way to determine if there are "no waiters" on a mutex. For example, a negative count on a mutex really only means that there "potentially" are waiters. Likewise, there can be waiters on the mutex even if the count is non-negative. Thus, "MUTEX_SHOW_NO_WAITER" doesn't always do what the name of the macro suggests. So this patch deletes the MUTEX_SHOW_NO_WAITERS() macro, directly use atomic_read() instead of the macro, and adds comments which elaborate on how the extra atomic_read() checks can help reduce unnecessary xchg() operations. Signed-off-by: Jason Low <[email protected]> Acked-by: Waiman Long <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-07-05locking/mutexes: Correct documentation on mutex optimistic spinningJason Low1-6/+4
The mutex optimistic spinning documentation states that we spin for acquisition when we find that there are no pending waiters. However, in actuality, whether or not there are waiters for the mutex doesn't determine if we will spin for it. This patch removes that statement and also adds a comment which mentions that we spin for the mutex while we don't need to reschedule. Signed-off-by: Jason Low <[email protected]> Acked-by: Davidlohr Bueso <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-07-05perf: Make perf_event_init_context() function staticJiri Olsa1-1/+1
Leftover from '8dc85d5 perf: Multiple task contexts'. Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Namhyung Kim <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-07-05sched: Rework check_for_tasks()Kirill Tkhai1-13/+20
1) Iterate thru all of threads in the system. Check for all threads, not only for group leaders. 2) Check for p->on_rq instead of p->state and cputime. Preempted task in !TASK_RUNNING state OR just created task may be queued, that we want to be reported too. 3) Use read_lock() instead of write_lock(). This function does not change any structures, and read_lock() is enough. Signed-off-by: Kirill Tkhai <[email protected]> Reviewed-by: Srikar Dronamraju <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Ben Segall <[email protected]> Cc: Fabian Frederick <[email protected]> Cc: Gautham R. Shenoy <[email protected]> Cc: Konstantin Khorenko <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Michael wang <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Paul Gortmaker <[email protected]> Cc: Paul Turner <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Srivatsa S. Bhat <[email protected]> Cc: Todd E Brandt <[email protected]> Cc: Toshi Kani <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1403684395.3462.44.camel@tkhai Signed-off-by: Ingo Molnar <[email protected]>
2014-07-05sched/rt: Enqueue just unthrottled rt_rq back on the stack in ↵Kirill Tkhai1-0/+3
__disable_runtime() Make rt_rq available for pick_next_task(). Otherwise, their tasks stay prisoned long time till dead cpu becomes alive again. Reviewed-by: Srikar Dronamraju <[email protected]> Signed-off-by: Kirill Tkhai <[email protected]> CC: Konstantin Khorenko <[email protected]> CC: Ben Segall <[email protected]> CC: Paul Turner <[email protected]> CC: Mike Galbraith <[email protected]> Cc: Linus Torvalds <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1403684388.3462.43.camel@tkhai Signed-off-by: Ingo Molnar <[email protected]>
2014-07-05sched/fair: Disable runtime_enabled on dying rqKirill Tkhai2-1/+29
We kill rq->rd on the CPU_DOWN_PREPARE stage: cpuset_cpu_inactive -> cpuset_update_active_cpus -> partition_sched_domains -> -> cpu_attach_domain -> rq_attach_root -> set_rq_offline This unthrottles all throttled cfs_rqs. But the cpu is still able to call schedule() till take_cpu_down->__cpu_disable() is called from stop_machine. This case the tasks from just unthrottled cfs_rqs are pickable in a standard scheduler way, and they are picked by dying cpu. The cfs_rqs becomes throttled again, and migrate_tasks() in migration_call skips their tasks (one more unthrottle in migrate_tasks()->CPU_DYING does not happen, because rq->rd is already NULL). Patch sets runtime_enabled to zero. This guarantees, the runtime is not accounted, and the cfs_rqs won't exceed given cfs_rq->runtime_remaining = 1, and tasks will be pickable in migrate_tasks(). runtime_enabled is recalculated again when rq becomes online again. Ben Segall also noticed, we always enable runtime in tg_set_cfs_bandwidth(). Actually, we should do that for online cpus only. To prevent races with unthrottle_offline_cfs_rqs() we take get_online_cpus() lock. Reviewed-by: Ben Segall <[email protected]> Reviewed-by: Srikar Dronamraju <[email protected]> Signed-off-by: Kirill Tkhai <[email protected]> CC: Konstantin Khorenko <[email protected]> CC: Paul Turner <[email protected]> CC: Mike Galbraith <[email protected]> Cc: Linus Torvalds <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1403684382.3462.42.camel@tkhai Signed-off-by: Ingo Molnar <[email protected]>
2014-07-05sched/numa: Change scan period code to match intentRik van Riel1-4/+4
Reading through the scan period code and comment, it appears the intent was to slow down NUMA scanning when a majority of accesses are on the local node, specifically a local:remote ratio of 3:1. However, the code actually tests local / (local + remote), and the actual cut-off point was around 30% local accesses, well before a task has actually converged on a node. Changing the threshold to 7 means scanning slows down when a task has around 70% of its accesses local, which appears to match the intent of the code more closely. Signed-off-by: Rik van Riel <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Linus Torvalds <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-07-05sched/numa: Rework best node setting in task_numa_migrate()Rik van Riel1-6/+13
Fix up the best node setting in task_numa_migrate() to deal with a task in a pseudo-interleaved NUMA group, which is already running in the best location. Set the task's preferred nid to the current nid, so task migration is not retried at a high rate. Signed-off-by: Rik van Riel <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Linus Torvalds <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-07-05sched/numa: Examine a task move when examining a task swapRik van Riel1-2/+21
Running "perf bench numa mem -0 -m -P 1000 -p 8 -t 20" on a 4 node system results in 160 runnable threads on a system with 80 CPU threads. Once a process has nearly converged, with 39 threads on one node and 1 thread on another node, the remaining thread will be unable to migrate to its preferred node through a task swap. However, a simple task move would make the workload converge, witout causing an imbalance. Test for this unlikely occurrence, and attempt a task move to the preferred nid when it happens. # Running main, "perf bench numa mem -p 8 -t 20 -0 -m -P 1000" ### # 160 tasks will execute (on 4 nodes, 80 CPUs): # -1x 0MB global shared mem operations # -1x 1000MB process shared mem operations # -1x 0MB thread local mem operations ### ### # # 0.0% [0.2 mins] 0/0 1/1 36/2 0/0 [36/3 ] l: 0-0 ( 0) {0-2} # 0.0% [0.3 mins] 43/3 37/2 39/2 41/3 [ 6/10] l: 0-1 ( 1) {1-2} # 0.0% [0.4 mins] 42/3 38/2 40/2 40/2 [ 4/9 ] l: 1-2 ( 1) [50.0%] {1-2} # 0.0% [0.6 mins] 41/3 39/2 40/2 40/2 [ 2/9 ] l: 2-4 ( 2) [50.0%] {1-2} # 0.0% [0.7 mins] 40/2 40/2 40/2 40/2 [ 0/8 ] l: 3-5 ( 2) [40.0%] ( 41.8s converged) Without this patch, this same perf bench numa mem run had to rely on the scheduler load balancer to first balance out the load (moving a random task), before a task swap could complete the NUMA convergence. The load balancer does not normally take action unless the load difference exceeds 25%. Convergence times of over half an hour have been observed without this patch. With this patch, the NUMA balancing code will simply migrate the task, if that does not cause an imbalance. Also skip examining a CPU in detail if the improvement on that CPU is no more than the best we already have. Signed-off-by: Rik van Riel <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Linus Torvalds <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-07-05sched/numa: Simplify task_numa_compare()Rik van Riel1-6/+1
When a task is part of a numa_group, the comparison should always use the group weight, in order to make workloads converge. Signed-off-by: Rik van Riel <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Linus Torvalds <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-07-05sched/numa: Use effective_load() to balance NUMA loadsRik van Riel1-6/+14
When CONFIG_FAIR_GROUP_SCHED is enabled, the load that a task places on a CPU is determined by the group the task is in. The active groups on the source and destination CPU can be different, resulting in a different load contribution by the same task at its source and at its destination. As a result, the load needs to be calculated separately for each CPU, instead of estimated once with task_h_load(). Getting this calculation right allows some workloads to converge, where previously the last thread could get stuck on another node, without being able to migrate to its final destination. Signed-off-by: Rik van Riel <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Linus Torvalds <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-07-05sched/numa: Move power adjustment into load_too_imbalanced()Rik van Riel1-15/+24
Currently the NUMA code scales the load on each node with the amount of CPU power available on that node, but it does not apply any adjustment to the load of the task that is being moved over. On systems with SMT/HT, this results in a task being weighed much more heavily than a CPU core, and a task move that would even out the load between nodes being disallowed. The correct thing is to apply the power correction to the numbers after we have first applied the move of the tasks' loads to them. This also allows us to do the power correction with a multiplication, rather than a division. Also drop two function arguments for load_too_unbalanced, since it takes various factors from env already. Signed-off-by: Rik van Riel <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Linus Torvalds <[email protected]> Cc: [email protected] Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-07-05sched/numa: Use group's max nid as task's preferred nidRik van Riel1-16/+1
From task_numa_placement, always try to consolidate the tasks in a group on the group's top nid. In case this task is part of a group that is interleaved over multiple nodes, task_numa_migrate will set the task's preferred nid to the best node it could find for the task, so this patch will cause at most one run through task_numa_migrate. Signed-off-by: Rik van Riel <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Linus Torvalds <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-07-05sched/fair: Implement fast idling of CPUs when the system is partially loadedTim Chen2-5/+28
When a system is lightly loaded (i.e. no more than 1 job per cpu), attempt to pull job to a cpu before putting it to idle is unnecessary and can be skipped. This patch adds an indicator so the scheduler can know when there's no more than 1 active job is on any CPU in the system to skip needless job pulls. On a 4 socket machine with a request/response kind of workload from clients, we saw about 0.13 msec delay when we go through a full load balance to try pull job from all the other cpus. While 0.1 msec was spent on processing the request and generating a response, the 0.13 msec load balance overhead was actually more than the actual work being done. This overhead can be skipped much of the time for lightly loaded systems. With this patch, we tested with a netperf request/response workload that has the server busy with half the cpus in a 4 socket system. We found the patch eliminated 75% of the load balance attempts before idling a cpu. The overhead of setting/clearing the indicator is low as we already gather the necessary info while we call add_nr_running() and update_sd_lb_stats.() We switch to full load balance load immediately if any cpu got more than one job on its run queue in add_nr_running. We'll clear the indicator to avoid load balance when we detect no cpu's have more than one job when we scan the work queues in update_sg_lb_stats(). We are aggressive in turning on the load balance and opportunistic in skipping the load balance. Signed-off-by: Tim Chen <[email protected]> Acked-by: Rik van Riel <[email protected]> Acked-by: Jason Low <[email protected]> Cc: "Paul E.McKenney" <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Alex Shi <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Peter Hurley <[email protected]> Cc: Linus Torvalds <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1403551009.2970.613.camel@schen9-DESK Signed-off-by: Ingo Molnar <[email protected]>
2014-07-05sched/idle: Drop !! while calculating 'broadcast'Viresh Kumar1-2/+2
We don't need 'broadcast' to be set to 'zero or one', but to 'zero or non-zero' and so the extra operation to convert it to 'zero or one' can be skipped. Also change type of 'broadcast' to unsigned int, i.e. type of drv->states[*].flags. Signed-off-by: Viresh Kumar <[email protected]> Cc: [email protected] Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/0dfbe2976aa108c53e08d3477ea90f6360c1f54c.1403584026.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <[email protected]>
2014-07-05sched: Fix clock_gettime(CLOCK_[PROCESS/THREAD]_CPUTIME_ID) monotonicityMike Galbraith1-2/+11
If a task has been dequeued, it has been accounted. Do not project cycles that may or may not ever be accounted to a dequeued task, as that may make clock_gettime() both inaccurate and non-monotonic. Protect update_rq_clock() from slight TSC skew while at it. Signed-off-by: Mike Galbraith <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Linus Torvalds <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-07-05sched: Fix potential near-infinite distribute_cfs_runtime() loopBen Segall1-21/+20
distribute_cfs_runtime() intentionally only hands out enough runtime to bring each cfs_rq to 1 ns of runtime, expecting the cfs_rqs to then take the runtime they need only once they actually get to run. However, if they get to run sufficiently quickly, the period timer is still in distribute_cfs_runtime() and no runtime is available, causing them to throttle. Then distribute has to handle them again, and this can go on until distribute has handed out all of the runtime 1ns at a time, which takes far too long. Instead allow access to the same runtime that distribute is handing out, accepting that corner cases with very low quota may be able to spend the entire cfs_b->runtime during distribute_cfs_runtime, meaning that the runtime directly handed out by distribute_cfs_runtime was over quota. In addition, if a cfs_rq does manage to throttle like this, make sure the existing distribute_cfs_runtime no longer loops over it again. Signed-off-by: Ben Segall <[email protected]> Cc: Linus Torvalds <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/20140620222120.13814.21652.stgit@sword-of-the-dawn.mtv.corp.google.com Signed-off-by: Ingo Molnar <[email protected]>
2014-07-05sched/core: Fix formatting issues in sched_can_stop_tick()Viresh Kumar1-7/+3
sched_can_stop_tick() is using 7 spaces instead of 8 spaces or a 'tab' at the beginning of few lines. Which doesn't align well with the Coding Guidelines. Also remove local variable 'rq' as it is used at only one place and we can directly use this_rq() instead. Signed-off-by: Viresh Kumar <[email protected]> Cc: [email protected] Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/afb781733e4a9ffbced5eb9fd25cc0aa5c6ffd7a.1403596966.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <[email protected]>
2014-07-05irq_work: Remove BUG_ON in irq_work_run()Peter Zijlstra1-42/+4
Because of a collision with 8d056c48e486 ("CPU hotplug, smp: flush any pending IPI callbacks before CPU offline"), which ends up calling hotplug_cfd()->flush_smp_call_function_queue()->irq_work_run(), which is not from IRQ context. And since that already calls irq_work_run() from the hotplug path, remove our entire hotplug handling. Reported-by: Stephen Warren <[email protected]> Tested-by: Stephen Warren <[email protected]> Reviewed-by: Srivatsa S. Bhat <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Linus Torvalds <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-07-05Merge branch 'timers/nohz' into sched/coreIngo Molnar5-45/+84
Merge these two, because upcoming patches will touch both areas. Signed-off-by: Ingo Molnar <[email protected]>
2014-07-03Merge tag 'trace-fixes-v3.16-rc3' of ↵Linus Torvalds3-24/+30
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Oleg Nesterov found and fixed a bug in the perf/ftrace/uprobes code where running: # perf probe -x /lib/libc.so.6 syscall # echo 1 >> /sys/kernel/debug/tracing/events/probe_libc/enable # perf record -e probe_libc:syscall whatever kills the uprobe. Along the way he found some other minor bugs and clean ups that he fixed up making it a total of 4 patches. Doing unrelated work, I found that the reading of the ftrace trace file disables all function tracer callbacks. This was fine when ftrace was the only user, but now that it's used by perf and kprobes, this is a bug where reading trace can disable kprobes and perf. A very unexpected side effect and should be fixed" * tag 'trace-fixes-v3.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Remove ftrace_stop/start() from reading the trace file tracing/uprobes: Fix the usage of uprobe_buffer_enable() in probe_event_enable() tracing/uprobes: Kill the bogus UPROBE_HANDLER_REMOVE code in uprobe_dispatcher() uprobes: Change unregister/apply to WARN() if uprobe/consumer is gone tracing/uprobes: Revert "Support mix of ftrace and perf"
2014-07-03kernel/printk/printk.c: revert "printk: enable interrupts before calling ↵Andrew Morton1-26/+18
console_trylock_for_printk()" Revert commit 939f04bec1a4 ("printk: enable interrupts before calling console_trylock_for_printk()"). Andreas reported: : None of the post 3.15 kernel boot for me. They all hang at the GRUB : screen telling me it loaded and started the kernel, but the kernel : itself stops before it prints anything (or even replaces the GRUB : background graphics). 939f04bec1a4 is modest latency reduction. Revert it until we understand the reason for these failures. Reported-by: Andreas Bombe <[email protected]> Cc: Jan Kara <[email protected]> Cc: Steven Rostedt <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-07-03crypto: fips - only panic on bad/missing crypto mod signaturesJarod Wilson1-4/+0
Per further discussion with NIST, the requirements for FIPS state that we only need to panic the system on failed kernel module signature checks for crypto subsystem modules. This moves the fips-mode-only module signature check out of the generic module loading code, into the crypto subsystem, at points where we can catch both algorithm module loads and mode module loads. At the same time, make CONFIG_CRYPTO_FIPS dependent on CONFIG_MODULE_SIG, as this is entirely necessary for FIPS mode. v2: remove extraneous blank line, perform checks in static inline function, drop no longer necessary fips.h include. CC: "David S. Miller" <[email protected]> CC: Rusty Russell <[email protected]> CC: Stephan Mueller <[email protected]> Signed-off-by: Jarod Wilson <[email protected]> Acked-by: Neil Horman <[email protected]> Signed-off-by: Herbert Xu <[email protected]>
2014-07-02perf: Do not allow optimized switch for non-cloned eventsJiri Olsa1-1/+1
The context check in perf_event_context_sched_out allows non-cloned context to be part of the optimized schedule out switch. This could move non-cloned context into another workload child. Once this child exits, the context is closed and leaves all original (parent) events in closed state. Any other new cloned event will have closed state and not measure anything. And probably causing other odd bugs. Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Cc: <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Corey Ashford <[email protected]> Cc: David Ahern <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Linus Torvalds <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-07-01workqueue: stronger test in process_one_work()Lai Jiangshan1-7/+1
When POOL_DISASSOCIATED is cleared, the running worker's local CPU should be the same as pool->cpu without any exception even during cpu-hotplug. This patch changes "(proposition_A && proposition_B && proposition_C)" to "(proposition_B && proposition_C)", so if the old compound proposition is true, the new one must be true too. so this won't hide any possible bug which can be hit by old test. tj: Minor description update and dropped the obvious comment. CC: Jason J. Herne <[email protected]> CC: Sasha Levin <[email protected]> Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2014-07-01workqueue: clear POOL_DISASSOCIATED in rebind_workers()Lai Jiangshan1-4/+1
a9ab775bcadf ("workqueue: directly restore CPU affinity of workers from CPU_ONLINE") moved pool locking into rebind_workers() but left "pool->flags &= ~POOL_DISASSOCIATED" in workqueue_cpu_up_callback(). There is nothing necessarily wrong with it, but there is no benefit either. Let's move it into rebind_workers() and achieve the following benefits: 1) better readability, POOL_DISASSOCIATED is cleared in rebind_workers() as expected. 2) we can guarantee that, when POOL_DISASSOCIATED is clear, the running workers of the pool are on the local CPU (pool->cpu). tj: Minor description update. Signed-off-by: Lai Jiangshan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2014-07-01cpuset: break kernfs active protection in cpuset_write_resmask()Tejun Heo1-0/+12
Writing to either "cpuset.cpus" or "cpuset.mems" file flushes cpuset_hotplug_work so that cpu or memory hotunplug doesn't end up migrating tasks off a cpuset after new resources are added to it. As cpuset_hotplug_work calls into cgroup core via cgroup_transfer_tasks(), this flushing adds the dependency to cgroup core locking from cpuset_write_resmak(). This used to be okay because cgroup interface files were protected by a different mutex; however, 8353da1f91f1 ("cgroup: remove cgroup_tree_mutex") simplified the cgroup core locking and this dependency became a deadlock hazard - cgroup file removal performed under cgroup core lock tries to drain on-going file operation which is trying to flush cpuset_hotplug_work blocked on the same cgroup core lock. The locking simplification was done because kernfs added an a lot easier way to deal with circular dependencies involving kernfs active protection. Let's use the same strategy in cpuset and break active protection in cpuset_write_resmask(). While it isn't the prettiest, this is a very rare, likely unique, situation which also goes away on the unified hierarchy. The commands to trigger the deadlock warning without the patch and the lockdep output follow. localhost:/ # mount -t cgroup -o cpuset xxx /cpuset localhost:/ # mkdir /cpuset/tmp localhost:/ # echo 1 > /cpuset/tmp/cpuset.cpus localhost:/ # echo 0 > cpuset/tmp/cpuset.mems localhost:/ # echo $$ > /cpuset/tmp/tasks localhost:/ # echo 0 > /sys/devices/system/cpu/cpu1/online ====================================================== [ INFO: possible circular locking dependency detected ] 3.16.0-rc1-0.1-default+ #7 Not tainted ------------------------------------------------------- kworker/1:0/32649 is trying to acquire lock: (cgroup_mutex){+.+.+.}, at: [<ffffffff8110e3d7>] cgroup_transfer_tasks+0x37/0x150 but task is already holding lock: (cpuset_hotplug_work){+.+...}, at: [<ffffffff81085412>] process_one_work+0x192/0x520 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (cpuset_hotplug_work){+.+...}: ... -> #1 (s_active#175){++++.+}: ... -> #0 (cgroup_mutex){+.+.+.}: ... other info that might help us debug this: Chain exists of: cgroup_mutex --> s_active#175 --> cpuset_hotplug_work Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(cpuset_hotplug_work); lock(s_active#175); lock(cpuset_hotplug_work); lock(cgroup_mutex); *** DEADLOCK *** 2 locks held by kworker/1:0/32649: #0: ("events"){.+.+.+}, at: [<ffffffff81085412>] process_one_work+0x192/0x520 #1: (cpuset_hotplug_work){+.+...}, at: [<ffffffff81085412>] process_one_work+0x192/0x520 stack backtrace: CPU: 1 PID: 32649 Comm: kworker/1:0 Not tainted 3.16.0-rc1-0.1-default+ #7 ... Call Trace: [<ffffffff815a5f78>] dump_stack+0x72/0x8a [<ffffffff810c263f>] print_circular_bug+0x10f/0x120 [<ffffffff810c481e>] check_prev_add+0x43e/0x4b0 [<ffffffff810c4ee6>] validate_chain+0x656/0x7c0 [<ffffffff810c53d2>] __lock_acquire+0x382/0x660 [<ffffffff810c57a9>] lock_acquire+0xf9/0x170 [<ffffffff815aa13f>] mutex_lock_nested+0x6f/0x380 [<ffffffff8110e3d7>] cgroup_transfer_tasks+0x37/0x150 [<ffffffff811129c0>] hotplug_update_tasks_insane+0x110/0x1d0 [<ffffffff81112bbd>] cpuset_hotplug_update_tasks+0x13d/0x180 [<ffffffff811148ec>] cpuset_hotplug_workfn+0x18c/0x630 [<ffffffff810854d4>] process_one_work+0x254/0x520 [<ffffffff810875dd>] worker_thread+0x13d/0x3d0 [<ffffffff8108e0c8>] kthread+0xf8/0x100 [<ffffffff815acaec>] ret_from_fork+0x7c/0xb0 Signed-off-by: Tejun Heo <[email protected]> Reported-by: Li Zefan <[email protected]> Tested-by: Li Zefan <[email protected]>
2014-07-01tracing: Remove ftrace_stop/start() from reading the trace fileSteven Rostedt (Red Hat)1-2/+0
Disabling reading and writing to the trace file should not be able to disable all function tracing callbacks. There's other users today (like kprobes and perf). Reading a trace file should not stop those from happening. Cc: [email protected] # 3.0+ Reviewed-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2014-07-01tracing: Add description of set_graph_notrace to tracing/READMENamhyung Kim1-0/+1
It was missing the description of set_graph_notrace file. Add it. Link: http://lkml.kernel.org/p/[email protected] Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2014-07-01tracing: Improve message of empty set_ftrace_notrace fileNamhyung Kim1-3/+8
When there's no entry in set_ftrace_notrace, it'll print nothing, but it's better to print something like below like set_graph_notrace does: #### no functions disabled #### Link: http://lkml.kernel.org/p/[email protected] Reported-by: Naoya Horiguchi <[email protected]> Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2014-07-01tracing: Improve message of empty set_graph_notrace fileNamhyung Kim1-1/+6
When there's no entry in set_graph_notrace, it'll print below message #### all functions enabled #### While this is technically correct, it's better to print like below: #### no functions disabled #### Link: http://lkml.kernel.org/p/[email protected] Reported-by: Naoya Horiguchi <[email protected]> Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2014-07-01tracing: Add ftrace_graph_notrace boot parameterNamhyung Kim1-4/+20
The ftrace_graph_notrace option is for specifying notrace filter for function graph tracer at boot time. It can be altered after boot using set_graph_notrace file on the debugfs. Link: http://lkml.kernel.org/p/[email protected] Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2014-07-01tracing: Convert pr_warning() to pr_warn() in trace_events.cFabian Frederick1-29/+27
Convert pr_warning to standard pr_warn Define pr_fmt(fmt) fmt to avoid any future default fmt definition Link: http://lkml.kernel.org/p/[email protected] Signed-off-by: Fabian Frederick <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2014-07-01ftrace: Do not copy hash if O_TRUNC is setNamhyung Kim1-5/+7
When a filter file is open for writing and O_TRUNC is set, there's no need to copy and free the filter entries. Link: http://lkml.kernel.org/p/[email protected] Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2014-07-01ftrace: Fix memory leak on failure path in ftrace_allocate_pages()Namhyung Kim1-1/+2
As struct ftrace_page is managed in a single linked list, it should free from the start page. Link: http://lkml.kernel.org/p/[email protected] Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2014-07-01ftrace: Get rid of obsolete global_start_up variableNamhyung Kim1-3/+1
It seems like it's a leftover from commit 4104d326b670 ("ftrace: Remove global function list and call function directly"). As it isn't updated at all, checking its value is meaningless. Let's get rid of it. Link: http://lkml.kernel.org/p/[email protected] Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2014-07-01tracing: Add trace_seq_buffer_ptr() helper functionSteven Rostedt (Red Hat)1-7/+7
There's several locations in the kernel that open code the calculation of the next location in the trace_seq buffer. This is usually done with p->buffer + p->len Instead of having this open coded, supply a helper function in the header to do it for them. This function is called trace_seq_buffer_ptr(). Link: http://lkml.kernel.org/p/[email protected] Acked-by: Paolo Bonzini <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2014-07-01tracing: Remove unnecessary null test before debugfs_remove()Fabian Frederick1-4/+2
This fixes checkpatch warning: "WARNING: debugfs_remove(NULL) is safe this check is probably not required" Link: http://lkml.kernel.org/p/[email protected] Signed-off-by: Fabian Frederick <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2014-07-01tracing: Remove trace_seq_reserve()Steven Rostedt (Red Hat)1-30/+0
trace_seq_reserve() has no users in the kernel, it just wastes space. Remove it. Cc: Eduard - Gabriel Munteanu <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2014-07-01tracing: Make trace_seq_putmem_hex() more robustSteven Rostedt (Red Hat)2-8/+19
Currently trace_seq_putmem_hex() can only take as a parameter a pointer to something that is 8 bytes or less, otherwise it will overflow the buffer. This is protected by a macro that encompasses the call to trace_seq_putmem_hex() that has a BUILD_BUG_ON() for the variable before it is passed in. This is not very robust and if trace_seq_putmem_hex() ever gets used outside that macro it will cause issues. Instead of only being able to produce a hex output of memory that is for a single word, change it to be more robust and allow any size input. Signed-off-by: Steven Rostedt <[email protected]>
2014-07-01tracing: Clean up trace_seq.cSteven Rostedt (Red Hat)1-32/+175
For using trace_seq_*() functions in NMI context, I posted a patch to move it to the lib/ directory. This caused Andrew Morton to take a look at the code. He went through and gave a lot of comments about missing kernel doc, inconsistent types for the save variable, mix match of EXPORT_SYMBOL_GPL() and EXPORT_SYMBOL() as well as missing EXPORT_SYMBOL*()s. There were a few comments about the way variables were being compared (int vs uint). All these were good review comments and should be implemented regardless of if trace_seq.c should be moved to lib/ or not. Signed-off-by: Steven Rostedt <[email protected]>
2014-07-01tracing: Move the trace_seq_* functions into its own trace_seq.c fileSteven Rostedt (Red Hat)5-295/+304
The trace_seq_*() functions are a nice utility that allows users to manipulate buffers with printf() like formats. It has its own trace_seq.h header in include/linux and should be in its own file. Being tied with trace_output.c is rather awkward. Signed-off-by: Steven Rostedt <[email protected]>
2014-07-01ftrace: Simplify ftrace_hash_disable/enable path in ftrace_hash_moveMasami Hiramatsu1-22/+11
Simplify ftrace_hash_disable/enable path in ftrace_hash_move for hardening the process if the memory allocation failed. Link: http://lkml.kernel.org/p/[email protected] Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>