aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2012-08-20workqueue: make all workqueues non-reentrantTejun Heo1-6/+7
By default, each per-cpu part of a bound workqueue operates separately and a work item may be executing concurrently on different CPUs. The behavior avoids some cross-cpu traffic but leads to subtle weirdities and not-so-subtle contortions in the API. * There's no sane usefulness in allowing a single work item to be executed concurrently on multiple CPUs. People just get the behavior unintentionally and get surprised after learning about it. Most either explicitly synchronize or use non-reentrant/ordered workqueue but this is error-prone. * flush_work() can't wait for multiple instances of the same work item on different CPUs. If a work item is executing on cpu0 and then queued on cpu1, flush_work() can only wait for the one on cpu1. Unfortunately, work items can easily cross CPU boundaries unintentionally when the queueing thread gets migrated. This means that if multiple queuers compete, flush_work() can't even guarantee that the instance queued right before it is finished before returning. * flush_work_sync() was added to work around some of the deficiencies of flush_work(). In addition to the usual flushing, it ensures that all currently executing instances are finished before returning. This operation is expensive as it has to walk all CPUs and at the same time fails to address competing queuer case. Incorrectly using flush_work() when flush_work_sync() is necessary is an easy error to make and can lead to bugs which are difficult to reproduce. * Similar problems exist for flush_delayed_work[_sync](). Other than the cross-cpu access concern, there's no benefit in allowing parallel execution and it's plain silly to have this level of contortion for workqueue which is widely used from core code to extremely obscure drivers. This patch makes all workqueues non-reentrant. If a work item is executing on a different CPU when queueing is requested, it is always queued to that CPU. This guarantees that any given work item can be executing on one CPU at maximum and if a work item is queued and executing, both are on the same CPU. The only behavior change which may affect workqueue users negatively is that non-reentrancy overrides the affinity specified by queue_work_on(). On a reentrant workqueue, the affinity specified by queue_work_on() is always followed. Now, if the work item is executing on one of the CPUs, the work item will be queued there regardless of the requested affinity. I've reviewed all workqueue users which request explicit affinity, and, fortunately, none seems to be crazy enough to exploit parallel execution of the same work item. This adds an additional busy_hash lookup if the work item was previously queued on a different CPU. This shouldn't be noticeable under any sane workload. Work item queueing isn't a very high-frequency operation and they don't jump across CPUs all the time. In a micro benchmark to exaggerate this difference - measuring the time it takes for two work items to repeatedly jump between two CPUs a number (10M) of times with busy_hash table densely populated, the difference was around 3%. While the overhead is measureable, it is only visible in pathological cases and the difference isn't huge. This change brings much needed sanity to workqueue and makes its behavior consistent with timer. I think this is the right tradeoff to make. This enables significant simplification of workqueue API. Simplification patches will follow. Signed-off-by: Tejun Heo <[email protected]>
2012-08-20workqueue: fix checkpatch issuesValentin Ilie1-16/+13
Fixed some checkpatch warnings. tj: adapted to wq/for-3.7 and massaged pr_xxx() format strings a bit. Signed-off-by: Valentin Ilie <[email protected]> Signed-off-by: Tejun Heo <[email protected]> LKML-Reference: <[email protected]>
2012-08-20Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds5-19/+70
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar. * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix migration thread runtime bogosity sched,rt: fix isolated CPUs leaving root_task_group indefinitely throttled sched,cgroup: Fix up task_groups list sched: fix divide by zero at {thread_group,task}_times sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies
2012-08-20cputime: Consolidate vtime handling on context switchFrederic Weisbecker1-0/+1
The archs that implement virtual cputime accounting all flush the cputime of a task when it gets descheduled and sometimes set up some ground initialization for the next task to account its cputime. These archs all put their own hooks in their context switch callbacks and handle the off-case themselves. Consolidate this by creating a new account_switch_vtime() callback called in generic code right after a context switch and that these archs must implement to flush the prev task cputime and initialize the next task cputime related state. Signed-off-by: Frederic Weisbecker <[email protected]> Acked-by: Martin Schwidefsky <[email protected]> Cc: Tony Luck <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]>
2012-08-20sched: Move cputime code to its own fileFrederic Weisbecker4-556/+570
Extract cputime code from the giant sched/core.c and put it in its own file. This make it easier to deal with this particular area and de-bloat a bit more core.c Signed-off-by: Frederic Weisbecker <[email protected]> Acked-by: Martin Schwidefsky <[email protected]> Cc: Tony Luck <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]>
2012-08-19Merge branch 'alpha' (alpha architecture patches)Linus Torvalds1-9/+0
Merge alpha architecture update from Michael Cree: "The Alpha Maintainer, Matt Turner, is currently unavailable, so I have collected up patches that have been posted to the linux-alpha mailing list over the last couple of months, and are forwarding them to you in the hope that you are prepared to accept them via me. The patches by Al Viro and myself I have been running against kernels for two months now so have had quite a bit of testing. All except one patch were intended for the 3.5 kernel but because of Matt's unavailability never got forwarded to you." * emailed patches from Michael Cree <[email protected]>: (9 commits) alpha: Fix fall-out from disintegrating asm/system.h Redefine ATOMIC_INIT and ATOMIC64_INIT to drop the casts alpha: fix fpu.h usage in userspace alpha/mm/fault.c: Port OOM changes to do_page_fault alpha: take kernel_execve() out of entry.S alpha: take a bunch of syscalls into osf_sys.c alpha: Use new generic strncpy_from_user() and strnlen_user() alpha: Wire up cross memory attach syscalls alpha: Don't export SOCK_NONBLOCK to user space.
2012-08-19alpha: take a bunch of syscalls into osf_sys.cAl Viro1-9/+0
New helper: current_thread_info(). Allows to do a bunch of odd syscalls in C. While we are at it, there had never been a reason to do osf_getpriority() in assembler. We also get "namespace"-aware (read: consistent with getuid(2), etc.) behaviour from getx?id() syscalls now. Signed-off-by: Al Viro <[email protected]> Signed-off-by: Michael Cree <[email protected]> Acked-by: Matt Turner <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-08-17tracing/syscalls: Fix perf syscall tracing when syscall_nr == -1Will Deacon1-0/+4
syscall_get_nr can return -1 in the case that the task is not executing a system call. This patch fixes perf_syscall_{enter,exit} to check that the syscall number is valid before using it as an index into a bitmap. Link: http://lkml.kernel.org/r/[email protected] Cc: Jason Baron <[email protected]> Cc: Wade Farnsworth <[email protected]> Cc: Frederic Weisbecker <[email protected]> Signed-off-by: Will Deacon <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2012-08-17Merge tag 'v3.6-rc2' into nextJames Morris79-3071/+4458
Linux 3.6-rc2 Resync with Linus.
2012-08-16workqueue: use system_highpri_wq for unbind_workJoonsoo Kim1-1/+1
To speed cpu down processing up, use system_highpri_wq. As scheduling priority of workers on it is higher than system_wq and it is not contended by other normal works on this cpu, work on it is processed faster than system_wq. tj: CPU up/downs care quite a bit about latency these days. This shouldn't hurt anything and makes sense. Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2012-08-16workqueue: use system_highpri_wq for highpri workers in rebind_workers()Joonsoo Kim1-4/+14
In rebind_workers(), we do inserting a work to rebind to cpu for busy workers. Currently, in this case, we use only system_wq. This makes a possible error situation as there is mismatch between cwq->pool and worker->pool. To prevent this, we should use system_highpri_wq for highpri worker to match theses. This implements it. tj: Rephrased comment a bit. Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2012-08-16workqueue: introduce system_highpri_wqJoonsoo Kim1-3/+6
Commit 3270476a6c0ce322354df8679652f060d66526dc ('workqueue: reimplement WQ_HIGHPRI using a separate worker_pool') introduce separate worker pool for HIGHPRI. When we handle busyworkers for gcwq, it can be normal worker or highpri worker. But, we don't consider this difference in rebind_workers(), we use just system_wq for highpri worker. It makes mismatch between cwq->pool and worker->pool. It doesn't make error in current implementation, but possible in the future. Now, we introduce system_highpri_wq to use proper cwq for highpri workers in rebind_workers(). Following patch fix this issue properly. tj: Even apart from rebinding, having system_highpri_wq generally makes sense. Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2012-08-16workqueue: change value of lcpu in __queue_delayed_work_on()Joonsoo Kim1-2/+8
We assign cpu id into work struct's data field in __queue_delayed_work_on(). In current implementation, when work is come in first time, current running cpu id is assigned. If we do __queue_delayed_work_on() with CPU A on CPU B, __queue_work() invoked in delayed_work_timer_fn() go into the following sub-optimal path in case of WQ_NON_REENTRANT. gcwq = get_gcwq(cpu); if (wq->flags & WQ_NON_REENTRANT && (last_gcwq = get_work_gcwq(work)) && last_gcwq != gcwq) { Change lcpu to @cpu and rechange lcpu to local cpu if lcpu is WORK_CPU_UNBOUND. It is sufficient to prevent to go into sub-optimal path. tj: Slightly rephrased the comment. Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2012-08-16workqueue: correct req_cpu in trace_workqueue_queue_work()Joonsoo Kim1-1/+2
When we do tracing workqueue_queue_work(), it records requested cpu. But, if !(@wq->flag & WQ_UNBOUND) and @cpu is WORK_CPU_UNBOUND, requested cpu is changed as local cpu. In case of @wq->flag & WQ_UNBOUND, above change is not occured, therefore it is reasonable to correct it. Use temporary local variable for storing requested cpu. Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2012-08-16workqueue: use enum value to set array size of pools in gcwqJoonsoo Kim1-1/+2
Commit 3270476a6c0ce322354df8679652f060d66526dc ('workqueue: reimplement WQ_HIGHPRI using a separate worker_pool') introduce separate worker_pool for HIGHPRI. Although there is NR_WORKER_POOLS enum value which represent size of pools, definition of worker_pool in gcwq doesn't use it. Using it makes code robust and prevent future mistakes. So change code to use this enum value. Signed-off-by: Joonsoo Kim <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2012-08-15time: Improve sanity checking of timekeeping inputsJohn Stultz1-2/+24
Unexpected behavior could occur if the time is set to a value large enough to overflow a 64bit ktime_t (which is something larger then the year 2262). Also unexpected behavior could occur if large negative offsets are injected via adjtimex. So this patch improves the sanity check timekeeping inputs by improving the timespec_valid() check, and then makes better use of timespec_valid() to make sure we don't set the time to an invalid negative value or one that overflows ktime_t. Note: This does not protect from setting the time close to overflowing ktime_t and then letting natural accumulation cause the overflow. Reported-by: CAI Qian <[email protected]> Reported-by: Sasha Levin <[email protected]> Signed-off-by: John Stultz <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Zhouping Liu <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2012-08-15audit: clean up refcounting in audit-treeMiklos Szeredi1-3/+9
Drop the initial reference by fsnotify_init_mark early instead of audit_tree_freeing_mark() at destroy time. In the cases we destroy the mark before we drop the initial reference we need to get rid of the get_mark that balances the put_mark in audit_tree_freeing_mark(). Signed-off-by: Miklos Szeredi <[email protected]>
2012-08-15audit: fix refcounting in audit-treeMiklos Szeredi1-3/+2
Refcounting of fsnotify_mark in audit tree is broken. E.g: refcount create_chunk alloc_chunk 1 fsnotify_add_mark 2 untag_chunk fsnotify_get_mark 3 fsnotify_destroy_mark audit_tree_freeing_mark 2 fsnotify_put_mark 1 fsnotify_put_mark 0 via destroy_list fsnotify_mark_destroy -1 This was reported by various people as triggering Oops when stopping auditd. We could just remove the put_mark from audit_tree_freeing_mark() but that would break freeing via inode destruction. So this patch simply omits a put_mark after calling destroy_mark or adds a get_mark before. The additional get_mark is necessary where there's no other put_mark after fsnotify_destroy_mark() since it assumes that the caller is holding a reference (or the inode is keeping the mark pinned, not the case here AFAICS). Signed-off-by: Miklos Szeredi <[email protected]> Reported-by: Valentin Avram <[email protected]> Reported-by: Peter Moody <[email protected]> Acked-by: Eric Paris <[email protected]> CC: [email protected]
2012-08-15audit: don't free_chunk() after fsnotify_add_mark()Miklos Szeredi1-3/+3
Don't do free_chunk() after fsnotify_add_mark(). That one does a delayed unref via the destroy list and this results in use-after-free. Signed-off-by: Miklos Szeredi <[email protected]> Acked-by: Eric Paris <[email protected]> CC: [email protected]
2012-08-14pidns: Export free_pid_nsEric W. Biederman1-0/+2
There is a least one modular user so export free_pid_ns so modules can capture and use the pid namespace on the very rare occasion when it makes sense. Acked-by: David S. Miller <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]>
2012-08-14net ip6 flowlabel: Make owner a union of struct pid * and kuid_tEric W. Biederman1-0/+1
Correct a long standing omission and use struct pid in the owner field of struct ip6_flowlabel when the share type is IPV6_FL_S_PROCESS. This guarantees we don't have issues when pid wraparound occurs. Use a kuid_t in the owner field of struct ip6_flowlabel when the share type is IPV6_FL_S_USER to add user namespace support. In /proc/net/ip6_flowlabel capture the current pid namespace when opening the file and release the pid namespace when the file is closed ensuring we print the pid owner value that is meaning to the reader of the file. Similarly use from_kuid_munged to print uid values that are meaningful to the reader of the file. This requires exporting pid_nr_ns so that ipv6 can continue to built as a module. Yoiks what silliness Acked-by: David S. Miller <[email protected]> Acked-by: Serge Hallyn <[email protected]> Signed-off-by: Eric W. Biederman <[email protected]>
2012-08-13workqueue: add missing wmb() in clear_work_data()Tejun Heo1-7/+12
Any operation which clears PENDING should be preceded by a wmb to guarantee that the next PENDING owner sees all the changes made before PENDING release. There are only two places where PENDING is cleared - set_work_cpu_and_clear_pending() and clear_work_data(). The caller of the former already does smp_wmb() but the latter doesn't have any. Move the wmb above set_work_cpu_and_clear_pending() into it and add one to clear_work_data(). There hasn't been any report related to this issue, and, given how clear_work_data() is used, it is extremely unlikely to have caused any actual problems on any architecture. Signed-off-by: Tejun Heo <[email protected]> Cc: Oleg Nesterov <[email protected]>
2012-08-13workqueue: fix CPU binding of flush_delayed_work[_sync]()Tejun Heo1-3/+4
delayed_work encodes the workqueue to use and the last CPU in delayed_work->work.data while it's on timer. The target CPU is implicitly recorded as the CPU the timer is queued on and delayed_work_timer_fn() queues delayed_work->work to the CPU it is running on. Unfortunately, this leaves flush_delayed_work[_sync]() no way to find out which CPU the delayed_work was queued for when they try to re-queue after killing the timer. Currently, it chooses the local CPU flush is running on. This can unexpectedly move a delayed_work queued on a specific CPU to another CPU and lead to subtle errors. There isn't much point in trying to save several bytes in struct delayed_work, which is already close to a hundred bytes on 64bit with all debug options turned off. This patch adds delayed_work->cpu to remember the CPU it's queued for. Note that if the timer is migrated during CPU down, the work item could be queued to the downed global_cwq after this change. As a detached global_cwq behaves like an unbound one, this doesn't change much for the delayed_work. Signed-off-by: Tejun Heo <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Andrew Morton <[email protected]>
2012-08-13sched: recover SD_WAKE_AFFINE in select_task_rq_fair and code clean upAlex Shi2-32/+3
Since power saving code was removed from sched now, the implement code is out of service in this function, and even pollute other logical. like, 'want_sd' never has chance to be set '0', that remove the effect of SD_WAKE_AFFINE here. So, clean up the obsolete code, includes SD_PREFER_LOCAL. Signed-off-by: Alex Shi <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2012-08-13sched: using dst_rq instead of this_rq during load balanceMichael Wang1-5/+4
As we already have dst_rq in lb_env, using or changing "this_rq" do not make sense. This patch will replace "this_rq" with dst_rq in load_balance, and we don't need to change "this_rq" while process LBF_SOME_PINNED any more. Signed-off-by: Michael Wang <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2012-08-13sched: Document schedule() entry pointsPekka Enberg1-0/+34
This patch adds a comment on top of the schedule() function to explain to scheduler newbies how the main scheduler function is entered. Acked-by: Randy Dunlap <[email protected]> Explained-by: Ingo Molnar <[email protected]> Explained-by: Peter Zijlstra <[email protected]> Signed-off-by: Pekka Enberg <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2012-08-13sched: Fix __sched_period commentBorislav Petkov1-1/+1
It should be sched_nr_latency so fix it before it annoys me more. Signed-off-by: Borislav Petkov <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2012-08-13Merge branch 'sched/urgent' into sched/coreThomas Gleixner5-19/+70
2012-08-13sched: Fix migration thread runtime bogosityMike Galbraith1-1/+21
Make stop scheduler class do the same accounting as other classes, Migration threads can be caught in the act while doing exec balancing, leading to the below due to use of unmaintained ->se.exec_start. The load that triggered this particular instance was an apparently out of control heavily threaded application that does system monitoring in what equated to an exec bomb, with one of the VERY frequently migrated tasks being ps. %CPU PID USER CMD 99.3 45 root [migration/10] 97.7 53 root [migration/12] 97.0 57 root [migration/13] 90.1 49 root [migration/11] 89.6 65 root [migration/15] 88.7 17 root [migration/3] 80.4 37 root [migration/8] 78.1 41 root [migration/9] 44.2 13 root [migration/2] Signed-off-by: Mike Galbraith <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2012-08-13sched,rt: fix isolated CPUs leaving root_task_group indefinitely throttledMike Galbraith1-0/+13
Root task group bandwidth replenishment must service all CPUs, regardless of where the timer was last started, and regardless of the isolation mechanism, lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy. Signed-off-by: Mike Galbraith <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2012-08-13sched,cgroup: Fix up task_groups listMike Galbraith2-1/+2
With multiple instances of task_groups, for_each_rt_rq() is a noop, no task groups having been added to the rt.c list instance. This renders __enable/disable_runtime() and print_rt_stats() noop, the user (non) visible effect being that rt task groups are missing in /proc/sched_debug. Signed-off-by: Mike Galbraith <[email protected]> Cc: [email protected] # v3.3+ Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2012-08-13sched: fix divide by zero at {thread_group,task}_timesStanislaw Gruszka1-14/+20
On architectures where cputime_t is 64 bit type, is possible to trigger divide by zero on do_div(temp, (__force u32) total) line, if total is a non zero number but has lower 32 bit's zeroed. Removing casting is not a good solution since some do_div() implementations do cast to u32 internally. This problem can be triggered in practice on very long lived processes: PID: 2331 TASK: ffff880472814b00 CPU: 2 COMMAND: "oraagent.bin" #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2 #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00 #3 [ffff880472a51cd0] die at ffffffff8100f26b #4 [ffff880472a51d00] do_trap at ffffffff814f03f4 #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff #6 [ffff880472a51e00] divide_error at ffffffff8100be7b [exception RIP: thread_group_times+0x56] RIP: ffffffff81056a16 RSP: ffff880472a51eb8 RFLAGS: 00010046 RAX: bc3572c9fe12d194 RBX: ffff880874150800 RCX: 0000000110266fad RDX: 0000000000000000 RSI: ffff880472a51eb8 RDI: 001038ae7d9633dc RBP: ffff880472a51ef8 R8: 00000000b10a3a64 R9: ffff880874150800 R10: 00007fcba27ab680 R11: 0000000000000202 R12: ffff880472a51f08 R13: ffff880472a51f10 R14: 0000000000000000 R15: 0000000000000007 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d #8 [ffff880472a51f40] sys_times at ffffffff81088524 #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2 RIP: 0000003808caac3a RSP: 00007fcba27ab6d8 RFLAGS: 00000202 RAX: 0000000000000064 RBX: ffffffff8100b0f2 RCX: 0000000000000000 RDX: 00007fcba27ab6e0 RSI: 000000000076d58e RDI: 00007fcba27ab6e0 RBP: 00007fcba27ab700 R8: 0000000000000020 R9: 000000000000091b R10: 00007fcba27ab680 R11: 0000000000000202 R12: 00007fff9ca41940 R13: 0000000000000000 R14: 00007fcba27ac9c0 R15: 00007fff9ca41940 ORIG_RAX: 0000000000000064 CS: 0033 SS: 002b Cc: [email protected] Signed-off-by: Stanislaw Gruszka <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2012-08-13sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchiesPeter Zijlstra2-3/+14
Peter Portante reported that for large cgroup hierarchies (and or on large CPU counts) we get immense lock contention on rq->lock and stuff stops working properly. His workload was a ton of processes, each in their own cgroup, everybody idling except for a sporadic wakeup once every so often. It was found that: schedule() idle_balance() load_balance() local_irq_save() double_rq_lock() update_h_load() walk_tg_tree(tg_load_down) tg_load_down() Results in an entire cgroup hierarchy walk under rq->lock for every new-idle balance and since new-idle balance isn't throttled this results in a lot of work while holding the rq->lock. This patch does two things, it removes the work from under rq->lock based on the good principle of race and pray which is widely employed in the load-balancer as a whole. And secondly it throttles the update_h_load() calculation to max once per jiffy. I considered excluding update_h_load() for new-idle balance all-together, but purely relying on regular balance passes to update this data might not work out under some rare circumstances where the new-idle busiest isn't the regular busiest for a while (unlikely, but a nightmare to debug if someone hits it and suffers). Cc: [email protected] Cc: Larry Woodman <[email protected]> Cc: Mike Galbraith <[email protected]> Reported-by: Peter Portante <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2012-08-13rcu: Use smp_hotplug_thread facility for RCUs per-CPU kthreadPaul E. McKenney4-177/+41
Bring RCU into the new-age CPU-hotplug fold by modifying RCU's per-CPU kthread code to use the new smp_hotplug_thread facility. [ tglx: Adapted it to use callbacks and to the simplified rcu yield ] Signed-off-by: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Srivatsa S. Bhat <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Namhyung Kim <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2012-08-13watchdog: Use hotplug thread infrastructureThomas Gleixner1-174/+89
Signed-off-by: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Srivatsa S. Bhat <[email protected]> Cc: Rusty Russell <[email protected]> Reviewed-by: Paul E. McKenney <[email protected]> Cc: Namhyung Kim <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2012-08-13softirq: Use hotplug thread infrastructureThomas Gleixner1-84/+27
[ paulmck: Call rcu_note_context_switch() with interrupts enabled. ] Signed-off-by: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Srivatsa S. Bhat <[email protected]> Cc: Rusty Russell <[email protected]> Reviewed-by: Paul E. McKenney <[email protected]> Cc: Namhyung Kim <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2012-08-13hotplug: Fix UP bug in smpboot hotplug codePaul E. McKenney2-2/+5
Because kernel subsystems need their per-CPU kthreads on UP systems as well as on SMP systems, the smpboot hotplug kthread functions must be provided in UP builds as well as in SMP builds. This commit therefore adds smpboot.c to UP builds and excludes irrelevant code via #ifdef. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]>
2012-08-13smpboot: Provide infrastructure for percpu hotplug threadsThomas Gleixner3-1/+242
Provide a generic interface for setting up and tearing down percpu threads. On registration the threads for already online cpus are created and started. On deregistration (modules) the threads are stoppped. During hotplug operations the threads are created, started, parked and unparked. The datastructure for registration provides a pointer to percpu storage space and optional setup, cleanup, park, unpark functions. These functions are called when the thread state changes. Each implementation has to provide a function which is queried and returns whether the thread should run and the thread function itself. The core code handles all state transitions and avoids duplicated code in the call sites. [ paulmck: Preemption leak fix ] Signed-off-by: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Reviewed-by: Srivatsa S. Bhat <[email protected]> Cc: Rusty Russell <[email protected]> Reviewed-by: Paul E. McKenney <[email protected]> Cc: Namhyung Kim <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2012-08-13kthread: Implement park/unpark facilityThomas Gleixner1-19/+166
To avoid the full teardown/setup of per cpu kthreads in the case of cpu hot(un)plug, provide a facility which allows to put the kthread into a park position and unpark it when the cpu comes online again. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Namhyung Kim <[email protected]> Cc: Peter Zijlstra <[email protected]> Reviewed-by: Srivatsa S. Bhat <[email protected]> Cc: Rusty Russell <[email protected]> Reviewed-by: Paul E. McKenney <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2012-08-13rcu: Yield simplerThomas Gleixner3-184/+41
The rcu_yield() code is amazing. It's there to avoid starvation of the system when lots of (boosting) work is to be done. Now looking at the code it's functionality is: Make the thread SCHED_OTHER and very nice, i.e. get it out of the way Arm a timer with 2 ticks schedule() Now if the system goes idle the rcu task returns, regains SCHED_FIFO and plugs on. If the systems stays busy the timer fires and wakes a per node kthread which in turn makes the per cpu thread SCHED_FIFO and brings it back on the cpu. For the boosting thread the "make it FIFO" bit is missing and it just runs some magic boost checks. Now this is a lot of code with extra threads and complexity. It's way simpler to let the tasks when they detect overload schedule away for 2 ticks and defer the normal wakeup as long as they are in yielded state and the cpu is not idle. That solves the same problem and the only difference is that when the cpu goes idle it's not guaranteed that the thread returns right away, but it won't be longer out than two ticks, so no harm is done. If that's an issue than it is way simpler just to wake the task from idle as RCU has callbacks there anyway. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Srivatsa S. Bhat <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Namhyung Kim <[email protected]> Reviewed-by: Paul E. McKenney <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2012-08-12Merge tag 'pm-for-3.6-rc2' of ↵Linus Torvalds2-22/+2
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael J. Wysocki: - Fix for two recent regressions in the generic PM domains framework. - Revert of a commit that introduced a resume regression and is conceptually incorrect in my opinion. - Fix for a return value in pcc-cpufreq.c from Julia Lawall. - RTC wakeup signaling fix from Neil Brown. - Suppression of compiler warnings for CONFIG_PM_SLEEP unset in ACPI, platform/x86 and TPM drivers. * tag 'pm-for-3.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: tpm_tis / PM: Fix unused function warning for CONFIG_PM_SLEEP platform / x86 / PM: Fix unused function warnings for CONFIG_PM_SLEEP ACPI / PM: Fix unused function warnings for CONFIG_PM_SLEEP Revert "NMI watchdog: fix for lockup detector breakage on resume" PM: Make dev_pm_get_subsys_data() always return 0 on success drivers/cpufreq/pcc-cpufreq.c: fix error return code RTC: Avoid races between RTC alarm wakeup and suspend.
2012-08-12printk: Fix calculation of length used to discard recordsJeff Mahoney1-0/+2
While tracking down a weird buffer overflow issue in a program that looked to be sane, I started double checking the length returned by syslog(SYSLOG_ACTION_READ_ALL, ...) to make sure it wasn't overflowing the buffer. Sure enough, it was. I saw this in strace: 11339 syslog(SYSLOG_ACTION_READ_ALL, "<5>[244017.708129] REISERFS (dev"..., 8192) = 8279 It turns out that the loops that calculate how much space the entries will take when they're copied don't include the newlines and prefixes that will be included in the final output since prev flags is passed as zero. This patch properly accounts for it and fixes the overflow. CC: [email protected] Signed-off-by: Jeff Mahoney <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2012-08-10perf: Add attribute to filter out callchainsFrederic Weisbecker1-14/+24
Introducing following bits to the the perf_event_attr struct: - exclude_callchain_kernel to filter out kernel callchain from the sample dump - exclude_callchain_user to filter out user callchain from the sample dump We need to be able to disable standard user callchain dump when we use the dwarf cfi callchain mode, because frame pointer based user callchains are useless in this mode. Implementing also exclude_callchain_kernel to have complete set of options. Signed-off-by: Jiri Olsa <[email protected]> [ Added kernel callchains filtering ] Cc: "Frank Ch. Eigler" <[email protected]> Cc: Arun Sharma <[email protected]> Cc: Benjamin Redelings <[email protected]> Cc: Corey Ashford <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Frank Ch. Eigler <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Robert Richter <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Tom Zanussi <[email protected]> Cc: Ulrich Drepper <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2012-08-10perf: Add ability to attach user stack dump to sampleJiri Olsa2-1/+165
Introducing PERF_SAMPLE_STACK_USER sample type bit to trigger the dump of the user level stack on sample. The size of the dump is specified by sample_stack_user value. Being able to dump parts of the user stack, starting from the stack pointer, will be useful to make a post mortem dwarf CFI based stack unwinding. Added HAVE_PERF_USER_STACK_DUMP config option to determine if the architecture provides user stack dump on perf event samples. This needs access to the user stack pointer which is not unified across architectures. Enabling this for x86 architecture. Signed-off-by: Jiri Olsa <[email protected]> Original-patch-by: Frederic Weisbecker <[email protected]> Cc: "Frank Ch. Eigler" <[email protected]> Cc: Arun Sharma <[email protected]> Cc: Benjamin Redelings <[email protected]> Cc: Corey Ashford <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Frank Ch. Eigler <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Robert Richter <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Tom Zanussi <[email protected]> Cc: Ulrich Drepper <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2012-08-10perf: Add perf_output_skip function to skip bytes in sampleJiri Olsa2-0/+10
Introducing perf_output_skip function to be able to skip data within the perf ring buffer. When writing data into perf ring buffer we first reserve needed place in ring buffer and then copy the actual data. There's a possibility we won't be able to fill all the reserved size with data, so we need a way to skip the remaining bytes. This is going to be useful when storing the user stack dump, where we might end up with less data than we originally requested. Signed-off-by: Jiri Olsa <[email protected]> Acked-by: Frederic Weisbecker <[email protected]> Cc: "Frank Ch. Eigler" <[email protected]> Cc: Arun Sharma <[email protected]> Cc: Benjamin Redelings <[email protected]> Cc: Corey Ashford <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Frank Ch. Eigler <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Robert Richter <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Tom Zanussi <[email protected]> Cc: Ulrich Drepper <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2012-08-10perf: Factor __output_copy to be usable with specific copy functionFrederic Weisbecker2-23/+43
Adding a generic way to use __output_copy function with specific copy function via DEFINE_PERF_OUTPUT_COPY macro. Using this to add new __output_copy_user function, that provides output copy from user pointers. For x86 the copy_from_user_nmi function is used and __copy_from_user_inatomic for the rest of the architectures. This new function will be used in user stack dump on sample, coming in next patches. Signed-off-by: Jiri Olsa <[email protected]> Cc: "Frank Ch. Eigler" <[email protected]> Cc: Arun Sharma <[email protected]> Cc: Benjamin Redelings <[email protected]> Cc: Corey Ashford <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Frank Ch. Eigler <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Robert Richter <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Tom Zanussi <[email protected]> Cc: Ulrich Drepper <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2012-08-10perf: Add ability to attach user level registers dump to sampleJiri Olsa1-0/+66
Introducing PERF_SAMPLE_REGS_USER sample type bit to trigger the dump of user level registers on sample. Registers we want to dump are specified by sample_regs_user bitmask. Only user level registers are dumped at the moment. Meaning the register values of the user space context as it was before the user entered the kernel for whatever reason (syscall, irq, exception, or a PMI happening in userspace). The layout of the sample_regs_user bitmap is described in asm/perf_regs.h for archs that support register dump. This is going to be useful to bring Dwarf CFI based stack unwinding on top of samples. Original-patch-by: Frederic Weisbecker <[email protected]> [ Dump registers ABI specification. ] Signed-off-by: Jiri Olsa <[email protected]> Suggested-by: Stephane Eranian <[email protected]> Cc: "Frank Ch. Eigler" <[email protected]> Cc: Arun Sharma <[email protected]> Cc: Benjamin Redelings <[email protected]> Cc: Corey Ashford <[email protected]> Cc: Cyrill Gorcunov <[email protected]> Cc: Frank Ch. Eigler <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Robert Richter <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: Tom Zanussi <[email protected]> Cc: Ulrich Drepper <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2012-08-08Revert "NMI watchdog: fix for lockup detector breakage on resume"Rafael J. Wysocki2-22/+2
Revert commit 45226e9 (NMI watchdog: fix for lockup detector breakage on resume) which breaks resume from system suspend on my SH7372 Mackerel board (by causing a NULL pointer dereference to happen) and is generally wrong, because it abuses the CPU hotplug functionality in a shamelessly blatant way. The original issue should be addressed through appropriate syscore resume callback instead. Signed-off-by: Rafael J. Wysocki <[email protected]>
2012-08-07tracing/trivial: Fix some typos in kernel/traceWang Tianhong2-5/+5
Fix some typos in kernel/trace. Link: http://lkml.kernel.org/r/1343887320.2228.9.camel@louis-ThinkPad-T410 Signed-off-by: Wang Tianhong <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2012-08-07tracing/filter: Add missing initializationJiri Olsa1-1/+1
Add missing initialization for ret variable. Its initialization is based on the re_cnt variable, which is being set deep down in the ftrace_function_filter_re function. I'm not sure compilers would be smart enough to see this in near future, so killing the warning this way. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jiri Olsa <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>