aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2013-08-19kernel: fix new kernel-doc warning in wait.cRandy Dunlap1-2/+1
Fix new kernel-doc warnings in kernel/wait.c: Warning(kernel/wait.c:374): No description found for parameter 'p' Warning(kernel/wait.c:374): Excess function parameter 'word' description in 'wake_up_atomic_t' Warning(kernel/wait.c:374): Excess function parameter 'bit' description in 'wake_up_atomic_t' Signed-off-by: Randy Dunlap <[email protected]> Cc: David Howells <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-08-19cgroup: fix cgroup_write_event_control()Tejun Heo1-4/+21
81eeaf0411 ("cgroup: make cftype->[un]register_event() deal with cgroup_subsys_state inst ead of cgroup") updated the cftype event methods to take @css (cgroup_subsys_state) instead of @cgroup; however, it incorrectly used @css passed to cgroup_write_event_control(), which the dummy_css for the cgroup as the file is a cgroup core file. This leads to oops on event registration. Fix it by using the css matching the event target file. Note that cgroup_write_event_control() now disallows cgroup core files from being event sources. This is for simplicity and doesn't matter as cgroup_event will be moved and made specific to memcg. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-08-19cgroup: fix subsystem file accesses on the root cgroupTejun Heo1-14/+10
105347ba5 ("cgroup: make cgroup_file_open() rcu_read_lock() around cgroup_css() and add cfent->css") added cfent->css to cache the associted cgroup_subsys_state across file operations. A cfent is associated with single css throughout its lifetime and the origimal commit initialized the cache pointer during cgroup_add_file() and verified that it matches the actual one in cgroup_file_open(). While this works fine for !root cgroups, it's broken for root cgroups as files in a root cgroup are created before the css's are associated with the cgroup and thus cgroup_css() call in cgroup_add_file() returns NULL associating all cfents in the root cgroup with NULL css. This makes cgroup_file_open() trigger WARN and fail with -ENODEV for all !core subsystem files in the root cgroups. There's no reason to initialize cfent->css separately from cgroup_add_file(). As the association never changes, cgroup_file_open() can set it unconditionally every time and containing the logic in cgroup_file_open() makes more sense anyway as the only reason it's necessary is file->private_data being already occupied. Fix it by setting cfent->css unconditionally from cgroup_file_open(). Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-08-19cgroup: change cgroup_from_id() to css_from_id()Li Zefan1-0/+22
Now we want cgroup core to always provide the css to use to the subsystems, so change this API to css_from_id(). Uninline css_from_id(), because it's getting bigger and cgroup_css() has been unexported. While at it, remove the #ifdef, and shuffle the order of the args. Signed-off-by: Li Zefan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2013-08-19generic-ipi/locking: Fix misleading smp_call_function_any() descriptionXie XiuQi1-2/+0
Fix locking description: after commit 8969a5ede0f9e17da4b9437 ("generic-ipi: remove kmalloc()"), wait = 0 can be guaranteed because we don't kmalloc() anymore. Signed-off-by: Xie XiuQi <[email protected]> Cc: Sheng Yang <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Rusty Russell <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-08-18nohz_full: Add full-system-idle arguments to APIPaul E. McKenney1-7/+18
This commit adds an isidle and jiffies argument to force_qs_rnp(), dyntick_save_progress_counter(), and rcu_implicit_dynticks_qs() to enable RCU's force-quiescent-state process to check for full-system idle. Signed-off-by: Paul E. McKenney <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Lai Jiangshan <[email protected]> [ paulmck: Use true and false for boolean constants per Lai Jiangshan. ] Reviewed-by: Josh Triplett <[email protected]>
2013-08-18nohz_full: Add full-system idle states and variablesPaul E. McKenney1-0/+17
This commit adds control variables and states for full-system idle. The system will progress through the states in numerical order when the system is fully idle (other than the timekeeping CPU), and reset down to the initial state if any non-timekeeping CPU goes non-idle. The current state is kept in full_sysidle_state. One flavor of RCU will be in charge of driving the state machine, defined by rcu_sysidle_state. This should be the busiest flavor of RCU. Signed-off-by: Paul E. McKenney <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Steven Rostedt <[email protected]> Reviewed-by: Josh Triplett <[email protected]>
2013-08-18nohz_full: Add per-CPU idle-state trackingPaul E. McKenney3-0/+85
This commit adds the code that updates the rcu_dyntick structure's new fields to track the per-CPU idle state based on interrupts and transitions into and out of the idle loop (NMIs are ignored because NMI handlers cannot cleanly read out the time anyway). This code is similar to the code that maintains RCU's idea of per-CPU idleness, but differs in that RCU treats CPUs running in user mode as idle, where this new code does not. Signed-off-by: Paul E. McKenney <[email protected]> Acked-by: Frederic Weisbecker <[email protected]> Cc: Steven Rostedt <[email protected]> Reviewed-by: Josh Triplett <[email protected]>
2013-08-18nohz_full: Add rcu_dyntick data for scalable detection of all-idle statePaul E. McKenney3-0/+33
This commit adds fields to the rcu_dyntick structure that are used to detect idle CPUs. These new fields differ from the existing ones in that the existing ones consider a CPU executing in user mode to be idle, where the new ones consider CPUs executing in user mode to be busy. The handling of these new fields is otherwise quite similar to that for the exiting fields. This commit also adds the initialization required for these fields. So, why is usermode execution treated differently, with RCU considering it a quiescent state equivalent to idle, while in contrast the new full-system idle state detection considers usermode execution to be non-idle? It turns out that although one of RCU's quiescent states is usermode execution, it is not a full-system idle state. This is because the purpose of the full-system idle state is not RCU, but rather determining when accurate timekeeping can safely be disabled. Whenever accurate timekeeping is required in a CONFIG_NO_HZ_FULL kernel, at least one CPU must keep the scheduling-clock tick going. If even one CPU is executing in user mode, accurate timekeeping is requires, particularly for architectures where gettimeofday() and friends do not enter the kernel. Only when all CPUs are really and truly idle can accurate timekeeping be disabled, allowing all CPUs to turn off the scheduling clock interrupt, thus greatly improving energy efficiency. This naturally raises the question "Why is this code in RCU rather than in timekeeping?", and the answer is that RCU has the data and infrastructure to efficiently make this determination. Signed-off-by: Paul E. McKenney <[email protected]> Acked-by: Frederic Weisbecker <[email protected]> Cc: Steven Rostedt <[email protected]> Reviewed-by: Josh Triplett <[email protected]>
2013-08-18nohz_full: Add Kconfig parameter for scalable detection of all-idle statePaul E. McKenney1-0/+23
At least one CPU must keep the scheduling-clock tick running for timekeeping purposes whenever there is a non-idle CPU. However, with the new nohz_full adaptive-idle machinery, it is difficult to distinguish between all CPUs really being idle as opposed to all non-idle CPUs being in adaptive-ticks mode. This commit therefore adds a Kconfig parameter as a first step towards enabling a scalable detection of full-system idle state. Signed-off-by: Paul E. McKenney <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Steven Rostedt <[email protected]> [ paulmck: Update help text per Frederic Weisbecker. ] Reviewed-by: Josh Triplett <[email protected]>
2013-08-18rcu: Eliminate unused APIs intended for adaptive ticksPaul E. McKenney1-43/+0
The rcu_user_enter_after_irq() and rcu_user_exit_after_irq() functions were intended for use by adaptive ticks, but changes in implementation have rendered them unnecessary. This commit therefore removes them. Reported-by: Frederic Weisbecker <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Josh Triplett <[email protected]>
2013-08-18rcu: Avoid redundant grace-period kthread wakeupsPaul E. McKenney1-3/+5
When setting up an in-the-future "advanced" grace period, the code needs to wake up the relevant grace-period kthread, which it currently does unconditionally. However, this results in needless wakeups in the case where the advanced grace period is being set up by the grace-period kthread itself, which is a non-uncommon situation. This commit therefore checks to see if the running thread is the grace-period kthread, and avoids doing the irq_work_queue()-mediated wakeup in that case. Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Josh Triplett <[email protected]>
2013-08-18rcu: Make call_rcu() leak callbacks for debug-object errorsPaul E. McKenney2-4/+20
If someone does a duplicate call_rcu(), the worst thing the second call_rcu() could do would be to actually queue the callback the second time because doing so corrupts whatever list the callback was already queued on. This commit therefore makes __call_rcu() check the new return value from debug-objects and leak the callback upon error. This commit also substitutes rcu_leak_callback() for whatever callback function was previously in place in order to avoid freeing the callback out from under any readers that might still be referencing it. These changes increase the probability that the debug-objects error messages will actually make it somewhere visible. Signed-off-by: Paul E. McKenney <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Sedat Dilek <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Linus Torvalds <[email protected]> Tested-by: Sedat Dilek <[email protected]> Reviewed-by: Josh Triplett <[email protected]>
2013-08-18rcu: Simplify debug-objects fixupsPaul E. McKenney1-100/+0
The current debug-objects fixups are complex and heavyweight, and the fixups are not complete: Even with the fixups, RCU's callback lists can still be corrupted. This commit therefore strips the fixups down to their minimal form, eliminating two of the three. It would be even better if (for example) call_rcu() simply leaked any problematic callbacks, but for that to happen, the debug-objects system would need to inform its caller of suspicious situations. This is the subject of a later commit in this series. Signed-off-by: Paul E. McKenney <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Sedat Dilek <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Linus Torvalds <[email protected]> Tested-by: Sedat Dilek <[email protected]> Reviewed-by: Josh Triplett <[email protected]>
2013-08-18rcu: Expedite grace periods during suspend/resumeBorislav Petkov1-0/+21
CONFIG_RCU_FAST_NO_HZ can increase grace-period durations by up to a factor of four, which can result in long suspend and resume times. Thus, this commit temporarily switches to expedited grace periods when suspending the box and return to normal settings when resuming. Similar logic is applied to hibernation. Because expedited grace periods are of dubious benefit on very large systems, so this commit restricts their automated use during suspend and resume to systems of 256 or fewer CPUs. (Some day a number of Linux-kernel facilities, including RCU's expedited grace periods, will be more scalable, but I need to see bug reports first.) [ paulmck: This also papers over an audio/irq bug, but hopefully that will be fixed soon. ] Signed-off-by: Borislav Petkov <[email protected]> Signed-off-by: Bjørn Mork <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Josh Triplett <[email protected]>
2013-08-18Merge branch 'for-3.11-fixes' of ↵Linus Torvalds1-2/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: "This contains one patch to fix the return value of cpuset's cgroups interface function, which used to always return -ENODEV for the writes on the 'memory_pressure_enabled' file" * 'for-3.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: fix the return value of cpuset_write_u64()
2013-08-16Merge tag 'pm-3.11-rc6' of ↵Linus Torvalds1-7/+13
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fix from Rafael Wysocki: "The removal of delayed_work_pending() checks from kernel/power/qos.c done in 3.9 introduced a deadlock in pm_qos_work_fn(). Fix from Stephen Boyd" * tag 'pm-3.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM / QoS: Fix workqueue deadlock when using pm_qos_update_request_timeout()
2013-08-16perf: Do not compute time values unnecessarilyPeter Zijlstra1-4/+4
We should not be calling calc_timer_values() for events that do not actually have an mmap()'ed userpage. Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-08-16perf: Account freq events globallyFrederic Weisbecker1-11/+8
Freq events may not always be affine to a particular CPU. As such, account_event_cpu() may crash if we account per cpu a freq event that has event->cpu == -1. To solve this, lets account freq events globally. In practice this doesn't change much the picture because perf tools create per-task perf events with one event per CPU by default. Profiling a single CPU is usually a corner case so there is no much point in optimizing things that way. Reported-by: Jiri Olsa <[email protected]> Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Tested-by: Jiri Olsa <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Stephane Eranian <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-08-16perf: Roll back callchain buffer refcount under the callchain mutexFrederic Weisbecker1-1/+2
When we fail to allocate the callchain buffers, we roll back the refcount we did and return from get_callchain_buffers(). However we take the refcount and allocate under the callchain lock but the rollback is done outside the lock. As a result, while we roll back, some concurrent callchain user may call get_callchain_buffers(), see the non-zero refcount and give up because the buffers are NULL without itself retrying the allocation. The consequences aren't that bad but that behaviour looks weird enough and it's better to give their chances to the following callchain users where we failed. Reported-by: Jiri Olsa <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Acked-by: Jiri Olsa <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Stephane Eranian <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-08-16nohz: Include local CPU in full dynticks global kickFrederic Weisbecker1-0/+1
tick_nohz_full_kick_all() is useful to notify all full dynticks CPUs that there is a system state change to checkout before re-evaluating the need for the tick. Unfortunately this is implemented using smp_call_function_many() that ignores the local CPU. This CPU also needs to re-evaluate the tick. on_each_cpu_mask() is not useful either because we don't want to re-evaluate the tick state in place but asynchronously from an IPI to avoid messing up with any random locking scenario. So lets call tick_nohz_full_kick() from tick_nohz_full_kick_all() so that the usual irq work takes care of it. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Li Zhong <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Kevin Hilman <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-08-16sched/cputime: Use this_cpu_add() in task_group_account_field()Christoph Lameter1-1/+1
Use of a this_cpu() operation reduces the number of instructions used for accounting (account_user_time()) and frees up some registers. This is in the scheduler tick hotpath. Signed-off-by: Christoph Lameter <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/00000140596dd165-338ff7f5-893b-4fec-b251-aaac5557239e-000000@email.amazonses.com Signed-off-by: Ingo Molnar <[email protected]>
2013-08-16cpumask: Fix cpumask leak in partition_sched_domains()Xiaotian Feng1-2/+3
If doms_new is NULL, partition_sched_domains() will reset ndoms_cur to 0, and free old sched domains with free_sched_domains(doms_cur, ndoms_cur). As ndoms_cur is 0, the cpumask will not be freed. Signed-off-by: Xiaotian Feng <[email protected]> Cc: Rusty Russell <[email protected]> Cc: [email protected] Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-08-16Merge tag 'v3.11-rc5' into sched/coreIngo Molnar30-548/+816
Merge Linux 3.11-rc5, to pick up the latest fixes. Signed-off-by: Ingo Molnar <[email protected]>
2013-08-16cgroup: use css_get() in cgroup_create() to check CSS_ROOTLi Zhong1-1/+1
It seems that the root css doesn't have refcnt allocated(not needed?), and would cause the booting error attached. This patch tries to use css_get() to not increase the refcnt if parent is root. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff810b37cc>] cgroup_mkdir+0x37c/0x740 PGD 0 Oops: 0002 [#1] Modules linked in: CPU: 0 PID: 1 Comm: systemd Not tainted 3.11.0-rc5-next-20130815+ #1 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007 task: ffff88007f868000 ti: ffff88007f864000 task.ti: ffff88007f864000 RIP: 0010:[<ffffffff810b37cc>] [<ffffffff810b37cc>] cgroup_mkdir+0x37c/0x740 RSP: 0018:ffff88007f865df8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffffff81a46ee0 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff81a415c0 RBP: ffff88007f865ec8 R08: 0000000000000001 R09: 0000000000000000 R10: ffff88007ce6d060 R11: 0000000000000000 R12: ffff88007ce6d000 R13: ffff88007ce6d060 R14: ffffffff81a46d80 R15: ffff88007c6e8018 FS: 00007f13dbf6f840(0000) GS:ffffffff81a23000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000007b7e5000 CR4: 00000000000006b0 Stack: ffffffff810b380d 0000000000000002 ffff88007f865e18 ffffffff81167069 ffff88007f865ed8 ffffffff8116a3f5 ffff880037454400 ffff88007c6e8018 ffff88007c6e8028 ffff88007c6e8328 ffff88007c6e8000 ffff88007ce6d000 Call Trace: [<ffffffff810b380d>] ? cgroup_mkdir+0x3bd/0x740 [<ffffffff81167069>] ? lookup_hash+0x19/0x20 [<ffffffff8116a3f5>] ? kern_path_create+0x95/0x170 [<ffffffff8116ce3e>] vfs_mkdir+0x9e/0xf0 [<ffffffff8116d7a0>] SyS_mkdirat+0x60/0xe0 [<ffffffff8116d839>] SyS_mkdir+0x19/0x20 [<ffffffff814c960d>] tracesys+0xcf/0xd4 Code: ad 70 ff ff ff 48 89 9d 60 ff ff ff 4d 89 d5 4c 8b bd 68 ff ff ff 4c 8b 65 88 eb 50 0f 1f 00 48 8b 43 18 a8 03 0f 85 6c 03 00 00 <ff> 00 e8 1d 0a fb ff 85 c0 74 0d 80 3d f0 45 a1 00 00 0f 84 4c RIP [<ffffffff810b37cc>] cgroup_mkdir+0x37c/0x740 RSP <ffff88007f865df8> CR2: 0000000000000000 ---[ end trace a4b14b49bc46fd60 ]--- Signed-off-by: Li Zhong <[email protected]> Acked-by: Li Zefan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2013-08-15xfs: ioctl check for capabilities in the current user namespaceDwight Engen1-0/+1
Use inode_capable() to check if SUID|SGID bits should be cleared to match similar check in inode_change_ok(). The check for CAP_LINUX_IMMUTABLE was not modified since all other file systems also check against init_user_ns rather than current_user_ns. Only allow changing of projid from init_user_ns. Reviewed-by: Dave Chinner <[email protected]> Reviewed-by: Gao feng <[email protected]> Signed-off-by: Dwight Engen <[email protected]> Signed-off-by: Ben Myers <[email protected]>
2013-08-15Merge tag 'v3.11-rc5' into perf/coreIngo Molnar31-549/+818
Merge Linux 3.11-rc5, to sync up with the latest upstream fixes since -rc1. Signed-off-by: Ingo Molnar <[email protected]>
2013-08-14Merge branch 'akpm' (patches from Andrew Morton)Linus Torvalds1-0/+6
Merge a bunch of fixes from Andrew Morton. * emailed patches from Andrew Morton <[email protected]>: fs/proc/task_mmu.c: fix buffer overflow in add_page_map() arch: *: Kconfig: add "kernel/Kconfig.freezer" to "arch/*/Kconfig" ocfs2: fix null pointer dereference in ocfs2_dir_foreach_blk_id() x86 get_unmapped_area(): use proper mmap base for bottom-up direction ocfs2: fix NULL pointer dereference in ocfs2_duplicate_clusters_by_page ocfs2: Revert 40bd62e to avoid regression in extended allocation drivers/rtc/rtc-stmp3xxx.c: provide timeout for potentially endless loop polling a HW bit hugetlb: fix lockdep splat caused by pmd sharing aoe: adjust ref of head for compound page tails microblaze: fix clone syscall mm: save soft-dirty bits on file pages mm: save soft-dirty bits on swapped pages memcg: don't initialize kmem-cache destroying work for root caches
2013-08-14Merge branch 'timers/nohz-v3' of ↵Ingo Molnar6-128/+116
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into timers/nohz Pull nohz improvements from Frederic Weisbecker: " It mostly contains fixes and full dynticks off-case optimizations. I believe that distros want to enable this feature so it seems important to optimize the case where the "nohz_full=" parameter is empty. ie: I'm trying to remove any performance regression that comes with NO_HZ_FULL=y when the feature is not used. This patchset improves the current situation a lot (off-case appears to be around 11% faster with hackbench, although I guess it may vary depending on the configuration but it should be significantly faster in any case) now there is still some work to do: I can still observe a remaining loss of 1.6% throughput seen with hackbench compared to CONFIG_NO_HZ_FULL=n. " Signed-off-by: Ingo Molnar <[email protected]>
2013-08-14nohz: Optimize full dynticks's sched hooks with static keysFrederic Weisbecker1-4/+4
Scheduler IPIs and task context switches are serious fast path. Let's try to hide as much as we can the impact of full dynticks APIs' off case that are called on these sites through the use of static keys. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Li Zhong <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Kevin Hilman <[email protected]>
2013-08-14nohz: Optimize full dynticks state checks with static keysFrederic Weisbecker1-12/+2
These APIs are frequenctly accessed and priority is given to optimize the full dynticks off-case in order to let distros enable this feature without suffering from significant performance regressions. Let's inline these APIs and optimize them with static keys. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Li Zhong <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Kevin Hilman <[email protected]>
2013-08-14nohz: Rename a few state variablesFrederic Weisbecker1-22/+22
Rename the full dynticks's cpumask and cpumask state variables to some more exportable names. These will be used later from global headers to optimize the main full dynticks APIs in conjunction with static keys. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Li Zhong <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Kevin Hilman <[email protected]>
2013-08-14vtime: Always debug check snapshot source _before_ updating itFrederic Weisbecker1-2/+2
The vtime delta update performed by get_vtime_delta() always check that the source of the snapshot is valid. Meanhile the snapshot updaters that rely on get_vtime_delta() also set the new snapshot origin. But some of them do this right before the call to get_vtime_delta(), making its debug check useless. This is easily fixable by moving the snapshot origin update after the call to get_vtime_delta(). The order doesn't matter there. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Li Zhong <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Kevin Hilman <[email protected]>
2013-08-14vtime: Always scale generic vtime accounting resultsFrederic Weisbecker1-6/+0
The cputime accounting in full dynticks can be a subtle mixup of CPUs using tick based accounting and others using generic vtime. As long as the tick can have a share on producing these stats, we want to scale the result against CFS precise accounting as the tick can miss some task hiding between the periodic interrupt. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Li Zhong <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Kevin Hilman <[email protected]>
2013-08-14vtime: Optimize full dynticks accounting off case with static keysFrederic Weisbecker1-18/+4
If no CPU is in the full dynticks range, we can avoid the full dynticks cputime accounting through generic vtime along with its overhead and use the traditional tick based accounting instead. Let's do this and nope the off case with static keys. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Li Zhong <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Kevin Hilman <[email protected]>
2013-08-14vtime: Fix racy cputime delta updateFrederic Weisbecker1-1/+2
get_vtime_delta() must be called under the task vtime_seqlock with the code that does the cputime accounting flush. Otherwise the cputime reader can be fooled and run into a race where it sees the snapshot update but misses the cputime flush. As a result it can report a cputime that is way too short. Fix vtime_account_user() that wasn't complying to that rule. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Li Zhong <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Kevin Hilman <[email protected]>
2013-08-14vtime: Remove a few unneeded generic vtime state checksFrederic Weisbecker1-12/+1
Some generic vtime APIs check if the vtime accounting is enabled on the local CPU before doing their work. Some of these are not needed because all their callers already take care of that. Let's remove the checks on these. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Li Zhong <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Kevin Hilman <[email protected]>
2013-08-14context_tracking: User/kernel broundary cross trace eventsFrederic Weisbecker1-0/+5
This can be useful to track all kernel/user round trips. And it's also helpful to debug the context tracking subsystem. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Li Zhong <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Kevin Hilman <[email protected]>
2013-08-14context_tracking: Optimize context switch off case with static keysFrederic Weisbecker1-3/+3
No need for syscall slowpath if no CPU is full dynticks, rather nop this in this case. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Li Zhong <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Kevin Hilman <[email protected]>
2013-08-14context_tracking: Optimize guest APIs off case with static keyFrederic Weisbecker2-21/+4
Optimize guest entry/exit APIs with static keys. This minimize the overhead for those who enable CONFIG_NO_HZ_FULL without always using it. Having no range passed to nohz_full= should result in the probes overhead to be minimized. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Li Zhong <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Kevin Hilman <[email protected]>
2013-08-14context_tracking: Optimize main APIs off case with static keyFrederic Weisbecker1-6/+6
Optimize user and exception entry/exit APIs with static keys. This minimize the overhead for those who enable CONFIG_NO_HZ_FULL without always using it. Having no range passed to nohz_full= should result in the probes to be nopped (at least we hope so...). If this proves not be enough in the long term, we'll need to bring an exception slow path by re-routing the exception handlers. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Li Zhong <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Kevin Hilman <[email protected]>
2013-08-14context_tracking: Ground setup for static key useFrederic Weisbecker1-6/+17
Prepare for using a static key in the context tracking subsystem. This will help optimizing the off case on its many users: * user_enter, user_exit, exception_enter, exception_exit, guest_enter, guest_exit, vtime_*() Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Li Zhong <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Kevin Hilman <[email protected]>
2013-08-13microblaze: fix clone syscallMichal Simek1-0/+6
Fix inadvertent breakage in the clone syscall ABI for Microblaze that was introduced in commit f3268edbe6fe ("microblaze: switch to generic fork/vfork/clone"). The Microblaze syscall ABI for clone takes the parent tid address in the 4th argument; the third argument slot is used for the stack size. The incorrectly-used CLONE_BACKWARDS type assigned parent tid to the 3rd slot. This commit restores the original ABI so that existing userspace libc code will work correctly. All kernel versions from v3.8-rc1 were affected. Signed-off-by: Michal Simek <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-08-13cpuset: remove an unncessary forward declarationLi Zefan1-3/+0
Signed-off-by: Li Zefan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2013-08-13cgroup: RCU protect each cgroup_subsys_state releaseTejun Heo1-16/+37
With the planned unified hierarchy, individual css's will be created and destroyed dynamically across the lifetime of a cgroup. To enable such usages, css destruction is being decoupled from cgroup destruction. Most of the destruction path has been decoupled but the actual free of css still depends on cgroup free path. When all css refs are drained, css_release() kicks off css_free_work_fn() which puts the cgroup. When the cgroup refcnt reaches zero, cgroup_diput() is invoked which in turn schedules RCU free of the cgroup. After a grace period, all css's are freed along with the cgroup itself. This patch moves the RCU grace period and css freeing from cgroup release path to css release path. css_release(), instead of kicking off css_free_work_fn() directly, schedules RCU callback css_free_rcu_fn() which in turn kicks off css_free_work_fn() after a RCU grace period. css_free_work_fn() is updated to free the css directly. The five-way punting - percpu ref kill confirmation, a work item, percpu ref release, RCU grace period, and again a work item - is quite hairy but the work items are there only to provide process context and the actual sequence is kill confirm -> release -> RCU free, which isn't simple but not too crazy. This removes cgroup_css() usage after offline_css() allowing clearing cgroup->subsys[] from offline_css(), which makes it consistent with online_css() and brings it closer to proper lifetime management for individual css's. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-08-13cgroup: move subsys file removal to kill_css()Tejun Heo1-7/+9
With the planned unified hierarchy, individual css's will be created and destroyed dynamically across the lifetime of a cgroup. To enable such usages, css destruction is being decoupled from cgroup destruction. This patch moves subsys file removal from cgroup_destroy_locked() to kill_css(). While this changes the order of destruction operations, the changes shouldn't be noticeable to cgroup subsystems or userland. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-08-13cgroup: factor out kill_css()Tejun Heo1-23/+35
Factor out css ref killing from cgroup_destroy_locked() into kill_css(). We're gonna add more to the path and the factored out function will eventually be called from other places too. While at it, replace open coded percpu_ref_get() with css_get() for consistency. This shouldn't cause any functional difference as the function is not used for root cgroups. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-08-13cgroup: decouple cgroup_subsys_state destruction from cgroup destructionTejun Heo1-28/+24
Currently, css (cgroup_subsys_state) lifetime is tied to that of the associated cgroup. css's are created when the associated cgroup is created and destroyed when it gets destroyed. Also, individual css's aren't RCU protected but the whole cgroup is. With the planned unified hierarchy, css's will need to be dynamically created and destroyed within the lifetime of a cgroup. To enable such usages, this patch decouples css destruction from cgroup destruction - offline_css() invocation and the final css_put() are moved from cgroup_destroy_css_killed() to css_killed_work_fn(). Now each css is individually offlined and put as its reference count is killed instead of waiting for all css's attached to the cgroup to finish refcnt killing and then proceeding to offlining and putting them together. While this changes the order of destruction operations, the changes shouldn't be noticeable to cgroup subsystems or userland. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-08-13cgroup: replace cgroup->css_kill_cnt with ->nr_cssTejun Heo1-24/+28
Currently, css (cgroup_subsys_state) lifetime is tied to that of the associated cgroup. With the planned unified hierarchy, css's will be dynamically created and destroyed within the lifetime of a cgroup. To enable such usages, css's will be individually RCU protected instead of being tied to the cgroup. cgroup->css_kill_cnt is used during cgroup destruction to wait for css reference count disable; however, this model doesn't work once css's lifetimes are managed separately from cgroup's. This patch replaces it with cgroup->nr_css which is an cgroup_mutex protected integer counting the number of attached css's. The count is incremented from online_css() and decremented after refcnt kill is confirmed. If the count reaches zero and the cgroup is marked dead, the second stage of cgroup destruction is kicked off. If a cgroup doesn't have any css attached at the time of rmdir, cgroup_destroy_locked() now invokes the second stage directly as no css kill confirmation would happen. cgroup_offline_fn() - the second step of cgroup destruction - is renamed to cgroup_destroy_css_killed() and now expects to be called with cgroup_mutex held. While this patch changes how css destruction is punted to work items, it shouldn't change any visible behavior. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-08-13cgroup: bounce cgroup_subsys_state ref kill confirmation to a work itemTejun Heo1-3/+18
css (cgroup_subsys_state) offlining, which requires process context, will be moved to ref kill confirmation. In preparation, bounce css_killed handling through css->destroy_work. css_ref_killed_fn() is renamed to css_killed_ref_fn() so that it's consistent with the new css_killed_work_fn(). This patch adds an additional work item bouncing but doesn't change the actual logic. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>