aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2013-06-20watchdog: Rename confusing state variableFrederic Weisbecker2-17/+17
We have two very conflicting state variable names in the watchdog: * watchdog_enabled: This one reflects the user interface. It's set to 1 by default and can be overriden with boot options or sysctl/procfs interface. * watchdog_disabled: This is the internal toggle state that tells if watchdog threads, timers and NMI events are currently running or not. This state mostly depends on the user settings. It's a convenient state latch. Now we really need to find clearer names because those are just too confusing to encourage deep review. watchdog_enabled now becomes watchdog_user_enabled to reflect its purpose as an interface. watchdog_disabled becomes watchdog_running to suggest its role as a pure internal state. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Srivatsa S. Bhat <[email protected]> Cc: Anish Singh <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Li Zhong <[email protected]> Cc: Don Zickus <[email protected]>
2013-06-19ftrace: Fix stddev calculation in function profilerJuri Lelli1-2/+8
When FUNCTION_GRAPH_TRACER is enabled, ftrace can profile kernel functions and print basic statistics about them. Unfortunately, running stddev calculation is wrong. This patch corrects it implementing Welford’s method: s^2 = 1 / (n * (n-1)) * (n * \Sum (x_i)^2 - (\Sum x_i)^2) . Link: http://lkml.kernel.org/r/[email protected] Cc: Frederic Weisbecker <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Juri Lelli <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-06-19tracing/kprobes: Remove unnecessary checking of trace_probe_is_enabledzhangwei(Jovi)1-2/+1
Since tp->flags assignment was moved into function enable_trace_probe(), there is no need to use trace_probe_is_enabled to check flags in the same function. Remove the unnecessary checking. Link: http://lkml.kernel.org/r/[email protected] Acked-by: Masami Hiramatsu <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Srikar Dronamraju <[email protected]> Signed-off-by: zhangwei(Jovi) <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-06-19tracing: Disable tracing on warningSteven Rostedt (Red Hat)3-0/+27
Add a traceoff_on_warning option in both the kernel command line as well as a sysctl option. When set, any WARN*() function that is hit will cause the tracing_on variable to be cleared, which disables writing to the ring buffer. This is useful especially when tracing a bug with function tracing. When a warning is hit, the print caused by the warning can flood the trace with the functions that producing the output for the warning. This can make the resulting trace useless by either hiding where the bug happened, or worse, by overflowing the buffer and losing the trace of the bug totally. Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-06-19tracing: Add binary '&' filter for eventsSteven Rostedt1-0/+6
There are some cases when filtering on a set flag of a field of a tracepoint is useful. But currently the only filtering commands for numbered fields is ==, !=, <, <=, >, >=. This does not help when you just want to trace if a specific flag is set. For example: > # sudo trace-cmd record -e brcmfmac:brcmf_dbg -f 'level & 0x40000' > disable all > enable brcmfmac:brcmf_dbg > path = /sys/kernel/debug/tracing/events/brcmfmac/brcmf_dbg/enable > (level & 0x40000) > ^ > parse_error: Invalid operator > When trying to trace brcmf_dbg when level has its 1 << 18 bit set, the filter fails to perform. By allowing a binary '&' operation, this gives the user the ability to test a bit. Note, a binary '|' is not added, as it doesn't make sense as fields must be compared to constants (for now), and ORing a constant will always return true. Link: http://lkml.kernel.org/r/[email protected] Suggested-by: Arend van Spriel <[email protected]> Tested-by: Arend van Spriel <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-06-20watchdog: Register / unregister watchdog kthreads on sysctl controlFrederic Weisbecker1-40/+47
The user activation/deactivation of the watchdog through boot parameters or systcl is currently implemented with a dance involving kthreads parking and unparking methods: the threads are unconditionally registered on boot and they park as soon as the user want the watchdog to be disabled. This method involves a few noisy details to handle though: the watchdog kthreads may be unparked anytime due to hotplug operations, after which the watchdog internals have to decide to park again if it is user-disabled. As a result the setup() and unpark() methods need to be able to request a reparking. This is not currently supported in the kthread infrastructure so this piece of the watchdog code only works halfway. Besides, unparking/reparking the watchdog kthreads consume unnecessary cputime on hotplug operations when those could be simply ignored in the first place. As suggested by Srivatsa, let's instead only register the watchdog threads when they are needed. This way we don't need to think about hotplug operations and we don't burden the CPU onlining when the watchdog is simply disabled. Suggested-by: Srivatsa S. Bhat <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Srivatsa S. Bhat <[email protected]> Cc: Anish Singh <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Li Zhong <[email protected]> Cc: Don Zickus <[email protected]>
2013-06-20nohz: Warn if the machine can not perform nohz_fullSteven Rostedt1-0/+5
If the user configures NO_HZ_FULL and defines nohz_full=XXX on the kernel command line, or enables NO_HZ_FULL_ALL, but nohz fails due to the machine having a unstable clock, warn about it. We do not want users thinking that they are getting the benefit of nohz when their machine can not support it. Signed-off-by: Steven Rostedt <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Li Zhong <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>
2013-06-19Merge tag 'please-pull-einj' of ↵Ingo Molnar1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/ras Pull miscellaneous fixes for ACPI EINJ (error injection) code, from Tony Luck. Signed-off-by: Ingo Molnar <[email protected]>
2013-06-19Merge branch 'rcu/next' of ↵Ingo Molnar7-1208/+154
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU changes from Paul E. McKenney: "The major changes for this series are: 1. Simplify RCU's grace-period and callback processing based on the new numbering for callbacks. These were posted to LKML at https://lkml.org/lkml/2013/5/20/330. 2. Documentation updates. These were posted to LKML at https://lkml.org/lkml/2013/5/20/348. 3. Miscellaneous fixes, including converting a few remaining printk() calls to pr_*(). These were posted to LKML at https://lkml.org/lkml/2013/5/20/324. 4. SRCU-related changes and fixes. These were posted to LKML at https://lkml.org/lkml/2013/5/20/425. 5. Removal of TINY_PREEMPT_RCU in favor of TREE_PREEMPT_RCU for single-CPU low-latency systems. These were posted to LKML at https://lkml.org/lkml/2013/5/20/427." Signed-off-by: Ingo Molnar <[email protected]>
2013-06-19sched: Don't mix use of typedef ctl_table and struct ctl_tableJoe Perches1-1/+1
Just use struct ctl_table. Signed-off-by: Joe Perches <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/1371063336.2069.22.camel@joe-AO722 Signed-off-by: Ingo Molnar <[email protected]>
2013-06-19sched: Remove WARN_ON(!sd) from init_sched_groups_power()Viresh Kumar1-1/+1
sd can't be NULL in init_sched_groups_power() and so checking it for NULL isn't useful. In case it is required, then also we need to rearrange the code a bit as we already accessed invalid pointer sd to get sg: sg = sd->groups. Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/2bbe633cd74b431c05253a8ce61fdfd5066a531b.1370948150.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <[email protected]>
2013-06-19sched: Fix memory leakage in build_sched_groups()Viresh Kumar1-2/+2
In build_sched_groups() we don't need to call get_group() for cpus which are already covered in previous iterations. Calling get_group() would mark the group used and eventually leak it since we wouldn't connect it and not find it again to free it. This will happen only in cases where sg->cpumask contained more than one cpu (For any topology level). This patch would free sg's memory for all cpus leaving the group leader as the group isn't marked used now. Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/7a61e955abdcbb1dfa9fe493f11a5ec53a11ddd3.1370948150.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <[email protected]>
2013-06-19sched: Use cached value of span instead of calling sched_domain_span()Viresh Kumar1-1/+1
In the beginning of build_sched_groups() we called sched_domain_span() and cached its return value in span. Few statements later we are calling it again to get the same pointer. Lets use the cached value instead as it hasn't changed in between. Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/834ecd507071ad88aff039352dbc7e063dd996a7.1370948150.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <[email protected]>
2013-06-19sched: Create for_each_sd_topology()Viresh Kumar1-3/+6
For loop for traversing sched_domain_topology was used at multiple placed in core.c. This patch removes code redundancy by creating for_each_sd_topology(). Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/e0e04542f54e9464bd9da54f5ccfe62ec6c4c0bc.1370861520.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <[email protected]>
2013-06-19sched: Don't set sd->child to NULL when it is already NULLViresh Kumar1-1/+1
Memory for sd is allocated with kzalloc_node() which will initialize its fields with zero. In build_sched_domain() we are setting sd->child to child even if child is NULL, which isn't required. Lets do it only if child isn't NULL. Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/f4753a1730051341003ad2ad29a3229c7356678e.1370861520.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <[email protected]>
2013-06-19sched: Don't initialize alloc_state in build_sched_domains()Viresh Kumar1-1/+1
alloc_state will be overwritten by __visit_domain_allocation_hell() and so we don't actually need to initialize alloc_state. Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/df57734a075cc5ad130e1ae498702e24f2529ab8.1370861520.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <[email protected]>
2013-06-19sched: Optimize build_sched_domains() for saving first SD node for a cpuViresh Kumar1-5/+2
We are saving first scheduling domain for a cpu in build_sched_domains() by iterating over the nested sd->child list. We don't actually need to do it this way. tl will be equal to sched_domain_topology for the first iteration and so we can set *per_cpu_ptr(d.sd, i) based on that. So, save pointer to first SD while running the iteration loop over tl's. Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/fc473527cbc4dfa0b8eeef2a59db74684eb59a83.1370436120.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <[email protected]>
2013-06-19sched: Remove unused params of build_sched_domain()Viresh Kumar1-4/+3
build_sched_domain() never uses parameter struct s_data *d and so passing it is useless. Remove it. Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/545e0b4536166a15b4475abcafe5ed0db4ad4a2c.1370436120.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <[email protected]>
2013-06-19sched: Rename sched.c as sched/core.c in comments and DocumentationViresh Kumar3-4/+4
Most of the stuff from kernel/sched.c was moved to kernel/sched/core.c long time back and the comments/Documentation never got updated. I figured it out when I was going through sched-domains.txt and so thought of fixing it globally. I haven't crossed check if the stuff that is referenced in sched/core.c by all these files is still present and hasn't changed as that wasn't the motive behind this patch. Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/cdff76a265326ab8d71922a1db5be599f20aad45.1370329560.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <[email protected]>
2013-06-19sched: Femove the useless declaration in kernel/sched/fair.cMichael Wang1-4/+0
default_cfs_period(), do_sched_cfs_period_timer(), do_sched_cfs_slack_timer() already defined previously, no need to declare again. Signed-off-by: Michael Wang <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-06-19sched: Refine the code in unthrottle_cfs_rq()Michael Wang1-1/+1
Directly use rq to save some code. Signed-off-by: Michael Wang <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-06-19sched/rt: Simplify pull_rt_task() logic and remove .leaf_rt_rq_listKirill Tkhai2-67/+16
[ Peter, this is based off of some of my work, I ran it though a few tests and it passed. I also reviewed it, and added my SOB as I am somewhat a co-author to it. ] Based on the patch by Steven Rostedt from previous year: https://lkml.org/lkml/2012/4/18/517 1)Simplify pull_rt_task() logic: search in pushable tasks of dest runqueue. The only pullable tasks are the tasks which are pushable in their local rq, and no others. 2)Remove .leaf_rt_rq_list member of struct rt_rq and functions connected with it: nobody uses it since now. Signed-off-by: Kirill Tkhai <[email protected]> Signed-off-by: Steven Rostedt <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-06-19Merge branch 'sched/urgent' into sched/coreIngo Molnar34-1096/+1297
Merge in fixes before applying ongoing new work. Signed-off-by: Ingo Molnar <[email protected]>
2013-06-19tracing/context-tracking: Add preempt_schedule_context() for tracingSteven Rostedt1-0/+40
Dave Jones hit the following bug report: =============================== [ INFO: suspicious RCU usage. ] 3.10.0-rc2+ #1 Not tainted ------------------------------- include/linux/rcupdate.h:771 rcu_read_lock() used illegally while idle! other info that might help us debug this: RCU used illegally from idle CPU! rcu_scheduler_active = 1, debug_locks = 0 RCU used illegally from extended quiescent state! 2 locks held by cc1/63645: #0: (&rq->lock){-.-.-.}, at: [<ffffffff816b39fd>] __schedule+0xed/0x9b0 #1: (rcu_read_lock){.+.+..}, at: [<ffffffff8109d645>] cpuacct_charge+0x5/0x1f0 CPU: 1 PID: 63645 Comm: cc1 Not tainted 3.10.0-rc2+ #1 [loadavg: 40.57 27.55 13.39 25/277 64369] Hardware name: Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H, BIOS F12a 04/23/2010 0000000000000000 ffff88010f78fcf8 ffffffff816ae383 ffff88010f78fd28 ffffffff810b698d ffff88011c092548 000000000023d073 ffff88011c092500 0000000000000001 ffff88010f78fd60 ffffffff8109d7c5 ffffffff8109d645 Call Trace: [<ffffffff816ae383>] dump_stack+0x19/0x1b [<ffffffff810b698d>] lockdep_rcu_suspicious+0xfd/0x130 [<ffffffff8109d7c5>] cpuacct_charge+0x185/0x1f0 [<ffffffff8109d645>] ? cpuacct_charge+0x5/0x1f0 [<ffffffff8108dffc>] update_curr+0xec/0x240 [<ffffffff8108f528>] put_prev_task_fair+0x228/0x480 [<ffffffff816b3a71>] __schedule+0x161/0x9b0 [<ffffffff816b4721>] preempt_schedule+0x51/0x80 [<ffffffff816b4800>] ? __cond_resched_softirq+0x60/0x60 [<ffffffff816b6824>] ? retint_careful+0x12/0x2e [<ffffffff810ff3cc>] ftrace_ops_control_func+0x1dc/0x210 [<ffffffff816be280>] ftrace_call+0x5/0x2f [<ffffffff816b681d>] ? retint_careful+0xb/0x2e [<ffffffff816b4805>] ? schedule_user+0x5/0x70 [<ffffffff816b4805>] ? schedule_user+0x5/0x70 [<ffffffff816b6824>] ? retint_careful+0x12/0x2e ------------[ cut here ]------------ What happened was that the function tracer traced the schedule_user() code that tells RCU that the system is coming back from userspace, and to add the CPU back to the RCU monitoring. Because the function tracer does a preempt_disable/enable_notrace() calls the preempt_enable_notrace() checks the NEED_RESCHED flag. If it is set, then preempt_schedule() is called. But this is called before the user_exit() function can inform the kernel that the CPU is no longer in user mode and needs to be accounted for by RCU. The fix is to create a new preempt_schedule_context() that checks if the kernel is still in user mode and if so to switch it to kernel mode before calling schedule. It also switches back to user mode coming back from schedule in need be. The only user of this currently is the preempt_enable_notrace(), which is only used by the tracing subsystem. Signed-off-by: Steven Rostedt <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-06-19sched: Fix clear NOHZ_BALANCE_KICKVincent Guittot1-4/+17
I have faced a sequence where the Idle Load Balance was sometime not triggered for a while on my platform, in the following scenario: CPU 0 and CPU 1 are running tasks and CPU 2 is idle CPU 1 kicks the Idle Load Balance CPU 1 selects CPU 2 as the new Idle Load Balancer CPU 2 sets NOHZ_BALANCE_KICK for CPU 2 CPU 2 sends a reschedule IPI to CPU 2 While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2 CPU 2 finally wakes up, runs task A and discards the Idle Load Balance task A quickly goes back to sleep (before a tick occurs on CPU 2) CPU 2 goes back to idle with NOHZ_BALANCE_KICK set Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be performed. We must wait for the sched softirq to be raised on CPU 2 thanks to another part the kernel to come back to clear NOHZ_BALANCE_KICK. The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if we can't raise the sched_softirq for the Idle Load Balance. Change since V1: - move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB can't run on this CPU (as suggested by Peter) Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-06-19perf: Add const qualifier to perf_pmu_register's 'name' argMischa Jonker1-1/+1
This allows us to use pdev->name for registering a PMU device. IMO the name is not supposed to be changed anyway. Signed-off-by: Mischa Jonker <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-06-19perf: Fix hypervisor branch sampling permission checkStephane Eranian1-2/+2
Commit 2b923c8 perf/x86: Check branch sampling priv level in generic code was missing the check for the hypervisor (HV) priv level, so add it back. With this patch, we get the following correct behavior: # echo 2 >/proc/sys/kernel/perf_event_paranoid $ perf record -j any,k noploop 1 Error: You may not have permission to collect stats. Consider tweaking /proc/sys/kernel/perf_event_paranoid: -1 - Not paranoid at all 0 - Disallow raw tracepoint access for unpriv 1 - Disallow cpu events for unpriv 2 - Disallow kernel profiling for unpriv $ perf record -j any,hv noploop 1 Error: You may not have permission to collect stats. Consider tweaking /proc/sys/kernel/perf_event_paranoid: -1 - Not paranoid at all 0 - Disallow raw tracepoint access for unpriv 1 - Disallow cpu events for unpriv 2 - Disallow kernel profiling for unpriv Signed-off-by: Stephane Eranian <[email protected]> Acked-by: Petr Matousek <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/20130606090204.GA3725@quad Signed-off-by: Ingo Molnar <[email protected]>
2013-06-19Merge branch 'perf/urgent' into perf/coreIngo Molnar3-82/+185
Merge in the latest fixes, to avoid conflicts with ongoing work. Signed-off-by: Ingo Molnar <[email protected]>
2013-06-19perf: Fix mmap() accounting holePeter Zijlstra2-72/+159
Vince's fuzzer once again found holes. This time it spotted a leak in the locked page accounting. When an event had redirected output and its close() was the last reference to the buffer we didn't have a vm context to undo accounting. Change the code to destroy the buffer on the last munmap() and detach all redirected events at that time. This provides us the right context to undo the vm accounting. Reported-and-tested-by: Vince Weaver <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Cc: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2013-06-19cgroup: rename cont to cgrpLi Zefan1-11/+11
Cont is short for container. control group was named process container at first, but then people found container already has a meaning in linux kernel. Clean up the leftover variable name @cont. Signed-off-by: Li Zefan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2013-06-18cgroup: clean up cgroup_serial_nr_cursorTejun Heo1-7/+8
cgroup_serial_nr_cursor was created atomic64_t because I thought it was never gonna used for anything other than assigning unique numbers to cgroups and didn't want to worry about synchronization; however, now we're using it as an event-stamp to distinguish cgroups created before and after certain point which assumes that it's protected by cgroup_mutex. Let's make it clear by making it a u64. Also, rename it to cgroup_serial_nr_next and make it point to the next nr to allocate so that where it's pointing to is clear and more conventional. Signed-off-by: Tejun Heo <[email protected]> Cc: Li Zefan <[email protected]>
2013-06-18cgroup: convert cgroup_cft_commit() to use cgroup_for_each_descendant_pre()Li Zefan1-36/+44
We used root->allcg_list to iterate cgroup hierarchy because at that time cgroup_for_each_descendant_pre() hasn't been invented. tj: In cgroup_cfts_commit(), s/@serial_nr/@update_upto/, move the assignment right above releasing cgroup_mutex and explain what's going on there. Signed-off-by: Li Zefan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2013-06-18cgroup: make serial_nr_cursor available throughout cgroup.cLi Zefan1-8/+10
The next patch will use it to determine if a cgroup is newly created while we're iterating the cgroup hierarchy. tj: Rephrased the comment on top of cgroup_serial_nr_cursor. Signed-off-by: Li Zefan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2013-06-18range: Do not add new blank slot with add_range_with_mergeYinghai Lu1-10/+11
Joshua reported: Commit cd7b304dfaf1 (x86, range: fix missing merge during add range) broke mtrr cleanup on his setup in 3.9.5. corresponding commit in upstream is fbe06b7bae7c. The reason is add_range_with_merge could generate blank spot. We could avoid that by searching new expanded start/end, that new range should include all connected ranges in range array. At last add the new expanded start/end to the range array. Also move up left array so do not add new blank slot in the range array. -v2: move left array to avoid enhance add_range() -v3: include fix from Joshua about memmove declaring when DYN_DEBUG is used. Reported-by: Joshua Covington <[email protected]> Tested-by: Joshua Covington <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Cc: <[email protected]> v3.9 Signed-off-by: H. Peter Anvin <[email protected]>
2013-06-18cgroup: fix memory leak in cgroup_rm_cftypes()Li Zefan1-1/+2
The memory allocated in cgroup_add_cftypes() should be freed. The effect of this bug is we leak a bit memory everytime we unload cfq-iosched module if blkio cgroup is enabled. Signed-off-by: Li Zefan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2013-06-18cgroup: fix umount vs cgroup_event_remove() raceLi Zefan1-6/+19
commit 5db9a4d99b0157a513944e9a44d29c9cec2e91dc Author: Tejun Heo <[email protected]> Date: Sat Jul 7 16:08:18 2012 -0700 cgroup: fix cgroup hierarchy umount race This commit fixed a race caused by the dput() in css_dput_fn(), but the dput() in cgroup_event_remove() can also lead to the same BUG(). Signed-off-by: Li Zefan <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Cc: [email protected]
2013-06-18cgroup: fix umount vs cgroup_cfts_commit() raceLi Zefan1-1/+8
cgroup_cfts_commit() uses dget() to keep cgroup alive after cgroup_mutex is dropped, but dget() won't prevent cgroupfs from being umounted. When the race happens, vfs will see some dentries with non-zero refcnt while umount is in process. Keep running this: mount -t cgroup -o blkio xxx /cgroup umount /cgroup And this: modprobe cfq-iosched rmmod cfs-iosched After a while, the BUG() in shrink_dcache_for_umount_subtree() may be triggered: BUG: Dentry xxx{i=0,n=blkio.yyy} still in use (1) [umount of cgroup cgroup] Signed-off-by: Li Zefan <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Cc: [email protected]
2013-06-18cgroup: disallow rename(2) if sane_behaviorTejun Heo1-0/+7
cgroup's rename(2) isn't a proper migration implementation - it can't move the cgroup to a different parent in the hierarchy. All it can do is swapping the name string for that cgroup. This isn't useful and can mislead users to think that cgroup supports proper cgroup-level migration. Disallow rename(2) if sane_behavior. v2: Fail with -EPERM instead of -EINVAL so that it matches the vfs return value when ->rename is not implemented as suggested by Li. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-06-18irq/generic-chip: fix a few kernel-doc entriesUwe Kleine-König1-3/+3
Signed-off-by: Uwe Kleine-König <[email protected]> Signed-off-by: Jiri Kosina <[email protected]>
2013-06-17Merge 3.10-rc6 into driver-core-nextGreg Kroah-Hartman20-127/+206
We want these fixes here too. Signed-off-by: Greg Kroah-Hartman <[email protected]>
2013-06-17ARM: sched_clock: Load cycle count after epoch stabilizesStephen Boyd1-11/+8
There is a small race between when the cycle count is read from the hardware and when the epoch stabilizes. Consider this scenario: CPU0 CPU1 ---- ---- cyc = read_sched_clock() cyc_to_sched_clock() update_sched_clock() ... cd.epoch_cyc = cyc; epoch_cyc = cd.epoch_cyc; ... epoch_ns + cyc_to_ns((cyc - epoch_cyc) The cyc on cpu0 was read before the epoch changed. But we calculate the nanoseconds based on the new epoch by subtracting the new epoch from the old cycle count. Since epoch is most likely larger than the old cycle count we calculate a large number that will be converted to nanoseconds and added to epoch_ns, causing time to jump forward too much. Fix this problem by reading the hardware after the epoch has stabilized. Cc: Russell King <[email protected]> Signed-off-by: Stephen Boyd <[email protected]> Signed-off-by: John Stultz <[email protected]>
2013-06-18irqdomain: Remove temporary MIPS workaround codeGrant Likely1-12/+0
The MIPS interrupt controllers are all registering their own irq_domains now. Drop the MIPS specific code because it is no longer needed. Signed-off-by: Grant Likely <[email protected]> Cc: [email protected] Cc: [email protected] Patchwork: https://patchwork.linux-mips.org/patch/5458/ Signed-off-by: Ralf Baechle <[email protected]>
2013-06-14Merge branch 'for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS fixes from Al Viro: "Several fixes + obvious cleanup (you've missed a couple of open-coded can_lookup() back then)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: snd_pcm_link(): fix a leak... use can_lookup() instead of direct checks of ->i_op->lookup move exit_task_namespaces() outside of exit_notify() fput: task_work_add() can fail if the caller has passed exit_task_work() ncpfs: fix rmdir returns Device or resource busy
2013-06-15move exit_task_namespaces() outside of exit_notify()Oleg Nesterov1-1/+1
exit_notify() does exit_task_namespaces() after forget_original_parent(). This was needed to ensure that ->nsproxy can't be cleared prematurely, an exiting child we are going to reparent can do do_notify_parent() and use the parent's (ours) pid_ns. However, after 32084504 "pidns: use task_active_pid_ns in do_notify_parent" ->nsproxy != NULL is no longer needed, we rely on task_active_pid_ns(). Move exit_task_namespaces() from exit_notify() to do_exit(), after exit_fs() and before exit_task_work(). This solves the problem reported by Andrey, free_ipc_ns()->shm_destroy() does fput() which needs task_work_add(). Note: this particular problem can be fixed if we change fput(), and that change makes sense anyway. But there is another reason to move the callsite. The original reason for exit_task_namespaces() from the middle of exit_notify() was subtle and it has already gone away, now this looks confusing. And this allows us do simplify exit_notify(), we can avoid unlock/lock(tasklist) and we can use ->exit_state instead of PF_EXITING in forget_original_parent(). Reported-by: Andrey Vagin <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: "Eric W. Biederman" <[email protected]> Acked-by: Andrey Vagin <[email protected]> Signed-off-by: Al Viro <[email protected]>
2013-06-14idle: Enable interrupts in the weak arch_cpu_idle() implementationJames Bottomley1-0/+1
PARISC bootup triggers the warning at kernel/cpu/idle.c:96. That's caused by the weak arch_cpu_idle() implementation, which is provided to avoid that architectures implement idle_poll over and over. The switchover to polling mode happens in the first call of the weak arch_cpu_idle() implementation, but that code fails to reenable interrupts and therefor triggers the warning. Fix this by enabling interrupts in the weak arch_cpu_idle() code. [ tglx: Made the changelog match the patch ] Signed-off-by: James Bottomley <[email protected]> Reviewed-by: Srivatsa S. Bhat <[email protected]> Link: http://lkml.kernel.org/r/1371236142.2726.43.camel@dabdike Signed-off-by: Thomas Gleixner <[email protected]>
2013-06-13cpuset: rename @cont to @cgrpLi Zefan1-16/+16
Cont is short for container. control group was named process container at first, but then people found container already has a meaning in linux kernel. Clean up the leftover variable name @cont. Signed-off-by: Li Zefan <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2013-06-13cgroup: use percpu refcnt for cgroup_subsys_statesTejun Heo1-61/+104
A css (cgroup_subsys_state) is how each cgroup is represented to a controller. As such, it can be used in hot paths across the various subsystems different controllers are associated with. One of the common operations is reference counting, which up until now has been implemented using a global atomic counter and can have significant adverse impact on scalability. For example, css refcnt can be gotten and put multiple times by blkcg for each IO request. For highops configurations which try to do as much per-cpu as possible, the global frequent refcnting can be very expensive. In general, given the various and hugely diverse paths css's end up being used from, we need to make it cheap and highly scalable. In its usage, css refcnting isn't very different from module refcnting. This patch converts css refcnting to use the recently added percpu_ref. css_get/tryget/put() directly maps to the matching percpu_ref operations and the deactivation logic is no longer necessary as percpu_ref already has refcnt killing. The only complication is that as the refcnt is per-cpu, percpu_ref_kill() in itself doesn't ensure that further tryget operations will fail, which we need to guarantee before invoking ->css_offline()'s. This is resolved collecting kill confirmation using percpu_ref_kill_and_confirm() and initiating the offline phase of destruction after all css refcnt's are confirmed to be seen as killed on all CPUs. The previous patches already splitted destruction into two phases, so percpu_ref_kill_and_confirm() can be hooked up easily. This patch removes css_refcnt() which is used for rcu dereference sanity check in css_id(). While we can add a percpu refcnt API to ask the same question, css_id() itself is scheduled to be removed fairly soon, so let's not bother with it. Just drop the sanity check and use rcu_dereference_raw() instead. v2: - init_cgroup_css() was calling percpu_ref_init() without checking the return value. This causes two problems - the obvious lack of error handling and percpu_ref_init() being called from cgroup_init_subsys() before the allocators are up, which triggers warnings but doesn't cause actual problems as the refcnt isn't used for roots anyway. Fix both by moving percpu_ref_init() to cgroup_create(). - The base references were put too early by percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the refs one extra time. This wasn't noticeable because css's go through another RCU grace period before being freed. Update cgroup_destroy_locked() to grab an extra reference before killing the refcnts. This problem was noticed by Kent. Signed-off-by: Tejun Heo <[email protected]> Reviewed-by: Kent Overstreet <[email protected]> Acked-by: Li Zefan <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Mike Snitzer <[email protected]> Cc: Vivek Goyal <[email protected]> Cc: "Alasdair G. Kergon" <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Mikulas Patocka <[email protected]> Cc: Glauber Costa <[email protected]>
2013-06-13Merge branch 'for-3.11' of ↵Tejun Heo18-213/+195
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu into for-3.11 This is to receive percpu_refcount which will replace atomic_t reference count in cgroup_subsys_state. Signed-off-by: Tejun Heo <[email protected]>
2013-06-13cgroup: split cgroup destruction into two stepsTejun Heo1-11/+27
Split cgroup_destroy_locked() into two steps and put the latter half into cgroup_offline_fn() which is executed from a work item. The latter half is responsible for offlining the css's, removing the cgroup from internal lists, and propagating release notification to the parent. The separation is to allow using percpu refcnt for css. Note that this allows for other cgroup operations to happen between the first and second halves of destruction, including creating a new cgroup with the same name. As the target cgroup is marked DEAD in the first half and cgroup internals don't care about the names of cgroups, this should be fine. A comment explaining this will be added by the next patch which implements the actual percpu refcnting. As RCU freeing is guaranteed to happen after the second step of destruction, we can use the same work item for both. This patch renames cgroup->free_work to ->destroy_work and uses it for both purposes. INIT_WORK() is now performed right before queueing the work item. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-06-13cgroup: reorder the operations in cgroup_destroy_locked()Tejun Heo1-26/+35
This patch reorders the operations in cgroup_destroy_locked() such that the userland visible parts happen before css offlining and removal from the ->sibling list. This will be used to make css use percpu refcnt. While at it, split out CGRP_DEAD related comment from the refcnt deactivation one and correct / clarify how different guarantees are met. While this patch changes the specific order of operations, it shouldn't cause any noticeable behavior difference. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>