aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2010-03-12cgroups: subsystem module unloadingBen Blum1-25/+142
Provides support for unloading modular subsystems. This patch adds a new function cgroup_unload_subsys which is to be used for removing a loaded subsystem during module deletion. Reference counting of the subsystems' modules is moved from once (at load time) to once per attached hierarchy (in parse_cgroupfs_options and rebind_subsystems) (i.e., 0 or 1). Signed-off-by: Ben Blum <[email protected]> Acked-by: Li Zefan <[email protected]> Cc: Paul Menage <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Lai Jiangshan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-12cgroups: subsystem module loading interfaceBen Blum1-5/+145
Add interface between cgroups subsystem management and module loading This patch implements rudimentary module-loading support for cgroups - namely, a cgroup_load_subsys (similar to cgroup_init_subsys) for use as a module initcall, and a struct module pointer in struct cgroup_subsys. Several functions that might be wanted by modules have had EXPORT_SYMBOL added to them, but it's unclear exactly which functions want it and which won't. Signed-off-by: Ben Blum <[email protected]> Acked-by: Li Zefan <[email protected]> Cc: Paul Menage <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Lai Jiangshan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-12cgroups: revamp subsys arrayBen Blum1-16/+80
This patch series provides the ability for cgroup subsystems to be compiled as modules both within and outside the kernel tree. This is mainly useful for classifiers and subsystems that hook into components that are already modules. cls_cgroup and blkio-cgroup serve as the example use cases for this feature. It provides an interface cgroup_load_subsys() and cgroup_unload_subsys() which modular subsystems can use to register and depart during runtime. The net_cls classifier subsystem serves as the example for a subsystem which can be converted into a module using these changes. Patch #1 sets up the subsys[] array so its contents can be dynamic as modules appear and (eventually) disappear. Iterations over the array are modified to handle when subsystems are absent, and the dynamic section of the array is protected by cgroup_mutex. Patch #2 implements an interface for modules to load subsystems, called cgroup_load_subsys, similar to cgroup_init_subsys, and adds a module pointer in struct cgroup_subsys. Patch #3 adds a mechanism for unloading modular subsystems, which includes a more advanced rework of the rudimentary reference counting introduced in patch 2. Patch #4 modifies the net_cls subsystem, which already had some module declarations, to be configurable as a module, which also serves as a simple proof-of-concept. Part of implementing patches 2 and 4 involved updating css pointers in each css_set when the module appears or leaves. In doing this, it was discovered that css_sets always remain linked to the dummy cgroup, regardless of whether or not any subsystems are actually bound to it (i.e., not mounted on an actual hierarchy). The subsystem loading and unloading code therefore should keep in mind the special cases where the added subsystem is the only one in the dummy cgroup (and therefore all css_sets need to be linked back into it) and where the removed subsys was the only one in the dummy cgroup (and therefore all css_sets should be unlinked from it) - however, as all css_sets always stay attached to the dummy cgroup anyway, these cases are ignored. Any fix that addresses this issue should also make sure these cases are addressed in the subsystem loading and unloading code. This patch: Make subsys[] able to be dynamically populated to support modular subsystems This patch reworks the way the subsys[] array is used so that subsystems can register themselves after boot time, and enables the internals of cgroups to be able to handle when subsystems are not present or may appear/disappear. Signed-off-by: Ben Blum <[email protected]> Acked-by: Li Zefan <[email protected]> Cc: Paul Menage <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Lai Jiangshan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-12cgroup: introduce coalesce css_get() and css_put()Daisuke Nishimura1-2/+3
Current css_get() and css_put() increment/decrement css->refcnt one by one. This patch add a new function __css_get(), which takes "count" as a arg and increment the css->refcnt by "count". And this patch also add a new arg("count") to __css_put() and change the function to decrement the css->refcnt by "count". These coalesce version of __css_get()/__css_put() will be used to improve performance of memcg's moving charge feature later, where instead of calling css_get()/css_put() repeatedly, these new functions will be used. No change is needed for current users of css_get()/css_put(). Signed-off-by: Daisuke Nishimura <[email protected]> Acked-by: Paul Menage <[email protected]> Cc: Balbir Singh <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Li Zefan <[email protected]> Cc: Daisuke Nishimura <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-12cgroup: introduce cancel_attach()Daisuke Nishimura1-7/+33
Add cancel_attach() operation to struct cgroup_subsys. cancel_attach() can be used when can_attach() operation prepares something for the subsys, but we should rollback what can_attach() operation has prepared if attach task fails after we've succeeded in can_attach(). Signed-off-by: Daisuke Nishimura <[email protected]> Acked-by: Li Zefan <[email protected]> Reviewed-by: Paul Menage <[email protected]> Cc: Balbir Singh <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Daisuke Nishimura <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-12Add generic sys_olduname()Christoph Hellwig1-0/+54
Add generic implementations of the old and really old uname system calls. Note that sh only implements sys_olduname but not sys_oldolduname, but I'm not going to bother with another ifdef for that special case. m32r implemented an old uname but never wired it up, so kill it, too. Signed-off-by: Christoph Hellwig <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Hirokazu Takata <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Al Viro <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: "Luck, Tony" <[email protected]> Cc: James Morris <[email protected]> Cc: Andreas Schwab <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-12improve sys_newuname() for compat architecturesChristoph Hellwig1-0/+13
On an architecture that supports 32-bit compat we need to override the reported machine in uname with the 32-bit value. Instead of doing this separately in every architecture introduce a COMPAT_UTS_MACHINE define in <asm/compat.h> and apply it directly in sys_newuname(). Signed-off-by: Christoph Hellwig <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Hirokazu Takata <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Al Viro <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: "Luck, Tony" <[email protected]> Cc: James Morris <[email protected]> Cc: Andreas Schwab <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-12Add generic sys_ipc wrapperChristoph Hellwig1-0/+1
Add a generic implementation of the ipc demultiplexer syscall. Except for s390 and sparc64 all implementations of the sys_ipc are nearly identical. There are slight differences in the types of the parameters, where mips and powerpc as the only 64-bit architectures with sys_ipc use unsigned long for the "third" argument as it gets casted to a pointer later, while it traditionally is an "int" like most other paramters. frv goes even further and uses unsigned long for all parameters execept for "ptr" which is a pointer type everywhere. The change from int to unsigned long for "third" and back to "int" for the others on frv should be fine due to the in-register calling conventions for syscalls (we already had a similar issue with the generic sys_ptrace), but I'd prefer to have the arch maintainers looks over this in details. Except for that h8300, m68k and m68knommu lack an impplementation of the semtimedop sub call which this patch adds, and various architectures have gets used - at least on i386 it seems superflous as the compat code on x86-64 and ia64 doesn't even bother to implement it. [[email protected]: add sys_ipc to sys_ni.c] Signed-off-by: Christoph Hellwig <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Hirokazu Takata <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Reviewed-by: H. Peter Anvin <[email protected]> Cc: Al Viro <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: "Luck, Tony" <[email protected]> Cc: James Morris <[email protected]> Cc: Andreas Schwab <[email protected]> Acked-by: Jesper Nilsson <[email protected]> Acked-by: Russell King <[email protected]> Acked-by: David Howells <[email protected]> Acked-by: Kyle McMartin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-12posix-cpu-timers: Reset expire cache when no timer is runningStanislaw Gruszka1-3/+7
When a process deletes cpu timer or a timer expires we do not clear the expiration cache sig->cputimer_expires. As a result the fastpath_timer_check() which prevents us to loop over all threads in case no timer is active is not working and we run the slow path needlessly on every tick. Zero sig->cputimer_expires in stop_process_timers(). Signed-off-by: Stanislaw Gruszka <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Hidetoshi Seto <[email protected]> Cc: Spencer Candland <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]>
2010-03-12timer stats: Fix del_timer_sync() and try_to_del_timer_sync()Andrew Morton1-0/+1
These functions forgot to run timer_stats_timer_clear_start_info(). It's unobvious what effect this has and whether it matters much - we won't be printing it out anyway if the timer's detached. Untested, just an Ingo trollpatch. [ Nevertheless correct - tglx ] Signed-off-by: Andrew Morton <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: [email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2010-03-12clockevents: Sanitize min_delta_ns adjustment and prevent overflowsThomas Gleixner2-13/+42
The current logic which handles clock events programming failures can increase min_delta_ns unlimited and even can cause overflows. Sanitize it by: - prevent zero increase when min_delta_ns == 1 - limiting min_delta_ns to a jiffie - bail out if the jiffie limit is hit - add retries stats for /proc/timer_list so we can gather data Reported-by: Uwe Kleine-Koenig <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]>
2010-03-11sched: Fix pick_next_highest_task_rt() for cgroupsPeter Zijlstra1-1/+6
Since pick_next_highest_task_rt() already iterates all the cgroups and is really only interested in tasks, skip over the !task entries. Reported-by: Dhaval Giani <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Tested-by: Dhaval Giani <[email protected]> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <[email protected]>
2010-03-11perf: export perf_trace_regs and perf_arch_fetch_caller_regsXiao Guangrong1-0/+1
Export perf_trace_regs and perf_arch_fetch_caller_regs since module will use these. Signed-off-by: Xiao Guangrong <[email protected]> [ use EXPORT_PER_CPU_SYMBOL_GPL() ] Signed-off-by: Peter Zijlstra <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-03-11kprobes: Calculate the index correctly when freeing the out-of-line ↵Masami Hiramatsu1-1/+2
execution slot From : Ananth N Mavinakayanahalli <[email protected]> When freeing the instruction slot, the arithmetic to calculate the index of the slot in the page needs to account for the total size of the instruction on the various architectures. Calculate the index correctly when freeing the out-of-line execution slot. Reported-by: Sachin Sant <[email protected]> Reported-by: Heiko Carstens <[email protected]> Signed-off-by: Ananth N Mavinakayanahalli <[email protected]> Signed-off-by: Masami Hiramatsu <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-03-11sched: Cleanup: remove unused variable in try_to_wake_up()Dan Carpenter1-2/+2
We haven't used the "orig_rq" variable since 055a00865d "Fix/add missing update_rq_clock() calls" Signed-off-by: Dan Carpenter <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Andreas Herrmann <[email protected]> Cc: Gautham R Shenoy <[email protected]> Cc: [email protected] LKML-Reference: <20100306111752.GL4958@bicker> Signed-off-by: Ingo Molnar <[email protected]>
2010-03-11Merge branch 'tip/tracing/core' of ↵Ingo Molnar4-12/+48
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into tracing/urgent
2010-03-11rcu: Increase RCU CPU stall timeouts if PROVE_RCUPaul E. McKenney1-6/+15
CONFIG_PROVE_RCU imposes additional overhead on the kernel, so increase the RCU CPU stall timeouts in an attempt to allow for this effect. Signed-off-by: Paul E. McKenney <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-03-11ftrace: Replace read_barrier_depends() with rcu_dereference_raw()Paul E. McKenney1-9/+13
Replace the calls to read_barrier_depends() in ftrace_list_func() with rcu_dereference_raw() to improve readability. The reason that we use rcu_dereference_raw() here is that removed entries are never freed, instead they are simply leaked. This is one of a very few cases where use of rcu_dereference_raw() is the long-term right answer. And I don't yet know of any others. ;-) Signed-off-by: Paul E. McKenney <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-03-11perf_event: Fix oops triggered by cpu offline/onlinePaul Mackerras1-1/+12
Anton Blanchard found that he could reliably make the kernel hit a BUG_ON in the slab allocator by taking a cpu offline and then online while a system-wide perf record session was running. The reason is that when the cpu comes up, we completely reinitialize the ctx field of the struct perf_cpu_context for the cpu. If there is a system-wide perf record session running, then there will be a struct perf_event that has a reference to the context, so its refcount will be 2. (The perf_event has been removed from the context's group_entry and event_entry lists by perf_event_exit_cpu(), but that doesn't remove the perf_event's reference to the context and doesn't decrement the context's refcount.) When the cpu comes up, perf_event_init_cpu() gets called, and it calls __perf_event_init_context() on the cpu's context. That resets the refcount to 1. Then when the perf record session finishes and the perf_event is closed, the refcount gets decremented to 0 and the context gets kfreed after an RCU grace period. Since the context wasn't kmalloced -- it's part of a per-cpu variable -- bad things happen. In fact we don't need to completely reinitialize the context when the cpu comes up. It's sufficient to initialize the context once at boot, but we need to do it for all possible cpus. This moves the context initialization to happen at boot time. With this, we don't trash the refcount and the context never gets kfreed, and we don't hit the BUG_ON. Reported-by: Anton Blanchard <[email protected]> Signed-off-by: Paul Mackerras <[email protected]> Tested-by: Anton Blanchard <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-03-10genirq: Prevent oneshot irq thread raceThomas Gleixner2-9/+40
Lars-Peter pointed out that the oneshot threaded interrupt handler code has the following race: CPU0 CPU1 hande_level_irq(irq X) mask_ack_irq(irq X) handle_IRQ_event(irq X) wake_up(thread_handler) thread handler(irq X) runs finalize_oneshot(irq X) does not unmask due to !(desc->status & IRQ_MASKED) return from irq does not unmask due to (desc->status & IRQ_ONESHOT) This leaves the interrupt line masked forever. The reason for this is the inconsistent handling of the IRQ_MASKED flag. Instead of setting it in the mask function the oneshot support sets the flag after waking up the irq thread. The solution for this is to set/clear the IRQ_MASKED status whenever we mask/unmask an interrupt line. That's the easy part, but that cleanup opens another race: CPU0 CPU1 hande_level_irq(irq) mask_ack_irq(irq) handle_IRQ_event(irq) wake_up(thread_handler) thread handler(irq) runs finalize_oneshot_irq(irq) unmask(irq) irq triggers again handle_level_irq(irq) mask_ack_irq(irq) return from irq due to IRQ_INPROGRESS return from irq does not unmask due to (desc->status & IRQ_ONESHOT) This requires that we synchronize finalize_oneshot_irq() with the primary handler. If IRQ_INPROGESS is set we wait until the primary handler on the other CPU has returned before unmasking the interrupt line again. We probably have never seen that problem because it does not happen on UP and on SMP the irqbalancer protects us by pinning the primary handler and the thread to the same CPU. Reported-by: Lars-Peter Clausen <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected]
2010-03-10perf: Drop the obsolete profile naming for trace eventsFrederic Weisbecker6-76/+76
Drop the obsolete "profile" naming used by perf for trace events. Perf can now do more than simple events counting, so generalize the API naming. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Jason Baron <[email protected]>
2010-03-10perf: Take a hot regs snapshot for trace eventsFrederic Weisbecker4-11/+9
We are taking a wrong regs snapshot when a trace event triggers. Either we use get_irq_regs(), which gives us the interrupted registers if we are in an interrupt, or we use task_pt_regs() which gives us the state before we entered the kernel, assuming we are lucky enough to be no kernel thread, in which case task_pt_regs() returns the initial set of regs when the kernel thread was started. What we want is different. We need a hot snapshot of the regs, so that we can get the instruction pointer to record in the sample, the frame pointer for the callchain, and some other things. Let's use the new perf_fetch_caller_regs() for that. Comparison with perf record -e lock: -R -a -f -g Before: perf [kernel] [k] __do_softirq | --- __do_softirq | |--55.16%-- __open | --44.84%-- __write_nocancel After: perf [kernel] [k] perf_tp_event | --- perf_tp_event | |--41.07%-- lock_acquire | | | |--39.36%-- _raw_spin_lock | | | | | |--7.81%-- hrtimer_interrupt | | | smp_apic_timer_interrupt | | | apic_timer_interrupt The old case was producing unreliable callchains. Now having right frame and instruction pointers, we have the trace we want. Also syscalls and kprobe events already have the right regs, let's use them instead of wasting a retrieval. v2: Follow the rename perf_save_regs() -> perf_fetch_caller_regs() Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Jason Baron <[email protected]> Cc: Archs <[email protected]>
2010-03-10perf: Introduce new perf_fetch_caller_regs() for hot regs snapshotFrederic Weisbecker1-0/+5
Events that trigger overflows by interrupting a context can use get_irq_regs() or task_pt_regs() to retrieve the state when the event triggered. But this is not the case for some other class of events like trace events as tracepoints are executed in the same context than the code that triggered the event. It means we need a different api to capture the regs there, namely we need a hot snapshot to get the most important informations for perf: the instruction pointer to get the event origin, the frame pointer for the callchain, the code segment for user_mode() tests (we always use __KERNEL_CS as trace events always occur from the kernel) and the eflags for further purposes. v2: rename perf_save_regs to perf_fetch_caller_regs as per Masami's suggestion. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Jason Baron <[email protected]> Cc: Archs <[email protected]>
2010-03-10lockdep: Move lock events under lockdep recursion protectionFrederic Weisbecker1-6/+3
There are rcu locked read side areas in the path where we submit a trace event. And these rcu_read_(un)lock() trigger lock events, which create recursive events. One pair in do_perf_sw_event: __lock_acquire | |--96.11%-- lock_acquire | | | |--27.21%-- do_perf_sw_event | | perf_tp_event | | | | | |--49.62%-- ftrace_profile_lock_release | | | lock_release | | | | | | | |--33.85%-- _raw_spin_unlock Another pair in perf_output_begin/end: __lock_acquire |--23.40%-- perf_output_begin | | __perf_event_overflow | | perf_swevent_overflow | | perf_swevent_add | | perf_swevent_ctx_event | | do_perf_sw_event | | perf_tp_event | | | | | |--55.37%-- ftrace_profile_lock_acquire | | | lock_acquire | | | | | | | |--37.31%-- _raw_spin_lock The problem is not that much the trace recursion itself, as we have a recursion protection already (though it's always wasteful to recurse). But the trace events are outside the lockdep recursion protection, then each lockdep event triggers a lock trace, which will trigger two other lockdep events. Here the recursive lock trace event won't be taken because of the trace recursion, so the recursion stops there but lockdep will still analyse these new events: To sum up, for each lockdep events we have: lock_*() | trace lock_acquire | ----- rcu_read_lock() | | | lock_acquire() | | | trace_lock_acquire() (stopped) | | | lockdep analyze | ----- rcu_read_unlock() | lock_release | trace_lock_release() (stopped) | lockdep analyze And you can repeat the above two times as we have two rcu read side sections when we submit an event. This is fixed in this patch by moving the lock trace event under the lockdep recursion protection. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Hitoshi Mitake <[email protected]> Cc: Li Zefan <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Jens Axboe <[email protected]>
2010-03-10perf: Provide better condition for event rotationPeter Zijlstra1-6/+15
Try to avoid useless rotation and PMU disables. [ Could be improved by keeping a nr_runnable count to better account for the < PERF_STAT_INACTIVE counters ] Signed-off-by: Peter Zijlstra <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <[email protected]>
2010-03-10perf: Optimize perf_disablePeter Zijlstra1-13/+3
Currently we always call hw_perf_disable(), even if its already disabled, this seems superflous, esp. since it cannot be made NMI safe (see further patches). Signed-off-by: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Arnaldo Carvalho de Melo <[email protected]> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <[email protected]>
2010-03-10perf: Rework and fix the arch CPU-hotplug hooksPeter Zijlstra1-15/+0
Remove the hw_perf_event_*() hotplug hooks in favour of per PMU hotplug notifiers. This has the advantage of reducing the static weak interface as well as exposing all hotplug actions to the PMU. Use this to fix x86 hotplug usage where we did things in ONLINE which should have been done in UP_PREPARE or STARTING. Signed-off-by: Peter Zijlstra <[email protected]> Cc: Paul Mundt <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Arnaldo Carvalho de Melo <[email protected]> LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-03-10perf: Provide generic perf_sample_data initializationPeter Zijlstra1-13/+8
This makes it easier to extend perf_sample_data and fixes a bug on arm and sparc, which failed to set ->raw to NULL, which can cause crashes when combined with PERF_SAMPLE_RAW. It also optimizes PowerPC and tracepoint, because the struct initialization is forced to zero out the whole structure. Signed-off-by: Peter Zijlstra <[email protected]> Acked-by: Jean Pihet <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Acked-by: David S. Miller <[email protected]> Cc: Jamie Iles <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Stephane Eranian <[email protected]> Cc: [email protected] LKML-Reference: <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2010-03-09Merge commit 'v2.6.34-rc1' into perf/urgentIngo Molnar79-3457/+5379
Conflicts: tools/perf/util/probe-event.c Merge reason: Pick up -rc1 and resolve the conflict as well. Signed-off-by: Ingo Molnar <[email protected]>
2010-03-08Merge branch 'for-next' into for-linusJiri Kosina8-12/+12
Conflicts: Documentation/filesystems/proc.txt arch/arm/mach-u300/include/mach/debug-macro.S drivers/net/qlge/qlge_ethtool.c drivers/net/qlge/qlge_main.c drivers/net/typhoon.c
2010-03-07sysfs: Use sysfs_attr_init and sysfs_bin_attr_init on module dynamic attributesEric W. Biederman1-0/+3
A little more whack-a-mole annotating the dynamic sysfs attributes. I had everything built into my earlier test kernel, and so I missed these. Signed-off-by: Eric W. Biederman <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2010-03-07sysfs: Use sysfs_attr_init and sysfs_bin_attr_init on dynamic attributesEric W. Biederman1-0/+1
These are the non-static sysfs attributes that exist on my test machine. Fix them to use sysfs_attr_init or sysfs_bin_attr_init as appropriate. It simply requires making a sysfs attribute present to see this. So this is a little bit tedious but otherwise not too bad. Signed-off-by: Eric W. Biederman <[email protected]> Acked-by: WANG Cong <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2010-03-07Driver core: Constify struct sysfs_ops in struct kobj_typeEmese Revfy1-1/+1
Constify struct sysfs_ops. This is part of the ops structure constification effort started by Arjan van de Ven et al. Benefits of this constification: * prevents modification of data that is shared (referenced) by many other structure instances at runtime * detects/prevents accidental (but not intentional) modification attempts on archs that enforce read-only kernel data at runtime * potentially better optimized code as the compiler can assume that the const data cannot be changed * the compiler/linker move const data into .rodata and therefore exclude them from false sharing Signed-off-by: Emese Revfy <[email protected]> Acked-by: David Teigland <[email protected]> Acked-by: Matt Domsch <[email protected]> Acked-by: Maciej Sosnowski <[email protected]> Acked-by: Hans J. Koch <[email protected]> Acked-by: Pekka Enberg <[email protected]> Acked-by: Jens Axboe <[email protected]> Acked-by: Stephen Hemminger <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2010-03-07kobject: Constify struct kset_uevent_opsEmese Revfy1-1/+1
Constify struct kset_uevent_ops. This is part of the ops structure constification effort started by Arjan van de Ven et al. Benefits of this constification: * prevents modification of data that is shared (referenced) by many other structure instances at runtime * detects/prevents accidental (but not intentional) modification attempts on archs that enforce read-only kernel data at runtime * potentially better optimized code as the compiler can assume that the const data cannot be changed * the compiler/linker move const data into .rodata and therefore exclude them from false sharing Signed-off-by: Emese Revfy <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2010-03-07sysdev: Pass attribute in sysdev_class attributes show/storeAndi Kleen2-3/+14
Passing the attribute to the low level IO functions allows all kinds of cleanups, by sharing low level IO code without requiring an own function for every piece of data. Also drivers can extend the attributes with own data fields and use that in the low level function. Similar to sysdev_attributes and normal attributes. This is a tree-wide sweep, converting everything in one go. No functional changes in this patch other than passing the new argument everywhere. Tested on x86, the non x86 parts are uncompiled. Signed-off-by: Andi Kleen <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2010-03-06elf coredump: add extended numbering supportDaisuke HATAYAMA1-0/+5
The current ELF dumper implementation can produce broken corefiles if program headers exceed 65535. This number is determined by the number of vmas which the process have. In particular, some extreme programs may use more than 65535 vmas. (If you google max_map_count, you can find some users facing this problem.) This kind of program never be able to generate correct coredumps. This patch implements ``extended numbering'' that uses sh_info field of the first section header instead of e_phnum field in order to represent upto 4294967295 vmas. This is supported by AMD64-ABI(http://www.x86-64.org/documentation.html) and Solaris(http://docs.sun.com/app/docs/doc/817-1984/). Of course, we are preparing patches for gdb and binutils. Signed-off-by: Daisuke HATAYAMA <[email protected]> Cc: "Luck, Tony" <[email protected]> Cc: Jeff Dike <[email protected]> Cc: David Howells <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Roland McGrath <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Alan Cox <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06elf coredump: replace ELF_CORE_EXTRA_* macros by functionsDaisuke HATAYAMA2-0/+26
elf_core_dump() and elf_fdpic_core_dump() use #ifdef and the corresponding macro for hiding _multiline_ logics in functions. This patch removes #ifdef and replaces ELF_CORE_EXTRA_* by corresponding functions. For architectures not implemeonting ELF_CORE_EXTRA_*, we use weak functions in order to reduce a range of modification. This cleanup is for my next patches, but I think this cleanup itself is worth doing regardless of my firnal purpose. Signed-off-by: Daisuke HATAYAMA <[email protected]> Cc: "Luck, Tony" <[email protected]> Cc: Jeff Dike <[email protected]> Cc: David Howells <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Roland McGrath <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Alan Cox <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06printk: avoid warning when CONFIG_PRINTK is disabledGustavo F. Padovan1-2/+1
kernel/printk.c:72: warning: `saved_console_loglevel' defined but not used Signed-off-by: Gustavo F. Padovan <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06kernel/pid.c: update comment on find_task_by_pid_nsTetsuo Handa1-1/+1
tasklist_lock does protect the task and its pid, it can't go away. The problem is that find_pid_ns() itself is unsafe without rcu lock, it can race with copy_process()->free_pid(any_pid). Protecting copy_process()->free_pid(any_pid) with tasklist_lock would make it possible to call find_task_by_pid_ns() under tasklist safely, but we don't do so because we are trying to get rid of the read_lock sites of tasklist_lock. Signed-off-by: Tetsuo Handa <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06panic: fix panic_timeout accuracy when running on a hypervisorAnton Blanchard1-16/+30
I've had some complaints about panic_timeout being wildly innacurate on shared processor PowerPC partitions (a 3 minute panic_timeout taking 30 minutes). The problem is we loop on mdelay(1) and with a 1ms in 10ms hypervisor timeslice each of these will take 10ms (ie 10x) longer. I expect other platforms with shared processor hypervisors will see the same issue. This patch keeps the old behaviour if we have a panic_blink (only keyboard LEDs right now) and does 1 second mdelays if we don't. Signed-off-by: Anton Blanchard <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06kernel core: use helpers for rlimitsJiri Slaby7-16/+20
Make sure compiler won't do weird things with limits. E.g. fetching them twice may return 2 different values after writable limits are implemented. I.e. either use rlimit helpers added in commit 3e10e716abf3 ("resource: add helpers for fetching rlimits") or ACCESS_ONCE if not applicable. Signed-off-by: Jiri Slaby <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: john stultz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06posix-cpu-timers: cleanup rlimits usageJiri Slaby1-15/+17
Fetch rlimit (both hard and soft) values only once and work on them. It removes many accesses through sig structure and makes the code cleaner. Mostly a preparation for writable resource limits support. Signed-off-by: Jiri Slaby <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: john stultz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06kernel/exit.c: fix shadows sparse warningThiago Farina1-1/+1
kernel/exit.c:1183:26: warning: symbol 'status' shadows an earlier one kernel/exit.c:1173:21: originally declared here Signed-off-by: Thiago Farina <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06includecheck fix for kernel/params.cJaswinder Singh Rajput1-1/+0
Fix the following 'make includecheck' warning: kernel/params.c: linux/string.h is included more than once. Signed-off-by: Jaswinder Singh Rajput <[email protected]> Cc: André Goddard Rosa <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06splice: comparing unsigned int < 0Dan Carpenter1-2/+3
"ret" needs to be signed or the error handling for splice_to_pipe() won't work correctly. Signed-off-by: Dan Carpenter <[email protected]> Cc: Tom Zanussi <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Lai Jiangshan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06kernel/cpu.c: delete deprecated definition in cpu_up()Chen Gong1-1/+1
Additional_cpus is only supported for IA64 now. X86_64 should not be included. Signed-off-by: Chen Gong <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Rusty Russell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm/pm: force GFP_NOIO during suspend/hibernation and resumeRafael J. Wysocki2-0/+12
There are quite a few GFP_KERNEL memory allocations made during suspend/hibernation and resume that may cause the system to hang, because the I/O operations they depend on cannot be completed due to the underlying devices being suspended. Avoid this problem by clearing the __GFP_IO and __GFP_FS bits in gfp_allowed_mask before suspend/hibernation and restoring the original values of these bits in gfp_allowed_mask durig the subsequent resume. [[email protected]: fix CONFIG_PM=n linkage] Signed-off-by: Rafael J. Wysocki <[email protected]> Reported-by: Maxim Levitsky <[email protected]> Cc: Sebastian Ott <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: change anon_vma linking to fix multi-process server scalability issueRik van Riel1-1/+5
The old anon_vma code can lead to scalability issues with heavily forking workloads. Specifically, each anon_vma will be shared between the parent process and all its child processes. In a workload with 1000 child processes and a VMA with 1000 anonymous pages per process that get COWed, this leads to a system with a million anonymous pages in the same anon_vma, each of which is mapped in just one of the 1000 processes. However, the current rmap code needs to walk them all, leading to O(N) scanning complexity for each page. This can result in systems where one CPU is walking the page tables of 1000 processes in page_referenced_one, while all other CPUs are stuck on the anon_vma lock. This leads to catastrophic failure for a benchmark like AIM7, where the total number of processes can reach in the tens of thousands. Real workloads are still a factor 10 less process intensive than AIM7, but they are catching up. This patch changes the way anon_vmas and VMAs are linked, which allows us to associate multiple anon_vmas with a VMA. At fork time, each child process gets its own anon_vmas, in which its COWed pages will be instantiated. The parents' anon_vma is also linked to the VMA, because non-COWed pages could be present in any of the children. This reduces rmap scanning complexity to O(1) for the pages of the 1000 child processes, with O(N) complexity for at most 1/N pages in the system. This reduces the average scanning cost in heavily forking workloads from O(N) to 2. The only real complexity in this patch stems from the fact that linking a VMA to anon_vmas now involves memory allocations. This means vma_adjust can fail, if it needs to attach a VMA to anon_vma structures. This in turn means error handling needs to be added to the calling functions. A second source of complexity is that, because there can be multiple anon_vmas, the anon_vma linking in vma_adjust can no longer be done under "the" anon_vma lock. To prevent the rmap code from walking up an incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h to make sure it is impossible to compile a kernel that needs both symbolic values for the same bitflag. Some test results: Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test box with 16GB RAM and not quite enough IO), the system ends up running >99% in system time, with every CPU on the same anon_vma lock in the pageout code. With these changes, AIM7 hits the cross-over point around 29.7k users. This happens with ~99% IO wait time, there never seems to be any spike in system time. The anon_vma lock contention appears to be resolved. [[email protected]: cleanups] Signed-off-by: Rik van Riel <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: avoid false sharing of mm_counterKAMEZAWA Hiroyuki1-1/+2
Considering the nature of per mm stats, it's the shared object among threads and can be a cache-miss point in the page fault path. This patch adds per-thread cache for mm_counter. RSS value will be counted into a struct in task_struct and synchronized with mm's one at events. Now, in this patch, the event is the number of calls to handle_mm_fault. Per-thread value is added to mm at each 64 calls. rough estimation with small benchmark on parallel thread (2threads) shows [before] 4.5 cache-miss/faults [after] 4.0 cache-miss/faults Anyway, the most contended object is mmap_sem if the number of threads grows. [[email protected]: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: clean up mm_counterKAMEZAWA Hiroyuki2-2/+2
Presently, per-mm statistics counter is defined by macro in sched.h This patch modifies it to - defined in mm.h as inlinf functions - use array instead of macro's name creation. This patch is for reducing patch size in future patch to modify implementation of per-mm counter. Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>