aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2010-03-12nsproxy: remove INIT_NSPROXY()Alexey Dobriyan1-1/+12
Remove INIT_NSPROXY(), use C99 initializer. Remove INIT_IPC_NS(), INIT_NET_NS() while I'm at it. Note: headers trim will be done later, now it's quite pointless because results will be invalidated by merge window. Signed-off-by: Alexey Dobriyan <[email protected]> Acked-by: Serge Hallyn <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-12pid_ns: zap_pid_ns_processes: use SEND_SIG_NOINFO instead of force_sig()Oleg Nesterov1-4/+3
zap_pid_ns_processes() uses force_sig(SIGKILL) to ensure SIGKILL will be delivered to sub-namespace inits as well. This is correct, but we are going to change force_sig_info() semantics. See http://bugzilla.kernel.org/show_bug.cgi?id=15395#c31 We can use send_sig_info(SEND_SIG_NOINFO) instead, since 614c517d7c00af1b26ded20646b329397d6f51a1 ("signals: SEND_SIG_NOINFO should be considered as SI_FROMUSER()") SEND_SIG_NOINFO means "from user" and therefore send_signal() will get the correct from_ancestor_ns = T flag. Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Serge Hallyn <[email protected]> Acked-by: Linus Torvalds <[email protected]> Acked-by: Roland McGrath <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-12copy_signal() cleanup: clean thread_group_cputime_init()Veaceslav Falico1-11/+0
Remove unneeded initializations in thread_group_cputime_init() and in posix_cpu_timers_init_group(). They are useless after kmem_cache_zalloc() was used in copy_signal(). Signed-off-by: Veaceslav Falico <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Cc: Roland McGrath <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-12copy_signal() cleanup: kill taskstats_tgid_init() and acct_init_pacct()Veaceslav Falico1-10/+0
Kill unused functions taskstats_tgid_init() and acct_init_pacct() because we don't use them anywhere after using kmem_cache_zalloc() in copy_signal(). Signed-off-by: Veaceslav Falico <[email protected]> Cc: Roland McGrath <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-12copy_signal() cleanup: use zalloc and remove initializationsVeaceslav Falico1-26/+1
Use kmem_cache_zalloc() on signal creation and remove unneeded initialization lines in copy_signal(). Signed-off-by: Veaceslav Falico <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Cc: Roland McGrath <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-12cgroups: remove events before destroying subsystem state objectsKirill A. Shutemov1-0/+8
Events should be removed after rmdir of cgroup directory, but before destroying subsystem state objects. Let's take reference to cgroup directory dentry to do that. Signed-off-by: Kirill A. Shutemov <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Paul Menage <[email protected]> Acked-by: Li Zefan <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Dan Malek <[email protected]> Cc: Daisuke Nishimura <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-12cgroups: fix race between userspace and kernelspaceKirill A. Shutemov1-15/+17
Notify userspace about cgroup removing only after rmdir of cgroup directory to avoid race between userspace and kernelspace. eventfd are used to notify about two types of event: - control file-specific, like crossing memory threshold; - cgroup removing. To understand what really happen, userspace can check if the cgroup still exists. To avoid race beetween userspace and kernelspace we have to notify userspace about cgroup removing only after rmdir of cgroup directory. Signed-off-by: Kirill A. Shutemov <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Paul Menage <[email protected]> Acked-by: Li Zefan <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Dan Malek <[email protected]> Cc: Daisuke Nishimura <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-12cgroup: implement eventfd-based generic API for notificationsKirill A. Shutemov1-1/+227
This patchset introduces eventfd-based API for notifications in cgroups and implements memory notifications on top of it. It uses statistics in memory controler to track memory usage. Output of time(1) on building kernel on tmpfs: Root cgroup before changes: make -j2 506.37 user 60.93s system 193% cpu 4:52.77 total Non-root cgroup before changes: make -j2 507.14 user 62.66s system 193% cpu 4:54.74 total Root cgroup after changes (0 thresholds): make -j2 507.13 user 62.20s system 193% cpu 4:53.55 total Non-root cgroup after changes (0 thresholds): make -j2 507.70 user 64.20s system 193% cpu 4:55.70 total Root cgroup after changes (1 thresholds, never crossed): make -j2 506.97 user 62.20s system 193% cpu 4:53.90 total Non-root cgroup after changes (1 thresholds, never crossed): make -j2 507.55 user 64.08s system 193% cpu 4:55.63 total This patch: Introduce the write-only file "cgroup.event_control" in every cgroup. To register new notification handler you need: - create an eventfd; - open a control file to be monitored. Callbacks register_event() and unregister_event() must be defined for the control file; - write "<event_fd> <control_fd> <args>" to cgroup.event_control. Interpretation of args is defined by control file implementation; eventfd will be woken up by control file implementation or when the cgroup is removed. To unregister notification handler just close eventfd. If you need notification functionality for a control file you have to implement callbacks register_event() and unregister_event() in the struct cftype. [[email protected]: Kconfig fix] Signed-off-by: Kirill A. Shutemov <[email protected]> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]> Paul Menage <[email protected]> Cc: Li Zefan <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Pavel Emelyanov <[email protected]> Cc: Dan Malek <[email protected]> Cc: Vladislav Buzov <[email protected]> Cc: Daisuke Nishimura <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Davide Libenzi <[email protected]> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-12cgroups: clean up cgroup_pidlist_find() a bitLi Zefan1-5/+3
Don't call get_pid_ns() before we locate/alloc the ns. Signed-off-by: Li Zefan <[email protected]> Cc: Serge Hallyn <[email protected]> Acked-by: Paul Menage <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-12cgroups: blkio subsystem as moduleBen Blum1-0/+9
Modify the Block I/O cgroup subsystem to be able to be built as a module. As the CFQ disk scheduler optionally depends on blk-cgroup, config options in block/Kconfig, block/Kconfig.iosched, and block/blk-cgroup.h are enhanced to support the new module dependency. Signed-off-by: Ben Blum <[email protected]> Cc: Li Zefan <[email protected]> Cc: Paul Menage <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Vivek Goyal <[email protected]> Cc: Jens Axboe <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-12cgroups: subsystem module unloadingBen Blum1-25/+142
Provides support for unloading modular subsystems. This patch adds a new function cgroup_unload_subsys which is to be used for removing a loaded subsystem during module deletion. Reference counting of the subsystems' modules is moved from once (at load time) to once per attached hierarchy (in parse_cgroupfs_options and rebind_subsystems) (i.e., 0 or 1). Signed-off-by: Ben Blum <[email protected]> Acked-by: Li Zefan <[email protected]> Cc: Paul Menage <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Lai Jiangshan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-12cgroups: subsystem module loading interfaceBen Blum1-5/+145
Add interface between cgroups subsystem management and module loading This patch implements rudimentary module-loading support for cgroups - namely, a cgroup_load_subsys (similar to cgroup_init_subsys) for use as a module initcall, and a struct module pointer in struct cgroup_subsys. Several functions that might be wanted by modules have had EXPORT_SYMBOL added to them, but it's unclear exactly which functions want it and which won't. Signed-off-by: Ben Blum <[email protected]> Acked-by: Li Zefan <[email protected]> Cc: Paul Menage <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Lai Jiangshan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-12cgroups: revamp subsys arrayBen Blum1-16/+80
This patch series provides the ability for cgroup subsystems to be compiled as modules both within and outside the kernel tree. This is mainly useful for classifiers and subsystems that hook into components that are already modules. cls_cgroup and blkio-cgroup serve as the example use cases for this feature. It provides an interface cgroup_load_subsys() and cgroup_unload_subsys() which modular subsystems can use to register and depart during runtime. The net_cls classifier subsystem serves as the example for a subsystem which can be converted into a module using these changes. Patch #1 sets up the subsys[] array so its contents can be dynamic as modules appear and (eventually) disappear. Iterations over the array are modified to handle when subsystems are absent, and the dynamic section of the array is protected by cgroup_mutex. Patch #2 implements an interface for modules to load subsystems, called cgroup_load_subsys, similar to cgroup_init_subsys, and adds a module pointer in struct cgroup_subsys. Patch #3 adds a mechanism for unloading modular subsystems, which includes a more advanced rework of the rudimentary reference counting introduced in patch 2. Patch #4 modifies the net_cls subsystem, which already had some module declarations, to be configurable as a module, which also serves as a simple proof-of-concept. Part of implementing patches 2 and 4 involved updating css pointers in each css_set when the module appears or leaves. In doing this, it was discovered that css_sets always remain linked to the dummy cgroup, regardless of whether or not any subsystems are actually bound to it (i.e., not mounted on an actual hierarchy). The subsystem loading and unloading code therefore should keep in mind the special cases where the added subsystem is the only one in the dummy cgroup (and therefore all css_sets need to be linked back into it) and where the removed subsys was the only one in the dummy cgroup (and therefore all css_sets should be unlinked from it) - however, as all css_sets always stay attached to the dummy cgroup anyway, these cases are ignored. Any fix that addresses this issue should also make sure these cases are addressed in the subsystem loading and unloading code. This patch: Make subsys[] able to be dynamically populated to support modular subsystems This patch reworks the way the subsys[] array is used so that subsystems can register themselves after boot time, and enables the internals of cgroups to be able to handle when subsystems are not present or may appear/disappear. Signed-off-by: Ben Blum <[email protected]> Acked-by: Li Zefan <[email protected]> Cc: Paul Menage <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Lai Jiangshan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-12cgroup: introduce coalesce css_get() and css_put()Daisuke Nishimura1-2/+3
Current css_get() and css_put() increment/decrement css->refcnt one by one. This patch add a new function __css_get(), which takes "count" as a arg and increment the css->refcnt by "count". And this patch also add a new arg("count") to __css_put() and change the function to decrement the css->refcnt by "count". These coalesce version of __css_get()/__css_put() will be used to improve performance of memcg's moving charge feature later, where instead of calling css_get()/css_put() repeatedly, these new functions will be used. No change is needed for current users of css_get()/css_put(). Signed-off-by: Daisuke Nishimura <[email protected]> Acked-by: Paul Menage <[email protected]> Cc: Balbir Singh <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Li Zefan <[email protected]> Cc: Daisuke Nishimura <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-12cgroup: introduce cancel_attach()Daisuke Nishimura1-7/+33
Add cancel_attach() operation to struct cgroup_subsys. cancel_attach() can be used when can_attach() operation prepares something for the subsys, but we should rollback what can_attach() operation has prepared if attach task fails after we've succeeded in can_attach(). Signed-off-by: Daisuke Nishimura <[email protected]> Acked-by: Li Zefan <[email protected]> Reviewed-by: Paul Menage <[email protected]> Cc: Balbir Singh <[email protected]> Acked-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Daisuke Nishimura <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-12Add generic sys_olduname()Christoph Hellwig1-0/+54
Add generic implementations of the old and really old uname system calls. Note that sh only implements sys_olduname but not sys_oldolduname, but I'm not going to bother with another ifdef for that special case. m32r implemented an old uname but never wired it up, so kill it, too. Signed-off-by: Christoph Hellwig <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Hirokazu Takata <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Al Viro <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: "Luck, Tony" <[email protected]> Cc: James Morris <[email protected]> Cc: Andreas Schwab <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-12improve sys_newuname() for compat architecturesChristoph Hellwig1-0/+13
On an architecture that supports 32-bit compat we need to override the reported machine in uname with the 32-bit value. Instead of doing this separately in every architecture introduce a COMPAT_UTS_MACHINE define in <asm/compat.h> and apply it directly in sys_newuname(). Signed-off-by: Christoph Hellwig <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Hirokazu Takata <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Al Viro <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: "Luck, Tony" <[email protected]> Cc: James Morris <[email protected]> Cc: Andreas Schwab <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-12Add generic sys_ipc wrapperChristoph Hellwig1-0/+1
Add a generic implementation of the ipc demultiplexer syscall. Except for s390 and sparc64 all implementations of the sys_ipc are nearly identical. There are slight differences in the types of the parameters, where mips and powerpc as the only 64-bit architectures with sys_ipc use unsigned long for the "third" argument as it gets casted to a pointer later, while it traditionally is an "int" like most other paramters. frv goes even further and uses unsigned long for all parameters execept for "ptr" which is a pointer type everywhere. The change from int to unsigned long for "third" and back to "int" for the others on frv should be fine due to the in-register calling conventions for syscalls (we already had a similar issue with the generic sys_ptrace), but I'd prefer to have the arch maintainers looks over this in details. Except for that h8300, m68k and m68knommu lack an impplementation of the semtimedop sub call which this patch adds, and various architectures have gets used - at least on i386 it seems superflous as the compat code on x86-64 and ia64 doesn't even bother to implement it. [[email protected]: add sys_ipc to sys_ni.c] Signed-off-by: Christoph Hellwig <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mundt <[email protected]> Cc: Jeff Dike <[email protected]> Cc: Hirokazu Takata <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Reviewed-by: H. Peter Anvin <[email protected]> Cc: Al Viro <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: "Luck, Tony" <[email protected]> Cc: James Morris <[email protected]> Cc: Andreas Schwab <[email protected]> Acked-by: Jesper Nilsson <[email protected]> Acked-by: Russell King <[email protected]> Acked-by: David Howells <[email protected]> Acked-by: Kyle McMartin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-07sysfs: Use sysfs_attr_init and sysfs_bin_attr_init on module dynamic attributesEric W. Biederman1-0/+3
A little more whack-a-mole annotating the dynamic sysfs attributes. I had everything built into my earlier test kernel, and so I missed these. Signed-off-by: Eric W. Biederman <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2010-03-07sysfs: Use sysfs_attr_init and sysfs_bin_attr_init on dynamic attributesEric W. Biederman1-0/+1
These are the non-static sysfs attributes that exist on my test machine. Fix them to use sysfs_attr_init or sysfs_bin_attr_init as appropriate. It simply requires making a sysfs attribute present to see this. So this is a little bit tedious but otherwise not too bad. Signed-off-by: Eric W. Biederman <[email protected]> Acked-by: WANG Cong <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2010-03-07Driver core: Constify struct sysfs_ops in struct kobj_typeEmese Revfy1-1/+1
Constify struct sysfs_ops. This is part of the ops structure constification effort started by Arjan van de Ven et al. Benefits of this constification: * prevents modification of data that is shared (referenced) by many other structure instances at runtime * detects/prevents accidental (but not intentional) modification attempts on archs that enforce read-only kernel data at runtime * potentially better optimized code as the compiler can assume that the const data cannot be changed * the compiler/linker move const data into .rodata and therefore exclude them from false sharing Signed-off-by: Emese Revfy <[email protected]> Acked-by: David Teigland <[email protected]> Acked-by: Matt Domsch <[email protected]> Acked-by: Maciej Sosnowski <[email protected]> Acked-by: Hans J. Koch <[email protected]> Acked-by: Pekka Enberg <[email protected]> Acked-by: Jens Axboe <[email protected]> Acked-by: Stephen Hemminger <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2010-03-07kobject: Constify struct kset_uevent_opsEmese Revfy1-1/+1
Constify struct kset_uevent_ops. This is part of the ops structure constification effort started by Arjan van de Ven et al. Benefits of this constification: * prevents modification of data that is shared (referenced) by many other structure instances at runtime * detects/prevents accidental (but not intentional) modification attempts on archs that enforce read-only kernel data at runtime * potentially better optimized code as the compiler can assume that the const data cannot be changed * the compiler/linker move const data into .rodata and therefore exclude them from false sharing Signed-off-by: Emese Revfy <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2010-03-07sysdev: Pass attribute in sysdev_class attributes show/storeAndi Kleen2-3/+14
Passing the attribute to the low level IO functions allows all kinds of cleanups, by sharing low level IO code without requiring an own function for every piece of data. Also drivers can extend the attributes with own data fields and use that in the low level function. Similar to sysdev_attributes and normal attributes. This is a tree-wide sweep, converting everything in one go. No functional changes in this patch other than passing the new argument everywhere. Tested on x86, the non x86 parts are uncompiled. Signed-off-by: Andi Kleen <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2010-03-06elf coredump: add extended numbering supportDaisuke HATAYAMA1-0/+5
The current ELF dumper implementation can produce broken corefiles if program headers exceed 65535. This number is determined by the number of vmas which the process have. In particular, some extreme programs may use more than 65535 vmas. (If you google max_map_count, you can find some users facing this problem.) This kind of program never be able to generate correct coredumps. This patch implements ``extended numbering'' that uses sh_info field of the first section header instead of e_phnum field in order to represent upto 4294967295 vmas. This is supported by AMD64-ABI(http://www.x86-64.org/documentation.html) and Solaris(http://docs.sun.com/app/docs/doc/817-1984/). Of course, we are preparing patches for gdb and binutils. Signed-off-by: Daisuke HATAYAMA <[email protected]> Cc: "Luck, Tony" <[email protected]> Cc: Jeff Dike <[email protected]> Cc: David Howells <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Roland McGrath <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Alan Cox <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06elf coredump: replace ELF_CORE_EXTRA_* macros by functionsDaisuke HATAYAMA2-0/+26
elf_core_dump() and elf_fdpic_core_dump() use #ifdef and the corresponding macro for hiding _multiline_ logics in functions. This patch removes #ifdef and replaces ELF_CORE_EXTRA_* by corresponding functions. For architectures not implemeonting ELF_CORE_EXTRA_*, we use weak functions in order to reduce a range of modification. This cleanup is for my next patches, but I think this cleanup itself is worth doing regardless of my firnal purpose. Signed-off-by: Daisuke HATAYAMA <[email protected]> Cc: "Luck, Tony" <[email protected]> Cc: Jeff Dike <[email protected]> Cc: David Howells <[email protected]> Cc: Greg Ungerer <[email protected]> Cc: Roland McGrath <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Alan Cox <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06printk: avoid warning when CONFIG_PRINTK is disabledGustavo F. Padovan1-2/+1
kernel/printk.c:72: warning: `saved_console_loglevel' defined but not used Signed-off-by: Gustavo F. Padovan <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06kernel/pid.c: update comment on find_task_by_pid_nsTetsuo Handa1-1/+1
tasklist_lock does protect the task and its pid, it can't go away. The problem is that find_pid_ns() itself is unsafe without rcu lock, it can race with copy_process()->free_pid(any_pid). Protecting copy_process()->free_pid(any_pid) with tasklist_lock would make it possible to call find_task_by_pid_ns() under tasklist safely, but we don't do so because we are trying to get rid of the read_lock sites of tasklist_lock. Signed-off-by: Tetsuo Handa <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06panic: fix panic_timeout accuracy when running on a hypervisorAnton Blanchard1-16/+30
I've had some complaints about panic_timeout being wildly innacurate on shared processor PowerPC partitions (a 3 minute panic_timeout taking 30 minutes). The problem is we loop on mdelay(1) and with a 1ms in 10ms hypervisor timeslice each of these will take 10ms (ie 10x) longer. I expect other platforms with shared processor hypervisors will see the same issue. This patch keeps the old behaviour if we have a panic_blink (only keyboard LEDs right now) and does 1 second mdelays if we don't. Signed-off-by: Anton Blanchard <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06kernel core: use helpers for rlimitsJiri Slaby7-16/+20
Make sure compiler won't do weird things with limits. E.g. fetching them twice may return 2 different values after writable limits are implemented. I.e. either use rlimit helpers added in commit 3e10e716abf3 ("resource: add helpers for fetching rlimits") or ACCESS_ONCE if not applicable. Signed-off-by: Jiri Slaby <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: john stultz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06posix-cpu-timers: cleanup rlimits usageJiri Slaby1-15/+17
Fetch rlimit (both hard and soft) values only once and work on them. It removes many accesses through sig structure and makes the code cleaner. Mostly a preparation for writable resource limits support. Signed-off-by: Jiri Slaby <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: john stultz <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06kernel/exit.c: fix shadows sparse warningThiago Farina1-1/+1
kernel/exit.c:1183:26: warning: symbol 'status' shadows an earlier one kernel/exit.c:1173:21: originally declared here Signed-off-by: Thiago Farina <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06includecheck fix for kernel/params.cJaswinder Singh Rajput1-1/+0
Fix the following 'make includecheck' warning: kernel/params.c: linux/string.h is included more than once. Signed-off-by: Jaswinder Singh Rajput <[email protected]> Cc: André Goddard Rosa <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06splice: comparing unsigned int < 0Dan Carpenter1-2/+3
"ret" needs to be signed or the error handling for splice_to_pipe() won't work correctly. Signed-off-by: Dan Carpenter <[email protected]> Cc: Tom Zanussi <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Lai Jiangshan <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06kernel/cpu.c: delete deprecated definition in cpu_up()Chen Gong1-1/+1
Additional_cpus is only supported for IA64 now. X86_64 should not be included. Signed-off-by: Chen Gong <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Rusty Russell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm/pm: force GFP_NOIO during suspend/hibernation and resumeRafael J. Wysocki2-0/+12
There are quite a few GFP_KERNEL memory allocations made during suspend/hibernation and resume that may cause the system to hang, because the I/O operations they depend on cannot be completed due to the underlying devices being suspended. Avoid this problem by clearing the __GFP_IO and __GFP_FS bits in gfp_allowed_mask before suspend/hibernation and restoring the original values of these bits in gfp_allowed_mask durig the subsequent resume. [[email protected]: fix CONFIG_PM=n linkage] Signed-off-by: Rafael J. Wysocki <[email protected]> Reported-by: Maxim Levitsky <[email protected]> Cc: Sebastian Ott <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: change anon_vma linking to fix multi-process server scalability issueRik van Riel1-1/+5
The old anon_vma code can lead to scalability issues with heavily forking workloads. Specifically, each anon_vma will be shared between the parent process and all its child processes. In a workload with 1000 child processes and a VMA with 1000 anonymous pages per process that get COWed, this leads to a system with a million anonymous pages in the same anon_vma, each of which is mapped in just one of the 1000 processes. However, the current rmap code needs to walk them all, leading to O(N) scanning complexity for each page. This can result in systems where one CPU is walking the page tables of 1000 processes in page_referenced_one, while all other CPUs are stuck on the anon_vma lock. This leads to catastrophic failure for a benchmark like AIM7, where the total number of processes can reach in the tens of thousands. Real workloads are still a factor 10 less process intensive than AIM7, but they are catching up. This patch changes the way anon_vmas and VMAs are linked, which allows us to associate multiple anon_vmas with a VMA. At fork time, each child process gets its own anon_vmas, in which its COWed pages will be instantiated. The parents' anon_vma is also linked to the VMA, because non-COWed pages could be present in any of the children. This reduces rmap scanning complexity to O(1) for the pages of the 1000 child processes, with O(N) complexity for at most 1/N pages in the system. This reduces the average scanning cost in heavily forking workloads from O(N) to 2. The only real complexity in this patch stems from the fact that linking a VMA to anon_vmas now involves memory allocations. This means vma_adjust can fail, if it needs to attach a VMA to anon_vma structures. This in turn means error handling needs to be added to the calling functions. A second source of complexity is that, because there can be multiple anon_vmas, the anon_vma linking in vma_adjust can no longer be done under "the" anon_vma lock. To prevent the rmap code from walking up an incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h to make sure it is impossible to compile a kernel that needs both symbolic values for the same bitflag. Some test results: Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test box with 16GB RAM and not quite enough IO), the system ends up running >99% in system time, with every CPU on the same anon_vma lock in the pageout code. With these changes, AIM7 hits the cross-over point around 29.7k users. This happens with ~99% IO wait time, there never seems to be any spike in system time. The anon_vma lock contention appears to be resolved. [[email protected]: cleanups] Signed-off-by: Rik van Riel <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Larry Woodman <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: avoid false sharing of mm_counterKAMEZAWA Hiroyuki1-1/+2
Considering the nature of per mm stats, it's the shared object among threads and can be a cache-miss point in the page fault path. This patch adds per-thread cache for mm_counter. RSS value will be counted into a struct in task_struct and synchronized with mm's one at events. Now, in this patch, the event is the number of calls to handle_mm_fault. Per-thread value is added to mm at each 64 calls. rough estimation with small benchmark on parallel thread (2threads) shows [before] 4.5 cache-miss/faults [after] 4.0 cache-miss/faults Anyway, the most contended object is mmap_sem if the number of threads grows. [[email protected]: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06mm: clean up mm_counterKAMEZAWA Hiroyuki2-2/+2
Presently, per-mm statistics counter is defined by macro in sched.h This patch modifies it to - defined in mm.h as inlinf functions - use array instead of macro's name creation. This patch is for reducing patch size in future patch to modify implementation of per-mm counter. Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Reviewed-by: Minchan Kim <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Lee Schermerhorn <[email protected]> Cc: David Rientjes <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-06bitops: rename for_each_bit() to for_each_set_bit()Akinobu Mita1-1/+1
Rename for_each_bit to for_each_set_bit in the kernel source tree. To permit for_each_clear_bit(), should that ever be added. The patch includes a macro to map the old for_each_bit() onto the new for_each_set_bit(). This is a (very) temporary thing to ease the migration. [[email protected]: add temporary for_each_bit()] Suggested-by: Alexey Dobriyan <[email protected]> Suggested-by: Andrew Morton <[email protected]> Signed-off-by: Akinobu Mita <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Russell King <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Artem Bityutskiy <[email protected]> Cc: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-05Merge branch 'perf-probes-for-linus-2' of ↵Linus Torvalds2-90/+569
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-probes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: Issue at least one memory barrier in stop_machine_text_poke() perf probe: Correct probe syntax on command line help perf probe: Add lazy line matching support perf probe: Show more lines after last line perf probe: Check function address range strictly in line finder perf probe: Use libdw callback routines perf probe: Use elfutils-libdw for analyzing debuginfo perf probe: Rename probe finder functions perf probe: Fix bugs in line range finder perf probe: Update perf probe document perf probe: Do not show --line option without dwarf support kprobes: Add documents of jump optimization kprobes/x86: Support kprobes jump optimization on x86 x86: Add text_poke_smp for SMP cross modifying code kprobes/x86: Cleanup save/restore registers kprobes/x86: Boost probes when reentering kprobes: Jump optimization sysctl interface kprobes: Introduce kprobes jump optimization kprobes: Introduce generic insn_slot framework kprobes/x86: Cleanup RELATIVEJUMP_INSTRUCTION to RELATIVEJUMP_OPCODE
2010-03-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6Linus Torvalds1-1/+7
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: padata: Allocate the cpumask for the padata instance crypto: authenc - Move saved IV in front of the ablkcipher request crypto: hash - Fix handling of unaligned buffers crypto: authenc - Use correct ahash complete functions crypto: md5 - Set statesize
2010-03-04Merge branch 'for-linus' of ↵Linus Torvalds3-82/+32
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits) init: Open /dev/console from rootfs mqueue: fix typo "failues" -> "failures" mqueue: only set error codes if they are really necessary mqueue: simplify do_open() error handling mqueue: apply mathematics distributivity on mq_bytes calculation mqueue: remove unneeded info->messages initialization mqueue: fix mq_open() file descriptor leak on user-space processes fix race in d_splice_alias() set S_DEAD on unlink() and non-directory rename() victims vfs: add NOFOLLOW flag to umount(2) get rid of ->mnt_parent in tomoyo/realpath hppfs can use existing proc_mnt, no need for do_kern_mount() in there Mirror MS_KERNMOUNT in ->mnt_flags get rid of useless vfsmount_lock use in put_mnt_ns() Take vfsmount_lock to fs/internal.h get rid of insanity with namespace roots in tomoyo take check for new events in namespace (guts of mounts_poll()) to namespace.c Don't mess with generic_permission() under ->d_lock in hpfs sanitize const/signedness for udf nilfs: sanitize const/signedness in dealing with ->d_name.name ... Fix up fairly trivial (famous last words...) conflicts in drivers/infiniband/core/uverbs_main.c and security/tomoyo/realpath.c
2010-03-04padata: Allocate the cpumask for the padata instanceSteffen Klassert1-1/+7
The cpumask of the padata instance was used without allocated. This caused boot crashes if CONFIG_CPUMASK_OFFSTACK is enabled. This patch fixes this by doing proper allocation for this cpumask. Signed-off-by: Steffen Klassert <[email protected]> Signed-off-by: Herbert Xu <[email protected]>
2010-03-03Prioritize synchronous signals over 'normal' signalsLinus Torvalds1-13/+30
This makes sure that we pick the synchronous signals caused by a processor fault over any pending regular asynchronous signals sent to use by [t]kill(). This is not strictly required semantics, but it makes it _much_ easier for programs like Wine that expect to find the fault information in the signal stack. Without this, if a non-synchronous signal gets picked first, the delayed asynchronous signal will have its signal context pointing to the new signal invocation, rather than the instruction that caused the SIGSEGV or SIGBUS in the first place. This is not all that pretty, and we're discussing making the synchronous signals more explicit rather than have these kinds of implicit preferences of SIGSEGV and friends. See for example http://bugzilla.kernel.org/show_bug.cgi?id=15395 for some of the discussion. But in the meantime this is a simple and fairly straightforward work-around, and the whole if (x & Y) x &= Y; thing can be compiled into (and gcc does do it) just three instructions: movq %rdx, %rax andl $Y, %eax cmovne %rax, %rdx so it is at least a simple solution to a subtle issue. Reported-and-tested-by: Pavel Vilim <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-03-03Merge branch 'for-fsnotify' into for-linusAl Viro1-5/+2
2010-03-03new helper: iterate_mounts()Al Viro1-33/+16
apply function to vfsmounts in set returned by collect_mounts(), stop if it returns non-zero. Signed-off-by: Al Viro <[email protected]>
2010-03-03New helper: path_is_under(path1, path2)Al Viro1-39/+12
Analog of is_subdir for vfsmount,dentry pairs, moved from audit_tree.c Signed-off-by: Al Viro <[email protected]>
2010-03-03Switch may_open() and break_lease() to passing O_...Al Viro1-5/+2
... instead of mixing FMODE_ and O_ Signed-off-by: Al Viro <[email protected]>
2010-03-03Merge branch 'x86-mm-for-linus' of ↵Linus Torvalds1-4/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: resource: Fix broken indentation resource: Fix generic page_is_ram() for partial RAM pages x86, paravirt: Remove kmap_atomic_pte paravirt op. x86, vmi: Disable highmem PTE allocation even when CONFIG_HIGHPTE=y x86, xen: Disable highmem PTE allocation even when CONFIG_HIGHPTE=y
2010-03-03Merge branch 'x86-apic-for-linus' of ↵Linus Torvalds4-46/+74
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (25 commits) x86: Fix out of order of gsi x86: apic: Fix mismerge, add arch_probe_nr_irqs() again x86, irq: Keep chip_data in create_irq_nr and destroy_irq xen: Remove unnecessary arch specific xen irq functions. smp: Use nr_cpus= to set nr_cpu_ids early x86, irq: Remove arch_probe_nr_irqs sparseirq: Use radix_tree instead of ptrs array sparseirq: Change irq_desc_ptrs to static init: Move radix_tree_init() early irq: Remove unnecessary bootmem code x86: Add iMac9,1 to pci_reboot_dmi_table x86: Convert i8259_lock to raw_spinlock x86: Convert nmi_lock to raw_spinlock x86: Convert ioapic_lock and vector_lock to raw_spinlock x86: Avoid race condition in pci_enable_msix() x86: Fix SCI on IOAPIC != 0 x86, ia32_aout: do not kill argument mapping x86, irq: Move __setup_vector_irq() before the first irq enable in cpu online path x86, irq: Update the vector domain for legacy irqs handled by io-apic x86, irq: Don't block IRQ0_VECTOR..IRQ15_VECTOR's on all cpu's ...