aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2010-08-18fs: fs_struct rwlock to spinlockNick Piggin1-5/+5
fs: fs_struct rwlock to spinlock struct fs_struct.lock is an rwlock with the read-side used to protect root and pwd members while taking references to them. Taking a reference to a path typically requires just 2 atomic ops, so the critical section is very small. Parallel read-side operations would have cacheline contention on the lock, the dentry, and the vfsmount cachelines, so the rwlock is unlikely to ever give a real parallelism increase. Replace it with a spinlock to avoid one or two atomic operations in typical path lookup fastpath. Signed-off-by: Nick Piggin <[email protected]> Signed-off-by: Al Viro <[email protected]>
2010-08-17Merge branch 'for_linus' of ↵Linus Torvalds2-2/+9
git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb: vt,console,kdb: preserve console_blanked while in kdb vt: fix regression warnings from KMS merge arm,kgdb: fix GDB_MAX_REGS no longer used kgdb: add missing __percpu markup in arch/x86/kernel/kgdb.c kdb: fix compile error without CONFIG_KALLSYMS
2010-08-17Fix unprotected access to task credentials in waitid()Daniel J Blueman1-3/+2
Using a program like the following: #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/wait.h> int main() { id_t id; siginfo_t infop; pid_t res; id = fork(); if (id == 0) { sleep(1); exit(0); } kill(id, SIGSTOP); alarm(1); waitid(P_PID, id, &infop, WCONTINUED); return 0; } to call waitid() on a stopped process results in access to the child task's credentials without the RCU read lock being held - which may be replaced in the meantime - eliciting the following warning: =================================================== [ INFO: suspicious rcu_dereference_check() usage. ] --------------------------------------------------- kernel/exit.c:1460 invoked rcu_dereference_check() without protection! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 1 2 locks held by waitid02/22252: #0: (tasklist_lock){.?.?..}, at: [<ffffffff81061ce5>] do_wait+0xc5/0x310 #1: (&(&sighand->siglock)->rlock){-.-...}, at: [<ffffffff810611da>] wait_consider_task+0x19a/0xbe0 stack backtrace: Pid: 22252, comm: waitid02 Not tainted 2.6.35-323cd+ #3 Call Trace: [<ffffffff81095da4>] lockdep_rcu_dereference+0xa4/0xc0 [<ffffffff81061b31>] wait_consider_task+0xaf1/0xbe0 [<ffffffff81061d15>] do_wait+0xf5/0x310 [<ffffffff810620b6>] sys_waitid+0x86/0x1f0 [<ffffffff8105fce0>] ? child_wait_callback+0x0/0x70 [<ffffffff81003282>] system_call_fastpath+0x16/0x1b This is fixed by holding the RCU read lock in wait_task_continued() to ensure that the task's current credentials aren't destroyed between us reading the cred pointer and us reading the UID from those credentials. Furthermore, protect wait_task_stopped() in the same way. We don't need to keep holding the RCU read lock once we've read the UID from the credentials as holding the RCU read lock doesn't stop the target task from changing its creds under us - so the credentials may be outdated immediately after we've read the pointer, lock or no lock. Signed-off-by: Daniel J Blueman <[email protected]> Signed-off-by: David Howells <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-17Make do_execve() take a const filename pointerDavid Howells1-1/+3
Make do_execve() take a const filename pointer so that kernel_execve() compiles correctly on ARM: arch/arm/kernel/sys_arm.c:88: warning: passing argument 1 of 'do_execve' discards qualifiers from pointer target type This also requires the argv and envp arguments to be consted twice, once for the pointer array and once for the strings the array points to. This is because do_execve() passes a pointer to the filename (now const) to copy_strings_kernel(). A simpler alternative would be to cast the filename pointer in do_execve() when it's passed to copy_strings_kernel(). do_execve() may not change any of the strings it is passed as part of the argv or envp lists as they are some of them in .rodata, so marking these strings as const should be fine. Further kernel_execve() and sys_execve() need to be changed to match. This has been test built on x86_64, frv, arm and mips. Signed-off-by: David Howells <[email protected]> Tested-by: Ralf Baechle <[email protected]> Acked-by: Russell King <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-16kdb: fix compile error without CONFIG_KALLSYMSJason Wessel2-2/+9
If CONFIG_KGDB_KDB is set and CONFIG_KALLSYMS is not set the kernel will fail to build with the error: kernel/built-in.o: In function `kallsyms_symbol_next': kernel/debug/kdb/kdb_support.c:237: undefined reference to `kdb_walk_kallsyms' kernel/built-in.o: In function `kallsyms_symbol_complete': kernel/debug/kdb/kdb_support.c:193: undefined reference to `kdb_walk_kallsyms' The kdb_walk_kallsyms needs a #ifdef proper header to match the C implementation. This patch also fixes the compiler warnings in kdb_support.c when compiling without CONFIG_KALLSYMS set. The compiler warnings are a result of the kallsyms_lookup() macro not initializing the two of the pass by reference variables. Signed-off-by: Jason Wessel <[email protected]> Reported-by: Michal Simek <[email protected]>
2010-08-16Merge branch 'tip/perf/urgent-3' of ↵Steven Rostedt4-68/+163
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into trace/tip/perf/urgent-4 Conflicts: kernel/trace/trace_events.c Signed-off-by: Steven Rostedt <[email protected]>
2010-08-16workqueue: free rescuer on destroy_workqueueXiaotian Feng1-1/+1
wq->rescuer is not freed when wq is destroyed, leads a memory leak then. This patch also remove a redundant line. Signed-off-by: Xiaotian Feng <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Cc: Oleg Nesterov <[email protected]>
2010-08-13tracing: Sanitize value returned from write(trace_marker, "...", len)Marcin Slusarz1-3/+8
When userspace code writes non-new-line-terminated string to trace_marker file, write handler appends new-line and returns number of bytes written to trace buffer, so write(fd, "abc", 3) will return 4 That's unexpected and unfortunately it confuses glibc's fprintf function. Example: int main() { fprintf(stderr, "abc"); return 0; } $ gcc test.c -o test $ echo mmiotrace > /sys/kernel/debug/tracing/current_tracer $ ./test 2>/sys/kernel/debug/tracing/trace_marker results in infinite loop: write(fd, "abc", 3) = 4 write(fd, "", 1) = 0 write(fd, "", 1) = 0 write(fd, "", 1) = 0 write(fd, "", 1) = 0 write(fd, "", 1) = 0 write(fd, "", 1) = 0 write(fd, "", 1) = 0 (...) ...and kernel trace buffer full of empty markers. Fix it by sanitizing write return value. Signed-off-by: Marcin Slusarz <[email protected]> LKML-Reference: <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2010-08-13time: Workaround gcc loop optimization that causes 64bit div errorsJohn Stultz1-3/+4
Early 4.3 versions of gcc apparently aggressively optimize the raw time accumulation loop, replacing it with a divide. On 32bit systems, this causes the following link errors: undefined reference to `__umoddi3' undefined reference to `__udivdi3' The gcc issue has been fixed in 4.4 and greater. This patch replaces the accumulation loop with a do_div, as suggested by Linus. Signed-off-by: John Stultz <[email protected]> CC: Jason Wessel <[email protected]> CC: Larry Finger <[email protected]> CC: Ingo Molnar <[email protected]> CC: Linus Torvalds <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-12Revert "fsnotify: store struct file not struct path"Linus Torvalds1-2/+2
This reverts commit 3bcf3860a4ff9bbc522820b4b765e65e4deceb3e (and the accompanying commit c1e5c954020e "vfs/fsnotify: fsnotify_close can delay the final work in fput" that was a horribly ugly hack to make it work at all). The 'struct file' approach not only causes that disgusting hack, it somehow breaks pulseaudio, probably due to some other subtlety with f_count handling. Fix up various conflicts due to later fsnotify work. Signed-off-by: Linus Torvalds <[email protected]>
2010-08-12tracing/events: Convert format output to seq_fileSteven Rostedt1-67/+141
Two new events were added that broke the current format output. Both from the SCSI system: scsi_dispatch_cmd_done and scsi_dispatch_cmd_timeout The reason is that their print_fmt exceeded a page size. Since the output of the format used simple_read_from_buffer and trace_seq, it was limited to a page size in output. This patch converts the printing of the format of an event into seq_file, which allows greater than a page size to be shown. I diffed all event formats comparing the output with and without this patch. All matched except for the above two, which showed just: FORMAT TOO BIG without this patch, but now properly displays the output with this patch. v2: Remove updating *pos in seq start function. [ Thanks to Li Zefan for pointing that out ] Reviewed-by: Li Zefan <[email protected]> Cc: Martin K. Petersen <[email protected]> Cc: Kei Tokunaga <[email protected]> Cc: James Bottomley <[email protected]> Cc: Tomohiro Kusumi <[email protected]> Cc: Xiao Guangrong <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2010-08-12Merge branch 'params' of ↵Linus Torvalds1-72/+161
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus * 'params' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: (22 commits) param: don't deref arg in __same_type() checks param: update drivers/acpi/debug.c to new scheme param: use module_param in drivers/message/fusion/mptbase.c ide: use module_param_named rather than module_param_call param: update drivers/char/ipmi/ipmi_watchdog.c to new scheme param: lock if_sdio's lbs_helper_name and lbs_fw_name against sysfs changes. param: lock myri10ge_fw_name against sysfs changes. param: simple locking for sysfs-writable charp parameters param: remove unnecessary writable charp param: add kerneldoc to moduleparam.h param: locking for kernel parameters param: make param sections const. param: use free hook for charp (fix leak of charp parameters) param: add a free hook to kernel_param_ops. param: silence .init.text references from param ops Add param ops struct for hvc_iucv driver. nfs: update for module_param_named API change AppArmor: update for module_param_named API change param: use ops in struct kernel_param, rather than get and set fns directly param: move the EXPORT_SYMBOL to after the definitions. ...
2010-08-12timekeeping: Fix overflow in rawtime tv_nsec on 32 bit archsJason Wessel1-4/+7
The tv_nsec is a long and when added to the shifted interval it can wrap and become negative which later causes looping problems in the getrawmonotonic(). The edge case occurs when the system has slept for a short period of time of ~2 seconds. A trace printk of the values in this patch illustrate the problem: ftrace time stamp: log 43.716079: logarithmic_accumulation: raw: 3d0913 tv_nsec d687faa 43.718513: logarithmic_accumulation: raw: 3d0913 tv_nsec da588bd 43.722161: logarithmic_accumulation: raw: 3d0913 tv_nsec de291d0 46.349925: logarithmic_accumulation: raw: 7a122600 tv_nsec e1f9ae3 46.349930: logarithmic_accumulation: raw: 1e848980 tv_nsec 8831c0e3 The kernel starts looping at 46.349925 in the getrawmonotonic() due to the negative value from adding the raw value to tv_nsec. A simple solution is to accumulate into a u64, and then normalize it to a timespec_t. Signed-off-by: Jason Wessel <[email protected]> [ Reworked variable names and simplified some of the code. - John ] Signed-off-by: John Stultz <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: H. Peter Anvin <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-12Add a dummy printk function for the maintenance of unused printksDavid Howells1-4/+0
Add a dummy printk function for the maintenance of unused printks through gcc format checking, and also so that side-effect checking is maintained too. Signed-off-by: David Howells <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-12block: add secure discardAdrian Hunter1-0/+8
Secure discard is the same as discard except that all copies of the discarded sectors (perhaps created by garbage collection) must also be erased. Signed-off-by: Adrian Hunter <[email protected]> Acked-by: Jens Axboe <[email protected]> Cc: Kyungmin Park <[email protected]> Cc: Madhusudhan Chikkature <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Ben Gardiner <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-12kernel/kfifo.c: add handling of chained scatterlistsStefani Seibold1-7/+6
The current kfifo scatterlist implementation will not work with chained scatterlists. It assumes that struct scatterlist arrays are allocated contiguously, which is not the case when chained scatterlists (struct sg_table) are in use. Signed-off-by: Stefani Seibold <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-11Merge branch 'for-linus' of ↵Linus Torvalds1-7/+2
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: isofs: Fix lseek() to position beyond 4 GB vfs: remove unused MNT_STRICTATIME vfs: show unreachable paths in getcwd and proc vfs: only add " (deleted)" where necessary vfs: add prepend_path() helper vfs: __d_path: dont prepend the name of the root dentry ia64: perfmon: add d_dname method vfs: add helpers to get root and pwd cachefiles: use path_get instead of lone dget fs/sysv/super.c: add support for non-PDP11 v7 filesystems V7: Adjust sanity checks for some volumes Add v7 alias v9fs: fixup for inode_setattr being removed Manual merge to take Al's version of the fs/sysv/super.c file: it merged cleanly, but Al had removed an unnecessary header include, so his side was better.
2010-08-11kfifo: replace the old non generic APIStefani Seibold2-898/+453
Simply replace the whole kfifo.c and kfifo.h files with the new generic version and fix the kerneldoc API template file. Signed-off-by: Stefani Seibold <[email protected]> Cc: Greg KH <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-11kfifo: add the new generic kfifo APIStefani Seibold1-0/+602
Add the new version of the kfifo API files kfifo.c and kfifo.h. Signed-off-by: Stefani Seibold <[email protected]> Cc: Greg KH <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-11kexec: return -EFAULT on copy_to_user() failuresDan Carpenter1-3/+5
copy_to/from_user() returns the number of bytes remaining to be copied. It never returns a negative value. The correct return code is -EFAULT and not -EIO. All the callers check for non-zero returns so that's Ok, but the return code is passed to the user so we should fix this. Signed-off-by: Dan Carpenter <[email protected]> Cc: Hidetoshi Seto <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Simon Kagstrom <[email protected]> Acked-by: WANG Cong <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-11lib/bug.c: add oops end marker to WARN implementationAnton Blanchard1-1/+1
We are missing the oops end marker for the exception based WARN implementation in lib/bug.c. This is useful for logfile analysis tools. Signed-off-by: Anton Blanchard <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-11panic: keep blinking in spite of long spin timer modeTAMUKI Shoichi1-32/+26
To keep panic_timeout accuracy when running under a hypervisor, the current implementation only spins on long time (1 second) calls to mdelay. That brings a good effect, but the problem is the keyboard LEDs don't blink at all on that situation. This patch changes to call to panic_blink_enter() between every mdelay and keeps blinking in spite of long spin timer mode. The time to call to mdelay is now 100ms. Even this change will keep panic_timeout accuracy enough when running under a hypervisor. Signed-off-by: TAMUKI Shoichi <[email protected]> Cc: Ben Dooks <[email protected]> Cc: Russell King <[email protected]> Acked-by: Dmitry Torokhov <[email protected]> Cc: Anton Blanchard <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-11pids: alloc_pidmap: remove the unnecessary boundary checksOleg Nesterov1-10/+7
alloc_pidmap() calculates max_scan so that if the initial offset != 0 we inspect the first map->page twice. This is correct, we want to find the unused bits < offset in this bitmap block. Add the comment. But it doesn't make any sense to stop the find_next_offset() loop when we are looking into this map->page for the second time. We have already already checked the bits >= offset during the first attempt, it is fine to do this again, no matter if we succeed this time or not. Remove this hard-to-understand code. It optimizes the very unlikely case when we are going to fail, but slows down the more likely case. Signed-off-by: Oleg Nesterov <[email protected]> Cc: Salman Qazi <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Sukadev Bhattiprolu <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-11pids: fix a race in pid generation that causes pids to be reused immediatelySalman1-1/+38
A program that repeatedly forks and waits is susceptible to having the same pid repeated, especially when it competes with another instance of the same program. This is really bad for bash implementation. Furthermore, many shell scripts assume that pid numbers will not be used for some length of time. Race Description: A B // pid == offset == n // pid == offset == n + 1 test_and_set_bit(offset, map->page) test_and_set_bit(offset, map->page); pid_ns->last_pid = pid; pid_ns->last_pid = pid; // pid == n + 1 is freed (wait()) // Next fork()... last = pid_ns->last_pid; // == n pid = last + 1; Code to reproduce it (Running multiple instances is more effective): #include <errno.h> #include <sys/types.h> #include <sys/wait.h> #include <unistd.h> #include <stdio.h> #include <stdlib.h> // The distance mod 32768 between two pids, where the first pid is expected // to be smaller than the second. int PidDistance(pid_t first, pid_t second) { return (second + 32768 - first) % 32768; } int main(int argc, char* argv[]) { int failed = 0; pid_t last_pid = 0; int i; printf("%d\n", sizeof(pid_t)); for (i = 0; i < 10000000; ++i) { if (i % 32786 == 0) printf("Iter: %d\n", i/32768); int child_exit_code = i % 256; pid_t pid = fork(); if (pid == -1) { fprintf(stderr, "fork failed, iteration %d, errno=%d", i, errno); exit(1); } if (pid == 0) { // Child exit(child_exit_code); } else { // Parent if (i > 0) { int distance = PidDistance(last_pid, pid); if (distance == 0 || distance > 30000) { fprintf(stderr, "Unexpected pid sequence: previous fork: pid=%d, " "current fork: pid=%d for iteration=%d.\n", last_pid, pid, i); failed = 1; } } last_pid = pid; int status; int reaped = wait(&status); if (reaped != pid) { fprintf(stderr, "Wait return value: expected pid=%d, " "got %d, iteration %d\n", pid, reaped, i); failed = 1; } else if (WEXITSTATUS(status) != child_exit_code) { fprintf(stderr, "Unexpected exit status %x, iteration %d\n", WEXITSTATUS(status), i); failed = 1; } } } exit(failed); } Thanks to Ted Tso for the key ideas of this implementation. Signed-off-by: Salman Qazi <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sukadev Bhattiprolu <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-11ptrace: optimize exit_ptrace() for the likely caseOleg Nesterov2-5/+14
exit_ptrace() takes tasklist_lock unconditionally. We need this lock to avoid the race with ptrace_traceme(), it acts as a barrier. Change its caller, forget_original_parent(), to call exit_ptrace() under tasklist_lock. Change exit_ptrace() to drop and reacquire this lock if needed. This allows us to add the fastpath list_empty(ptraced) check. In the likely no-tracees case exit_ptrace() just returns and we avoid the lock() + unlock() sequence. "Zhang, Yanmin" <[email protected]> suggested to add this check, and he reports that this change adds about 11% improvement in some tests. Suggested-and-tested-by: "Zhang, Yanmin" <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Roland McGrath <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-11cgroups: save space for the terminatorDan Carpenter1-2/+2
The original code didn't leave enough space for a NULL terminator. These strings are copied with strcpy() into fixed length buffers in cgroup_root_from_opts(). Signed-off-by: Dan Carpenter <[email protected]> Acked-by: Serge E. Hallyn <[email protected]> Reviewd-by: KAMEZAWA Hiroyuki <[email protected]> Cc: Paul Menage <[email protected]> Cc: Li Zefan <[email protected]> Cc: Ben Blum <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-11param: locking for kernel parametersRusty Russell1-7/+26
There may be cases (most obviously, sysfs-writable charp parameters) where a module needs to prevent sysfs access to parameters. Rather than express this in terms of a big lock, the functions are expressed in terms of what they protect against. This is clearer, esp. if the implementation changes to a module-level or even param-level lock. Signed-off-by: Rusty Russell <[email protected]> Reviewed-by: Takashi Iwai <[email protected]> Tested-by: Phil Carmody <[email protected]>
2010-08-11param: make param sections const.Rusty Russell1-2/+2
Since this section can be read-only (they're in .rodata), they should always have been const. Minor flow-through various functions. Signed-off-by: Rusty Russell <[email protected]> Tested-by: Phil Carmody <[email protected]>
2010-08-11param: use free hook for charp (fix leak of charp parameters)Rusty Russell1-2/+50
Instead of using a "I kmalloced this" flag, we keep track of the kmalloced strings and use that list to check if we need to kfree (in practice, the list is very short). This means that kparams can be const again, and plugs a leak. This is important for drivers/usb/gadget/nokia.c which gets modprobe/rmmod'ed frequently on the N9000. Signed-off-by: Rusty Russell <[email protected]> Reviewed-by: Takashi Iwai <[email protected]> Cc: Artem Bityutskiy <[email protected]> Tested-by: Phil Carmody <[email protected]>
2010-08-11param: add a free hook to kernel_param_ops.Rusty Russell1-1/+16
This allows us to generalize the KPARAM_KMALLOCED flag, by calling a function on every parameter when a module is unloaded. Signed-off-by: Rusty Russell <[email protected]> Reviewed-by: Takashi Iwai <[email protected]> Tested-by: Phil Carmody <[email protected]>
2010-08-11param: use ops in struct kernel_param, rather than get and set fns directlyRusty Russell1-28/+62
This is more kernel-ish, saves some space, and also allows us to expand the ops without breaking all the callers who are happy for the new members to be NULL. The few places which defined their own param types are changed to the new scheme (more which crept in recently fixed in following patches). Since we're touching them anyway, we change get() and set() to take a const struct kernel_param (which they really are). This causes some harmless warnings until we fix them (in following patches). To reduce churn, module_param_call creates the ops struct so the callers don't have to change (and casts the functions to reduce warnings). The modern version which takes an ops struct is called module_param_cb. Signed-off-by: Rusty Russell <[email protected]> Reviewed-by: Takashi Iwai <[email protected]> Tested-by: Phil Carmody <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Ville Syrjala <[email protected]> Cc: Dmitry Torokhov <[email protected]> Cc: Alessandro Rubini <[email protected]> Cc: Michal Januszewski <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: "J. Bruce Fields" <[email protected]> Cc: Neil Brown <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected]
2010-08-11param: move the EXPORT_SYMBOL to after the definitions.Rusty Russell1-26/+13
This is modern style, and good to do before we start changing things. Signed-off-by: Rusty Russell <[email protected]> Reviewed-by: Takashi Iwai <[email protected]> Tested-by: Phil Carmody <[email protected]>
2010-08-11params: don't hand NULL values to param.set callbacks.Rusty Russell1-17/+3
An audit by Dongdong Deng revealed that most driver-author-written param calls don't handle val == NULL (which happens when parameters are specified with no =, eg "foo" instead of "foo=1"). The only real case to use this is boolean, so handle it specially for that case and remove a source of bugs for everyone else. Signed-off-by: Rusty Russell <[email protected]> Cc: Dongdong Deng <[email protected]> Cc: Américo Wang <[email protected]>
2010-08-11vfs: add helpers to get root and pwdMiklos Szeredi1-7/+2
Add three helpers that retrieve a refcounted copy of the root and cwd from the supplied fs_struct. get_fs_root() get_fs_pwd() get_fs_root_and_pwd() Signed-off-by: Miklos Szeredi <[email protected]> Signed-off-by: Al Viro <[email protected]>
2010-08-10kernel/timer.c: fix kernel-doc function parameter warningRandy Dunlap1-0/+1
Fix kernel-doc warning, add @timer description: Warning(kernel/timer.c:335): No description found for parameter 'timer' Signed-off-by: Randy Dunlap <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-10Merge branch 'for-2.6.36' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds2-18/+64
* 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block: (149 commits) block: make sure that REQ_* types are seen even with CONFIG_BLOCK=n xen-blkfront: fix missing out label blkdev: fix blkdev_issue_zeroout return value block: update request stacking methods to support discards block: fix missing export of blk_types.h writeback: fix bad _bh spinlock nesting drbd: revert "delay probes", feature is being re-implemented differently drbd: Initialize all members of sync_conf to their defaults [Bugz 315] drbd: Disable delay probes for the upcomming release writeback: cleanup bdi_register writeback: add new tracepoints writeback: remove unnecessary init_timer call writeback: optimize periodic bdi thread wakeups writeback: prevent unnecessary bdi threads wakeups writeback: move bdi threads exiting logic to the forker thread writeback: restructure bdi forker loop a little writeback: move last_active to bdi writeback: do not remove bdi from bdi_list writeback: simplify bdi code a little writeback: do not lose wake-ups in bdi threads ... Fixed up pretty trivial conflicts in drivers/block/virtio_blk.c and drivers/scsi/scsi_error.c as per Jens.
2010-08-10Merge branch 'writable_limits' of git://decibel.fi.muni.cz/~xslaby/linuxLinus Torvalds3-62/+165
* 'writable_limits' of git://decibel.fi.muni.cz/~xslaby/linux: unistd: add __NR_prlimit64 syscall numbers rlimits: implement prlimit64 syscall rlimits: switch more rlimit syscalls to do_prlimit rlimits: redo do_setrlimit to more generic do_prlimit rlimits: add rlimit64 structure rlimits: do security check under task_lock rlimits: allow setrlimit to non-current tasks rlimits: split sys_setrlimit rlimits: selinux, do rlimits changes under task_lock rlimits: make sure ->rlim_max never grows in sys_setrlimit rlimits: add task_struct to update_rlimit_cpu rlimits: security, add task_struct to setrlimit Fix up various system call number conflicts. We not only added fanotify system calls in the meantime, but asm-generic/unistd.h added a wait4 along with a range of reserved per-architecture system calls.
2010-08-10Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notifyLinus Torvalds9-278/+325
* 'for-linus' of git://git.infradead.org/users/eparis/notify: (132 commits) fanotify: use both marks when possible fsnotify: pass both the vfsmount mark and inode mark fsnotify: walk the inode and vfsmount lists simultaneously fsnotify: rework ignored mark flushing fsnotify: remove global fsnotify groups lists fsnotify: remove group->mask fsnotify: remove the global masks fsnotify: cleanup should_send_event fanotify: use the mark in handler functions audit: use the mark in handler functions dnotify: use the mark in handler functions inotify: use the mark in handler functions fsnotify: send fsnotify_mark to groups in event handling functions fsnotify: Exchange list heads instead of moving elements fsnotify: srcu to protect read side of inode and vfsmount locks fsnotify: use an explicit flag to indicate fsnotify_destroy_mark has been called fsnotify: use _rcu functions for mark list traversal fsnotify: place marks on object in order of group memory address vfs/fsnotify: fsnotify_close can delay the final work in fput fsnotify: store struct file not struct path ... Fix up trivial delete/modify conflict in fs/notify/inotify/inotify.c.
2010-08-10Merge branch 'for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits) no need for list_for_each_entry_safe()/resetting with superblock list Fix sget() race with failing mount vfs: don't hold s_umount over close_bdev_exclusive() call sysv: do not mark superblock dirty on remount sysv: do not mark superblock dirty on mount btrfs: remove junk sb_dirt change BFS: clean up the superblock usage AFFS: wait for sb synchronization when needed AFFS: clean up dirty flag usage cifs: truncate fallout mbcache: fix shrinker function return value mbcache: Remove unused features add f_flags to struct statfs(64) pass a struct path to vfs_statfs update VFS documentation for method changes. All filesystems that need invalidate_inode_buffers() are doing that explicitly convert remaining ->clear_inode() to ->evict_inode() Make ->drop_inode() just return whether inode needs to be dropped fs/inode.c:clear_inode() is gone fs/inode.c:evict() doesn't care about delete vs. non-delete paths now ... Fix up trivial conflicts in fs/nilfs2/super.c
2010-08-09gcc-4.6: printk: use stable variable to dump kmsg bufferAndi Kleen1-5/+5
kmsg_dump takes care to sample the global variables inside a spinlock, but then goes on to use the same variables outside the spinlock region too. Use the correct variable. This will make the race window smaller. Found by gcc 4.6's new warnings. Signed-off-by: Andi Kleen <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09stop_machine: struct cpu_stopper, remove alignment padding on 64 bitsRichard Kennedy1-1/+1
Reorder elements in structure cpu_stopper to remove alignment padding on 64 bit builds, this shrinks its size from 40 to 32 bytes saving 8 bytes per cpu. Signed-off-by: Richard Kennedy <[email protected]> Acked-by: Tejun Heo <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Rusty Russell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09kernel/range: remove unused definition of ARRAY_SIZE()Geert Uytterhoeven1-4/+0
Remove duplicate definition of ARRAY_SIZE(), which was never used anyway. Signed-off-by: Geert Uytterhoeven <[email protected]> Cc: Yinghai Lu <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09sys_personality: remove the bogus checks in ↵Oleg Nesterov1-17/+5
sys_personality()->__set_personality() path Cleanup, no functional changes. - __set_personality() always changes ->exec_domain/personality, the special case when ->exec_domain remains the same buys nothing but complicates the code. Unify both cases to simplify the code. - The -EINVAL check in sys_personality() was never right. If we assume that set_personality() can fail we should check the value it returns instead of verifying that task->personality was actually changed. Remove it. Before the previous patch it was possible to hit this case due to overflow problems, but this -EINVAL just indicated the kernel bug. OTOH, probably it makes sense to change lookup_exec_domain() to return ERR_PTR() instead of default_exec_domain if the search in exec_domains list fails, and report this error to the user-space. But this means another user-space change, and we have in-kernel users which need fixes. For example, PER_OSF4 falls into PER_MASK for unkown reason and nobody cares to register this domain. Signed-off-by: Oleg Nesterov <[email protected]> Cc: Wenming Zhang <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09hibernation: freeze swap at hibernationKAMEZAWA Hiroyuki3-3/+5
When taking a memory snapshot in hibernate_snapshot(), all (directly called) memory allocations use GFP_ATOMIC. Hence swap misusage during hibernation never occurs. But from a pessimistic point of view, there is no guarantee that no page allcation has __GFP_WAIT. It is better to have a global indication "we enter hibernation, don't use swap!". This patch tries to freeze new-swap-allocation during hibernation. (All user processes are frozenm so swapin is not a concern). This way, no updates will happen to swap_map[] between hibernate_snapshot() and save_image(). Swap is thawed when swsusp_free() is called. We can be assured that swap corruption will not occur. Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> Cc: "Rafael J. Wysocki" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Ondrej Zary <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Andrea Arcangeli <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09oom: badness heuristic rewriteDavid Rientjes1-0/+1
This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of "allowable" memory. "Allowable," in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of "allowable" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <[email protected]> Cc: Nick Piggin <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Balbir Singh <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09oom: move sysctl declarations to oom.hDavid Rientjes1-3/+1
The three oom killer sysctl variables (sysctl_oom_dump_tasks, sysctl_oom_kill_allocating_task, and sysctl_panic_on_oom) are better declared in include/linux/oom.h rather than kernel/sysctl.c. Signed-off-by: David Rientjes <[email protected]> Acked-by: KOSAKI Motohiro <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2010-08-09pass a struct path to vfs_statfsChristoph Hellwig1-1/+1
We'll need the path to implement the flags field for statvfs support. We do have it available in all callers except: - ecryptfs_statfs. This one doesn't actually need vfs_statfs but just needs to do a caller to the lower filesystem statfs method. - sys_ustat. Add a non-exported statfs_by_dentry helper for it which doesn't won't be able to fill out the flags field later on. In addition rename the helpers for statfs vs fstatfs to do_*statfs instead of the misleading vfs prefix. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Al Viro <[email protected]>
2010-08-09workqueue: workqueue_cpu_callback() should be cpu_notifier instead of ↵Tejun Heo1-1/+1
hotcpu_notifier Commit 6ee0578b (workqueue: mark init_workqueues as early_initcall) made workqueue SMP initialization depend on workqueue_cpu_callback(), which however was registered as hotcpu_notifier() and didn't get called if CONFIG_HOTPLUG_CPU is not set. This made gcwqs on non-boot CPUs not create their initial workers leading to boot failures. Fix it by making it a cpu_notifier. Signed-off-by: Tejun Heo <[email protected]> Reported-and-bisected-by: walt <[email protected]> Tested-by: Markus Trippelsdorf <[email protected]>
2010-08-08workqueue: add missing __percpu markup in kernel/workqueue.cNamhyung Kim1-1/+1
works in schecule_on_each_cpu() is a percpu pointer but was missing __percpu markup. Add it. Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2010-08-07Merge branch 'bkl/core' of ↵Linus Torvalds1-8/+0
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing * 'bkl/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing: do_coredump: Do not take BKL init: Remove the BKL from startup code