aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2014-06-10gcov: add support for GCC 4.9Yuan Pengfei2-0/+11
This patch handles the gcov-related changes in GCC 4.9: A new counter (time profile) is added. The total number is 9 now. A new profile merge function __gcov_merge_time_profile is added. See gcc/gcov-io.h and libgcc/libgcov-merge.c For the first change, the layout of struct gcov_info is affected. For the second one, a dummy function is added to kernel/gcov/base.c similarly. Signed-off-by: Yuan Pengfei <[email protected]> Acked-by: Peter Oberparleiter <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-10fs,userns: Change inode_capable to capable_wrt_inode_uidgidAndy Lutomirski1-12/+8
The kernel has no concept of capabilities with respect to inodes; inodes exist independently of namespaces. For example, inode_capable(inode, CAP_LINUX_IMMUTABLE) would be nonsense. This patch changes inode_capable to check for uid and gid mappings and renames it to capable_wrt_inode_uidgid, which should make it more obvious what it does. Fixes CVE-2014-4014. Cc: Theodore Ts'o <[email protected]> Cc: Serge Hallyn <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Dave Chinner <[email protected]> Cc: [email protected] Signed-off-by: Andy Lutomirski <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-10tracing: Fix check of ftrace_trace_arrays list_empty() checkSteven Rostedt (Red Hat)1-1/+1
The check that tests if ftrace_trace_arrays is empty in top_trace_array(), uses the .prev pointer: if (list_empty(ftrace_trace_arrays.prev)) instead of testing the variable itself: if (list_empty(&ftrace_trace_arrays)) Although it is technically correct, it is awkward and confusing. Use the proper method. Link: http://lkml.kernel.org/r/[email protected] Reported-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2014-06-10tracing: Fix leak of per cpu max data in instancesSteven Rostedt (Red Hat)1-9/+12
The freeing of an instance, if max data is configured, there will be per cpu data structures created. But these are not freed when the instance is deleted, which causes a memory leak. A new helper function is added that frees the individual buffers within a trace array, instead of duplicating the code. This way changes made for one are applied to the other (normal buffer vs max buffer). Link: http://lkml.kernel.org/r/[email protected] Reported-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2014-06-10auditsc: audit_krule mask accesses need bounds checkingAndy Lutomirski1-9/+18
Fixes an easy DoS and possible information disclosure. This does nothing about the broken state of x32 auditing. eparis: If the admin has enabled auditd and has specifically loaded audit rules. This bug has been around since before git. Wow... Cc: [email protected] Signed-off-by: Andy Lutomirski <[email protected]> Signed-off-by: Eric Paris <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-10tracing: Cleanup saved_cmdlines_size changesNamhyung Kim1-3/+3
The recent addition of saved_cmdlines_size file had some remaining (minor - mostly coding style) issues. Fix them by passing pointer name to sizeof() and using scnprintf(). Link: http://lkml.kernel.org/p/[email protected] Cc: Namhyung Kim <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Yoshihiro YUNOMAE <[email protected]> Signed-off-by: Namhyung Kim <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2014-06-10ring-buffer: Check if buffer exists before pollingSteven Rostedt (Red Hat)2-7/+20
The per_cpu buffers are created one per possible CPU. But these do not mean that those CPUs are online, nor do they even exist. With the addition of the ring buffer polling, it assumes that the caller polls on an existing buffer. But this is not the case if the user reads trace_pipe from a CPU that does not exist, and this causes the kernel to crash. Simple fix is to check the cpu against buffer bitmask against to see if the buffer was allocated or not and return -ENODEV if it is not. More updates were done to pass the -ENODEV back up to userspace. Link: http://lkml.kernel.org/r/[email protected] Reported-by: Sasha Levin <[email protected]> Cc: [email protected] # 3.10+ Signed-off-by: Steven Rostedt <[email protected]>
2014-06-09Merge tag 'trace-3.16' of ↵Linus Torvalds17-460/+951
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: "Lots of tweaks, small fixes, optimizations, and some helper functions to help out the rest of the kernel to ease their use of trace events. The big change for this release is the allowing of other tracers, such as the latency tracers, to be used in the trace instances and allow for function or function graph tracing to be in the top level simultaneously" * tag 'trace-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits) tracing: Fix memory leak on instance deletion tracing: Fix leak of ring buffer data when new instances creation fails tracing/kprobes: Avoid self tests if tracing is disabled on boot up tracing: Return error if ftrace_trace_arrays list is empty tracing: Only calculate stats of tracepoint benchmarks for 2^32 times tracing: Convert stddev into u64 in tracepoint benchmark tracing: Introduce saved_cmdlines_size file tracing: Add __get_dynamic_array_len() macro for trace events tracing: Remove unused variable in trace_benchmark tracing: Eliminate double free on failure of allocation on boot up ftrace/x86: Call text_ip_addr() instead of the duplicated code tracing: Print max callstack on stacktrace bug tracing: Move locking of trace_cmdline_lock into start/stop seq calls tracing: Try again for saved cmdline if failed due to locking tracing: Have saved_cmdlines use the seq_read infrastructure tracing: Add tracepoint benchmark tracepoint tracing: Print nasty banner when trace_printk() is in use tracing: Add funcgraph_tail option to print function name after closing braces tracing: Eliminate duplicate TRACE_GRAPH_PRINT_xx defines tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks ...
2014-06-09Merge branch 'for-3.16' of ↵Linus Torvalds6-684/+1220
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "A lot of activities on cgroup side. Heavy restructuring including locking simplification took place to improve the code base and enable implementation of the unified hierarchy, which currently exists behind a __DEVEL__ mount option. The core support is mostly complete but individual controllers need further work. To explain the design and rationales of the the unified hierarchy Documentation/cgroups/unified-hierarchy.txt is added. Another notable change is css (cgroup_subsys_state - what each controller uses to identify and interact with a cgroup) iteration update. This is part of continuing updates on css object lifetime and visibility. cgroup started with reference count draining on removal way back and is now reaching a point where csses behave and are iterated like normal refcnted objects albeit with some complexities to allow distinguishing the state where they're being deleted. The css iteration update isn't taken advantage of yet but is planned to be used to simplify memcg significantly" * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits) cgroup: disallow disabled controllers on the default hierarchy cgroup: don't destroy the default root cgroup: disallow debug controller on the default hierarchy cgroup: clean up MAINTAINERS entries cgroup: implement css_tryget() device_cgroup: use css_has_online_children() instead of has_children() cgroup: convert cgroup_has_live_children() into css_has_online_children() cgroup: use CSS_ONLINE instead of CGRP_DEAD cgroup: iterate cgroup_subsys_states directly cgroup: introduce CSS_RELEASED and reduce css iteration fallback window cgroup: move cgroup->serial_nr into cgroup_subsys_state cgroup: link all cgroup_subsys_states in their sibling lists cgroup: move cgroup->sibling and ->children into cgroup_subsys_state cgroup: remove cgroup->parent device_cgroup: remove direct access to cgroup->children memcg: update memcg_has_children() to use css_next_child() memcg: remove tasks/children test from mem_cgroup_force_empty() cgroup: remove css_parent() cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css cgroup: use cgroup->self.refcnt for cgroup refcnting ...
2014-06-09Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds2-301/+149
Pull workqueue updates from Tejun Heo: "Lai simplified worker destruction path and internal workqueue locking and there are some other minor changes. Except for the removal of some long-deprecated interfaces which haven't had any in-kernel user for quite a while, there shouldn't be any difference to workqueue users" * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: kernel/workqueue.c: pr_warning/pr_warn & printk/pr_info workqueue: remove the confusing POOL_FREEZING workqueue: rename first_worker() to first_idle_worker() workqueue: remove unused work_clear_pending() workqueue: remove unused WORK_CPU_END workqueue: declare system_highpri_wq workqueue: use generic attach/detach routine for rescuers workqueue: separate pool-attaching code out from create_worker() workqueue: rename manager_mutex to attach_mutex workqueue: narrow the protection range of manager_mutex workqueue: convert worker_idr to worker_ida workqueue: separate iteration role from worker_idr workqueue: destroy worker directly in the idle timeout handler workqueue: async worker destruction workqueue: destroy_worker() should destroy idle workers only workqueue: use manager lock only to protect worker_idr workqueue: Remove deprecated system_nrt[_freezable]_wq workqueue: Remove deprecated flush[_delayed]_work_sync() kernel/workqueue.c: pr_warning/pr_warn & printk/pr_info workqueue: simplify wq_update_unbound_numa() by jumping to use_dfl_pwq if the target cpumask equals wq's
2014-06-09Revert "perf: Disable PERF_RECORD_MMAP2 support"Don Zickus1-4/+0
This reverts commit 3090ffb5a2515990182f3f55b0688a7817325488. Re-enable the mmap2 interface as we will have a user soon. Since things have changed since perf disabled mmap2, small tweaks to the revert had to be done: o commit 9d4ecc88 forced (n!=8) to become (n<7) o a new libunwind test needed updating to use mmap2 interface Signed-off-by: Don Zickus <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jiri Olsa <[email protected]>
2014-06-09perf: Pass protection and flags bits through mmap2 interfacePeter Zijlstra1-0/+33
The mmap2 interface was missing the protection and flags bits needed to accurately determine if a mmap memory area was shared or private and if it was readable or not. Signed-off-by: Peter Zijlstra <[email protected]> [tweaked patch to compile and wrote changelog] Signed-off-by: Don Zickus <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Jiri Olsa <[email protected]>
2014-06-08numa,sched: fix load_to_imbalanced logic inversionRik van Riel1-1/+1
This function is supposed to return true if the new load imbalance is worse than the old one. It didn't. I can only hope brown paper bags are in style. Now things converge much better on both the 4 node and 8 node systems. I am not sure why this did not seem to impact specjbb performance on the 4 node system, which is the system I have full-time access to. This bug was introduced recently, with commit e63da03639cc ("sched/numa: Allow task switch if load imbalance improves") Signed-off-by: Rik van Riel <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-08Merge branch 'next' (accumulated 3.16 merge window patches) into masterLinus Torvalds79-1274/+2130
Now that 3.15 is released, this merges the 'next' branch into 'master', bringing us to the normal situation where my 'master' branch is the merge window. * accumulated work in next: (6809 commits) ufs: sb mutex merge + mutex_destroy powerpc: update comments for generic idle conversion cris: update comments for generic idle conversion idle: remove cpu_idle() forward declarations nbd: zero from and len fields in NBD_CMD_DISCONNECT. mm: convert some level-less printks to pr_* MAINTAINERS: adi-buildroot-devel is moderated MAINTAINERS: add linux-api for review of API/ABI changes mm/kmemleak-test.c: use pr_fmt for logging fs/dlm/debug_fs.c: replace seq_printf by seq_puts fs/dlm/lockspace.c: convert simple_str to kstr fs/dlm/config.c: convert simple_str to kstr mm: mark remap_file_pages() syscall as deprecated mm: memcontrol: remove unnecessary memcg argument from soft limit functions mm: memcontrol: clean up memcg zoneinfo lookup mm/memblock.c: call kmemleak directly from memblock_(alloc|free) mm/mempool.c: update the kmemleak stack trace for mempool allocations lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations mm: introduce kmemleak_update_trace() mm/kmemleak.c: use %u to print ->checksum ...
2014-06-07rtmutex: Detect changes in the pi lock chainThomas Gleixner1-24/+71
When we walk the lock chain, we drop all locks after each step. So the lock chain can change under us before we reacquire the locks. That's harmless in principle as we just follow the wrong lock path. But it can lead to a false positive in the dead lock detection logic: T0 holds L0 T0 blocks on L1 held by T1 T1 blocks on L2 held by T2 T2 blocks on L3 held by T3 T4 blocks on L4 held by T4 Now we walk the chain lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 -> drop locks T2 times out and blocks on L0 Now we continue: lock T2 -> lock L0 -> deadlock detected, but it's not a deadlock at all. Brad tried to work around that in the deadlock detection logic itself, but the more I looked at it the less I liked it, because it's crystal ball magic after the fact. We actually can detect a chain change very simple: lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 -> next_lock = T2->pi_blocked_on->lock; drop locks T2 times out and blocks on L0 Now we continue: lock T2 -> if (next_lock != T2->pi_blocked_on->lock) return; So if we detect that T2 is now blocked on a different lock we stop the chain walk. That's also correct in the following scenario: lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 -> next_lock = T2->pi_blocked_on->lock; drop locks T3 times out and drops L3 T2 acquires L3 and blocks on L4 now Now we continue: lock T2 -> if (next_lock != T2->pi_blocked_on->lock) return; We don't have to follow up the chain at that point, because T2 propagated our priority up to T4 already. [ Folded a cleanup patch from peterz ] Signed-off-by: Thomas Gleixner <[email protected]> Reported-by: Brad Mouring <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Cc: [email protected]
2014-06-07rtmutex: Handle deadlock detection smarterThomas Gleixner3-5/+38
Even in the case when deadlock detection is not requested by the caller, we can detect deadlocks. Right now the code stops the lock chain walk and keeps the waiter enqueued, even on itself. Silly not to yell when such a scenario is detected and to keep the waiter enqueued. Return -EDEADLK unconditionally and handle it at the call sites. The futex calls return -EDEADLK. The non futex ones dequeue the waiter, throw a warning and put the task into a schedule loop. Tagged for stable as it makes the code more robust. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Brad Mouring <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Cc: [email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2014-06-06tracing: Fix memory leak on instance deletionSteven Rostedt (Red Hat)1-2/+1
When an instance is created, it also gets a snapshot ring buffer allocated (with minimum of pages). But when it is deleted the snapshot buffer is not. There was a helper function added to match the allocation of these ring buffers to a way to free them, but it wasn't used by the deletion of an instance. Using that helper function solves this memory leak. Signed-off-by: Steven Rostedt <[email protected]>
2014-06-06sysctl: convert use of typedef ctl_table to struct ctl_tableJoe Perches2-4/+4
This typedef is unnecessary and should just be removed. Signed-off-by: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-06kernel/seccomp.c: kernel-doc warning fixFabian Frederick1-2/+2
+ fix small typo Signed-off-by: Fabian Frederick <[email protected]> Cc: "David S. Miller" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-06ipc, kernel: clear whitespacePaul McQuade1-2/+2
trailing whitespace Signed-off-by: Paul McQuade <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-06ipc, kernel: use Linux headersPaul McQuade2-2/+2
Use #include <linux/uaccess.h> instead of <asm/uaccess.h> Use #include <linux/types.h> instead of <asm/types.h> Signed-off-by: Paul McQuade <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-06kernel/profile.c: use static const char instead of static charFabian Frederick1-3/+3
schedstr, sleepstr and kvmstr are only used in strcmp & strlen Signed-off-by: Fabian Frederick <[email protected]> Cc: Paul Gortmaker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-06kernel/profile.c: convert printk to pr_foo()Fabian Frederick1-9/+5
Signed-off-by: Fabian Frederick <[email protected]> Cc: Paul Gortmaker <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-06kernel/user_namespace.c: kernel-doc/checkpatch fixesFabian Frederick1-13/+20
-uid->gid -split some function declarations -if/then/else warning Signed-off-by: Fabian Frederick <[email protected]> Cc: Oleg Nesterov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-06sysctl: allow for strict write position handlingKees Cook1-2/+67
When writing to a sysctl string, each write, regardless of VFS position, begins writing the string from the start. This means the contents of the last write to the sysctl controls the string contents instead of the first: open("/proc/sys/kernel/modprobe", O_WRONLY) = 1 write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096 write(1, "/bin/true", 9) = 9 close(1) = 0 $ cat /proc/sys/kernel/modprobe /bin/true Expected behaviour would be to have the sysctl be "AAAA..." capped at maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the contents of the second write. Similarly, multiple short writes would not append to the sysctl. The old behavior is unlike regular POSIX files enough that doing audits of software that interact with sysctls can end up in unexpected or dangerous situations. For example, "as long as the input starts with a trusted path" turns out to be an insufficient filter, as what must also happen is for the input to be entirely contained in a single write syscall -- not a common consideration, especially for high level tools. This provides kernel.sysctl_writes_strict as a way to make this behavior act in a less surprising manner for strings, and disallows non-zero file position when writing numeric sysctls (similar to what is already done when reading from non-zero file positions). For now, the default (0) is to warn about non-zero file position use, but retain the legacy behavior. Setting this to -1 disables the warning, and setting this to 1 enables the file position respecting behavior. [[email protected]: fix build] [[email protected]: move misplaced hunk, per Randy] Signed-off-by: Kees Cook <[email protected]> Cc: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-06sysctl: refactor sysctl string writing logicKees Cook1-7/+4
Consolidate buffer length checking with new-line/end-of-line checking. Additionally, instead of reading user memory twice, just do the assignment during the loop. This change doesn't affect the potential races here. It was already possible to read a sysctl that was in the middle of a write. In both cases, the string will always be NULL terminated. The pre-existing race remains a problem to be solved. Signed-off-by: Kees Cook <[email protected]> Cc: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-06sysctl: clean up char buffer argumentsKees Cook1-7/+7
When writing to a sysctl string, each write, regardless of VFS position, began writing the string from the start. This meant the contents of the last write to the sysctl controlled the string contents instead of the first. This misbehavior was featured in an exploit against Chrome OS. While it's not in itself a vulnerability, it's a weirdness that isn't on the mind of most auditors: "This filter looks correct, the first line written would not be meaningful to sysctl" doesn't apply here, since the size of the write and the contents of the final write are what matter when writing to sysctls. This adds the sysctl kernel.sysctl_writes_strict to control the write behavior. The default (0) reports when VFS position is non-0 on a write, but retains legacy behavior, -1 disables the warning, and 1 enables the position-respecting behavior. The long-term plan here is to wait for userspace to be fixed in response to the new warning and to then switch the default kernel behavior to the new position-respecting behavior. This patch (of 4): The char buffer arguments are needlessly cast in weird places. Clean it up so things are easier to read. Signed-off-by: Kees Cook <[email protected]> Cc: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-06kernel/kexec.c: convert printk to pr_foo()Fabian Frederick1-37/+32
+ some pr_warning -> pr_warn and checkpatch warning fixes Signed-off-by: Fabian Frederick <[email protected]> Cc: Eric Biederman <[email protected]> Cc: Vivek Goyal <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-06kernel/panic.c: add "crash_kexec_post_notifiers" option for kdump after ↵Masami Hiramatsu1-2/+21
panic_notifers Add a "crash_kexec_post_notifiers" boot option to run kdump after running panic_notifiers and dump kmsg. This can help rare situations where kdump fails because of unstable crashed kernel or hardware failure (memory corruption on critical data/code), or the 2nd kernel is already broken by the 1st kernel (it's a broken behavior, but who can guarantee that the "crashed" kernel works correctly?). Usage: add "crash_kexec_post_notifiers" to kernel boot option. Note that this actually increases risks of the failure of kdump. This option should be set only if you worry about the rare case of kdump failure rather than increasing the chance of success. Signed-off-by: Masami Hiramatsu <[email protected]> Acked-by: Motohiro Kosaki <[email protected]> Acked-by: Vivek Goyal <[email protected]> Cc: Eric Biederman <[email protected]> Cc: Yoshihiro YUNOMAE <[email protected]> Cc: Satoru MORIYA <[email protected]> Cc: Tomoki Sekiyama <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-06smp: print more useful debug info upon receiving IPI on an offline CPUSrivatsa S. Bhat1-3/+15
There is a longstanding problem related to CPU hotplug which causes IPIs to be delivered to offline CPUs, and the smp-call-function IPI handler code prints out a warning whenever this is detected. Every once in a while this (usually harmless) warning gets reported on LKML, but so far it has not been completely fixed. Usually the solution involves finding out the IPI sender and fixing it by adding appropriate synchronization with CPU hotplug. However, while going through one such internal bug reports, I found that there is a significant bug in the receiver side itself (more specifically, in stop-machine) that can lead to this problem even when the sender code is perfectly fine. This patchset fixes that synchronization problem in the CPU hotplug stop-machine code. Patch 1 adds some additional debug code to the smp-call-function framework, to help debug such issues easily. Patch 2 modifies the stop-machine code to ensure that any IPIs that were sent while the target CPU was online, would be noticed and handled by that CPU without fail before it goes offline. Thus, this avoids scenarios where IPIs are received on offline CPUs (as long as the sender uses proper hotplug synchronization). In fact, I debugged the problem by using Patch 1, and found that the payload of the IPI was always the block layer's trigger_softirq() function. But I was not able to find anything wrong with the block layer code. That's when I started looking at the stop-machine code and realized that there is a race-window which makes the IPI _receiver_ the culprit, not the sender. Patch 2 fixes that race and hence this should put an end to most of the hard-to-debug IPI-to-offline-CPU issues. This patch (of 2): Today the smp-call-function code just prints a warning if we get an IPI on an offline CPU. This info is sufficient to let us know that something went wrong, but often it is very hard to debug exactly who sent the IPI and why, from this info alone. In most cases, we get the warning about the IPI to an offline CPU, immediately after the CPU going offline comes out of the stop-machine phase and reenables interrupts. Since all online CPUs participate in stop-machine, the information regarding the sender of the IPI is already lost by the time we exit the stop-machine loop. So even if we dump the stack on each CPU at this point, we won't find anything useful since all of them will show the stack-trace of the stopper thread. So we need a better way to figure out who sent the IPI and why. To achieve this, when we detect an IPI targeted to an offline CPU, loop through the call-single-data linked list and print out the payload (i.e., the name of the function which was supposed to be executed by the target CPU). This would give us an insight as to who might have sent the IPI and help us debug this further. [[email protected]: correctly suppress warning output on second and later occurrences] Signed-off-by: Srivatsa S. Bhat <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Rusty Russell <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Gautham R Shenoy <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-06signals: change wait_for_helper() to use kernel_sigaction()Oleg Nesterov1-4/+1
Now that we have kernel_sigaction() we can change wait_for_helper() to use it and cleans up the code a bit. Signed-off-by: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Al Viro <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-06signals: introduce kernel_sigaction()Oleg Nesterov1-24/+12
Now that allow_signal() is really trivial we can unify it with disallow_signal(). Add the new helper, kernel_sigaction(), and reimplement allow_signal/disallow_signal as a trivial wrappers. This saves one EXPORT_SYMBOL() and the new helper can have more users. Signed-off-by: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Al Viro <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-06signals: disallow_signal() should flush the potentially pending signalOleg Nesterov1-0/+7
disallow_signal() simply sets SIG_IGN, this is not enough and recalc_sigpending() is simply pointless because in can never change the state of TIF_SIGPENDING. If we ignore a signal, we also need to do flush_sigqueue_mask() for the case when this signal is pending, this way recalc_sigpending() can actually clear TIF_SIGPENDING and we do not "leak" the allocated siginfo's. Signed-off-by: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Al Viro <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-06signals: kill the obsolete sigdelset() and recalc_sigpending() in allow_signal()Oleg Nesterov1-4/+1
allow_signal() does sigdelset(current->blocked) due to historic reason, previously it could be called by a daemonize()'ed kthread, and daemonize() played with current->blocked. Now that daemonize() has gone away we can remove sigdelset() and recalc_sigpending(). If a user really wants to unblock a signal, it must use sigprocmask() or set_current_block() explicitely. Signed-off-by: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Al Viro <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-06signals: mv {dis,}allow_signal() from sched.h/exit.c to signal.[ch]Oleg Nesterov2-39/+29
Move the declaration/definition of allow_signal/disallow_signal to signal.h/signal.c. The new place is more logical and allows to use the static helpers in signal.c (see the next changes). While at it, make them return void and remove the valid_signal() check. Nobody checks the returned value, and in-kernel users must not pass the wrong signal number. Signed-off-by: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Al Viro <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-06signals: cleanup the usage of t/current in do_sigaction()Oleg Nesterov1-8/+7
The usage of "task_struct *t" and "current" in do_sigaction() looks really annoying and chaotic. Initially "t" is used as a cached value of current but not consistently, then it is reused as a loop variable and we have to use "current" again. Clean up this mess and also convert the code to use for_each_thread(). Signed-off-by: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Al Viro <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-06signals: rename rm_from_queue_full() to flush_sigqueue_mask()Oleg Nesterov1-11/+8
"rm_from_queue_full" looks ugly and misleading, especially now that rm_from_queue() has gone away. Rename it to flush_sigqueue_mask(), this matches flush_sigqueue() we already have. Also remove the obsolete comment which explains the difference with rm_from_queue() we already killed. Signed-off-by: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Al Viro <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-06signals: kill rm_from_queue(), change prepare_signal() to use for_each_thread()Oleg Nesterov1-33/+10
rm_from_queue() doesn't make sense. The only caller, prepare_signal(), can use rm_from_queue_full() with the same effect. While at it, change prepare_signal() to use for_each_thread() instead of do/while_each_thread. Signed-off-by: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Al Viro <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-06signals: s/siginitset/sigemptyset/ in do_sigtimedwait()Oleg Nesterov1-1/+1
Cosmetic, but siginitset(0) looks a bit strange, sigemptyset() is what do_sigtimedwait() needs. Signed-off-by: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Al Viro <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Tejun Heo <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-06ptrace: task_clear_jobctl_trapping()->wake_up_bit() needs mb()Oleg Nesterov1-0/+1
__wake_up_bit() checks waitqueue_active() and thus the caller needs mb() as wake_up_bit() documents, fix task_clear_jobctl_trapping(). Signed-off-by: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-06ptrace: fix fork event messages across pid namespacesMatthew Dempsky1-3/+7
When tracing a process in another pid namespace, it's important for fork event messages to contain the child's pid as seen from the tracer's pid namespace, not the parent's. Otherwise, the tracer won't be able to correlate the fork event with later SIGTRAP signals it receives from the child. We still risk a race condition if a ptracer from a different pid namespace attaches after we compute the pid_t value. However, sending a bogus fork event message in this unlikely scenario is still a vast improvement over the status quo where we always send bogus fork event messages to debuggers in a different pid namespace than the forking process. Signed-off-by: Matthew Dempsky <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Cc: Kees Cook <[email protected]> Cc: Julien Tinnes <[email protected]> Cc: Roland McGrath <[email protected]> Cc: Jan Kratochvil <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2014-06-07PM / sleep: trace events for suspend/resumeTodd E Brandt4-2/+23
Adds trace events that give finer resolution into suspend/resume. These events are graphed in the timelines generated by the analyze_suspend.py script. They represent large areas of time consumed that are typical to suspend and resume. The event is triggered by calling the function "trace_suspend_resume" with three arguments: a string (the name of the event to be displayed in the timeline), an integer (case specific number, such as the power state or cpu number), and a boolean (where true is used to denote the start of the timeline event, and false to denote the end). The suspend_resume trace event reproduces the data that the machine_suspend trace event did, so the latter has been removed. Signed-off-by: Todd Brandt <[email protected]> Acked-by: Steven Rostedt <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2014-06-07Merge branch 'acpi-pm' into pm-sleepRafael J. Wysocki15-44/+74
2014-06-06Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds4-12/+20
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Four misc fixes: each was deemed serious enough to warrant v3.15 inclusion" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix tg_set_cfs_bandwidth() deadlock on rq->lock sched/dl: Fix race in dl_task_timer() sched: Fix sched_policy < 0 comparison sched/numa: Fix use of spin_{un}lock_irq() when interrupts are disabled
2014-06-06tracing: Fix leak of ring buffer data when new instances creation failsSteven Rostedt (Red Hat)1-2/+20
Yoshihiro Yunomae reported that the ring buffer data for a trace instance does not get properly cleaned up when it fails. He proposed a patch that manually cleaned the data up and addad a bunch of labels. The labels are not needed because all trace array is allocated with a kzalloc which initializes it to 0 and all kfree()s can take a NULL pointer and will ignore it. Adding a new helper function free_trace_buffers() that can also take null buffers to free the buffers that were allocated by allocate_trace_buffers(). Link: http://lkml.kernel.org/r/20140605223522.32311.31664.stgit@yunodevel Reported-by: Yoshihiro YUNOMAE <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2014-06-06tracing/kprobes: Avoid self tests if tracing is disabled on boot upYoshihiro YUNOMAE1-0/+3
If tracing is disabled on boot up, the kernel should not execute tracing self tests. The kernel should check whether tracing is disabled or not before executing any of the tracing self tests. Link: http://lkml.kernel.org/p/20140605223520.32311.56097.stgit@yunodevel Acked-by: Masami Hiramatsu <[email protected]> Signed-off-by: Yoshihiro YUNOMAE <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2014-06-06tracing: Return error if ftrace_trace_arrays list is emptyYoshihiro YUNOMAE2-0/+16
ftrace_trace_arrays links global_trace.list. However, global_trace is not added to ftrace_trace_arrays if trace_alloc_buffers() failed. As the result, ftrace_trace_arrays becomes an empty list. If ftrace_trace_arrays is an empty list, current top_trace_array() returns an invalid pointer. As the result, the kernel can induce memory corruption or panic. Current implementation does not check whether ftrace_trace_arrays is empty list or not. So, in this patch, if ftrace_trace_arrays is empty list, top_trace_array() returns NULL. Moreover, this patch makes all functions calling top_trace_array() handle it appropriately. Link: http://lkml.kernel.org/p/20140605223517.32311.99233.stgit@yunodevel Signed-off-by: Yoshihiro YUNOMAE <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2014-06-06locking/rwlocks: Introduce 'qrwlocks' - fair, queued rwlocksWaiman Long3-0/+141
This rwlock uses the arch_spin_lock_t as a waitqueue, and assuming the arch_spin_lock_t is a fair lock (ticket,mcs etc..) the resulting rwlock is a fair lock. It fits in the same 8 bytes as the regular rwlock_t by folding the reader and writer count into a single integer, using the remaining 4 bytes for the arch_spinlock_t. Architectures that can single-copy adress bytes can optimize queue_write_unlock() with a 0 write to the LSB (the write count). Performance as measured by Davidlohr Bueso (rwlock_t -> qrwlock_t): +--------------+-------------+---------------+ | Workload | #users | delta | +--------------+-------------+---------------+ | alltests | > 1400 | -4.83% | | custom | 0-100,> 100 | +1.43%,-1.57% | | high_systime | > 1000 | -2.61 | | shared | all | +0.32 | +--------------+-------------+---------------+ http://www.stgolabs.net/qrwlock-stuff/aim7-results-vs-rwsem_optsin/ Signed-off-by: Waiman Long <[email protected]> [peterz: near complete rewrite] Signed-off-by: Peter Zijlstra <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: "Paul E.McKenney" <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-06perf: Differentiate exec() and non-exec() comm eventsAdrian Hunter1-2/+2
perf tools like 'perf report' can aggregate samples by comm strings, which generally works. However, there are other potential use-cases. For example, to pair up 'calls' with 'returns' accurately (from branch events like Intel BTS) it is necessary to identify whether the process has exec'd. Although a comm event is generated when an 'exec' happens it is also generated whenever the comm string is changed on a whim (e.g. by prctl PR_SET_NAME). This patch adds a flag to the comm event to differentiate one case from the other. In order to determine whether the kernel supports the new flag, a selection bit named 'exec' is added to struct perf_event_attr. The bit does nothing but will cause perf_event_open() to fail if the bit is set on kernels that do not have it defined. Signed-off-by: Adrian Hunter <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Cc: Paul Mackerras <[email protected]> Cc: Dave Jones <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: David Ahern <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2014-06-06Merge branch 'perf/urgent' into perf/core, to resolve conflict and to ↵Ingo Molnar23-250/+377
prepare for new patches Conflicts: arch/x86/kernel/traps.c Signed-off-by: Ingo Molnar <[email protected]>