aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2013-07-24tracing: Kill trace_cpu struct/membersOleg Nesterov2-29/+0
After the previous changes trace_array_cpu->trace_cpu and trace_array->trace_cpu becomes write-only. Remove these members and kill "struct trace_cpu" as well. As a side effect this also removes memset(per_cpu_memory, 0). It was not needed, alloc_percpu() returns zero-filled memory. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-07-24tracing: Change tracing_fops/snapshot_fops to rely on tracing_get_cpu()Oleg Nesterov1-28/+22
tracing_open() and tracing_snapshot_open() are racy, the memory inode->i_private points to can be already freed. Convert these last users of "inode->i_private == trace_cpu" to use "i_private = trace_array" and rely on tracing_get_cpu(). v2: incorporate the fix from Steven, tracing_release() must not blindly dereference file->private_data unless we know that the file was opened for reading. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-07-24tracing: Change tracing_entries_fops to rely on tracing_get_cpu()Oleg Nesterov1-37/+12
tracing_open_generic_tc() is racy, the memory inode->i_private points to can be already freed. 1. Change its last user, tracing_entries_fops, to use tracing_*_generic_tr() instead. 2. Change debugfs_create_file("buffer_size_kb", data) callers to pass "data = tr". 3. Change tracing_entries_read() and tracing_entries_write() to use tracing_get_cpu(). 4. Kill the no longer used tracing_open_generic_tc() and tracing_release_generic_tc(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-07-24tracing: Change tracing_stats_fops to rely on tracing_get_cpu()Oleg Nesterov1-7/+6
tracing_open_generic_tc() is racy, the memory inode->i_private points to can be already freed. 1. Change one of its users, tracing_stats_fops, to use tracing_*_generic_tr() instead. 2. Change trace_create_cpu_file("stats", data) to pass "data = tr". 3. Change tracing_stats_read() to use tracing_get_cpu(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-07-24tracing: Change tracing_buffers_fops to rely on tracing_get_cpu()Oleg Nesterov1-5/+4
tracing_buffers_open() is racy, the memory inode->i_private points to can be already freed. Change debugfs_create_file("trace_pipe_raw", data) caller to pass "data = tr", tracing_buffers_open() can use tracing_get_cpu(). Change debugfs_create_file("snapshot_raw_fops", data) caller too, this file uses tracing_buffers_open/release. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-07-24tracing: Change tracing_pipe_fops() to rely on tracing_get_cpu()Oleg Nesterov1-9/+7
tracing_open_pipe() is racy, the memory inode->i_private points to can be already freed. Change debugfs_create_file("trace_pipe", data) callers to to pass "data = tr", tracing_open_pipe() can use tracing_get_cpu(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-07-24tracing: Introduce trace_create_cpu_file() and tracing_get_cpu()Oleg Nesterov1-14/+36
Every "file_operations" used by tracing_init_debugfs_percpu is buggy. f_op->open/etc does: 1. struct trace_cpu *tc = inode->i_private; struct trace_array *tr = tc->tr; 2. trace_array_get(tr) or fail; 3. do_something(tc); But tc (and tr) can be already freed before trace_array_get() is called. And it doesn't matter whether this file is per-cpu or it was created by init_tracer_debugfs(), free_percpu() or kfree() are equally bad. Note that even 1. is not safe, the freed memory can be unmapped. But even if it was safe trace_array_get() can wrongly succeed if we also race with the next new_instance_create() which can re-allocate the same tr, or tc was overwritten and ->tr points to the valid tr. In this case 3. uses the freed/reused memory. Add the new trivial helper, trace_create_cpu_file() which simply calls trace_create_file() and encodes "cpu" in "struct inode". Another helper, tracing_get_cpu() will be used to read cpu_nr-or-RING_BUFFER_ALL_CPUS. The patch abuses ->i_cdev to encode the number, it is never used unless the file is S_ISCHR(). But we could use something else, say, i_bytes or even ->d_fsdata. In any case this hack is hidden inside these 2 helpers, it would be trivial to change them if needed. This patch only changes tracing_init_debugfs_percpu() to use the new trace_create_cpu_file(), the next patches will change file_operations. Note: tracing_get_cpu(inode) is always safe but you can't trust the result unless trace_array_get() was called, without trace_types_lock which acts as a barrier it can wrongly return RING_BUFFER_ALL_CPUS. Link: http://lkml.kernel.org/r/[email protected] Cc: Al Viro <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-07-23Merge branch 'for-3.11-fixes' of ↵Linus Torvalds1-12/+19
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup changes from Tejun Heo: "This contains two patches, both of which aren't fixes per-se but I think it'd be better to fast-track them. One removes bcache_subsys_id which was added without proper review through the block tree. Fortunately, bcache cgroup code is unconditionally disabled, so this was never exposed to userland. The cgroup subsys_id is removed. Kent will remove the affected (disabled) code through bcache branch. The other simplifies task_group_path_from_hierarchy(). The function doesn't currently have in-kernel users but there are external code and development going on dependent on the function and making the function available for 3.11 would make things go smoother" * 'for-3.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: replace task_cgroup_path_from_hierarchy() with task_cgroup_path() cgroup: remove bcache_subsys_id which got added stealthily
2013-07-23Fix __wait_on_atomic_t() to call the action func if the counter != 0David Howells1-1/+2
Fix __wait_on_atomic_t() so that it calls the action func if the counter != 0 rather than if the counter is 0 so as to be analogous to __wait_on_bit(). Thanks to Yacine who found this by visual inspection. This will affect FS-Cache in that it will could fail to sleep correctly when trying to clean up after a netfs cookie is withdrawn. Reported-by: Yacine Belkadi <[email protected]> Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> cc: Milosz Tanski <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2013-07-23sched: Micro-optimize the smart wake-affine logicPeter Zijlstra3-2/+8
Smart wake-affine is using node-size as the factor currently, but the overhead of the mask operation is high. Thus, this patch introduce the 'sd_llc_size' percpu variable, which will record the highest cache-share domain size, and make it to be the new factor, in order to reduce the overhead and make it more reasonable. Tested-by: Davidlohr Bueso <[email protected]> Tested-by: Michael Wang <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Acked-by: Michael Wang <[email protected]> Cc: Mike Galbraith <[email protected]> Link: http://lkml.kernel.org/r/[email protected] [ Tidied up the changelog. ] Signed-off-by: Ingo Molnar <[email protected]>
2013-07-23sched: Implement smarter wake-affine logicMichael Wang1-0/+47
The wake-affine scheduler feature is currently always trying to pull the wakee close to the waker. In theory this should be beneficial if the waker's CPU caches hot data for the wakee, and it's also beneficial in the extreme ping-pong high context switch rate case. Testing shows it can benefit hackbench up to 15%. However, the feature is somewhat blind, from which some workloads such as pgbench suffer. It's also time-consuming algorithmically. Testing shows it can damage pgbench up to 50% - far more than the benefit it brings in the best case. So wake-affine should be smarter and it should realize when to stop its thankless effort at trying to find a suitable CPU to wake on. This patch introduces 'wakee_flips', which will be increased each time the task flips (switches) its wakee target. So a high 'wakee_flips' value means the task has more than one wakee, and the bigger the number, the higher the wakeup frequency. Now when making the decision on whether to pull or not, pay attention to the wakee with a high 'wakee_flips', pulling such a task may benefit the wakee. Also imply that the waker will face cruel competition later, it could be very cruel or very fast depends on the story behind 'wakee_flips', waker therefore suffers. Furthermore, if waker also has a high 'wakee_flips', that implies that multiple tasks rely on it, then waker's higher latency will damage all of them, so pulling wakee seems to be a bad deal. Thus, when 'waker->wakee_flips / wakee->wakee_flips' becomes higher and higher, the cost of pulling seems to be worse and worse. The patch therefore helps the wake-affine feature to stop its pulling work when: wakee->wakee_flips > factor && waker->wakee_flips > (factor * wakee->wakee_flips) The 'factor' here is the number of CPUs in the current CPU's NUMA node, so a bigger node will lead to more pulling since the trial becomes more severe. After applying the patch, pgbench shows up to 40% improvements and no regressions. Tested with 12 cpu x86 server and tip 3.10.0-rc7. The percentages in the final column highlight the areas with the biggest wins, all other areas improved as well: pgbench base smart | db_size | clients | tps | | tps | +---------+---------+-------+ +-------+ | 22 MB | 1 | 10598 | | 10796 | | 22 MB | 2 | 21257 | | 21336 | | 22 MB | 4 | 41386 | | 41622 | | 22 MB | 8 | 51253 | | 57932 | | 22 MB | 12 | 48570 | | 54000 | | 22 MB | 16 | 46748 | | 55982 | +19.75% | 22 MB | 24 | 44346 | | 55847 | +25.93% | 22 MB | 32 | 43460 | | 54614 | +25.66% | 7484 MB | 1 | 8951 | | 9193 | | 7484 MB | 2 | 19233 | | 19240 | | 7484 MB | 4 | 37239 | | 37302 | | 7484 MB | 8 | 46087 | | 50018 | | 7484 MB | 12 | 42054 | | 48763 | | 7484 MB | 16 | 40765 | | 51633 | +26.66% | 7484 MB | 24 | 37651 | | 52377 | +39.11% | 7484 MB | 32 | 37056 | | 51108 | +37.92% | 15 GB | 1 | 8845 | | 9104 | | 15 GB | 2 | 19094 | | 19162 | | 15 GB | 4 | 36979 | | 36983 | | 15 GB | 8 | 46087 | | 49977 | | 15 GB | 12 | 41901 | | 48591 | | 15 GB | 16 | 40147 | | 50651 | +26.16% | 15 GB | 24 | 37250 | | 52365 | +40.58% | 15 GB | 32 | 36470 | | 50015 | +37.14% Signed-off-by: Michael Wang <[email protected]> Cc: Mike Galbraith <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] [ Improved the changelog. ] Signed-off-by: Ingo Molnar <[email protected]>
2013-07-23sched: Move h_load calculation to task_h_load()Vladimir Davydov2-35/+30
The bad thing about update_h_load(), which computes hierarchical load factor for task groups, is that it is called for each task group in the system before every load balancer run, and since rebalance can be triggered very often, this function can eat really a lot of cpu time if there are many cpu cgroups in the system. Although the situation was improved significantly by commit a35b646 ('sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies'), the problem still can arise under some kinds of loads, e.g. when cpus are switching from idle to busy and back very frequently. For instance, when I start 1000 of processes that wake up every millisecond on my 8 cpus host, 'top' and 'perf top' show: Cpu(s): 17.8%us, 24.3%sy, 0.0%ni, 57.9%id, 0.0%wa, 0.0%hi, 0.0%si Events: 243K cycles 7.57% [kernel] [k] __schedule 7.08% [kernel] [k] timerqueue_add 6.13% libc-2.12.so [.] usleep Then if I create 10000 *idle* cpu cgroups (no processes in them), cpu usage increases significantly although the 'wakers' are still executing in the root cpu cgroup: Cpu(s): 19.1%us, 48.7%sy, 0.0%ni, 31.6%id, 0.0%wa, 0.0%hi, 0.7%si Events: 230K cycles 24.56% [kernel] [k] tg_load_down 5.76% [kernel] [k] __schedule This happens because this particular kind of load triggers 'new idle' rebalance very frequently, which requires calling update_h_load(), which, in turn, calls tg_load_down() for every *idle* cpu cgroup even though it is absolutely useless, because idle cpu cgroups have no tasks to pull. This patch tries to improve the situation by making h_load calculation proceed only when h_load is really necessary. To achieve this, it substitutes update_h_load() with update_cfs_rq_h_load(), which computes h_load only for a given cfs_rq and all its ascendants, and makes the load balancer call this function whenever it considers if a task should be pulled, i.e. it moves h_load calculations directly to task_h_load(). For h_load of the same cfs_rq not to be updated multiple times (in case several tasks in the same cgroup are considered during the same balance run), the patch keeps the time of the last h_load update for each cfs_rq and breaks calculation when it finds h_load to be uptodate. The benefit of it is that h_load is computed only for those cfs_rq's, which really need it, in particular all idle task groups are skipped. Although this, in fact, moves h_load calculation under rq lock, it should not affect latency much, because the amount of work done under rq lock while trying to pull tasks is limited by sched_nr_migrate. After the patch applied with the setup described above (1000 wakers in the root cgroup and 10000 idle cgroups), I get: Cpu(s): 16.9%us, 24.8%sy, 0.0%ni, 58.4%id, 0.0%wa, 0.0%hi, 0.0%si Events: 242K cycles 7.57% [kernel] [k] __schedule 6.70% [kernel] [k] timerqueue_add 5.93% libc-2.12.so [.] usleep Signed-off-by: Vladimir Davydov <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-07-23perf: Update perf_event_type documentationPeter Zijlstra1-15/+16
Due to a discussion with Adrian I had a good look at the perf_event_type record layout and found the documentation to be somewhat unclear. Cc: Adrian Hunter <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-07-23mutex: Do not unnecessarily deal with waitersDavidlohr Bueso1-23/+18
Upon entering the slowpath, we immediately attempt to acquire the lock by checking if it is already unlocked. If we are lucky enough that this is the case, then we don't need to deal with any waiter related logic. Furthermore any checks for an empty wait_list are unnecessary as we already know that count is non-negative and hence no one is waiting for the lock. Move the count check and xchg calls to be done before any waiters are setup - including waiter debugging. Upon failure to acquire the lock, the xchg sets the counter to 0, instead of -1 as it was originally. This can be done here since we set it back to -1 right at the beginning of the loop so other waiters are woken up when the lock is released. When tested on a 8-socket (80 core) system against a vanilla 3.10-rc1 kernel, this patch provides some small performance benefits (+2-6%). While it could be considered in the noise level, the average percentages were stable across multiple runs and no performance regressions were seen. Two big winners, for small amounts of users (10-100), were the short and compute workloads had a +19.36% and +%15.76% in jobs per minute. Also change some break statements to 'goto slowpath', which IMO makes a little more intuitive to read. Signed-off-by: Davidlohr Bueso <[email protected]> Acked-by: Rik van Riel <[email protected]> Acked-by: Maarten Lankhorst <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-07-23Merge tag 'v3.11-rc2' into core/lockingIngo Molnar41-452/+856
Merge in Linux 3.11-rc2 before moving on with new work. Signed-off-by: Ingo Molnar <[email protected]>
2013-07-23kprobes/x86: Call out into INT3 handler directly instead of using notifierJiri Kosina1-1/+1
In fd4363fff3d96 ("x86: Introduce int3 (breakpoint)-based instruction patching"), the mechanism that was introduced for notifying alternatives code from int3 exception handler that and exception occured was die_notifier. This is however problematic, as early code might be using jump labels even before the notifier registration has been performed, which will then lead to an oops due to unhandled exception. One of such occurences has been encountered by Fengguang: int3: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.11.0-rc1-01429-g04bf576 #8 task: ffff88000da1b040 ti: ffff88000da1c000 task.ti: ffff88000da1c000 RIP: 0010:[<ffffffff811098cc>] [<ffffffff811098cc>] ttwu_do_wakeup+0x28/0x225 RSP: 0000:ffff88000dd03f10 EFLAGS: 00000006 RAX: 0000000000000000 RBX: ffff88000dd12940 RCX: ffffffff81769c40 RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000001 RBP: ffff88000dd03f28 R08: ffffffff8176a8c0 R09: 0000000000000002 R10: ffffffff810ff484 R11: ffff88000dd129e8 R12: ffff88000dbc90c0 R13: ffff88000dbc90c0 R14: ffff88000da1dfd8 R15: ffff88000da1dfd8 FS: 0000000000000000(0000) GS:ffff88000dd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000ffffffff CR3: 0000000001c88000 CR4: 00000000000006e0 Stack: ffff88000dd12940 ffff88000dbc90c0 ffff88000da1dfd8 ffff88000dd03f48 ffffffff81109e2b ffff88000dd12940 0000000000000000 ffff88000dd03f68 ffffffff81109e9e 0000000000000000 0000000000012940 ffff88000dd03f98 Call Trace: <IRQ> [<ffffffff81109e2b>] ttwu_do_activate.constprop.56+0x6d/0x79 [<ffffffff81109e9e>] sched_ttwu_pending+0x67/0x84 [<ffffffff8110c845>] scheduler_ipi+0x15a/0x2b0 [<ffffffff8104dfb4>] smp_reschedule_interrupt+0x38/0x41 [<ffffffff8173bf5d>] reschedule_interrupt+0x6d/0x80 <EOI> [<ffffffff810ff484>] ? __atomic_notifier_call_chain+0x5/0xc1 [<ffffffff8105cc30>] ? native_safe_halt+0xd/0x16 [<ffffffff81015f10>] default_idle+0x147/0x282 [<ffffffff81017026>] arch_cpu_idle+0x3d/0x5d [<ffffffff81127d6a>] cpu_idle_loop+0x46d/0x5db [<ffffffff81127f5c>] cpu_startup_entry+0x84/0x84 [<ffffffff8104f4f8>] start_secondary+0x3c8/0x3d5 [...] Fix this by directly calling poke_int3_handler() from the int3 exception handler (analogically to what ftrace has been doing already), instead of relying on notifier, registration of which might not have yet been finalized by the time of the first trap. Reported-and-tested-by: Fengguang Wu <[email protected]> Signed-off-by: Jiri Kosina <[email protected]> Acked-by: Masami Hiramatsu <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Fengguang Wu <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-07-22Merge tag 'trace-3.11-rc2' of ↵Linus Torvalds13-151/+166
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes and cleanups from Steven Rostedt: "This contains fixes, optimizations and some clean ups Some of the fixes need to go back to 3.10. They are minor, and deal mostly with incorrect ref counting in accessing event files. There was a couple of optimizations that should have perf perform a bit better when accessing trace events. And some various clean ups. Some of the clean ups are necessary to help in a fix to a theoretical race between opening a event file and deleting that event" * tag 'trace-3.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Kill the unbalanced tr->ref++ in tracing_buffers_open() tracing: Kill trace_array->waiter tracing: Do not (ab)use trace_seq in event_id_read() tracing: Simplify the iteration logic in f_start/f_next tracing: Add ref_data to function and fgraph tracer structs tracing: Miscellaneous fixes for trace_array ref counting tracing: Fix error handling to ensure instances can always be removed tracing/kprobe: Wait for disabling all running kprobe handlers tracing/perf: Move the PERF_MAX_TRACE_SIZE check into perf_trace_buf_prepare() tracing/syscall: Avoid perf_trace_buf_*() if sys_data->perf_events is empty tracing/function: Avoid perf_trace_buf_*() if event_function.perf_events is empty tracing: Typo fix on ring buffer comments tracing: Use trace_seq_puts()/trace_seq_putc() where possible tracing: Use correct config guard CONFIG_STACK_TRACER
2013-07-22sched_clock: Fix integer overflowBaruch Siach1-1/+1
The expression '(1 << 32)' happens to evaluate as 0 on ARM, but it evaluates as 1 on xtensa and x86_64. This zeros sched_clock_mask, and breaks sched_clock(). Set the type of 1 to 'unsigned long long' to get the value we need. Reported-by: Max Filippov <[email protected]> Tested-by: Max Filippov <[email protected]> Acked-by: Russell King <[email protected]> Signed-off-by: Baruch Siach <[email protected]> Signed-off-by: John Stultz <[email protected]>
2013-07-22mutex: Fix/document access-once assumption in mutex_can_spin_on_owner()Peter Zijlstra1-2/+4
mutex_can_spin_on_owner() is technically broken in that it would in theory allow the compiler to load lock->owner twice, seeing a pointer first time and a NULL pointer the second time. Linus pointed out that a compiler has to be seriously broken to not compile this correctly - but nevertheless this change is correct as it will better document the implementation. Signed-off-by: Peter Zijlstra <[email protected]> Acked-by: Davidlohr Bueso <[email protected]> Acked-by: Waiman Long <[email protected]> Acked-by: Linus Torvalds <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: David Howells <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-07-22sched/fair: Cleanup: remove duplicate variable declarationKirill Tkhai1-5/+3
cfs_rq is declared twice, fix it. Also use 'se' instead of '&p->se'. Signed-off-by: Kirill Tkhai <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2013-07-22Merge tag 'v3.11-rc2' into sched/coreIngo Molnar27-135/+149
Merge in Linux 3.11-rc2, to provide a post-merge-window development base. Signed-off-by: Ingo Molnar <[email protected]>
2013-07-19tracing: Kill the unbalanced tr->ref++ in tracing_buffers_open()Oleg Nesterov1-2/+0
tracing_buffers_open() does trace_array_get() and then it wrongly inrcements tr->ref again under trace_types_lock. This means that every caller leaks trace_array: # cd /sys/kernel/debug/tracing/ # mkdir instances/X # true < instances/X/per_cpu/cpu0/trace_pipe_raw # rmdir instances/X rmdir: failed to remove `instances/X': Device or resource busy Link: http://lkml.kernel.org/r/[email protected] Cc: Ingo Molnar <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: [email protected] # 3.10 Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-07-19Merge tag 'pm+acpi-3.11-rc2' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management and ACPI fixes from Rafael Wysocki: "These are fixes collected over the last week, most importnatly two cpufreq reverts fixing regressions introduced in 3.10, an autoseelp fix preventing systems using it from crashing during shutdown and two ACPI scan fixes related to hotplug. Specifics: - Two cpufreq commits from the 3.10 cycle introduced regressions. The first of them was buggy (it did way much more than it needed to do) and the second one attempted to fix an issue introduced by the first one. Fixes from Srivatsa S Bhat revert both. - If autosleep triggers during system shutdown and the shutdown callbacks of some device drivers have been called already, it may crash the system. Fix from Liu Shuo prevents that from happening by making try_to_suspend() check system_state. - The ACPI memory hotplug driver doesn't clear its driver_data on errors which may cause a NULL poiter dereference to happen later. Fix from Toshi Kani. - The ACPI namespace scanning code should not try to attach scan handlers to device objects that have them already, which may confuse things quite a bit, and it should rescan the whole namespace branch starting at the given node after receiving a bus check notify event even if the device at that particular node has been discovered already. Fixes from Rafael J Wysocki. - New ACPI video blacklist entry for a system whose initial backlight setting from the BIOS doesn't make sense. From Lan Tianyu. - Garbage string output avoindance for ACPI PNP from Liu Shuo. - Two Kconfig fixes for issues introduced recently in the s3c24xx cpufreq driver (when moving the driver to drivers/cpufreq) from Paul Bolle. - Trivial comment fix in pm_wakeup.h from Chanwoo Choi" * tag 'pm+acpi-3.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI / video: ignore BIOS initial backlight value for Fujitsu E753 PNP / ACPI: avoid garbage in resource name cpufreq: Revert commit 2f7021a8 to fix CPU hotplug regression cpufreq: s3c24xx: fix "depends on ARM_S3C24XX" in Kconfig cpufreq: s3c24xx: rename CONFIG_CPU_FREQ_S3C24XX_DEBUGFS PM / Sleep: Fix comment typo in pm_wakeup.h PM / Sleep: avoid 'autosleep' in shutdown progress cpufreq: Revert commit a66b2e to fix suspend/resume regression ACPI / memhotplug: Fix a stale pointer in error path ACPI / scan: Always call acpi_bus_scan() for bus check notifications ACPI / scan: Do not try to attach scan handlers to devices having them
2013-07-19tracing: Kill trace_array->waiterOleg Nesterov1-1/+0
Trivial. trace_array->waiter has no users since 6eaaa5d5 "tracing/core: use appropriate waiting on trace_pipe". Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-07-19Merge branch 'x86/jumplabel' into perf/coreIngo Molnar1-1/+1
Upcoming kprobes patches rely on the int3 code-patching machinery introduced by: fd4363fff3d9 x86: Introduce int3 (breakpoint)-based instruction patching Signed-off-by: Ingo Molnar <[email protected]>
2013-07-19Merge branch 'linus' into perf/coreIngo Molnar76-2689/+4056
Merge in a v3.11-rc1-ish branch to go from v3.10 based development to a v3.11 based one. Signed-off-by: Ingo Molnar <[email protected]>
2013-07-18tracing: Do not (ab)use trace_seq in event_id_read()Oleg Nesterov1-13/+4
event_id_read() has no reason to kmalloc "struct trace_seq" (more than PAGE_SIZE!), it can use a small buffer instead. Note: "if (*ppos) return 0" looks strange and even wrong, simple_read_from_buffer() handles ppos != 0 case corrrectly. And it seems that almost every user of trace_seq in this file should be converted too. Unless you use seq_open(), trace_seq buys nothing compared to the raw buffer, but it needs a bit more memory and code. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-07-18tracing: Simplify the iteration logic in f_start/f_nextOleg Nesterov1-38/+22
f_next() looks overcomplicated, and it is not strictly correct even if this doesn't matter. Say, FORMAT_FIELD_SEPERATOR should not return NULL (means EOF) if trace_get_fields() returns an empty list, we should simply advance to FORMAT_PRINTFMT as we do when we find the end of list. 1. Change f_next() to return "struct list_head *" rather than "ftrace_event_field *", and change f_show() to do list_entry(). This simplifies the code a bit, only f_show() needs to know about ftrace_event_field, and f_next() can play with ->prev directly 2. Change f_next() to not play with ->prev / return inside the switch() statement. It can simply set node = head/common_head, the prev-or-advance-to-the-next-magic below does all work. While at it. f_start() looks overcomplicated too. I don't think *pos == 0 makes sense as a separate case, just change this code to do "while" instead of "do/while". The patch also moves f_start() down, close to f_stop(). This is purely cosmetic, just to make the locking added by the next patch more clear/visible. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-07-18tracing: Add ref_data to function and fgraph tracer structsSteven Rostedt (Red Hat)3-2/+11
The selftest for function and function graph tracers are defined as __init, as they are only executed at boot up. The "tracer" structs that are associated to those tracers are not setup as __init as they are used after boot. To stop mismatch warnings, those structures need to be annotated with __ref_data. Currently, the tracer structures are defined to __read_mostly, as they do not really change. But in the future they should be converted to consts, but that will take a little work because they have a "next" pointer that gets updated when they are registered. That will have to wait till the next major release. Link: http://lkml.kernel.org/r/[email protected] Reported-by: kbuild test robot <[email protected]> Reported-by: Chen Gang <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-07-18tracing: Miscellaneous fixes for trace_array ref countingAlexander Z Lam2-8/+37
Some error paths did not handle ref counting properly, and some trace files need ref counting. Link: http://lkml.kernel.org/r/[email protected] Cc: [email protected] # 3.10 Cc: Vaibhav Nagarnaik <[email protected]> Cc: David Sharp <[email protected]> Cc: Alexander Z Lam <[email protected]> Signed-off-by: Alexander Z Lam <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-07-18tracing: Fix error handling to ensure instances can always be removedAlexander Z Lam1-1/+3
Remove debugfs directories for tracing instances during creation if an error occurs causing the trace_array for that instance to not be added to ftrace_trace_arrays. If the directory continues to exist after the error, it cannot be removed because the respective trace_array is not in ftrace_trace_arrays. Link: http://lkml.kernel.org/r/[email protected] Cc: [email protected] # 3.10 Cc: Vaibhav Nagarnaik <[email protected]> Cc: David Sharp <[email protected]> Cc: Alexander Z Lam <[email protected]> Signed-off-by: Alexander Z Lam <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-07-18tracing/kprobe: Wait for disabling all running kprobe handlersMasami Hiramatsu1-6/+17
Wait for disabling all running kprobe handlers when a kprobe event is disabled, since the caller, trace_remove_event_call() supposes that a removing event is disabled completely by disabling the event. With this change, ftrace can ensure that there is no running event handlers after disabling it. Link: http://lkml.kernel.org/r/20130709093526.20138.93100.stgit@mhiramat-M0-7522 Signed-off-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-07-18tracing/perf: Move the PERF_MAX_TRACE_SIZE check into perf_trace_buf_prepare()Oleg Nesterov4-20/+4
Every perf_trace_buf_prepare() caller does WARN_ONCE(size > PERF_MAX_TRACE_SIZE, message) and "message" is almost the same. Shift this WARN_ONCE() into perf_trace_buf_prepare(). This changes the meaning of _ONCE, but I think this is fine. - 4947014 2932448 10104832 17984294 1126b26 vmlinux + 4948422 2932448 10104832 17985702 11270a6 vmlinux on my build. Link: http://lkml.kernel.org/r/[email protected] Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-07-18tracing/syscall: Avoid perf_trace_buf_*() if sys_data->perf_events is emptyOleg Nesterov1-4/+8
perf_trace_buf_prepare() + perf_trace_buf_submit(head, task => NULL) make no sense if hlist_empty(head). Change perf_syscall_enter/exit() to check sys_data->{enter,exit}_event->perf_events beforehand. Link: http://lkml.kernel.org/r/[email protected] Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-07-18tracing/function: Avoid perf_trace_buf_*() if event_function.perf_events is ↵Oleg Nesterov1-2/+4
empty perf_trace_buf_prepare() + perf_trace_buf_submit(head, task => NULL) make no sense if hlist_empty(head). Change perf_ftrace_function_call() to check event_function.perf_events beforehand. Link: http://lkml.kernel.org/r/[email protected] Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-07-18tracing: Typo fix on ring buffer commentszhangwei(Jovi)1-7/+9
There have some mismatch between comments with real function name, update it. This patch also add some missed function arguments description. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: zhangwei(Jovi) <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-07-18tracing: Use trace_seq_puts()/trace_seq_putc() where possiblezhangwei(Jovi)6-45/+45
For string without format specifiers, use trace_seq_puts() or trace_seq_putc(). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: zhangwei(Jovi) <[email protected]> [ fixed a trace_seq_putc(s, " ") to trace_seq_putc(s, ' ') ] Signed-off-by: Steven Rostedt <[email protected]>
2013-07-18Merge tag 'driver-core-3.11-rc2' of ↵Linus Torvalds1-2/+0
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core patches from Greg KH: "Here are some driver core patches for 3.11-rc2. They aren't really bugfixes, but a bunch of new helper macros for drivers to properly create attribute groups, which drivers and subsystems need to fix up a ton of race issues with incorrectly creating sysfs files (binary and normal) after userspace has been told that the device is present. Also here is the ability to create binary files as attribute groups, to solve that race condition, which was impossible to do before this, so that's my fault the drivers were broken. The majority of the .c changes is indenting and moving code around a bit. It affects no existing code, but allows the large backlog of 70+ patches that I already have created to start flowing into the different subtrees, instead of having to live in my driver-core tree, causing merge nightmares in linux-next for the next few months. These were finalized too late for the -rc1 merge window, which is why they were didn't make that pull request, testing and review from others didn't happen until a few weeks ago, and then there's the whole distraction of the past few days, which prevented these from getting to you sooner, sorry about that. Oh, and there's a bugfix for the documentation build warning in here as well. All of these have been in linux-next this week, with no reported problems" * tag 'driver-core-3.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: driver-core: fix new kernel-doc warning in base/platform.c sysfs: use file mode defines from stat.h sysfs: add more helper macro's for (bin_)attribute(_groups) driver core: add default groups to struct class driver core: Introduce device_create_groups sysfs: prevent warning when only using binary attributes sysfs: add support for binary attributes in groups driver core: device.h: add RW and RO attribute macros sysfs.h: add BIN_ATTR macro sysfs.h: add ATTRIBUTE_GROUPS() macro sysfs.h: add __ATTR_RW() macro
2013-07-18remove sched notifier for cross-cpu migrationsMarcelo Tosatti1-15/+0
Linux as a guest on KVM hypervisor, the only user of the pvclock vsyscall interface, does not require notification on task migration because: 1. cpu ID number maps 1:1 to per-CPU pvclock time info. 2. per-CPU pvclock time info is updated if the underlying CPU changes. 3. that version is increased whenever underlying CPU changes. Which is sufficient to guarantee nanoseconds counter is calculated properly. Signed-off-by: Marcelo Tosatti <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Signed-off-by: Gleb Natapov <[email protected]>
2013-07-18sched: Fix some kernel-doc warningsYacine Belkadi3-25/+70
When building the htmldocs (in verbose mode), scripts/kernel-doc reports the follwing type of warnings: Warning(kernel/sched/core.c:936): No description found for return value of 'task_curr' ... Fix those by: - adding the missing descriptions - using "Return" sections for the descriptions Signed-off-by: Yacine Belkadi <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] [ While at it, fix the cpupri_set() explanation. ] Signed-off-by: Ingo Molnar <[email protected]>
2013-07-16x86: Introduce int3 (breakpoint)-based instruction patchingJiri Kosina1-1/+1
Introduce a method for run-time instruction patching on a live SMP kernel based on int3 breakpoint, completely avoiding the need for stop_machine(). The way this is achieved: - add a int3 trap to the address that will be patched - sync cores - update all but the first byte of the patched range - sync cores - replace the first byte (int3) by the first byte of replacing opcode - sync cores According to http://lkml.indiana.edu/hypermail/linux/kernel/1001.1/01530.html synchronization after replacing "all but first" instructions should not be necessary (on Intel hardware), as the syncing after the subsequent patching of the first byte provides enough safety. But there's not only Intel HW out there, and we'd rather be on a safe side. If any CPU instruction execution would collide with the patching, it'd be trapped by the int3 breakpoint and redirected to the provided "handler" (which would typically mean just skipping over the patched region, acting as "nop" has been there, in case we are doing nop -> jump and jump -> nop transitions). Ftrace has been using this very technique since 08d636b ("ftrace/x86: Have arch x86_64 use breakpoints instead of stop machine") for ages already, and jump labels are another obvious potential user of this. Based on activities of Masami Hiramatsu <[email protected]> a few years ago. Reviewed-by: Steven Rostedt <[email protected]> Reviewed-by: Masami Hiramatsu <[email protected]> Signed-off-by: Jiri Kosina <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: H. Peter Anvin <[email protected]>
2013-07-16sysfs.h: add __ATTR_RW() macroGreg Kroah-Hartman1-2/+0
A number of parts of the kernel created their own version of this, might as well have the sysfs core provide it instead. Reviewed-by: Guenter Roeck <[email protected]> Tested-by: Guenter Roeck <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2013-07-16cgroup: remove gratuituous BUG_ON()s from rebind_subsystems()Tejun Heo1-9/+0
rebind_subsystems() performs santiy checks even on subsystems which aren't specified to be added or removed and the checks aren't all that useful given that these are in a very cold path while the violations they check would trip up in much hotter paths. Let's remove these from rebind_subsystems(). Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-07-16cgroup: move module ref handling into rebind_subsystems()Tejun Heo1-65/+28
Module ref handling in cgroup is rather weird. parse_cgroupfs_options() grabs all the modules for the specified subsystems. A module ref is kept if the specified subsystem is newly bound to the hierarchy. If not, or the operation fails, the refs are dropped. This scatters module ref handling across multiple functions making it difficult to track. It also make the function nasty to use for dynamic subsystem binding which is necessary for the planned unified hierarchy. There's nothing which requires the subsystem modules to be pinned between parse_cgroupfs_options() and rebind_subsystems() in both mount and remount paths. parse_cgroupfs_options() can just parse and rebind_subsystems() can handle pinning the subsystems that it wants to bind, which is a natural part of its task - binding - anyway. Move module ref handling into rebind_subsystems() which makes the code a lot simpler - modules are gotten iff it's gonna be bound and put iff unbound or binding fails. v2: Li pointed out that if a controller module is unloaded between parsing and binding, rebind_subsystems() won't notice the missing controller as it only iterates through existing controllers. Fix it by updating rebind_subsystems() to compare @added_mask to @pinned and fail with -ENOENT if they don't match. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Li Zefan <[email protected]>
2013-07-15tracing: Use correct config guard CONFIG_STACK_TRACERzhangwei(Jovi)1-2/+2
We should use CONFIG_STACK_TRACER to guard readme text of stack tracer related file, not CONFIG_STACKTRACE. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: zhangwei(Jovi) <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2013-07-14kernel: delete __cpuinit usage from all core kernel filesPaul Gortmaker15-33/+33
The __cpuinit type of throwaway sections might have made sense some time ago when RAM was more constrained, but now the savings do not offset the cost and complications. For example, the fix in commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time") is a good example of the nasty type of bugs that can be created with improper use of the various __init prefixes. After a discussion on LKML[1] it was decided that cpuinit should go the way of devinit and be phased out. Once all the users are gone, we can then finally remove the macros themselves from linux/init.h. This removes all the uses of the __cpuinit macros from C files in the core kernel directories (kernel, init, lib, mm, and include) that don't really have a specific maintainer. [1] https://lkml.org/lkml/2013/5/20/589 Signed-off-by: Paul Gortmaker <[email protected]>
2013-07-14rcu: delete __cpuinit usage from all rcu filesPaul Gortmaker4-11/+11
The __cpuinit type of throwaway sections might have made sense some time ago when RAM was more constrained, but now the savings do not offset the cost and complications. For example, the fix in commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time") is a good example of the nasty type of bugs that can be created with improper use of the various __init prefixes. After a discussion on LKML[1] it was decided that cpuinit should go the way of devinit and be phased out. Once all the users are gone, we can then finally remove the macros themselves from linux/init.h. This removes all the drivers/rcu uses of the __cpuinit macros from all C files. [1] https://lkml.org/lkml/2013/5/20/589 Cc: "Paul E. McKenney" <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Dipankar Sarma <[email protected]> Reviewed-by: Josh Triplett <[email protected]> Signed-off-by: Paul Gortmaker <[email protected]>
2013-07-15PM / Sleep: avoid 'autosleep' in shutdown progressLiu ShuoX1-1/+2
Prevent automatic system suspend from happening during system shutdown by making try_to_suspend() check system_state and return immediately if it is not SYSTEM_RUNNING. This prevents the following breakage from happening (scenario from Zhang Yanmin): Kernel starts shutdown and calls all device driver's shutdown callback. When a driver's shutdown is called, the last wakelock is released and suspend-to-ram starts. However, as some driver's shut down callbacks already shut down devices and disabled runtime pm, the suspend-to-ram calls driver's suspend callback without noticing that device is already off and causes crash. [rjw: Changelog] Signed-off-by: Liu ShuoX <[email protected]> Cc: 3.5+ <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2013-07-14Merge branch 'for-linus' of ↵Linus Torvalds1-10/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs stuff from Al Viro: "O_TMPFILE ABI changes, Oleg's fput() series, misc cleanups, including making simple_lookup() usable for filesystems with non-NULL s_d_op, which allows us to get rid of quite a bit of ugliness" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: sunrpc: now we can just set ->s_d_op cgroup: we can use simple_lookup() now efivarfs: we can use simple_lookup() now make simple_lookup() usable for filesystems that set ->s_d_op configfs: don't open-code d_alloc_name() __rpc_lookup_create_exclusive: pass string instead of qstr rpc_create_*_dir: don't bother with qstr llist: llist_add() can use llist_add_batch() llist: fix/simplify llist_add() and llist_add_batch() fput: turn "list_head delayed_fput_list" into llist_head fs/file_table.c:fput(): add comment Safer ABI for O_TMPFILE
2013-07-14cgroup: we can use simple_lookup() nowAl Viro1-10/+1
Signed-off-by: Al Viro <[email protected]>