aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2015-08-03stop_machine: Move 'cpu_stopper_task' and 'stop_cpus_work' into 'struct ↵Oleg Nesterov1-8/+9
cpu_stopper' Multpiple DEFINE_PER_CPU's do not make sense, move all the per-cpu variables into 'struct cpu_stopper'. Signed-off-by: Oleg Nesterov <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-08-03sched/preempt: Fix cond_resched_lock() and cond_resched_softirq()Konstantin Khlebnikov1-3/+3
These functions check should_resched() before unlocking spinlock/bh-enable: preempt_count always non-zero => should_resched() always returns false. cond_resched_lock() worked iff spin_needbreak is set. This patch adds argument "preempt_offset" to should_resched(). preempt_count offset constants for that: PREEMPT_DISABLE_OFFSET - offset after preempt_disable() PREEMPT_LOCK_OFFSET - offset after spin_lock() SOFTIRQ_DISABLE_OFFSET - offset after local_bh_distable() SOFTIRQ_LOCK_OFFSET - offset after spin_lock_bh() Signed-off-by: Konstantin Khlebnikov <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Alexander Graf <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: David Vrabel <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Fixes: bdb438065890 ("sched: Extract the basic add/sub preempt_count modifiers") Link: http://lkml.kernel.org/r/20150715095204.12246.98268.stgit@buzz Signed-off-by: Ingo Molnar <[email protected]>
2015-08-03sched/fair: Beef up wake_wide()Mike Galbraith1-34/+33
Josef Bacik reported that Facebook sees better performance with their 1:N load (1 dispatch/node, N workers/node) when carrying an old patch to try very hard to wake to an idle CPU. While looking at wake_wide(), I noticed that it doesn't pay attention to the wakeup of a many partner waker, returning 1 only when waking one of its many partners. Correct that, letting explicit domain flags override the heuristic. While at it, adjust task_struct bits, we don't need a 64-bit counter. Tested-by: Josef Bacik <[email protected]> Signed-off-by: Mike Galbraith <[email protected]> [ Tidy things up. ] Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: kernel-team<[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-08-03sched: Introduce the 'trace_sched_waking' tracepointPeter Zijlstra3-5/+9
Mathieu reported that since 317f394160e9 ("sched: Move the second half of ttwu() to the remote cpu") trace_sched_wakeup() can happen out of context of the waker. This is a problem when you want to analyse wakeup paths because it is now very hard to correlate the wakeup event to whoever issued the wakeup. OTOH trace_sched_wakeup() is issued at the point where we set p->state = TASK_RUNNING, which is right were we hand the task off to the scheduler, so this is an important point when looking at scheduling behaviour, up to here its been the wakeup path everything hereafter is due to scheduler policy. To bridge this gap, introduce a second tracepoint: trace_sched_waking. It is guaranteed to be called in the waker context. Reported-by: Mathieu Desnoyers <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Francis Giraldeau <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-08-03sched/cputime: Guarantee stime + utime == rtimePeter Zijlstra2-44/+64
While the current code guarantees monotonicity for stime and utime independently of one another, it does not guarantee that the sum of both is equal to the total time we started out with. This confuses things (and peoples) who look at this sum, like top, and will report >100% usage followed by a matching period of 0%. Rework the code to provide both individual monotonicity and a coherent sum. Suggested-by: Fredrik Markstrom <[email protected]> Reported-by: Fredrik Markstrom <[email protected]> Tested-by: Fredrik Markstrom <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Stanislaw Gruszka <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-08-03sched, sysctl: Delete an unnecessary check before unregister_sysctl_table()Markus Elfring1-2/+1
The unregister_sysctl_table() function tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-08-03sched/deadline: Remove a redundant condition from task_woken_dl()Xunlei Pang1-1/+0
'p' has been already queued at this point, so "!task_running(rq, p)" and "p->nr_cpus_allowed > 1" imply that "has_pushable_dl_tasks(rq)" is true, so it can be removed. Signed-off-by: Xunlei Pang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-08-03sched/rt: Remove a redundant condition from task_woken_rt()Xunlei Pang1-1/+0
'p' has been already queued at this point, so "!task_running(rq, p)" and "p->nr_cpus_allowed > 1" imply that "has_pushable_tasks(rq)" is true, so it can be removed. Signed-off-by: Xunlei Pang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-08-03sched/fair: Avoid pulling all tasks in idle balancingYuyang Du1-0/+7
In idle balancing where a CPU going idle pulls tasks from another CPU, a livelock may happen if the CPU pulls all tasks from another, makes it idle, and this iterates. So just avoid this. Reported-by: Rabin Vincent <[email protected]> Signed-off-by: Yuyang Du <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Ben Segall <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Morten Rasmussen <[email protected]> Cc: Paul Turner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-08-03locking/static_keys: Add selftestPeter Zijlstra1-1/+38
Add a little selftest that validates all combinations. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-08-03locking/static_keys: Add a new static_key interfacePeter Zijlstra1-7/+30
There are various problems and short-comings with the current static_key interface: - static_key_{true,false}() read like a branch depending on the key value, instead of the actual likely/unlikely branch depending on init value. - static_key_{true,false}() are, as stated above, tied to the static_key init values STATIC_KEY_INIT_{TRUE,FALSE}. - we're limited to the 2 (out of 4) possible options that compile to a default NOP because that's what our arch_static_branch() assembly emits. So provide a new static_key interface: DEFINE_STATIC_KEY_TRUE(name); DEFINE_STATIC_KEY_FALSE(name); Which define a key of different types with an initial true/false value. Then allow: static_branch_likely() static_branch_unlikely() to take a key of either type and emit the right instruction for the case. This means adding a second arch_static_branch_jump() assembly helper which emits a JMP per default. In order to determine the right instruction for the right state, encode the branch type in the LSB of jump_entry::key. This is the final step in removing the naming confusion that has led to a stream of avoidable bugs such as: a833581e372a ("x86, perf: Fix static_key bug in load_mm_cr4()") ... but it also allows new static key combinations that will give us performance enhancements in the subsequent patches. Tested-by: Rabin Vincent <[email protected]> # arm Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Michael Ellerman <[email protected]> # ppc Acked-by: Heiko Carstens <[email protected]> # s390 Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-08-03locking/static_keys: Rework update logicPeter Zijlstra1-50/+38
Instead of spreading the branch_default logic all over the place, concentrate it into the one jump_label_type() function. This does mean we need to actually increment/decrement the enabled count _before_ calling the update path, otherwise jump_label_type() will not see the right state. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-08-03locking/static_keys: Add static_key_{en,dis}able() helpersPeter Zijlstra1-4/+2
Add two helpers to make it easier to treat the refcount as boolean. Suggested-by: Jason Baron <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-08-03jump_label: Add jump_entry_key() helperPeter Zijlstra1-4/+9
Avoid some casting with a helper, also prepares the way for overloading the LSB of jump_entry::key. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-08-03jump_label, locking/static_keys: Rename JUMP_LABEL_TYPE_* and related ↵Peter Zijlstra1-9/+16
helpers to the static_key* pattern Rename the JUMP_LABEL_TYPE_* macros to be JUMP_TYPE_* and move the inline helpers into kernel/jump_label.c, since that's the only place they're ever used. Also rename the helpers where it's all about static keys. This is the second step in removing the naming confusion that has led to a stream of avoidable bugs such as: a833581e372a ("x86, perf: Fix static_key bug in load_mm_cr4()") Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-08-03jump_label: Rename JUMP_LABEL_{EN,DIS}ABLE to JUMP_LABEL_{JMP,NOP}Peter Zijlstra1-9/+9
Since we've already stepped away from ENABLE is a JMP and DISABLE is a NOP with the branch_default bits, and are going to make it even worse, rename it to make it all clearer. This way we don't mix multiple levels of logic attributes, but have a plain 'physical' name for what the current instruction patching status of a jump label is. This is a first step in removing the naming confusion that has led to a stream of avoidable bugs such as: a833581e372a ("x86, perf: Fix static_key bug in load_mm_cr4()") Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] [ Beefed up the changelog. ] Signed-off-by: Ingo Molnar <[email protected]>
2015-08-03Merge branch 'x86/asm' into locking/coreIngo Molnar2-0/+3
Upcoming changes to static keys is interacting/conflicting with the following pending TSC commits in tip:x86/asm: 4ea1636b04db x86/asm/tsc: Rename native_read_tsc() to rdtsc() ... So merge it into the locking tree to have a smoother resolution. Signed-off-by: Ingo Molnar <[email protected]>
2015-08-03locking/pvqspinlock: Only kick CPU at unlock timeWaiman Long2-21/+51
For an over-committed guest with more vCPUs than physical CPUs available, it is possible that a vCPU may be kicked twice before getting the lock - once before it becomes queue head and once again before it gets the lock. All these CPU kicking and halting (VMEXIT) can be expensive and slow down system performance. This patch adds a new vCPU state (vcpu_hashed) which enables the code to delay CPU kicking until at unlock time. Once this state is set, the new lock holder will set _Q_SLOW_VAL and fill in the hash table on behalf of the halted queue head vCPU. The original vcpu_halted state will be used by pv_wait_node() only to differentiate other queue nodes from the qeue head. Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Douglas Hatch <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Scott J Norton <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-08-03locking/qrwlock: Reduce reader/writer to reader lock transfer latencyWaiman Long1-8/+4
Currently, a reader will check first to make sure that the writer mode byte is cleared before incrementing the reader count. That waiting is not really necessary. It increases the latency in the reader/writer to reader transition and reduces readers performance. This patch eliminates that waiting. It also has the side effect of reducing the chance of writer lock stealing and improving the fairness of the lock. Using a locking microbenchmark, a 10-threads 5M locking loop of mostly readers (RW ratio = 10,000:1) has the following performance numbers in a Haswell-EX box: Kernel Locking Rate (Kops/s) ------ --------------------- 4.1.1 15,063,081 4.1.1+patch 17,241,552 (+14.4%) Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Douglas Hatch <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Scott J Norton <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-08-03locking/pvqspinlock: Order pv_unhash() after cmpxchg() on unlock slowpathWill Deacon1-5/+18
When we unlock in __pv_queued_spin_unlock(), a failed cmpxchg() on the lock value indicates that we need to take the slow-path and unhash the corresponding node blocked on the lock. Since a failed cmpxchg() does not provide any memory-ordering guarantees, it is possible that the node data could be read before the cmpxchg() on weakly-ordered architectures and therefore return a stale value, leading to hash corruption and/or a BUG(). This patch adds an smb_rmb() following the failed cmpxchg operation, so that the unhashing is ordered after the lock has been checked. Reported-by: Peter Zijlstra <[email protected]> Signed-off-by: Will Deacon <[email protected]> [ Added more comments] Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Waiman Long <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Paul McKenney <[email protected]> Cc: Steve Capper <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-08-03locking: Clean up pvqspinlock warningPeter Zijlstra1-6/+7
- Rename the on-stack variable to match the datastructure variable, - place the cmpxchg back under the comment that explains it, - clean up the WARN() statement to avoid superfluous conditionals and line-breaks. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Waiman Long <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-08-03Merge branch 'locking/urgent', tag 'v4.2-rc5' into locking/core, to pick up ↵Ingo Molnar19-124/+276
fixes before applying new changes Signed-off-by: Ingo Molnar <[email protected]>
2015-08-01clockevents: Drop redundant cpumask check in tick_check_new_device()Luiz Capitulino1-3/+0
The same check is performed by tick_check_percpu(). Signed-off-by: Luiz Capitulino <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-07-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-21/+37
Conflicts: arch/s390/net/bpf_jit_comp.c drivers/net/ethernet/ti/netcp_ethss.c net/bridge/br_multicast.c net/ipv4/ip_fragment.c All four conflicts were cases of simple overlapping changes. Signed-off-by: David S. Miller <[email protected]>
2015-07-31PM / suspend: make sync() on suspend-to-RAM build-time optionalLen Brown2-0/+12
The Linux kernel suspend path has traditionally invoked sys_sync() before freezing user threads. But sys_sync() can be expensive, and some user-space OS's do not want the kernel to pay the cost of sys_sync() on every suspend -- preferring invoke sync() from user-space if/when they want it. So make sys_sync on suspend build-time optional. The default is unchanged. Signed-off-by: Len Brown <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2015-07-31x86/ldt: Make modify_ldt() optionalAndy Lutomirski1-0/+1
The modify_ldt syscall exposes a large attack surface and is unnecessary for modern userspace. Make it optional. Signed-off-by: Andy Lutomirski <[email protected]> Reviewed-by: Kees Cook <[email protected]> Cc: Andrew Cooper <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Jan Beulich <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] <[email protected]> Cc: xen-devel <[email protected]> Link: http://lkml.kernel.org/r/a605166a771c343fd64802dece77a903507333bd.1438291540.git.luto@kernel.org [ Made MATH_EMULATION dependent on MODIFY_LDT_SYSCALL. ] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes: Fix the waitqueue_active() check in xol_free_insn_slot()Oleg Nesterov1-0/+1
The xol_free_insn_slot()->waitqueue_active() check is buggy. We need mb() after we set the conditon for wait_event(), or xol_take_insn_slot() can miss the wakeup. Signed-off-by: Oleg Nesterov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Pratyush Anand <[email protected]> Cc: Srikar Dronamraju <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes: Use vm_special_mapping to name the XOL vmaOleg Nesterov1-10/+20
Change xol_add_vma() to use _install_special_mapping(), this way we can name the vma installed by uprobes. Currently it looks like private anonymous mapping, this is confusing and complicates the debugging. With this change /proc/$pid/maps reports "[uprobes]". As a side effect this will cause core dumps to include the XOL vma and I think this is good; this can help to debug the problem if the app crashed because it was probed. Signed-off-by: Oleg Nesterov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Pratyush Anand <[email protected]> Cc: Srikar Dronamraju <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes: Fix the usage of install_special_mapping()Oleg Nesterov1-8/+9
install_special_mapping(pages) expects that "pages" is the zero- terminated array while xol_add_vma() passes &area->page, this means that special_mapping_fault() can wrongly use the next member in xol_area (vaddr) as "struct page *". Fortunately, this area is not expandable so pgoff != 0 isn't possible (modulo bugs in special_mapping_vmops), but still this does not look good. Signed-off-by: Oleg Nesterov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Pratyush Anand <[email protected]> Cc: Srikar Dronamraju <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes/x86: Make arch_uretprobe_is_alive(RP_CHECK_CALL) more cleverOleg Nesterov1-7/+7
The previous change documents that cleanup_return_instances() can't always detect the dead frames, the stack can grow. But there is one special case which imho worth fixing: arch_uretprobe_is_alive() can return true when the stack didn't actually grow, but the next "call" insn uses the already invalidated frame. Test-case: #include <stdio.h> #include <setjmp.h> jmp_buf jmp; int nr = 1024; void func_2(void) { if (--nr == 0) return; longjmp(jmp, 1); } void func_1(void) { setjmp(jmp); func_2(); } int main(void) { func_1(); return 0; } If you ret-probe func_1() and func_2() prepare_uretprobe() hits the MAX_URETPROBE_DEPTH limit and "return" from func_2() is not reported. When we know that the new call is not chained, we can do the more strict check. In this case "sp" points to the new ret-addr, so every frame which uses the same "sp" must be dead. The only complication is that arch_uretprobe_is_alive() needs to know was it chained or not, so we add the new RP_CHECK_CHAIN_CALL enum and change prepare_uretprobe() to pass RP_CHECK_CALL only if !chained. Note: arch_uretprobe_is_alive() could also re-read *sp and check if this word is still trampoline_vaddr. This could obviously improve the logic, but I would like to avoid another copy_from_user() especially in the case when we can't avoid the false "alive == T" positives. Tested-by: Pratyush Anand <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]> Acked-by: Anton Arapov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes: Add the "enum rp_check ctx" arg to arch_uretprobe_is_alive()Oleg Nesterov1-3/+6
arch/x86 doesn't care (so far), but as Pratyush Anand pointed out other architectures might want why arch_uretprobe_is_alive() was called and use different checks depending on the context. Add the new argument to distinguish 2 callers. Tested-by: Pratyush Anand <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]> Acked-by: Anton Arapov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes: Change prepare_uretprobe() to (try to) flush the dead framesOleg Nesterov1-0/+13
Change prepare_uretprobe() to flush the !arch_uretprobe_is_alive() return_instance's. This is not needed correctness-wise, but can help to avoid the failure caused by MAX_URETPROBE_DEPTH. Note: in this case arch_uretprobe_is_alive() can be false positive, the stack can grow after longjmp(). Unfortunately, the kernel can't 100% solve this problem, but see the next patch. Tested-by: Pratyush Anand <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]> Acked-by: Anton Arapov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes: Change handle_trampoline() to flush the frames invalidated by longjmp()Oleg Nesterov1-11/+18
Test-case: #include <stdio.h> #include <setjmp.h> jmp_buf jmp; void func_2(void) { longjmp(jmp, 1); } void func_1(void) { if (setjmp(jmp)) return; func_2(); printf("ERR!! I am running on the caller's stack\n"); } int main(void) { func_1(); return 0; } fails if you probe func_1() and func_2() because handle_trampoline() assumes that the probed function should must return and hit the bp installed be prepare_uretprobe(). But in this case func_2() does not return, so when func_1() returns the kernel uses the no longer valid return_instance of func_2(). Change handle_trampoline() to unwind ->return_instances until we know that the next chain is alive or NULL, this ensures that the current chain is the last we need to report and free. Alternatively, every return_instance could use unique trampoline_vaddr, in this case we could use it as a key. And this could solve the problem with sigaltstack() automatically. But this approach needs more changes, and it puts the "hard" limit on MAX_URETPROBE_DEPTH. Plus it can not solve another problem partially fixed by the next patch. Note: this change has no effect on !x86, the arch-agnostic version of arch_uretprobe_is_alive() just returns "true". TODO: as documented by the previous change, arch_uretprobe_is_alive() can be fooled by sigaltstack/etc. Tested-by: Pratyush Anand <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]> Acked-by: Anton Arapov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes/x86: Reimplement arch_uretprobe_is_alive()Oleg Nesterov1-0/+1
Add the x86 specific version of arch_uretprobe_is_alive() helper. It returns true if the stack frame mangled by prepare_uretprobe() is still on stack. So if it returns false, we know that the probed function has already returned. We add the new return_instance->stack member and change the generic code to initialize it in prepare_uretprobe, but it should be equally useful for other architectures. TODO: this assumes that the probed application can't use multiple stacks (say sigaltstack). We will try to improve this logic later. Tested-by: Pratyush Anand <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]> Acked-by: Anton Arapov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes: Export 'struct return_instance', introduce arch_uretprobe_is_alive()Oleg Nesterov1-9/+5
Add the new "weak" helper, arch_uretprobe_is_alive(), used by the next patches. It should return true if this return_instance is still valid. The arch agnostic version just always returns true. The patch exports "struct return_instance" for the architectures which want to override this hook. We can also cleanup prepare_uretprobe() if we pass the new return_instance to arch_uretprobe_hijack_return_addr(). Tested-by: Pratyush Anand <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]> Acked-by: Anton Arapov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes: Change handle_trampoline() to find the next chain beforehandOleg Nesterov1-11/+16
No functional changes, preparation. Add the new helper, find_next_ret_chain(), which finds the first !chained entry and returns its ->next. Yes, it is suboptimal. We probably want to turn ->chained into ->start_of_this_chain pointer and avoid another loop. But this needs the boring changes in dup_utask(), so lets do this later. Change the main loop in handle_trampoline() to unwind the stack until ri is equal to the pointer returned by this new helper. Tested-by: Pratyush Anand <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]> Acked-by: Anton Arapov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes: Change prepare_uretprobe() to use uprobe_warn()Oleg Nesterov1-7/+3
Turn the last pr_warn() in uprobes.c into uprobe_warn(). While at it: - s/kzalloc/kmalloc, we initialize every member of 'ri' - remove the pointless comment above the obvious code Tested-by: Pratyush Anand <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]> Acked-by: Anton Arapov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes: Send SIGILL if handle_trampoline() failsOleg Nesterov1-11/+10
1. It doesn't make sense to continue if handle_trampoline() fails, change handle_swbp() to always return after this call. 2. Turn pr_warn() into uprobe_warn(), and change handle_trampoline() to send SIGILL on failure. It is pointless to return to user mode with the corrupted instruction_pointer() which we can't restore. Tested-by: Pratyush Anand <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]> Acked-by: Anton Arapov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes: Introduce free_ret_instance()Oleg Nesterov1-14/+13
We can simplify uprobe_free_utask() and handle_uretprobe_chain() if we add a simple helper which does put_uprobe/kfree and returns the ->next return_instance. Tested-by: Pratyush Anand <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]> Acked-by: Anton Arapov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes: Introduce get_uprobe()Oleg Nesterov1-19/+20
Cosmetic. Add the new trivial helper, get_uprobe(). It matches put_uprobe() we already have and we can simplify a couple of its users. Tested-by: Pratyush Anand <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]> Acked-by: Anton Arapov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31Merge branch 'x86/urgent' into x86/asm, before applying dependent patchesIngo Molnar18-123/+266
Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31Merge branch 'perf/urgent' into perf/core, to merge fixes before pulling ↵Ingo Molnar3-23/+39
more changes Signed-off-by: Ingo Molnar <[email protected]>
2015-07-30genirq/irqdomain: Allow irq domain aliasingMarc Zyngier1-5/+13
It is not uncommon (at least with the ARM stuff) to have a piece of hardware that implements different flavours of "interrupts". A typical example of this is the GICv3 ITS, which implements standard PCI/MSI support, but also some form of "generic MSI". So far, the PCI/MSI domain is registered using the ITS device_node, so that irq_find_host can return it. On the contrary, the raw MSI domain is not registered with an device_node, making it impossible to be looked up by another subsystem (obviously, using the same device_node twice would only result in confusion, as it is not defined which one irq_find_host would return). A solution to this is to "type" domains that may be aliasing, and to be able to lookup an device_node that matches a given type. For this, we introduce irq_find_matching_host() as a superset of irq_find_host: struct irq_domain *irq_find_matching_host(struct device_node *node, enum irq_domain_bus_token bus_token); where bus_token is the "type" we want to match the domain against (so far, only DOMAIN_BUS_ANY is defined). This result in some moderately invasive changes on the PPC side (which is the only user of the .match method). This has otherwise no functionnal change. Reviewed-by: Hanjun Guo <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Cc: <[email protected]> Cc: Yijing Wang <[email protected]> Cc: Ma Jun <[email protected]> Cc: Lorenzo Pieralisi <[email protected]> Cc: Duc Dang <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Jason Cooper <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-07-30genirq: Use the proper parameter name in kernel docMasanari Iida1-1/+1
The following warning is emitted for make xmldocs: Warning(.//kernel/irq/chip.c:1009): No description found for parameter 'vcpu_info' Warning(.//kernel/irq/chip.c:1009): Excess function parameter 'dest' description in 'irq_chip_set_vcpu_affinity_parent' Signed-off-by: Masanari Iida <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-07-30Merge branch 'linus' into irq/coreThomas Gleixner15-113/+231
Pull in upstream fixes before applying conflicting changes
2015-07-29block: add a bi_error field to struct bioChristoph Hellwig2-15/+7
Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: NeilBrown <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-07-29nohz: Remove useless argument on tick_nohz_task_switch()Frederic Weisbecker2-2/+2
Leftover from early code. Cc: Christoph Lameter <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Preeti U Murthy <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Viresh Kumar <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>
2015-07-29nohz: Move tick_nohz_restart_sched_tick() above its usersFrederic Weisbecker1-18/+16
Fix the function declaration/definition dance. Reviewed-by: Rik van Riel <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Preeti U Murthy <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Viresh Kumar <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>
2015-07-29nohz: Restart nohz full tick from irq exitFrederic Weisbecker1-24/+10
Restart the tick when necessary from the irq exit path. It makes nohz full more flexible, simplify the related IPIs and doesn't bring significant overhead on irq exit. In a longer term view, it will allow us to piggyback the nohz kick on the scheduler IPI in the future instead of sending a dedicated IPI that often doubles the scheduler IPI on task wakeup. This will require more changes though including careful review of resched_curr() callers to include nohz full needs. Reviewed-by: Rik van Riel <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Preeti U Murthy <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Viresh Kumar <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>
2015-07-29nohz: Remove idle task special caseFrederic Weisbecker1-5/+3
On nohz full early days, idle dynticks and full dynticks weren't well integrated and we couldn't risk full dynticks calls on idle without risking messing up tick idle statistics. This is why we prevented such thing to happen. Nowadays full dynticks and idle dynticks are better integrated and interact without known issue. So lets remove that. Reviewed-by: Rik van Riel <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Preeti U Murthy <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Viresh Kumar <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>