aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2015-08-03locking/pvqspinlock: Only kick CPU at unlock timeWaiman Long2-21/+51
For an over-committed guest with more vCPUs than physical CPUs available, it is possible that a vCPU may be kicked twice before getting the lock - once before it becomes queue head and once again before it gets the lock. All these CPU kicking and halting (VMEXIT) can be expensive and slow down system performance. This patch adds a new vCPU state (vcpu_hashed) which enables the code to delay CPU kicking until at unlock time. Once this state is set, the new lock holder will set _Q_SLOW_VAL and fill in the hash table on behalf of the halted queue head vCPU. The original vcpu_halted state will be used by pv_wait_node() only to differentiate other queue nodes from the qeue head. Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Douglas Hatch <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Scott J Norton <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-08-03locking/qrwlock: Reduce reader/writer to reader lock transfer latencyWaiman Long1-8/+4
Currently, a reader will check first to make sure that the writer mode byte is cleared before incrementing the reader count. That waiting is not really necessary. It increases the latency in the reader/writer to reader transition and reduces readers performance. This patch eliminates that waiting. It also has the side effect of reducing the chance of writer lock stealing and improving the fairness of the lock. Using a locking microbenchmark, a 10-threads 5M locking loop of mostly readers (RW ratio = 10,000:1) has the following performance numbers in a Haswell-EX box: Kernel Locking Rate (Kops/s) ------ --------------------- 4.1.1 15,063,081 4.1.1+patch 17,241,552 (+14.4%) Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Douglas Hatch <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Scott J Norton <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Will Deacon <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-08-03locking/pvqspinlock: Order pv_unhash() after cmpxchg() on unlock slowpathWill Deacon1-5/+18
When we unlock in __pv_queued_spin_unlock(), a failed cmpxchg() on the lock value indicates that we need to take the slow-path and unhash the corresponding node blocked on the lock. Since a failed cmpxchg() does not provide any memory-ordering guarantees, it is possible that the node data could be read before the cmpxchg() on weakly-ordered architectures and therefore return a stale value, leading to hash corruption and/or a BUG(). This patch adds an smb_rmb() following the failed cmpxchg operation, so that the unhashing is ordered after the lock has been checked. Reported-by: Peter Zijlstra <[email protected]> Signed-off-by: Will Deacon <[email protected]> [ Added more comments] Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Waiman Long <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Paul McKenney <[email protected]> Cc: Steve Capper <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-08-03locking: Clean up pvqspinlock warningPeter Zijlstra1-6/+7
- Rename the on-stack variable to match the datastructure variable, - place the cmpxchg back under the comment that explains it, - clean up the WARN() statement to avoid superfluous conditionals and line-breaks. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Waiman Long <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-08-03Merge branch 'locking/urgent', tag 'v4.2-rc5' into locking/core, to pick up ↵Ingo Molnar19-124/+276
fixes before applying new changes Signed-off-by: Ingo Molnar <[email protected]>
2015-08-01clockevents: Drop redundant cpumask check in tick_check_new_device()Luiz Capitulino1-3/+0
The same check is performed by tick_check_percpu(). Signed-off-by: Luiz Capitulino <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-07-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-21/+37
Conflicts: arch/s390/net/bpf_jit_comp.c drivers/net/ethernet/ti/netcp_ethss.c net/bridge/br_multicast.c net/ipv4/ip_fragment.c All four conflicts were cases of simple overlapping changes. Signed-off-by: David S. Miller <[email protected]>
2015-07-31PM / suspend: make sync() on suspend-to-RAM build-time optionalLen Brown2-0/+12
The Linux kernel suspend path has traditionally invoked sys_sync() before freezing user threads. But sys_sync() can be expensive, and some user-space OS's do not want the kernel to pay the cost of sys_sync() on every suspend -- preferring invoke sync() from user-space if/when they want it. So make sys_sync on suspend build-time optional. The default is unchanged. Signed-off-by: Len Brown <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2015-07-31x86/ldt: Make modify_ldt() optionalAndy Lutomirski1-0/+1
The modify_ldt syscall exposes a large attack surface and is unnecessary for modern userspace. Make it optional. Signed-off-by: Andy Lutomirski <[email protected]> Reviewed-by: Kees Cook <[email protected]> Cc: Andrew Cooper <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Jan Beulich <[email protected]> Cc: Konrad Rzeszutek Wilk <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] <[email protected]> Cc: xen-devel <[email protected]> Link: http://lkml.kernel.org/r/a605166a771c343fd64802dece77a903507333bd.1438291540.git.luto@kernel.org [ Made MATH_EMULATION dependent on MODIFY_LDT_SYSCALL. ] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes: Fix the waitqueue_active() check in xol_free_insn_slot()Oleg Nesterov1-0/+1
The xol_free_insn_slot()->waitqueue_active() check is buggy. We need mb() after we set the conditon for wait_event(), or xol_take_insn_slot() can miss the wakeup. Signed-off-by: Oleg Nesterov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Pratyush Anand <[email protected]> Cc: Srikar Dronamraju <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes: Use vm_special_mapping to name the XOL vmaOleg Nesterov1-10/+20
Change xol_add_vma() to use _install_special_mapping(), this way we can name the vma installed by uprobes. Currently it looks like private anonymous mapping, this is confusing and complicates the debugging. With this change /proc/$pid/maps reports "[uprobes]". As a side effect this will cause core dumps to include the XOL vma and I think this is good; this can help to debug the problem if the app crashed because it was probed. Signed-off-by: Oleg Nesterov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Pratyush Anand <[email protected]> Cc: Srikar Dronamraju <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes: Fix the usage of install_special_mapping()Oleg Nesterov1-8/+9
install_special_mapping(pages) expects that "pages" is the zero- terminated array while xol_add_vma() passes &area->page, this means that special_mapping_fault() can wrongly use the next member in xol_area (vaddr) as "struct page *". Fortunately, this area is not expandable so pgoff != 0 isn't possible (modulo bugs in special_mapping_vmops), but still this does not look good. Signed-off-by: Oleg Nesterov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Pratyush Anand <[email protected]> Cc: Srikar Dronamraju <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes/x86: Make arch_uretprobe_is_alive(RP_CHECK_CALL) more cleverOleg Nesterov1-7/+7
The previous change documents that cleanup_return_instances() can't always detect the dead frames, the stack can grow. But there is one special case which imho worth fixing: arch_uretprobe_is_alive() can return true when the stack didn't actually grow, but the next "call" insn uses the already invalidated frame. Test-case: #include <stdio.h> #include <setjmp.h> jmp_buf jmp; int nr = 1024; void func_2(void) { if (--nr == 0) return; longjmp(jmp, 1); } void func_1(void) { setjmp(jmp); func_2(); } int main(void) { func_1(); return 0; } If you ret-probe func_1() and func_2() prepare_uretprobe() hits the MAX_URETPROBE_DEPTH limit and "return" from func_2() is not reported. When we know that the new call is not chained, we can do the more strict check. In this case "sp" points to the new ret-addr, so every frame which uses the same "sp" must be dead. The only complication is that arch_uretprobe_is_alive() needs to know was it chained or not, so we add the new RP_CHECK_CHAIN_CALL enum and change prepare_uretprobe() to pass RP_CHECK_CALL only if !chained. Note: arch_uretprobe_is_alive() could also re-read *sp and check if this word is still trampoline_vaddr. This could obviously improve the logic, but I would like to avoid another copy_from_user() especially in the case when we can't avoid the false "alive == T" positives. Tested-by: Pratyush Anand <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]> Acked-by: Anton Arapov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes: Add the "enum rp_check ctx" arg to arch_uretprobe_is_alive()Oleg Nesterov1-3/+6
arch/x86 doesn't care (so far), but as Pratyush Anand pointed out other architectures might want why arch_uretprobe_is_alive() was called and use different checks depending on the context. Add the new argument to distinguish 2 callers. Tested-by: Pratyush Anand <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]> Acked-by: Anton Arapov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes: Change prepare_uretprobe() to (try to) flush the dead framesOleg Nesterov1-0/+13
Change prepare_uretprobe() to flush the !arch_uretprobe_is_alive() return_instance's. This is not needed correctness-wise, but can help to avoid the failure caused by MAX_URETPROBE_DEPTH. Note: in this case arch_uretprobe_is_alive() can be false positive, the stack can grow after longjmp(). Unfortunately, the kernel can't 100% solve this problem, but see the next patch. Tested-by: Pratyush Anand <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]> Acked-by: Anton Arapov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes: Change handle_trampoline() to flush the frames invalidated by longjmp()Oleg Nesterov1-11/+18
Test-case: #include <stdio.h> #include <setjmp.h> jmp_buf jmp; void func_2(void) { longjmp(jmp, 1); } void func_1(void) { if (setjmp(jmp)) return; func_2(); printf("ERR!! I am running on the caller's stack\n"); } int main(void) { func_1(); return 0; } fails if you probe func_1() and func_2() because handle_trampoline() assumes that the probed function should must return and hit the bp installed be prepare_uretprobe(). But in this case func_2() does not return, so when func_1() returns the kernel uses the no longer valid return_instance of func_2(). Change handle_trampoline() to unwind ->return_instances until we know that the next chain is alive or NULL, this ensures that the current chain is the last we need to report and free. Alternatively, every return_instance could use unique trampoline_vaddr, in this case we could use it as a key. And this could solve the problem with sigaltstack() automatically. But this approach needs more changes, and it puts the "hard" limit on MAX_URETPROBE_DEPTH. Plus it can not solve another problem partially fixed by the next patch. Note: this change has no effect on !x86, the arch-agnostic version of arch_uretprobe_is_alive() just returns "true". TODO: as documented by the previous change, arch_uretprobe_is_alive() can be fooled by sigaltstack/etc. Tested-by: Pratyush Anand <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]> Acked-by: Anton Arapov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes/x86: Reimplement arch_uretprobe_is_alive()Oleg Nesterov1-0/+1
Add the x86 specific version of arch_uretprobe_is_alive() helper. It returns true if the stack frame mangled by prepare_uretprobe() is still on stack. So if it returns false, we know that the probed function has already returned. We add the new return_instance->stack member and change the generic code to initialize it in prepare_uretprobe, but it should be equally useful for other architectures. TODO: this assumes that the probed application can't use multiple stacks (say sigaltstack). We will try to improve this logic later. Tested-by: Pratyush Anand <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]> Acked-by: Anton Arapov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes: Export 'struct return_instance', introduce arch_uretprobe_is_alive()Oleg Nesterov1-9/+5
Add the new "weak" helper, arch_uretprobe_is_alive(), used by the next patches. It should return true if this return_instance is still valid. The arch agnostic version just always returns true. The patch exports "struct return_instance" for the architectures which want to override this hook. We can also cleanup prepare_uretprobe() if we pass the new return_instance to arch_uretprobe_hijack_return_addr(). Tested-by: Pratyush Anand <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]> Acked-by: Anton Arapov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes: Change handle_trampoline() to find the next chain beforehandOleg Nesterov1-11/+16
No functional changes, preparation. Add the new helper, find_next_ret_chain(), which finds the first !chained entry and returns its ->next. Yes, it is suboptimal. We probably want to turn ->chained into ->start_of_this_chain pointer and avoid another loop. But this needs the boring changes in dup_utask(), so lets do this later. Change the main loop in handle_trampoline() to unwind the stack until ri is equal to the pointer returned by this new helper. Tested-by: Pratyush Anand <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]> Acked-by: Anton Arapov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes: Change prepare_uretprobe() to use uprobe_warn()Oleg Nesterov1-7/+3
Turn the last pr_warn() in uprobes.c into uprobe_warn(). While at it: - s/kzalloc/kmalloc, we initialize every member of 'ri' - remove the pointless comment above the obvious code Tested-by: Pratyush Anand <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]> Acked-by: Anton Arapov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes: Send SIGILL if handle_trampoline() failsOleg Nesterov1-11/+10
1. It doesn't make sense to continue if handle_trampoline() fails, change handle_swbp() to always return after this call. 2. Turn pr_warn() into uprobe_warn(), and change handle_trampoline() to send SIGILL on failure. It is pointless to return to user mode with the corrupted instruction_pointer() which we can't restore. Tested-by: Pratyush Anand <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]> Acked-by: Anton Arapov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes: Introduce free_ret_instance()Oleg Nesterov1-14/+13
We can simplify uprobe_free_utask() and handle_uretprobe_chain() if we add a simple helper which does put_uprobe/kfree and returns the ->next return_instance. Tested-by: Pratyush Anand <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]> Acked-by: Anton Arapov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31uprobes: Introduce get_uprobe()Oleg Nesterov1-19/+20
Cosmetic. Add the new trivial helper, get_uprobe(). It matches put_uprobe() we already have and we can simplify a couple of its users. Tested-by: Pratyush Anand <[email protected]> Signed-off-by: Oleg Nesterov <[email protected]> Acked-by: Srikar Dronamraju <[email protected]> Acked-by: Anton Arapov <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31Merge branch 'x86/urgent' into x86/asm, before applying dependent patchesIngo Molnar18-123/+266
Signed-off-by: Ingo Molnar <[email protected]>
2015-07-31Merge branch 'perf/urgent' into perf/core, to merge fixes before pulling ↵Ingo Molnar3-23/+39
more changes Signed-off-by: Ingo Molnar <[email protected]>
2015-07-30genirq/irqdomain: Allow irq domain aliasingMarc Zyngier1-5/+13
It is not uncommon (at least with the ARM stuff) to have a piece of hardware that implements different flavours of "interrupts". A typical example of this is the GICv3 ITS, which implements standard PCI/MSI support, but also some form of "generic MSI". So far, the PCI/MSI domain is registered using the ITS device_node, so that irq_find_host can return it. On the contrary, the raw MSI domain is not registered with an device_node, making it impossible to be looked up by another subsystem (obviously, using the same device_node twice would only result in confusion, as it is not defined which one irq_find_host would return). A solution to this is to "type" domains that may be aliasing, and to be able to lookup an device_node that matches a given type. For this, we introduce irq_find_matching_host() as a superset of irq_find_host: struct irq_domain *irq_find_matching_host(struct device_node *node, enum irq_domain_bus_token bus_token); where bus_token is the "type" we want to match the domain against (so far, only DOMAIN_BUS_ANY is defined). This result in some moderately invasive changes on the PPC side (which is the only user of the .match method). This has otherwise no functionnal change. Reviewed-by: Hanjun Guo <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Cc: <[email protected]> Cc: Yijing Wang <[email protected]> Cc: Ma Jun <[email protected]> Cc: Lorenzo Pieralisi <[email protected]> Cc: Duc Dang <[email protected]> Cc: Bjorn Helgaas <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Jason Cooper <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-07-30genirq: Use the proper parameter name in kernel docMasanari Iida1-1/+1
The following warning is emitted for make xmldocs: Warning(.//kernel/irq/chip.c:1009): No description found for parameter 'vcpu_info' Warning(.//kernel/irq/chip.c:1009): Excess function parameter 'dest' description in 'irq_chip_set_vcpu_affinity_parent' Signed-off-by: Masanari Iida <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-07-30Merge branch 'linus' into irq/coreThomas Gleixner15-113/+231
Pull in upstream fixes before applying conflicting changes
2015-07-29block: add a bi_error field to struct bioChristoph Hellwig2-15/+7
Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: NeilBrown <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2015-07-29nohz: Remove useless argument on tick_nohz_task_switch()Frederic Weisbecker2-2/+2
Leftover from early code. Cc: Christoph Lameter <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Preeti U Murthy <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Viresh Kumar <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>
2015-07-29nohz: Move tick_nohz_restart_sched_tick() above its usersFrederic Weisbecker1-18/+16
Fix the function declaration/definition dance. Reviewed-by: Rik van Riel <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Preeti U Murthy <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Viresh Kumar <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>
2015-07-29nohz: Restart nohz full tick from irq exitFrederic Weisbecker1-24/+10
Restart the tick when necessary from the irq exit path. It makes nohz full more flexible, simplify the related IPIs and doesn't bring significant overhead on irq exit. In a longer term view, it will allow us to piggyback the nohz kick on the scheduler IPI in the future instead of sending a dedicated IPI that often doubles the scheduler IPI on task wakeup. This will require more changes though including careful review of resched_curr() callers to include nohz full needs. Reviewed-by: Rik van Riel <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Preeti U Murthy <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Viresh Kumar <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>
2015-07-29nohz: Remove idle task special caseFrederic Weisbecker1-5/+3
On nohz full early days, idle dynticks and full dynticks weren't well integrated and we couldn't risk full dynticks calls on idle without risking messing up tick idle statistics. This is why we prevented such thing to happen. Nowadays full dynticks and idle dynticks are better integrated and interact without known issue. So lets remove that. Reviewed-by: Rik van Riel <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Preeti U Murthy <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Viresh Kumar <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>
2015-07-29jiffies: Remove HZ > USEC_PER_SEC special caseFrederic Weisbecker1-3/+7
HZ never goes much further 1000 and a bit. And if we ever reach one tick per microsecond, we might be having a problem. Lets stop maintaining this special case, just leave a paranoid check. Reviewed-by: Rik van Riel <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Ingo Molnar <[email protected]> Cc; John Stultz <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Preeti U Murthy <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Viresh Kumar <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>
2015-07-29module: weaken locking assertion for oops path.Rusty Russell1-2/+6
We don't actually hold the module_mutex when calling find_module_all from module_kallsyms_lookup_name: that's because it's used by the oops code and we don't want to deadlock. However, access to the list read-only is safe if preempt is disabled, so we can weaken the assertion. Keep a strong version for external callers though. Fixes: 0be964be0d45 ("module: Sanitize RCU usage and locking") Reported-by: He Kuang <[email protected]> Cc: [email protected] Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Rusty Russell <[email protected]>
2015-07-27perf: Fix running time accountingPeter Zijlstra1-2/+2
A recent fix to the shadow timestamp inadvertly broke the running time accounting. We must not update the running timestamp if we fail to schedule the event, the event will not have ran. This can (and did) result in negative total runtime because the stopped timestamp was before the running timestamp (we 'started' but never stopped the event -- because it never really started we didn't have to stop it either). Reported-and-Tested-by: Vince Weaver <[email protected]> Fixes: 72f669c0086f ("perf: Update shadow timestamp before add event") Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: [email protected] # 4.1 Cc: Shaohua Li <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]>
2015-07-27ebpf: Allow dereferences of PTR_TO_STACK registersAlex Gartrell1-1/+5
mov %rsp, %r1 ; r1 = rsp add $-8, %r1 ; r1 = rsp - 8 store_q $123, -8(%rsp) ; *(u64*)r1 = 123 <- valid store_q $123, (%r1) ; *(u64*)r1 = 123 <- previously invalid mov $0, %r0 exit ; Always need to exit And we'd get the following error: 0: (bf) r1 = r10 1: (07) r1 += -8 2: (7a) *(u64 *)(r10 -8) = 999 3: (7a) *(u64 *)(r1 +0) = 999 R1 invalid mem access 'fp' Unable to load program We already know that a register is a stack address and the appropriate offset, so we should be able to validate those references as well. Signed-off-by: Alex Gartrell <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-07-27genirq: Add chip_[suspend|resume] PM support to irq_chipBrian Norris1-0/+6
Some (admittedly odd) irqchips perform functions that are not directly related to any of their child IRQ lines, and therefore need to perform some tasks during suspend/resume regardless of whether there are any "installed" interrupts for the irqchip. However, the current generic-chip framework does not call the chip's irq_{suspend,resume} when there are no interrupts installed (this makes sense, because there are no irq_data objects for such a call to be made). More specifically, irq-bcm7120-l2 configures both a forwarding mask (which affects other top-level GIC IRQs) and a second-level interrupt mask (for managing its own child interrupts). The former must be saved/restored on suspend/resume, even when there's nothing to do for the latter. This patch adds a new set of suspend/resume hooks to irq_chip_generic, to help represent *chip* suspend/resume, rather than IRQ suspend/resume. These callbacks will always be called for an IRQ chip (regardless of the installed interrupts) and are based on the per-chip irq_chip_generic struct, rather than the per-IRQ irq_data struct. The original problem report is described in extra detail here: http://lkml.kernel.org/g/20150619224123.GL4917@ld-irv-0074 Signed-off-by: Brian Norris <[email protected]> Tested-by: Florian Fainelli <[email protected]> Cc: Gregory Fong <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Kevin Cernekee <[email protected]> Cc: Jason Cooper <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-07-27genirq: Export irq_[get|set]_irqchip_state()Bjorn Andersson1-0/+2
Export these functions to be able to build the Qualcomm family A PMIC gpio and mpp drivers as modules. [ tglx: Made them GPL exports ] Signed-off-by: Bjorn Andersson <[email protected]> Reviewed-by: Mark Brown <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: Srinivas Kandagatla <[email protected]> Cc: Linus Walleij <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2015-07-26Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "This update contains: - the manual revert of the SYSCALL32 changes which caused a regression - a fix for the MPX vma handling - three fixes for the ioremap 'is ram' checks. - PAT warning fixes - a trivial fix for the size calculation of TLB tracepoints - handle old EFI structures gracefully This also contains a PAT fix from Jan plus a revert thereof. Toshi explained why the code is correct" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm/pat: Revert 'Adjust default caching mode translation tables' x86/asm/entry/32: Revert 'Do not use R9 in SYSCALL32' commit x86/mm: Fix newly introduced printk format warnings mm: Fix bugs in region_is_ram() x86/mm: Remove region_is_ram() call from ioremap x86/mm: Move warning from __ioremap_check_ram() to the call site x86/mm/pat, drivers/media/ivtv: Move the PAT warning and replace WARN() with pr_warn() x86/mm/pat, drivers/infiniband/ipath: Replace WARN() with pr_warn() x86/mm/pat: Adjust default caching mode translation tables x86/fpu: Disable dependent CPU features on "noxsave" x86/mpx: Do not set ->vm_ops on MPX VMAs x86/mm: Add parenthesis for TLB tracepoint size calculation efi: Handle memory error structures produced based on old versions of standard
2015-07-25Merge tag 'trace-v4.2-rc2-fix3' of ↵Linus Torvalds1-18/+34
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull ftrace fix from Steven Rostedt: "Back in 3.16 the ftrace code was redesigned and cleaned up to remove the double iteration list (one for registered ftrace ops, and one for registered "global" ops), to just use one list. That simplified the code but also broke the function tracing filtering on pid. This updates the code to handle the filtering again with the new logic" * tag 'trace-v4.2-rc2-fix3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace: Fix breakage of set_ftrace_pid
2015-07-24ftrace: Fix breakage of set_ftrace_pidSteven Rostedt (Red Hat)1-18/+34
Commit 4104d326b670 ("ftrace: Remove global function list and call function directly") simplified the ftrace code by removing the global_ops list with a new design. But this cleanup also broke the filtering of PIDs that are added to the set_ftrace_pid file. Add back the proper hooks to have pid filtering working once again. Cc: [email protected] # 3.16+ Reported-by: Matt Fleming <[email protected]> Reported-by: Richard Weinberger <[email protected]> Tested-by: Matt Fleming <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>
2015-07-23perf: Add PERF_RECORD_SWITCH to indicate context switchesAdrian Hunter1-0/+103
There are already two events for context switches, namely the tracepoint sched:sched_switch and the software event context_switches. Unfortunately neither are suitable for use by non-privileged users for the purpose of synchronizing hardware trace data (e.g. Intel PT) to the context switch. Tracepoints are no good at all for non-privileged users because they need either CAP_SYS_ADMIN or /proc/sys/kernel/perf_event_paranoid <= -1. On the other hand, kernel software events need either CAP_SYS_ADMIN or /proc/sys/kernel/perf_event_paranoid <= 1. Now many distributions do default perf_event_paranoid to 1 making context_switches a contender, except it has another problem (which is also shared with sched:sched_switch) which is that it happens before perf schedules events out instead of after perf schedules events in. Whereas a privileged user can see all the events anyway, a non-privileged user only sees events for their own processes, in other words they see when their process was scheduled out not when it was scheduled in. That presents two problems to use the event: 1. the information comes too late, so tools have to look ahead in the event stream to find out what the current state is 2. if they are unlucky tracing might have stopped before the context-switches event is recorded. This new PERF_RECORD_SWITCH event does not have those problems and it also has a couple of other small advantages. It is easier to use because it is an auxiliary event (like mmap, comm and task events) which can be enabled by setting a single bit. It is smaller than sched:sched_switch and easier to parse. To make the event useful for privileged users also, if the context is cpu-wide then the event record will be PERF_RECORD_SWITCH_CPU_WIDE which is the same as PERF_RECORD_SWITCH except it also provides the next or previous pid/tid. Signed-off-by: Adrian Hunter <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Jiri Olsa <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Mathieu Poirier <[email protected]> Cc: Pawel Moll <[email protected]> Cc: Stephane Eranian <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
2015-07-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller8-24/+32
Conflicts: net/bridge/br_mdb.c br_mdb.c conflict was a function call being removed to fix a bug in 'net' but whose signature was changed in 'net-next'. Signed-off-by: David S. Miller <[email protected]>
2015-07-22rcu: Don't disable CPU hotplug during OOM notifiersPaul E. McKenney1-2/+0
RCU's rcu_oom_notify() disables CPU hotplug in order to stabilize the list of online CPUs, which it traverses. However, this is completely pointless because smp_call_function_single() will quietly fail if invoked on an offline CPU. Because the count of requests is incremented in the rcu_oom_notify_cpu() function that is remotely invoked, everything works nicely even in the face of concurrent CPU-hotplug operations. Furthermore, in recent kernels, invoking get_online_cpus() from an OOM notifier can result in deadlock. This commit therefore removes the call to get_online_cpus() and put_online_cpus() from rcu_oom_notify(). Reported-by: Marcin Ślusarz <[email protected]> Reported-by: David Rientjes <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Acked-by: David Rientjes <[email protected]> Tested-by: Marcin Ślusarz <[email protected]>
2015-07-22rcu: Fix backwards RCU_LOCKDEP_WARN() in synchronize_rcu_tasks()Paul E. McKenney1-1/+1
The RCU_LOCKDEP_WARN() in synchronize_rcu_tasks() triggers if the scheduler is active, which is backwards. This commit therefore negates the test. Reported-by: Ingo Molnar <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-22rcu: Rename rcu_lockdep_assert() to RCU_LOCKDEP_WARN()Paul E. McKenney9-48/+47
This commit renames rcu_lockdep_assert() to RCU_LOCKDEP_WARN() for consistency with the WARN() series of macros. This also requires inverting the sense of the conditional, which this commit also does. Reported-by: Ingo Molnar <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Ingo Molnar <[email protected]>
2015-07-22rcu: Make rcu_is_watching() really notraceAlexei Starovoitov1-2/+2
Although rcu_is_watching() is marked notrace, it invokes preempt_disable() and preempt_enable(), both of which can be traced. This defeats the purpose of the notrace on rcu_is_watching(), so this commit substitutes preempt_disable_notrace() and preempt_enable_notrace(). Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Acked-by: Steven Rostedt <[email protected]>
2015-07-22cpu: Wait for RCU grace periods concurrentlyPaul E. McKenney1-5/+5
In kernels built with CONFIG_PREEMPT, _cpu_down() waits for RCU and RCU-sched grace periods back-to-back, incurring quite a bit more latency than required. This commit therefore uses the new synchronize_rcu_mult() to allow waiting for both grace periods concurrently. Signed-off-by: Paul E. McKenney <[email protected]>
2015-07-22rcu: Create a synchronize_rcu_mult()Paul E. McKenney1-10/+27
There have been several requests for a primitive that waits for grace periods for several RCU flavors concurrently, so this commit creates it. This is a variadic macro, and you pass in the call_rcu() functions of the flavors of RCU that you wish to wait for. Note that you cannot pass in call_srcu() for two reasons: (1) This would result in a type mismatch and (2) You need to specify which srcu_struct you want to use. Handle this by creating a wrapper function for your SRCU domain, for example: void call_srcu_mine(struct rcu_head *head, rcu_callback_t func) { call_srcu(&ss_mine, head, func); } You can then do something like this: synchronize_rcu_mult(call_srcu_mine, call_rcu, call_rcu_sched); Signed-off-by: Paul E. McKenney <[email protected]>