aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2017-08-18kernel/watchdog: Prevent false positives with turbo modesThomas Gleixner2-0/+60
The hardlockup detector on x86 uses a performance counter based on unhalted CPU cycles and a periodic hrtimer. The hrtimer period is about 2/5 of the performance counter period, so the hrtimer should fire 2-3 times before the performance counter NMI fires. The NMI code checks whether the hrtimer fired since the last invocation. If not, it assumess a hard lockup. The calculation of those periods is based on the nominal CPU frequency. Turbo modes increase the CPU clock frequency and therefore shorten the period of the perf/NMI watchdog. With extreme Turbo-modes (3x nominal frequency) the perf/NMI period is shorter than the hrtimer period which leads to false positives. A simple fix would be to shorten the hrtimer period, but that comes with the side effect of more frequent hrtimer and softlockup thread wakeups, which is not desired. Implement a low pass filter, which checks the perf/NMI period against kernel time. If the perf/NMI fires before 4/5 of the watchdog period has elapsed then the event is ignored and postponed to the next perf/NMI. That solves the problem and avoids the overhead of shorter hrtimer periods and more frequent softlockup thread wakeups. Fixes: 58687acba592 ("lockup_detector: Combine nmi_watchdog and softlockup detector") Reported-and-tested-by: Kan Liang <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708150931310.1886@nanos
2017-08-18genirq: Restore trigger settings in irq_modify_status()Marc Zyngier1-2/+8
irq_modify_status starts by clearing the trigger settings from irq_data before applying the new settings, but doesn't restore them, leaving them to IRQ_TYPE_NONE. That's pretty confusing to the potential request_irq() that could follow. Instead, snapshot the settings before clearing them, and restore them if the irq_modify_status() invocation was not changing the trigger. Fixes: 1e2a7d78499e ("irqdomain: Don't set type when mapping an IRQ") Reported-and-tested-by: jeffy <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Jon Hunter <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected]
2017-08-18Merge branch 'irq/for-gpio' into irq/coreThomas Gleixner3-30/+313
Merge the flow handlers and irq domain extensions which are in a separate branch so they can be consumed by the gpio folks.
2017-08-18irqdomain: Add irq_domain_{push,pop}_irq() functionsDavid Daney1-0/+169
For an already existing irqdomain hierarchy, as might be obtained via a call to pci_enable_msix_range(), a PCI driver wishing to add an additional irqdomain to the hierarchy needs to be able to insert the irqdomain to that already initialized hierarchy. Calling irq_domain_create_hierarchy() allows the new irqdomain to be created, but no existing code allows for initializing the associated irq_data. Add a couple of helper functions (irq_domain_push_irq() and irq_domain_pop_irq()) to initialize the irq_data for the new irqdomain added to an existing hierarchy. Signed-off-by: David Daney <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Alexandre Courbot <[email protected]> Cc: Linus Walleij <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected]
2017-08-18irqdomain: Check for NULL function pointer in irq_domain_free_irqs_hierarchy()David Daney1-1/+2
A follow-on patch will call irq_domain_free_irqs_hierarchy() when the free() function pointer may be NULL. Add a NULL pointer check to handle this new use case. Signed-off-by: David Daney <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Alexandre Courbot <[email protected]> Cc: Linus Walleij <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected]
2017-08-18irqdomain: Factor out code to add and remove items to and from the revmapDavid Daney1-29/+29
The code to add and remove items to and from the revmap occurs several times. In preparation for the follow on patches that add more uses of this code, factor this out in to separate static functions. Signed-off-by: David Daney <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Marc Zyngier <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Alexandre Courbot <[email protected]> Cc: Linus Walleij <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected]
2017-08-18genirq: Add handle_fasteoi_{level,edge}_irq flow handlersDavid Daney2-0/+110
Follow-on patch for gpio-thunderx uses a irqdomain hierarchy which requires slightly different flow handlers, add them to chip.c which contains most of the other flow handlers. Make these conditionally compiled based on CONFIG_IRQ_FASTEOI_HIERARCHY_HANDLERS. Signed-off-by: David Daney <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Alexandre Courbot <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Linus Walleij <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected]
2017-08-18genirq: Export more irq_chip_*_parent() functionsDavid Daney1-0/+3
Many of the family of functions including irq_chip_mask_parent(), irq_chip_unmask_parent() are exported, but not all. Add EXPORT_SYMBOL_GPL to irq_chip_enable_parent, irq_chip_disable_parent and irq_chip_set_affinity_parent, so they likewise are usable from modules. Signed-off-by: David Daney <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Alexandre Courbot <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Linus Walleij <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected]
2017-08-18genirq/proc: Use the the accessor to report the effective affinityMarc Zyngier1-1/+1
If CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK is defined, but that the interrupt is not single target, the effective affinity reported in /proc/irq/x/effective_affinity will be empty, which is not the truth. Instead, use the accessor to report the affinity, which will pick the right mask. Signed-off-by: Marc Zyngier <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Andrew Lunn <[email protected]> Cc: James Hogan <[email protected]> Cc: Jason Cooper <[email protected]> Cc: Paul Burton <[email protected]> Cc: Chris Zankel <[email protected]> Cc: Kevin Cernekee <[email protected]> Cc: Wei Xu <[email protected]> Cc: Max Filippov <[email protected]> Cc: Florian Fainelli <[email protected]> Cc: Gregory Clement <[email protected]> Cc: Matt Redfearn <[email protected]> Cc: Sebastian Hesselbarth <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-08-18genirq/debugfs: Triggering of interrupts from userspaceMarc Zyngier1-1/+49
When developing new (and therefore buggy) interrupt related code, it can sometimes be useful to inject interrupts without having to rely on a device to actually generate them. This functionnality relies either on the irqchip driver to expose a irq_set_irqchip_state(IRQCHIP_STATE_PENDING) callback, or on the core code to be able to retrigger a (edge-only) interrupt. To use this feature: echo -n trigger > /sys/kernel/debug/irq/irqs/IRQNUM WARNING: This is DANGEROUS, and strictly a debug feature. Do not use it on a production system. Your HW is likely to catch fire, your data to be corrupted, and reporting this will make you look an even bigger fool than the idiot who wrote this patch. Signed-off-by: Marc Zyngier <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-08-18ACPI / PM: Check low power idle constraints for debug onlySrinivas Pandruvada1-1/+1
For SoC to achieve its lowest power platform idle state a set of hardware preconditions must be met. These preconditions or constraints can be obtained by issuing a device specific method (_DSM) with function "1". Refer to the document provided in the link below. Here during initialization (from attach() callback of LPS0 device), invoke function 1 to get the device constraints. Each enabled constraint is stored in a table. The devices in this table are used to check whether they were in required minimum state, while entering suspend. This check is done from platform freeze wake() callback, only when /sys/power/pm_debug_messages attribute is non zero. If any constraint is not met and device is ACPI power managed then it prints the device information to kernel logs. Also if debug is enabled in acpi/sleep.c, the constraint table and state of each device on wake is dumped in kernel logs. Since pm_debug_messages_on setting is used as condition to check constraints outside kernel/power/main.c, pm_debug_messages_on is changed to a global variable. Link: http://www.uefi.org/sites/default/files/resources/Intel_ACPI_Low_Power_S0_Idle.pdf Signed-off-by: Srinivas Pandruvada <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2017-08-18cpufreq: schedutil: Always process remote callback with slow switchingViresh Kumar1-2/+7
The frequency update from the utilization update handlers can be divided into two parts: (A) Finding the next frequency (B) Updating the frequency While any CPU can do (A), (B) can be restricted to a group of CPUs only, depending on the current platform. For platforms where fast cpufreq switching is possible, both (A) and (B) are always done from the same CPU and that CPU should be capable of changing the frequency of the target CPU. But for platforms where fast cpufreq switching isn't possible, after doing (A) we wake up a kthread which will eventually do (B). This kthread is already bound to the right set of CPUs, i.e. only those which can change the frequency of CPUs of a cpufreq policy. And so any CPU can actually do (A) in this case, as the frequency is updated from the right set of CPUs only. Check cpufreq_can_do_remote_dvfs() only for the fast switching case. Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2017-08-18cpufreq: schedutil: Don't restrict kthread to related_cpus unnecessarilyViresh Kumar1-1/+5
Utilization update callbacks are now processed remotely, even on the CPUs that don't share cpufreq policy with the target CPU (if dvfs_possible_from_any_cpu flag is set). But in non-fast switch paths, the frequency is changed only from one of policy->related_cpus. This happens because the kthread which does the actual update is bound to a subset of CPUs (i.e. related_cpus). Allow frequency to be remotely updated as well (i.e. call __cpufreq_driver_target()) if dvfs_possible_from_any_cpu flag is set. Reported-by: Pavan Kondeti <[email protected]> Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2017-08-17Revert "pstore: Honor dmesg_restrict sysctl on dmesg dumps"Kees Cook1-2/+1
This reverts commit 68c4a4f8abc60c9440ede9cd123d48b78325f7a3, with various conflict clean-ups. The capability check required too much privilege compared to simple DAC controls. A system builder was forced to have crash handler processes run with CAP_SYSLOG which would give it the ability to read (and wipe) the _current_ dmesg, which is much more access than being given access only to the historical log stored in pstorefs. With the prior commit to make the root directory 0750, the files are protected by default but a system builder can now opt to give access to a specific group (via chgrp on the pstorefs root directory) without being forced to also give away CAP_SYSLOG. Suggested-by: Nick Kralevich <[email protected]> Signed-off-by: Kees Cook <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]>
2017-08-17alarmtimer: Fix unavailable wake-up source in sysfsGeert Uytterhoeven1-2/+9
Currently the alarmtimer registers a wake-up source unconditionally, regardless of the system having a (wake-up capable) RTC or not. Hence the alarmtimer will always show up in /sys/kernel/debug/wakeup_sources, even if it is not available, and thus cannot be a wake-up source. To fix this, postpone registration until a wake-up capable RTC device is added. Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Miroslav Lichvar <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Stephen Boyd <[email protected]> Signed-off-by: Geert Uytterhoeven <[email protected]> Signed-off-by: John Stultz <[email protected]>
2017-08-17timekeeping: Use proper timekeeper for debug codeStafford Horne1-1/+1
When CONFIG_DEBUG_TIMEKEEPING is enabled the timekeeping_check_update() function will update status like last_warning and underflow_seen on the timekeeper. If there are issues found this state is used to rate limit the warnings that get printed. This rate limiting doesn't really really work if stored in real_tk as the shadow timekeeper is overwritten onto real_tk at the end of every update_wall_time() call, resetting last_warning and other statuses. Fix rate limiting by using the shadow_timekeeper for timekeeping_check_update(). Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Miroslav Lichvar <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Stephen Boyd <[email protected]> Fixes: commit 57d05a93ada7 ("time: Rework debugging variables so they aren't global") Signed-off-by: Stafford Horne <[email protected]> Signed-off-by: John Stultz <[email protected]>
2017-08-17bpf: don't enable preemption twice in smap_do_verdictDaniel Borkmann1-1/+2
In smap_do_verdict(), the fall-through branch leads to call preempt_enable() twice for the SK_REDIRECT, which creates an imbalance. Only enable it for all remaining cases again. Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support") Reported-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: John Fastabend <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-17bpf: fix liveness propagation to parent in spilled stack slotsDaniel Borkmann1-1/+1
Using parent->regs[] when propagating REG_LIVE_READ for spilled regs doesn't work since parent->regs[] denote the set of normal registers but not spilled ones. Propagate to the correct regs. Fixes: dc503a8ad984 ("bpf/verifier: track liveness for pruning") Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Edward Cree <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-17Merge branches 'doc.2017.08.17a', 'fixes.2017.08.17a', ↵Paul E. McKenney26-798/+573
'hotplug.2017.07.25b', 'misc.2017.08.17a', 'spin_unlock_wait_no.2017.08.17a', 'srcu.2017.07.27c' and 'torture.2017.07.24c' into HEAD doc.2017.08.17a: Documentation updates. fixes.2017.08.17a: RCU fixes. hotplug.2017.07.25b: CPU-hotplug updates. misc.2017.08.17a: Miscellaneous fixes outside of RCU (give or take conflicts). spin_unlock_wait_no.2017.08.17a: Remove spin_unlock_wait(). srcu.2017.07.27c: SRCU updates. torture.2017.07.24c: Torture-test updates.
2017-08-17locking: Remove spin_unlock_wait() generic definitionsPaul E. McKenney1-117/+0
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore removes spin_unlock_wait() and related definitions from core code. Signed-off-by: Paul E. McKenney <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Will Deacon <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Alan Stern <[email protected]> Cc: Andrea Parri <[email protected]> Cc: Linus Torvalds <[email protected]>
2017-08-17exit: Replace spin_unlock_wait() with lock/unlock pairPaul E. McKenney1-1/+2
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore replaces the spin_unlock_wait() call in do_exit() with spin_lock() followed immediately by spin_unlock(). This should be safe from a performance perspective because the lock is a per-task lock, and this is happening only at task-exit time. Signed-off-by: Paul E. McKenney <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Will Deacon <[email protected]> Cc: Alan Stern <[email protected]> Cc: Andrea Parri <[email protected]> Cc: Linus Torvalds <[email protected]>
2017-08-17completion: Replace spin_unlock_wait() with lock/unlock pairPaul E. McKenney1-7/+4
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore replaces the spin_unlock_wait() call in completion_done() with spin_lock() followed immediately by spin_unlock(). This should be safe from a performance perspective because the lock will be held only the wakeup happens really quickly. Signed-off-by: Paul E. McKenney <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Will Deacon <[email protected]> Cc: Alan Stern <[email protected]> Cc: Andrea Parri <[email protected]> Cc: Linus Torvalds <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]>
2017-08-17membarrier: Provide expedited private commandMathieu Desnoyers5-71/+178
Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built from all runqueues for which current thread's mm is the same as the thread calling sys_membarrier. It executes faster than the non-expedited variant (no blocking). It also works on NOHZ_FULL configurations. Scheduler-wise, it requires a memory barrier before and after context switching between processes (which have different mm). The memory barrier before context switch is already present. For the barrier after context switch: * Our TSO archs can do RELEASE without being a full barrier. Look at x86 spin_unlock() being a regular STORE for example. But for those archs, all atomics imply smp_mb and all of them have atomic ops in switch_mm() for mm_cpumask(), and on x86 the CR3 load acts as a full barrier. * From all weakly ordered machines, only ARM64 and PPC can do RELEASE, the rest does indeed do smp_mb(), so there the spin_unlock() is a full barrier and we're good. * ARM64 has a very heavy barrier in switch_to(), which suffices. * PPC just removed its barrier from switch_to(), but appears to be talking about adding something to switch_mm(). So add a smp_mb__after_unlock_lock() for now, until this is settled on the PPC side. Changes since v3: - Properly document the memory barriers provided by each architecture. Changes since v2: - Address comments from Peter Zijlstra, - Add smp_mb__after_unlock_lock() after finish_lock_switch() in finish_task_switch() to add the memory barrier we need after storing to rq->curr. This is much simpler than the previous approach relying on atomic_dec_and_test() in mmdrop(), which actually added a memory barrier in the common case of switching between userspace processes. - Return -EINVAL when MEMBARRIER_CMD_SHARED is used on a nohz_full kernel, rather than having the whole membarrier system call returning -ENOSYS. Indeed, CMD_PRIVATE_EXPEDITED is compatible with nohz_full. Adapt the CMD_QUERY mask accordingly. Changes since v1: - move membarrier code under kernel/sched/ because it uses the scheduler runqueue, - only add the barrier when we switch from a kernel thread. The case where we switch from a user-space thread is already handled by the atomic_dec_and_test() in mmdrop(). - add a comment to mmdrop() documenting the requirement on the implicit memory barrier. CC: Peter Zijlstra <[email protected]> CC: Paul E. McKenney <[email protected]> CC: Boqun Feng <[email protected]> CC: Andrew Hunter <[email protected]> CC: Maged Michael <[email protected]> CC: [email protected] CC: Avi Kivity <[email protected]> CC: Benjamin Herrenschmidt <[email protected]> CC: Paul Mackerras <[email protected]> CC: Michael Ellerman <[email protected]> Signed-off-by: Mathieu Desnoyers <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Tested-by: Dave Watson <[email protected]>
2017-08-17rcu: Remove exports from rcu_idle_exit() and rcu_idle_enter()Paul E. McKenney1-2/+0
The rcu_idle_exit() and rcu_idle_enter() functions are exported because they were originally used by RCU_NONIDLE(), which was intended to be usable from modules. However, RCU_NONIDLE() now instead uses rcu_irq_enter_irqson() and rcu_irq_exit_irqson(), which are not exported, and there have been no complaints. This commit therefore removes the exports from rcu_idle_exit() and rcu_idle_enter(). Reported-by: Peter Zijlstra <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2017-08-17rcu: Add warning to rcu_idle_enter() for irqs enabledPaul E. McKenney1-1/+2
All current callers of rcu_idle_enter() have irqs disabled, and rcu_idle_enter() relies on this, but doesn't check. This commit therefore adds a RCU_LOCKDEP_WARN() to add some verification to the trust. While we are there, pass "true" rather than "1" to rcu_eqs_enter(). Signed-off-by: Paul E. McKenney <[email protected]>
2017-08-17rcu: Make rcu_idle_enter() rely on callers disabling irqsPeter Zijlstra (Intel)1-4/+1
All callers to rcu_idle_enter() have irqs disabled, so there is no point in rcu_idle_enter disabling them again. This commit therefore replaces the irq disabling with a RCU_LOCKDEP_WARN(). Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2017-08-17rcu: Add assertions verifying blocked-tasks listPaul E. McKenney2-2/+11
This commit adds assertions verifying the consistency of the rcu_node structure's ->blkd_tasks list and its ->gp_tasks, ->exp_tasks, and ->boost_tasks pointers. In particular, the ->blkd_tasks lists must be empty except for leaf rcu_node structures. Signed-off-by: Paul E. McKenney <[email protected]>
2017-08-17rcu/tracing: Set disable_rcu_irq_enter on rcu_eqs_exit()Masami Hiramatsu1-0/+2
Set disable_rcu_irq_enter on not only rcu_eqs_enter_common() but also rcu_eqs_exit(), since rcu_eqs_exit() suffers from the same issue as was fixed for rcu_eqs_enter_common() by commit 03ecd3f48e57 ("rcu/tracing: Add rcu_disabled to denote when rcu_irq_enter() will not work"). Signed-off-by: Masami Hiramatsu <[email protected]> Acked-by: Steven Rostedt (VMware) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2017-08-17rcu: Add TPS() protection for _rcu_barrier_trace stringsPaul E. McKenney1-12/+15
The _rcu_barrier_trace() function is a wrapper for trace_rcu_barrier(), which needs TPS() protection for strings passed through the second argument. However, it has escaped prior TPS()-ification efforts because it _rcu_barrier_trace() does not start with "trace_". This commit therefore adds the needed TPS() protection Signed-off-by: Paul E. McKenney <[email protected]> Acked-by: Steven Rostedt (VMware) <[email protected]>
2017-08-17rcu: Use idle versions of swait to make idle-hack clearLuis R. Rodriguez1-6/+5
These RCU waits were set to use interruptible waits to avoid the kthreads contributing to system load average, even though they are not interruptible as they are spawned from a kthread. Use the new TASK_IDLE swaits which makes our goal clear, and removes confusion about these paths possibly being interruptible -- they are not. When the system is idle the RCU grace-period kthread will spend all its time blocked inside the swait_event_interruptible(). If the interruptible() was not used, then this kthread would contribute to the load average. This means that an idle system would have a load average of 2 (or 3 if PREEMPT=y), rather than the load average of 0 that almost fifty years of UNIX has conditioned sysadmins to expect. The same argument applies to swait_event_interruptible_timeout() use. The RCU grace-period kthread spends its time blocked inside this call while waiting for grace periods to complete. In particular, if there was only one busy CPU, but that CPU was frequently invoking call_rcu(), then the RCU grace-period kthread would spend almost all its time blocked inside the swait_event_interruptible_timeout(). This would mean that the load average would be 2 rather than the expected 1 for the single busy CPU. Acked-by: "Eric W. Biederman" <[email protected]> Tested-by: Paul E. McKenney <[email protected]> Signed-off-by: Luis R. Rodriguez <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2017-08-17rcu: Add event tracing to ->gp_tasks update at GP startPaul E. McKenney1-1/+8
There is currently event tracing to track when a task is preempted within a preemptible RCU read-side critical section, and also when that task subsequently reaches its outermost rcu_read_unlock(), but none indicating when a new grace period starts when that grace period must wait on pre-existing readers that have been been preempted at least once since the beginning of their current RCU read-side critical sections. This commit therefore adds an event trace at grace-period start in the case where there are such readers. Note that only the first reader in the list is traced. Signed-off-by: Paul E. McKenney <[email protected]> Acked-by: Steven Rostedt (VMware) <[email protected]>
2017-08-17rcu: Move rcu.h to new trivial-function stylePaul E. McKenney1-108/+20
This commit saves a few lines in kernel/rcu/rcu.h by moving to single-line definitions for trivial functions, instead of the old style where the two curly braces each get their own line. Signed-off-by: Paul E. McKenney <[email protected]>
2017-08-17rcu: Add TPS() to event-traced stringsPaul E. McKenney1-6/+6
Strings used in event tracing need to be specially handled, for example, using the TPS() macro. Without the TPS() macro, although output looks fine from within a running kernel, extracting traces from a crash dump produces garbage instead of strings. This commit therefore adds the TPS() macro to some unadorned strings that were passed to event-tracing macros. Signed-off-by: Paul E. McKenney <[email protected]> Acked-by: Steven Rostedt (VMware) <[email protected]>
2017-08-17rcu: Create reasonable API for do_exit() TASKS_RCU processingPaul E. McKenney2-6/+19
Currently, the exit-time support for TASKS_RCU is open-coded in do_exit(). This commit creates exit_tasks_rcu_start() and exit_tasks_rcu_finish() APIs for do_exit() use. This has the benefit of confining the use of the tasks_rcu_exit_srcu variable to one file, allowing it to become static. Signed-off-by: Paul E. McKenney <[email protected]>
2017-08-17rcu: Drive TASKS_RCU directly off of PREEMPTPaul E. McKenney2-18/+2
The actual use of TASKS_RCU is only when PREEMPT, otherwise RCU-sched is used instead. This commit therefore makes synchronize_rcu_tasks() and call_rcu_tasks() available always, but mapped to synchronize_sched() and call_rcu_sched(), respectively, when !PREEMPT. This approach also allows some #ifdefs to be removed from rcutorture. Reported-by: Ingo Molnar <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Masami Hiramatsu <[email protected]> Acked-by: Ingo Molnar <[email protected]>
2017-08-17locking/lockdep: Explicitly initialize wq_barrier::done::mapBoqun Feng1-1/+10
With the new lockdep crossrelease feature, which checks completions usage, a false positive is reported in the workqueue code: > Worker A : acquired of wfc.work -> wait for cpu_hotplug_lock to be released > Task B : acquired of cpu_hotplug_lock -> wait for lock#3 to be released > Task C : acquired of lock#3 -> wait for completion of barr->done > (Task C is in lru_add_drain_all_cpuslocked()) > Worker D : wait for wfc.work to be released -> will complete barr->done Such a dead lock can not happen because Task C's barr->done and Worker D's barr->done can not be the same instance. The reason of this false positive is we initialize all wq_barrier::done at insert_wq_barrier() via init_completion(), which makes them belong to the same lock class, therefore, impossible circles are reported. To fix this, explicitly initialize the lockdep map for wq_barrier::done in insert_wq_barrier(), so that the lock class key of wq_barrier::done is a subkey of the corresponding work_struct, as a result we won't build a dependency between a wq_barrier with a unrelated work, and we can differ wq barriers based on the related works, so the false positive above is avoided. Also define the empty lockdep_init_map_crosslock() for !CROSSRELEASE to make the code simple and away from unnecessary #ifdefs. Reported-by: Ingo Molnar <[email protected]> Signed-off-by: Boqun Feng <[email protected]> Cc: Byungchul Park <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-17locking/refcounts, x86/asm: Implement fast refcount overflow protectionKees Cook1-0/+12
This implements refcount_t overflow protection on x86 without a noticeable performance impact, though without the fuller checking of REFCOUNT_FULL. This is done by duplicating the existing atomic_t refcount implementation but with normally a single instruction added to detect if the refcount has gone negative (e.g. wrapped past INT_MAX or below zero). When detected, the handler saturates the refcount_t to INT_MIN / 2. With this overflow protection, the erroneous reference release that would follow a wrap back to zero is blocked from happening, avoiding the class of refcount-overflow use-after-free vulnerabilities entirely. Only the overflow case of refcounting can be perfectly protected, since it can be detected and stopped before the reference is freed and left to be abused by an attacker. There isn't a way to block early decrements, and while REFCOUNT_FULL stops increment-from-zero cases (which would be the state _after_ an early decrement and stops potential double-free conditions), this fast implementation does not, since it would require the more expensive cmpxchg loops. Since the overflow case is much more common (e.g. missing a "put" during an error path), this protection provides real-world protection. For example, the two public refcount overflow use-after-free exploits published in 2016 would have been rendered unexploitable: http://perception-point.io/2016/01/14/analysis-and-exploitation-of-a-linux-kernel-vulnerability-cve-2016-0728/ http://cyseclabs.com/page?n=02012016 This implementation does, however, notice an unchecked decrement to zero (i.e. caller used refcount_dec() instead of refcount_dec_and_test() and it resulted in a zero). Decrements under zero are noticed (since they will have resulted in a negative value), though this only indicates that a use-after-free may have already happened. Such notifications are likely avoidable by an attacker that has already exploited a use-after-free vulnerability, but it's better to have them reported than allow such conditions to remain universally silent. On first overflow detection, the refcount value is reset to INT_MIN / 2 (which serves as a saturation value) and a report and stack trace are produced. When operations detect only negative value results (such as changing an already saturated value), saturation still happens but no notification is performed (since the value was already saturated). On the matter of races, since the entire range beyond INT_MAX but before 0 is negative, every operation at INT_MIN / 2 will trap, leaving no overflow-only race condition. As for performance, this implementation adds a single "js" instruction to the regular execution flow of a copy of the standard atomic_t refcount operations. (The non-"and_test" refcount_dec() function, which is uncommon in regular refcount design patterns, has an additional "jz" instruction to detect reaching exactly zero.) Since this is a forward jump, it is by default the non-predicted path, which will be reinforced by dynamic branch prediction. The result is this protection having virtually no measurable change in performance over standard atomic_t operations. The error path, located in .text.unlikely, saves the refcount location and then uses UD0 to fire a refcount exception handler, which resets the refcount, handles reporting, and returns to regular execution. This keeps the changes to .text size minimal, avoiding return jumps and open-coded calls to the error reporting routine. Example assembly comparison: refcount_inc() before: .text: ffffffff81546149: f0 ff 45 f4 lock incl -0xc(%rbp) refcount_inc() after: .text: ffffffff81546149: f0 ff 45 f4 lock incl -0xc(%rbp) ffffffff8154614d: 0f 88 80 d5 17 00 js ffffffff816c36d3 ... .text.unlikely: ffffffff816c36d3: 48 8d 4d f4 lea -0xc(%rbp),%rcx ffffffff816c36d7: 0f ff (bad) These are the cycle counts comparing a loop of refcount_inc() from 1 to INT_MAX and back down to 0 (via refcount_dec_and_test()), between unprotected refcount_t (atomic_t), fully protected REFCOUNT_FULL (refcount_t-full), and this overflow-protected refcount (refcount_t-fast): 2147483646 refcount_inc()s and 2147483647 refcount_dec_and_test()s: cycles protections atomic_t 82249267387 none refcount_t-fast 82211446892 overflow, untested dec-to-zero refcount_t-full 144814735193 overflow, untested dec-to-zero, inc-from-zero This code is a modified version of the x86 PAX_REFCOUNT atomic_t overflow defense from the last public patch of PaX/grsecurity, based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Thanks to PaX Team for various suggestions for improvement for repurposing this code to be a refcount-only protection. Signed-off-by: Kees Cook <[email protected]> Reviewed-by: Josh Poimboeuf <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David S. Miller <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Elena Reshetova <[email protected]> Cc: Eric Biggers <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Greg KH <[email protected]> Cc: Hans Liljestrand <[email protected]> Cc: James Bottomley <[email protected]> Cc: Jann Horn <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Manfred Spraul <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Serge E. Hallyn <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: linux-arch <[email protected]> Link: http://lkml.kernel.org/r/20170815161924.GA133115@beast Signed-off-by: Ingo Molnar <[email protected]>
2017-08-17Merge branch 'linus' into perf/core, to pick up fixesIngo Molnar4-12/+40
Signed-off-by: Ingo Molnar <[email protected]>
2017-08-16Merge tag 'audit-pr-20170816' of ↵Linus Torvalds1-6/+8
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit fixes from Paul Moore: "Two small fixes to the audit code, both explained well in the respective patch descriptions, but the quick summary is one use-after-free fix, and one silly fanotify notification flag fix" * tag 'audit-pr-20170816' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: Receive unmount event audit: Fix use after free in audit_remove_watch_rule()
2017-08-16bpf: sock_map fixes for !CONFIG_BPF_SYSCALL and !STREAM_PARSERJohn Fastabend2-1/+5
Resolve issues with !CONFIG_BPF_SYSCALL and !STREAM_PARSER net/core/filter.c: In function ‘do_sk_redirect_map’: net/core/filter.c:1881:3: error: implicit declaration of function ‘__sock_map_lookup_elem’ [-Werror=implicit-function-declaration] sk = __sock_map_lookup_elem(ri->map, ri->ifindex); ^ net/core/filter.c:1881:6: warning: assignment makes pointer from integer without a cast [enabled by default] sk = __sock_map_lookup_elem(ri->map, ri->ifindex); Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support") Reported-by: Eric Dumazet <[email protected]> Signed-off-by: John Fastabend <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-16bpf: sockmap state change warning fixJohn Fastabend1-0/+3
psock will uninitialized in default case we need to do the same psock lookup and check as in other branch. Fixes compile warning below. kernel/bpf/sockmap.c: In function ‘smap_state_change’: kernel/bpf/sockmap.c:156:21: warning: ‘psock’ may be used uninitialized in this function [-Wmaybe-uninitialized] struct smap_psock *psock; Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support") Reported-by: David Miller <[email protected]> Signed-off-by: John Fastabend <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-16bpf: devmap: remove unnecessary value size checkJohn Fastabend1-6/+0
In the devmap alloc map logic we check to ensure that the sizeof the values are not greater than KMALLOC_MAX_SIZE. But, in the dev map case we ensure the value size is 4bytes earlier in the function because all values should be netdev ifindex values. The second check is harmless but is not needed so remove it. Signed-off-by: John Fastabend <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-16bpf: add access to sock fields and pkt data from sk_skb programsJohn Fastabend1-0/+1
Signed-off-by: John Fastabend <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-16bpf: sockmap with sk redirect supportJohn Fastabend4-2/+857
Recently we added a new map type called dev map used to forward XDP packets between ports (6093ec2dc313). This patches introduces a similar notion for sockets. A sockmap allows users to add participating sockets to a map. When sockets are added to the map enough context is stored with the map entry to use the entry with a new helper bpf_sk_redirect_map(map, key, flags) This helper (analogous to bpf_redirect_map in XDP) is given the map and an entry in the map. When called from a sockmap program, discussed below, the skb will be sent on the socket using skb_send_sock(). With the above we need a bpf program to call the helper from that will then implement the send logic. The initial site implemented in this series is the recv_sock hook. For this to work we implemented a map attach command to add attributes to a map. In sockmap we add two programs a parse program and a verdict program. The parse program uses strparser to build messages and pass them to the verdict program. The parse programs use the normal strparser semantics. The verdict program is of type SK_SKB. The verdict program returns a verdict SK_DROP, or SK_REDIRECT for now. Additional actions may be added later. When SK_REDIRECT is returned, expected when bpf program uses bpf_sk_redirect_map(), the sockmap logic will consult per cpu variables set by the helper routine and pull the sock entry out of the sock map. This pattern follows the existing redirect logic in cls and xdp programs. This gives the flow, recv_sock -> str_parser (parse_prog) -> verdict_prog -> skb_send_sock \ -> kfree_skb As an example use case a message based load balancer may use specific logic in the verdict program to select the sock to send on. Sample programs are provided in future patches that hopefully illustrate the user interfaces. Also selftests are in follow-on patches. Signed-off-by: John Fastabend <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-16bpf: export bpf_prog_inc_not_zeroJohn Fastabend1-1/+2
bpf_prog_inc_not_zero will be used by upcoming sockmap patches this patch simply exports it so we can pull it in. Signed-off-by: John Fastabend <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-08-16sched/completion: Document that reinit_completion() must be called after ↵Steven Rostedt1-0/+8
complete_all() The complete_all() function modifies the completion's "done" variable to UINT_MAX, and no other caller (wait_for_completion(), etc) will modify it back to zero. That means that any call to complete_all() must have a reinit_completion() before that completion can be used again. Document this fact by the complete_all() function. Also document that completion_done() will always return true if complete_all() is called. Signed-off-by: Steven Rostedt (VMware) <[email protected]> Acked-by: Linus Torvalds <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-08-16Merge branch 'irq/for-gpio' into irq/coreThomas Gleixner3-0/+170
Merge the irq simulator which is in a separate branch so it can be consumed by the gpio folks.
2017-08-16genirq/irq_sim: Add a devres variant of irq_sim_init()Bartosz Golaszewski1-0/+43
Add a resource managed version of irq_sim_init(). This can be conveniently used in device drivers. Signed-off-by: Bartosz Golaszewski <[email protected]> Acked-by: Jonathan Cameron <[email protected]> Cc: Lars-Peter Clausen <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Linus Walleij <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Bamvor Jian Zhang <[email protected]> Cc: Jonathan Cameron <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2017-08-16genirq/irq_sim: Add a simple interrupt simulator frameworkBartosz Golaszewski3-0/+127
Implement a simple, irq_work-based framework for simulating interrupts. Currently the API exposes routines for initializing and deinitializing the simulator object, enqueueing the interrupts and retrieving the allocated interrupt numbers based on the offset of the dummy interrupt in the simulator struct. Signed-off-by: Bartosz Golaszewski <[email protected]> Reviewed-by: Jonathan Cameron <[email protected]> Cc: Lars-Peter Clausen <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Linus Walleij <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Bamvor Jian Zhang <[email protected]> Cc: Jonathan Cameron <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2017-08-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller3-6/+32