aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2017-06-08sched/deadline: Improve the tracking of active utilizationLuca Abeni4-15/+276
This patch implements a more theoretically sound algorithm for tracking active utilization: instead of decreasing it when a task blocks, use a timer (the "inactive timer", named after the "Inactive" task state of the GRUB algorithm) to decrease the active utilization at the so called "0-lag time". Tested-by: Claudio Scordino <[email protected]> Tested-by: Daniel Bristot de Oliveira <[email protected]> Signed-off-by: Luca Abeni <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Juri Lelli <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mathieu Poirier <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tommaso Cucinotta <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-06-08sched/deadline: Track the active utilizationLuca Abeni2-4/+67
Active utilization is defined as the total utilization of active (TASK_RUNNING) tasks queued on a runqueue. Hence, it is increased when a task wakes up and is decreased when a task blocks. When a task is migrated from CPUi to CPUj, immediately subtract the task's utilization from CPUi and add it to CPUj. This mechanism is implemented by modifying the pull and push functions. Note: this is not fully correct from the theoretical point of view (the utilization should be removed from CPUi only at the 0 lag time), a more theoretically sound solution is presented in the next patches. Tested-by: Daniel Bristot de Oliveira <[email protected]> Signed-off-by: Luca Abeni <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Juri Lelli <[email protected]> Cc: Claudio Scordino <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mathieu Poirier <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tommaso Cucinotta <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-06-08sched/core: Implement new approach to scale select_idle_cpu()Peter Zijlstra2-5/+17
Hackbench recently suffered a bunch of pain, first by commit: 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive") and then by commit: c743f0a5c50f ("sched/fair, cpumask: Export for_each_cpu_wrap()") which fixed a bug in the initial for_each_cpu_wrap() implementation that made select_idle_cpu() even more expensive. The bug was that it would skip over CPUs when bits were consequtive in the bitmask. This however gave me an idea to fix select_idle_cpu(); where the old scheme was a cliff-edge throttle on idle scanning, this introduces a more gradual approach. Instead of stopping to scan entirely, we limit how many CPUs we scan. Initial benchmarks show that it mostly recovers hackbench while not hurting anything else, except Mason's schbench, but not as bad as the old thing. It also appears to recover the tbench high-end, which also suffered like hackbench. Tested-by: Matt Fleming <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Chris Mason <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: kitsunyan <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-06-05sched/header: Remove leftover, obsolete commentPerr Zhang1-2/+0
There is no more set_task_vxid() helper, remove its description. Signed-off-by: Perr Zhang <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-24sched/clock: Fix early boot preempt assumption in __set_sched_clock_stable()Peter Zijlstra1-1/+8
The more strict early boot preemption warnings found that __set_sched_clock_stable() was incorrectly assuming we'd still be running on a single CPU: BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1 caller is debug_smp_processor_id+0x1c/0x1e CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.12.0-rc2-00108-g1c3c5ea #1 Call Trace: dump_stack+0x110/0x192 check_preemption_disabled+0x10c/0x128 ? set_debug_rodata+0x25/0x25 debug_smp_processor_id+0x1c/0x1e sched_clock_init_late+0x27/0x87 [...] Fix it by disabling IRQs. Reported-by: kernel test robot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23x86/tsc: Fold set_cyc2ns_scale() into callerArnd Bergmann1-10/+4
The newly introduced wrapper function only has one caller, and this one is conditional, causing a harmless warning when CONFIG_CPU_FREQ is disabled: arch/x86/kernel/tsc.c:189:13: error: 'set_cyc2ns_scale' defined but not used [-Werror=unused-function] My first idea was to move the wrapper inside of that #ifdef, but on second thought it seemed nicer to remove it completely again and rename __set_cyc2ns_scale back to set_cyc2ns_scale, but leaving the extra argument. Signed-off-by: Arnd Bergmann <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Thomas Gleixner <[email protected]> Fixes: 615cd03373a0 ("x86/tsc: Fix sched_clock() sync") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23sched/core: Enable might_sleep() and smp_processor_id() checks earlyThomas Gleixner3-2/+14
might_sleep() and smp_processor_id() checks are enabled after the boot process is done. That hides bugs in the SMP bringup and driver initialization code. Enable it right when the scheduler starts working, i.e. when init task and kthreadd have been created and right before the idle task enables preemption. Tested-by: Mark Rutland <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Mark Rutland <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23init: Introduce SYSTEM_SCHEDULING stateThomas Gleixner2-1/+6
might_sleep() debugging and smp_processor_id() debugging should be active right after the scheduler starts working. The init task can invoke smp_processor_id() from preemptible context as it is pinned on the boot cpu until sched_smp_init() removes the pinning and lets it schedule on all non isolated cpus. Add a new state which allows to enable those checks earlier and add it to the xen do_poweroff() function. No functional change. Tested-by: Mark Rutland <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Boris Ostrovsky <[email protected]> Acked-by: Mark Rutland <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23mm/vmscan: Adjust system_state checksThomas Gleixner1-1/+1
To enable smp_processor_id() and might_sleep() debug checks earlier, it's required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING. Adjust the system_state check in kswapd_run() to handle the extra states. Tested-by: Mark Rutland <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23printk: Adjust system_state checksThomas Gleixner1-1/+1
To enable smp_processor_id() and might_sleep() debug checks earlier, it's required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING. Adjust the system_state check in boot_delay_msec() to handle the extra states. Tested-by: Mark Rutland <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23extable: Adjust system_state checksThomas Gleixner1-1/+1
To enable smp_processor_id() and might_sleep() debug checks earlier, it's required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING. Adjust the system_state check in core_kernel_text() to handle the extra states, i.e. to cover init text up to the point where the system switches to state RUNNING. Tested-by: Mark Rutland <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23async: Adjust system_state checksThomas Gleixner1-4/+4
To enable smp_processor_id() and might_sleep() debug checks earlier, it's required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING. Adjust the system_state check in async_run_entry_fn() and async_synchronize_cookie_domain() to handle the extra states. Tested-by: Mark Rutland <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Arjan van de Ven <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23iommu/of: Adjust system_state checkThomas Gleixner1-1/+1
To enable smp_processor_id() and might_sleep() debug checks earlier, it's required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING. Adjust the system_state check in of_iommu_driver_present() to handle the extra states. Tested-by: Mark Rutland <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Joerg Roedel <[email protected]> Acked-by: Robin Murphy <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23iommu/vt-d: Adjust system_state checksThomas Gleixner1-2/+2
To enable smp_processor_id() and might_sleep() debug checks earlier, it's required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING. Adjust the system_state checks in dmar_parse_one_atsr() and dmar_iommu_notify_scope_dev() to handle the extra states. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Joerg Roedel <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23cpufreq/pasemi: Adjust system_state checkThomas Gleixner1-1/+1
To enable smp_processor_id() and might_sleep() debug checks earlier, it's required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING. Adjust the system_state check in pas_cpufreq_cpu_exit() to handle the extra states. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Viresh Kumar <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23mm: Adjust system_state checkThomas Gleixner1-1/+1
To enable smp_processor_id() and might_sleep() debug checks earlier, it's required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING. get_nid_for_pfn() checks for system_state == BOOTING to decide whether to use early_pfn_to_nid() when CONFIG_DEFERRED_STRUCT_PAGE_INIT=y. That check is dubious, because the switch to state RUNNING happes way after page_alloc_init_late() has been invoked. Change the check to less than RUNNING state so it covers the new intermediate states as well. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Greg Kroah-Hartman <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23ACPI: Adjust system_state checkThomas Gleixner1-1/+1
To enable smp_processor_id() and might_sleep() debug checks earlier, it's required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING. Make the decision whether a pci root is hotplugged depend on SYSTEM_RUNNING instead of !SYSTEM_BOOTING. It makes no sense to cover states greater than SYSTEM_RUNNING as there are not hotplug events on reboot and poweroff. Tested-by: Mark Rutland <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Len Brown <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23powerpc: Adjust system_state checkThomas Gleixner1-1/+1
To enable smp_processor_id() and might_sleep() debug checks earlier, it's required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING. Adjust the system_state check in smp_generic_cpu_bootable() to handle the extra states. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Michael Ellerman <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23metag: Adjust system_state checkThomas Gleixner1-2/+1
To enable smp_processor_id() and might_sleep() debug checks earlier, it's required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING. Adjust the system_state check in stop_this_cpu() to handle the extra states. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: James Hogan <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23x86/smp: Adjust system_state checkThomas Gleixner1-1/+1
To enable smp_processor_id() and might_sleep() debug checks earlier, it's required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING. Adjust the system_state check in announce_cpu() to handle the extra states. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23arm64: Adjust system_state checkThomas Gleixner1-2/+1
To enable smp_processor_id() and might_sleep() debug checks earlier, it's required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING. Adjust the system_state check in smp_send_stop() to handle the extra states. Tested-by: Mark Rutland <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Mark Rutland <[email protected]> Acked-by: Catalin Marinas <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Will Deacon <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23arm: Adjust system_state checkThomas Gleixner1-2/+1
To enable smp_processor_id() and might_sleep() debug checks earlier, it's required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING. Adjust the system_state check in ipi_cpu_stop() to handle the extra states. Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Russell King <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23init: Pin init task to the boot CPU, initiallyThomas Gleixner1-5/+12
Some of the boot code in init_kernel_freeable() which runs before SMP bringup assumes (rightfully) that it runs on the boot CPU and therefore can use smp_processor_id() in preemptible context. That works so far because the smp_processor_id() check starts to be effective after smp bringup. That's just wrong. Starting with SMP bringup and the ability to move threads around, smp_processor_id() in preemptible context is broken. Aside of that it does not make sense to allow init to run on all CPUs before sched_smp_init() has been run. Pin the init to the boot CPU so the existing code can continue to use smp_processor_id() without triggering the checks when the enabling of those checks starts earlier. Tested-by: Mark Rutland <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23sched/numa: Use down_read_trylock() for the mmap_semVlastimil Babka1-1/+2
A customer has reported a soft-lockup when running an intensive memory stress test, where the trace on multiple CPU's looks like this: RIP: 0010:[<ffffffff810c53fe>] [<ffffffff810c53fe>] native_queued_spin_lock_slowpath+0x10e/0x190 ... Call Trace: [<ffffffff81182d07>] queued_spin_lock_slowpath+0x7/0xa [<ffffffff811bc331>] change_protection_range+0x3b1/0x930 [<ffffffff811d4be8>] change_prot_numa+0x18/0x30 [<ffffffff810adefe>] task_numa_work+0x1fe/0x310 [<ffffffff81098322>] task_work_run+0x72/0x90 Further investigation showed that the lock contention here is pmd_lock(). The task_numa_work() function makes sure that only one thread is let to perform the work in a single scan period (via cmpxchg), but if there's a thread with mmap_sem locked for writing for several periods, multiple threads in task_numa_work() can build up a convoy waiting for mmap_sem for read and then all get unblocked at once. This patch changes the down_read() to the trylock version, which prevents the build up. For a workload experiencing mmap_sem contention, it's probably better to postpone the NUMA balancing work anyway. This seems to have fixed the soft lockups involving pmd_lock(), which is in line with the convoy theory. Signed-off-by: Vlastimil Babka <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Rik van Riel <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23sched/rt: Minimize rq->lock contention in do_sched_rt_period_timer()Dave Kleikamp1-0/+11
With CONFIG_RT_GROUP_SCHED=y, do_sched_rt_period_timer() sequentially takes each CPU's rq->lock. On a large, busy system, the cumulative time it takes to acquire each lock can be excessive, even triggering a watchdog timeout. If rt_rq->rt_time and rt_rq->rt_nr_running are both zero, this function does nothing while holding the lock, so don't bother taking it at all. Signed-off-by: Dave Kleikamp <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23sched/core: Allow __sched_setscheduler() in interrupts when PI is not usedSteven Rostedt (VMware)1-2/+2
When priority inheritance was added back in 2.6.18 to sched_setscheduler(), it added a path to taking an rt-mutex wait_lock, which is not IRQ safe. As PI is not a common occurrence, lockdep will likely never trigger if sched_setscheduler was called from interrupt context. A BUG_ON() was added to trigger if __sched_setscheduler() was ever called from interrupt context because there was a possibility to take the wait_lock. Today the wait_lock is irq safe, but the path to taking it in sched_setscheduler() is the same as the path to taking it from normal context. The wait_lock is taken with raw_spin_lock_irq() and released with raw_spin_unlock_irq() which will indiscriminately enable interrupts, which would be bad in interrupt context. The problem is that normalize_rt_tasks, which is called by triggering the sysrq nice-all-RT-tasks was changed to call __sched_setscheduler(), and this is done from interrupt context! Now __sched_setscheduler() takes a "pi" parameter that is used to know if the priority inheritance should be called or not. As the BUG_ON() only cares about calling the PI code, it should only bug if called from interrupt context with the "pi" parameter set to true. Reported-by: Laurent Dufour <[email protected]> Tested-by: Laurent Dufour <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Fixes: dbc7f069b93a ("sched: Use replace normalize_task() with __sched_setscheduler()") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23sched/deadline: Remove unnecessary condition in push_dl_task()Byungchul Park1-1/+1
pick_next_pushable_dl_task(rq) has BUG_ON(rq->cpu != task_cpu(task)) when it returns a task other than NULL, which means that task_cpu(task) must be rq->cpu. So if task == next_task, then task_cpu(next_task) must be rq->cpu as well. Remove the redundant condition and make the code simpler. This way one unnecessary branch and two LOAD operations can be avoided. Signed-off-by: Byungchul Park <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]> Reviewed-by: Juri Lelli <[email protected]> Reviewed-by: Daniel Bristot de Oliveira <[email protected]> Cc: <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23sched/rt: Remove unnecessary condition in push_rt_task()Byungchul Park1-1/+1
pick_next_pushable_task(rq) has BUG_ON(rq_cpu != task_cpu(task)) when it returns a task other than NULL, which means that task_cpu(task) must be rq->cpu. So if task == next_task, then task_cpu(next_task) must be rq->cpu as well. Remove the redundant condition and make the code simpler. This way one unnecessary branch and two LOAD operations can be avoided. Signed-off-by: Byungchul Park <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Steven Rostedt (VMware) <[email protected]> Reviewed-by: Juri Lelli <[email protected]> Reviewed-by: Daniel Bristot de Oliveira <[email protected]> Cc: <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23sched/core: Use the new llist_for_each_entry_safe() primitiveByungchul Park1-12/+3
Now that we've added llist_for_each_entry_safe(), use it to simplify an open coded version of it in sched_ttwu_pending(). Signed-off-by: Byungchul Park <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23llist: Provide a safe version for llist_for_each()Byungchul Park1-0/+19
Sometimes we have to dereference next field of llist node before entering loop becasue the node might be deleted or the next field might be modified within the loop. So this adds the safe version of llist_for_each(), that is, llist_for_each_safe(). Signed-off-by: Byungchul Park <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Huang, Ying <[email protected]> Cc: <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23smp, cpumask: Use non-atomic cpumask_{set,clear}_cpu()Peter Zijlstra2-2/+13
The cpumasks in smp_call_function_many() are private and not subject to concurrency, atomic bitops are pointless and expensive. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23smp: Avoid sending needless IPI in smp_call_function_many()Aaron Lu1-2/+12
Inter-Processor-Interrupt(IPI) is needed when a page is unmapped and the process' mm_cpumask() shows the process has ever run on other CPUs. page migration, page reclaim all need IPIs. The number of IPI needed to send to different CPUs is especially large for multi-threaded workload since mm_cpumask() is per process. For smp_call_function_many(), whenever a CPU queues a CSD to a target CPU, it will send an IPI to let the target CPU to handle the work. This isn't necessary - we need only send IPI when queueing a CSD to an empty call_single_queue. The reason: flush_smp_call_function_queue() that is called upon a CPU receiving an IPI will empty the queue and then handle all of the CSDs there. So if the target CPU's call_single_queue is not empty, we know that: i. An IPI for the target CPU has already been sent by 'previous queuers'; ii. flush_smp_call_function_queue() hasn't emptied that CPU's queue yet. Thus, it's safe for us to just queue our CSD there without sending an addtional IPI. And for the 'previous queuers', we can limit it to the first queuer. To demonstrate the effect of this patch, a multi-thread workload that spawns 80 threads to equally consume 100G memory is used. This is tested on a 2 node broadwell-EP which has 44cores/88threads and 32G memory. So after 32G memory is used up, page reclaiming starts to happen a lot. With this patch, IPI number dropped 88% and throughput increased about 15% for the above workload. Signed-off-by: Aaron Lu <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Huang Ying <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tim Chen <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-05-23Merge branch 'linus' into sched/core, to pick up fixesIngo Molnar449-1975/+4611
Signed-off-by: Ingo Molnar <[email protected]>
2017-05-22Merge branch 'linus' of ↵Linus Torvalds1-1/+39
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fix from Herbert Xu: "This fixes a regression in the skcipher interface that allows bogus key parameters to hit underlying implementations which can cause crashes" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: skcipher - Add missing API setkey checks
2017-05-22Merge tag 'pstore-v4.12-rc3' of ↵Linus Torvalds1-6/+11
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore fix from Kees Cook: "Marta noticed another misbehavior in EFI pstore, which this fixes. Hopefully this is the last of the v4.12 fixes for pstore!" * tag 'pstore-v4.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: efi-pstore: Fix write/erase id tracking
2017-05-22Merge tag 'acpi-4.12-rc3' of ↵Linus Torvalds3-4/+25
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI fixes from Rafael Wysocki: "These revert a 4.11 change that turned out to be problematic and add a .gitignore file. Specifics: - Revert a 4.11 commit related to the ACPI-based handling of laptop lids that made changes incompatible with existing user space stacks and broke things there (Lv Zheng). - Add .gitignore to the ACPI tools directory (Prarit Bhargava)" * tag 'acpi-4.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: Revert "ACPI / button: Remove lid_init_state=method mode" tools/power/acpi: Add .gitignore file
2017-05-22Merge tag 'pm-4.12-rc3' of ↵Linus Torvalds11-303/+787
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix RTC wakeup from suspend-to-idle broken recently, fix CPU idleness detection condition in the schedutil cpufreq governor, fix a cpufreq driver build failure, fix an error code path in the power capping framework, clean up the hibernate core and update the intel_pstate documentation. Specifics: - Fix RTC wakeup from suspend-to-idle broken by the recent rework of ACPI wakeup handling (Rafael Wysocki). - Update intel_pstate driver documentation to reflect the current code and explain how it works in more detail (Rafael Wysocki). - Fix an issue related to CPU idleness detection on systems with shared cpufreq policies in the schedutil governor (Juri Lelli). - Fix a possible build issue in the dbx500 cpufreq driver (Arnd Bergmann). - Fix a function in the power capping framework core to return an error code instead of 0 when there's an error (Dan Carpenter). - Clean up variable definition in the hibernation core (Pushkar Jambhlekar)" * tag 'pm-4.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: dbx500: add a Kconfig symbol PM / hibernate: Declare variables as static PowerCap: Fix an error code in powercap_register_zone() RTC: rtc-cmos: Fix wakeup from suspend-to-idle PM / wakeup: Fix up wakeup_source_report_event() cpufreq: intel_pstate: Document the current behavior and user interface cpufreq: schedutil: use now as reference when aggregating shared policy requests
2017-05-22i2c: designware: Fix bogus sda_hold_time due to uninitialized varsJan Kiszka1-1/+1
We need to initializes those variables to 0 for platforms that do not provide ACPI parameters. Otherwise, we set sda_hold_time to random values, breaking e.g. Galileo and IOT2000 boards. Reported-and-tested-by: Linus Torvalds <[email protected]> Reported-by: Tobias Klausmann <[email protected]> Fixes: 9d6408433019 ("i2c: designware: don't infer timings described by ACPI from clock rate") Signed-off-by: Jan Kiszka <[email protected]> Reviewed-by: Ard Biesheuvel <[email protected]> Acked-by: Jarkko Nikula <[email protected]> Signed-off-by: Wolfram Sang <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2017-05-22efi-pstore: Fix write/erase id trackingKees Cook1-6/+11
Prior to the pstore interface refactoring, the "id" generated during a backend pstore_write() was only retained by the internal pstore inode tracking list. Additionally the "part" was ignored, so EFI would encode this in the id. This corrects the misunderstandings and correctly sets "id" during pstore_write(), and uses "part" directly during pstore_erase(). Reported-by: Marta Lofstedt <[email protected]> Fixes: 76cc9580e3fb ("pstore: Replace arguments for write() API") Fixes: a61072aae693 ("pstore: Replace arguments for erase() API") Signed-off-by: Kees Cook <[email protected]> Tested-by: Marta Lofstedt <[email protected]>
2017-05-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds33-133/+335
Pull networking fixes from David Miller: "Mostly netfilter bug fixes in here, but we have some bits elsewhere as well. 1) Don't do SNAT replies for non-NATed connections in IPVS, from Julian Anastasov. 2) Don't delete conntrack helpers while they are still in use, from Liping Zhang. 3) Fix zero padding in xtables's xt_data_to_user(), from Willem de Bruijn. 4) Add proper RCU protection to nf_tables_dump_set() because we cannot guarantee that we hold the NFNL_SUBSYS_NFTABLES lock. From Liping Zhang. 5) Initialize rcv_mss in tcp_disconnect(), from Wei Wang. 6) smsc95xx devices can't handle IPV6 checksums fully, so don't advertise support for offloading them. From Nisar Sayed. 7) Fix out-of-bounds access in __ip6_append_data(), from Eric Dumazet. 8) Make atl2_probe() propagate the error code properly on failures, from Alexey Khoroshilov. 9) arp_target[] in bond_check_params() is used uninitialized. This got changes from a global static to a local variable, which is how this mistake happened. Fix from Jarod Wilson. 10) Fix fallout from unnecessary NULL check removal in cls_matchall, from Jiri Pirko. This is definitely brown paper bag territory..." * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits) net: sched: cls_matchall: fix null pointer dereference vsock: use new wait API for vsock_stream_sendmsg() bonding: fix randomly populated arp target array net: Make IP alignment calulations clearer. bonding: fix accounting of active ports in 3ad net: atheros: atl2: don't return zero on failure path in atl2_probe() ipv6: fix out of bound writes in __ip6_append_data() bridge: start hello_timer when enabling KERNEL_STP in br_stp_start smsc95xx: Support only IPv4 TCP/UDP csum offload arp: always override existing neigh entries with gratuitous ARP arp: postpone addr_type calculation to as late as possible arp: decompose is_garp logic into a separate function arp: fixed error in a comment tcp: initialize rcv_mss to TCP_MIN_MSS instead of 0 netfilter: xtables: fix build failure from COMPAT_XT_ALIGN outside CONFIG_COMPAT ebtables: arpreply: Add the standard target sanity check netfilter: nf_tables: revisit chain/object refcounting from elements netfilter: nf_tables: missing sanitization in data from userspace netfilter: nf_tables: can't assume lock is acquired when dumping set elems netfilter: synproxy: fix conntrackd interaction ...
2017-05-22net: sched: cls_matchall: fix null pointer dereferenceJiri Pirko1-1/+0
Since the head is guaranteed by the check above to be null, the call_rcu would explode. Remove the previously logically dead code that was made logically very much alive and kicking. Fixes: 985538eee06f ("net/sched: remove redundant null check on head") Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-05-22vsock: use new wait API for vsock_stream_sendmsg()WANG Cong1-13/+8
As reported by Michal, vsock_stream_sendmsg() could still sleep at vsock_stream_has_space() after prepare_to_wait(): vsock_stream_has_space vmci_transport_stream_has_space vmci_qpair_produce_free_space qp_lock qp_acquire_queue_mutex mutex_lock Just switch to the new wait API like we did for commit d9dc8b0f8b4e ("net: fix sleeping for sk_wait_event()"). Reported-by: Michal Kubecek <[email protected]> Cc: Stefan Hajnoczi <[email protected]> Cc: Jorgen Hansen <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Claudio Imbrenda <[email protected]> Signed-off-by: Cong Wang <[email protected]> Reviewed-by: Stefan Hajnoczi <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-05-22bonding: fix randomly populated arp target arrayJarod Wilson1-3/+2
In commit dc9c4d0fe023, the arp_target array moved from a static global to a local variable. By the nature of static globals, the array used to be initialized to all 0. At present, it's full of random data, which that gets interpreted as arp_target values, when none have actually been specified. Systems end up booting with spew along these lines: [ 32.161783] IPv6: ADDRCONF(NETDEV_UP): lacp0: link is not ready [ 32.168475] IPv6: ADDRCONF(NETDEV_UP): lacp0: link is not ready [ 32.175089] 8021q: adding VLAN 0 to HW filter on device lacp0 [ 32.193091] IPv6: ADDRCONF(NETDEV_UP): lacp0: link is not ready [ 32.204892] lacp0: Setting MII monitoring interval to 100 [ 32.211071] lacp0: Removing ARP target 216.124.228.17 [ 32.216824] lacp0: Removing ARP target 218.160.255.255 [ 32.222646] lacp0: Removing ARP target 185.170.136.184 [ 32.228496] lacp0: invalid ARP target 255.255.255.255 specified for removal [ 32.236294] lacp0: option arp_ip_target: invalid value (-255.255.255.255) [ 32.243987] lacp0: Removing ARP target 56.125.228.17 [ 32.249625] lacp0: Removing ARP target 218.160.255.255 [ 32.255432] lacp0: Removing ARP target 15.157.233.184 [ 32.261165] lacp0: invalid ARP target 255.255.255.255 specified for removal [ 32.268939] lacp0: option arp_ip_target: invalid value (-255.255.255.255) [ 32.276632] lacp0: Removing ARP target 16.0.0.0 [ 32.281755] lacp0: Removing ARP target 218.160.255.255 [ 32.287567] lacp0: Removing ARP target 72.125.228.17 [ 32.293165] lacp0: Removing ARP target 218.160.255.255 [ 32.298970] lacp0: Removing ARP target 8.125.228.17 [ 32.304458] lacp0: Removing ARP target 218.160.255.255 None of these were actually specified as ARP targets, and the driver does seem to clean up the mess okay, but it's rather noisy and confusing, leaks values to userspace, and the 255.255.255.255 spew shows up even when debug prints are disabled. The fix: just zero out arp_target at init time. While we're in here, init arp_all_targets_value in the right place. Fixes: dc9c4d0fe023 ("bonding: reduce scope of some global variables") CC: Mahesh Bandewar <[email protected]> CC: Jay Vosburgh <[email protected]> CC: Veaceslav Falico <[email protected]> CC: Andy Gospodarek <[email protected]> CC: [email protected] CC: [email protected] Signed-off-by: Jarod Wilson <[email protected]> Acked-by: Andy Gospodarek <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-05-22Merge branches 'pm-sleep' and 'powercap'Rafael J. Wysocki4-8/+8
* pm-sleep: PM / hibernate: Declare variables as static RTC: rtc-cmos: Fix wakeup from suspend-to-idle PM / wakeup: Fix up wakeup_source_report_event() * powercap: PowerCap: Fix an error code in powercap_register_zone()
2017-05-22Merge branches 'acpi-button' and 'acpi-tools'Rafael J. Wysocki3-4/+25
* acpi-button: Revert "ACPI / button: Remove lid_init_state=method mode" * acpi-tools: tools/power/acpi: Add .gitignore file
2017-05-22Merge branches 'intel_pstate', 'pm-cpufreq' and 'pm-cpufreq-sched'Rafael J. Wysocki7-295/+779
* intel_pstate: cpufreq: intel_pstate: Document the current behavior and user interface * pm-cpufreq: cpufreq: dbx500: add a Kconfig symbol * pm-cpufreq-sched: cpufreq: schedutil: use now as reference when aggregating shared policy requests
2017-05-22net: Make IP alignment calulations clearer.David S. Miller1-4/+8
The assignmnet: ip_align = strict ? 2 : NET_IP_ALIGN; in compare_pkt_ptr_alignment() trips up Coverity because we can only get to this code when strict is true, therefore ip_align will always be 2 regardless of NET_IP_ALIGN's value. So just assign directly to '2' and explain the situation in the comment above. Reported-by: "Gustavo A. R. Silva" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-05-22bonding: fix accounting of active ports in 3adJarod Wilson1-1/+1
As of 7bb11dc9f59d and 0622cab0341c, bond slaves in a 3ad bond are not removed from the aggregator when they are down, and the active slave count is NOT equal to number of ports in the aggregator, but rather the number of ports in the aggregator that are still enabled. The sysfs spew for bonding_show_ad_num_ports() has a comment that says "Show number of active 802.3ad ports.", but it's currently showing total number of ports, both active and inactive. Remedy it by using the same logic introduced in 0622cab0341c in __bond_3ad_get_active_agg_info(), so sysfs, procfs and netlink all report the number of active ports. Note that this means that IFLA_BOND_AD_INFO_NUM_PORTS really means NUM_ACTIVE_PORTS instead of NUM_PORTS, and thus perhaps should be renamed for clarity. Lightly tested on a dual i40e lacp bond, simulating link downs with an ip link set dev <slave2> down, was able to produce the state where I could see both in the same aggregator, but a number of ports count of 1. MII Status: up Active Aggregator Info: Aggregator ID: 1 Number of ports: 2 <--- Slave Interface: ens10 MII Status: up <--- Aggregator ID: 1 Slave Interface: ens11 MII Status: up Aggregator ID: 1 MII Status: up Active Aggregator Info: Aggregator ID: 1 Number of ports: 1 <--- Slave Interface: ens10 MII Status: down <--- Aggregator ID: 1 Slave Interface: ens11 MII Status: up Aggregator ID: 1 CC: Jay Vosburgh <[email protected]> CC: Veaceslav Falico <[email protected]> CC: Andy Gospodarek <[email protected]> CC: [email protected] Signed-off-by: Jarod Wilson <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-05-22net: atheros: atl2: don't return zero on failure path in atl2_probe()Alexey Khoroshilov1-4/+4
If dma mask checks fail in atl2_probe(), it breaks off initialization, deallocates all resources, but returns zero. The patch adds proper error code return value and make error code setup unified. Found by Linux Driver Verification project (linuxtesting.org). Signed-off-by: Alexey Khoroshilov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-05-22ipv6: fix out of bound writes in __ip6_append_data()Eric Dumazet1-7/+8
Andrey Konovalov and [email protected] reported crashes caused by one skb shared_info being overwritten from __ip6_append_data() Andrey program lead to following state : copy -4200 datalen 2000 fraglen 2040 maxfraglen 2040 alloclen 2048 transhdrlen 0 offset 0 fraggap 6200 The skb_copy_and_csum_bits(skb_prev, maxfraglen, data + transhdrlen, fraggap, 0); is overwriting skb->head and skb_shared_info Since we apparently detect this rare condition too late, move the code earlier to even avoid allocating skb and risking crashes. Once again, many thanks to Andrey and syzkaller team. Signed-off-by: Eric Dumazet <[email protected]> Reported-by: Andrey Konovalov <[email protected]> Tested-by: Andrey Konovalov <[email protected]> Reported-by: <[email protected]> Signed-off-by: David S. Miller <[email protected]>