aboutsummaryrefslogtreecommitdiff
path: root/kernel/time
AgeCommit message (Collapse)AuthorFilesLines
2017-02-10tick/nohz: Fix possible missing clock reprog after tick soft restartFrederic Weisbecker1-0/+5
ts->next_tick keeps track of the next tick deadline in order to optimize clock programmation on irq exit and avoid redundant clock device writes. Now if ts->next_tick missed an update, we may spuriously miss a clock reprog later as the nohz code is fooled by an obsolete next_tick value. This is what happens here on a specific path: when we observe an expired timer from the nohz update code on irq exit, we perform a soft tick restart which simply fires the closest possible tick without actually exiting the nohz mode and restoring a periodic state. But we forget to update ts->next_tick accordingly. As a result, after the next tick resulting from such soft tick restart, the nohz code sees a stale value on ts->next_tick which doesn't match the clock deadline that just expired. If that obsolete ts->next_tick value happens to collide with the actual next tick deadline to be scheduled, we may spuriously bypass the clock reprogramming. In the worst case, the tick may never fire again. Fix this with a ts->next_tick reset on soft tick restart. Signed-off-by: Frederic Weisbecker <[email protected]> Reviewed: Wanpeng Li <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2017-02-04tick/broadcast: Reduce lock cacheline contentionWaiman Long1-7/+8
It was observed that on an Intel x86 system without the ARAT (Always running APIC timer) feature and with fairly large number of CPUs as well as CPUs coming in and out of intel_idle frequently, the lock contention on the tick_broadcast_lock can become significant. To reduce contention, the lock is put into its own cacheline and all the cpumask_var_t variables are put into the __read_mostly section. Running the SP benchmark of the NAS Parallel Benchmarks on a 4-socket 16-core 32-thread Nehalam system, the performance number improved from 3353.94 Mop/s to 3469.31 Mop/s when this patch was applied on a 4.9.6 kernel. This is a 3.4% improvement. Signed-off-by: Waiman Long <[email protected]> Cc: "Peter Zijlstra (Intel)" <[email protected]> Cc: Andrew Morton <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2017-02-01timers/itimer: Convert internal cputime_t units to nsecFrederic Weisbecker2-68/+39
Use the new nsec based cputime accessors as part of the whole cputime conversion from cputime_t to nsecs. Also convert itimers to use nsec based internal counters. This simplifies it and removes the whole game with error/inc_error which served to deal with cputime_t random granularity. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Stanislaw Gruszka <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Wanpeng Li <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-02-01timers/posix-timers: Convert internals to use nsecsFrederic Weisbecker2-126/+90
Use the new nsec based cputime accessors as part of the whole cputime conversion from cputime_t to nsecs. Also convert posix-cpu-timers to use nsec based internal counters to simplify it. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Stanislaw Gruszka <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Wanpeng Li <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-02-01timers/posix-timers: Use TICK_NSEC instead of a dynamically ad-hoc ↵Frederic Weisbecker1-9/+2
calculated version Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Stanislaw Gruszka <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Wanpeng Li <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-02-01sched/cputime: Introduce special task_cputime_t() API to return old-typed ↵Frederic Weisbecker2-24/+24
cputime This API returns a task's cputime in cputime_t in order to ease the conversion of cputime internals to use nsecs units instead. Blindly converting all cputime readers to use this API now will later let us convert more smoothly and step by step all these places to use the new nsec based cputime. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Stanislaw Gruszka <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Wanpeng Li <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-02-01time: Introduce jiffies64_to_nsecs()Frederic Weisbecker2-0/+16
This will be needed for the cputime_t to nsec conversion. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Tony Luck <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Stanislaw Gruszka <[email protected]> Cc: Wanpeng Li <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-02-01jiffies: Reuse TICK_NSEC instead of NSEC_PER_JIFFYFrederic Weisbecker1-16/+16
NSEC_PER_JIFFY is an ad-hoc redefinition of TICK_NSEC. Let's rather use a unique and well maintained version. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Tony Luck <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Stanislaw Gruszka <[email protected]> Cc: Wanpeng Li <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-02-01Merge branch 'linus' into sched/core, to pick up fixes and refresh the branchIngo Molnar2-2/+9
Signed-off-by: Ingo Molnar <[email protected]>
2017-01-14sched/clock, clocksource: Add optional cs::mark_unstable() methodThomas Gleixner1-0/+4
PeterZ reported that we'd fail to mark the TSC unstable when the clocksource watchdog finds it unsuitable. Allow a clocksource to run a custom action when its being marked unstable and hook up the TSC unstable code. Reported-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-01-11nohz: Fix collision between tick and other hrtimersFrederic Weisbecker2-2/+9
When the tick is stopped and an interrupt occurs afterward, we check on that interrupt exit if the next tick needs to be rescheduled. If it doesn't need any update, we don't want to do anything. In order to check if the tick needs an update, we compare it against the clockevent device deadline. Now that's a problem because the clockevent device is at a lower level than the tick itself if it is implemented on top of hrtimer. Every hrtimer share this clockevent device. So comparing the next tick deadline against the clockevent device deadline is wrong because the device may be programmed for another hrtimer whose deadline collides with the tick. As a result we may end up not reprogramming the tick accidentally. In a worst case scenario under full dynticks mode, the tick stops firing as it is supposed to every 1hz, leaving /proc/stat stalled: Task in a full dynticks CPU ---------------------------- * hrtimer A is queued 2 seconds ahead * the tick is stopped, scheduled 1 second ahead * tick fires 1 second later * on tick exit, nohz schedules the tick 1 second ahead but sees the clockevent device is already programmed to that deadline, fooled by hrtimer A, the tick isn't rescheduled. * hrtimer A is cancelled before its deadline * tick never fires again until an interrupt happens... In order to fix this, store the next tick deadline to the tick_sched local structure and reuse that value later to check whether we need to reprogram the clock after an interrupt. On the other hand, ts->sleep_length still wants to know about the next clock event and not just the tick, so we want to improve the related comment to avoid confusion. Reported-by: James Hartsock <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Reviewed-by: Wanpeng Li <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Acked-by: Rik van Riel <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Cc: [email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2017-01-06timekeeping: Remove unused timekeeping_{get,set}_tai_offset()Stephen Boyd2-40/+1
The last caller to timekeeping_set_tai_offset() was in commit 0b5154fb9040 (timekeeping: Simplify tai updating from do_adjtimex, 2013-03-22) and the last caller to timekeeping_get_tai_offset() was in commit 76f4108892d9 (hrtimer: Cleanup hrtimer accessors to the timekepeing state, 2014-07-16). Remove these unused functions now that we handle TAI offsets differently. Cc: John Stultz <[email protected]> Signed-off-by: Stephen Boyd <[email protected]> Signed-off-by: John Stultz <[email protected]>
2016-12-25ktime: Cleanup ktime_set() usageThomas Gleixner4-6/+6
ktime_set(S,N) was required for the timespec storage type and is still useful for situations where a Seconds and Nanoseconds part of a time value needs to be converted. For anything where the Seconds argument is 0, this is pointless and can be replaced with a simple assignment. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]>
2016-12-25ktime: Get rid of the unionThomas Gleixner11-83/+83
ktime is a union because the initial implementation stored the time in scalar nanoseconds on 64 bit machine and in a endianess optimized timespec variant for 32bit machines. The Y2038 cleanup removed the timespec variant and switched everything to scalar nanoseconds. The union remained, but become completely pointless. Get rid of the union and just keep ktime_t as simple typedef of type s64. The conversion was done with coccinelle and some manual mopping up. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]>
2016-12-25clocksource: Use a plain u64 instead of cycle_tThomas Gleixner5-39/+36
There is no point in having an extra type for extra confusion. u64 is unambiguous. Conversion was done with the following coccinelle script: @rem@ @@ -typedef u64 cycle_t; @fix@ typedef cycle_t; @@ -cycle_t +u64 Signed-off-by: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: John Stultz <[email protected]>
2016-12-24Replace <asm/uaccess.h> with <linux/uaccess.h> globallyLinus Torvalds8-8/+8
This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-18Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Thomas Gleixner: "Prevent NULL pointer dereferencing in the tick broadcast code. Old bug, which got unearthed by the hotplug ordering problem" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tick/broadcast: Prevent NULL pointer dereference
2016-12-15tick/broadcast: Prevent NULL pointer dereferenceThomas Gleixner1-0/+3
When a disfunctional timer, e.g. dummy timer, is installed, the tick core tries to setup the broadcast timer. If no broadcast device is installed, the kernel crashes with a NULL pointer dereference in tick_broadcast_setup_oneshot() because the function has no sanity check. Reported-by: Mason <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Anna-Maria Gleixner <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Sebastian Andrzej Siewior <[email protected]> Cc: Daniel Lezcano <[email protected]> Cc: Peter Zijlstra <[email protected]>, Cc: Sebastian Frias <[email protected]> Cc: Thibaud Cornic <[email protected]> Cc: Robin Murphy <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2016-12-14posix-timers: give lazy compilers some help optimizing code awayNicolas Pitre1-1/+2
The OpenRISC compiler (so far) fails to optimize away a large portion of code containing a reference to posix_timer_event in alarmtimer.c when CONFIG_POSIX_TIMERS is unset. Let's give it a direct clue to let the build succeed. This fixes [linux-next:master 6682/7183] alarmtimer.c:undefined reference to `posix_timer_event' reported by kbuild test robot. Signed-off-by: Nicolas Pitre <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Josh Triplett <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-12-12Merge branch 'timers-core-for-linus' of ↵Linus Torvalds8-94/+275
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "The time/timekeeping/timer folks deliver with this update: - Fix a reintroduced signed/unsigned issue and cleanup the whole signed/unsigned mess in the timekeeping core so this wont happen accidentaly again. - Add a new trace clock based on boot time - Prevent injection of random sleep times when PM tracing abuses the RTC for storage - Make posix timers configurable for real tiny systems - Add tracepoints for the alarm timer subsystem so timer based suspend wakeups can be instrumented - The usual pile of fixes and updates to core and drivers" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) timekeeping: Use mul_u64_u32_shr() instead of open coding it timekeeping: Get rid of pointless typecasts timekeeping: Make the conversion call chain consistently unsigned timekeeping_Force_unsigned_clocksource_to_nanoseconds_conversion alarmtimer: Add tracepoints for alarm timers trace: Update documentation for mono, mono_raw and boot clock trace: Add an option for boot clock as trace clock timekeeping: Add a fast and NMI safe boot clock timekeeping/clocksource_cyc2ns: Document intended range limitation timekeeping: Ignore the bogus sleep time if pm_trace is enabled selftests/timers: Fix spelling mistake "Asyncrhonous" -> "Asynchronous" clocksource/drivers/bcm2835_timer: Unmap region obtained by of_iomap clocksource/drivers/arm_arch_timer: Map frame with of_io_request_and_map() arm64: dts: rockchip: Arch counter doesn't tick in system suspend clocksource/drivers/arm_arch_timer: Don't assume clock runs in suspend posix-timers: Make them configurable posix_cpu_timers: Move the add_device_randomness() call to a proper place timer: Move sys_alarm from timer.c to itimer.c ptp_clock: Allow for it to be optional Kconfig: Regenerate *.c_shipped files after previous changes ...
2016-12-12Merge branch 'smp-hotplug-for-linus' of ↵Linus Torvalds1-19/+14
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull smp hotplug updates from Thomas Gleixner: "This is the final round of converting the notifier mess to the state machine. The removal of the notifiers and the related infrastructure will happen around rc1, as there are conversions outstanding in other trees. The whole exercise removed about 2000 lines of code in total and in course of the conversion several dozen bugs got fixed. The new mechanism allows to test almost every hotplug step standalone, so usage sites can exercise all transitions extensively. There is more room for improvement, like integrating all the pointlessly different architecture mechanisms of synchronizing, setting cpus online etc into the core code" * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits) tracing/rb: Init the CPU mask on allocation soc/fsl/qbman: Convert to hotplug state machine soc/fsl/qbman: Convert to hotplug state machine zram: Convert to hotplug state machine KVM/PPC/Book3S HV: Convert to hotplug state machine arm64/cpuinfo: Convert to hotplug state machine arm64/cpuinfo: Make hotplug notifier symmetric mm/compaction: Convert to hotplug state machine iommu/vt-d: Convert to hotplug state machine mm/zswap: Convert pool to hotplug state machine mm/zswap: Convert dst-mem to hotplug state machine mm/zsmalloc: Convert to hotplug state machine mm/vmstat: Convert to hotplug state machine mm/vmstat: Avoid on each online CPU loops mm/vmstat: Drop get_online_cpus() from init_cpu_node_state/vmstat_cpu_dead() tracing/rb: Convert to hotplug state machine oprofile/nmi timer: Convert to hotplug state machine net/iucv: Use explicit clean up labels in iucv_init() x86/pci/amd-bus: Convert to hotplug state machine x86/oprofile/nmi: Convert to hotplug state machine ...
2016-12-12Merge branch 'sched-core-for-linus' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The main scheduler changes in this cycle were: - support Intel Turbo Boost Max Technology 3.0 (TBM3) by introducig a notion of 'better cores', which the scheduler will prefer to schedule single threaded workloads on. (Tim Chen, Srinivas Pandruvada) - enhance the handling of asymmetric capacity CPUs further (Morten Rasmussen) - improve/fix load handling when moving tasks between task groups (Vincent Guittot) - simplify and clean up the cputime code (Stanislaw Gruszka) - improve mass fork()ed task spread a.k.a. hackbench speedup (Vincent Guittot) - make struct kthread kmalloc()ed and related fixes (Oleg Nesterov) - add uaccess atomicity debugging (when using access_ok() in the wrong context), under CONFIG_DEBUG_ATOMIC_SLEEP=y (Peter Zijlstra) - implement various fixes, cleanups and other enhancements (Daniel Bristot de Oliveira, Martin Schwidefsky, Rafael J. Wysocki)" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits) sched/core: Use load_avg for selecting idlest group sched/core: Fix find_idlest_group() for fork kthread: Don't abuse kthread_create_on_cpu() in __kthread_create_worker() kthread: Don't use to_live_kthread() in kthread_[un]park() kthread: Don't use to_live_kthread() in kthread_stop() Revert "kthread: Pin the stack via try_get_task_stack()/put_task_stack() in to_live_kthread() function" kthread: Make struct kthread kmalloc'ed x86/uaccess, sched/preempt: Verify access_ok() context sched/x86: Make CONFIG_SCHED_MC_PRIO=y easier to enable sched/x86: Change CONFIG_SCHED_ITMT to CONFIG_SCHED_MC_PRIO x86/sched: Use #include <linux/mutex.h> instead of #include <asm/mutex.h> cpufreq/intel_pstate: Use CPPC to get max performance acpi/bus: Set _OSC for diverse core support acpi/bus: Enable HWP CPPC objects x86/sched: Add SD_ASYM_PACKING flags to x86 ITMT CPU x86/sysctl: Add sysctl for ITMT scheduling feature x86: Enable Intel Turbo Boost Max Technology 3.0 x86/topology: Define x86's arch_update_cpu_topology sched: Extend scheduler's asym packing sched/fair: Clean up the tunable parameter definitions ...
2016-12-09timekeeping: Use mul_u64_u32_shr() instead of open coding itThomas Gleixner1-21/+5
The resume code must deal with a clocksource delta which is potentially big enough to overflow the 64bit mult. Replace the open coded handling with the proper function. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: David Gibson <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: Parit Bhargava <[email protected]> Cc: Laurent Vivier <[email protected]> Cc: "Christopher S. Hall" <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Liav Rehana <[email protected]> Cc: John Stultz <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-12-09timekeeping: Get rid of pointless typecastsThomas Gleixner1-3/+2
cycle_t is defined as u64, so casting it to u64 is a pointless and confusing exercise. cycle_t should simply go away and be replaced with a plain u64 to avoid further confusion. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: David Gibson <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: Parit Bhargava <[email protected]> Cc: Laurent Vivier <[email protected]> Cc: "Christopher S. Hall" <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Liav Rehana <[email protected]> Cc: John Stultz <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-12-09timekeeping: Make the conversion call chain consistently unsignedThomas Gleixner1-13/+13
Propagating a unsigned value through signed variables and functions makes absolutely no sense and is just prone to (re)introduce subtle signed vs. unsigned issues as happened recently. Clean it up. Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: David Gibson <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: Parit Bhargava <[email protected]> Cc: Laurent Vivier <[email protected]> Cc: "Christopher S. Hall" <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Liav Rehana <[email protected]> Cc: John Stultz <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-12-09timekeeping_Force_unsigned_clocksource_to_nanoseconds_conversionThomas Gleixner1-2/+2
The clocksource delta to nanoseconds conversion is using signed math, but the delta is unsigned. This makes the conversion space smaller than necessary and in case of a multiplication overflow the conversion can become negative. The conversion is done with scaled math: s64 nsec_delta = ((s64)clkdelta * clk->mult) >> clk->shift; Shifting a signed integer right obvioulsy preserves the sign, which has interesting consequences: - Time jumps backwards - __iter_div_u64_rem() which is used in one of the calling code pathes will take forever to piecewise calculate the seconds/nanoseconds part. This has been reported by several people with different scenarios: David observed that when stopping a VM with a debugger: "It was essentially the stopped by debugger case. I forget exactly why, but the guest was being explicitly stopped from outside, it wasn't just scheduling lag. I think it was something in the vicinity of 10 minutes stopped." When lifting the stop the machine went dead. The stopped by debugger case is not really interesting, but nevertheless it would be a good thing not to die completely. But this was also observed on a live system by Liav: "When the OS is too overloaded, delta will get a high enough value for the msb of the sum delta * tkr->mult + tkr->xtime_nsec to be set, and so after the shift the nsec variable will gain a value similar to 0xffffffffff000000." Unfortunately this has been reintroduced recently with commit 6bd58f09e1d8 ("time: Add cycles to nanoseconds translation"). It had been fixed a year ago already in commit 35a4933a8959 ("time: Avoid signed overflow in timekeeping_get_ns()"). Though it's not surprising that the issue has been reintroduced because the function itself and the whole call chain uses s64 for the result and the propagation of it. The change in this recent commit is subtle: s64 nsec; - nsec = (d * m + n) >> s: + nsec = d * m + n; + nsec >>= s; d being type of cycle_t adds another level of obfuscation. This wouldn't have happened if the previous change to unsigned computation would have made the 'nsec' variable u64 right away and a follow up patch had cleaned up the whole call chain. There have been patches submitted which basically did a revert of the above patch leaving everything else unchanged as signed. Back to square one. This spawned a admittedly pointless discussion about potential users which rely on the unsigned behaviour until someone pointed out that it had been fixed before. The changelogs of said patches added further confusion as they made finally false claims about the consequences for eventual users which expect signed results. Despite delta being cycle_t, aka. u64, it's very well possible to hand in a signed negative value and the signed computation will happily return the correct result. But nobody actually sat down and analyzed the code which was added as user after the propably unintended signed conversion. Though in sensitive code like this it's better to analyze it proper and make sure that nothing relies on this than hunting the subtle wreckage half a year later. After analyzing all call chains it stands that no caller can hand in a negative value (which actually would work due to the s64 cast) and rely on the signed math to do the right thing. Change the conversion function to unsigned math. The conversion of all call chains is done in a follow up patch. This solves the starvation issue, which was caused by the negative result, but it does not solve the underlying problem. It merily procrastinates it. When the timekeeper update is deferred long enough that the unsigned multiplication overflows, then time going backwards is observable again. It does neither solve the issue of clocksources with a small counter width which will wrap around possibly several times and cause random time stamps to be generated. But those are usually not found on systems used for virtualization, so this is likely a non issue. I took the liberty to claim authorship for this simply because analyzing all callsites and writing the changelog took substantially more time than just making the simple s/s64/u64/ change and ignore the rest. Fixes: 6bd58f09e1d8 ("time: Add cycles to nanoseconds translation") Reported-by: David Gibson <[email protected]> Reported-by: Liav Rehana <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: David Gibson <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: Parit Bhargava <[email protected]> Cc: Laurent Vivier <[email protected]> Cc: "Christopher S. Hall" <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Richard Cochran <[email protected]> Cc: John Stultz <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-12-07clocksource: export the clocks_calc_mult_shift to use by timestamp codeMurali Karicheri1-0/+1
The CPSW CPTS driver is capable of doing timestamping on tx/rx packets and requires to know mult and shift factors for timestamp conversion from raw value to nanoseconds (ptp clock). Now these mult and shift factors are calculated manually and provided through DT, which makes very hard to support of a lot number of platforms, especially if CPTS refclk is not the same for some kind of boards and depends on efuse settings (Keystone 2 platforms). Hence, export clocks_calc_mult_shift() to allow drivers like CPSW CPTS (and other ptp drivesr) to benefit from automaitc calculation of mult and shift factors. Cc: John Stultz <[email protected]> Signed-off-by: Murali Karicheri <[email protected]> Signed-off-by: Grygorii Strashko <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-12-01alarmtimer: Add tracepoints for alarm timersBaolin Wang1-10/+43
Alarm timers are one of the mechanisms to wake up a system from suspend, but there exist no tracepoints to analyse which process/thread armed an alarmtimer. Add tracepoints for start/cancel/expire of individual alarm timers and one for tracing the suspend time decision when to resume the system. The following trace excerpt illustrates the new mechanism: Binder:3292_2-3304 [000] d..2 149.981123: alarmtimer_cancel: alarmtimer:ffffffc1319a7800 type:REALTIME expires:1325463120000000000 now:1325376810370370245 Binder:3292_2-3304 [000] d..2 149.981136: alarmtimer_start: alarmtimer:ffffffc1319a7800 type:REALTIME expires:1325376840000000000 now:1325376810370384591 Binder:3292_9-3953 [000] d..2 150.212991: alarmtimer_cancel: alarmtimer:ffffffc1319a5a00 type:BOOTTIME expires:179552000000 now:150154008122 Binder:3292_9-3953 [000] d..2 150.213006: alarmtimer_start: alarmtimer:ffffffc1319a5a00 type:BOOTTIME expires:179551000000 now:150154025622 system_server-3000 [002] ...1 162.701940: alarmtimer_suspend: alarmtimer type:REALTIME expires:1325376840000000000 The wakeup time which is selected at suspend time allows to map it back to the task arming the timer: Binder:3292_2. [ tglx: Store alarm timer expiry time instead of some useless RTC relative information, add proper type information for wakeups which are handled via the clock_nanosleep/freezer and massage the changelog. ] Signed-off-by: Baolin Wang <[email protected]> Signed-off-by: John Stultz <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Richard Cochran <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-11-29timekeeping: Add a fast and NMI safe boot clockJoel Fernandes1-0/+29
This boot clock can be used as a tracing clock and will account for suspend time. To keep it NMI safe since we're accessing from tracing, we're not using a separate timekeeper with updates to monotonic clock and boot offset protected with seqlocks. This has the following minor side effects: (1) Its possible that a timestamp be taken after the boot offset is updated but before the timekeeper is updated. If this happens, the new boot offset is added to the old timekeeping making the clock appear to update slightly earlier: CPU 0 CPU 1 timekeeping_inject_sleeptime64() __timekeeping_inject_sleeptime(tk, delta); timestamp(); timekeeping_update(tk, TK_CLEAR_NTP...); (2) On 32-bit systems, the 64-bit boot offset (tk->offs_boot) may be partially updated. Since the tk->offs_boot update is a rare event, this should be a rare occurrence which postprocessing should be able to handle. Signed-off-by: Joel Fernandes <[email protected]> Signed-off-by: John Stultz <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-11-22sched/nohz: Convert to hotplug state machineSebastian Andrzej Siewior1-19/+14
Install the callbacks via the state machine. Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-11-16posix-timers: Make them configurableNicolas Pitre4-5/+137
Some embedded systems have no use for them. This removes about 25KB from the kernel binary size when configured out. Corresponding syscalls are routed to a stub logging the attempt to use those syscalls which should be enough of a clue if they were disabled without proper consideration. They are: timer_create, timer_gettime: timer_getoverrun, timer_settime, timer_delete, clock_adjtime, setitimer, getitimer, alarm. The clock_settime, clock_gettime, clock_getres and clock_nanosleep syscalls are replaced by simple wrappers compatible with CLOCK_REALTIME, CLOCK_MONOTONIC and CLOCK_BOOTTIME only which should cover the vast majority of use cases with very little code. Signed-off-by: Nicolas Pitre <[email protected]> Acked-by: Richard Cochran <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Acked-by: John Stultz <[email protected]> Reviewed-by: Josh Triplett <[email protected]> Cc: Paul Bolle <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Michal Marek <[email protected]> Cc: Edward Cree <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-11-16posix_cpu_timers: Move the add_device_randomness() call to a proper placeNicolas Pitre1-4/+0
There is no logical relation between add_device_randomness() and posix_cpu_timers_exit(). Let's move the former to where the later is called. This way, when posix-cpu-timers.c is compiled out, there is no need to worry about not losing a call to add_device_randomness(). Signed-off-by: Nicolas Pitre <[email protected]> Acked-by: John Stultz <[email protected]> Cc: Paul Bolle <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Richard Cochran <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Michal Marek <[email protected]> Cc: Edward Cree <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-11-16timer: Move sys_alarm from timer.c to itimer.cNicolas Pitre2-14/+14
Move the only user of alarm_setitimer to itimer.c where it is defined. This allows for making alarm_setitimer static, and dropping it from the build when __ARCH_WANT_SYS_ALARM is not defined. Signed-off-by: Nicolas Pitre <[email protected]> Acked-by: John Stultz <[email protected]> Cc: Paul Bolle <[email protected]> Cc: [email protected] Cc: [email protected] Cc: Richard Cochran <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Michal Marek <[email protected]> Cc: Edward Cree <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-11-15sched/cputime: Simplify task_cputime()Stanislaw Gruszka1-2/+2
Now since fetch_task_cputime() has no other users than task_cputime(), its code could be used directly in task_cputime(). Moreover since only 2 task_cputime() calls of 17 use a NULL argument, we can add dummy variables to those calls and remove NULL checks from task_cputimes(). Also remove NULL checks from task_cputimes_scaled(). Signed-off-by: Stanislaw Gruszka <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Martin Schwidefsky <[email protected]> Cc: Michael Neuling <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-10-26timers: Fix documentation for schedule_timeout() and similarDouglas Anderson2-10/+21
The documentation for schedule_timeout(), schedule_hrtimeout(), and schedule_hrtimeout_range() all claim that the routines couldn't possibly return early if the task state was TASK_UNINTERRUPTIBLE. This is simply not true since wake_up_process() will cause those routines to exit early. We cannot make schedule_[hr]timeout() loop until the timeout expires if the task state is uninterruptible because we have users which rely on the existing and designed behaviour. Make the documentation match the (correct) implementation. schedule_hrtimeout() returns -EINTR even when a uninterruptible task was woken up. This might look strange, but making the return code depend on the state is too much of an effort as it would affect all the call sites. There is no value in doing so, but we spell it out clearly in the documentation. Suggested-by: Daniel Kurtz <[email protected]> Signed-off-by: Douglas Anderson <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Andreas Mohr <[email protected]> Cc: [email protected] Cc: [email protected] Cc: John Stultz <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-10-26timers: Fix usleep_range() in the context of wake_up_process()Douglas Anderson1-12/+9
Users of usleep_range() expect that it will _never_ return in less time than the minimum passed parameter. However, nothing in the code ensures this, when the sleeping task is woken by wake_up_process() or any other mechanism which can wake a task from uninterruptible state. Neither usleep_range() nor schedule_hrtimeout_range*() have any protection against wakeups. schedule_hrtimeout_range*() is designed this way despite the fact that the API documentation does not mention it. msleep() already has code to handle this case since it will loop as long as there was still time left. usleep_range() has no such loop, add it. Presumably this problem was not detected before because usleep_range() is only used in a few places and the function is mostly used in contexts which are not exposed to wakeups of any form. An effort was made to look for users relying on the old behavior by looking for usleep_range() in the same file as wake_up_process(). No problems were found by this search, though it is conceivable that someone could have put the sleep and wakeup in two different files. An effort was made to ask several upstream maintainers if they were aware of people relying on wake_up_process() to wake up usleep_range(). No maintainers were aware of that but they were aware of many people relying on usleep_range() never returning before the minimum. Reported-by: Tao Huang <[email protected]> Signed-off-by: Douglas Anderson <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: Andreas Mohr <[email protected]> Cc: [email protected] Cc: [email protected] Cc: John Stultz <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-10-25timers: Prevent base clock corruption when forwardingThomas Gleixner1-13/+10
When a timer is enqueued we try to forward the timer base clock. This mechanism has two issues: 1) Forwarding a remote base unlocked The forwarding function is called from get_target_base() with the current timer base lock held. But if the new target base is a different base than the current base (can happen with NOHZ, sigh!) then the forwarding is done on an unlocked base. This can lead to corruption of base->clk. Solution is simple: Invoke the forwarding after the target base is locked. 2) Possible corruption due to jiffies advancing This is similar to the issue in get_net_timer_interrupt() which was fixed in the previous patch. jiffies can advance between check and assignement and therefore advancing base->clk beyond the next expiry value. So we need to read jiffies into a local variable once and do the checks and assignment with the local copy. Fixes: a683f390b93f("timers: Forward the wheel clock whenever possible") Reported-by: Ashton Holmes <[email protected]> Reported-by: Michael Thayer <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Michal Necasek <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-10-25timers: Prevent base clock rewind when forwarding clockThomas Gleixner1-5/+9
Ashton and Michael reported, that kernel versions 4.8 and later suffer from USB timeouts which are caused by the timer wheel rework. This is caused by a bug in the base clock forwarding mechanism, which leads to timers expiring early. The scenario which leads to this is: run_timers() while (jiffies >= base->clk) { collect_expired_timers(); base->clk++; expire_timers(); } So base->clk = jiffies + 1. Now the cpu goes idle: idle() get_next_timer_interrupt() nextevt = __next_time_interrupt(); if (time_after(nextevt, base->clk)) base->clk = jiffies; jiffies has not advanced since run_timers(), so this assignment effectively decrements base->clk by one. base->clk is the index into the timer wheel arrays. So let's assume the following state after the base->clk increment in run_timers(): jiffies = 0 base->clk = 1 A timer gets enqueued with an expiry delta of 63 ticks (which is the case with the USB timeout and HZ=250) so the resulting bucket index is: base->clk + delta = 1 + 63 = 64 The timer goes into the first wheel level. The array size is 64 so it ends up in bucket 0, which is correct as it takes 63 ticks to advance base->clk to index into bucket 0 again. If the cpu goes idle before jiffies advance, then the bug in the forwarding mechanism sets base->clk back to 0, so the next invocation of run_timers() at the next tick will index into bucket 0 and therefore expire the timer 62 ticks too early. Instead of blindly setting base->clk to jiffies we must make the forwarding conditional on jiffies > base->clk, but we cannot use jiffies for this as we might run into the following issue: if (time_after(jiffies, base->clk) { if (time_after(nextevt, base->clk)) base->clk = jiffies; jiffies can increment between the check and the assigment far enough to advance beyond nextevt. So we need to use a stable value for checking. get_next_timer_interrupt() has the basej argument which is the jiffies value snapshot taken in the calling code. So we can just that. Thanks to Ashton for bisecting and providing trace data! Fixes: a683f390b93f ("timers: Forward the wheel clock whenever possible") Reported-by: Ashton Holmes <[email protected]> Reported-by: Michael Thayer <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Michal Necasek <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-10-25timers: Lock base for same bucket optimizationThomas Gleixner1-11/+17
Linus stumbled over the unlocked modification of the timer expiry value in mod_timer() which is an optimization for timers which stay in the same bucket - due to the bucket granularity - despite their expiry time getting updated. The optimization itself still makes sense even if we take the lock, because in case that the bucket stays the same, we avoid the pointless queue/enqueue dance. Make the check and the modification of timer->expires protected by the base lock and shuffle the remaining code around so we can keep the lock held when we actually have to requeue the timer to a different bucket. Fixes: f00c0afdfa62 ("timers: Implement optimization for same expiry time in mod_timer()") Reported-by: Linus Torvalds <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610241711220.4983@nanos Cc: [email protected] Cc: Andrew Morton <[email protected]> Cc: Peter Zijlstra <[email protected]>
2016-10-25timers: Plug locking race vs. timer migrationThomas Gleixner1-1/+8
Linus noticed that lock_timer_base() lacks a READ_ONCE() for accessing the timer flags. As a consequence the compiler is allowed to reload the flags between the initial check for TIMER_MIGRATION and the following timer base computation and the spin lock of the base. While this has not been observed (yet), we need to make sure that it never happens. Fixes: 0eeda71bc30d ("timer: Replace timer base by a cpu index") Reported-by: Linus Torvalds <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610241711220.4983@nanos Cc: [email protected] Cc: Andrew Morton <[email protected]> Cc: Peter Zijlstra <[email protected]>
2016-10-17alarmtimer: Remove unused but set variableTobias Klauser1-2/+0
Remove the set but unused variable base in alarm_clock_get to fix the following warning when building with 'W=1': kernel/time/alarmtimer.c: In function ‘alarm_timer_create’: kernel/time/alarmtimer.c:545:21: warning: variable ‘base’ set but not used [-Wunused-but-set-variable] Signed-off-by: Tobias Klauser <[email protected]> Cc: John Stultz <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-10-15Merge tag 'gcc-plugins-v4.9-rc1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull gcc plugins update from Kees Cook: "This adds a new gcc plugin named "latent_entropy". It is designed to extract as much possible uncertainty from a running system at boot time as possible, hoping to capitalize on any possible variation in CPU operation (due to runtime data differences, hardware differences, SMP ordering, thermal timing variation, cache behavior, etc). At the very least, this plugin is a much more comprehensive example for how to manipulate kernel code using the gcc plugin internals" * tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: latent_entropy: Mark functions with __latent_entropy gcc-plugins: Add latent_entropy plugin
2016-10-10latent_entropy: Mark functions with __latent_entropyEmese Revfy1-1/+1
The __latent_entropy gcc attribute can be used only on functions and variables. If it is on a function then the plugin will instrument it for gathering control-flow entropy. If the attribute is on a variable then the plugin will initialize it with random contents. The variable must be an integer, an integer array type or a structure with integer fields. These specific functions have been selected because they are init functions (to help gather boot-time entropy), are called at unpredictable times, or they have variable loops, each of which provide some level of latent entropy. Signed-off-by: Emese Revfy <[email protected]> [kees: expanded commit message] Signed-off-by: Kees Cook <[email protected]>
2016-10-05timekeeping: Fix __ktime_get_fast_ns() regressionJohn Stultz1-2/+5
In commit 27727df240c7 ("Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING"), I changed the logic to open-code the timekeeping_get_ns() function, but I forgot to include the unit conversion from cycles to nanoseconds, breaking the function's output, which impacts users like perf. This results in bogus perf timestamps like: swapper 0 [000] 253.427536: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms]) swapper 0 [000] 254.426573: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms]) swapper 0 [000] 254.426687: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms]) swapper 0 [000] 254.426800: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms]) swapper 0 [000] 254.426905: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms]) swapper 0 [000] 254.427022: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms]) swapper 0 [000] 254.427127: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms]) swapper 0 [000] 254.427239: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms]) swapper 0 [000] 254.427346: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms]) swapper 0 [000] 254.427463: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms]) swapper 0 [000] 255.426572: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms]) Instead of more reasonable expected timestamps like: swapper 0 [000] 39.953768: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms]) swapper 0 [000] 40.064839: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms]) swapper 0 [000] 40.175956: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms]) swapper 0 [000] 40.287103: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms]) swapper 0 [000] 40.398217: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms]) swapper 0 [000] 40.509324: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms]) swapper 0 [000] 40.620437: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms]) swapper 0 [000] 40.731546: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms]) swapper 0 [000] 40.842654: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms]) swapper 0 [000] 40.953772: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms]) swapper 0 [000] 41.064881: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms]) Add the proper use of timekeeping_delta_to_ns() to convert the cycle delta to nanoseconds as needed. Thanks to Brendan and Alexei for finding this quickly after the v4.8 release. Unfortunately the problematic commit has landed in some -stable trees so they'll need this fix as well. Many apologies for this mistake. I'll be looking to add a perf-clock sanity test to the kselftest timers tests soon. Fixes: 27727df240c7 "timekeeping: Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING" Reported-by: Brendan Gregg <[email protected]> Reported-by: Alexei Starovoitov <[email protected]> Tested-and-reviewed-by: Mathieu Desnoyers <[email protected]> Signed-off-by: John Stultz <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: stable <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-09-13tick/nohz: Prevent stopping the tick on an offline CPUWanpeng Li1-2/+5
can_stop_full_tick() has no check for offline cpus. So it allows to stop the tick on an offline cpu from the interrupt return path, which is wrong and subsequently makes irq_work_needs_cpu() warn about being called for an offline cpu. Commit f7ea0fd639c2c4 ("tick: Don't invoke tick_nohz_stop_sched_tick() if the cpu is offline") added prevention for can_stop_idle_tick(), but forgot to do the same in can_stop_full_tick(). Add it. [ tglx: Massaged changelog ] Signed-off-by: Wanpeng Li <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Frederic Weisbecker <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-09-08Merge branch 'linus' into timers/core, to refresh the branchIngo Molnar1-1/+2
Signed-off-by: Ingo Molnar <[email protected]>
2016-09-02tick/nohz: Fix softlockup on scheduler stalls in kvm guestWanpeng Li1-1/+2
tick_nohz_start_idle() is prevented to be called if the idle tick can't be stopped since commit 1f3b0f8243cb934 ("tick/nohz: Optimize nohz idle enter"). As a result, after suspend/resume the host machine, full dynticks kvm guest will softlockup: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [swapper/0:0] Call Trace: default_idle+0x31/0x1a0 arch_cpu_idle+0xf/0x20 default_idle_call+0x2a/0x50 cpu_startup_entry+0x39b/0x4d0 rest_init+0x138/0x140 ? rest_init+0x5/0x140 start_kernel+0x4c1/0x4ce ? set_init_arg+0x55/0x55 ? early_idt_handler_array+0x120/0x120 x86_64_start_reservations+0x24/0x26 x86_64_start_kernel+0x142/0x14f In addition, cat /proc/stat | grep cpu in guest or host: cpu 398 16 5049 15754 5490 0 1 46 0 0 cpu0 206 5 450 0 0 0 1 14 0 0 cpu1 81 0 3937 3149 1514 0 0 9 0 0 cpu2 45 6 332 6052 2243 0 0 11 0 0 cpu3 65 2 328 6552 1732 0 0 11 0 0 The idle and iowait states are weird 0 for cpu0(housekeeping). The bug is present in both guest and host kernels, and they both have cpu0's idle and iowait states issue, however, host kernel's suspend/resume path etc will touch watchdog to avoid the softlockup. - The watchdog will not be touched in tick_nohz_stop_idle path (need be touched since the scheduler stall is expected) if idle_active flags are not detected. - The idle and iowait states will not be accounted when exit idle loop (resched or interrupt) if idle start time and idle_active flags are not set. This patch fixes it by reverting commit 1f3b0f8243cb934 since can't stop idle tick doesn't mean can't be idle. Fixes: 1f3b0f8243cb934 ("tick/nohz: Optimize nohz idle enter") Signed-off-by: Wanpeng Li <[email protected]> Cc: Sanjeev Yadav<[email protected]> Cc: Gaurav Jindal<[email protected]> Cc: [email protected] Cc: [email protected] Cc: Radim Krčmář <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Paolo Bonzini <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2016-08-31time: Avoid undefined behaviour in ktime_add_safe()Vegard Nossum1-1/+1
I ran into this: ================================================================================ UBSAN: Undefined behaviour in kernel/time/hrtimer.c:310:16 signed integer overflow: 9223372036854775807 + 50000 cannot be represented in type 'long long int' CPU: 2 PID: 4798 Comm: trinity-c2 Not tainted 4.8.0-rc1+ #91 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014 0000000000000000 ffff88010ce6fb88 ffffffff82344740 0000000041b58ab3 ffffffff84f97a20 ffffffff82344694 ffff88010ce6fbb0 ffff88010ce6fb60 000000000000c350 ffff88010ce6f968 dffffc0000000000 ffffffff857bc320 Call Trace: [<ffffffff82344740>] dump_stack+0xac/0xfc [<ffffffff82344694>] ? _atomic_dec_and_lock+0xc4/0xc4 [<ffffffff8242df78>] ubsan_epilogue+0xd/0x8a [<ffffffff8242e6b4>] handle_overflow+0x202/0x23d [<ffffffff8242e4b2>] ? val_to_string.constprop.6+0x11e/0x11e [<ffffffff8236df71>] ? timerqueue_add+0x151/0x410 [<ffffffff81485c48>] ? hrtimer_start_range_ns+0x3b8/0x1380 [<ffffffff81795631>] ? memset+0x31/0x40 [<ffffffff8242e6fd>] __ubsan_handle_add_overflow+0xe/0x10 [<ffffffff81488ac9>] hrtimer_nanosleep+0x5d9/0x790 [<ffffffff814884f0>] ? hrtimer_init_sleeper+0x80/0x80 [<ffffffff813a9ffb>] ? __might_sleep+0x5b/0x260 [<ffffffff8148be10>] common_nsleep+0x20/0x30 [<ffffffff814906c7>] SyS_clock_nanosleep+0x197/0x210 [<ffffffff81490530>] ? SyS_clock_getres+0x150/0x150 [<ffffffff823c7113>] ? __this_cpu_preempt_check+0x13/0x20 [<ffffffff8162ef60>] ? __context_tracking_exit.part.3+0x30/0x1b0 [<ffffffff81490530>] ? SyS_clock_getres+0x150/0x150 [<ffffffff81007bd3>] do_syscall_64+0x1b3/0x4b0 [<ffffffff845f85aa>] entry_SYSCALL64_slow_path+0x25/0x25 ================================================================================ Add a new ktime_add_unsafe() helper which doesn't check for overflow, but doesn't throw a UBSAN warning when it does overflow either. Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Prarit Bhargava <[email protected]> Signed-off-by: Vegard Nossum <[email protected]> Signed-off-by: John Stultz <[email protected]>
2016-08-31time: Avoid undefined behaviour in timespec64_add_safe()Vegard Nossum1-1/+1
I ran into this: ================================================================================ UBSAN: Undefined behaviour in kernel/time/time.c:783:2 signed integer overflow: 5273 + 9223372036854771711 cannot be represented in type 'long int' CPU: 0 PID: 17363 Comm: trinity-c0 Not tainted 4.8.0-rc1+ #88 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014 0000000000000000 ffff88011457f8f0 ffffffff82344f50 0000000041b58ab3 ffffffff84f98080 ffffffff82344ea4 ffff88011457f918 ffff88011457f8c8 ffff88011457f8e0 7fffffffffffefff ffff88011457f6d8 dffffc0000000000 Call Trace: [<ffffffff82344f50>] dump_stack+0xac/0xfc [<ffffffff82344ea4>] ? _atomic_dec_and_lock+0xc4/0xc4 [<ffffffff8242f4c8>] ubsan_epilogue+0xd/0x8a [<ffffffff8242fc04>] handle_overflow+0x202/0x23d [<ffffffff8242fa02>] ? val_to_string.constprop.6+0x11e/0x11e [<ffffffff823c7837>] ? debug_smp_processor_id+0x17/0x20 [<ffffffff8131b581>] ? __sigqueue_free.part.13+0x51/0x70 [<ffffffff8146d4e0>] ? rcu_is_watching+0x110/0x110 [<ffffffff8242fc4d>] __ubsan_handle_add_overflow+0xe/0x10 [<ffffffff81476ef8>] timespec64_add_safe+0x298/0x340 [<ffffffff81476c60>] ? timespec_add_safe+0x330/0x330 [<ffffffff812f7990>] ? wait_noreap_copyout+0x1d0/0x1d0 [<ffffffff8184bf18>] poll_select_set_timeout+0xf8/0x170 [<ffffffff8184be20>] ? poll_schedule_timeout+0x2b0/0x2b0 [<ffffffff813aa9bb>] ? __might_sleep+0x5b/0x260 [<ffffffff833c8a87>] __sys_recvmmsg+0x107/0x790 [<ffffffff833c8980>] ? SyS_recvmsg+0x20/0x20 [<ffffffff81486378>] ? hrtimer_start_range_ns+0x3b8/0x1380 [<ffffffff845f8bfb>] ? _raw_spin_unlock_irqrestore+0x3b/0x60 [<ffffffff8148bcea>] ? do_setitimer+0x39a/0x8e0 [<ffffffff813aa9bb>] ? __might_sleep+0x5b/0x260 [<ffffffff833c9110>] ? __sys_recvmmsg+0x790/0x790 [<ffffffff833c91e9>] SyS_recvmmsg+0xd9/0x160 [<ffffffff833c9110>] ? __sys_recvmmsg+0x790/0x790 [<ffffffff823c7853>] ? __this_cpu_preempt_check+0x13/0x20 [<ffffffff8162f680>] ? __context_tracking_exit.part.3+0x30/0x1b0 [<ffffffff833c9110>] ? __sys_recvmmsg+0x790/0x790 [<ffffffff81007bd3>] do_syscall_64+0x1b3/0x4b0 [<ffffffff845f936a>] entry_SYSCALL64_slow_path+0x25/0x25 ================================================================================ Line 783 is this: 783 set_normalized_timespec64(&res, lhs.tv_sec + rhs.tv_sec, 784 lhs.tv_nsec + rhs.tv_nsec); In other words, since lhs.tv_sec and rhs.tv_sec are both time64_t, this is a signed addition which will cause undefined behaviour on overflow. Note that this is not currently a huge concern since the kernel should be built with -fno-strict-overflow by default, but could be a problem in the future, a problem with older compilers, or other compilers than gcc. The easiest way to avoid the overflow is to cast one of the arguments to unsigned (so the addition will be done using unsigned arithmetic). Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Prarit Bhargava <[email protected]> Signed-off-by: Vegard Nossum <[email protected]> Signed-off-by: John Stultz <[email protected]>
2016-08-31timekeeping: Prints the amounts of time spent during suspendRuchi Kandoi1-0/+2
In addition to keeping a histogram of suspend times, also print out the time spent in suspend to dmesg. This helps to keep track of suspend time while debugging using kernel logs. Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Prarit Bhargava <[email protected]> Signed-off-by: Ruchi Kandoi <[email protected]> [jstultz: Tweaked commit message] Signed-off-by: John Stultz <[email protected]>