Age | Commit message (Collapse) | Author | Files | Lines |
|
Clocksources don't get the VALID_FOR_HRES flag until they have been
checked by a watchdog. However, when using an override, the
clocksource_select logic will clear the override value if the
clocksource is not marked VALID_FOR_HRES during that inititial check.
When using the boot arguments clocksource=<foo>, this selection can
run before the watchdog, and can cause the override to be incorrectly
cleared.
To address this condition, the override_name is only invalidated for
unstable clocksources. Otherwise, the override is left intact until after
the watchdog has validated the clocksource as stable/unstable.
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Prarit Bhargava <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Signed-off-by: Kyle Walker <[email protected]>
Signed-off-by: John Stultz <[email protected]>
|
|
Fix a minor spelling error.
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Prarit Bhargava <[email protected]>
Signed-off-by: Pratyush Patel <[email protected]>
[jstultz: Added commit message]
Signed-off-by: John Stultz <[email protected]>
|
|
It was reported that hibernation could fail on the 2nd attempt, where the
system hangs at hibernate() -> syscore_resume() -> i8237A_resume() ->
claim_dma_lock(), because the lock has already been taken.
However there is actually no other process would like to grab this lock on
that problematic platform.
Further investigation showed that the problem is triggered by setting
/sys/power/pm_trace to 1 before the 1st hibernation.
Since once pm_trace is enabled, the rtc becomes unmeaningful after suspend,
and meanwhile some BIOSes would like to adjust the 'invalid' RTC (e.g, smaller
than 1970) to the release date of that motherboard during POST stage, thus
after resumed, it may seem that the system had a significant long sleep time
which is a completely meaningless value.
Then in timekeeping_resume -> tk_debug_account_sleep_time, if the bit31 of the
sleep time happened to be set to 1, fls() returns 32 and we add 1 to
sleep_time_bin[32], which causes an out of bounds array access and therefor
memory being overwritten.
As depicted by System.map:
0xffffffff81c9d080 b sleep_time_bin
0xffffffff81c9d100 B dma_spin_lock
the dma_spin_lock.val is set to 1, which caused this problem.
This patch adds a sanity check in tk_debug_account_sleep_time()
to ensure we don't index past the sleep_time_bin array.
[jstultz: Problem diagnosed and original patch by Chen Yu, I've solved the
issue slightly differently, but borrowed his excelent explanation of the
issue here.]
Fixes: 5c83545f24ab "power: Add option to log time spent in suspend"
Reported-by: Janek Kozicki <[email protected]>
Reported-by: Chen Yu <[email protected]>
Signed-off-by: John Stultz <[email protected]>
Cc: [email protected]
Cc: Peter Zijlstra <[email protected]>
Cc: Xunlei Pang <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: stable <[email protected]>
Cc: Zhang Rui <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
When I added some extra sanity checking in timekeeping_get_ns() under
CONFIG_DEBUG_TIMEKEEPING, I missed that the NMI safe __ktime_get_fast_ns()
method was using timekeeping_get_ns().
Thus the locking added to the debug checks broke the NMI-safety of
__ktime_get_fast_ns().
This patch open-codes the timekeeping_get_ns() logic for
__ktime_get_fast_ns(), so can avoid any deadlocks in NMI.
Fixes: 4ca22c2648f9 "timekeeping: Add warnings when overflows or underflows are observed"
Reported-by: Steven Rostedt <[email protected]>
Reported-by: Peter Zijlstra <[email protected]>
Signed-off-by: John Stultz <[email protected]>
Cc: stable <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
The tick_nohz_stop_sched_tick() routine is not properly
canceling the sched timer when nothing is pending, because
get_next_timer_interrupt() is no longer returning KTIME_MAX in
that case. This causes periodic interrupts when none are needed.
When determining the next interrupt time, we first use
__next_timer_interrupt() to get the first expiring timer in the
timer wheel. If no timer is found, we return the base clock value
plus NEXT_TIMER_MAX_DELTA to indicate there is no timer in the
timer wheel.
Back in get_next_timer_interrupt(), we set the "expires" value
by converting the timer wheel expiry (in ticks) to a nsec value.
But we don't want to do this if the timer wheel expiry value
indicates no timer; we want to return KTIME_MAX.
Prior to commit 500462a9de65 ("timers: Switch to a non-cascading
wheel") we checked base->active_timers to see if any timers
were active, and if not, we didn't touch the expiry value and so
properly returned KTIME_MAX. Now we don't have active_timers.
To fix this, we now just check the timer wheel expiry value to
see if it is "now + NEXT_TIMER_MAX_DELTA", and if it is, we don't
try to compute a new value based on it, but instead simply let the
KTIME_MAX value in expires remain.
Fixes: 500462a9de65 "timers: Switch to a non-cascading wheel"
Signed-off-by: Chris Metcalf <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: John Stultz <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp hotplug updates from Thomas Gleixner:
"This is the next part of the hotplug rework.
- Convert all notifiers with a priority assigned
- Convert all CPU_STARTING/DYING notifiers
The final removal of the STARTING/DYING infrastructure will happen
when the merge window closes.
Another 700 hundred line of unpenetrable maze gone :)"
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
timers/core: Correct callback order during CPU hot plug
leds/trigger/cpu: Move from CPU_STARTING to ONLINE level
powerpc/numa: Convert to hotplug state machine
arm/perf: Fix hotplug state machine conversion
irqchip/armada: Avoid unused function warnings
ARC/time: Convert to hotplug state machine
clocksource/atlas7: Convert to hotplug state machine
clocksource/armada-370-xp: Convert to hotplug state machine
clocksource/exynos_mct: Convert to hotplug state machine
clocksource/arm_global_timer: Convert to hotplug state machine
rcu: Convert rcutree to hotplug state machine
KVM/arm/arm64/vgic-new: Convert to hotplug state machine
smp/cfd: Convert core to hotplug state machine
x86/x2apic: Convert to CPU hotplug state machine
profile: Convert to hotplug state machine
timers/core: Convert to hotplug state machine
hrtimer: Convert to hotplug state machine
x86/tboot: Convert to hotplug state machine
arm64/armv8 deprecated: Convert to hotplug state machine
hwtracing/coresight-etm4x: Convert to hotplug state machine
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"This update provides the following changes:
- The rework of the timer wheel which addresses the shortcomings of
the current wheel (cascading, slow search for next expiring timer,
etc). That's the first major change of the wheel in almost 20
years since Finn implemted it.
- A large overhaul of the clocksource drivers init functions to
consolidate the Device Tree initialization
- Some more Y2038 updates
- A capability fix for timerfd
- Yet another clock chip driver
- The usual pile of updates, comment improvements all over the place"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (130 commits)
tick/nohz: Optimize nohz idle enter
clockevents: Make clockevents_subsys static
clocksource/drivers/time-armada-370-xp: Fix return value check
timers: Implement optimization for same expiry time in mod_timer()
timers: Split out index calculation
timers: Only wake softirq if necessary
timers: Forward the wheel clock whenever possible
timers/nohz: Remove pointless tick_nohz_kick_tick() function
timers: Optimize collect_expired_timers() for NOHZ
timers: Move __run_timers() function
timers: Remove set_timer_slack() leftovers
timers: Switch to a non-cascading wheel
timers: Reduce the CPU index space to 256k
timers: Give a few structs and members proper names
hlist: Add hlist_is_singular_node() helper
signals: Use hrtimer for sigtimedwait()
timers: Remove the deprecated mod_timer_pinned() API
timers, net/ipv4/inet: Initialize connection request timers as pinned
timers, drivers/tty/mips_ejtag: Initialize the poll timer as pinned
timers, drivers/tty/metag_da: Initialize the poll timer as pinned
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging and IIO driver updates from Greg KH:
"Here is the big Staging and IIO driver update for 4.8-rc1.
We ended up adding more code than removing, again, but it's not all
that bad. Lots of cleanups all over the staging tree, and new IIO
drivers, full details in the shortlog.
All of these have been in linux-next for a while with no reported
issues"
* tag 'staging-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (417 commits)
drivers:iio:accel:mma8452: removed unwanted return statements
drivers:iio:accel:mma8452: added cleanup provision in case of failure.
iio: Add iio.git tree to MAINTAINERS
iio:st_pressure: clean useless static channel initializers
iio:st_pressure:lps22hb: temperature support
iio:st_pressure:lps22hb: open drain support
iio:st_pressure: temperature triggered buffering
iio:st_pressure: document sampling gains
iio:st_pressure: align storagebits on power of 2
iio:st_sensors: align on storagebits boundaries
staging:iio:lis3l02dq drop separate driver
iio: accel: st_accel: Add lis3l02dq support
iio: adc: add missing of_node references to iio_dev
iio: adc: ti-ads1015: add indio_dev->dev.of_node reference
iio: potentiometer: Fix typo in Kconfig
iio: potentiometer: mcp4531: Add device tree binding
iio: potentiometer: mcp4531: Add device tree binding documentation
iio: potentiometer: mcp4531: Add support for MCP454x, MCP456x, MCP464x and MCP466x
iio:imu:mpu6050: icm20608 initial support
iio: adc: max1363: Add device tree binding
...
|
|
tick_nohz_start_idle is called before checking whether the idle tick can be
stopped. If the tick cannot be stopped, calling tick_nohz_start_idle() is
pointless and just wasting CPU cycles.
Only invoke tick_nohz_start_idle() when can_stop_idle_tick() returns true. A
short one minute observation of the effect on ARM64 shows a reduction of calls
by 1.5% thus optimizing the idle entry sequence.
[tglx: Massaged changelog ]
Co-developed-by: Sanjeev Yadav<[email protected]>
Signed-off-by: Gaurav Jindal<[email protected]>
Link: http://lkml.kernel.org/r/[email protected]@spreadtrum.com
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
The clockevents_subsys struct is used for sysfs support and
is not declared or used outside the file it is defined in.
Fix the following warning by making it static:
kernel/time/clockevents.c:648:17: warning: symbol 'clockevents_subsys' was not declared. Should it be static?
Signed-off-by: Ben Dooks <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
When tearing down, call timers_dead_cpu() before notify_dead().
There is a hidden dependency between:
- timers
- block multiqueue
- rcutree
If timers_dead_cpu() comes later than blk_mq_queue_reinit_notify()
that latter function causes a RCU stall.
Signed-off-by: Richard Cochran <[email protected]>
Signed-off-by: Anna-Maria Gleixner <[email protected]>
Reviewed-by: Sebastian Andrzej Siewior <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Split out the clockevents callbacks instead of piggybacking them on
hrtimers.
This gets rid of a POST_DEAD user. See commit:
54e88fad223c ("sched: Make sure timers have migrated before killing the migration_thread")
We just move the callback state to the proper place in the state machine.
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Anna-Maria Gleixner <[email protected]>
Reviewed-by: Sebastian Andrzej Siewior <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Variable "now" seems to be genuinely used unintialized
if branch
if (CPUCLOCK_PERTHREAD(timer->it_clock)) {
is not taken and branch
if (unlikely(sighand == NULL)) {
is taken. In this case the process has been reaped and the timer is marked as
disarmed anyway. So none of the postprocessing of the sample is
required. Return right away.
Signed-off-by: Alexey Dobriyan <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
|
|
The existing optimization for same expiry time in mod_timer() checks whether
the timer expiry time is the same as the new requested expiry time. In the old
timer wheel implementation this does not take the slack batching into account,
neither does the new implementation evaluate whether the new expiry time will
requeue the timer to the same bucket.
To optimize that, we can calculate the resulting bucket and check if the new
expiry time is different from the current expiry time. This calculation
happens outside the base lock held region. If the resulting bucket is the same
we can avoid taking the base lock and requeueing the timer.
If the timer needs to be requeued then we have to check under the base lock
whether the base time has changed between the lockless calculation and taking
the lock. If it has changed we need to recalculate under the lock.
This optimization takes effect for timers which are enqueued into the less
granular wheel levels (1 and above). With a simple test case the functionality
has been verified:
Before After
Match: 5.5% 86.6%
Requeue: 94.5% 13.4%
Recalc: <0.01%
In the non optimized case the timer is requeued in 94.5% of the cases. With
the index optimization in place the requeue rate drops to 13.4%. The case
where the lockless index calculation has to be redone is less than 0.01%.
With a real world test case (networking) we observed the following changes:
Before After
Match: 97.8% 99.7%
Requeue: 2.2% 0.3%
Recalc: <0.001%
That means two percent fewer lock/requeue/unlock operations done in one of
the hot path use cases of timers.
Signed-off-by: Anna-Maria Gleixner <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: George Spelvin <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
For further optimizations we need to seperate index calculation
from queueing. No functional change.
Signed-off-by: Anna-Maria Gleixner <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: George Spelvin <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
With the wheel forwading in place and with the HZ=1000 4ms folding we can
avoid running the softirq at all.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: George Spelvin <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The wheel clock is stale when a CPU goes into a long idle sleep. This has the
side effect that timers which are queued end up in the outer wheel levels.
That results in coarser granularity.
To solve this, we keep track of the idle state and forward the wheel clock
whenever possible.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: George Spelvin <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
This was a failed attempt to optimize the timer expiry in idle, which was
disabled and never revisited. Remove the cruft.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: George Spelvin <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
After a NOHZ idle sleep the timer wheel must be forwarded to current jiffies.
There might be expired timers so the current code loops and checks the expired
buckets for timers. This can take quite some time for long NOHZ idle periods.
The pending bitmask in the timer base allows us to do a quick search for the
next expiring timer and therefore a fast forward of the base time which
prevents pointless long lasting loops.
For a 3 seconds idle sleep this reduces the catchup time from ~1ms to 5us.
Signed-off-by: Anna-Maria Gleixner <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: George Spelvin <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Move __run_timers() below __next_timer_interrupt() and next_pending_bucket()
in preparation for __run_timers() NOHZ optimization.
No functional change.
Signed-off-by: Anna-Maria Gleixner <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: George Spelvin <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
We now have implicit batching in the timer wheel. The slack API is no longer
used, so remove it.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Alan Stern <[email protected]>
Cc: Andrew F. Davis <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: David S. Miller <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: Dmitry Eremin-Solenikov <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: George Spelvin <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Jaehoon Chung <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mathias Nyman <[email protected]>
Cc: Pali Rohár <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Sebastian Reichel <[email protected]>
Cc: Ulf Hansson <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The current timer wheel has some drawbacks:
1) Cascading:
Cascading can be an unbound operation and is completely pointless in most
cases because the vast majority of the timer wheel timers are canceled or
rearmed before expiration. (They are used as timeout safeguards, not as
real timers to measure time.)
2) No fast lookup of the next expiring timer:
In NOHZ scenarios the first timer soft interrupt after a long NOHZ period
must fast forward the base time to the current value of jiffies. As we
have no way to find the next expiring timer fast, the code loops linearly
and increments the base time one by one and checks for expired timers
in each step. This causes unbound overhead spikes exactly in the moment
when we should wake up as fast as possible.
After a thorough analysis of real world data gathered on laptops,
workstations, webservers and other machines (thanks Chris!) I came to the
conclusion that the current 'classic' timer wheel implementation can be
modified to address the above issues.
The vast majority of timer wheel timers is canceled or rearmed before
expiry. Most of them are timeouts for networking and other I/O tasks. The
nature of timeouts is to catch the exception from normal operation (TCP ack
timed out, disk does not respond, etc.). For these kinds of timeouts the
accuracy of the timeout is not really a concern. Timeouts are very often
approximate worst-case values and in case the timeout fires, we already
waited for a long time and performance is down the drain already.
The few timers which actually expire can be split into two categories:
1) Short expiry times which expect halfways accurate expiry
2) Long term expiry times are inaccurate today already due to the
batching which is done for NOHZ automatically and also via the
set_timer_slack() API.
So for long term expiry timers we can avoid the cascading property and just
leave them in the less granular outer wheels until expiry or
cancelation. Timers which are armed with a timeout larger than the wheel
capacity are no longer cascaded. We expire them with the longest possible
timeout (6+ days). We have not observed such timeouts in our data collection,
but at least we handle them, applying the rule of the least surprise.
To avoid extending the wheel levels for HZ=1000 so we can accomodate the
longest observed timeouts (5 days in the network conntrack code) we reduce the
first level granularity on HZ=1000 to 4ms, which effectively is the same as
the HZ=250 behaviour. From our data analysis there is nothing which relies on
that 1ms granularity and as a side effect we get better batching and timer
locality for the networking code as well.
Contrary to the classic wheel the granularity of the next wheel is not the
capacity of the first wheel. The granularities of the wheels are in the
currently chosen setting 8 times the granularity of the previous wheel.
So for HZ=250 we end up with the following granularity levels:
Level Offset Granularity Range
0 0 4 ms 0 ms - 252 ms
1 64 32 ms 256 ms - 2044 ms (256ms - ~2s)
2 128 256 ms 2048 ms - 16380 ms (~2s - ~16s)
3 192 2048 ms (~2s) 16384 ms - 131068 ms (~16s - ~2m)
4 256 16384 ms (~16s) 131072 ms - 1048572 ms (~2m - ~17m)
5 320 131072 ms (~2m) 1048576 ms - 8388604 ms (~17m - ~2h)
6 384 1048576 ms (~17m) 8388608 ms - 67108863 ms (~2h - ~18h)
7 448 8388608 ms (~2h) 67108864 ms - 536870911 ms (~18h - ~6d)
That's a worst case inaccuracy of 12.5% for the timers which are queued at the
beginning of a level.
So the new wheel concept addresses the old issues:
1) Cascading is avoided completely
2) By keeping the timers in the bucket until expiry/cancelation we can track
the buckets which have timers enqueued in a bucket bitmap and therefore can
look up the next expiring timer very fast and O(1).
A further benefit of the concept is that the slack calculation which is done
on every timer start is no longer necessary because the granularity levels
provide natural batching already.
Our extensive testing with various loads did not show any performance
degradation vs. the current wheel implementation.
This patch does not address the 'fast lookup' issue as we wanted to make sure
that there is no regression introduced by the wheel redesign. The
optimizations are in follow up patches.
This patch contains fixes from Anna-Maria Gleixner and Richard Cochran.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: George Spelvin <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Some of the names in the internal implementation of the timer code
are not longer correct and others are simply too long to type.
Clean it up before we switch the wheel implementation over to
the new scheme.
No functional change.
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: George Spelvin <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
We switched all users to initialize the timers as pinned and call
mod_timer(). Remove the now unused timer API function.
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: George Spelvin <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
We want to move the timer migration logic from a 'push' to a 'pull' model.
Under the current 'push' model pinned timers are handled via
a runtime API variant: mod_timer_pinned().
The 'pull' model requires us to store the pinned attribute of a timer
in the timer_list structure itself, as a new TIMER_PINNED bit in
timer->flags.
This flag must be set at initialization time and the timer APIs
recognize the flag.
This patch:
- Implements the new flag and associated new-style initialization
methods
- makes mod_timer() recognize new-style pinned timers,
- and adds some migration helper facility to allow
step by step conversion of old-style to new-style
pinned timers.
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: George Spelvin <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
This is to avoid the "null" name when we either
~ # cat /sys/devices/system/clockevents/broadcast/current_device
(null)
or
~ # cat /proc/timer_list
...
Tick Device: mode: 1
Broadcast device
Clock Event Device: (null)
...
Signed-off-by: Jisheng Zhang <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
While reviewing another patch I noticed that kernel/time/tick-sched.c
had a charmingly (confusingly, annoyingly) rich set of variants for
spelling 'CPU':
cpu
cpus
CPU
CPUs
per CPU
per-CPU
per cpu
... sometimes these were mixed even within the same comment block!
Compress these variants down to a single consistent set of:
CPU
CPUs
per-CPU
Cc: Frederic Weisbecker <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Wei Jiangang <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
EXPORT_SYMBOL() get_monotonic_coarse64 for new IIO timestamping clock
selection usage. This provides user apps the ability to request a
particular IIO device to timestamp samples using a monotonic coarse clock
granularity.
Signed-off-by: Gregor Boirie <[email protected]>
Signed-off-by: Jonathan Cameron <[email protected]>
|
|
https://git.linaro.org/people/john.stultz/linux into timers/core
Pull time(keeping) updates from John Stultz:
- Handle the 1ns issue with the old refusing to die vsyscall machinery
- More y2038 updates
- Documentation fixes
- Simplify clocksource handling
|
|
The tstats_show() function prints a ktime_t variable by converting
it to struct timespec first. The algorithm is ok, but we want to
stop using timespec in general because of the 32-bit time_t
overflow problem.
This changes the code to use struct timespec64, without any
functional change.
Cc: Prarit Bhargava <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: John Stultz <[email protected]>
|
|
udelay_test_single() uses ktime_get_ts() to get two timespec values
and calculate the difference between them, while udelay_test_show()
uses the same to printk() the current monotonic time.
Both of these are y2038 safe on all machines, but we want to
get rid of struct timespec anyway, so this converts the code to
use ktime_get_ns() and ktime_get_ts64() respectively.
Cc: Prarit Bhargava <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: John Stultz <[email protected]>
|
|
time_to_tm() takes time_t as an argument.
time_t is not y2038 safe.
Add time64_to_tm() that takes time64_t as an argument
which is y2038 safe.
The plan is to eventually replace all calls to time_to_tm()
by time64_to_tm().
Cc: Prarit Bhargava <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Deepa Dinamani <[email protected]>
Signed-off-by: John Stultz <[email protected]>
|
|
Updated struct alarm and struct alarm_timer descriptions.
Cc: Prarit Bhargava <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Pratyush Patel <[email protected]>
Signed-off-by: John Stultz <[email protected]>
|
|
The user notices the problem in a raw and real time drift, calling
clock_gettime with CLOCK_REALTIME / CLOCK_MONOTONIC_RAW on a system
with no ntp correction taking place (no ntpd or ptp stuff running).
The problem is, that old_vsyscall_fixup adds an extra 1ns even though
xtime_nsec is already held in full nsecs and the remainder in this
case is 0. Do the rounding up buisness only if needed.
Cc: Prarit Bhargava <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Thomas Graziadei <[email protected]>
Signed-off-by: John Stultz <[email protected]>
|
|
In clocksource_enqueue(), it is unnecessary to continue looping
the list, if we find there is an entry that the value of rating
is smaller than the new one. It is safe to be out the loop,
because all of entry are inserted in descending order.
Cc: Prarit Bhargava <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Minfei Huang <[email protected]>
Signed-off-by: John Stultz <[email protected]>
|
|
Only need CONFIG_NO_HZ_COMMON as this block is already in a
CONFIG_SMP block.
Signed-off-by: Pratyush Patel <[email protected]>
Link: http://lkml.kernel.org/r/20160301172849.GA18152@cyborg
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
Update the usleep_range() function comment to make it clear that it can
only be used in non-atomic context.
Previously we claimed usleep_range() was a drop-in replacement for udelay()
where wakeup is flexible. But that's only true in non-atomic contexts,
where it's possible to sleep instead of delay.
Signed-off-by: Bjorn Helgaas <[email protected]>
Cc: John Stultz <[email protected]>
Link: http://lkml.kernel.org/r/20160531212302.28502.44995.stgit@bhelgaas-glaptop2.roam.corp.google.com
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
hrtimer_init_on_stack() needs a matching call to
destroy_hrtimer_on_stack(), so both need to be exported.
Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
callbacks
When activating a static object we need make sure that the object is
tracked in the object tracker. If it is a non-static object then the
activation is illegal.
In previous implementation, each subsystem need take care of this in
their fixup callbacks. Actually we can put it into debugobjects core.
Thus we can save duplicated code, and have *pure* fixup callbacks.
To achieve this, a new callback "is_static_object" is introduced to let
the type specific code decide whether a object is static or not. If
yes, we take it into object tracker, otherwise give warning and invoke
fixup callback.
This change has paassed debugobjects selftest, and I also do some test
with all debugobjects supports enabled.
At last, I have a concern about the fixups that can it change the object
which is in incorrect state on fixup? Because the 'addr' may not point
to any valid object if a non-static object is not tracked. Then Change
such object can overwrite someone's memory and cause unexpected
behaviour. For example, the timer_fixup_activate bind timer to function
stub_timer.
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: improve code comments where invoke the new is_static_object callback]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Du, Changbin <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Update the return type to use bool instead of int, corresponding to
cheange (debugobjects: make fixup functions return bool instead of int).
Signed-off-by: Du, Changbin <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
All references to timespec_add_safe() now use timespec64_add_safe().
The plan is to replace struct timespec references with struct timespec64
throughout the kernel as timespec is not y2038 safe.
Drop timespec_add_safe() and use timespec64_add_safe() for all
architectures.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Deepa Dinamani <[email protected]>
Acked-by: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
timespec64_add_safe() has been defined in time64.h for 64 bit systems.
But, 32 bit systems only have an extern function prototype defined.
Provide a definition for the above function.
The function will be necessary as part of y2038 changes. struct
timespec is not y2038 safe. All references to timespec will be replaced
by struct timespec64. The function is meant to be a replacement for
timespec_add_safe().
The implementation is similar to timespec_add_safe().
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Deepa Dinamani <[email protected]>
Acked-by: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"A rather small set of patches from the timer departement:
- Some more y2038 work
- Yet another new clocksource driver
- The usual set of small fixes, cleanups and enhancements"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource/drivers/tegra: Remove unused suspend/resume code
clockevents/driversi/mps2: add MPS2 Timer driver
dt-bindings: document the MPS2 timer bindings
clocksource/drivers/mtk_timer: Add __init attribute
clockevents/drivers/dw_apb_timer: Implement ->set_state_oneshot_stopped()
time: Introduce do_sys_settimeofday64()
security: Introduce security_settime64()
clocksource: Add missing include of of.h.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- massive CPU hotplug rework (Thomas Gleixner)
- improve migration fairness (Peter Zijlstra)
- CPU load calculation updates/cleanups (Yuyang Du)
- cpufreq updates (Steve Muckle)
- nohz optimizations (Frederic Weisbecker)
- switch_mm() micro-optimization on x86 (Andy Lutomirski)
- ... lots of other enhancements, fixes and cleanups.
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (66 commits)
ARM: Hide finish_arch_post_lock_switch() from modules
sched/core: Provide a tsk_nr_cpus_allowed() helper
sched/core: Use tsk_cpus_allowed() instead of accessing ->cpus_allowed
sched/loadavg: Fix loadavg artifacts on fully idle and on fully loaded systems
sched/fair: Correct unit of load_above_capacity
sched/fair: Clean up scale confusion
sched/nohz: Fix affine unpinned timers mess
sched/fair: Fix fairness issue on migration
sched/core: Kill sched_class::task_waking to clean up the migration logic
sched/fair: Prepare to fix fairness problems on migration
sched/fair: Move record_wakee()
sched/core: Fix comment typo in wake_q_add()
sched/core: Remove unused variable
sched: Make hrtick_notifier an explicit call
sched/fair: Make ilb_notifier an explicit call
sched/hotplug: Make activate() the last hotplug step
sched/hotplug: Move migration CPU_DYING to sched_cpu_dying()
sched/migration: Move CPU_ONLINE into scheduler state
sched/migration: Move calc_load_migrate() into CPU_DYING
sched/migration: Move prepare transition to SCHED_STARTING state
...
|
|
All the atomic operations have their arguments the wrong way around;
make atomic_fetch_or() consistent and flip them.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Ticks can happen while the CPU is in dynticks-idle or dynticks-singletask
mode. In fact "nohz" or "dynticks" only mean that we exit the periodic
mode and we try to minimize the ticks as much as possible. The nohz
subsystem uses a confusing terminology with the internal state
"ts->tick_stopped" which is also available through its public interface
with tick_nohz_tick_stopped(). This is a misnomer as the tick is instead
reduced with the best effort rather than stopped. In the best case the
tick can indeed be actually stopped but there is no guarantee about that.
If a timer needs to fire one second later, a tick will fire while the
CPU is in nohz mode and this is a very common scenario.
Now this confusion happens to be a problem with CPU load updates:
cpu_load_update_active() doesn't handle nohz ticks correctly because it
assumes that ticks are completely stopped in nohz mode and that
cpu_load_update_active() can't be called in dynticks mode. When that
happens, the whole previous tickless load is ignored and the function
just records the load for the current tick, ignoring potentially long
idle periods behind.
In order to solve this, we could account the current load for the
previous nohz time but there is a risk that we account the load of a
task that got freshly enqueued for the whole nohz period.
So instead, lets record the dynticks load on nohz frame entry so we know
what to record in case of nohz ticks, then use this record to account
the tickless load on nohz ticks and nohz frame end.
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Byungchul Park <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Luiz Capitulino <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul E . McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The CPU load update related functions have a weak naming convention
currently, starting with update_cpu_load_*() which isn't ideal as
"update" is a very generic concept.
Since two of these functions are public already (and a third is to come)
that's enough to introduce a more conventional naming scheme. So let's
do the following rename instead:
update_cpu_load_*() -> cpu_load_update_*()
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Byungchul Park <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Luiz Capitulino <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul E . McKenney <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The do_sys_settimeofday() function uses a timespec, which is not year
2038 safe on 32bit systems.
Thus this patch introduces do_sys_settimeofday64(), which allows us to
transition users of do_sys_settimeofday() to using 64bit time types.
Cc: Prarit Bhargava <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Baolin Wang <[email protected]>
[jstultz: Include errno-base.h to avoid build issue on some arches]
Signed-off-by: John Stultz <[email protected]>
|