Age | Commit message (Collapse) | Author | Files | Lines |
|
The recent implementation of a generic dummy timer resulted in a
different registration order of per cpu local timers which made the
broadcast control logic go belly up.
If the dummy timer is the first clock event device which is registered
for a CPU, then it is installed, the broadcast timer is initialized
and the CPU is marked as broadcast target.
If a real clock event device is installed after that, we can fail to
take the CPU out of the broadcast mask. In the worst case we end up
with two periodic timer events firing for the same CPU. One from the
per cpu hardware device and one from the broadcast.
Now the problem is that we have no way to distinguish whether the
system is in a state which makes broadcasting necessary or the
broadcast bit was set due to the nonfunctional dummy timer
installment.
To solve this we need to keep track of the system state seperately and
provide a more detailed decision logic whether we keep the CPU in
broadcast mode or not.
The old decision logic only clears the broadcast mode, if the newly
installed clock event device is not affected by power states.
The new logic clears the broadcast mode if one of the following is
true:
- The new device is not affected by power states.
- The system is not in a power state affected mode
- The system has switched to oneshot mode. The oneshot broadcast is
controlled from the deep idle state. The CPU is not in idle at
this point, so it's safe to remove it from the mask.
If we clear the broadcast bit for the CPU when a new device is
installed, we also shutdown the broadcast device when this was the
last CPU in the broadcast mask.
If the broadcast bit is kept, then we leave the new device in shutdown
state and rely on the broadcast to deliver the timer interrupts via
the broadcast ipis.
Reported-and-tested-by: Stehle Vincent-B46079 <[email protected]>
Reviewed-by: Stephen Boyd <[email protected]>
Cc: John Stultz <[email protected]>,
Cc: Mark Rutland <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: [email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
When the system switches from periodic to oneshot mode, the broadcast
logic causes a possibility that a CPU which has not yet switched to
oneshot mode puts its own clock event device into oneshot mode without
updating the state and the timer handler.
CPU0 CPU1
per cpu tickdev is in periodic mode
and switched to broadcast
Switch to oneshot mode
tick_broadcast_switch_to_oneshot()
cpumask_copy(tick_oneshot_broacast_mask,
tick_broadcast_mask);
broadcast device mode = oneshot
Timer interrupt
irq_enter()
tick_check_oneshot_broadcast()
dev->set_mode(ONESHOT);
tick_handle_periodic()
if (dev->mode == ONESHOT)
dev->next_event += period;
FAIL.
We fail, because dev->next_event contains KTIME_MAX, if the device was
in periodic mode before the uncontrolled switch to oneshot happened.
We must copy the broadcast bits over to the oneshot mask, because
otherwise a CPU which relies on the broadcast would not been woken up
anymore after the broadcast device switched to oneshot mode.
So we need to verify in tick_check_oneshot_broadcast() whether the CPU
has already switched to oneshot mode. If not, leave the device
untouched and let the CPU switch controlled into oneshot mode.
This is a long standing bug, which was never noticed, because the main
user of the broadcast x86 cannot run into that scenario, AFAICT. The
nonarchitected timer mess of ARM creates a gazillion of differently
broken abominations which trigger the shortcomings of that broadcast
code, which better had never been necessary in the first place.
Reported-and-tested-by: Stehle Vincent-B46079 <[email protected]>
Reviewed-by: Stephen Boyd <[email protected]>
Cc: John Stultz <[email protected]>,
Cc: Mark Rutland <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: [email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
In periodic mode we remove offline cpus from the broadcast propagation
mask. In oneshot mode we fail to do so. This was not a problem so far,
but the recent changes to the broadcast propagation introduced a
constellation which can result in a NULL pointer dereference.
What happens is:
CPU0 CPU1
idle()
arch_idle()
tick_broadcast_oneshot_control(OFF);
set cpu1 in tick_broadcast_force_mask
if (cpu_offline())
arch_cpu_dead()
cpu_dead_cleanup(cpu1)
cpu1 tickdevice pointer = NULL
broadcast interrupt
dereference cpu1 tickdevice pointer -> OOPS
We dereference the pointer because cpu1 is still set in
tick_broadcast_force_mask and tick_do_broadcast() expects a valid
cpumask and therefor lacks any further checks.
Remove the cpu from the tick_broadcast_force_mask before we set the
tick device pointer to NULL. Also add a sanity check to the oneshot
broadcast function, so we can detect such issues w/o crashing the
machine.
Reported-by: Prarit Bhargava <[email protected]>
Cc: [email protected]
Cc: CAI Qian <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
Although parameters are supposed to be part of the kernel API, experimental
parameters are often removed. In addition, downgrading a kernel might cause
previously-working modules to fail to load.
On balance, it's probably better to warn, and load the module anyway.
This may let through a typo, but at least the logs will show it.
Reported-by: Andy Lutomirski <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>
|
|
There is no such path as /sys/parameters, module parameters live in
/sys/module/*/parameters.
Signed-off-by: Jean Delvare <[email protected]>
Cc: Rusty Russell <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>
|
|
If we pass a pointer to a const string in the form "module:symbol"
module_kallsyms_lookup_name() will try to split the string at the colon,
i.e., will try to modify r/o data. That will, in fact, fail on a kernel
with enabled CONFIG_DEBUG_RODATA.
Avoid modifying the passed string in module_kallsyms_lookup_name(),
modify find_module_all() instead to pass it the module name length.
Signed-off-by: Mathias Krause <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>
|
|
There are multiple places where the ftrace_trace_arrays list is accessed in
trace_events.c without the trace_types_lock held.
Link: http://lkml.kernel.org/r/[email protected]
Cc: Vaibhav Nagarnaik <[email protected]>
Cc: David Sharp <[email protected]>
Cc: Alexander Z Lam <[email protected]>
Cc: [email protected] # 3.10
Signed-off-by: Alexander Z Lam <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
|
|
The trace_marker file was present for each new instance created, but it
added the trace mark to the global trace buffer instead of to
the instance's buffer.
Link: http://lkml.kernel.org/r/[email protected]
Cc: David Sharp <[email protected]>
Cc: Vaibhav Nagarnaik <[email protected]>
Cc: Alexander Z Lam <[email protected]>
Cc: [email protected] # 3.10
Signed-off-by: Alexander Z Lam <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
|
|
If the kernel command line ftrace filter parameters are set
(ftrace_filter or ftrace_notrace), force the function self test to
pass, with a warning why it was forced.
If the user adds a filter to the kernel command line, it is assumed
that they know what they are doing, and the self test should just not
run instead of failing (which disables function tracing) or clearing
the filter, as that will probably annoy the user.
If the user wants the selftest to run, the message will tell them why
it did not.
Signed-off-by: Steven Rostedt <[email protected]>
|
|
kprobe_perf_func() and kretprobe_perf_func() pass addr=ip to
perf_trace_buf_submit() for no reason.
This sets perf_sample_data->addr for PERF_SAMPLE_ADDR, we already
have perf_sample_data->ip initialized if PERF_SAMPLE_IP.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
|
|
If the ring buffer is disabled and the irqsoff tracer records a trace it
will clear out its buffer and lose the data it had previously recorded.
Currently there's a callback when writing to the tracing_of file, but if
tracing is disabled via the function tracer trigger, it will not inform
the irqsoff tracer to stop recording.
By using the "mirror" flag (buffer_disabled) in the trace_array, that keeps
track of the status of the trace_array's buffer, it gives the irqsoff
tracer a fast way to know if it should record a new trace or not.
The flag may be a little behind the real state of the buffer, but it
should not affect the trace too much. It's more important for the irqsoff
tracer to be fast.
Reported-by: Dave Jones <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
|
|
I think that "ftrace_event_file *trace_probe[]" complicates the
code for no reason, turn it into list_head to simplify the code.
enable_trace_probe() no longer needs synchronize_sched().
This needs the extra sizeof(list_head) memory for every attached
ftrace_event_file, hopefully not a problem in this case.
Link: http://lkml.kernel.org/r/[email protected]
Acked-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
|
|
The comment on the soft disable 'disable' case of
__ftrace_event_enable_disable() states that the soft disable bit
should be cleared in that case, but currently only the soft mode bit
is actually cleared.
This essentially leaves the standard non-soft-enable enable/disable
paths as the only way to clear the soft disable flag, but the soft
disable bit should also be cleared when removing a trigger with '!'.
Also, the SOFT_DISABLED bit should never be set if SOFT_MODE is
cleared.
This fixes the above discrepancies.
Link: http://lkml.kernel.org/r/b9c68dd50bc07019e6c67d3f9b29be4ef1b2badb.1372479499.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
|
|
Rather than enumerating each permutation, build the enable state
string up from the combination of states. This also allows for the
simpler addition of more states.
Link: http://lkml.kernel.org/r/9aff5af6dee2f5a40ca30df41c39d5f33e998d7a.1372479499.git.tom.zanussi@linux.intel.com
Signed-off-by: Tom Zanussi <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
|
|
enable_trace_probe() and disable_trace_probe() should not worry about
serialization, the caller (perf_trace_init or __ftrace_set_clr_event)
holds event_mutex.
They are also called by kprobe_trace_self_tests_init(), but this __init
function can't race with itself or trace_events.c
And note that this code depended on event_mutex even before 41a7dd420c
which introduced probe_enable_lock. In fact it assumes that the caller
kprobe_register() can never race with itself. Otherwise, say, tp->flags
manipulations are racy.
Link: http://lkml.kernel.org/r/[email protected]
Acked-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
|
|
perf_trace_buf_prepare() + perf_trace_buf_submit() make no sense
if this task/CPU has no active counters. Change kprobe_perf_func()
and kretprobe_perf_func() to check call->perf_events beforehand
and return if this list is empty.
For example, "perf record -e some_probe -p1". Only /sbin/init will
report, all other threads which hit the same probe will do
perf_trace_buf_prepare/perf_trace_buf_submit just to realize that
nobody wants perf_swevent_event().
Link: http://lkml.kernel.org/r/[email protected]
Acked-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
|
|
Running the following:
# cd /sys/kernel/debug/tracing
# echo p:i do_sys_open > kprobe_events
# echo p:j schedule >> kprobe_events
# cat kprobe_events
p:kprobes/i do_sys_open
p:kprobes/j schedule
# echo p:i do_sys_open >> kprobe_events
# cat kprobe_events
p:kprobes/j schedule
p:kprobes/i do_sys_open
# ls /sys/kernel/debug/tracing/events/kprobes/
enable filter j
Notice that the 'i' is missing from the kprobes directory.
The console produces:
"Failed to create system directory kprobes"
This is because kprobes passes in a allocated name for the system
and the ftrace event subsystem saves off that name instead of creating
a duplicate for it. But the kprobes may free the system name making
the pointer to it invalid.
This bug was introduced by 92edca073c37 "tracing: Use direct field, type
and system names" which switched from using kstrdup() on the system name
in favor of just keeping apointer to it, as the internal ftrace event
system names are static and exist for the life of the computer being booted.
Instead of reverting back to duplicating system names again, we can use
core_kernel_data() to determine if the passed in name was allocated or
static. Then use the MSB of the ref_count to be a flag to keep track if
the name was allocated or not. Then we can still save from having to duplicate
strings that will always exist, but still copy the ones that may be freed.
Cc: [email protected] # 3.10
Reported-by: "zhangwei(Jovi)" <[email protected]>
Reported-by: Masami Hiramatsu <[email protected]>
Tested-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
|
|
Merge in a recent upstream commit:
c2853c8df57f include/linux/math64.h: add div64_ul()
because:
72a4cf20cb71 sched: Change cfs_rq load avg to unsigned long
relies on it.
[ We don't rebase sched/core for this, because the handful of
followup commits after the broken commit are not behavioral
changes so are unlikely to be needed during bisection. ]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
existing hierarchy
0ce6cba357 ("cgroup: CGRP_ROOT_SUBSYS_BOUND should be ignored when
comparing mount options") only updated the remount path but
CGRP_ROOT_SUBSYS_BOUND should also be ignored when comparing options
while mounting an existing hierarchy. As option mismatch triggers a
warning but doesn't fail the mount without sane_behavior, this only
triggers a spurious warning message.
Fix it by only comparing CGRP_ROOT_OPTION_MASK bits when comparing new
and existing root options.
Signed-off-by: Tejun Heo <[email protected]>
|
|
This __put_user() could be used by unprivileged processes to write into
kernel memory. The issue here is that even if copy_siginfo_to_user()
fails, the error code is not checked before __put_user() is executed.
Luckily, ptrace_peek_siginfo() has been added within the 3.10-rc cycle,
so it has not hit a stable release yet.
Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Oleg Nesterov <[email protected]>
Cc: Andrey Vagin <[email protected]>
Cc: Roland McGrath <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: David Howells <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Cc: Pedro Alves <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
"Correct an ordering issue in the tick broadcast code. I really wish
we'd get compensation for pain and suffering for each line of code we
write to work around dysfunctional timer hardware."
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick: Fix tick_broadcast_pending_mask not cleared
|
|
* pm-assorted:
PM / Sleep: Warn about system time after resume with pm_trace
|
|
If the clock was set (stepped), set the action parameter to functions
in the pvclock gtod notifier chain to non-zero. This allows the
callee to only do work if the clock was stepped.
This will be used on Xen as the synchronization of the Xen wallclock
to the control domain's (dom0) system time will be done with this
notifier and updating on every timer tick is unnecessary and too
expensive.
Signed-off-by: David Vrabel <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: John Stultz <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
Instead of passing multiple bools to timekeeping_updated(), define
flags and use a single 'action' parameter. It is then more obvious
what each timekeeping_update() call does.
Signed-off-by: David Vrabel <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: John Stultz <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
hrtimers_resume() only reprograms the timers for the current CPU as it
assumes that all other CPUs are offline at this point in the resume
process. If other CPUs are online then their timers will not be
corrected and they may fire at the wrong time.
When running as a Xen guest, this assumption is not true. Non-boot
CPUs are only stopped with IRQs disabled instead of offlining them.
This is a performance optimization as disabling the CPUs would add an
unacceptable amount of additional downtime during a live migration (>
200 ms for a 4 VCPU guest).
hrtimers_resume() cannot call on_each_cpu(retrigger_next_event,...)
as the other CPUs will be stopped with IRQs disabled. Instead, defer
the call to the next softirq.
[ tglx: Separated the xen change out ]
Signed-off-by: David Vrabel <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: John Stultz <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
Direct compare of jiffies related values does not work in the wrap
around case. Replace it with time_is_after_jiffies().
Signed-off-by: Bart Van Assche <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: [email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
Use the already defined macro to pass the function return address.
Signed-off-by: Davidlohr Bueso <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
Now that we are using runnable load avg in sched balance, we don't
need to keep it under CONFIG_FAIR_GROUP_SCHED.
Also align the code style to #ifdef instead of #if defined() and
reorder the tg output info.
Signed-off-by: Alex Shi <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
* pm-assorted:
PM / QoS: Add pm_qos and dev_pm_qos to events-power.txt
PM / QoS: Add dev_pm_qos_request tracepoints
PM / QoS: Add pm_qos_request tracepoints
PM / QoS: Add pm_qos_update_target/flags tracepoints
PM / QoS: Update Documentation/power/pm_qos_interface.txt
PM / Sleep: Print last wakeup source on failed wakeup_count write
PM / QoS: correct the valid range of pm_qos_class
PM / wakeup: Adjust messaging for wake events during suspend
PM / Runtime: Update .runtime_idle() callback documentation
PM / Runtime: Rework the "runtime idle" helper routine
PM / Hibernate: print physical addresses consistently with other parts of kernel
|
|
* freezer:
af_unix: use freezable blocking calls in read
sigtimedwait: use freezable blocking call
nanosleep: use freezable blocking call
futex: use freezable blocking call
select: use freezable blocking call
epoll: use freezable blocking call
binder: use freezable blocking calls
freezer: add new freezable helpers using freezer_do_not_count()
freezer: convert freezable helpers to static inline where possible
freezer: convert freezable helpers to freezer_do_not_count()
freezer: skip waking up tasks with PF_FREEZER_SKIP set
freezer: shorten freezer sleep time using exponential backoff
lockdep: check that no locks held at freeze time
lockdep: remove task argument from debug_check_no_locks_held
freezer: add unsafe versions of freezable helpers for CIFS
freezer: add unsafe versions of freezable helpers for NFS
|
|
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Randy Dunlap <[email protected]>
|
|
When building imx_v6_v7_defconfig with imx-drm drivers selected as
modules, we get the following build errors:
ERROR: "irq_gc_mask_clr_bit" [drivers/staging/imx-drm/ipu-v3/imx-ipu-v3.ko] undefined!
ERROR: "irq_gc_mask_set_bit" [drivers/staging/imx-drm/ipu-v3/imx-ipu-v3.ko] undefined!
ERROR: "irq_gc_ack_set_bit" [drivers/staging/imx-drm/ipu-v3/imx-ipu-v3.ko] undefined!
Export the required functions to avoid this problem.
Signed-off-by: Fabio Estevam <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
Commit 02725e7471b8 ('genirq: Use irq_get/put functions'),
inadvertently changed can_request_irq() to return 0 for IRQs that have
no action. This causes pcibios_lookup_irq() to select only IRQs that
already have an action with IRQF_SHARED set, or to fail if there are
none. Change can_request_irq() to return 1 for IRQs that have no
action (if the first two conditions are met).
Reported-by: Bjarni Ingi Gislason <[email protected]>
Tested-by: Bjarni Ingi Gislason <[email protected]> (against 3.2)
Signed-off-by: Ben Hutchings <[email protected]>
Cc: [email protected]
Cc: [email protected] # 2.6.39+
Link: http://bugs.debian.org/709647
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
This patch alters format string's width, to align all statistics
at par with the longest struct sched_statistic member name under
/proc/<PID>/sched.
Signed-off-by: Kamalesh Babulal <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
1672d04070 ("cgroup: fix cgroupfs_root early destruction path")
introduced CGRP_ROOT_SUBSYS_BOUND which is used to mark completion of
subsys binding on a new root; however, this broke remounts.
cgroup_remount() doesn't allow changing root options via remount and
CGRP_ROOT_SUBSYS_BOUND, which is set on all fully initialized roots,
makes the function reject all remounts.
Fix it by putting the options part in the lower 16 bits of root->flags
and masking the comparions. While at it, make cgroup_remount() emit
an error message explaining why it's rejecting a remount request, so
that it's less of a mystery.
Signed-off-by: Tejun Heo <[email protected]>
|
|
eb178d06332 ("cgroup: grab cgroup_mutex in
drop_parsed_module_refcounts()") made drop_parsed_module_refcounts()
grab cgroup_mutex to make lockdep assertion in for_each_subsys()
happy. Unfortunately, cgroup_remount() calls the function while
holding cgroup_mutex in its failure path leading to the following
deadlock.
# mount -t cgroup -o remount,memory,blkio cgroup blkio
cgroup: option changes via remount are deprecated (pid=525 comm=mount)
=============================================
[ INFO: possible recursive locking detected ]
3.10.0-rc4-work+ #1 Not tainted
---------------------------------------------
mount/525 is trying to acquire lock:
(cgroup_mutex){+.+.+.}, at: [<ffffffff8110a3e1>] drop_parsed_module_refcounts+0x21/0xb0
but task is already holding lock:
(cgroup_mutex){+.+.+.}, at: [<ffffffff8110e4e1>] cgroup_remount+0x51/0x200
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(cgroup_mutex);
lock(cgroup_mutex);
*** DEADLOCK ***
May be due to missing lock nesting notation
4 locks held by mount/525:
#0: (&type->s_umount_key#30){+.+...}, at: [<ffffffff811e9a0d>] do_mount+0x2bd/0xa30
#1: (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [<ffffffff8110e4d3>] cgroup_remount+0x43/0x200
#2: (cgroup_mutex){+.+.+.}, at: [<ffffffff8110e4e1>] cgroup_remount+0x51/0x200
#3: (cgroup_root_mutex){+.+.+.}, at: [<ffffffff8110e4ef>] cgroup_remount+0x5f/0x200
stack backtrace:
CPU: 2 PID: 525 Comm: mount Not tainted 3.10.0-rc4-work+ #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
ffffffff829651f0 ffff88000ec2fc28 ffffffff81c24bb1 ffff88000ec2fce8
ffffffff810f420d 0000000000000006 0000000000000001 0000000000000056
ffff8800153b4640 ffff880000000000 ffffffff81c2e468 ffff8800153b4640
Call Trace:
[<ffffffff81c24bb1>] dump_stack+0x19/0x1b
[<ffffffff810f420d>] __lock_acquire+0x15dd/0x1e60
[<ffffffff810f531c>] lock_acquire+0x9c/0x1f0
[<ffffffff81c2a805>] mutex_lock_nested+0x65/0x410
[<ffffffff8110a3e1>] drop_parsed_module_refcounts+0x21/0xb0
[<ffffffff8110e63e>] cgroup_remount+0x1ae/0x200
[<ffffffff811c9bb2>] do_remount_sb+0x82/0x190
[<ffffffff811e9d41>] do_mount+0x5f1/0xa30
[<ffffffff811ea203>] SyS_mount+0x83/0xc0
[<ffffffff81c2fb82>] system_call_fastpath+0x16/0x1b
Fix it by moving the drop_parsed_module_refcounts() invocation outside
cgroup_mutex.
Signed-off-by: Tejun Heo <[email protected]>
|
|
pm_trace uses the system's Real Time Clock (RTC) to save the magic
number. The reason for this is that the RTC is the only reliably
available piece of hardware during resume operations where a value
can be set that will survive a reboot.
Consequence is that after a resume (even if it is successful) your
system clock will have a value corresponding to the magic number
instead of the correct date/time! It is therefore advisable to use
a program like ntp-date or rdate to reset the correct date/time from
an external time source when using this trace option.
There is no run-time message to warn users of the consequences of
enabling pm_trace. Adding a warning message to pm_trace_store()
will serve as a reminder to users to set the system date and time
after resume.
Signed-off-by: Shuah Khan <[email protected]>
Acked-by: Pavel Machek <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
Fix spelling of 'calling' in description of se flags in
enqueue_entity().
Signed-off-by: Kamalesh Babulal <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
At present we print per-entity load-tracking statistics for
cfs_rq of cgroups/runqueues. Given that per task statistics
is maintained, it can be used to know the contribution made
by the task to its parenting cfs_rq level.
This patch adds per-task load-tracking statistics to /proc/<PID>/sched.
Signed-off-by: Kamalesh Babulal <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Based-on-patch-by: Fengguang Wu <[email protected]>
Signed-off-by: Alex Shi <[email protected]>
Tested-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Since no one use it.
Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Paul Turner <[email protected]>
Tested-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Similar to runnable_load_avg, blocked_load_avg variable, long type is
enough for removed_load in 64 bit or 32 bit machine.
Then we avoid the expensive atomic64 operations on 32 bit machine.
Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Paul Turner <[email protected]>
Tested-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Since tg->load_avg is smaller than tg->load_weight, we don't need a
atomic64_t variable for load_avg in 32 bit machine.
The same reason for cfs_rq->tg_load_contrib.
The atomic_long_t/unsigned long variable type are more efficient and
convenience for them.
Signed-off-by: Alex Shi <[email protected]>
Tested-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Since the 'u64 runnable_load_avg, blocked_load_avg' in cfs_rq struct are
smaller than 'unsigned long' cfs_rq->load.weight. We don't need u64
vaiables to describe them. unsigned long is more efficient and convenience.
Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Paul Turner <[email protected]>
Tested-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Aside from using runnable load average in background, move_tasks is
also the key function in load balance. We need consider the runnable
load average in it in order to make it an apple to apple load
comparison.
Morten had caught a div u64 bug on ARM, thanks!
Thanks-to: Morten Rasmussen <[email protected]>
Signed-off-by: Alex Shi <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
They are the base values in load balance, update them with rq runnable
load average, then the load balance will consider runnable load avg
naturally.
We also try to include the blocked_load_avg as cpu load in balancing,
but that cause kbuild performance drop 6% on every Intel machine, and
aim7/oltp drop on some of 4 CPU sockets machines.
Or only add blocked_load_avg into get_rq_runable_load, hackbench still
drop a little on NHM EX.
Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Gu Zheng <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
To get the latest runnable info, we need do this cpuload update after
task_tick.
Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Paul Turner <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The woken migrated task will __synchronize_entity_decay(se); in
migrate_task_rq_fair, then it needs to set
`se->avg.last_runnable_update -= (-se->avg.decay_count) << 20' before
update_entity_load_avg, in order to avoid sleep time is updated twice
for se.avg.load_avg_contrib in both __syncchronize and
update_entity_load_avg.
However if the sleeping task is woken up from the same cpu, it miss
the last_runnable_update before update_entity_load_avg(se, 0, 1), then
the sleep time was used twice in both functions. So we need to remove
the double sleep time accounting.
Paul also contributed some code comments in this commit.
Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Paul Turner <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
We need to initialize the se.avg.{decay_count, load_avg_contrib} for a
new forked task. Otherwise random values of above variables cause a
mess when a new task is enqueued:
enqueue_task_fair
enqueue_entity
enqueue_entity_load_avg
and make fork balancing imbalance due to incorrect load_avg_contrib.
Further more, Morten Rasmussen notice some tasks were not launched at
once after created. So Paul and Peter suggest giving a start value for
new task runnable avg time same as sched_slice().
PeterZ said:
> So the 'problem' is that our running avg is a 'floating' average; ie. it
> decays with time. Now we have to guess about the future of our newly
> spawned task -- something that is nigh impossible seeing these CPU
> vendors keep refusing to implement the crystal ball instruction.
>
> So there's two asymptotic cases we want to deal well with; 1) the case
> where the newly spawned program will be 'nearly' idle for its lifetime;
> and 2) the case where its cpu-bound.
>
> Since we have to guess, we'll go for worst case and assume its
> cpu-bound; now we don't want to make the avg so heavy adjusting to the
> near-idle case takes forever. We want to be able to quickly adjust and
> lower our running avg.
>
> Now we also don't want to make our avg too light, such that it gets
> decremented just for the new task not having had a chance to run yet --
> even if when it would run, it would be more cpu-bound than not.
>
> So what we do is we make the initial avg of the same duration as that we
> guess it takes to run each task on the system at least once -- aka
> sched_slice().
>
> Of course we can defeat this with wakeup/fork bombs, but in the 'normal'
> case it should be good enough.
Paul also contributed most of the code comments in this commit.
Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Gu Zheng <[email protected]>
Reviewed-by: Paul Turner <[email protected]>
[peterz; added explanation of sched_slice() usage]
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The following 2 variables are only used under CONFIG_SMP, so its
better to move their definiation into CONFIG_SMP too.
atomic64_t load_avg;
atomic_t runnable_avg;
Signed-off-by: Alex Shi <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|