Age | Commit message (Collapse) | Author | Files | Lines |
|
wake_up_new_task()
If a newly created task is selected to go to a different CPU in fork
balance when it wakes up the first time, its load averages should
not be removed from the source CPU since they are never added to
it before. The same is also applicable to a never used group entity.
Fix it in remove_entity_load_avg(): when entity's last_update_time
is 0, simply return. This should precisely identify the case in
question, because in other migrations, the last_update_time is set
to 0 after remove_entity_load_avg().
Reported-by: Steve Muckle <[email protected]>
Signed-off-by: Yuyang Du <[email protected]>
[peterz: cfs_rq_last_update_time]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Patrick Bellasi <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vincent Guittot <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
earliest_dl.next should cache deadline of the earliest ready task that
is also enqueued in the pushable rbtree, as pull algorithm uses this
information to find candidates for migration: if the earliest_dl.next
deadline of source rq is earlier than the earliest_dl.curr deadline of
destination rq, the task from the source rq can be pulled.
However, current implementation only guarantees that earliest_dl.next is
the deadline of the next ready task instead of the next pushable task;
which will result in potentially holding both rqs' lock and find nothing
to migrate because of affinity constraints. In addition, current logic
doesn't update the next candidate for pushing in pick_next_task_dl(),
even if the running task is never eligible.
This patch fixes both problems by updating earliest_dl.next when
pushable dl task is enqueued/dequeued, similar to what we already do for
RT.
Tested-by: Luca Abeni <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
new patches
Signed-off-by: Ingo Molnar <[email protected]>
|
|
In the following commit:
7675104990ed ("sched: Implement lockless wake-queues")
we gained lockless wake-queues.
The -RT kernel managed to lockup itself with those. There could be multiple
attempts for task X to enqueue it for a wakeup _even_ if task X is already
running.
The reason is that task X could be runnable but not yet on CPU. The the
task performing the wakeup did not leave the CPU it could performe
multiple wakeups.
With the proper timming task X could be running and enqueued for a
wakeup. If this happens while X is performing a fork() then its its
child will have a !NULL `wake_q` member copied.
This is not a problem as long as the child task does not participate in
lockless wakeups :)
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Fixes: 7675104990ed ("sched: Implement lockless wake-queues")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Make 'r' 64-bit type to avoid overflow in 'r * LOAD_AVG_MAX'
on 32-bit systems:
UBSAN: Undefined behaviour in kernel/sched/fair.c:2785:18
signed integer overflow:
87950 * 47742 cannot be represented in type 'int'
The most likely effect of this bug are bad load average numbers
resulting in weird scheduling. It's also likely that this can
persist for a longer time - until the system goes idle for
a long time so that all load avg numbers get reset.
[ This is the CFS load average metric, not the procfs output, which
is separate. ]
Signed-off-by: Andrey Ryabinin <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
Link: http://lkml.kernel.org/r/[email protected]
[ Improved the changelog. ]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
There's a race on CPU unplug where we free the swevent hash array
while it can still have events on. This will result in a
use-after-free which is BAD.
Simply do not free the hash array on unplug. This leaves the thing
around and no use-after-free takes place.
When the last swevent dies, we do a for_each_possible_cpu() iteration
anyway to clean these up, at which time we'll free it, so no leakage
will occur.
Reported-by: Sasha Levin <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
|
|
I managed to tickle this warning:
[ 2338.884942] ------------[ cut here ]------------
[ 2338.890112] WARNING: CPU: 13 PID: 35162 at ../kernel/events/core.c:2702 task_ctx_sched_out+0x6b/0x80()
[ 2338.900504] Modules linked in:
[ 2338.903933] CPU: 13 PID: 35162 Comm: bash Not tainted 4.4.0-rc4-dirty #244
[ 2338.911610] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
[ 2338.923071] ffffffff81f1468e ffff8807c6457cb8 ffffffff815c680c 0000000000000000
[ 2338.931382] ffff8807c6457cf0 ffffffff810c8a56 ffffe8ffff8c1bd0 ffff8808132ed400
[ 2338.939678] 0000000000000286 ffff880813170380 ffff8808132ed400 ffff8807c6457d00
[ 2338.947987] Call Trace:
[ 2338.950726] [<ffffffff815c680c>] dump_stack+0x4e/0x82
[ 2338.956474] [<ffffffff810c8a56>] warn_slowpath_common+0x86/0xc0
[ 2338.963195] [<ffffffff810c8b4a>] warn_slowpath_null+0x1a/0x20
[ 2338.969720] [<ffffffff811a49cb>] task_ctx_sched_out+0x6b/0x80
[ 2338.976244] [<ffffffff811a62d2>] perf_event_exec+0xe2/0x180
[ 2338.982575] [<ffffffff8121fb6f>] setup_new_exec+0x6f/0x1b0
[ 2338.988810] [<ffffffff8126de83>] load_elf_binary+0x393/0x1660
[ 2338.995339] [<ffffffff811dc772>] ? get_user_pages+0x52/0x60
[ 2339.001669] [<ffffffff8121e297>] search_binary_handler+0x97/0x200
[ 2339.008581] [<ffffffff8121f8b3>] do_execveat_common.isra.33+0x543/0x6e0
[ 2339.016072] [<ffffffff8121fcea>] SyS_execve+0x3a/0x50
[ 2339.021819] [<ffffffff819fc165>] stub_execve+0x5/0x5
[ 2339.027469] [<ffffffff819fbeb2>] ? entry_SYSCALL_64_fastpath+0x12/0x71
[ 2339.034860] ---[ end trace ee1337c59a0ddeac ]---
Which is a WARN_ON_ONCE() indicating that cpuctx->task_ctx is not
what we expected it to be.
This is because context switches can swap the task_struct::perf_event_ctxp[]
pointer around. Therefore you have to either disable preemption when looking
at current, or hold ctx->lock.
Fix perf_event_enable_on_exec(), it loads current->perf_event_ctxp[]
before disabling interrupts, therefore a preemption in the right place
can swap contexts around and we're using the wrong one.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Alexander Potapenko <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Kostya Serebryany <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: syzkaller <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Two more fixes:
1. The recordmcount change had an output that used sprintf()
(incorrectly) when it should have been a fprintf() to stderr.
2. The printk_formats file could crash if someone added a
trace_printk() in the core kernel, and also added one in a module.
This does not affect production kernels. Only kernels where
developers add trace_printk() for debugging can crash"
* tag 'trace-v4.4-rc4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix setting of start_index in find_next()
ftrace/scripts: Fix incorrect use of sprintf in recordmcount
|
|
Some sysfs attributes in /sys/power/ should really be read-only,
so add support for that, convert those attributes to read-only
and drop the stub .show() routines from them.
Original-by: Sergey Senozhatsky <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
When we do cat /sys/kernel/debug/tracing/printk_formats, we hit kernel
panic at t_show.
general protection fault: 0000 [#1] PREEMPT SMP
CPU: 0 PID: 2957 Comm: sh Tainted: G W O 3.14.55-x86_64-01062-gd4acdc7 #2
RIP: 0010:[<ffffffff811375b2>]
[<ffffffff811375b2>] t_show+0x22/0xe0
RSP: 0000:ffff88002b4ebe80 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
RDX: 0000000000000004 RSI: ffffffff81fd26a6 RDI: ffff880032f9f7b1
RBP: ffff88002b4ebe98 R08: 0000000000001000 R09: 000000000000ffec
R10: 0000000000000000 R11: 000000000000000f R12: ffff880004d9b6c0
R13: 7365725f6d706400 R14: ffff880004d9b6c0 R15: ffffffff82020570
FS: 0000000000000000(0000) GS:ffff88003aa00000(0063) knlGS:00000000f776bc40
CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 00000000f6c02ff0 CR3: 000000002c2b3000 CR4: 00000000001007f0
Call Trace:
[<ffffffff811dc076>] seq_read+0x2f6/0x3e0
[<ffffffff811b749b>] vfs_read+0x9b/0x160
[<ffffffff811b7f69>] SyS_read+0x49/0xb0
[<ffffffff81a3a4b9>] ia32_do_call+0x13/0x13
---[ end trace 5bd9eb630614861e ]---
Kernel panic - not syncing: Fatal exception
When the first time find_next calls find_next_mod_format, it should
iterate the trace_bprintk_fmt_list to find the first print format of
the module. However in current code, start_index is smaller than *pos
at first, and code will not iterate the list. Latter container_of will
get the wrong address with former v, which will cause mod_fmt be a
meaningless object and so is the returned mod_fmt->fmt.
This patch will fix it by correcting the start_index. After fixed,
when the first time calls find_next_mod_format, start_index will be
equal to *pos, and code will iterate the trace_bprintk_fmt_list to
get the right module printk format, so is the returned mod_fmt->fmt.
Link: http://lkml.kernel.org/r/[email protected]
Cc: [email protected] # 3.12+
Fixes: 102c9323c35a8 "tracing: Add __tracepoint_string() to export string pointers"
Signed-off-by: Qiu Peiyang <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
|
|
Signed-off-by: Al Viro <[email protected]>
|
|
A _lot_ of ->write() instances were open-coding it; some are
converted to memdup_user_nul(), a lot more remain...
Signed-off-by: Al Viro <[email protected]>
|
|
These are noisy during boot and not all that interesting.
Signed-off-by: Tejun Heo <[email protected]>
|
|
Both htab_map_update_elem() and htab_map_delete_elem() can be
called from eBPF program, and they may be in kernel hot path,
so it isn't efficient to use a per-hashtable lock in this two
helpers.
The per-hashtable spinlock is used for protecting bucket's
hlist, and per-bucket lock is just enough. This patch converts
the per-hashtable lock into per-bucket spinlock, so that
contention can be decreased a lot.
Signed-off-by: Ming Lei <[email protected]>
Acked-by: Daniel Borkmann <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The spinlock is just used for protecting the per-bucket
hlist, so it isn't needed for selecting bucket.
Acked-by: Daniel Borkmann <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Preparing for removing global per-hashtable lock, so
the counter need to be defined as aotmic_t first.
Acked-by: Daniel Borkmann <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The posix_clock_poll function is supposed to return a bit mask of
POLLxxx values. However, in case the hardware has disappeared (due to
hot plugging for example) this code returns -ENODEV in a futile
attempt to throw an error at the file descriptor level. The kernel's
file_operations interface does not accept such error codes from the
poll method. Instead, this function aught to return POLLERR.
The value -ENODEV does, in fact, contain the POLLERR bit (and almost
all the other POLLxxx bits as well), but only by chance. This patch
fixes code to return a proper bit mask.
Credit goes to Markus Elfring for pointing out the suspicious
signed/unsigned mismatch.
Reported-by: Markus Elfring <[email protected]>
igned-off-by: Richard Cochran <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Julia Lawall <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: [email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core
Pull another round of GIC changes from Marc:
ACPI support for GIV-v2m
|
|
Make the inode argument of the inode_getsecid hook non-const so that we
can use it to revalidate invalid security labels.
Signed-off-by: Andreas Gruenbacher <[email protected]>
Acked-by: Stephen Smalley <[email protected]>
Signed-off-by: Paul Moore <[email protected]>
|
|
The file tracing_enable is obsolete and does not exist anymore. Replace
the comment that references it with the proper tracing_on file.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Chuyu Hu <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
|
|
The start and end variables were only used when ftrace_module_init() was
split up into multiple functions. No need to keep them around after the
merger.
Signed-off-by: Steven Rostedt <[email protected]>
|
|
Simple cleanup. No need for two functions here.
The whole work can simply be done inside 'ftrace_module_init'.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Abel Vesa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
|
|
This bpf_verifier_ops structure is never modified, like the other
bpf_verifier_ops structures, so declare it as const.
Done with the help of Coccinelle.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Julia Lawall <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
|
|
Jiri Olsa noted that the change to replace the control_ops did not update
the trampoline for when running perf on a single CPU and with CONFIG_PREEMPT
disabled (where dynamic ops, like perf, can use trampolines directly). The
result was that perf function could be called when RCU is not watching as
well as not handle the ftrace_local_disable().
Modify the ftrace_ops_get_func() to also check the RCU and PER_CPU ops flags
and use the recursive function if they are set. The recursive function is
modified to check those flags and execute the appropriate checks if they are
set.
Link: http://lkml.kernel.org/r/[email protected]
Reported-by: Jiri Olsa <[email protected]>
Patch-fixed-up-by: Jiri Olsa <[email protected]>
Tested-by: Jiri Olsa <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
|
|
Currently perf has its own list function within the ftrace infrastructure
that seems to be used only to allow for it to have per-cpu disabling as well
as a check to make sure that it's not called while RCU is not watching. It
uses something called the "control_ops" which is used to iterate over ops
under it with the control_list_func().
The problem is that this control_ops and control_list_func unnecessarily
complicates the code. By replacing FTRACE_OPS_FL_CONTROL with two new flags
(FTRACE_OPS_FL_RCU and FTRACE_OPS_FL_PER_CPU) we can remove all the code
that is special with the control ops and add the needed checks within the
generic ftrace_list_func().
Signed-off-by: Steven Rostedt <[email protected]>
|
|
When showing all tramps registered to a ftrace record in the file
enabled_functions, it exits the loop with ops == NULL. But then it is
suppose to show the function on the ops->trampoline and
add_trampoline_func() is called with the given ops. But because ops is now
NULL (to exit the loop), it always shows the static trampoline instead of
the one that is really registered to the record.
The call to add_trampoline_func() that shows the trampoline for the given
ops needs to be called at every iteration.
Fixes: 39daa7b9e895 "ftrace: Show all tramps registered to a record on ftrace_bug()"
Signed-off-by: Steven Rostedt <[email protected]>
|
|
s/ARCH_SUPPORT_FTARCE_OPS/ARCH_SUPPORTS_FTRACE_OPS/
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Li Bin <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
|
|
Conflicts:
drivers/gpu/drm/i915/intel_pm.c
|
|
Add include asm/paravirt.h to cputime.c, as steal_account_process_tick
calls paravirt_steal_clock, which is defined in asm/paravirt.h.
The ifdef CONFIG_PARAVIRT is necessary because not all archs have an
asm/paravirt.h to include.
The reason why currently cputime.c compiles, even though include
<asm/paravirt.h> is missing, is that on x86 asm/paravirt.h is included
by one of the other headers included in kernel/sched/cputime.c:
On arm and arm64, where I am about to introduce asm/paravirt.h and
stolen time support, without #include <asm/paravirt.h> in cputime.c, I
would get an error.
Signed-off-by: Stefano Stabellini <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
|
|
Since there will be several places checking if fwnode.type
is equal FWNODE_IRQCHIP, this patch adds a convenient function
for this purpose.
Acked-by: Marc Zyngier <[email protected]>
Signed-off-by: Suravee Suthikulpanit <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
|
|
While reviewing Michael Kerrisk's recent futex manpage update, I noticed
that we allow the FUTEX_CLOCK_REALTIME flag for FUTEX_WAIT_BITSET but
not for FUTEX_WAIT.
FUTEX_WAIT is treated as a simple version for FUTEX_WAIT_BITSET
internally (with a bitmask of FUTEX_BITSET_MATCH_ANY). As such, I cannot
come up with a reason for this exclusion for FUTEX_WAIT.
This change does modify the behavior of the futex syscall, changing a
call with FUTEX_WAIT | FUTEX_CLOCK_REALTIME from returning -ENOSYS, to be
equivalent to FUTEX_WAIT_BITSET | FUTEX_CLOCK_REALTIME with a bitset of
FUTEX_BITSET_MATCH_ANY.
Reported-by: Michael Kerrisk <[email protected]>
Signed-off-by: Darren Hart <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Link: http://lkml.kernel.org/r/9f3bdc116d79d23f5ee72ceb9a2a857f5ff8fa29.1450474525.git.dvhart@linux.intel.com
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
out_unlock: does not only drop the locks, it also drops the refcount
on the pi_state. Really intuitive.
Move the label after the put_pi_state() call and use 'break' in the
error handling path of the requeue loop.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: [email protected]
Cc: Andy Lowe <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
In the error handling cases we neither have pi_state nor a reference
to it. Remove the pointless code.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: [email protected]
Cc: Andy Lowe <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
Documentation of the pi_state refcounting in the requeue code is non
existent. Add it.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: [email protected]
Cc: Andy Lowe <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
free_pi_state() is confusing as it is in fact only freeing/caching the
pi state when the last reference is gone. Rename it to put_pi_state()
which reflects better what it is doing.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: [email protected]
Cc: Andy Lowe <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
If the proxy lock in the requeue loop acquires the rtmutex for a
waiter then it acquired also refcount on the pi_state related to the
futex, but the waiter side does not drop the reference count.
Add the missing free_pi_state() call.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Darren Hart <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: [email protected]
Cc: Andy Lowe <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
|
|
The Linux kernel already has the concept of IRQ domain, wherein a
component can expose a set of IRQs which are managed by a particular
interrupt controller chip or other subsystem. The PCI driver exposes
the notion of an IRQ domain for Message-Signaled Interrupts (MSI) from
PCI Express devices. This patch exposes the functions which are
necessary for creating a MSI IRQ domain within a module.
[ tglx: Split it into x86 and core irq parts ]
Signed-off-by: Jake Oshins <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
The clocksource validation which makes sure that the newly read value
is not smaller than the last value only works if the clocksource mask
is 64bit, i.e. the counter is 64bit wide. But we want to use that
mechanism also for clocksources which are less than 64bit wide.
So instead of checking whether bit 63 is set, we check whether the
most significant bit of the clocksource mask is set in the delta
result. If it is set, we return 0.
[ tglx: Simplified the implementation, added a comment and massaged
the commit message ]
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Yang Yingliang <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core
Pull the MSI wire bridge implementation from Marc Zyngier along with
the first user of it. This is infrastructure to support a wired
interrupt to MSI interrupt brigde. The first user is mbigen found in
Hisilicon ARM SoCs.
|
|
https://git.linaro.org/people/john.stultz/linux into timers/core
Get the core time(keeping) updates from John Stultz
- NTP robustness tweaks
- Another signed overflow nailed down
- More y2038 changes
- Stop alarmtimer after resume
- MAINTAINERS update
- Selftest fixes
|
|
Currently, panic() and crash_kexec() can be called at the same time.
For example (x86 case):
CPU 0:
oops_end()
crash_kexec()
mutex_trylock() // acquired
nmi_shootdown_cpus() // stop other CPUs
CPU 1:
panic()
crash_kexec()
mutex_trylock() // failed to acquire
smp_send_stop() // stop other CPUs
infinite loop
If CPU 1 calls smp_send_stop() before nmi_shootdown_cpus(), kdump
fails.
In another case:
CPU 0:
oops_end()
crash_kexec()
mutex_trylock() // acquired
<NMI>
io_check_error()
panic()
crash_kexec()
mutex_trylock() // failed to acquire
infinite loop
Clearly, this is an undesirable result.
To fix this problem, this patch changes crash_kexec() to exclude others
by using the panic_cpu atomic.
Signed-off-by: Hidehiro Kawai <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Dave Young <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: HATAYAMA Daisuke <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Martin Schwidefsky <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Minfei Huang <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: x86-ml <[email protected]>
Link: http://lkml.kernel.org/r/20151210014630.25437.94161.stgit@softrs
Signed-off-by: Borislav Petkov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
Currently, kdump_nmi_shootdown_cpus(), a subroutine of crash_kexec(),
sends an NMI IPI to CPUs which haven't called panic() to stop them,
save their register information and do some cleanups for crash dumping.
However, if such a CPU is infinitely looping in NMI context, we fail to
save its register information into the crash dump.
For example, this can happen when unknown NMIs are broadcast to all
CPUs as follows:
CPU 0 CPU 1
=========================== ==========================
receive an unknown NMI
unknown_nmi_error()
panic() receive an unknown NMI
spin_trylock(&panic_lock) unknown_nmi_error()
crash_kexec() panic()
spin_trylock(&panic_lock)
panic_smp_self_stop()
infinite loop
kdump_nmi_shootdown_cpus()
issue NMI IPI -----------> blocked until IRET
infinite loop...
Here, since CPU 1 is in NMI context, the second NMI from CPU 0 is
blocked until CPU 1 executes IRET. However, CPU 1 never executes IRET,
so the NMI is not handled and the callback function to save registers is
never called.
In practice, this can happen on some servers which broadcast NMIs to all
CPUs when the NMI button is pushed.
To save registers in this case, we need to:
a) Return from NMI handler instead of looping infinitely
or
b) Call the callback function directly from the infinite loop
Inherently, a) is risky because NMI is also used to prevent corrupted
data from being propagated to devices. So, we chose b).
This patch does the following:
1. Move the infinite looping of CPUs which haven't called panic() in NMI
context (actually done by panic_smp_self_stop()) outside of panic() to
enable us to refer pt_regs. Please note that panic_smp_self_stop() is
still used for normal context.
2. Call a callback of kdump_nmi_shootdown_cpus() directly to save
registers and do some cleanups after setting waiting_for_crash_ipi which
is used for counting down the number of CPUs which handled the callback
Signed-off-by: Hidehiro Kawai <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Aaron Tomlin <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: Dave Young <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Don Zickus <[email protected]>
Cc: Eric Biederman <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Gobinda Charan Maji <[email protected]>
Cc: HATAYAMA Daisuke <[email protected]>
Cc: Hidehiro Kawai <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Javi Merino <[email protected]>
Cc: Jiang Liu <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: lkml <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Nicolas Iooss <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Prarit Bhargava <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Stefan Lippers-Hollmann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ulrich Obergfell <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Yasuaki Ishimatsu <[email protected]>
Link: http://lkml.kernel.org/r/20151210014628.25437.75256.stgit@softrs
[ Cleanup comments, fixup formatting. ]
Signed-off-by: Borislav Petkov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
If panic on NMI happens just after panic() on the same CPU, panic() is
recursively called. Kernel stalls, as a result, after failing to acquire
panic_lock.
To avoid this problem, don't call panic() in NMI context if we've
already entered panic().
For that, introduce nmi_panic() macro to reduce code duplication. In
the case of panic on NMI, don't return from NMI handlers if another CPU
already panicked.
Signed-off-by: Hidehiro Kawai <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Aaron Tomlin <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Chris Metcalf <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Don Zickus <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Gobinda Charan Maji <[email protected]>
Cc: HATAYAMA Daisuke <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Javi Merino <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: lkml <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Michal Nazarewicz <[email protected]>
Cc: Nicolas Iooss <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Prarit Bhargava <[email protected]>
Cc: Rasmus Villemoes <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ulrich Obergfell <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Cc: Vivek Goyal <[email protected]>
Link: http://lkml.kernel.org/r/20151210014626.25437.13302.stgit@softrs
[ Cleanup comments, fixup formatting. ]
Signed-off-by: Borislav Petkov <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
Back in the days where eBPF (or back then "internal BPF" ;->) was not
exposed to user space, and only the classic BPF programs internally
translated into eBPF programs, we missed the fact that for classic BPF
A and X needed to be cleared. It was fixed back then via 83d5b7ef99c9
("net: filter: initialize A and X registers"), and thus classic BPF
specifics were added to the eBPF interpreter core to work around it.
This added some confusion for JIT developers later on that take the
eBPF interpreter code as an example for deriving their JIT. F.e. in
f75298f5c3fe ("s390/bpf: clear correct BPF accumulator register"), at
least X could leak stack memory. Furthermore, since this is only needed
for classic BPF translations and not for eBPF (verifier takes care
that read access to regs cannot be done uninitialized), more complexity
is added to JITs as they need to determine whether they deal with
migrations or native eBPF where they can just omit clearing A/X in
their prologue and thus reduce image size a bit, see f.e. cde66c2d88da
("s390/bpf: Only clear A and X for converted BPF programs"). In other
cases (x86, arm64), A and X is being cleared in the prologue also for
eBPF case, which is unnecessary.
Lets move this into the BPF migration in bpf_convert_filter() where it
actually belongs as long as the number of eBPF JITs are still few. It
can thus be done generically; allowing us to remove the quirk from
__bpf_prog_run() and to slightly reduce JIT image size in case of eBPF,
while reducing code duplication on this matter in current(/future) eBPF
JITs.
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Reviewed-by: Michael Holzheu <[email protected]>
Tested-by: Michael Holzheu <[email protected]>
Cc: Zi Shen Lim <[email protected]>
Cc: Yang Shi <[email protected]>
Acked-by: Yang Shi <[email protected]>
Acked-by: Zi Shen Lim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Conflicts:
drivers/net/geneve.c
Here we had an overlapping change, where in 'net' the extraneous stats
bump was being removed whilst in 'net-next' the final argument to
udp_tunnel6_xmit_skb() was being changed.
Signed-off-by: David S. Miller <[email protected]>
|
|
The Cavium guys reported a soft lockup on their arm64 machine, caused by
commit c55a6ffa6285 ("locking/osq: Relax atomic semantics"):
mutex_optimistic_spin+0x9c/0x1d0
__mutex_lock_slowpath+0x44/0x158
mutex_lock+0x54/0x58
kernfs_iop_permission+0x38/0x70
__inode_permission+0x88/0xd8
inode_permission+0x30/0x6c
link_path_walk+0x68/0x4d4
path_openat+0xb4/0x2bc
do_filp_open+0x74/0xd0
do_sys_open+0x14c/0x228
SyS_openat+0x3c/0x48
el0_svc_naked+0x24/0x28
This is because in osq_lock we initialise the node for the current CPU:
node->locked = 0;
node->next = NULL;
node->cpu = curr;
and then publish the current CPU in the lock tail:
old = atomic_xchg_acquire(&lock->tail, curr);
Once the update to lock->tail is visible to another CPU, the node is
then live and can be both read and updated by concurrent lockers.
Unfortunately, the ACQUIRE semantics of the xchg operation mean that
there is no guarantee the contents of the node will be visible before
lock tail is updated. This can lead to lock corruption when, for
example, a concurrent locker races to set the next field.
Fixes: c55a6ffa6285 ("locking/osq: Relax atomic semantics"):
Reported-by: David Daney <[email protected]>
Reported-by: Andrew Pinski <[email protected]>
Tested-by: Andrew Pinski <[email protected]>
Acked-by: Davidlohr Bueso <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Thus its been occasionally noted that users have seen
confusing warnings like:
Adjusting tsc more than 11% (5941981 vs 7759439)
We try to limit the maximum total adjustment to 11% (10% tick
adjustment + 0.5% frequency adjustment). But this is done by
bounding the requested adjustment values, and the internal
steering that is done by tracking the error from what was
requested and what was applied, does not have any such limits.
This is usually not problematic, but in some cases has a risk
that an adjustment could cause the clocksource mult value to
overflow, so its an indication things are outside of what is
expected.
It ends up most of the reports of this 11% warning are on systems
using chrony, which utilizes the adjtimex() ADJ_TICK interface
(which allows a +-10% adjustment). The original rational for
ADJ_TICK unclear to me but my assumption it was originally added
to allow broken systems to get a big constant correction at boot
(see adjtimex userspace package for an example) which would allow
the system to work w/ ntpd's 0.5% adjustment limit.
Chrony uses ADJ_TICK to make very aggressive short term corrections
(usually right at startup). Which push us close enough to the max
bound that a few late ticks can cause the internal steering to push
past the max adjust value (tripping the warning).
Thus this patch adds some extra logic to enforce the max adjustment
cap in the internal steering.
Note: This has the potential to slow corrections when the ADJ_TICK
value is furthest away from the default value. So it would be good to
get some testing from folks using chrony, to make sure we don't
cause any troubles there.
Cc: Miroslav Lichvar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Prarit Bhargava <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Tested-by: Miroslav Lichvar <[email protected]>
Reported-by: Andy Lutomirski <[email protected]>
Signed-off-by: John Stultz <[email protected]>
|
|
The function "second_overflow" uses "unsign long"
as its input parameter type which will overflow after
year 2106 on 32bit systems.
Thus this patch replaces it with time64_t type.
While the 64-bit division is expensive, "next_ntp_leap_sec"
has been calculated already, so we can just re-use it in the
TIME_INS/DEL cases, allowing one expensive division per
leapsecond instead of re-doing the divsion once a second after
the leap flag has been set.
Signed-off-by: DengChao <[email protected]>
[jstultz: Tweaked commit message]
Signed-off-by: John Stultz <[email protected]>
|
|
The type of static variant "time_reftime" and the call of
get_seconds in ntp are both not y2038 safe.
So change the type of time_reftime to time64_t and replace
get_seconds with __ktime_get_real_seconds.
The local variant "secs" in ntp_update_offset represents
seconds between now and last ntp adjustment, it seems impossible
that this time will last more than 68 years, so keep its type as
"long".
Reviewed-by: John Stultz <[email protected]>
Signed-off-by: DengChao <[email protected]>
[jstultz: Tweaked commit message]
Signed-off-by: John Stultz <[email protected]>
|
|
In order to fix Y2038 issues in the ntp code we will need replace
get_seconds() with ktime_get_real_seconds() but as the ntp code uses
the timekeeping lock which is also used by ktime_get_real_seconds(),
we need a version without locking.
Add a new function __ktime_get_real_seconds() in timekeeping to
do this.
Reviewed-by: John Stultz <[email protected]>
Signed-off-by: DengChao <[email protected]>
Signed-off-by: John Stultz <[email protected]>
|