Age | Commit message (Collapse) | Author | Files | Lines |
|
Set the force migration flag when migrating interrupts away from an
outgoing CPU.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
Interrupts which cannot be migrated in process context, need to be masked
before the affinity is changed forcefully.
Add support for that. Will be compiled out for architectures which do not
have this x86 specific issue.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
In order to move x86 to the generic hotplug migration code, add support for
cleaning up move in progress bits.
On architectures which have this x86 specific (mis)feature not enabled,
this is optimized out by the compiler.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
Interrupts, which are shut down are tried to be migrated as well. That's
pointless because the interrupt cannot fire and the next startup will move
it to the proper place anyway.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
Move the checks for a valid irq chip and the irq_set_affinity() callback
right in front of the whole migration logic. No point in doing a gazillion
of other things when the interrupt cannot be migrated at all.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
In case the affinity of an interrupt was broken, a printk is emitted.
But if the affinity cannot be set at all due to a missing
irq_set_affinity() callback or due to a failing callback, the message is
still printed preceeded by a warning/error.
That makes no sense whatsoever.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
This is called from stop_machine() with interrupts disabled. No point in
disabling them some more.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
So that the affinity code can reuse them.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
The startup vs. setaffinity ordering of interrupts depends on the
IRQF_NOAUTOEN flag. Chained interrupts are not getting any affinity
assignment at all.
A regular interrupt is started up and then the affinity is set. A
IRQF_NOAUTOEN marked interrupt is not started up, but the affinity is set
nevertheless.
Move the affinity setup to startup_irq() so the ordering is always the same
and chained interrupts get the proper default affinity assigned as well.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
Rename it with a proper irq_ prefix and make it available for other files
in the core code. Preparatory patch for moving the irq affinity setup
around.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
No point to have this alloc/free dance of cpumasks. Provide a static mask
for setup_affinity() and protect it proper.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
If an CPU goes offline, the interrupts are migrated away, but a eventually
pending interrupt move, which has not yet been made effective is kept
pending even if the outgoing CPU is the sole target of the pending affinity
mask. What's worse is, that the pending affinity mask is discarded even if
it would contain a valid subset of the online CPUs.
Implement a helper function which allows to avoid these issues.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
Debugging (hierarchical) interupt domains is tedious as there is no
information about the hierarchy and no information about states of
interrupts in the various domain levels.
Add a debugfs directory 'irq' and subdirectories 'domains' and 'irqs'.
The domains directory contains the domain files. The content is information
about the domain. If the domain is part of a hierarchy then the parent
domains are printed as well.
# ls /sys/kernel/debug/irq/domains/
default INTEL-IR-2 INTEL-IR-MSI-2 IO-APIC-IR-2 PCI-MSI
DMAR-MSI INTEL-IR-3 INTEL-IR-MSI-3 IO-APIC-IR-3 unknown-1
INTEL-IR-0 INTEL-IR-MSI-0 IO-APIC-IR-0 IO-APIC-IR-4 VECTOR
INTEL-IR-1 INTEL-IR-MSI-1 IO-APIC-IR-1 PCI-HT
# cat /sys/kernel/debug/irq/domains/VECTOR
name: VECTOR
size: 0
mapped: 216
flags: 0x00000041
# cat /sys/kernel/debug/irq/domains/IO-APIC-IR-0
name: IO-APIC-IR-0
size: 24
mapped: 19
flags: 0x00000041
parent: INTEL-IR-3
name: INTEL-IR-3
size: 65536
mapped: 167
flags: 0x00000041
parent: VECTOR
name: VECTOR
size: 0
mapped: 216
flags: 0x00000041
Unfortunately there is no per cpu information about the VECTOR domain (yet).
The irqs directory contains detailed information about mapped interrupts.
# cat /sys/kernel/debug/irq/irqs/3
handler: handle_edge_irq
status: 0x00004000
istate: 0x00000000
ddepth: 1
wdepth: 0
dstate: 0x01018000
IRQD_IRQ_DISABLED
IRQD_SINGLE_TARGET
IRQD_MOVE_PCNTXT
node: 0
affinity: 0-143
effectiv: 0
pending:
domain: IO-APIC-IR-0
hwirq: 0x3
chip: IR-IO-APIC
flags: 0x10
IRQCHIP_SKIP_SET_WAKE
parent:
domain: INTEL-IR-3
hwirq: 0x20000
chip: INTEL-IR
flags: 0x0
parent:
domain: VECTOR
hwirq: 0x3
chip: APIC
flags: 0x0
This was developed to simplify the debugging of the managed affinity
changes.
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
Add a map counter instead of counting radix tree entries for
diagnosis. That also gives correct information for linear domains.
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
In order to provide proper debug interface it's required to have domain
names available when the domain is added. Non fwnode based architectures
like x86 have no way to do so.
It's not possible to use domain ops or host data for this as domain ops
might be the same for several instances, but the names have to be unique.
Extend the irqchip fwnode to allow transporting the domain name. If no node
is supplied, create a 'unknown-N' placeholder.
Warn if an invalid node is supplied and treat it like no node. This happens
e.g. with i2 devices on x86 which hand in an ACPI type node which has no
interface for retrieving the name.
[ Folded a fix from Marc to make DT name parsing work ]
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
Prevent overwriting an already assigned domain name. Remove the extra check
for chip->name, because if domain->name is NULL overwriting it with NULL is
not a problem.
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Keith Busch <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
Merge branch 'uuid-types' from git://git.infradead.org/users/hch/uuid.git
to satisfy dependencies.
|
|
Although idle load balancing obviously only concerns idle CPUs, it can
be a disturbance on a busy nohz_full CPU. Indeed a CPU can only get rid
of an idle load balancing duty once a tick fires while it runs a task
and this can take a while on a nohz_full CPU.
We could fix that and escape the idle load balancing duty from the very
idle exit path but that would bring unecessary overhead. Lets just not
bother and leave that job to housekeeping CPUs (those outside nohz_full
range). The nohz_full CPUs simply don't want any disturbance.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The idle load balancing registration path assumes that we only stop the
tick when the CPU is idle, ignoring the nohz full case. As a result, a
nohz full CPU that is running a task may be chosen to perform idle load
balancing.
Lets make sure that only CPUs in dynticks idle mode can be picked as
idle load balancers.
Signed-off-by: Frederic Weisbecker <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The loadavg naming code still assumes that nohz == idle whereas its code
is actually handling well both nohz idle and nohz full.
So lets fix the naming according to what the code actually does, to
unconfuse the reader.
Signed-off-by: Frederic Weisbecker <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Two entries being added at the same time to the IFLA
policy table, whilst parallel bug fixes to decnet
routing dst handling overlapping with the dst gc removal
in net-next.
Signed-off-by: David S. Miller <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching
Pull livepatching fix from Jiri Kosina:
"Fix the way how livepatches are being stacked with respect to RCU,
from Petr Mladek"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
livepatch: Fix stacking of patches with respect to RCU
|
|
Provide a resource managed variant of irq_setup_generic_chip().
Signed-off-by: Bartosz Golaszewski <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Cc: [email protected]
Cc: Jonathan Corbet <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
Provide a resource managed variant of irq_alloc_generic_chip().
Signed-off-by: Bartosz Golaszewski <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Cc: [email protected]
Cc: Jonathan Corbet <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
This function will be used in the devres variant of
irq_alloc_generic_chip().
Signed-off-by: Bartosz Golaszewski <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: Marc Zyngier <[email protected]>
Cc: [email protected]
Cc: Jonathan Corbet <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
If the event for which an AUX area is about to be allocated, does
not support setting up an AUX area, rb_alloc_aux() return -ENOTSUPP.
This error condition is being returned unfiltered to the user space,
and, for example, the perf tools fails with:
failed to mmap with 524 (INTERNAL ERROR: strerror_r(524, 0x3fff497a1c8, 512)=22)
This error can be easily seen with "perf record -m 128,256 -e cpu-clock".
The 524 error code maps to -ENOTSUPP (in rb_alloc_aux()). The -ENOTSUPP
error code shall be only used within the kernel. So the correct error
code would then be -EOPNOTSUPP.
With this commit, the perf tool then reports:
failed to mmap with 95 (Operation not supported)
which is more clear.
Signed-off-by: Hendrik Brueckner <[email protected]>
Acked-by: Alexander Shishkin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Pu Hou <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Thomas-Mich Richter <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
https://git.linaro.org/people/john.stultz/linux into timers/core
Merge time(keeping) updates from John Stultz:
"Just a small set of changes, the biggest changes being the MONOTONIC_RAW
handling cleanup, and a new kselftest from Miroslav. Also a a clear
warning deprecating CONFIG_GENERIC_TIME_VSYSCALL_OLD, which affects ppc
and ia64."
|
|
Pick up dependent changes.
|
|
CONFIG_GENERIC_TIME_VSYSCALL_OLD was introduced five years ago
to allow a transition from the old vsyscall implementations to
the new method (which simplified internal accounting and made
timekeeping more precise).
However, PPC and IA64 have yet to make the transition, despite
in some cases me sending test patches to try to help it along.
http://patches.linaro.org/patch/30501/
http://patches.linaro.org/patch/35412/
If its helpful, my last pass at the patches can be found here:
https://git.linaro.org/people/john.stultz/linux.git dev/oldvsyscall-cleanup
So I think its time to set a deadline and make it clear this
is going away. So this patch adds warnings about this
functionality being dropped. Likely to be in v4.15.
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Miroslav Lichvar <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Prarit Bhargava <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Anton Blanchard <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Fenghua Yu <[email protected]>
Signed-off-by: John Stultz <[email protected]>
|
|
Now that we fixed the sub-ns handling for CLOCK_MONOTONIC_RAW,
remove the duplicitive tk->raw_time.tv_nsec, which can be
stored in tk->tkr_raw.xtime_nsec (similarly to how its handled
for monotonic time).
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Miroslav Lichvar <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Prarit Bhargava <[email protected]>
Cc: Stephen Boyd <[email protected]>
Cc: Kevin Brodsky <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Daniel Mentz <[email protected]>
Tested-by: Daniel Mentz <[email protected]>
Signed-off-by: John Stultz <[email protected]>
|
|
Get upstream changes so pending patches won't conflict.
|
|
The expiry time of a posix cpu timer is supplied through sys_timer_set()
via a struct timespec. The timespec is validated for correctness.
In the actual set timer implementation the timespec is converted to a
scalar nanoseconds value. If the tv_sec part of the time spec is large
enough the conversion to nanoseconds (sec * NSEC_PER_SEC) overflows 64bit.
Mitigate that by using the timespec_to_ktime() conversion function, which
checks the tv_sec part for a potential mult overflow and clamps the result
to KTIME_MAX, which is about 292 years.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Xishi Qiu <[email protected]>
Cc: John Stultz <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
The expiry time of a itimer is supplied through sys_setitimer() via a
struct timeval. The timeval is validated for correctness.
In the actual set timer implementation the timeval is converted to a
scalar nanoseconds value. If the tv_sec part of the time spec is large
enough the conversion to nanoseconds (sec * NSEC_PER_SEC) overflows 64bit.
Mitigate that by using the timeval_to_ktime() conversion function, which
checks the tv_sec part for a potential mult overflow and clamps the result
to KTIME_MAX, which is about 292 years.
Reported-by: Xishi Qiu <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: John Stultz <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
|
|
Signed-off-by: Peter Meerwald-Stadler <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: John Stultz <[email protected]>
Cc: [email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
|
|
This function was introduced by:
150593bf8693 ("sched/api: Introduce task_rcu_dereference() and try_get_task_struct()")
... to allow easier usage of task_rcu_dereference(), however no users
were ever added. Drop the helper.
Signed-off-by: Davidlohr Bueso <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Conflicts:
kernel/sched/Makefile
Pick up the waitqueue related renames - it didn't get much feedback,
so it appears to be uncontroversial. Famous last words? ;-)
Signed-off-by: Ingo Molnar <[email protected]>
|
|
If we set a next or last buddy for a se that is not on_rq, we will
end up taking a NULL pointer dereference in wakeup_preempt_entity
via pick_next_task_fair.
Detect when we would be about to do that, throw a warning and
then refuse to actually set it.
This has been suggested at least twice:
https://marc.info/?l=linux-kernel&m=146651668921468&w=2
https://lkml.org/lkml/2016/6/16/663
I recently had to debug a problem with these (we hadn't backported
Konstantin's patches in this area) and this would have saved a lot
of time/pain.
Just do it.
Signed-off-by: Daniel Axtens <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
well
This definition of SCHED_WARN_ON():
#define SCHED_WARN_ON(x) ((void)(x))
is not fully compatible with the 'real' WARN_ON_ONCE() primitive, as it
has no return value, so it cannot be used in conditionals.
Fix it.
Cc: Daniel Axtens <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.
Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.
To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:
struct wait_queue_head::task_list => ::head
struct wait_queue_entry::task_list => ::entry
For example, this code:
rqw->wait.task_list.next != &wait->task_list
... is was pretty unclear (to me) what it's doing, while now it's written this way:
rqw->wait.head.next != &wait->entry
... which makes it pretty clear that we are iterating a list until we see the head.
Other examples are:
list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
list_for_each_entry(wq, &fence->wait.task_list, task_list) {
... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:
list_for_each_entry_safe(pos, next, &x->head, entry) {
list_for_each_entry(wq, &fence->wait.head, entry) {
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
sched/core.c to sched/wait_bit.c
The key hashed waitqueue data structures and their initialization
was done in the main scheduler file for no good reason, move them
to sched/wait_bit.c instead.
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
<linux/wait_bit.h>
The wait_bit*() types and APIs are mixed into wait.h, but they
are a pretty orthogonal extension of wait-queues.
Furthermore, only about 50 kernel files use these APIs, while
over 1000 use the regular wait-queue functionality.
So clean up the main wait.h by moving the wait-bit functionality
out of it, into a separate .h and .c file:
include/linux/wait_bit.h for types and APIs
kernel/sched/wait_bit.c for the implementation
Update all header dependencies.
This reduces the size of wait.h rather significantly, by about 30%.
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
So wait-bit-queue head variables are often named:
struct wait_bit_queue *q
... which is a bit ambiguous and super confusing, because
they clearly suggest wait-queue head semantics and behavior
(they rhyme with the old wait_queue_t *q naming), while they
are extended wait-queue _entries_, not heads!
They are misnomers in two ways:
- the 'wait_bit_queue' leaves open the question of whether
it's an entry or a head
- the 'q' parameter and local variable naming falsely implies
that it's a 'queue' - while it's an entry.
This resulted in sometimes confusing cases such as:
finish_wait(wq, &q->wait);
where the 'q' is not a wait-queue head, but a wait-bit-queue entry.
So improve this all by standardizing wait-bit-queue nomenclature
similar to wait-queue head naming:
struct wait_bit_queue => struct wait_bit_queue_entry
q => wbq_entry
Which makes it all a much clearer:
struct wait_bit_queue_entry *wbq_entry
... and turns the former confusing piece of code into:
finish_wait(wq_head, &wbq_entry->wq_entry;
which IMHO makes it apparently clear what we are doing,
without having to analyze the context of the code: we are
adding a wait-queue entry to a regular wait-queue head,
which entry is embedded in a wait-bit-queue entry.
I'm not a big fan of acronyms, but repeating wait_bit_queue_entry
in field and local variable names is too long, so Hopefully it's
clear enough that 'wq_' prefixes stand for wait-queues, while
'wbq_' prefixes stand for wait-bit-queues.
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Rename 'struct wait_bit_queue::wait' to ::wq_entry, to more clearly
name it as a wait-queue entry.
Propagate it to a couple of usage sites where the wait-bit-queue internals
are exposed.
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
The wait-queue head parameters and variables are named in a
couple of ways, we have the following variants currently:
wait_queue_head_t *q
wait_queue_head_t *wq
wait_queue_head_t *head
In particular the 'wq' naming is ambiguous in the sense whether it's
a wait-queue head or entry name - as entries were often named 'wait'.
( Not to mention the confusion of any readers coming over from
workqueue-land. )
Standardize all this around a single, unambiguous parameter and
variable name:
struct wait_queue_head *wq_head
which is easy to grep for and also rhymes nicely with the wait-queue
entry naming:
struct wait_queue_entry *wq_entry
Also rename:
struct __wait_queue_head => struct wait_queue_head
... and use this struct type to migrate from typedefs usage to 'struct'
usage, which is more in line with existing kernel practices.
Don't touch any external users and preserve the main wait_queue_head_t
typedef.
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
So the various wait-queue entry variables in include/linux/wait.h
and kernel/sched/wait.c are named in a colorfully inconsistent
way:
wait_queue_entry_t *wait
wait_queue_entry_t *__wait (even in plain C code!)
wait_queue_entry_t *q (!)
wait_queue_entry_t *new (making anyone who knows C++ cringe)
wait_queue_entry_t *old
I think part of the reason for the inconsistency is the constant
apparent confusion about what a wait queue 'head' versus 'entry' is.
( Some of the documentation talks about a 'wait descriptor', which is
the wait-queue entry itself - further adding to the confusion. )
The most common name is 'wait', but that in itself is somewhat
ambiguous as well, as it does not really make it clear whether
it's a wait-queue entry or head.
To improve all this name the wait-queue entry structure parameters
and variables consistently and push through this naming into all
the wait.h and wait.c code:
struct wait_queue_entry *wq_entry
The 'wq_' prefix makes it easy to grep for, and we also use the
opportunity to move away from the typedef to a plain 'struct' naming:
in the kernel we typically reserve typedefs for cases where a
C structure is really small and somewhat opaque - such as pte_t.
wait-queue entries are neither small nor opaque, so use the more
standard 'struct xxx_entry' list management code nomenclature instead.
( We don't touch external users, and we preserve the typedef as well
for actual wait-queue users, to reduce unnecessary churn. )
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Rename:
wait_queue_t => wait_queue_entry_t
'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.
Start sorting this out by renaming it to 'wait_queue_entry_t'.
This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
pi_mutex isn't supposed to be tracked by lockdep, but just
passing NULLs for name and key will cause lockdep to spew a
warning and die, which is not what we want it to do.
Skip lockdep initialization if the caller passed NULLs for
name and key, suggesting such initialization isn't desired.
Signed-off-by: Sasha Levin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Fixes: f5694788ad8d ("rt_mutex: Add lockdep annotations")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
|
|
Signed-off-by: Ingo Molnar <[email protected]>
|
|
rcu_read_(un)lock(), list_*_rcu(), and synchronize_rcu() are used for a secure
access and manipulation of the list of patches that modify the same function.
In particular, it is the variable func_stack that is accessible from the ftrace
handler via struct ftrace_ops and klp_ops.
Of course, it synchronizes also some states of the patch on the top of the
stack, e.g. func->transition in klp_ftrace_handler.
At the same time, this mechanism guards also the manipulation of
task->patch_state. It is modified according to the state of the transition and
the state of the process.
Now, all this works well as long as RCU works well. Sadly livepatching might
get into some corner cases when this is not true. For example, RCU is not
watching when rcu_read_lock() is taken in idle threads. It is because they
might sleep and prevent reaching the grace period for too long.
There are ways how to make RCU watching even in idle threads, see
rcu_irq_enter(). But there is a small location inside RCU infrastructure when
even this does not work.
This small problematic location can be detected either before calling
rcu_irq_enter() by rcu_irq_enter_disabled() or later by rcu_is_watching().
Sadly, there is no safe way how to handle it. Once we detect that RCU was not
watching, we might see inconsistent state of the function stack and the related
variables in klp_ftrace_handler(). Then we could do a wrong decision, use an
incompatible implementation of the function and break the consistency of the
system. We could warn but we could not avoid the damage.
Fortunately, ftrace has similar problems and they seem to be solved well there.
It uses a heavy weight implementation of some RCU operations. In particular, it
replaces:
+ rcu_read_lock() with preempt_disable_notrace()
+ rcu_read_unlock() with preempt_enable_notrace()
+ synchronize_rcu() with schedule_on_each_cpu(sync_work)
My understanding is that this is RCU implementation from a stone age. It meets
the core RCU requirements but it is rather ineffective. Especially, it does not
allow to batch or speed up the synchronize calls.
On the other hand, it is very trivial. It allows to safely trace and/or
livepatch even the RCU core infrastructure. And the effectiveness is a not a
big issue because using ftrace or livepatches on productive systems is a rare
operation. The safety is much more important than a negligible extra load.
Note that the alternative implementation follows the RCU principles. Therefore,
we could and actually must use list_*_rcu() variants when manipulating the
func_stack. These functions allow to access the pointers in the right
order and with the right barriers. But they do not use any other
information that would be set only by rcu_read_lock().
Also note that there are actually two problems solved in ftrace:
First, it cares about the consistency of RCU read sections. It is being solved
the way as described and used in this patch.
Second, ftrace needs to make sure that nobody is inside the dynamic trampoline
when it is being freed. For this, it also calls synchronize_rcu_tasks() in
preemptive kernel in ftrace_shutdown().
Livepatch has similar problem but it is solved by ftrace for free.
klp_ftrace_handler() is a good guy and never sleeps. In addition, it is
registered with FTRACE_OPS_FL_DYNAMIC. It causes that
unregister_ftrace_function() calls:
* schedule_on_each_cpu(ftrace_sync) - always
* synchronize_rcu_tasks() - in preemptive kernel
The effect is that nobody is neither inside the dynamic trampoline nor inside
the ftrace handler after unregister_ftrace_function() returns.
[[email protected]: reformat changelog, fix comment]
Signed-off-by: Petr Mladek <[email protected]>
Acked-by: Josh Poimboeuf <[email protected]>
Acked-by: Miroslav Benes <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>
|