aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2017-06-22genirq/cpuhotplug: Set force affinity flag on hotplug migrationThomas Gleixner1-1/+1
Set the force migration flag when migrating interrupts away from an outgoing CPU. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Keith Busch <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Christoph Hellwig <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-06-22genirq/cpuhotplug: Add support for conditional maskingThomas Gleixner1-1/+10
Interrupts which cannot be migrated in process context, need to be masked before the affinity is changed forcefully. Add support for that. Will be compiled out for architectures which do not have this x86 specific issue. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Keith Busch <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Christoph Hellwig <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-06-22genirq/cpuhotplug: Add support for cleaning up move in progressThomas Gleixner2-3/+35
In order to move x86 to the generic hotplug migration code, add support for cleaning up move in progress bits. On architectures which have this x86 specific (mis)feature not enabled, this is optimized out by the compiler. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Keith Busch <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Christoph Hellwig <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-06-22genirq/cpuhotplug: Do not migrated shutdown irqsThomas Gleixner1-3/+8
Interrupts, which are shut down are tried to be migrated as well. That's pointless because the interrupt cannot fire and the next startup will move it to the proper place anyway. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Keith Busch <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Christoph Hellwig <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-06-22genirq/cpuhotplug: Reorder check logicThomas Gleixner1-16/+20
Move the checks for a valid irq chip and the irq_set_affinity() callback right in front of the whole migration logic. No point in doing a gazillion of other things when the interrupt cannot be migrated at all. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Keith Busch <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Christoph Hellwig <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-06-22genirq/cpuhotplug: Dont claim success on errorThomas Gleixner1-1/+4
In case the affinity of an interrupt was broken, a printk is emitted. But if the affinity cannot be set at all due to a missing irq_set_affinity() callback or due to a failing callback, the message is still printed preceeded by a warning/error. That makes no sense whatsoever. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Keith Busch <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Christoph Hellwig <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-06-22genirq/cpuhotplug: Remove irq disabling logicThomas Gleixner1-8/+4
This is called from stop_machine() with interrupts disabled. No point in disabling them some more. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Keith Busch <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Christoph Hellwig <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-06-22genirq: Move pending helpers to internal.hChristoph Hellwig2-28/+38
So that the affinity code can reuse them. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Keith Busch <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-06-22genirq: Move initial affinity setup to irq_startup()Thomas Gleixner2-9/+8
The startup vs. setaffinity ordering of interrupts depends on the IRQF_NOAUTOEN flag. Chained interrupts are not getting any affinity assignment at all. A regular interrupt is started up and then the affinity is set. A IRQF_NOAUTOEN marked interrupt is not started up, but the affinity is set nevertheless. Move the affinity setup to startup_irq() so the ordering is always the same and chained interrupts get the proper default affinity assigned as well. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Keith Busch <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Christoph Hellwig <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-06-22genirq: Rename setup_affinity() to irq_setup_affinity()Thomas Gleixner2-6/+7
Rename it with a proper irq_ prefix and make it available for other files in the core code. Preparatory patch for moving the irq affinity setup around. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Keith Busch <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Christoph Hellwig <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-06-22genirq: Remove mask argument from setup_affinity()Thomas Gleixner3-34/+29
No point to have this alloc/free dance of cpumasks. Provide a static mask for setup_affinity() and protect it proper. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Keith Busch <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Christoph Hellwig <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-06-22genirq: Provide irq_fixup_move_pending()Thomas Gleixner1-0/+30
If an CPU goes offline, the interrupts are migrated away, but a eventually pending interrupt move, which has not yet been made effective is kept pending even if the outgoing CPU is the sole target of the pending affinity mask. What's worse is, that the pending affinity mask is discarded even if it would contain a valid subset of the online CPUs. Implement a helper function which allows to avoid these issues. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Marc Zyngier <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Keith Busch <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Christoph Hellwig <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-06-22genirq/debugfs: Add proper debugfs interfaceThomas Gleixner7-1/+337
Debugging (hierarchical) interupt domains is tedious as there is no information about the hierarchy and no information about states of interrupts in the various domain levels. Add a debugfs directory 'irq' and subdirectories 'domains' and 'irqs'. The domains directory contains the domain files. The content is information about the domain. If the domain is part of a hierarchy then the parent domains are printed as well. # ls /sys/kernel/debug/irq/domains/ default INTEL-IR-2 INTEL-IR-MSI-2 IO-APIC-IR-2 PCI-MSI DMAR-MSI INTEL-IR-3 INTEL-IR-MSI-3 IO-APIC-IR-3 unknown-1 INTEL-IR-0 INTEL-IR-MSI-0 IO-APIC-IR-0 IO-APIC-IR-4 VECTOR INTEL-IR-1 INTEL-IR-MSI-1 IO-APIC-IR-1 PCI-HT # cat /sys/kernel/debug/irq/domains/VECTOR name: VECTOR size: 0 mapped: 216 flags: 0x00000041 # cat /sys/kernel/debug/irq/domains/IO-APIC-IR-0 name: IO-APIC-IR-0 size: 24 mapped: 19 flags: 0x00000041 parent: INTEL-IR-3 name: INTEL-IR-3 size: 65536 mapped: 167 flags: 0x00000041 parent: VECTOR name: VECTOR size: 0 mapped: 216 flags: 0x00000041 Unfortunately there is no per cpu information about the VECTOR domain (yet). The irqs directory contains detailed information about mapped interrupts. # cat /sys/kernel/debug/irq/irqs/3 handler: handle_edge_irq status: 0x00004000 istate: 0x00000000 ddepth: 1 wdepth: 0 dstate: 0x01018000 IRQD_IRQ_DISABLED IRQD_SINGLE_TARGET IRQD_MOVE_PCNTXT node: 0 affinity: 0-143 effectiv: 0 pending: domain: IO-APIC-IR-0 hwirq: 0x3 chip: IR-IO-APIC flags: 0x10 IRQCHIP_SKIP_SET_WAKE parent: domain: INTEL-IR-3 hwirq: 0x20000 chip: INTEL-IR flags: 0x0 parent: domain: VECTOR hwirq: 0x3 chip: APIC flags: 0x0 This was developed to simplify the debugging of the managed affinity changes. Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Marc Zyngier <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Keith Busch <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Christoph Hellwig <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2017-06-22genirq/irqdomain: Add map counterThomas Gleixner1-0/+4
Add a map counter instead of counting radix tree entries for diagnosis. That also gives correct information for linear domains. Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Marc Zyngier <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Keith Busch <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Christoph Hellwig <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-06-22genirq: Allow fwnode to carry name information onlyThomas Gleixner1-13/+92
In order to provide proper debug interface it's required to have domain names available when the domain is added. Non fwnode based architectures like x86 have no way to do so. It's not possible to use domain ops or host data for this as domain ops might be the same for several instances, but the names have to be unique. Extend the irqchip fwnode to allow transporting the domain name. If no node is supplied, create a 'unknown-N' placeholder. Warn if an invalid node is supplied and treat it like no node. This happens e.g. with i2 devices on x86 which hand in an ACPI type node which has no interface for retrieving the name. [ Folded a fix from Marc to make DT name parsing work ] Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Marc Zyngier <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Keith Busch <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Christoph Hellwig <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-06-22genirq/msi: Prevent overwriting domain nameThomas Gleixner1-1/+2
Prevent overwriting an already assigned domain name. Remove the extra check for chip->name, because if domain->name is NULL overwriting it with NULL is not a problem. Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Marc Zyngier <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Keith Busch <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Christoph Hellwig <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-06-22Merge branch 'uuid-types'Rafael J. Wysocki1-2/+2
Merge branch 'uuid-types' from git://git.infradead.org/users/hch/uuid.git to satisfy dependencies.
2017-06-22sched/fair: Spare idle load balancing on nohz_full CPUsFrederic Weisbecker1-0/+4
Although idle load balancing obviously only concerns idle CPUs, it can be a disturbance on a busy nohz_full CPU. Indeed a CPU can only get rid of an idle load balancing duty once a tick fires while it runs a task and this can take a while on a nohz_full CPU. We could fix that and escape the idle load balancing duty from the very idle exit path but that would bring unecessary overhead. Lets just not bother and leave that job to housekeeping CPUs (those outside nohz_full range). The nohz_full CPUs simply don't want any disturbance. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-06-22nohz: Move idle balancer registration to the idle pathFrederic Weisbecker1-2/+3
The idle load balancing registration path assumes that we only stop the tick when the CPU is idle, ignoring the nohz full case. As a result, a nohz full CPU that is running a task may be chosen to perform idle load balancing. Lets make sure that only CPUs in dynticks idle mode can be picked as idle load balancers. Signed-off-by: Frederic Weisbecker <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-06-22sched/loadavg: Generalize "_idle" naming to "_nohz"Frederic Weisbecker2-27/+28
The loadavg naming code still assumes that nohz == idle whereas its code is actually handling well both nohz idle and nohz full. So lets fix the naming according to what the code actually does, to unconfuse the reader. Signed-off-by: Frederic Weisbecker <[email protected]> Acked-by: Rik van Riel <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-06-22Merge branch 'linus' into locking/core, to pick up fixesIngo Molnar19-94/+109
Signed-off-by: Ingo Molnar <[email protected]>
2017-06-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller8-16/+56
Two entries being added at the same time to the IFLA policy table, whilst parallel bug fixes to decnet routing dst handling overlapping with the dst gc removal in net-next. Signed-off-by: David S. Miller <[email protected]>
2017-06-21Merge branch 'for-linus' of ↵Linus Torvalds2-7/+37
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching Pull livepatching fix from Jiri Kosina: "Fix the way how livepatches are being stacked with respect to RCU, from Petr Mladek" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching: livepatch: Fix stacking of patches with respect to RCU
2017-06-21irq/generic-chip: Provide devm_irq_setup_generic_chip()Bartosz Golaszewski1-0/+52
Provide a resource managed variant of irq_setup_generic_chip(). Signed-off-by: Bartosz Golaszewski <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Marc Zyngier <[email protected]> Cc: [email protected] Cc: Jonathan Corbet <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-06-21irq/generic-chip: Provide devm_irq_alloc_generic_chip()Bartosz Golaszewski1-0/+34
Provide a resource managed variant of irq_alloc_generic_chip(). Signed-off-by: Bartosz Golaszewski <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Marc Zyngier <[email protected]> Cc: [email protected] Cc: Jonathan Corbet <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-06-21irq/generic-chip: Export irq_init_generic_chip() locallyBartosz Golaszewski2-4/+14
This function will be used in the devres variant of irq_alloc_generic_chip(). Signed-off-by: Bartosz Golaszewski <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Marc Zyngier <[email protected]> Cc: [email protected] Cc: Jonathan Corbet <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-06-21perf/aux: Correct return code of rb_alloc_aux() if !has_aux(ev)Hendrik Brueckner1-1/+1
If the event for which an AUX area is about to be allocated, does not support setting up an AUX area, rb_alloc_aux() return -ENOTSUPP. This error condition is being returned unfiltered to the user space, and, for example, the perf tools fails with: failed to mmap with 524 (INTERNAL ERROR: strerror_r(524, 0x3fff497a1c8, 512)=22) This error can be easily seen with "perf record -m 128,256 -e cpu-clock". The 524 error code maps to -ENOTSUPP (in rb_alloc_aux()). The -ENOTSUPP error code shall be only used within the kernel. So the correct error code would then be -EOPNOTSUPP. With this commit, the perf tool then reports: failed to mmap with 95 (Operation not supported) which is more clear. Signed-off-by: Hendrik Brueckner <[email protected]> Acked-by: Alexander Shishkin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Pu Hou <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Thomas-Mich Richter <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-06-21Merge branch 'fortglx/4.13/time' of ↵Thomas Gleixner1-20/+26
https://git.linaro.org/people/john.stultz/linux into timers/core Merge time(keeping) updates from John Stultz: "Just a small set of changes, the biggest changes being the MONOTONIC_RAW handling cleanup, and a new kselftest from Miroslav. Also a a clear warning deprecating CONFIG_GENERIC_TIME_VSYSCALL_OLD, which affects ppc and ia64."
2017-06-21Merge branch 'timers/urgent' into timers/coreThomas Gleixner17-109/+107
Pick up dependent changes.
2017-06-20time: Add warning about imminent deprecation of CONFIG_GENERIC_TIME_VSYSCALL_OLDJohn Stultz1-0/+1
CONFIG_GENERIC_TIME_VSYSCALL_OLD was introduced five years ago to allow a transition from the old vsyscall implementations to the new method (which simplified internal accounting and made timekeeping more precise). However, PPC and IA64 have yet to make the transition, despite in some cases me sending test patches to try to help it along. http://patches.linaro.org/patch/30501/ http://patches.linaro.org/patch/35412/ If its helpful, my last pass at the patches can be found here: https://git.linaro.org/people/john.stultz/linux.git dev/oldvsyscall-cleanup So I think its time to set a deadline and make it clear this is going away. So this patch adds warnings about this functionality being dropped. Likely to be in v4.15. Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Miroslav Lichvar <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Marcelo Tosatti <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Anton Blanchard <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Tony Luck <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Fenghua Yu <[email protected]> Signed-off-by: John Stultz <[email protected]>
2017-06-20time: Clean up CLOCK_MONOTONIC_RAW time handlingJohn Stultz1-20/+25
Now that we fixed the sub-ns handling for CLOCK_MONOTONIC_RAW, remove the duplicitive tk->raw_time.tv_nsec, which can be stored in tk->tkr_raw.xtime_nsec (similarly to how its handled for monotonic time). Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Miroslav Lichvar <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Stephen Boyd <[email protected]> Cc: Kevin Brodsky <[email protected]> Cc: Will Deacon <[email protected]> Cc: Daniel Mentz <[email protected]> Tested-by: Daniel Mentz <[email protected]> Signed-off-by: John Stultz <[email protected]>
2017-06-20Merge branch 'linus' into irq/coreThomas Gleixner29-154/+175
Get upstream changes so pending patches won't conflict.
2017-06-20posix-cpu-timers: Make timespec to nsec conversion safeThomas Gleixner1-1/+5
The expiry time of a posix cpu timer is supplied through sys_timer_set() via a struct timespec. The timespec is validated for correctness. In the actual set timer implementation the timespec is converted to a scalar nanoseconds value. If the tv_sec part of the time spec is large enough the conversion to nanoseconds (sec * NSEC_PER_SEC) overflows 64bit. Mitigate that by using the timespec_to_ktime() conversion function, which checks the tv_sec part for a potential mult overflow and clamps the result to KTIME_MAX, which is about 292 years. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Xishi Qiu <[email protected]> Cc: John Stultz <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-06-20itimer: Make timeval to nsec conversion range limitedThomas Gleixner1-2/+6
The expiry time of a itimer is supplied through sys_setitimer() via a struct timeval. The timeval is validated for correctness. In the actual set timer implementation the timeval is converted to a scalar nanoseconds value. If the tv_sec part of the time spec is large enough the conversion to nanoseconds (sec * NSEC_PER_SEC) overflows 64bit. Mitigate that by using the timeval_to_ktime() conversion function, which checks the tv_sec part for a potential mult overflow and clamps the result to KTIME_MAX, which is about 292 years. Reported-by: Xishi Qiu <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: John Stultz <[email protected]> Link: http://lkml.kernel.org/r/[email protected]
2017-06-20timers: Fix parameter description of try_to_del_timer_sync()Peter Meerwald-Stadler1-1/+1
Signed-off-by: Peter Meerwald-Stadler <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Cc: John Stultz <[email protected]> Cc: [email protected] Signed-off-by: Thomas Gleixner <[email protected]>
2017-06-20sched/core: Drop the unused try_get_task_struct() helper functionDavidlohr Bueso1-13/+0
This function was introduced by: 150593bf8693 ("sched/api: Introduce task_rcu_dereference() and try_get_task_struct()") ... to allow easier usage of task_rcu_dereference(), however no users were ever added. Drop the helper. Signed-off-by: Davidlohr Bueso <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-06-20Merge branch 'WIP.sched/core' into sched/coreIngo Molnar35-519/+550
Conflicts: kernel/sched/Makefile Pick up the waitqueue related renames - it didn't get much feedback, so it appears to be uncontroversial. Famous last words? ;-) Signed-off-by: Ingo Molnar <[email protected]>
2017-06-20sched/fair: WARN() and refuse to set buddy when !se->on_rqDaniel Axtens1-2/+8
If we set a next or last buddy for a se that is not on_rq, we will end up taking a NULL pointer dereference in wakeup_preempt_entity via pick_next_task_fair. Detect when we would be about to do that, throw a warning and then refuse to actually set it. This has been suggested at least twice: https://marc.info/?l=linux-kernel&m=146651668921468&w=2 https://lkml.org/lkml/2016/6/16/663 I recently had to debug a problem with these (we hadn't backported Konstantin's patches in this area) and this would have saved a lot of time/pain. Just do it. Signed-off-by: Daniel Axtens <[email protected]> Cc: Ben Segall <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-06-20sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as ↵Ingo Molnar1-2/+2
well This definition of SCHED_WARN_ON(): #define SCHED_WARN_ON(x) ((void)(x)) is not fully compatible with the 'real' WARN_ON_ONCE() primitive, as it has no return value, so it cannot be used in conditionals. Fix it. Cc: Daniel Axtens <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-06-20sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list namingIngo Molnar2-14/+14
So I've noticed a number of instances where it was not obvious from the code whether ->task_list was for a wait-queue head or a wait-queue entry. Furthermore, there's a number of wait-queue users where the lists are not for 'tasks' but other entities (poll tables, etc.), in which case the 'task_list' name is actively confusing. To clear this all up, name the wait-queue head and entry list structure fields unambiguously: struct wait_queue_head::task_list => ::head struct wait_queue_entry::task_list => ::entry For example, this code: rqw->wait.task_list.next != &wait->task_list ... is was pretty unclear (to me) what it's doing, while now it's written this way: rqw->wait.head.next != &wait->entry ... which makes it pretty clear that we are iterating a list until we see the head. Other examples are: list_for_each_entry_safe(pos, next, &x->task_list, task_list) { list_for_each_entry(wq, &fence->wait.task_list, task_list) { ... where it's unclear (to me) what we are iterating, and during review it's hard to tell whether it's trying to walk a wait-queue entry (which would be a bug), while now it's written as: list_for_each_entry_safe(pos, next, &x->head, entry) { list_for_each_entry(wq, &fence->wait.head, entry) { Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-06-20sched/wait: Move bit_wait_table[] and related functionality from ↵Ingo Molnar2-16/+25
sched/core.c to sched/wait_bit.c The key hashed waitqueue data structures and their initialization was done in the main scheduler file for no good reason, move them to sched/wait_bit.c instead. Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-06-20sched/wait: Split out the wait_bit*() APIs from <linux/wait.h> into ↵Ingo Molnar3-258/+264
<linux/wait_bit.h> The wait_bit*() types and APIs are mixed into wait.h, but they are a pretty orthogonal extension of wait-queues. Furthermore, only about 50 kernel files use these APIs, while over 1000 use the regular wait-queue functionality. So clean up the main wait.h by moving the wait-bit functionality out of it, into a separate .h and .c file: include/linux/wait_bit.h for types and APIs kernel/sched/wait_bit.c for the implementation Update all header dependencies. This reduces the size of wait.h rather significantly, by about 30%. Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-06-20sched/wait: Standardize wait_bit_queue namingIngo Molnar1-21/+20
So wait-bit-queue head variables are often named: struct wait_bit_queue *q ... which is a bit ambiguous and super confusing, because they clearly suggest wait-queue head semantics and behavior (they rhyme with the old wait_queue_t *q naming), while they are extended wait-queue _entries_, not heads! They are misnomers in two ways: - the 'wait_bit_queue' leaves open the question of whether it's an entry or a head - the 'q' parameter and local variable naming falsely implies that it's a 'queue' - while it's an entry. This resulted in sometimes confusing cases such as: finish_wait(wq, &q->wait); where the 'q' is not a wait-queue head, but a wait-bit-queue entry. So improve this all by standardizing wait-bit-queue nomenclature similar to wait-queue head naming: struct wait_bit_queue => struct wait_bit_queue_entry q => wbq_entry Which makes it all a much clearer: struct wait_bit_queue_entry *wbq_entry ... and turns the former confusing piece of code into: finish_wait(wq_head, &wbq_entry->wq_entry; which IMHO makes it apparently clear what we are doing, without having to analyze the context of the code: we are adding a wait-queue entry to a regular wait-queue head, which entry is embedded in a wait-bit-queue entry. I'm not a big fan of acronyms, but repeating wait_bit_queue_entry in field and local variable names is too long, so Hopefully it's clear enough that 'wq_' prefixes stand for wait-queues, while 'wbq_' prefixes stand for wait-bit-queues. Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-06-20sched/wait: Standardize 'struct wait_bit_queue' wait-queue entry field nameIngo Molnar1-21/+20
Rename 'struct wait_bit_queue::wait' to ::wq_entry, to more clearly name it as a wait-queue entry. Propagate it to a couple of usage sites where the wait-bit-queue internals are exposed. Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-06-20sched/wait: Standardize internal naming of wait-queue headsIngo Molnar1-77/+77
The wait-queue head parameters and variables are named in a couple of ways, we have the following variants currently: wait_queue_head_t *q wait_queue_head_t *wq wait_queue_head_t *head In particular the 'wq' naming is ambiguous in the sense whether it's a wait-queue head or entry name - as entries were often named 'wait'. ( Not to mention the confusion of any readers coming over from workqueue-land. ) Standardize all this around a single, unambiguous parameter and variable name: struct wait_queue_head *wq_head which is easy to grep for and also rhymes nicely with the wait-queue entry naming: struct wait_queue_entry *wq_entry Also rename: struct __wait_queue_head => struct wait_queue_head ... and use this struct type to migrate from typedefs usage to 'struct' usage, which is more in line with existing kernel practices. Don't touch any external users and preserve the main wait_queue_head_t typedef. Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-06-20sched/wait: Standardize internal naming of wait-queue entriesIngo Molnar1-49/+49
So the various wait-queue entry variables in include/linux/wait.h and kernel/sched/wait.c are named in a colorfully inconsistent way: wait_queue_entry_t *wait wait_queue_entry_t *__wait (even in plain C code!) wait_queue_entry_t *q (!) wait_queue_entry_t *new (making anyone who knows C++ cringe) wait_queue_entry_t *old I think part of the reason for the inconsistency is the constant apparent confusion about what a wait queue 'head' versus 'entry' is. ( Some of the documentation talks about a 'wait descriptor', which is the wait-queue entry itself - further adding to the confusion. ) The most common name is 'wait', but that in itself is somewhat ambiguous as well, as it does not really make it clear whether it's a wait-queue entry or head. To improve all this name the wait-queue entry structure parameters and variables consistently and push through this naming into all the wait.h and wait.c code: struct wait_queue_entry *wq_entry The 'wq_' prefix makes it easy to grep for, and we also use the opportunity to move away from the typedef to a plain 'struct' naming: in the kernel we typically reserve typedefs for cases where a C structure is really small and somewhat opaque - such as pte_t. wait-queue entries are neither small nor opaque, so use the more standard 'struct xxx_entry' list management code nomenclature instead. ( We don't touch external users, and we preserve the typedef as well for actual wait-queue users, to reduce unnecessary churn. ) Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-06-20sched/wait: Rename wait_queue_t => wait_queue_entry_tIngo Molnar6-28/+28
Rename: wait_queue_t => wait_queue_entry_t 'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue", but in reality it's a queue *entry*. The 'real' queue is the wait queue head, which had to carry the name. Start sorting this out by renaming it to 'wait_queue_entry_t'. This also allows the real structure name 'struct __wait_queue' to lose its double underscore and become 'struct wait_queue_entry', which is the more canonical nomenclature for such data types. Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-06-20locking/rtmutex: Don't initialize lockdep when not requiredLevin, Alexander (Sasha Levin)1-1/+2
pi_mutex isn't supposed to be tracked by lockdep, but just passing NULLs for name and key will cause lockdep to spew a warning and die, which is not what we want it to do. Skip lockdep initialization if the caller passed NULLs for name and key, suggesting such initialization isn't desired. Signed-off-by: Sasha Levin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Fixes: f5694788ad8d ("rt_mutex: Add lockdep annotations") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-06-20Merge tag 'v4.12-rc6' into perf/core, to pick up fixesIngo Molnar14-85/+44
Signed-off-by: Ingo Molnar <[email protected]>
2017-06-20livepatch: Fix stacking of patches with respect to RCUPetr Mladek2-7/+37
rcu_read_(un)lock(), list_*_rcu(), and synchronize_rcu() are used for a secure access and manipulation of the list of patches that modify the same function. In particular, it is the variable func_stack that is accessible from the ftrace handler via struct ftrace_ops and klp_ops. Of course, it synchronizes also some states of the patch on the top of the stack, e.g. func->transition in klp_ftrace_handler. At the same time, this mechanism guards also the manipulation of task->patch_state. It is modified according to the state of the transition and the state of the process. Now, all this works well as long as RCU works well. Sadly livepatching might get into some corner cases when this is not true. For example, RCU is not watching when rcu_read_lock() is taken in idle threads. It is because they might sleep and prevent reaching the grace period for too long. There are ways how to make RCU watching even in idle threads, see rcu_irq_enter(). But there is a small location inside RCU infrastructure when even this does not work. This small problematic location can be detected either before calling rcu_irq_enter() by rcu_irq_enter_disabled() or later by rcu_is_watching(). Sadly, there is no safe way how to handle it. Once we detect that RCU was not watching, we might see inconsistent state of the function stack and the related variables in klp_ftrace_handler(). Then we could do a wrong decision, use an incompatible implementation of the function and break the consistency of the system. We could warn but we could not avoid the damage. Fortunately, ftrace has similar problems and they seem to be solved well there. It uses a heavy weight implementation of some RCU operations. In particular, it replaces: + rcu_read_lock() with preempt_disable_notrace() + rcu_read_unlock() with preempt_enable_notrace() + synchronize_rcu() with schedule_on_each_cpu(sync_work) My understanding is that this is RCU implementation from a stone age. It meets the core RCU requirements but it is rather ineffective. Especially, it does not allow to batch or speed up the synchronize calls. On the other hand, it is very trivial. It allows to safely trace and/or livepatch even the RCU core infrastructure. And the effectiveness is a not a big issue because using ftrace or livepatches on productive systems is a rare operation. The safety is much more important than a negligible extra load. Note that the alternative implementation follows the RCU principles. Therefore, we could and actually must use list_*_rcu() variants when manipulating the func_stack. These functions allow to access the pointers in the right order and with the right barriers. But they do not use any other information that would be set only by rcu_read_lock(). Also note that there are actually two problems solved in ftrace: First, it cares about the consistency of RCU read sections. It is being solved the way as described and used in this patch. Second, ftrace needs to make sure that nobody is inside the dynamic trampoline when it is being freed. For this, it also calls synchronize_rcu_tasks() in preemptive kernel in ftrace_shutdown(). Livepatch has similar problem but it is solved by ftrace for free. klp_ftrace_handler() is a good guy and never sleeps. In addition, it is registered with FTRACE_OPS_FL_DYNAMIC. It causes that unregister_ftrace_function() calls: * schedule_on_each_cpu(ftrace_sync) - always * synchronize_rcu_tasks() - in preemptive kernel The effect is that nobody is neither inside the dynamic trampoline nor inside the ftrace handler after unregister_ftrace_function() returns. [[email protected]: reformat changelog, fix comment] Signed-off-by: Petr Mladek <[email protected]> Acked-by: Josh Poimboeuf <[email protected]> Acked-by: Miroslav Benes <[email protected]> Signed-off-by: Jiri Kosina <[email protected]>