aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2019-08-13rcu/nocb: Remove deferred wakeup checks for extended quiescent statesPaul E. McKenney1-10/+0
The idea behind the checks for extended quiescent states at the end of __call_rcu_nocb() is to handle cases where call_rcu() is invoked directly from within an extended quiescent state, for example, from the idle loop. However, this will result in a timer-mediated deferred wakeup, which will cause the needed wakeup to happen within a jiffy or thereabouts. There should be no forward-progress concerns, and if there are, the proper response is to exit the extended quiescent state while executing the endless blast of call_rcu() invocations, for example, using RCU_NONIDLE(). Given the more realistic case of an isolated call_rcu() invocation, there should be no problem. This commit therefore removes the checks for invoking call_rcu() within an extended quiescent state for on no-CBs CPUs. Signed-off-by: Paul E. McKenney <[email protected]>
2019-08-13rcu/nocb: Check for deferred nocb wakeups before nohz_full early exitPaul E. McKenney1-4/+4
In theory, a timer is used to defer wakeups of no-CBs grace-period kthreads when the wakeup cannot be done safely directly from the call_rcu(). In practice, the one-jiffy delay is not always consistent with timely callback invocation under heavy call_rcu() loads. Therefore, there are a number of checks for a pending deferred wakeup, including from the scheduling-clock interrupt. Unfortunately, this check follows the rcu_nohz_full_cpu() early exit, which renders it useless on such CPUs. This commit therefore moves the check for the pending deferred no-CB wakeup to precede the rcu_nohz_full_cpu() early exit. Signed-off-by: Paul E. McKenney <[email protected]>
2019-08-13rcu/nocb: Make rcutree_migrate_callbacks() start at leaf rcu_node structurePaul E. McKenney1-5/+6
Because rcutree_migrate_callbacks() is invoked infrequently and because an exact snapshot of the grace-period state might save some callbacks a second trip through a grace period, this function has used the root rcu_node structure. However, this safe-second-trip optimization happens only if rcutree_migrate_callbacks() races with grace-period initialization, so it is not worth the added mental load. This commit therefore makes rcutree_migrate_callbacks() start with the leaf rcu_node structures, as is done elsewhere. Signed-off-by: Paul E. McKenney <[email protected]>
2019-08-13rcu/nocb: Add checks for offloaded callback processingPaul E. McKenney1-3/+8
This commit is a preparatory patch for offloaded callbacks using the same ->cblist structure used by non-offloaded callbacks. It therefore adds rcu_segcblist_is_offloaded() calls where they will be needed when !rcu_segcblist_is_enabled() no longer flags the offloaded case. It also adds checks in rcu_do_batch() to ensure that there are no missed checks: Currently, it should not be possible for offloaded execution to reach rcu_do_batch(), though this will change later in this series. Signed-off-by: Paul E. McKenney <[email protected]>
2019-08-13rcu/nocb: Use separate flag to indicate offloaded ->cblistPaul E. McKenney4-8/+32
RCU callback processing currently uses rcu_is_nocb_cpu() to determine whether or not the current CPU's callbacks are to be offloaded. This works, but it is not so good for cache locality. Plus use of ->cblist for offloaded callbacks will greatly increase the frequency of these checks. This commit therefore adds a ->offloaded flag to the rcu_segcblist structure to provide a more flexible and cache-friendly means of checking for callback offloading. Signed-off-by: Paul E. McKenney <[email protected]>
2019-08-13rcu/nocb: Use separate flag to indicate disabled ->cblistPaul E. McKenney3-3/+4
NULLing the RCU_NEXT_TAIL pointer was a clever way to save a byte, but forward-progress considerations would require that this pointer be both NULL and non-NULL, which, absent a quantum-computer port of the Linux kernel, simply won't happen. This commit therefore creates as separate ->enabled flag to replace the current NULL checks. [ paulmck: Add include files per 0day test robot and -next. ] Signed-off-by: Paul E. McKenney <[email protected]>
2019-08-13rcu/nocb: Print gp/cb kthread hierarchy if dump_treePaul E. McKenney1-0/+6
This commit causes the no-CBs grace-period/callback hierarchy to be printed to the console when the dump_tree kernel boot parameter is set. Signed-off-by: Paul E. McKenney <[email protected]>
2019-08-13rcu/nocb: Rename rcu_nocb_leader_stride kernel boot parameterPaul E. McKenney1-4/+4
This commit changes the name of the rcu_nocb_leader_stride kernel boot parameter to rcu_nocb_gp_stride in order to account for the new distinction between callback and grace-period no-CBs kthreads. Signed-off-by: Paul E. McKenney <[email protected]>
2019-08-13rcu/nocb: Rename and document no-CB CB kthread sleep trace eventPaul E. McKenney1-1/+1
The nocb_cb_wait() function traces a "FollowerSleep" trace_rcu_nocb_wake() event, which never was documented and is now misleading. This commit therefore changes "FollowerSleep" to "CBSleep", documents this, and updates the documentation for "Sleep" as well. Signed-off-by: Paul E. McKenney <[email protected]>
2019-08-13rcu/nocb: Rename rcu_organize_nocb_kthreads() local variablePaul E. McKenney1-3/+3
This commit renames rdp_leader to rdp_gp in order to account for the new distinction between callback and grace-period no-CBs kthreads. Signed-off-by: Paul E. McKenney <[email protected]>
2019-08-13rcu/nocb: Rename wake_nocb_leader_defer() to wake_nocb_gp_defer()Paul E. McKenney1-6/+6
This commit adjusts naming to account for the new distinction between callback and grace-period no-CBs kthreads. Signed-off-by: Paul E. McKenney <[email protected]>
2019-08-13rcu/nocb: Rename __wake_nocb_leader() to __wake_nocb_gp()Paul E. McKenney1-9/+9
This commit adjusts naming to account for the new distinction between callback and grace-period no-CBs kthreads. While in the area, it also updates local variables. Signed-off-by: Paul E. McKenney <[email protected]>
2019-08-13rcu/nocb: Rename wake_nocb_leader() to wake_nocb_gp()Paul E. McKenney1-3/+3
This commit adjusts naming to account for the new distinction between callback and grace-period no-CBs kthreads. Signed-off-by: Paul E. McKenney <[email protected]>
2019-08-13rcu/nocb: Rename nocb_follower_wait() to nocb_cb_wait()Paul E. McKenney1-2/+2
This commit adjusts naming to account for the new distinction between callback and grace-period no-CBs kthreads. Signed-off-by: Paul E. McKenney <[email protected]>
2019-08-13rcu/nocb: Provide separate no-CBs grace-period kthreadsPaul E. McKenney2-60/+61
Currently, there is one no-CBs rcuo kthread per CPU, and these kthreads are divided into groups. The first rcuo kthread to come online in a given group is that group's leader, and the leader both waits for grace periods and invokes its CPU's callbacks. The non-leader rcuo kthreads only invoke callbacks. This works well in the real-time/embedded environments for which it was intended because such environments tend not to generate all that many callbacks. However, given huge floods of callbacks, it is possible for the leader kthread to be stuck invoking callbacks while its followers wait helplessly while their callbacks pile up. This is a good recipe for an OOM, and rcutorture's new callback-flood capability does generate such OOMs. One strategy would be to wait until such OOMs start happening in production, but similar OOMs have in fact happened starting in 2018. It would therefore be wise to take a more proactive approach. This commit therefore features per-CPU rcuo kthreads that do nothing but invoke callbacks. Instead of having one of these kthreads act as leader, each group has a separate rcog kthread that handles grace periods for its group. Because these rcuog kthreads do not invoke callbacks, callback floods on one CPU no longer block callbacks from reaching the rcuc callback-invocation kthreads on other CPUs. This change does introduce additional kthreads, however: 1. The number of additional kthreads is about the square root of the number of CPUs, so that a 4096-CPU system would have only about 64 additional kthreads. Note that recent changes decreased the number of rcuo kthreads by a factor of two (CONFIG_PREEMPT=n) or even three (CONFIG_PREEMPT=y), so this still represents a significant improvement on most systems. 2. The leading "rcuo" of the rcuog kthreads should allow existing scripting to affinity these additional kthreads as needed, the same as for the rcuop and rcuos kthreads. (There are no longer any rcuob kthreads.) 3. A state-machine approach was considered and rejected. Although this would allow the rcuo kthreads to continue their dual leader/follower roles, it complicates callback invocation and makes it more difficult to consolidate rcuo callback invocation with existing softirq callback invocation. The introduction of rcuog kthreads should thus be acceptable. Signed-off-by: Paul E. McKenney <[email protected]>
2019-08-13rcu/nocb: Update comments to prepare for forward-progress workPaul E. McKenney2-32/+33
This commit simply rewords comments to prepare for leader nocb kthreads doing only grace-period work and callback shuffling. This will mean the addition of replacement kthreads to invoke callbacks. The "leader" and "follower" thus become less meaningful, so the commit changes no-CB comments with these strings to "GP" and "CB", respectively. (Give or take the usual grammatical transformations.) Signed-off-by: Paul E. McKenney <[email protected]>
2019-08-13rcu/nocb: Rename rcu_data fields to prepare for forward-progress workPaul E. McKenney2-46/+46
This commit simply renames rcu_data fields to prepare for leader nocb kthreads doing only grace-period work and callback shuffling. This will mean the addition of replacement kthreads to invoke callbacks. The "leader" and "follower" thus become less meaningful, so the commit changes no-CB fields with these strings to "gp" and "cb", respectively. Signed-off-by: Paul E. McKenney <[email protected]>
2019-08-13Merge branches 'consolidate.2019.08.01b', 'fixes.2019.08.12a', ↵Paul E. McKenney16-110/+193
'lists.2019.08.13a' and 'torture.2019.08.01b' into HEAD consolidate.2019.08.01b: Further consolidation cleanups fixes.2019.08.12a: Miscellaneous fixes lists.2019.08.13a: Optional lockdep arguments for RCU list macros torture.2019.08.01b: Torture-test updates
2019-08-13btf: rename /sys/kernel/btf/kernel into /sys/kernel/btf/vmlinuxAndrii Nakryiko1-15/+15
Expose kernel's BTF under the name vmlinux to be more uniform with using kernel module names as file names in the future. Fixes: 341dfcf8d78e ("btf: expose BTF info through sysfs") Suggested-by: Daniel Borkmann <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2019-08-13btf: expose BTF info through sysfsAndrii Nakryiko2-0/+54
Make .BTF section allocated and expose its contents through sysfs. /sys/kernel/btf directory is created to contain all the BTFs present inside kernel. Currently there is only kernel's main BTF, represented as /sys/kernel/btf/kernel file. Once kernel modules' BTFs are supported, each module will expose its BTF as /sys/kernel/btf/<module-name> file. Current approach relies on a few pieces coming together: 1. pahole is used to take almost final vmlinux image (modulo .BTF and kallsyms) and generate .BTF section by converting DWARF info into BTF. This section is not allocated and not mapped to any segment, though, so is not yet accessible from inside kernel at runtime. 2. objcopy dumps .BTF contents into binary file and subsequently convert binary file into linkable object file with automatically generated symbols _binary__btf_kernel_bin_start and _binary__btf_kernel_bin_end, pointing to start and end, respectively, of BTF raw data. 3. final vmlinux image is generated by linking this object file (and kallsyms, if necessary). sysfs_btf.c then creates /sys/kernel/btf/kernel file and exposes embedded BTF contents through it. This allows, e.g., libbpf and bpftool access BTF info at well-known location, without resorting to searching for vmlinux image on disk (location of which is not standardized and vmlinux image might not be even available in some scenarios, e.g., inside qemu during testing). Alternative approach using .incbin assembler directive to embed BTF contents directly was attempted but didn't work, because sysfs_proc.o is not re-compiled during link-vmlinux.sh stage. This is required, though, to update embedded BTF data (initially empty data is embedded, then pahole generates BTF info and we need to regenerate sysfs_btf.o with updated contents, but it's too late at that point). If BTF couldn't be generated due to missing or too old pahole, sysfs_btf.c handles that gracefully by detecting that _binary__btf_kernel_bin_start (weak symbol) is 0 and not creating /sys/kernel/btf at all. v2->v3: - added Documentation/ABI/testing/sysfs-kernel-btf (Greg K-H); - created proper kobject (btf_kobj) for btf directory (Greg K-H); - undo v2 change of reusing vmlinux, as it causes extra kallsyms pass due to initially missing __binary__btf_kernel_bin_{start/end} symbols; v1->v2: - allow kallsyms stage to re-use vmlinux generated by gen_btf(); Reviewed-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2019-08-12rcu: Fix spelling mistake "greate"->"great"Mukesh Ojha1-1/+1
This commit fixes a spelling mistake in file tree_exp.h. Signed-off-by: Mukesh Ojha <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2019-08-12rcu: Remove redundant "if" condition from rcu_gp_is_expedited()Paul E. McKenney1-2/+1
Because rcu_expedited_nesting is initialized to 1 and not decremented until just before init is spawned, rcu_expedited_nesting is guaranteed to be non-zero whenever rcu_scheduler_active == RCU_SCHEDULER_INIT. This commit therefore removes this redundant "if" equality test. Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Joel Fernandes (Google) <[email protected]>
2019-08-12idle: Prevent late-arriving interrupts from disrupting offlinePeter Zijlstra1-2/+3
Scheduling-clock interrupts can arrive late in the CPU-offline process, after idle entry and the subsequent call to cpuhp_report_idle_dead(). Once execution passes the call to rcu_report_dead(), RCU is ignoring the CPU, which results in lockdep complaints when the interrupt handler uses RCU: ------------------------------------------------------------------------ ============================= WARNING: suspicious RCU usage 5.2.0-rc1+ #681 Not tainted ----------------------------- kernel/sched/fair.c:9542 suspicious rcu_dereference_check() usage! other info that might help us debug this: RCU used illegally from offline CPU! rcu_scheduler_active = 2, debug_locks = 1 no locks held by swapper/5/0. stack backtrace: CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.2.0-rc1+ #681 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011 Call Trace: <IRQ> dump_stack+0x5e/0x8b trigger_load_balance+0xa8/0x390 ? tick_sched_do_timer+0x60/0x60 update_process_times+0x3b/0x50 tick_sched_handle+0x2f/0x40 tick_sched_timer+0x32/0x70 __hrtimer_run_queues+0xd3/0x3b0 hrtimer_interrupt+0x11d/0x270 ? sched_clock_local+0xc/0x74 smp_apic_timer_interrupt+0x79/0x200 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:delay_tsc+0x22/0x50 Code: ff 0f 1f 80 00 00 00 00 65 44 8b 05 18 a7 11 48 0f ae e8 0f 31 48 89 d6 48 c1 e6 20 48 09 c6 eb 0e f3 90 65 8b 05 fe a6 11 48 <41> 39 c0 75 18 0f ae e8 0f 31 48 c1 e2 20 48 09 c2 48 89 d0 48 29 RSP: 0000:ffff8f92c0157ed0 EFLAGS: 00000212 ORIG_RAX: ffffffffffffff13 RAX: 0000000000000005 RBX: ffff8c861f356400 RCX: ffff8f92c0157e64 RDX: 000000321214c8cc RSI: 00000032120daa7f RDI: 0000000000260f15 RBP: 0000000000000005 R08: 0000000000000005 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000000 R14: ffff8c861ee18000 R15: ffff8c861ee18000 cpuhp_report_idle_dead+0x31/0x60 do_idle+0x1d5/0x200 ? _raw_spin_unlock_irqrestore+0x2d/0x40 cpu_startup_entry+0x14/0x20 start_secondary+0x151/0x170 secondary_startup_64+0xa4/0xb0 ------------------------------------------------------------------------ This happens rarely, but can be forced by happen more often by placing delays in cpuhp_report_idle_dead() following the call to rcu_report_dead(). With this in place, the following rcutorture scenario reproduces the problem within a few minutes: tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 8 --duration 5 --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y" --configs "TREE04" This commit uses the crude but effective expedient of moving the disabling of interrupts within the idle loop to precede the cpu_is_offline() check. It also invokes tick_nohz_idle_stop_tick() instead of tick_nohz_idle_stop_tick_protected() to shut off the scheduling-clock interrupt. Signed-off-by: Peter Zijlstra <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> [ paulmck: Revert tick_nohz_idle_stop_tick_protected() removal, new callers. ] Signed-off-by: Paul E. McKenney <[email protected]>
2019-08-12kernel: only define task_struct_whitelist conditionallyChristoph Hellwig1-0/+2
If CONFIG_ARCH_TASK_STRUCT_ALLOCATOR is set task_struct_whitelist is never called, and thus generates a compiler warning. Signed-off-by: Christoph Hellwig <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Tony Luck <[email protected]>
2019-08-12sched/fair: Use rq_lock/unlock in online_fair_sched_groupPhil Auld1-3/+3
Enabling WARN_DOUBLE_CLOCK in /sys/kernel/debug/sched_features causes warning to fire in update_rq_clock. This seems to be caused by onlining a new fair sched group not using the rq lock wrappers. [] rq->clock_update_flags & RQCF_UPDATED [] WARNING: CPU: 5 PID: 54385 at kernel/sched/core.c:210 update_rq_clock+0xec/0x150 [] Call Trace: [] online_fair_sched_group+0x53/0x100 [] cpu_cgroup_css_online+0x16/0x20 [] online_css+0x1c/0x60 [] cgroup_apply_control_enable+0x231/0x3b0 [] cgroup_mkdir+0x41b/0x530 [] kernfs_iop_mkdir+0x61/0xa0 [] vfs_mkdir+0x108/0x1a0 [] do_mkdirat+0x77/0xe0 [] do_syscall_64+0x55/0x1d0 [] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Using the wrappers in online_fair_sched_group instead of the raw locking removes this warning. [ tglx: Use rq_*lock_irq() ] Signed-off-by: Phil Auld <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Vincent Guittot <[email protected]> Cc: Ingo Molnar <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2019-08-10Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds2-10/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: "Three fixlets for the scheduler: - Avoid double bandwidth accounting in the push & pull code - Use a sane FIFO priority for the Pressure Stall Information (PSI) thread. - Avoid permission checks when setting the scheduler params for the PSI thread" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/psi: Do not require setsched permission from the trigger creator sched/psi: Reduce psimon FIFO priority sched/deadline: Fix double accounting of rq/running bw in push & pull
2019-08-10dma-mapping: fix page attributes for dma_mmap_*Christoph Hellwig2-2/+19
All the way back to introducing dma_common_mmap we've defaulted to mark the pages as uncached. But this is wrong for DMA coherent devices. Later on DMA_ATTR_WRITE_COMBINE also got incorrect treatment as that flag is only treated special on the alloc side for non-coherent devices. Introduce a new dma_pgprot helper that deals with the check for coherent devices so that only the remapping cases ever reach arch_dma_mmap_pgprot and we thus ensure no aliasing of page attributes happens, which makes the powerpc version of arch_dma_mmap_pgprot obsolete and simplifies the remaining ones. Note that this means arch_dma_mmap_pgprot is a bit misnamed now, but we'll phase it out soon. Fixes: 64ccc9c033c6 ("common: dma-mapping: add support for generic dma_mmap_* calls") Reported-by: Shawn Anastasio <[email protected]> Reported-by: Gavin Li <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Acked-by: Catalin Marinas <[email protected]> # arm64
2019-08-10dma-direct: don't truncate dma_required_mask to bus addressing capabilitiesLucas Stach1-3/+0
The dma required_mask needs to reflect the actual addressing capabilities needed to handle the whole system RAM. When truncated down to the bus addressing capabilities dma_addressing_limited() will incorrectly signal no limitations for devices which are restricted by the bus_dma_mask. Fixes: b4ebe6063204 (dma-direct: implement complete bus_dma_mask handling) Signed-off-by: Lucas Stach <[email protected]> Tested-by: Atish Patra <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2019-08-10dma-direct: fix DMA_ATTR_NO_KERNEL_MAPPINGChristoph Hellwig1-2/+5
The new DMA_ATTR_NO_KERNEL_MAPPING needs to actually assign a dma_addr to work. Also skip it if the architecture needs forced decryption handling, as that needs a kernel virtual address. Fixes: d98849aff879 (dma-direct: handle DMA_ATTR_NO_KERNEL_MAPPING in common code) Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Lucas Stach <[email protected]>
2019-08-10cpufreq: schedutil: Don't skip freq update when limits changeViresh Kumar1-4/+10
To avoid reducing the frequency of a CPU prematurely, we skip reducing the frequency if the CPU had been busy recently. This should not be done when the limits of the policy are changed, for example due to thermal throttling. We should always get the frequency within the new limits as soon as possible. Trying to fix this by using only one flag, i.e. need_freq_update, can lead to a race condition where the flag gets cleared without forcing us to change the frequency at least once. And so this patch introduces another flag to avoid that race condition. Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX") Cc: v4.18+ <[email protected]> # v4.18+ Reported-by: Doug Smythies <[email protected]> Tested-by: Doug Smythies <[email protected]> Signed-off-by: Viresh Kumar <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2019-08-10PM: suspend: Fix platform_suspend_prepare_noirq()Rafael J. Wysocki1-5/+6
After commit ac9eafbe930a ("ACPI: PM: s2idle: Execute LPS0 _DSM functions with suspended devices"), a NULL pointer may be dereferenced if suspend-to-idle is attempted on a platform without "traditional" suspend support due to invalid fall-through in platform_suspend_prepare_noirq(). Fix that and while at it add missing braces in platform_resume_noirq(). Fixes: ac9eafbe930a ("ACPI: PM: s2idle: Execute LPS0 _DSM functions with suspended devices") Reported-by: Marek Szyprowski <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2019-08-09rcu: Add support for consolidated-RCU reader checkingJoel Fernandes (Google)2-33/+74
This commit adds RCU-reader checks to list_for_each_entry_rcu() and hlist_for_each_entry_rcu(). These checks are optional, and are indicated by a lockdep expression passed to a new optional argument to these two macros. If this optional lockdep expression is omitted, these two macros act as before, checking for an RCU read-side critical section. Signed-off-by: Joel Fernandes (Google) <[email protected]> [ paulmck: Update to eliminate return within macro and update comment. ] Signed-off-by: Paul E. McKenney <[email protected]>
2019-08-09dma-mapping: Remove dma_check_mask()Thiago Jung Bauermann1-8/+0
sme_active() is an x86-specific function so it's better not to call it from generic code. Christoph Hellwig mentioned that "There is no reason why we should have a special debug printk just for one specific reason why there is a requirement for a large DMA mask.", so just remove dma_check_mask(). Signed-off-by: Thiago Jung Bauermann <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Tom Lendacky <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2019-08-09swiotlb: Remove call to sme_active()Thiago Jung Bauermann1-2/+1
sme_active() is an x86-specific function so it's better not to call it from generic code. There's no need to mention which memory encryption feature is active, so just use a more generic message. Besides, other architectures will have different names for similar technology. Signed-off-by: Thiago Jung Bauermann <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Tom Lendacky <[email protected]> Signed-off-by: Michael Ellerman <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2019-08-09padata: initialize pd->cpu with effective cpumaskDaniel Jordan1-1/+1
Exercising CPU hotplug on a 5.2 kernel with recent padata fixes from cryptodev-2.6.git in an 8-CPU kvm guest... # modprobe tcrypt alg="pcrypt(rfc4106(gcm(aes)))" type=3 # echo 0 > /sys/devices/system/cpu/cpu1/online # echo c > /sys/kernel/pcrypt/pencrypt/parallel_cpumask # modprobe tcrypt mode=215 ...caused the following crash: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 2 PID: 134 Comm: kworker/2:2 Not tainted 5.2.0-padata-base+ #7 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-<snip> Workqueue: pencrypt padata_parallel_worker RIP: 0010:padata_reorder+0xcb/0x180 ... Call Trace: padata_do_serial+0x57/0x60 pcrypt_aead_enc+0x3a/0x50 [pcrypt] padata_parallel_worker+0x9b/0xe0 process_one_work+0x1b5/0x3f0 worker_thread+0x4a/0x3c0 ... In padata_alloc_pd, pd->cpu is set using the user-supplied cpumask instead of the effective cpumask, and in this case cpumask_first picked an offline CPU. The offline CPU's reorder->list.next is NULL in padata_reorder because the list wasn't initialized in padata_init_pqueues, which only operates on CPUs in the effective mask. Fix by using the effective mask in padata_alloc_pd. Fixes: 6fc4dbcf0276 ("padata: Replace delayed timer with immediate workqueue in padata_reorder") Signed-off-by: Daniel Jordan <[email protected]> Cc: Herbert Xu <[email protected]> Cc: Steffen Klassert <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Herbert Xu <[email protected]>
2019-08-08ACPI: PM: s2idle: Execute LPS0 _DSM functions with suspended devicesRafael J. Wysocki1-3/+9
According to Section 3.5 of the "Intel Low Power S0 Idle" document [1], Function 5 of the LPS0 _DSM is expected to be invoked when the system configuration matches the criteria for entering the target low-power state of the platform. In particular, this means that all devices should be suspended and in low-power states already when that function is invoked. This is not the case currently, however, because Function 5 of the LPS0 _DSM is invoked by it before the "noirq" phase of device suspend, which means that some devices may not have been put into low-power states yet at that point. That is a consequence of the previous design of the suspend-to-idle flow that allowed the "noirq" phase of device suspend and the "noirq" phase of device resume to be carried out for multiple times while "suspended" (if any spurious wakeup events were detected) and the point of the LPS0 _DSM Function 5 invocation was chosen so as to call it (and LPS0 _DSM Function 6 analogously) once per suspend-resume cycle (regardless of how many times the "noirq" phases of device suspend and resume were carried out while "suspended"). Now that the suspend-to-idle flow has been redesigned to carry out the "noirq" phases of device suspend and resume once in each cycle, the code can be reordered to follow the specification that it is based on more closely. For this purpose, add ->prepare_late and ->restore_early platform callbacks for suspend-to-idle, to be executed, respectively, after the "noirq" phase of suspending devices and before the "noirq" phase of resuming them and make ACPI use them for the invocation of LPS0 _DSM functions as appropriate. While at it, move the LPS0 entry requirements check to be made before invoking Functions 3 and 5 of the LPS0 _DSM (also once per cycle) as follows from the specification [1]. Link: https://uefi.org/sites/default/files/resources/Intel_ACPI_Low_Power_S0_Idle.pdf # [1] Signed-off-by: Rafael J. Wysocki <[email protected]> Tested-by: Kai-Heng Feng <[email protected]>
2019-08-08cpufreq: schedutil: fix equation in commentQais Yousef1-3/+3
scale_irq_capacity() call in schedutil_cpu_util() does util *= (max - irq) util /= max But the comment says util *= (1 - irq) util /= max Fix the comment to match what the scaling function does. Signed-off-by: Qais Yousef <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Viresh Kumar <[email protected]> Acked-by: Vincent Guittot <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: "Rafael J . Wysocki" <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2019-08-08sched: Rework pick_next_task() slow-pathPeter Zijlstra7-73/+34
Avoid the RETRY_TASK case in the pick_next_task() slow path. By doing the put_prev_task() early, we get the rt/deadline pull done, and by testing rq->nr_running we know if we need newidle_balance(). This then gives a stable state to pick a task from. Since the fast-path is fair only; it means the other classes will always have pick_next_task(.prev=NULL, .rf=NULL) and we can simplify. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Aaron Lu <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: [email protected] Cc: Phil Auld <[email protected]> Cc: Julien Desfossez <[email protected]> Cc: Nishanth Aravamudan <[email protected]> Link: https://lkml.kernel.org/r/aa34d24b36547139248f32a30138791ac6c02bd6.1559129225.git.vpillai@digitalocean.com
2019-08-08sched: Allow put_prev_task() to drop rq->lockPeter Zijlstra7-8/+32
Currently the pick_next_task() loop is convoluted and ugly because of how it can drop the rq->lock and needs to restart the picking. For the RT/Deadline classes, it is put_prev_task() where we do balancing, and we could do this before the picking loop. Make this possible. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: Aaron Lu <[email protected]> Cc: [email protected] Cc: Phil Auld <[email protected]> Cc: Julien Desfossez <[email protected]> Cc: Nishanth Aravamudan <[email protected]> Link: https://lkml.kernel.org/r/e4519f6850477ab7f3d257062796e6425ee4ba7c.1559129225.git.vpillai@digitalocean.com
2019-08-08sched/fair: Expose newidle_balance()Peter Zijlstra2-10/+12
For pick_next_task_fair() it is the newidle balance that requires dropping the rq->lock; provided we do put_prev_task() early, we can also detect the condition for doing newidle early. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Aaron Lu <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: [email protected] Cc: Phil Auld <[email protected]> Cc: Julien Desfossez <[email protected]> Cc: Nishanth Aravamudan <[email protected]> Link: https://lkml.kernel.org/r/9e3eb1859b946f03d7e500453a885725b68957ba.1559129225.git.vpillai@digitalocean.com
2019-08-08sched: Add task_struct pointer to sched_class::set_curr_taskPeter Zijlstra7-46/+48
In preparation of further separating pick_next_task() and set_curr_task() we have to pass the actual task into it, while there, rename the thing to better pair with put_prev_task(). Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Aaron Lu <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: [email protected] Cc: Phil Auld <[email protected]> Cc: Julien Desfossez <[email protected]> Cc: Nishanth Aravamudan <[email protected]> Link: https://lkml.kernel.org/r/a96d1bcdd716db4a4c5da2fece647a1456c0ed78.1559129225.git.vpillai@digitalocean.com
2019-08-08sched: Rework CPU hotplug task selectionPeter Zijlstra2-18/+15
The CPU hotplug task selection is the only place where we used put_prev_task() on a task that is not current. While looking at that, it occured to me that we can simplify all that by by using a custom pick loop. Since we don't need to put current, we can do away with the fake task too. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Aaron Lu <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: [email protected] Cc: Phil Auld <[email protected]> Cc: Julien Desfossez <[email protected]> Cc: Nishanth Aravamudan <[email protected]>
2019-08-08sched/{rt,deadline}: Fix set_next_task vs pick_next_taskPeter Zijlstra2-24/+24
Because pick_next_task() implies set_curr_task() and some of the details haven't mattered too much, some of what _should_ be in set_curr_task() ended up in pick_next_task, correct this. This prepares the way for a pick_next_task() variant that does not affect the current state; allowing remote picking. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Aaron Lu <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: [email protected] Cc: Phil Auld <[email protected]> Cc: Julien Desfossez <[email protected]> Cc: Nishanth Aravamudan <[email protected]> Link: https://lkml.kernel.org/r/38c61d5240553e043c27c5e00b9dd0d184dd6081.1559129225.git.vpillai@digitalocean.com
2019-08-08sched: Fix kerneldoc comment for ia64_set_curr_taskPeter Zijlstra1-1/+1
Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Aaron Lu <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: [email protected] Cc: Phil Auld <[email protected]> Cc: Julien Desfossez <[email protected]> Cc: Nishanth Aravamudan <[email protected]> Link: https://lkml.kernel.org/r/fde3a65ea3091ec6b84dac3c19639f85f452c5d1.1559129225.git.vpillai@digitalocean.com
2019-08-08stop_machine: Fix stop_cpus_in_progress orderingPeter Zijlstra1-0/+2
Make sure the entire for loop has stop_cpus_in_progress set. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Aaron Lu <[email protected]> Cc: Valentin Schneider <[email protected]> Cc: [email protected] Cc: Phil Auld <[email protected]> Cc: Julien Desfossez <[email protected]> Cc: Nishanth Aravamudan <[email protected]> Link: https://lkml.kernel.org/r/0fd8fd4b99b9b9aa88d8b2dff897f7fd0d88f72c.1559129225.git.vpillai@digitalocean.com
2019-08-08sched/fair: Fix low cpu usage with high throttling by removing expiration of ↵Dave Chiluk2-69/+7
cpu-local slices It has been observed, that highly-threaded, non-cpu-bound applications running under cpu.cfs_quota_us constraints can hit a high percentage of periods throttled while simultaneously not consuming the allocated amount of quota. This use case is typical of user-interactive non-cpu bound applications, such as those running in kubernetes or mesos when run on multiple cpu cores. This has been root caused to cpu-local run queue being allocated per cpu bandwidth slices, and then not fully using that slice within the period. At which point the slice and quota expires. This expiration of unused slice results in applications not being able to utilize the quota for which they are allocated. The non-expiration of per-cpu slices was recently fixed by 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")'. Prior to that it appears that this had been broken since at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That added the following conditional which resulted in slices never being expired. if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { /* extend local deadline, drift is bounded above by 2 ticks */ cfs_rq->runtime_expires += TICK_NSEC; Because this was broken for nearly 5 years, and has recently been fixed and is now being noticed by many users running kubernetes (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion that the mechanisms around expiring runtime should be removed altogether. This allows quota already allocated to per-cpu run-queues to live longer than the period boundary. This allows threads on runqueues that do not use much CPU to continue to use their remaining slice over a longer period of time than cpu.cfs_period_us. However, this helps prevent the above condition of hitting throttling while also not fully utilizing your cpu quota. This theoretically allows a machine to use slightly more than its allotted quota in some periods. This overflow would be bounded by the remaining quota left on each per-cpu runqueueu. This is typically no more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will change nothing, as they should theoretically fully utilize all of their quota in each period. For user-interactive tasks as described above this provides a much better user/application experience as their cpu utilization will more closely match the amount they requested when they hit throttling. This means that cpu limits no longer strictly apply per period for non-cpu bound applications, but that they are still accurate over longer timeframes. This greatly improves performance of high-thread-count, non-cpu bound applications with low cfs_quota_us allocation on high-core-count machines. In the case of an artificial testcase (10ms/100ms of quota on 80 CPU machine), this commit resulted in almost 30x performance improvement, while still maintaining correct cpu quota restrictions. That testcase is available at https://github.com/indeedeng/fibtest. Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") Signed-off-by: Dave Chiluk <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Phil Auld <[email protected]> Reviewed-by: Ben Segall <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: John Hammond <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Kyle Anderson <[email protected]> Cc: Gabriel Munos <[email protected]> Cc: Peter Oskolkov <[email protected]> Cc: Cong Wang <[email protected]> Cc: Brendan Gregg <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2019-08-08sched: Clean up active_mm reference countingPeter Zijlstra1-19/+30
The current active_mm reference counting is confusing and sub-optimal. Rewrite the code to explicitly consider the 4 separate cases: user -> user When switching between two user tasks, all we need to consider is switch_mm(). user -> kernel When switching from a user task to a kernel task (which doesn't have an associated mm) we retain the last mm in our active_mm. Increment a reference count on active_mm. kernel -> kernel When switching between kernel threads, all we need to do is pass along the active_mm reference. kernel -> user When switching between a kernel and user task, we must switch from the last active_mm to the next mm, hoping of course that these are the same. Decrement a reference on the active_mm. The code keeps a different order, because as you'll note, both 'to user' cases require switch_mm(). And where the old code would increment/decrement for the 'kernel -> kernel' case, the new code observes this is a neutral operation and avoids touching the reference count. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Rik van Riel <[email protected]> Reviewed-by: Mathieu Desnoyers <[email protected]> Cc: [email protected]
2019-08-08rcu/tree: Fix SCHED_FIFO paramsPeter Zijlstra1-3/+3
A rather embarrasing mistake had us call sched_setscheduler() before initializing the parameters passed to it. Fixes: 1a763fd7c633 ("rcu/tree: Call setschedule() gp ktread to SCHED_FIFO outside of atomic region") Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Paul E. McKenney <[email protected]> Cc: Juri Lelli <[email protected]>
2019-08-08mutex: Fix up mutex_waiter usagePeter Zijlstra2-15/+0
The patch moving bits into mutex.c was a little too much; by also moving struct mutex_waiter a few less common CONFIGs would no longer build. Fixes: 5f35d5a66b3e ("locking/mutex: Make __mutex_owner static to mutex.c") Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
2019-08-08genirq/affinity: Create affinity mask for single vectorMing Lei1-4/+2
Since commit c66d4bd110a1f8 ("genirq/affinity: Add new callback for (re)calculating interrupt sets"), irq_create_affinity_masks() returns NULL in case of single vector. This change has caused regression on some drivers, such as lpfc. The problem is that single vector requests can happen in some generic cases: 1) kdump kernel 2) irq vectors resource is close to exhaustion. If in that situation the affinity mask for a single vector is not created, every caller has to handle the special case. There is no reason why the mask cannot be created, so remove the check for a single vector and create the mask. Fixes: c66d4bd110a1f8 ("genirq/affinity: Add new callback for (re)calculating interrupt sets") Signed-off-by: Ming Lei <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]