aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2020-06-29rcu: Support reclaim for head-less objectUladzislau Rezki (Sony)1-2/+43
Update the kvfree_call_rcu() function with head-less support. This allows RCU to reclaim objects without an embedded rcu_head. tree-RCU: We introduce two chains of arrays to store SLAB-backed and vmalloc pointers, each. Storage in either of these arrays does not require embedding an rcu_head within the object. Maintaining the arrays may become impossible due to high memory pressure. For such cases there is an emergency path. Objects with rcu_head inside are just queued on a backup rcu_head list. Later on that list is drained. As for the head-less variant, as the current context can sleep, the following emergency measures are applied: a) Synchronously wait until a grace period has elapsed. b) Call kvfree(). tiny-RCU: For double argument calls, there are no new changes in behavior. For single argument call, kvfree() is directly inlined on the current stack after a synchronize_rcu() call. Note that for tiny-RCU, any call to synchronize_rcu() is actually a quiescent state, therefore it does nothing. Reviewed-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Signed-off-by: Joel Fernandes (Google) <[email protected]> Co-developed-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu: Rename *_kfree_callback/*_kfree_rcu_offset/kfree_call_*Uladzislau Rezki (Sony)2-10/+10
The following changes are introduced: 1. Rename rcu_invoke_kfree_callback() to rcu_invoke_kvfree_callback(), as well as the associated trace events, so the rcu_kfree_callback(), becomes rcu_kvfree_callback(). The reason is to be aligned with kvfree() notation. 2. Rename __is_kfree_rcu_offset to __is_kvfree_rcu_offset. All RCU paths use kvfree() now instead of kfree(), thus rename it. 3. Rename kfree_call_rcu() to the kvfree_call_rcu(). The reason is, it is capable of freeing vmalloc() memory now. Do the same with __kfree_rcu() macro, it becomes __kvfree_rcu(), the goal is the same. Reviewed-by: Joel Fernandes (Google) <[email protected]> Co-developed-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu/tiny: support vmalloc in tiny-RCUUladzislau Rezki (Sony)1-1/+2
Replace kfree() with kvfree() in rcu_reclaim_tiny(). This makes it possible to release either SLAB or vmalloc objects after a GP. Reviewed-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu/tree: Maintain separate array for vmalloc ptrsUladzislau Rezki (Sony)1-73/+100
To do so, we use an array of kvfree_rcu_bulk_data structures. It consists of two elements: - index number 0 corresponds to slab pointers. - index number 1 corresponds to vmalloc pointers. Keeping vmalloc pointers separated from slab pointers makes it possible to invoke the right freeing API for the right kind of pointer. It also prepares us for future headless support for vmalloc and SLAB objects. Such objects cannot be queued on a linked list and are instead directly into an array. Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Signed-off-by: Joel Fernandes (Google) <[email protected]> Reviewed-by: Joel Fernandes (Google) <[email protected]> Co-developed-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu/tree: cache specified number of objectsUladzislau Rezki (Sony)1-4/+62
In order to reduce the dynamic need for pages in kfree_rcu(), pre-allocate a configurable number of pages per CPU and link them in a list. When kfree_rcu() reclaims objects, the object's container page is cached into a list instead of being released to the low-level page allocator. Such an approach provides O(1) access to free pages while also reducing the number of requests to the page allocator. It also makes the kfree_rcu() code to have free pages available during a low memory condition. A read-only sysfs parameter (rcu_min_cached_objs) reflects the minimum number of allowed cached pages per CPU. Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu/tree: Use static initializer for krc.lockSebastian Andrzej Siewior1-7/+6
The per-CPU variable is initialized at runtime in kfree_rcu_batch_init(). This function is invoked before 'rcu_scheduler_active' is set to 'RCU_SCHEDULER_RUNNING'. After the initialisation, '->initialized' is to true. The raw_spin_lock is only acquired if '->initialized' is set to true. The worqueue item is only used if 'rcu_scheduler_active' set to RCU_SCHEDULER_RUNNING which happens after initialisation. Use a static initializer for krc.lock and remove the runtime initialisation of the lock. Since the lock can now be always acquired, remove the '->initialized' check. Cc: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu/tree: Move kfree_rcu_cpu locking/unlocking to separate functionsUladzislau Rezki (Sony)1-8/+23
Introduce helpers to lock and unlock per-cpu "kfree_rcu_cpu" structures. That will make kfree_call_rcu() more readable and prevent programming errors. Reviewed-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu/tree: Simplify KFREE_BULK_MAX_ENTR macroUladzislau Rezki (Sony)1-8/+9
We can simplify KFREE_BULK_MAX_ENTR macro and get rid of magic numbers which were used to make the structure to be exactly one page. Suggested-by: Boqun Feng <[email protected]> Reviewed-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Signed-off-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu/tree: Make debug_objects logic independent of rcu_headJoel Fernandes (Google)1-16/+13
kfree_rcu()'s debug_objects logic uses the address of the object's embedded rcu_head to queue/unqueue. Instead of this, make use of the object's address itself as preparation for future headless kfree_rcu() support. Reviewed-by: Uladzislau Rezki <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Signed-off-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu/tree: Repeat the monitor if any free channel is busyUladzislau Rezki (Sony)1-3/+6
It is possible that one of the channels cannot be detached because its free channel is busy and previously queued data has not been processed yet. On the other hand, another channel can be successfully detached causing the monitor work to stop. Prevent that by rescheduling the monitor work if there are any channels in the pending state after a detach attempt. Fixes: 34c881745549e ("rcu: Support kfree_bulk() interface in kfree_rcu()") Acked-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu/tree: Skip entry into the page allocator for PREEMPT_RTJoel Fernandes (Google)1-0/+12
To keep the kfree_rcu() code working in purely atomic sections on RT, such as non-threaded IRQ handlers and raw spinlock sections, avoid calling into the page allocator which uses sleeping locks on RT. In fact, even if the caller is preemptible, the kfree_rcu() code is not, as the krcp->lock is a raw spinlock. Calling into the page allocator is optional and avoiding it should be Ok, especially with the page pre-allocation support in future patches. Such pre-allocation would further avoid the a need for a dynamically allocated page in the first place. Cc: Sebastian Andrzej Siewior <[email protected]> Reviewed-by: Uladzislau Rezki <[email protected]> Co-developed-by: Uladzislau Rezki <[email protected]> Signed-off-by: Uladzislau Rezki <[email protected]> Signed-off-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu/tree: Keep kfree_rcu() awake during lock contentionJoel Fernandes (Google)1-15/+15
On PREEMPT_RT kernels, the krcp spinlock gets converted to an rt-mutex and causes kfree_rcu() callers to sleep. This makes it unusable for callers in purely atomic sections such as non-threaded IRQ handlers and raw spinlock sections. Fix it by converting the spinlock to a raw spinlock. Vetting all code paths, there is no reason to believe that the raw spinlock will hurt RT latencies as it is not held for a long time. Cc: [email protected] Cc: Uladzislau Rezki <[email protected]> Reviewed-by: Uladzislau Rezki <[email protected]> Signed-off-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu: Fix a kernel-doc warnings for "count"Mauro Carvalho Chehab1-1/+1
There are some kernel-doc warnings: ./kernel/rcu/tree.c:2915: warning: Function parameter or member 'count' not described in 'kfree_rcu_cpu' This commit therefore moves the comment for "count" to the kernel-doc markup. Signed-off-by: Mauro Carvalho Chehab <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29kernel/rcu/tree.c: Fix kernel-doc warningsRandy Dunlap1-1/+0
Fix kernel-doc warning: ../kernel/rcu/tree.c:959: warning: Excess function parameter 'irq' description in 'rcu_nmi_enter' Fixes: cf7614e13c8f ("rcu: Refactor rcu_{nmi,irq}_{enter,exit}()") Signed-off-by: Randy Dunlap <[email protected]> Cc: Byungchul Park <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu: grpnum just records group numberWei Yang1-1/+1
The ->grpnum field in the rcu_node structure contains the bit position in this structure's parent's bitmasks, which is not the CPU number. This commit therefore adjusts this field's comment accordingly. Signed-off-by: Wei Yang <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu: grplo/grphi just records CPU numberWei Yang1-2/+2
The ->grplo and ->grphi fields store the lowest and highest CPU number covered by to a rcu_node structure, which is not the group number. This commit therefore adjusts these fields' comments to match reality. Signed-off-by: Wei Yang <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu: gp_max is protected by root rcu_node's lockWei Yang1-2/+2
Because gp_max is protected by root rcu_node's lock, this commit moves the gp_max definition to the region of the rcu_node structure containing fields protected by this lock. Signed-off-by: Wei Yang <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu: Stop shrinker loopPeter Enderborg1-1/+1
The count and scan can be separated in time, and there is a fair chance that all work is already done when the scan starts, which might in turn result in a needless retry. This commit therefore avoids this retry by returning SHRINK_STOP. Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Signed-off-by: Peter Enderborg <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu: Replace 1 with trueJules Irenge1-1/+1
Coccinelle reports a warning WARNING: Assignment of 0/1 to bool variable The root cause is that the variable lastphase is a bool, but is initialised with integer 1. This commit therefore replaces the 1 with a true. Signed-off-by: Jules Irenge <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29lockdep: Complain only once about RCU in extended quiescent statePaul E. McKenney1-3/+1
Currently, lockdep_rcu_suspicious() complains twice about RCU read-side critical sections being invoked from within extended quiescent states, for example: RCU used illegally from idle CPU! rcu_scheduler_active = 2, debug_locks = 1 RCU used illegally from extended quiescent state! This commit therefore saves a couple lines of code and one line of console-log output by eliminating the first of these two complaints. Link: https://lore.kernel.org/lkml/[email protected] Cc: Peter Zijlstra <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu: Mark rcu_nmi_enter() call to rcu_cleanup_after_idle() noinstrPaul E. McKenney1-1/+4
The objtool complains about the call to rcu_cleanup_after_idle() from rcu_nmi_enter(), so this commit adds instrumentation_begin() before that call and instrumentation_end() after it. Acked-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu: Remove initialized but unused rnp from check_slow_task()Paul E. McKenney1-2/+0
This commit removes the variable rnp from check_slow_task(), which is defined, assigned to, but not otherwise used. Reported-by: kbuild test robot <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29tick/nohz: Narrow down noise while setting current task's tick dependencyFrederic Weisbecker1-7/+15
Setting a tick dependency on any task, including the case where a task sets that dependency on itself, triggers an IPI to all CPUs. That is of course suboptimal but it had previously not been an issue because it was only used by POSIX CPU timers on nohz_full, which apparently never occurs in latency-sensitive workloads in production. (Or users of such systems are suffering in silence on the one hand or venting their ire on the wrong people on the other.) But RCU now sets a task tick dependency on the current task in order to fix stall issues that can occur during RCU callback processing. Thus, RCU callback processing triggers frequent system-wide IPIs from nohz_full CPUs. This is quite counter-productive, after all, avoiding IPIs is what nohz_full is supposed to be all about. This commit therefore optimizes tasks' self-setting of a task tick dependency by using tick_nohz_full_kick() to avoid the system-wide IPI. Instead, only the execution of the one task is disturbed, which is acceptable given that this disturbance is well down into the noise compared to the degree to which the RCU callback processing itself disturbs execution. Fixes: 6a949b7af82d (rcu: Force on tick when invoking lots of callbacks) Reported-by: Matt Fleming <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Cc: [email protected] Cc: Paul E. McKenney <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Ingo Molnar <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu: Update comment from rsp->rcu_gp_seq to rsp->gp_seqLihao Liang1-2/+2
Signed-off-by: Lihao Liang <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu: Expedited grace-period sleeps to idle priorityPaul E. McKenney1-1/+1
This commit converts the schedule_timeout_uninterruptible() call used by RCU's expedited grace-period processing to schedule_timeout_idle(). This conversion avoids polluting the load-average with RCU-related sleeping. Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu: No-CBs-related sleeps to idle priorityPaul E. McKenney1-1/+1
This commit converts the schedule_timeout_interruptible() call used by RCU's no-CBs grace-period kthreads to schedule_timeout_idle(). This conversion avoids polluting the load-average with RCU-related sleeping. Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu: Priority-boost-related sleeps to idle priorityPaul E. McKenney1-1/+1
This commit converts the long-standing schedule_timeout_interruptible() call used by RCU's priority-boosting kthreads to schedule_timeout_idle(). This conversion avoids polluting the load-average with RCU-related sleeping. Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu: Grace-period-kthread related sleeps to idle priorityPaul E. McKenney1-3/+3
This commit converts the long-standing schedule_timeout_interruptible() and schedule_timeout_uninterruptible() calls used by RCU's grace-period kthread to schedule_timeout_idle(). This conversion avoids polluting the load-average with RCU-related sleeping. Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu: Add comment documenting rcu_callback_map's purposePaul E. McKenney1-0/+1
The rcu_callback_map lockdep_map structure was added back in 2013, but its purpose has become obscure. This commit therefore documments that the purpose of rcu_callback map is, in the words of commit 24ef659a857 ("rcu: Provide better diagnostics for blocking in RCU callback functions"), to help lockdep to tie an "inappropriate voluntary context switch back to the fact that the function is being invoked from within a callback." Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu: Add callbacks-invoked countersPaul E. McKenney3-0/+5
This commit adds a count of the callbacks invoked to the per-CPU rcu_data structure. This count is printed by the show_rcu_gp_kthreads() that is invoked by rcutorture and the RCU CPU stall-warning code. It is also intended for use by drgn. Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu: Simplify the calculation of rcu_state.ncpusWei Yang1-6/+3
There is only 1 bit set in mask, which means that the only difference between oldmask and the new one will be at the position where the bit is set in mask. This commit therefore updates rcu_state.ncpus by checking whether the bit in mask is already set in rnp->expmaskinitnext. Signed-off-by: Wei Yang <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29rcu: Initialize and destroy rcu_synchronize only when necessaryWei Yang1-5/+7
The __wait_rcu_gp() function unconditionally initializes and cleans up each element of rs_array[], whether used or not. This is slightly wasteful and rather confusing, so this commit skips both initialization and cleanup for duplicate callback functions. Signed-off-by: Wei Yang <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29docs: RCU: Convert stallwarn.txt to ReSTMauro Carvalho Chehab1-2/+2
- Add a SPDX header; - Adjust document and section titles; - Fix list markups; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add it to RCU/index.rst. Signed-off-by: Mauro Carvalho Chehab <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29docs: RCU: Convert torture.txt to ReSTMauro Carvalho Chehab1-1/+1
- Add a SPDX header; - Adjust document and section titles; - Some whitespace fixes and new line breaks; - Mark literal blocks as such; - Add it to RCU/index.rst. Signed-off-by: Mauro Carvalho Chehab <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2020-06-29Merge branch 'linus' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fixes from Herbert Xu: "This fixes two race conditions, one in padata and one in af_alg" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: padata: upgrade smp_mb__after_atomic to smp_mb in padata_do_serial crypto: af_alg - fix use-after-free in af_alg_accept() due to bh_lock_sock()
2020-06-29x86/ftrace: Only have the builtin ftrace_regs_caller call direct hooksSteven Rostedt (VMware)1-0/+8
If a direct hook is attached to a function that ftrace also has a function attached to it, then it is required that the ftrace_ops_list_func() is used to iterate over the registered ftrace callbacks. This will also include the direct ftrace_ops helper, that tells ftrace_regs_caller where to return to (the direct callback and not the function that called it). As this direct helper is only to handle the case of ftrace callbacks attached to the same function as the direct callback, the ftrace callback allocated trampolines (used to only call them), should never be used to return back to a direct callback. Only copy the portion of the ftrace_regs_caller that will return back to what called it, and not the portion that returns back to the direct caller. The direct ftrace_ops must then pick the ftrace_regs_caller builtin function as its own trampoline to ensure that it will never have one allocated for it (which would not include the handling of direct callbacks). Link: http://lkml.kernel.org/r/[email protected] Cc: Peter Zijlstra <[email protected]> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-06-29cgroup: unexport cgroup_rstat_updatedChristoph Hellwig1-1/+0
cgroup_rstat_updated is only used by core block code, no need to export it. Acked-by: Tejun Heo <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
2020-06-29tracing: Only allow trace_array_printk() to be used by instancesSteven Rostedt (VMware)1-3/+7
To prevent default "trace_printks()" from spamming the top level tracing ring buffer, only allow trace instances to use trace_array_printk() (which can be used without the trace_printk() start up warning). Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2020-06-29dma-mapping: warn when coherent pool is depletedDavid Rientjes1-1/+5
When a DMA coherent pool is depleted, allocation failures may or may not get reported in the kernel log depending on the allocator. The admin does have a workaround, however, by using coherent_pool= on the kernel command line. Provide some guidance on the failure and a recommended minimum size for the pools (double the size). Signed-off-by: David Rientjes <[email protected]> Tested-by: Guenter Roeck <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2020-06-28Merge tag 'sched_urgent_for_5.8_rc3' of ↵Linus Torvalds5-30/+37
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: "The most anticipated fix in this pull request is probably the horrible build fix for the RANDSTRUCT fail that didn't make -rc2. Also included is the cleanup that removes those BUILD_BUG_ON()s and replaces it with ugly unions. Also included is the try_to_wake_up() race fix that was first triggered by Paul's RCU-torture runs, but was independently hit by Dave Chinner's fstest runs as well" * tag 'sched_urgent_for_5.8_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/cfs: change initial value of runnable_avg smp, irq_work: Continue smp_call_function*() and irq_work*() integration sched/core: s/WF_ON_RQ/WQ_ON_CPU/ sched/core: Fix ttwu() race sched/core: Fix PI boosting between RT and DEADLINE tasks sched/deadline: Initialize ->dl_boosted sched/core: Check cpus_mask, not cpus_ptr in __set_cpus_allowed_ptr(), to fix mask corruption sched/core: Fix CONFIG_GCC_PLUGIN_RANDSTRUCT build fail
2020-06-28Merge tag 'rcu_urgent_for_5.8_rc3' of ↵Linus Torvalds1-7/+25
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU-vs-KCSAN fixes from Borislav Petkov: "A single commit that uses "arch_" atomic operations to avoid the instrumentation that comes with the non-"arch_" versions. In preparation for that commit, it also has another commit that makes these "arch_" atomic operations available to generic code. Without these commits, KCSAN uses can see pointless errors" * tag 'rcu_urgent_for_5.8_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rcu: Fixup noinstr warnings locking/atomics: Provide the arch_atomic_ interface to generic code
2020-06-28sched/cfs: change initial value of runnable_avgVincent Guittot1-1/+1
Some performance regression on reaim benchmark have been raised with commit 070f5e860ee2 ("sched/fair: Take into account runnable_avg to classify group") The problem comes from the init value of runnable_avg which is initialized with max value. This can be a problem if the newly forked task is finally a short task because the group of CPUs is wrongly set to overloaded and tasks are pulled less agressively. Set initial value of runnable_avg equals to util_avg to reflect that there is no waiting time so far. Fixes: 070f5e860ee2 ("sched/fair: Take into account runnable_avg to classify group") Reported-by: kernel test robot <[email protected]> Signed-off-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-06-28smp, irq_work: Continue smp_call_function*() and irq_work*() integrationPeter Zijlstra2-21/+3
Instead of relying on BUG_ON() to ensure the various data structures line up, use a bunch of horrible unions to make it all automatic. Much of the union magic is to ensure irq_work and smp_call_function do not (yet) see the members of their respective data structures change name. Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Frederic Weisbecker <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-06-28sched/core: s/WF_ON_RQ/WQ_ON_CPU/Peter Zijlstra2-3/+3
Use a better name for this poorly named flag, to avoid confusion... Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Mel Gorman <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-06-28sched/core: Fix ttwu() racePeter Zijlstra1-5/+28
Paul reported rcutorture occasionally hitting a NULL deref: sched_ttwu_pending() ttwu_do_wakeup() check_preempt_curr() := check_preempt_wakeup() find_matching_se() is_same_group() if (se->cfs_rq == pse->cfs_rq) <-- *BOOM* Debugging showed that this only appears to happen when we take the new code-path from commit: 2ebb17717550 ("sched/core: Offload wakee task activation if it the wakee is descheduling") and only when @cpu == smp_processor_id(). Something which should not be possible, because p->on_cpu can only be true for remote tasks. Similarly, without the new code-path from commit: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") this would've unconditionally hit: smp_cond_load_acquire(&p->on_cpu, !VAL); and if: 'cpu == smp_processor_id() && p->on_cpu' is possible, this would result in an instant live-lock (with IRQs disabled), something that hasn't been reported. The NULL deref can be explained however if the task_cpu(p) load at the beginning of try_to_wake_up() returns an old value, and this old value happens to be smp_processor_id(). Further assume that the p->on_cpu load accurately returns 1, it really is still running, just not here. Then, when we enqueue the task locally, we can crash in exactly the observed manner because p->se.cfs_rq != rq->cfs_rq, because p's cfs_rq is from the wrong CPU, therefore we'll iterate into the non-existant parents and NULL deref. The closest semi-plausible scenario I've managed to contrive is somewhat elaborate (then again, actual reproduction takes many CPU hours of rcutorture, so it can't be anything obvious): X->cpu = 1 rq(1)->curr = X CPU0 CPU1 CPU2 // switch away from X LOCK rq(1)->lock smp_mb__after_spinlock dequeue_task(X) X->on_rq = 9 switch_to(Z) X->on_cpu = 0 UNLOCK rq(1)->lock // migrate X to cpu 0 LOCK rq(1)->lock dequeue_task(X) set_task_cpu(X, 0) X->cpu = 0 UNLOCK rq(1)->lock LOCK rq(0)->lock enqueue_task(X) X->on_rq = 1 UNLOCK rq(0)->lock // switch to X LOCK rq(0)->lock smp_mb__after_spinlock switch_to(X) X->on_cpu = 1 UNLOCK rq(0)->lock // X goes sleep X->state = TASK_UNINTERRUPTIBLE smp_mb(); // wake X ttwu() LOCK X->pi_lock smp_mb__after_spinlock if (p->state) cpu = X->cpu; // =? 1 smp_rmb() // X calls schedule() LOCK rq(0)->lock smp_mb__after_spinlock dequeue_task(X) X->on_rq = 0 if (p->on_rq) smp_rmb(); if (p->on_cpu && ttwu_queue_wakelist(..)) [*] smp_cond_load_acquire(&p->on_cpu, !VAL) cpu = select_task_rq(X, X->wake_cpu, ...) if (X->cpu != cpu) switch_to(Y) X->on_cpu = 0 UNLOCK rq(0)->lock However I'm having trouble convincing myself that's actually possible on x86_64 -- after all, every LOCK implies an smp_mb() there, so if ttwu observes ->state != RUNNING, it must also observe ->cpu != 1. (Most of the previous ttwu() races were found on very large PowerPC) Nevertheless, this fully explains the observed failure case. Fix it by ordering the task_cpu(p) load after the p->on_cpu load, which is easy since nothing actually uses @cpu before this. Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") Reported-by: Paul E. McKenney <[email protected]> Tested-by: Paul E. McKenney <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-06-28sched/core: Fix PI boosting between RT and DEADLINE tasksJuri Lelli1-1/+2
syzbot reported the following warning: WARNING: CPU: 1 PID: 6351 at kernel/sched/deadline.c:628 enqueue_task_dl+0x22da/0x38a0 kernel/sched/deadline.c:1504 At deadline.c:628 we have: 623 static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se) 624 { 625 struct dl_rq *dl_rq = dl_rq_of_se(dl_se); 626 struct rq *rq = rq_of_dl_rq(dl_rq); 627 628 WARN_ON(dl_se->dl_boosted); 629 WARN_ON(dl_time_before(rq_clock(rq), dl_se->deadline)); [...] } Which means that setup_new_dl_entity() has been called on a task currently boosted. This shouldn't happen though, as setup_new_dl_entity() is only called when the 'dynamic' deadline of the new entity is in the past w.r.t. rq_clock and boosted tasks shouldn't verify this condition. Digging through the PI code I noticed that what above might in fact happen if an RT tasks blocks on an rt_mutex hold by a DEADLINE task. In the first branch of boosting conditions we check only if a pi_task 'dynamic' deadline is earlier than mutex holder's and in this case we set mutex holder to be dl_boosted. However, since RT 'dynamic' deadlines are only initialized if such tasks get boosted at some point (or if they become DEADLINE of course), in general RT 'dynamic' deadlines are usually equal to 0 and this verifies the aforementioned condition. Fix it by checking that the potential donor task is actually (even if temporary because in turn boosted) running at DEADLINE priority before using its 'dynamic' deadline value. Fixes: 2d3d891d3344 ("sched/deadline: Add SCHED_DEADLINE inheritance logic") Reported-by: [email protected] Signed-off-by: Juri Lelli <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Reviewed-by: Daniel Bristot de Oliveira <[email protected]> Tested-by: Daniel Wagner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-06-28sched/deadline: Initialize ->dl_boostedJuri Lelli1-0/+1
syzbot reported the following warning triggered via SYSC_sched_setattr(): WARNING: CPU: 0 PID: 6973 at kernel/sched/deadline.c:593 setup_new_dl_entity /kernel/sched/deadline.c:594 [inline] WARNING: CPU: 0 PID: 6973 at kernel/sched/deadline.c:593 enqueue_dl_entity /kernel/sched/deadline.c:1370 [inline] WARNING: CPU: 0 PID: 6973 at kernel/sched/deadline.c:593 enqueue_task_dl+0x1c17/0x2ba0 /kernel/sched/deadline.c:1441 This happens because the ->dl_boosted flag is currently not initialized by __dl_clear_params() (unlike the other flags) and setup_new_dl_entity() rightfully complains about it. Initialize dl_boosted to 0. Fixes: 2d3d891d3344 ("sched/deadline: Add SCHED_DEADLINE inheritance logic") Reported-by: [email protected] Signed-off-by: Juri Lelli <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Tested-by: Daniel Wagner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-06-28sched/core: Check cpus_mask, not cpus_ptr in __set_cpus_allowed_ptr(), to ↵Scott Wood1-1/+1
fix mask corruption This function is concerned with the long-term CPU mask, not the transitory mask the task might have while migrate disabled. Before this patch, if a task was migrate-disabled at the time __set_cpus_allowed_ptr() was called, and the new mask happened to be equal to the CPU that the task was running on, then the mask update would be lost. Signed-off-by: Scott Wood <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2020-06-27Merge tag 'dma-mapping-5.8-4' of git://git.infradead.org/users/hch/dma-mappingLinus Torvalds3-28/+39
Pull dma-mapping fixes from Christoph Hellwig: - fix dma coherent mmap in nommu (me) - more AMD SEV fallout (David Rientjes, me) - fix alignment in dma_common_*_remap (Eric Auger) * tag 'dma-mapping-5.8-4' of git://git.infradead.org/users/hch/dma-mapping: dma-remap: align the size in dma_common_*_remap() dma-mapping: DMA_COHERENT_POOL should select GENERIC_ALLOCATOR dma-direct: add missing set_memory_decrypted() for coherent mapping dma-direct: check return value when encrypting or decrypting memory dma-direct: re-encrypt memory if dma_direct_alloc_pages() fails dma-direct: always align allocation size in dma_direct_alloc_pages() dma-direct: mark __dma_direct_alloc_pages static dma-direct: re-enable mmap for !CONFIG_MMU
2020-06-27Merge tag 'kgdb-5.8-rc3' of ↵Linus Torvalds2-29/+47
git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux Pull kgdb fixes from Daniel Thompson: "The main change here is a fix for a number of unsafe interactions between kdb and the console system. The fixes are specific to kdb (pure kgdb debugging does not use the console system at all). On systems with an NMI then kdb, if it is enabled, must get messages to the user despite potentially running from some "difficult" calling contexts. These fixes avoid using the console system where we have been provided an alternative (safer) way to interact with the user and, if using the console system in unavoidable, use oops_in_progress for deadlock avoidance. These fixes also ensure kdb honours the console enable flag. Also included is a fix that wraps kgdb trap handling in an RCU read lock to avoids triggering diagnostic warnings. This is a wide lock scope but this is OK because kgdb is a stop-the-world debugger. When we stop the world we put all the CPUs into holding pens and this inhibits RCU update anyway" * tag 'kgdb-5.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux: kgdb: Avoid suspicious RCU usage warning kdb: Switch to use safer dbg_io_ops over console APIs kdb: Make kdb_printf() console handling more robust kdb: Check status of console prior to invoking handlers kdb: Re-factor kdb_printf() message write code