aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2021-05-18kcsan: Remove reporting indirectionMark Rutland1-66/+49
Now that we have separate kcsan_report_*() functions, we can factor the distinct logic for each of the report cases out of kcsan_report(). While this means each case has to handle mutual exclusion independently, this minimizes the conditionality of code and makes it easier to read, and will permit passing distinct bits of information to print_report() in future. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <[email protected]> [ [email protected]: retain comment about lockdep_off() ] Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-18kcsan: Refactor access_info initializationMark Rutland1-17/+25
In subsequent patches we'll want to split kcsan_report() into distinct handlers for each report type. The largest bit of common work is initializing the `access_info`, so let's factor this out into a helper, and have the kcsan_report_*() functions pass the `aaccess_info` as a parameter to kcsan_report(). There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <[email protected]> Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-18kcsan: Fold panic() call into print_report()Mark Rutland1-13/+8
So that we can add more callers of print_report(), lets fold the panic() call into print_report() so the caller doesn't have to handle this explicitly. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <[email protected]> Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-18kcsan: Refactor passing watchpoint/other_infoMark Rutland1-9/+4
The `watchpoint_idx` argument to kcsan_report() isn't meaningful for races which were not detected by a watchpoint, and it would be clearer if callers passed the other_info directly so that a NULL value can be passed in this case. Given that callers manipulate their watchpoints before passing the index into kcsan_report_*(), and given we index the `other_infos` array using this before we sanity-check it, the subsequent sanity check isn't all that useful. Let's remove the `watchpoint_idx` sanity check, and move the job of finding the `other_info` out of kcsan_report(). Other than the removal of the check, there should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <[email protected]> Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-18kcsan: Distinguish kcsan_report() callsMark Rutland3-15/+33
Currently kcsan_report() is used to handle three distinct cases: * The caller hit a watchpoint when attempting an access. Some information regarding the caller and access are recorded, but no output is produced. * A caller which previously setup a watchpoint detected that the watchpoint has been hit, and possibly detected a change to the location in memory being watched. This may result in output reporting the interaction between this caller and the caller which hit the watchpoint. * A caller detected a change to a modification to a memory location which wasn't detected by a watchpoint, for which there is no information on the other thread. This may result in output reporting the unexpected change. ... depending on the specific case the caller has distinct pieces of information available, but the prototype of kcsan_report() has to handle all three cases. This means that in some cases we pass redundant information, and in others we don't pass all the information we could pass. This also means that the report code has to demux these three cases. So that we can pass some additional information while also simplifying the callers and report code, add separate kcsan_report_*() functions for the distinct cases, updating callers accordingly. As the watchpoint_idx is unused in the case of kcsan_report_unknown_origin(), this passes a dummy value into kcsan_report(). Subsequent patches will refactor the report code to avoid this. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <[email protected]> [ [email protected]: try to make kcsan_report_*() names more descriptive ] Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-18kcsan: Simplify value change detectionMark Rutland1-24/+16
In kcsan_setup_watchpoint() we store snapshots of a watched value into a union of u8/u16/u32/u64 sized fields, modify this in place using a consistent field, then later check for any changes via the u64 field. We can achieve the safe effect more simply by always treating the field as a u64, as smaller values will be zero-extended. As the values are zero-extended, we don't need to truncate the access_mask when we apply it, and can always apply the full 64-bit access_mask to the 64-bit value. Finally, we can store the two snapshots and calculated difference separately, which makes the code a little easier to read, and will permit reporting the old/new values in subsequent patches. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <[email protected]> Signed-off-by: Marco Elver <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-18kcsan: Fix debugfs initcall return typeArnd Bergmann1-1/+2
clang with CONFIG_LTO_CLANG points out that an initcall function should return an 'int' due to the changes made to the initcall macros in commit 3578ad11f3fb ("init: lto: fix PREL32 relocations"): kernel/kcsan/debugfs.c:274:15: error: returning 'void' from a function with incompatible result type 'int' late_initcall(kcsan_debugfs_init); ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~ include/linux/init.h:292:46: note: expanded from macro 'late_initcall' #define late_initcall(fn) __define_initcall(fn, 7) Fixes: e36299efe7d7 ("kcsan, debugfs: Move debugfs file creation out of early init") Cc: stable <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Marco Elver <[email protected]> Reviewed-by: Nathan Chancellor <[email protected]> Reviewed-by: Miguel Ojeda <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-18Merge branches 'bitmaprange.2021.05.10c', 'doc.2021.05.10c', ↵Paul E. McKenney15-469/+731
'fixes.2021.05.13a', 'kvfree_rcu.2021.05.10c', 'mmdumpobj.2021.05.10c', 'nocb.2021.05.12a', 'srcu.2021.05.12a', 'tasks.2021.05.18a' and 'torture.2021.05.10c' into HEAD bitmaprange.2021.05.10c: Allow "all" for bitmap ranges. doc.2021.05.10c: Documentation updates. fixes.2021.05.13a: Miscellaneous fixes. kvfree_rcu.2021.05.10c: kvfree_rcu() updates. mmdumpobj.2021.05.10c: mem_dump_obj() updates. nocb.2021.05.12a: RCU NOCB CPU updates, including limited deoffloading. srcu.2021.05.12a: SRCU updates. tasks.2021.05.18a: Tasks-RCU updates. torture.2021.05.10c: Torture-test updates.
2021-05-18tasks-rcu: Make show_rcu_tasks_gp_kthreads() be static inlinePaul E. McKenney2-1/+4
In some architectures, the no-op variant of show_rcu_tasks_gp_kthreads() get "no previous prototype" compiler warnings. These are false positives given that kernel/rcu/tasks.h is included only once. But why put up with the compiler noise? This commit therefore adds "static inline" to this definition to force the compiler to accept this situation, while also moving it to its proper place in kernel/rcu/rcu.h. Reported-by: kernel test robot <[email protected]> [ paulmck: Update per Stephen Rothwell feedback. ] Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-18rcu-tasks: Make ksoftirqd provide RCU Tasks quiescent statesPaul E. McKenney1-0/+1
Heavy networking load can cause a CPU to execute continuously and indefinitely within ksoftirqd, in which case there will be no voluntary task switches and thus no RCU-tasks quiescent states. This commit therefore causes the exiting rcu_softirq_qs() to provide an RCU-tasks quiescent state. This of course means that __do_softirq() and its callers cannot be invoked from within a tracing trampoline. Reported-by: Toke Høiland-Jørgensen <[email protected]> Tested-by: Toke Høiland-Jørgensen <[email protected]> Reviewed-by: Masami Hiramatsu <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: Masami Hiramatsu <[email protected]>
2021-05-18sched: Make the idle task quack like a per-CPU kthreadValentin Schneider2-18/+33
For all intents and purposes, the idle task is a per-CPU kthread. It isn't created via the same route as other pcpu kthreads however, and as a result it is missing a few bells and whistles: it fails kthread_is_per_cpu() and it doesn't have PF_NO_SETAFFINITY set. Fix the former by giving the idle task a kthread struct along with the KTHREAD_IS_PER_CPU flag. This requires some extra iffery as init_idle() call be called more than once on the same idle task. Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-18sched,stats: Further simplify sched_infoPeter Zijlstra1-9/+12
There's no point doing delta==0 updates. Suggested-by: Mel Gorman <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
2021-05-18locking/mutex: clear MUTEX_FLAGS if wait_list is empty due to signalZqiang4-11/+17
When a interruptible mutex locker is interrupted by a signal without acquiring this lock and removed from the wait queue. if the mutex isn't contended enough to have a waiter put into the wait queue again, the setting of the WAITER bit will force mutex locker to go into the slowpath to acquire the lock every time, so if the wait queue is empty, the WAITER bit need to be clear. Fixes: 040a0a371005 ("mutex: Add support for wound/wait style locks") Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Zqiang <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-18locking/lockdep: Correct calling tracepointsLeo Yan1-2/+2
The commit eb1f00237aca ("lockdep,trace: Expose tracepoints") reverses tracepoints for lock_contended() and lock_acquired(), thus the ftrace log shows the wrong locking sequence that "acquired" event is prior to "contended" event: <idle>-0 [001] d.s3 20803.501685: lock_acquire: 0000000008b91ab4 &sg_policy->update_lock <idle>-0 [001] d.s3 20803.501686: lock_acquired: 0000000008b91ab4 &sg_policy->update_lock <idle>-0 [001] d.s3 20803.501689: lock_contended: 0000000008b91ab4 &sg_policy->update_lock <idle>-0 [001] d.s3 20803.501690: lock_release: 0000000008b91ab4 &sg_policy->update_lock This patch fixes calling tracepoints for lock_contended() and lock_acquired(). Fixes: eb1f00237aca ("lockdep,trace: Expose tracepoints") Signed-off-by: Leo Yan <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-05-17genirq: Add a IRQF_NO_DEBUG flagThomas Gleixner4-2/+19
The whole call to note_interrupt() can be avoided or return early when interrupts would be marked accordingly. For IPI handlers which always return HANDLED the whole procedure is pretty pointless to begin with. Add a IRQF_NO_DEBUG flag and mark the interrupt accordingly if supplied when the interrupt is requested. When noirqdebug is set on the kernel commandline, then the interrupt is marked unconditionally so that there is only one condition in the hotpath to evaluate. [ clg: Add changelog ] Signed-off-by: Thomas Gleixner <[email protected]> Signed-off-by: Cédric Le Goater <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-17kdb: Switch to use %ptTsAndy Shevchenko1-8/+1
Use %ptTs instead of open-coded variant to print contents of time64_t type in human readable form. Cc: Jason Wessel <[email protected]> Cc: Daniel Thompson <[email protected]> Cc: [email protected] Signed-off-by: Andy Shevchenko <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Reviewed-by: Douglas Anderson <[email protected]> Reviewed-by: Daniel Thompson <[email protected]> Acked-by: Daniel Thompson <[email protected]> Signed-off-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-17module: check for exit sections in layout_sections() instead of ↵Jessica Yu1-6/+11
module_init_section() Previously, when CONFIG_MODULE_UNLOAD=n, the module loader just does not attempt to load exit sections since it never expects that any code in those sections will ever execute. However, dynamic code patching (alternatives, jump_label and static_call) can have sites in __exit code, even if __exit is never executed. Therefore __exit must be present at runtime, at least for as long as __init code is. Commit 33121347fb1c ("module: treat exit sections the same as init sections when !CONFIG_MODULE_UNLOAD") solves the requirements of jump_labels and static_calls by putting the exit sections in the init region of the module so that they are at least present at init, and discarded afterwards. It does this by including a check for exit sections in module_init_section(), so that it also returns true for exit sections, and the module loader will automatically sort them in the init region of the module. However, the solution there was not completely arch-independent. ARM is a special case where it supplies its own module_{init, exit}_section() functions. Instead of pushing the exit section checks into module_init_section(), just implement the exit section check in layout_sections(), so that we don't have to touch arch-dependent code. Fixes: 33121347fb1c ("module: treat exit sections the same as init sections when !CONFIG_MODULE_UNLOAD") Reviewed-by: Russell King (Oracle) <[email protected]> Signed-off-by: Jessica Yu <[email protected]>
2021-05-16Merge tag 'timers-urgent-2021-05-16' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "Two fixes for timers: - Use the ALARM feature check in the alarmtimer core code insted of the old method of checking for the set_alarm() callback. Drivers can have that callback set but the feature bit cleared. If such a RTC device is selected then alarms wont work. - Use a proper define to let the preprocessor check whether Hyper-V VDSO clocksource should be active. The code used a constant in an enum with #ifdef, which evaluates to always false and disabled the clocksource for VDSO" * tag 'timers-urgent-2021-05-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource/drivers/hyper-v: Re-enable VDSO_CLOCKMODE_HVCLOCK on X86 alarmtimer: Check RTC features instead of ops
2021-05-15Merge tag 'sched-urgent-2021-05-15' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Fix an idle CPU selection bug, and an AMD Ryzen maximum frequency enumeration bug" * tag 'sched-urgent-2021-05-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, sched: Fix the AMD CPPC maximum performance value on certain AMD Ryzen generations sched/fair: Fix clearing of has_idle_cores flag in select_idle_cpu()
2021-05-15Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-1/+1
Merge misc fixes from Andrew Morton: "13 patches. Subsystems affected by this patch series: resource, squashfs, hfsplus, modprobe, and mm (hugetlb, slub, userfaultfd, ksm, pagealloc, kasan, pagemap, and ioremap)" * emailed patches from Andrew Morton <[email protected]>: mm/ioremap: fix iomap_max_page_shift docs: admin-guide: update description for kernel.modprobe sysctl hfsplus: prevent corruption in shrinking truncate mm/filemap: fix readahead return types kasan: fix unit tests with CONFIG_UBSAN_LOCAL_BOUNDS enabled mm: fix struct page layout on 32-bit systems ksm: revert "use GET_KSM_PAGE_NOLOCK to get ksm page in remove_rmap_item_from_tree()" userfaultfd: release page in error path to avoid BUG_ON squashfs: fix divide error in calculate_skip() kernel/resource: fix return code check in __request_free_mem_region mm, slub: move slub_debug static key enabling outside slab_mutex mm/hugetlb: fix cow where page writtable in child mm/hugetlb: fix F_SEAL_FUTURE_WRITE
2021-05-14kernel/resource: fix return code check in __request_free_mem_regionAlistair Popple1-1/+1
Splitting an earlier version of a patch that allowed calling __request_region() while holding the resource lock into a series of patches required changing the return code for the newly introduced __request_region_locked(). Unfortunately this change was not carried through to a subsequent commit 56fd94919b8b ("kernel/resource: fix locking in request_free_mem_region") in the series. This resulted in a use-after-free due to freeing the struct resource without properly releasing it. Fix this by correcting the return code check so that the struct is not freed if the request to add it was successful. Link: https://lkml.kernel.org/r/[email protected] Fixes: 56fd94919b8b ("kernel/resource: fix locking in request_free_mem_region") Signed-off-by: Alistair Popple <[email protected]> Reported-by: kernel test robot <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Balbir Singh <[email protected]> Cc: Dan Williams <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Jerome Glisse <[email protected]> Cc: John Hubbard <[email protected]> Cc: Muchun Song <[email protected]> Cc: Oliver Sang <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-05-14Merge tag 'trace-v5.13-rc1' of ↵Linus Torvalds1-4/+27
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Fix trace_check_vprintf() for %.*s The sanity check of all strings being read from the ring buffer to make sure they are in safe memory space did not account for the %.*s notation having another parameter to process (the length). Add that to the check" * tag 'trace-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Handle %.*s in trace_check_vprintf()
2021-05-14kernel/module: Use BUG_ON instead of if condition followed by BUGzhouchuangao1-2/+1
Fix the following coccinelle report: kernel/module.c:1018:2-5: WARNING: Use BUG_ON instead of if condition followed by BUG. BUG_ON uses unlikely in if(). Through disassembly, we can see that brk #0x800 is compiled to the end of the function. As you can see below: ...... ffffff8008660bec: d65f03c0 ret ffffff8008660bf0: d4210000 brk #0x800 Usually, the condition in if () is not satisfied. For the multi-stage pipeline, we do not need to perform fetch decode and excute operation on brk instruction. In my opinion, this can improve the efficiency of the multi-stage pipeline. Signed-off-by: zhouchuangao <[email protected]> Signed-off-by: Jessica Yu <[email protected]>
2021-05-13tracing: Handle %.*s in trace_check_vprintf()Steven Rostedt (VMware)1-4/+27
If a trace event uses the %*.s notation, the trace_check_vprintf() will fail and will warn about a bad processing of strings, because it does not take into account the length field when processing the star (*) part. Have it handle this case as well. Link: https://lore.kernel.org/linux-nfs/[email protected]/ Reported-by: Chuck Lever III <[email protected]> Fixes: 9a6944fee68e2 ("tracing: Add a verifier to check string pointers for trace events") Signed-off-by: Steven Rostedt (VMware) <[email protected]>
2021-05-13rcu: Add missing __releases() annotationJules Irenge1-0/+1
Sparse reports a warning at rcu_print_task_stall(): "warning: context imbalance in rcu_print_task_stall - unexpected unlock" The root cause is a missing annotation on rcu_print_task_stall(). This commit therefore adds the missing __releases(rnp->lock) annotation. Signed-off-by: Jules Irenge <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-13rcu: Improve comments describing RCU read-side critical sectionsPaul E. McKenney1-10/+14
There are a number of places that call out the fact that preempt-disable regions of code now act as RCU read-side critical sections, where preempt-disable regions of code include irq-disable regions of code, bh-disable regions of code, hardirq handlers, and NMI handlers. However, someone relying solely on (for example) the call_rcu() header comment might well have no idea that preempt-disable regions of code have RCU semantics. This commit therefore updates the header comments for call_rcu(), synchronize_rcu(), rcu_dereference_bh_check(), and rcu_dereference_sched_check() to call out these new(ish) forms of RCU readers. Reported-by: Michel Lespinasse <[email protected]> [ paulmck: Apply Matthew Wilcox and Michel Lespinasse feedback. ] Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-13tick/nohz: Call tick_nohz_task_switch() with interrupts disabledPeter Zijlstra2-7/+2
Call tick_nohz_task_switch() slightly earlier after the context switch to benefit from disabled IRQs. This way the function doesn't need to disable them once more. Signed-off-by: Peter Zijlstra <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-13tick/nohz: Kick only _queued_ task whose tick dependency is updatedMarcelo Tosatti2-2/+22
When the tick dependency of a task is updated, we want it to aknowledge the new state and restart the tick if needed. If the task is not running, we don't need to kick it because it will observe the new dependency upon scheduling in. But if the task is running, we may need to send an IPI to it so that it gets notified. Unfortunately we don't have the means to check if a task is running in a race free way. Checking p->on_cpu in a synchronized way against p->tick_dep_mask would imply adding a full barrier between prepare_task_switch() and tick_nohz_task_switch(), which we want to avoid in this fast-path. Therefore we blindly fire an IPI to the task's CPU. Meanwhile we can check if the task is queued on the CPU rq because p->on_rq is always set to TASK_ON_RQ_QUEUED _before_ schedule() and its full barrier that precedes tick_nohz_task_switch(). And if the task is queued on a nohz_full CPU, it also has fair chances to be running as the isolation constraints prescribe running single tasks on full dynticks CPUs. So use this as a trick to check if we can spare an IPI toward a non-running task. NOTE: For the ordering to be correct, it is assumed that we never deactivate a task while it is running, the only exception being the task deactivating itself while scheduling out. Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Marcelo Tosatti <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-13tick/nohz: Change signal tick dependency to wake up CPUs of member tasksMarcelo Tosatti2-4/+15
Rather than waking up all nohz_full CPUs on the system, only wake up the target CPUs of member threads of the signal. Reduces interruptions to nohz_full CPUs. Signed-off-by: Marcelo Tosatti <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-13tick/nohz: Only wake up a single target cpu when kicking a taskFrederic Weisbecker1-13/+27
When adding a tick dependency to a task, its necessary to wake up the CPU where the task resides to reevaluate tick dependencies on that CPU. However the current code wakes up all nohz_full CPUs, which is unnecessary. Switch to waking up a single CPU, by using ordering of writes to task->cpu and task->tick_dep_mask. [ mingo: Minor readability edit. ] Suggested-by: Peter Zijlstra <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Marcelo Tosatti <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-13tick/nohz: Update nohz_full Kconfig helpFrederic Weisbecker1-5/+6
CONFIG_NO_HZ_FULL behaves just like CONFIG_NO_HZ_IDLE by default. Reassure distros about it. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-13tick/nohz: Update idle_exittime on actual idle exitYunfeng Ye1-6/+8
The idle_exittime field of tick_sched is used to record the time when the idle state was left. but currently the idle_exittime is updated in the function tick_nohz_restart_sched_tick(), which is not always in idle state when nohz_full is configured: tick_irq_exit tick_nohz_irq_exit tick_nohz_full_update_tick tick_nohz_restart_sched_tick ts->idle_exittime = now; It's thus overwritten by mistake on nohz_full tick restart. Move the update to the appropriate idle exit path instead. Signed-off-by: Yunfeng Ye <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-13tick/nohz: Remove superflous check for CONFIG_VIRT_CPU_ACCOUNTING_NATIVEFrederic Weisbecker1-2/+0
The vtime_accounting_enabled_this_cpu() early check already makes what follows as dead code in the case of CONFIG_VIRT_CPU_ACCOUNTING_NATIVE. No need to keep the ifdeferry around. Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-13tick/nohz: Conditionally restart tick on idle exitYunfeng Ye1-15/+27
In nohz_full mode, switching from idle to a task will unconditionally issue a tick restart. If the task is alone in the runqueue or is the highest priority, the tick will fire once then eventually stop. But that alone is still undesired noise. Therefore, only restart the tick on idle exit when it's strictly necessary. Signed-off-by: Yunfeng Ye <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-13sched/isolation: Reconcile rcu_nocbs= and nohz_full=Paul Gortmaker1-3/+1
We have a mismatch between RCU and isolation -- in relation to what is considered the maximum valid CPU number. This matters because nohz_full= and rcu_nocbs= are joined at the hip; in fact the former will enforce the latter. So we don't want a CPU mask to be valid for one and denied for the other. The difference 1st appeared as of v4.15; further details are below. As it is confusing to anyone who isn't looking at the code regularly, a reminder is in order; three values exist here: CONFIG_NR_CPUS - compiled in maximum cap on number of CPUs supported. nr_cpu_ids - possible # of CPUs (typically reflects what ACPI says) cpus_present - actual number of present/detected/installed CPUs. For this example, I'll refer to NR_CPUS=64 from "make defconfig" and nr_cpu_ids=6 for ACPI reporting on a board that could run a six core, and present=4 for a quad that is physically in the socket. From dmesg: smpboot: Allowing 6 CPUs, 2 hotplug CPUs setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:6 nr_node_ids:1 rcu: RCU restricting CPUs from NR_CPUS=64 to nr_cpu_ids=6. smp: Brought up 1 node, 4 CPUs And from userspace, see: paul@trash:/sys/devices/system/cpu$ cat present 0-3 paul@trash:/sys/devices/system/cpu$ cat possible 0-5 paul@trash:/sys/devices/system/cpu$ cat kernel_max 63 Everything is fine if we boot 5x5 for rcu/nohz: Command line: BOOT_IMAGE=/boot/bzImage nohz_full=2-5 rcu_nocbs=2-5 root=/dev/sda1 ro NO_HZ: Full dynticks CPUs: 2-5. rcu: Offload RCU callbacks from CPUs: 2-5. ..even though there is no CPU 4 or 5. Both RCU and nohz_full are OK. Now we push that > 6 but less than NR_CPU and with 15x15 we get: Command line: BOOT_IMAGE=/boot/bzImage rcu_nocbs=2-15 nohz_full=2-15 root=/dev/sda1 ro rcu: Note: kernel parameter 'rcu_nocbs=', 'nohz_full', or 'isolcpus=' contains nonexistent CPUs. rcu: Offload RCU callbacks from CPUs: 2-5. These are both functionally equivalent, as we are only changing flags on phantom CPUs that don't exist, but note the kernel interpretation changes. And worse, it only changes for one of the two - which is the problem. RCU doesn't care if you want to restrict the flags on phantom CPUs but clearly nohz_full does after this change from v4.15. edb9382175c3: ("sched/isolation: Move isolcpus= handling to the housekeeping code") - if (cpulist_parse(str, non_housekeeping_mask) < 0) { - pr_warn("Housekeeping: Incorrect nohz_full cpumask\n"); + err = cpulist_parse(str, non_housekeeping_mask); + if (err < 0 || cpumask_last(non_housekeeping_mask) >= nr_cpu_ids) { + pr_warn("Housekeeping: nohz_full= or isolcpus= incorrect CPU range\n"); To be clear, the sanity check on "possible" (nr_cpu_ids) is new here. The goal was reasonable ; not wanting housekeeping to land on a not-possible CPU, but note two things: 1) this is an exclusion list, not an inclusion list; we are tracking non_housekeeping CPUs; not ones who are explicitly assigned housekeeping 2) we went one further in 9219565aa890 ("sched/isolation: Require a present CPU in housekeeping mask") - ensuring that housekeeping was sanity checking against present and not just possible CPUs. To be clear, this means the check added in v4.15 is doubly redundant. And more importantly, overly strict/restrictive. We care now, because the bitmap boot arg parsing now knows that a value of "N" is NR_CPUS; the size of the bitmap, but the bitmap code doesn't know anything about the subtleties of our max/possible/present CPU specifics as outlined above. So drop the check added in v4.15 (edb9382175c3) and make RCU and nohz_full both in alignment again on NR_CPUS so "N" works for both, and then they can fall back to nr_cpu_ids internally just as before. Command line: BOOT_IMAGE=/boot/bzImage nohz_full=2-N rcu_nocbs=2-N root=/dev/sda1 ro NO_HZ: Full dynticks CPUs: 2-5. rcu: Offload RCU callbacks from CPUs: 2-5. As shown above, with this change, RCU and nohz_full are in sync, even with the use of the "N" placeholder. Same result is achieved with "15". Signed-off-by: Paul Gortmaker <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-12sched: Make multiple runqueue task counters 32-bitAlexey Dobriyan2-7/+7
Make: struct dl_rq::dl_nr_migratory struct dl_rq::dl_nr_running struct rt_rq::rt_nr_boosted struct rt_rq::rt_nr_migratory struct rt_rq::rt_nr_total struct rq::nr_uninterruptible 32-bit. If total number of tasks can't exceed 2**32 (and less due to futex pid limits), then per-runqueue counters can't as well. This patchset has been sponsored by REX Prefix Eradication Society. Signed-off-by: Alexey Dobriyan <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-12sched: Make nr_iowait_cpu() return 32-bit valueAlexey Dobriyan1-1/+1
Runqueue ->nr_iowait counters are 32-bit anyway. Propagate 32-bitness into other code, but don't try too hard. Signed-off-by: Alexey Dobriyan <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-12sched: Make nr_iowait() return 32-bit valueAlexey Dobriyan1-2/+2
Creating 2**32 tasks to wait in D-state is impossible and wasteful. Return "unsigned int" and save on REX prefixes. Signed-off-by: Alexey Dobriyan <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-12sched: Make nr_running() return 32-bit valueAlexey Dobriyan1-2/+2
Creating 2**32 tasks is impossible due to futex pid limits and wasteful anyway. Nobody has done it. Bring nr_running() into 32-bit world to save on REX prefixes. Signed-off-by: Alexey Dobriyan <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-05-12srcu: Early test SRCU polling startFrederic Weisbecker1-1/+5
Place an early call to start_poll_synchronize_srcu() before the invocation of call_srcu() on the same srcu_struct structure. After the later call to srcu_barrier(), the completion of the first grace period should be visible to a subsequent invocation of poll_state_synchronize_srcu(), and if not, warn. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Boqun Feng <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Neeraj Upadhyay <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Uladzislau Rezki <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-12rcu: Fix various typos in commentsIngo Molnar6-13/+13
Fix ~12 single-word typos in RCU code comments. [ paulmck: Apply feedback from Randy Dunlap. ] Reviewed-by: Randy Dunlap <[email protected]> Signed-off-by: Ingo Molnar <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-12rcu/nocb: Unify timersFrederic Weisbecker2-56/+42
Now that ->nocb_timer and ->nocb_bypass_timer have become quite similar, this commit merges them together. A new RCU_NOCB_WAKE_BYPASS wake level is introduced. As a result, timers perform all kinds of deferred wake ups but other deferred wakeup callsites only handle non-bypass wakeups in order not to wake up rcuo too early. The timer also unconditionally executes a full barrier so as to order timer_pending() and callback enqueue although the path performing RCU_NOCB_WAKE_FORCE that makes use of it is debatable. It should also test against the rdp leader instead of the current rdp. This unconditional full barrier shouldn't bring visible overhead since these timers almost never fire. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Neeraj Upadhyay <[email protected]> Cc: Boqun Feng <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-12rcu/nocb: Prepare for fine-grained deferred wakeupFrederic Weisbecker3-10/+11
Tuning the deferred wakeup level must be done from a safe wakeup point. Currently those sites are: * ->nocb_timer * user/idle/guest entry * CPU down * softirq/rcuc All of these sites perform the wake up for both RCU_NOCB_WAKE and RCU_NOCB_WAKE_FORCE. In order to merge ->nocb_timer and ->nocb_bypass_timer together, we plan to add a new RCU_NOCB_WAKE_BYPASS that really should be deferred until a timer fires so that we don't wake up the NOCB-gp kthread too early. To prepare for that, this commit specifies the per-callsite wakeup level/limit. Cc: Josh Triplett <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Neeraj Upadhyay <[email protected]> Cc: Boqun Feng <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> [ paulmck: Fix non-NOCB rcu_nocb_need_deferred_wakeup() definition. ] Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-12rcu/nocb: Only cancel nocb timer if not pollingFrederic Weisbecker1-7/+7
This commit refrains deleting the ->nocb_timer if rcu_nocb is polling because it should not ever have been queued in the polling case. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Neeraj Upadhyay <[email protected]> Cc: Boqun Feng <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-12rcu/nocb: Delete bypass_timer upon nocb_gp wakeupFrederic Weisbecker1-0/+2
A NOCB-gp wake p can safely delete the ->nocb_bypass_timer because nocb_gp_wait() will recheck again the bypass state and rearm the bypass timer if necessary. This commit therefore deletes this timer. Reviewed-by: Boqun Feng <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Neeraj Upadhyay <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-12rcu/nocb: Cancel nocb_timer upon nocb_gp wakeupFrederic Weisbecker1-0/+4
When waking up in nocb_gp_wait(), there is no need to keep the nocb_timer around because this function will traverse the whole rdp list. Any update performed before the timer was armed will now be visible after the ->nocb_gp_lock acquire. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Neeraj Upadhyay <[email protected]> Cc: Boqun Feng <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-12rcu/nocb: Allow de-offloading rdp leaderFrederic Weisbecker1-4/+0
The only thing that prevented an rdp leader from being de-offloaded was the nocb_bypass_timer that used to lock the nocb_lock of the rdp leader. If an rdp gets de-offloaded, it will subtlely ignore rcu_nocb_lock() calls and do its job in the timer unsafely. Worse yet: If it gets re-offloaded in the middle of the timer, rcu_nocb_unlock() would try to unlock, leaving it imbalanced. Now that the nocb_bypass_timer doesn't use the nocb_lock anymore, de-offloading the rdp leader is now safe. This commit therefore allows the rdp leader to be de-offloaded. Reported-by: Paul E. McKenney <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Neeraj Upadhyay <[email protected]> Cc: Boqun Feng <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-12rcu/nocb: Directly call __wake_nocb_gp() from bypass timerFrederic Weisbecker1-2/+3
The bypass timer calls __call_rcu_nocb_wake() instead of directly calling __wake_nocb_gp(). The only difference here is that rdp->qlen_last_fqs_check gets overridden. But resetting the deferred force quiescent state base shouldn't be relevant for that timer. In fact the bypass queue in question can be for any rdp from the group and not necessarily the rdp leader on which the bypass timer is attached. This commit therefore calls __wake_nocb_gp() directly. This way we don't even need to lock the ->nocb_lock. Signed-off-by: Frederic Weisbecker <[email protected]> Cc: Josh Triplett <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Neeraj Upadhyay <[email protected]> Cc: Boqun Feng <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]>
2021-05-12locking: Fix comment typosIngo Molnar1-6/+6
A few snuck through. Signed-off-by: Ingo Molnar <[email protected]>
2021-05-12sched: Fix leftover comment typosIngo Molnar2-5/+5
A few more snuck in. Also capitalize 'CPU' while at it. Signed-off-by: Ingo Molnar <[email protected]>