Age | Commit message (Collapse) | Author | Files | Lines |
|
Now that we have separate kcsan_report_*() functions, we can factor the
distinct logic for each of the report cases out of kcsan_report(). While
this means each case has to handle mutual exclusion independently, this
minimizes the conditionality of code and makes it easier to read, and
will permit passing distinct bits of information to print_report() in
future.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <[email protected]>
[ [email protected]: retain comment about lockdep_off() ]
Signed-off-by: Marco Elver <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
In subsequent patches we'll want to split kcsan_report() into distinct
handlers for each report type. The largest bit of common work is
initializing the `access_info`, so let's factor this out into a helper,
and have the kcsan_report_*() functions pass the `aaccess_info` as a
parameter to kcsan_report().
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <[email protected]>
Signed-off-by: Marco Elver <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
So that we can add more callers of print_report(), lets fold the panic()
call into print_report() so the caller doesn't have to handle this
explicitly.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <[email protected]>
Signed-off-by: Marco Elver <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
The `watchpoint_idx` argument to kcsan_report() isn't meaningful for
races which were not detected by a watchpoint, and it would be clearer
if callers passed the other_info directly so that a NULL value can be
passed in this case.
Given that callers manipulate their watchpoints before passing the index
into kcsan_report_*(), and given we index the `other_infos` array using
this before we sanity-check it, the subsequent sanity check isn't all
that useful.
Let's remove the `watchpoint_idx` sanity check, and move the job of
finding the `other_info` out of kcsan_report().
Other than the removal of the check, there should be no functional
change as a result of this patch.
Signed-off-by: Mark Rutland <[email protected]>
Signed-off-by: Marco Elver <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
Currently kcsan_report() is used to handle three distinct cases:
* The caller hit a watchpoint when attempting an access. Some
information regarding the caller and access are recorded, but no
output is produced.
* A caller which previously setup a watchpoint detected that the
watchpoint has been hit, and possibly detected a change to the
location in memory being watched. This may result in output reporting
the interaction between this caller and the caller which hit the
watchpoint.
* A caller detected a change to a modification to a memory location
which wasn't detected by a watchpoint, for which there is no
information on the other thread. This may result in output reporting
the unexpected change.
... depending on the specific case the caller has distinct pieces of
information available, but the prototype of kcsan_report() has to handle
all three cases. This means that in some cases we pass redundant
information, and in others we don't pass all the information we could
pass. This also means that the report code has to demux these three
cases.
So that we can pass some additional information while also simplifying
the callers and report code, add separate kcsan_report_*() functions for
the distinct cases, updating callers accordingly. As the watchpoint_idx
is unused in the case of kcsan_report_unknown_origin(), this passes a
dummy value into kcsan_report(). Subsequent patches will refactor the
report code to avoid this.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <[email protected]>
[ [email protected]: try to make kcsan_report_*() names more descriptive ]
Signed-off-by: Marco Elver <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
In kcsan_setup_watchpoint() we store snapshots of a watched value into a
union of u8/u16/u32/u64 sized fields, modify this in place using a
consistent field, then later check for any changes via the u64 field.
We can achieve the safe effect more simply by always treating the field
as a u64, as smaller values will be zero-extended. As the values are
zero-extended, we don't need to truncate the access_mask when we apply
it, and can always apply the full 64-bit access_mask to the 64-bit
value.
Finally, we can store the two snapshots and calculated difference
separately, which makes the code a little easier to read, and will
permit reporting the old/new values in subsequent patches.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <[email protected]>
Signed-off-by: Marco Elver <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
clang with CONFIG_LTO_CLANG points out that an initcall function should
return an 'int' due to the changes made to the initcall macros in commit
3578ad11f3fb ("init: lto: fix PREL32 relocations"):
kernel/kcsan/debugfs.c:274:15: error: returning 'void' from a function with incompatible result type 'int'
late_initcall(kcsan_debugfs_init);
~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
include/linux/init.h:292:46: note: expanded from macro 'late_initcall'
#define late_initcall(fn) __define_initcall(fn, 7)
Fixes: e36299efe7d7 ("kcsan, debugfs: Move debugfs file creation out of early init")
Cc: stable <[email protected]>
Reviewed-by: Greg Kroah-Hartman <[email protected]>
Reviewed-by: Marco Elver <[email protected]>
Reviewed-by: Nathan Chancellor <[email protected]>
Reviewed-by: Miguel Ojeda <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
'fixes.2021.05.13a', 'kvfree_rcu.2021.05.10c', 'mmdumpobj.2021.05.10c', 'nocb.2021.05.12a', 'srcu.2021.05.12a', 'tasks.2021.05.18a' and 'torture.2021.05.10c' into HEAD
bitmaprange.2021.05.10c: Allow "all" for bitmap ranges.
doc.2021.05.10c: Documentation updates.
fixes.2021.05.13a: Miscellaneous fixes.
kvfree_rcu.2021.05.10c: kvfree_rcu() updates.
mmdumpobj.2021.05.10c: mem_dump_obj() updates.
nocb.2021.05.12a: RCU NOCB CPU updates, including limited deoffloading.
srcu.2021.05.12a: SRCU updates.
tasks.2021.05.18a: Tasks-RCU updates.
torture.2021.05.10c: Torture-test updates.
|
|
In some architectures, the no-op variant of show_rcu_tasks_gp_kthreads()
get "no previous prototype" compiler warnings. These are false positives
given that kernel/rcu/tasks.h is included only once. But why put up
with the compiler noise?
This commit therefore adds "static inline" to this definition to force
the compiler to accept this situation, while also moving it to its proper
place in kernel/rcu/rcu.h.
Reported-by: kernel test robot <[email protected]>
[ paulmck: Update per Stephen Rothwell feedback. ]
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
Heavy networking load can cause a CPU to execute continuously and
indefinitely within ksoftirqd, in which case there will be no voluntary
task switches and thus no RCU-tasks quiescent states. This commit
therefore causes the exiting rcu_softirq_qs() to provide an RCU-tasks
quiescent state.
This of course means that __do_softirq() and its callers cannot be
invoked from within a tracing trampoline.
Reported-by: Toke Høiland-Jørgensen <[email protected]>
Tested-by: Toke Høiland-Jørgensen <[email protected]>
Reviewed-by: Masami Hiramatsu <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
|
|
For all intents and purposes, the idle task is a per-CPU kthread. It isn't
created via the same route as other pcpu kthreads however, and as a result
it is missing a few bells and whistles: it fails kthread_is_per_cpu() and
it doesn't have PF_NO_SETAFFINITY set.
Fix the former by giving the idle task a kthread struct along with the
KTHREAD_IS_PER_CPU flag. This requires some extra iffery as init_idle()
call be called more than once on the same idle task.
Signed-off-by: Valentin Schneider <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
There's no point doing delta==0 updates.
Suggested-by: Mel Gorman <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
|
|
When a interruptible mutex locker is interrupted by a signal
without acquiring this lock and removed from the wait queue.
if the mutex isn't contended enough to have a waiter
put into the wait queue again, the setting of the WAITER
bit will force mutex locker to go into the slowpath to
acquire the lock every time, so if the wait queue is empty,
the WAITER bit need to be clear.
Fixes: 040a0a371005 ("mutex: Add support for wound/wait style locks")
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Zqiang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
The commit eb1f00237aca ("lockdep,trace: Expose tracepoints") reverses
tracepoints for lock_contended() and lock_acquired(), thus the ftrace
log shows the wrong locking sequence that "acquired" event is prior to
"contended" event:
<idle>-0 [001] d.s3 20803.501685: lock_acquire: 0000000008b91ab4 &sg_policy->update_lock
<idle>-0 [001] d.s3 20803.501686: lock_acquired: 0000000008b91ab4 &sg_policy->update_lock
<idle>-0 [001] d.s3 20803.501689: lock_contended: 0000000008b91ab4 &sg_policy->update_lock
<idle>-0 [001] d.s3 20803.501690: lock_release: 0000000008b91ab4 &sg_policy->update_lock
This patch fixes calling tracepoints for lock_contended() and
lock_acquired().
Fixes: eb1f00237aca ("lockdep,trace: Expose tracepoints")
Signed-off-by: Leo Yan <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
|
|
The whole call to note_interrupt() can be avoided or return early when
interrupts would be marked accordingly. For IPI handlers which always
return HANDLED the whole procedure is pretty pointless to begin with.
Add a IRQF_NO_DEBUG flag and mark the interrupt accordingly if supplied
when the interrupt is requested.
When noirqdebug is set on the kernel commandline, then the interrupt is
marked unconditionally so that there is only one condition in the hotpath
to evaluate.
[ clg: Add changelog ]
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Cédric Le Goater <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Use %ptTs instead of open-coded variant to print contents
of time64_t type in human readable form.
Cc: Jason Wessel <[email protected]>
Cc: Daniel Thompson <[email protected]>
Cc: [email protected]
Signed-off-by: Andy Shevchenko <[email protected]>
Reviewed-by: Petr Mladek <[email protected]>
Reviewed-by: Douglas Anderson <[email protected]>
Reviewed-by: Daniel Thompson <[email protected]>
Acked-by: Daniel Thompson <[email protected]>
Signed-off-by: Petr Mladek <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
module_init_section()
Previously, when CONFIG_MODULE_UNLOAD=n, the module loader just does not
attempt to load exit sections since it never expects that any code in those
sections will ever execute. However, dynamic code patching (alternatives,
jump_label and static_call) can have sites in __exit code, even if __exit is
never executed. Therefore __exit must be present at runtime, at least for as
long as __init code is.
Commit 33121347fb1c ("module: treat exit sections the same as init
sections when !CONFIG_MODULE_UNLOAD") solves the requirements of
jump_labels and static_calls by putting the exit sections in the init
region of the module so that they are at least present at init, and
discarded afterwards. It does this by including a check for exit
sections in module_init_section(), so that it also returns true for exit
sections, and the module loader will automatically sort them in the init
region of the module.
However, the solution there was not completely arch-independent. ARM is
a special case where it supplies its own module_{init, exit}_section()
functions. Instead of pushing the exit section checks into
module_init_section(), just implement the exit section check in
layout_sections(), so that we don't have to touch arch-dependent code.
Fixes: 33121347fb1c ("module: treat exit sections the same as init sections when !CONFIG_MODULE_UNLOAD")
Reviewed-by: Russell King (Oracle) <[email protected]>
Signed-off-by: Jessica Yu <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"Two fixes for timers:
- Use the ALARM feature check in the alarmtimer core code insted of
the old method of checking for the set_alarm() callback.
Drivers can have that callback set but the feature bit cleared. If
such a RTC device is selected then alarms wont work.
- Use a proper define to let the preprocessor check whether Hyper-V
VDSO clocksource should be active.
The code used a constant in an enum with #ifdef, which evaluates to
always false and disabled the clocksource for VDSO"
* tag 'timers-urgent-2021-05-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource/drivers/hyper-v: Re-enable VDSO_CLOCKMODE_HVCLOCK on X86
alarmtimer: Check RTC features instead of ops
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Fix an idle CPU selection bug, and an AMD Ryzen maximum frequency
enumeration bug"
* tag 'sched-urgent-2021-05-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, sched: Fix the AMD CPPC maximum performance value on certain AMD Ryzen generations
sched/fair: Fix clearing of has_idle_cores flag in select_idle_cpu()
|
|
Merge misc fixes from Andrew Morton:
"13 patches.
Subsystems affected by this patch series: resource, squashfs, hfsplus,
modprobe, and mm (hugetlb, slub, userfaultfd, ksm, pagealloc, kasan,
pagemap, and ioremap)"
* emailed patches from Andrew Morton <[email protected]>:
mm/ioremap: fix iomap_max_page_shift
docs: admin-guide: update description for kernel.modprobe sysctl
hfsplus: prevent corruption in shrinking truncate
mm/filemap: fix readahead return types
kasan: fix unit tests with CONFIG_UBSAN_LOCAL_BOUNDS enabled
mm: fix struct page layout on 32-bit systems
ksm: revert "use GET_KSM_PAGE_NOLOCK to get ksm page in remove_rmap_item_from_tree()"
userfaultfd: release page in error path to avoid BUG_ON
squashfs: fix divide error in calculate_skip()
kernel/resource: fix return code check in __request_free_mem_region
mm, slub: move slub_debug static key enabling outside slab_mutex
mm/hugetlb: fix cow where page writtable in child
mm/hugetlb: fix F_SEAL_FUTURE_WRITE
|
|
Splitting an earlier version of a patch that allowed calling
__request_region() while holding the resource lock into a series of
patches required changing the return code for the newly introduced
__request_region_locked().
Unfortunately this change was not carried through to a subsequent commit
56fd94919b8b ("kernel/resource: fix locking in request_free_mem_region")
in the series. This resulted in a use-after-free due to freeing the
struct resource without properly releasing it. Fix this by correcting the
return code check so that the struct is not freed if the request to add it
was successful.
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 56fd94919b8b ("kernel/resource: fix locking in request_free_mem_region")
Signed-off-by: Alistair Popple <[email protected]>
Reported-by: kernel test robot <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Jerome Glisse <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Oliver Sang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Fix trace_check_vprintf() for %.*s
The sanity check of all strings being read from the ring buffer to
make sure they are in safe memory space did not account for the %.*s
notation having another parameter to process (the length).
Add that to the check"
* tag 'trace-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Handle %.*s in trace_check_vprintf()
|
|
Fix the following coccinelle report:
kernel/module.c:1018:2-5:
WARNING: Use BUG_ON instead of if condition followed by BUG.
BUG_ON uses unlikely in if(). Through disassembly, we can see that
brk #0x800 is compiled to the end of the function.
As you can see below:
......
ffffff8008660bec: d65f03c0 ret
ffffff8008660bf0: d4210000 brk #0x800
Usually, the condition in if () is not satisfied. For the
multi-stage pipeline, we do not need to perform fetch decode
and excute operation on brk instruction.
In my opinion, this can improve the efficiency of the
multi-stage pipeline.
Signed-off-by: zhouchuangao <[email protected]>
Signed-off-by: Jessica Yu <[email protected]>
|
|
If a trace event uses the %*.s notation, the trace_check_vprintf() will
fail and will warn about a bad processing of strings, because it does not
take into account the length field when processing the star (*) part.
Have it handle this case as well.
Link: https://lore.kernel.org/linux-nfs/[email protected]/
Reported-by: Chuck Lever III <[email protected]>
Fixes: 9a6944fee68e2 ("tracing: Add a verifier to check string pointers for trace events")
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
|
|
Sparse reports a warning at rcu_print_task_stall():
"warning: context imbalance in rcu_print_task_stall - unexpected unlock"
The root cause is a missing annotation on rcu_print_task_stall().
This commit therefore adds the missing __releases(rnp->lock) annotation.
Signed-off-by: Jules Irenge <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
There are a number of places that call out the fact that preempt-disable
regions of code now act as RCU read-side critical sections, where
preempt-disable regions of code include irq-disable regions of code,
bh-disable regions of code, hardirq handlers, and NMI handlers. However,
someone relying solely on (for example) the call_rcu() header comment
might well have no idea that preempt-disable regions of code have RCU
semantics.
This commit therefore updates the header comments for
call_rcu(), synchronize_rcu(), rcu_dereference_bh_check(), and
rcu_dereference_sched_check() to call out these new(ish) forms of RCU
readers.
Reported-by: Michel Lespinasse <[email protected]>
[ paulmck: Apply Matthew Wilcox and Michel Lespinasse feedback. ]
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
Call tick_nohz_task_switch() slightly earlier after the context switch
to benefit from disabled IRQs. This way the function doesn't need to
disable them once more.
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
When the tick dependency of a task is updated, we want it to aknowledge
the new state and restart the tick if needed. If the task is not
running, we don't need to kick it because it will observe the new
dependency upon scheduling in. But if the task is running, we may need
to send an IPI to it so that it gets notified.
Unfortunately we don't have the means to check if a task is running
in a race free way. Checking p->on_cpu in a synchronized way against
p->tick_dep_mask would imply adding a full barrier between
prepare_task_switch() and tick_nohz_task_switch(), which we want to
avoid in this fast-path.
Therefore we blindly fire an IPI to the task's CPU.
Meanwhile we can check if the task is queued on the CPU rq because
p->on_rq is always set to TASK_ON_RQ_QUEUED _before_ schedule() and its
full barrier that precedes tick_nohz_task_switch(). And if the task is
queued on a nohz_full CPU, it also has fair chances to be running as the
isolation constraints prescribe running single tasks on full dynticks
CPUs.
So use this as a trick to check if we can spare an IPI toward a
non-running task.
NOTE: For the ordering to be correct, it is assumed that we never
deactivate a task while it is running, the only exception being the task
deactivating itself while scheduling out.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Rather than waking up all nohz_full CPUs on the system, only wake up
the target CPUs of member threads of the signal.
Reduces interruptions to nohz_full CPUs.
Signed-off-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
When adding a tick dependency to a task, its necessary to
wake up the CPU where the task resides to reevaluate tick
dependencies on that CPU.
However the current code wakes up all nohz_full CPUs, which
is unnecessary.
Switch to waking up a single CPU, by using ordering of writes
to task->cpu and task->tick_dep_mask.
[ mingo: Minor readability edit. ]
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
CONFIG_NO_HZ_FULL behaves just like CONFIG_NO_HZ_IDLE by default.
Reassure distros about it.
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The idle_exittime field of tick_sched is used to record the time when
the idle state was left. but currently the idle_exittime is updated in
the function tick_nohz_restart_sched_tick(), which is not always in idle
state when nohz_full is configured:
tick_irq_exit
tick_nohz_irq_exit
tick_nohz_full_update_tick
tick_nohz_restart_sched_tick
ts->idle_exittime = now;
It's thus overwritten by mistake on nohz_full tick restart. Move the
update to the appropriate idle exit path instead.
Signed-off-by: Yunfeng Ye <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
The vtime_accounting_enabled_this_cpu() early check already makes what
follows as dead code in the case of CONFIG_VIRT_CPU_ACCOUNTING_NATIVE.
No need to keep the ifdeferry around.
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
In nohz_full mode, switching from idle to a task will unconditionally
issue a tick restart. If the task is alone in the runqueue or is the
highest priority, the tick will fire once then eventually stop. But that
alone is still undesired noise.
Therefore, only restart the tick on idle exit when it's strictly
necessary.
Signed-off-by: Yunfeng Ye <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
We have a mismatch between RCU and isolation -- in relation to what is
considered the maximum valid CPU number.
This matters because nohz_full= and rcu_nocbs= are joined at the hip; in
fact the former will enforce the latter. So we don't want a CPU mask to
be valid for one and denied for the other.
The difference 1st appeared as of v4.15; further details are below.
As it is confusing to anyone who isn't looking at the code regularly, a
reminder is in order; three values exist here:
CONFIG_NR_CPUS - compiled in maximum cap on number of CPUs supported.
nr_cpu_ids - possible # of CPUs (typically reflects what ACPI says)
cpus_present - actual number of present/detected/installed CPUs.
For this example, I'll refer to NR_CPUS=64 from "make defconfig" and
nr_cpu_ids=6 for ACPI reporting on a board that could run a six core,
and present=4 for a quad that is physically in the socket. From dmesg:
smpboot: Allowing 6 CPUs, 2 hotplug CPUs
setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:6 nr_node_ids:1
rcu: RCU restricting CPUs from NR_CPUS=64 to nr_cpu_ids=6.
smp: Brought up 1 node, 4 CPUs
And from userspace, see:
paul@trash:/sys/devices/system/cpu$ cat present
0-3
paul@trash:/sys/devices/system/cpu$ cat possible
0-5
paul@trash:/sys/devices/system/cpu$ cat kernel_max
63
Everything is fine if we boot 5x5 for rcu/nohz:
Command line: BOOT_IMAGE=/boot/bzImage nohz_full=2-5 rcu_nocbs=2-5 root=/dev/sda1 ro
NO_HZ: Full dynticks CPUs: 2-5.
rcu: Offload RCU callbacks from CPUs: 2-5.
..even though there is no CPU 4 or 5. Both RCU and nohz_full are OK.
Now we push that > 6 but less than NR_CPU and with 15x15 we get:
Command line: BOOT_IMAGE=/boot/bzImage rcu_nocbs=2-15 nohz_full=2-15 root=/dev/sda1 ro
rcu: Note: kernel parameter 'rcu_nocbs=', 'nohz_full', or 'isolcpus=' contains nonexistent CPUs.
rcu: Offload RCU callbacks from CPUs: 2-5.
These are both functionally equivalent, as we are only changing flags on
phantom CPUs that don't exist, but note the kernel interpretation changes.
And worse, it only changes for one of the two - which is the problem.
RCU doesn't care if you want to restrict the flags on phantom CPUs but
clearly nohz_full does after this change from v4.15.
edb9382175c3: ("sched/isolation: Move isolcpus= handling to the housekeeping code")
- if (cpulist_parse(str, non_housekeeping_mask) < 0) {
- pr_warn("Housekeeping: Incorrect nohz_full cpumask\n");
+ err = cpulist_parse(str, non_housekeeping_mask);
+ if (err < 0 || cpumask_last(non_housekeeping_mask) >= nr_cpu_ids) {
+ pr_warn("Housekeeping: nohz_full= or isolcpus= incorrect CPU range\n");
To be clear, the sanity check on "possible" (nr_cpu_ids) is new here.
The goal was reasonable ; not wanting housekeeping to land on a
not-possible CPU, but note two things:
1) this is an exclusion list, not an inclusion list; we are tracking
non_housekeeping CPUs; not ones who are explicitly assigned housekeeping
2) we went one further in 9219565aa890 ("sched/isolation: Require a present CPU in housekeeping mask")
- ensuring that housekeeping was sanity checking against present and not just possible CPUs.
To be clear, this means the check added in v4.15 is doubly redundant.
And more importantly, overly strict/restrictive.
We care now, because the bitmap boot arg parsing now knows that a value
of "N" is NR_CPUS; the size of the bitmap, but the bitmap code doesn't
know anything about the subtleties of our max/possible/present CPU
specifics as outlined above.
So drop the check added in v4.15 (edb9382175c3) and make RCU and
nohz_full both in alignment again on NR_CPUS so "N" works for both,
and then they can fall back to nr_cpu_ids internally just as before.
Command line: BOOT_IMAGE=/boot/bzImage nohz_full=2-N rcu_nocbs=2-N root=/dev/sda1 ro
NO_HZ: Full dynticks CPUs: 2-5.
rcu: Offload RCU callbacks from CPUs: 2-5.
As shown above, with this change, RCU and nohz_full are in sync, even
with the use of the "N" placeholder. Same result is achieved with "15".
Signed-off-by: Paul Gortmaker <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Acked-by: Paul E. McKenney <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Make:
struct dl_rq::dl_nr_migratory
struct dl_rq::dl_nr_running
struct rt_rq::rt_nr_boosted
struct rt_rq::rt_nr_migratory
struct rt_rq::rt_nr_total
struct rq::nr_uninterruptible
32-bit.
If total number of tasks can't exceed 2**32 (and less due to futex pid
limits), then per-runqueue counters can't as well.
This patchset has been sponsored by REX Prefix Eradication Society.
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Runqueue ->nr_iowait counters are 32-bit anyway.
Propagate 32-bitness into other code, but don't try too hard.
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Creating 2**32 tasks to wait in D-state is impossible and wasteful.
Return "unsigned int" and save on REX prefixes.
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Creating 2**32 tasks is impossible due to futex pid limits and wasteful
anyway. Nobody has done it.
Bring nr_running() into 32-bit world to save on REX prefixes.
Signed-off-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
|
|
Place an early call to start_poll_synchronize_srcu() before the invocation
of call_srcu() on the same srcu_struct structure.
After the later call to srcu_barrier(), the completion of the
first grace period should be visible to a subsequent invocation of
poll_state_synchronize_srcu(), and if not, warn.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Neeraj Upadhyay <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Uladzislau Rezki <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
Fix ~12 single-word typos in RCU code comments.
[ paulmck: Apply feedback from Randy Dunlap. ]
Reviewed-by: Randy Dunlap <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
Now that ->nocb_timer and ->nocb_bypass_timer have become quite similar,
this commit merges them together. A new RCU_NOCB_WAKE_BYPASS wake level
is introduced. As a result, timers perform all kinds of deferred wake
ups but other deferred wakeup callsites only handle non-bypass wakeups
in order not to wake up rcuo too early.
The timer also unconditionally executes a full barrier so as to order
timer_pending() and callback enqueue although the path performing
RCU_NOCB_WAKE_FORCE that makes use of it is debatable. It should also
test against the rdp leader instead of the current rdp.
This unconditional full barrier shouldn't bring visible overhead since
these timers almost never fire.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Neeraj Upadhyay <[email protected]>
Cc: Boqun Feng <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
Tuning the deferred wakeup level must be done from a safe wakeup
point. Currently those sites are:
* ->nocb_timer
* user/idle/guest entry
* CPU down
* softirq/rcuc
All of these sites perform the wake up for both RCU_NOCB_WAKE and
RCU_NOCB_WAKE_FORCE.
In order to merge ->nocb_timer and ->nocb_bypass_timer together, we plan
to add a new RCU_NOCB_WAKE_BYPASS that really should be deferred until
a timer fires so that we don't wake up the NOCB-gp kthread too early.
To prepare for that, this commit specifies the per-callsite wakeup
level/limit.
Cc: Josh Triplett <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Neeraj Upadhyay <[email protected]>
Cc: Boqun Feng <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
[ paulmck: Fix non-NOCB rcu_nocb_need_deferred_wakeup() definition. ]
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
This commit refrains deleting the ->nocb_timer if rcu_nocb is polling
because it should not ever have been queued in the polling case.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Neeraj Upadhyay <[email protected]>
Cc: Boqun Feng <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
A NOCB-gp wake p can safely delete the ->nocb_bypass_timer because
nocb_gp_wait() will recheck again the bypass state and rearm the bypass
timer if necessary. This commit therefore deletes this timer.
Reviewed-by: Boqun Feng <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Neeraj Upadhyay <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
When waking up in nocb_gp_wait(), there is no need to keep the nocb_timer
around because this function will traverse the whole rdp list. Any
update performed before the timer was armed will now be visible after
the ->nocb_gp_lock acquire.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Neeraj Upadhyay <[email protected]>
Cc: Boqun Feng <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
The only thing that prevented an rdp leader from being de-offloaded was
the nocb_bypass_timer that used to lock the nocb_lock of the rdp leader.
If an rdp gets de-offloaded, it will subtlely ignore rcu_nocb_lock()
calls and do its job in the timer unsafely. Worse yet: If it gets
re-offloaded in the middle of the timer, rcu_nocb_unlock() would try to
unlock, leaving it imbalanced.
Now that the nocb_bypass_timer doesn't use the nocb_lock anymore,
de-offloading the rdp leader is now safe. This commit therefore allows
the rdp leader to be de-offloaded.
Reported-by: Paul E. McKenney <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Neeraj Upadhyay <[email protected]>
Cc: Boqun Feng <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
The bypass timer calls __call_rcu_nocb_wake() instead of directly
calling __wake_nocb_gp(). The only difference here is that
rdp->qlen_last_fqs_check gets overridden. But resetting the deferred
force quiescent state base shouldn't be relevant for that timer. In fact
the bypass queue in question can be for any rdp from the group and not
necessarily the rdp leader on which the bypass timer is attached.
This commit therefore calls __wake_nocb_gp() directly. This way we
don't even need to lock the ->nocb_lock.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Josh Triplett <[email protected]>
Cc: Lai Jiangshan <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Neeraj Upadhyay <[email protected]>
Cc: Boqun Feng <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
|
|
A few snuck through.
Signed-off-by: Ingo Molnar <[email protected]>
|
|
A few more snuck in. Also capitalize 'CPU' while at it.
Signed-off-by: Ingo Molnar <[email protected]>
|