aboutsummaryrefslogtreecommitdiff
path: root/kernel/rcu/tree.c
AgeCommit message (Collapse)AuthorFilesLines
2023-02-02Merge branch 'stall.2023.01.09a' into HEADPaul E. McKenney1-0/+18
stall.2023.01.09a: RCU CPU stall-warning updates.
2023-02-02Merge branches 'doc.2023.01.05a', 'fixes.2023.01.23a', 'kvfree.2023.01.03a', ↵Paul E. McKenney1-275/+364
'srcu.2023.01.03a', 'srcu-always.2023.02.02a', 'tasks.2023.01.03a', 'torture.2023.01.05a' and 'torturescript.2023.01.03a' into HEAD doc.2023.01.05a: Documentation update. fixes.2023.01.23a: Miscellaneous fixes. kvfree.2023.01.03a: kvfree_rcu() updates. srcu.2023.01.03a: SRCU updates. srcu-always.2023.02.02a: Finish making SRCU be unconditionally available. tasks.2023.01.03a: Tasks-RCU updates. torture.2023.01.05a: Torture-test updates. torturescript.2023.01.03a: Torture-test scripting updates.
2023-01-23rcu: Disable laziness if lazy-tracking says soJoel Fernandes (Google)1-1/+3
During suspend, we see failures to suspend 1 in 300-500 suspends. Looking closer, it appears that asynchronous RCU callbacks are being queued as lazy even though synchronous callbacks are expedited. These delays appear to not be very welcome by the suspend/resume code as evidenced by these occasional suspend failures. This commit modifies call_rcu() to check if rcu_async_should_hurry(), which will return true if we are in suspend or in-kernel boot. [ paulmck: Alphabetize local variables. ] Ignoring the lazy hint makes the 3000 suspend/resume cycles pass reliably on a 12th gen 12-core Intel CPU, and there is some evidence that it also slightly speeds up boot performance. Fixes: 3cb278e73be5 ("rcu: Make call_rcu() lazy to save power") Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-17rcu: Track laziness during boot and suspendJoel Fernandes (Google)1-0/+2
Boot and suspend/resume should not be slowed down in kernels built with CONFIG_RCU_LAZY=y. In particular, suspend can sometimes fail in such kernels. This commit therefore adds rcu_async_hurry(), rcu_async_relax(), and rcu_async_should_hurry() functions that track whether or not either a boot or a suspend/resume operation is in progress. This will enable a later commit to refrain from laziness during those times. Export rcu_async_should_hurry(), rcu_async_hurry(), and rcu_async_relax() for later use by rcutorture. [ paulmck: Apply feedback from Steve Rostedt. ] Fixes: 3cb278e73be5 ("rcu: Make call_rcu() lazy to save power") Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-12rcu: Remove redundant call to rcu_boost_kthread_setaffinity()Zqiang1-5/+0
The rcu_boost_kthread_setaffinity() function is invoked at rcutree_online_cpu() and rcutree_offline_cpu() time, early in the online timeline and late in the offline timeline, respectively. It is also invoked from rcutree_dead_cpu(), however, in the absence of userspace manipulations (for which userspace must take responsibility), this call is redundant with that from rcutree_offline_cpu(). This redundancy can be demonstrated by printing out the relevant cpumasks This commit therefore removes the call to rcu_boost_kthread_setaffinity() from rcutree_dead_cpu(). Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
2023-01-05rcu: Add RCU stall diagnosis informationZhen Lei1-0/+18
Because RCU CPU stall warnings are driven from the scheduling-clock interrupt handler, a workload consisting of a very large number of short-duration hardware interrupts can result in misleading stall-warning messages. On systems supporting only a single level of interrupts, that is, where interrupts handlers cannot be interrupted, this can produce misleading diagnostics. The stack traces will show the innocent-bystander interrupted task, not the interrupts that are at the very least exacerbating the stall. This situation can be improved by displaying the number of interrupts and the CPU time that they have consumed. Diagnosing other types of stalls can be eased by also providing the count of softirqs and the CPU time that they consumed as well as the number of context switches and the task-level CPU time consumed. Consider the following output given this change: rcu: INFO: rcu_preempt self-detected stall on CPU rcu: 0-....: (1250 ticks this GP) <omitted> rcu: hardirqs softirqs csw/system rcu: number: 624 45 0 rcu: cputime: 69 1 2425 ==> 2500(ms) This output shows that the number of hard and soft interrupts is small, there are no context switches, and the system takes up a lot of time. This indicates that the current task is looping with preemption disabled. The impact on system performance is negligible because snapshot is recorded only once for all continuous RCU stalls. This added debugging information is suppressed by default and can be enabled by building the kernel with CONFIG_RCU_CPU_STALL_CPUTIME=y or by booting with rcupdate.rcu_cpu_stall_cputime=1. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu/kvfree: Split ready for reclaim objects from a batchUladzislau Rezki (Sony)1-33/+54
This patch splits the lists of objects so as to avoid sending any through RCU that have already been queued for more than one grace period. These long-term-resident objects are immediately freed. The remaining short-term-resident objects are queued for later freeing using queue_rcu_work(). This change avoids delaying workqueue handlers with synchronize_rcu() invocations. Yes, workqueue handlers are designed to handle blocking, but avoiding blocking when unnecessary improves performance during low-memory situations. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu/kvfree: Carefully reset number of objects in krcpUladzislau Rezki (Sony)1-10/+30
The schedule_delayed_monitor_work() function relies on the count of objects queued into any given kfree_rcu_cpu structure. This count is used to determine how quickly to schedule passing these objects to RCU. There are three pipes where pointers can be placed. When any pipe is offloaded, the kfree_rcu_cpu structure's ->count counter is set to zero, which is wrong because the other pipes might still be non-empty. This commit therefore maintains per-pipe counters, and introduces a krc_count() helper to access the aggregate value of those counters. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu/kvfree: Use READ_ONCE() when access to krcp->headUladzislau Rezki (Sony)1-2/+2
The need_offload_krc() function is now lock-free, which gives the compiler freedom to load old values from plain C-language loads from the kfree_rcu_cpu struture's ->head pointer. This commit therefore applied READ_ONCE() to these loads. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu/kvfree: Use a polled API to speedup a reclaim processUladzislau Rezki (Sony)1-8/+39
Currently all objects placed into a batch wait for a full grace period to elapse after that batch is ready to send to RCU. However, this can unnecessarily delay freeing of the first objects that were added to the batch. After all, several RCU grace periods might have elapsed since those objects were added, and if so, there is no point in further deferring their freeing. This commit therefore adds per-page grace-period snapshots which are obtained from get_state_synchronize_rcu(). When the batch is ready to be passed to call_rcu(), each page's snapshot is checked by passing it to poll_state_synchronize_rcu(). If a given page's RCU grace period has already elapsed, its objects are freed immediately by kvfree_rcu_bulk(). Otherwise, these objects are freed after a call to synchronize_rcu(). This approach requires that the pages be traversed in reverse order, that is, the oldest ones first. Test example: kvm.sh --memory 10G --torture rcuscale --allcpus --duration 1 \ --kconfig CONFIG_NR_CPUS=64 \ --kconfig CONFIG_RCU_NOCB_CPU=y \ --kconfig CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y \ --kconfig CONFIG_RCU_LAZY=n \ --bootargs "rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 \ rcuscale.holdoff=20 rcuscale.kfree_loops=10000 \ torture.disable_onoff_at_boot" --trust-make Before this commit: Total time taken by all kfree'ers: 8535693700 ns, loops: 10000, batches: 1188, memory footprint: 2248MB Total time taken by all kfree'ers: 8466933582 ns, loops: 10000, batches: 1157, memory footprint: 2820MB Total time taken by all kfree'ers: 5375602446 ns, loops: 10000, batches: 1130, memory footprint: 6502MB Total time taken by all kfree'ers: 7523283832 ns, loops: 10000, batches: 1006, memory footprint: 3343MB Total time taken by all kfree'ers: 6459171956 ns, loops: 10000, batches: 1150, memory footprint: 6549MB After this commit: Total time taken by all kfree'ers: 8560060176 ns, loops: 10000, batches: 1787, memory footprint: 61MB Total time taken by all kfree'ers: 8573885501 ns, loops: 10000, batches: 1777, memory footprint: 93MB Total time taken by all kfree'ers: 8320000202 ns, loops: 10000, batches: 1727, memory footprint: 66MB Total time taken by all kfree'ers: 8552718794 ns, loops: 10000, batches: 1790, memory footprint: 75MB Total time taken by all kfree'ers: 8601368792 ns, loops: 10000, batches: 1724, memory footprint: 62MB The reduction in memory footprint is well in excess of an order of magnitude. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu/kvfree: Move need_offload_krc() out of krcp->lockUladzislau Rezki (Sony)1-7/+4
The need_offload_krc() function currently holds the krcp->lock in order to safely check krcp->head. This commit removes the need for this lock in that function by updating the krcp->head pointer using WRITE_ONCE() macro so that readers can carry out lockless loads of that pointer. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu/kvfree: Move bulk/list reclaim to separate functionsUladzislau Rezki (Sony)1-49/+65
The kvfree_rcu() code maintains lists of pages of pointers, but also a singly linked list, with the latter being used when memory allocation fails. Traversal of these two types of lists is currently open coded. This commit simplifies the code by providing kvfree_rcu_bulk() and kvfree_rcu_list() functions, respectively, to traverse these two types of lists. This patch does not introduce any functional change. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu/kvfree: Switch to a generic linked list APIUladzislau Rezki (Sony)1-46/+43
This commit improves the readability and maintainability of the kvfree_rcu() code by switching from an open-coded linked list to the standard Linux-kernel circular doubly linked list. This patch does not introduce any functional change. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu: Refactor kvfree_call_rcu() and high-level helpersUladzislau Rezki (Sony)1-17/+12
Currently a kvfree_call_rcu() takes an offset within a structure as a second parameter, so a helper such as a kvfree_rcu_arg_2() has to convert rcu_head and a freed ptr to an offset in order to pass it. That leads to an extra conversion on macro entry. Instead of converting, refactor the code in way that a pointer that has to be freed is passed directly to the kvfree_call_rcu(). This patch does not make any functional change and is transparent to all kvfree_rcu() users. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu: Test synchronous RCU grace periods at the end of rcu_init()Paul E. McKenney1-0/+2
This commit tests synchronize_rcu() and synchronize_rcu_expedited() at the end of rcu_init(), in addition to the test already at the beginning of that function. These tests are run only in kernels built with CONFIG_PROVE_RCU=y. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu: Make rcu_blocking_is_gp() stop early-boot might_sleep()Zqiang1-2/+3
Currently, rcu_blocking_is_gp() invokes might_sleep() even during early boot when interrupts are disabled and before the scheduler is scheduling. This is at best an accident waiting to happen. Therefore, this commit moves that might_sleep() under an rcu_scheduler_active check in order to ensure that might_sleep() is not invoked unless sleeping might actually happen. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu: Upgrade header comment for poll_state_synchronize_rcu()Paul E. McKenney1-1/+9
This commit emphasizes the possibility of concurrent calls to synchronize_rcu() and synchronize_rcu_expedited() causing one or the other of the two grace periods being lost from the viewpoint of poll_state_synchronize_rcu(). If you cannot afford to lose grace periods this way, you should instead use the _full() variants of the polled RCU API, for example, poll_state_synchronize_rcu_full(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu: Throttle callback invocation based on number of ready callbacksPaul E. McKenney1-1/+1
Currently, rcu_do_batch() sizes its batches based on the total number of callbacks in the callback list. This can result in some strange choices, for example, if there was 12,800 callbacks in the list, but only 200 were ready to invoke, RCU would invoke 100 at a time (12,800 shifted down by seven bits). A more measured approach would use the number that were actually ready to invoke, an approach that has become feasible only recently given the per-segment ->seglen counts in ->cblist. This commit therefore bases the batch limit on the number of callbacks ready to invoke instead of on the total number of callbacks. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu: Consolidate initialization and CPU-hotplug codePaul E. McKenney1-156/+158
This commit consolidates the initialization and CPU-hotplug code at the end of kernel/rcu/tree.c. This is strictly a code-motion commit. No functionality has changed. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-12-21Merge tag 'rcu-urgent.2022.12.17a' of ↵Linus Torvalds1-4/+6
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU fix from Paul McKenney: "This fixes a lockdep false positive in synchronize_rcu() that can otherwise occur during early boot. The fix simply avoids invoking lockdep if the scheduler has not yet been initialized, that is, during that portion of boot when interrupts are disabled" * tag 'rcu-urgent.2022.12.17a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: rcu: Don't assert interrupts enabled too early in boot
2022-12-17rcu: Don't assert interrupts enabled too early in bootPaul E. McKenney1-4/+6
The rcu_poll_gp_seq_end() and rcu_poll_gp_seq_end_unlocked() both check that interrupts are enabled, as they normally should be when waiting for an RCU grace period. Except that it is legal to wait for grace periods during early boot, before interrupts have been enabled for the first time, and polling for grace periods is required to work during this time. This can result in false-positive lockdep splats in the presence of boot-time-initiated tracing. This commit therefore conditions those interrupts-enabled checks on rcu_scheduler_active having advanced past RCU_SCHEDULER_INACTIVE, by which time interrupts have been enabled. Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-12-12Merge tag 'rcu.2022.12.02a' of ↵Linus Torvalds1-56/+96
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU updates from Paul McKenney: - Documentation updates. This is the second in a series from an ongoing review of the RCU documentation. - Miscellaneous fixes. - Introduce a default-off Kconfig option that depends on RCU_NOCB_CPU that, on CPUs mentioned in the nohz_full or rcu_nocbs boot-argument CPU lists, causes call_rcu() to introduce delays. These delays result in significant power savings on nearly idle Android and ChromeOS systems. These savings range from a few percent to more than ten percent. This series also includes several commits that change call_rcu() to a new call_rcu_hurry() function that avoids these delays in a few cases, for example, where timely wakeups are required. Several of these are outside of RCU and thus have acks and reviews from the relevant maintainers. - Create an srcu_read_lock_nmisafe() and an srcu_read_unlock_nmisafe() for architectures that support NMIs, but which do not provide NMI-safe this_cpu_inc(). These NMI-safe SRCU functions are required by the upcoming lockless printk() work by John Ogness et al. - Changes providing minor but important increases in torture test coverage for the new RCU polled-grace-period APIs. - Changes to torturescript that avoid redundant kernel builds, thus providing about a 30% speedup for the torture.sh acceptance test. * tag 'rcu.2022.12.02a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (49 commits) net: devinet: Reduce refcount before grace period net: Use call_rcu_hurry() for dst_release() workqueue: Make queue_rcu_work() use call_rcu_hurry() percpu-refcount: Use call_rcu_hurry() for atomic switch scsi/scsi_error: Use call_rcu_hurry() instead of call_rcu() rcu/rcutorture: Use call_rcu_hurry() where needed rcu/rcuscale: Use call_rcu_hurry() for async reader test rcu/sync: Use call_rcu_hurry() instead of call_rcu rcuscale: Add laziness and kfree tests rcu: Shrinker for lazy rcu rcu: Refactor code a bit in rcu_nocb_do_flush_bypass() rcu: Make call_rcu() lazy to save power rcu: Implement lockdep_rcu_enabled for !CONFIG_DEBUG_LOCK_ALLOC srcu: Debug NMI safety even on archs that don't require it srcu: Explain the reason behind the read side critical section on GP start srcu: Warn when NMI-unsafe API is used in NMI arch/s390: Add ARCH_HAS_NMI_SAFE_THIS_CPU_OPS Kconfig option arch/loongarch: Add ARCH_HAS_NMI_SAFE_THIS_CPU_OPS Kconfig option rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state() rcu-tasks: Make grace-period-age message human-readable ...
2022-11-30Merge branches 'doc.2022.10.20a', 'fixes.2022.10.21a', 'lazy.2022.11.30a', ↵Paul E. McKenney1-56/+96
'srcunmisafe.2022.11.09a', 'torture.2022.10.18c' and 'torturescript.2022.10.20a' into HEAD doc.2022.10.20a: Documentation updates. fixes.2022.10.21a: Miscellaneous fixes. lazy.2022.11.30a: Lazy call_rcu() and NOCB updates. srcunmisafe.2022.11.09a: NMI-safe SRCU readers. torture.2022.10.18c: Torture-test updates. torturescript.2022.10.20a: Torture-test scripting updates.
2022-11-29rcu: Make call_rcu() lazy to save powerJoel Fernandes (Google)1-46/+83
Implement timer-based RCU callback batching (also known as lazy callbacks). With this we save about 5-10% of power consumed due to RCU requests that happen when system is lightly loaded or idle. By default, all async callbacks (queued via call_rcu) are marked lazy. An alternate API call_rcu_hurry() is provided for the few users, for example synchronize_rcu(), that need the old behavior. The batch is flushed whenever a certain amount of time has passed, or the batch on a particular CPU grows too big. Also memory pressure will flush it in a future patch. To handle several corner cases automagically (such as rcu_barrier() and hotplug), we re-use bypass lists which were originally introduced to address lock contention, to handle lazy CBs as well. The bypass list length has the lazy CB length included in it. A separate lazy CB length counter is also introduced to keep track of the number of lazy CBs. [ paulmck: Fix formatting of inline call_rcu_lazy() definition. ] [ paulmck: Apply Zqiang feedback. ] [ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ] Suggested-by: Paul McKenney <paulmck@kernel.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-21rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state()Zqiang1-1/+1
Running rcutorture with non-zero fqs_duration module parameter in a kernel built with CONFIG_PREEMPTION=y results in the following splat: BUG: using __this_cpu_read() in preemptible [00000000] code: rcu_torture_fqs/398 caller is __this_cpu_preempt_check+0x13/0x20 CPU: 3 PID: 398 Comm: rcu_torture_fqs Not tainted 6.0.0-rc1-yoctodev-standard+ Call Trace: <TASK> dump_stack_lvl+0x5b/0x86 dump_stack+0x10/0x16 check_preemption_disabled+0xe5/0xf0 __this_cpu_preempt_check+0x13/0x20 rcu_force_quiescent_state.part.0+0x1c/0x170 rcu_force_quiescent_state+0x1e/0x30 rcu_torture_fqs+0xca/0x160 ? rcu_torture_boost+0x430/0x430 kthread+0x192/0x1d0 ? kthread_complete_and_exit+0x30/0x30 ret_from_fork+0x22/0x30 </TASK> The problem is that rcu_force_quiescent_state() uses __this_cpu_read() in preemptible code instead of the proper raw_cpu_read(). This commit therefore changes __this_cpu_read() to raw_cpu_read(). Signed-off-by: Zqiang <qiang1.zhang@intel.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-21rcu: Remove rcu_is_idle_cpu()Yipeng Zou1-6/+0
The commit 3fcd6a230fa7 ("x86/cpu: Avoid cpuinfo-induced IPIing of idle CPUs") introduced rcu_is_idle_cpu() in order to identify the current CPU idle state. But commit f3eca381bd49 ("x86/aperfmperf: Replace arch_freq_get_on_cpu()") switched to using MAX_SAMPLE_AGE, so rcu_is_idle_cpu() is no longer used. This commit therefore removes it. Fixes: f3eca381bd49 ("x86/aperfmperf: Replace arch_freq_get_on_cpu()") Signed-off-by: Yipeng Zou <zouyipeng@huawei.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-20rcu: Keep synchronize_rcu() from enabling irqs in early bootPaul E. McKenney1-4/+6
Making polled RCU grace periods account for expedited grace periods required acquiring the leaf rcu_node structure's lock during early boot, but after rcu_init() was called. This lock is irq-disabled, but the code incorrectly assumes that irqs are always disabled when invoking synchronize_rcu(). The exception is early boot before the scheduler has started, which means that upon return from synchronize_rcu(), irqs will be incorrectly enabled. This commit fixes this bug by using irqsave/irqrestore locking primitives. Fixes: bf95b2bc3e42 ("rcu: Switch polled grace-period APIs to ->gp_seq_polled") Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18rcu: Fix missing nocb gp wake on rcu_barrier()Frederic Weisbecker1-0/+11
In preparation for RCU lazy changes, wake up the RCU nocb gp thread if needed after an entrain. This change prevents the RCU barrier callback from waiting in the queue for several seconds before the lazy callbacks in front of it are serviced. Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18rcu: Use READ_ONCE() for lockless read of rnp->qsmaskJoel Fernandes (Google)1-1/+1
The rnp->qsmask is locklessly accessed from rcutree_dying_cpu(). This may help avoid load tearing due to concurrent access, KCSAN issues, and preserve sanity of people reading the mask in tracing. Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18rcu: Remove duplicate RCU exp QS report from rcu_report_dead()Zqiang1-2/+0
The rcu_report_dead() function invokes rcu_report_exp_rdp() in order to force an immediate expedited quiescent state on the outgoing CPU, and then it invokes rcu_preempt_deferred_qs() to provide any required deferred quiescent state of either sort. Because the call to rcu_preempt_deferred_qs() provides the expedited RCU quiescent state if requested, the call to rcu_report_exp_rdp() is potentially redundant. One possible issue is a concurrent start of a new expedited RCU grace period, but this situation is already handled correctly by __sync_rcu_exp_select_node_cpus(). This function will detect that the CPU is going offline via the error return from its call to smp_call_function_single(). In that case, it will retry, and eventually stop retrying due to rcu_report_exp_rdp() clearing the ->qsmaskinitnext bit corresponding to the target CPU. As a result, __sync_rcu_exp_select_node_cpus() will report the necessary quiescent state after dealing with any remaining CPU. This change assumes that control does not enter rcu_report_dead() within an RCU read-side critical section, but then again, the surviving call to rcu_preempt_deferred_qs() has always made this assumption. This commit therefore removes the call to rcu_report_exp_rdp(), thus relying on rcu_preempt_deferred_qs() to handle both normal and expedited quiescent states. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-09-01Merge branches 'doc.2022.08.31b', 'fixes.2022.08.31b', 'kvfree.2022.08.31b', ↵Paul E. McKenney1-78/+252
'nocb.2022.09.01a', 'poll.2022.08.31b', 'poll-srcu.2022.08.31b' and 'tasks.2022.08.31b' into HEAD doc.2022.08.31b: Documentation updates fixes.2022.08.31b: Miscellaneous fixes kvfree.2022.08.31b: kvfree_rcu() updates nocb.2022.09.01a: NOCB CPU updates poll.2022.08.31b: Full-oldstate RCU polling grace-period API poll-srcu.2022.08.31b: Polled SRCU grace-period updates tasks.2022.08.31b: Tasks RCU updates
2022-08-31rcu-tasks: Make RCU Tasks Trace check for userspace executionZqiang1-2/+2
Userspace execution is a valid quiescent state for RCU Tasks Trace, but the scheduling-clock interrupt does not currently report such quiescent states. Of course, the scheduling-clock interrupt is not strictly speaking userspace execution. However, the only way that this code is not in a quiescent state is if something invoked rcu_read_lock_trace(), and that would be reflected in the ->trc_reader_nesting field in the task_struct structure. Furthermore, this field is checked by rcu_tasks_trace_qs(), which is invoked by rcu_tasks_qs() which is in turn invoked by rcu_note_voluntary_context_switch() in kernels building at least one of the RCU Tasks flavors. It is therefore safe to invoke rcu_tasks_trace_qs() from the rcu_sched_clock_irq(). But rcu_tasks_qs() also invokes rcu_tasks_classic_qs() for RCU Tasks, which lacks the read-side markers provided by RCU Tasks Trace. This raises the possibility that an RCU Tasks grace period could start after the interrupt from userspace execution, but before the call to rcu_sched_clock_irq(). However, it turns out that this is safe because the RCU Tasks grace period waits for an RCU grace period, which will wait for the entire scheduling-clock interrupt handler, including any RCU Tasks read-side critical section that this handler might contain. This commit therefore updates the rcu_sched_clock_irq() function's check for usermode execution and its call to rcu_tasks_classic_qs() to instead check for both usermode execution and interrupt from idle, and to instead call rcu_note_voluntary_context_switch(). This consolidates code and provides more faster RCU Tasks Trace reporting of quiescent states in kernels that do scheduling-clock interrupts for userspace execution. [ paulmck: Consolidate checks into rcu_sched_clock_irq(). ] Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31rcu: Make synchronize_rcu() fastpath update only boot-CPU countersPaul E. McKenney1-2/+16
Large systems can have hundreds of rcu_node structures, and updating counters in each of them might slow down booting. This commit therefore updates only the counters in those rcu_node structures corresponding to the boot CPU, up to and including the root rcu_node structure. The counters for the remaining rcu_node structures are updated by the rcu_scheduler_starting() function, which executes just before the first non-boot kthread is spawned. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31rcu: Remove ->rgos_polled field from rcu_gp_oldstate structurePaul E. McKenney1-5/+1
Because both normal and expedited grace periods increment their respective counters on their pre-scheduler early boot fastpaths, the rcu_gp_oldstate structure no longer needs its ->rgos_polled field. This commit therefore removes this field, shrinking this structure so that it is the same size as an rcu_head structure. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31rcu: Make synchronize_rcu() fast path update ->gp_seq countersPaul E. McKenney1-13/+26
This commit causes the early boot single-CPU synchronize_rcu() fastpath to update the rcu_state and rcu_node structures' ->gp_seq and ->gp_seq_needed counters. This will allow the full-state polled grace-period APIs to detect all normal grace periods without the need to track the special combined polling-only counter, which is a step towards removing the ->rgos_polled field from the rcu_gp_oldstate, thereby reducing its size by one third. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31rcu-tasks: Remove grace-period fast-path rcu-tasks helperPaul E. McKenney1-2/+0
Now that the grace-period fast path can only happen during the pre-scheduler portion of early boot, this fast path can no longer block run-time RCU Tasks and RCU Tasks Trace grace periods. This commit therefore removes the conditional cond_resched_tasks_rcu_qs() invocation. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31rcu: Set rcu_data structures' initial ->gpwrap value to truePaul E. McKenney1-0/+1
It would be good do reduce the size of the rcu_gp_oldstate structure from three unsigned long instances to two, but this requires that the boot-time optimized grace periods update the various ->gp_seq fields. Updating these fields in the rcu_state structure and in all of the rcu_node structures is at least semi-reasonable, but updating them in all of the rcu_data structures is a bridge too far. This means that if there are too many early boot-time grace periods, the ->gp_seq field in the rcu_data structure cannot be trusted. This commit therefore sets each rcu_data structure's ->gpwrap field to provide the necessary impetus for a suitable level of distrust. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31rcu: Disable run-time single-CPU grace-period optimizationPaul E. McKenney1-31/+9
The run-time single-CPU grace-period optimization applies only to kernels built with CONFIG_SMP=y && CONFIG_PREEMPTION=y that are running on a single-CPU system. But a kernel intended for a single-CPU system should instead be built with CONFIG_SMP=n, and in any case, single-CPU systems running Linux no longer appear to be the common case. Plus this optimization results in the rcu_gp_oldstate structure being half again larger than it needs to be. This commit therefore disables the run-time single-CPU grace-period optimization, so that this optimization applies only during the pre-scheduler portion of the boot sequence. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31rcu: Add full-sized polling for cond_sync_full()Paul E. McKenney1-1/+27
The cond_synchronize_rcu() API compresses the combined expedited and normal grace-period states into a single unsigned long, which conserves storage, but can miss grace periods in certain cases involving overlapping normal and expedited grace periods. Missing the occasional grace period is usually not a problem, but there are use cases that care about each and every grace period. This commit therefore adds yet another member of the full-state RCU grace-period polling API, which is the cond_synchronize_rcu_full() function. This uses up to three times the storage (rcu_gp_oldstate structure instead of unsigned long), but is guaranteed not to miss grace periods. [ paulmck: Apply feedback from kernel test robot and Julia Lawall. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31rcu: Remove blank line from poll_state_synchronize_rcu() docbook headerPaul E. McKenney1-3/+2
This commit removes the blank line preceding the oldstate parameter to the docbook header for the poll_state_synchronize_rcu() function and marks uses of this parameter later in that header. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31rcu: Add full-sized polling for start_poll()Paul E. McKenney1-13/+45
The start_poll_synchronize_rcu() API compresses the combined expedited and normal grace-period states into a single unsigned long, which conserves storage, but can miss grace periods in certain cases involving overlapping normal and expedited grace periods. Missing the occasional grace period is usually not a problem, but there are use cases that care about each and every grace period. This commit therefore adds the next member of the full-state RCU grace-period polling API, namely the start_poll_synchronize_rcu_full() function. This uses up to three times the storage (rcu_gp_oldstate structure instead of unsigned long), but is guaranteed not to miss grace periods. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31rcu: Add full-sized polling for get_state()Paul E. McKenney1-0/+33
The get_state_synchronize_rcu() API compresses the combined expedited and normal grace-period states into a single unsigned long, which conserves storage, but can miss grace periods in certain cases involving overlapping normal and expedited grace periods. Missing the occasional grace period is usually not a problem, but there are use cases that care about each and every grace period. This commit therefore adds the next member of the full-state RCU grace-period polling API, namely the get_state_synchronize_rcu_full() function. This uses up to three times the storage (rcu_gp_oldstate structure instead of unsigned long), but is guaranteed not to miss grace periods. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31rcu: Add full-sized polling for get_completed*() and poll_state*()Paul E. McKenney1-4/+72
The get_completed_synchronize_rcu() and poll_state_synchronize_rcu() APIs compress the combined expedited and normal grace-period states into a single unsigned long, which conserves storage, but can miss grace periods in certain cases involving overlapping normal and expedited grace periods. Missing the occasional grace period is usually not a problem, but there are use cases that care about each and every grace period. This commit therefore adds the first members of the full-state RCU grace-period polling API, namely the get_completed_synchronize_rcu_full() and poll_state_synchronize_rcu_full() functions. These use up to three times the storage (rcu_gp_oldstate structure instead of unsigned long), but which are guaranteed not to miss grace periods, at least in situations where the single-CPU grace-period optimization does not apply. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31rcu/kvfree: Update KFREE_DRAIN_JIFFIES intervalUladzislau Rezki (Sony)1-4/+19
Currently the monitor work is scheduled with a fixed interval of HZ/20, which is roughly 50 milliseconds. The drawback of this approach is low utilization of the 512 page slots in scenarios with infrequence kvfree_rcu() calls. For example on an Android system: <snip> kworker/3:3-507 [003] .... 470.286305: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000d0f0dde5 nr_records=6 kworker/6:1-76 [006] .... 470.416613: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000ea0d6556 nr_records=1 kworker/6:1-76 [006] .... 470.416625: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000003e025849 nr_records=9 kworker/3:3-507 [003] .... 471.390000: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000815a8713 nr_records=48 kworker/1:1-73 [001] .... 471.725785: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000fda9bf20 nr_records=3 kworker/1:1-73 [001] .... 471.725833: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000a425b67b nr_records=76 kworker/0:4-1411 [000] .... 472.085673: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000007996be9d nr_records=1 kworker/0:4-1411 [000] .... 472.085728: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000d0f0dde5 nr_records=5 kworker/6:1-76 [006] .... 472.260340: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x0000000065630ee4 nr_records=102 <snip> In many cases, out of 512 slots, fewer than 10 were actually used. In order to improve batching and make utilization more efficient this commit sets a drain interval to a fixed 5-seconds interval. Floods are detected when a page fills quickly, and in that case, the reclaim work is re-scheduled for the next scheduling-clock tick (jiffy). After this change: <snip> kworker/7:1-371 [007] .... 5630.725708: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000005ab0ffb3 nr_records=121 kworker/7:1-371 [007] .... 5630.989702: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x0000000060c84761 nr_records=47 kworker/7:1-371 [007] .... 5630.989714: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000000babf308 nr_records=510 kworker/7:1-371 [007] .... 5631.553790: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000bb7bd0ef nr_records=169 kworker/7:1-371 [007] .... 5631.553808: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x0000000044c78753 nr_records=510 kworker/5:6-9428 [005] .... 5631.746102: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000d98519aa nr_records=123 kworker/4:7-9434 [004] .... 5632.001758: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x00000000526c9d44 nr_records=322 kworker/4:7-9434 [004] .... 5632.002073: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000002c6a8afa nr_records=185 kworker/7:1-371 [007] .... 5632.277515: rcu_invoke_kfree_bulk_callback: rcu_preempt bulk=0x000000007f4a962f nr_records=510 <snip> Here, all but one of the cases, more than one hundreds slots were used, representing an order-of-magnitude improvement. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31rcu/kfree: Fix kfree_rcu_shrink_count() return valueJoel Fernandes (Google)1-1/+1
As per the comments in include/linux/shrinker.h, .count_objects callback should return the number of freeable items, but if there are no objects to free, SHRINK_EMPTY should be returned. The only time 0 is returned should be when we are unable to determine the number of objects, or the cache should be skipped for another reason. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-31rcu: Back off upon fill_page_cache_func() allocation failureMichal Hocko1-8/+9
The fill_page_cache_func() function allocates couple of pages to store kvfree_rcu_bulk_data structures. This is a lightweight (GFP_NORETRY) allocation which can fail under memory pressure. The function will, however keep retrying even when the previous attempt has failed. This retrying is in theory correct, but in practice the allocation is invoked from workqueue context, which means that if the memory reclaim gets stuck, these retries can hog the worker for quite some time. Although the workqueues subsystem automatically adjusts concurrency, such adjustment is not guaranteed to happen until the worker context sleeps. And the fill_page_cache_func() function's retry loop is not guaranteed to sleep (see the should_reclaim_retry() function). And we have seen this function cause workqueue lockups: kernel: BUG: workqueue lockup - pool cpus=93 node=1 flags=0x1 nice=0 stuck for 32s! [...] kernel: pool 74: cpus=37 node=0 flags=0x1 nice=0 hung=32s workers=2 manager: 2146 kernel: pwq 498: cpus=249 node=1 flags=0x1 nice=0 active=4/256 refcnt=5 kernel: in-flight: 1917:fill_page_cache_func kernel: pending: dbs_work_handler, free_work, kfree_rcu_monitor Originally, we thought that the root cause of this lockup was several retries with direct reclaim, but this is not yet confirmed. Furthermore, we have seen similar lockups without any heavy memory pressure. This suggests that there are other factors contributing to these lockups. However, it is not really clear that endless retries are desireable. So let's make the fill_page_cache_func() function back off after allocation failure. Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-08-05Merge tag 'mm-stable-2022-08-03' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Most of the MM queue. A few things are still pending. Liam's maple tree rework didn't make it. This has resulted in a few other minor patch series being held over for next time. Multi-gen LRU still isn't merged as we were waiting for mapletree to stabilize. The current plan is to merge MGLRU into -mm soon and to later reintroduce mapletree, with a view to hopefully getting both into 6.1-rc1. Summary: - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport - Some kmemleak fixes from Patrick Wang and Waiman Long - DAMON updates from SeongJae Park - memcg debug/visibility work from Roman Gushchin - vmalloc speedup from Uladzislau Rezki - more folio conversion work from Matthew Wilcox - enhancements for coherent device memory mapping from Alex Sierra - addition of shared pages tracking and CoW support for fsdax, from Shiyang Ruan - hugetlb optimizations from Mike Kravetz - Mel Gorman has contributed some pagealloc changes to improve latency and realtime behaviour. - mprotect soft-dirty checking has been improved by Peter Xu - Many other singleton patches all over the place" [ XFS merge from hell as per Darrick Wong in https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ] * tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits) tools/testing/selftests/vm/hmm-tests.c: fix build mm: Kconfig: fix typo mm: memory-failure: convert to pr_fmt() mm: use is_zone_movable_page() helper hugetlbfs: fix inaccurate comment in hugetlbfs_statfs() hugetlbfs: cleanup some comments in inode.c hugetlbfs: remove unneeded header file hugetlbfs: remove unneeded hugetlbfs_ops forward declaration hugetlbfs: use helper macro SZ_1{K,M} mm: cleanup is_highmem() mm/hmm: add a test for cross device private faults selftests: add soft-dirty into run_vmtests.sh selftests: soft-dirty: add test for mprotect mm/mprotect: fix soft-dirty check in can_change_pte_writable() mm: memcontrol: fix potential oom_lock recursion deadlock mm/gup.c: fix formatting in check_and_migrate_movable_page() xfs: fail dax mount if reflink is enabled on a partition mm/memcontrol.c: remove the redundant updating of stats_flush_threshold userfaultfd: don't fail on unrecognized features hugetlb_cgroup: fix wrong hugetlb cgroup numa stat ...
2022-07-21Merge branch 'ctxt.2022.07.05a' into HEADPaul E. McKenney1-449/+27
ctxt.2022.07.05a: Linux-kernel memory model development branch.
2022-07-21Merge branches 'doc.2022.06.21a', 'fixes.2022.07.19a', 'nocb.2022.07.19a', ↵Paul E. McKenney1-34/+150
'poll.2022.07.21a', 'rcu-tasks.2022.06.21a' and 'torture.2022.06.21a' into HEAD doc.2022.06.21a: Documentation updates. fixes.2022.07.19a: Miscellaneous fixes. nocb.2022.07.19a: Callback-offload updates. poll.2022.07.21a: Polled grace-period updates. rcu-tasks.2022.06.21a: Tasks RCU updates. torture.2022.06.21a: Torture-test updates.
2022-07-21rcu: Add polled expedited grace-period primitivesPaul E. McKenney1-5/+12
This commit adds expedited grace-period functionality to RCU's polled grace-period API, adding start_poll_synchronize_rcu_expedited() and cond_synchronize_rcu_expedited(), which are similar to the existing start_poll_synchronize_rcu() and cond_synchronize_rcu() functions, respectively. Note that although start_poll_synchronize_rcu_expedited() can be invoked very early, the resulting expedited grace periods are not guaranteed to start until after workqueues are fully initialized. On the other hand, both synchronize_rcu() and synchronize_rcu_expedited() can also be invoked very early, and the resulting grace periods will be taken into account as they occur. [ paulmck: Apply feedback from Neeraj Upadhyay. ] Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/ Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing Cc: Brian Foster <bfoster@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ian Kent <raven@themaw.net> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>