diff options
30 files changed, 327 insertions, 160 deletions
diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst index 1ae79a10a8de..e8c84fcc0507 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.rst +++ b/Documentation/RCU/Design/Requirements/Requirements.rst @@ -1929,16 +1929,46 @@ The Linux-kernel CPU-hotplug implementation has notifiers that are used to allow the various kernel subsystems (including RCU) to respond appropriately to a given CPU-hotplug operation. Most RCU operations may be invoked from CPU-hotplug notifiers, including even synchronous -grace-period operations such as ``synchronize_rcu()`` and -``synchronize_rcu_expedited()``. - -However, all-callback-wait operations such as ``rcu_barrier()`` are also -not supported, due to the fact that there are phases of CPU-hotplug -operations where the outgoing CPU's callbacks will not be invoked until -after the CPU-hotplug operation ends, which could also result in -deadlock. Furthermore, ``rcu_barrier()`` blocks CPU-hotplug operations -during its execution, which results in another type of deadlock when -invoked from a CPU-hotplug notifier. +grace-period operations such as (``synchronize_rcu()`` and +``synchronize_rcu_expedited()``). However, these synchronous operations +do block and therefore cannot be invoked from notifiers that execute via +``stop_machine()``, specifically those between the ``CPUHP_AP_OFFLINE`` +and ``CPUHP_AP_ONLINE`` states. + +In addition, all-callback-wait operations such as ``rcu_barrier()`` may +not be invoked from any CPU-hotplug notifier. This restriction is due +to the fact that there are phases of CPU-hotplug operations where the +outgoing CPU's callbacks will not be invoked until after the CPU-hotplug +operation ends, which could also result in deadlock. Furthermore, +``rcu_barrier()`` blocks CPU-hotplug operations during its execution, +which results in another type of deadlock when invoked from a CPU-hotplug +notifier. + +Finally, RCU must avoid deadlocks due to interaction between hotplug, +timers and grace period processing. It does so by maintaining its own set +of books that duplicate the centrally maintained ``cpu_online_mask``, +and also by reporting quiescent states explicitly when a CPU goes +offline. This explicit reporting of quiescent states avoids any need +for the force-quiescent-state loop (FQS) to report quiescent states for +offline CPUs. However, as a debugging measure, the FQS loop does splat +if offline CPUs block an RCU grace period for too long. + +An offline CPU's quiescent state will be reported either: + +1. As the CPU goes offline using RCU's hotplug notifier (``rcu_report_dead()``). +2. When grace period initialization (``rcu_gp_init()``) detects a + race either with CPU offlining or with a task unblocking on a leaf + ``rcu_node`` structure whose CPUs are all offline. + +The CPU-online path (``rcu_cpu_starting()``) should never need to report +a quiescent state for an offline CPU. However, as a debugging measure, +it does emit a warning if a quiescent state was not already reported +for that CPU. + +During the checking/modification of RCU's hotplug bookkeeping, the +corresponding CPU's leaf node lock is held. This avoids race conditions +between RCU's hotplug notifier hooks, the grace period initialization +code, and the FQS loop, all of which refer to or modify this bookkeeping. Scheduler and RCU ~~~~~~~~~~~~~~~~~ diff --git a/Documentation/RCU/checklist.rst b/Documentation/RCU/checklist.rst index 2efed9926c3f..bb7128eb322e 100644 --- a/Documentation/RCU/checklist.rst +++ b/Documentation/RCU/checklist.rst @@ -314,6 +314,13 @@ over a rather long period of time, but improvements are always welcome! shared between readers and updaters. Additional primitives are provided for this case, as discussed in lockdep.txt. + One exception to this rule is when data is only ever added to + the linked data structure, and is never removed during any + time that readers might be accessing that structure. In such + cases, READ_ONCE() may be used in place of rcu_dereference() + and the read-side markers (rcu_read_lock() and rcu_read_unlock(), + for example) may be omitted. + 10. Conversely, if you are in an RCU read-side critical section, and you don't hold the appropriate update-side lock, you -must- use the "_rcu()" variants of the list macros. Failing to do so diff --git a/Documentation/RCU/rcu_dereference.rst b/Documentation/RCU/rcu_dereference.rst index c9667eb0d444..f3e587acb4de 100644 --- a/Documentation/RCU/rcu_dereference.rst +++ b/Documentation/RCU/rcu_dereference.rst @@ -28,6 +28,12 @@ Follow these rules to keep your RCU code working properly: for an example where the compiler can in fact deduce the exact value of the pointer, and thus cause misordering. +- In the special case where data is added but is never removed + while readers are accessing the structure, READ_ONCE() may be used + instead of rcu_dereference(). In this case, use of READ_ONCE() + takes on the role of the lockless_dereference() primitive that + was removed in v4.15. + - You are only permitted to use rcu_dereference on pointer values. The compiler simply knows too much about integral values to trust it to carry dependencies through integer operations. diff --git a/Documentation/RCU/whatisRCU.rst b/Documentation/RCU/whatisRCU.rst index fb3ff76c3e73..1a4723f48bd9 100644 --- a/Documentation/RCU/whatisRCU.rst +++ b/Documentation/RCU/whatisRCU.rst @@ -497,8 +497,7 @@ long -- there might be other high-priority work to be done. In such cases, one uses call_rcu() rather than synchronize_rcu(). The call_rcu() API is as follows:: - void call_rcu(struct rcu_head * head, - void (*func)(struct rcu_head *head)); + void call_rcu(struct rcu_head *head, rcu_callback_t func); This function invokes func(head) after a grace period has elapsed. This invocation might happen from either softirq or process context, diff --git a/arch/x86/kernel/cpu/aperfmperf.c b/arch/x86/kernel/cpu/aperfmperf.c index e2f319dc992d..22911deacb6e 100644 --- a/arch/x86/kernel/cpu/aperfmperf.c +++ b/arch/x86/kernel/cpu/aperfmperf.c @@ -14,11 +14,13 @@ #include <linux/cpufreq.h> #include <linux/smp.h> #include <linux/sched/isolation.h> +#include <linux/rcupdate.h> #include "cpu.h" struct aperfmperf_sample { unsigned int khz; + atomic_t scfpending; ktime_t time; u64 aperf; u64 mperf; @@ -62,17 +64,20 @@ static void aperfmperf_snapshot_khz(void *dummy) s->aperf = aperf; s->mperf = mperf; s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta); + atomic_set_release(&s->scfpending, 0); } static bool aperfmperf_snapshot_cpu(int cpu, ktime_t now, bool wait) { s64 time_delta = ktime_ms_delta(now, per_cpu(samples.time, cpu)); + struct aperfmperf_sample *s = per_cpu_ptr(&samples, cpu); /* Don't bother re-computing within the cache threshold time. */ if (time_delta < APERFMPERF_CACHE_THRESHOLD_MS) return true; - smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, wait); + if (!atomic_xchg(&s->scfpending, 1) || wait) + smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, wait); /* Return false if the previous iteration was too long ago. */ return time_delta <= APERFMPERF_STALE_THRESHOLD_MS; @@ -89,6 +94,9 @@ unsigned int aperfmperf_get_khz(int cpu) if (!housekeeping_cpu(cpu, HK_FLAG_MISC)) return 0; + if (rcu_is_idle_cpu(cpu)) + return 0; /* Idle CPUs are completely uninteresting. */ + aperfmperf_snapshot_cpu(cpu, ktime_get(), true); return per_cpu(samples.khz, cpu); } @@ -108,6 +116,8 @@ void arch_freq_prepare_all(void) for_each_online_cpu(cpu) { if (!housekeeping_cpu(cpu, HK_FLAG_MISC)) continue; + if (rcu_is_idle_cpu(cpu)) + continue; /* Idle CPUs are completely uninteresting. */ if (!aperfmperf_snapshot_cpu(cpu, now, false)) wait = true; } @@ -118,6 +128,8 @@ void arch_freq_prepare_all(void) unsigned int arch_freq_get_on_cpu(int cpu) { + struct aperfmperf_sample *s = per_cpu_ptr(&samples, cpu); + if (!cpu_khz) return 0; @@ -131,6 +143,8 @@ unsigned int arch_freq_get_on_cpu(int cpu) return per_cpu(samples.khz, cpu); msleep(APERFMPERF_REFRESH_DELAY_MS); + atomic_set(&s->scfpending, 1); + smp_mb(); /* ->scfpending before smp_call_function_single(). */ smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1); return per_cpu(samples.khz, cpu); diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.c b/arch/x86/kernel/cpu/mtrr/mtrr.c index 6a80f36b5d59..5f436cb4f7c4 100644 --- a/arch/x86/kernel/cpu/mtrr/mtrr.c +++ b/arch/x86/kernel/cpu/mtrr/mtrr.c @@ -794,8 +794,6 @@ void mtrr_ap_init(void) if (!use_intel() || mtrr_aps_delayed_init) return; - rcu_cpu_starting(smp_processor_id()); - /* * Ideally we should hold mtrr_mutex here to avoid mtrr entries * changed, but this routine will be called in cpu boot time, diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index de776b2e6046..99bdcebaedfc 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -229,6 +229,7 @@ static void notrace start_secondary(void *unused) #endif cpu_init_exception_handling(); cpu_init(); + rcu_cpu_starting(raw_smp_processor_id()); x86_cpuinit.early_percpu_clock_init(); preempt_disable(); smp_callin(); diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 2f05e9128201..4b5fd3da5fe8 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -536,6 +536,7 @@ extern int panic_on_warn; extern unsigned long panic_on_taint; extern bool panic_on_taint_nousertaint; extern int sysctl_panic_on_rcu_stall; +extern int sysctl_max_rcu_stall_to_panic; extern int sysctl_panic_on_stackoverflow; extern bool crash_kexec_post_notifiers; diff --git a/include/linux/list.h b/include/linux/list.h index a18c87b63376..89bdc92e75c3 100644 --- a/include/linux/list.h +++ b/include/linux/list.h @@ -9,7 +9,7 @@ #include <linux/kernel.h> /* - * Simple doubly linked list implementation. + * Circular doubly linked list implementation. * * Some of the internal functions ("__xxx") are useful when * manipulating whole lists rather than single entries, as diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h index f5594879175a..ccc3ce66c7e0 100644 --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -375,6 +375,12 @@ static inline void lockdep_unregister_key(struct lock_class_key *key) #define lockdep_depth(tsk) (0) +/* + * Dummy forward declarations, allow users to write less ifdef-y code + * and depend on dead code elimination. + */ +extern int lock_is_held(const void *); +extern int lockdep_is_held(const void *); #define lockdep_is_held_type(l, r) (1) #define lockdep_assert_held(l) do { (void)(l); } while (0) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 6cdd0152c253..de0826411311 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -241,6 +241,11 @@ bool rcu_lockdep_current_cpu_online(void); static inline bool rcu_lockdep_current_cpu_online(void) { return true; } #endif /* #else #if defined(CONFIG_HOTPLUG_CPU) && defined(CONFIG_PROVE_RCU) */ +extern struct lockdep_map rcu_lock_map; +extern struct lockdep_map rcu_bh_lock_map; +extern struct lockdep_map rcu_sched_lock_map; +extern struct lockdep_map rcu_callback_map; + #ifdef CONFIG_DEBUG_LOCK_ALLOC static inline void rcu_lock_acquire(struct lockdep_map *map) @@ -253,10 +258,6 @@ static inline void rcu_lock_release(struct lockdep_map *map) lock_release(map, _THIS_IP_); } -extern struct lockdep_map rcu_lock_map; -extern struct lockdep_map rcu_bh_lock_map; -extern struct lockdep_map rcu_sched_lock_map; -extern struct lockdep_map rcu_callback_map; int debug_lockdep_rcu_enabled(void); int rcu_read_lock_held(void); int rcu_read_lock_bh_held(void); @@ -327,7 +328,7 @@ static inline void rcu_preempt_sleep_check(void) { } #else /* #ifdef CONFIG_PROVE_RCU */ -#define RCU_LOCKDEP_WARN(c, s) do { } while (0) +#define RCU_LOCKDEP_WARN(c, s) do { } while (0 && (c)) #define rcu_sleep_check() do { } while (0) #endif /* #else #ifdef CONFIG_PROVE_RCU */ diff --git a/include/linux/rcupdate_trace.h b/include/linux/rcupdate_trace.h index 3e7919fc5f34..86c8f6c98412 100644 --- a/include/linux/rcupdate_trace.h +++ b/include/linux/rcupdate_trace.h @@ -11,10 +11,10 @@ #include <linux/sched.h> #include <linux/rcupdate.h> -#ifdef CONFIG_DEBUG_LOCK_ALLOC - extern struct lockdep_map rcu_trace_lock_map; +#ifdef CONFIG_DEBUG_LOCK_ALLOC + static inline int rcu_read_lock_trace_held(void) { return lock_is_held(&rcu_trace_lock_map); diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h index 7c1ecdb356d8..2a97334eb786 100644 --- a/include/linux/rcutiny.h +++ b/include/linux/rcutiny.h @@ -89,6 +89,8 @@ static inline void rcu_irq_enter_irqson(void) { } static inline void rcu_irq_exit(void) { } static inline void rcu_irq_exit_preempt(void) { } static inline void rcu_irq_exit_check_preempt(void) { } +#define rcu_is_idle_cpu(cpu) \ + (is_idle_task(current) && !in_nmi() && !in_irq() && !in_serving_softirq()) static inline void exit_rcu(void) { } static inline bool rcu_preempt_need_deferred_qs(struct task_struct *t) { diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h index 59eb5cd567d7..df578b73960f 100644 --- a/include/linux/rcutree.h +++ b/include/linux/rcutree.h @@ -50,6 +50,7 @@ void rcu_irq_exit(void); void rcu_irq_exit_preempt(void); void rcu_irq_enter_irqson(void); void rcu_irq_exit_irqson(void); +bool rcu_is_idle_cpu(int cpu); #ifdef CONFIG_PROVE_RCU void rcu_irq_exit_check_preempt(void); diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 85fb2f34c59b..c0f71f2e7160 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -47,9 +47,7 @@ extern spinlock_t mmlist_lock; extern union thread_union init_thread_union; extern struct task_struct init_task; -#ifdef CONFIG_PROVE_RCU extern int lockdep_tasklist_lock_is_held(void); -#endif /* #ifdef CONFIG_PROVE_RCU */ extern asmlinkage void schedule_tail(struct task_struct *prev); extern void init_idle(struct task_struct *idle, int cpu); diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index d8fd8676fc72..749db62f6215 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -435,7 +435,6 @@ struct tcf_block { struct mutex proto_destroy_lock; /* Lock for proto_destroy hashtable. */ }; -#ifdef CONFIG_PROVE_LOCKING static inline bool lockdep_tcf_chain_is_locked(struct tcf_chain *chain) { return lockdep_is_held(&chain->filter_chain_lock); @@ -445,17 +444,6 @@ static inline bool lockdep_tcf_proto_is_locked(struct tcf_proto *tp) { return lockdep_is_held(&tp->lock); } -#else -static inline bool lockdep_tcf_chain_is_locked(struct tcf_block *chain) -{ - return true; -} - -static inline bool lockdep_tcf_proto_is_locked(struct tcf_proto *tp) -{ - return true; -} -#endif /* #ifdef CONFIG_PROVE_LOCKING */ #define tcf_chain_dereference(p, chain) \ rcu_dereference_protected(p, lockdep_tcf_chain_is_locked(chain)) diff --git a/include/net/sock.h b/include/net/sock.h index a5c6ae78df77..198d5486fb09 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1566,13 +1566,11 @@ do { \ lockdep_init_map(&(sk)->sk_lock.dep_map, (name), (key), 0); \ } while (0) -#ifdef CONFIG_LOCKDEP static inline bool lockdep_sock_is_held(const struct sock *sk) { return lockdep_is_held(&sk->sk_lock) || lockdep_is_held(&sk->sk_lock.slock); } -#endif void lock_sock_nested(struct sock *sk, int subclass); diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig index b71e21f73c40..cdc57b4f6d48 100644 --- a/kernel/rcu/Kconfig +++ b/kernel/rcu/Kconfig @@ -221,19 +221,23 @@ config RCU_NOCB_CPU Use this option to reduce OS jitter for aggressive HPC or real-time workloads. It can also be used to offload RCU callback invocation to energy-efficient CPUs in battery-powered - asymmetric multiprocessors. + asymmetric multiprocessors. The price of this reduced jitter + is that the overhead of call_rcu() increases and that some + workloads will incur significant increases in context-switch + rates. This option offloads callback invocation from the set of CPUs specified at boot time by the rcu_nocbs parameter. For each such CPU, a kthread ("rcuox/N") will be created to invoke callbacks, where the "N" is the CPU being offloaded, and where - the "p" for RCU-preempt (PREEMPTION kernels) and "s" for RCU-sched - (!PREEMPTION kernels). Nothing prevents this kthread from running - on the specified CPUs, but (1) the kthreads may be preempted - between each callback, and (2) affinity or cgroups can be used - to force the kthreads to run on whatever set of CPUs is desired. - - Say Y here if you want to help to debug reduced OS jitter. + the "x" is "p" for RCU-preempt (PREEMPTION kernels) and "s" for + RCU-sched (!PREEMPTION kernels). Nothing prevents this kthread + from running on the specified CPUs, but (1) the kthreads may be + preempted between each callback, and (2) affinity or cgroups can + be used to force the kthreads to run on whatever set of CPUs is + desired. + + Say Y here if you need reduced OS jitter, despite added overhead. Say N here if you are unsure. config TASKS_TRACE_RCU_READ_MB diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index e01cba5e4b52..59ef1ae6dc37 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -533,4 +533,20 @@ static inline bool rcu_is_nocb_cpu(int cpu) { return false; } static inline void rcu_bind_current_to_nocb(void) { } #endif +#if !defined(CONFIG_TINY_RCU) && defined(CONFIG_TASKS_RCU) +void show_rcu_tasks_classic_gp_kthread(void); +#else +static inline void show_rcu_tasks_classic_gp_kthread(void) {} +#endif +#if !defined(CONFIG_TINY_RCU) && defined(CONFIG_TASKS_RUDE_RCU) +void show_rcu_tasks_rude_gp_kthread(void); +#else +static inline void show_rcu_tasks_rude_gp_kthread(void) {} +#endif +#if !defined(CONFIG_TINY_RCU) && defined(CONFIG_TASKS_TRACE_RCU) +void show_rcu_tasks_trace_gp_kthread(void); +#else +static inline void show_rcu_tasks_trace_gp_kthread(void) {} +#endif + #endif /* __LINUX_RCU_H */ diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h index 5c293afc07b8..492262bcb591 100644 --- a/kernel/rcu/rcu_segcblist.h +++ b/kernel/rcu/rcu_segcblist.h @@ -62,7 +62,7 @@ static inline bool rcu_segcblist_is_enabled(struct rcu_segcblist *rsclp) /* Is the specified rcu_segcblist offloaded? */ static inline bool rcu_segcblist_is_offloaded(struct rcu_segcblist *rsclp) { - return rsclp->offloaded; + return IS_ENABLED(CONFIG_RCU_NOCB_CPU) && rsclp->offloaded; } /* diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 4dfd113882aa..528ed10b78fd 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -317,6 +317,7 @@ struct rcu_torture_ops { void (*cb_barrier)(void); void (*fqs)(void); void (*stats)(void); + void (*gp_kthread_dbg)(void); int (*stall_dur)(void); int irq_capable; int can_boost; @@ -466,6 +467,7 @@ static struct rcu_torture_ops rcu_ops = { .cb_barrier = rcu_barrier, .fqs = rcu_force_quiescent_state, .stats = NULL, + .gp_kthread_dbg = show_rcu_gp_kthreads, .stall_dur = rcu_jiffies_till_stall_check, .irq_capable = 1, .can_boost = rcu_can_boost(), @@ -693,6 +695,7 @@ static struct rcu_torture_ops tasks_ops = { .exp_sync = synchronize_rcu_mult_test, .call = call_rcu_tasks, .cb_barrier = rcu_barrier_tasks, + .gp_kthread_dbg = show_rcu_tasks_classic_gp_kthread, .fqs = NULL, .stats = NULL, .irq_capable = 1, @@ -762,6 +765,7 @@ static struct rcu_torture_ops tasks_rude_ops = { .exp_sync = synchronize_rcu_tasks_rude, .call = call_rcu_tasks_rude, .cb_barrier = rcu_barrier_tasks_rude, + .gp_kthread_dbg = show_rcu_tasks_rude_gp_kthread, .fqs = NULL, .stats = NULL, .irq_capable = 1, @@ -800,6 +804,7 @@ static struct rcu_torture_ops tasks_tracing_ops = { .exp_sync = synchronize_rcu_tasks_trace, .call = call_rcu_tasks_trace, .cb_barrier = rcu_barrier_tasks_trace, + .gp_kthread_dbg = show_rcu_tasks_trace_gp_kthread, .fqs = NULL, .stats = NULL, .irq_capable = 1, @@ -1604,7 +1609,8 @@ rcu_torture_stats_print(void) sched_show_task(wtp); splatted = true; } - show_rcu_gp_kthreads(); + if (cur_ops->gp_kthread_dbg) + cur_ops->gp_kthread_dbg(); rcu_ftrace_dump(DUMP_ALL); } rtcv_snap = rcu_torture_current_version; @@ -2486,7 +2492,8 @@ rcu_torture_cleanup(void) return; } - show_rcu_gp_kthreads(); + if (cur_ops->gp_kthread_dbg) + cur_ops->gp_kthread_dbg(); rcu_torture_read_exit_cleanup(); rcu_torture_barrier_cleanup(); rcu_torture_fwd_prog_cleanup(); diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index c13348ee80a5..0f23d20d485a 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -177,11 +177,13 @@ static int init_srcu_struct_fields(struct srcu_struct *ssp, bool is_static) INIT_DELAYED_WORK(&ssp->work, process_srcu); if (!is_static) ssp->sda = alloc_percpu(struct srcu_data); + if (!ssp->sda) + return -ENOMEM; init_srcu_struct_nodes(ssp, is_static); ssp->srcu_gp_seq_needed_exp = 0; ssp->srcu_last_gp_end = ktime_get_mono_fast_ns(); smp_store_release(&ssp->srcu_gp_seq_needed, 0); /* Init done. */ - return ssp->sda ? 0 : -ENOMEM; + return 0; } #ifdef CONFIG_DEBUG_LOCK_ALLOC @@ -906,7 +908,7 @@ static void __synchronize_srcu(struct srcu_struct *ssp, bool do_norm) { struct rcu_synchronize rcu; - RCU_LOCKDEP_WARN(lock_is_held(&ssp->dep_map) || + RCU_LOCKDEP_WARN(lockdep_is_held(ssp) || lock_is_held(&rcu_bh_lock_map) || lock_is_held(&rcu_lock_map) || lock_is_held(&rcu_sched_lock_map), diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index d5d9f2d03e8a..35bdcfd84d42 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -290,7 +290,7 @@ static void show_rcu_tasks_generic_gp_kthread(struct rcu_tasks *rtp, char *s) ".C"[!!data_race(rtp->cbs_head)], s); } -#endif /* #ifndef CONFIG_TINY_RCU */ +#endif // #ifndef CONFIG_TINY_RCU static void exit_tasks_rcu_finish_trace(struct task_struct *t); @@ -335,23 +335,18 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp) // Start off with initial wait and slowly back off to 1 HZ wait. fract = rtp->init_fract; - if (fract > HZ) - fract = HZ; - for (;;) { + while (!list_empty(&holdouts)) { bool firstreport; bool needreport; int rtst; - if (list_empty(&holdouts)) - break; - /* Slowly back off waiting for holdouts */ set_tasks_gp_state(rtp, RTGS_WAIT_SCAN_HOLDOUTS); - schedule_timeout_idle(HZ/fract); + schedule_timeout_idle(fract); - if (fract > 1) - fract--; + if (fract < HZ) + fract++; rtst = READ_ONCE(rcu_task_stall_timeout); needreport = rtst > 0 && time_after(jiffies, lastreport + rtst); @@ -560,7 +555,7 @@ EXPORT_SYMBOL_GPL(rcu_barrier_tasks); static int __init rcu_spawn_tasks_kthread(void) { rcu_tasks.gp_sleep = HZ / 10; - rcu_tasks.init_fract = 10; + rcu_tasks.init_fract = HZ / 10; rcu_tasks.pregp_func = rcu_tasks_pregp_step; rcu_tasks.pertask_func = rcu_tasks_pertask; rcu_tasks.postscan_func = rcu_tasks_postscan; @@ -571,12 +566,13 @@ static int __init rcu_spawn_tasks_kthread(void) } core_initcall(rcu_spawn_tasks_kthread); -#ifndef CONFIG_TINY_RCU -static void show_rcu_tasks_classic_gp_kthread(void) +#if !defined(CONFIG_TINY_RCU) +void show_rcu_tasks_classic_gp_kthread(void) { show_rcu_tasks_generic_gp_kthread(&rcu_tasks, ""); } -#endif /* #ifndef CONFIG_TINY_RCU */ +EXPORT_SYMBOL_GPL(show_rcu_tasks_classic_gp_kthread); +#endif // !defined(CONFIG_TINY_RCU) /* Do the srcu_read_lock() for the above synchronize_srcu(). */ void exit_tasks_rcu_start(void) __acquires(&tasks_rcu_exit_srcu) @@ -598,7 +594,6 @@ void exit_tasks_rcu_finish(void) __releases(&tasks_rcu_exit_srcu) } #else /* #ifdef CONFIG_TASKS_RCU */ -static inline void show_rcu_tasks_classic_gp_kthread(void) { } void exit_tasks_rcu_start(void) { } void exit_tasks_rcu_finish(void) { exit_tasks_rcu_finish_trace(current); } #endif /* #else #ifdef CONFIG_TASKS_RCU */ @@ -699,16 +694,14 @@ static int __init rcu_spawn_tasks_rude_kthread(void) } core_initcall(rcu_spawn_tasks_rude_kthread); -#ifndef CONFIG_TINY_RCU -static void show_rcu_tasks_rude_gp_kthread(void) +#if !defined(CONFIG_TINY_RCU) +void show_rcu_tasks_rude_gp_kthread(void) { show_rcu_tasks_generic_gp_kthread(&rcu_tasks_rude, ""); } -#endif /* #ifndef CONFIG_TINY_RCU */ - -#else /* #ifdef CONFIG_TASKS_RUDE_RCU */ -static void show_rcu_tasks_rude_gp_kthread(void) {} -#endif /* #else #ifdef CONFIG_TASKS_RUDE_RCU */ +EXPORT_SYMBOL_GPL(show_rcu_tasks_rude_gp_kthread); +#endif // !defined(CONFIG_TINY_RCU) +#endif /* #ifdef CONFIG_TASKS_RUDE_RCU */ //////////////////////////////////////////////////////////////////////// // @@ -1183,12 +1176,12 @@ static int __init rcu_spawn_tasks_trace_kthread(void) { if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB)) { rcu_tasks_trace.gp_sleep = HZ / 10; - rcu_tasks_trace.init_fract = 10; + rcu_tasks_trace.init_fract = HZ / 10; } else { rcu_tasks_trace.gp_sleep = HZ / 200; if (rcu_tasks_trace.gp_sleep <= 0) rcu_tasks_trace.gp_sleep = 1; - rcu_tasks_trace.init_fract = HZ / 5; + rcu_tasks_trace.init_fract = HZ / 200; if (rcu_tasks_trace.init_fract <= 0) rcu_tasks_trace.init_fract = 1; } @@ -1202,8 +1195,8 @@ static int __init rcu_spawn_tasks_trace_kthread(void) } core_initcall(rcu_spawn_tasks_trace_kthread); -#ifndef CONFIG_TINY_RCU -static void show_rcu_tasks_trace_gp_kthread(void) +#if !defined(CONFIG_TINY_RCU) +void show_rcu_tasks_trace_gp_kthread(void) { char buf[64]; @@ -1213,11 +1206,11 @@ static void show_rcu_tasks_trace_gp_kthread(void) data_race(n_heavy_reader_attempts)); show_rcu_tasks_generic_gp_kthread(&rcu_tasks_trace, buf); } -#endif /* #ifndef CONFIG_TINY_RCU */ +EXPORT_SYMBOL_GPL(show_rcu_tasks_trace_gp_kthread); +#endif // !defined(CONFIG_TINY_RCU) #else /* #ifdef CONFIG_TASKS_TRACE_RCU */ static void exit_tasks_rcu_finish_trace(struct task_struct *t) { } -static inline void show_rcu_tasks_trace_gp_kthread(void) {} #endif /* #else #ifdef CONFIG_TASKS_TRACE_RCU */ #ifndef CONFIG_TINY_RCU diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 06895ef85d69..516c6893e359 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -177,7 +177,7 @@ module_param(rcu_unlock_delay, int, 0444); * per-CPU. Object size is equal to one page. This value * can be changed at boot time. */ -static int rcu_min_cached_objs = 2; +static int rcu_min_cached_objs = 5; module_param(rcu_min_cached_objs, int, 0444); /* Retrieve RCU kthreads priority for rcutorture */ @@ -341,6 +341,14 @@ static bool rcu_dynticks_in_eqs(int snap) return !(snap & RCU_DYNTICK_CTRL_CTR); } +/* Return true if the specified CPU is currently idle from an RCU viewpoint. */ +bool rcu_is_idle_cpu(int cpu) +{ + struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); + + return rcu_dynticks_in_eqs(rcu_dynticks_snap(rdp)); +} + /* * Return true if the CPU corresponding to the specified rcu_data * structure has spent some time in an extended quiescent state since @@ -546,12 +554,12 @@ static int param_set_next_fqs_jiffies(const char *val, const struct kernel_param return ret; } -static struct kernel_param_ops first_fqs_jiffies_ops = { +static const struct kernel_param_ops first_fqs_jiffies_ops = { .set = param_set_first_fqs_jiffies, .get = param_get_ulong, }; -static struct kernel_param_ops next_fqs_jiffies_ops = { +static const struct kernel_param_ops next_fqs_jiffies_ops = { .set = param_set_next_fqs_jiffies, .get = param_get_ulong, }; @@ -928,8 +936,8 @@ void __rcu_irq_enter_check_tick(void) { struct rcu_data *rdp = this_cpu_ptr(&rcu_data); - // Enabling the tick is unsafe in NMI handlers. - if (WARN_ON_ONCE(in_nmi())) + // If we're here from NMI there's nothing to do. + if (in_nmi()) return; RCU_LOCKDEP_WARN(rcu_dynticks_curr_cpu_in_eqs(), @@ -1093,8 +1101,11 @@ static void rcu_disable_urgency_upon_qs(struct rcu_data *rdp) * CPU can safely enter RCU read-side critical sections. In other words, * if the current CPU is not in its idle loop or is in an interrupt or * NMI handler, return true. + * + * Make notrace because it can be called by the internal functions of + * ftrace, and making this notrace removes unnecessary recursion calls. */ -bool rcu_is_watching(void) +notrace bool rcu_is_watching(void) { bool ret; @@ -1149,7 +1160,7 @@ bool rcu_lockdep_current_cpu_online(void) preempt_disable_notrace(); rdp = this_cpu_ptr(&rcu_data); rnp = rdp->mynode; - if (rdp->grpmask & rcu_rnp_online_cpus(rnp)) + if (rdp->grpmask & rcu_rnp_online_cpus(rnp) || READ_ONCE(rnp->ofl_seq) & 0x1) ret = true; preempt_enable_notrace(); return ret; @@ -1603,8 +1614,7 @@ static bool __note_gp_changes(struct rcu_node *rnp, struct rcu_data *rdp) { bool ret = false; bool need_qs; - const bool offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) && - rcu_segcblist_is_offloaded(&rdp->cblist); + const bool offloaded = rcu_segcblist_is_offloaded(&rdp->cblist); raw_lockdep_assert_held_rcu_node(rnp); @@ -1715,6 +1725,7 @@ static void rcu_strict_gp_boundary(void *unused) */ static bool rcu_gp_init(void) { + unsigned long firstseq; unsigned long flags; unsigned long oldmask; unsigned long mask; @@ -1758,6 +1769,12 @@ static bool rcu_gp_init(void) */ rcu_state.gp_state = RCU_GP_ONOFF; rcu_for_each_leaf_node(rnp) { + smp_mb(); // Pair with barriers used when updating ->ofl_seq to odd values. + firstseq = READ_ONCE(rnp->ofl_seq); + if (firstseq & 0x1) + while (firstseq == READ_ONCE(rnp->ofl_seq)) + schedule_timeout_idle(1); // Can't wake unless RCU is watching. + smp_mb(); // Pair with barriers used when updating ->ofl_seq to even values. raw_spin_lock(&rcu_state.ofl_lock); raw_spin_lock_irq_rcu_node(rnp); if (rnp->qsmaskinit == rnp->qsmaskinitnext && @@ -2048,8 +2065,7 @@ static void rcu_gp_cleanup(void) needgp = true; } /* Advance CBs to reduce false positives below. */ - offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) && - rcu_segcblist_is_offloaded(&rdp->cblist); + offloaded = rcu_segcblist_is_offloaded(&rdp->cblist); if ((offloaded || !rcu_accelerate_cbs(rnp, rdp)) && needgp) { WRITE_ONCE(rcu_state.gp_flags, RCU_GP_FLAG_INIT); WRITE_ONCE(rcu_state.gp_req_activity, jiffies); @@ -2248,8 +2264,7 @@ rcu_report_qs_rdp(struct rcu_data *rdp) unsigned long flags; unsigned long mask; bool needwake = false; - const bool offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) && - rcu_segcblist_is_offloaded(&rdp->cblist); + const bool offloaded = rcu_segcblist_is_offloaded(&rdp->cblist); struct rcu_node *rnp; WARN_ON_ONCE(rdp->cpu != smp_processor_id()); @@ -2399,6 +2414,7 @@ int rcutree_dead_cpu(unsigned int cpu) if (!IS_ENABLED(CONFIG_HOTPLUG_CPU)) return 0; + WRITE_ONCE(rcu_state.n_online_cpus, rcu_state.n_online_cpus - 1); /* Adjust any no-longer-needed kthreads. */ rcu_boost_kthread_setaffinity(rnp, -1); /* Do any needed no-CB deferred wakeups from this CPU. */ @@ -2417,8 +2433,7 @@ static void rcu_do_batch(struct rcu_data *rdp) { int div; unsigned long flags; - const bool offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) && - rcu_segcblist_is_offloaded(&rdp->cblist); + const bool offloaded = rcu_segcblist_is_offloaded(&rdp->cblist); struct rcu_head *rhp; struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl); long bl, count; @@ -2675,8 +2690,7 @@ static __latent_entropy void rcu_core(void) unsigned long flags; struct rcu_data *rdp = raw_cpu_ptr(&rcu_data); struct rcu_node *rnp = rdp->mynode; - const bool offloaded = IS_ENABLED(CONFIG_RCU_NOCB_CPU) && - rcu_segcblist_is_offloaded(&rdp->cblist); + const bool offloaded = rcu_segcblist_is_offloaded(&rdp->cblist); if (cpu_is_offline(smp_processor_id())) return; @@ -2978,8 +2992,7 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func) rcu_segcblist_n_cbs(&rdp->cblist)); /* Go handle any RCU core processing required. */ - if (IS_ENABLED(CONFIG_RCU_NOCB_CPU) && - unlikely(rcu_segcblist_is_offloaded(&rdp->cblist))) { + if (unlikely(rcu_segcblist_is_offloaded(&rdp->cblist))) { __call_rcu_nocb_wake(rdp, was_alldone, flags); /* unlocks */ } else { __call_rcu_core(rdp, head, flags); @@ -3084,6 +3097,9 @@ struct kfree_rcu_cpu_work { * In order to save some per-cpu space the list is singular. * Even though it is lockless an access has to be protected by the * per-cpu lock. + * @page_cache_work: A work to refill the cache when it is empty + * @work_in_progress: Indicates that page_cache_work is running + * @hrtimer: A hrtimer for scheduling a page_cache_work * @nr_bkv_objs: number of allocated objects at @bkvcache. * * This is a per-CPU structure. The reason that it is not included in @@ -3100,6 +3116,11 @@ struct kfree_rcu_cpu { bool monitor_todo; bool initialized; int count; + + struct work_struct page_cache_work; + atomic_t work_in_progress; + struct hrtimer hrtimer; + struct llist_head bkvcache; int nr_bkv_objs; }; @@ -3217,10 +3238,10 @@ static void kfree_rcu_work(struct work_struct *work) } rcu_lock_release(&rcu_callback_map); - krcp = krc_this_cpu_lock(&flags); + raw_spin_lock_irqsave(&krcp->lock, flags); if (put_cached_bnode(krcp, bkvhead[i])) bkvhead[i] = NULL; - krc_this_cpu_unlock(krcp, flags); + raw_spin_unlock_irqrestore(&krcp->lock, flags); if (bkvhead[i]) free_page((unsigned long) bkvhead[i]); @@ -3347,6 +3368,57 @@ static void kfree_rcu_monitor(struct work_struct *work) raw_spin_unlock_irqrestore(&krcp->lock, flags); } +static enum hrtimer_restart +schedule_page_work_fn(struct hrtimer *t) +{ + struct kfree_rcu_cpu *krcp = + container_of(t, struct kfree_rcu_cpu, hrtimer); + + queue_work(system_highpri_wq, &krcp->page_cache_work); + return HRTIMER_NORESTART; +} + +static void fill_page_cache_func(struct work_struct *work) +{ + struct kvfree_rcu_bulk_data *bnode; + struct kfree_rcu_cpu *krcp = + container_of(work, struct kfree_rcu_cpu, + page_cache_work); + unsigned long flags; + bool pushed; + int i; + + for (i = 0; i < rcu_min_cached_objs; i++) { + bnode = (struct kvfree_rcu_bulk_data *) + __get_free_page(GFP_KERNEL | __GFP_NOWARN); + + if (bnode) { + raw_spin_lock_irqsave(&krcp->lock, flags); + pushed = put_cached_bnode(krcp, bnode); + raw_spin_unlock_irqrestore(&krcp->lock, flags); + + if (!pushed) { + free_page((unsigned long) bnode); + break; + } + } + } + + atomic_set(&krcp->work_in_progress, 0); +} + +static void +run_page_cache_worker(struct kfree_rcu_cpu *krcp) +{ + if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING && + !atomic_xchg(&krcp->work_in_progress, 1)) { + hrtimer_init(&krcp->hrtimer, CLOCK_MONOTONIC, + HRTIMER_MODE_REL); + krcp->hrtimer.function = schedule_page_work_fn; + hrtimer_start(&krcp->hrtimer, 0, HRTIMER_MODE_REL); + } +} + static inline bool kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr) { @@ -3363,32 +3435,8 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr) if (!krcp->bkvhead[idx] || krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) { bnode = get_cached_bnode(krcp); - if (!bnode) { - /* - * To keep this path working on raw non-preemptible - * sections, prevent the optional entry into the - * allocator as it uses sleeping locks. In fact, even - * if the caller of kfree_rcu() is preemptible, this - * path still is not, as krcp->lock is a raw spinlock. - * With additional page pre-allocation in the works, - * hitting this return is going to be much less likely. - */ - if (IS_ENABLED(CONFIG_PREEMPT_RT)) - return false; - - /* - * NOTE: For one argument of kvfree_rcu() we can - * drop the lock and get the page in sleepable - * context. That would allow to maintain an array - * for the CONFIG_PREEMPT_RT as well if no cached - * pages are available. - */ - bnode = (struct kvfree_rcu_bulk_data *) - __get_free_page(GFP_NOWAIT | __GFP_NOWARN); - } - /* Switch to emergency path. */ - if (unlikely(!bnode)) + if (!bnode) return false; /* Initialize the new block. */ @@ -3452,12 +3500,10 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func) goto unlock_return; } - /* - * Under high memory pressure GFP_NOWAIT can fail, - * in that case the emergency path is maintained. - */ success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr); if (!success) { + run_page_cache_worker(krcp); + if (head == NULL) // Inline if kvfree_rcu(one_arg) call. goto unlock_return; @@ -3567,7 +3613,7 @@ void __init kfree_rcu_scheduler_running(void) * During early boot, any blocking grace-period wait automatically * implies a grace period. Later on, this is never the case for PREEMPTION. * - * Howevr, because a context switch is a grace period for !PREEMPTION, any + * However, because a context switch is a grace period for !PREEMPTION, any * blocking grace-period wait automatically implies a grace period if * there is only one CPU online at any point time during execution of * either synchronize_rcu() or synchronize_rcu_expedited(). It is OK to @@ -3583,7 +3629,20 @@ static int rcu_blocking_is_gp(void) return rcu_scheduler_active == RCU_SCHEDULER_INACTIVE; might_sleep(); /* Check for RCU read-side critical section. */ preempt_disable(); - ret = num_online_cpus() <= 1; + /* + * If the rcu_state.n_online_cpus counter is equal to one, + * there is only one CPU, and that CPU sees all prior accesses + * made by any CPU that was online at the time of its access. + * Furthermore, if this counter is equal to one, its value cannot + * change until after the preempt_enable() below. + * + * Furthermore, if rcu_state.n_online_cpus is equal to one here, + * all later CPUs (both this one and any that come online later + * on) are guaranteed to see all accesses prior to this point + * in the code, without the need for additional memory barriers. + * Those memory barriers are provided by CPU-hotplug code. + */ + ret = READ_ONCE(rcu_state.n_online_cpus) <= 1; preempt_enable(); return ret; } @@ -3628,7 +3687,7 @@ void synchronize_rcu(void) lock_is_held(&rcu_sched_lock_map), "Illegal synchronize_rcu() in RCU read-side critical section"); if (rcu_blocking_is_gp()) - return; + return; // Context allows vacuous grace periods. if (rcu_gp_is_expedited()) synchronize_rcu_expedited(); else @@ -3707,13 +3766,13 @@ static int rcu_pending(int user) return 1; /* Does this CPU have callbacks ready to invoke? */ - if (rcu_segcblist_ready_cbs(&rdp->cblist)) + if (!rcu_segcblist_is_offloaded(&rdp->cblist) && + rcu_segcblist_ready_cbs(&rdp->cblist)) return 1; /* Has RCU gone idle with this CPU needing another grace period? */ if (!gp_in_progress && rcu_segcblist_is_enabled(&rdp->cblist) && - (!IS_ENABLED(CONFIG_RCU_NOCB_CPU) || - !rcu_segcblist_is_offloaded(&rdp->cblist)) && + !rcu_segcblist_is_offloaded(&rdp->cblist) && !rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL)) return 1; @@ -3969,6 +4028,7 @@ int rcutree_prepare_cpu(unsigned int cpu) raw_spin_unlock_irqrestore_rcu_node(rnp, flags); rcu_prepare_kthreads(cpu); rcu_spawn_cpu_nocb_kthread(cpu); + WRITE_ONCE(rcu_state.n_online_cpus, rcu_state.n_online_cpus + 1); return 0; } @@ -4057,6 +4117,9 @@ void rcu_cpu_starting(unsigned int cpu) rnp = rdp->mynode; mask = rdp->grpmask; + WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1); + WARN_ON_ONCE(!(rnp->ofl_seq & 0x1)); + smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier(). raw_spin_lock_irqsave_rcu_node(rnp, flags); WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask); newcpu = !(rnp->expmaskinitnext & mask); @@ -4067,13 +4130,18 @@ void rcu_cpu_starting(unsigned int cpu) rcu_gpnum_ovf(rnp, rdp); /* Offline-induced counter wrap? */ rdp->rcu_onl_gp_seq = READ_ONCE(rcu_state.gp_seq); rdp->rcu_onl_gp_flags = READ_ONCE(rcu_state.gp_flags); - if (rnp->qsmask & mask) { /* RCU waiting on incoming CPU? */ + + /* An incoming CPU should never be blocking a grace period. */ + if (WARN_ON_ONCE(rnp->qsmask & mask)) { /* RCU waiting on incoming CPU? */ rcu_disable_urgency_upon_qs(rdp); /* Report QS -after- changing ->qsmaskinitnext! */ rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags); } else { raw_spin_unlock_irqrestore_rcu_node(rnp, flags); } + smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier(). + WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1); + WARN_ON_ONCE(rnp->ofl_seq & 0x1); smp_mb(); /* Ensure RCU read-side usage follows above initialization. */ } @@ -4101,6 +4169,9 @@ void rcu_report_dead(unsigned int cpu) /* Remove outgoing CPU from mask in the leaf rcu_node structure. */ mask = rdp->grpmask; + WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1); + WARN_ON_ONCE(!(rnp->ofl_seq & 0x1)); + smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier(). raw_spin_lock(&rcu_state.ofl_lock); raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order guarantee. */ rdp->rcu_ofl_gp_seq = READ_ONCE(rcu_state.gp_seq); @@ -4113,6 +4184,9 @@ void rcu_report_dead(unsigned int cpu) WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask); raw_spin_unlock_irqrestore_rcu_node(rnp, flags); raw_spin_unlock(&rcu_state.ofl_lock); + smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier(). + WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1); + WARN_ON_ONCE(rnp->ofl_seq & 0x1); rdp->cpu_started = false; } @@ -4449,24 +4523,14 @@ static void __init kfree_rcu_batch_init(void) for_each_possible_cpu(cpu) { struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu); - struct kvfree_rcu_bulk_data *bnode; for (i = 0; i < KFREE_N_BATCHES; i++) { INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work); krcp->krw_arr[i].krcp = krcp; } - for (i = 0; i < rcu_min_cached_objs; i++) { - bnode = (struct kvfree_rcu_bulk_data *) - __get_free_page(GFP_NOWAIT | __GFP_NOWARN); - - if (bnode) - put_cached_bnode(krcp, bnode); - else - pr_err("Failed to preallocate for %d CPU!\n", cpu); - } - INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor); + INIT_WORK(&krcp->page_cache_work, fill_page_cache_func); krcp->initialized = true; } if (register_shrinker(&kfree_rcu_shrinker)) diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index e4f66b8f7c47..7708ed161f4a 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -56,6 +56,7 @@ struct rcu_node { /* Initialized from ->qsmaskinitnext at the */ /* beginning of each grace period. */ unsigned long qsmaskinitnext; + unsigned long ofl_seq; /* CPU-hotplug operation sequence count. */ /* Online CPUs for next grace period. */ unsigned long expmask; /* CPUs or groups that need to check in */ /* to allow the current expedited GP */ @@ -298,6 +299,7 @@ struct rcu_state { /* Hierarchy levels (+1 to */ /* shut bogus gcc warning) */ int ncpus; /* # CPUs seen so far. */ + int n_online_cpus; /* # CPUs online for RCU. */ /* The following fields are guarded by the root rcu_node's lock. */ diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index fd8a52e9a887..7e291ce0a1d6 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -628,7 +628,7 @@ static void rcu_read_unlock_special(struct task_struct *t) set_tsk_need_resched(current); set_preempt_need_resched(); if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled && - !rdp->defer_qs_iw_pending && exp) { + !rdp->defer_qs_iw_pending && exp && cpu_online(rdp->cpu)) { // Get scheduler to re-evaluate and call hooks. // If !IRQ_WORK, FQS scan will eventually IPI. init_irq_work(&rdp->defer_qs_iw, diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index 0fde39b8daab..70d48c52fabc 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -13,6 +13,7 @@ /* panic() on RCU Stall sysctl. */ int sysctl_panic_on_rcu_stall __read_mostly; +int sysctl_max_rcu_stall_to_panic __read_mostly; #ifdef CONFIG_PROVE_RCU #define RCU_STALL_DELAY_DELTA (5 * HZ) @@ -106,6 +107,11 @@ early_initcall(check_cpu_stall_init); /* If so specified via sysctl, panic, yielding cleaner stall-warning output. */ static void panic_on_rcu_stall(void) { + static int cpu_stall; + + if (++cpu_stall < sysctl_max_rcu_stall_to_panic) + return; + if (sysctl_panic_on_rcu_stall) panic("RCU Stall\n"); } @@ -249,13 +255,16 @@ static bool check_slow_task(struct task_struct *t, void *arg) /* * Scan the current list of tasks blocked within RCU read-side critical - * sections, printing out the tid of each. + * sections, printing out the tid of each of the first few of them. */ -static int rcu_print_task_stall(struct rcu_node *rnp) +static int rcu_print_task_stall(struct rcu_node *rnp, unsigned long flags) + __releases(rnp->lock) { + int i = 0; int ndetected = 0; struct rcu_stall_chk_rdr rscr; struct task_struct *t; + struct task_struct *ts[8]; if (!rcu_preempt_blocked_readers_cgp(rnp)) return 0; @@ -264,6 +273,14 @@ static int rcu_print_task_stall(struct rcu_node *rnp) t = list_entry(rnp->gp_tasks->prev, struct task_struct, rcu_node_entry); list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) { + get_task_struct(t); + ts[i++] = t; + if (i >= ARRAY_SIZE(ts)) + break; + } + raw_spin_unlock_irqrestore_rcu_node(rnp, flags); + for (i--; i; i--) { + t = ts[i]; if (!try_invoke_on_locked_down_task(t, check_slow_task, &rscr)) pr_cont(" P%d", t->pid); else @@ -273,6 +290,7 @@ static int rcu_print_task_stall(struct rcu_node *rnp) ".q"[rscr.rs.b.need_qs], ".e"[rscr.rs.b.exp_hint], ".l"[rscr.on_blkd_list]); + put_task_struct(t); ndetected++; } pr_cont("\n"); @@ -293,8 +311,9 @@ static void rcu_print_detail_task_stall_rnp(struct rcu_node *rnp) * Because preemptible RCU does not exist, we never have to check for * tasks blocked within RCU read-side critical sections. */ -static int rcu_print_task_stall(struct rcu_node *rnp) +static int rcu_print_task_stall(struct rcu_node *rnp, unsigned long flags) { + raw_spin_unlock_irqrestore_rcu_node(rnp, flags); return 0; } #endif /* #else #ifdef CONFIG_PREEMPT_RCU */ @@ -472,7 +491,6 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) pr_err("INFO: %s detected stalls on CPUs/tasks:\n", rcu_state.name); rcu_for_each_leaf_node(rnp) { raw_spin_lock_irqsave_rcu_node(rnp, flags); - ndetected += rcu_print_task_stall(rnp); if (rnp->qsmask != 0) { for_each_leaf_node_possible_cpu(rnp, cpu) if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) { @@ -480,7 +498,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) ndetected++; } } - raw_spin_unlock_irqrestore_rcu_node(rnp, flags); + ndetected += rcu_print_task_stall(rnp, flags); // Releases rnp->lock. } for_each_possible_cpu(cpu) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index afad085960b8..c9fbdd848138 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -2650,6 +2650,17 @@ static struct ctl_table kern_table[] = { .extra2 = SYSCTL_ONE, }, #endif +#if defined(CONFIG_TREE_RCU) + { + .procname = "max_rcu_stall_to_panic", + .data = &sysctl_max_rcu_stall_to_panic, + .maxlen = sizeof(sysctl_max_rcu_stall_to_panic), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ONE, + .extra2 = SYSCTL_INT_MAX, + }, +#endif #ifdef CONFIG_STACKLEAK_RUNTIME_DISABLE { .procname = "stack_erasing", diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TRACE01 b/tools/testing/selftests/rcutorture/configs/rcu/TRACE01 index 12e7661b86f5..34c8ff5a12f2 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/TRACE01 +++ b/tools/testing/selftests/rcutorture/configs/rcu/TRACE01 @@ -4,8 +4,8 @@ CONFIG_HOTPLUG_CPU=y CONFIG_PREEMPT_NONE=y CONFIG_PREEMPT_VOLUNTARY=n CONFIG_PREEMPT=n -CONFIG_DEBUG_LOCK_ALLOC=y -CONFIG_PROVE_LOCKING=y -#CHECK#CONFIG_PROVE_RCU=y +CONFIG_DEBUG_LOCK_ALLOC=n +CONFIG_PROVE_LOCKING=n +#CHECK#CONFIG_PROVE_RCU=n CONFIG_TASKS_TRACE_RCU_READ_MB=y CONFIG_RCU_EXPERT=y diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TRACE02 b/tools/testing/selftests/rcutorture/configs/rcu/TRACE02 index b69ed6673c41..77541eeb4e9f 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/TRACE02 +++ b/tools/testing/selftests/rcutorture/configs/rcu/TRACE02 @@ -4,8 +4,8 @@ CONFIG_HOTPLUG_CPU=y CONFIG_PREEMPT_NONE=n CONFIG_PREEMPT_VOLUNTARY=n CONFIG_PREEMPT=y -CONFIG_DEBUG_LOCK_ALLOC=n -CONFIG_PROVE_LOCKING=n -#CHECK#CONFIG_PROVE_RCU=n +CONFIG_DEBUG_LOCK_ALLOC=y +CONFIG_PROVE_LOCKING=y +#CHECK#CONFIG_PROVE_RCU=y CONFIG_TASKS_TRACE_RCU_READ_MB=n CONFIG_RCU_EXPERT=y |