diff options
Diffstat (limited to 'Documentation/RCU/Design/Data-Structures')
| -rw-r--r-- | Documentation/RCU/Design/Data-Structures/Data-Structures.html | 1391 | ||||
| -rw-r--r-- | Documentation/RCU/Design/Data-Structures/Data-Structures.rst | 1163 | 
2 files changed, 1163 insertions, 1391 deletions
diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.html b/Documentation/RCU/Design/Data-Structures/Data-Structures.html deleted file mode 100644 index c30c1957c7e6..000000000000 --- a/Documentation/RCU/Design/Data-Structures/Data-Structures.html +++ /dev/null @@ -1,1391 +0,0 @@ -<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" -        "http://www.w3.org/TR/html4/loose.dtd"> -        <html> -        <head><title>A Tour Through TREE_RCU's Data Structures [LWN.net]</title> -        <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> - -           <p>December 18, 2016</p> -           <p>This article was contributed by Paul E. McKenney</p> - -<h3>Introduction</h3> - -This document describes RCU's major data structures and their relationship -to each other. - -<ol> -<li>	<a href="#Data-Structure Relationships"> -	Data-Structure Relationships</a> -<li>	<a href="#The rcu_state Structure"> -	The <tt>rcu_state</tt> Structure</a> -<li>	<a href="#The rcu_node Structure"> -	The <tt>rcu_node</tt> Structure</a> -<li>	<a href="#The rcu_segcblist Structure"> -	The <tt>rcu_segcblist</tt> Structure</a> -<li>	<a href="#The rcu_data Structure"> -	The <tt>rcu_data</tt> Structure</a> -<li>	<a href="#The rcu_head Structure"> -	The <tt>rcu_head</tt> Structure</a> -<li>	<a href="#RCU-Specific Fields in the task_struct Structure"> -	RCU-Specific Fields in the <tt>task_struct</tt> Structure</a> -<li>	<a href="#Accessor Functions"> -	Accessor Functions</a> -</ol> - -<h3><a name="Data-Structure Relationships">Data-Structure Relationships</a></h3> - -<p>RCU is for all intents and purposes a large state machine, and its -data structures maintain the state in such a way as to allow RCU readers -to execute extremely quickly, while also processing the RCU grace periods -requested by updaters in an efficient and extremely scalable fashion. -The efficiency and scalability of RCU updaters is provided primarily -by a combining tree, as shown below: - -</p><p><img src="BigTreeClassicRCU.svg" alt="BigTreeClassicRCU.svg" width="30%"> - -</p><p>This diagram shows an enclosing <tt>rcu_state</tt> structure -containing a tree of <tt>rcu_node</tt> structures. -Each leaf node of the <tt>rcu_node</tt> tree has up to 16 -<tt>rcu_data</tt> structures associated with it, so that there -are <tt>NR_CPUS</tt> number of <tt>rcu_data</tt> structures, -one for each possible CPU. -This structure is adjusted at boot time, if needed, to handle the -common case where <tt>nr_cpu_ids</tt> is much less than -<tt>NR_CPUs</tt>. -For example, a number of Linux distributions set <tt>NR_CPUs=4096</tt>, -which results in a three-level <tt>rcu_node</tt> tree. -If the actual hardware has only 16 CPUs, RCU will adjust itself -at boot time, resulting in an <tt>rcu_node</tt> tree with only a single node. - -</p><p>The purpose of this combining tree is to allow per-CPU events -such as quiescent states, dyntick-idle transitions, -and CPU hotplug operations to be processed efficiently -and scalably. -Quiescent states are recorded by the per-CPU <tt>rcu_data</tt> structures, -and other events are recorded by the leaf-level <tt>rcu_node</tt> -structures. -All of these events are combined at each level of the tree until finally -grace periods are completed at the tree's root <tt>rcu_node</tt> -structure. -A grace period can be completed at the root once every CPU -(or, in the case of <tt>CONFIG_PREEMPT_RCU</tt>, task) -has passed through a quiescent state. -Once a grace period has completed, record of that fact is propagated -back down the tree. - -</p><p>As can be seen from the diagram, on a 64-bit system -a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout -of 64 at the root and a fanout of 16 at the leaves. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> -	Why isn't the fanout at the leaves also 64? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> -	Because there are more types of events that affect the leaf-level -	<tt>rcu_node</tt> structures than further up the tree. -	Therefore, if the leaf <tt>rcu_node</tt> structures have fanout of -	64, the contention on these structures' <tt>->structures</tt> -	becomes excessive. -	Experimentation on a wide variety of systems has shown that a fanout -	of 16 works well for the leaves of the <tt>rcu_node</tt> tree. -	</font> - -	<p><font color="ffffff">Of course, further experience with -	systems having hundreds or thousands of CPUs may demonstrate -	that the fanout for the non-leaf <tt>rcu_node</tt> structures -	must also be reduced. -	Such reduction can be easily carried out when and if it proves -	necessary. -	In the meantime, if you are using such a system and running into -	contention problems on the non-leaf <tt>rcu_node</tt> structures, -	you may use the <tt>CONFIG_RCU_FANOUT</tt> kernel configuration -	parameter to reduce the non-leaf fanout as needed. -	</font> - -	<p><font color="ffffff">Kernels built for systems with -	strong NUMA characteristics might also need to adjust -	<tt>CONFIG_RCU_FANOUT</tt> so that the domains of the -	<tt>rcu_node</tt> structures align with hardware boundaries. -	However, there has thus far been no need for this. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p>If your system has more than 1,024 CPUs (or more than 512 CPUs on -a 32-bit system), then RCU will automatically add more levels to the -tree. -For example, if you are crazy enough to build a 64-bit system with 65,536 -CPUs, RCU would configure the <tt>rcu_node</tt> tree as follows: - -</p><p><img src="HugeTreeClassicRCU.svg" alt="HugeTreeClassicRCU.svg" width="50%"> - -</p><p>RCU currently permits up to a four-level tree, which on a 64-bit system -accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for -32-bit systems. -On the other hand, you can set both <tt>CONFIG_RCU_FANOUT</tt> and -<tt>CONFIG_RCU_FANOUT_LEAF</tt> to be as small as 2, which would result -in a 16-CPU test using a 4-level tree. -This can be useful for testing large-system capabilities on small test -machines. - -</p><p>This multi-level combining tree allows us to get most of the -performance and scalability -benefits of partitioning, even though RCU grace-period detection is -inherently a global operation. -The trick here is that only the last CPU to report a quiescent state -into a given <tt>rcu_node</tt> structure need advance to the <tt>rcu_node</tt> -structure at the next level up the tree. -This means that at the leaf-level <tt>rcu_node</tt> structure, only -one access out of sixteen will progress up the tree. -For the internal <tt>rcu_node</tt> structures, the situation is even -more extreme:  Only one access out of sixty-four will progress up -the tree. -Because the vast majority of the CPUs do not progress up the tree, -the lock contention remains roughly constant up the tree. -No matter how many CPUs there are in the system, at most 64 quiescent-state -reports per grace period will progress all the way to the root -<tt>rcu_node</tt> structure, thus ensuring that the lock contention -on that root <tt>rcu_node</tt> structure remains acceptably low. - -</p><p>In effect, the combining tree acts like a big shock absorber, -keeping lock contention under control at all tree levels regardless -of the level of loading on the system. - -</p><p>RCU updaters wait for normal grace periods by registering -RCU callbacks, either directly via <tt>call_rcu()</tt> -or indirectly via <tt>synchronize_rcu()</tt> and friends. -RCU callbacks are represented by <tt>rcu_head</tt> structures, -which are queued on <tt>rcu_data</tt> structures while they are -waiting for a grace period to elapse, as shown in the following figure: - -</p><p><img src="BigTreePreemptRCUBHdyntickCB.svg" alt="BigTreePreemptRCUBHdyntickCB.svg" width="40%"> - -</p><p>This figure shows how <tt>TREE_RCU</tt>'s and -<tt>PREEMPT_RCU</tt>'s major data structures are related. -Lesser data structures will be introduced with the algorithms that -make use of them. - -</p><p>Note that each of the data structures in the above figure has -its own synchronization: - -<p><ol> -<li>	Each <tt>rcu_state</tt> structures has a lock and a mutex, -	and some fields are protected by the corresponding root -	<tt>rcu_node</tt> structure's lock. -<li>	Each <tt>rcu_node</tt> structure has a spinlock. -<li>	The fields in <tt>rcu_data</tt> are private to the corresponding -	CPU, although a few can be read and written by other CPUs. -</ol> - -<p>It is important to note that different data structures can have -very different ideas about the state of RCU at any given time. -For but one example, awareness of the start or end of a given RCU -grace period propagates slowly through the data structures. -This slow propagation is absolutely necessary for RCU to have good -read-side performance. -If this balkanized implementation seems foreign to you, one useful -trick is to consider each instance of these data structures to be -a different person, each having the usual slightly different -view of reality. - -</p><p>The general role of each of these data structures is as -follows: - -</p><ol> -<li>	<tt>rcu_state</tt>: -	This structure forms the interconnection between the -	<tt>rcu_node</tt> and <tt>rcu_data</tt> structures, -	tracks grace periods, serves as short-term repository -	for callbacks orphaned by CPU-hotplug events, -	maintains <tt>rcu_barrier()</tt> state, -	tracks expedited grace-period state, -	and maintains state used to force quiescent states when -	grace periods extend too long, -<li>	<tt>rcu_node</tt>: This structure forms the combining -	tree that propagates quiescent-state -	information from the leaves to the root, and also propagates -	grace-period information from the root to the leaves. -	It provides local copies of the grace-period state in order -	to allow this information to be accessed in a synchronized -	manner without suffering the scalability limitations that -	would otherwise be imposed by global locking. -	In <tt>CONFIG_PREEMPT_RCU</tt> kernels, it manages the lists -	of tasks that have blocked while in their current -	RCU read-side critical section. -	In <tt>CONFIG_PREEMPT_RCU</tt> with -	<tt>CONFIG_RCU_BOOST</tt>, it manages the -	per-<tt>rcu_node</tt> priority-boosting -	kernel threads (kthreads) and state. -	Finally, it records CPU-hotplug state in order to determine -	which CPUs should be ignored during a given grace period. -<li>	<tt>rcu_data</tt>: This per-CPU structure is the -	focus of quiescent-state detection and RCU callback queuing. -	It also tracks its relationship to the corresponding leaf -	<tt>rcu_node</tt> structure to allow more-efficient -	propagation of quiescent states up the <tt>rcu_node</tt> -	combining tree. -	Like the <tt>rcu_node</tt> structure, it provides a local -	copy of the grace-period information to allow for-free -	synchronized -	access to this information from the corresponding CPU. -	Finally, this structure records past dyntick-idle state -	for the corresponding CPU and also tracks statistics. -<li>	<tt>rcu_head</tt>: -	This structure represents RCU callbacks, and is the -	only structure allocated and managed by RCU users. -	The <tt>rcu_head</tt> structure is normally embedded -	within the RCU-protected data structure. -</ol> - -<p>If all you wanted from this article was a general notion of how -RCU's data structures are related, you are done. -Otherwise, each of the following sections give more details on -the <tt>rcu_state</tt>, <tt>rcu_node</tt> and <tt>rcu_data</tt> data -structures. - -<h3><a name="The rcu_state Structure"> -The <tt>rcu_state</tt> Structure</a></h3> - -<p>The <tt>rcu_state</tt> structure is the base structure that -represents the state of RCU in the system. -This structure forms the interconnection between the -<tt>rcu_node</tt> and <tt>rcu_data</tt> structures, -tracks grace periods, contains the lock used to -synchronize with CPU-hotplug events, -and maintains state used to force quiescent states when -grace periods extend too long, - -</p><p>A few of the <tt>rcu_state</tt> structure's fields are discussed, -singly and in groups, in the following sections. -The more specialized fields are covered in the discussion of their -use. - -<h5>Relationship to rcu_node and rcu_data Structures</h5> - -This portion of the <tt>rcu_state</tt> structure is declared -as follows: - -<pre> -  1   struct rcu_node node[NUM_RCU_NODES]; -  2   struct rcu_node *level[NUM_RCU_LVLS + 1]; -  3   struct rcu_data __percpu *rda; -</pre> - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> -	Wait a minute! -	You said that the <tt>rcu_node</tt> structures formed a tree, -	but they are declared as a flat array! -	What gives? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> -	The tree is laid out in the array. -	The first node In the array is the head, the next set of nodes in the -	array are children of the head node, and so on until the last set of -	nodes in the array are the leaves. -	</font> - -	<p><font color="ffffff">See the following diagrams to see how -	this works. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p>The <tt>rcu_node</tt> tree is embedded into the -<tt>->node[]</tt> array as shown in the following figure: - -</p><p><img src="TreeMapping.svg" alt="TreeMapping.svg" width="40%"> - -</p><p>One interesting consequence of this mapping is that a -breadth-first traversal of the tree is implemented as a simple -linear scan of the array, which is in fact what the -<tt>rcu_for_each_node_breadth_first()</tt> macro does. -This macro is used at the beginning and ends of grace periods. - -</p><p>Each entry of the <tt>->level</tt> array references -the first <tt>rcu_node</tt> structure on the corresponding level -of the tree, for example, as shown below: - -</p><p><img src="TreeMappingLevel.svg" alt="TreeMappingLevel.svg" width="40%"> - -</p><p>The zero<sup>th</sup> element of the array references the root -<tt>rcu_node</tt> structure, the first element references the -first child of the root <tt>rcu_node</tt>, and finally the second -element references the first leaf <tt>rcu_node</tt> structure. - -</p><p>For whatever it is worth, if you draw the tree to be tree-shaped -rather than array-shaped, it is easy to draw a planar representation: - -</p><p><img src="TreeLevel.svg" alt="TreeLevel.svg" width="60%"> - -</p><p>Finally, the <tt>->rda</tt> field references a per-CPU -pointer to the corresponding CPU's <tt>rcu_data</tt> structure. - -</p><p>All of these fields are constant once initialization is complete, -and therefore need no protection. - -<h5>Grace-Period Tracking</h5> - -<p>This portion of the <tt>rcu_state</tt> structure is declared -as follows: - -<pre> -  1   unsigned long gp_seq; -</pre> - -<p>RCU grace periods are numbered, and -the <tt>->gp_seq</tt> field contains the current grace-period -sequence number. -The bottom two bits are the state of the current grace period, -which can be zero for not yet started or one for in progress. -In other words, if the bottom two bits of <tt>->gp_seq</tt> are -zero, then RCU is idle. -Any other value in the bottom two bits indicates that something is broken. -This field is protected by the root <tt>rcu_node</tt> structure's -<tt>->lock</tt> field. - -</p><p>There are <tt>->gp_seq</tt> fields -in the <tt>rcu_node</tt> and <tt>rcu_data</tt> structures -as well. -The fields in the <tt>rcu_state</tt> structure represent the -most current value, and those of the other structures are compared -in order to detect the beginnings and ends of grace periods in a distributed -fashion. -The values flow from <tt>rcu_state</tt> to <tt>rcu_node</tt> -(down the tree from the root to the leaves) to <tt>rcu_data</tt>. - -<h5>Miscellaneous</h5> - -<p>This portion of the <tt>rcu_state</tt> structure is declared -as follows: - -<pre> -  1   unsigned long gp_max; -  2   char abbr; -  3   char *name; -</pre> - -<p>The <tt>->gp_max</tt> field tracks the duration of the longest -grace period in jiffies. -It is protected by the root <tt>rcu_node</tt>'s <tt>->lock</tt>. - -<p>The <tt>->name</tt> and <tt>->abbr</tt> fields distinguish -between preemptible RCU (“rcu_preempt” and “p”) -and non-preemptible RCU (“rcu_sched” and “s”). -These fields are used for diagnostic and tracing purposes. - -<h3><a name="The rcu_node Structure"> -The <tt>rcu_node</tt> Structure</a></h3> - -<p>The <tt>rcu_node</tt> structures form the combining -tree that propagates quiescent-state -information from the leaves to the root and also that propagates -grace-period information from the root down to the leaves. -They provides local copies of the grace-period state in order -to allow this information to be accessed in a synchronized -manner without suffering the scalability limitations that -would otherwise be imposed by global locking. -In <tt>CONFIG_PREEMPT_RCU</tt> kernels, they manage the lists -of tasks that have blocked while in their current -RCU read-side critical section. -In <tt>CONFIG_PREEMPT_RCU</tt> with -<tt>CONFIG_RCU_BOOST</tt>, they manage the -per-<tt>rcu_node</tt> priority-boosting -kernel threads (kthreads) and state. -Finally, they record CPU-hotplug state in order to determine -which CPUs should be ignored during a given grace period. - -</p><p>The <tt>rcu_node</tt> structure's fields are discussed, -singly and in groups, in the following sections. - -<h5>Connection to Combining Tree</h5> - -<p>This portion of the <tt>rcu_node</tt> structure is declared -as follows: - -<pre> -  1   struct rcu_node *parent; -  2   u8 level; -  3   u8 grpnum; -  4   unsigned long grpmask; -  5   int grplo; -  6   int grphi; -</pre> - -<p>The <tt>->parent</tt> pointer references the <tt>rcu_node</tt> -one level up in the tree, and is <tt>NULL</tt> for the root -<tt>rcu_node</tt>. -The RCU implementation makes heavy use of this field to push quiescent -states up the tree. -The <tt>->level</tt> field gives the level in the tree, with -the root being at level zero, its children at level one, and so on. -The <tt>->grpnum</tt> field gives this node's position within -the children of its parent, so this number can range between 0 and 31 -on 32-bit systems and between 0 and 63 on 64-bit systems. -The <tt>->level</tt> and <tt>->grpnum</tt> fields are -used only during initialization and for tracing. -The <tt>->grpmask</tt> field is the bitmask counterpart of -<tt>->grpnum</tt>, and therefore always has exactly one bit set. -This mask is used to clear the bit corresponding to this <tt>rcu_node</tt> -structure in its parent's bitmasks, which are described later. -Finally, the <tt>->grplo</tt> and <tt>->grphi</tt> fields -contain the lowest and highest numbered CPU served by this -<tt>rcu_node</tt> structure, respectively. - -</p><p>All of these fields are constant, and thus do not require any -synchronization. - -<h5>Synchronization</h5> - -<p>This field of the <tt>rcu_node</tt> structure is declared -as follows: - -<pre> -  1   raw_spinlock_t lock; -</pre> - -<p>This field is used to protect the remaining fields in this structure, -unless otherwise stated. -That said, all of the fields in this structure can be accessed without -locking for tracing purposes. -Yes, this can result in confusing traces, but better some tracing confusion -than to be heisenbugged out of existence. - -<h5>Grace-Period Tracking</h5> - -<p>This portion of the <tt>rcu_node</tt> structure is declared -as follows: - -<pre> -  1   unsigned long gp_seq; -  2   unsigned long gp_seq_needed; -</pre> - -<p>The <tt>rcu_node</tt> structures' <tt>->gp_seq</tt> fields are -the counterparts of the field of the same name in the <tt>rcu_state</tt> -structure. -They each may lag up to one step behind their <tt>rcu_state</tt> -counterpart. -If the bottom two bits of a given <tt>rcu_node</tt> structure's -<tt>->gp_seq</tt> field is zero, then this <tt>rcu_node</tt> -structure believes that RCU is idle. -</p><p>The <tt>>gp_seq</tt> field of each <tt>rcu_node</tt> -structure is updated at the beginning and the end -of each grace period. - -<p>The <tt>->gp_seq_needed</tt> fields record the -furthest-in-the-future grace period request seen by the corresponding -<tt>rcu_node</tt> structure.  The request is considered fulfilled when -the value of the <tt>->gp_seq</tt> field equals or exceeds that of -the <tt>->gp_seq_needed</tt> field. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> -	Suppose that this <tt>rcu_node</tt> structure doesn't see -	a request for a very long time. -	Won't wrapping of the <tt>->gp_seq</tt> field cause -	problems? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> -	No, because if the <tt>->gp_seq_needed</tt> field lags behind the -	<tt>->gp_seq</tt> field, the <tt>->gp_seq_needed</tt> field -	will be updated at the end of the grace period. -	Modulo-arithmetic comparisons therefore will always get the -	correct answer, even with wrapping. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h5>Quiescent-State Tracking</h5> - -<p>These fields manage the propagation of quiescent states up the -combining tree. - -</p><p>This portion of the <tt>rcu_node</tt> structure has fields -as follows: - -<pre> -  1   unsigned long qsmask; -  2   unsigned long expmask; -  3   unsigned long qsmaskinit; -  4   unsigned long expmaskinit; -</pre> - -<p>The <tt>->qsmask</tt> field tracks which of this -<tt>rcu_node</tt> structure's children still need to report -quiescent states for the current normal grace period. -Such children will have a value of 1 in their corresponding bit. -Note that the leaf <tt>rcu_node</tt> structures should be -thought of as having <tt>rcu_data</tt> structures as their -children. -Similarly, the <tt>->expmask</tt> field tracks which -of this <tt>rcu_node</tt> structure's children still need to report -quiescent states for the current expedited grace period. -An expedited grace period has -the same conceptual properties as a normal grace period, but the -expedited implementation accepts extreme CPU overhead to obtain -much lower grace-period latency, for example, consuming a few -tens of microseconds worth of CPU time to reduce grace-period -duration from milliseconds to tens of microseconds. -The <tt>->qsmaskinit</tt> field tracks which of this -<tt>rcu_node</tt> structure's children cover for at least -one online CPU. -This mask is used to initialize <tt>->qsmask</tt>, -and <tt>->expmaskinit</tt> is used to initialize -<tt>->expmask</tt> and the beginning of the -normal and expedited grace periods, respectively. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> -	Why are these bitmasks protected by locking? -	Come on, haven't you heard of atomic instructions??? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> -	Lockless grace-period computation!  Such a tantalizing possibility! -	</font> - -	<p><font color="ffffff">But consider the following sequence of events: -	</font> - -	<ol> -	<li>	<font color="ffffff">CPU 0 has been in dyntick-idle -		mode for quite some time. -		When it wakes up, it notices that the current RCU -		grace period needs it to report in, so it sets a -		flag where the scheduling clock interrupt will find it. -		</font><p> -	<li>	<font color="ffffff">Meanwhile, CPU 1 is running -		<tt>force_quiescent_state()</tt>, -		and notices that CPU 0 has been in dyntick idle mode, -		which qualifies as an extended quiescent state. -		</font><p> -	<li>	<font color="ffffff">CPU 0's scheduling clock -		interrupt fires in the -		middle of an RCU read-side critical section, and notices -		that the RCU core needs something, so commences RCU softirq -		processing. -		</font> -		<p> -	<li>	<font color="ffffff">CPU 0's softirq handler -		executes and is just about ready -		to report its quiescent state up the <tt>rcu_node</tt> -		tree. -		</font><p> -	<li>	<font color="ffffff">But CPU 1 beats it to the punch, -		completing the current -		grace period and starting a new one. -		</font><p> -	<li>	<font color="ffffff">CPU 0 now reports its quiescent -		state for the wrong -		grace period. -		That grace period might now end before the RCU read-side -		critical section. -		If that happens, disaster will ensue. -		</font> -	</ol> - -	<p><font color="ffffff">So the locking is absolutely required in -	order to coordinate clearing of the bits with updating of the -	grace-period sequence number in <tt>->gp_seq</tt>. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h5>Blocked-Task Management</h5> - -<p><tt>PREEMPT_RCU</tt> allows tasks to be preempted in the -midst of their RCU read-side critical sections, and these tasks -must be tracked explicitly. -The details of exactly why and how they are tracked will be covered -in a separate article on RCU read-side processing. -For now, it is enough to know that the <tt>rcu_node</tt> -structure tracks them. - -<pre> -  1   struct list_head blkd_tasks; -  2   struct list_head *gp_tasks; -  3   struct list_head *exp_tasks; -  4   bool wait_blkd_tasks; -</pre> - -<p>The <tt>->blkd_tasks</tt> field is a list header for -the list of blocked and preempted tasks. -As tasks undergo context switches within RCU read-side critical -sections, their <tt>task_struct</tt> structures are enqueued -(via the <tt>task_struct</tt>'s <tt>->rcu_node_entry</tt> -field) onto the head of the <tt>->blkd_tasks</tt> list for the -leaf <tt>rcu_node</tt> structure corresponding to the CPU -on which the outgoing context switch executed. -As these tasks later exit their RCU read-side critical sections, -they remove themselves from the list. -This list is therefore in reverse time order, so that if one of the tasks -is blocking the current grace period, all subsequent tasks must -also be blocking that same grace period. -Therefore, a single pointer into this list suffices to track -all tasks blocking a given grace period. -That pointer is stored in <tt>->gp_tasks</tt> for normal -grace periods and in <tt>->exp_tasks</tt> for expedited -grace periods. -These last two fields are <tt>NULL</tt> if either there is -no grace period in flight or if there are no blocked tasks -preventing that grace period from completing. -If either of these two pointers is referencing a task that -removes itself from the <tt>->blkd_tasks</tt> list, -then that task must advance the pointer to the next task on -the list, or set the pointer to <tt>NULL</tt> if there -are no subsequent tasks on the list. - -</p><p>For example, suppose that tasks T1, T2, and T3 are -all hard-affinitied to the largest-numbered CPU in the system. -Then if task T1 blocked in an RCU read-side -critical section, then an expedited grace period started, -then task T2 blocked in an RCU read-side critical section, -then a normal grace period started, and finally task 3 blocked -in an RCU read-side critical section, then the state of the -last leaf <tt>rcu_node</tt> structure's blocked-task list -would be as shown below: - -</p><p><img src="blkd_task.svg" alt="blkd_task.svg" width="60%"> - -</p><p>Task T1 is blocking both grace periods, task T2 is -blocking only the normal grace period, and task T3 is blocking -neither grace period. -Note that these tasks will not remove themselves from this list -immediately upon resuming execution. -They will instead remain on the list until they execute the outermost -<tt>rcu_read_unlock()</tt> that ends their RCU read-side critical -section. - -<p> -The <tt>->wait_blkd_tasks</tt> field indicates whether or not -the current grace period is waiting on a blocked task. - -<h5>Sizing the <tt>rcu_node</tt> Array</h5> - -<p>The <tt>rcu_node</tt> array is sized via a series of -C-preprocessor expressions as follows: - -<pre> - 1 #ifdef CONFIG_RCU_FANOUT - 2 #define RCU_FANOUT CONFIG_RCU_FANOUT - 3 #else - 4 # ifdef CONFIG_64BIT - 5 # define RCU_FANOUT 64 - 6 # else - 7 # define RCU_FANOUT 32 - 8 # endif - 9 #endif -10 -11 #ifdef CONFIG_RCU_FANOUT_LEAF -12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF -13 #else -14 # ifdef CONFIG_64BIT -15 # define RCU_FANOUT_LEAF 64 -16 # else -17 # define RCU_FANOUT_LEAF 32 -18 # endif -19 #endif -20 -21 #define RCU_FANOUT_1        (RCU_FANOUT_LEAF) -22 #define RCU_FANOUT_2        (RCU_FANOUT_1 * RCU_FANOUT) -23 #define RCU_FANOUT_3        (RCU_FANOUT_2 * RCU_FANOUT) -24 #define RCU_FANOUT_4        (RCU_FANOUT_3 * RCU_FANOUT) -25 -26 #if NR_CPUS <= RCU_FANOUT_1 -27 #  define RCU_NUM_LVLS        1 -28 #  define NUM_RCU_LVL_0        1 -29 #  define NUM_RCU_NODES        NUM_RCU_LVL_0 -30 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0 } -31 #  define RCU_NODE_NAME_INIT  { "rcu_node_0" } -32 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0" } -33 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0" } -34 #elif NR_CPUS <= RCU_FANOUT_2 -35 #  define RCU_NUM_LVLS        2 -36 #  define NUM_RCU_LVL_0        1 -37 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) -38 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1) -39 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1 } -40 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1" } -41 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1" } -42 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1" } -43 #elif NR_CPUS <= RCU_FANOUT_3 -44 #  define RCU_NUM_LVLS        3 -45 #  define NUM_RCU_LVL_0        1 -46 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) -47 #  define NUM_RCU_LVL_2        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) -48 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2) -49 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 } -50 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2" } -51 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" } -52 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" } -53 #elif NR_CPUS <= RCU_FANOUT_4 -54 #  define RCU_NUM_LVLS        4 -55 #  define NUM_RCU_LVL_0        1 -56 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3) -57 #  define NUM_RCU_LVL_2        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) -58 #  define NUM_RCU_LVL_3        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) -59 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3) -60 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 } -61 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" } -62 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" } -63 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" } -64 #else -65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS" -66 #endif -</pre> - -<p>The maximum number of levels in the <tt>rcu_node</tt> structure -is currently limited to four, as specified by lines 21-24 -and the structure of the subsequent “if” statement. -For 32-bit systems, this allows 16*32*32*32=524,288 CPUs, which -should be sufficient for the next few years at least. -For 64-bit systems, 16*64*64*64=4,194,304 CPUs is allowed, which -should see us through the next decade or so. -This four-level tree also allows kernels built with -<tt>CONFIG_RCU_FANOUT=8</tt> to support up to 4096 CPUs, -which might be useful in very large systems having eight CPUs per -socket (but please note that no one has yet shown any measurable -performance degradation due to misaligned socket and <tt>rcu_node</tt> -boundaries). -In addition, building kernels with a full four levels of <tt>rcu_node</tt> -tree permits better testing of RCU's combining-tree code. - -</p><p>The <tt>RCU_FANOUT</tt> symbol controls how many children -are permitted at each non-leaf level of the <tt>rcu_node</tt> tree. -If the <tt>CONFIG_RCU_FANOUT</tt> Kconfig option is not specified, -it is set based on the word size of the system, which is also -the Kconfig default. - -</p><p>The <tt>RCU_FANOUT_LEAF</tt> symbol controls how many CPUs are -handled by each leaf <tt>rcu_node</tt> structure. -Experience has shown that allowing a given leaf <tt>rcu_node</tt> -structure to handle 64 CPUs, as permitted by the number of bits in -the <tt>->qsmask</tt> field on a 64-bit system, results in -excessive contention for the leaf <tt>rcu_node</tt> structures' -<tt>->lock</tt> fields. -The number of CPUs per leaf <tt>rcu_node</tt> structure is therefore -limited to 16 given the default value of <tt>CONFIG_RCU_FANOUT_LEAF</tt>. -If <tt>CONFIG_RCU_FANOUT_LEAF</tt> is unspecified, the value -selected is based on the word size of the system, just as for -<tt>CONFIG_RCU_FANOUT</tt>. -Lines 11-19 perform this computation. - -</p><p>Lines 21-24 compute the maximum number of CPUs supported by -a single-level (which contains a single <tt>rcu_node</tt> structure), -two-level, three-level, and four-level <tt>rcu_node</tt> tree, -respectively, given the fanout specified by <tt>RCU_FANOUT</tt> -and <tt>RCU_FANOUT_LEAF</tt>. -These numbers of CPUs are retained in the -<tt>RCU_FANOUT_1</tt>, -<tt>RCU_FANOUT_2</tt>, -<tt>RCU_FANOUT_3</tt>, and -<tt>RCU_FANOUT_4</tt> -C-preprocessor variables, respectively. - -</p><p>These variables are used to control the C-preprocessor <tt>#if</tt> -statement spanning lines 26-66 that computes the number of -<tt>rcu_node</tt> structures required for each level of the tree, -as well as the number of levels required. -The number of levels is placed in the <tt>NUM_RCU_LVLS</tt> -C-preprocessor variable by lines 27, 35, 44, and 54. -The number of <tt>rcu_node</tt> structures for the topmost level -of the tree is always exactly one, and this value is unconditionally -placed into <tt>NUM_RCU_LVL_0</tt> by lines 28, 36, 45, and 55. -The rest of the levels (if any) of the <tt>rcu_node</tt> tree -are computed by dividing the maximum number of CPUs by the -fanout supported by the number of levels from the current level down, -rounding up.  This computation is performed by lines 37, -46-47, and 56-58. -Lines 31-33, 40-42, 50-52, and 62-63 create initializers -for lockdep lock-class names. -Finally, lines 64-66 produce an error if the maximum number of -CPUs is too large for the specified fanout. - -<h3><a name="The rcu_segcblist Structure"> -The <tt>rcu_segcblist</tt> Structure</a></h3> - -The <tt>rcu_segcblist</tt> structure maintains a segmented list of -callbacks as follows: - -<pre> - 1 #define RCU_DONE_TAIL        0 - 2 #define RCU_WAIT_TAIL        1 - 3 #define RCU_NEXT_READY_TAIL  2 - 4 #define RCU_NEXT_TAIL        3 - 5 #define RCU_CBLIST_NSEGS     4 - 6 - 7 struct rcu_segcblist { - 8   struct rcu_head *head; - 9   struct rcu_head **tails[RCU_CBLIST_NSEGS]; -10   unsigned long gp_seq[RCU_CBLIST_NSEGS]; -11   long len; -12   long len_lazy; -13 }; -</pre> - -<p> -The segments are as follows: - -<ol> -<li>	<tt>RCU_DONE_TAIL</tt>: Callbacks whose grace periods have elapsed. -	These callbacks are ready to be invoked. -<li>	<tt>RCU_WAIT_TAIL</tt>: Callbacks that are waiting for the -	current grace period. -	Note that different CPUs can have different ideas about which -	grace period is current, hence the <tt>->gp_seq</tt> field. -<li>	<tt>RCU_NEXT_READY_TAIL</tt>: Callbacks waiting for the next -	grace period to start. -<li>	<tt>RCU_NEXT_TAIL</tt>: Callbacks that have not yet been -	associated with a grace period. -</ol> - -<p> -The <tt>->head</tt> pointer references the first callback or -is <tt>NULL</tt> if the list contains no callbacks (which is -<i>not</i> the same as being empty). -Each element of the <tt>->tails[]</tt> array references the -<tt>->next</tt> pointer of the last callback in the corresponding -segment of the list, or the list's <tt>->head</tt> pointer if -that segment and all previous segments are empty. -If the corresponding segment is empty but some previous segment is -not empty, then the array element is identical to its predecessor. -Older callbacks are closer to the head of the list, and new callbacks -are added at the tail. -This relationship between the <tt>->head</tt> pointer, the -<tt>->tails[]</tt> array, and the callbacks is shown in this -diagram: - -</p><p><img src="nxtlist.svg" alt="nxtlist.svg" width="40%"> - -</p><p>In this figure, the <tt>->head</tt> pointer references the -first -RCU callback in the list. -The <tt>->tails[RCU_DONE_TAIL]</tt> array element references -the <tt>->head</tt> pointer itself, indicating that none -of the callbacks is ready to invoke. -The <tt>->tails[RCU_WAIT_TAIL]</tt> array element references callback -CB 2's <tt>->next</tt> pointer, which indicates that -CB 1 and CB 2 are both waiting on the current grace period, -give or take possible disagreements about exactly which grace period -is the current one. -The <tt>->tails[RCU_NEXT_READY_TAIL]</tt> array element -references the same RCU callback that <tt>->tails[RCU_WAIT_TAIL]</tt> -does, which indicates that there are no callbacks waiting on the next -RCU grace period. -The <tt>->tails[RCU_NEXT_TAIL]</tt> array element references -CB 4's <tt>->next</tt> pointer, indicating that all the -remaining RCU callbacks have not yet been assigned to an RCU grace -period. -Note that the <tt>->tails[RCU_NEXT_TAIL]</tt> array element -always references the last RCU callback's <tt>->next</tt> pointer -unless the callback list is empty, in which case it references -the <tt>->head</tt> pointer. - -<p> -There is one additional important special case for the -<tt>->tails[RCU_NEXT_TAIL]</tt> array element: It can be <tt>NULL</tt> -when this list is <i>disabled</i>. -Lists are disabled when the corresponding CPU is offline or when -the corresponding CPU's callbacks are offloaded to a kthread, -both of which are described elsewhere. - -</p><p>CPUs advance their callbacks from the -<tt>RCU_NEXT_TAIL</tt> to the <tt>RCU_NEXT_READY_TAIL</tt> to the -<tt>RCU_WAIT_TAIL</tt> to the <tt>RCU_DONE_TAIL</tt> list segments -as grace periods advance. - -</p><p>The <tt>->gp_seq[]</tt> array records grace-period -numbers corresponding to the list segments. -This is what allows different CPUs to have different ideas as to -which is the current grace period while still avoiding premature -invocation of their callbacks. -In particular, this allows CPUs that go idle for extended periods -to determine which of their callbacks are ready to be invoked after -reawakening. - -</p><p>The <tt>->len</tt> counter contains the number of -callbacks in <tt>->head</tt>, and the -<tt>->len_lazy</tt> contains the number of those callbacks that -are known to only free memory, and whose invocation can therefore -be safely deferred. - -<p><b>Important note</b>: It is the <tt>->len</tt> field that -determines whether or not there are callbacks associated with -this <tt>rcu_segcblist</tt> structure, <i>not</i> the <tt>->head</tt> -pointer. -The reason for this is that all the ready-to-invoke callbacks -(that is, those in the <tt>RCU_DONE_TAIL</tt> segment) are extracted -all at once at callback-invocation time (<tt>rcu_do_batch</tt>), due -to which <tt>->head</tt> may be set to NULL if there are no not-done -callbacks remaining in the <tt>rcu_segcblist</tt>. -If callback invocation must be postponed, for example, because a -high-priority process just woke up on this CPU, then the remaining -callbacks are placed back on the <tt>RCU_DONE_TAIL</tt> segment and -<tt>->head</tt> once again points to the start of the segment. -In short, the head field can briefly be <tt>NULL</tt> even though the -CPU has callbacks present the entire time. -Therefore, it is not appropriate to test the <tt>->head</tt> pointer -for <tt>NULL</tt>. - -<p>In contrast, the <tt>->len</tt> and <tt>->len_lazy</tt> counts -are adjusted only after the corresponding callbacks have been invoked. -This means that the <tt>->len</tt> count is zero only if -the <tt>rcu_segcblist</tt> structure really is devoid of callbacks. -Of course, off-CPU sampling of the <tt>->len</tt> count requires -careful use of appropriate synchronization, for example, memory barriers. -This synchronization can be a bit subtle, particularly in the case -of <tt>rcu_barrier()</tt>. - -<h3><a name="The rcu_data Structure"> -The <tt>rcu_data</tt> Structure</a></h3> - -<p>The <tt>rcu_data</tt> maintains the per-CPU state for the RCU subsystem. -The fields in this structure may be accessed only from the corresponding -CPU (and from tracing) unless otherwise stated. -This structure is the -focus of quiescent-state detection and RCU callback queuing. -It also tracks its relationship to the corresponding leaf -<tt>rcu_node</tt> structure to allow more-efficient -propagation of quiescent states up the <tt>rcu_node</tt> -combining tree. -Like the <tt>rcu_node</tt> structure, it provides a local -copy of the grace-period information to allow for-free -synchronized -access to this information from the corresponding CPU. -Finally, this structure records past dyntick-idle state -for the corresponding CPU and also tracks statistics. - -</p><p>The <tt>rcu_data</tt> structure's fields are discussed, -singly and in groups, in the following sections. - -<h5>Connection to Other Data Structures</h5> - -<p>This portion of the <tt>rcu_data</tt> structure is declared -as follows: - -<pre> -  1   int cpu; -  2   struct rcu_node *mynode; -  3   unsigned long grpmask; -  4   bool beenonline; -</pre> - -<p>The <tt>->cpu</tt> field contains the number of the -corresponding CPU and the <tt>->mynode</tt> field references the -corresponding <tt>rcu_node</tt> structure. -The <tt>->mynode</tt> is used to propagate quiescent states -up the combining tree. -These two fields are constant and therefore do not require synchronization. - -<p>The <tt>->grpmask</tt> field indicates the bit in -the <tt>->mynode->qsmask</tt> corresponding to this -<tt>rcu_data</tt> structure, and is also used when propagating -quiescent states. -The <tt>->beenonline</tt> flag is set whenever the corresponding -CPU comes online, which means that the debugfs tracing need not dump -out any <tt>rcu_data</tt> structure for which this flag is not set. - -<h5>Quiescent-State and Grace-Period Tracking</h5> - -<p>This portion of the <tt>rcu_data</tt> structure is declared -as follows: - -<pre> -  1   unsigned long gp_seq; -  2   unsigned long gp_seq_needed; -  3   bool cpu_no_qs; -  4   bool core_needs_qs; -  5   bool gpwrap; -</pre> - -<p>The <tt>->gp_seq</tt> field is the counterpart of the field of the same -name in the <tt>rcu_state</tt> and <tt>rcu_node</tt> structures.  The -<tt>->gp_seq_needed</tt> field is the counterpart of the field of the same -name in the rcu_node</tt> structure. -They may each lag up to one behind their <tt>rcu_node</tt> -counterparts, but in <tt>CONFIG_NO_HZ_IDLE</tt> and -<tt>CONFIG_NO_HZ_FULL</tt> kernels can lag -arbitrarily far behind for CPUs in dyntick-idle mode (but these counters -will catch up upon exit from dyntick-idle mode). -If the lower two bits of a given <tt>rcu_data</tt> structure's -<tt>->gp_seq</tt> are zero, then this <tt>rcu_data</tt> -structure believes that RCU is idle. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> -	All this replication of the grace period numbers can only cause -	massive confusion. -	Why not just keep a global sequence number and be done with it??? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> -	Because if there was only a single global sequence -	numbers, there would need to be a single global lock to allow -	safely accessing and updating it. -	And if we are not going to have a single global lock, we need -	to carefully manage the numbers on a per-node basis. -	Recall from the answer to a previous Quick Quiz that the consequences -	of applying a previously sampled quiescent state to the wrong -	grace period are quite severe. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p>The <tt>->cpu_no_qs</tt> flag indicates that the -CPU has not yet passed through a quiescent state, -while the <tt>->core_needs_qs</tt> flag indicates that the -RCU core needs a quiescent state from the corresponding CPU. -The <tt>->gpwrap</tt> field indicates that the corresponding -CPU has remained idle for so long that the -<tt>gp_seq</tt> counter is in danger of overflow, which -will cause the CPU to disregard the values of its counters on -its next exit from idle. - -<h5>RCU Callback Handling</h5> - -<p>In the absence of CPU-hotplug events, RCU callbacks are invoked by -the same CPU that registered them. -This is strictly a cache-locality optimization: callbacks can and -do get invoked on CPUs other than the one that registered them. -After all, if the CPU that registered a given callback has gone -offline before the callback can be invoked, there really is no other -choice. - -</p><p>This portion of the <tt>rcu_data</tt> structure is declared -as follows: - -<pre> - 1 struct rcu_segcblist cblist; - 2 long qlen_last_fqs_check; - 3 unsigned long n_cbs_invoked; - 4 unsigned long n_nocbs_invoked; - 5 unsigned long n_cbs_orphaned; - 6 unsigned long n_cbs_adopted; - 7 unsigned long n_force_qs_snap; - 8 long blimit; -</pre> - -<p>The <tt>->cblist</tt> structure is the segmented callback list -described earlier. -The CPU advances the callbacks in its <tt>rcu_data</tt> structure -whenever it notices that another RCU grace period has completed. -The CPU detects the completion of an RCU grace period by noticing -that the value of its <tt>rcu_data</tt> structure's -<tt>->gp_seq</tt> field differs from that of its leaf -<tt>rcu_node</tt> structure. -Recall that each <tt>rcu_node</tt> structure's -<tt>->gp_seq</tt> field is updated at the beginnings and ends of each -grace period. - -<p> -The <tt>->qlen_last_fqs_check</tt> and -<tt>->n_force_qs_snap</tt> coordinate the forcing of quiescent -states from <tt>call_rcu()</tt> and friends when callback -lists grow excessively long. - -</p><p>The <tt>->n_cbs_invoked</tt>, -<tt>->n_cbs_orphaned</tt>, and <tt>->n_cbs_adopted</tt> -fields count the number of callbacks invoked, -sent to other CPUs when this CPU goes offline, -and received from other CPUs when those other CPUs go offline. -The <tt>->n_nocbs_invoked</tt> is used when the CPU's callbacks -are offloaded to a kthread. - -<p> -Finally, the <tt>->blimit</tt> counter is the maximum number of -RCU callbacks that may be invoked at a given time. - -<h5>Dyntick-Idle Handling</h5> - -<p>This portion of the <tt>rcu_data</tt> structure is declared -as follows: - -<pre> -  1   int dynticks_snap; -  2   unsigned long dynticks_fqs; -</pre> - -The <tt>->dynticks_snap</tt> field is used to take a snapshot -of the corresponding CPU's dyntick-idle state when forcing -quiescent states, and is therefore accessed from other CPUs. -Finally, the <tt>->dynticks_fqs</tt> field is used to -count the number of times this CPU is determined to be in -dyntick-idle state, and is used for tracing and debugging purposes. - -<p> -This portion of the rcu_data structure is declared as follows: - -<pre> -  1   long dynticks_nesting; -  2   long dynticks_nmi_nesting; -  3   atomic_t dynticks; -  4   bool rcu_need_heavy_qs; -  5   bool rcu_urgent_qs; -</pre> - -<p>These fields in the rcu_data structure maintain the per-CPU dyntick-idle -state for the corresponding CPU. -The fields may be accessed only from the corresponding CPU (and from tracing) -unless otherwise stated. - -<p>The <tt>->dynticks_nesting</tt> field counts the -nesting depth of process execution, so that in normal circumstances -this counter has value zero or one. -NMIs, irqs, and tracers are counted by the <tt>->dynticks_nmi_nesting</tt> -field. -Because NMIs cannot be masked, changes to this variable have to be -undertaken carefully using an algorithm provided by Andy Lutomirski. -The initial transition from idle adds one, and nested transitions -add two, so that a nesting level of five is represented by a -<tt>->dynticks_nmi_nesting</tt> value of nine. -This counter can therefore be thought of as counting the number -of reasons why this CPU cannot be permitted to enter dyntick-idle -mode, aside from process-level transitions. - -<p>However, it turns out that when running in non-idle kernel context, -the Linux kernel is fully capable of entering interrupt handlers that -never exit and perhaps also vice versa. -Therefore, whenever the <tt>->dynticks_nesting</tt> field is -incremented up from zero, the <tt>->dynticks_nmi_nesting</tt> field -is set to a large positive number, and whenever the -<tt>->dynticks_nesting</tt> field is decremented down to zero, -the the <tt>->dynticks_nmi_nesting</tt> field is set to zero. -Assuming that the number of misnested interrupts is not sufficient -to overflow the counter, this approach corrects the -<tt>->dynticks_nmi_nesting</tt> field every time the corresponding -CPU enters the idle loop from process context. - -</p><p>The <tt>->dynticks</tt> field counts the corresponding -CPU's transitions to and from either dyntick-idle or user mode, so -that this counter has an even value when the CPU is in dyntick-idle -mode or user mode and an odd value otherwise. The transitions to/from -user mode need to be counted for user mode adaptive-ticks support -(see timers/NO_HZ.txt). - -</p><p>The <tt>->rcu_need_heavy_qs</tt> field is used -to record the fact that the RCU core code would really like to -see a quiescent state from the corresponding CPU, so much so that -it is willing to call for heavy-weight dyntick-counter operations. -This flag is checked by RCU's context-switch and <tt>cond_resched()</tt> -code, which provide a momentary idle sojourn in response. - -</p><p>Finally, the <tt>->rcu_urgent_qs</tt> field is used to record -the fact that the RCU core code would really like to see a quiescent state from -the corresponding CPU, with the various other fields indicating just how badly -RCU wants this quiescent state. -This flag is checked by RCU's context-switch path -(<tt>rcu_note_context_switch</tt>) and the cond_resched code. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> -	Why not simply combine the <tt>->dynticks_nesting</tt> -	and <tt>->dynticks_nmi_nesting</tt> counters into a -	single counter that just counts the number of reasons that -	the corresponding CPU is non-idle? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> -	Because this would fail in the presence of interrupts whose -	handlers never return and of handlers that manage to return -	from a made-up interrupt. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p>Additional fields are present for some special-purpose -builds, and are discussed separately. - -<h3><a name="The rcu_head Structure"> -The <tt>rcu_head</tt> Structure</a></h3> - -<p>Each <tt>rcu_head</tt> structure represents an RCU callback. -These structures are normally embedded within RCU-protected data -structures whose algorithms use asynchronous grace periods. -In contrast, when using algorithms that block waiting for RCU grace periods, -RCU users need not provide <tt>rcu_head</tt> structures. - -</p><p>The <tt>rcu_head</tt> structure has fields as follows: - -<pre> -  1   struct rcu_head *next; -  2   void (*func)(struct rcu_head *head); -</pre> - -<p>The <tt>->next</tt> field is used -to link the <tt>rcu_head</tt> structures together in the -lists within the <tt>rcu_data</tt> structures. -The <tt>->func</tt> field is a pointer to the function -to be called when the callback is ready to be invoked, and -this function is passed a pointer to the <tt>rcu_head</tt> -structure. -However, <tt>kfree_rcu()</tt> uses the <tt>->func</tt> -field to record the offset of the <tt>rcu_head</tt> -structure within the enclosing RCU-protected data structure. - -</p><p>Both of these fields are used internally by RCU. -From the viewpoint of RCU users, this structure is an -opaque “cookie”. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> -	Given that the callback function <tt>->func</tt> -	is passed a pointer to the <tt>rcu_head</tt> structure, -	how is that function supposed to find the beginning of the -	enclosing RCU-protected data structure? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> -	In actual practice, there is a separate callback function per -	type of RCU-protected data structure. -	The callback function can therefore use the <tt>container_of()</tt> -	macro in the Linux kernel (or other pointer-manipulation facilities -	in other software environments) to find the beginning of the -	enclosing structure. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h3><a name="RCU-Specific Fields in the task_struct Structure"> -RCU-Specific Fields in the <tt>task_struct</tt> Structure</a></h3> - -<p>The <tt>CONFIG_PREEMPT_RCU</tt> implementation uses some -additional fields in the <tt>task_struct</tt> structure: - -<pre> - 1 #ifdef CONFIG_PREEMPT_RCU - 2   int rcu_read_lock_nesting; - 3   union rcu_special rcu_read_unlock_special; - 4   struct list_head rcu_node_entry; - 5   struct rcu_node *rcu_blocked_node; - 6 #endif /* #ifdef CONFIG_PREEMPT_RCU */ - 7 #ifdef CONFIG_TASKS_RCU - 8   unsigned long rcu_tasks_nvcsw; - 9   bool rcu_tasks_holdout; -10   struct list_head rcu_tasks_holdout_list; -11   int rcu_tasks_idle_cpu; -12 #endif /* #ifdef CONFIG_TASKS_RCU */ -</pre> - -<p>The <tt>->rcu_read_lock_nesting</tt> field records the -nesting level for RCU read-side critical sections, and -the <tt>->rcu_read_unlock_special</tt> field is a bitmask -that records special conditions that require <tt>rcu_read_unlock()</tt> -to do additional work. -The <tt>->rcu_node_entry</tt> field is used to form lists of -tasks that have blocked within preemptible-RCU read-side critical -sections and the <tt>->rcu_blocked_node</tt> field references -the <tt>rcu_node</tt> structure whose list this task is a member of, -or <tt>NULL</tt> if it is not blocked within a preemptible-RCU -read-side critical section. - -<p>The <tt>->rcu_tasks_nvcsw</tt> field tracks the number of -voluntary context switches that this task had undergone at the -beginning of the current tasks-RCU grace period, -<tt>->rcu_tasks_holdout</tt> is set if the current tasks-RCU -grace period is waiting on this task, <tt>->rcu_tasks_holdout_list</tt> -is a list element enqueuing this task on the holdout list, -and <tt>->rcu_tasks_idle_cpu</tt> tracks which CPU this -idle task is running, but only if the task is currently running, -that is, if the CPU is currently idle. - -<h3><a name="Accessor Functions"> -Accessor Functions</a></h3> - -<p>The following listing shows the -<tt>rcu_get_root()</tt>, <tt>rcu_for_each_node_breadth_first</tt> and -<tt>rcu_for_each_leaf_node()</tt> function and macros: - -<pre> -  1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp) -  2 { -  3   return &rsp->node[0]; -  4 } -  5 -  6 #define rcu_for_each_node_breadth_first(rsp, rnp) \ -  7   for ((rnp) = &(rsp)->node[0]; \ -  8        (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) -  9 - 10 #define rcu_for_each_leaf_node(rsp, rnp) \ - 11   for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \ - 12        (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) -</pre> - -<p>The <tt>rcu_get_root()</tt> simply returns a pointer to the -first element of the specified <tt>rcu_state</tt> structure's -<tt>->node[]</tt> array, which is the root <tt>rcu_node</tt> -structure. - -</p><p>As noted earlier, the <tt>rcu_for_each_node_breadth_first()</tt> -macro takes advantage of the layout of the <tt>rcu_node</tt> -structures in the <tt>rcu_state</tt> structure's -<tt>->node[]</tt> array, performing a breadth-first traversal by -simply traversing the array in order. -Similarly, the <tt>rcu_for_each_leaf_node()</tt> macro traverses only -the last part of the array, thus traversing only the leaf -<tt>rcu_node</tt> structures. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> -	What does -	<tt>rcu_for_each_leaf_node()</tt> do if the <tt>rcu_node</tt> tree -	contains only a single node? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> -	In the single-node case, -	<tt>rcu_for_each_leaf_node()</tt> traverses the single node. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<h3><a name="Summary"> -Summary</a></h3> - -So the state of RCU is represented by an <tt>rcu_state</tt> structure, -which contains a combining tree of <tt>rcu_node</tt> and -<tt>rcu_data</tt> structures. -Finally, in <tt>CONFIG_NO_HZ_IDLE</tt> kernels, each CPU's dyntick-idle -state is tracked by dynticks-related fields in the <tt>rcu_data</tt> structure. - -If you made it this far, you are well prepared to read the code -walkthroughs in the other articles in this series. - -<h3><a name="Acknowledgments"> -Acknowledgments</a></h3> - -I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul -Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn -for helping me get this document into a more human-readable state. - -<h3><a name="Legal Statement"> -Legal Statement</a></h3> - -<p>This work represents the view of the author and does not necessarily -represent the view of IBM. - -</p><p>Linux is a registered trademark of Linus Torvalds. - -</p><p>Other company, product, and service names may be trademarks or -service marks of others. - -</body></html> diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.rst b/Documentation/RCU/Design/Data-Structures/Data-Structures.rst new file mode 100644 index 000000000000..4a48e20a46f2 --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/Data-Structures.rst @@ -0,0 +1,1163 @@ +=================================================== +A Tour Through TREE_RCU's Data Structures [LWN.net] +=================================================== + +December 18, 2016 + +This article was contributed by Paul E. McKenney + +Introduction +============ + +This document describes RCU's major data structures and their relationship +to each other. + +Data-Structure Relationships +============================ + +RCU is for all intents and purposes a large state machine, and its +data structures maintain the state in such a way as to allow RCU readers +to execute extremely quickly, while also processing the RCU grace periods +requested by updaters in an efficient and extremely scalable fashion. +The efficiency and scalability of RCU updaters is provided primarily +by a combining tree, as shown below: + +.. kernel-figure:: BigTreeClassicRCU.svg + +This diagram shows an enclosing ``rcu_state`` structure containing a tree +of ``rcu_node`` structures. Each leaf node of the ``rcu_node`` tree has up +to 16 ``rcu_data`` structures associated with it, so that there are +``NR_CPUS`` number of ``rcu_data`` structures, one for each possible CPU. +This structure is adjusted at boot time, if needed, to handle the common +case where ``nr_cpu_ids`` is much less than ``NR_CPUs``. +For example, a number of Linux distributions set ``NR_CPUs=4096``, +which results in a three-level ``rcu_node`` tree. +If the actual hardware has only 16 CPUs, RCU will adjust itself +at boot time, resulting in an ``rcu_node`` tree with only a single node. + +The purpose of this combining tree is to allow per-CPU events +such as quiescent states, dyntick-idle transitions, +and CPU hotplug operations to be processed efficiently +and scalably. +Quiescent states are recorded by the per-CPU ``rcu_data`` structures, +and other events are recorded by the leaf-level ``rcu_node`` +structures. +All of these events are combined at each level of the tree until finally +grace periods are completed at the tree's root ``rcu_node`` +structure. +A grace period can be completed at the root once every CPU +(or, in the case of ``CONFIG_PREEMPT_RCU``, task) +has passed through a quiescent state. +Once a grace period has completed, record of that fact is propagated +back down the tree. + +As can be seen from the diagram, on a 64-bit system +a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout +of 64 at the root and a fanout of 16 at the leaves. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**:                                                       | ++-----------------------------------------------------------------------+ +| Why isn't the fanout at the leaves also 64?                           | ++-----------------------------------------------------------------------+ +| **Answer**:                                                           | ++-----------------------------------------------------------------------+ +| Because there are more types of events that affect the leaf-level     | +| ``rcu_node`` structures than further up the tree. Therefore, if the   | +| leaf ``rcu_node`` structures have fanout of 64, the contention on     | +| these structures' ``->structures`` becomes excessive. Experimentation | +| on a wide variety of systems has shown that a fanout of 16 works well | +| for the leaves of the ``rcu_node`` tree.                              | +|                                                                       | +| Of course, further experience with systems having hundreds or         | +| thousands of CPUs may demonstrate that the fanout for the non-leaf    | +| ``rcu_node`` structures must also be reduced. Such reduction can be   | +| easily carried out when and if it proves necessary. In the meantime,  | +| if you are using such a system and running into contention problems   | +| on the non-leaf ``rcu_node`` structures, you may use the              | +| ``CONFIG_RCU_FANOUT`` kernel configuration parameter to reduce the    | +| non-leaf fanout as needed.                                            | +|                                                                       | +| Kernels built for systems with strong NUMA characteristics might      | +| also need to adjust ``CONFIG_RCU_FANOUT`` so that the domains of      | +| the ``rcu_node`` structures align with hardware boundaries.           | +| However, there has thus far been no need for this.                    | ++-----------------------------------------------------------------------+ + +If your system has more than 1,024 CPUs (or more than 512 CPUs on a +32-bit system), then RCU will automatically add more levels to the tree. +For example, if you are crazy enough to build a 64-bit system with +65,536 CPUs, RCU would configure the ``rcu_node`` tree as follows: + +.. kernel-figure:: HugeTreeClassicRCU.svg + +RCU currently permits up to a four-level tree, which on a 64-bit system +accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for +32-bit systems. On the other hand, you can set both +``CONFIG_RCU_FANOUT`` and ``CONFIG_RCU_FANOUT_LEAF`` to be as small as +2, which would result in a 16-CPU test using a 4-level tree. This can be +useful for testing large-system capabilities on small test machines. + +This multi-level combining tree allows us to get most of the performance +and scalability benefits of partitioning, even though RCU grace-period +detection is inherently a global operation. The trick here is that only +the last CPU to report a quiescent state into a given ``rcu_node`` +structure need advance to the ``rcu_node`` structure at the next level +up the tree. This means that at the leaf-level ``rcu_node`` structure, +only one access out of sixteen will progress up the tree. For the +internal ``rcu_node`` structures, the situation is even more extreme: +Only one access out of sixty-four will progress up the tree. Because the +vast majority of the CPUs do not progress up the tree, the lock +contention remains roughly constant up the tree. No matter how many CPUs +there are in the system, at most 64 quiescent-state reports per grace +period will progress all the way to the root ``rcu_node`` structure, +thus ensuring that the lock contention on that root ``rcu_node`` +structure remains acceptably low. + +In effect, the combining tree acts like a big shock absorber, keeping +lock contention under control at all tree levels regardless of the level +of loading on the system. + +RCU updaters wait for normal grace periods by registering RCU callbacks, +either directly via ``call_rcu()`` or indirectly via +``synchronize_rcu()`` and friends. RCU callbacks are represented by +``rcu_head`` structures, which are queued on ``rcu_data`` structures +while they are waiting for a grace period to elapse, as shown in the +following figure: + +.. kernel-figure:: BigTreePreemptRCUBHdyntickCB.svg + +This figure shows how ``TREE_RCU``'s and ``PREEMPT_RCU``'s major data +structures are related. Lesser data structures will be introduced with +the algorithms that make use of them. + +Note that each of the data structures in the above figure has its own +synchronization: + +#. Each ``rcu_state`` structures has a lock and a mutex, and some fields +   are protected by the corresponding root ``rcu_node`` structure's lock. +#. Each ``rcu_node`` structure has a spinlock. +#. The fields in ``rcu_data`` are private to the corresponding CPU, +   although a few can be read and written by other CPUs. + +It is important to note that different data structures can have very +different ideas about the state of RCU at any given time. For but one +example, awareness of the start or end of a given RCU grace period +propagates slowly through the data structures. This slow propagation is +absolutely necessary for RCU to have good read-side performance. If this +balkanized implementation seems foreign to you, one useful trick is to +consider each instance of these data structures to be a different +person, each having the usual slightly different view of reality. + +The general role of each of these data structures is as follows: + +#. ``rcu_state``: This structure forms the interconnection between the +   ``rcu_node`` and ``rcu_data`` structures, tracks grace periods, +   serves as short-term repository for callbacks orphaned by CPU-hotplug +   events, maintains ``rcu_barrier()`` state, tracks expedited +   grace-period state, and maintains state used to force quiescent +   states when grace periods extend too long, +#. ``rcu_node``: This structure forms the combining tree that propagates +   quiescent-state information from the leaves to the root, and also +   propagates grace-period information from the root to the leaves. It +   provides local copies of the grace-period state in order to allow +   this information to be accessed in a synchronized manner without +   suffering the scalability limitations that would otherwise be imposed +   by global locking. In ``CONFIG_PREEMPT_RCU`` kernels, it manages the +   lists of tasks that have blocked while in their current RCU read-side +   critical section. In ``CONFIG_PREEMPT_RCU`` with +   ``CONFIG_RCU_BOOST``, it manages the per-\ ``rcu_node`` +   priority-boosting kernel threads (kthreads) and state. Finally, it +   records CPU-hotplug state in order to determine which CPUs should be +   ignored during a given grace period. +#. ``rcu_data``: This per-CPU structure is the focus of quiescent-state +   detection and RCU callback queuing. It also tracks its relationship +   to the corresponding leaf ``rcu_node`` structure to allow +   more-efficient propagation of quiescent states up the ``rcu_node`` +   combining tree. Like the ``rcu_node`` structure, it provides a local +   copy of the grace-period information to allow for-free synchronized +   access to this information from the corresponding CPU. Finally, this +   structure records past dyntick-idle state for the corresponding CPU +   and also tracks statistics. +#. ``rcu_head``: This structure represents RCU callbacks, and is the +   only structure allocated and managed by RCU users. The ``rcu_head`` +   structure is normally embedded within the RCU-protected data +   structure. + +If all you wanted from this article was a general notion of how RCU's +data structures are related, you are done. Otherwise, each of the +following sections give more details on the ``rcu_state``, ``rcu_node`` +and ``rcu_data`` data structures. + +The ``rcu_state`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``rcu_state`` structure is the base structure that represents the +state of RCU in the system. This structure forms the interconnection +between the ``rcu_node`` and ``rcu_data`` structures, tracks grace +periods, contains the lock used to synchronize with CPU-hotplug events, +and maintains state used to force quiescent states when grace periods +extend too long, + +A few of the ``rcu_state`` structure's fields are discussed, singly and +in groups, in the following sections. The more specialized fields are +covered in the discussion of their use. + +Relationship to rcu_node and rcu_data Structures +'''''''''''''''''''''''''''''''''''''''''''''''' + +This portion of the ``rcu_state`` structure is declared as follows: + +:: + +     1   struct rcu_node node[NUM_RCU_NODES]; +     2   struct rcu_node *level[NUM_RCU_LVLS + 1]; +     3   struct rcu_data __percpu *rda; + ++-----------------------------------------------------------------------+ +| **Quick Quiz**:                                                       | ++-----------------------------------------------------------------------+ +| Wait a minute! You said that the ``rcu_node`` structures formed a     | +| tree, but they are declared as a flat array! What gives?              | ++-----------------------------------------------------------------------+ +| **Answer**:                                                           | ++-----------------------------------------------------------------------+ +| The tree is laid out in the array. The first node In the array is the | +| head, the next set of nodes in the array are children of the head     | +| node, and so on until the last set of nodes in the array are the      | +| leaves.                                                               | +| See the following diagrams to see how this works.                     | ++-----------------------------------------------------------------------+ + +The ``rcu_node`` tree is embedded into the ``->node[]`` array as shown +in the following figure: + +.. kernel-figure:: TreeMapping.svg + +One interesting consequence of this mapping is that a breadth-first +traversal of the tree is implemented as a simple linear scan of the +array, which is in fact what the ``rcu_for_each_node_breadth_first()`` +macro does. This macro is used at the beginning and ends of grace +periods. + +Each entry of the ``->level`` array references the first ``rcu_node`` +structure on the corresponding level of the tree, for example, as shown +below: + +.. kernel-figure:: TreeMappingLevel.svg + +The zero\ :sup:`th` element of the array references the root +``rcu_node`` structure, the first element references the first child of +the root ``rcu_node``, and finally the second element references the +first leaf ``rcu_node`` structure. + +For whatever it is worth, if you draw the tree to be tree-shaped rather +than array-shaped, it is easy to draw a planar representation: + +.. kernel-figure:: TreeLevel.svg + +Finally, the ``->rda`` field references a per-CPU pointer to the +corresponding CPU's ``rcu_data`` structure. + +All of these fields are constant once initialization is complete, and +therefore need no protection. + +Grace-Period Tracking +''''''''''''''''''''' + +This portion of the ``rcu_state`` structure is declared as follows: + +:: + +     1   unsigned long gp_seq; + +RCU grace periods are numbered, and the ``->gp_seq`` field contains the +current grace-period sequence number. The bottom two bits are the state +of the current grace period, which can be zero for not yet started or +one for in progress. In other words, if the bottom two bits of +``->gp_seq`` are zero, then RCU is idle. Any other value in the bottom +two bits indicates that something is broken. This field is protected by +the root ``rcu_node`` structure's ``->lock`` field. + +There are ``->gp_seq`` fields in the ``rcu_node`` and ``rcu_data`` +structures as well. The fields in the ``rcu_state`` structure represent +the most current value, and those of the other structures are compared +in order to detect the beginnings and ends of grace periods in a +distributed fashion. The values flow from ``rcu_state`` to ``rcu_node`` +(down the tree from the root to the leaves) to ``rcu_data``. + +Miscellaneous +''''''''''''' + +This portion of the ``rcu_state`` structure is declared as follows: + +:: + +     1   unsigned long gp_max; +     2   char abbr; +     3   char *name; + +The ``->gp_max`` field tracks the duration of the longest grace period +in jiffies. It is protected by the root ``rcu_node``'s ``->lock``. + +The ``->name`` and ``->abbr`` fields distinguish between preemptible RCU +(“rcu_preempt” and “p”) and non-preemptible RCU (“rcu_sched” and “s”). +These fields are used for diagnostic and tracing purposes. + +The ``rcu_node`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``rcu_node`` structures form the combining tree that propagates +quiescent-state information from the leaves to the root and also that +propagates grace-period information from the root down to the leaves. +They provides local copies of the grace-period state in order to allow +this information to be accessed in a synchronized manner without +suffering the scalability limitations that would otherwise be imposed by +global locking. In ``CONFIG_PREEMPT_RCU`` kernels, they manage the lists +of tasks that have blocked while in their current RCU read-side critical +section. In ``CONFIG_PREEMPT_RCU`` with ``CONFIG_RCU_BOOST``, they +manage the per-\ ``rcu_node`` priority-boosting kernel threads +(kthreads) and state. Finally, they record CPU-hotplug state in order to +determine which CPUs should be ignored during a given grace period. + +The ``rcu_node`` structure's fields are discussed, singly and in groups, +in the following sections. + +Connection to Combining Tree +'''''''''''''''''''''''''''' + +This portion of the ``rcu_node`` structure is declared as follows: + +:: + +     1   struct rcu_node *parent; +     2   u8 level; +     3   u8 grpnum; +     4   unsigned long grpmask; +     5   int grplo; +     6   int grphi; + +The ``->parent`` pointer references the ``rcu_node`` one level up in the +tree, and is ``NULL`` for the root ``rcu_node``. The RCU implementation +makes heavy use of this field to push quiescent states up the tree. The +``->level`` field gives the level in the tree, with the root being at +level zero, its children at level one, and so on. The ``->grpnum`` field +gives this node's position within the children of its parent, so this +number can range between 0 and 31 on 32-bit systems and between 0 and 63 +on 64-bit systems. The ``->level`` and ``->grpnum`` fields are used only +during initialization and for tracing. The ``->grpmask`` field is the +bitmask counterpart of ``->grpnum``, and therefore always has exactly +one bit set. This mask is used to clear the bit corresponding to this +``rcu_node`` structure in its parent's bitmasks, which are described +later. Finally, the ``->grplo`` and ``->grphi`` fields contain the +lowest and highest numbered CPU served by this ``rcu_node`` structure, +respectively. + +All of these fields are constant, and thus do not require any +synchronization. + +Synchronization +''''''''''''''' + +This field of the ``rcu_node`` structure is declared as follows: + +:: + +     1   raw_spinlock_t lock; + +This field is used to protect the remaining fields in this structure, +unless otherwise stated. That said, all of the fields in this structure +can be accessed without locking for tracing purposes. Yes, this can +result in confusing traces, but better some tracing confusion than to be +heisenbugged out of existence. + +.. _grace-period-tracking-1: + +Grace-Period Tracking +''''''''''''''''''''' + +This portion of the ``rcu_node`` structure is declared as follows: + +:: + +     1   unsigned long gp_seq; +     2   unsigned long gp_seq_needed; + +The ``rcu_node`` structures' ``->gp_seq`` fields are the counterparts of +the field of the same name in the ``rcu_state`` structure. They each may +lag up to one step behind their ``rcu_state`` counterpart. If the bottom +two bits of a given ``rcu_node`` structure's ``->gp_seq`` field is zero, +then this ``rcu_node`` structure believes that RCU is idle. + +The ``>gp_seq`` field of each ``rcu_node`` structure is updated at the +beginning and the end of each grace period. + +The ``->gp_seq_needed`` fields record the furthest-in-the-future grace +period request seen by the corresponding ``rcu_node`` structure. The +request is considered fulfilled when the value of the ``->gp_seq`` field +equals or exceeds that of the ``->gp_seq_needed`` field. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**:                                                       | ++-----------------------------------------------------------------------+ +| Suppose that this ``rcu_node`` structure doesn't see a request for a  | +| very long time. Won't wrapping of the ``->gp_seq`` field cause        | +| problems?                                                             | ++-----------------------------------------------------------------------+ +| **Answer**:                                                           | ++-----------------------------------------------------------------------+ +| No, because if the ``->gp_seq_needed`` field lags behind the          | +| ``->gp_seq`` field, the ``->gp_seq_needed`` field will be updated at  | +| the end of the grace period. Modulo-arithmetic comparisons therefore  | +| will always get the correct answer, even with wrapping.               | ++-----------------------------------------------------------------------+ + +Quiescent-State Tracking +'''''''''''''''''''''''' + +These fields manage the propagation of quiescent states up the combining +tree. + +This portion of the ``rcu_node`` structure has fields as follows: + +:: + +     1   unsigned long qsmask; +     2   unsigned long expmask; +     3   unsigned long qsmaskinit; +     4   unsigned long expmaskinit; + +The ``->qsmask`` field tracks which of this ``rcu_node`` structure's +children still need to report quiescent states for the current normal +grace period. Such children will have a value of 1 in their +corresponding bit. Note that the leaf ``rcu_node`` structures should be +thought of as having ``rcu_data`` structures as their children. +Similarly, the ``->expmask`` field tracks which of this ``rcu_node`` +structure's children still need to report quiescent states for the +current expedited grace period. An expedited grace period has the same +conceptual properties as a normal grace period, but the expedited +implementation accepts extreme CPU overhead to obtain much lower +grace-period latency, for example, consuming a few tens of microseconds +worth of CPU time to reduce grace-period duration from milliseconds to +tens of microseconds. The ``->qsmaskinit`` field tracks which of this +``rcu_node`` structure's children cover for at least one online CPU. +This mask is used to initialize ``->qsmask``, and ``->expmaskinit`` is +used to initialize ``->expmask`` and the beginning of the normal and +expedited grace periods, respectively. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**:                                                       | ++-----------------------------------------------------------------------+ +| Why are these bitmasks protected by locking? Come on, haven't you     | +| heard of atomic instructions???                                       | ++-----------------------------------------------------------------------+ +| **Answer**:                                                           | ++-----------------------------------------------------------------------+ +| Lockless grace-period computation! Such a tantalizing possibility!    | +| But consider the following sequence of events:                        | +|                                                                       | +| #. CPU 0 has been in dyntick-idle mode for quite some time. When it   | +|    wakes up, it notices that the current RCU grace period needs it to | +|    report in, so it sets a flag where the scheduling clock interrupt  | +|    will find it.                                                      | +| #. Meanwhile, CPU 1 is running ``force_quiescent_state()``, and       | +|    notices that CPU 0 has been in dyntick idle mode, which qualifies  | +|    as an extended quiescent state.                                    | +| #. CPU 0's scheduling clock interrupt fires in the middle of an RCU   | +|    read-side critical section, and notices that the RCU core needs    | +|    something, so commences RCU softirq processing.                    | +| #. CPU 0's softirq handler executes and is just about ready to report | +|    its quiescent state up the ``rcu_node`` tree.                      | +| #. But CPU 1 beats it to the punch, completing the current grace      | +|    period and starting a new one.                                     | +| #. CPU 0 now reports its quiescent state for the wrong grace period.  | +|    That grace period might now end before the RCU read-side critical  | +|    section. If that happens, disaster will ensue.                     | +|                                                                       | +| So the locking is absolutely required in order to coordinate clearing | +| of the bits with updating of the grace-period sequence number in      | +| ``->gp_seq``.                                                         | ++-----------------------------------------------------------------------+ + +Blocked-Task Management +''''''''''''''''''''''' + +``PREEMPT_RCU`` allows tasks to be preempted in the midst of their RCU +read-side critical sections, and these tasks must be tracked explicitly. +The details of exactly why and how they are tracked will be covered in a +separate article on RCU read-side processing. For now, it is enough to +know that the ``rcu_node`` structure tracks them. + +:: + +     1   struct list_head blkd_tasks; +     2   struct list_head *gp_tasks; +     3   struct list_head *exp_tasks; +     4   bool wait_blkd_tasks; + +The ``->blkd_tasks`` field is a list header for the list of blocked and +preempted tasks. As tasks undergo context switches within RCU read-side +critical sections, their ``task_struct`` structures are enqueued (via +the ``task_struct``'s ``->rcu_node_entry`` field) onto the head of the +``->blkd_tasks`` list for the leaf ``rcu_node`` structure corresponding +to the CPU on which the outgoing context switch executed. As these tasks +later exit their RCU read-side critical sections, they remove themselves +from the list. This list is therefore in reverse time order, so that if +one of the tasks is blocking the current grace period, all subsequent +tasks must also be blocking that same grace period. Therefore, a single +pointer into this list suffices to track all tasks blocking a given +grace period. That pointer is stored in ``->gp_tasks`` for normal grace +periods and in ``->exp_tasks`` for expedited grace periods. These last +two fields are ``NULL`` if either there is no grace period in flight or +if there are no blocked tasks preventing that grace period from +completing. If either of these two pointers is referencing a task that +removes itself from the ``->blkd_tasks`` list, then that task must +advance the pointer to the next task on the list, or set the pointer to +``NULL`` if there are no subsequent tasks on the list. + +For example, suppose that tasks T1, T2, and T3 are all hard-affinitied +to the largest-numbered CPU in the system. Then if task T1 blocked in an +RCU read-side critical section, then an expedited grace period started, +then task T2 blocked in an RCU read-side critical section, then a normal +grace period started, and finally task 3 blocked in an RCU read-side +critical section, then the state of the last leaf ``rcu_node`` +structure's blocked-task list would be as shown below: + +.. kernel-figure:: blkd_task.svg + +Task T1 is blocking both grace periods, task T2 is blocking only the +normal grace period, and task T3 is blocking neither grace period. Note +that these tasks will not remove themselves from this list immediately +upon resuming execution. They will instead remain on the list until they +execute the outermost ``rcu_read_unlock()`` that ends their RCU +read-side critical section. + +The ``->wait_blkd_tasks`` field indicates whether or not the current +grace period is waiting on a blocked task. + +Sizing the ``rcu_node`` Array +''''''''''''''''''''''''''''' + +The ``rcu_node`` array is sized via a series of C-preprocessor +expressions as follows: + +:: + +    1 #ifdef CONFIG_RCU_FANOUT +    2 #define RCU_FANOUT CONFIG_RCU_FANOUT +    3 #else +    4 # ifdef CONFIG_64BIT +    5 # define RCU_FANOUT 64 +    6 # else +    7 # define RCU_FANOUT 32 +    8 # endif +    9 #endif +   10 +   11 #ifdef CONFIG_RCU_FANOUT_LEAF +   12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF +   13 #else +   14 # ifdef CONFIG_64BIT +   15 # define RCU_FANOUT_LEAF 64 +   16 # else +   17 # define RCU_FANOUT_LEAF 32 +   18 # endif +   19 #endif +   20 +   21 #define RCU_FANOUT_1        (RCU_FANOUT_LEAF) +   22 #define RCU_FANOUT_2        (RCU_FANOUT_1 * RCU_FANOUT) +   23 #define RCU_FANOUT_3        (RCU_FANOUT_2 * RCU_FANOUT) +   24 #define RCU_FANOUT_4        (RCU_FANOUT_3 * RCU_FANOUT) +   25 +   26 #if NR_CPUS <= RCU_FANOUT_1 +   27 #  define RCU_NUM_LVLS        1 +   28 #  define NUM_RCU_LVL_0        1 +   29 #  define NUM_RCU_NODES        NUM_RCU_LVL_0 +   30 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0 } +   31 #  define RCU_NODE_NAME_INIT  { "rcu_node_0" } +   32 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0" } +   33 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0" } +   34 #elif NR_CPUS <= RCU_FANOUT_2 +   35 #  define RCU_NUM_LVLS        2 +   36 #  define NUM_RCU_LVL_0        1 +   37 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) +   38 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1) +   39 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1 } +   40 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1" } +   41 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1" } +   42 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1" } +   43 #elif NR_CPUS <= RCU_FANOUT_3 +   44 #  define RCU_NUM_LVLS        3 +   45 #  define NUM_RCU_LVL_0        1 +   46 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) +   47 #  define NUM_RCU_LVL_2        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) +   48 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2) +   49 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 } +   50 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2" } +   51 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" } +   52 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" } +   53 #elif NR_CPUS <= RCU_FANOUT_4 +   54 #  define RCU_NUM_LVLS        4 +   55 #  define NUM_RCU_LVL_0        1 +   56 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3) +   57 #  define NUM_RCU_LVL_2        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) +   58 #  define NUM_RCU_LVL_3        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) +   59 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3) +   60 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 } +   61 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" } +   62 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" } +   63 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" } +   64 #else +   65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS" +   66 #endif + +The maximum number of levels in the ``rcu_node`` structure is currently +limited to four, as specified by lines 21-24 and the structure of the +subsequent “if” statement. For 32-bit systems, this allows +16*32*32*32=524,288 CPUs, which should be sufficient for the next few +years at least. For 64-bit systems, 16*64*64*64=4,194,304 CPUs is +allowed, which should see us through the next decade or so. This +four-level tree also allows kernels built with ``CONFIG_RCU_FANOUT=8`` +to support up to 4096 CPUs, which might be useful in very large systems +having eight CPUs per socket (but please note that no one has yet shown +any measurable performance degradation due to misaligned socket and +``rcu_node`` boundaries). In addition, building kernels with a full four +levels of ``rcu_node`` tree permits better testing of RCU's +combining-tree code. + +The ``RCU_FANOUT`` symbol controls how many children are permitted at +each non-leaf level of the ``rcu_node`` tree. If the +``CONFIG_RCU_FANOUT`` Kconfig option is not specified, it is set based +on the word size of the system, which is also the Kconfig default. + +The ``RCU_FANOUT_LEAF`` symbol controls how many CPUs are handled by +each leaf ``rcu_node`` structure. Experience has shown that allowing a +given leaf ``rcu_node`` structure to handle 64 CPUs, as permitted by the +number of bits in the ``->qsmask`` field on a 64-bit system, results in +excessive contention for the leaf ``rcu_node`` structures' ``->lock`` +fields. The number of CPUs per leaf ``rcu_node`` structure is therefore +limited to 16 given the default value of ``CONFIG_RCU_FANOUT_LEAF``. If +``CONFIG_RCU_FANOUT_LEAF`` is unspecified, the value selected is based +on the word size of the system, just as for ``CONFIG_RCU_FANOUT``. +Lines 11-19 perform this computation. + +Lines 21-24 compute the maximum number of CPUs supported by a +single-level (which contains a single ``rcu_node`` structure), +two-level, three-level, and four-level ``rcu_node`` tree, respectively, +given the fanout specified by ``RCU_FANOUT`` and ``RCU_FANOUT_LEAF``. +These numbers of CPUs are retained in the ``RCU_FANOUT_1``, +``RCU_FANOUT_2``, ``RCU_FANOUT_3``, and ``RCU_FANOUT_4`` C-preprocessor +variables, respectively. + +These variables are used to control the C-preprocessor ``#if`` statement +spanning lines 26-66 that computes the number of ``rcu_node`` structures +required for each level of the tree, as well as the number of levels +required. The number of levels is placed in the ``NUM_RCU_LVLS`` +C-preprocessor variable by lines 27, 35, 44, and 54. The number of +``rcu_node`` structures for the topmost level of the tree is always +exactly one, and this value is unconditionally placed into +``NUM_RCU_LVL_0`` by lines 28, 36, 45, and 55. The rest of the levels +(if any) of the ``rcu_node`` tree are computed by dividing the maximum +number of CPUs by the fanout supported by the number of levels from the +current level down, rounding up. This computation is performed by +lines 37, 46-47, and 56-58. Lines 31-33, 40-42, 50-52, and 62-63 create +initializers for lockdep lock-class names. Finally, lines 64-66 produce +an error if the maximum number of CPUs is too large for the specified +fanout. + +The ``rcu_segcblist`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``rcu_segcblist`` structure maintains a segmented list of callbacks +as follows: + +:: + +    1 #define RCU_DONE_TAIL        0 +    2 #define RCU_WAIT_TAIL        1 +    3 #define RCU_NEXT_READY_TAIL  2 +    4 #define RCU_NEXT_TAIL        3 +    5 #define RCU_CBLIST_NSEGS     4 +    6 +    7 struct rcu_segcblist { +    8   struct rcu_head *head; +    9   struct rcu_head **tails[RCU_CBLIST_NSEGS]; +   10   unsigned long gp_seq[RCU_CBLIST_NSEGS]; +   11   long len; +   12   long len_lazy; +   13 }; + +The segments are as follows: + +#. ``RCU_DONE_TAIL``: Callbacks whose grace periods have elapsed. These +   callbacks are ready to be invoked. +#. ``RCU_WAIT_TAIL``: Callbacks that are waiting for the current grace +   period. Note that different CPUs can have different ideas about which +   grace period is current, hence the ``->gp_seq`` field. +#. ``RCU_NEXT_READY_TAIL``: Callbacks waiting for the next grace period +   to start. +#. ``RCU_NEXT_TAIL``: Callbacks that have not yet been associated with a +   grace period. + +The ``->head`` pointer references the first callback or is ``NULL`` if +the list contains no callbacks (which is *not* the same as being empty). +Each element of the ``->tails[]`` array references the ``->next`` +pointer of the last callback in the corresponding segment of the list, +or the list's ``->head`` pointer if that segment and all previous +segments are empty. If the corresponding segment is empty but some +previous segment is not empty, then the array element is identical to +its predecessor. Older callbacks are closer to the head of the list, and +new callbacks are added at the tail. This relationship between the +``->head`` pointer, the ``->tails[]`` array, and the callbacks is shown +in this diagram: + +.. kernel-figure:: nxtlist.svg + +In this figure, the ``->head`` pointer references the first RCU callback +in the list. The ``->tails[RCU_DONE_TAIL]`` array element references the +``->head`` pointer itself, indicating that none of the callbacks is +ready to invoke. The ``->tails[RCU_WAIT_TAIL]`` array element references +callback CB 2's ``->next`` pointer, which indicates that CB 1 and CB 2 +are both waiting on the current grace period, give or take possible +disagreements about exactly which grace period is the current one. The +``->tails[RCU_NEXT_READY_TAIL]`` array element references the same RCU +callback that ``->tails[RCU_WAIT_TAIL]`` does, which indicates that +there are no callbacks waiting on the next RCU grace period. The +``->tails[RCU_NEXT_TAIL]`` array element references CB 4's ``->next`` +pointer, indicating that all the remaining RCU callbacks have not yet +been assigned to an RCU grace period. Note that the +``->tails[RCU_NEXT_TAIL]`` array element always references the last RCU +callback's ``->next`` pointer unless the callback list is empty, in +which case it references the ``->head`` pointer. + +There is one additional important special case for the +``->tails[RCU_NEXT_TAIL]`` array element: It can be ``NULL`` when this +list is *disabled*. Lists are disabled when the corresponding CPU is +offline or when the corresponding CPU's callbacks are offloaded to a +kthread, both of which are described elsewhere. + +CPUs advance their callbacks from the ``RCU_NEXT_TAIL`` to the +``RCU_NEXT_READY_TAIL`` to the ``RCU_WAIT_TAIL`` to the +``RCU_DONE_TAIL`` list segments as grace periods advance. + +The ``->gp_seq[]`` array records grace-period numbers corresponding to +the list segments. This is what allows different CPUs to have different +ideas as to which is the current grace period while still avoiding +premature invocation of their callbacks. In particular, this allows CPUs +that go idle for extended periods to determine which of their callbacks +are ready to be invoked after reawakening. + +The ``->len`` counter contains the number of callbacks in ``->head``, +and the ``->len_lazy`` contains the number of those callbacks that are +known to only free memory, and whose invocation can therefore be safely +deferred. + +.. important:: + +   It is the ``->len`` field that determines whether or +   not there are callbacks associated with this ``rcu_segcblist`` +   structure, *not* the ``->head`` pointer. The reason for this is that all +   the ready-to-invoke callbacks (that is, those in the ``RCU_DONE_TAIL`` +   segment) are extracted all at once at callback-invocation time +   (``rcu_do_batch``), due to which ``->head`` may be set to NULL if there +   are no not-done callbacks remaining in the ``rcu_segcblist``. If +   callback invocation must be postponed, for example, because a +   high-priority process just woke up on this CPU, then the remaining +   callbacks are placed back on the ``RCU_DONE_TAIL`` segment and +   ``->head`` once again points to the start of the segment. In short, the +   head field can briefly be ``NULL`` even though the CPU has callbacks +   present the entire time. Therefore, it is not appropriate to test the +   ``->head`` pointer for ``NULL``. + +In contrast, the ``->len`` and ``->len_lazy`` counts are adjusted only +after the corresponding callbacks have been invoked. This means that the +``->len`` count is zero only if the ``rcu_segcblist`` structure really +is devoid of callbacks. Of course, off-CPU sampling of the ``->len`` +count requires careful use of appropriate synchronization, for example, +memory barriers. This synchronization can be a bit subtle, particularly +in the case of ``rcu_barrier()``. + +The ``rcu_data`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``rcu_data`` maintains the per-CPU state for the RCU subsystem. The +fields in this structure may be accessed only from the corresponding CPU +(and from tracing) unless otherwise stated. This structure is the focus +of quiescent-state detection and RCU callback queuing. It also tracks +its relationship to the corresponding leaf ``rcu_node`` structure to +allow more-efficient propagation of quiescent states up the ``rcu_node`` +combining tree. Like the ``rcu_node`` structure, it provides a local +copy of the grace-period information to allow for-free synchronized +access to this information from the corresponding CPU. Finally, this +structure records past dyntick-idle state for the corresponding CPU and +also tracks statistics. + +The ``rcu_data`` structure's fields are discussed, singly and in groups, +in the following sections. + +Connection to Other Data Structures +''''''''''''''''''''''''''''''''''' + +This portion of the ``rcu_data`` structure is declared as follows: + +:: + +     1   int cpu; +     2   struct rcu_node *mynode; +     3   unsigned long grpmask; +     4   bool beenonline; + +The ``->cpu`` field contains the number of the corresponding CPU and the +``->mynode`` field references the corresponding ``rcu_node`` structure. +The ``->mynode`` is used to propagate quiescent states up the combining +tree. These two fields are constant and therefore do not require +synchronization. + +The ``->grpmask`` field indicates the bit in the ``->mynode->qsmask`` +corresponding to this ``rcu_data`` structure, and is also used when +propagating quiescent states. The ``->beenonline`` flag is set whenever +the corresponding CPU comes online, which means that the debugfs tracing +need not dump out any ``rcu_data`` structure for which this flag is not +set. + +Quiescent-State and Grace-Period Tracking +''''''''''''''''''''''''''''''''''''''''' + +This portion of the ``rcu_data`` structure is declared as follows: + +:: + +     1   unsigned long gp_seq; +     2   unsigned long gp_seq_needed; +     3   bool cpu_no_qs; +     4   bool core_needs_qs; +     5   bool gpwrap; + +The ``->gp_seq`` field is the counterpart of the field of the same name +in the ``rcu_state`` and ``rcu_node`` structures. The +``->gp_seq_needed`` field is the counterpart of the field of the same +name in the rcu_node structure. They may each lag up to one behind their +``rcu_node`` counterparts, but in ``CONFIG_NO_HZ_IDLE`` and +``CONFIG_NO_HZ_FULL`` kernels can lag arbitrarily far behind for CPUs in +dyntick-idle mode (but these counters will catch up upon exit from +dyntick-idle mode). If the lower two bits of a given ``rcu_data`` +structure's ``->gp_seq`` are zero, then this ``rcu_data`` structure +believes that RCU is idle. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**:                                                       | ++-----------------------------------------------------------------------+ +| All this replication of the grace period numbers can only cause       | +| massive confusion. Why not just keep a global sequence number and be  | +| done with it???                                                       | ++-----------------------------------------------------------------------+ +| **Answer**:                                                           | ++-----------------------------------------------------------------------+ +| Because if there was only a single global sequence numbers, there     | +| would need to be a single global lock to allow safely accessing and   | +| updating it. And if we are not going to have a single global lock, we | +| need to carefully manage the numbers on a per-node basis. Recall from | +| the answer to a previous Quick Quiz that the consequences of applying | +| a previously sampled quiescent state to the wrong grace period are    | +| quite severe.                                                         | ++-----------------------------------------------------------------------+ + +The ``->cpu_no_qs`` flag indicates that the CPU has not yet passed +through a quiescent state, while the ``->core_needs_qs`` flag indicates +that the RCU core needs a quiescent state from the corresponding CPU. +The ``->gpwrap`` field indicates that the corresponding CPU has remained +idle for so long that the ``gp_seq`` counter is in danger of overflow, +which will cause the CPU to disregard the values of its counters on its +next exit from idle. + +RCU Callback Handling +''''''''''''''''''''' + +In the absence of CPU-hotplug events, RCU callbacks are invoked by the +same CPU that registered them. This is strictly a cache-locality +optimization: callbacks can and do get invoked on CPUs other than the +one that registered them. After all, if the CPU that registered a given +callback has gone offline before the callback can be invoked, there +really is no other choice. + +This portion of the ``rcu_data`` structure is declared as follows: + +:: + +    1 struct rcu_segcblist cblist; +    2 long qlen_last_fqs_check; +    3 unsigned long n_cbs_invoked; +    4 unsigned long n_nocbs_invoked; +    5 unsigned long n_cbs_orphaned; +    6 unsigned long n_cbs_adopted; +    7 unsigned long n_force_qs_snap; +    8 long blimit; + +The ``->cblist`` structure is the segmented callback list described +earlier. The CPU advances the callbacks in its ``rcu_data`` structure +whenever it notices that another RCU grace period has completed. The CPU +detects the completion of an RCU grace period by noticing that the value +of its ``rcu_data`` structure's ``->gp_seq`` field differs from that of +its leaf ``rcu_node`` structure. Recall that each ``rcu_node`` +structure's ``->gp_seq`` field is updated at the beginnings and ends of +each grace period. + +The ``->qlen_last_fqs_check`` and ``->n_force_qs_snap`` coordinate the +forcing of quiescent states from ``call_rcu()`` and friends when +callback lists grow excessively long. + +The ``->n_cbs_invoked``, ``->n_cbs_orphaned``, and ``->n_cbs_adopted`` +fields count the number of callbacks invoked, sent to other CPUs when +this CPU goes offline, and received from other CPUs when those other +CPUs go offline. The ``->n_nocbs_invoked`` is used when the CPU's +callbacks are offloaded to a kthread. + +Finally, the ``->blimit`` counter is the maximum number of RCU callbacks +that may be invoked at a given time. + +Dyntick-Idle Handling +''''''''''''''''''''' + +This portion of the ``rcu_data`` structure is declared as follows: + +:: + +     1   int dynticks_snap; +     2   unsigned long dynticks_fqs; + +The ``->dynticks_snap`` field is used to take a snapshot of the +corresponding CPU's dyntick-idle state when forcing quiescent states, +and is therefore accessed from other CPUs. Finally, the +``->dynticks_fqs`` field is used to count the number of times this CPU +is determined to be in dyntick-idle state, and is used for tracing and +debugging purposes. + +This portion of the rcu_data structure is declared as follows: + +:: + +     1   long dynticks_nesting; +     2   long dynticks_nmi_nesting; +     3   atomic_t dynticks; +     4   bool rcu_need_heavy_qs; +     5   bool rcu_urgent_qs; + +These fields in the rcu_data structure maintain the per-CPU dyntick-idle +state for the corresponding CPU. The fields may be accessed only from +the corresponding CPU (and from tracing) unless otherwise stated. + +The ``->dynticks_nesting`` field counts the nesting depth of process +execution, so that in normal circumstances this counter has value zero +or one. NMIs, irqs, and tracers are counted by the +``->dynticks_nmi_nesting`` field. Because NMIs cannot be masked, changes +to this variable have to be undertaken carefully using an algorithm +provided by Andy Lutomirski. The initial transition from idle adds one, +and nested transitions add two, so that a nesting level of five is +represented by a ``->dynticks_nmi_nesting`` value of nine. This counter +can therefore be thought of as counting the number of reasons why this +CPU cannot be permitted to enter dyntick-idle mode, aside from +process-level transitions. + +However, it turns out that when running in non-idle kernel context, the +Linux kernel is fully capable of entering interrupt handlers that never +exit and perhaps also vice versa. Therefore, whenever the +``->dynticks_nesting`` field is incremented up from zero, the +``->dynticks_nmi_nesting`` field is set to a large positive number, and +whenever the ``->dynticks_nesting`` field is decremented down to zero, +the the ``->dynticks_nmi_nesting`` field is set to zero. Assuming that +the number of misnested interrupts is not sufficient to overflow the +counter, this approach corrects the ``->dynticks_nmi_nesting`` field +every time the corresponding CPU enters the idle loop from process +context. + +The ``->dynticks`` field counts the corresponding CPU's transitions to +and from either dyntick-idle or user mode, so that this counter has an +even value when the CPU is in dyntick-idle mode or user mode and an odd +value otherwise. The transitions to/from user mode need to be counted +for user mode adaptive-ticks support (see timers/NO_HZ.txt). + +The ``->rcu_need_heavy_qs`` field is used to record the fact that the +RCU core code would really like to see a quiescent state from the +corresponding CPU, so much so that it is willing to call for +heavy-weight dyntick-counter operations. This flag is checked by RCU's +context-switch and ``cond_resched()`` code, which provide a momentary +idle sojourn in response. + +Finally, the ``->rcu_urgent_qs`` field is used to record the fact that +the RCU core code would really like to see a quiescent state from the +corresponding CPU, with the various other fields indicating just how +badly RCU wants this quiescent state. This flag is checked by RCU's +context-switch path (``rcu_note_context_switch``) and the cond_resched +code. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**:                                                       | ++-----------------------------------------------------------------------+ +| Why not simply combine the ``->dynticks_nesting`` and                 | +| ``->dynticks_nmi_nesting`` counters into a single counter that just   | +| counts the number of reasons that the corresponding CPU is non-idle?  | ++-----------------------------------------------------------------------+ +| **Answer**:                                                           | ++-----------------------------------------------------------------------+ +| Because this would fail in the presence of interrupts whose handlers  | +| never return and of handlers that manage to return from a made-up     | +| interrupt.                                                            | ++-----------------------------------------------------------------------+ + +Additional fields are present for some special-purpose builds, and are +discussed separately. + +The ``rcu_head`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Each ``rcu_head`` structure represents an RCU callback. These structures +are normally embedded within RCU-protected data structures whose +algorithms use asynchronous grace periods. In contrast, when using +algorithms that block waiting for RCU grace periods, RCU users need not +provide ``rcu_head`` structures. + +The ``rcu_head`` structure has fields as follows: + +:: + +     1   struct rcu_head *next; +     2   void (*func)(struct rcu_head *head); + +The ``->next`` field is used to link the ``rcu_head`` structures +together in the lists within the ``rcu_data`` structures. The ``->func`` +field is a pointer to the function to be called when the callback is +ready to be invoked, and this function is passed a pointer to the +``rcu_head`` structure. However, ``kfree_rcu()`` uses the ``->func`` +field to record the offset of the ``rcu_head`` structure within the +enclosing RCU-protected data structure. + +Both of these fields are used internally by RCU. From the viewpoint of +RCU users, this structure is an opaque “cookie”. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**:                                                       | ++-----------------------------------------------------------------------+ +| Given that the callback function ``->func`` is passed a pointer to    | +| the ``rcu_head`` structure, how is that function supposed to find the | +| beginning of the enclosing RCU-protected data structure?              | ++-----------------------------------------------------------------------+ +| **Answer**:                                                           | ++-----------------------------------------------------------------------+ +| In actual practice, there is a separate callback function per type of | +| RCU-protected data structure. The callback function can therefore use | +| the ``container_of()`` macro in the Linux kernel (or other            | +| pointer-manipulation facilities in other software environments) to    | +| find the beginning of the enclosing structure.                        | ++-----------------------------------------------------------------------+ + +RCU-Specific Fields in the ``task_struct`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``CONFIG_PREEMPT_RCU`` implementation uses some additional fields in +the ``task_struct`` structure: + +:: + +    1 #ifdef CONFIG_PREEMPT_RCU +    2   int rcu_read_lock_nesting; +    3   union rcu_special rcu_read_unlock_special; +    4   struct list_head rcu_node_entry; +    5   struct rcu_node *rcu_blocked_node; +    6 #endif /* #ifdef CONFIG_PREEMPT_RCU */ +    7 #ifdef CONFIG_TASKS_RCU +    8   unsigned long rcu_tasks_nvcsw; +    9   bool rcu_tasks_holdout; +   10   struct list_head rcu_tasks_holdout_list; +   11   int rcu_tasks_idle_cpu; +   12 #endif /* #ifdef CONFIG_TASKS_RCU */ + +The ``->rcu_read_lock_nesting`` field records the nesting level for RCU +read-side critical sections, and the ``->rcu_read_unlock_special`` field +is a bitmask that records special conditions that require +``rcu_read_unlock()`` to do additional work. The ``->rcu_node_entry`` +field is used to form lists of tasks that have blocked within +preemptible-RCU read-side critical sections and the +``->rcu_blocked_node`` field references the ``rcu_node`` structure whose +list this task is a member of, or ``NULL`` if it is not blocked within a +preemptible-RCU read-side critical section. + +The ``->rcu_tasks_nvcsw`` field tracks the number of voluntary context +switches that this task had undergone at the beginning of the current +tasks-RCU grace period, ``->rcu_tasks_holdout`` is set if the current +tasks-RCU grace period is waiting on this task, +``->rcu_tasks_holdout_list`` is a list element enqueuing this task on +the holdout list, and ``->rcu_tasks_idle_cpu`` tracks which CPU this +idle task is running, but only if the task is currently running, that +is, if the CPU is currently idle. + +Accessor Functions +~~~~~~~~~~~~~~~~~~ + +The following listing shows the ``rcu_get_root()``, +``rcu_for_each_node_breadth_first`` and ``rcu_for_each_leaf_node()`` +function and macros: + +:: + +     1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp) +     2 { +     3   return &rsp->node[0]; +     4 } +     5 +     6 #define rcu_for_each_node_breadth_first(rsp, rnp) \ +     7   for ((rnp) = &(rsp)->node[0]; \ +     8        (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) +     9 +    10 #define rcu_for_each_leaf_node(rsp, rnp) \ +    11   for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \ +    12        (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) + +The ``rcu_get_root()`` simply returns a pointer to the first element of +the specified ``rcu_state`` structure's ``->node[]`` array, which is the +root ``rcu_node`` structure. + +As noted earlier, the ``rcu_for_each_node_breadth_first()`` macro takes +advantage of the layout of the ``rcu_node`` structures in the +``rcu_state`` structure's ``->node[]`` array, performing a breadth-first +traversal by simply traversing the array in order. Similarly, the +``rcu_for_each_leaf_node()`` macro traverses only the last part of the +array, thus traversing only the leaf ``rcu_node`` structures. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**:                                                       | ++-----------------------------------------------------------------------+ +| What does ``rcu_for_each_leaf_node()`` do if the ``rcu_node`` tree    | +| contains only a single node?                                          | ++-----------------------------------------------------------------------+ +| **Answer**:                                                           | ++-----------------------------------------------------------------------+ +| In the single-node case, ``rcu_for_each_leaf_node()`` traverses the   | +| single node.                                                          | ++-----------------------------------------------------------------------+ + +Summary +~~~~~~~ + +So the state of RCU is represented by an ``rcu_state`` structure, which +contains a combining tree of ``rcu_node`` and ``rcu_data`` structures. +Finally, in ``CONFIG_NO_HZ_IDLE`` kernels, each CPU's dyntick-idle state +is tracked by dynticks-related fields in the ``rcu_data`` structure. If +you made it this far, you are well prepared to read the code +walkthroughs in the other articles in this series. + +Acknowledgments +~~~~~~~~~~~~~~~ + +I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul +Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn for +helping me get this document into a more human-readable state. + +Legal Statement +~~~~~~~~~~~~~~~ + +This work represents the view of the author and does not necessarily +represent the view of IBM. + +Linux is a registered trademark of Linus Torvalds. + +Other company, product, and service names may be trademarks or service +marks of others.  |