diff options
Diffstat (limited to 'Documentation/core-api')
-rw-r--r-- | Documentation/core-api/errseq.rst | 159 | ||||
-rw-r--r-- | Documentation/core-api/idr.rst | 79 | ||||
-rw-r--r-- | Documentation/core-api/index.rst | 4 | ||||
-rw-r--r-- | Documentation/core-api/kernel-api.rst | 27 | ||||
-rw-r--r-- | Documentation/core-api/printk-formats.rst | 482 | ||||
-rw-r--r-- | Documentation/core-api/refcount-vs-atomic.rst | 150 |
6 files changed, 889 insertions, 12 deletions
diff --git a/Documentation/core-api/errseq.rst b/Documentation/core-api/errseq.rst new file mode 100644 index 000000000000..ff332e272405 --- /dev/null +++ b/Documentation/core-api/errseq.rst @@ -0,0 +1,159 @@ +===================== +The errseq_t datatype +===================== + +An errseq_t is a way of recording errors in one place, and allowing any +number of "subscribers" to tell whether it has changed since a previous +point where it was sampled. + +The initial use case for this is tracking errors for file +synchronization syscalls (fsync, fdatasync, msync and sync_file_range), +but it may be usable in other situations. + +It's implemented as an unsigned 32-bit value. The low order bits are +designated to hold an error code (between 1 and MAX_ERRNO). The upper bits +are used as a counter. This is done with atomics instead of locking so that +these functions can be called from any context. + +Note that there is a risk of collisions if new errors are being recorded +frequently, since we have so few bits to use as a counter. + +To mitigate this, the bit between the error value and counter is used as +a flag to tell whether the value has been sampled since a new value was +recorded. That allows us to avoid bumping the counter if no one has +sampled it since the last time an error was recorded. + +Thus we end up with a value that looks something like this: + ++--------------------------------------+----+------------------------+ +| 31..13 | 12 | 11..0 | ++--------------------------------------+----+------------------------+ +| counter | SF | errno | ++--------------------------------------+----+------------------------+ + +The general idea is for "watchers" to sample an errseq_t value and keep +it as a running cursor. That value can later be used to tell whether +any new errors have occurred since that sampling was done, and atomically +record the state at the time that it was checked. This allows us to +record errors in one place, and then have a number of "watchers" that +can tell whether the value has changed since they last checked it. + +A new errseq_t should always be zeroed out. An errseq_t value of all zeroes +is the special (but common) case where there has never been an error. An all +zero value thus serves as the "epoch" if one wishes to know whether there +has ever been an error set since it was first initialized. + +API usage +========= + +Let me tell you a story about a worker drone. Now, he's a good worker +overall, but the company is a little...management heavy. He has to +report to 77 supervisors today, and tomorrow the "big boss" is coming in +from out of town and he's sure to test the poor fellow too. + +They're all handing him work to do -- so much he can't keep track of who +handed him what, but that's not really a big problem. The supervisors +just want to know when he's finished all of the work they've handed him so +far and whether he made any mistakes since they last asked. + +He might have made the mistake on work they didn't actually hand him, +but he can't keep track of things at that level of detail, all he can +remember is the most recent mistake that he made. + +Here's our worker_drone representation:: + + struct worker_drone { + errseq_t wd_err; /* for recording errors */ + }; + +Every day, the worker_drone starts out with a blank slate:: + + struct worker_drone wd; + + wd.wd_err = (errseq_t)0; + +The supervisors come in and get an initial read for the day. They +don't care about anything that happened before their watch begins:: + + struct supervisor { + errseq_t s_wd_err; /* private "cursor" for wd_err */ + spinlock_t s_wd_err_lock; /* protects s_wd_err */ + } + + struct supervisor su; + + su.s_wd_err = errseq_sample(&wd.wd_err); + spin_lock_init(&su.s_wd_err_lock); + +Now they start handing him tasks to do. Every few minutes they ask him to +finish up all of the work they've handed him so far. Then they ask him +whether he made any mistakes on any of it:: + + spin_lock(&su.su_wd_err_lock); + err = errseq_check_and_advance(&wd.wd_err, &su.s_wd_err); + spin_unlock(&su.su_wd_err_lock); + +Up to this point, that just keeps returning 0. + +Now, the owners of this company are quite miserly and have given him +substandard equipment with which to do his job. Occasionally it +glitches and he makes a mistake. He sighs a heavy sigh, and marks it +down:: + + errseq_set(&wd.wd_err, -EIO); + +...and then gets back to work. The supervisors eventually poll again +and they each get the error when they next check. Subsequent calls will +return 0, until another error is recorded, at which point it's reported +to each of them once. + +Note that the supervisors can't tell how many mistakes he made, only +whether one was made since they last checked, and the latest value +recorded. + +Occasionally the big boss comes in for a spot check and asks the worker +to do a one-off job for him. He's not really watching the worker +full-time like the supervisors, but he does need to know whether a +mistake occurred while his job was processing. + +He can just sample the current errseq_t in the worker, and then use that +to tell whether an error has occurred later:: + + errseq_t since = errseq_sample(&wd.wd_err); + /* submit some work and wait for it to complete */ + err = errseq_check(&wd.wd_err, since); + +Since he's just going to discard "since" after that point, he doesn't +need to advance it here. He also doesn't need any locking since it's +not usable by anyone else. + +Serializing errseq_t cursor updates +=================================== + +Note that the errseq_t API does not protect the errseq_t cursor during a +check_and_advance_operation. Only the canonical error code is handled +atomically. In a situation where more than one task might be using the +same errseq_t cursor at the same time, it's important to serialize +updates to that cursor. + +If that's not done, then it's possible for the cursor to go backward +in which case the same error could be reported more than once. + +Because of this, it's often advantageous to first do an errseq_check to +see if anything has changed, and only later do an +errseq_check_and_advance after taking the lock. e.g.:: + + if (errseq_check(&wd.wd_err, READ_ONCE(su.s_wd_err)) { + /* su.s_wd_err is protected by s_wd_err_lock */ + spin_lock(&su.s_wd_err_lock); + err = errseq_check_and_advance(&wd.wd_err, &su.s_wd_err); + spin_unlock(&su.s_wd_err_lock); + } + +That avoids the spinlock in the common case where nothing has changed +since the last time it was checked. + +Functions +========= + +.. kernel-doc:: lib/errseq.c diff --git a/Documentation/core-api/idr.rst b/Documentation/core-api/idr.rst new file mode 100644 index 000000000000..9078a5c3ac95 --- /dev/null +++ b/Documentation/core-api/idr.rst @@ -0,0 +1,79 @@ +.. SPDX-License-Identifier: CC-BY-SA-4.0 + +============= +ID Allocation +============= + +:Author: Matthew Wilcox + +Overview +======== + +A common problem to solve is allocating identifiers (IDs); generally +small numbers which identify a thing. Examples include file descriptors, +process IDs, packet identifiers in networking protocols, SCSI tags +and device instance numbers. The IDR and the IDA provide a reasonable +solution to the problem to avoid everybody inventing their own. The IDR +provides the ability to map an ID to a pointer, while the IDA provides +only ID allocation, and as a result is much more memory-efficient. + +IDR usage +========= + +Start by initialising an IDR, either with :c:func:`DEFINE_IDR` +for statically allocated IDRs or :c:func:`idr_init` for dynamically +allocated IDRs. + +You can call :c:func:`idr_alloc` to allocate an unused ID. Look up +the pointer you associated with the ID by calling :c:func:`idr_find` +and free the ID by calling :c:func:`idr_remove`. + +If you need to change the pointer associated with an ID, you can call +:c:func:`idr_replace`. One common reason to do this is to reserve an +ID by passing a ``NULL`` pointer to the allocation function; initialise the +object with the reserved ID and finally insert the initialised object +into the IDR. + +Some users need to allocate IDs larger than ``INT_MAX``. So far all of +these users have been content with a ``UINT_MAX`` limit, and they use +:c:func:`idr_alloc_u32`. If you need IDs that will not fit in a u32, +we will work with you to address your needs. + +If you need to allocate IDs sequentially, you can use +:c:func:`idr_alloc_cyclic`. The IDR becomes less efficient when dealing +with larger IDs, so using this function comes at a slight cost. + +To perform an action on all pointers used by the IDR, you can +either use the callback-based :c:func:`idr_for_each` or the +iterator-style :c:func:`idr_for_each_entry`. You may need to use +:c:func:`idr_for_each_entry_continue` to continue an iteration. You can +also use :c:func:`idr_get_next` if the iterator doesn't fit your needs. + +When you have finished using an IDR, you can call :c:func:`idr_destroy` +to release the memory used by the IDR. This will not free the objects +pointed to from the IDR; if you want to do that, use one of the iterators +to do it. + +You can use :c:func:`idr_is_empty` to find out whether there are any +IDs currently allocated. + +If you need to take a lock while allocating a new ID from the IDR, +you may need to pass a restrictive set of GFP flags, which can lead +to the IDR being unable to allocate memory. To work around this, +you can call :c:func:`idr_preload` before taking the lock, and then +:c:func:`idr_preload_end` after the allocation. + +.. kernel-doc:: include/linux/idr.h + :doc: idr sync + +IDA usage +========= + +.. kernel-doc:: lib/idr.c + :doc: IDA description + +Functions and structures +======================== + +.. kernel-doc:: include/linux/idr.h +.. kernel-doc:: lib/idr.c diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index d5bbe035316d..c670a8031786 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -14,13 +14,17 @@ Core utilities kernel-api assoc_array atomic_ops + refcount-vs-atomic cpu_hotplug + idr local_ops workqueue genericirq flexible-arrays librs genalloc + errseq + printk-formats Interfaces for kernel debugging =============================== diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst index 2d9da6c40a4d..ff335f8aeb39 100644 --- a/Documentation/core-api/kernel-api.rst +++ b/Documentation/core-api/kernel-api.rst @@ -103,18 +103,6 @@ CRC Functions .. kernel-doc:: lib/crc-itu-t.c :export: -idr/ida Functions ------------------ - -.. kernel-doc:: include/linux/idr.h - :doc: idr sync - -.. kernel-doc:: lib/idr.c - :doc: IDA description - -.. kernel-doc:: lib/idr.c - :export: - Math Functions in Linux ======================= @@ -139,6 +127,21 @@ Division Functions .. kernel-doc:: lib/gcd.c :export: +Sorting +------- + +.. kernel-doc:: lib/sort.c + :export: + +.. kernel-doc:: lib/list_sort.c + :export: + +UUID/GUID +--------- + +.. kernel-doc:: lib/uuid.c + :export: + Memory Management in Linux ========================== diff --git a/Documentation/core-api/printk-formats.rst b/Documentation/core-api/printk-formats.rst new file mode 100644 index 000000000000..934559b3c130 --- /dev/null +++ b/Documentation/core-api/printk-formats.rst @@ -0,0 +1,482 @@ +========================================= +How to get printk format specifiers right +========================================= + +:Author: Randy Dunlap <rdunlap@infradead.org> +:Author: Andrew Murray <amurray@mpc-data.co.uk> + + +Integer types +============= + +:: + + If variable is of Type, use printk format specifier: + ------------------------------------------------------------ + int %d or %x + unsigned int %u or %x + long %ld or %lx + unsigned long %lu or %lx + long long %lld or %llx + unsigned long long %llu or %llx + size_t %zu or %zx + ssize_t %zd or %zx + s32 %d or %x + u32 %u or %x + s64 %lld or %llx + u64 %llu or %llx + + +If <type> is dependent on a config option for its size (e.g., sector_t, +blkcnt_t) or is architecture-dependent for its size (e.g., tcflag_t), use a +format specifier of its largest possible type and explicitly cast to it. + +Example:: + + printk("test: sector number/total blocks: %llu/%llu\n", + (unsigned long long)sector, (unsigned long long)blockcount); + +Reminder: sizeof() returns type size_t. + +The kernel's printf does not support %n. Floating point formats (%e, %f, +%g, %a) are also not recognized, for obvious reasons. Use of any +unsupported specifier or length qualifier results in a WARN and early +return from vsnprintf(). + +Pointer types +============= + +A raw pointer value may be printed with %p which will hash the address +before printing. The kernel also supports extended specifiers for printing +pointers of different types. + +Plain Pointers +-------------- + +:: + + %p abcdef12 or 00000000abcdef12 + +Pointers printed without a specifier extension (i.e unadorned %p) are +hashed to prevent leaking information about the kernel memory layout. This +has the added benefit of providing a unique identifier. On 64-bit machines +the first 32 bits are zeroed. If you *really* want the address see %px +below. + +Symbols/Function Pointers +------------------------- + +:: + + %pS versatile_init+0x0/0x110 + %ps versatile_init + %pF versatile_init+0x0/0x110 + %pf versatile_init + %pSR versatile_init+0x9/0x110 + (with __builtin_extract_return_addr() translation) + %pB prev_fn_of_versatile_init+0x88/0x88 + + +The ``S`` and ``s`` specifiers are used for printing a pointer in symbolic +format. They result in the symbol name with (S) or without (s) +offsets. If KALLSYMS are disabled then the symbol address is printed instead. + +Note, that the ``F`` and ``f`` specifiers are identical to ``S`` (``s``) +and thus deprecated. We have ``F`` and ``f`` because on ia64, ppc64 and +parisc64 function pointers are indirect and, in fact, are function +descriptors, which require additional dereferencing before we can lookup +the symbol. As of now, ``S`` and ``s`` perform dereferencing on those +platforms (when needed), so ``F`` and ``f`` exist for compatibility +reasons only. + +The ``B`` specifier results in the symbol name with offsets and should be +used when printing stack backtraces. The specifier takes into +consideration the effect of compiler optimisations which may occur +when tail-calls are used and marked with the noreturn GCC attribute. + +Kernel Pointers +--------------- + +:: + + %pK 01234567 or 0123456789abcdef + +For printing kernel pointers which should be hidden from unprivileged +users. The behaviour of %pK depends on the kptr_restrict sysctl - see +Documentation/sysctl/kernel.txt for more details. + +Unmodified Addresses +-------------------- + +:: + + %px 01234567 or 0123456789abcdef + +For printing pointers when you *really* want to print the address. Please +consider whether or not you are leaking sensitive information about the +kernel memory layout before printing pointers with %px. %px is functionally +equivalent to %lx (or %lu). %px is preferred because it is more uniquely +grep'able. If in the future we need to modify the way the kernel handles +printing pointers we will be better equipped to find the call sites. + +Struct Resources +---------------- + +:: + + %pr [mem 0x60000000-0x6fffffff flags 0x2200] or + [mem 0x0000000060000000-0x000000006fffffff flags 0x2200] + %pR [mem 0x60000000-0x6fffffff pref] or + [mem 0x0000000060000000-0x000000006fffffff pref] + +For printing struct resources. The ``R`` and ``r`` specifiers result in a +printed resource with (R) or without (r) a decoded flags member. + +Passed by reference. + +Physical address types phys_addr_t +---------------------------------- + +:: + + %pa[p] 0x01234567 or 0x0123456789abcdef + +For printing a phys_addr_t type (and its derivatives, such as +resource_size_t) which can vary based on build options, regardless of the +width of the CPU data path. + +Passed by reference. + +DMA address types dma_addr_t +---------------------------- + +:: + + %pad 0x01234567 or 0x0123456789abcdef + +For printing a dma_addr_t type which can vary based on build options, +regardless of the width of the CPU data path. + +Passed by reference. + +Raw buffer as an escaped string +------------------------------- + +:: + + %*pE[achnops] + +For printing raw buffer as an escaped string. For the following buffer:: + + 1b 62 20 5c 43 07 22 90 0d 5d + +A few examples show how the conversion would be done (excluding surrounding +quotes):: + + %*pE "\eb \C\a"\220\r]" + %*pEhp "\x1bb \C\x07"\x90\x0d]" + %*pEa "\e\142\040\\\103\a\042\220\r\135" + +The conversion rules are applied according to an optional combination +of flags (see :c:func:`string_escape_mem` kernel documentation for the +details): + + - a - ESCAPE_ANY + - c - ESCAPE_SPECIAL + - h - ESCAPE_HEX + - n - ESCAPE_NULL + - o - ESCAPE_OCTAL + - p - ESCAPE_NP + - s - ESCAPE_SPACE + +By default ESCAPE_ANY_NP is used. + +ESCAPE_ANY_NP is the sane choice for many cases, in particularly for +printing SSIDs. + +If field width is omitted then 1 byte only will be escaped. + +Raw buffer as a hex string +-------------------------- + +:: + + %*ph 00 01 02 ... 3f + %*phC 00:01:02: ... :3f + %*phD 00-01-02- ... -3f + %*phN 000102 ... 3f + +For printing small buffers (up to 64 bytes long) as a hex string with a +certain separator. For larger buffers consider using +:c:func:`print_hex_dump`. + +MAC/FDDI addresses +------------------ + +:: + + %pM 00:01:02:03:04:05 + %pMR 05:04:03:02:01:00 + %pMF 00-01-02-03-04-05 + %pm 000102030405 + %pmR 050403020100 + +For printing 6-byte MAC/FDDI addresses in hex notation. The ``M`` and ``m`` +specifiers result in a printed address with (M) or without (m) byte +separators. The default byte separator is the colon (:). + +Where FDDI addresses are concerned the ``F`` specifier can be used after +the ``M`` specifier to use dash (-) separators instead of the default +separator. + +For Bluetooth addresses the ``R`` specifier shall be used after the ``M`` +specifier to use reversed byte order suitable for visual interpretation +of Bluetooth addresses which are in the little endian order. + +Passed by reference. + +IPv4 addresses +-------------- + +:: + + %pI4 1.2.3.4 + %pi4 001.002.003.004 + %p[Ii]4[hnbl] + +For printing IPv4 dot-separated decimal addresses. The ``I4`` and ``i4`` +specifiers result in a printed address with (i4) or without (I4) leading +zeros. + +The additional ``h``, ``n``, ``b``, and ``l`` specifiers are used to specify +host, network, big or little endian order addresses respectively. Where +no specifier is provided the default network/big endian order is used. + +Passed by reference. + +IPv6 addresses +-------------- + +:: + + %pI6 0001:0002:0003:0004:0005:0006:0007:0008 + %pi6 00010002000300040005000600070008 + %pI6c 1:2:3:4:5:6:7:8 + +For printing IPv6 network-order 16-bit hex addresses. The ``I6`` and ``i6`` +specifiers result in a printed address with (I6) or without (i6) +colon-separators. Leading zeros are always used. + +The additional ``c`` specifier can be used with the ``I`` specifier to +print a compressed IPv6 address as described by +http://tools.ietf.org/html/rfc5952 + +Passed by reference. + +IPv4/IPv6 addresses (generic, with port, flowinfo, scope) +--------------------------------------------------------- + +:: + + %pIS 1.2.3.4 or 0001:0002:0003:0004:0005:0006:0007:0008 + %piS 001.002.003.004 or 00010002000300040005000600070008 + %pISc 1.2.3.4 or 1:2:3:4:5:6:7:8 + %pISpc 1.2.3.4:12345 or [1:2:3:4:5:6:7:8]:12345 + %p[Ii]S[pfschnbl] + +For printing an IP address without the need to distinguish whether it's of +type AF_INET or AF_INET6. A pointer to a valid struct sockaddr, +specified through ``IS`` or ``iS``, can be passed to this format specifier. + +The additional ``p``, ``f``, and ``s`` specifiers are used to specify port +(IPv4, IPv6), flowinfo (IPv6) and scope (IPv6). Ports have a ``:`` prefix, +flowinfo a ``/`` and scope a ``%``, each followed by the actual value. + +In case of an IPv6 address the compressed IPv6 address as described by +http://tools.ietf.org/html/rfc5952 is being used if the additional +specifier ``c`` is given. The IPv6 address is surrounded by ``[``, ``]`` in +case of additional specifiers ``p``, ``f`` or ``s`` as suggested by +https://tools.ietf.org/html/draft-ietf-6man-text-addr-representation-07 + +In case of IPv4 addresses, the additional ``h``, ``n``, ``b``, and ``l`` +specifiers can be used as well and are ignored in case of an IPv6 +address. + +Passed by reference. + +Further examples:: + + %pISfc 1.2.3.4 or [1:2:3:4:5:6:7:8]/123456789 + %pISsc 1.2.3.4 or [1:2:3:4:5:6:7:8]%1234567890 + %pISpfc 1.2.3.4:12345 or [1:2:3:4:5:6:7:8]:12345/123456789 + +UUID/GUID addresses +------------------- + +:: + + %pUb 00010203-0405-0607-0809-0a0b0c0d0e0f + %pUB 00010203-0405-0607-0809-0A0B0C0D0E0F + %pUl 03020100-0504-0706-0809-0a0b0c0e0e0f + %pUL 03020100-0504-0706-0809-0A0B0C0E0E0F + +For printing 16-byte UUID/GUIDs addresses. The additional ``l``, ``L``, +``b`` and ``B`` specifiers are used to specify a little endian order in +lower (l) or upper case (L) hex notation - and big endian order in lower (b) +or upper case (B) hex notation. + +Where no additional specifiers are used the default big endian +order with lower case hex notation will be printed. + +Passed by reference. + +dentry names +------------ + +:: + + %pd{,2,3,4} + %pD{,2,3,4} + +For printing dentry name; if we race with :c:func:`d_move`, the name might +be a mix of old and new ones, but it won't oops. %pd dentry is a safer +equivalent of %s dentry->d_name.name we used to use, %pd<n> prints ``n`` +last components. %pD does the same thing for struct file. + +Passed by reference. + +block_device names +------------------ + +:: + + %pg sda, sda1 or loop0p1 + +For printing name of block_device pointers. + +struct va_format +---------------- + +:: + + %pV + +For printing struct va_format structures. These contain a format string +and va_list as follows:: + + struct va_format { + const char *fmt; + va_list *va; + }; + +Implements a "recursive vsnprintf". + +Do not use this feature without some mechanism to verify the +correctness of the format string and va_list arguments. + +Passed by reference. + +kobjects +-------- + +:: + + %pOF[fnpPcCF] + + +For printing kobject based structs (device nodes). Default behaviour is +equivalent to %pOFf. + + - f - device node full_name + - n - device node name + - p - device node phandle + - P - device node path spec (name + @unit) + - F - device node flags + - c - major compatible string + - C - full compatible string + +The separator when using multiple arguments is ':' + +Examples:: + + %pOF /foo/bar@0 - Node full name + %pOFf /foo/bar@0 - Same as above + %pOFfp /foo/bar@0:10 - Node full name + phandle + %pOFfcF /foo/bar@0:foo,device:--P- - Node full name + + major compatible string + + node flags + D - dynamic + d - detached + P - Populated + B - Populated bus + +Passed by reference. + +struct clk +---------- + +:: + + %pC pll1 + %pCn pll1 + %pCr 1560000000 + +For printing struct clk structures. %pC and %pCn print the name +(Common Clock Framework) or address (legacy clock framework) of the +structure; %pCr prints the current clock rate. + +Passed by reference. + +bitmap and its derivatives such as cpumask and nodemask +------------------------------------------------------- + +:: + + %*pb 0779 + %*pbl 0,3-6,8-10 + +For printing bitmap and its derivatives such as cpumask and nodemask, +%*pb outputs the bitmap with field width as the number of bits and %*pbl +output the bitmap as range list with field width as the number of bits. + +Passed by reference. + +Flags bitfields such as page flags, gfp_flags +--------------------------------------------- + +:: + + %pGp referenced|uptodate|lru|active|private + %pGg GFP_USER|GFP_DMA32|GFP_NOWARN + %pGv read|exec|mayread|maywrite|mayexec|denywrite + +For printing flags bitfields as a collection of symbolic constants that +would construct the value. The type of flags is given by the third +character. Currently supported are [p]age flags, [v]ma_flags (both +expect ``unsigned long *``) and [g]fp_flags (expects ``gfp_t *``). The flag +names and print order depends on the particular type. + +Note that this format should not be used directly in the +:c:func:`TP_printk()` part of a tracepoint. Instead, use the show_*_flags() +functions from <trace/events/mmflags.h>. + +Passed by reference. + +Network device features +----------------------- + +:: + + %pNF 0x000000000000c000 + +For printing netdev_features_t. + +Passed by reference. + +Thanks +====== + +If you add other %p extensions, please extend <lib/test_printf.c> with +one or more test cases, if at all feasible. + +Thank you for your cooperation and attention. diff --git a/Documentation/core-api/refcount-vs-atomic.rst b/Documentation/core-api/refcount-vs-atomic.rst new file mode 100644 index 000000000000..83351c258cdb --- /dev/null +++ b/Documentation/core-api/refcount-vs-atomic.rst @@ -0,0 +1,150 @@ +=================================== +refcount_t API compared to atomic_t +=================================== + +.. contents:: :local: + +Introduction +============ + +The goal of refcount_t API is to provide a minimal API for implementing +an object's reference counters. While a generic architecture-independent +implementation from lib/refcount.c uses atomic operations underneath, +there are a number of differences between some of the ``refcount_*()`` and +``atomic_*()`` functions with regards to the memory ordering guarantees. +This document outlines the differences and provides respective examples +in order to help maintainers validate their code against the change in +these memory ordering guarantees. + +The terms used through this document try to follow the formal LKMM defined in +github.com/aparri/memory-model/blob/master/Documentation/explanation.txt + +memory-barriers.txt and atomic_t.txt provide more background to the +memory ordering in general and for atomic operations specifically. + +Relevant types of memory ordering +================================= + +.. note:: The following section only covers some of the memory + ordering types that are relevant for the atomics and reference + counters and used through this document. For a much broader picture + please consult memory-barriers.txt document. + +In the absence of any memory ordering guarantees (i.e. fully unordered) +atomics & refcounters only provide atomicity and +program order (po) relation (on the same CPU). It guarantees that +each ``atomic_*()`` and ``refcount_*()`` operation is atomic and instructions +are executed in program order on a single CPU. +This is implemented using :c:func:`READ_ONCE`/:c:func:`WRITE_ONCE` and +compare-and-swap primitives. + +A strong (full) memory ordering guarantees that all prior loads and +stores (all po-earlier instructions) on the same CPU are completed +before any po-later instruction is executed on the same CPU. +It also guarantees that all po-earlier stores on the same CPU +and all propagated stores from other CPUs must propagate to all +other CPUs before any po-later instruction is executed on the original +CPU (A-cumulative property). This is implemented using :c:func:`smp_mb`. + +A RELEASE memory ordering guarantees that all prior loads and +stores (all po-earlier instructions) on the same CPU are completed +before the operation. It also guarantees that all po-earlier +stores on the same CPU and all propagated stores from other CPUs +must propagate to all other CPUs before the release operation +(A-cumulative property). This is implemented using +:c:func:`smp_store_release`. + +A control dependency (on success) for refcounters guarantees that +if a reference for an object was successfully obtained (reference +counter increment or addition happened, function returned true), +then further stores are ordered against this operation. +Control dependency on stores are not implemented using any explicit +barriers, but rely on CPU not to speculate on stores. This is only +a single CPU relation and provides no guarantees for other CPUs. + + +Comparison of functions +======================= + +case 1) - non-"Read/Modify/Write" (RMW) ops +------------------------------------------- + +Function changes: + + * :c:func:`atomic_set` --> :c:func:`refcount_set` + * :c:func:`atomic_read` --> :c:func:`refcount_read` + +Memory ordering guarantee changes: + + * none (both fully unordered) + + +case 2) - increment-based ops that return no value +-------------------------------------------------- + +Function changes: + + * :c:func:`atomic_inc` --> :c:func:`refcount_inc` + * :c:func:`atomic_add` --> :c:func:`refcount_add` + +Memory ordering guarantee changes: + + * none (both fully unordered) + +case 3) - decrement-based RMW ops that return no value +------------------------------------------------------ + +Function changes: + + * :c:func:`atomic_dec` --> :c:func:`refcount_dec` + +Memory ordering guarantee changes: + + * fully unordered --> RELEASE ordering + + +case 4) - increment-based RMW ops that return a value +----------------------------------------------------- + +Function changes: + + * :c:func:`atomic_inc_not_zero` --> :c:func:`refcount_inc_not_zero` + * no atomic counterpart --> :c:func:`refcount_add_not_zero` + +Memory ordering guarantees changes: + + * fully ordered --> control dependency on success for stores + +.. note:: We really assume here that necessary ordering is provided as a + result of obtaining pointer to the object! + + +case 5) - decrement-based RMW ops that return a value +----------------------------------------------------- + +Function changes: + + * :c:func:`atomic_dec_and_test` --> :c:func:`refcount_dec_and_test` + * :c:func:`atomic_sub_and_test` --> :c:func:`refcount_sub_and_test` + * no atomic counterpart --> :c:func:`refcount_dec_if_one` + * ``atomic_add_unless(&var, -1, 1)`` --> ``refcount_dec_not_one(&var)`` + +Memory ordering guarantees changes: + + * fully ordered --> RELEASE ordering + control dependency + +.. note:: :c:func:`atomic_add_unless` only provides full order on success. + + +case 6) - lock-based RMW +------------------------ + +Function changes: + + * :c:func:`atomic_dec_and_lock` --> :c:func:`refcount_dec_and_lock` + * :c:func:`atomic_dec_and_mutex_lock` --> :c:func:`refcount_dec_and_mutex_lock` + +Memory ordering guarantees changes: + + * fully ordered --> RELEASE ordering + control dependency + hold + :c:func:`spin_lock` on success |