diff options
Diffstat (limited to 'Documentation/admin-guide')
18 files changed, 713 insertions, 227 deletions
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 2cc502a75ef6..5b86245450bd 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -299,7 +299,7 @@ Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by lruvec->lru_lock; PG_lru bit of page->flags is cleared before isolating a page from its LRU under lruvec->lru_lock. -2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM) +2.7 Kernel Memory Extension ----------------------------------------------- With the Kernel memory extension, the Memory Controller is able to limit @@ -386,8 +386,6 @@ U != 0, K >= U: a. Enable CONFIG_CGROUPS b. Enable CONFIG_MEMCG -c. Enable CONFIG_MEMCG_SWAP (to use swap extension) -d. Enable CONFIG_MEMCG_KMEM (to use kmem extension) 3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?) ------------------------------------------------------------------- diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index be4a77baf784..7bcfb38498c6 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1355,6 +1355,11 @@ PAGE_SIZE multiple when read back. pagetables Amount of memory allocated for page tables. + sec_pagetables + Amount of memory allocated for secondary page tables, + this currently includes KVM mmu allocations on x86 + and arm64. + percpu (npn) Amount of memory used for storing per-cpu kernel data structures. @@ -2185,75 +2190,93 @@ Cpuset Interface Files It accepts only the following input values when written to. - ======== ================================ - "root" a partition root - "member" a non-root member of a partition - ======== ================================ - - When set to be a partition root, the current cgroup is the - root of a new partition or scheduling domain that comprises - itself and all its descendants except those that are separate - partition roots themselves and their descendants. The root - cgroup is always a partition root. - - There are constraints on where a partition root can be set. - It can only be set in a cgroup if all the following conditions - are true. - - 1) The "cpuset.cpus" is not empty and the list of CPUs are - exclusive, i.e. they are not shared by any of its siblings. - 2) The parent cgroup is a partition root. - 3) The "cpuset.cpus" is also a proper subset of the parent's - "cpuset.cpus.effective". - 4) There is no child cgroups with cpuset enabled. This is for - eliminating corner cases that have to be handled if such a - condition is allowed. - - Setting it to partition root will take the CPUs away from the - effective CPUs of the parent cgroup. Once it is set, this - file cannot be reverted back to "member" if there are any child - cgroups with cpuset enabled. - - A parent partition cannot distribute all its CPUs to its - child partitions. There must be at least one cpu left in the - parent partition. - - Once becoming a partition root, changes to "cpuset.cpus" is - generally allowed as long as the first condition above is true, - the change will not take away all the CPUs from the parent - partition and the new "cpuset.cpus" value is a superset of its - children's "cpuset.cpus" values. - - Sometimes, external factors like changes to ancestors' - "cpuset.cpus" or cpu hotplug can cause the state of the partition - root to change. On read, the "cpuset.sched.partition" file - can show the following values. - - ============== ============================== - "member" Non-root member of a partition - "root" Partition root - "root invalid" Invalid partition root - ============== ============================== - - It is a partition root if the first 2 partition root conditions - above are true and at least one CPU from "cpuset.cpus" is - granted by the parent cgroup. - - A partition root can become invalid if none of CPUs requested - in "cpuset.cpus" can be granted by the parent cgroup or the - parent cgroup is no longer a partition root itself. In this - case, it is not a real partition even though the restriction - of the first partition root condition above will still apply. - The cpu affinity of all the tasks in the cgroup will then be - associated with CPUs in the nearest ancestor partition. - - An invalid partition root can be transitioned back to a - real partition root if at least one of the requested CPUs - can now be granted by its parent. In this case, the cpu - affinity of all the tasks in the formerly invalid partition - will be associated to the CPUs of the newly formed partition. - Changing the partition state of an invalid partition root to - "member" is always allowed even if child cpusets are present. + ========== ===================================== + "member" Non-root member of a partition + "root" Partition root + "isolated" Partition root without load balancing + ========== ===================================== + + The root cgroup is always a partition root and its state + cannot be changed. All other non-root cgroups start out as + "member". + + When set to "root", the current cgroup is the root of a new + partition or scheduling domain that comprises itself and all + its descendants except those that are separate partition roots + themselves and their descendants. + + When set to "isolated", the CPUs in that partition root will + be in an isolated state without any load balancing from the + scheduler. Tasks placed in such a partition with multiple + CPUs should be carefully distributed and bound to each of the + individual CPUs for optimal performance. + + The value shown in "cpuset.cpus.effective" of a partition root + is the CPUs that the partition root can dedicate to a potential + new child partition root. The new child subtracts available + CPUs from its parent "cpuset.cpus.effective". + + A partition root ("root" or "isolated") can be in one of the + two possible states - valid or invalid. An invalid partition + root is in a degraded state where some state information may + be retained, but behaves more like a "member". + + All possible state transitions among "member", "root" and + "isolated" are allowed. + + On read, the "cpuset.cpus.partition" file can show the following + values. + + ============================= ===================================== + "member" Non-root member of a partition + "root" Partition root + "isolated" Partition root without load balancing + "root invalid (<reason>)" Invalid partition root + "isolated invalid (<reason>)" Invalid isolated partition root + ============================= ===================================== + + In the case of an invalid partition root, a descriptive string on + why the partition is invalid is included within parentheses. + + For a partition root to become valid, the following conditions + must be met. + + 1) The "cpuset.cpus" is exclusive with its siblings , i.e. they + are not shared by any of its siblings (exclusivity rule). + 2) The parent cgroup is a valid partition root. + 3) The "cpuset.cpus" is not empty and must contain at least + one of the CPUs from parent's "cpuset.cpus", i.e. they overlap. + 4) The "cpuset.cpus.effective" cannot be empty unless there is + no task associated with this partition. + + External events like hotplug or changes to "cpuset.cpus" can + cause a valid partition root to become invalid and vice versa. + Note that a task cannot be moved to a cgroup with empty + "cpuset.cpus.effective". + + For a valid partition root with the sibling cpu exclusivity + rule enabled, changes made to "cpuset.cpus" that violate the + exclusivity rule will invalidate the partition as well as its + sibiling partitions with conflicting cpuset.cpus values. So + care must be taking in changing "cpuset.cpus". + + A valid non-root parent partition may distribute out all its CPUs + to its child partitions when there is no task associated with it. + + Care must be taken to change a valid partition root to + "member" as all its child partitions, if present, will become + invalid causing disruption to tasks running in those child + partitions. These inactivated partitions could be recovered if + their parent is switched back to a partition root with a proper + set of "cpuset.cpus". + + Poll and inotify events are triggered whenever the state of + "cpuset.cpus.partition" changes. That includes changes caused + by write to "cpuset.cpus.partition", cpu hotplug or other + changes that modify the validity status of the partition. + This will allow user space agents to monitor unexpected changes + to "cpuset.cpus.partition" without the need to do continuous + polling. Device controller diff --git a/Documentation/admin-guide/dynamic-debug-howto.rst b/Documentation/admin-guide/dynamic-debug-howto.rst index a89cfa083155..faa22f77847a 100644 --- a/Documentation/admin-guide/dynamic-debug-howto.rst +++ b/Documentation/admin-guide/dynamic-debug-howto.rst @@ -5,143 +5,115 @@ Dynamic debug Introduction ============ -This document describes how to use the dynamic debug (dyndbg) feature. +Dynamic debug allows you to dynamically enable/disable kernel +debug-print code to obtain additional kernel information. -Dynamic debug is designed to allow you to dynamically enable/disable -kernel code to obtain additional kernel information. Currently, if -``CONFIG_DYNAMIC_DEBUG`` is set, then all ``pr_debug()``/``dev_dbg()`` and -``print_hex_dump_debug()``/``print_hex_dump_bytes()`` calls can be dynamically -enabled per-callsite. +If ``/proc/dynamic_debug/control`` exists, your kernel has dynamic +debug. You'll need root access (sudo su) to use this. -If you do not want to enable dynamic debug globally (i.e. in some embedded -system), you may set ``CONFIG_DYNAMIC_DEBUG_CORE`` as basic support of dynamic -debug and add ``ccflags := -DDYNAMIC_DEBUG_MODULE`` into the Makefile of any -modules which you'd like to dynamically debug later. - -If ``CONFIG_DYNAMIC_DEBUG`` is not set, ``print_hex_dump_debug()`` is just -shortcut for ``print_hex_dump(KERN_DEBUG)``. - -For ``print_hex_dump_debug()``/``print_hex_dump_bytes()``, format string is -its ``prefix_str`` argument, if it is constant string; or ``hexdump`` -in case ``prefix_str`` is built dynamically. +Dynamic debug provides: -Dynamic debug has even more useful features: + * a Catalog of all *prdbgs* in your kernel. + ``cat /proc/dynamic_debug/control`` to see them. - * Simple query language allows turning on and off debugging - statements by matching any combination of 0 or 1 of: + * a Simple query/command language to alter *prdbgs* by selecting on + any combination of 0 or 1 of: - source filename - function name - line number (including ranges of line numbers) - module name - format string - - * Provides a debugfs control file: ``<debugfs>/dynamic_debug/control`` - which can be read to display the complete list of known debug - statements, to help guide you - -Controlling dynamic debug Behaviour -=================================== - -The behaviour of ``pr_debug()``/``dev_dbg()`` are controlled via writing to a -control file in the 'debugfs' filesystem. Thus, you must first mount -the debugfs filesystem, in order to make use of this feature. -Subsequently, we refer to the control file as: -``<debugfs>/dynamic_debug/control``. For example, if you want to enable -printing from source file ``svcsock.c``, line 1603 you simply do:: - - nullarbor:~ # echo 'file svcsock.c line 1603 +p' > - <debugfs>/dynamic_debug/control - -If you make a mistake with the syntax, the write will fail thus:: - - nullarbor:~ # echo 'file svcsock.c wtf 1 +p' > - <debugfs>/dynamic_debug/control - -bash: echo: write error: Invalid argument - -Note, for systems without 'debugfs' enabled, the control file can be -found in ``/proc/dynamic_debug/control``. + - class name (as known/declared by each module) Viewing Dynamic Debug Behaviour =============================== -You can view the currently configured behaviour of all the debug -statements via:: +You can view the currently configured behaviour in the *prdbg* catalog:: - nullarbor:~ # cat <debugfs>/dynamic_debug/control + :#> head -n7 /proc/dynamic_debug/control # filename:lineno [module]function flags format - net/sunrpc/svc_rdma.c:323 [svcxprt_rdma]svc_rdma_cleanup =_ "SVCRDMA Module Removed, deregister RPC RDMA transport\012" - net/sunrpc/svc_rdma.c:341 [svcxprt_rdma]svc_rdma_init =_ "\011max_inline : %d\012" - net/sunrpc/svc_rdma.c:340 [svcxprt_rdma]svc_rdma_init =_ "\011sq_depth : %d\012" - net/sunrpc/svc_rdma.c:338 [svcxprt_rdma]svc_rdma_init =_ "\011max_requests : %d\012" - ... + init/main.c:1179 [main]initcall_blacklist =_ "blacklisting initcall %s\012 + init/main.c:1218 [main]initcall_blacklisted =_ "initcall %s blacklisted\012" + init/main.c:1424 [main]run_init_process =_ " with arguments:\012" + init/main.c:1426 [main]run_init_process =_ " %s\012" + init/main.c:1427 [main]run_init_process =_ " with environment:\012" + init/main.c:1429 [main]run_init_process =_ " %s\012" +The 3rd space-delimited column shows the current flags, preceded by +a ``=`` for easy use with grep/cut. ``=p`` shows enabled callsites. -You can also apply standard Unix text manipulation filters to this -data, e.g.:: +Controlling dynamic debug Behaviour +=================================== - nullarbor:~ # grep -i rdma <debugfs>/dynamic_debug/control | wc -l - 62 +The behaviour of *prdbg* sites are controlled by writing +query/commands to the control file. Example:: - nullarbor:~ # grep -i tcp <debugfs>/dynamic_debug/control | wc -l - 42 + # grease the interface + :#> alias ddcmd='echo $* > /proc/dynamic_debug/control' -The third column shows the currently enabled flags for each debug -statement callsite (see below for definitions of the flags). The -default value, with no flags enabled, is ``=_``. So you can view all -the debug statement callsites with any non-default flags:: + :#> ddcmd '-p; module main func run* +p' + :#> grep =p /proc/dynamic_debug/control + init/main.c:1424 [main]run_init_process =p " with arguments:\012" + init/main.c:1426 [main]run_init_process =p " %s\012" + init/main.c:1427 [main]run_init_process =p " with environment:\012" + init/main.c:1429 [main]run_init_process =p " %s\012" - nullarbor:~ # awk '$3 != "=_"' <debugfs>/dynamic_debug/control - # filename:lineno [module]function flags format - net/sunrpc/svcsock.c:1603 [sunrpc]svc_send p "svc_process: st_sendto returned %d\012" +Error messages go to console/syslog:: + + :#> ddcmd mode foo +p + dyndbg: unknown keyword "mode" + dyndbg: query parse failed + bash: echo: write error: Invalid argument + +If debugfs is also enabled and mounted, ``dynamic_debug/control`` is +also under the mount-dir, typically ``/sys/kernel/debug/``. Command Language Reference ========================== -At the lexical level, a command comprises a sequence of words separated +At the basic lexical level, a command is a sequence of words separated by spaces or tabs. So these are all equivalent:: - nullarbor:~ # echo -n 'file svcsock.c line 1603 +p' > - <debugfs>/dynamic_debug/control - nullarbor:~ # echo -n ' file svcsock.c line 1603 +p ' > - <debugfs>/dynamic_debug/control - nullarbor:~ # echo -n 'file svcsock.c line 1603 +p' > - <debugfs>/dynamic_debug/control + :#> ddcmd file svcsock.c line 1603 +p + :#> ddcmd "file svcsock.c line 1603 +p" + :#> ddcmd ' file svcsock.c line 1603 +p ' Command submissions are bounded by a write() system call. Multiple commands can be written together, separated by ``;`` or ``\n``:: - ~# echo "func pnpacpi_get_resources +p; func pnp_assign_mem +p" \ - > <debugfs>/dynamic_debug/control + :#> ddcmd "func pnpacpi_get_resources +p; func pnp_assign_mem +p" + :#> ddcmd <<"EOC" + func pnpacpi_get_resources +p + func pnp_assign_mem +p + EOC + :#> cat query-batch-file > /proc/dynamic_debug/control -If your query set is big, you can batch them too:: +You can also use wildcards in each query term. The match rule supports +``*`` (matches zero or more characters) and ``?`` (matches exactly one +character). For example, you can match all usb drivers:: - ~# cat query-batch-file > <debugfs>/dynamic_debug/control + :#> ddcmd file "drivers/usb/*" +p # "" to suppress shell expansion -Another way is to use wildcards. The match rule supports ``*`` (matches -zero or more characters) and ``?`` (matches exactly one character). For -example, you can match all usb drivers:: - - ~# echo "file drivers/usb/* +p" > <debugfs>/dynamic_debug/control - -At the syntactical level, a command comprises a sequence of match -specifications, followed by a flags change specification:: +Syntactically, a command is pairs of keyword values, followed by a +flags change or setting:: command ::= match-spec* flags-spec -The match-spec's are used to choose a subset of the known pr_debug() -callsites to which to apply the flags-spec. Think of them as a query -with implicit ANDs between each pair. Note that an empty list of -match-specs will select all debug statement callsites. +The match-spec's select *prdbgs* from the catalog, upon which to apply +the flags-spec, all constraints are ANDed together. An absent keyword +is the same as keyword "*". + -A match specification comprises a keyword, which controls the -attribute of the callsite to be compared, and a value to compare -against. Possible keywords are::: +A match specification is a keyword, which selects the attribute of +the callsite to be compared, and a value to compare against. Possible +keywords are::: match-spec ::= 'func' string | 'file' string | 'module' string | 'format' string | + 'class' string | 'line' line-range line-range ::= lineno | @@ -203,6 +175,16 @@ format format "nfsd: SETATTR" // a neater way to match a format with whitespace format 'nfsd: SETATTR' // yet another way to match a format with whitespace +class + The given class_name is validated against each module, which may + have declared a list of known class_names. If the class_name is + found for a module, callsite & class matching and adjustment + proceeds. Examples:: + + class DRM_UT_KMS # a DRM.debug category + class JUNK # silent non-match + // class TLD_* # NOTICE: no wildcard in class names + line The given line number or range of line numbers is compared against the line number of each ``pr_debug()`` callsite. A single @@ -228,17 +210,16 @@ of the characters:: The flags are:: p enables the pr_debug() callsite. - f Include the function name in the printed message - l Include line number in the printed message - m Include module name in the printed message - t Include thread ID in messages not generated from interrupt context - _ No flags are set. (Or'd with others on input) + _ enables no flags. -For ``print_hex_dump_debug()`` and ``print_hex_dump_bytes()``, only ``p`` flag -have meaning, other flags ignored. + Decorator flags add to the message-prefix, in order: + t Include thread ID, or <intr> + m Include module name + f Include the function name + l Include line number -For display, the flags are preceded by ``=`` -(mnemonic: what the flags are currently equal to). +For ``print_hex_dump_debug()`` and ``print_hex_dump_bytes()``, only +the ``p`` flag has meaning, other flags are ignored. Note the regexp ``^[-+=][flmpt_]+$`` matches a flags specification. To clear all flags at once, use ``=_`` or ``-flmpt``. @@ -313,7 +294,7 @@ For ``CONFIG_DYNAMIC_DEBUG`` kernels, any settings given at boot-time (or enabled by ``-DDEBUG`` flag during compilation) can be disabled later via the debugfs interface if the debug messages are no longer needed:: - echo "module module_name -p" > <debugfs>/dynamic_debug/control + echo "module module_name -p" > /proc/dynamic_debug/control Examples ======== @@ -321,37 +302,31 @@ Examples :: // enable the message at line 1603 of file svcsock.c - nullarbor:~ # echo -n 'file svcsock.c line 1603 +p' > - <debugfs>/dynamic_debug/control + :#> ddcmd 'file svcsock.c line 1603 +p' // enable all the messages in file svcsock.c - nullarbor:~ # echo -n 'file svcsock.c +p' > - <debugfs>/dynamic_debug/control + :#> ddcmd 'file svcsock.c +p' // enable all the messages in the NFS server module - nullarbor:~ # echo -n 'module nfsd +p' > - <debugfs>/dynamic_debug/control + :#> ddcmd 'module nfsd +p' // enable all 12 messages in the function svc_process() - nullarbor:~ # echo -n 'func svc_process +p' > - <debugfs>/dynamic_debug/control + :#> ddcmd 'func svc_process +p' // disable all 12 messages in the function svc_process() - nullarbor:~ # echo -n 'func svc_process -p' > - <debugfs>/dynamic_debug/control + :#> ddcmd 'func svc_process -p' // enable messages for NFS calls READ, READLINK, READDIR and READDIR+. - nullarbor:~ # echo -n 'format "nfsd: READ" +p' > - <debugfs>/dynamic_debug/control + :#> ddcmd 'format "nfsd: READ" +p' // enable messages in files of which the paths include string "usb" - nullarbor:~ # echo -n 'file *usb* +p' > <debugfs>/dynamic_debug/control + :#> ddcmd 'file *usb* +p' > /proc/dynamic_debug/control // enable all messages - nullarbor:~ # echo -n '+p' > <debugfs>/dynamic_debug/control + :#> ddcmd '+p' > /proc/dynamic_debug/control // add module, function to all enabled messages - nullarbor:~ # echo -n '+mf' > <debugfs>/dynamic_debug/control + :#> ddcmd '+mf' > /proc/dynamic_debug/control // boot-args example, with newlines and comments for readability Kernel command line: ... @@ -364,3 +339,38 @@ Examples dyndbg="file init/* +p #cmt ; func parse_one +p" // enable pr_debugs in 2 functions in a module loaded later pc87360.dyndbg="func pc87360_init_device +p; func pc87360_find +p" + +Kernel Configuration +==================== + +Dynamic Debug is enabled via kernel config items:: + + CONFIG_DYNAMIC_DEBUG=y # build catalog, enables CORE + CONFIG_DYNAMIC_DEBUG_CORE=y # enable mechanics only, skip catalog + +If you do not want to enable dynamic debug globally (i.e. in some embedded +system), you may set ``CONFIG_DYNAMIC_DEBUG_CORE`` as basic support of dynamic +debug and add ``ccflags := -DDYNAMIC_DEBUG_MODULE`` into the Makefile of any +modules which you'd like to dynamically debug later. + + +Kernel *prdbg* API +================== + +The following functions are cataloged and controllable when dynamic +debug is enabled:: + + pr_debug() + dev_dbg() + print_hex_dump_debug() + print_hex_dump_bytes() + +Otherwise, they are off by default; ``ccflags += -DDEBUG`` or +``#define DEBUG`` in a source file will enable them appropriately. + +If ``CONFIG_DYNAMIC_DEBUG`` is not set, ``print_hex_dump_debug()`` is +just a shortcut for ``print_hex_dump(KERN_DEBUG)``. + +For ``print_hex_dump_debug()``/``print_hex_dump_bytes()``, format string is +its ``prefix_str`` argument, if it is constant string; or ``hexdump`` +in case ``prefix_str`` is built dynamically. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index e92d63d3e878..a465d5242774 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -321,6 +321,8 @@ force_enable - Force enable the IOMMU on platforms known to be buggy with IOMMU enabled. Use this option with care. + pgtbl_v1 - Use v1 page table for DMA-API (Default). + pgtbl_v2 - Use v2 page table for DMA-API. amd_iommu_dump= [HW,X86-64] Enable AMD IOMMU driver option to dump the ACPI table @@ -1467,6 +1469,14 @@ Permit 'security.evm' to be updated regardless of current integrity status. + early_page_ext [KNL] Enforces page_ext initialization to earlier + stages so cover more early boot allocations. + Please note that as side effect some optimizations + might be disabled to achieve that (e.g. parallelized + memory initialization is disabled) so the boot process + might take longer, especially on systems with a lot of + memory. Available with CONFIG_PAGE_EXTENSION=y. + failslab= fail_usercopy= fail_page_alloc= @@ -2432,6 +2442,12 @@ 0: force disabled 1: force enabled + kunit.enable= [KUNIT] Enable executing KUnit tests. Requires + CONFIG_KUNIT to be set to be fully enabled. The + default value can be overridden via + KUNIT_DEFAULT_ENABLED. + Default is 1 (enabled) + kvm.ignore_msrs=[KVM] Ignore guest accesses to unhandled MSRs. Default is 0 (don't ignore, but inject #GP) @@ -3203,6 +3219,7 @@ spectre_v2_user=off [X86] spec_store_bypass_disable=off [X86,PPC] ssbd=force-off [ARM64] + nospectre_bhb [ARM64] l1tf=off [X86] mds=off [X86] tsx_async_abort=off [X86] @@ -3609,7 +3626,7 @@ nohugeiomap [KNL,X86,PPC,ARM64] Disable kernel huge I/O mappings. - nohugevmalloc [PPC] Disable kernel huge vmalloc mappings. + nohugevmalloc [KNL,X86,PPC,ARM64] Disable kernel huge vmalloc mappings. nosmt [KNL,S390] Disable symmetric multithreading (SMT). Equivalent to smt=1. @@ -3622,11 +3639,15 @@ (bounds check bypass). With this option data leaks are possible in the system. - nospectre_v2 [X86,PPC_FSL_BOOK3E,ARM64] Disable all mitigations for + nospectre_v2 [X86,PPC_E500,ARM64] Disable all mitigations for the Spectre variant 2 (indirect branch prediction) vulnerability. System may allow data leaks with this option. + nospectre_bhb [ARM64] Disable all mitigations for Spectre-BHB (branch + history injection) vulnerability. System may allow data leaks + with this option. + nospec_store_bypass_disable [HW] Disable all mitigations for the Speculative Store Bypass vulnerability @@ -3737,9 +3758,9 @@ [X86,PV_OPS] Disable paravirtualized VMware scheduler clock and use the default one. - no-steal-acc [X86,PV_OPS,ARM64] Disable paravirtualized steal time - accounting. steal time is computed, but won't - influence scheduler behaviour + no-steal-acc [X86,PV_OPS,ARM64,PPC/PSERIES] Disable paravirtualized + steal time accounting. steal time is computed, but + won't influence scheduler behaviour nolapic [X86-32,APIC] Do not enable or use the local APIC. @@ -6028,12 +6049,6 @@ This parameter controls use of the Protected Execution Facility on pSeries. - swapaccount= [KNL] - Format: [0|1] - Enable accounting of swap in memory resource - controller if no parameter or 1 is given or disable - it if 0 is given (See Documentation/admin-guide/cgroup-v1/memory.rst) - swiotlb= [ARM,IA-64,PPC,MIPS,X86] Format: { <int> [,<int>] | force | noforce } <int> -- Number of I/O TLB slabs @@ -6836,6 +6851,12 @@ Crash from Xen panic notifier, without executing late panic() code such as dumping handler. + xen_msr_safe= [X86,XEN] + Format: <bool> + Select whether to always use non-faulting (safe) MSR + access functions when running as Xen PV guest. The + default value is controlled by CONFIG_XEN_PV_MSR_SAFE. + xen_nopvspin [X86,XEN] Disables the qspinlock slowpath using Xen PV optimizations. This parameter is obsoleted by "nopvspin" parameter, which diff --git a/Documentation/admin-guide/mm/cma_debugfs.rst b/Documentation/admin-guide/mm/cma_debugfs.rst index 4e06ffabd78a..7367e6294ef6 100644 --- a/Documentation/admin-guide/mm/cma_debugfs.rst +++ b/Documentation/admin-guide/mm/cma_debugfs.rst @@ -5,10 +5,10 @@ CMA Debugfs Interface The CMA debugfs interface is useful to retrieve basic information out of the different CMA areas and to test allocation/release in each of the areas. -Each CMA zone represents a directory under <debugfs>/cma/, indexed by the -kernel's CMA index. So the first CMA zone would be: +Each CMA area represents a directory under <debugfs>/cma/, represented by +its CMA name like below: - <debugfs>/cma/cma-0 + <debugfs>/cma/<cma_name> The structure of the files created under that directory is as follows: @@ -18,8 +18,8 @@ The structure of the files created under that directory is as follows: - [RO] bitmap: The bitmap of page states in the zone. - [WO] alloc: Allocate N pages from that CMA area. For example:: - echo 5 > <debugfs>/cma/cma-2/alloc + echo 5 > <debugfs>/cma/<cma_name>/alloc -would try to allocate 5 pages from the cma-2 area. +would try to allocate 5 pages from the 'cma_name' area. - [WO] free: Free N pages from that CMA area, similar to the above. diff --git a/Documentation/admin-guide/mm/damon/index.rst b/Documentation/admin-guide/mm/damon/index.rst index 05500042f777..33d37bb2fb4e 100644 --- a/Documentation/admin-guide/mm/damon/index.rst +++ b/Documentation/admin-guide/mm/damon/index.rst @@ -1,8 +1,8 @@ .. SPDX-License-Identifier: GPL-2.0 -======================== -Monitoring Data Accesses -======================== +========================== +DAMON: Data Access MONitor +========================== :doc:`DAMON </mm/damon/index>` allows light-weight data access monitoring. Using DAMON, users can analyze the memory access patterns of their systems and diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst index 4d5ca2c46288..9f88afc734da 100644 --- a/Documentation/admin-guide/mm/damon/start.rst +++ b/Documentation/admin-guide/mm/damon/start.rst @@ -29,16 +29,9 @@ called DAMON Operator (DAMO). It is available at https://github.com/awslabs/damo. The examples below assume that ``damo`` is on your ``$PATH``. It's not mandatory, though. -Because DAMO is using the debugfs interface (refer to :doc:`usage` for the -detail) of DAMON, you should ensure debugfs is mounted. Mount it manually as -below:: - - # mount -t debugfs none /sys/kernel/debug/ - -or append the following line to your ``/etc/fstab`` file so that your system -can automatically mount debugfs upon booting:: - - debugfs /sys/kernel/debug debugfs defaults 0 0 +Because DAMO is using the sysfs interface (refer to :doc:`usage` for the +detail) of DAMON, you should ensure :doc:`sysfs </filesystems/sysfs>` is +mounted. Recording Data Access Patterns diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index ca91ecc29078..b47b0cbbd491 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -393,6 +393,11 @@ the files as above. Above is only for an example. debugfs Interface ================= +.. note:: + + DAMON debugfs interface will be removed after next LTS kernel is released, so + users should move to the :ref:`sysfs interface <sysfs_interface>`. + DAMON exports eight files, ``attrs``, ``target_ids``, ``init_regions``, ``schemes``, ``monitor_on``, ``kdamond_pid``, ``mk_contexts`` and ``rm_contexts`` under its debugfs directory, ``<debugfs>/damon/``. diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index 1bd11118dfb1..d1064e0ba34a 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -32,6 +32,7 @@ the Linux memory management. idle_page_tracking ksm memory-hotplug + multigen_lru nommu-mmap numa_memory_policy numaperf diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst index b244f0202a03..fb6ba2002a4b 100644 --- a/Documentation/admin-guide/mm/ksm.rst +++ b/Documentation/admin-guide/mm/ksm.rst @@ -184,6 +184,42 @@ The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the ``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must be increased accordingly. +Monitoring KSM profit +===================== + +KSM can save memory by merging identical pages, but also can consume +additional memory, because it needs to generate a number of rmap_items to +save each scanned page's brief rmap information. Some of these pages may +be merged, but some may not be abled to be merged after being checked +several times, which are unprofitable memory consumed. + +1) How to determine whether KSM save memory or consume memory in system-wide + range? Here is a simple approximate calculation for reference:: + + general_profit =~ pages_sharing * sizeof(page) - (all_rmap_items) * + sizeof(rmap_item); + + where all_rmap_items can be easily obtained by summing ``pages_sharing``, + ``pages_shared``, ``pages_unshared`` and ``pages_volatile``. + +2) The KSM profit inner a single process can be similarly obtained by the + following approximate calculation:: + + process_profit =~ ksm_merging_pages * sizeof(page) - + ksm_rmap_items * sizeof(rmap_item). + + where ksm_merging_pages is shown under the directory ``/proc/<pid>/``, + and ksm_rmap_items is shown in ``/proc/<pid>/ksm_stat``. + +From the perspective of application, a high ratio of ``ksm_rmap_items`` to +``ksm_merging_pages`` means a bad madvise-applied policy, so developers or +administrators have to rethink how to change madvise policy. Giving an example +for reference, a page's size is usually 4K, and the rmap_item's size is +separately 32B on 32-bit CPU architecture and 64B on 64-bit CPU architecture. +so if the ``ksm_rmap_items/ksm_merging_pages`` ratio exceeds 64 on 64-bit CPU +or exceeds 128 on 32-bit CPU, then the app's madvise policy should be dropped, +because the ksm profit is approximately zero or negative. + Monitoring KSM events ===================== diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst new file mode 100644 index 000000000000..33e068830497 --- /dev/null +++ b/Documentation/admin-guide/mm/multigen_lru.rst @@ -0,0 +1,162 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +Multi-Gen LRU +============= +The multi-gen LRU is an alternative LRU implementation that optimizes +page reclaim and improves performance under memory pressure. Page +reclaim decides the kernel's caching policy and ability to overcommit +memory. It directly impacts the kswapd CPU usage and RAM efficiency. + +Quick start +=========== +Build the kernel with the following configurations. + +* ``CONFIG_LRU_GEN=y`` +* ``CONFIG_LRU_GEN_ENABLED=y`` + +All set! + +Runtime options +=============== +``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the +following subsections. + +Kill switch +----------- +``enabled`` accepts different values to enable or disable the +following components. Its default value depends on +``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled +unless some of them have unforeseen side effects. Writing to +``enabled`` has no effect when a component is not supported by the +hardware, and valid values will be accepted even when the main switch +is off. + +====== =============================================================== +Values Components +====== =============================================================== +0x0001 The main switch for the multi-gen LRU. +0x0002 Clearing the accessed bit in leaf page table entries in large + batches, when MMU sets it (e.g., on x86). This behavior can + theoretically worsen lock contention (mmap_lock). If it is + disabled, the multi-gen LRU will suffer a minor performance + degradation for workloads that contiguously map hot pages, + whose accessed bits can be otherwise cleared by fewer larger + batches. +0x0004 Clearing the accessed bit in non-leaf page table entries as + well, when MMU sets it (e.g., on x86). This behavior was not + verified on x86 varieties other than Intel and AMD. If it is + disabled, the multi-gen LRU will suffer a negligible + performance degradation. +[yYnN] Apply to all the components above. +====== =============================================================== + +E.g., +:: + + echo y >/sys/kernel/mm/lru_gen/enabled + cat /sys/kernel/mm/lru_gen/enabled + 0x0007 + echo 5 >/sys/kernel/mm/lru_gen/enabled + cat /sys/kernel/mm/lru_gen/enabled + 0x0005 + +Thrashing prevention +-------------------- +Personal computers are more sensitive to thrashing because it can +cause janks (lags when rendering UI) and negatively impact user +experience. The multi-gen LRU offers thrashing prevention to the +majority of laptop and desktop users who do not have ``oomd``. + +Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of +``N`` milliseconds from getting evicted. The OOM killer is triggered +if this working set cannot be kept in memory. In other words, this +option works as an adjustable pressure relief valve, and when open, it +terminates applications that are hopefully not being used. + +Based on the average human detectable lag (~100ms), ``N=1000`` usually +eliminates intolerable janks due to thrashing. Larger values like +``N=3000`` make janks less noticeable at the risk of premature OOM +kills. + +The default value ``0`` means disabled. + +Experimental features +===================== +``/sys/kernel/debug/lru_gen`` accepts commands described in the +following subsections. Multiple command lines are supported, so does +concatenation with delimiters ``,`` and ``;``. + +``/sys/kernel/debug/lru_gen_full`` provides additional stats for +debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from +evicted generations in this file. + +Working set estimation +---------------------- +Working set estimation measures how much memory an application needs +in a given time interval, and it is usually done with little impact on +the performance of the application. E.g., data centers want to +optimize job scheduling (bin packing) to improve memory utilizations. +When a new job comes in, the job scheduler needs to find out whether +each server it manages can allocate a certain amount of memory for +this new job before it can pick a candidate. To do so, the job +scheduler needs to estimate the working sets of the existing jobs. + +When it is read, ``lru_gen`` returns a histogram of numbers of pages +accessed over different time intervals for each memcg and node. +``MAX_NR_GENS`` decides the number of bins for each histogram. The +histograms are noncumulative. +:: + + memcg memcg_id memcg_path + node node_id + min_gen_nr age_in_ms nr_anon_pages nr_file_pages + ... + max_gen_nr age_in_ms nr_anon_pages nr_file_pages + +Each bin contains an estimated number of pages that have been accessed +within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages +and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of +the former is the largest and that of the latter is the smallest. + +Users can write the following command to ``lru_gen`` to create a new +generation ``max_gen_nr+1``: + + ``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]`` + +``can_swap`` defaults to the swap setting and, if it is set to ``1``, +it forces the scan of anon pages when swap is off, and vice versa. +``force_scan`` defaults to ``1`` and, if it is set to ``0``, it +employs heuristics to reduce the overhead, which is likely to reduce +the coverage as well. + +A typical use case is that a job scheduler runs this command at a +certain time interval to create new generations, and it ranks the +servers it manages based on the sizes of their cold pages defined by +this time interval. + +Proactive reclaim +----------------- +Proactive reclaim induces page reclaim when there is no memory +pressure. It usually targets cold pages only. E.g., when a new job +comes in, the job scheduler wants to proactively reclaim cold pages on +the server it selected, to improve the chance of successfully landing +this new job. + +Users can write the following command to ``lru_gen`` to evict +generations less than or equal to ``min_gen_nr``. + + ``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]`` + +``min_gen_nr`` should be less than ``max_gen_nr-1``, since +``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to +the active list) and therefore cannot be evicted. ``swappiness`` +overrides the default value in ``/proc/sys/vm/swappiness``. +``nr_to_reclaim`` limits the number of pages to evict. + +A typical use case is that a job scheduler runs this command before it +tries to land a new job on a server. If it fails to materialize enough +cold pages because of the overestimation, it retries on the next +server according to the ranking result obtained from the working set +estimation step. This less forceful approach limits the impacts on the +existing jobs. diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index c9c37f16eef8..8ee78ec232eb 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -191,7 +191,14 @@ allocation failure to throttle the next allocation attempt:: /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs -The khugepaged progress can be seen in the number of pages collapsed:: +The khugepaged progress can be seen in the number of pages collapsed (note +that this counter may not be an exact count of the number of pages +collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping +being replaced by a PMD mapping, or (2) All 4K physical pages replaced by +one 2M hugepage. Each may happen independently, or together, depending on +the type of memory and the failures that occur. As such, this value should +be interpreted roughly as a sign of progress, and counters in /proc/vmstat +consulted for more accurate accounting):: /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed @@ -366,10 +373,9 @@ thp_split_pmd page table entry. thp_zero_page_alloc - is incremented every time a huge zero page is - successfully allocated. It includes allocations which where - dropped due race with other allocation. Note, it doesn't count - every map of the huge zero page, only its allocation. + is incremented every time a huge zero page used for thp is + successfully allocated. Note, it doesn't count every map of + the huge zero page, only its allocation. thp_zero_page_alloc_failed is incremented if kernel fails to allocate diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst index 6528036093e1..83f31919ebb3 100644 --- a/Documentation/admin-guide/mm/userfaultfd.rst +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -17,7 +17,10 @@ of the ``PROT_NONE+SIGSEGV`` trick. Design ====== -Userfaults are delivered and resolved through the ``userfaultfd`` syscall. +Userspace creates a new userfaultfd, initializes it, and registers one or more +regions of virtual memory with it. Then, any page faults which occur within the +region(s) result in a message being delivered to the userfaultfd, notifying +userspace of the fault. The ``userfaultfd`` (aside from registering and unregistering virtual memory ranges) provides two primary functionalities: @@ -34,12 +37,11 @@ The real advantage of userfaults if compared to regular virtual memory management of mremap/mprotect is that the userfaults in all their operations never involve heavyweight structures like vmas (in fact the ``userfaultfd`` runtime load never takes the mmap_lock for writing). - Vmas are not suitable for page- (or hugepage) granular fault tracking when dealing with virtual address spaces that could span Terabytes. Too many vmas would be needed for that. -The ``userfaultfd`` once opened by invoking the syscall, can also be +The ``userfaultfd``, once created, can also be passed using unix domain sockets to a manager process, so the same manager process could handle the userfaults of a multitude of different processes without them being aware about what is going on @@ -50,6 +52,39 @@ is a corner case that would currently return ``-EBUSY``). API === +Creating a userfaultfd +---------------------- + +There are two ways to create a new userfaultfd, each of which provide ways to +restrict access to this functionality (since historically userfaultfds which +handle kernel page faults have been a useful tool for exploiting the kernel). + +The first way, supported since userfaultfd was introduced, is the +userfaultfd(2) syscall. Access to this is controlled in several ways: + +- Any user can always create a userfaultfd which traps userspace page faults + only. Such a userfaultfd can be created using the userfaultfd(2) syscall + with the flag UFFD_USER_MODE_ONLY. + +- In order to also trap kernel page faults for the address space, either the + process needs the CAP_SYS_PTRACE capability, or the system must have + vm.unprivileged_userfaultfd set to 1. By default, vm.unprivileged_userfaultfd + is set to 0. + +The second way, added to the kernel more recently, is by opening +/dev/userfaultfd and issuing a USERFAULTFD_IOC_NEW ioctl to it. This method +yields equivalent userfaultfds to the userfaultfd(2) syscall. + +Unlike userfaultfd(2), access to /dev/userfaultfd is controlled via normal +filesystem permissions (user/group/mode), which gives fine grained access to +userfaultfd specifically, without also granting other unrelated privileges at +the same time (as e.g. granting CAP_SYS_PTRACE would do). Users who have access +to /dev/userfaultfd can always create userfaultfds that trap kernel page faults; +vm.unprivileged_userfaultfd is not considered. + +Initializing a userfaultfd +-------------------------- + When first opened the ``userfaultfd`` must be enabled invoking the ``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or a later API version) which will specify the ``read/POLLIN`` protocol diff --git a/Documentation/admin-guide/perf/alibaba_pmu.rst b/Documentation/admin-guide/perf/alibaba_pmu.rst new file mode 100644 index 000000000000..11de998bb480 --- /dev/null +++ b/Documentation/admin-guide/perf/alibaba_pmu.rst @@ -0,0 +1,100 @@ +============================================================= +Alibaba's T-Head SoC Uncore Performance Monitoring Unit (PMU) +============================================================= + +The Yitian 710, custom-built by Alibaba Group's chip development business, +T-Head, implements uncore PMU for performance and functional debugging to +facilitate system maintenance. + +DDR Sub-System Driveway (DRW) PMU Driver +========================================= + +Yitian 710 employs eight DDR5/4 channels, four on each die. Each DDR5 channel +is independent of others to service system memory requests. And one DDR5 +channel is split into two independent sub-channels. The DDR Sub-System Driveway +implements separate PMUs for each sub-channel to monitor various performance +metrics. + +The Driveway PMU devices are named as ali_drw_<sys_base_addr> with perf. +For example, ali_drw_21000 and ali_drw_21080 are two PMU devices for two +sub-channels of the same channel in die 0. And the PMU device of die 1 is +prefixed with ali_drw_400XXXXX, e.g. ali_drw_40021000. + +Each sub-channel has 36 PMU counters in total, which is classified into +four groups: + +- Group 0: PMU Cycle Counter. This group has one pair of counters + pmu_cycle_cnt_low and pmu_cycle_cnt_high, that is used as the cycle count + based on DDRC core clock. + +- Group 1: PMU Bandwidth Counters. This group has 8 counters that are used + to count the total access number of either the eight bank groups in a + selected rank, or four ranks separately in the first 4 counters. The base + transfer unit is 64B. + +- Group 2: PMU Retry Counters. This group has 10 counters, that intend to + count the total retry number of each type of uncorrectable error. + +- Group 3: PMU Common Counters. This group has 16 counters, that are used + to count the common events. + +For now, the Driveway PMU driver only uses counters in group 0 and group 3. + +The DDR Controller (DDRCTL) and DDR PHY combine to create a complete solution +for connecting an SoC application bus to DDR memory devices. The DDRCTL +receives transactions Host Interface (HIF) which is custom-defined by Synopsys. +These transactions are queued internally and scheduled for access while +satisfying the SDRAM protocol timing requirements, transaction priorities, and +dependencies between the transactions. The DDRCTL in turn issues commands on +the DDR PHY Interface (DFI) to the PHY module, which launches and captures data +to and from the SDRAM. The driveway PMUs have hardware logic to gather +statistics and performance logging signals on HIF, DFI, etc. + +By counting the READ, WRITE and RMW commands sent to the DDRC through the HIF +interface, we could calculate the bandwidth. Example usage of counting memory +data bandwidth:: + + perf stat \ + -e ali_drw_21000/hif_wr/ \ + -e ali_drw_21000/hif_rd/ \ + -e ali_drw_21000/hif_rmw/ \ + -e ali_drw_21000/cycle/ \ + -e ali_drw_21080/hif_wr/ \ + -e ali_drw_21080/hif_rd/ \ + -e ali_drw_21080/hif_rmw/ \ + -e ali_drw_21080/cycle/ \ + -e ali_drw_23000/hif_wr/ \ + -e ali_drw_23000/hif_rd/ \ + -e ali_drw_23000/hif_rmw/ \ + -e ali_drw_23000/cycle/ \ + -e ali_drw_23080/hif_wr/ \ + -e ali_drw_23080/hif_rd/ \ + -e ali_drw_23080/hif_rmw/ \ + -e ali_drw_23080/cycle/ \ + -e ali_drw_25000/hif_wr/ \ + -e ali_drw_25000/hif_rd/ \ + -e ali_drw_25000/hif_rmw/ \ + -e ali_drw_25000/cycle/ \ + -e ali_drw_25080/hif_wr/ \ + -e ali_drw_25080/hif_rd/ \ + -e ali_drw_25080/hif_rmw/ \ + -e ali_drw_25080/cycle/ \ + -e ali_drw_27000/hif_wr/ \ + -e ali_drw_27000/hif_rd/ \ + -e ali_drw_27000/hif_rmw/ \ + -e ali_drw_27000/cycle/ \ + -e ali_drw_27080/hif_wr/ \ + -e ali_drw_27080/hif_rd/ \ + -e ali_drw_27080/hif_rmw/ \ + -e ali_drw_27080/cycle/ -- sleep 10 + +The average DRAM bandwidth can be calculated as follows: + +- Read Bandwidth = perf_hif_rd * DDRC_WIDTH * DDRC_Freq / DDRC_Cycle +- Write Bandwidth = (perf_hif_wr + perf_hif_rmw) * DDRC_WIDTH * DDRC_Freq / DDRC_Cycle + +Here, DDRC_WIDTH = 64 bytes. + +The current driver does not support sampling. So "perf record" is +unsupported. Also attach to a task is unsupported as the events are all +uncore. diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst index 9c9ece88ce53..793e1970bc05 100644 --- a/Documentation/admin-guide/perf/index.rst +++ b/Documentation/admin-guide/perf/index.rst @@ -18,3 +18,4 @@ Performance monitor support xgene-pmu arm_dsu_pmu thunderx2-pmu + alibaba_pmu diff --git a/Documentation/admin-guide/pm/amd-pstate.rst b/Documentation/admin-guide/pm/amd-pstate.rst index 83b58eb4ab4d..8f3d30c5a0d8 100644 --- a/Documentation/admin-guide/pm/amd-pstate.rst +++ b/Documentation/admin-guide/pm/amd-pstate.rst @@ -182,6 +182,7 @@ to the ``struct sugov_cpu`` that the utilization update belongs to. Then, ``amd-pstate`` updates the desired performance according to the CPU scheduler assigned. +.. _processor_support: Processor Support ======================= @@ -282,6 +283,8 @@ efficiency frequency management method on AMD processors. Kernel Module Options for ``amd-pstate`` ========================================= +.. _shared_mem: + ``shared_mem`` Use a module param (shared_mem) to enable related processors manually with **amd_pstate.shared_mem=1**. @@ -393,6 +396,76 @@ about part of the output. :: CPU_005 712 116384 39 49 166 0.7565 9645075 2214891 38431470 25.1 11.646 469 2.496 kworker/5:0-40 CPU_006 712 116408 39 49 166 0.6769 8950227 1839034 37192089 24.06 11.272 470 2.496 kworker/6:0-1264 +Unit Tests for amd-pstate +------------------------- + +``amd-pstate-ut`` is a test module for testing the ``amd-pstate`` driver. + + * It can help all users to verify their processor support (SBIOS/Firmware or Hardware). + + * Kernel can have a basic function test to avoid the kernel regression during the update. + + * We can introduce more functional or performance tests to align the result together, it will benefit power and performance scale optimization. + +1. Test case decriptions + + +---------+--------------------------------+------------------------------------------------------------------------------------+ + | Index | Functions | Description | + +=========+================================+====================================================================================+ + | 0 | amd_pstate_ut_acpi_cpc_valid || Check whether the _CPC object is present in SBIOS. | + | | || | + | | || The detail refer to `Processor Support <processor_support_>`_. | + +---------+--------------------------------+------------------------------------------------------------------------------------+ + | 1 | amd_pstate_ut_check_enabled || Check whether AMD P-State is enabled. | + | | || | + | | || AMD P-States and ACPI hardware P-States always can be supported in one processor. | + | | | But AMD P-States has the higher priority and if it is enabled with | + | | | :c:macro:`MSR_AMD_CPPC_ENABLE` or ``cppc_set_enable``, it will respond to the | + | | | request from AMD P-States. | + +---------+--------------------------------+------------------------------------------------------------------------------------+ + | 2 | amd_pstate_ut_check_perf || Check if the each performance values are reasonable. | + | | || highest_perf >= nominal_perf > lowest_nonlinear_perf > lowest_perf > 0. | + +---------+--------------------------------+------------------------------------------------------------------------------------+ + | 3 | amd_pstate_ut_check_freq || Check if the each frequency values and max freq when set support boost mode | + | | | are reasonable. | + | | || max_freq >= nominal_freq > lowest_nonlinear_freq > min_freq > 0 | + | | || If boost is not active but supported, this maximum frequency will be larger than | + | | | the one in ``cpuinfo``. | + +---------+--------------------------------+------------------------------------------------------------------------------------+ + +#. How to execute the tests + + We use test module in the kselftest frameworks to implement it. + We create amd-pstate-ut module and tie it into kselftest.(for + details refer to Linux Kernel Selftests [4]_). + + 1. Build + + + open the :c:macro:`CONFIG_X86_AMD_PSTATE` configuration option. + + set the :c:macro:`CONFIG_X86_AMD_PSTATE_UT` configuration option to M. + + make project + + make selftest :: + + $ cd linux + $ make -C tools/testing/selftests + + #. Installation & Steps :: + + $ make -C tools/testing/selftests install INSTALL_PATH=~/kselftest + $ sudo ./kselftest/run_kselftest.sh -c amd-pstate + TAP version 13 + 1..1 + # selftests: amd-pstate: amd-pstate-ut.sh + # amd-pstate-ut: ok + ok 1 selftests: amd-pstate: amd-pstate-ut.sh + + #. Results :: + + $ dmesg | grep "amd_pstate_ut" | tee log.txt + [12977.570663] amd_pstate_ut: 1 amd_pstate_ut_acpi_cpc_valid success! + [12977.570673] amd_pstate_ut: 2 amd_pstate_ut_check_enabled success! + [12977.571207] amd_pstate_ut: 3 amd_pstate_ut_check_perf success! + [12977.571212] amd_pstate_ut: 4 amd_pstate_ut_check_freq success! Reference =========== @@ -405,3 +478,6 @@ Reference .. [3] Processor Programming Reference (PPR) for AMD Family 19h Model 51h, Revision A1 Processors https://www.amd.com/system/files/TechDocs/56569-A1-PUB.zip + +.. [4] Linux Kernel Selftests, + https://www.kernel.org/doc/html/latest/dev-tools/kselftest.html diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index ee6572b1edad..98d1b198b2b4 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -65,6 +65,11 @@ combining the following values: 4 s3_beep = ======= +arch +==== + +The machine hardware name, the same output as ``uname -m`` +(e.g. ``x86_64`` or ``aarch64``). auto_msgmni =========== @@ -635,6 +640,17 @@ different types of memory (represented as different NUMA nodes) to place the hot pages in the fast memory. This is implemented based on unmapping and page fault too. +numa_balancing_promote_rate_limit_MBps +====================================== + +Too high promotion/demotion throughput between different memory types +may hurt application latency. This can be used to rate limit the +promotion throughput. The per-node max promotion throughput in MB/s +will be limited to be no more than the set value. + +A rule of thumb is to set this to less than 1/10 of the PMEM node +write bandwidth. + oops_all_cpu_backtrace ====================== diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 9b833e439f09..988f6a4c8084 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -926,6 +926,9 @@ calls without any restrictions. The default value is 0. +Another way to control permissions for userfaultfd is to use +/dev/userfaultfd instead of userfaultfd(2). See +Documentation/admin-guide/mm/userfaultfd.rst. user_reserve_kbytes =================== |