diff options
Diffstat (limited to 'Documentation/admin-guide')
27 files changed, 1560 insertions, 187 deletions
diff --git a/Documentation/admin-guide/LSM/index.rst b/Documentation/admin-guide/LSM/index.rst index a6ba95fbaa9f..ce63be6d64ad 100644 --- a/Documentation/admin-guide/LSM/index.rst +++ b/Documentation/admin-guide/LSM/index.rst @@ -47,3 +47,4 @@ subdirectories. tomoyo Yama SafeSetID + ipe diff --git a/Documentation/admin-guide/LSM/ipe.rst b/Documentation/admin-guide/LSM/ipe.rst new file mode 100644 index 000000000000..f38e641df0e9 --- /dev/null +++ b/Documentation/admin-guide/LSM/ipe.rst @@ -0,0 +1,790 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Integrity Policy Enforcement (IPE) +================================== + +.. NOTE:: + + This is the documentation for admins, system builders, or individuals + attempting to use IPE. If you're looking for more developer-focused + documentation about IPE please see :doc:`the design docs </security/ipe>`. + +Overview +-------- + +Integrity Policy Enforcement (IPE) is a Linux Security Module that takes a +complementary approach to access control. Unlike traditional access control +mechanisms that rely on labels and paths for decision-making, IPE focuses +on the immutable security properties inherent to system components. These +properties are fundamental attributes or features of a system component +that cannot be altered, ensuring a consistent and reliable basis for +security decisions. + +To elaborate, in the context of IPE, system components primarily refer to +files or the devices these files reside on. However, this is just a +starting point. The concept of system components is flexible and can be +extended to include new elements as the system evolves. The immutable +properties include the origin of a file, which remains constant and +unchangeable over time. For example, IPE policies can be crafted to trust +files originating from the initramfs. Since initramfs is typically verified +by the bootloader, its files are deemed trustworthy; "file is from +initramfs" becomes an immutable property under IPE's consideration. + +The immutable property concept extends to the security features enabled on +a file's origin, such as dm-verity or fs-verity, which provide a layer of +integrity and trust. For example, IPE allows the definition of policies +that trust files from a dm-verity protected device. dm-verity ensures the +integrity of an entire device by providing a verifiable and immutable state +of its contents. Similarly, fs-verity offers filesystem-level integrity +checks, allowing IPE to enforce policies that trust files protected by +fs-verity. These two features cannot be turned off once established, so +they are considered immutable properties. These examples demonstrate how +IPE leverages immutable properties, such as a file's origin and its +integrity protection mechanisms, to make access control decisions. + +For the IPE policy, specifically, it grants the ability to enforce +stringent access controls by assessing security properties against +reference values defined within the policy. This assessment can be based on +the existence of a security property (e.g., verifying if a file originates +from initramfs) or evaluating the internal state of an immutable security +property. The latter includes checking the roothash of a dm-verity +protected device, determining whether dm-verity possesses a valid +signature, assessing the digest of a fs-verity protected file, or +determining whether fs-verity possesses a valid built-in signature. This +nuanced approach to policy enforcement enables a highly secure and +customizable system defense mechanism, tailored to specific security +requirements and trust models. + +To enable IPE, ensure that ``CONFIG_SECURITY_IPE`` (under +:menuselection:`Security -> Integrity Policy Enforcement (IPE)`) config +option is enabled. + +Use Cases +--------- + +IPE works best in fixed-function devices: devices in which their purpose +is clearly defined and not supposed to be changed (e.g. network firewall +device in a data center, an IoT device, etcetera), where all software and +configuration is built and provisioned by the system owner. + +IPE is a long-way off for use in general-purpose computing: the Linux +community as a whole tends to follow a decentralized trust model (known as +the web of trust), which IPE has no support for it yet. Instead, IPE +supports PKI (public key infrastructure), which generally designates a +set of trusted entities that provide a measure of absolute trust. + +Additionally, while most packages are signed today, the files inside +the packages (for instance, the executables), tend to be unsigned. This +makes it difficult to utilize IPE in systems where a package manager is +expected to be functional, without major changes to the package manager +and ecosystem behind it. + +The digest_cache LSM [#digest_cache_lsm]_ is a system that when combined with IPE, +could be used to enable and support general-purpose computing use cases. + +Known Limitations +----------------- + +IPE cannot verify the integrity of anonymous executable memory, such as +the trampolines created by gcc closures and libffi (<3.4.2), or JIT'd code. +Unfortunately, as this is dynamically generated code, there is no way +for IPE to ensure the integrity of this code to form a trust basis. + +IPE cannot verify the integrity of programs written in interpreted +languages when these scripts are invoked by passing these program files +to the interpreter. This is because the way interpreters execute these +files; the scripts themselves are not evaluated as executable code +through one of IPE's hooks, but they are merely text files that are read +(as opposed to compiled executables) [#interpreters]_. + +Threat Model +------------ + +IPE specifically targets the risk of tampering with user-space executable +code after the kernel has initially booted, including the kernel modules +loaded from userspace via ``modprobe`` or ``insmod``. + +To illustrate, consider a scenario where an untrusted binary, possibly +malicious, is downloaded along with all necessary dependencies, including a +loader and libc. The primary function of IPE in this context is to prevent +the execution of such binaries and their dependencies. + +IPE achieves this by verifying the integrity and authenticity of all +executable code before allowing them to run. It conducts a thorough +check to ensure that the code's integrity is intact and that they match an +authorized reference value (digest, signature, etc) as per the defined +policy. If a binary does not pass this verification process, either +because its integrity has been compromised or it does not meet the +authorization criteria, IPE will deny its execution. Additionally, IPE +generates audit logs which may be utilized to detect and analyze failures +resulting from policy violation. + +Tampering threat scenarios include modification or replacement of +executable code by a range of actors including: + +- Actors with physical access to the hardware +- Actors with local network access to the system +- Actors with access to the deployment system +- Compromised internal systems under external control +- Malicious end users of the system +- Compromised end users of the system +- Remote (external) compromise of the system + +IPE does not mitigate threats arising from malicious but authorized +developers (with access to a signing certificate), or compromised +developer tools used by them (i.e. return-oriented programming attacks). +Additionally, IPE draws hard security boundary between userspace and +kernelspace. As a result, kernel-level exploits are considered outside +the scope of IPE and mitigation is left to other mechanisms. + +Policy +------ + +IPE policy is a plain-text [#devdoc]_ policy composed of multiple statements +over several lines. There is one required line, at the top of the +policy, indicating the policy name, and the policy version, for +instance:: + + policy_name=Ex_Policy policy_version=0.0.0 + +The policy name is a unique key identifying this policy in a human +readable name. This is used to create nodes under securityfs as well as +uniquely identify policies to deploy new policies vs update existing +policies. + +The policy version indicates the current version of the policy (NOT the +policy syntax version). This is used to prevent rollback of policy to +potentially insecure previous versions of the policy. + +The next portion of IPE policy are rules. Rules are formed by key=value +pairs, known as properties. IPE rules require two properties: ``action``, +which determines what IPE does when it encounters a match against the +rule, and ``op``, which determines when the rule should be evaluated. +The ordering is significant, a rule must start with ``op``, and end with +``action``. Thus, a minimal rule is:: + + op=EXECUTE action=ALLOW + +This example will allow any execution. Additional properties are used to +assess immutable security properties about the files being evaluated. +These properties are intended to be descriptions of systems within the +kernel that can provide a measure of integrity verification, such that IPE +can determine the trust of the resource based on the value of the property. + +Rules are evaluated top-to-bottom. As a result, any revocation rules, +or denies should be placed early in the file to ensure that these rules +are evaluated before a rule with ``action=ALLOW``. + +IPE policy supports comments. The character '#' will function as a +comment, ignoring all characters to the right of '#' until the newline. + +The default behavior of IPE evaluations can also be expressed in policy, +through the ``DEFAULT`` statement. This can be done at a global level, +or a per-operation level:: + + # Global + DEFAULT action=ALLOW + + # Operation Specific + DEFAULT op=EXECUTE action=ALLOW + +A default must be set for all known operations in IPE. If you want to +preserve older policies being compatible with newer kernels that can introduce +new operations, set a global default of ``ALLOW``, then override the +defaults on a per-operation basis (as above). + +With configurable policy-based LSMs, there's several issues with +enforcing the configurable policies at startup, around reading and +parsing the policy: + +1. The kernel *should* not read files from userspace, so directly reading + the policy file is prohibited. +2. The kernel command line has a character limit, and one kernel module + should not reserve the entire character limit for its own + configuration. +3. There are various boot loaders in the kernel ecosystem, so handing + off a memory block would be costly to maintain. + +As a result, IPE has addressed this problem through a concept of a "boot +policy". A boot policy is a minimal policy which is compiled into the +kernel. This policy is intended to get the system to a state where +userspace is set up and ready to receive commands, at which point a more +complex policy can be deployed via securityfs. The boot policy can be +specified via ``SECURITY_IPE_BOOT_POLICY`` config option, which accepts +a path to a plain-text version of the IPE policy to apply. This policy +will be compiled into the kernel. If not specified, IPE will be disabled +until a policy is deployed and activated through securityfs. + +Deploying Policies +~~~~~~~~~~~~~~~~~~ + +Policies can be deployed from userspace through securityfs. These policies +are signed through the PKCS#7 message format to enforce some level of +authorization of the policies (prohibiting an attacker from gaining +unconstrained root, and deploying an "allow all" policy). These +policies must be signed by a certificate that chains to the +``SYSTEM_TRUSTED_KEYRING``. With openssl, the policy can be signed by:: + + openssl smime -sign \ + -in "$MY_POLICY" \ + -signer "$MY_CERTIFICATE" \ + -inkey "$MY_PRIVATE_KEY" \ + -noattr \ + -nodetach \ + -nosmimecap \ + -outform der \ + -out "$MY_POLICY.p7b" + +Deploying the policies is done through securityfs, through the +``new_policy`` node. To deploy a policy, simply cat the file into the +securityfs node:: + + cat "$MY_POLICY.p7b" > /sys/kernel/security/ipe/new_policy + +Upon success, this will create one subdirectory under +``/sys/kernel/security/ipe/policies/``. The subdirectory will be the +``policy_name`` field of the policy deployed, so for the example above, +the directory will be ``/sys/kernel/security/ipe/policies/Ex_Policy``. +Within this directory, there will be seven files: ``pkcs7``, ``policy``, +``name``, ``version``, ``active``, ``update``, and ``delete``. + +The ``pkcs7`` file is read-only. Reading it returns the raw PKCS#7 data +that was provided to the kernel, representing the policy. If the policy being +read is the boot policy, this will return ``ENOENT``, as it is not signed. + +The ``policy`` file is read only. Reading it returns the PKCS#7 inner +content of the policy, which will be the plain text policy. + +The ``active`` file is used to set a policy as the currently active policy. +This file is rw, and accepts a value of ``"1"`` to set the policy as active. +Since only a single policy can be active at one time, all other policies +will be marked inactive. The policy being marked active must have a policy +version greater or equal to the currently-running version. + +The ``update`` file is used to update a policy that is already present +in the kernel. This file is write-only and accepts a PKCS#7 signed +policy. Two checks will always be performed on this policy: First, the +``policy_names`` must match with the updated version and the existing +version. Second the updated policy must have a policy version greater than +or equal to the currently-running version. This is to prevent rollback attacks. + +The ``delete`` file is used to remove a policy that is no longer needed. +This file is write-only and accepts a value of ``1`` to delete the policy. +On deletion, the securityfs node representing the policy will be removed. +However, delete the current active policy is not allowed and will return +an operation not permitted error. + +Similarly, writing to both ``update`` and ``new_policy`` could result in +bad message(policy syntax error) or file exists error. The latter error happens +when trying to deploy a policy with a ``policy_name`` while the kernel already +has a deployed policy with the same ``policy_name``. + +Deploying a policy will *not* cause IPE to start enforcing the policy. IPE will +only enforce the policy marked active. Note that only one policy can be active +at a time. + +Once deployment is successful, the policy can be activated, by writing file +``/sys/kernel/security/ipe/policies/$policy_name/active``. +For example, the ``Ex_Policy`` can be activated by:: + + echo 1 > "/sys/kernel/security/ipe/policies/Ex_Policy/active" + +From above point on, ``Ex_Policy`` is now the enforced policy on the +system. + +IPE also provides a way to delete policies. This can be done via the +``delete`` securityfs node, +``/sys/kernel/security/ipe/policies/$policy_name/delete``. +Writing ``1`` to that file deletes the policy:: + + echo 1 > "/sys/kernel/security/ipe/policies/$policy_name/delete" + +There is only one requirement to delete a policy: the policy being deleted +must be inactive. + +.. NOTE:: + + If a traditional MAC system is enabled (SELinux, apparmor, smack), all + writes to ipe's securityfs nodes require ``CAP_MAC_ADMIN``. + +Modes +~~~~~ + +IPE supports two modes of operation: permissive (similar to SELinux's +permissive mode) and enforced. In permissive mode, all events are +checked and policy violations are logged, but the policy is not really +enforced. This allows users to test policies before enforcing them. + +The default mode is enforce, and can be changed via the kernel command +line parameter ``ipe.enforce=(0|1)``, or the securityfs node +``/sys/kernel/security/ipe/enforce``. + +.. NOTE:: + + If a traditional MAC system is enabled (SELinux, apparmor, smack, etcetera), + all writes to ipe's securityfs nodes require ``CAP_MAC_ADMIN``. + +Audit Events +~~~~~~~~~~~~ + +1420 AUDIT_IPE_ACCESS +^^^^^^^^^^^^^^^^^^^^^ +Event Examples:: + + type=1420 audit(1653364370.067:61): ipe_op=EXECUTE ipe_hook=MMAP enforcing=1 pid=2241 comm="ld-linux.so" path="/deny/lib/libc.so.6" dev="sda2" ino=14549020 rule="DEFAULT action=DENY" + type=1300 audit(1653364370.067:61): SYSCALL arch=c000003e syscall=9 success=no exit=-13 a0=7f1105a28000 a1=195000 a2=5 a3=812 items=0 ppid=2219 pid=2241 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=2 comm="ld-linux.so" exe="/tmp/ipe-test/lib/ld-linux.so" subj=unconfined key=(null) + type=1327 audit(1653364370.067:61): 707974686F6E3300746573742F6D61696E2E7079002D6E00 + + type=1420 audit(1653364735.161:64): ipe_op=EXECUTE ipe_hook=MMAP enforcing=1 pid=2472 comm="mmap_test" path=? dev=? ino=? rule="DEFAULT action=DENY" + type=1300 audit(1653364735.161:64): SYSCALL arch=c000003e syscall=9 success=no exit=-13 a0=0 a1=1000 a2=4 a3=21 items=0 ppid=2219 pid=2472 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=2 comm="mmap_test" exe="/root/overlake_test/upstream_test/vol_fsverity/bin/mmap_test" subj=unconfined key=(null) + type=1327 audit(1653364735.161:64): 707974686F6E3300746573742F6D61696E2E7079002D6E00 + +This event indicates that IPE made an access control decision; the IPE +specific record (1420) is always emitted in conjunction with a +``AUDITSYSCALL`` record. + +Determining whether IPE is in permissive or enforced mode can be derived +from ``success`` property and exit code of the ``AUDITSYSCALL`` record. + + +Field descriptions: + ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| Field | Value Type | Optional? | Description of Value | ++===========+============+===========+=================================================================================+ +| ipe_op | string | No | The IPE operation name associated with the log | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| ipe_hook | string | No | The name of the LSM hook that triggered the IPE event | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| enforcing | integer | No | The current IPE enforcing state 1 is in enforcing mode, 0 is in permissive mode | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| pid | integer | No | The pid of the process that triggered the IPE event. | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| comm | string | No | The command line program name of the process that triggered the IPE event | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| path | string | Yes | The absolute path to the evaluated file | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| ino | integer | Yes | The inode number of the evaluated file | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| dev | string | Yes | The device name of the evaluated file, e.g. vda | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| rule | string | No | The matched policy rule | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ + +1421 AUDIT_IPE_CONFIG_CHANGE +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Event Example:: + + type=1421 audit(1653425583.136:54): old_active_pol_name="Allow_All" old_active_pol_version=0.0.0 old_policy_digest=sha256:E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855 new_active_pol_name="boot_verified" new_active_pol_version=0.0.0 new_policy_digest=sha256:820EEA5B40CA42B51F68962354BA083122A20BB846F26765076DD8EED7B8F4DB auid=4294967295 ses=4294967295 lsm=ipe res=1 + type=1300 audit(1653425583.136:54): SYSCALL arch=c000003e syscall=1 success=yes exit=2 a0=3 a1=5596fcae1fb0 a2=2 a3=2 items=0 ppid=184 pid=229 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=4294967295 comm="python3" exe="/usr/bin/python3.10" key=(null) + type=1327 audit(1653425583.136:54): PROCTITLE proctitle=707974686F6E3300746573742F6D61696E2E7079002D66002E2 + +This event indicates that IPE switched the active poliy from one to another +along with the version and the hash digest of the two policies. +Note IPE can only have one policy active at a time, all access decision +evaluation is based on the current active policy. +The normal procedure to deploy a new policy is loading the policy to deploy +into the kernel first, then switch the active policy to it. + +This record will always be emitted in conjunction with a ``AUDITSYSCALL`` record for the ``write`` syscall. + +Field descriptions: + ++------------------------+------------+-----------+---------------------------------------------------+ +| Field | Value Type | Optional? | Description of Value | ++========================+============+===========+===================================================+ +| old_active_pol_name | string | Yes | The name of previous active policy | ++------------------------+------------+-----------+---------------------------------------------------+ +| old_active_pol_version | string | Yes | The version of previous active policy | ++------------------------+------------+-----------+---------------------------------------------------+ +| old_policy_digest | string | Yes | The hash of previous active policy | ++------------------------+------------+-----------+---------------------------------------------------+ +| new_active_pol_name | string | No | The name of current active policy | ++------------------------+------------+-----------+---------------------------------------------------+ +| new_active_pol_version | string | No | The version of current active policy | ++------------------------+------------+-----------+---------------------------------------------------+ +| new_policy_digest | string | No | The hash of current active policy | ++------------------------+------------+-----------+---------------------------------------------------+ +| auid | integer | No | The login user ID | ++------------------------+------------+-----------+---------------------------------------------------+ +| ses | integer | No | The login session ID | ++------------------------+------------+-----------+---------------------------------------------------+ +| lsm | string | No | The lsm name associated with the event | ++------------------------+------------+-----------+---------------------------------------------------+ +| res | integer | No | The result of the audited operation(success/fail) | ++------------------------+------------+-----------+---------------------------------------------------+ + +1422 AUDIT_IPE_POLICY_LOAD +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Event Example:: + + type=1422 audit(1653425529.927:53): policy_name="boot_verified" policy_version=0.0.0 policy_digest=sha256:820EEA5B40CA42B51F68962354BA083122A20BB846F26765076DD8EED7B8F4DB auid=4294967295 ses=4294967295 lsm=ipe res=1 + type=1300 audit(1653425529.927:53): arch=c000003e syscall=1 success=yes exit=2567 a0=3 a1=5596fcae1fb0 a2=a07 a3=2 items=0 ppid=184 pid=229 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=4294967295 comm="python3" exe="/usr/bin/python3.10" key=(null) + type=1327 audit(1653425529.927:53): PROCTITLE proctitle=707974686F6E3300746573742F6D61696E2E7079002D66002E2E + +This record indicates a new policy has been loaded into the kernel with the policy name, policy version and policy hash. + +This record will always be emitted in conjunction with a ``AUDITSYSCALL`` record for the ``write`` syscall. + +Field descriptions: + ++----------------+------------+-----------+---------------------------------------------------+ +| Field | Value Type | Optional? | Description of Value | ++================+============+===========+===================================================+ +| policy_name | string | No | The policy_name | ++----------------+------------+-----------+---------------------------------------------------+ +| policy_version | string | No | The policy_version | ++----------------+------------+-----------+---------------------------------------------------+ +| policy_digest | string | No | The policy hash | ++----------------+------------+-----------+---------------------------------------------------+ +| auid | integer | No | The login user ID | ++----------------+------------+-----------+---------------------------------------------------+ +| ses | integer | No | The login session ID | ++----------------+------------+-----------+---------------------------------------------------+ +| lsm | string | No | The lsm name associated with the event | ++----------------+------------+-----------+---------------------------------------------------+ +| res | integer | No | The result of the audited operation(success/fail) | ++----------------+------------+-----------+---------------------------------------------------+ + + +1404 AUDIT_MAC_STATUS +^^^^^^^^^^^^^^^^^^^^^ + +Event Examples:: + + type=1404 audit(1653425689.008:55): enforcing=0 old_enforcing=1 auid=4294967295 ses=4294967295 enabled=1 old-enabled=1 lsm=ipe res=1 + type=1300 audit(1653425689.008:55): arch=c000003e syscall=1 success=yes exit=2 a0=1 a1=55c1065e5c60 a2=2 a3=0 items=0 ppid=405 pid=441 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=) + type=1327 audit(1653425689.008:55): proctitle="-bash" + + type=1404 audit(1653425689.008:55): enforcing=1 old_enforcing=0 auid=4294967295 ses=4294967295 enabled=1 old-enabled=1 lsm=ipe res=1 + type=1300 audit(1653425689.008:55): arch=c000003e syscall=1 success=yes exit=2 a0=1 a1=55c1065e5c60 a2=2 a3=0 items=0 ppid=405 pid=441 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=) + type=1327 audit(1653425689.008:55): proctitle="-bash" + +This record will always be emitted in conjunction with a ``AUDITSYSCALL`` record for the ``write`` syscall. + +Field descriptions: + ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| Field | Value Type | Optional? | Description of Value | ++===============+============+===========+=================================================================================================+ +| enforcing | integer | No | The enforcing state IPE is being switched to, 1 is in enforcing mode, 0 is in permissive mode | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| old_enforcing | integer | No | The enforcing state IPE is being switched from, 1 is in enforcing mode, 0 is in permissive mode | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| auid | integer | No | The login user ID | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| ses | integer | No | The login session ID | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| enabled | integer | No | The new TTY audit enabled setting | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| old-enabled | integer | No | The old TTY audit enabled setting | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| lsm | string | No | The lsm name associated with the event | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| res | integer | No | The result of the audited operation(success/fail) | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ + + +Success Auditing +^^^^^^^^^^^^^^^^ + +IPE supports success auditing. When enabled, all events that pass IPE +policy and are not blocked will emit an audit event. This is disabled by +default, and can be enabled via the kernel command line +``ipe.success_audit=(0|1)`` or +``/sys/kernel/security/ipe/success_audit`` securityfs file. + +This is *very* noisy, as IPE will check every userspace binary on the +system, but is useful for debugging policies. + +.. NOTE:: + + If a traditional MAC system is enabled (SELinux, apparmor, smack, etcetera), + all writes to ipe's securityfs nodes require ``CAP_MAC_ADMIN``. + +Properties +---------- + +As explained above, IPE properties are ``key=value`` pairs expressed in IPE +policy. Two properties are built-into the policy parser: 'op' and 'action'. +The other properties are used to restrict immutable security properties +about the files being evaluated. Currently those properties are: +'``boot_verified``', '``dmverity_signature``', '``dmverity_roothash``', +'``fsverity_signature``', '``fsverity_digest``'. A description of all +properties supported by IPE are listed below: + +op +~~ + +Indicates the operation for a rule to apply to. Must be in every rule, +as the first token. IPE supports the following operations: + + ``EXECUTE`` + + Pertains to any file attempting to be executed, or loaded as an + executable. + + ``FIRMWARE``: + + Pertains to firmware being loaded via the firmware_class interface. + This covers both the preallocated buffer and the firmware file + itself. + + ``KMODULE``: + + Pertains to loading kernel modules via ``modprobe`` or ``insmod``. + + ``KEXEC_IMAGE``: + + Pertains to kernel images loading via ``kexec``. + + ``KEXEC_INITRAMFS`` + + Pertains to initrd images loading via ``kexec --initrd``. + + ``POLICY``: + + Controls loading policies via reading a kernel-space initiated read. + + An example of such is loading IMA policies by writing the path + to the policy file to ``$securityfs/ima/policy`` + + ``X509_CERT``: + + Controls loading IMA certificates through the Kconfigs, + ``CONFIG_IMA_X509_PATH`` and ``CONFIG_EVM_X509_PATH``. + +action +~~~~~~ + + Determines what IPE should do when a rule matches. Must be in every + rule, as the final clause. Can be one of: + + ``ALLOW``: + + If the rule matches, explicitly allow access to the resource to proceed + without executing any more rules. + + ``DENY``: + + If the rule matches, explicitly prohibit access to the resource to + proceed without executing any more rules. + +boot_verified +~~~~~~~~~~~~~ + + This property can be utilized for authorization of files from initramfs. + The format of this property is:: + + boot_verified=(TRUE|FALSE) + + + .. WARNING:: + + This property will trust files from initramfs(rootfs). It should + only be used during early booting stage. Before mounting the real + rootfs on top of the initramfs, initramfs script will recursively + remove all files and directories on the initramfs. This is typically + implemented by using switch_root(8) [#switch_root]_. Therefore the + initramfs will be empty and not accessible after the real + rootfs takes over. It is advised to switch to a different policy + that doesn't rely on the property after this point. + This ensures that the trust policies remain relevant and effective + throughout the system's operation. + +dmverity_roothash +~~~~~~~~~~~~~~~~~ + + This property can be utilized for authorization or revocation of + specific dm-verity volumes, identified via their root hashes. It has a + dependency on the DM_VERITY module. This property is controlled by + the ``IPE_PROP_DM_VERITY`` config option, it will be automatically + selected when ``SECURITY_IPE`` and ``DM_VERITY`` are all enabled. + The format of this property is:: + + dmverity_roothash=DigestName:HexadecimalString + + The supported DigestNames for dmverity_roothash are [#dmveritydigests]_ + + + blake2b-512 + + blake2s-256 + + sha256 + + sha384 + + sha512 + + sha3-224 + + sha3-256 + + sha3-384 + + sha3-512 + + sm3 + + rmd160 + +dmverity_signature +~~~~~~~~~~~~~~~~~~ + + This property can be utilized for authorization of all dm-verity + volumes that have a signed roothash that validated by a keyring + specified by dm-verity's configuration, either the system trusted + keyring, or the secondary keyring. It depends on + ``DM_VERITY_VERIFY_ROOTHASH_SIG`` config option and is controlled by + the ``IPE_PROP_DM_VERITY_SIGNATURE`` config option, it will be automatically + selected when ``SECURITY_IPE``, ``DM_VERITY`` and + ``DM_VERITY_VERIFY_ROOTHASH_SIG`` are all enabled. + The format of this property is:: + + dmverity_signature=(TRUE|FALSE) + +fsverity_digest +~~~~~~~~~~~~~~~ + + This property can be utilized for authorization of specific fsverity + enabled files, identified via their fsverity digests. + It depends on ``FS_VERITY`` config option and is controlled by + the ``IPE_PROP_FS_VERITY`` config option, it will be automatically + selected when ``SECURITY_IPE`` and ``FS_VERITY`` are all enabled. + The format of this property is:: + + fsverity_digest=DigestName:HexadecimalString + + The supported DigestNames for fsverity_digest are [#fsveritydigest]_ + + + sha256 + + sha512 + +fsverity_signature +~~~~~~~~~~~~~~~~~~ + + This property is used to authorize all fs-verity enabled files that have + been verified by fs-verity's built-in signature mechanism. The signature + verification relies on a key stored within the ".fs-verity" keyring. It + depends on ``FS_VERITY_BUILTIN_SIGNATURES`` config option and + it is controlled by the ``IPE_PROP_FS_VERITY`` config option, + it will be automatically selected when ``SECURITY_IPE``, ``FS_VERITY`` + and ``FS_VERITY_BUILTIN_SIGNATURES`` are all enabled. + The format of this property is:: + + fsverity_signature=(TRUE|FALSE) + +Policy Examples +--------------- + +Allow all +~~~~~~~~~ + +:: + + policy_name=Allow_All policy_version=0.0.0 + DEFAULT action=ALLOW + +Allow only initramfs +~~~~~~~~~~~~~~~~~~~~ + +:: + + policy_name=Allow_Initramfs policy_version=0.0.0 + DEFAULT action=DENY + + op=EXECUTE boot_verified=TRUE action=ALLOW + +Allow any signed and validated dm-verity volume and the initramfs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + policy_name=Allow_Signed_DMV_And_Initramfs policy_version=0.0.0 + DEFAULT action=DENY + + op=EXECUTE boot_verified=TRUE action=ALLOW + op=EXECUTE dmverity_signature=TRUE action=ALLOW + +Prohibit execution from a specific dm-verity volume +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + policy_name=Deny_DMV_By_Roothash policy_version=0.0.0 + DEFAULT action=DENY + + op=EXECUTE dmverity_roothash=sha256:cd2c5bae7c6c579edaae4353049d58eb5f2e8be0244bf05345bc8e5ed257baff action=DENY + + op=EXECUTE boot_verified=TRUE action=ALLOW + op=EXECUTE dmverity_signature=TRUE action=ALLOW + +Allow only a specific dm-verity volume +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + policy_name=Allow_DMV_By_Roothash policy_version=0.0.0 + DEFAULT action=DENY + + op=EXECUTE dmverity_roothash=sha256:401fcec5944823ae12f62726e8184407a5fa9599783f030dec146938 action=ALLOW + +Allow any fs-verity file with a valid built-in signature +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + policy_name=Allow_Signed_And_Validated_FSVerity policy_version=0.0.0 + DEFAULT action=DENY + + op=EXECUTE fsverity_signature=TRUE action=ALLOW + +Allow execution of a specific fs-verity file +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + policy_name=ALLOW_FSV_By_Digest policy_version=0.0.0 + DEFAULT action=DENY + + op=EXECUTE fsverity_digest=sha256:fd88f2b8824e197f850bf4c5109bea5cf0ee38104f710843bb72da796ba5af9e action=ALLOW + +Additional Information +---------------------- + +- `Github Repository <https://github.com/microsoft/ipe>`_ +- :doc:`Developer and design docs for IPE </security/ipe>` + +FAQ +--- + +Q: + What's the difference between other LSMs which provide a measure of + trust-based access control? + +A: + + In general, there's two other LSMs that can provide similar functionality: + IMA, and Loadpin. + + IMA and IPE are functionally very similar. The significant difference between + the two is the policy. [#devdoc]_ + + Loadpin and IPE differ fairly dramatically, as Loadpin only covers the IPE's + kernel read operations, whereas IPE is capable of controlling execution + on top of kernel read. The trust model is also different; Loadpin roots its + trust in the initial super-block, whereas trust in IPE is stemmed from kernel + itself (via ``SYSTEM_TRUSTED_KEYS``). + +----------- + +.. [#digest_cache_lsm] https://lore.kernel.org/lkml/[email protected]/ + +.. [#interpreters] There is `some interest in solving this issue <https://lore.kernel.org/lkml/[email protected]/>`_. + +.. [#devdoc] Please see :doc:`the design docs </security/ipe>` for more on + this topic. + +.. [#switch_root] https://man7.org/linux/man-pages/man8/switch_root.8.html + +.. [#dmveritydigests] These hash algorithms are based on values accepted by + the Linux crypto API; IPE does not impose any + restrictions on the digest algorithm itself; + thus, this list may be out of date. + +.. [#fsveritydigest] These hash algorithms are based on values accepted by the + kernel's fsverity support; IPE does not impose any + restrictions on the digest algorithm itself; + thus, this list may be out of date. diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index 091e8bb38887..678d70d6e1c3 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -102,17 +102,41 @@ Examples:: #select lzo compression algorithm echo lzo > /sys/block/zram0/comp_algorithm -For the time being, the `comp_algorithm` content does not necessarily -show every compression algorithm supported by the kernel. We keep this -list primarily to simplify device configuration and one can configure -a new device with a compression algorithm that is not listed in -`comp_algorithm`. The thing is that, internally, ZRAM uses Crypto API -and, if some of the algorithms were built as modules, it's impossible -to list all of them using, for instance, /proc/crypto or any other -method. This, however, has an advantage of permitting the usage of -custom crypto compression modules (implementing S/W or H/W compression). - -4) Set Disksize +For the time being, the `comp_algorithm` content shows only compression +algorithms that are supported by zram. + +4) Set compression algorithm parameters: Optional +================================================= + +Compression algorithms may support specific parameters which can be +tweaked for particular dataset. ZRAM has an `algorithm_params` device +attribute which provides a per-algorithm params configuration. + +For example, several compression algorithms support `level` parameter. +In addition, certain compression algorithms support pre-trained dictionaries, +which significantly change algorithms' characteristics. In order to configure +compression algorithm to use external pre-trained dictionary, pass full +path to the `dict` along with other parameters:: + + #pass path to pre-trained zstd dictionary + echo "algo=zstd dict=/etc/dictioary" > /sys/block/zram0/algorithm_params + + #same, but using algorithm priority + echo "priority=1 dict=/etc/dictioary" > \ + /sys/block/zram0/algorithm_params + + #pass path to pre-trained zstd dictionary and compression level + echo "algo=zstd level=8 dict=/etc/dictioary" > \ + /sys/block/zram0/algorithm_params + +Parameters are algorithm specific: not all algorithms support pre-trained +dictionaries, not all algorithms support `level`. Furthermore, for certain +algorithms `level` controls the compression level (the higher the value the +better the compression ratio, it even can take negatives values for some +algorithms), for other algorithms `level` is acceleration level (the higher +the value the lower the compression ratio). + +5) Set Disksize =============== Set disk size by writing the value to sysfs node 'disksize'. @@ -132,7 +156,7 @@ There is little point creating a zram of greater than twice the size of memory since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the size of the disk when not in use so a huge zram is wasteful. -5) Set memory limit: Optional +6) Set memory limit: Optional ============================= Set memory limit by writing the value to sysfs node 'mem_limit'. @@ -151,7 +175,7 @@ Examples:: # To disable memory limit echo 0 > /sys/block/zram0/mem_limit -6) Activate +7) Activate =========== :: @@ -162,7 +186,7 @@ Examples:: mkfs.ext4 /dev/zram1 mount /dev/zram1 /tmp -7) Add/remove zram devices +8) Add/remove zram devices ========================== zram provides a control interface, which enables dynamic (on-demand) device @@ -182,7 +206,7 @@ execute:: echo X > /sys/class/zram-control/hot_remove -8) Stats +9) Stats ======== Per-device statistics are exported as various nodes under /sys/block/zram<id>/ @@ -205,6 +229,7 @@ writeback_limit_enable RW show and set writeback_limit feature max_comp_streams RW the number of possible concurrent compress operations comp_algorithm RW show and change the compression algorithm +algorithm_params WO setup compression algorithm parameters compact WO trigger memory compaction debug_stat RO this file is used for zram debugging purposes backing_dev RW set up backend storage for zram to write out @@ -283,15 +308,15 @@ a single line of text and contains the following stats separated by whitespace: Unit: 4K bytes ============== ============================================================= -9) Deactivate -============= +10) Deactivate +============== :: swapoff /dev/zram0 umount /dev/zram1 -10) Reset +11) Reset ========= Write any positive value to 'reset' sysfs node:: @@ -487,11 +512,14 @@ registered compression algorithms, increases our chances of finding the algorithm that successfully compresses a particular page. Sometimes, however, it is convenient (and sometimes even necessary) to limit recompression to only one particular algorithm so that it will not try any other algorithms. -This can be achieved by providing a algo=NAME parameter::: +This can be achieved by providing a `algo` or `priority` parameter::: #use zstd algorithm only (if registered) echo "type=huge algo=zstd" > /sys/block/zramX/recompress + #use zstd algorithm only (if zstd was registered under priority 1) + echo "type=huge priority=1" > /sys/block/zramX/recompress + memory tracking =============== diff --git a/Documentation/admin-guide/bug-bisect.rst b/Documentation/admin-guide/bug-bisect.rst index 325c5d0ed34a..585630d14581 100644 --- a/Documentation/admin-guide/bug-bisect.rst +++ b/Documentation/admin-guide/bug-bisect.rst @@ -1,76 +1,144 @@ -Bisecting a bug -+++++++++++++++ +.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0) +.. [see the bottom of this file for redistribution information] -Last updated: 28 October 2016 +====================== +Bisecting a regression +====================== -Introduction -============ +This document describes how to use a ``git bisect`` to find the source code +change that broke something -- for example when some functionality stopped +working after upgrading from Linux 6.0 to 6.1. -Always try the latest kernel from kernel.org and build from source. If you are -not confident in doing that please report the bug to your distribution vendor -instead of to a kernel developer. +The text focuses on the gist of the process. If you are new to bisecting the +kernel, better follow Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst +instead: it depicts everything from start to finish while covering multiple +aspects even kernel developers occasionally forget. This includes detecting +situations early where a bisection would be a waste of time, as nobody would +care about the result -- for example, because the problem happens after the +kernel marked itself as 'tainted', occurs in an abandoned version, was already +fixed, or is caused by a .config change you or your Linux distributor performed. -Finding bugs is not always easy. Have a go though. If you can't find it don't -give up. Report as much as you have found to the relevant maintainer. See -MAINTAINERS for who that is for the subsystem you have worked on. +Finding the change causing a kernel issue using a bisection +=========================================================== -Before you submit a bug report read -'Documentation/admin-guide/reporting-issues.rst'. +*Note: the following process assumes you prepared everything for a bisection. +This includes having a Git clone with the appropriate sources, installing the +software required to build and install kernels, as well as a .config file stored +in a safe place (the following example assumes '~/prepared_kernel_.config') to +use as pristine base at each bisection step; ideally, you have also worked out +a fully reliable and straight-forward way to reproduce the regression, too.* -Devices not appearing -===================== - -Often this is caused by udev/systemd. Check that first before blaming it -on the kernel. - -Finding patch that caused a bug -=============================== - -Using the provided tools with ``git`` makes finding bugs easy provided the bug -is reproducible. - -Steps to do it: - -- build the Kernel from its git source -- start bisect with [#f1]_:: - - $ git bisect start - -- mark the broken changeset with:: - - $ git bisect bad [commit] - -- mark a changeset where the code is known to work with:: - - $ git bisect good [commit] - -- rebuild the Kernel and test -- interact with git bisect by using either:: - - $ git bisect good - - or:: - - $ git bisect bad - - depending if the bug happened on the changeset you're testing -- After some interactions, git bisect will give you the changeset that - likely caused the bug. - -- For example, if you know that the current version is bad, and version - 4.8 is good, you could do:: - - $ git bisect start - $ git bisect bad # Current version is bad - $ git bisect good v4.8 - - -.. [#f1] You can, optionally, provide both good and bad arguments at git - start with ``git bisect start [BAD] [GOOD]`` - -For further references, please read: - -- The man page for ``git-bisect`` -- `Fighting regressions with git bisect <https://www.kernel.org/pub/software/scm/git/docs/git-bisect-lk2009.html>`_ -- `Fully automated bisecting with "git bisect run" <https://lwn.net/Articles/317154>`_ -- `Using Git bisect to figure out when brokenness was introduced <http://webchick.net/node/99>`_ +* Preparation: start the bisection and tell Git about the points in the history + you consider to be working and broken, which Git calls 'good' and 'bad':: + + git bisect start + git bisect good v6.0 + git bisect bad v6.1 + + Instead of Git tags like 'v6.0' and 'v6.1' you can specify commit-ids, too. + +1. Copy your prepared .config into the build directory and adjust it to the + needs of the codebase Git checked out for testing:: + + cp ~/prepared_kernel_.config .config + make olddefconfig + +2. Now build, install, and boot a kernel. This might fail for unrelated reasons, + for example, when a compile error happens at the current stage of the + bisection a later change resolves. In such cases run ``git bisect skip`` and + go back to step 1. + +3. Check if the functionality that regressed works in the kernel you just built. + + If it works, execute:: + + git bisect good + + If it is broken, run:: + + git bisect bad + + Note, getting this wrong just once will send the rest of the bisection + totally off course. To prevent having to start anew later you thus want to + ensure what you tell Git is correct; it is thus often wise to spend a few + minutes more on testing in case your reproducer is unreliable. + + After issuing one of these two commands, Git will usually check out another + bisection point and print something like 'Bisecting: 675 revisions left to + test after this (roughly 10 steps)'. In that case go back to step 1. + + If Git instead prints something like 'cafecaca0c0dacafecaca0c0dacafecaca0c0da + is the first bad commit', then you have finished the bisection. In that case + move to the next point below. Note, right after displaying that line Git will + show some details about the culprit including its patch description; this can + easily fill your terminal, so you might need to scroll up to see the message + mentioning the culprit's commit-id. + + In case you missed Git's output, you can always run ``git bisect log`` to + print the status: it will show how many steps remain or mention the result of + the bisection. + +* Recommended complementary task: put the bisection log and the current .config + file aside for the bug report; furthermore tell Git to reset the sources to + the state before the bisection:: + + git bisect log > ~/bisection-log + cp .config ~/bisection-config-culprit + git bisect reset + +* Recommended optional task: try reverting the culprit on top of the latest + codebase and check if that fixes your bug; if that is the case, it validates + the bisection and enables developers to resolve the regression through a + revert. + + To try this, update your clone and check out latest mainline. Then tell Git + to revert the change by specifying its commit-id:: + + git revert --no-edit cafec0cacaca0 + + Git might reject this, for example when the bisection landed on a merge + commit. In that case, abandon the attempt. Do the same, if Git fails to revert + the culprit on its own because later changes depend on it -- at least unless + you bisected a stable or longterm kernel series, in which case you want to + check out its latest codebase and try a revert there. + + If a revert succeeds, build and test another kernel to check if reverting + resolved your regression. + +With that the process is complete. Now report the regression as described by +Documentation/admin-guide/reporting-issues.rst. + + +Additional reading material +--------------------------- + +* The `man page for 'git bisect' <https://git-scm.com/docs/git-bisect>`_ and + `fighting regressions with 'git bisect' <https://git-scm.com/docs/git-bisect-lk2009.html>`_ + in the Git documentation. +* `Working with git bisect <https://nathanchance.dev/posts/working-with-git-bisect/>`_ + from kernel developer Nathan Chancellor. +* `Using Git bisect to figure out when brokenness was introduced <http://webchick.net/node/99>`_. +* `Fully automated bisecting with 'git bisect run' <https://lwn.net/Articles/317154>`_. + +.. + end-of-content +.. + This document is maintained by Thorsten Leemhuis <[email protected]>. If + you spot a typo or small mistake, feel free to let him know directly and + he'll fix it. You are free to do the same in a mostly informal way if you + want to contribute changes to the text -- but for copyright reasons please CC + [email protected] and 'sign-off' your contribution as + Documentation/process/submitting-patches.rst explains in the section 'Sign + your work - the Developer's Certificate of Origin'. +.. + This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top + of the file. If you want to distribute this text under CC-BY-4.0 only, + please use 'The Linux kernel development community' for author attribution + and link this as source: + https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/bug-bisect.rst + +.. + Note: Only the content of this RST file as found in the Linux kernel sources + is available under CC-BY-4.0, as versions of this text that were processed + (for example by the kernel's build system) might contain content taken from + files which use a more restrictive license. diff --git a/Documentation/admin-guide/bug-hunting.rst b/Documentation/admin-guide/bug-hunting.rst index 95299b08c405..1d0f8ceb3075 100644 --- a/Documentation/admin-guide/bug-hunting.rst +++ b/Documentation/admin-guide/bug-hunting.rst @@ -244,14 +244,14 @@ Reporting the bug Once you find where the bug happened, by inspecting its location, you could either try to fix it yourself or report it upstream. -In order to report it upstream, you should identify the mailing list -used for the development of the affected code. This can be done by using -the ``get_maintainer.pl`` script. +In order to report it upstream, you should identify the bug tracker, if any, or +mailing list used for the development of the affected code. This can be done by +using the ``get_maintainer.pl`` script. For example, if you find a bug at the gspca's sonixj.c file, you can get its maintainers with:: - $ ./scripts/get_maintainer.pl -f drivers/media/usb/gspca/sonixj.c + $ ./scripts/get_maintainer.pl --bug -f drivers/media/usb/gspca/sonixj.c Hans Verkuil <[email protected]> (odd fixer:GSPCA USB WEBCAM DRIVER,commit_signer:1/1=100%) Mauro Carvalho Chehab <[email protected]> (maintainer:MEDIA INPUT INFRASTRUCTURE (V4L/DVB),commit_signer:1/1=100%) Tejun Heo <[email protected]> (commit_signer:1/1=100%) @@ -267,11 +267,12 @@ Please notice that it will point to: - The driver maintainer (Hans Verkuil); - The subsystem maintainer (Mauro Carvalho Chehab); - The driver and/or subsystem mailing list ([email protected]); -- the Linux Kernel mailing list ([email protected]). +- The Linux Kernel mailing list ([email protected]); +- The bug reporting URIs for the driver/subsystem (none in the above example). -Usually, the fastest way to have your bug fixed is to report it to mailing -list used for the development of the code (linux-media ML) copying the -driver maintainer (Hans). +If the listing contains bug reporting URIs at the end, please prefer them over +email. Otherwise, please report bugs to the mailing list used for the +development of the code (linux-media ML) copying the driver maintainer (Hans). If you are totally stumped as to whom to send the report, and ``get_maintainer.pl`` didn't provide you anything useful, send it to diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 9cde26d33843..270501db9f4e 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -78,18 +78,24 @@ Brief summary of control files. memory.memsw.max_usage_in_bytes show max memory+Swap usage recorded memory.soft_limit_in_bytes set/show soft limit of memory usage This knob is not available on CONFIG_PREEMPT_RT systems. + This knob is deprecated and shouldn't be + used. memory.stat show various statistics memory.use_hierarchy set/show hierarchical account enabled This knob is deprecated and shouldn't be used. memory.force_empty trigger forced page reclaim memory.pressure_level set memory pressure notifications + This knob is deprecated and shouldn't be + used. memory.swappiness set/show swappiness parameter of vmscan (See sysctl's vm.swappiness) memory.move_charge_at_immigrate set/show controls of moving charges This knob is deprecated and shouldn't be used. memory.oom_control set/show oom controls. + This knob is deprecated and shouldn't be + used. memory.numa_stat show the number of memory usage per numa node memory.kmem.limit_in_bytes Deprecated knob to set and read the kernel @@ -105,10 +111,18 @@ Brief summary of control files. memory.kmem.max_usage_in_bytes show max kernel memory usage recorded memory.kmem.tcp.limit_in_bytes set/show hard limit for tcp buf memory + This knob is deprecated and shouldn't be + used. memory.kmem.tcp.usage_in_bytes show current tcp buf memory allocation + This knob is deprecated and shouldn't be + used. memory.kmem.tcp.failcnt show the number of tcp buf memory usage hits limits + This knob is deprecated and shouldn't be + used. memory.kmem.tcp.max_usage_in_bytes show max tcp buf memory usage recorded + This knob is deprecated and shouldn't be + used. ==================================== ========================================== 1. History @@ -693,8 +707,10 @@ For compatibility reasons writing 1 to memory.use_hierarchy will always pass:: # echo 1 > memory.use_hierarchy -7. Soft limits -============== +7. Soft limits (DEPRECATED) +=========================== + +THIS IS DEPRECATED! Soft limits allow for greater sharing of memory. The idea behind soft limits is to allow control groups to use as much of the memory as needed, provided @@ -834,8 +850,10 @@ It's applicable for root and non-root cgroup. .. _cgroup-v1-memory-oom-control: -10. OOM Control -=============== +10. OOM Control (DEPRECATED) +============================ + +THIS IS DEPRECATED! memory.oom_control file is for OOM notification and other controls. @@ -882,8 +900,10 @@ At reading, current status of OOM is shown. The number of processes belonging to this cgroup killed by any kind of OOM killer. -11. Memory Pressure -=================== +11. Memory Pressure (DEPRECATED) +================================ + +THIS IS DEPRECATED! The pressure level notifications can be used to monitor the memory allocation cost; based on the pressure, applications can implement diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 86311c2907cd..69af2173555f 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -533,10 +533,12 @@ cgroup namespace on namespace creation. Because the resource control interface files in a given directory control the distribution of the parent's resources, the delegatee shouldn't be allowed to write to them. For the first method, this is -achieved by not granting access to these files. For the second, the -kernel rejects writes to all files other than "cgroup.procs" and -"cgroup.subtree_control" on a namespace root from inside the -namespace. +achieved by not granting access to these files. For the second, files +outside the namespace should be hidden from the delegatee by the means +of at least mount namespacing, and the kernel rejects writes to all +files on a namespace root from inside the cgroup namespace, except for +those files listed in "/sys/kernel/cgroup/delegate" (including +"cgroup.procs", "cgroup.threads", "cgroup.subtree_control", etc.). The end results are equivalent for both delegation types. Once delegated, the user can build sub-hierarchy under the directory, @@ -981,6 +983,14 @@ All cgroup core files are prefixed with "cgroup." A dying cgroup can consume system resources not exceeding limits, which were active at the moment of cgroup deletion. + nr_subsys_<cgroup_subsys> + Total number of live cgroup subsystems (e.g memory + cgroup) at and beneath the current cgroup. + + nr_dying_subsys_<cgroup_subsys> + Total number of dying cgroup subsystems (e.g. memory + cgroup) at and beneath the current cgroup. + cgroup.freeze A read-write single value file which exists on non-root cgroups. Allowed values are "0" and "1". The default is "0". @@ -1333,11 +1343,14 @@ The following nested keys are defined. all the existing limitations and potential future extensions. memory.peak - A read-only single value file which exists on non-root - cgroups. + A read-write single value file which exists on non-root cgroups. + + The max memory usage recorded for the cgroup and its descendants since + either the creation of the cgroup or the most recent reset for that FD. - The max memory usage recorded for the cgroup and its - descendants since the creation of the cgroup. + A write of any non-empty string to this file resets it to the + current memory usage for subsequent reads through the same + file descriptor. memory.oom.group A read-write single value file which exists on non-root @@ -1614,6 +1627,25 @@ The following nested keys are defined. Usually because failed to allocate some continuous swap space for the huge page. + numa_pages_migrated (npn) + Number of pages migrated by NUMA balancing. + + numa_pte_updates (npn) + Number of pages whose page table entries are modified by + NUMA balancing to produce NUMA hinting faults on access. + + numa_hint_faults (npn) + Number of NUMA hinting faults. + + pgdemote_kswapd + Number of pages demoted by kswapd. + + pgdemote_direct + Number of pages demoted directly. + + pgdemote_khugepaged + Number of pages demoted by khugepaged. + memory.numa_stat A read-only nested-keyed file which exists on non-root cgroups. @@ -1663,11 +1695,14 @@ The following nested keys are defined. Healthy workloads are not expected to reach this limit. memory.swap.peak - A read-only single value file which exists on non-root - cgroups. + A read-write single value file which exists on non-root cgroups. + + The max swap usage recorded for the cgroup and its descendants since + the creation of the cgroup or the most recent reset for that FD. - The max swap usage recorded for the cgroup and its - descendants since the creation of the cgroup. + A write of any non-empty string to this file resets it to the + current memory usage for subsequent reads through the same + file descriptor. memory.swap.max A read-write single value file which exists on non-root @@ -1717,9 +1752,10 @@ The following nested keys are defined. entries fault back in or are written out to disk. memory.zswap.writeback - A read-write single value file. The default value is "1". The - initial value of the root cgroup is 1, and when a new cgroup is - created, it inherits the current value of its parent. + A read-write single value file. The default value is "1". + Note that this setting is hierarchical, i.e. the writeback would be + implicitly disabled for child cgroups if the upper hierarchy + does so. When this is set to 0, all swapping attempts to swapping devices are disabled. This included both zswap writebacks, and swapping due @@ -1730,6 +1766,8 @@ The following nested keys are defined. Note that this is subtly different from setting memory.swap.max to 0, as it still allows for pages to be written to the zswap pool. + This setting has no effect if zswap is disabled, and swapping + is allowed unless memory.swap.max is set to 0. memory.pressure A read-only nested-keyed file. @@ -2939,8 +2977,8 @@ Deprecated v1 Core Features - "cgroup.clone_children" is removed. -- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file - at the root instead. +- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" or + "cgroup.stat" files at the root instead. Issues with v1 and Rationales for v2 diff --git a/Documentation/admin-guide/device-mapper/dm-crypt.rst b/Documentation/admin-guide/device-mapper/dm-crypt.rst index cde1d5594b0d..9f8139ff97d6 100644 --- a/Documentation/admin-guide/device-mapper/dm-crypt.rst +++ b/Documentation/admin-guide/device-mapper/dm-crypt.rst @@ -166,13 +166,18 @@ integrity_key_size:<bytes> Module parameters:: - max_read_size + Maximum size of read requests. When a request larger than this size + is received, dm-crypt will split the request. The splitting improves + concurrency (the split requests could be encrypted in parallel by multiple + cores), but it also causes overhead. The user should tune this parameters to + fit the actual workload. + max_write_size - Maximum size of read or write requests. When a request larger than this size + Maximum size of write requests. When a request larger than this size is received, dm-crypt will split the request. The splitting improves concurrency (the split requests could be encrypted in parallel by multiple - cores), but it also causes overhead. The user should tune these parameters to + cores), but it also causes overhead. The user should tune this parameters to fit the actual workload. diff --git a/Documentation/admin-guide/ext4.rst b/Documentation/admin-guide/ext4.rst index 5740d85439ff..2418b0c2d3df 100644 --- a/Documentation/admin-guide/ext4.rst +++ b/Documentation/admin-guide/ext4.rst @@ -212,16 +212,6 @@ When mounting an ext4 filesystem, the following option are accepted: that ext4's inode table readahead algorithm will pre-read into the buffer cache. The default value is 32 blocks. - nouser_xattr - Disables Extended User Attributes. See the attr(5) manual page for - more information about extended attributes. - - noacl - This option disables POSIX Access Control List support. If ACL support - is enabled in the kernel configuration (CONFIG_EXT4_FS_POSIX_ACL), ACL - is enabled by default on mount. See the acl(5) manual page for more - information about acl. - bsddf (*) Make 'df' act like BSD. diff --git a/Documentation/admin-guide/hw-vuln/srso.rst b/Documentation/admin-guide/hw-vuln/srso.rst index 4bd3ce3ba171..2ad1c05b8c88 100644 --- a/Documentation/admin-guide/hw-vuln/srso.rst +++ b/Documentation/admin-guide/hw-vuln/srso.rst @@ -158,3 +158,72 @@ poisoned BTB entry and using that safe one for all function returns. In older Zen1 and Zen2, this is accomplished using a reinterpretation technique similar to Retbleed one: srso_untrain_ret() and srso_safe_ret(). + +Checking the safe RET mitigation actually works +----------------------------------------------- + +In case one wants to validate whether the SRSO safe RET mitigation works +on a kernel, one could use two performance counters + +* PMC_0xc8 - Count of RET/RET lw retired +* PMC_0xc9 - Count of RET/RET lw retired mispredicted + +and compare the number of RETs retired properly vs those retired +mispredicted, in kernel mode. Another way of specifying those events +is:: + + # perf list ex_ret_near_ret + + List of pre-defined events (to be used in -e or -M): + + core: + ex_ret_near_ret + [Retired Near Returns] + ex_ret_near_ret_mispred + [Retired Near Returns Mispredicted] + +Either the command using the event mnemonics:: + + # perf stat -e ex_ret_near_ret:k -e ex_ret_near_ret_mispred:k sleep 10s + +or using the raw PMC numbers:: + + # perf stat -e cpu/event=0xc8,umask=0/k -e cpu/event=0xc9,umask=0/k sleep 10s + +should give the same amount. I.e., every RET retired should be +mispredicted:: + + [root@brent: ~/kernel/linux/tools/perf> ./perf stat -e cpu/event=0xc8,umask=0/k -e cpu/event=0xc9,umask=0/k sleep 10s + + Performance counter stats for 'sleep 10s': + + 137,167 cpu/event=0xc8,umask=0/k + 137,173 cpu/event=0xc9,umask=0/k + + 10.004110303 seconds time elapsed + + 0.000000000 seconds user + 0.004462000 seconds sys + +vs the case when the mitigation is disabled (spec_rstack_overflow=off) +or not functioning properly, showing usually a lot smaller number of +mispredicted retired RETs vs the overall count of retired RETs during +a workload:: + + [root@brent: ~/kernel/linux/tools/perf> ./perf stat -e cpu/event=0xc8,umask=0/k -e cpu/event=0xc9,umask=0/k sleep 10s + + Performance counter stats for 'sleep 10s': + + 201,627 cpu/event=0xc8,umask=0/k + 4,074 cpu/event=0xc9,umask=0/k + + 10.003267252 seconds time elapsed + + 0.002729000 seconds user + 0.000000000 seconds sys + +Also, there is a selftest which performs the above, go to +tools/testing/selftests/x86/ and do:: + + make srso + ./srso diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 09126bb8cc9f..bb48ae24ae69 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -333,12 +333,17 @@ allowed anymore to lift isolation requirements as needed. This option does not override iommu=pt - force_enable - Force enable the IOMMU on platforms known - to be buggy with IOMMU enabled. Use this - option with care. - pgtbl_v1 - Use v1 page table for DMA-API (Default). - pgtbl_v2 - Use v2 page table for DMA-API. - irtcachedis - Disable Interrupt Remapping Table (IRT) caching. + force_enable - Force enable the IOMMU on platforms known + to be buggy with IOMMU enabled. Use this + option with care. + pgtbl_v1 - Use v1 page table for DMA-API (Default). + pgtbl_v2 - Use v2 page table for DMA-API. + irtcachedis - Disable Interrupt Remapping Table (IRT) caching. + nohugepages - Limit page-sizes used for v1 page-tables + to 4 KiB. + v2_pgsizes_only - Limit page-sizes used for v1 page-tables + to 4KiB/2Mib/1GiB. + amd_iommu_dump= [HW,X86-64] Enable AMD IOMMU driver option to dump the ACPI table @@ -517,6 +522,18 @@ Format: <io>,<irq>,<mode> See header of drivers/net/hamradio/baycom_ser_hdx.c. + bdev_allow_write_mounted= + Format: <bool> + Control the ability to open a mounted block device + for writing, i.e., allow / disallow writes that bypass + the FS. This was implemented as a means to prevent + fuzzers from crashing the kernel by overwriting the + metadata underneath a mounted FS without its awareness. + This also prevents destructive formatting of mounted + filesystems by naive storage tooling that don't use + O_EXCL. Default is Y and can be changed through the + Kconfig option CONFIG_BLK_DEV_WRITE_MOUNTED. + bert_disable [ACPI] Disable BERT OS support on buggy BIOSes. @@ -2350,6 +2367,18 @@ ipcmni_extend [KNL,EARLY] Extend the maximum number of unique System V IPC identifiers from 32,768 to 16,777,216. + ipe.enforce= [IPE] + Format: <bool> + Determine whether IPE starts in permissive (0) or + enforce (1) mode. The default is enforce. + + ipe.success_audit= + [IPE] + Format: <bool> + Start IPE with success auditing enabled, emitting + an audit event when a binary is allowed. The default + is 0. + irqaffinity= [SMP] Set the default irq affinity mask The argument is a cpu list, as described above. @@ -4123,6 +4152,21 @@ Disable NUMA, Only set up a single NUMA node spanning all memory. + numa=fake=<size>[MG] + [KNL, ARM64, RISCV, X86, EARLY] + If given as a memory unit, fills all system RAM with + nodes of size interleaved over physical nodes. + + numa=fake=<N> + [KNL, ARM64, RISCV, X86, EARLY] + If given as an integer, fills all system RAM with N + fake nodes interleaved over physical nodes. + + numa=fake=<N>U + [KNL, ARM64, RISCV, X86, EARLY] + If given as an integer followed by 'U', it will + divide each physical node into N emulated nodes. + numa_balancing= [KNL,ARM64,PPC,RISCV,S390,X86] Enable or disable automatic NUMA balancing. Allowed values are enable and disable @@ -4788,6 +4832,16 @@ printk.time= Show timing data prefixed to each printk message line Format: <bool> (1/Y/y=enable, 0/N/n=disable) + proc_mem.force_override= [KNL] + Format: {always | ptrace | never} + Traditionally /proc/pid/mem allows memory permissions to be + overridden without restrictions. This option may be set to + restrict that. Can be one of: + - 'always': traditional behavior always allows mem overrides. + - 'ptrace': only allow mem overrides for active ptracers. + - 'never': never allow mem overrides. + If not specified, default is the CONFIG_PROC_MEM_* choice. + processor.max_cstate= [HW,ACPI] Limit processor to maximum C-state max_cstate=9 overrides any DMI blacklist limit. @@ -4935,6 +4989,10 @@ Set maximum number of finished RCU callbacks to process in one batch. + rcutree.csd_lock_suppress_rcu_stall= [KNL] + Do only a one-line RCU CPU stall warning when + there is an ongoing too-long CSD-lock wait. + rcutree.do_rcu_barrier= [KNL] Request a call to rcu_barrier(). This is throttled so that userspace tests can safely @@ -5382,7 +5440,13 @@ Time to wait (s) after boot before inducing stall. rcutorture.stall_cpu_irqsoff= [KNL] - Disable interrupts while stalling if set. + Disable interrupts while stalling if set, but only + on the first stall in the set. + + rcutorture.stall_cpu_repeat= [KNL] + Number of times to repeat the stall sequence, + so that rcutorture.stall_cpu_repeat=3 will result + in four stall sequences. rcutorture.stall_gp_kthread= [KNL] Duration (s) of forced sleep within RCU @@ -5570,14 +5634,6 @@ of zero will disable batching. Batching is always disabled for synchronize_rcu_tasks(). - rcupdate.rcu_tasks_rude_lazy_ms= [KNL] - Set timeout in milliseconds RCU Tasks - Rude asynchronous callback batching for - call_rcu_tasks_rude(). A negative value - will take the default. A value of zero will - disable batching. Batching is always disabled - for synchronize_rcu_tasks_rude(). - rcupdate.rcu_tasks_trace_lazy_ms= [KNL] Set timeout in milliseconds RCU Tasks Trace asynchronous callback batching for @@ -6614,6 +6670,15 @@ <deci-seconds>: poll all this frequency 0: no polling (default) + thp_anon= [KNL] + Format: <size>,<size>[KMG]:<state>;<size>-<size>[KMG]:<state> + state is one of "always", "madvise", "never" or "inherit". + Control the default behavior of the system with respect + to anonymous transparent hugepages. + Can be used multiple times for multiple anon THP sizes. + See Documentation/admin-guide/mm/transhuge.rst for more + details. + threadirqs [KNL,EARLY] Force threading of all interrupt handlers except those marked explicitly IRQF_NO_THREAD. @@ -6743,6 +6808,51 @@ the same thing would happen if it was left off). The irq_handler_entry event, and all events under the "initcall" system. + Flags can be added to the instance to modify its behavior when it is + created. The flags are separated by '^'. + + The available flags are: + + traceoff - Have the tracing instance tracing disabled after it is created. + traceprintk - Have trace_printk() write into this trace instance + (note, "printk" and "trace_printk" can also be used) + + trace_instance=foo^traceoff^traceprintk,sched,irq + + The flags must come before the defined events. + + If memory has been reserved (see memmap for x86), the instance + can use that memory: + + memmap=12M$0x284500000 trace_instance=boot_map@0x284500000:12M + + The above will create a "boot_map" instance that uses the physical + memory at 0x284500000 that is 12Megs. The per CPU buffers of that + instance will be split up accordingly. + + Alternatively, the memory can be reserved by the reserve_mem option: + + reserve_mem=12M:4096:trace trace_instance=boot_map@trace + + This will reserve 12 megabytes at boot up with a 4096 byte alignment + and place the ring buffer in this memory. Note that due to KASLR, the + memory may not be the same location each time, which will not preserve + the buffer content. + + Also note that the layout of the ring buffer data may change between + kernel versions where the validator will fail and reset the ring buffer + if the layout is not the same as the previous kernel. + + If the ring buffer is used for persistent bootups and has events enabled, + it is recommend to disable tracing so that events from a previous boot do not + mix with events of the current boot (unless you are debugging a random crash + at boot up). + + reserve_mem=12M:4096:trace trace_instance=boot_map^traceoff^traceprintk@trace,sched,irq + + See also Documentation/trace/debugging.rst + + trace_options=[option-list] [FTRACE] Enable or disable tracer options at boot. The option-list is a comma delimited list of options @@ -7352,6 +7462,13 @@ it can be updated at runtime by writing to the corresponding sysfs file. + workqueue.panic_on_stall=<uint> + Panic when workqueue stall is detected by + CONFIG_WQ_WATCHDOG. It sets the number times of the + stall to trigger panic. + + The default is 0, which disables the panic on stall. + workqueue.cpu_intensive_thresh_us= Per-cpu work items which run for longer than this threshold are automatically considered CPU intensive diff --git a/Documentation/admin-guide/media/cec.rst b/Documentation/admin-guide/media/cec.rst index 6b30e355cf23..92690e1f2183 100644 --- a/Documentation/admin-guide/media/cec.rst +++ b/Documentation/admin-guide/media/cec.rst @@ -42,10 +42,14 @@ dongles): ``persistent_config``: by default this is off, but when set to 1 the driver will store the current settings to the device's internal eeprom and restore it the next time the device is connected to the USB port. + - RainShadow Tech. Note: this driver does not support the persistent_config module option of the Pulse-Eight driver. The hardware supports it, but I have no plans to add this feature. But I accept patches :-) +- Extron DA HD 4K PLUS HDMI Distribution Amplifier. See + :ref:`extron_da_hd_4k_plus` for more information. + Miscellaneous: - vivid: emulates a CEC receiver and CEC transmitter. @@ -378,3 +382,86 @@ it later using ``--analyze-pin``. You can also use this as a full-fledged CEC device by configuring it using ``cec-ctl --tv -p0.0.0.0`` or ``cec-ctl --playback -p1.0.0.0``. + +.. _extron_da_hd_4k_plus: + +Extron DA HD 4K PLUS CEC Adapter driver +======================================= + +This driver is for the Extron DA HD 4K PLUS series of HDMI Distribution +Amplifiers: https://www.extron.com/product/dahd4kplusseries + +The 2, 4 and 6 port models are supported. + +Firmware version 1.02.0001 or higher is required. + +Note that older Extron hardware revisions have a problem with the CEC voltage, +which may mean that CEC will not work. This is fixed in hardware revisions +E34814 and up. + +The CEC support has two modes: the first is a manual mode where userspace has +to manually control CEC for the HDMI Input and all HDMI Outputs. While this gives +full control, it is also complicated. + +The second mode is an automatic mode, which is selected if the module option +``vendor_id`` is set. In that case the driver controls CEC and CEC messages +received in the input will be distributed to the outputs. It is still possible +to use the /dev/cecX devices to talk to the connected devices directly, but it is +the driver that configures everything and deals with things like Hotplug Detect +changes. + +The driver also takes care of the EDIDs: /dev/videoX devices are created to +read the EDIDs and (for the HDMI Input port) to set the EDID. + +By default userspace is responsible to set the EDID for the HDMI Input +according to the EDIDs of the connected displays. But if the ``manufacturer_name`` +module option is set, then the driver will take care of setting the EDID +of the HDMI Input based on the supported resolutions of the connected displays. +Currently the driver only supports resolutions 1080p60 and 4kp60: if all connected +displays support 4kp60, then it will advertise 4kp60 on the HDMI input, otherwise +it will fall back to an EDID that just reports 1080p60. + +The status of the Extron is reported in ``/sys/kernel/debug/cec/cecX/status``. + +The extron-da-hd-4k-plus driver implements the following module options: + +``debug`` +--------- + +If set to 1, then all serial port traffic is shown. + +``vendor_id`` +------------- + +The CEC Vendor ID to report to connected displays. + +If set, then the driver will take care of distributing CEC messages received +on the input to the HDMI outputs. This is done for the following CEC messages: + +- <Standby> +- <Image View On> and <Text View On> +- <Give Device Power Status> +- <Set System Audio Mode> +- <Request Current Latency> + +If not set, then userspace is responsible for this, and it will have to +configure the CEC devices for HDMI Input and the HDMI Outputs manually. + +``manufacturer_name`` +--------------------- + +A three character manufacturer name that is used in the EDID for the HDMI +Input. If not set, then userspace is reponsible for configuring an EDID. +If set, then the driver will update the EDID automatically based on the +resolutions supported by the connected displays, and it will not be possible +anymore to manually set the EDID for the HDMI Input. + +``hpd_never_low`` +----------------- + +If set, then the Hotplug Detect pin of the HDMI Input will always be high, +even if nothing is connected to the HDMI Outputs. If not set (the default) +then the Hotplug Detect pin of the HDMI input will go low if all the detected +Hotplug Detect pins of the HDMI Outputs are also low. + +This option may be changed dynamically. diff --git a/Documentation/admin-guide/media/mgb4.rst b/Documentation/admin-guide/media/mgb4.rst index e434d4a9eeb3..b9da127c074d 100644 --- a/Documentation/admin-guide/media/mgb4.rst +++ b/Documentation/admin-guide/media/mgb4.rst @@ -227,8 +227,13 @@ Common FPDL3/GMSL output parameters open.* **frame_rate** (RW): - Output video frame rate in frames per second. The default frame rate is - 60Hz. + Output video signal frame rate limit in frames per second. Due to + the limited output pixel clock steps, the card can not always generate + a frame rate perfectly matching the value required by the connected display. + Using this parameter one can limit the frame rate by "crippling" the signal + so that the lines are not equal (the porches of the last line differ) but + the signal appears like having the exact frame rate to the connected display. + The default frame rate limit is 60Hz. **hsync_polarity** (RW): HSYNC signal polarity. @@ -253,33 +258,33 @@ Common FPDL3/GMSL output parameters and there is a non-linear stepping between two consecutive allowed frequencies. The driver finds the nearest allowed frequency to the given value and sets it. When reading this property, you get the exact - frequency set by the driver. The default frequency is 70000kHz. + frequency set by the driver. The default frequency is 61150kHz. *Note: This parameter can not be changed while the output v4l2 device is open.* **hsync_width** (RW): - Width of the HSYNC signal in pixels. The default value is 16. + Width of the HSYNC signal in pixels. The default value is 40. **vsync_width** (RW): - Width of the VSYNC signal in video lines. The default value is 2. + Width of the VSYNC signal in video lines. The default value is 20. **hback_porch** (RW): Number of PCLK pulses between deassertion of the HSYNC signal and the first - valid pixel in the video line (marked by DE=1). The default value is 32. + valid pixel in the video line (marked by DE=1). The default value is 50. **hfront_porch** (RW): Number of PCLK pulses between the end of the last valid pixel in the video line (marked by DE=1) and assertion of the HSYNC signal. The default value - is 32. + is 50. **vback_porch** (RW): Number of video lines between deassertion of the VSYNC signal and the video - line with the first valid pixel (marked by DE=1). The default value is 2. + line with the first valid pixel (marked by DE=1). The default value is 31. **vfront_porch** (RW): Number of video lines between the end of the last valid pixel line (marked - by DE=1) and assertion of the VSYNC signal. The default value is 2. + by DE=1) and assertion of the VSYNC signal. The default value is 30. FPDL3 specific input parameters ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/Documentation/admin-guide/media/rkisp1.rst b/Documentation/admin-guide/media/rkisp1.rst index 6f14d9561fa5..6c878c71442f 100644 --- a/Documentation/admin-guide/media/rkisp1.rst +++ b/Documentation/admin-guide/media/rkisp1.rst @@ -114,11 +114,18 @@ to be applied to the hardware during a video stream, allowing userspace to dynamically modify values such as black level, cross talk corrections and others. -The buffer format is defined by struct :c:type:`rkisp1_params_cfg`, and -userspace should set +The ISP driver supports two different parameters configuration methods, the +`fixed parameters format` or the `extensible parameters format`. + +When using the `fixed parameters` method the buffer format is defined by struct +:c:type:`rkisp1_params_cfg`, and userspace should set :ref:`V4L2_META_FMT_RK_ISP1_PARAMS <v4l2-meta-fmt-rk-isp1-params>` as the dataformat. +When using the `extensible parameters` method the buffer format is defined by +struct :c:type:`rkisp1_ext_params_cfg`, and userspace should set +:ref:`V4L2_META_FMT_RK_ISP1_EXT_PARAMS <v4l2-meta-fmt-rk-isp1-ext-params>` as +the dataformat. Capturing Video Frames Example ============================== diff --git a/Documentation/admin-guide/media/vivid.rst b/Documentation/admin-guide/media/vivid.rst index 1306f19ecb5a..034ca7c77fb9 100644 --- a/Documentation/admin-guide/media/vivid.rst +++ b/Documentation/admin-guide/media/vivid.rst @@ -328,7 +328,7 @@ and an HDMI input, one input for each input type. Those are described in more detail below. Special attention has been given to the rate at which new frames become -available. The jitter will be around 1 jiffie (that depends on the HZ +available. The jitter will be around 1 jiffy (that depends on the HZ configuration of your kernel, so usually 1/100, 1/250 or 1/1000 of a second), but the long-term behavior is exactly following the framerate. So a framerate of 59.94 Hz is really different from 60 Hz. If the framerate @@ -1343,7 +1343,7 @@ Some Future Improvements Just as a reminder and in no particular order: - Add a virtual alsa driver to test audio -- Add virtual sub-devices and media controller support +- Add virtual sub-devices - Some support for testing compressed video - Add support to loop raw VBI output to raw VBI input - Add support to loop teletext sliced VBI output to VBI input @@ -1358,4 +1358,4 @@ Just as a reminder and in no particular order: - Make a thread for the RDS generation, that would help in particular for the "Controls" RDS Rx I/O Mode as the read-only RDS controls could be updated in real-time. -- Changing the EDID should cause hotplug detect emulation to happen. +- Changing the EDID doesn't wait 100 ms before setting the HPD signal. diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst index 054010a7f3d8..c4dddf6733cd 100644 --- a/Documentation/admin-guide/mm/damon/start.rst +++ b/Documentation/admin-guide/mm/damon/start.rst @@ -7,7 +7,7 @@ Getting Started This document briefly describes how you can use DAMON by demonstrating its default user space tool. Please note that this document describes only a part of its features for brevity. Please refer to the usage `doc -<https://github.com/awslabs/damo/blob/next/USAGE.md>`_ of the tool for more +<https://github.com/damonitor/damo/blob/next/USAGE.md>`_ of the tool for more details. @@ -26,7 +26,7 @@ User Space Tool For the demonstration, we will use the default user space tool for DAMON, called DAMON Operator (DAMO). It is available at -https://github.com/awslabs/damo. The examples below assume that ``damo`` is on +https://github.com/damonitor/damo. The examples below assume that ``damo`` is on your ``$PATH``. It's not mandatory, though. Because DAMO is using the sysfs interface (refer to :doc:`usage` for the diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index 26df6cfa4441..d9be9f7caa7d 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -7,19 +7,19 @@ Detailed Usages DAMON provides below interfaces for different users. - *DAMON user space tool.* - `This <https://github.com/awslabs/damo>`_ is for privileged people such as + `This <https://github.com/damonitor/damo>`_ is for privileged people such as system administrators who want a just-working human-friendly interface. Using this, users can use the DAMON’s major features in a human-friendly way. It may not be highly tuned for special cases, though. For more detail, please refer to its `usage document - <https://github.com/awslabs/damo/blob/next/USAGE.md>`_. + <https://github.com/damonitor/damo/blob/next/USAGE.md>`_. - *sysfs interface.* :ref:`This <sysfs_interface>` is for privileged user space programmers who want more optimized use of DAMON. Using this, users can use DAMON’s major features by reading from and writing to special sysfs files. Therefore, you can write and use your personalized DAMON sysfs wrapper programs that reads/writes the sysfs files instead of you. The `DAMON user space tool - <https://github.com/awslabs/damo>`_ is one example of such programs. + <https://github.com/damonitor/damo>`_ is one example of such programs. - *Kernel Space Programming Interface.* :doc:`This </mm/damon/api>` is for kernel space programmers. Using this, users can utilize every feature of DAMON most flexibly and efficiently by @@ -543,7 +543,7 @@ memory rate becomes larger than 60%, or lower than 30%". :: # echo 300 > watermarks/low Please note that it's highly recommended to use user space tools like `damo -<https://github.com/awslabs/damo>`_ rather than manually reading and writing +<https://github.com/damonitor/damo>`_ rather than manually reading and writing the files as above. Above is only for an example. .. _tracepoint: diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst index 098f14d83e99..cb2c080f400c 100644 --- a/Documentation/admin-guide/mm/memory-hotplug.rst +++ b/Documentation/admin-guide/mm/memory-hotplug.rst @@ -294,8 +294,9 @@ The following files are currently defined: ``crash_hotplug`` read-only: when changes to the system memory map occur due to hot un/plug of memory, this file contains '1' if the kernel updates the kdump capture kernel memory - map itself (via elfcorehdr), or '0' if userspace must update - the kdump capture kernel memory map. + map itself (via elfcorehdr and other relevant kexec + segments), or '0' if userspace must update the kdump + capture kernel memory map. Availability depends on the CONFIG_MEMORY_HOTPLUG kernel configuration option. diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 058485daf186..cfdd16a52e39 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -202,6 +202,16 @@ PMD-mappable transparent hugepage:: cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size +All THPs at fault and collapse time will be added to _deferred_list, +and will therefore be split under memory presure if they are considered +"underused". A THP is underused if the number of zero-filled pages in +the THP is above max_ptes_none (see below). It is possible to disable +this behaviour by writing 0 to shrink_underused, and enable it by writing +1 to it:: + + echo 0 > /sys/kernel/mm/transparent_hugepage/shrink_underused + echo 1 > /sys/kernel/mm/transparent_hugepage/shrink_underused + khugepaged will be automatically started when PMD-sized THP is enabled (either of the per-size anon control or the top-level control are set to "always" or "madvise"), and it'll be automatically shutdown when @@ -284,13 +294,37 @@ that THP is shared. Exceeding the number would block the collapse:: A higher value may increase memory footprint for some workloads. -Boot parameter -============== +Boot parameters +=============== + +You can change the sysfs boot time default for the top-level "enabled" +control by passing the parameter ``transparent_hugepage=always`` or +``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` to the +kernel command line. + +Alternatively, each supported anonymous THP size can be controlled by +passing ``thp_anon=<size>,<size>[KMG]:<state>;<size>-<size>[KMG]:<state>``, +where ``<size>`` is the THP size (must be a power of 2 of PAGE_SIZE and +supported anonymous THP) and ``<state>`` is one of ``always``, ``madvise``, +``never`` or ``inherit``. + +For example, the following will set 16K, 32K, 64K THP to ``always``, +set 128K, 512K to ``inherit``, set 256K to ``madvise`` and 1M, 2M +to ``never``:: -You can change the sysfs boot time defaults of Transparent Hugepage -Support by passing the parameter ``transparent_hugepage=always`` or -``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` -to the kernel command line. + thp_anon=16K-64K:always;128K,512K:inherit;256K:madvise;1M-2M:never + +``thp_anon=`` may be specified multiple times to configure all THP sizes as +required. If ``thp_anon=`` is specified at least once, any anon THP sizes +not explicitly configured on the command line are implicitly set to +``never``. + +``transparent_hugepage`` setting only affects the global toggle. If +``thp_anon`` is not specified, PMD_ORDER THP will default to ``inherit``. +However, if a valid ``thp_anon`` setting is provided by the user, the +PMD_ORDER THP policy will be overridden. If the policy for PMD_ORDER +is not defined within a valid ``thp_anon``, its policy will default to +``never``. Hugepages in tmpfs/shmem ======================== @@ -447,6 +481,12 @@ thp_deferred_split_page splitting it would free up some memory. Pages on split queue are going to be split under memory pressure. +thp_underused_split_page + is incremented when a huge page on the split queue was split + because it was underused. A THP is underused if the number of + zero pages in the THP is above a certain threshold + (/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none). + thp_split_pmd is incremented every time a PMD split into table of PTEs. This can happen, for instance, when application calls mprotect() or @@ -527,6 +567,18 @@ split_deferred it would free up some memory. Pages on split queue are going to be split under memory pressure, if splitting is possible. +nr_anon + the number of anonymous THP we have in the whole system. These THPs + might be currently entirely mapped or have partially unmapped/unused + subpages. + +nr_anon_partially_mapped + the number of anonymous THP which are likely partially mapped, possibly + wasting memory, and have been queued for deferred memory reclamation. + Note that in corner some cases (e.g., failed migration), we might detect + an anonymous THP as "partially mapped" and count it here, even though it + is not actually partially mapped anymore. + As the system ages, allocating huge pages may be expensive as the system uses memory compaction to copy data around memory to free a huge page for use. There are some counters in ``/proc/vmstat`` to help diff --git a/Documentation/admin-guide/perf/arm-ni.rst b/Documentation/admin-guide/perf/arm-ni.rst new file mode 100644 index 000000000000..d26a8f697c36 --- /dev/null +++ b/Documentation/admin-guide/perf/arm-ni.rst @@ -0,0 +1,17 @@ +==================================== +Arm Network-on Chip Interconnect PMU +==================================== + +NI-700 and friends implement a distinct PMU for each clock domain within the +interconnect. Correspondingly, the driver exposes multiple PMU devices named +arm_ni_<x>_cd_<y>, where <x> is an (arbitrary) instance identifier and <y> is +the clock domain ID within that particular instance. If multiple NI instances +exist within a system, the PMU devices can be correlated with the underlying +hardware instance via sysfs parentage. + +Each PMU exposes base event aliases for the interface types present in its clock +domain. These require qualifying with the "eventid" and "nodeid" parameters +to specify the event code to count and the interface at which to count it +(per the configured hardware ID as reflected in the xxNI_NODE_INFO register). +The exception is the "cycles" alias for the PMU cycle counter, which is encoded +with the PMU node type and needs no further qualification. diff --git a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst index d47cd229d710..39b8e1fdd0cd 100644 --- a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst +++ b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst @@ -46,16 +46,16 @@ Some of the events only exist for specific configurations. DesignWare Cores (DWC) PCIe PMU Driver ======================================= -This driver adds PMU devices for each PCIe Root Port named based on the BDF of +This driver adds PMU devices for each PCIe Root Port named based on the SBDF of the Root Port. For example, - 30:03.0 PCI bridge: Device 1ded:8000 (rev 01) + 0001:30:03.0 PCI bridge: Device 1ded:8000 (rev 01) -the PMU device name for this Root Port is dwc_rootport_3018. +the PMU device name for this Root Port is dwc_rootport_13018. The DWC PCIe PMU driver registers a perf PMU driver, which provides description of available events and configuration options in sysfs, see -/sys/bus/event_source/devices/dwc_rootport_{bdf}. +/sys/bus/event_source/devices/dwc_rootport_{sbdf}. The "format" directory describes format of the config fields of the perf_event_attr structure. The "events" directory provides configuration @@ -66,16 +66,16 @@ The "perf list" command shall list the available events from sysfs, e.g.:: $# perf list | grep dwc_rootport <...> - dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/ [Kernel PMU event] + dwc_rootport_13018/Rx_PCIe_TLP_Data_Payload/ [Kernel PMU event] <...> - dwc_rootport_3018/rx_memory_read,lane=?/ [Kernel PMU event] + dwc_rootport_13018/rx_memory_read,lane=?/ [Kernel PMU event] Time Based Analysis Event Usage ------------------------------- Example usage of counting PCIe RX TLP data payload (Units of bytes):: - $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/ + $# perf stat -a -e dwc_rootport_13018/Rx_PCIe_TLP_Data_Payload/ The average RX/TX bandwidth can be calculated using the following formula: @@ -88,7 +88,7 @@ Lane Event Usage Each lane has the same event set and to avoid generating a list of hundreds of events, the user need to specify the lane ID explicitly, e.g.:: - $# perf stat -a -e dwc_rootport_3018/rx_memory_read,lane=4/ + $# perf stat -a -e dwc_rootport_13018/rx_memory_read,lane=4/ The driver does not support sampling, therefore "perf record" will not work. Per-task (without "-a") perf sessions are not supported. diff --git a/Documentation/admin-guide/perf/hisi-pcie-pmu.rst b/Documentation/admin-guide/perf/hisi-pcie-pmu.rst index 5541ff40e06a..083ca50de896 100644 --- a/Documentation/admin-guide/perf/hisi-pcie-pmu.rst +++ b/Documentation/admin-guide/perf/hisi-pcie-pmu.rst @@ -28,7 +28,9 @@ The "identifier" sysfs file allows users to identify the version of the PMU hardware device. The "bus" sysfs file allows users to get the bus number of Root Ports -monitored by PMU. +monitored by PMU. Furthermore users can get the Root Ports range in +[bdf_min, bdf_max] from "bdf_min" and "bdf_max" sysfs attributes +respectively. Example usage of perf:: diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst index 7eb3dcd6f4da..8502bc174640 100644 --- a/Documentation/admin-guide/perf/index.rst +++ b/Documentation/admin-guide/perf/index.rst @@ -16,6 +16,7 @@ Performance monitor support starfive_starlink_pmu arm-ccn arm-cmn + arm-ni xgene-pmu arm_dsu_pmu thunderx2-pmu diff --git a/Documentation/admin-guide/pm/amd-pstate.rst b/Documentation/admin-guide/pm/amd-pstate.rst index d0324d44f548..210a808b74ec 100644 --- a/Documentation/admin-guide/pm/amd-pstate.rst +++ b/Documentation/admin-guide/pm/amd-pstate.rst @@ -251,7 +251,9 @@ performance supported in `AMD CPPC Performance Capability <perf_cap_>`_). In some ASICs, the highest CPPC performance is not the one in the ``_CPC`` table, so we need to expose it to sysfs. If boost is not active, but still supported, this maximum frequency will be larger than the one in -``cpuinfo``. +``cpuinfo``. On systems that support preferred core, the driver will have +different values for some cores than others and this will reflect the values +advertised by the platform at bootup. This attribute is read-only. ``amd_pstate_lowest_nonlinear_freq`` @@ -262,6 +264,17 @@ lowest non-linear performance in `AMD CPPC Performance Capability <perf_cap_>`_.) This attribute is read-only. +``amd_pstate_hw_prefcore`` + +Whether the platform supports the preferred core feature and it has been +enabled. This attribute is read-only. + +``amd_pstate_prefcore_ranking`` + +The performance ranking of the core. This number doesn't have any unit, but +larger numbers are preferred at the time of reading. This can change at +runtime based on platform conditions. This attribute is read-only. + ``energy_performance_available_preferences`` A list of all the supported EPP preferences that could be used for diff --git a/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst b/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst index 5ab3440e6cee..5151ec312dc0 100644 --- a/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst +++ b/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst @@ -113,3 +113,62 @@ to apply at each uncore* level. Support for "current_freq_khz" is available only at each fabric cluster level (i.e., in uncore* directory). + +Efficiency vs. Latency Tradeoff +------------------------------- + +The Efficiency Latency Control (ELC) feature improves performance +per watt. With this feature hardware power management algorithms +optimize trade-off between latency and power consumption. For some +latency sensitive workloads further tuning can be done by SW to +get desired performance. + +The hardware monitors the average CPU utilization across all cores +in a power domain at regular intervals and decides an uncore frequency. +While this may result in the best performance per watt, workload may be +expecting higher performance at the expense of power. Consider an +application that intermittently wakes up to perform memory reads on an +otherwise idle system. In such cases, if hardware lowers uncore +frequency, then there may be delay in ramp up of frequency to meet +target performance. + +The ELC control defines some parameters which can be changed from SW. +If the average CPU utilization is below a user-defined threshold +(elc_low_threshold_percent attribute below), the user-defined uncore +floor frequency will be used (elc_floor_freq_khz attribute below) +instead of hardware calculated minimum. + +Similarly in high load scenario where the CPU utilization goes above +the high threshold value (elc_high_threshold_percent attribute below) +instead of jumping to maximum uncore frequency, frequency is increased +in 100MHz steps. This avoids consuming unnecessarily high power +immediately with CPU utilization spikes. + +Attributes for efficiency latency control: + +``elc_floor_freq_khz`` + This attribute is used to get/set the efficiency latency floor frequency. + If this variable is lower than the 'min_freq_khz', it is ignored by + the firmware. + +``elc_low_threshold_percent`` + This attribute is used to get/set the efficiency latency control low + threshold. This attribute is in percentages of CPU utilization. + +``elc_high_threshold_percent`` + This attribute is used to get/set the efficiency latency control high + threshold. This attribute is in percentages of CPU utilization. + +``elc_high_threshold_enable`` + This attribute is used to enable/disable the efficiency latency control + high threshold. Write '1' to enable, '0' to disable. + +Example system configuration below, which does following: + * when CPU utilization is less than 10%: sets uncore frequency to 800MHz + * when CPU utilization is higher than 95%: increases uncore frequency in + 100MHz steps, until power limit is reached + + elc_floor_freq_khz:800000 + elc_high_threshold_percent:95 + elc_high_threshold_enable:1 + elc_low_threshold_percent:10 diff --git a/Documentation/admin-guide/ramoops.rst b/Documentation/admin-guide/ramoops.rst index 6f534a707b2a..2eabef31220d 100644 --- a/Documentation/admin-guide/ramoops.rst +++ b/Documentation/admin-guide/ramoops.rst @@ -129,7 +129,7 @@ Setting the ramoops parameters can be done in several different manners: takes a size, alignment and name as arguments. The name is used to map the memory to a label that can be retrieved by ramoops. - reserver_mem=2M:4096:oops ramoops.mem_name=oops + reserve_mem=2M:4096:oops ramoops.mem_name=oops You can specify either RAM memory or peripheral devices' memory. However, when specifying RAM, be sure to reserve the memory by issuing memblock_reserve() diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst index f92551539e8a..700aa72eecb1 100644 --- a/Documentation/admin-guide/tainted-kernels.rst +++ b/Documentation/admin-guide/tainted-kernels.rst @@ -182,3 +182,5 @@ More detailed explanation for tainting produce extremely unusual kernel structure layouts (even performance pathological ones), which is important to know when debugging. Set at build time. + + 18) ``N`` if an in-kernel test, such as a KUnit test, has been run. |