diff options
Diffstat (limited to 'Documentation/admin-guide')
-rw-r--r-- | Documentation/admin-guide/LSM/index.rst | 1 | ||||
-rw-r--r-- | Documentation/admin-guide/LSM/ipe.rst | 790 | ||||
-rw-r--r-- | Documentation/admin-guide/bug-bisect.rst | 208 | ||||
-rw-r--r-- | Documentation/admin-guide/bug-hunting.rst | 17 | ||||
-rw-r--r-- | Documentation/admin-guide/cgroup-v2.rst | 29 | ||||
-rw-r--r-- | Documentation/admin-guide/device-mapper/dm-crypt.rst | 11 | ||||
-rw-r--r-- | Documentation/admin-guide/hw-vuln/srso.rst | 69 | ||||
-rw-r--r-- | Documentation/admin-guide/kernel-parameters.txt | 78 | ||||
-rw-r--r-- | Documentation/admin-guide/media/vivid.rst | 2 | ||||
-rw-r--r-- | Documentation/admin-guide/perf/arm-ni.rst | 17 | ||||
-rw-r--r-- | Documentation/admin-guide/perf/dwc_pcie_pmu.rst | 16 | ||||
-rw-r--r-- | Documentation/admin-guide/perf/hisi-pcie-pmu.rst | 4 | ||||
-rw-r--r-- | Documentation/admin-guide/perf/index.rst | 1 | ||||
-rw-r--r-- | Documentation/admin-guide/pm/amd-pstate.rst | 15 | ||||
-rw-r--r-- | Documentation/admin-guide/ramoops.rst | 2 | ||||
-rw-r--r-- | Documentation/admin-guide/tainted-kernels.rst | 2 |
16 files changed, 1145 insertions, 117 deletions
diff --git a/Documentation/admin-guide/LSM/index.rst b/Documentation/admin-guide/LSM/index.rst index a6ba95fbaa9f..ce63be6d64ad 100644 --- a/Documentation/admin-guide/LSM/index.rst +++ b/Documentation/admin-guide/LSM/index.rst @@ -47,3 +47,4 @@ subdirectories. tomoyo Yama SafeSetID + ipe diff --git a/Documentation/admin-guide/LSM/ipe.rst b/Documentation/admin-guide/LSM/ipe.rst new file mode 100644 index 000000000000..f38e641df0e9 --- /dev/null +++ b/Documentation/admin-guide/LSM/ipe.rst @@ -0,0 +1,790 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Integrity Policy Enforcement (IPE) +================================== + +.. NOTE:: + + This is the documentation for admins, system builders, or individuals + attempting to use IPE. If you're looking for more developer-focused + documentation about IPE please see :doc:`the design docs </security/ipe>`. + +Overview +-------- + +Integrity Policy Enforcement (IPE) is a Linux Security Module that takes a +complementary approach to access control. Unlike traditional access control +mechanisms that rely on labels and paths for decision-making, IPE focuses +on the immutable security properties inherent to system components. These +properties are fundamental attributes or features of a system component +that cannot be altered, ensuring a consistent and reliable basis for +security decisions. + +To elaborate, in the context of IPE, system components primarily refer to +files or the devices these files reside on. However, this is just a +starting point. The concept of system components is flexible and can be +extended to include new elements as the system evolves. The immutable +properties include the origin of a file, which remains constant and +unchangeable over time. For example, IPE policies can be crafted to trust +files originating from the initramfs. Since initramfs is typically verified +by the bootloader, its files are deemed trustworthy; "file is from +initramfs" becomes an immutable property under IPE's consideration. + +The immutable property concept extends to the security features enabled on +a file's origin, such as dm-verity or fs-verity, which provide a layer of +integrity and trust. For example, IPE allows the definition of policies +that trust files from a dm-verity protected device. dm-verity ensures the +integrity of an entire device by providing a verifiable and immutable state +of its contents. Similarly, fs-verity offers filesystem-level integrity +checks, allowing IPE to enforce policies that trust files protected by +fs-verity. These two features cannot be turned off once established, so +they are considered immutable properties. These examples demonstrate how +IPE leverages immutable properties, such as a file's origin and its +integrity protection mechanisms, to make access control decisions. + +For the IPE policy, specifically, it grants the ability to enforce +stringent access controls by assessing security properties against +reference values defined within the policy. This assessment can be based on +the existence of a security property (e.g., verifying if a file originates +from initramfs) or evaluating the internal state of an immutable security +property. The latter includes checking the roothash of a dm-verity +protected device, determining whether dm-verity possesses a valid +signature, assessing the digest of a fs-verity protected file, or +determining whether fs-verity possesses a valid built-in signature. This +nuanced approach to policy enforcement enables a highly secure and +customizable system defense mechanism, tailored to specific security +requirements and trust models. + +To enable IPE, ensure that ``CONFIG_SECURITY_IPE`` (under +:menuselection:`Security -> Integrity Policy Enforcement (IPE)`) config +option is enabled. + +Use Cases +--------- + +IPE works best in fixed-function devices: devices in which their purpose +is clearly defined and not supposed to be changed (e.g. network firewall +device in a data center, an IoT device, etcetera), where all software and +configuration is built and provisioned by the system owner. + +IPE is a long-way off for use in general-purpose computing: the Linux +community as a whole tends to follow a decentralized trust model (known as +the web of trust), which IPE has no support for it yet. Instead, IPE +supports PKI (public key infrastructure), which generally designates a +set of trusted entities that provide a measure of absolute trust. + +Additionally, while most packages are signed today, the files inside +the packages (for instance, the executables), tend to be unsigned. This +makes it difficult to utilize IPE in systems where a package manager is +expected to be functional, without major changes to the package manager +and ecosystem behind it. + +The digest_cache LSM [#digest_cache_lsm]_ is a system that when combined with IPE, +could be used to enable and support general-purpose computing use cases. + +Known Limitations +----------------- + +IPE cannot verify the integrity of anonymous executable memory, such as +the trampolines created by gcc closures and libffi (<3.4.2), or JIT'd code. +Unfortunately, as this is dynamically generated code, there is no way +for IPE to ensure the integrity of this code to form a trust basis. + +IPE cannot verify the integrity of programs written in interpreted +languages when these scripts are invoked by passing these program files +to the interpreter. This is because the way interpreters execute these +files; the scripts themselves are not evaluated as executable code +through one of IPE's hooks, but they are merely text files that are read +(as opposed to compiled executables) [#interpreters]_. + +Threat Model +------------ + +IPE specifically targets the risk of tampering with user-space executable +code after the kernel has initially booted, including the kernel modules +loaded from userspace via ``modprobe`` or ``insmod``. + +To illustrate, consider a scenario where an untrusted binary, possibly +malicious, is downloaded along with all necessary dependencies, including a +loader and libc. The primary function of IPE in this context is to prevent +the execution of such binaries and their dependencies. + +IPE achieves this by verifying the integrity and authenticity of all +executable code before allowing them to run. It conducts a thorough +check to ensure that the code's integrity is intact and that they match an +authorized reference value (digest, signature, etc) as per the defined +policy. If a binary does not pass this verification process, either +because its integrity has been compromised or it does not meet the +authorization criteria, IPE will deny its execution. Additionally, IPE +generates audit logs which may be utilized to detect and analyze failures +resulting from policy violation. + +Tampering threat scenarios include modification or replacement of +executable code by a range of actors including: + +- Actors with physical access to the hardware +- Actors with local network access to the system +- Actors with access to the deployment system +- Compromised internal systems under external control +- Malicious end users of the system +- Compromised end users of the system +- Remote (external) compromise of the system + +IPE does not mitigate threats arising from malicious but authorized +developers (with access to a signing certificate), or compromised +developer tools used by them (i.e. return-oriented programming attacks). +Additionally, IPE draws hard security boundary between userspace and +kernelspace. As a result, kernel-level exploits are considered outside +the scope of IPE and mitigation is left to other mechanisms. + +Policy +------ + +IPE policy is a plain-text [#devdoc]_ policy composed of multiple statements +over several lines. There is one required line, at the top of the +policy, indicating the policy name, and the policy version, for +instance:: + + policy_name=Ex_Policy policy_version=0.0.0 + +The policy name is a unique key identifying this policy in a human +readable name. This is used to create nodes under securityfs as well as +uniquely identify policies to deploy new policies vs update existing +policies. + +The policy version indicates the current version of the policy (NOT the +policy syntax version). This is used to prevent rollback of policy to +potentially insecure previous versions of the policy. + +The next portion of IPE policy are rules. Rules are formed by key=value +pairs, known as properties. IPE rules require two properties: ``action``, +which determines what IPE does when it encounters a match against the +rule, and ``op``, which determines when the rule should be evaluated. +The ordering is significant, a rule must start with ``op``, and end with +``action``. Thus, a minimal rule is:: + + op=EXECUTE action=ALLOW + +This example will allow any execution. Additional properties are used to +assess immutable security properties about the files being evaluated. +These properties are intended to be descriptions of systems within the +kernel that can provide a measure of integrity verification, such that IPE +can determine the trust of the resource based on the value of the property. + +Rules are evaluated top-to-bottom. As a result, any revocation rules, +or denies should be placed early in the file to ensure that these rules +are evaluated before a rule with ``action=ALLOW``. + +IPE policy supports comments. The character '#' will function as a +comment, ignoring all characters to the right of '#' until the newline. + +The default behavior of IPE evaluations can also be expressed in policy, +through the ``DEFAULT`` statement. This can be done at a global level, +or a per-operation level:: + + # Global + DEFAULT action=ALLOW + + # Operation Specific + DEFAULT op=EXECUTE action=ALLOW + +A default must be set for all known operations in IPE. If you want to +preserve older policies being compatible with newer kernels that can introduce +new operations, set a global default of ``ALLOW``, then override the +defaults on a per-operation basis (as above). + +With configurable policy-based LSMs, there's several issues with +enforcing the configurable policies at startup, around reading and +parsing the policy: + +1. The kernel *should* not read files from userspace, so directly reading + the policy file is prohibited. +2. The kernel command line has a character limit, and one kernel module + should not reserve the entire character limit for its own + configuration. +3. There are various boot loaders in the kernel ecosystem, so handing + off a memory block would be costly to maintain. + +As a result, IPE has addressed this problem through a concept of a "boot +policy". A boot policy is a minimal policy which is compiled into the +kernel. This policy is intended to get the system to a state where +userspace is set up and ready to receive commands, at which point a more +complex policy can be deployed via securityfs. The boot policy can be +specified via ``SECURITY_IPE_BOOT_POLICY`` config option, which accepts +a path to a plain-text version of the IPE policy to apply. This policy +will be compiled into the kernel. If not specified, IPE will be disabled +until a policy is deployed and activated through securityfs. + +Deploying Policies +~~~~~~~~~~~~~~~~~~ + +Policies can be deployed from userspace through securityfs. These policies +are signed through the PKCS#7 message format to enforce some level of +authorization of the policies (prohibiting an attacker from gaining +unconstrained root, and deploying an "allow all" policy). These +policies must be signed by a certificate that chains to the +``SYSTEM_TRUSTED_KEYRING``. With openssl, the policy can be signed by:: + + openssl smime -sign \ + -in "$MY_POLICY" \ + -signer "$MY_CERTIFICATE" \ + -inkey "$MY_PRIVATE_KEY" \ + -noattr \ + -nodetach \ + -nosmimecap \ + -outform der \ + -out "$MY_POLICY.p7b" + +Deploying the policies is done through securityfs, through the +``new_policy`` node. To deploy a policy, simply cat the file into the +securityfs node:: + + cat "$MY_POLICY.p7b" > /sys/kernel/security/ipe/new_policy + +Upon success, this will create one subdirectory under +``/sys/kernel/security/ipe/policies/``. The subdirectory will be the +``policy_name`` field of the policy deployed, so for the example above, +the directory will be ``/sys/kernel/security/ipe/policies/Ex_Policy``. +Within this directory, there will be seven files: ``pkcs7``, ``policy``, +``name``, ``version``, ``active``, ``update``, and ``delete``. + +The ``pkcs7`` file is read-only. Reading it returns the raw PKCS#7 data +that was provided to the kernel, representing the policy. If the policy being +read is the boot policy, this will return ``ENOENT``, as it is not signed. + +The ``policy`` file is read only. Reading it returns the PKCS#7 inner +content of the policy, which will be the plain text policy. + +The ``active`` file is used to set a policy as the currently active policy. +This file is rw, and accepts a value of ``"1"`` to set the policy as active. +Since only a single policy can be active at one time, all other policies +will be marked inactive. The policy being marked active must have a policy +version greater or equal to the currently-running version. + +The ``update`` file is used to update a policy that is already present +in the kernel. This file is write-only and accepts a PKCS#7 signed +policy. Two checks will always be performed on this policy: First, the +``policy_names`` must match with the updated version and the existing +version. Second the updated policy must have a policy version greater than +or equal to the currently-running version. This is to prevent rollback attacks. + +The ``delete`` file is used to remove a policy that is no longer needed. +This file is write-only and accepts a value of ``1`` to delete the policy. +On deletion, the securityfs node representing the policy will be removed. +However, delete the current active policy is not allowed and will return +an operation not permitted error. + +Similarly, writing to both ``update`` and ``new_policy`` could result in +bad message(policy syntax error) or file exists error. The latter error happens +when trying to deploy a policy with a ``policy_name`` while the kernel already +has a deployed policy with the same ``policy_name``. + +Deploying a policy will *not* cause IPE to start enforcing the policy. IPE will +only enforce the policy marked active. Note that only one policy can be active +at a time. + +Once deployment is successful, the policy can be activated, by writing file +``/sys/kernel/security/ipe/policies/$policy_name/active``. +For example, the ``Ex_Policy`` can be activated by:: + + echo 1 > "/sys/kernel/security/ipe/policies/Ex_Policy/active" + +From above point on, ``Ex_Policy`` is now the enforced policy on the +system. + +IPE also provides a way to delete policies. This can be done via the +``delete`` securityfs node, +``/sys/kernel/security/ipe/policies/$policy_name/delete``. +Writing ``1`` to that file deletes the policy:: + + echo 1 > "/sys/kernel/security/ipe/policies/$policy_name/delete" + +There is only one requirement to delete a policy: the policy being deleted +must be inactive. + +.. NOTE:: + + If a traditional MAC system is enabled (SELinux, apparmor, smack), all + writes to ipe's securityfs nodes require ``CAP_MAC_ADMIN``. + +Modes +~~~~~ + +IPE supports two modes of operation: permissive (similar to SELinux's +permissive mode) and enforced. In permissive mode, all events are +checked and policy violations are logged, but the policy is not really +enforced. This allows users to test policies before enforcing them. + +The default mode is enforce, and can be changed via the kernel command +line parameter ``ipe.enforce=(0|1)``, or the securityfs node +``/sys/kernel/security/ipe/enforce``. + +.. NOTE:: + + If a traditional MAC system is enabled (SELinux, apparmor, smack, etcetera), + all writes to ipe's securityfs nodes require ``CAP_MAC_ADMIN``. + +Audit Events +~~~~~~~~~~~~ + +1420 AUDIT_IPE_ACCESS +^^^^^^^^^^^^^^^^^^^^^ +Event Examples:: + + type=1420 audit(1653364370.067:61): ipe_op=EXECUTE ipe_hook=MMAP enforcing=1 pid=2241 comm="ld-linux.so" path="/deny/lib/libc.so.6" dev="sda2" ino=14549020 rule="DEFAULT action=DENY" + type=1300 audit(1653364370.067:61): SYSCALL arch=c000003e syscall=9 success=no exit=-13 a0=7f1105a28000 a1=195000 a2=5 a3=812 items=0 ppid=2219 pid=2241 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=2 comm="ld-linux.so" exe="/tmp/ipe-test/lib/ld-linux.so" subj=unconfined key=(null) + type=1327 audit(1653364370.067:61): 707974686F6E3300746573742F6D61696E2E7079002D6E00 + + type=1420 audit(1653364735.161:64): ipe_op=EXECUTE ipe_hook=MMAP enforcing=1 pid=2472 comm="mmap_test" path=? dev=? ino=? rule="DEFAULT action=DENY" + type=1300 audit(1653364735.161:64): SYSCALL arch=c000003e syscall=9 success=no exit=-13 a0=0 a1=1000 a2=4 a3=21 items=0 ppid=2219 pid=2472 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=2 comm="mmap_test" exe="/root/overlake_test/upstream_test/vol_fsverity/bin/mmap_test" subj=unconfined key=(null) + type=1327 audit(1653364735.161:64): 707974686F6E3300746573742F6D61696E2E7079002D6E00 + +This event indicates that IPE made an access control decision; the IPE +specific record (1420) is always emitted in conjunction with a +``AUDITSYSCALL`` record. + +Determining whether IPE is in permissive or enforced mode can be derived +from ``success`` property and exit code of the ``AUDITSYSCALL`` record. + + +Field descriptions: + ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| Field | Value Type | Optional? | Description of Value | ++===========+============+===========+=================================================================================+ +| ipe_op | string | No | The IPE operation name associated with the log | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| ipe_hook | string | No | The name of the LSM hook that triggered the IPE event | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| enforcing | integer | No | The current IPE enforcing state 1 is in enforcing mode, 0 is in permissive mode | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| pid | integer | No | The pid of the process that triggered the IPE event. | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| comm | string | No | The command line program name of the process that triggered the IPE event | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| path | string | Yes | The absolute path to the evaluated file | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| ino | integer | Yes | The inode number of the evaluated file | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| dev | string | Yes | The device name of the evaluated file, e.g. vda | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| rule | string | No | The matched policy rule | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ + +1421 AUDIT_IPE_CONFIG_CHANGE +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Event Example:: + + type=1421 audit(1653425583.136:54): old_active_pol_name="Allow_All" old_active_pol_version=0.0.0 old_policy_digest=sha256:E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855 new_active_pol_name="boot_verified" new_active_pol_version=0.0.0 new_policy_digest=sha256:820EEA5B40CA42B51F68962354BA083122A20BB846F26765076DD8EED7B8F4DB auid=4294967295 ses=4294967295 lsm=ipe res=1 + type=1300 audit(1653425583.136:54): SYSCALL arch=c000003e syscall=1 success=yes exit=2 a0=3 a1=5596fcae1fb0 a2=2 a3=2 items=0 ppid=184 pid=229 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=4294967295 comm="python3" exe="/usr/bin/python3.10" key=(null) + type=1327 audit(1653425583.136:54): PROCTITLE proctitle=707974686F6E3300746573742F6D61696E2E7079002D66002E2 + +This event indicates that IPE switched the active poliy from one to another +along with the version and the hash digest of the two policies. +Note IPE can only have one policy active at a time, all access decision +evaluation is based on the current active policy. +The normal procedure to deploy a new policy is loading the policy to deploy +into the kernel first, then switch the active policy to it. + +This record will always be emitted in conjunction with a ``AUDITSYSCALL`` record for the ``write`` syscall. + +Field descriptions: + ++------------------------+------------+-----------+---------------------------------------------------+ +| Field | Value Type | Optional? | Description of Value | ++========================+============+===========+===================================================+ +| old_active_pol_name | string | Yes | The name of previous active policy | ++------------------------+------------+-----------+---------------------------------------------------+ +| old_active_pol_version | string | Yes | The version of previous active policy | ++------------------------+------------+-----------+---------------------------------------------------+ +| old_policy_digest | string | Yes | The hash of previous active policy | ++------------------------+------------+-----------+---------------------------------------------------+ +| new_active_pol_name | string | No | The name of current active policy | ++------------------------+------------+-----------+---------------------------------------------------+ +| new_active_pol_version | string | No | The version of current active policy | ++------------------------+------------+-----------+---------------------------------------------------+ +| new_policy_digest | string | No | The hash of current active policy | ++------------------------+------------+-----------+---------------------------------------------------+ +| auid | integer | No | The login user ID | ++------------------------+------------+-----------+---------------------------------------------------+ +| ses | integer | No | The login session ID | ++------------------------+------------+-----------+---------------------------------------------------+ +| lsm | string | No | The lsm name associated with the event | ++------------------------+------------+-----------+---------------------------------------------------+ +| res | integer | No | The result of the audited operation(success/fail) | ++------------------------+------------+-----------+---------------------------------------------------+ + +1422 AUDIT_IPE_POLICY_LOAD +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Event Example:: + + type=1422 audit(1653425529.927:53): policy_name="boot_verified" policy_version=0.0.0 policy_digest=sha256:820EEA5B40CA42B51F68962354BA083122A20BB846F26765076DD8EED7B8F4DB auid=4294967295 ses=4294967295 lsm=ipe res=1 + type=1300 audit(1653425529.927:53): arch=c000003e syscall=1 success=yes exit=2567 a0=3 a1=5596fcae1fb0 a2=a07 a3=2 items=0 ppid=184 pid=229 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=4294967295 comm="python3" exe="/usr/bin/python3.10" key=(null) + type=1327 audit(1653425529.927:53): PROCTITLE proctitle=707974686F6E3300746573742F6D61696E2E7079002D66002E2E + +This record indicates a new policy has been loaded into the kernel with the policy name, policy version and policy hash. + +This record will always be emitted in conjunction with a ``AUDITSYSCALL`` record for the ``write`` syscall. + +Field descriptions: + ++----------------+------------+-----------+---------------------------------------------------+ +| Field | Value Type | Optional? | Description of Value | ++================+============+===========+===================================================+ +| policy_name | string | No | The policy_name | ++----------------+------------+-----------+---------------------------------------------------+ +| policy_version | string | No | The policy_version | ++----------------+------------+-----------+---------------------------------------------------+ +| policy_digest | string | No | The policy hash | ++----------------+------------+-----------+---------------------------------------------------+ +| auid | integer | No | The login user ID | ++----------------+------------+-----------+---------------------------------------------------+ +| ses | integer | No | The login session ID | ++----------------+------------+-----------+---------------------------------------------------+ +| lsm | string | No | The lsm name associated with the event | ++----------------+------------+-----------+---------------------------------------------------+ +| res | integer | No | The result of the audited operation(success/fail) | ++----------------+------------+-----------+---------------------------------------------------+ + + +1404 AUDIT_MAC_STATUS +^^^^^^^^^^^^^^^^^^^^^ + +Event Examples:: + + type=1404 audit(1653425689.008:55): enforcing=0 old_enforcing=1 auid=4294967295 ses=4294967295 enabled=1 old-enabled=1 lsm=ipe res=1 + type=1300 audit(1653425689.008:55): arch=c000003e syscall=1 success=yes exit=2 a0=1 a1=55c1065e5c60 a2=2 a3=0 items=0 ppid=405 pid=441 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=) + type=1327 audit(1653425689.008:55): proctitle="-bash" + + type=1404 audit(1653425689.008:55): enforcing=1 old_enforcing=0 auid=4294967295 ses=4294967295 enabled=1 old-enabled=1 lsm=ipe res=1 + type=1300 audit(1653425689.008:55): arch=c000003e syscall=1 success=yes exit=2 a0=1 a1=55c1065e5c60 a2=2 a3=0 items=0 ppid=405 pid=441 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=) + type=1327 audit(1653425689.008:55): proctitle="-bash" + +This record will always be emitted in conjunction with a ``AUDITSYSCALL`` record for the ``write`` syscall. + +Field descriptions: + ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| Field | Value Type | Optional? | Description of Value | ++===============+============+===========+=================================================================================================+ +| enforcing | integer | No | The enforcing state IPE is being switched to, 1 is in enforcing mode, 0 is in permissive mode | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| old_enforcing | integer | No | The enforcing state IPE is being switched from, 1 is in enforcing mode, 0 is in permissive mode | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| auid | integer | No | The login user ID | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| ses | integer | No | The login session ID | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| enabled | integer | No | The new TTY audit enabled setting | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| old-enabled | integer | No | The old TTY audit enabled setting | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| lsm | string | No | The lsm name associated with the event | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| res | integer | No | The result of the audited operation(success/fail) | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ + + +Success Auditing +^^^^^^^^^^^^^^^^ + +IPE supports success auditing. When enabled, all events that pass IPE +policy and are not blocked will emit an audit event. This is disabled by +default, and can be enabled via the kernel command line +``ipe.success_audit=(0|1)`` or +``/sys/kernel/security/ipe/success_audit`` securityfs file. + +This is *very* noisy, as IPE will check every userspace binary on the +system, but is useful for debugging policies. + +.. NOTE:: + + If a traditional MAC system is enabled (SELinux, apparmor, smack, etcetera), + all writes to ipe's securityfs nodes require ``CAP_MAC_ADMIN``. + +Properties +---------- + +As explained above, IPE properties are ``key=value`` pairs expressed in IPE +policy. Two properties are built-into the policy parser: 'op' and 'action'. +The other properties are used to restrict immutable security properties +about the files being evaluated. Currently those properties are: +'``boot_verified``', '``dmverity_signature``', '``dmverity_roothash``', +'``fsverity_signature``', '``fsverity_digest``'. A description of all +properties supported by IPE are listed below: + +op +~~ + +Indicates the operation for a rule to apply to. Must be in every rule, +as the first token. IPE supports the following operations: + + ``EXECUTE`` + + Pertains to any file attempting to be executed, or loaded as an + executable. + + ``FIRMWARE``: + + Pertains to firmware being loaded via the firmware_class interface. + This covers both the preallocated buffer and the firmware file + itself. + + ``KMODULE``: + + Pertains to loading kernel modules via ``modprobe`` or ``insmod``. + + ``KEXEC_IMAGE``: + + Pertains to kernel images loading via ``kexec``. + + ``KEXEC_INITRAMFS`` + + Pertains to initrd images loading via ``kexec --initrd``. + + ``POLICY``: + + Controls loading policies via reading a kernel-space initiated read. + + An example of such is loading IMA policies by writing the path + to the policy file to ``$securityfs/ima/policy`` + + ``X509_CERT``: + + Controls loading IMA certificates through the Kconfigs, + ``CONFIG_IMA_X509_PATH`` and ``CONFIG_EVM_X509_PATH``. + +action +~~~~~~ + + Determines what IPE should do when a rule matches. Must be in every + rule, as the final clause. Can be one of: + + ``ALLOW``: + + If the rule matches, explicitly allow access to the resource to proceed + without executing any more rules. + + ``DENY``: + + If the rule matches, explicitly prohibit access to the resource to + proceed without executing any more rules. + +boot_verified +~~~~~~~~~~~~~ + + This property can be utilized for authorization of files from initramfs. + The format of this property is:: + + boot_verified=(TRUE|FALSE) + + + .. WARNING:: + + This property will trust files from initramfs(rootfs). It should + only be used during early booting stage. Before mounting the real + rootfs on top of the initramfs, initramfs script will recursively + remove all files and directories on the initramfs. This is typically + implemented by using switch_root(8) [#switch_root]_. Therefore the + initramfs will be empty and not accessible after the real + rootfs takes over. It is advised to switch to a different policy + that doesn't rely on the property after this point. + This ensures that the trust policies remain relevant and effective + throughout the system's operation. + +dmverity_roothash +~~~~~~~~~~~~~~~~~ + + This property can be utilized for authorization or revocation of + specific dm-verity volumes, identified via their root hashes. It has a + dependency on the DM_VERITY module. This property is controlled by + the ``IPE_PROP_DM_VERITY`` config option, it will be automatically + selected when ``SECURITY_IPE`` and ``DM_VERITY`` are all enabled. + The format of this property is:: + + dmverity_roothash=DigestName:HexadecimalString + + The supported DigestNames for dmverity_roothash are [#dmveritydigests]_ + + + blake2b-512 + + blake2s-256 + + sha256 + + sha384 + + sha512 + + sha3-224 + + sha3-256 + + sha3-384 + + sha3-512 + + sm3 + + rmd160 + +dmverity_signature +~~~~~~~~~~~~~~~~~~ + + This property can be utilized for authorization of all dm-verity + volumes that have a signed roothash that validated by a keyring + specified by dm-verity's configuration, either the system trusted + keyring, or the secondary keyring. It depends on + ``DM_VERITY_VERIFY_ROOTHASH_SIG`` config option and is controlled by + the ``IPE_PROP_DM_VERITY_SIGNATURE`` config option, it will be automatically + selected when ``SECURITY_IPE``, ``DM_VERITY`` and + ``DM_VERITY_VERIFY_ROOTHASH_SIG`` are all enabled. + The format of this property is:: + + dmverity_signature=(TRUE|FALSE) + +fsverity_digest +~~~~~~~~~~~~~~~ + + This property can be utilized for authorization of specific fsverity + enabled files, identified via their fsverity digests. + It depends on ``FS_VERITY`` config option and is controlled by + the ``IPE_PROP_FS_VERITY`` config option, it will be automatically + selected when ``SECURITY_IPE`` and ``FS_VERITY`` are all enabled. + The format of this property is:: + + fsverity_digest=DigestName:HexadecimalString + + The supported DigestNames for fsverity_digest are [#fsveritydigest]_ + + + sha256 + + sha512 + +fsverity_signature +~~~~~~~~~~~~~~~~~~ + + This property is used to authorize all fs-verity enabled files that have + been verified by fs-verity's built-in signature mechanism. The signature + verification relies on a key stored within the ".fs-verity" keyring. It + depends on ``FS_VERITY_BUILTIN_SIGNATURES`` config option and + it is controlled by the ``IPE_PROP_FS_VERITY`` config option, + it will be automatically selected when ``SECURITY_IPE``, ``FS_VERITY`` + and ``FS_VERITY_BUILTIN_SIGNATURES`` are all enabled. + The format of this property is:: + + fsverity_signature=(TRUE|FALSE) + +Policy Examples +--------------- + +Allow all +~~~~~~~~~ + +:: + + policy_name=Allow_All policy_version=0.0.0 + DEFAULT action=ALLOW + +Allow only initramfs +~~~~~~~~~~~~~~~~~~~~ + +:: + + policy_name=Allow_Initramfs policy_version=0.0.0 + DEFAULT action=DENY + + op=EXECUTE boot_verified=TRUE action=ALLOW + +Allow any signed and validated dm-verity volume and the initramfs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + policy_name=Allow_Signed_DMV_And_Initramfs policy_version=0.0.0 + DEFAULT action=DENY + + op=EXECUTE boot_verified=TRUE action=ALLOW + op=EXECUTE dmverity_signature=TRUE action=ALLOW + +Prohibit execution from a specific dm-verity volume +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + policy_name=Deny_DMV_By_Roothash policy_version=0.0.0 + DEFAULT action=DENY + + op=EXECUTE dmverity_roothash=sha256:cd2c5bae7c6c579edaae4353049d58eb5f2e8be0244bf05345bc8e5ed257baff action=DENY + + op=EXECUTE boot_verified=TRUE action=ALLOW + op=EXECUTE dmverity_signature=TRUE action=ALLOW + +Allow only a specific dm-verity volume +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + policy_name=Allow_DMV_By_Roothash policy_version=0.0.0 + DEFAULT action=DENY + + op=EXECUTE dmverity_roothash=sha256:401fcec5944823ae12f62726e8184407a5fa9599783f030dec146938 action=ALLOW + +Allow any fs-verity file with a valid built-in signature +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + policy_name=Allow_Signed_And_Validated_FSVerity policy_version=0.0.0 + DEFAULT action=DENY + + op=EXECUTE fsverity_signature=TRUE action=ALLOW + +Allow execution of a specific fs-verity file +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + policy_name=ALLOW_FSV_By_Digest policy_version=0.0.0 + DEFAULT action=DENY + + op=EXECUTE fsverity_digest=sha256:fd88f2b8824e197f850bf4c5109bea5cf0ee38104f710843bb72da796ba5af9e action=ALLOW + +Additional Information +---------------------- + +- `Github Repository <https://github.com/microsoft/ipe>`_ +- :doc:`Developer and design docs for IPE </security/ipe>` + +FAQ +--- + +Q: + What's the difference between other LSMs which provide a measure of + trust-based access control? + +A: + + In general, there's two other LSMs that can provide similar functionality: + IMA, and Loadpin. + + IMA and IPE are functionally very similar. The significant difference between + the two is the policy. [#devdoc]_ + + Loadpin and IPE differ fairly dramatically, as Loadpin only covers the IPE's + kernel read operations, whereas IPE is capable of controlling execution + on top of kernel read. The trust model is also different; Loadpin roots its + trust in the initial super-block, whereas trust in IPE is stemmed from kernel + itself (via ``SYSTEM_TRUSTED_KEYS``). + +----------- + +.. [#digest_cache_lsm] https://lore.kernel.org/lkml/20240415142436.2545003-1-roberto.sassu@huaweicloud.com/ + +.. [#interpreters] There is `some interest in solving this issue <https://lore.kernel.org/lkml/20220321161557.495388-1-mic@digikod.net/>`_. + +.. [#devdoc] Please see :doc:`the design docs </security/ipe>` for more on + this topic. + +.. [#switch_root] https://man7.org/linux/man-pages/man8/switch_root.8.html + +.. [#dmveritydigests] These hash algorithms are based on values accepted by + the Linux crypto API; IPE does not impose any + restrictions on the digest algorithm itself; + thus, this list may be out of date. + +.. [#fsveritydigest] These hash algorithms are based on values accepted by the + kernel's fsverity support; IPE does not impose any + restrictions on the digest algorithm itself; + thus, this list may be out of date. diff --git a/Documentation/admin-guide/bug-bisect.rst b/Documentation/admin-guide/bug-bisect.rst index 325c5d0ed34a..585630d14581 100644 --- a/Documentation/admin-guide/bug-bisect.rst +++ b/Documentation/admin-guide/bug-bisect.rst @@ -1,76 +1,144 @@ -Bisecting a bug -+++++++++++++++ +.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0) +.. [see the bottom of this file for redistribution information] -Last updated: 28 October 2016 +====================== +Bisecting a regression +====================== -Introduction -============ +This document describes how to use a ``git bisect`` to find the source code +change that broke something -- for example when some functionality stopped +working after upgrading from Linux 6.0 to 6.1. -Always try the latest kernel from kernel.org and build from source. If you are -not confident in doing that please report the bug to your distribution vendor -instead of to a kernel developer. +The text focuses on the gist of the process. If you are new to bisecting the +kernel, better follow Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst +instead: it depicts everything from start to finish while covering multiple +aspects even kernel developers occasionally forget. This includes detecting +situations early where a bisection would be a waste of time, as nobody would +care about the result -- for example, because the problem happens after the +kernel marked itself as 'tainted', occurs in an abandoned version, was already +fixed, or is caused by a .config change you or your Linux distributor performed. -Finding bugs is not always easy. Have a go though. If you can't find it don't -give up. Report as much as you have found to the relevant maintainer. See -MAINTAINERS for who that is for the subsystem you have worked on. +Finding the change causing a kernel issue using a bisection +=========================================================== -Before you submit a bug report read -'Documentation/admin-guide/reporting-issues.rst'. +*Note: the following process assumes you prepared everything for a bisection. +This includes having a Git clone with the appropriate sources, installing the +software required to build and install kernels, as well as a .config file stored +in a safe place (the following example assumes '~/prepared_kernel_.config') to +use as pristine base at each bisection step; ideally, you have also worked out +a fully reliable and straight-forward way to reproduce the regression, too.* -Devices not appearing -===================== - -Often this is caused by udev/systemd. Check that first before blaming it -on the kernel. - -Finding patch that caused a bug -=============================== - -Using the provided tools with ``git`` makes finding bugs easy provided the bug -is reproducible. - -Steps to do it: - -- build the Kernel from its git source -- start bisect with [#f1]_:: - - $ git bisect start - -- mark the broken changeset with:: - - $ git bisect bad [commit] - -- mark a changeset where the code is known to work with:: - - $ git bisect good [commit] - -- rebuild the Kernel and test -- interact with git bisect by using either:: - - $ git bisect good - - or:: - - $ git bisect bad - - depending if the bug happened on the changeset you're testing -- After some interactions, git bisect will give you the changeset that - likely caused the bug. - -- For example, if you know that the current version is bad, and version - 4.8 is good, you could do:: - - $ git bisect start - $ git bisect bad # Current version is bad - $ git bisect good v4.8 - - -.. [#f1] You can, optionally, provide both good and bad arguments at git - start with ``git bisect start [BAD] [GOOD]`` - -For further references, please read: - -- The man page for ``git-bisect`` -- `Fighting regressions with git bisect <https://www.kernel.org/pub/software/scm/git/docs/git-bisect-lk2009.html>`_ -- `Fully automated bisecting with "git bisect run" <https://lwn.net/Articles/317154>`_ -- `Using Git bisect to figure out when brokenness was introduced <http://webchick.net/node/99>`_ +* Preparation: start the bisection and tell Git about the points in the history + you consider to be working and broken, which Git calls 'good' and 'bad':: + + git bisect start + git bisect good v6.0 + git bisect bad v6.1 + + Instead of Git tags like 'v6.0' and 'v6.1' you can specify commit-ids, too. + +1. Copy your prepared .config into the build directory and adjust it to the + needs of the codebase Git checked out for testing:: + + cp ~/prepared_kernel_.config .config + make olddefconfig + +2. Now build, install, and boot a kernel. This might fail for unrelated reasons, + for example, when a compile error happens at the current stage of the + bisection a later change resolves. In such cases run ``git bisect skip`` and + go back to step 1. + +3. Check if the functionality that regressed works in the kernel you just built. + + If it works, execute:: + + git bisect good + + If it is broken, run:: + + git bisect bad + + Note, getting this wrong just once will send the rest of the bisection + totally off course. To prevent having to start anew later you thus want to + ensure what you tell Git is correct; it is thus often wise to spend a few + minutes more on testing in case your reproducer is unreliable. + + After issuing one of these two commands, Git will usually check out another + bisection point and print something like 'Bisecting: 675 revisions left to + test after this (roughly 10 steps)'. In that case go back to step 1. + + If Git instead prints something like 'cafecaca0c0dacafecaca0c0dacafecaca0c0da + is the first bad commit', then you have finished the bisection. In that case + move to the next point below. Note, right after displaying that line Git will + show some details about the culprit including its patch description; this can + easily fill your terminal, so you might need to scroll up to see the message + mentioning the culprit's commit-id. + + In case you missed Git's output, you can always run ``git bisect log`` to + print the status: it will show how many steps remain or mention the result of + the bisection. + +* Recommended complementary task: put the bisection log and the current .config + file aside for the bug report; furthermore tell Git to reset the sources to + the state before the bisection:: + + git bisect log > ~/bisection-log + cp .config ~/bisection-config-culprit + git bisect reset + +* Recommended optional task: try reverting the culprit on top of the latest + codebase and check if that fixes your bug; if that is the case, it validates + the bisection and enables developers to resolve the regression through a + revert. + + To try this, update your clone and check out latest mainline. Then tell Git + to revert the change by specifying its commit-id:: + + git revert --no-edit cafec0cacaca0 + + Git might reject this, for example when the bisection landed on a merge + commit. In that case, abandon the attempt. Do the same, if Git fails to revert + the culprit on its own because later changes depend on it -- at least unless + you bisected a stable or longterm kernel series, in which case you want to + check out its latest codebase and try a revert there. + + If a revert succeeds, build and test another kernel to check if reverting + resolved your regression. + +With that the process is complete. Now report the regression as described by +Documentation/admin-guide/reporting-issues.rst. + + +Additional reading material +--------------------------- + +* The `man page for 'git bisect' <https://git-scm.com/docs/git-bisect>`_ and + `fighting regressions with 'git bisect' <https://git-scm.com/docs/git-bisect-lk2009.html>`_ + in the Git documentation. +* `Working with git bisect <https://nathanchance.dev/posts/working-with-git-bisect/>`_ + from kernel developer Nathan Chancellor. +* `Using Git bisect to figure out when brokenness was introduced <http://webchick.net/node/99>`_. +* `Fully automated bisecting with 'git bisect run' <https://lwn.net/Articles/317154>`_. + +.. + end-of-content +.. + This document is maintained by Thorsten Leemhuis <linux@leemhuis.info>. If + you spot a typo or small mistake, feel free to let him know directly and + he'll fix it. You are free to do the same in a mostly informal way if you + want to contribute changes to the text -- but for copyright reasons please CC + linux-doc@vger.kernel.org and 'sign-off' your contribution as + Documentation/process/submitting-patches.rst explains in the section 'Sign + your work - the Developer's Certificate of Origin'. +.. + This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top + of the file. If you want to distribute this text under CC-BY-4.0 only, + please use 'The Linux kernel development community' for author attribution + and link this as source: + https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/bug-bisect.rst + +.. + Note: Only the content of this RST file as found in the Linux kernel sources + is available under CC-BY-4.0, as versions of this text that were processed + (for example by the kernel's build system) might contain content taken from + files which use a more restrictive license. diff --git a/Documentation/admin-guide/bug-hunting.rst b/Documentation/admin-guide/bug-hunting.rst index 95299b08c405..1d0f8ceb3075 100644 --- a/Documentation/admin-guide/bug-hunting.rst +++ b/Documentation/admin-guide/bug-hunting.rst @@ -244,14 +244,14 @@ Reporting the bug Once you find where the bug happened, by inspecting its location, you could either try to fix it yourself or report it upstream. -In order to report it upstream, you should identify the mailing list -used for the development of the affected code. This can be done by using -the ``get_maintainer.pl`` script. +In order to report it upstream, you should identify the bug tracker, if any, or +mailing list used for the development of the affected code. This can be done by +using the ``get_maintainer.pl`` script. For example, if you find a bug at the gspca's sonixj.c file, you can get its maintainers with:: - $ ./scripts/get_maintainer.pl -f drivers/media/usb/gspca/sonixj.c + $ ./scripts/get_maintainer.pl --bug -f drivers/media/usb/gspca/sonixj.c Hans Verkuil <hverkuil@xs4all.nl> (odd fixer:GSPCA USB WEBCAM DRIVER,commit_signer:1/1=100%) Mauro Carvalho Chehab <mchehab@kernel.org> (maintainer:MEDIA INPUT INFRASTRUCTURE (V4L/DVB),commit_signer:1/1=100%) Tejun Heo <tj@kernel.org> (commit_signer:1/1=100%) @@ -267,11 +267,12 @@ Please notice that it will point to: - The driver maintainer (Hans Verkuil); - The subsystem maintainer (Mauro Carvalho Chehab); - The driver and/or subsystem mailing list (linux-media@vger.kernel.org); -- the Linux Kernel mailing list (linux-kernel@vger.kernel.org). +- The Linux Kernel mailing list (linux-kernel@vger.kernel.org); +- The bug reporting URIs for the driver/subsystem (none in the above example). -Usually, the fastest way to have your bug fixed is to report it to mailing -list used for the development of the code (linux-media ML) copying the -driver maintainer (Hans). +If the listing contains bug reporting URIs at the end, please prefer them over +email. Otherwise, please report bugs to the mailing list used for the +development of the code (linux-media ML) copying the driver maintainer (Hans). If you are totally stumped as to whom to send the report, and ``get_maintainer.pl`` didn't provide you anything useful, send it to diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 86311c2907cd..c65756afd372 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -533,10 +533,12 @@ cgroup namespace on namespace creation. Because the resource control interface files in a given directory control the distribution of the parent's resources, the delegatee shouldn't be allowed to write to them. For the first method, this is -achieved by not granting access to these files. For the second, the -kernel rejects writes to all files other than "cgroup.procs" and -"cgroup.subtree_control" on a namespace root from inside the -namespace. +achieved by not granting access to these files. For the second, files +outside the namespace should be hidden from the delegatee by the means +of at least mount namespacing, and the kernel rejects writes to all +files on a namespace root from inside the cgroup namespace, except for +those files listed in "/sys/kernel/cgroup/delegate" (including +"cgroup.procs", "cgroup.threads", "cgroup.subtree_control", etc.). The end results are equivalent for both delegation types. Once delegated, the user can build sub-hierarchy under the directory, @@ -981,6 +983,14 @@ All cgroup core files are prefixed with "cgroup." A dying cgroup can consume system resources not exceeding limits, which were active at the moment of cgroup deletion. + nr_subsys_<cgroup_subsys> + Total number of live cgroup subsystems (e.g memory + cgroup) at and beneath the current cgroup. + + nr_dying_subsys_<cgroup_subsys> + Total number of dying cgroup subsystems (e.g. memory + cgroup) at and beneath the current cgroup. + cgroup.freeze A read-write single value file which exists on non-root cgroups. Allowed values are "0" and "1". The default is "0". @@ -1717,9 +1727,10 @@ The following nested keys are defined. entries fault back in or are written out to disk. memory.zswap.writeback - A read-write single value file. The default value is "1". The - initial value of the root cgroup is 1, and when a new cgroup is - created, it inherits the current value of its parent. + A read-write single value file. The default value is "1". + Note that this setting is hierarchical, i.e. the writeback would be + implicitly disabled for child cgroups if the upper hierarchy + does so. When this is set to 0, all swapping attempts to swapping devices are disabled. This included both zswap writebacks, and swapping due @@ -2939,8 +2950,8 @@ Deprecated v1 Core Features - "cgroup.clone_children" is removed. -- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file - at the root instead. +- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" or + "cgroup.stat" files at the root instead. Issues with v1 and Rationales for v2 diff --git a/Documentation/admin-guide/device-mapper/dm-crypt.rst b/Documentation/admin-guide/device-mapper/dm-crypt.rst index 552c9155165d..48a48bd09372 100644 --- a/Documentation/admin-guide/device-mapper/dm-crypt.rst +++ b/Documentation/admin-guide/device-mapper/dm-crypt.rst @@ -162,13 +162,18 @@ iv_large_sectors Module parameters:: - max_read_size + Maximum size of read requests. When a request larger than this size + is received, dm-crypt will split the request. The splitting improves + concurrency (the split requests could be encrypted in parallel by multiple + cores), but it also causes overhead. The user should tune this parameters to + fit the actual workload. + max_write_size - Maximum size of read or write requests. When a request larger than this size + Maximum size of write requests. When a request larger than this size is received, dm-crypt will split the request. The splitting improves concurrency (the split requests could be encrypted in parallel by multiple - cores), but it also causes overhead. The user should tune these parameters to + cores), but it also causes overhead. The user should tune this parameters to fit the actual workload. diff --git a/Documentation/admin-guide/hw-vuln/srso.rst b/Documentation/admin-guide/hw-vuln/srso.rst index 4bd3ce3ba171..2ad1c05b8c88 100644 --- a/Documentation/admin-guide/hw-vuln/srso.rst +++ b/Documentation/admin-guide/hw-vuln/srso.rst @@ -158,3 +158,72 @@ poisoned BTB entry and using that safe one for all function returns. In older Zen1 and Zen2, this is accomplished using a reinterpretation technique similar to Retbleed one: srso_untrain_ret() and srso_safe_ret(). + +Checking the safe RET mitigation actually works +----------------------------------------------- + +In case one wants to validate whether the SRSO safe RET mitigation works +on a kernel, one could use two performance counters + +* PMC_0xc8 - Count of RET/RET lw retired +* PMC_0xc9 - Count of RET/RET lw retired mispredicted + +and compare the number of RETs retired properly vs those retired +mispredicted, in kernel mode. Another way of specifying those events +is:: + + # perf list ex_ret_near_ret + + List of pre-defined events (to be used in -e or -M): + + core: + ex_ret_near_ret + [Retired Near Returns] + ex_ret_near_ret_mispred + [Retired Near Returns Mispredicted] + +Either the command using the event mnemonics:: + + # perf stat -e ex_ret_near_ret:k -e ex_ret_near_ret_mispred:k sleep 10s + +or using the raw PMC numbers:: + + # perf stat -e cpu/event=0xc8,umask=0/k -e cpu/event=0xc9,umask=0/k sleep 10s + +should give the same amount. I.e., every RET retired should be +mispredicted:: + + [root@brent: ~/kernel/linux/tools/perf> ./perf stat -e cpu/event=0xc8,umask=0/k -e cpu/event=0xc9,umask=0/k sleep 10s + + Performance counter stats for 'sleep 10s': + + 137,167 cpu/event=0xc8,umask=0/k + 137,173 cpu/event=0xc9,umask=0/k + + 10.004110303 seconds time elapsed + + 0.000000000 seconds user + 0.004462000 seconds sys + +vs the case when the mitigation is disabled (spec_rstack_overflow=off) +or not functioning properly, showing usually a lot smaller number of +mispredicted retired RETs vs the overall count of retired RETs during +a workload:: + + [root@brent: ~/kernel/linux/tools/perf> ./perf stat -e cpu/event=0xc8,umask=0/k -e cpu/event=0xc9,umask=0/k sleep 10s + + Performance counter stats for 'sleep 10s': + + 201,627 cpu/event=0xc8,umask=0/k + 4,074 cpu/event=0xc9,umask=0/k + + 10.003267252 seconds time elapsed + + 0.002729000 seconds user + 0.000000000 seconds sys + +Also, there is a selftest which performs the above, go to +tools/testing/selftests/x86/ and do:: + + make srso + ./srso diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 09126bb8cc9f..8337d0fed311 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -333,12 +333,17 @@ allowed anymore to lift isolation requirements as needed. This option does not override iommu=pt - force_enable - Force enable the IOMMU on platforms known - to be buggy with IOMMU enabled. Use this - option with care. - pgtbl_v1 - Use v1 page table for DMA-API (Default). - pgtbl_v2 - Use v2 page table for DMA-API. - irtcachedis - Disable Interrupt Remapping Table (IRT) caching. + force_enable - Force enable the IOMMU on platforms known + to be buggy with IOMMU enabled. Use this + option with care. + pgtbl_v1 - Use v1 page table for DMA-API (Default). + pgtbl_v2 - Use v2 page table for DMA-API. + irtcachedis - Disable Interrupt Remapping Table (IRT) caching. + nohugepages - Limit page-sizes used for v1 page-tables + to 4 KiB. + v2_pgsizes_only - Limit page-sizes used for v1 page-tables + to 4KiB/2Mib/1GiB. + amd_iommu_dump= [HW,X86-64] Enable AMD IOMMU driver option to dump the ACPI table @@ -517,6 +522,18 @@ Format: <io>,<irq>,<mode> See header of drivers/net/hamradio/baycom_ser_hdx.c. + bdev_allow_write_mounted= + Format: <bool> + Control the ability to open a mounted block device + for writing, i.e., allow / disallow writes that bypass + the FS. This was implemented as a means to prevent + fuzzers from crashing the kernel by overwriting the + metadata underneath a mounted FS without its awareness. + This also prevents destructive formatting of mounted + filesystems by naive storage tooling that don't use + O_EXCL. Default is Y and can be changed through the + Kconfig option CONFIG_BLK_DEV_WRITE_MOUNTED. + bert_disable [ACPI] Disable BERT OS support on buggy BIOSes. @@ -2350,6 +2367,18 @@ ipcmni_extend [KNL,EARLY] Extend the maximum number of unique System V IPC identifiers from 32,768 to 16,777,216. + ipe.enforce= [IPE] + Format: <bool> + Determine whether IPE starts in permissive (0) or + enforce (1) mode. The default is enforce. + + ipe.success_audit= + [IPE] + Format: <bool> + Start IPE with success auditing enabled, emitting + an audit event when a binary is allowed. The default + is 0. + irqaffinity= [SMP] Set the default irq affinity mask The argument is a cpu list, as described above. @@ -4788,6 +4817,16 @@ printk.time= Show timing data prefixed to each printk message line Format: <bool> (1/Y/y=enable, 0/N/n=disable) + proc_mem.force_override= [KNL] + Format: {always | ptrace | never} + Traditionally /proc/pid/mem allows memory permissions to be + overridden without restrictions. This option may be set to + restrict that. Can be one of: + - 'always': traditional behavior always allows mem overrides. + - 'ptrace': only allow mem overrides for active ptracers. + - 'never': never allow mem overrides. + If not specified, default is the CONFIG_PROC_MEM_* choice. + processor.max_cstate= [HW,ACPI] Limit processor to maximum C-state max_cstate=9 overrides any DMI blacklist limit. @@ -4935,6 +4974,10 @@ Set maximum number of finished RCU callbacks to process in one batch. + rcutree.csd_lock_suppress_rcu_stall= [KNL] + Do only a one-line RCU CPU stall warning when + there is an ongoing too-long CSD-lock wait. + rcutree.do_rcu_barrier= [KNL] Request a call to rcu_barrier(). This is throttled so that userspace tests can safely @@ -5382,7 +5425,13 @@ Time to wait (s) after boot before inducing stall. rcutorture.stall_cpu_irqsoff= [KNL] - Disable interrupts while stalling if set. + Disable interrupts while stalling if set, but only + on the first stall in the set. + + rcutorture.stall_cpu_repeat= [KNL] + Number of times to repeat the stall sequence, + so that rcutorture.stall_cpu_repeat=3 will result + in four stall sequences. rcutorture.stall_gp_kthread= [KNL] Duration (s) of forced sleep within RCU @@ -5570,14 +5619,6 @@ of zero will disable batching. Batching is always disabled for synchronize_rcu_tasks(). - rcupdate.rcu_tasks_rude_lazy_ms= [KNL] - Set timeout in milliseconds RCU Tasks - Rude asynchronous callback batching for - call_rcu_tasks_rude(). A negative value - will take the default. A value of zero will - disable batching. Batching is always disabled - for synchronize_rcu_tasks_rude(). - rcupdate.rcu_tasks_trace_lazy_ms= [KNL] Set timeout in milliseconds RCU Tasks Trace asynchronous callback batching for @@ -7352,6 +7393,13 @@ it can be updated at runtime by writing to the corresponding sysfs file. + workqueue.panic_on_stall=<uint> + Panic when workqueue stall is detected by + CONFIG_WQ_WATCHDOG. It sets the number times of the + stall to trigger panic. + + The default is 0, which disables the panic on stall. + workqueue.cpu_intensive_thresh_us= Per-cpu work items which run for longer than this threshold are automatically considered CPU intensive diff --git a/Documentation/admin-guide/media/vivid.rst b/Documentation/admin-guide/media/vivid.rst index 1306f19ecb5a..c9d301ab46a3 100644 --- a/Documentation/admin-guide/media/vivid.rst +++ b/Documentation/admin-guide/media/vivid.rst @@ -328,7 +328,7 @@ and an HDMI input, one input for each input type. Those are described in more detail below. Special attention has been given to the rate at which new frames become -available. The jitter will be around 1 jiffie (that depends on the HZ +available. The jitter will be around 1 jiffy (that depends on the HZ configuration of your kernel, so usually 1/100, 1/250 or 1/1000 of a second), but the long-term behavior is exactly following the framerate. So a framerate of 59.94 Hz is really different from 60 Hz. If the framerate diff --git a/Documentation/admin-guide/perf/arm-ni.rst b/Documentation/admin-guide/perf/arm-ni.rst new file mode 100644 index 000000000000..d26a8f697c36 --- /dev/null +++ b/Documentation/admin-guide/perf/arm-ni.rst @@ -0,0 +1,17 @@ +==================================== +Arm Network-on Chip Interconnect PMU +==================================== + +NI-700 and friends implement a distinct PMU for each clock domain within the +interconnect. Correspondingly, the driver exposes multiple PMU devices named +arm_ni_<x>_cd_<y>, where <x> is an (arbitrary) instance identifier and <y> is +the clock domain ID within that particular instance. If multiple NI instances +exist within a system, the PMU devices can be correlated with the underlying +hardware instance via sysfs parentage. + +Each PMU exposes base event aliases for the interface types present in its clock +domain. These require qualifying with the "eventid" and "nodeid" parameters +to specify the event code to count and the interface at which to count it +(per the configured hardware ID as reflected in the xxNI_NODE_INFO register). +The exception is the "cycles" alias for the PMU cycle counter, which is encoded +with the PMU node type and needs no further qualification. diff --git a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst index d47cd229d710..39b8e1fdd0cd 100644 --- a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst +++ b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst @@ -46,16 +46,16 @@ Some of the events only exist for specific configurations. DesignWare Cores (DWC) PCIe PMU Driver ======================================= -This driver adds PMU devices for each PCIe Root Port named based on the BDF of +This driver adds PMU devices for each PCIe Root Port named based on the SBDF of the Root Port. For example, - 30:03.0 PCI bridge: Device 1ded:8000 (rev 01) + 0001:30:03.0 PCI bridge: Device 1ded:8000 (rev 01) -the PMU device name for this Root Port is dwc_rootport_3018. +the PMU device name for this Root Port is dwc_rootport_13018. The DWC PCIe PMU driver registers a perf PMU driver, which provides description of available events and configuration options in sysfs, see -/sys/bus/event_source/devices/dwc_rootport_{bdf}. +/sys/bus/event_source/devices/dwc_rootport_{sbdf}. The "format" directory describes format of the config fields of the perf_event_attr structure. The "events" directory provides configuration @@ -66,16 +66,16 @@ The "perf list" command shall list the available events from sysfs, e.g.:: $# perf list | grep dwc_rootport <...> - dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/ [Kernel PMU event] + dwc_rootport_13018/Rx_PCIe_TLP_Data_Payload/ [Kernel PMU event] <...> - dwc_rootport_3018/rx_memory_read,lane=?/ [Kernel PMU event] + dwc_rootport_13018/rx_memory_read,lane=?/ [Kernel PMU event] Time Based Analysis Event Usage ------------------------------- Example usage of counting PCIe RX TLP data payload (Units of bytes):: - $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/ + $# perf stat -a -e dwc_rootport_13018/Rx_PCIe_TLP_Data_Payload/ The average RX/TX bandwidth can be calculated using the following formula: @@ -88,7 +88,7 @@ Lane Event Usage Each lane has the same event set and to avoid generating a list of hundreds of events, the user need to specify the lane ID explicitly, e.g.:: - $# perf stat -a -e dwc_rootport_3018/rx_memory_read,lane=4/ + $# perf stat -a -e dwc_rootport_13018/rx_memory_read,lane=4/ The driver does not support sampling, therefore "perf record" will not work. Per-task (without "-a") perf sessions are not supported. diff --git a/Documentation/admin-guide/perf/hisi-pcie-pmu.rst b/Documentation/admin-guide/perf/hisi-pcie-pmu.rst index 5541ff40e06a..083ca50de896 100644 --- a/Documentation/admin-guide/perf/hisi-pcie-pmu.rst +++ b/Documentation/admin-guide/perf/hisi-pcie-pmu.rst @@ -28,7 +28,9 @@ The "identifier" sysfs file allows users to identify the version of the PMU hardware device. The "bus" sysfs file allows users to get the bus number of Root Ports -monitored by PMU. +monitored by PMU. Furthermore users can get the Root Ports range in +[bdf_min, bdf_max] from "bdf_min" and "bdf_max" sysfs attributes +respectively. Example usage of perf:: diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst index 7eb3dcd6f4da..8502bc174640 100644 --- a/Documentation/admin-guide/perf/index.rst +++ b/Documentation/admin-guide/perf/index.rst @@ -16,6 +16,7 @@ Performance monitor support starfive_starlink_pmu arm-ccn arm-cmn + arm-ni xgene-pmu arm_dsu_pmu thunderx2-pmu diff --git a/Documentation/admin-guide/pm/amd-pstate.rst b/Documentation/admin-guide/pm/amd-pstate.rst index d0324d44f548..210a808b74ec 100644 --- a/Documentation/admin-guide/pm/amd-pstate.rst +++ b/Documentation/admin-guide/pm/amd-pstate.rst @@ -251,7 +251,9 @@ performance supported in `AMD CPPC Performance Capability <perf_cap_>`_). In some ASICs, the highest CPPC performance is not the one in the ``_CPC`` table, so we need to expose it to sysfs. If boost is not active, but still supported, this maximum frequency will be larger than the one in -``cpuinfo``. +``cpuinfo``. On systems that support preferred core, the driver will have +different values for some cores than others and this will reflect the values +advertised by the platform at bootup. This attribute is read-only. ``amd_pstate_lowest_nonlinear_freq`` @@ -262,6 +264,17 @@ lowest non-linear performance in `AMD CPPC Performance Capability <perf_cap_>`_.) This attribute is read-only. +``amd_pstate_hw_prefcore`` + +Whether the platform supports the preferred core feature and it has been +enabled. This attribute is read-only. + +``amd_pstate_prefcore_ranking`` + +The performance ranking of the core. This number doesn't have any unit, but +larger numbers are preferred at the time of reading. This can change at +runtime based on platform conditions. This attribute is read-only. + ``energy_performance_available_preferences`` A list of all the supported EPP preferences that could be used for diff --git a/Documentation/admin-guide/ramoops.rst b/Documentation/admin-guide/ramoops.rst index 6f534a707b2a..2eabef31220d 100644 --- a/Documentation/admin-guide/ramoops.rst +++ b/Documentation/admin-guide/ramoops.rst @@ -129,7 +129,7 @@ Setting the ramoops parameters can be done in several different manners: takes a size, alignment and name as arguments. The name is used to map the memory to a label that can be retrieved by ramoops. - reserver_mem=2M:4096:oops ramoops.mem_name=oops + reserve_mem=2M:4096:oops ramoops.mem_name=oops You can specify either RAM memory or peripheral devices' memory. However, when specifying RAM, be sure to reserve the memory by issuing memblock_reserve() diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst index f92551539e8a..700aa72eecb1 100644 --- a/Documentation/admin-guide/tainted-kernels.rst +++ b/Documentation/admin-guide/tainted-kernels.rst @@ -182,3 +182,5 @@ More detailed explanation for tainting produce extremely unusual kernel structure layouts (even performance pathological ones), which is important to know when debugging. Set at build time. + + 18) ``N`` if an in-kernel test, such as a KUnit test, has been run. |