aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2023-08-27tls: allocate the fallback aead after checking that the cipher is validSabrina Dubroca1-10/+10
No need to allocate the aead if we're going to fail afterwards. Signed-off-by: Sabrina Dubroca <[email protected]> Link: https://lore.kernel.org/r/335e32511ed55a0b30f3f81a78fa8f323b3bdf8f.1692977948.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27tls: expand use of tls_cipher_desc in tls_set_device_offloadSabrina Dubroca1-18/+4
tls_set_device_offload is already getting iv and rec_seq sizes from tls_cipher_desc. We can now also check if the cipher_type coming from userspace is valid and can be offloaded. We can also remove the runtime check on rec_seq, since we validate it at compile time. Signed-off-by: Sabrina Dubroca <[email protected]> Link: https://lore.kernel.org/r/8ab71b8eca856c7aaf981a45fe91ac649eb0e2e9.1692977948.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27tls: validate cipher descriptions at compile timeSabrina Dubroca1-0/+18
Signed-off-by: Sabrina Dubroca <[email protected]> Link: https://lore.kernel.org/r/b38fb8cf60e099e82ae9979c3c9c92421042417c.1692977948.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27tls: extend tls_cipher_desc to fully describe the ciphersSabrina Dubroca2-9/+64
- add nonce, usually equal to iv_size but not for chacha - add offsets into the crypto_info for each field - add algorithm name - add offloadable flag Also add helpers to access each field of a crypto_info struct described by a tls_cipher_desc. Signed-off-by: Sabrina Dubroca <[email protected]> Link: https://lore.kernel.org/r/39d5f476d63c171097764e8d38f6f158b7c109ae.1692977948.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27tls: rename tls_cipher_size_desc to tls_cipher_descSabrina Dubroca4-52/+52
We're going to add other fields to it to fully describe a cipher, so the "_size" name won't match the contents. Signed-off-by: Sabrina Dubroca <[email protected]> Link: https://lore.kernel.org/r/76ca6c7686bd6d1534dfa188fb0f1f6fabebc791.1692977948.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27tls: reduce size of tls_cipher_size_descSabrina Dubroca4-9/+20
tls_cipher_size_desc indexes ciphers by their type, but we're not using indices 0..50 of the array. Each struct tls_cipher_size_desc is 20B, so that's a lot of unused memory. We can reindex the array starting at the lowest used cipher_type. Introduce the get_cipher_size_desc helper to find the right item and avoid out-of-bounds accesses, and make tls_cipher_size_desc's size explicit so that gcc reminds us to update TLS_CIPHER_MIN/MAX when we add a new cipher. Signed-off-by: Sabrina Dubroca <[email protected]> Link: https://lore.kernel.org/r/5e054e370e240247a5d37881a1cd93a67c15f4ca.1692977948.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27tls: add TLS_CIPHER_ARIA_GCM_* to tls_cipher_size_descSabrina Dubroca1-0/+2
Signed-off-by: Sabrina Dubroca <[email protected]> Link: https://lore.kernel.org/r/b2e0fb79e6d0a4478be9bf33781dc9c9281c9d56.1692977948.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27tls: move tls_cipher_size_desc to net/tls/tls.hSabrina Dubroca2-10/+10
It's only used in net/tls/*, no need to bloat include/net/tls.h. Signed-off-by: Sabrina Dubroca <[email protected]> Link: https://lore.kernel.org/r/dd9fad80415e5b3575b41f56b331871038362eab.1692977948.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27selftests: tls: test some invalid inputs for setsockoptSabrina Dubroca1-0/+25
This test will need to be updated if new ciphers are added. Signed-off-by: Sabrina Dubroca <[email protected]> Link: https://lore.kernel.org/r/bfcfa9cffda56d2064296ab7c99a05775dd4c28e.1692977948.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27selftests: tls: add getsockopt testSabrina Dubroca1-0/+35
The kernel accepts fetching either just the version and cipher type, or exactly the per-cipher struct. Also check that getsockopt returns what we just passed to the kernel. Signed-off-by: Sabrina Dubroca <[email protected]> Link: https://lore.kernel.org/r/81a007ca13de9a74f4af45635d06682cdb385a54.1692977948.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27selftests: tls: add test variants for aria-gcmSabrina Dubroca2-0/+25
Only supported for TLS1.2. Signed-off-by: Sabrina Dubroca <[email protected]> Link: https://lore.kernel.org/r/ccf4a4d3f3820f8ff30431b7629f5210cb33fa89.1692977948.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27Merge branch 'tools-net-ynl-add-support-for-netlink-raw-families'Jakub Kicinski15-72/+2632
Donald Hunter says: ==================== tools/net/ynl: Add support for netlink-raw families This patchset adds support for netlink-raw families such as rtnetlink. Patch 1 fixes a typo in existing schemas Patch 2 contains the schema definition Patches 3 & 4 update the schema documentation Patches 5 - 9 extends ynl Patches 10 - 12 add several netlink-raw specs The netlink-raw schema is very similar to genetlink-legacy and I thought about making the changes there and symlinking to it. On balance I thought that might be problematic for accurate schema validation. rtnetlink doesn't seem to fit into unified or directional message enumeration models. It seems like an 'explicit' model would be useful, to force the schema author to specify the message ids directly. There is not yet support for notifications because ynl currently doesn't support defining 'event' properties on a 'do' operation. The message ids are shared so ops need to be both sync and async. I plan to look at this in a future patch. The link and route messages contain different nested attributes dependent on the type of link or route. Decoding these will need some kind of attr-space selection that uses the value of another attribute as the selector key. These nested attributes have been left with type 'binary' for now. ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27doc/netlink: Add spec for rt route messagesDonald Hunter1-0/+327
Add schema for rt route with support for getroute, newroute and delroute. Routes can be dumped with filter attributes like this: ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/rt_route.yaml \ --dump getroute --json '{"rtm-family": 2, "rtm-table": 254}' Signed-off-by: Donald Hunter <[email protected]> Reviewed-by: Jacob Keller <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27doc/netlink: Add spec for rt link messagesDonald Hunter1-0/+1432
Add schema for rt link with support for newlink, dellink, getlink, setlink and getstats. A dummy link can be created like this: sudo ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/rt_link.yaml \ --do newlink --create \ --json '{"ifname": "dummy0", "linkinfo": {"kind": "dummy"}}' For example, offload stats can be fetched like this: ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/rt_link.yaml \ --dump getstats --json '{ "filter-mask": 8 }' Signed-off-by: Donald Hunter <[email protected]> Reviewed-by: Jacob Keller <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27doc/netlink: Add spec for rt addr messagesDonald Hunter1-0/+179
Add schema for rt addr with support for: - newaddr, deladdr, getaddr (dump) Signed-off-by: Donald Hunter <[email protected]> Reviewed-by: Jacob Keller <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27tools/net/ynl: Add support for create flagsDonald Hunter3-8/+22
Add support for using NLM_F_REPLACE, _EXCL, _CREATE and _APPEND flags in requests. Signed-off-by: Donald Hunter <[email protected]> Reviewed-by: Jacob Keller <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27tools/net/ynl: Implement nlattr array-nest decoding in ynlDonald Hunter1-0/+13
Add support for the 'array-nest' attribute type that is used by several netlink-raw families. Signed-off-by: Donald Hunter <[email protected]> Reviewed-by: Jakub Kicinski <[email protected]> Reviewed-by: Jacob Keller <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27tools/net/ynl: Add support for netlink-raw familiesDonald Hunter1-33/+91
Refactor the ynl code to encapsulate protocol specifics into NetlinkProtocol and GenlProtocol. Signed-off-by: Donald Hunter <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27tools/net/ynl: Fix extack parsing with fixed header genlmsgDonald Hunter1-25/+40
Move decode_fixed_header into YnlFamily and add a _fixed_header_size method to allow extack decoding to skip the fixed header. Signed-off-by: Donald Hunter <[email protected]> Reviewed-by: Jacob Keller <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27tools/ynl: Add mcast-group schema parsing to ynlDonald Hunter1-0/+31
Add a SpecMcastGroup class to the nlspec lib. Signed-off-by: Donald Hunter <[email protected]> Reviewed-by: Jacob Keller <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27doc/netlink: Document the netlink-raw schema extensionsDonald Hunter2-0/+59
Add a doc page for netlink-raw that describes the schema attributes needed for netlink-raw. Signed-off-by: Donald Hunter <[email protected]> Reviewed-by: Jacob Keller <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27doc/netlink: Update genetlink-legacy documentationDonald Hunter3-13/+35
Add documentation for recently added genetlink-legacy schema attributes. Remove statements about 'work in progress' and 'todo'. Signed-off-by: Donald Hunter <[email protected]> Reviewed-by: Jacob Keller <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27doc/netlink: Add a schema for netlink-raw familiesDonald Hunter1-0/+410
This schema is largely a copy of the genetlink-legacy schema with the following modifications: - change the schema id to netlink-raw - add a top-level protonum property, e.g. 0 (for NETLINK_ROUTE) - change the protocol enumeration to netlink-raw, removing the genetlink options. - replace doc references to generic netlink with raw netlink - add a value property to mcast-group definitions Signed-off-by: Donald Hunter <[email protected]> Reviewed-by: Jacob Keller <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27doc/netlink: Fix typo in genetlink-* schemasDonald Hunter2-2/+2
Fix typo verion -> version in genetlink-c and genetlink-legacy. Signed-off-by: Donald Hunter <[email protected]> Reviewed-by: Jacob Keller <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27Merge branch 'devlink-mlx5-add-port-function-attributes-for-ipsec'Jakub Kicinski15-108/+898
Saeed Mahameed says: ==================== {devlink,mlx5}: Add port function attributes for ipsec From Dima: Introduce hypervisor-level control knobs to set the functionality of PCI VF devices passed through to guests. The administrator of a hypervisor host may choose to change the settings of a port function from the defaults configured by the device firmware. The software stack has two types of IPsec offload - crypto and packet. Specifically, the ip xfrm command has sub-commands for "state" and "policy" that have an "offload" parameter. With ip xfrm state, both crypto and packet offload types are supported, while ip xfrm policy can only be offloaded in packet mode. The series introduces two new boolean attributes of a port function: ipsec_crypto and ipsec_packet. The goal is to provide a similar level of granularity for controlling VF IPsec offload capabilities, which would be aligned with the software model. This will allow users to decide if they want both types of offload enabled for a VF, just one of them, or none at all (which is the default). At a high level, the difference between the two knobs is that with ipsec_crypto, only XFRM state can be offloaded. Specifically, only the crypto operation (Encrypt/Decrypt) is offloaded. With ipsec_packet, both XFRM state and policy can be offloaded. Furthermore, in addition to crypto operation offload, IPsec encapsulation is also offloaded. For XFRM state, choosing between crypto and packet offload types is possible. From the HW perspective, different resources may be required for each offload type. Examples of when a user prefers to enable IPsec packet offload for a VF when using switchdev mode: $ devlink port show pci/0000:06:00.0/1 pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0 function: hw_addr 00:00:00:00:00:00 roce enable migratable disable ipsec_crypto disable ipsec_packet disable $ devlink port function set pci/0000:06:00.0/1 ipsec_packet enable $ devlink port show pci/0000:06:00.0/1 pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0 function: hw_addr 00:00:00:00:00:00 roce enable migratable disable ipsec_crypto disable ipsec_packet enable This enables the corresponding IPsec capability of the function before it's enumerated, so when the driver reads the capability from the device firmware, it is enabled. The driver is then able to configure corresponding features and ops of the VF net device to support IPsec state and policy offloading. v2: https://lore.kernel.org/netdev/[email protected]/ ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27net/mlx5: Implement devlink port function cmds to control ipsec_packetDima Chumak6-4/+173
Implement devlink port function commands to enable / disable IPsec packet offloads. This is used to control the IPsec capability of the device. When ipsec_offload is enabled for a VF, it prevents adding IPsec packet offloads on the PF, because the two cannot be active simultaneously due to HW constraints. Conversely, if there are any active IPsec packet offloads on the PF, it's not allowed to enable ipsec_packet on a VF, until PF IPsec offloads are cleared. Signed-off-by: Dima Chumak <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27net/mlx5: Implement devlink port function cmds to control ipsec_cryptoDima Chumak7-1/+431
Implement devlink port function commands to enable / disable IPsec crypto offloads. This is used to control the IPsec capability of the device. When ipsec_crypto is enabled for a VF, it prevents adding IPsec crypto offloads on the PF, because the two cannot be active simultaneously due to HW constraints. Conversely, if there are any active IPsec crypto offloads on the PF, it's not allowed to enable ipsec_crypto on a VF, until PF IPsec offloads are cleared. Signed-off-by: Dima Chumak <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27net/mlx5: Provide an interface to block change of IPsec capabilitiesLeon Romanovsky4-1/+61
mlx5 HW can't perform IPsec offload operation simultaneously both on PF and VFs at the same time. While the previous patches added devlink knobs to change IPsec capabilities dynamically, there is a need to add a logic to block such IPsec capabilities for the cases when IPsec is already configured. Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27net/mlx5: Add IFC bits to support IPsec enable/disableLeon Romanovsky1-0/+3
Add hardware definitions to allow to control IPSec capabilities. Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27net/mlx5e: Rewrite IPsec vs. TC block interfaceLeon Romanovsky3-93/+38
In the commit 366e46242b8e ("net/mlx5e: Make IPsec offload work together with eswitch and TC"), new API to block IPsec vs. TC creation was introduced. Internally, that API used devlink lock to avoid races with userspace, but it is not really needed as dev->priv.eswitch is stable and can't be changed. So remove dependency on devlink lock and move block encap code back to its original place. Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27net/mlx5: Drop extra layer of locks in IPsecLeon Romanovsky1-14/+4
There is no need in holding devlink lock as it gives nothing compared to already used write mode_lock. Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27devlink: Expose port function commands to control IPsec packet offloadsDima Chumak4-0/+97
Expose port function commands to enable / disable IPsec packet offloads, this is used to control the port IPsec capabilities. When IPsec packet is disabled for a function of the port (default), function cannot offload IPsec packet operations (encapsulation and XFRM policy offload). When enabled, IPsec packet operations can be offloaded by the function of the port, which includes crypto operation (Encrypt/Decrypt), IPsec encapsulation and XFRM state and policy offload. Example of a PCI VF port which supports IPsec packet offloads: $ devlink port show pci/0000:06:00.0/1 pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0 function: hw_addr 00:00:00:00:00:00 roce enable ipsec_packet disable $ devlink port function set pci/0000:06:00.0/1 ipsec_packet enable $ devlink port show pci/0000:06:00.0/1 pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0 function: hw_addr 00:00:00:00:00:00 roce enable ipsec_packet enable Signed-off-by: Dima Chumak <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27devlink: Expose port function commands to control IPsec crypto offloadsDima Chumak4-0/+96
Expose port function commands to enable / disable IPsec crypto offloads, this is used to control the port IPsec capabilities. When IPsec crypto is disabled for a function of the port (default), function cannot offload any IPsec crypto operations (Encrypt/Decrypt and XFRM state offloading). When enabled, IPsec crypto operations can be offloaded by the function of the port. Example of a PCI VF port which supports IPsec crypto offloads: $ devlink port show pci/0000:06:00.0/1 pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0 function: hw_addr 00:00:00:00:00:00 roce enable ipsec_crypto disable $ devlink port function set pci/0000:06:00.0/1 ipsec_crypto enable $ devlink port show pci/0000:06:00.0/1 pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0 function: hw_addr 00:00:00:00:00:00 roce enable ipsec_crypto enable Signed-off-by: Dima Chumak <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-27Linux 6.5Linus Torvalds1-1/+1
2023-08-27dt-bindings: PCI: qcom: Fix SDX65 compatibleKrzysztof Kozlowski1-5/+7
Commit c0aba9f32801 ("dt-bindings: PCI: qcom: Add SDX65 SoC") adding SDX65 was never tested and is clearly bogus. The qcom,sdx65-pcie-ep compatible is followed by a fallback in DTS, and there is no driver matched by this compatible. Driver matches by its fallback qcom,sdx55-pcie-ep. This also fixes dtbs_check warnings like: qcom-sdx65-mtp.dtb: pcie-ep@1c00000: compatible: ['qcom,sdx65-pcie-ep', 'qcom,sdx55-pcie-ep'] is too long [kwilczynski: commit log] Fixes: c0aba9f32801 ("dt-bindings: PCI: qcom: Add SDX65 SoC") Link: https://lore.kernel.org/linux-pci/[email protected] Signed-off-by: Krzysztof Kozlowski <[email protected]> Signed-off-by: Krzysztof WilczyƄski <[email protected]> Acked-by: Conor Dooley <[email protected]> Cc: [email protected]
2023-08-27ext4: fix slab-use-after-free in ext4_es_insert_extent()Baokun Li1-14/+30
Yikebaer reported an issue: ================================================================== BUG: KASAN: slab-use-after-free in ext4_es_insert_extent+0xc68/0xcb0 fs/ext4/extents_status.c:894 Read of size 4 at addr ffff888112ecc1a4 by task syz-executor/8438 CPU: 1 PID: 8438 Comm: syz-executor Not tainted 6.5.0-rc5 #1 Call Trace: [...] kasan_report+0xba/0xf0 mm/kasan/report.c:588 ext4_es_insert_extent+0xc68/0xcb0 fs/ext4/extents_status.c:894 ext4_map_blocks+0x92a/0x16f0 fs/ext4/inode.c:680 ext4_alloc_file_blocks.isra.0+0x2df/0xb70 fs/ext4/extents.c:4462 ext4_zero_range fs/ext4/extents.c:4622 [inline] ext4_fallocate+0x251c/0x3ce0 fs/ext4/extents.c:4721 [...] Allocated by task 8438: [...] kmem_cache_zalloc include/linux/slab.h:693 [inline] __es_alloc_extent fs/ext4/extents_status.c:469 [inline] ext4_es_insert_extent+0x672/0xcb0 fs/ext4/extents_status.c:873 ext4_map_blocks+0x92a/0x16f0 fs/ext4/inode.c:680 ext4_alloc_file_blocks.isra.0+0x2df/0xb70 fs/ext4/extents.c:4462 ext4_zero_range fs/ext4/extents.c:4622 [inline] ext4_fallocate+0x251c/0x3ce0 fs/ext4/extents.c:4721 [...] Freed by task 8438: [...] kmem_cache_free+0xec/0x490 mm/slub.c:3823 ext4_es_try_to_merge_right fs/ext4/extents_status.c:593 [inline] __es_insert_extent+0x9f4/0x1440 fs/ext4/extents_status.c:802 ext4_es_insert_extent+0x2ca/0xcb0 fs/ext4/extents_status.c:882 ext4_map_blocks+0x92a/0x16f0 fs/ext4/inode.c:680 ext4_alloc_file_blocks.isra.0+0x2df/0xb70 fs/ext4/extents.c:4462 ext4_zero_range fs/ext4/extents.c:4622 [inline] ext4_fallocate+0x251c/0x3ce0 fs/ext4/extents.c:4721 [...] ================================================================== The flow of issue triggering is as follows: 1. remove es raw es es removed es1 |-------------------| -> |----|.......|------| 2. insert es es insert es1 merge with es es1 merge with es and free es1 |----|.......|------| -> |------------|------| -> |-------------------| es merges with newes, then merges with es1, frees es1, then determines if es1->es_len is 0 and triggers a UAF. The code flow is as follows: ext4_es_insert_extent es1 = __es_alloc_extent(true); es2 = __es_alloc_extent(true); __es_remove_extent(inode, lblk, end, NULL, es1) __es_insert_extent(inode, &newes, es1) ---> insert es1 to es tree __es_insert_extent(inode, &newes, es2) ext4_es_try_to_merge_right ext4_es_free_extent(inode, es1) ---> es1 is freed if (es1 && !es1->es_len) // Trigger UAF by determining if es1 is used. We determine whether es1 or es2 is used immediately after calling __es_remove_extent() or __es_insert_extent() to avoid triggering a UAF if es1 or es2 is freed. Reported-by: Yikebaer Aizezi <[email protected]> Closes: https://lore.kernel.org/lkml/CALcu4raD4h9coiyEBL4Bm0zjDwxC2CyPiTwsP3zFuhot6y9Beg@mail.gmail.com Fixes: 2a69c450083d ("ext4: using nofail preallocation in ext4_es_insert_extent()") Cc: [email protected] Signed-off-by: Baokun Li <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27libfs: remove redundant checks of s_encodingEric Biggers1-12/+2
Now that neither ext4 nor f2fs allows inodes with the casefold flag to be instantiated when unsupported, it's unnecessary to repeatedly check for support later on during random filesystem operations. Signed-off-by: Eric Biggers <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: remove redundant checks of s_encodingEric Biggers2-4/+4
Now that ext4 does not allow inodes with the casefold flag to be instantiated when unsupported, it's unnecessary to repeatedly check for support later on during random filesystem operations. Signed-off-by: Eric Biggers <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: reject casefold inode flag without casefold featureEric Biggers1-1/+4
It is invalid for the casefold inode flag to be set without the casefold superblock feature flag also being set. e2fsck already considers this case to be invalid and handles it by offering to clear the casefold flag on the inode. __ext4_iget() also already considered this to be invalid, sort of, but it only got so far as logging an error message; it didn't actually reject the inode. Make it reject the inode so that other code doesn't have to handle this case. This matches what f2fs does. Note: we could check 's_encoding != NULL' instead of ext4_has_feature_casefold(). This would make the check robust against the casefold feature being enabled by userspace writing to the page cache of the mounted block device. However, it's unsolvable in general for filesystems to be robust against concurrent writes to the page cache of the mounted block device. Though this very particular scenario involving the casefold feature is solvable, we should not pretend that we can support this model, so let's just check the casefold feature. tune2fs already forbids enabling casefold on a mounted filesystem. Signed-off-by: Eric Biggers <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: use LIST_HEAD() to initialize the list_head in mballoc.cRuan Jinjie1-13/+5
Use LIST_HEAD() to initialize the list_head instead of open-coding it. Signed-off-by: Ruan Jinjie <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: do not mark inode dirty every time when appending using delallocLiu Song1-26/+62
In the delalloc append write scenario, if inode's i_size is extended due to buffer write, there are delalloc writes pending in the range up to i_size, and no need to touch i_disksize since writeback will push i_disksize up to i_size eventually. Offers significant performance improvement in high-frequency append write scenarios. I conducted tests in my 32-core environment by launching 32 concurrent threads to append write to the same file. Each write operation had a length of 1024 bytes and was repeated 100000 times. Without using this patch, the test was completed in 7705 ms. However, with this patch, the test was completed in 5066 ms, resulting in a performance improvement of 34%. Moreover, in test scenarios of Kafka version 2.6.2, using packet size of 2K, with this patch resulted in a 10% performance improvement. Signed-off-by: Liu Song <[email protected]> Suggested-by: Jan Kara <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: rename s_error_work to s_sb_upd_workTheodore Ts'o2-19/+22
The most common use that s_error_work will get scheduled is now the periodic update of the superblock. So rename it to s_sb_upd_work. Also rename the function flush_stashed_error_work() to update_super_work(). Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: add periodic superblock update checkVitaliy Kuznetsov1-1/+61
This patch introduces a mechanism to periodically check and update the superblock within the ext4 file system. The main purpose of this patch is to keep the disk superblock up to date. The update will be performed if more than one hour has passed since the last update, and if more than 16MB of data have been written to disk. This check and update is performed within the ext4_journal_commit_callback function, ensuring that the superblock is written while the disk is active, rather than based on a timer that may trigger during disk idle periods. Discussion https://www.spinics.net/lists/linux-ext4/msg85865.html Signed-off-by: Vitaliy Kuznetsov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: drop dio overwrite only flag and associated warningBrian Foster1-15/+10
The commit referenced below opened up concurrent unaligned dio under shared locking for pure overwrites. In doing so, it enabled use of the IOMAP_DIO_OVERWRITE_ONLY flag and added a warning on unexpected -EAGAIN returns as an extra precaution, since ext4 does not retry writes in such cases. The flag itself is advisory in this case since ext4 checks for unaligned I/Os and uses appropriate locking up front, rather than on a retry in response to -EAGAIN. As it turns out, the warning check is susceptible to false positives because there are scenarios where -EAGAIN can be expected from lower layers without necessarily having IOCB_NOWAIT set on the iocb. For example, one instance of the warning has been seen where io_uring sets IOCB_HIPRI, which in turn results in REQ_POLLED|REQ_NOWAIT on the bio. This results in -EAGAIN if the block layer is unable to allocate a request, etc. [Note that there is an outstanding patch to untangle REQ_POLLED and REQ_NOWAIT such that the latter relies on IOCB_NOWAIT, which would also address this instance of the warning.] Another instance of the warning has been reproduced by syzbot. A dio write is interrupted down in __get_user_pages_locked() waiting on the mm lock and returns -EAGAIN up the stack. If the iomap dio iteration layer has made no progress on the write to this point, -EAGAIN returns up to the filesystem and triggers the warning. This use of the overwrite flag in ext4 is precautionary and half-baked. I.e., ext4 doesn't actually implement overwrite checking in the iomap callbacks when the flag is set, so the only extra verification it provides are i_size checks in the generic iomap dio layer. Combined with the tendency for false positives, the added verification is not worth the extra trouble. Remove the flag, associated warning, and update the comments to document when concurrent unaligned dio writes are allowed and why said flag is not used. Cc: [email protected] Reported-by: [email protected] Reported-by: Pengfei Xu <[email protected]> Fixes: 310ee0902b8d ("ext4: allow concurrent unaligned dio overwrites") Signed-off-by: Brian Foster <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: add correct group descriptors and reserved GDT blocks to system zoneWang Jianjian3-8/+17
When setup_system_zone, flex_bg is not initialized so it is always 1. Use a new helper function, ext4_num_base_meta_blocks() which does not depend on sbi->s_log_groups_per_flex being initialized. [ Squashed two patches in the Link URL's below together into a single commit, which is simpler to review/understand. Also fix checkpatch warnings. --TYT ] Cc: [email protected] Signed-off-by: Wang Jianjian <[email protected]> Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: remove unused function declarationCai Xinchen1-6/+0
These functions do not have its function implementation. So those function declaration is useless. Remove these Signed-off-by: Cai Xinchen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: mballoc: avoid garbage value from errSu Hui1-1/+1
clang's static analysis warning: fs/ext4/mballoc.c line 4178, column 6, Branch condition evaluates to a garbage value. err is uninitialized and will be judged when 'len <= 0' or it first enters the loop while the condition "!ext4_sb_block_valid()" is true. Although this can't make problems now, it's better to correct it. Signed-off-by: Su Hui <[email protected]> Reviewed-by: Nick Desaulniers <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: use sbi instead of EXT4_SB(sb) in ext4_mb_new_blocks_simple()Lu Hongfei1-1/+1
Signed-off-by: Lu Hongfei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: change the type of blocksize in ext4_mb_init_cache()Lu Hongfei1-1/+1
The return value type of i_blocksize() is 'unsigned int', so the type of blocksize has been modified from 'int' to 'unsigned int' to ensure data type consistency. Signed-off-by: Lu Hongfei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>
2023-08-27ext4: fix unttached inode after power cut with orphan file feature enabledZhihao Cheng1-0/+3
Running generic/475(filesystem consistent tests after power cut) could easily trigger unattached inode error while doing fsck: Unattached zero-length inode 39405. Clear? no Unattached inode 39405 Connect to /lost+found? no Above inconsistence is caused by following process: P1 P2 ext4_create inode = ext4_new_inode_start_handle // itable records nlink=1 ext4_add_nondir err = ext4_add_entry // ENOSPC ext4_append ext4_bread ext4_getblk ext4_map_blocks // returns ENOSPC drop_nlink(inode) // won't be updated into disk inode ext4_orphan_add(handle, inode) ext4_orphan_file_add ext4_journal_stop(handle) jbd2_journal_commit_transaction // commit success >> power cut << ext4_fill_super ext4_load_and_init_journal // itable records nlink=1 ext4_orphan_cleanup ext4_process_orphan if (inode->i_nlink) // true, inode won't be deleted Then, allocated inode will be reserved on disk and corresponds to no dentries, so e2fsck reports 'unattached inode' problem. The problem won't happen if orphan file feature is disabled, because ext4_orphan_add() will update disk inode in orphan list mode. There are several places not updating disk inode while putting inode into orphan area, such as ext4_add_nondir(), ext4_symlink() and whiteout in ext4_rename(). Fix it by updating inode into disk in all error branches of these places. Link: https://bugzilla.kernel.org/show_bug.cgi?id=217605 Fixes: 02f310fcf47f ("ext4: Speedup ext4 orphan inode handling") Signed-off-by: Zhihao Cheng <[email protected]> Reviewed-by: Jan Kara <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]>