aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-03-06ethtool: Add GTP RSS hash options to ethtool.hTakeru Hayasaka1-0/+48
This is a patch that enables RSS functionality for GTP packets using ethtool. A user can include TEID and make RSS work for GTP-U over IPv4 by doing the following:`ethtool -N ens3 rx-flow-hash gtpu4 sde` In addition to gtpu(4|6), we now support gtpc(4|6),gtpc(4|6)t,gtpu(4|6)e, gtpu(4|6)u, and gtpu(4|6)d. gtpc(4|6): Used for GTP-C in IPv4 and IPv6, where the GTP header format does not include a TEID. gtpc(4|6)t: Used for GTP-C in IPv4 and IPv6, with a GTP header format that includes a TEID. gtpu(4|6): Used for GTP-U in both IPv4 and IPv6 scenarios. gtpu(4|6)e: Used for GTP-U with extended headers in both IPv4 and IPv6. gtpu(4|6)u: Used when the PSC (PDU session container) in the GTP-U extended header includes Uplink, applicable to both IPv4 and IPv6. gtpu(4|6)d: Used when the PSC in the GTP-U extended header includes Downlink, for both IPv4 and IPv6. GTP generates a flow that includes an ID called TEID to identify the tunnel. This tunnel is created for each UE (User Equipment).By performing RSS based on this flow, it is possible to apply RSS for each communication unit from the UE. Without this, RSS would only be effective within the range of IP addresses. For instance, the PGW can only perform RSS within the IP range of the SGW. Problematic from a load distribution perspective, especially if there's a bias in the terminals connected to a particular base station.This case can be solved by using this patch. Signed-off-by: Takeru Hayasaka <[email protected]> Reviewed-by: Marcin Szycik <[email protected]> Tested-by: Pucha Himasekhar Reddy <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
2024-03-06Merge tag 'vfs-6.8-release.fixes' of ↵Linus Torvalds4-55/+64
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Get rid of copy_mc flag in iov_iter which really only makes sense for the core dumping code so move it out of the generic iov iter code and make it coredump's problem. See the detailed commit description. - Revert fs/aio: Make io_cancel() generate completions again The initial fix here was predicated on the assumption that calling ki_cancel() didn't complete aio requests. However, that turned out to be wrong since the two drivers that actually make use of this set a cancellation function that performs the cancellation correctly. So revert this change. - Ensure that the test for IOCB_AIO_RW always happens before the read from ki_ctx. * tag 'vfs-6.8-release.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: iov_iter: get rid of 'copy_mc' flag fs/aio: Check IOCB_AIO_RW before the struct aio_kiocb conversion Revert "fs/aio: Make io_cancel() generate completions again"
2024-03-06Merge tag 'arm-fixes-6.8-3' of ↵Linus Torvalds16-77/+35
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull ARM SoC fixes from Arnd Bergmann: "These should be the final fixes for the soc tree for 6.8, as usual they mostly deal wtih dts files: - Qualcomm fixes for pcie4 on sc8280xp, a revert of msm8996 mpm support, sm6115 interconnect and sm8650 gpio. - Two fixes for Tegra234 ethernet - A Makefile fix to actually build the allwinner based orange pi zero 2w device tree - Fixes for clocks and reset on imx8mp and a DSI display regression on imx7. The non-DT fixes are: - Firmware fixes addressing a kernel panic in op-tee and a minor regression in microchip/riscv. - A defconfig change to bring back backlight support after a Kconfig change" * tag 'arm-fixes-6.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: firmware: microchip: Fix over-requested allocation size tee: optee: Fix kernel panic caused by incorrect error handling Revert "arm64: dts: qcom: msm8996: Hook up MPM" arm64: dts: qcom: sc8280xp-x13s: limit pcie4 link speed arm64: dts: qcom: sc8280xp-crd: limit pcie4 link speed arm64: dts: imx8mp: Fix LDB clocks property arm64: dts: imx8mp: Fix TC9595 reset GPIO on DH i.MX8M Plus DHCOM SoM MAINTAINERS: Use a proper mailinglist for NXP i.MX development ARM: dts: imx7: remove DSI port endpoints arm64: dts: allwinner: h616: Add Orange Pi Zero 2W to Makefile ARM: imx_v6_v7_defconfig: Restore CONFIG_BACKLIGHT_CLASS_DEVICE arm64: tegra: Fix Tegra234 MGBE power-domains arm64: tegra: Set the correct PHY mode for MGBE arm64: dts: qcom: sm6115: Fix missing interconnect-names arm64: dts: qcom: sm8650-mtp: add gpio74 as reserved gpio arm64: dts: qcom: sm8650-qrd: add gpio74 as reserved gpio
2024-03-06Merge tag 'v6.8-p6' of ↵Linus Torvalds2-19/+19
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fixes from Herbert Xu: "Fix potential use-after-frees in rk3288 and sun8i-ce" * tag 'v6.8-p6' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: rk3288 - Fix use after free in unprepare crypto: sun8i-ce - Fix use after free in unprepare
2024-03-06inet: Add getsockopt support for IP_ROUTER_ALERT and IPV6_ROUTER_ALERTJuntong Deng3-3/+17
Currently getsockopt does not support IP_ROUTER_ALERT and IPV6_ROUTER_ALERT, and we are unable to get the values of these two socket options through getsockopt. This patch adds getsockopt support for IP_ROUTER_ALERT and IPV6_ROUTER_ALERT. Signed-off-by: Juntong Deng <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-03-06Merge branch 'ynl-small-recv'David S. Miller2-7/+42
Jakub Kicinski says: ==================== tools: ynl: add --dbg-small-recv for easier kernel testing When testing netlink dumps I usually hack some user space up to constrain its user space buffer size (iproute2, ethtool or ynl). Netlink will try to fill the messages up, so since these apps use large buffers by default, the dumps are rarely fragmented. I was hoping to figure out a way to create a selftest for dump testing, but so far I have no idea how to do that in a useful and generic way. Until someone does that, make manual dump testing easier with YNL. Create a special option for limiting the buffer size, so I don't have to make the same edits each time, and maybe others will benefit, too :) Example: $ ./cli.py [...] --dbg-small-recv >/dev/null Recv: read 3712 bytes, 29 messages nl_len = 128 (112) nl_flags = 0x0 nl_type = 19 [...] nl_len = 128 (112) nl_flags = 0x0 nl_type = 19 Recv: read 3968 bytes, 31 messages nl_len = 128 (112) nl_flags = 0x0 nl_type = 19 [...] nl_len = 128 (112) nl_flags = 0x0 nl_type = 19 Recv: read 532 bytes, 5 messages nl_len = 128 (112) nl_flags = 0x0 nl_type = 19 [...] nl_len = 128 (112) nl_flags = 0x0 nl_type = 19 nl_len = 20 (4) nl_flags = 0x2 nl_type = 3 Now let's make the DONE not fit in the last message: $ ./cli.py [...] --dbg-small-recv 4499 >/dev/null Recv: read 3712 bytes, 29 messages nl_len = 128 (112) nl_flags = 0x0 nl_type = 19 [...] nl_len = 128 (112) nl_flags = 0x0 nl_type = 19 Recv: read 4480 bytes, 35 messages nl_len = 128 (112) nl_flags = 0x0 nl_type = 19 [...] nl_len = 128 (112) nl_flags = 0x0 nl_type = 19 Recv: read 20 bytes, 1 messages nl_len = 20 (4) nl_flags = 0x2 nl_type = 3 A real test would also have to check the messages are complete and not duplicated. That part has to be done manually right now. Note that the first message is always conservatively sized by the kernel. Still, I think this is good enough to be useful. v2: - patch 2: - move the recv_size setting up - change the default to 0 so that cli.py doesn't have to worry what the "unset" value is v1: https://lore.kernel.org/all/[email protected]/ ==================== Signed-off-by: David S. Miller <[email protected]>
2024-03-06tools: ynl: add --dbg-small-recv for easier kernel testingJakub Kicinski1-1/+6
Most "production" netlink clients use large buffers to make dump efficient, which means that handling of dump continuation in the kernel is not very well tested. Add an option for debugging / testing handling of dumps. It enables printing of extra netlink-level debug and lowers the recv() buffer size in one go. When used without any argument (--dbg-small-recv) it picks a very small default (4000), explicit size can be set, too (--dbg-small-recv 5000). Example: $ ./cli.py [...] --dbg-small-recv Recv: read 3712 bytes, 29 messages nl_len = 128 (112) nl_flags = 0x0 nl_type = 19 [...] nl_len = 128 (112) nl_flags = 0x0 nl_type = 19 Recv: read 3968 bytes, 31 messages nl_len = 128 (112) nl_flags = 0x0 nl_type = 19 [...] nl_len = 128 (112) nl_flags = 0x0 nl_type = 19 Recv: read 532 bytes, 5 messages nl_len = 128 (112) nl_flags = 0x0 nl_type = 19 [...] nl_len = 128 (112) nl_flags = 0x0 nl_type = 19 nl_len = 20 (4) nl_flags = 0x2 nl_type = 3 (the [...] are edits to shorten the commit message). Note that the first message of the dump is sized conservatively by the kernel. Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Donald Hunter <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-03-06tools: ynl: support debug printing messagesJakub Kicinski1-0/+15
For manual debug, allow printing the netlink level messages to stderr. Reviewed-by: Donald Hunter <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-03-06tools: ynl: allow setting recv() sizeJakub Kicinski1-3/+18
Make the size of the buffer we use for recv() configurable. The details of the buffer sizing in netlink are somewhat arcane, we could spend a lot of time polishing this API. Let's just leave some hopefully helpful comments for now. This is a for-developers-only feature, anyway. Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Donald Hunter <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-03-06tools: ynl: move the new line in NlMsg __repr__Jakub Kicinski1-3/+3
We add the new line even if message has no error or extack, which leads to print(nl_msg) ending with two new lines. Reviewed-by: Donald Hunter <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-03-06Merge branch 'tools-ynl-make-clean'David S. Miller4-8/+9
Jakub Kicinski says: ==================== tools: ynl: clean up make clean First change renames the clean target which removes build results, to a more common name. Second one add missing .PHONY targets. Third one ensures that clean deletes __pycache__. v2: add patch 2 v1: https://lore.kernel.org/all/[email protected]/ ==================== Signed-off-by: David S. Miller <[email protected]>
2024-03-06tools: ynl: remove __pycache__ during cleanJakub Kicinski1-0/+1
Build process uses python to generate the user space code. Remove __pycache__ on make clean. Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Donald Hunter <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-03-06tools: ynl: add distclean to .PHONY in all makefilesJakub Kicinski3-3/+3
Donald points out most YNL makefiles are missing distclean in .PHONY, even tho generated/Makefile does list it. Suggested-by: Donald Hunter <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Donald Hunter <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-03-06tools: ynl: rename make hardclean -> distcleanJakub Kicinski4-5/+5
The make target to remove all generated files used to be called "hardclean" because it deleted files which were tracked by git. We no longer track generated user space files, so use the more common "distclean" name. Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Donald Hunter <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-03-06net/rds: fix WARNING in rds_conn_connect_if_downEdward Adam Davis2-5/+4
If connection isn't established yet, get_mr() will fail, trigger connection after get_mr(). Fixes: 584a8279a44a ("RDS: RDMA: return appropriate error on rdma map failures") Reported-and-tested-by: [email protected] Signed-off-by: Edward Adam Davis <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-03-06Merge branch '100GbE' of ↵David S. Miller12-201/+125
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2024-03-04 (ice) This series contains updates to ice driver only. Jake changes the driver to use relative VSI index for VF VSIs as the VF driver has no direct use of the VSI number on ice hardware. He also reworks some Tx/Rx functions to clarify their uses, cleans up some style issues, and utilizes kernel helper functions. Maciej removes a redundant call to disable Tx queues on ifdown and removes some unnecessary devm usages. ==================== Signed-off-by: David S. Miller <[email protected]>
2024-03-06Merge branch 'ravb-cleanups'David S. Miller2-147/+83
Niklas Söderlund says: ==================== ravb: Align Rx descriptor setup and maintenance When RZ/G2L support was added the Rx code path was split in two, one to support R-Car and one to support RZ/G2L. One reason for this is that R-Car uses the extended Rx descriptor format, while RZ/G2L uses the normal descriptor format. In many aspects this is not needed as the extended descriptor format is just a normal descriptor with extra metadata (timestamsp) appended. And the R-Car SoCs can also use normal descriptors if hardware timestamps were not desired. This split has led to RZ/G2L gaining support for split descriptors in the Rx path while R-Car still lacks this. This series is the first step in trying to merge the R-Car and RZ/G2L Rx paths so features and bugs corrected in one will benefit the other. The first patch in the series clarifies that the driver now supports either normal or extended descriptors, not both at the same time by grouping them in a union. This is the foundation that later patches will build on the aligning the two Rx paths. Patches 2-5 deals with correcting small issues in the Rx frame and descriptor sizes that either were incorrect at the time they were added in 2017 (my bad) or concepts built on-top of this initial incorrect design. While finally patch 6 merges the R-Car and RZ/G2L for Rx descriptor setup and maintenance. When this work has landed I plan to follow up with more work aligning the rest of the Rx code paths and hopefully bring split descriptor support to the R-Car SoCs. ==================== Signed-off-by: David S. Miller <[email protected]>
2024-03-06ravb: Unify Rx ring maintenance code pathsNiklas Söderlund2-106/+33
The R-Car and RZ/G2L Rx code paths were split in two separate implementations when support for RZ/G2L was added due to the fact that R-Car uses the extended descriptor format while RZ/G2L uses normal descriptors. This has led to a duplication of Rx logic with the only difference being the different Rx descriptors types used. The implementation however neglects to take into account that extended descriptors are normal descriptors with additional metadata at the end to carry hardware timestamp information. The hardware timestamp information is only consumed in the R-Car Rx loop and all the maintenance code around the Rx ring can be shared between the two implementations if the difference in descriptor length is carefully considered. This change merges the two implementations for Rx ring maintenance by adding a method to access both types of descriptors as normal descriptors, as this part covers all the fields needed for Rx ring maintenance the only difference between using normal or extended descriptor is the size of the memory region to allocate/free and the step size between each descriptor in the ring. Signed-off-by: Niklas Söderlund <[email protected]> Reviewed-by: Paul Barker <[email protected]> Reviewed-by: Sergey Shtylyov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-03-06ravb: Move maximum Rx descriptor data usage to info structNiklas Söderlund2-8/+9
To make it possible to merge the R-Car and RZ/G2L code paths move the maximum usable size of a single Rx descriptor data slice into the hardware information instead of using two different defines in the two different code paths. Signed-off-by: Niklas Söderlund <[email protected]> Reviewed-by: Paul Barker <[email protected]> Reviewed-by: Sergey Shtylyov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-03-06ravb: Use the max frame size from hardware info for RZ/G2LNiklas Söderlund2-3/+3
Remove the define describing the RZ/G2L maximum frame size and only use the information in the hardware information struct. This will make it easier to merge the R-Car and RZ/G2L code paths. There is no functional change as both the define and the maximum frame length in the hardware information is set to 8K. Signed-off-by: Niklas Söderlund <[email protected]> Reviewed-by: Paul Barker <[email protected]> Reviewed-by: Sergey Shtylyov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-03-06ravb: Create helper to allocate skb and align itNiklas Söderlund2-20/+24
The EtherAVB device requires the SKB data to be aligned to 128 bytes. The alignment is done by allocating an skb 128 bytes larger than the maximum frame size supported by the device and adjusting the headroom to fit the requirement. This code has been refactored a few times and small issues have been added along the way. The issues are not harmful but prevent merging parts of the Rx code which have been split in two implementations with the addition of RZ/G2L support, a device that supports larger frame sizes. This change removes the need for duplicated and somewhat inaccurate hardware alignment constrains stored in the hardware information struct by creating a helper to handle the allocation of an skb and alignment of an skb data. For the R-Car class of devices the maximum frame size is 4K and each descriptor is limited to 2K of data. The current implementation does not support split descriptors, this limits the frame size to 2K. The current hardware information however records the descriptor size just under 2K due to bad understanding of the device when larger MTUs where added. For the RZ/G2L device the maximum frame size is 8K and each descriptor is limited to 4K of data. The current hardware information records this correctly, but it gets the alignment constrains wrong as just aligns it by 128, it does not extend it by 128 bytes to allow the full frame to be stored. This works because the RZ/G2L device supports split descriptors and allocates each skb to 8K and aligns each 4K descriptor in this space. Signed-off-by: Niklas Söderlund <[email protected]> Reviewed-by: Paul Barker <[email protected]> Reviewed-by: Sergey Shtylyov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-03-06ravb: Make it clear the information relates to maximum frame sizeNiklas Söderlund2-6/+7
The struct member rx_max_buf_size was added before split descriptor support was added. It is unclear if the value describes the full skb frame buffer or the data descriptor buffer which can be combined into a single skb. Rename it to make it clear it referees to the maximum frame size and can cover multiple descriptors. Signed-off-by: Niklas Söderlund <[email protected]> Reviewed-by: Paul Barker <[email protected]> Reviewed-by: Sergey Shtylyov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-03-06ravb: Group descriptor types used in Rx ringNiklas Söderlund2-30/+33
The Rx ring can either be made up of normal or extended descriptors, not a mix of the two at the same time. Make this explicit by grouping the two variables in a rx_ring union. The extension of the storage for more than one queue of normal descriptors from a single to NUM_RX_QUEUE queues have no practical effect. But aids in making the code readable as the code that uses it already piggyback on other members of struct ravb_private that are arrays of max length NUM_RX_QUEUE, e.g. rx_desc_dma. This will also make further refactoring easier. While at it, rename the normal descriptor Rx ring to make it clear it's not strictly related to the GbEthernet E-MAC IP found in RZ/G2L, normal descriptors could be used on R-Car SoCs too. Signed-off-by: Niklas Söderlund <[email protected]> Reviewed-by: Paul Barker <[email protected]> Reviewed-by: Sergey Shtylyov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-03-06Merge branch '200GbE' of ↵David S. Miller10-1374/+1182
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue From: Tony Nguyen <[email protected]> To: [email protected], [email protected], [email protected], [email protected], [email protected] Cc: Tony Nguyen <[email protected]>, [email protected] Tony Nguyen says: ==================== idpf: refactor virtchnl messages Alan Brady says: The motivation for this series has two primary goals. We want to enable support of multiple simultaneous messages and make the channel more robust. The way it works right now, the driver can only send and receive a single message at a time and if something goes really wrong, it can lead to data corruption and strange bugs. To start the series, we introduce an idpf_virtchnl.h file. This reduces the burden on idpf.h which is overloaded with struct and function declarations. The conversion works by conceptualizing a send and receive as a "virtchnl transaction" (idpf_vc_xn) and introducing a "transaction manager" (idpf_vc_xn_manager). The vcxn_mngr will init a ring of transactions from which the driver will pop from a bitmap of free transactions to track in-flight messages. Instead of needing to handle a complicated send/recv for every a message, the driver now just needs to fill out a xn_params struct and hand it over to idpf_vc_xn_exec which will take care of all the messy bits. Once a message is sent and receives a reply, we leverage the completion API to signal the received buffer is ready to be used (assuming success, or an error code otherwise). At a low-level, this implements the "sw cookie" field of the virtchnl message descriptor to enable this. We have 16 bits we can put whatever we want and the recipient is required to apply the same cookie to the reply for that message. We use the first 8 bits as an index into the array of transactions to enable fast lookups and we use the second 8 bits as a salt to make sure each cookie is unique for that message. As transactions are received in arbitrary order, it's possible to reuse a transaction index and the salt guards against index conflicts to make certain the lookup is correct. As a primitive example, say index 1 is used with salt 1. The message times out without receiving a reply so index 1 is renewed to be ready for a new transaction, we report the timeout, and send the message again. Since index 1 is free to be used again now, index 1 is again sent but now salt is 2. This time we do get a reply, however it could be that the reply is _actually_ for the previous send index 1 with salt 1. Without the salt we would have no way of knowing for sure if it's the correct reply, but with we will know for certain. Through this conversion we also get several other benefits. We can now more appropriately handle asynchronously sent messages by providing space for a callback to be defined. This notably allows us to handle MAC filter failures better; previously we could potentially have stale, failed filters in our list, which shouldn't really have a major impact but is obviously not correct. I also managed to remove fairly significant more lines than I added which is a win in my book. Additionally, this converts some variables to use auto-variables where appropriate. This makes the alloc paths much cleaner and less prone to memory leaks. We also fix a few virtchnl related bugs while we're here. ==================== Signed-off-by: David S. Miller <[email protected]>
2024-03-06Merge branch '100GbE' of ↵David S. Miller9-22/+15
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2024-03-05 (idpf, ice, i40e, igc, e1000e) This series contains updates to idpf, ice, i40e, igc and e1000e drivers. Emil disables local BH on NAPI schedule for proper handling of softirqs on idpf. Jake stops reporting of virtchannel RSS option which in unsupported on ice. Rand Deeb adds null check to prevent possible null pointer dereference on ice. Michal Schmidt moves DPLL mutex initialization to resolve uninitialized mutex usage for ice. Jesse fixes incorrect variable usage for calculating Tx stats on ice. Ivan Vecera corrects logic for firmware equals check on i40e. Florian Kauer prevents memory corruption for XDP_REDIRECT on igc. Sasha reverts an incorrect use of FIELD_GET which caused a regression for Wake on LAN on e1000e. ==================== Signed-off-by: David S. Miller <[email protected]>
2024-03-06iov_iter: get rid of 'copy_mc' flagLinus Torvalds3-42/+42
This flag is only set by one single user: the magical core dumping code that looks up user pages one by one, and then writes them out using their kernel addresses (by using a BVEC_ITER). That actually ends up being a huge problem, because while we do use copy_mc_to_kernel() for this case and it is able to handle the possible machine checks involved, nothing else is really ready to handle the failures caused by the machine check. In particular, as reported by Tong Tiangen, we don't actually support fault_in_iov_iter_readable() on a machine check area. As a result, the usual logic for writing things to a file under a filesystem lock, which involves doing a copy with page faults disabled and then if that fails trying to fault pages in without holding the locks with fault_in_iov_iter_readable() does not work at all. We could decide to always just make the MC copy "succeed" (and filling the destination with zeroes), and that would then create a core dump file that just ignores any machine checks. But honestly, this single special case has been problematic before, and means that all the normal iov_iter code ends up slightly more complex and slower. See for example commit c9eec08bac96 ("iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc()") where David Howells re-organized the code just to avoid having to check the 'copy_mc' flags inside the inner iov_iter loops. So considering that we have exactly one user, and that one user is a non-critical special case that doesn't actually ever trigger in real life (Tong found this with manual error injection), the sane solution is to just decide that the onus on handling the machine check lines on that user instead. Ergo, do the copy_mc_to_kernel() in the core dump logic itself, copying the user data to a stable kernel page before writing it out. Fixes: f1982740f5e7 ("iov_iter: Convert iterate*() to inline funcs") Signed-off-by: Linus Torvalds <[email protected]> Signed-off-by: Tong Tiangen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/all/[email protected]/ Tested-by: David Howells <[email protected]> Reviewed-by: David Howells <[email protected]> Reviewed-by: Jens Axboe <[email protected]> Reported-by: Tong Tiangen <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-03-06Merge branch 'Improve packet offload for dual stack'Steffen Klassert2-2/+8
Mike Yu says: ==================== In the XFRM stack, whether a packet is forwarded to the IPv4 or IPv6 stack depends on the family field of the matched SA. This does not completely work for IPsec packet offload in some scenario, for example, sending an IPv6 packet that will be encrypted and encapsulated as an IPv4 packet in HW. Here are the patches to make IPsec packet offload work on the mentioned scenario. ==================== Signed-off-by: Steffen Klassert <[email protected]>
2024-03-06Merge branch 'netlink-emsgsize'David S. Miller4-19/+19
Jakub Kicinski says: ==================== netlink: handle EMSGSIZE errors in the core Ido discovered some time back that we usually force NLMSG_DONE to be delivered in a separate recv() syscall, even if it would fit into the same skb as data messages. He made nexthop try to fit DONE with data in commit 8743aeff5bc4 ("nexthop: Fix infinite nexthop bucket dump when using maximum nexthop ID"), and nobody has complained so far. We have since also tried to follow the same pattern in new genetlink families, but explaining to people, or even remembering the correct handling ourselves is tedious. Let the netlink socket layer consume -EMSGSIZE errors. Practically speaking most families use this error code as "dump needs more space", anyway. v2: - init err to 0 in last patch v1: https://lore.kernel.org/all/[email protected]/ ==================== Signed-off-by: David S. Miller <[email protected]>
2024-03-06genetlink: fit NLMSG_DONE into same read() as familiesJakub Kicinski1-5/+7
Make sure ctrl_fill_info() returns sensible error codes and propagate them out to netlink core. Let netlink core decide when to return skb->len and when to treat the exit as an error. Netlink core does better job at it, if we always return skb->len the core doesn't know when we're done dumping and NLMSG_DONE ends up in a separate read(). Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-03-06netdev: let netlink core handle -EMSGSIZE errorsJakub Kicinski2-14/+3
Previous change added -EMSGSIZE handling to af_netlink, we don't have to hide these errors any longer. Theoretically the error handling changes from: if (err == -EMSGSIZE) to if (err == -EMSGSIZE && skb->len) everywhere, but in practice it doesn't matter. All messages fit into NLMSG_GOODSIZE, so overflow of an empty skb cannot happen. Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-03-06netlink: handle EMSGSIZE errors in the coreJakub Kicinski1-0/+9
Eric points out that our current suggested way of handling EMSGSIZE errors ((err == -EMSGSIZE) ? skb->len : err) will break if we didn't fit even a single object into the buffer provided by the user. This should not happen for well behaved applications, but we can fix that, and free netlink families from dealing with that completely by moving error handling into the core. Let's assume from now on that all EMSGSIZE errors in dumps are because we run out of skb space. Families can now propagate the error nla_put_*() etc generated and not worry about any return value magic. If some family really wants to send EMSGSIZE to user space, assuming it generates the same error on the next dump iteration the skb->len should be 0, and user space should still see the EMSGSIZE. This should simplify families and prevent mistakes in return values which lead to DONE being forced into a separate recv() call as discovered by Ido some time ago. Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-03-06Merge tag 'riscv-firmware-for-v6.9' of ↵Arnd Bergmann1-1/+2
https://git.kernel.org/pub/scm/linux/kernel/git/conor/linux into arm/fixes RISC-V firmware drivers for v6.9 A single minor fix for an oversized allocation due to sizeof() misuse by yours truly that came in since I sent my last fixes PR. Signed-off-by: Conor Dooley <[email protected]> * tag 'riscv-firmware-for-v6.9' of https://git.kernel.org/pub/scm/linux/kernel/git/conor/linux: firmware: microchip: Fix over-requested allocation size Link: https://lore.kernel.org/r/20240305-vicinity-dumpling-8943ef26f004@spud Signed-off-by: Arnd Bergmann <[email protected]>
2024-03-06Merge tag 'qcom-arm64-fixes-for-6.8-2' of ↵Arnd Bergmann3-33/+10
https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into arm/fixes A few more Qualcomm Arm64 DeviceTree fixes for v6.8 This reduces the link speed of the PCIe bus with WiFi-card connected on the Lenovo ThinkPad X13s and the Qualcomm Compute Reference Device, avoid link errors and initialization issues reported by users. It also reverts the enablement of MPM on MSM8996, which is reported to prevent boards on this platform from booting for some users. * tag 'qcom-arm64-fixes-for-6.8-2' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux: Revert "arm64: dts: qcom: msm8996: Hook up MPM" arm64: dts: qcom: sc8280xp-x13s: limit pcie4 link speed arm64: dts: qcom: sc8280xp-crd: limit pcie4 link speed Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]>
2024-03-05selftests: avoid using SKIP(exit()) in harness fixure setupJakub Kicinski2-4/+4
selftest harness uses various exit codes to signal test results. Avoid calling exit() directly, otherwise tests may get broken by harness refactoring (like the commit under Fixes). SKIP() will instruct the harness that the test shouldn't run, it used to not be the case, but that has been fixed. So just return, no need to exit. Note that for hmm-tests this actually changes the result from pass to skip. Which seems fair, the test is skipped, after all. Reported-by: Mark Brown <[email protected]> Link: https://lore.kernel.org/all/[email protected] Fixes: a724707976b0 ("selftests: kselftest_harness: use KSFT_* exit codes") Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Mark Brown <[email protected]> Tested-by: Mark Brown <[email protected]> Reviewed-by: Przemek Kitszel <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-05Merge branch 'net-ethernet-rework-eee'Jakub Kicinski6-54/+168
Oleksij Rempel says: ==================== net: ethernet: Rework EEE with Andrew's permission I'll continue mainlining this patches: ============================================================== Most MAC drivers get EEE wrong. The API to the PHY is not very obvious, which is probably why. Rework the API, pushing most of the EEE handling into phylib core, leaving the MAC drivers to just enable/disable support for EEE in there change_link call back. MAC drivers are now expect to indicate to phylib if they support EEE. This will allow future patches to configure the PHY to advertise no EEE link modes when EEE is not supported. The information could also be used to enable SmartEEE if the PHY supports it. With these changes, the uAPI configuration eee_enable becomes a global on/off. tx-lpi must also be enabled before EEE is enabled. This fits the discussion here: https://lore.kernel.org/netdev/[email protected]/T/ This patchset puts in place all the infrastructure, and converts one MAC driver to the new API. Following patchsets will convert other MAC drivers, extend support into phylink, and when all MAC drivers are converted to the new scheme, clean up some unneeded code. ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-05net: fec: Fixup EEEAndrew Lunn1-18/+5
The enabling/disabling of EEE in the MAC should happen as a result of auto negotiation. So move the enable/disable into fec_enet_adjust_link() which gets called by phylib when there is a change in link status. fec_enet_set_eee() now just stores away the LPI timer value. Everything else is passed to phylib, so it can correctly setup the PHY. fec_enet_get_eee() relies on phylib doing most of the work, the MAC driver just adds the LPI timer value. Call phy_support_eee() if the quirk is present to indicate the MAC actually supports EEE. Signed-off-by: Andrew Lunn <[email protected]> Tested-by: Oleksij Rempel <[email protected]> (On iMX8MP debix) Signed-off-by: Oleksij Rempel <[email protected]> Reviewed-by: Wei Fang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-05net: fec: Move fec_enet_eee_mode_set() and helper earlierAndrew Lunn1-37/+38
FEC is about to get its EEE code re-written. To allow this, move fec_enet_eee_mode_set() before fec_enet_adjust_link() which will need to call it. Signed-off-by: Andrew Lunn <[email protected]> Signed-off-by: Oleksij Rempel <[email protected]> Reviewed-by: Wei Fang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-05net: phy: Add phy_support_eee() indicating MAC support EEEAndrew Lunn2-1/+30
In order for EEE to operate, both the MAC and the PHY need to support it, similar to how pause works. With some exception - a number of PHYs have SmartEEE or AutoGrEEEn support in order to provide some EEE-like power savings with non-EEE capable MACs. Copy the pause concept and add the call phy_support_eee() which the MAC makes after connecting the PHY to indicate it supports EEE. phylib will then advertise EEE when auto-neg is performed. Signed-off-by: Andrew Lunn <[email protected]> Signed-off-by: Oleksij Rempel <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-05net: phy: Immediately call adjust_link if only tx_lpi_enabled changesAndrew Lunn2-5/+49
The MAC driver changes its EEE hardware configuration in its adjust_link callback. This is called when auto-neg completes. Disabling EEE via eee_enabled false will trigger an autoneg, and as a result the adjust_link callback will be called with phydev->enable_tx_lpi set to false. Similarly, eee_enabled set to true and with a change of advertised link modes will result in a new autoneg, and a call the adjust_link call. If set_eee is called with only a change to tx_lpi_enabled which does not trigger an auto-neg, it is necessary to call the adjust_link callback so that the MAC is reconfigured to take this change into account. When setting phydev->enable_tx_lpi, take both eee_enabled and tx_lpi_enabled into account, so the MAC drivers just needs to act on phydev->enable_tx_lpi and not the whole EEE configuration. The same check should be done for tx_lpi_timer too. Signed-off-by: Andrew Lunn <[email protected]> Reviewed-by: Florian Fainelli <[email protected]> Signed-off-by: Oleksij Rempel <[email protected]> Reviewed-by: Andrew Lunn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-05net: phy: Keep track of EEE configurationAndrew Lunn2-2/+8
Have phylib keep track of the EEE configuration. This simplifies the MAC drivers, in that they don't need to store it. Future patches to phylib will also make use of this information to further simplify the MAC drivers. Reviewed-by: Russell King (Oracle) <[email protected]> Signed-off-by: Andrew Lunn <[email protected]> Reviewed-by: Florian Fainelli <[email protected]> Signed-off-by: Oleksij Rempel <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-05net: phy: Add phydev->enable_tx_lpi to simplify adjust link callbacksAndrew Lunn2-0/+9
MAC drivers which support EEE need to know the results of the EEE auto-neg in order to program the hardware to perform EEE or not. The oddly named phy_init_eee() can be used to determine this, it returns 0 if EEE should be used, or a negative error code, e.g. -EOPPROTONOTSUPPORT if the PHY does not support EEE or negotiate resulted in it not being used. However, many MAC drivers get this wrong. Add phydev->enable_tx_lpi which indicates the result of the autoneg for EEE, including if EEE is administratively disabled with ethtool. The MAC driver can then access this in the same way as link speed and duplex in the adjust link callback. If enable_tx_lpi is true, the MAC should send low power indications and does not need to consider anything else with respect to EEE. Reviewed-by: Florian Fainelli <[email protected]> Signed-off-by: Andrew Lunn <[email protected]> Signed-off-by: Oleksij Rempel <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-05net: add helpers for EEE configurationRussell King1-0/+38
Add helpers that phylib and phylink can use to manage EEE configuration and determine whether the MAC should be permitted to use LPI based on that configuration. Signed-off-by: Russell King (Oracle) <[email protected]> Signed-off-by: Andrew Lunn <[email protected]> Reviewed-by: Florian Fainelli <[email protected]> Signed-off-by: Oleksij Rempel <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-05net: dsa: microchip: fix register write order in ksz8_ind_write8()Tobias Jakobi (Compleo)1-2/+2
This bug was noticed while re-implementing parts of the kernel driver in userspace using spidev. The goal was to enable some of the errata workarounds that Microchip describes in their errata sheet [1]. Both the errata sheet and the regular datasheet of e.g. the KSZ8795 imply that you need to do this for indirect register accesses: - write a 16-bit value to a control register pair (this value consists of the indirect register table, and the offset inside the table) - either read or write an 8-bit value from the data storage register (indicated by REG_IND_BYTE in the kernel) The current implementation has the order swapped. It can be proven, by reading back some indirect register with known content (the EEE register modified in ksz8_handle_global_errata() is one of these), that this implementation does not work. Private discussion with Oleksij Rempel of Pengutronix has revealed that the workaround was apparantly never tested on actual hardware. [1] https://ww1.microchip.com/downloads/aemDocuments/documents/OTH/ProductDocuments/Errata/KSZ87xx-Errata-DS80000687C.pdf Signed-off-by: Tobias Jakobi (Compleo) <[email protected]> Reviewed-by: Oleksij Rempel <[email protected]> Fixes: 7b6e6235b664 ("net: dsa: microchip: ksz8795: handle eee specif erratum") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-05ethtool: ignore unused/unreliable fields in set_eee opHeiner Kallweit1-5/+0
This function is used with the set_eee() ethtool operation. Certain fields of struct ethtool_keee() are relevant only for the get_eee() operation. In addition, in case of the ioctl interface, we have no guarantee that userspace sends sane values in struct ethtool_eee. Therefore explicitly ignore all fields not needed for set_eee(). This protects from drivers trying to use unchecked and unreliable data, relying on specific userspace behavior. Note: Such unsafe driver behavior has been found and fixed in the tg3 driver. Signed-off-by: Heiner Kallweit <[email protected]> Reviewed-by: Andrew Lunn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-05dpll: move all dpll<>netdev helpers to dpll codeJakub Kicinski9-65/+64
Older versions of GCC really want to know the full definition of the type involved in rcu_assign_pointer(). struct dpll_pin is defined in a local header, net/core can't reach it. Move all the netdev <> dpll code into dpll, where the type is known. Otherwise we'd need multiple function calls to jump between the compilation units. This is the same problem the commit under fixes was trying to address, but with rcu_assign_pointer() not rcu_dereference(). Some of the exports are not needed, networking core can't be a module, we only need exports for the helpers used by drivers. Reported-by: Geert Uytterhoeven <[email protected]> Link: https://lore.kernel.org/all/[email protected]/ Fixes: 640f41ed33b5 ("dpll: fix build failure due to rcu_dereference_check() on unknown type") Reviewed-by: Jiri Pirko <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-05sock: Use unsafe_memcpy() for sock_copy()Kees Cook2-4/+6
While testing for places where zero-sized destinations were still showing up in the kernel, sock_copy() and inet_reqsk_clone() were found, which are using very specific memcpy() offsets for both avoiding a portion of struct sock, and copying beyond the end of it (since struct sock is really just a common header before the protocol-specific allocation). Instead of trying to unravel this historical lack of container_of(), just switch to unsafe_memcpy(), since that's effectively what was happening already (memcpy() wasn't checking 0-sized destinations while the code base was being converted away from fake flexible arrays). Avoid the following false positive warning with future changes to CONFIG_FORTIFY_SOURCE: memcpy: detected field-spanning write (size 3068) of destination "&nsk->__sk_common.skc_dontcopy_end" at net/core/sock.c:2057 (size 0) Signed-off-by: Kees Cook <[email protected]> Reviewed-by: Simon Horman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-05net: tap: Remove generic .ndo_get_stats64Breno Leitao1-1/+0
Commit 3e2f544dd8a33 ("net: get stats64 if device if driver is configured") moved the callback to dev_get_tstats64() to net core, so, unless the driver is doing some custom stats collection, it does not need to set .ndo_get_stats64. Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64 function pointer. Signed-off-by: Breno Leitao <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-05net: tuntap: Leverage core stats allocatorBreno Leitao1-9/+2
With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and convert veth & vrf"), stats allocation could be done on net core instead of in this driver. With this new approach, the driver doesn't have to bother with error handling (allocation failure checking, making sure free happens in the right spot, etc). This is core responsibility now. Remove the allocation in the tun/tap driver and leverage the network core allocation instead. Signed-off-by: Breno Leitao <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-03-05cpumap: Zero-initialise xdp_rxq_info struct before running XDP programToke Høiland-Jørgensen1-1/+1
When running an XDP program that is attached to a cpumap entry, we don't initialise the xdp_rxq_info data structure being used in the xdp_buff that backs the XDP program invocation. Tobias noticed that this leads to random values being returned as the xdp_md->rx_queue_index value for XDP programs running in a cpumap. This means we're basically returning the contents of the uninitialised memory, which is bad. Fix this by zero-initialising the rxq data structure before running the XDP program. Fixes: 9216477449f3 ("bpf: cpumap: Add the possibility to attach an eBPF program to cpumap") Reported-by: Tobias Böhm <[email protected]> Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2024-03-05selftests/bpf: Fix up xdp bonding test wrt feature flagsDaniel Borkmann1-2/+2
Adjust the XDP feature flags for the bond device when no bond slave devices are attached. After 9b0ed890ac2a ("bonding: do not report NETDEV_XDP_ACT_XSK_ZEROCOPY"), the empty bond device must report 0 as flags instead of NETDEV_XDP_ACT_MASK. # ./vmtest.sh -- ./test_progs -t xdp_bond [...] [ 3.983311] bond1 (unregistering): (slave veth1_1): Releasing backup interface [ 3.995434] bond1 (unregistering): Released all slaves [ 4.022311] bond2: (slave veth2_1): Releasing backup interface #507/1 xdp_bonding/xdp_bonding_attach:OK #507/2 xdp_bonding/xdp_bonding_nested:OK #507/3 xdp_bonding/xdp_bonding_features:OK #507/4 xdp_bonding/xdp_bonding_roundrobin:OK #507/5 xdp_bonding/xdp_bonding_activebackup:OK #507/6 xdp_bonding/xdp_bonding_xor_layer2:OK #507/7 xdp_bonding/xdp_bonding_xor_layer23:OK #507/8 xdp_bonding/xdp_bonding_xor_layer34:OK #507/9 xdp_bonding/xdp_bonding_redirect_multi:OK #507 xdp_bonding:OK Summary: 1/9 PASSED, 0 SKIPPED, 0 FAILED [ 4.185255] bond2 (unregistering): Released all slaves [...] Fixes: 9b0ed890ac2a ("bonding: do not report NETDEV_XDP_ACT_XSK_ZEROCOPY") Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Toke Høiland-Jørgensen <[email protected]> Message-ID: <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>