diff options
Diffstat (limited to 'Documentation/networking')
21 files changed, 1185 insertions, 141 deletions
diff --git a/Documentation/networking/6lowpan.txt b/Documentation/networking/6lowpan.rst index 2e5a939d7e6f..e70a6520cc33 100644 --- a/Documentation/networking/6lowpan.txt +++ b/Documentation/networking/6lowpan.rst @@ -1,37 +1,40 @@ +.. SPDX-License-Identifier: GPL-2.0 -Netdev private dataroom for 6lowpan interfaces: +============================================== +Netdev private dataroom for 6lowpan interfaces +============================================== All 6lowpan able net devices, means all interfaces with ARPHRD_6LOWPAN, must have "struct lowpan_priv" placed at beginning of netdev_priv. -The priv_size of each interface should be calculate by: +The priv_size of each interface should be calculate by:: dev->priv_size = LOWPAN_PRIV_SIZE(LL_6LOWPAN_PRIV_DATA); Where LL_PRIV_6LOWPAN_DATA is sizeof linklayer 6lowpan private data struct. -To access the LL_PRIV_6LOWPAN_DATA structure you can cast: +To access the LL_PRIV_6LOWPAN_DATA structure you can cast:: lowpan_priv(dev)-priv; to your LL_6LOWPAN_PRIV_DATA structure. -Before registering the lowpan netdev interface you must run: +Before registering the lowpan netdev interface you must run:: lowpan_netdev_setup(dev, LOWPAN_LLTYPE_FOOBAR); wheres LOWPAN_LLTYPE_FOOBAR is a define for your 6LoWPAN linklayer type of enum lowpan_lltypes. -Example to evaluate the private usually you can do: +Example to evaluate the private usually you can do:: -static inline struct lowpan_priv_foobar * -lowpan_foobar_priv(struct net_device *dev) -{ + static inline struct lowpan_priv_foobar * + lowpan_foobar_priv(struct net_device *dev) + { return (struct lowpan_priv_foobar *)lowpan_priv(dev)->priv; -} + } -switch (dev->type) { -case ARPHRD_6LOWPAN: + switch (dev->type) { + case ARPHRD_6LOWPAN: lowpan_priv = lowpan_priv(dev); /* do great stuff which is ARPHRD_6LOWPAN related */ switch (lowpan_priv->lltype) { @@ -42,8 +45,8 @@ case ARPHRD_6LOWPAN: ... } break; -... -} + ... + } In case of generic 6lowpan branch ("net/6lowpan") you can remove the check on ARPHRD_6LOWPAN, because you can be sure that these function are called diff --git a/Documentation/networking/bareudp.rst b/Documentation/networking/bareudp.rst new file mode 100644 index 000000000000..465a8b251bfe --- /dev/null +++ b/Documentation/networking/bareudp.rst @@ -0,0 +1,52 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================== +Bare UDP Tunnelling Module Documentation +======================================== + +There are various L3 encapsulation standards using UDP being discussed to +leverage the UDP based load balancing capability of different networks. +MPLSoUDP (__ https://tools.ietf.org/html/rfc7510) is one among them. + +The Bareudp tunnel module provides a generic L3 encapsulation tunnelling +support for tunnelling different L3 protocols like MPLS, IP, NSH etc. inside +a UDP tunnel. + +Special Handling +---------------- +The bareudp device supports special handling for MPLS & IP as they can have +multiple ethertypes. +MPLS procotcol can have ethertypes ETH_P_MPLS_UC (unicast) & ETH_P_MPLS_MC (multicast). +IP protocol can have ethertypes ETH_P_IP (v4) & ETH_P_IPV6 (v6). +This special handling can be enabled only for ethertypes ETH_P_IP & ETH_P_MPLS_UC +with a flag called multiproto mode. + +Usage +------ + +1) Device creation & deletion + + a) ip link add dev bareudp0 type bareudp dstport 6635 ethertype 0x8847. + + This creates a bareudp tunnel device which tunnels L3 traffic with ethertype + 0x8847 (MPLS traffic). The destination port of the UDP header will be set to + 6635.The device will listen on UDP port 6635 to receive traffic. + + b) ip link delete bareudp0 + +2) Device creation with multiple proto mode enabled + +There are two ways to create a bareudp device for MPLS & IP with multiproto mode +enabled. + + a) ip link add dev bareudp0 type bareudp dstport 6635 ethertype 0x8847 multiproto + + b) ip link add dev bareudp0 type bareudp dstport 6635 ethertype mpls + +3) Device Usage + +The bareudp device could be used along with OVS or flower filter in TC. +The OVS or TC flower layer must set the tunnel information in SKB dst field before +sending packet buffer to the bareudp device for transmission. On reception the +bareudp device extracts and stores the tunnel information in SKB dst field before +passing the packet buffer to the network stack. diff --git a/Documentation/networking/device_drivers/mellanox/mlx5.rst b/Documentation/networking/device_drivers/mellanox/mlx5.rst index f575a49790e8..e9b65035cd47 100644 --- a/Documentation/networking/device_drivers/mellanox/mlx5.rst +++ b/Documentation/networking/device_drivers/mellanox/mlx5.rst @@ -101,7 +101,7 @@ Enabling the driver and kconfig options **External options** ( Choose if the corresponding mlx5 feature is required ) - CONFIG_PTP_1588_CLOCK: When chosen, mlx5 ptp support will be enabled -- CONFIG_VXLAN: When chosen, mlx5 vxaln support will be enabled. +- CONFIG_VXLAN: When chosen, mlx5 vxlan support will be enabled. - CONFIG_MLXFW: When chosen, mlx5 firmware flashing support will be enabled (via devlink and ethtool). Devlink info diff --git a/Documentation/networking/device_drivers/stmicro/stmmac.rst b/Documentation/networking/device_drivers/stmicro/stmmac.rst index c34bab3d2df0..5d46e5036129 100644 --- a/Documentation/networking/device_drivers/stmicro/stmmac.rst +++ b/Documentation/networking/device_drivers/stmicro/stmmac.rst @@ -32,7 +32,8 @@ is also supported. DesignWare(R) Cores Ethernet MAC 10/100/1000 Universal version 3.70a (and older) and DesignWare(R) Cores Ethernet Quality-of-Service version 4.0 (and upper) have been used for developing this driver as well as -DesignWare(R) Cores XGMAC - 10G Ethernet MAC. +DesignWare(R) Cores XGMAC - 10G Ethernet MAC and DesignWare(R) Cores +Enterprise MAC - 100G Ethernet MAC. This driver supports both the platform bus and PCI. @@ -48,6 +49,8 @@ Cores Ethernet Controllers and corresponding minimum and maximum versions: +-------------------------------+--------------+--------------+--------------+ | XGMAC - 10G Ethernet MAC | 2.10a | N/A | XGMAC2+ | +-------------------------------+--------------+--------------+--------------+ +| XLGMAC - 100G Ethernet MAC | 2.00a | N/A | XLGMAC2+ | ++-------------------------------+--------------+--------------+--------------+ For questions related to hardware requirements, refer to the documentation supplied with your Ethernet adapter. All hardware requirements listed apply @@ -57,7 +60,7 @@ Feature List ============ The following features are available in this driver: - - GMII/MII/RGMII/SGMII/RMII/XGMII Interface + - GMII/MII/RGMII/SGMII/RMII/XGMII/XLGMII Interface - Half-Duplex / Full-Duplex Operation - Energy Efficient Ethernet (EEE) - IEEE 802.3x PAUSE Packets (Flow Control) diff --git a/Documentation/networking/devlink/bnxt.rst b/Documentation/networking/devlink/bnxt.rst index 82ef9ec46707..3dfd84ccb1c7 100644 --- a/Documentation/networking/devlink/bnxt.rst +++ b/Documentation/networking/devlink/bnxt.rst @@ -51,6 +51,9 @@ The ``bnxt_en`` driver reports the following versions * - Name - Type - Description + * - ``board.id`` + - fixed + - Part number identifying the board design * - ``asic.id`` - fixed - ASIC design identifier @@ -63,12 +66,15 @@ The ``bnxt_en`` driver reports the following versions * - ``fw`` - stored, running - Overall board firmware version - * - ``fw.app`` - - stored, running - - Data path firmware version * - ``fw.mgmt`` - stored, running - - Management firmware version + - NIC hardware resource management firmware version + * - ``fw.mgmt.api`` + - running + - Minimum firmware interface spec version supported between driver and firmware + * - ``fw.nsci`` + - stored, running + - General platform management firmware version * - ``fw.roce`` - stored, running - RoCE management firmware version diff --git a/Documentation/networking/devlink/devlink-flash.rst b/Documentation/networking/devlink/devlink-flash.rst new file mode 100644 index 000000000000..40a87c0222cb --- /dev/null +++ b/Documentation/networking/devlink/devlink-flash.rst @@ -0,0 +1,93 @@ +.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) + +.. _devlink_flash: + +============= +Devlink Flash +============= + +The ``devlink-flash`` API allows updating device firmware. It replaces the +older ``ethtool-flash`` mechanism, and doesn't require taking any +networking locks in the kernel to perform the flash update. Example use:: + + $ devlink dev flash pci/0000:05:00.0 file flash-boot.bin + +Note that the file name is a path relative to the firmware loading path +(usually ``/lib/firmware/``). Drivers may send status updates to inform +user space about the progress of the update operation. + +Firmware Loading +================ + +Devices which require firmware to operate usually store it in non-volatile +memory on the board, e.g. flash. Some devices store only basic firmware on +the board, and the driver loads the rest from disk during probing. +``devlink-info`` allows users to query firmware information (loaded +components and versions). + +In other cases the device can both store the image on the board, load from +disk, or automatically flash a new image from disk. The ``fw_load_policy`` +devlink parameter can be used to control this behavior +(:ref:`Documentation/networking/devlink/devlink-params.rst <devlink_params_generic>`). + +On-disk firmware files are usually stored in ``/lib/firmware/``. + +Firmware Version Management +=========================== + +Drivers are expected to implement ``devlink-flash`` and ``devlink-info`` +functionality, which together allow for implementing vendor-independent +automated firmware update facilities. + +``devlink-info`` exposes the ``driver`` name and three version groups +(``fixed``, ``running``, ``stored``). + +The ``driver`` attribute and ``fixed`` group identify the specific device +design, e.g. for looking up applicable firmware updates. This is why +``serial_number`` is not part of the ``fixed`` versions (even though it +is fixed) - ``fixed`` versions should identify the design, not a single +device. + +``running`` and ``stored`` firmware versions identify the firmware running +on the device, and firmware which will be activated after reboot or device +reset. + +The firmware update agent is supposed to be able to follow this simple +algorithm to update firmware contents, regardless of the device vendor: + +.. code-block:: sh + + # Get unique HW design identifier + $hw_id = devlink-dev-info['fixed'] + + # Find out which FW flash we want to use for this NIC + $want_flash_vers = some-db-backed.lookup($hw_id, 'flash') + + # Update flash if necessary + if $want_flash_vers != devlink-dev-info['stored']: + $file = some-db-backed.download($hw_id, 'flash') + devlink-dev-flash($file) + + # Find out the expected overall firmware versions + $want_fw_vers = some-db-backed.lookup($hw_id, 'all') + + # Update on-disk file if necessary + if $want_fw_vers != devlink-dev-info['running']: + $file = some-db-backed.download($hw_id, 'disk') + write($file, '/lib/firmware/') + + # Try device reset, if available + if $want_fw_vers != devlink-dev-info['running']: + devlink-reset() + + # Reboot, if reset wasn't enough + if $want_fw_vers != devlink-dev-info['running']: + reboot() + +Note that each reference to ``devlink-dev-info`` in this pseudo-code +is expected to fetch up-to-date information from the kernel. + +For the convenience of identifying firmware files some vendors add +``bundle_id`` information to the firmware versions. This meta-version covers +multiple per-component versions and can be used e.g. in firmware file names +(all component versions could get rather long.) diff --git a/Documentation/networking/devlink/devlink-info.rst b/Documentation/networking/devlink/devlink-info.rst index 70981dd1b981..3fe11401b838 100644 --- a/Documentation/networking/devlink/devlink-info.rst +++ b/Documentation/networking/devlink/devlink-info.rst @@ -5,34 +5,119 @@ Devlink Info ============ The ``devlink-info`` mechanism enables device drivers to report device -information in a generic fashion. It is extensible, and enables exporting -even device or driver specific information. +(hardware and firmware) information in a standard, extensible fashion. -devlink supports representing the following types of versions +The original motivation for the ``devlink-info`` API was twofold: -.. list-table:: List of version types + - making it possible to automate device and firmware management in a fleet + of machines in a vendor-independent fashion (see also + :ref:`Documentation/networking/devlink/devlink-flash.rst <devlink_flash>`); + - name the per component FW versions (as opposed to the crowded ethtool + version string). + +``devlink-info`` supports reporting multiple types of objects. Reporting driver +versions is generally discouraged - here, and via any other Linux API. + +.. list-table:: List of top level info objects :widths: 5 95 - * - Type + * - Name - Description + * - ``driver`` + - Name of the currently used device driver, also available through sysfs. + + * - ``serial_number`` + - Serial number of the device. + + This is usually the serial number of the ASIC, also often available + in PCI config space of the device in the *Device Serial Number* + capability. + + The serial number should be unique per physical device. + Sometimes the serial number of the device is only 48 bits long (the + length of the Ethernet MAC address), and since PCI DSN is 64 bits long + devices pad or encode additional information into the serial number. + One example is adding port ID or PCI interface ID in the extra two bytes. + Drivers should make sure to strip or normalize any such padding + or interface ID, and report only the part of the serial number + which uniquely identifies the hardware. In other words serial number + reported for two ports of the same device or on two hosts of + a multi-host device should be identical. + + .. note:: ``devlink-info`` API should be extended with a new field + if devices want to report board/product serial number (often + reported in PCI *Vital Product Data* capability). + * - ``fixed`` - - Represents fixed versions, which cannot change. For example, + - Group for hardware identifiers, and versions of components + which are not field-updatable. + + Versions in this section identify the device design. For example, component identifiers or the board version reported in the PCI VPD. + Data in ``devlink-info`` should be broken into the smallest logical + components, e.g. PCI VPD may concatenate various information + to form the Part Number string, while in ``devlink-info`` all parts + should be reported as separate items. + + This group must not contain any frequently changing identifiers, + such as serial numbers. See + :ref:`Documentation/networking/devlink/devlink-flash.rst <devlink_flash>` + to understand why. + * - ``running`` - - Represents the version of the currently running component. For - example the running version of firmware. These versions generally - only update after a reboot. + - Group for information about currently running software/firmware. + These versions often only update after a reboot, sometimes device reset. + * - ``stored`` - - Represents the version of a component as stored, such as after a - flash update. Stored values should update to reflect changes in the - flash even if a reboot has not yet occurred. + - Group for software/firmware versions in device flash. + + Stored values must update to reflect changes in the flash even + if reboot has not yet occurred. If device is not capable of updating + ``stored`` versions when new software is flashed, it must not report + them. + +Each version can be reported at most once in each version group. Firmware +components stored on the flash should feature in both the ``running`` and +``stored`` sections, if device is capable of reporting ``stored`` versions +(see :ref:`Documentation/networking/devlink/devlink-flash.rst <devlink_flash>`). +In case software/firmware components are loaded from the disk (e.g. +``/lib/firmware``) only the running version should be reported via +the kernel API. Generic Versions ================ It is expected that drivers use the following generic names for exporting -version information. Other information may be exposed using driver-specific -names, but these should be documented in the driver-specific file. +version information. If a generic name for a given component doesn't exist yet, +driver authors should consult existing driver-specific versions and attempt +reuse. As last resort, if a component is truly unique, using driver-specific +names is allowed, but these should be documented in the driver-specific file. + +All versions should try to use the following terminology: + +.. list-table:: List of common version suffixes + :widths: 10 90 + + * - Name + - Description + * - ``id``, ``revision`` + - Identifiers of designs and revision, mostly used for hardware versions. + + * - ``api`` + - Version of API between components. API items are usually of limited + value to the user, and can be inferred from other versions by the vendor, + so adding API versions is generally discouraged as noise. + + * - ``bundle_id`` + - Identifier of a distribution package which was flashed onto the device. + This is an attribute of a firmware package which covers multiple versions + for ease of managing firmware images (see + :ref:`Documentation/networking/devlink/devlink-flash.rst <devlink_flash>`). + + ``bundle_id`` can appear in both ``running`` and ``stored`` versions, + but it must not be reported if any of the components covered by the + ``bundle_id`` was changed and no longer matches the version from + the bundle. board.id -------- @@ -52,7 +137,7 @@ ASIC design identifier. asic.rev -------- -ASIC design revision. +ASIC design revision/stepping. board.manufacture ----------------- @@ -72,6 +157,12 @@ Control unit firmware version. This firmware is responsible for house keeping tasks, PHY control etc. but not the packet-by-packet data path operation. +fw.mgmt.api +----------- + +Firmware interface specification version of the software interfaces between +driver and firmware. + fw.app ------ @@ -91,10 +182,31 @@ Network Controller Sideband Interface. fw.psid ------- -Unique identifier of the firmware parameter set. +Unique identifier of the firmware parameter set. These are usually +parameters of a particular board, defined at manufacturing time. fw.roce ------- RoCE firmware version which is responsible for handling roce management. + +fw.bundle_id +------------ + +Unique identifier of the entire firmware bundle. + +Future work +=========== + +The following extensions could be useful: + + - product serial number - NIC boards often get labeled with a board serial + number rather than ASIC serial number; it'd be useful to add board serial + numbers to the API if they can be retrieved from the device; + + - on-disk firmware file names - drivers list the file names of firmware they + may need to load onto devices via the ``MODULE_FIRMWARE()`` macro. These, + however, are per module, rather than per device. It'd be useful to list + the names of firmware files the driver will try to load for a given device, + in order of priority. diff --git a/Documentation/networking/devlink/devlink-params.rst b/Documentation/networking/devlink/devlink-params.rst index da2f85c0fa21..d075fd090b3d 100644 --- a/Documentation/networking/devlink/devlink-params.rst +++ b/Documentation/networking/devlink/devlink-params.rst @@ -41,6 +41,8 @@ In order for ``driverinit`` parameters to take effect, the driver must support reloading via the ``devlink-reload`` command. This command will request a reload of the device driver. +.. _devlink_params_generic: + Generic configuration parameters ================================ The following is a list of generic configuration parameters that drivers may diff --git a/Documentation/networking/devlink/devlink-region.rst b/Documentation/networking/devlink/devlink-region.rst index 8b46e8591fe0..04e04d1ff627 100644 --- a/Documentation/networking/devlink/devlink-region.rst +++ b/Documentation/networking/devlink/devlink-region.rst @@ -20,6 +20,11 @@ address regions that are otherwise inaccessible to the user. Regions may also be used to provide an additional way to debug complex error states, but see also :doc:`devlink-health` +Regions may optionally support capturing a snapshot on demand via the +``DEVLINK_CMD_REGION_NEW`` netlink message. A driver wishing to allow +requested snapshots must implement the ``.snapshot`` callback for the region +in its ``devlink_region_ops`` structure. + example usage ------------- @@ -29,8 +34,7 @@ example usage $ devlink region show [ DEV/REGION ] $ devlink region del DEV/REGION snapshot SNAPSHOT_ID $ devlink region dump DEV/REGION [ snapshot SNAPSHOT_ID ] - $ devlink region read DEV/REGION [ snapshot SNAPSHOT_ID ] - address ADDRESS length length + $ devlink region read DEV/REGION [ snapshot SNAPSHOT_ID ] address ADDRESS length length # Show all of the exposed regions with region sizes: $ devlink region show @@ -40,6 +44,9 @@ example usage # Delete a snapshot using: $ devlink region del pci/0000:00:05.0/cr-space snapshot 1 + # Request an immediate snapshot, if supported by the region + $ devlink region new pci/0000:00:05.0/cr-space snapshot 5 + # Dump a snapshot: $ devlink region dump pci/0000:00:05.0/fw-health snapshot 1 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30 @@ -48,8 +55,7 @@ example usage 0000000000000030 bada cce5 bada cce5 bada cce5 bada cce5 # Read a specific part of a snapshot: - $ devlink region read pci/0000:00:05.0/fw-health snapshot 1 address 0 - length 16 + $ devlink region read pci/0000:00:05.0/fw-health snapshot 1 address 0 length 16 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30 As regions are likely very device or driver specific, no generic regions are diff --git a/Documentation/networking/devlink/devlink-trap.rst b/Documentation/networking/devlink/devlink-trap.rst index 47a429bb8658..fe089acb7783 100644 --- a/Documentation/networking/devlink/devlink-trap.rst +++ b/Documentation/networking/devlink/devlink-trap.rst @@ -238,6 +238,12 @@ be added to the following table: - ``drop`` - Traps NVE packets that the device decided to drop because their overlay source MAC is multicast + * - ``ingress_flow_action_drop`` + - ``drop`` + - Traps packets dropped during processing of ingress flow action drop + * - ``egress_flow_action_drop`` + - ``drop`` + - Traps packets dropped during processing of egress flow action drop Driver-specific Packet Traps ============================ @@ -251,6 +257,8 @@ drivers: * :doc:`netdevsim` * :doc:`mlxsw` +.. _Generic-Packet-Trap-Groups: + Generic Packet Trap Groups ========================== @@ -277,6 +285,35 @@ narrow. The description of these groups must be added to the following table: * - ``tunnel_drops`` - Contains packet traps for packets that were dropped by the device during tunnel encapsulation / decapsulation + * - ``acl_drops`` + - Contains packet traps for packets that were dropped by the device during + ACL processing + +Packet Trap Policers +==================== + +As previously explained, the underlying device can trap certain packets to the +CPU for processing. In most cases, the underlying device is capable of handling +packet rates that are several orders of magnitude higher compared to those that +can be handled by the CPU. + +Therefore, in order to prevent the underlying device from overwhelming the CPU, +devices usually include packet trap policers that are able to police the +trapped packets to rates that can be handled by the CPU. + +The ``devlink-trap`` mechanism allows capable device drivers to register their +supported packet trap policers with ``devlink``. The device driver can choose +to associate these policers with supported packet trap groups (see +:ref:`Generic-Packet-Trap-Groups`) during its initialization, thereby exposing +its default control plane policy to user space. + +Device drivers should allow user space to change the parameters of the policers +(e.g., rate, burst size) as well as the association between the policers and +trap groups by implementing the relevant callbacks. + +If possible, device drivers should implement a callback that allows user space +to retrieve the number of packets that were dropped by the policer because its +configured policy was violated. Testing ======= diff --git a/Documentation/networking/devlink/ice.rst b/Documentation/networking/devlink/ice.rst new file mode 100644 index 000000000000..4574352d6ff4 --- /dev/null +++ b/Documentation/networking/devlink/ice.rst @@ -0,0 +1,96 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================== +ice devlink support +=================== + +This document describes the devlink features implemented by the ``ice`` +device driver. + +Info versions +============= + +The ``ice`` driver reports the following versions + +.. list-table:: devlink info versions implemented + :widths: 5 5 5 90 + + * - Name + - Type + - Example + - Description + * - ``board.id`` + - fixed + - K65390-000 + - The Product Board Assembly (PBA) identifier of the board. + * - ``fw.mgmt`` + - running + - 2.1.7 + - 3-digit version number of the management firmware that controls the + PHY, link, etc. + * - ``fw.mgmt.api`` + - running + - 1.5 + - 2-digit version number of the API exported over the AdminQ by the + management firmware. Used by the driver to identify what commands + are supported. + * - ``fw.mgmt.build`` + - running + - 0x305d955f + - Unique identifier of the source for the management firmware. + * - ``fw.undi`` + - running + - 1.2581.0 + - Version of the Option ROM containing the UEFI driver. The version is + reported in ``major.minor.patch`` format. The major version is + incremented whenever a major breaking change occurs, or when the + minor version would overflow. The minor version is incremented for + non-breaking changes and reset to 1 when the major version is + incremented. The patch version is normally 0 but is incremented when + a fix is delivered as a patch against an older base Option ROM. + * - ``fw.psid.api`` + - running + - 0.80 + - Version defining the format of the flash contents. + * - ``fw.bundle_id`` + - running + - 0x80002ec0 + - Unique identifier of the firmware image file that was loaded onto + the device. Also referred to as the EETRACK identifier of the NVM. + * - ``fw.app.name`` + - running + - ICE OS Default Package + - The name of the DDP package that is active in the device. The DDP + package is loaded by the driver during initialization. Each + variation of the DDP package has a unique name. + * - ``fw.app`` + - running + - 1.3.1.0 + - The version of the DDP package that is active in the device. Note + that both the name (as reported by ``fw.app.name``) and version are + required to uniquely identify the package. + +Regions +======= + +The ``ice`` driver enables access to the contents of the Non Volatile Memory +flash chip via the ``nvm-flash`` region. + +Users can request an immediate capture of a snapshot via the +``DEVLINK_CMD_REGION_NEW`` + +.. code:: shell + + $ devlink region new pci/0000:01:00.0/nvm-flash snapshot 1 + $ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1 + + $ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1 + 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30 + 0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8 + 0000000000000020 0016 0bb8 0016 1720 0000 0000 c00f 3ffc + 0000000000000030 bada cce5 bada cce5 bada cce5 bada cce5 + + $ devlink region read pci/0000:01:00.0/nvm-flash snapshot 1 address 0 length 16 + 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30 + + $ devlink region delete pci/0000:01:00.0/nvm-flash snapshot 1 diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst index 087ff54d53fc..c536db2cc0f9 100644 --- a/Documentation/networking/devlink/index.rst +++ b/Documentation/networking/devlink/index.rst @@ -16,6 +16,7 @@ general. devlink-dpipe devlink-health devlink-info + devlink-flash devlink-params devlink-region devlink-resource @@ -32,6 +33,7 @@ parameters, info versions, and other features it supports. bnxt ionic + ice mlx4 mlx5 mlxsw diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst index 629a6e69c036..4e4b97f7971a 100644 --- a/Documentation/networking/devlink/mlx5.rst +++ b/Documentation/networking/devlink/mlx5.rst @@ -37,6 +37,12 @@ parameters. * ``smfs`` Software managed flow steering. In SMFS mode, the HW steering entities are created and manage through the driver without firmware intervention. + * - ``fdb_large_groups`` + - u32 + - driverinit + - Control the number of large groups (size > 1) in the FDB table. + + * The default value is 15, and the range is between 1 and 1024. The ``mlx5`` driver supports reloading via ``DEVLINK_CMD_RELOAD`` diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst index f1f868479ceb..567326491f80 100644 --- a/Documentation/networking/ethtool-netlink.rst +++ b/Documentation/networking/ethtool-netlink.rst @@ -189,6 +189,21 @@ Userspace to kernel: ``ETHTOOL_MSG_DEBUG_SET`` set debugging settings ``ETHTOOL_MSG_WOL_GET`` get wake-on-lan settings ``ETHTOOL_MSG_WOL_SET`` set wake-on-lan settings + ``ETHTOOL_MSG_FEATURES_GET`` get device features + ``ETHTOOL_MSG_FEATURES_SET`` set device features + ``ETHTOOL_MSG_PRIVFLAGS_GET`` get private flags + ``ETHTOOL_MSG_PRIVFLAGS_SET`` set private flags + ``ETHTOOL_MSG_RINGS_GET`` get ring sizes + ``ETHTOOL_MSG_RINGS_SET`` set ring sizes + ``ETHTOOL_MSG_CHANNELS_GET`` get channel counts + ``ETHTOOL_MSG_CHANNELS_SET`` set channel counts + ``ETHTOOL_MSG_COALESCE_GET`` get coalescing parameters + ``ETHTOOL_MSG_COALESCE_SET`` set coalescing parameters + ``ETHTOOL_MSG_PAUSE_GET`` get pause parameters + ``ETHTOOL_MSG_PAUSE_SET`` set pause parameters + ``ETHTOOL_MSG_EEE_GET`` get EEE settings + ``ETHTOOL_MSG_EEE_SET`` set EEE settings + ``ETHTOOL_MSG_TSINFO_GET`` get timestamping info ===================================== ================================ Kernel to userspace: @@ -204,6 +219,22 @@ Kernel to userspace: ``ETHTOOL_MSG_DEBUG_NTF`` debugging settings notification ``ETHTOOL_MSG_WOL_GET_REPLY`` wake-on-lan settings ``ETHTOOL_MSG_WOL_NTF`` wake-on-lan settings notification + ``ETHTOOL_MSG_FEATURES_GET_REPLY`` device features + ``ETHTOOL_MSG_FEATURES_SET_REPLY`` optional reply to FEATURES_SET + ``ETHTOOL_MSG_FEATURES_NTF`` netdev features notification + ``ETHTOOL_MSG_PRIVFLAGS_GET_REPLY`` private flags + ``ETHTOOL_MSG_PRIVFLAGS_NTF`` private flags + ``ETHTOOL_MSG_RINGS_GET_REPLY`` ring sizes + ``ETHTOOL_MSG_RINGS_NTF`` ring sizes + ``ETHTOOL_MSG_CHANNELS_GET_REPLY`` channel counts + ``ETHTOOL_MSG_CHANNELS_NTF`` channel counts + ``ETHTOOL_MSG_COALESCE_GET_REPLY`` coalescing parameters + ``ETHTOOL_MSG_COALESCE_NTF`` coalescing parameters + ``ETHTOOL_MSG_PAUSE_GET_REPLY`` pause parameters + ``ETHTOOL_MSG_PAUSE_NTF`` pause parameters + ``ETHTOOL_MSG_EEE_GET_REPLY`` EEE settings + ``ETHTOOL_MSG_EEE_NTF`` EEE settings + ``ETHTOOL_MSG_TSINFO_GET_REPLY`` timestamping info ===================================== ================================= ``GET`` requests are sent by userspace applications to retrieve device @@ -521,6 +552,410 @@ Request contents: ``WAKE_MAGICSECURE`` mode. +FEATURES_GET +============ + +Gets netdev features like ``ETHTOOL_GFEATURES`` ioctl request. + +Request contents: + + ==================================== ====== ========================== + ``ETHTOOL_A_FEATURES_HEADER`` nested request header + ==================================== ====== ========================== + +Kernel response contents: + + ==================================== ====== ========================== + ``ETHTOOL_A_FEATURES_HEADER`` nested reply header + ``ETHTOOL_A_FEATURES_HW`` bitset dev->hw_features + ``ETHTOOL_A_FEATURES_WANTED`` bitset dev->wanted_features + ``ETHTOOL_A_FEATURES_ACTIVE`` bitset dev->features + ``ETHTOOL_A_FEATURES_NOCHANGE`` bitset NETIF_F_NEVER_CHANGE + ==================================== ====== ========================== + +Bitmaps in kernel response have the same meaning as bitmaps used in ioctl +interference but attribute names are different (they are based on +corresponding members of struct net_device). Legacy "flags" are not provided, +if userspace needs them (most likely only ethtool for backward compatibility), +it can calculate their values from related feature bits itself. +ETHA_FEATURES_HW uses mask consisting of all features recognized by kernel (to +provide all names when using verbose bitmap format), the other three use no +mask (simple bit lists). + + +FEATURES_SET +============ + +Request to set netdev features like ``ETHTOOL_SFEATURES`` ioctl request. + +Request contents: + + ==================================== ====== ========================== + ``ETHTOOL_A_FEATURES_HEADER`` nested request header + ``ETHTOOL_A_FEATURES_WANTED`` bitset requested features + ==================================== ====== ========================== + +Kernel response contents: + + ==================================== ====== ========================== + ``ETHTOOL_A_FEATURES_HEADER`` nested reply header + ``ETHTOOL_A_FEATURES_WANTED`` bitset diff wanted vs. result + ``ETHTOOL_A_FEATURES_ACTIVE`` bitset diff old vs. new active + ==================================== ====== ========================== + +Request constains only one bitset which can be either value/mask pair (request +to change specific feature bits and leave the rest) or only a value (request +to set all features to specified set). + +As request is subject to netdev_change_features() sanity checks, optional +kernel reply (can be suppressed by ``ETHTOOL_FLAG_OMIT_REPLY`` flag in request +header) informs client about the actual result. ``ETHTOOL_A_FEATURES_WANTED`` +reports the difference between client request and actual result: mask consists +of bits which differ between requested features and result (dev->features +after the operation), value consists of values of these bits in the request +(i.e. negated values from resulting features). ``ETHTOOL_A_FEATURES_ACTIVE`` +reports the difference between old and new dev->features: mask consists of +bits which have changed, values are their values in new dev->features (after +the operation). + +``ETHTOOL_MSG_FEATURES_NTF`` notification is sent not only if device features +are modified using ``ETHTOOL_MSG_FEATURES_SET`` request or on of ethtool ioctl +request but also each time features are modified with netdev_update_features() +or netdev_change_features(). + + +PRIVFLAGS_GET +============= + +Gets private flags like ``ETHTOOL_GPFLAGS`` ioctl request. + +Request contents: + + ==================================== ====== ========================== + ``ETHTOOL_A_PRIVFLAGS_HEADER`` nested request header + ==================================== ====== ========================== + +Kernel response contents: + + ==================================== ====== ========================== + ``ETHTOOL_A_PRIVFLAGS_HEADER`` nested reply header + ``ETHTOOL_A_PRIVFLAGS_FLAGS`` bitset private flags + ==================================== ====== ========================== + +``ETHTOOL_A_PRIVFLAGS_FLAGS`` is a bitset with values of device private flags. +These flags are defined by driver, their number and names (and also meaning) +are device dependent. For compact bitset format, names can be retrieved as +``ETH_SS_PRIV_FLAGS`` string set. If verbose bitset format is requested, +response uses all private flags supported by the device as mask so that client +gets the full information without having to fetch the string set with names. + + +PRIVFLAGS_SET +============= + +Sets or modifies values of device private flags like ``ETHTOOL_SPFLAGS`` +ioctl request. + +Request contents: + + ==================================== ====== ========================== + ``ETHTOOL_A_PRIVFLAGS_HEADER`` nested request header + ``ETHTOOL_A_PRIVFLAGS_FLAGS`` bitset private flags + ==================================== ====== ========================== + +``ETHTOOL_A_PRIVFLAGS_FLAGS`` can either set the whole set of private flags or +modify only values of some of them. + + +RINGS_GET +========= + +Gets ring sizes like ``ETHTOOL_GRINGPARAM`` ioctl request. + +Request contents: + + ==================================== ====== ========================== + ``ETHTOOL_A_RINGS_HEADER`` nested request header + ==================================== ====== ========================== + +Kernel response contents: + + ==================================== ====== ========================== + ``ETHTOOL_A_RINGS_HEADER`` nested reply header + ``ETHTOOL_A_RINGS_RX_MAX`` u32 max size of RX ring + ``ETHTOOL_A_RINGS_RX_MINI_MAX`` u32 max size of RX mini ring + ``ETHTOOL_A_RINGS_RX_JUMBO_MAX`` u32 max size of RX jumbo ring + ``ETHTOOL_A_RINGS_TX_MAX`` u32 max size of TX ring + ``ETHTOOL_A_RINGS_RX`` u32 size of RX ring + ``ETHTOOL_A_RINGS_RX_MINI`` u32 size of RX mini ring + ``ETHTOOL_A_RINGS_RX_JUMBO`` u32 size of RX jumbo ring + ``ETHTOOL_A_RINGS_TX`` u32 size of TX ring + ==================================== ====== ========================== + + +RINGS_SET +========= + +Sets ring sizes like ``ETHTOOL_SRINGPARAM`` ioctl request. + +Request contents: + + ==================================== ====== ========================== + ``ETHTOOL_A_RINGS_HEADER`` nested reply header + ``ETHTOOL_A_RINGS_RX`` u32 size of RX ring + ``ETHTOOL_A_RINGS_RX_MINI`` u32 size of RX mini ring + ``ETHTOOL_A_RINGS_RX_JUMBO`` u32 size of RX jumbo ring + ``ETHTOOL_A_RINGS_TX`` u32 size of TX ring + ==================================== ====== ========================== + +Kernel checks that requested ring sizes do not exceed limits reported by +driver. Driver may impose additional constraints and may not suspport all +attributes. + + +CHANNELS_GET +============ + +Gets channel counts like ``ETHTOOL_GCHANNELS`` ioctl request. + +Request contents: + + ==================================== ====== ========================== + ``ETHTOOL_A_CHANNELS_HEADER`` nested request header + ==================================== ====== ========================== + +Kernel response contents: + + ===================================== ====== ========================== + ``ETHTOOL_A_CHANNELS_HEADER`` nested reply header + ``ETHTOOL_A_CHANNELS_RX_MAX`` u32 max receive channels + ``ETHTOOL_A_CHANNELS_TX_MAX`` u32 max transmit channels + ``ETHTOOL_A_CHANNELS_OTHER_MAX`` u32 max other channels + ``ETHTOOL_A_CHANNELS_COMBINED_MAX`` u32 max combined channels + ``ETHTOOL_A_CHANNELS_RX_COUNT`` u32 receive channel count + ``ETHTOOL_A_CHANNELS_TX_COUNT`` u32 transmit channel count + ``ETHTOOL_A_CHANNELS_OTHER_COUNT`` u32 other channel count + ``ETHTOOL_A_CHANNELS_COMBINED_COUNT`` u32 combined channel count + ===================================== ====== ========================== + + +CHANNELS_SET +============ + +Sets channel counts like ``ETHTOOL_SCHANNELS`` ioctl request. + +Request contents: + + ===================================== ====== ========================== + ``ETHTOOL_A_CHANNELS_HEADER`` nested request header + ``ETHTOOL_A_CHANNELS_RX_COUNT`` u32 receive channel count + ``ETHTOOL_A_CHANNELS_TX_COUNT`` u32 transmit channel count + ``ETHTOOL_A_CHANNELS_OTHER_COUNT`` u32 other channel count + ``ETHTOOL_A_CHANNELS_COMBINED_COUNT`` u32 combined channel count + ===================================== ====== ========================== + +Kernel checks that requested channel counts do not exceed limits reported by +driver. Driver may impose additional constraints and may not suspport all +attributes. + + +COALESCE_GET +============ + +Gets coalescing parameters like ``ETHTOOL_GCOALESCE`` ioctl request. + +Request contents: + + ==================================== ====== ========================== + ``ETHTOOL_A_COALESCE_HEADER`` nested request header + ==================================== ====== ========================== + +Kernel response contents: + + =========================================== ====== ======================= + ``ETHTOOL_A_COALESCE_HEADER`` nested reply header + ``ETHTOOL_A_COALESCE_RX_USECS`` u32 delay (us), normal Rx + ``ETHTOOL_A_COALESCE_RX_MAX_FRAMES`` u32 max packets, normal Rx + ``ETHTOOL_A_COALESCE_RX_USECS_IRQ`` u32 delay (us), Rx in IRQ + ``ETHTOOL_A_COALESCE_RX_MAX_FRAMES_IRQ`` u32 max packets, Rx in IRQ + ``ETHTOOL_A_COALESCE_TX_USECS`` u32 delay (us), normal Tx + ``ETHTOOL_A_COALESCE_TX_MAX_FRAMES`` u32 max packets, normal Tx + ``ETHTOOL_A_COALESCE_TX_USECS_IRQ`` u32 delay (us), Tx in IRQ + ``ETHTOOL_A_COALESCE_TX_MAX_FRAMES_IRQ`` u32 IRQ packets, Tx in IRQ + ``ETHTOOL_A_COALESCE_STATS_BLOCK_USECS`` u32 delay of stats update + ``ETHTOOL_A_COALESCE_USE_ADAPTIVE_RX`` bool adaptive Rx coalesce + ``ETHTOOL_A_COALESCE_USE_ADAPTIVE_TX`` bool adaptive Tx coalesce + ``ETHTOOL_A_COALESCE_PKT_RATE_LOW`` u32 threshold for low rate + ``ETHTOOL_A_COALESCE_RX_USECS_LOW`` u32 delay (us), low Rx + ``ETHTOOL_A_COALESCE_RX_MAX_FRAMES_LOW`` u32 max packets, low Rx + ``ETHTOOL_A_COALESCE_TX_USECS_LOW`` u32 delay (us), low Tx + ``ETHTOOL_A_COALESCE_TX_MAX_FRAMES_LOW`` u32 max packets, low Tx + ``ETHTOOL_A_COALESCE_PKT_RATE_HIGH`` u32 threshold for high rate + ``ETHTOOL_A_COALESCE_RX_USECS_HIGH`` u32 delay (us), high Rx + ``ETHTOOL_A_COALESCE_RX_MAX_FRAMES_HIGH`` u32 max packets, high Rx + ``ETHTOOL_A_COALESCE_TX_USECS_HIGH`` u32 delay (us), high Tx + ``ETHTOOL_A_COALESCE_TX_MAX_FRAMES_HIGH`` u32 max packets, high Tx + ``ETHTOOL_A_COALESCE_RATE_SAMPLE_INTERVAL`` u32 rate sampling interval + =========================================== ====== ======================= + +Attributes are only included in reply if their value is not zero or the +corresponding bit in ``ethtool_ops::supported_coalesce_params`` is set (i.e. +they are declared as supported by driver). + + +COALESCE_SET +============ + +Sets coalescing parameters like ``ETHTOOL_SCOALESCE`` ioctl request. + +Request contents: + + =========================================== ====== ======================= + ``ETHTOOL_A_COALESCE_HEADER`` nested request header + ``ETHTOOL_A_COALESCE_RX_USECS`` u32 delay (us), normal Rx + ``ETHTOOL_A_COALESCE_RX_MAX_FRAMES`` u32 max packets, normal Rx + ``ETHTOOL_A_COALESCE_RX_USECS_IRQ`` u32 delay (us), Rx in IRQ + ``ETHTOOL_A_COALESCE_RX_MAX_FRAMES_IRQ`` u32 max packets, Rx in IRQ + ``ETHTOOL_A_COALESCE_TX_USECS`` u32 delay (us), normal Tx + ``ETHTOOL_A_COALESCE_TX_MAX_FRAMES`` u32 max packets, normal Tx + ``ETHTOOL_A_COALESCE_TX_USECS_IRQ`` u32 delay (us), Tx in IRQ + ``ETHTOOL_A_COALESCE_TX_MAX_FRAMES_IRQ`` u32 IRQ packets, Tx in IRQ + ``ETHTOOL_A_COALESCE_STATS_BLOCK_USECS`` u32 delay of stats update + ``ETHTOOL_A_COALESCE_USE_ADAPTIVE_RX`` bool adaptive Rx coalesce + ``ETHTOOL_A_COALESCE_USE_ADAPTIVE_TX`` bool adaptive Tx coalesce + ``ETHTOOL_A_COALESCE_PKT_RATE_LOW`` u32 threshold for low rate + ``ETHTOOL_A_COALESCE_RX_USECS_LOW`` u32 delay (us), low Rx + ``ETHTOOL_A_COALESCE_RX_MAX_FRAMES_LOW`` u32 max packets, low Rx + ``ETHTOOL_A_COALESCE_TX_USECS_LOW`` u32 delay (us), low Tx + ``ETHTOOL_A_COALESCE_TX_MAX_FRAMES_LOW`` u32 max packets, low Tx + ``ETHTOOL_A_COALESCE_PKT_RATE_HIGH`` u32 threshold for high rate + ``ETHTOOL_A_COALESCE_RX_USECS_HIGH`` u32 delay (us), high Rx + ``ETHTOOL_A_COALESCE_RX_MAX_FRAMES_HIGH`` u32 max packets, high Rx + ``ETHTOOL_A_COALESCE_TX_USECS_HIGH`` u32 delay (us), high Tx + ``ETHTOOL_A_COALESCE_TX_MAX_FRAMES_HIGH`` u32 max packets, high Tx + ``ETHTOOL_A_COALESCE_RATE_SAMPLE_INTERVAL`` u32 rate sampling interval + =========================================== ====== ======================= + +Request is rejected if it attributes declared as unsupported by driver (i.e. +such that the corresponding bit in ``ethtool_ops::supported_coalesce_params`` +is not set), regardless of their values. Driver may impose additional +constraints on coalescing parameters and their values. + + +PAUSE_GET +============ + +Gets channel counts like ``ETHTOOL_GPAUSE`` ioctl request. + +Request contents: + + ===================================== ====== ========================== + ``ETHTOOL_A_PAUSE_HEADER`` nested request header + ===================================== ====== ========================== + +Kernel response contents: + + ===================================== ====== ========================== + ``ETHTOOL_A_PAUSE_HEADER`` nested request header + ``ETHTOOL_A_PAUSE_AUTONEG`` bool pause autonegotiation + ``ETHTOOL_A_PAUSE_RX`` bool receive pause frames + ``ETHTOOL_A_PAUSE_TX`` bool transmit pause frames + ===================================== ====== ========================== + + +PAUSE_SET +============ + +Sets pause parameters like ``ETHTOOL_GPAUSEPARAM`` ioctl request. + +Request contents: + + ===================================== ====== ========================== + ``ETHTOOL_A_PAUSE_HEADER`` nested request header + ``ETHTOOL_A_PAUSE_AUTONEG`` bool pause autonegotiation + ``ETHTOOL_A_PAUSE_RX`` bool receive pause frames + ``ETHTOOL_A_PAUSE_TX`` bool transmit pause frames + ===================================== ====== ========================== + + +EEE_GET +======= + +Gets channel counts like ``ETHTOOL_GEEE`` ioctl request. + +Request contents: + + ===================================== ====== ========================== + ``ETHTOOL_A_EEE_HEADER`` nested request header + ===================================== ====== ========================== + +Kernel response contents: + + ===================================== ====== ========================== + ``ETHTOOL_A_EEE_HEADER`` nested request header + ``ETHTOOL_A_EEE_MODES_OURS`` bool supported/advertised modes + ``ETHTOOL_A_EEE_MODES_PEER`` bool peer advertised link modes + ``ETHTOOL_A_EEE_ACTIVE`` bool EEE is actively used + ``ETHTOOL_A_EEE_ENABLED`` bool EEE is enabled + ``ETHTOOL_A_EEE_TX_LPI_ENABLED`` bool Tx lpi enabled + ``ETHTOOL_A_EEE_TX_LPI_TIMER`` u32 Tx lpi timeout (in us) + ===================================== ====== ========================== + +In ``ETHTOOL_A_EEE_MODES_OURS``, mask consists of link modes for which EEE is +enabled, value of link modes for which EEE is advertised. Link modes for which +peer advertises EEE are listed in ``ETHTOOL_A_EEE_MODES_PEER`` (no mask). The +netlink interface allows reporting EEE status for all link modes but only +first 32 are provided by the ``ethtool_ops`` callback. + + +EEE_SET +======= + +Sets pause parameters like ``ETHTOOL_GEEEPARAM`` ioctl request. + +Request contents: + + ===================================== ====== ========================== + ``ETHTOOL_A_EEE_HEADER`` nested request header + ``ETHTOOL_A_EEE_MODES_OURS`` bool advertised modes + ``ETHTOOL_A_EEE_ENABLED`` bool EEE is enabled + ``ETHTOOL_A_EEE_TX_LPI_ENABLED`` bool Tx lpi enabled + ``ETHTOOL_A_EEE_TX_LPI_TIMER`` u32 Tx lpi timeout (in us) + ===================================== ====== ========================== + +``ETHTOOL_A_EEE_MODES_OURS`` is used to either list link modes to advertise +EEE for (if there is no mask) or specify changes to the list (if there is +a mask). The netlink interface allows reporting EEE status for all link modes +but only first 32 can be set at the moment as that is what the ``ethtool_ops`` +callback supports. + + +TSINFO_GET +========== + +Gets timestamping information like ``ETHTOOL_GET_TS_INFO`` ioctl request. + +Request contents: + + ===================================== ====== ========================== + ``ETHTOOL_A_TSINFO_HEADER`` nested request header + ===================================== ====== ========================== + +Kernel response contents: + + ===================================== ====== ========================== + ``ETHTOOL_A_TSINFO_HEADER`` nested request header + ``ETHTOOL_A_TSINFO_TIMESTAMPING`` bitset SO_TIMESTAMPING flags + ``ETHTOOL_A_TSINFO_TX_TYPES`` bitset supported Tx types + ``ETHTOOL_A_TSINFO_RX_FILTERS`` bitset supported Rx filters + ``ETHTOOL_A_TSINFO_PHC_INDEX`` u32 PTP hw clock index + ===================================== ====== ========================== + +``ETHTOOL_A_TSINFO_PHC_INDEX`` is absent if there is no associated PHC (there +is no special value for this case). The bitset attributes are omitted if they +would be empty (no bit set). + + Request translation =================== @@ -545,37 +980,37 @@ have their netlink replacement yet. ``ETHTOOL_GLINK`` ``ETHTOOL_MSG_LINKSTATE_GET`` ``ETHTOOL_GEEPROM`` n/a ``ETHTOOL_SEEPROM`` n/a - ``ETHTOOL_GCOALESCE`` n/a - ``ETHTOOL_SCOALESCE`` n/a - ``ETHTOOL_GRINGPARAM`` n/a - ``ETHTOOL_SRINGPARAM`` n/a - ``ETHTOOL_GPAUSEPARAM`` n/a - ``ETHTOOL_SPAUSEPARAM`` n/a - ``ETHTOOL_GRXCSUM`` n/a - ``ETHTOOL_SRXCSUM`` n/a - ``ETHTOOL_GTXCSUM`` n/a - ``ETHTOOL_STXCSUM`` n/a - ``ETHTOOL_GSG`` n/a - ``ETHTOOL_SSG`` n/a + ``ETHTOOL_GCOALESCE`` ``ETHTOOL_MSG_COALESCE_GET`` + ``ETHTOOL_SCOALESCE`` ``ETHTOOL_MSG_COALESCE_SET`` + ``ETHTOOL_GRINGPARAM`` ``ETHTOOL_MSG_RINGS_GET`` + ``ETHTOOL_SRINGPARAM`` ``ETHTOOL_MSG_RINGS_SET`` + ``ETHTOOL_GPAUSEPARAM`` ``ETHTOOL_MSG_PAUSE_GET`` + ``ETHTOOL_SPAUSEPARAM`` ``ETHTOOL_MSG_PAUSE_SET`` + ``ETHTOOL_GRXCSUM`` ``ETHTOOL_MSG_FEATURES_GET`` + ``ETHTOOL_SRXCSUM`` ``ETHTOOL_MSG_FEATURES_SET`` + ``ETHTOOL_GTXCSUM`` ``ETHTOOL_MSG_FEATURES_GET`` + ``ETHTOOL_STXCSUM`` ``ETHTOOL_MSG_FEATURES_SET`` + ``ETHTOOL_GSG`` ``ETHTOOL_MSG_FEATURES_GET`` + ``ETHTOOL_SSG`` ``ETHTOOL_MSG_FEATURES_SET`` ``ETHTOOL_TEST`` n/a ``ETHTOOL_GSTRINGS`` ``ETHTOOL_MSG_STRSET_GET`` ``ETHTOOL_PHYS_ID`` n/a ``ETHTOOL_GSTATS`` n/a - ``ETHTOOL_GTSO`` n/a - ``ETHTOOL_STSO`` n/a + ``ETHTOOL_GTSO`` ``ETHTOOL_MSG_FEATURES_GET`` + ``ETHTOOL_STSO`` ``ETHTOOL_MSG_FEATURES_SET`` ``ETHTOOL_GPERMADDR`` rtnetlink ``RTM_GETLINK`` - ``ETHTOOL_GUFO`` n/a - ``ETHTOOL_SUFO`` n/a - ``ETHTOOL_GGSO`` n/a - ``ETHTOOL_SGSO`` n/a - ``ETHTOOL_GFLAGS`` n/a - ``ETHTOOL_SFLAGS`` n/a - ``ETHTOOL_GPFLAGS`` n/a - ``ETHTOOL_SPFLAGS`` n/a + ``ETHTOOL_GUFO`` ``ETHTOOL_MSG_FEATURES_GET`` + ``ETHTOOL_SUFO`` ``ETHTOOL_MSG_FEATURES_SET`` + ``ETHTOOL_GGSO`` ``ETHTOOL_MSG_FEATURES_GET`` + ``ETHTOOL_SGSO`` ``ETHTOOL_MSG_FEATURES_SET`` + ``ETHTOOL_GFLAGS`` ``ETHTOOL_MSG_FEATURES_GET`` + ``ETHTOOL_SFLAGS`` ``ETHTOOL_MSG_FEATURES_SET`` + ``ETHTOOL_GPFLAGS`` ``ETHTOOL_MSG_PRIVFLAGS_GET`` + ``ETHTOOL_SPFLAGS`` ``ETHTOOL_MSG_PRIVFLAGS_SET`` ``ETHTOOL_GRXFH`` n/a ``ETHTOOL_SRXFH`` n/a - ``ETHTOOL_GGRO`` n/a - ``ETHTOOL_SGRO`` n/a + ``ETHTOOL_GGRO`` ``ETHTOOL_MSG_FEATURES_GET`` + ``ETHTOOL_SGRO`` ``ETHTOOL_MSG_FEATURES_SET`` ``ETHTOOL_GRXRINGS`` n/a ``ETHTOOL_GRXCLSRLCNT`` n/a ``ETHTOOL_GRXCLSRULE`` n/a @@ -589,18 +1024,18 @@ have their netlink replacement yet. ``ETHTOOL_GSSET_INFO`` ``ETHTOOL_MSG_STRSET_GET`` ``ETHTOOL_GRXFHINDIR`` n/a ``ETHTOOL_SRXFHINDIR`` n/a - ``ETHTOOL_GFEATURES`` n/a - ``ETHTOOL_SFEATURES`` n/a - ``ETHTOOL_GCHANNELS`` n/a - ``ETHTOOL_SCHANNELS`` n/a + ``ETHTOOL_GFEATURES`` ``ETHTOOL_MSG_FEATURES_GET`` + ``ETHTOOL_SFEATURES`` ``ETHTOOL_MSG_FEATURES_SET`` + ``ETHTOOL_GCHANNELS`` ``ETHTOOL_MSG_CHANNELS_GET`` + ``ETHTOOL_SCHANNELS`` ``ETHTOOL_MSG_CHANNELS_SET`` ``ETHTOOL_SET_DUMP`` n/a ``ETHTOOL_GET_DUMP_FLAG`` n/a ``ETHTOOL_GET_DUMP_DATA`` n/a - ``ETHTOOL_GET_TS_INFO`` n/a + ``ETHTOOL_GET_TS_INFO`` ``ETHTOOL_MSG_TSINFO_GET`` ``ETHTOOL_GMODULEINFO`` n/a ``ETHTOOL_GMODULEEEPROM`` n/a - ``ETHTOOL_GEEE`` n/a - ``ETHTOOL_SEEE`` n/a + ``ETHTOOL_GEEE`` ``ETHTOOL_MSG_EEE_GET`` + ``ETHTOOL_SEEE`` ``ETHTOOL_MSG_EEE_SET`` ``ETHTOOL_GRSSH`` n/a ``ETHTOOL_SRSSH`` n/a ``ETHTOOL_GTUNABLE`` n/a diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt index c4a328f2d57a..2f0f8b17dade 100644 --- a/Documentation/networking/filter.txt +++ b/Documentation/networking/filter.txt @@ -606,7 +606,7 @@ before a conversion to the new layout is being done behind the scenes! Currently, the classic BPF format is being used for JITing on most 32-bit architectures, whereas x86-64, aarch64, s390x, powerpc64, -sparc64, arm32, riscv (RV64G) perform JIT compilation from eBPF +sparc64, arm32, riscv64, riscv32 perform JIT compilation from eBPF instruction set. Some core changes of the new internal format: diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index d07d9855dcd3..6538ede29661 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -8,6 +8,7 @@ Contents: netdev-FAQ af_xdp + bareudp batman-adv can can_ucan_protocol @@ -21,6 +22,7 @@ Contents: z8530book msg_zerocopy failover + net_dim net_failover phy sfp-phylink @@ -33,6 +35,7 @@ Contents: tls tls-offload nfc + 6lowpan .. only:: subproject and html diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 5f53faff4e25..9375324aa8e1 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -812,7 +812,7 @@ tcp_limit_output_bytes - INTEGER tcp_challenge_ack_limit - INTEGER Limits number of Challenge ACK sent per second, as recommended in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks) - Default: 100 + Default: 1000 tcp_rx_skb_cache - BOOLEAN Controls a per TCP socket cache of one skb, that might help @@ -958,6 +958,15 @@ ip_nonlocal_bind - BOOLEAN which can be quite useful - but may break some applications. Default: 0 +ip_autobind_reuse - BOOLEAN + By default, bind() does not select the ports automatically even if + the new socket and all sockets bound to the port have SO_REUSEADDR. + ip_autobind_reuse allows bind() to reuse the port and this is useful + when you use bind()+connect(), but may break some applications. + The preferred solution is to use IP_BIND_ADDRESS_NO_PORT and this + option should only be set by experts. + Default: 0 + ip_dynaddr - BOOLEAN If set non-zero, enables support for dynamic addresses. If set to a non-zero value larger than 1, a kernel log @@ -974,6 +983,13 @@ ip_early_demux - BOOLEAN reduces overall throughput, in such case you should disable it. Default: 1 +ping_group_range - 2 INTEGERS + Restrict ICMP_PROTO datagram sockets to users in the group range. + The default is "1 0", meaning, that nobody (not even root) may + create ping sockets. Setting it to "100 100" would grant permissions + to the single group. "0 4294967295" would enable it for the world, "100 + 4294967295" would enable it for the users, but not daemons. + tcp_early_demux - BOOLEAN Enable early demux for established TCP sockets. Default: 1 diff --git a/Documentation/networking/net_dim.txt b/Documentation/networking/net_dim.rst index 9bdb7d5a3ba3..3bed9fd95336 100644 --- a/Documentation/networking/net_dim.txt +++ b/Documentation/networking/net_dim.rst @@ -1,28 +1,20 @@ +====================================================== Net DIM - Generic Network Dynamic Interrupt Moderation ====================================================== -Author: - Tal Gilboa <[email protected]> - - -Contents -========= +:Author: Tal Gilboa <[email protected]> -- Assumptions -- Introduction -- The Net DIM Algorithm -- Registering a Network Device to DIM -- Example +.. contents:: :depth: 2 -Part 0: Assumptions -====================== +Assumptions +=========== This document assumes the reader has basic knowledge in network drivers and in general interrupt moderation. -Part I: Introduction -====================== +Introduction +============ Dynamic Interrupt Moderation (DIM) (in networking) refers to changing the interrupt moderation configuration of a channel in order to optimize packet @@ -41,14 +33,15 @@ number of wanted packets per event. The Net DIM algorithm ascribes importance to increase bandwidth over reducing interrupt rate. -Part II: The Net DIM Algorithm -=============================== +Net DIM Algorithm +================= Each iteration of the Net DIM algorithm follows these steps: -1. Calculates new data sample. -2. Compares it to previous sample. -3. Makes a decision - suggests interrupt moderation configuration fields. -4. Applies a schedule work function, which applies suggested configuration. + +#. Calculates new data sample. +#. Compares it to previous sample. +#. Makes a decision - suggests interrupt moderation configuration fields. +#. Applies a schedule work function, which applies suggested configuration. The first two steps are straightforward, both the new and the previous data are supplied by the driver registered to Net DIM. The previous data is the new data @@ -89,19 +82,21 @@ manoeuvre as it may provide partial data or ignore the algorithm suggestion under some conditions. -Part III: Registering a Network Device to DIM -============================================== +Registering a Network Device to DIM +=================================== -Net DIM API exposes the main function net_dim(struct dim *dim, -struct dim_sample end_sample). This function is the entry point to the Net +Net DIM API exposes the main function net_dim(). +This function is the entry point to the Net DIM algorithm and has to be called every time the driver would like to check if it should change interrupt moderation parameters. The driver should provide two -data structures: struct dim and struct dim_sample. Struct dim +data structures: :c:type:`struct dim <dim>` and +:c:type:`struct dim_sample <dim_sample>`. :c:type:`struct dim <dim>` describes the state of DIM for a specific object (RX queue, TX queue, other queues, etc.). This includes the current selected profile, previous data samples, the callback function provided by the driver and more. -Struct dim_sample describes a data sample, which will be compared to the -data sample stored in struct dim in order to decide on the algorithm's next +:c:type:`struct dim_sample <dim_sample>` describes a data sample, +which will be compared to the data sample stored in :c:type:`struct dim <dim>` +in order to decide on the algorithm's next step. The sample should include bytes, packets and interrupts, measured by the driver. @@ -110,9 +105,10 @@ main net_dim() function. The recommended method is to call net_dim() on each interrupt. Since Net DIM has a built-in moderation and it might decide to skip iterations under certain conditions, there is no need to moderate the net_dim() calls as well. As mentioned above, the driver needs to provide an object of type -struct dim to the net_dim() function call. It is advised for each entity -using Net DIM to hold a struct dim as part of its data structure and use it -as the main Net DIM API object. The struct dim_sample should hold the latest +:c:type:`struct dim <dim>` to the net_dim() function call. It is advised for +each entity using Net DIM to hold a :c:type:`struct dim <dim>` as part of its +data structure and use it as the main Net DIM API object. +The :c:type:`struct dim_sample <dim_sample>` should hold the latest bytes, packets and interrupts count. No need to perform any calculations, just include the raw data. @@ -124,19 +120,19 @@ the data flow. After the work is done, Net DIM algorithm needs to be set to the proper state in order to move to the next iteration. -Part IV: Example -================= +Example +======= The following code demonstrates how to register a driver to Net DIM. The actual usage is not complete but it should make the outline of the usage clear. -my_driver.c: +.. code-block:: c -#include <linux/dim.h> + #include <linux/dim.h> -/* Callback for net DIM to schedule on a decision to change moderation */ -void my_driver_do_dim_work(struct work_struct *work) -{ + /* Callback for net DIM to schedule on a decision to change moderation */ + void my_driver_do_dim_work(struct work_struct *work) + { /* Get struct dim from struct work_struct */ struct dim *dim = container_of(work, struct dim, work); @@ -145,11 +141,11 @@ void my_driver_do_dim_work(struct work_struct *work) /* Signal net DIM work is done and it should move to next iteration */ dim->state = DIM_START_MEASURE; -} + } -/* My driver's interrupt handler */ -int my_driver_handle_interrupt(struct my_driver_entity *my_entity, ...) -{ + /* My driver's interrupt handler */ + int my_driver_handle_interrupt(struct my_driver_entity *my_entity, ...) + { ... /* A struct to hold current measured data */ struct dim_sample dim_sample; @@ -162,13 +158,19 @@ int my_driver_handle_interrupt(struct my_driver_entity *my_entity, ...) /* Call net DIM */ net_dim(&my_entity->dim, dim_sample); ... -} + } -/* My entity's initialization function (my_entity was already allocated) */ -int my_driver_init_my_entity(struct my_driver_entity *my_entity, ...) -{ + /* My entity's initialization function (my_entity was already allocated) */ + int my_driver_init_my_entity(struct my_driver_entity *my_entity, ...) + { ... /* Initiate struct work_struct with my driver's callback function */ INIT_WORK(&my_entity->dim.work, my_driver_do_dim_work); ... -} + } + +Dynamic Interrupt Moderation (DIM) library API +============================================== + +.. kernel-doc:: include/linux/dim.h + :internal: diff --git a/Documentation/networking/page_pool.rst b/Documentation/networking/page_pool.rst new file mode 100644 index 000000000000..43088ddf95e4 --- /dev/null +++ b/Documentation/networking/page_pool.rst @@ -0,0 +1,159 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +Page Pool API +============= + +The page_pool allocator is optimized for the XDP mode that uses one frame +per-page, but it can fallback on the regular page allocator APIs. + +Basic use involves replacing alloc_pages() calls with the +page_pool_alloc_pages() call. Drivers should use page_pool_dev_alloc_pages() +replacing dev_alloc_pages(). + +API keeps track of inflight pages, in order to let API user know +when it is safe to free a page_pool object. Thus, API users +must run page_pool_release_page() when a page is leaving the page_pool or +call page_pool_put_page() where appropriate in order to maintain correct +accounting. + +API user must call page_pool_put_page() once on a page, as it +will either recycle the page, or in case of refcnt > 1, it will +release the DMA mapping and inflight state accounting. + +Architecture overview +===================== + +.. code-block:: none + + +------------------+ + | Driver | + +------------------+ + ^ + | + | + | + v + +--------------------------------------------+ + | request memory | + +--------------------------------------------+ + ^ ^ + | | + | Pool empty | Pool has entries + | | + v v + +-----------------------+ +------------------------+ + | alloc (and map) pages | | get page from cache | + +-----------------------+ +------------------------+ + ^ ^ + | | + | cache available | No entries, refill + | | from ptr-ring + | | + v v + +-----------------+ +------------------+ + | Fast cache | | ptr-ring cache | + +-----------------+ +------------------+ + +API interface +============= +The number of pools created **must** match the number of hardware queues +unless hardware restrictions make that impossible. This would otherwise beat the +purpose of page pool, which is allocate pages fast from cache without locking. +This lockless guarantee naturally comes from running under a NAPI softirq. +The protection doesn't strictly have to be NAPI, any guarantee that allocating +a page will cause no race conditions is enough. + +* page_pool_create(): Create a pool. + * flags: PP_FLAG_DMA_MAP, PP_FLAG_DMA_SYNC_DEV + * order: 2^order pages on allocation + * pool_size: size of the ptr_ring + * nid: preferred NUMA node for allocation + * dev: struct device. Used on DMA operations + * dma_dir: DMA direction + * max_len: max DMA sync memory size + * offset: DMA address offset + +* page_pool_put_page(): The outcome of this depends on the page refcnt. If the + driver bumps the refcnt > 1 this will unmap the page. If the page refcnt is 1 + the allocator owns the page and will try to recycle it in one of the pool + caches. If PP_FLAG_DMA_SYNC_DEV is set, the page will be synced for_device + using dma_sync_single_range_for_device(). + +* page_pool_put_full_page(): Similar to page_pool_put_page(), but will DMA sync + for the entire memory area configured in area pool->max_len. + +* page_pool_recycle_direct(): Similar to page_pool_put_full_page() but caller + must guarantee safe context (e.g NAPI), since it will recycle the page + directly into the pool fast cache. + +* page_pool_release_page(): Unmap the page (if mapped) and account for it on + inflight counters. + +* page_pool_dev_alloc_pages(): Get a page from the page allocator or page_pool + caches. + +* page_pool_get_dma_addr(): Retrieve the stored DMA address. + +* page_pool_get_dma_dir(): Retrieve the stored DMA direction. + +Coding examples +=============== + +Registration +------------ + +.. code-block:: c + + /* Page pool registration */ + struct page_pool_params pp_params = { 0 }; + struct xdp_rxq_info xdp_rxq; + int err; + + pp_params.order = 0; + /* internal DMA mapping in page_pool */ + pp_params.flags = PP_FLAG_DMA_MAP; + pp_params.pool_size = DESC_NUM; + pp_params.nid = NUMA_NO_NODE; + pp_params.dev = priv->dev; + pp_params.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE; + page_pool = page_pool_create(&pp_params); + + err = xdp_rxq_info_reg(&xdp_rxq, ndev, 0); + if (err) + goto err_out; + + err = xdp_rxq_info_reg_mem_model(&xdp_rxq, MEM_TYPE_PAGE_POOL, page_pool); + if (err) + goto err_out; + +NAPI poller +----------- + + +.. code-block:: c + + /* NAPI Rx poller */ + enum dma_data_direction dma_dir; + + dma_dir = page_pool_get_dma_dir(dring->page_pool); + while (done < budget) { + if (some error) + page_pool_recycle_direct(page_pool, page); + if (packet_is_xdp) { + if XDP_DROP: + page_pool_recycle_direct(page_pool, page); + } else (packet_is_skb) { + page_pool_release_page(page_pool, page); + new_page = page_pool_dev_alloc_pages(page_pool); + } + } + +Driver unload +------------- + +.. code-block:: c + + /* Driver unload */ + page_pool_put_full_page(page_pool, page, false); + xdp_rxq_info_unreg(&xdp_rxq); diff --git a/Documentation/networking/sfp-phylink.rst b/Documentation/networking/sfp-phylink.rst index d753a309f9d1..5aec7c8857d0 100644 --- a/Documentation/networking/sfp-phylink.rst +++ b/Documentation/networking/sfp-phylink.rst @@ -74,10 +74,13 @@ phylib to the sfp/phylink support. Please send patches to improve this documentation. 1. Optionally split the network driver's phylib update function into - three parts dealing with link-down, link-up and reconfiguring the - MAC settings. This can be done as a separate preparation commit. + two parts dealing with link-down and link-up. This can be done as + a separate preparation commit. - An example of this preparation can be found in git commit fc548b991fb0. + An older example of this preparation can be found in git commit + fc548b991fb0, although this was splitting into three parts; the + link-up part now includes configuring the MAC for the link settings. + Please see :c:func:`mac_link_up` for more information on this. 2. Replace:: @@ -135,27 +138,27 @@ this documentation. .. code-block:: c - static int foo_ethtool_set_link_ksettings(struct net_device *dev, - const struct ethtool_link_ksettings *cmd) - { - struct foo_priv *priv = netdev_priv(dev); - - return phylink_ethtool_ksettings_set(priv->phylink, cmd); - } - - static int foo_ethtool_get_link_ksettings(struct net_device *dev, - struct ethtool_link_ksettings *cmd) - { - struct foo_priv *priv = netdev_priv(dev); + static int foo_ethtool_set_link_ksettings(struct net_device *dev, + const struct ethtool_link_ksettings *cmd) + { + struct foo_priv *priv = netdev_priv(dev); + + return phylink_ethtool_ksettings_set(priv->phylink, cmd); + } - return phylink_ethtool_ksettings_get(priv->phylink, cmd); - } + static int foo_ethtool_get_link_ksettings(struct net_device *dev, + struct ethtool_link_ksettings *cmd) + { + struct foo_priv *priv = netdev_priv(dev); + + return phylink_ethtool_ksettings_get(priv->phylink, cmd); + } -7. Replace the call to: +7. Replace the call to:: phy_dev = of_phy_connect(dev, node, link_func, flags, phy_interface); - and associated code with a call to: + and associated code with a call to:: err = phylink_of_phy_connect(priv->phylink, node, flags); @@ -207,6 +210,14 @@ this documentation. using. This is particularly important for in-band negotiation methods such as 1000base-X and SGMII. + The :c:func:`mac_link_up` method is used to inform the MAC that the + link has come up. The call includes the negotiation mode and interface + for reference only. The finalised link parameters are also supplied + (speed, duplex and flow control/pause enablement settings) which + should be used to configure the MAC when the MAC and PCS are not + tightly integrated, or when the settings are not coming from in-band + negotiation. + The :c:func:`mac_config` method is used to update the MAC with the requested state, and must avoid unnecessarily taking the link down when making changes to the MAC configuration. This means the diff --git a/Documentation/networking/snmp_counter.rst b/Documentation/networking/snmp_counter.rst index 38a4edc4522b..10e11099e74a 100644 --- a/Documentation/networking/snmp_counter.rst +++ b/Documentation/networking/snmp_counter.rst @@ -908,8 +908,8 @@ A TLP probe packet is sent. A packet loss is detected and recovered by TLP. -TCP Fast Open -============= +TCP Fast Open description +========================= TCP Fast Open is a technology which allows data transfer before the 3-way handshake complete. Please refer the `TCP Fast Open wiki`_ for a general description. |