aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-06-24gve: DQO: Add core netdev featuresBailey Forrest8-25/+260
Add napi netdev device registration, interrupt handling and initial tx and rx polling stubs. The stubs will be filled in follow-on patches. Also: - LRO feature advertisement and handling - Also update ethtool logic Signed-off-by: Bailey Forrest <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Reviewed-by: Catherine Sullivan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24gve: Update adminq commands to support DQO queuesBailey Forrest4-29/+64
DQO queue creation requires additional parameters: - TX completion/RX buffer queue size - TX completion/RX buffer queue address - TX/RX queue size - RX buffer size Signed-off-by: Bailey Forrest <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Reviewed-by: Catherine Sullivan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24gve: Add DQO fields for core data structuresBailey Forrest1-11/+251
- Add new DQO datapath structures: - `gve_rx_buf_queue_dqo` - `gve_rx_compl_queue_dqo` - `gve_rx_buf_state_dqo` - `gve_tx_desc_dqo` - `gve_tx_pending_packet_dqo` - Incorporate these into the existing ring data structures: - `gve_rx_ring` - `gve_tx_ring` Noteworthy mentions: - `gve_rx_buf_state` represents an RX buffer which was posted to HW. Each RX queue has an array of these objects and the index into the array is used as the buffer_id when posted to HW. - `gve_tx_pending_packet_dqo` is treated similarly for TX queues. The completion_tag is the index into the array. - These two structures have links for linked lists which are represented by 16b indexes into a contiguous array of these structures. This reduces memory footprint compared to 64b pointers. - We use unions for the writeable datapath structures to reduce cache footprint. GQI specific members will renamed like DQO members in a future patch. Signed-off-by: Bailey Forrest <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Reviewed-by: Catherine Sullivan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24gve: Add dqo descriptorsBailey Forrest2-1/+257
General description of rings and descriptors: TX ring is used for sending TX packet buffers to the NIC. It has the following descriptors: - `gve_tx_pkt_desc_dqo` - Data buffer descriptor - `gve_tx_tso_context_desc_dqo` - TSO context descriptor - `gve_tx_general_context_desc_dqo` - Generic metadata descriptor Metadata is a collection of 12 bytes. We define `gve_tx_metadata_dqo` which represents the logical interpetation of the metadata bytes. It's helpful to define this structure because the metadata bytes exist in multiple descriptor types (including `gve_tx_tso_context_desc_dqo`), and the device requires same field has the same value in all descriptors. The TX completion ring is used to receive completions from the NIC. Having a separate ring allows for completions to be out of order. The completion descriptor `gve_tx_compl_desc` has several different types, most important are packet and descriptor completions. Descriptor completions are used to notify the driver when descriptors sent on the TX ring are done being consumed. The descriptor completion is only used to signal that space is cleared in the TX ring. A packet completion will be received when a packet transmitted on the TX queue is done being transmitted. In addition there are "miss" and "reinjection" completions. The device implements a "flow-miss model". Most packets will simply receive a packet completion. The flow-miss system may choose to process a packet based on its contents. A TX packet which experiences a flow miss would receive a miss completion followed by a later reinjection completion. The miss-completion is received when the packet starts to be processed by the flow-miss system and the reinjection completion is received when the flow-miss system completes processing the packet and sends it on the wire. The RX buffer ring is used to send buffers to HW via the `gve_rx_desc_dqo` descriptor. Received packets are put into the RX queue by the device, which populates the `gve_rx_compl_desc_dqo` descriptor. The RX descriptors refer to buffers posted by the buffer queue. Received buffers may be returned out of order, such as when HW LRO is enabled. Important concepts: - "TX" and "RX buffer" queues, which send descriptors to the device, use MMIO doorbells to notify the device of new descriptors. - "RX" and "TX completion" queues, which receive descriptors from the device, use a "generation bit" to know when a descriptor was populated by the device. The driver initializes all bits with the "current generation". The device will populate received descriptors with the "next generation" which is inverted from the current generation. When the ring wraps, the current/next generation are swapped. - It's the driver's responsibility to ensure that the RX and TX completion queues are not overrun. This can be accomplished by limiting the number of descriptors posted to HW. - TX packets have a 16 bit completion_tag and RX buffers have a 16 bit buffer_id. These will be returned on the TX completion and RX queues respectively to let the driver know which packet/buffer was completed. Bitfields are used to describe descriptor fields. This notation is more concise and readable than shift-and-mask. It is possible because the driver is restricted to little endian platforms. Signed-off-by: Bailey Forrest <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Reviewed-by: Catherine Sullivan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24gve: Add support for DQO RX PTYPE mapBailey Forrest4-2/+127
Unlike GQI, DQO RX descriptors do not contain the L3 and L4 type of the packet. L3 and L4 types are necessary in order to set the hash and csum on RX SKBs correctly. DQO RX descriptors instead contain a 10 bit PTYPE index. The PTYPE map enables the device to tell the driver how to map from PTYPE index to L3/L4 type. The device doesn't provide any guarantees about the range of possible PTYPEs, so we just use a 1024 entry array to implement a fast mapping structure. Signed-off-by: Bailey Forrest <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Reviewed-by: Catherine Sullivan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24gve: adminq: DQO specific device descriptor logicBailey Forrest2-15/+55
- In addition to TX and RX queues, DQO has TX completion and RX buffer queues. - TX completions are received when the device has completed sending a packet on the wire. - RX buffers are posted on a separate queue form the RX completions. - DQO descriptor rings are allowed to be smaller than PAGE_SIZE. Signed-off-by: Bailey Forrest <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Reviewed-by: Catherine Sullivan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24gve: Introduce per netdev `enum gve_queue_format`Bailey Forrest5-15/+37
The currently supported queue formats are: - GQI_RDA - GQI with raw DMA addressing - GQI_QPL - GQI with queue page list - DQO_RDA - DQO with raw DMA addressing The old `gve_priv.raw_addressing` value is only used for GQI_RDA, so we remove it in favor of just checking against GQI_RDA Signed-off-by: Bailey Forrest <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Reviewed-by: Catherine Sullivan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24gve: Introduce a new model for device optionsBailey Forrest2-43/+179
The current model uses an integer ID and a fixed size struct for the parameters of each device option. The new model allows the device option structs to grow in size over time. A driver may assume that changes to device option structs will always be appended. New device options will also generally have a `supported_features_mask` so that the driver knows which fields within a particular device option are enabled. `gve_device_option.feat_mask` is changed to `required_features_mask`, and it is a bitmask which must match the value expected by the driver. This gives the device the ability to break backwards compatibility with old drivers for certain features by blocking the old drivers from trying to use the feature. We maintain ABI compatibility with the old model for GVE_DEV_OPT_ID_RAW_ADDRESSING in case a driver is using a device which does not support the new model. This patch introduces some new terminology: RDA - Raw DMA Addressing - Buffers associated with SKBs are directly DMA mapped and read/updated by the device. QPL - Queue Page Lists - Driver uses bounce buffers which are DMA mapped with the device for read/write and data is copied from/to SKBs. Signed-off-by: Bailey Forrest <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Reviewed-by: Catherine Sullivan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24gve: Make gve_rx_slot_page_info.page_offset an absolute offsetBailey Forrest3-5/+5
Using `page_offset` like a boolean means a page may only be split into two sections. With page sizes larger than 4k, this can be very wasteful. Future commits in this patchset use `struct gve_rx_slot_page_info` in a way which supports a fixed buffer size and a variable page size. Signed-off-by: Bailey Forrest <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Reviewed-by: Catherine Sullivan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24gve: gve_rx_copy: Move padding to an argumentBailey Forrest3-5/+7
Future use cases will have a different padding value. Signed-off-by: Bailey Forrest <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Reviewed-by: Catherine Sullivan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24gve: Move some static functions to a common fileBailey Forrest5-60/+94
These functions will be shared by the GQI and DQO variants of the GVNIC driver as of follow-up patches in this series. Signed-off-by: Bailey Forrest <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Reviewed-by: Catherine Sullivan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24gve: Update GVE documentation to describe DQOBailey Forrest1-5/+48
DQO is a new descriptor format for our next generation virtual NIC. Signed-off-by: Bailey Forrest <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Reviewed-by: Catherine Sullivan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24usbnet: add usbnet_event_names[] for keventYajun Deng1-2/+19
Modify the netdev_dbg content from int to char * in usbnet_defer_kevent(), this looks more readable. Signed-off-by: Yajun Deng <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24Merge branch 'add-sparx5i-driver'David S. Miller21-84/+12318
Steen Hegelund says: ==================== Adding the Sparx5i Switch Driver This series provides the Microchip Sparx5i Switch Driver The SparX-5 Enterprise Ethernet switch family provides a rich set of Enterprise switching features such as advanced TCAM-based VLAN and QoS processing enabling delivery of differentiated services, and security through TCAMbased frame processing using versatile content aware processor (VCAP). IPv4/IPv6 Layer 3 (L3) unicast and multicast routing is supported with up to 18K IPv4/9K IPv6 unicast LPM entries and up to 9K IPv4/3K IPv6 (S,G) multicast groups. L3 security features include source guard and reverse path forwarding (uRPF) tasks. Additional L3 features include VRF-Lite and IP tunnels (IP over GRE/IP). The SparX-5 switch family features a highly flexible set of Ethernet ports with support for 10G and 25G aggregation links, QSGMII, USGMII, and USXGMII. The device integrates a powerful 1 GHz dual-core ARM® Cortex®-A53 CPU enabling full management of the switch and advanced Enterprise applications. The SparX-5 switch family targets managed Layer 2 and Layer 3 equipment in SMB, SME, and Enterprise where high port count 1G/2.5G/5G/10G switching with 10G/25G aggregation links is required. The SparX-5 switch family consists of following SKUs: VSC7546 SparX-5-64 supports up to 64 Gbps of bandwidth with the following primary port configurations. - 6 ×10G - 16 × 2.5G + 2 × 10G - 24 × 1G + 4 × 10G VSC7549 SparX-5-90 supports up to 90 Gbps of bandwidth with the following primary port configurations. - 9 × 10G - 16 × 2.5G + 4 × 10G - 48 × 1G + 4 × 10G VSC7552 SparX-5-128 supports up to 128 Gbps of bandwidth with the following primary port configurations. - 12 × 10G - 6 x 10G + 2 x 25G - 16 × 2.5G + 8 × 10G - 48 × 1G + 8 × 10G VSC7556 SparX-5-160 supports up to 160 Gbps of bandwidth with the following primary port configurations. - 16 × 10G - 10 × 10G + 2 × 25G - 16 × 2.5G + 10 × 10G - 48 × 1G + 10 × 10G VSC7558 SparX-5-200 supports up to 200 Gbps of bandwidth with the following primary port configurations. - 20 × 10G - 8 × 25G In addition, the device supports one 10/100/1000/2500/5000 Mbps SGMII/SerDes node processor interface (NPI) Ethernet port. Time sensitive networking (TSN) is supported through a comprehensive set of features including frame preemption, cut-through, frame replication and elimination for reliability, enhanced scheduling: credit-based shaping, time-aware shaping, cyclic queuing, and forwarding, and per-stream policing and filtering. Together with IEEE 1588 and IEEE 802.1AS support, this guarantees low-latency deterministic networking for Industrial Ethernet. The Sparx5i support is developed on the PCB134 and PCB135 evaluation boards. - PCB134 main networking features: - 12x SFP+ front 10G module slots (connected to Sparx5i through SFI). - 8x SFP28 front 25G module slots (connected to Sparx5i through SFI high speed). - Optional, one additional 10/100/1000BASE-T (RJ45) Ethernet port (on-board VSC8211 PHY connected to Sparx5i through SGMII). - PCB135 main networking features: - 48x1G (10/100/1000M) RJ45 front ports using 12xVSC8514 QuadPHY’s each connected to VSC7558 through QSGMII. - 4x10G (1G/2.5G/5G/10G) RJ45 front ports using the AQR407 10G QuadPHY each port connects to VSC7558 through SFI. - 4x SFP28 25G module slots on back connected to VSC7558 through SFI high speed. - Optional, one additional 1G (10/100/1000M) RJ45 port using an on-board VSC8211 PHY, which can be connected to VSC7558 NPI port through SGMII using a loopback add-on PCB) This series provides support for: - SFPs and DAC cables via PHYLINK with a number of 5G, 10G and 25G devices and media types. - Port module configuration for 10M to 25G speeds with SGMII, QSGMII, 1000BASEX, 2500BASEX and 10GBASER as appropriate for these modes. - SerDes configuration via the Sparx5i SerDes driver (see below). - Host mode providing register based injection and extraction. - Switch mode providing MAC/VLAN table learning and Layer2 switching offloaded to the Sparx5i switch. - STP state, VLAN support, host/bridge port mode, Forwarding DB, and configuration and statistics via ethtool. More support will be added at a later stage. The Sparx5i Chip Register Model can be browsed at this location: https://github.com/microchip-ung/sparx-5_reginfo and the datasheet is available here: https://ww1.microchip.com/downloads/en/DeviceDoc/SparX-5_Family_L2L3_Enterprise_10G_Ethernet_Switches_Datasheet_00003822B.pdf The series depends on the following series currently on their way into the kernel: - 25G Base-R phy mode Link: https://lore.kernel.org/r/[email protected]/ - Sparx5 Reset Driver Link: https://lore.kernel.org/r/[email protected]/ ChangeLog: v5: - cover letter - updated the description to match the latest data sheets - basic driver - added error message in case of reset controller error - port struct: replacing has_sfp with inband, adding pause_adv - host mode - port cleanup: unregisters netdevs and then removes phylink etc - checking for pause_adv when comparing port config changes - getting duplex and pause state in the link_up callback. - getting inband, autoneg and pause_adv config in the pcs_config callback. - port - use only the pause_adv bits when getting aneg status - use the inband state when updating the PCS and port config v4: - basic driver: Using devm_reset_control_get_optional_shared to get the reset control, and let the reset framework check if it is valid. - host mode (phylink): Use the PCS operations to get state and update configuration. Removed the setting of interface modes. Let phylink control this. Using the new 5gbase-r and 25gbase-r modes. Using a helper function to check if one of the 3 base-r modes has been selected. Currently it will not be possible to change the interface mode by changing the speed (e.g via ethtool). This will be added later. v3: - basic driver: - removed unneeded braces - release reference to ports node after use - use dev_err_probe to handle DEFER - update error value when bailing out (a few cases) - updated formatting of port struct and grouping of bool values - simplified the spx5_rmw and spx5_inst_rmw inline functions - host mode (netdev): - removed lockless flag - added port timer init - host mode (packet - manual injection): - updated error counters in error situations - implemented timer handling of watermark threshold: stop and restart netif queues. - fixed error message handling (rate limited) - fixed comment style error - used DIV_ROUND_UP macro - removed a debug message for open ports v2: - Updated bindings: - drop minItems for the reg property - Statistics implementation: - Reorganized statistics into ethtool groups: eth-phy, eth-mac, eth-ctrl, rmon as defined by the IEEE 802.3 categories and RFC 2819. - The remaining statistics are provided by the classic ethtool statistics command. - Hostmode support: - Removed netdev renaming - Validate ethernet address in sparx5_set_mac_address() ==================== Signed-off-by: David S. Miller <[email protected]>
2021-06-24arm64: dts: sparx5: Add the Sparx5 switch nodeSteen Hegelund3-84/+1112
This provides the configuration for the currently available evaluation boards PCB134 and PCB135. The series depends on the following series currently on its way into the kernel: - Sparx5 Reset Driver Link: https://lore.kernel.org/r/[email protected]/ Signed-off-by: Steen Hegelund <[email protected]> Signed-off-by: Lars Povlsen <[email protected]> Signed-off-by: Bjarni Jonasson <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24net: sparx5: add ethtool configuration and statistics supportSteen Hegelund5-1/+1248
This adds statistic counters for the network interfaces provided by the driver. It also adds CPU port counters (which are not exposed by ethtool). This also adds support for configuring the network interface parameters via ethtool: speed, duplex, aneg etc. Signed-off-by: Steen Hegelund <[email protected]> Signed-off-by: Bjarni Jonasson <[email protected]> Signed-off-by: Lars Povlsen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24net: sparx5: add calendar bandwidth allocation supportSteen Hegelund4-2/+609
This configures the Sparx5 calendars according to the bandwidth requested in the Device Tree nodes. It also checks if the total requested bandwidth is within the specs of the detected Sparx5 models limits. Signed-off-by: Steen Hegelund <[email protected]> Signed-off-by: Bjarni Jonasson <[email protected]> Signed-off-by: Lars Povlsen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24net: sparx5: add switching supportSteen Hegelund7-1/+544
This adds SwitchDev support by hardware offloading the software bridge. Signed-off-by: Steen Hegelund <[email protected]> Signed-off-by: Bjarni Jonasson <[email protected]> Signed-off-by: Lars Povlsen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24net: sparx5: add vlan supportSteen Hegelund4-4/+246
This adds Sparx5 VLAN support. Sparx5 has more VLAN features than provided here, but these will be added in later series. For now we only add the basic L2 features. Signed-off-by: Steen Hegelund <[email protected]> Signed-off-by: Bjarni Jonasson <[email protected]> Signed-off-by: Lars Povlsen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24net: sparx5: add mactable supportSteen Hegelund5-2/+565
This adds the Sparx5 MAC tables: listening for MAC table updates and updating on request. Signed-off-by: Steen Hegelund <[email protected]> Signed-off-by: Bjarni Jonasson <[email protected]> Signed-off-by: Lars Povlsen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24net: sparx5: add port module supportSteen Hegelund6-12/+1279
This add configuration of the Sparx5 port module instances. Sparx5 has in total 65 logical ports (denoted D0 to D64) and 33 physical SerDes connections (S0 to S32). The 65th port (D64) is fixed allocated to SerDes0 (S0). The remaining 64 ports can in various multiplexing scenarios be connected to the remaining 32 SerDes using QSGMII, or USGMII or USXGMII extenders. 32 of the ports can have a 1:1 mapping to the 32 SerDes. Some additional ports (D65 to D69) are internal to the device and do not connect to port modules or SerDes macros. For example, internal ports are used for frame injection and extraction to the CPU queues. The 65 logical ports are split up into the following blocks. - 13 x 5G ports (D0-D11, D64) - 32 x 2G5 ports (D16-D47) - 12 x 10G ports (D12-D15, D48-D55) - 8 x 25G ports (D56-D63) Each logical port supports different line speeds, and depending on the speeds supported, different port modules (MAC+PCS) are needed. A port supporting 5 Gbps, 10 Gbps, or 25 Gbps as maximum line speed, will have a DEV5G, DEV10G, or DEV25G module to support the 5 Gbps, 10 Gbps (incl 5 Gbps), or 25 Gbps (including 10 Gbps and 5 Gbps) speeds. As well as, it will have a shadow DEV2G5 port module to support the lower speeds (10/100/1000/2500Mbps). When a port needs to operate at lower speed and the shadow DEV2G5 needs to be connected to its corresponding SerDes Not all interface modes are supported in this series, but will be added at a later stage. Signed-off-by: Steen Hegelund <[email protected]> Signed-off-by: Bjarni Jonasson <[email protected]> Signed-off-by: Lars Povlsen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24net: sparx5: add hostmode with phylink supportSteen Hegelund6-10/+841
This patch adds netdevs and phylink support for the ports in the switch. It also adds register based injection and extraction for these ports. Frame DMA support for injection and extraction will be added in a later series. Signed-off-by: Steen Hegelund <[email protected]> Signed-off-by: Bjarni Jonasson <[email protected]> Signed-off-by: Lars Povlsen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24net: sparx5: add the basic sparx5 driverSteen Hegelund7-0/+5680
This adds the Sparx5 basic SwitchDev driver framework with IO range mapping, switch device detection and core clock configuration. Support for ports, phylink, netdev, mactable etc. are in the following patches. Signed-off-by: Steen Hegelund <[email protected]> Signed-off-by: Bjarni Jonasson <[email protected]> Signed-off-by: Lars Povlsen <[email protected]> Reviewed-by: Philipp Zabel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24dt-bindings: net: sparx5: Add sparx5-switch bindingsSteen Hegelund1-0/+226
Document the Sparx5 switch device driver bindings Signed-off-by: Steen Hegelund <[email protected]> Signed-off-by: Lars Povlsen <[email protected]> Signed-off-by: Bjarni Jonasson <[email protected]> Reviewed-by: Rob Herring <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24net: mdiobus: fix fwnode_mdbiobus_register() fallback caseMarcin Wojtas1-1/+1
The fallback case of fwnode_mdbiobus_register() (relevant for !CONFIG_FWNODE_MDIO) was defined with wrong argument name, causing a compilation error. Fix that. Signed-off-by: Marcin Wojtas <[email protected]> Reviewed-by: Andrew Lunn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24net: ip: avoid OOM kills with large UDP sends over loopbackJakub Kicinski2-29/+35
Dave observed number of machines hitting OOM on the UDP send path. The workload seems to be sending large UDP packets over loopback. Since loopback has MTU of 64k kernel will try to allocate an skb with up to 64k of head space. This has a good chance of failing under memory pressure. What's worse if the message length is <32k the allocation may trigger an OOM killer. This is entirely avoidable, we can use an skb with page frags. af_unix solves a similar problem by limiting the head length to SKB_MAX_ALLOC. This seems like a good and simple approach. It means that UDP messages > 16kB will now use fragments if underlying device supports SG, if extra allocator pressure causes regressions in real workloads we can switch to trying the large allocation first and falling back. v4: pre-calculate all the additions to alloclen so we can be sure it won't go over order-2 Reported-by: Dave Jones <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24tools/testing: add a selftest for SO_NETNS_COOKIELorenz Bauer4-1/+64
Make sure that SO_NETNS_COOKIE returns a non-zero value, and that sockets from different namespaces have a distinct cookie value. Signed-off-by: Lorenz Bauer <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24net: retrieve netns cookie via getsocketoptMartynas Pumputis6-0/+17
It's getting more common to run nested container environments for testing cloud software. One of such examples is Kind [1] which runs a Kubernetes cluster in Docker containers on a single host. Each container acts as a Kubernetes node, and thus can run any Pod (aka container) inside the former. This approach simplifies testing a lot, as it eliminates complicated VM setups. Unfortunately, such a setup breaks some functionality when cgroupv2 BPF programs are used for load-balancing. The load-balancer BPF program needs to detect whether a request originates from the host netns or a container netns in order to allow some access, e.g. to a service via a loopback IP address. Typically, the programs detect this by comparing netns cookies with the one of the init ns via a call to bpf_get_netns_cookie(NULL). However, in nested environments the latter cannot be used given the Kubernetes node's netns is outside the init ns. To fix this, we need to pass the Kubernetes node netns cookie to the program in a different way: by extending getsockopt() with a SO_NETNS_COOKIE option, the orchestrator which runs in the Kubernetes node netns can retrieve the cookie and pass it to the program instead. Thus, this is following up on Eric's commit 3d368ab87cf6 ("net: initialize net->net_cookie at netns setup") to allow retrieval via SO_NETNS_COOKIE. This is also in line in how we retrieve socket cookie via SO_COOKIE. [1] https://kind.sigs.k8s.io/ Signed-off-by: Lorenz Bauer <[email protected]> Signed-off-by: Martynas Pumputis <[email protected]> Cc: Eric Dumazet <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-06-24ti: Remove rcu_read_lock() around XDP program invocationToke Høiland-Jørgensen1-8/+2
The cpsw driver has rcu_read_lock()/rcu_read_unlock() pairs around XDP program invocations. However, the actual lifetime of the objects referred by the XDP program invocation is longer, all the way through to the call to xdp_do_flush(), making the scope of the rcu_read_lock() too small. This turns out to be harmless because it all happens in a single NAPI poll cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() misleading. Rather than extend the scope of the rcu_read_lock(), just get rid of it entirely. With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep it around. Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Tested-by: Grygorii Strashko <[email protected]> Reviewed-by: Grygorii Strashko <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/bpf/[email protected]
2021-06-24stmmac: Remove rcu_read_lock() around XDP program invocationToke Høiland-Jørgensen1-8/+2
The stmmac driver has rcu_read_lock()/rcu_read_unlock() pairs around XDP program invocations. However, the actual lifetime of the objects referred by the XDP program invocation is longer, all the way through to the call to xdp_do_flush(), making the scope of the rcu_read_lock() too small. This turns out to be harmless because it all happens in a single NAPI poll cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() misleading. Rather than extend the scope of the rcu_read_lock(), just get rid of it entirely. With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep it around. Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Cc: Giuseppe Cavallaro <[email protected]> Cc: Alexandre Torgue <[email protected]> Cc: Jose Abreu <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-06-24netsec: Remove rcu_read_lock() around XDP program invocationToke Høiland-Jørgensen1-3/+0
The netsec driver has a rcu_read_lock()/rcu_read_unlock() pair around the full RX loop, covering everything up to and including xdp_do_flush(). This is actually the correct behaviour, but because it all happens in a single NAPI poll cycle (and thus under local_bh_disable()), it is also technically redundant. With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep the rcu_read_lock() around anymore, so let's just remove it. Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Ilias Apalodimas <[email protected]> Cc: Jassi Brar <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-06-24sfc: Remove rcu_read_lock() around XDP program invocationToke Høiland-Jørgensen1-7/+2
The sfc driver has rcu_read_lock()/rcu_read_unlock() pairs around XDP program invocations. However, the actual lifetime of the objects referred by the XDP program invocation is longer, all the way through to the call to xdp_do_flush(), making the scope of the rcu_read_lock() too small. This turns out to be harmless because it all happens in a single NAPI poll cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() misleading. Rather than extend the scope of the rcu_read_lock(), just get rid of it entirely. With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep it around. Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Edward Cree <[email protected]> Cc: Martin Habets <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-06-24qede: Remove rcu_read_lock() around XDP program invocationToke Høiland-Jørgensen1-6/+0
The qede driver has rcu_read_lock()/rcu_read_unlock() pairs around XDP program invocations. However, the actual lifetime of the objects referred by the XDP program invocation is longer, all the way through to the call to xdp_do_flush(), making the scope of the rcu_read_lock() too small. This turns out to be harmless because it all happens in a single NAPI poll cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() misleading. Rather than extend the scope of the rcu_read_lock(), just get rid of it entirely. With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep it around. Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Cc: Ariel Elior <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/bpf/[email protected]
2021-06-24nfp: Remove rcu_read_lock() around XDP program invocationToke Høiland-Jørgensen1-2/+0
The nfp driver has rcu_read_lock()/rcu_read_unlock() pairs around XDP program invocations. However, the actual lifetime of the objects referred by the XDP program invocation is longer, all the way through to the call to xdp_do_flush(), making the scope of the rcu_read_lock() too small. While this is not actually an issue for the nfp driver because it doesn't support XDP_REDIRECT (and thus doesn't call xdp_do_flush()), the rcu_read_lock() is still unneeded. And With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep it around. Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Simon Horman <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/bpf/[email protected]
2021-06-24mlx4: Remove rcu_read_lock() around XDP program invocationToke Høiland-Jørgensen1-6/+2
The mlx4 driver has rcu_read_lock()/rcu_read_unlock() pairs around XDP program invocations. However, the actual lifetime of the objects referred by the XDP program invocation is longer, all the way through to the call to xdp_do_flush(), making the scope of the rcu_read_lock() too small. This turns out to be harmless because it all happens in a single NAPI poll cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() misleading. Rather than extend the scope of the rcu_read_lock(), just get rid of it entirely. With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep it around. Also switch the RCU dereferences in the driver loop itself to the _bh variants. Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Tariq Toukan <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-06-24marvell: Remove rcu_read_lock() around XDP program invocationToke Høiland-Jørgensen2-6/+0
The mvneta and mvpp2 drivers have rcu_read_lock()/rcu_read_unlock() pairs around XDP program invocations. However, the actual lifetime of the objects referred by the XDP program invocation is longer, all the way through to the call to xdp_do_flush(), making the scope of the rcu_read_lock() too small. This turns out to be harmless because it all happens in a single NAPI poll cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() misleading. Rather than extend the scope of the rcu_read_lock(), just get rid of it entirely. With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep it around. Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Cc: Thomas Petazzoni <[email protected]> Cc: Russell King <[email protected]> Cc: Marcin Wojtas <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-06-24intel: Remove rcu_read_lock() around XDP program invocationToke Høiland-Jørgensen9-27/+3
The Intel drivers all have rcu_read_lock()/rcu_read_unlock() pairs around XDP program invocations. However, the actual lifetime of the objects referred by the XDP program invocation is longer, all the way through to the call to xdp_do_flush(), making the scope of the rcu_read_lock() too small. This turns out to be harmless because it all happens in a single NAPI poll cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() misleading. Rather than extend the scope of the rcu_read_lock(), just get rid of it entirely. With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep it around. Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Tested-by: Jesper Dangaard Brouer <[email protected]> # i40e Cc: Jesse Brandeburg <[email protected]> Cc: Tony Nguyen <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/bpf/[email protected]
2021-06-24freescale: Remove rcu_read_lock() around XDP program invocationToke Høiland-Jørgensen2-10/+1
The dpaa and dpaa2 drivers have rcu_read_lock()/rcu_read_unlock() pairs around XDP program invocations. However, the actual lifetime of the objects referred by the XDP program invocation is longer, all the way through to the call to xdp_do_flush(), making the scope of the rcu_read_lock() too small. This turns out to be harmless because it all happens in a single NAPI poll cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() misleading. Rather than extend the scope of the rcu_read_lock(), just get rid of it entirely. With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep it around. Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: Camelia Groza <[email protected]> Cc: Ioana Radulescu <[email protected]> Cc: Madalin Bucur <[email protected]> Cc: Ioana Ciornei <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-06-24thunderx: Remove rcu_read_lock() around XDP program invocationToke Høiland-Jørgensen1-2/+0
The thunderx driver has rcu_read_lock()/rcu_read_unlock() pairs around XDP program invocations. However, the actual lifetime of the objects referred by the XDP program invocation is longer, all the way through to the call to xdp_do_flush(), making the scope of the rcu_read_lock() too small. This turns out to be harmless because it all happens in a single NAPI poll cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() misleading. Rather than extend the scope of the rcu_read_lock(), just get rid of it entirely. With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep it around. Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Cc: Sunil Goutham <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/bpf/[email protected]
2021-06-24bnxt: Remove rcu_read_lock() around XDP program invocationToke Høiland-Jørgensen1-2/+0
The bnxt driver has rcu_read_lock()/rcu_read_unlock() pairs around XDP program invocations. However, the actual lifetime of the objects referred by the XDP program invocation is longer, all the way through to the call to xdp_do_flush(), making the scope of the rcu_read_lock() too small. This turns out to be harmless because it all happens in a single NAPI poll cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() misleading. Rather than extend the scope of the rcu_read_lock(), just get rid of it entirely. With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep it around. Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Cc: Michael Chan <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-06-24ena: Remove rcu_read_lock() around XDP program invocationToke Høiland-Jørgensen1-3/+0
The ena driver has rcu_read_lock()/rcu_read_unlock() pairs around XDP program invocations. However, the actual lifetime of the objects referred by the XDP program invocation is longer, all the way through to the call to xdp_do_flush(), making the scope of the rcu_read_lock() too small. This turns out to be harmless because it all happens in a single NAPI poll cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() misleading. Rather than extend the scope of the rcu_read_lock(), just get rid of it entirely. With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep it around. Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Cc: Saeed Bishara <[email protected]> Cc: Guy Tzalik <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-06-24bpf, sched: Remove unneeded rcu_read_lock() around BPF program invocationToke Høiland-Jørgensen2-5/+0
The rcu_read_lock() call in cls_bpf and act_bpf are redundant: on the TX side, there's already a call to rcu_read_lock_bh() in __dev_queue_xmit(), and on RX there's a covering rcu_read_lock() in netif_receive_skb{,_list}_internal(). With the previous patches we also amended the lockdep checks in the map code to not require any particular RCU flavour, so we can just get rid of the rcu_read_lock()s. Suggested-by: Daniel Borkmann <[email protected]> Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-06-24xdp: Add proper __rcu annotations to redirect map entriesToke Høiland-Jørgensen8-54/+83
XDP_REDIRECT works by a three-step process: the bpf_redirect() and bpf_redirect_map() helpers will lookup the target of the redirect and store it (along with some other metadata) in a per-CPU struct bpf_redirect_info. Next, when the program returns the XDP_REDIRECT return code, the driver will call xdp_do_redirect() which will use the information thus stored to actually enqueue the frame into a bulk queue structure (that differs slightly by map type, but shares the same principle). Finally, before exiting its NAPI poll loop, the driver will call xdp_do_flush(), which will flush all the different bulk queues, thus completing the redirect. Pointers to the map entries will be kept around for this whole sequence of steps, protected by RCU. However, there is no top-level rcu_read_lock() in the core code; instead drivers add their own rcu_read_lock() around the XDP portions of the code, but somewhat inconsistently as Martin discovered[0]. However, things still work because everything happens inside a single NAPI poll sequence, which means it's between a pair of calls to local_bh_disable()/local_bh_enable(). So Paul suggested[1] that we could document this intention by using rcu_dereference_check() with rcu_read_lock_bh_held() as a second parameter, thus allowing sparse and lockdep to verify that everything is done correctly. This patch does just that: we add an __rcu annotation to the map entry pointers and remove the various comments explaining the NAPI poll assurance strewn through devmap.c in favour of a longer explanation in filter.c. The goal is to have one coherent documentation of the entire flow, and rely on the RCU annotations as a "standard" way of communicating the flow in the map code (which can additionally be understood by sparse and lockdep). The RCU annotation replacements result in a fairly straight-forward replacement where READ_ONCE() becomes rcu_dereference_check(), WRITE_ONCE() becomes rcu_assign_pointer() and xchg() and cmpxchg() gets wrapped in the proper constructs to cast the pointer back and forth between __rcu and __kernel address space (for the benefit of sparse). The one complication is that xskmap has a few constructions where double-pointers are passed back and forth; these simply all gain __rcu annotations, and only the final reference/dereference to the inner-most pointer gets changed. With this, everything can be run through sparse without eliciting complaints, and lockdep can verify correctness even without the use of rcu_read_lock() in the drivers. Subsequent patches will clean these up from the drivers. [0] https://lore.kernel.org/bpf/[email protected]/ [1] https://lore.kernel.org/bpf/20210419165837.GA975577@paulmck-ThinkPad-P17-Gen-1/ Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-06-24bpf: Allow RCU-protected lookups to happen from bh contextToke Høiland-Jørgensen3-12/+21
XDP programs are called from a NAPI poll context, which means the RCU reference liveness is ensured by local_bh_disable(). Add rcu_read_lock_bh_held() as a condition to the RCU checks for map lookups so lockdep understands that the dereferences are safe from inside *either* an rcu_read_lock() section *or* a local_bh_disable() section. While both bh_disabled and rcu_read_lock() provide RCU protection, they are semantically distinct, so we need both conditions to prevent lockdep complaints. This change is done in preparation for removing the redundant rcu_read_lock()s from drivers. Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-06-24doc: Give XDP as example of non-obvious RCU reader/updater pairingToke Høiland-Jørgensen1-2/+9
This commit gives an example of non-obvious RCU reader/updater pairing in the guise of the XDP feature in networking, which calls BPF programs from network-driver NAPI (softirq) context. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-06-24doc: Clarify and expand RCU updaters and corresponding readersPaul E. McKenney1-21/+27
This commit clarifies which primitives readers can use given that the corresponding updaters have made a specific choice. This commit also adds this information for the various RCU Tasks flavors. While in the area, it removes a paragraph that no longer applies in any straightforward manner. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-06-24rcu: Create an unrcu_pointer() to remove __rcu from a pointerPaul E. McKenney1-0/+14
The xchg() and cmpxchg() functions are sometimes used to carry out RCU updates. Unfortunately, this can result in sparse warnings for both the old-value and new-value arguments, as well as for the return value. The arguments can be dealt with using RCU_INITIALIZER(): old_p = xchg(&p, RCU_INITIALIZER(new_p)); But a sparse warning still remains due to assigning the __rcu pointer returned from xchg to the (most likely) non-__rcu pointer old_p. This commit therefore provides an unrcu_pointer() macro that strips the __rcu. This macro can be used as follows: old_p = unrcu_pointer(xchg(&p, RCU_INITIALIZER(new_p))); Reported-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Toke Høiland-Jørgensen <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-06-24iwlwifi: acpi: remove unused function iwl_acpi_eval_dsm_func()Kalle Valo1-36/+0
Stephen reported a warning: drivers/net/wireless/intel/iwlwifi/fw/acpi.c:720:12: warning: 'iwl_acpi_eval_dsm_func' defined but not used [-Wunused-function] The warning is correct and the function is not used anywhere, so let's just remove it. Reported-by: Stephen Rothwell <[email protected]> Fixes: 7119f02b5d34 ("iwlwifi: mvm: support BIOS enable/disable for 11ax in Russia") Signed-off-by: Kalle Valo <[email protected]> Acked-by: Luca Coelho <[email protected]> Signed-off-by: Kalle Valo <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-24rtw88: fix c2h memory leakPo-Hao Huang3-1/+13
Fix erroneous code that leads to unreferenced objects. During H2C operations, some functions returned without freeing the memory that only the function have access to. Release these objects when they're no longer needed to avoid potentially memory leaks. Signed-off-by: Po-Hao Huang <[email protected]> Signed-off-by: Ping-Ke Shih <[email protected]> Signed-off-by: Kalle Valo <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-06-24brcmfmac: support parse country code map from DTShawn Guo1-2/+55
With any regulatory domain requests coming from either user space or 802.11 IE (Information Element), the country is coded in ISO3166 standard. It needs to be translated to firmware country code and revision with the mapping info in settings->country_codes table. Support populate country_codes table by parsing the mapping from DT. The BRCMF_BUSTYPE_SDIO bus_type check gets separated from general DT validation, so that country code can be handled as general part rather than SDIO bus specific one. Signed-off-by: Shawn Guo <[email protected]> Reviewed-by: Arend van Spriel <[email protected]> Signed-off-by: Kalle Valo <[email protected]> Link: https://lore.kernel.org/r/[email protected]