aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-03-31bnxt_en: Update firmware interface to 1.9.1.15.Michael Chan4-103/+210
Minor changes, such as new extended port statistics. Signed-off-by: Michael Chan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31vlan: vlan_hw_filter_capable() can be staticWei Yongjun1-1/+1
Fixes the following sparse warning: net/8021q/vlan_core.c:168:6: warning: symbol 'vlan_hw_filter_capable' was not declared. Should it be static? Signed-off-by: Wei Yongjun <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31Merge tag 'mlx5-updates-2018-03-30' of ↵David S. Miller12-447/+409
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2018-03-30 This series contains updates to mlx5 core and mlx5e netdev drivers. The main highlight of this series is the RX optimizations for striding RQ path, introduced by Tariq. First Four patches are trivial misc cleanups. - Spelling mistake fix - Dead code removal - Warning messages RX optimizations for striding RQ: 1) RX refactoring, cleanups and micro optimizations - MTU calculation simplifications, obsoletes some WQEs-to-packets translation functions and helps delete ~60 LOC. - Do not busy-wait a pending UMR completion. - post the new values of UMR WQE inline, instead of using a data pointer. - use pre-initialized structures to save calculations in datapath. 2) Use linear SKB in Striding RQ "build_skb", (Using linear SKB has many advantages): - Saves a memcpy of the headers. - No page-boundary checks in datapath. - No filler CQEs. - Significantly smaller CQ. - SKB data continuously resides in linear part, and not split to small amount (linear part) and large amount (fragment). This saves datapath cycles in driver and improves utilization of SKB fragments in GRO. - The fragments of a resulting GRO SKB follow the IP forwarding assumption of equal-size fragments. implementation details: HW writes the packets to the beginning of a stride, i.e. does not keep headroom. To overcome this we make sure we can extend backwards and use the last bytes of stride i-1. Extra care is needed for stride 0 as it has no preceding stride. We make sure headroom bytes are available by shifting the buffer pointer passed to HW by headroom bytes. This configuration now becomes default, whenever capable. Of course, this implies turning LRO off. Performance testing: ConnectX-5, single core, single RX ring, default MTU. UDP packet rate, early drop in TC layer: -------------------------------------------- | pkt size | before | after | ratio | -------------------------------------------- | 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x | | 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x | | 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x | -------------------------------------------- TCP streams: ~20% gain 3) Support XDP over Striding RQ: Now that linear SKB is supported over Striding RQ, we can support XDP by setting stride size to PAGE_SIZE and headroom to XDP_PACKET_HEADROOM. Striding RQ is capable of a higher packet-rate than conventional RQ. Performance testing: ConnectX-5, 24 rings, default MTU. CQE compression ON (to reduce completions BW in PCI). XDP_DROP packet rate: -------------------------------------------------- | pkt size | XDP rate | 100GbE linerate | pct% | -------------------------------------------------- | 64byte | 126.2 Mpps | 148.0 Mpps | 85% | | 128byte | 80.0 Mpps | 84.8 Mpps | 94% | | 256byte | 42.7 Mpps | 42.7 Mpps | 100% | | 512byte | 23.4 Mpps | 23.4 Mpps | 100% | -------------------------------------------------- 4) Remove mlx5 page_ref bulking in Striding RQ and use page_ref_inc only when needed. Without this bulking, we have: - no atomic ops on WQE allocation or free - one atomic op per SKB - In the default MTU configuration (1500, stride size is 2K), the non-bulking method execute 2 atomic ops as before - For larger MTUs with stride size of 4K, non-bulking method executes only a single op. - For XDP (stride size of 4K, no SKBs), non-bulking have no atomic ops per packet at all. Performance testing: ConnectX-5, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz. Single core packet rate (64 bytes). Early drop in TC: no degradation. XDP_DROP: before: 14,270,188 pps after: 20,503,603 pps, 43% improvement. ==================== Signed-off-by: David S. Miller <[email protected]>
2018-03-31Merge tag 'rxrpc-next-20180330' of ↵David S. Miller20-71/+509
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs David Howells says: ==================== rxrpc: Fixes and more traces Here are some patches that add some more tracepoints to AF_RXRPC and fix some issues therein: (1) Fix the use of VERSION packets to keep firewall routes open. (2) Fix the incorrect current time usage in a tracepoint. (3) Fix Tx ring annotation corruption. (4) Fix accidental conversion of call-level abort into connection-level abort. (5) Fix calculation of resend time. (6) Remove a couple of unused variables. (7) Fix a bunch of checker warnings and an error. Note that not all warnings can be quashed as checker doesn't seem to correctly handle seqlocks. (8) Fix a potential race between call destruction and socket/net destruction. (9) Add a tracepoint to track rxrpc_local refcounting. (10) Fix an apparent leak of rxrpc_local objects. (11) Add a tracepoint to track rxrpc_peer refcounting. (12) Fix a leak of rxrpc_peer objects. ==================== Signed-off-by: David S. Miller <[email protected]>
2018-03-31hv_netvsc: Clean up extra parameter from rndis_filter_receive_data()Haiyang Zhang1-7/+9
The variables, msg and data, have the same value. This patch removes the extra one. Signed-off-by: Haiyang Zhang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31ethernet: hisilicon: hns: hns_dsaf_mac: Use generic eth_broadcast_addrJoe Perches1-4/+2
Rather than use an on-stack array to copy a broadcast address, use the generic eth_broadcast_addr function to save a trivial amount of object code. Signed-off-by: Joe Perches <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31Merge branch 'net_rwsem-fixes'David S. Miller5-8/+8
Kirill Tkhai says: ==================== net_rwsem fixes there is wext_netdev_notifier_call()->wireless_nlevent_flush() netdevice notifier, which takes net_rwsem, so we can't take net_rwsem in {,un}register_netdevice_notifier(). Since {,un}register_netdevice_notifier() is executed under pernet_ops_rwsem, net_namespace_list can't change, while we holding it, so there is no need net_rwsem in these functions [1/2]. The same is in [2/2]. We make callers of __rtnl_link_unregister() take pernet_ops_rwsem, and close the race with setup_net() and cleanup_net(), so __rtnl_link_unregister() does not need it. This also fixes the problem of that __rtnl_link_unregister() does not see initializing and exiting nets. ==================== Signed-off-by: David S. Miller <[email protected]>
2018-03-31net: Do not take net_rwsem in __rtnl_link_unregister()Kirill Tkhai4-3/+8
This function calls call_netdevice_notifier(), which also may take net_rwsem. So, we can't use net_rwsem here. This patch makes callers of this functions take pernet_ops_rwsem, like register_netdevice_notifier() does. This will protect the modifications of net_namespace_list, and allows notifiers to take it (they won't have to care about context). Since __rtnl_link_unregister() is used on module load and unload (which are not frequent operations), this looks for me better, than make all call_netdevice_notifier() always executing in "protected net_namespace_list" context. Also, this fixes the problem we had a deal in 328fbe747ad4 "Close race between {un, }register_netdevice_notifier and ...", and guarantees __rtnl_link_unregister() does not skip exitting net. Signed-off-by: Kirill Tkhai <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31net: Remove net_rwsem from {, un}register_netdevice_notifier()Kirill Tkhai1-5/+0
These functions take net_rwsem, while wireless_nlevent_flush() also takes it. But down_read() can't be taken recursive, because of rw_semaphore design, which prevents it to be occupied by only readers forever. Since we take pernet_ops_rwsem in {,un}register_netdevice_notifier(), net list can't change, so these down_read()/up_read() can be removed. Fixes: f0b07bb151b0 "net: Introduce net_rwsem to protect net_namespace_list" Signed-off-by: Kirill Tkhai <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31net: hns3: remove unnecessary pci_set_drvdata() and devm_kfree()Wei Yongjun1-4/+0
There is no need for explicit calls of devm_kfree(), as the allocated memory will be freed during driver's detach. The driver core clears the driver data to NULL after device_release. Thus, it is not needed to manually clear the device driver data to NULL. So remove the unnecessary pci_set_drvdata() and devm_kfree(). Signed-off-by: Wei Yongjun <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31netdevsim: Change nsim_devlink_setup to return error to callerDavid Ahern3-8/+15
Change nsim_devlink_setup to return any error back to the caller and update nsim_init to handle it. Requested-by: Jakub Kicinski <[email protected]> Signed-off-by: David Ahern <[email protected]> Acked-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31Merge branch 'tipc-slim-down-name-table'David S. Miller11-699/+556
Jon Maloy says: ==================== tipc: slim down name table We clean up and improve the name binding table: - Replace the memory consuming 'sub_sequence/service range' array with an RB tree. - Introduce support for overlapping service sequences/ranges v2: #1: Fixed a missing initialization reported by David Miller #4: Obsoleted and replaced a few more macros to get a consistent terminology in the API. #5: Added new commit to fix a potential string overflow bug (it is still only in net-next) reported by Arnd Bergmann ==================== Signed-off-by: David S. Miller <[email protected]>
2018-03-31tipc: avoid possible string overflowJon Maloy2-2/+3
gcc points out that the combined length of the fixed-length inputs to l->name is larger than the destination buffer size: net/tipc/link.c: In function 'tipc_link_create': net/tipc/link.c:465:26: error: '%s' directive writing up to 32 bytes into a region of size between 26 and 58 [-Werror=format-overflow=] sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str); net/tipc/link.c:465:2: note: 'sprintf' output 11 or more bytes (assuming 75) into a destination of size 60 sprintf(l->name, "%s:%s-%s:unknown", self_str, if_name, peer_str); A detailed analysis reveals that the theoretical maximum length of a link name is: max self_str + 1 + max if_name + 1 + max peer_str + 1 + max if_name = 16 + 1 + 15 + 1 + 16 + 1 + 15 = 65 Since we also need space for a trailing zero we now set MAX_LINK_NAME to 68. Just to be on the safe side we also replace the sprintf() call with snprintf(). Fixes: 25b0b9c4e835 ("tipc: handle collisions of 32-bit node address hash values") Reported-by: Arnd Bergmann <[email protected]> Signed-off-by: Jon Maloy <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31tipc: tipc: rename address types in user apiJon Maloy1-24/+33
The three address type structs in the user API have names that in reality reflect the specific, non-Linux environment where they were originally created. We now give them more intuitive names, in accordance with how TIPC is described in the current documentation. struct tipc_portid -> struct tipc_socket_addr struct tipc_name -> struct tipc_service_addr struct tipc_name_seq -> struct tipc_service_range To avoid confusion, we also update some commmets and macro names to match the new terminology. For compatibility, we add macros that map all old names to the new ones. Signed-off-by: Jon Maloy <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31tipc: permit overlapping service ranges in name tableJon Maloy7-111/+60
With the new RB tree structure for service ranges it becomes possible to solve an old problem; - we can now allow overlapping service ranges in the table. When inserting a new service range to the tree, we use 'lower' as primary key, and when necessary 'upper' as secondary key. Since there may now be multiple service ranges matching an indicated 'lower' value, we must also add the 'upper' value to the functions used for removing publications, so that the correct, corresponding range item can be found. These changes guarantee that a well-formed publication/withdrawal item from a peer node never will be rejected, and make it possible to eliminate the problematic backlog functionality we currently have for handling such cases. Signed-off-by: Jon Maloy <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31tipc: refactor name table translate functionJon Maloy1-36/+25
The function tipc_nametbl_translate() function is ugly and hard to follow. This can be improved somewhat by introducing a stack variable for holding the publication list to be used and re-ordering the if- clauses for selection of algorithm. Signed-off-by: Jon Maloy <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31tipc: replace name table service range array with rb treeJon Maloy6-568/+477
The current design of the binding table has an unnecessary memory consuming and complex data structure. It aggregates the service range items into an array, which is expanded by a factor two every time it becomes too small to hold a new item. Furthermore, the arrays never shrink when the number of ranges diminishes. We now replace this array with an RB tree that is holding the range items as tree nodes, each range directly holding a list of bindings. This, along with a few name changes, improves both readability and volume of the code, as well as reducing memory consumption and hopefully improving cache hit rate. Signed-off-by: Jon Maloy <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31Merge branch 'bridge-mtu'David S. Miller4-33/+25
Nikolay Aleksandrov says: ==================== net: bridge: MTU handling changes As previously discussed the recent changes break some setups and could lead to packet drops. Thus the first patch reverts the behaviour for the bridge to follow the minimum MTU but also keeps the ability to set the MTU to the maximum (out of all ports) if vlan filtering is enabled. Patch 02 is the bigger change in behaviour - we've always had trouble when configuring bridges and their MTU which is auto tuning on port events (add/del/changemtu), which means config software needs to chase it and fix it after each such event, after patch 02 we allow the user to configure any MTU (ETH_MIN/MAX limited) but once that is done the bridge stops auto tuning and relies on the user to keep the MTU correct. This should be compatible with cases that don't touch the MTU (or set it to the same value), while allowing to configure the MTU and not worry about it changing afterwards. The patches are intentionally split like this, so that if they get accepted and there are any complaints patch 02 can be reverted. ==================== Signed-off-by: David S. Miller <[email protected]>
2018-03-31net: bridge: disable bridge MTU auto tuning if it was set manuallyNikolay Aleksandrov4-20/+26
As Roopa noted today the biggest source of problems when configuring bridge and ports is that the bridge MTU keeps changing automatically on port events (add/del/changemtu). That leads to inconsistent behaviour and network config software needs to chase the MTU and fix it on each such event. Let's improve on that situation and allow for the user to set any MTU within ETH_MIN/MAX limits, but once manually configured it is the user's responsibility to keep it correct afterwards. In case the MTU isn't manually set - the behaviour reverts to the previous and the bridge follows the minimum MTU. Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31net: bridge: set min MTU on port events and allow user to set maxNikolay Aleksandrov4-32/+18
Recently the bridge was changed to automatically set maximum MTU on port events (add/del/changemtu) when vlan filtering is enabled, but that actually changes behaviour in a way which breaks some setups and can lead to packet drops. In order to still allow that maximum to be set while being compatible, we add the ability for the user to tune the bridge MTU up to the maximum when vlan filtering is enabled, but that has to be done explicitly and all port events (add/del/changemtu) lead to resetting that MTU to the minimum as before. Suggested-by: Roopa Prabhu <[email protected]> Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31Merge branch 'thunderx-DMAC-filtering'David S. Miller5-30/+374
Vadim Lomovtsev says: ==================== net: thunderx: implement DMAC filtering support By default CN88XX BGX accepts all incoming multicast and broadcast packets and filtering is disabled. The nic driver doesn't provide an ability to change such behaviour. This series is to implement DMAC filtering management for CN88XX nic driver allowing user to enable/disable filtering and configure specific MAC addresses to filter traffic. Changes from v1: build issues: - update code in order to address compiler warnings; checkpatch.pl reported issues: - update code in order to fit 80 symbols length; - update commit descriptions in order to fit 80 symbols length; ==================== Signed-off-by: David S. Miller <[email protected]>
2018-03-31net: thunderx: add ndo_set_rx_mode callback implementation for VFVadim Lomovtsev1-1/+109
The ndo_set_rx_mode() is called from atomic context which causes messages response timeouts while VF to PF communication via MSIx. To get rid of that we're copy passed mc list, parse flags and queue handling of kernel request to ordered workqueue. Signed-off-by: Vadim Lomovtsev <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31net: thunderx: add workqueue control structures for handle ndo_set_rx_mode ↵Vadim Lomovtsev1-0/+17
request The kernel calls ndo_set_rx_mode() callback from atomic context which causes messaging timeouts between VF and PF (as they’re implemented via MSIx). So in order to handle ndo_set_rx_mode() we need to get rid of it. This commit implements necessary workqueue related structures to let VF queue kernel request processing in non-atomic context later. Signed-off-by: Vadim Lomovtsev <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31net: thunderx: add XCAST messages handlers for PFVadim Lomovtsev1-4/+41
This commit is to add message handling for ndo_set_rx_mode() callback at PF side. Signed-off-by: Vadim Lomovtsev <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31net: thunderx: add new messages for handle ndo_set_rx_mode callbackVadim Lomovtsev1-0/+12
The kernel calls ndo_set_rx_mode() callback supplying it will all necessary info, such as device state flags, multicast mac addresses list and so on. Since we have only 128 bits to communicate with PF we need to initiate several requests to PF with small/short operation each based on input data. So this commit implements following PF messages codes along with new data structures for them: NIC_MBOX_MSG_RESET_XCAST to flush all filters configured for this particular network interface (VF) NIC_MBOX_MSG_ADD_MCAST to add new MAC address to DMAC filter registers for this particular network interface (VF) NIC_MBOX_MSG_SET_XCAST to apply filtering configuration to filter control register Signed-off-by: Vadim Lomovtsev <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31net: thunderx: add multicast filter management supportVadim Lomovtsev2-1/+153
The ThunderX NIC could be partitioned to up to 128 VFs and thus represented to system. Each VF is mapped to pair BGX:LMAC, and each of VF is configured by kernel individually. Eventually the bunch of VFs could be mapped onto same pair BGX:LMAC and thus could cause several multicast filtering configuration requests to LMAC with the same MAC addresses. This commit is to add ThunderX NIC BGX filtering manipulation routines. Signed-off-by: Vadim Lomovtsev <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31net: thunderx: add MAC address filter tracking for LMACVadim Lomovtsev1-14/+30
The ThunderX NIC has two Ethernet Interfaces (BGX) each of them could has up to four Logical MACs configured. Each of BGX has 32 filters to be configured for filtering ingress packets. The number of filters available to particular LMAC is from 8 (if we have four LMACs configured per BGX) up to 32 (in case of only one LMAC is configured per BGX). At the same time the NIC could present up to 128 VFs to OS as network interfaces, each of them kernel will configure with set of MAC addresses for filtering. So to prevent dupes in BGX filter registers from different network interfaces it is required to cache and track all filter configuration requests prior to applying them onto BGX filter registers. This commit is to update LMAC structures with control fields to allocate/releasing filters tracking list along with implementing dmac array allocate/release per LMAC. Signed-off-by: Vadim Lomovtsev <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31net: thunderx: move filter register related macro into proper placeVadim Lomovtsev2-11/+13
The ThunderX NIC has set of registers which allows to configure filter policy for ingress packets. There are three possible regimes of filtering multicasts, broadcasts and unicasts: accept all, reject all and accept filter allowed only. Current implementation has enum with all of them and two generic macro for enabling filtering et all (CAM_ACCEPT) and enabling/disabling broadcast packets, which also should be corrected in order to represent register bits properly. All these values are private for driver and there is no need to ‘publish’ them via header file. This commit is to move filtering register manipulation values from header file into source with explicit assignment of exact register values to them to be used while register configuring. Signed-off-by: Vadim Lomovtsev <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31Merge branch 'meson8b'David S. Miller2-4/+6
Martin Blumenstingl says: ==================== Meson8m2 support for dwmac-meson8b The Meson8m2 SoC is an updated version of the Meson8 SoC. Some of the peripherals are shared with Meson8b (for example the watchdog registers and the internal temperature sensor calibration procedure). Meson8m2 also seems to include the same Gigabit MAC register layout as Meson8b. The registers in the Amlogic dwmac "glue" seem identical between Meson8b and Meson8m2. Manual testing seems to confirm this. To be extra-safe a new compatible string is added because there's no (public) documentation on the Meson8m2 SoC. This will allow us to implement any SoC-specific variations later on (if needed). ==================== Signed-off-by: David S. Miller <[email protected]>
2018-03-31net: stmmac: dwmac-meson8b: Add support for the Meson8m2 SoCMartin Blumenstingl1-2/+3
The Meson8m2 SoC uses a similar (potentially even identical) register layout as the Meson8b and GXBB SoCs for the dwmac glue. Add a new compatible string and update the module description to indicate support for these SoCs. Signed-off-by: Martin Blumenstingl <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31dt-bindings: net: meson-dwmac: add support for the Meson8m2 SoCMartin Blumenstingl1-2/+3
The Meson8m2 SoC uses a similar (potentially even identical) register layout for the dwmac glue as Meson8b and GXBB. Unfortunately there is no documentation available. Testing shows that both, RMII and RGMII PHYs are working if they are configured as on Meson8b. Add a new compatible string to the documentation so differences (if there are any) between Meson8m2 and the other SoCs can be taken care of within the driver. Signed-off-by: Martin Blumenstingl <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-30Merge tag 'kbuild-fixes-v4.16-3' of ↵Linus Torvalds5-5/+12
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - fix missed rebuild of TRIM_UNUSED_KSYMS - fix rpm-pkg for GNU tar >= 1.29 - include scripts/dtc/include-prefixes/* to kernel header deb-pkg - add -no-integrated-as option ealier to fix building with Clang - fix netfilter Makefile for parallel building * tag 'kbuild-fixes-v4.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: netfilter: nf_nat_snmp_basic: add correct dependency to Makefile kbuild: rpm-pkg: Support GNU tar >= 1.29 builddeb: Fix header package regarding dtc source links kbuild: set no-integrated-as before incl. arch Makefile kbuild: make scripts/adjust_autoksyms.sh robust against timestamp races
2018-03-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds76-384/+793
Pull networking fixes from David Miller: 1) Fix RCU locking in xfrm_local_error(), from Taehee Yoo. 2) Fix return value assignments and thus error checking in iwl_mvm_start_ap_ibss(), from Johannes Berg. 3) Don't count header length twice in vti4, from Stefano Brivio. 4) Fix deadlock in rt6_age_examine_exception, from Eric Dumazet. 5) Fix out-of-bounds access in nf_sk_lookup_slow{v4,v6}() from Subash Abhinov. 6) Check nladdr size in netlink_connect(), from Alexander Potapenko. 7) VF representor SQ numbers are 32 not 16 bits, in mlx5 driver, from Or Gerlitz. 8) Out of bounds read in skb_network_protocol(), from Eric Dumazet. 9) r8169 driver sets driver data pointer after register_netdev() which is too late. Fix from Heiner Kallweit. 10) Fix memory leak in mlx4 driver, from Moshe Shemesh. 11) The multi-VLAN decap fix added a regression when dealing with device that lack a MAC header, such as tun. Fix from Toshiaki Makita. 12) Fix integer overflow in dynamic interrupt coalescing code. From Tal Gilboa. 13) Use after free in vrf code, from David Ahern. 14) IPV6 route leak between VRFs fix, also from David Ahern. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (81 commits) net: mvneta: fix enable of all initialized RXQs net/ipv6: Fix route leaking between VRFs vrf: Fix use after free and double free in vrf_finish_output ipv6: sr: fix seg6 encap performances with TSO enabled net/dim: Fix int overflow vlan: Fix vlan insertion for packets without ethernet header net: Fix untag for vlan packets without ethernet header atm: iphase: fix spelling mistake: "Receiverd" -> "Received" vhost: validate log when IOTLB is enabled qede: Do not drop rx-checksum invalidated packets. hv_netvsc: enable multicast if necessary ip_tunnel: Resolve ipsec merge conflict properly. lan78xx: Crash in lan78xx_writ_reg (Workqueue: events lan78xx_deferred_multicast_write) qede: Fix barrier usage after tx doorbell write. vhost: correctly remove wait queue during poll failure net/mlx4_core: Fix memory leak while delete slave's resources net/mlx4_en: Fix mixed PFC and Global pause user control requests net/smc: use announced length in sock_recvmsg() llc: properly handle dev_queue_xmit() return value strparser: Fix sign of err codes ...
2018-03-31Merge branch 'bpf-cgroup-bind-connect'Daniel Borkmann33-132/+2314
Andrey Ignatov says: ==================== v2->v3: - rebase due to conflicts - fix ipv6=m build v1->v2: - support expected_attach_type at prog load time so that prog (incl. context accesses and calls to helpers) can be validated with regard to specific attach point it is supposed to be attached to. Later, at attach time, attach type is checked so that it must be same as at load time if it was provided - reworked hooks to rely on expected_attach_type, and reduced number of new prog types from 6 to just 1: BPF_PROG_TYPE_CGROUP_SOCK_ADDR - reused BPF_PROG_TYPE_CGROUP_SOCK for sys_bind post-hooks - add selftests for post-sys_bind hook For our container management we've been using complicated and fragile setup consisting of LD_PRELOAD wrapper intercepting bind and connect calls from all containerized applications. Unfortunately it doesn't work for apps that don't use glibc and changing all applications that run in the datacenter is not possible due to 3rd party code and libraries (despite being open source code) and sheer amount of legacy code that has to be rewritten (we're rewriting what we can in parallel) These applications are written without containers in mind and have builtin assumptions about network services. Like an application X expects to connect localhost:special_port and find service Y in there. To move application X and service Y into two different containers LD_PRELOAD approach is used to help one service connect to another without rewriting them. Moving these two applications into different L2 (netns) or L3 (vrf) network isolation scopes doesn't help to solve the problem, since applications need to see each other like they were running on the host without containers. So if app X and app Y would run in different netns something would need to punch a connectivity hole in those namespaces. That would be real layering violation (with corresponding network debugging pains), since clean l2, l3 abstraction would suddenly support something that breaks through the layers. Instead we used LD_PRELOAD (and now bpf programs) at bind/connect time to help applications discover and connect to each other. All applications are running in init_nens and there are no vrfs. After bind/connect the normal fib/neighbor core networking logic works as it should always do and the whole system is clean from network point of view and can be debugged with standard tools. We also considered resurrecting Hannes's afnetns work, but all hierarchical namespace abstraction don't work due to these builtin networking assumptions inside the apps. To run an application inside cgroup container that was not written with containers in mind we have to make an illusion of running in non-containerized environment. In some cases we remember the port and container id in the post-bind hook in a bpf map and when some other task in a different container is trying to connect to a service we need to know where this service is running. It can be remote and can be local. Both client and service may or may not be written with containers in mind and this sockaddr rewrite is providing connectivity and load balancing feature. BPF+cgroup looks to be the best solution for this problem. Hence we introduce 3 hooks: - at entry into sys_bind and sys_connect to let bpf prog look and modify 'struct sockaddr' provided by user space and fail bind/connect when appropriate - post sys_bind after port is allocated The approach works great and has zero overhead for anyone who doesn't use it and very low overhead when deployed. Different use case for this feature is to do low overhead firewall that doesn't need to inspect all packets and works at bind/connect time. ==================== Signed-off-by: Daniel Borkmann <[email protected]>
2018-03-31selftests/bpf: Selftest for sys_bind post-hooks.Andrey Ignatov3-1/+493
Add selftest for attach types `BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND`. The main things tested are: * prog load behaves as expected (valid/invalid accesses in prog); * prog attach behaves as expected (load- vs attach-time attach types); * `BPF_CGROUP_INET_SOCK_CREATE` can be attached in a backward compatible way; * post-hooks return expected result and errno. Example: # ./test_sock Test case: bind4 load with invalid access: src_ip6 .. [PASS] Test case: bind4 load with invalid access: mark .. [PASS] Test case: bind6 load with invalid access: src_ip4 .. [PASS] Test case: sock_create load with invalid access: src_port .. [PASS] Test case: sock_create load w/o expected_attach_type (compat mode) .. [PASS] Test case: sock_create load w/ expected_attach_type .. [PASS] Test case: attach type mismatch bind4 vs bind6 .. [PASS] Test case: attach type mismatch bind6 vs bind4 .. [PASS] Test case: attach type mismatch default vs bind4 .. [PASS] Test case: attach type mismatch bind6 vs sock_create .. [PASS] Test case: bind4 reject all .. [PASS] Test case: bind6 reject all .. [PASS] Test case: bind6 deny specific IP & port .. [PASS] Test case: bind4 allow specific IP & port .. [PASS] Test case: bind4 allow all .. [PASS] Test case: bind6 allow all .. [PASS] Summary: 16 PASSED, 0 FAILED Signed-off-by: Andrey Ignatov <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-03-31bpf: Post-hooks for sys_bindAndrey Ignatov6-30/+195
"Post-hooks" are hooks that are called right before returning from sys_bind. At this time IP and port are already allocated and no further changes to `struct sock` can happen before returning from sys_bind but BPF program has a chance to inspect the socket and change sys_bind result. Specifically it can e.g. inspect what port was allocated and if it doesn't satisfy some policy, BPF program can force sys_bind to fail and return EPERM to user. Another example of usage is recording the IP:port pair to some map to use it in later calls to sys_connect. E.g. if some TCP server inside cgroup was bound to some IP:port_n, it can be recorded to a map. And later when some TCP client inside same cgroup is trying to connect to 127.0.0.1:port_n, BPF hook for sys_connect can override the destination and connect application to IP:port_n instead of 127.0.0.1:port_n. That helps forcing all applications inside a cgroup to use desired IP and not break those applications if they e.g. use localhost to communicate between each other. == Implementation details == Post-hooks are implemented as two new attach types `BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND` for existing prog type `BPF_PROG_TYPE_CGROUP_SOCK`. Separate attach types for IPv4 and IPv6 are introduced to avoid access to IPv6 field in `struct sock` from `inet_bind()` and to IPv4 field from `inet6_bind()` since those fields might not make sense in such cases. Signed-off-by: Andrey Ignatov <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-03-31selftests/bpf: Selftest for sys_connect hooksAndrey Ignatov8-4/+284
Add selftest for BPF_CGROUP_INET4_CONNECT and BPF_CGROUP_INET6_CONNECT attach types. Try to connect(2) to specified IP:port and test that: * remote IP:port pair is overridden; * local end of connection is bound to specified IP. All combinations of IPv4/IPv6 and TCP/UDP are tested. Example: # tcpdump -pn -i lo -w connect.pcap 2>/dev/null & [1] 478 # strace -qqf -e connect -o connect.trace ./test_sock_addr.sh Wait for testing IPv4/IPv6 to become available ... OK Load bind4 with invalid type (can pollute stderr) ... REJECTED Load bind4 with valid type ... OK Attach bind4 with invalid type ... REJECTED Attach bind4 with valid type ... OK Load connect4 with invalid type (can pollute stderr) libbpf: load bpf \ program failed: Permission denied libbpf: -- BEGIN DUMP LOG --- libbpf: 0: (b7) r2 = 23569 1: (63) *(u32 *)(r1 +24) = r2 2: (b7) r2 = 16777343 3: (63) *(u32 *)(r1 +4) = r2 invalid bpf_context access off=4 size=4 [ 1518.404609] random: crng init done libbpf: -- END LOG -- libbpf: failed to load program 'cgroup/connect4' libbpf: failed to load object './connect4_prog.o' ... REJECTED Load connect4 with valid type ... OK Attach connect4 with invalid type ... REJECTED Attach connect4 with valid type ... OK Test case #1 (IPv4/TCP): Requested: bind(192.168.1.254, 4040) .. Actual: bind(127.0.0.1, 4444) Requested: connect(192.168.1.254, 4040) from (*, *) .. Actual: connect(127.0.0.1, 4444) from (127.0.0.4, 56068) Test case #2 (IPv4/UDP): Requested: bind(192.168.1.254, 4040) .. Actual: bind(127.0.0.1, 4444) Requested: connect(192.168.1.254, 4040) from (*, *) .. Actual: connect(127.0.0.1, 4444) from (127.0.0.4, 56447) Load bind6 with invalid type (can pollute stderr) ... REJECTED Load bind6 with valid type ... OK Attach bind6 with invalid type ... REJECTED Attach bind6 with valid type ... OK Load connect6 with invalid type (can pollute stderr) libbpf: load bpf \ program failed: Permission denied libbpf: -- BEGIN DUMP LOG --- libbpf: 0: (b7) r6 = 0 1: (63) *(u32 *)(r1 +12) = r6 invalid bpf_context access off=12 size=4 libbpf: -- END LOG -- libbpf: failed to load program 'cgroup/connect6' libbpf: failed to load object './connect6_prog.o' ... REJECTED Load connect6 with valid type ... OK Attach connect6 with invalid type ... REJECTED Attach connect6 with valid type ... OK Test case #3 (IPv6/TCP): Requested: bind(face:b00c:1234:5678::abcd, 6060) .. Actual: bind(::1, 6666) Requested: connect(face:b00c:1234:5678::abcd, 6060) from (*, *) Actual: connect(::1, 6666) from (::6, 37458) Test case #4 (IPv6/UDP): Requested: bind(face:b00c:1234:5678::abcd, 6060) .. Actual: bind(::1, 6666) Requested: connect(face:b00c:1234:5678::abcd, 6060) from (*, *) Actual: connect(::1, 6666) from (::6, 39315) ### SUCCESS # egrep 'connect\(.*AF_INET' connect.trace | \ > egrep -vw 'htons\(1025\)' | fold -b -s -w 72 502 connect(7, {sa_family=AF_INET, sin_port=htons(4040), sin_addr=inet_addr("192.168.1.254")}, 128) = 0 502 connect(8, {sa_family=AF_INET, sin_port=htons(4040), sin_addr=inet_addr("192.168.1.254")}, 128) = 0 502 connect(9, {sa_family=AF_INET6, sin6_port=htons(6060), inet_pton(AF_INET6, "face:b00c:1234:5678::abcd", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 128) = 0 502 connect(10, {sa_family=AF_INET6, sin6_port=htons(6060), inet_pton(AF_INET6, "face:b00c:1234:5678::abcd", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 128) = 0 # fg tcpdump -pn -i lo -w connect.pcap 2> /dev/null # tcpdump -r connect.pcap -n tcp | cut -c 1-72 reading from file connect.pcap, link-type EN10MB (Ethernet) 17:57:40.383533 IP 127.0.0.4.56068 > 127.0.0.1.4444: Flags [S], seq 1333 17:57:40.383566 IP 127.0.0.1.4444 > 127.0.0.4.56068: Flags [S.], seq 112 17:57:40.383589 IP 127.0.0.4.56068 > 127.0.0.1.4444: Flags [.], ack 1, w 17:57:40.384578 IP 127.0.0.1.4444 > 127.0.0.4.56068: Flags [R.], seq 1, 17:57:40.403327 IP6 ::6.37458 > ::1.6666: Flags [S], seq 406513443, win 17:57:40.403357 IP6 ::1.6666 > ::6.37458: Flags [S.], seq 2448389240, ac 17:57:40.403376 IP6 ::6.37458 > ::1.6666: Flags [.], ack 1, win 342, opt 17:57:40.404263 IP6 ::1.6666 > ::6.37458: Flags [R.], seq 1, ack 1, win Signed-off-by: Andrey Ignatov <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-03-31bpf: Hooks for sys_connectAndrey Ignatov13-1/+202
== The problem == See description of the problem in the initial patch of this patch set. == The solution == The patch provides much more reliable in-kernel solution for the 2nd part of the problem: making outgoing connecttion from desired IP. It adds new attach types `BPF_CGROUP_INET4_CONNECT` and `BPF_CGROUP_INET6_CONNECT` for program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both source and destination of a connection at connect(2) time. Local end of connection can be bound to desired IP using newly introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though, and doesn't support binding to port, i.e. leverages `IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this: * looking for a free port is expensive and can affect performance significantly; * there is no use-case for port. As for remote end (`struct sockaddr *` passed by user), both parts of it can be overridden, remote IP and remote port. It's useful if an application inside cgroup wants to connect to another application inside same cgroup or to itself, but knows nothing about IP assigned to the cgroup. Support is added for IPv4 and IPv6, for TCP and UDP. IPv4 and IPv6 have separate attach types for same reason as sys_bind hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound. == Implementation notes == The patch introduces new field in `struct proto`: `pre_connect` that is a pointer to a function with same signature as `connect` but is called before it. The reason is in some cases BPF hooks should be called way before control is passed to `sk->sk_prot->connect`. Specifically `inet_dgram_connect` autobinds socket before calling `sk->sk_prot->connect` and there is no way to call `bpf_bind()` from hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since it'd cause double-bind. On the other hand `proto.pre_connect` provides a flexible way to add BPF hooks for connect only for necessary `proto` and call them at desired time before `connect`. Since `bpf_bind()` is allowed to bind only to IP and autobind in `inet_dgram_connect` binds only port there is no chance of double-bind. bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite of value of `bind_address_no_port` socket field. bpf_bind() sets `with_lock` to `false` when calling to __inet_bind() and __inet6_bind() since all call-sites, where bpf_bind() is called, already hold socket lock. Signed-off-by: Andrey Ignatov <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-03-31net: Introduce __inet_bind() and __inet6_bindAndrey Ignatov4-28/+52
Refactor `bind()` code to make it ready to be called from BPF helper function `bpf_bind()` (will be added soon). Implementation of `inet_bind()` and `inet6_bind()` is separated into `__inet_bind()` and `__inet6_bind()` correspondingly. These function can be used from both `sk_prot->bind` and `bpf_bind()` contexts. New functions have two additional arguments. `force_bind_address_no_port` forces binding to IP only w/o checking `inet_sock.bind_address_no_port` field. It'll allow to bind local end of a connection to desired IP in `bpf_bind()` w/o changing `bind_address_no_port` field of a socket. It's useful since `bpf_bind()` can return an error and we'd need to restore original value of `bind_address_no_port` in that case if we changed this before calling to the helper. `with_lock` specifies whether to lock socket when working with `struct sk` or not. The argument is set to `true` for `sk_prot->bind`, i.e. old behavior is preserved. But it will be set to `false` for `bpf_bind()` use-case. The reason is all call-sites, where `bpf_bind()` will be called, already hold that socket lock. Signed-off-by: Andrey Ignatov <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-03-31selftests/bpf: Selftest for sys_bind hooksAndrey Ignatov4-1/+517
Add selftest to work with bpf_sock_addr context from `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` programs. Try to bind(2) on IP:port and apply: * loads to make sure context can be read correctly, including narrow loads (byte, half) for IP and full-size loads (word) for all fields; * stores to those fields allowed by verifier. All combination from IPv4/IPv6 and TCP/UDP are tested. Both scenarios are tested: * valid programs can be loaded and attached; * invalid programs can be neither loaded nor attached. Test passes when expected data can be read from context in the BPF-program, and after the call to bind(2) socket is bound to IP:port pair that was written by BPF-program to the context. Example: # ./test_sock_addr Attached bind4 program. Test case #1 (IPv4/TCP): Requested: bind(192.168.1.254, 4040) .. Actual: bind(127.0.0.1, 4444) Test case #2 (IPv4/UDP): Requested: bind(192.168.1.254, 4040) .. Actual: bind(127.0.0.1, 4444) Attached bind6 program. Test case #3 (IPv6/TCP): Requested: bind(face:b00c:1234:5678::abcd, 6060) .. Actual: bind(::1, 6666) Test case #4 (IPv6/UDP): Requested: bind(face:b00c:1234:5678::abcd, 6060) .. Actual: bind(::1, 6666) ### SUCCESS Signed-off-by: Andrey Ignatov <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-03-31bpf: Hooks for sys_bindAndrey Ignatov10-8/+366
== The problem == There is a use-case when all processes inside a cgroup should use one single IP address on a host that has multiple IP configured. Those processes should use the IP for both ingress and egress, for TCP and UDP traffic. So TCP/UDP servers should be bound to that IP to accept incoming connections on it, and TCP/UDP clients should make outgoing connections from that IP. It should not require changing application code since it's often not possible. Currently it's solved by intercepting glibc wrappers around syscalls such as `bind(2)` and `connect(2)`. It's done by a shared library that is preloaded for every process in a cgroup so that whenever TCP/UDP server calls `bind(2)`, the library replaces IP in sockaddr before passing arguments to syscall. When application calls `connect(2)` the library transparently binds the local end of connection to that IP (`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty). Shared library approach is fragile though, e.g.: * some applications clear env vars (incl. `LD_PRELOAD`); * `/etc/ld.so.preload` doesn't help since some applications are linked with option `-z nodefaultlib`; * other applications don't use glibc and there is nothing to intercept. == The solution == The patch provides much more reliable in-kernel solution for the 1st part of the problem: binding TCP/UDP servers on desired IP. It does not depend on application environment and implementation details (whether glibc is used or not). It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND` (similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`). The new program type is intended to be used with sockets (`struct sock`) in a cgroup and provided by user `struct sockaddr`. Pointers to both of them are parts of the context passed to programs of newly added types. The new attach types provides hooks in `bind(2)` system call for both IPv4 and IPv6 so that one can write a program to override IP addresses and ports user program tries to bind to and apply such a program for whole cgroup. == Implementation notes == [1] Separate attach types for `AF_INET` and `AF_INET6` are added intentionally to prevent reading/writing to offsets that don't make sense for corresponding socket family. E.g. if user passes `sockaddr_in` it doesn't make sense to read from / write to `user_ip6[]` context fields. [2] The write access to `struct bpf_sock_addr_kern` is implemented using special field as an additional "register". There are just two registers in `sock_addr_convert_ctx_access`: `src` with value to write and `dst` with pointer to context that can't be changed not to break later instructions. But the fields, allowed to write to, are not available directly and to access them address of corresponding pointer has to be loaded first. To get additional register the 1st not used by `src` and `dst` one is taken, its content is saved to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load address of pointer field, and finally the register's content is restored from the temporary field after writing `src` value. Signed-off-by: Andrey Ignatov <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-03-31libbpf: Support expected_attach_type at prog loadAndrey Ignatov5-46/+133
Support setting `expected_attach_type` at prog load time in both `bpf/bpf.h` and `bpf/libbpf.h`. Since both headers already have API to load programs, new functions are added not to break backward compatibility for existing ones: * `bpf_load_program_xattr()` is added to `bpf/bpf.h`; * `bpf_prog_load_xattr()` is added to `bpf/libbpf.h`. Both new functions accept structures, `struct bpf_load_program_attr` and `struct bpf_prog_load_attr` correspondingly, where new fields can be added in the future w/o changing the API. Standard `_xattr` suffix is used to name the new API functions. Since `bpf_load_program_name()` is not used as heavily as `bpf_load_program()`, it was removed in favor of more generic `bpf_load_program_xattr()`. Signed-off-by: Andrey Ignatov <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-03-31bpf: Check attach type at prog load timeAndrey Ignatov8-29/+88
== The problem == There are use-cases when a program of some type can be attached to multiple attach points and those attach points must have different permissions to access context or to call helpers. E.g. context structure may have fields for both IPv4 and IPv6 but it doesn't make sense to read from / write to IPv6 field when attach point is somewhere in IPv4 stack. Same applies to BPF-helpers: it may make sense to call some helper from some attach point, but not from other for same prog type. == The solution == Introduce `expected_attach_type` field in in `struct bpf_attr` for `BPF_PROG_LOAD` command. If scenario described in "The problem" section is the case for some prog type, the field will be checked twice: 1) At load time prog type is checked to see if attach type for it must be known to validate program permissions correctly. Prog will be rejected with EINVAL if it's the case and `expected_attach_type` is not specified or has invalid value. 2) At attach time `attach_type` is compared with `expected_attach_type`, if prog type requires to have one, and, if they differ, attach will be rejected with EINVAL. The `expected_attach_type` is now available as part of `struct bpf_prog` in both `bpf_verifier_ops->is_valid_access()` and `bpf_verifier_ops->get_func_proto()` () and can be used to check context accesses and calls to helpers correspondingly. Initially the idea was discussed by Alexei Starovoitov <[email protected]> and Daniel Borkmann <[email protected]> here: https://marc.info/?l=linux-netdev&m=152107378717201&w=2 Signed-off-by: Andrey Ignatov <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2018-03-30net/mlx5e: RX, Recycle buffer of UMR WQEsTariq Toukan1-2/+9
Upon a new UMR post, check if the WQE buffer contains a previous UMR WQE. If so, modify the dynamic fields instead of a whole WQE overwrite. This saves a memcpy. In current setting, after 2 WQ cycles (12 UMR posts), this will always be the case. No degradation sensed. Signed-off-by: Tariq Toukan <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2018-03-30net/mlx5e: Keep single pre-initialized UMR WQE per RQTariq Toukan3-19/+13
All UMR WQEs of an RQ share many common fields. We use pre-initialized structures to save calculations in datapath. One field (xlt_offset) was the only reason we saved a pre-initialized copy per WQE index. Here we remove its initialization (move its calculation to datapath), and reduce the number of copies to one-per-RQ. A very small datapath calculation is added, it occurs once per a MPWQE (i.e. once every 256KB), but reduces memory consumption and gives better cache utilization. Performance testing: Tested packet rate, no degradation sensed. Signed-off-by: Tariq Toukan <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2018-03-30net/mlx5e: Remove page_ref bulking in Striding RQTariq Toukan2-32/+16
When many packets reside on the same page, the bulking of page_ref modifications reduces the total number of atomic operations executed. Besides the necessary 2 operations on page alloc/free, we have the following extra ops per page: - one on WQE allocation (bump refcnt to maximum possible), - zero ops for SKBs, - one on WQE free, a constant of two operations in total, no matter how many packets/SKBs actually populate the page. Without this bulking, we have: - no ops on WQE allocation or free, - one op per SKB, Comparing the two methods when PAGE_SIZE is 4K: - As mentioned above, bulking method always executes 2 operations, not more, but not less. - In the default MTU configuration (1500, stride size is 2K), the non-bulking method execute 2 ops as well. - For larger MTUs with stride size of 4K, non-bulking method executes only a single op. - For XDP (stride size of 4K, no SKBs), non-bulking method executes no ops at all! Hence, to optimize the flows with linear SKB and XDP over Striding RQ, we here remove the page_ref bulking method. Performance testing: ConnectX-5, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz. Single core packet rate (64 bytes). Early drop in TC: no degradation. XDP_DROP: before: 14,270,188 pps after: 20,503,603 pps, 43% improvement. Signed-off-by: Tariq Toukan <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2018-03-30net/mlx5e: Support XDP over Striding RQTariq Toukan3-8/+26
Add XDP support over Striding RQ. Now that linear SKB is supported over Striding RQ, we can support XDP by setting stride size to PAGE_SIZE and headroom to XDP_PACKET_HEADROOM. Upon a MPWQE free, do not release pages that are being XDP xmit, they will be released upon completions. Striding RQ is capable of a higher packet-rate than conventional RQ. A performance gain is expected for all cases that had a HW packet-rate bottleneck. This is the case whenever using many flows that distribute to many cores. Performance testing: ConnectX-5, 24 rings, default MTU. CQE compression ON (to reduce completions BW in PCI). XDP_DROP packet rate: -------------------------------------------------- | pkt size | XDP rate | 100GbE linerate | pct% | -------------------------------------------------- | 64byte | 126.2 Mpps | 148.0 Mpps | 85% | | 128byte | 80.0 Mpps | 84.8 Mpps | 94% | | 256byte | 42.7 Mpps | 42.7 Mpps | 100% | | 512byte | 23.4 Mpps | 23.4 Mpps | 100% | -------------------------------------------------- Signed-off-by: Tariq Toukan <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2018-03-30net/mlx5e: Refactor RQ XDP_TX indicationTariq Toukan2-6/+8
Make the xdp_xmit indication available for Striding RQ by taking it out of the type-specific union. This refactor is a preparation for a downstream patch that adds XDP support over Striding RQ. In addition, use a bitmap instead of a boolean for possible future flags. Signed-off-by: Tariq Toukan <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2018-03-30net/mlx5e: Use linear SKB in Striding RQTariq Toukan5-45/+153
Current Striding RQ HW feature utilizes the RX buffers so that there is no wasted room between the strides. This maximises the memory utilization. This prevents the use of build_skb() (which requires headroom and tailroom), and demands to memcpy the packets headers into the skb linear part. In this patch, whenever a set of conditions holds, we apply an RQ configuration that allows combining the use of linear SKB on top of a Striding RQ. To use build_skb() with Striding RQ, the following must hold: 1. packet does not cross a page boundary. 2. there is enough headroom and tailroom surrounding the packet. We can satisfy 1 and 2 by configuring: stride size = MTU + headroom + tailoom. This is possible only when: a. (MTU - headroom - tailoom) does not exceed PAGE_SIZE. b. HW LRO is turned off. Using linear SKB has many advantages: - Saves a memcpy of the headers. - No page-boundary checks in datapath. - No filler CQEs. - Significantly smaller CQ. - SKB data continuously resides in linear part, and not split to small amount (linear part) and large amount (fragment). This saves datapath cycles in driver and improves utilization of SKB fragments in GRO. - The fragments of a resulting GRO SKB follow the IP forwarding assumption of equal-size fragments. Some implementation details: HW writes the packets to the beginning of a stride, i.e. does not keep headroom. To overcome this we make sure we can extend backwards and use the last bytes of stride i-1. Extra care is needed for stride 0 as it has no preceding stride. We make sure headroom bytes are available by shifting the buffer pointer passed to HW by headroom bytes. This configuration now becomes default, whenever capable. Of course, this implies turning LRO off. Performance testing: ConnectX-5, single core, single RX ring, default MTU. UDP packet rate, early drop in TC layer: -------------------------------------------- | pkt size | before | after | ratio | -------------------------------------------- | 1500byte | 4.65 Mpps | 5.96 Mpps | 1.28x | | 500byte | 5.23 Mpps | 5.97 Mpps | 1.14x | | 64byte | 5.94 Mpps | 5.96 Mpps | 1.00x | -------------------------------------------- TCP streams: ~20% gain Signed-off-by: Tariq Toukan <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>
2018-03-30net/mlx5e: Use inline MTTs in UMR WQEsTariq Toukan3-88/+38
When modifying the page mapping of a HW memory region (via a UMR post), post the new values inlined in WQE, instead of using a data pointer. This is a micro-optimization, inline UMR WQEs of different rings scale better in HW. In addition, this obsoletes a few control flows and helps delete ~50 LOC. Signed-off-by: Tariq Toukan <[email protected]> Signed-off-by: Saeed Mahameed <[email protected]>