aboutsummaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2021-08-14net: bridge: vlan: dump mcast ctx querier stateNikolay Aleksandrov1-1/+4
Use the new mcast querier state dump infrastructure and export vlans' mcast context querier state embedded in attribute BRIDGE_VLANDB_GOPTS_MCAST_QUERIER_STATE. Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-14net: bridge: mcast: dump ipv6 querier stateNikolay Aleksandrov1-4/+32
Add support for dumping global IPv6 querier state, we dump the state only if our own querier is enabled or there has been another external querier which has won the election. For the bridge global state we use a new attribute IFLA_BR_MCAST_QUERIER_STATE and embed the state inside. The structure is: [IFLA_BR_MCAST_QUERIER_STATE] `[BRIDGE_QUERIER_IPV6_ADDRESS] - ip address of the querier `[BRIDGE_QUERIER_IPV6_PORT] - bridge port ifindex where the querier was seen (set only if external querier) `[BRIDGE_QUERIER_IPV6_OTHER_TIMER] - other querier timeout IPv4 and IPv6 attributes are embedded at the same level of IFLA_BR_MCAST_QUERIER_STATE. If we didn't dump anything we cancel the nest and return. Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-14net: bridge: mcast: dump ipv4 querier stateNikolay Aleksandrov3-1/+81
Add support for dumping global IPv4 querier state, we dump the state only if our own querier is enabled or there has been another external querier which has won the election. For the bridge global state we use a new attribute IFLA_BR_MCAST_QUERIER_STATE and embed the state inside. The structure is: [IFLA_BR_MCAST_QUERIER_STATE] `[BRIDGE_QUERIER_IP_ADDRESS] - ip address of the querier `[BRIDGE_QUERIER_IP_PORT] - bridge port ifindex where the querier was seen (set only if external querier) `[BRIDGE_QUERIER_IP_OTHER_TIMER] - other querier timeout Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-14net: bridge: mcast: consolidate querier selection for ipv4 and ipv6Nikolay Aleksandrov1-38/+29
We can consolidate both functions as they share almost the same logic. This is easier to maintain and we have a single querier update function. Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-14net: bridge: mcast: make sure querier port/address updates are consistentNikolay Aleksandrov2-21/+54
Use a sequence counter to make sure port/address updates can be read consistently without requiring the bridge multicast_lock. We need to zero out the port and address when the other querier has expired and we're about to select ourselves as querier. br_multicast_read_querier will be used later when dumping querier state. Updates are done only with the multicast spinlock and softirqs disabled, while reads are done from process context and from softirqs (due to notifications). Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-14net: bridge: mcast: record querier port device ifindex instead of pointerNikolay Aleksandrov2-8/+13
Currently when a querier port is detected its net_bridge_port pointer is recorded, but it's used only for comparisons so it's fine to have stale pointer, in order to dereference and use the port pointer a proper accounting of its usage must be implemented adding unnecessary complexity. To solve the problem we can just store the netdevice ifindex instead of the port pointer and retrieve the bridge port. It is a best effort and the device needs to be validated that is still part of that bridge before use, but that is small price to pay for avoiding querier reference counting for each port/vlan. Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-14devlink: Clear whole devlink_flash_notify structLeon Romanovsky1-2/+2
The { 0 } doesn't clear all fields in the struct, but tells to the compiler to set all fields to zero and doesn't touch any sub-fields if they exists. The {} is an empty initialiser that instructs to fully initialize whole struct including sub-fields, which is error-prone for future devlink_flash_notify extensions. Fixes: 6700acc5f1fe ("devlink: collect flash notify params into a struct") Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-14devlink: Use xarray to store devlink instancesLeon Romanovsky1-21/+49
We can use xarray instead of linearly organized linked lists for the devlink instances. This will let us revise the locking scheme in favour of internal xarray locking that protects database. Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-14devlink: Count struct devlink consumersLeon Romanovsky1-35/+170
The struct devlink itself is protected by internal lock and doesn't need global lock during operation. That global lock is used to protect addition/removal new devlink instances from the global list in use by all devlink consumers in the system. The future conversion of linked list to be xarray will allow us to actually delete that lock, but first we need to count all struct devlink users. The reference counting provides us a way to ensure that no new user space commands success to grab devlink instance which is going to be destroyed makes it is safe to access it without lock. Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-14devlink: Remove check of always valid devlink pointerLeon Romanovsky1-56/+38
Devlink objects are accessible only after they were registered and have valid devlink_*->devlink pointers. Remove that check and simplify respective fill functions as an outcome of such change. Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-14devlink: Simplify devlink_pernet_pre_exit callLeon Romanovsky1-10/+10
The devlink_pernet_pre_exit() will be called if net namespace exits. That routine is relevant for devlink instances that were assigned to that namespaces first. This assignment is possible only with the following command: "devlink reload DEV netns ...", which already checks reload support. Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-14mptcp: backup flag from incoming MPJ ack optionPaolo Abeni1-2/+4
the parsed incoming backup flag is not propagated to the subflow itself, the client may end-up using it to send data. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/191 Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Mat Martineau <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-14mptcp: add mibs for stale subflows processingPaolo Abeni5-0/+8
This allows monitoring exceptional events like active backup scenarios. Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Mat Martineau <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-14mptcp: faster active backup recoveryPaolo Abeni5-5/+88
The msk can use backup subflows to transmit in-sequence data only if there are no other active subflow. On active backup scenario, the MPTCP connection can do forward progress only due to MPTCP retransmissions - rtx can pick backup subflows. This patch introduces a new flag flow MPTCP subflows: if the underlying TCP connection made no progresses for long time, and there are other less problematic subflows available, the given subflow become stale. Stale subflows are not considered active: if all non backup subflows become stale, the MPTCP scheduler can pick backup subflows for plain transmissions. Stale subflows can return in active state, as soon as any reply from the peer is observed. Active backup scenarios can now leverage the available b/w with no restrinction. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/207 Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Mat Martineau <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-14mptcp: cleanup sysctl data and helpersPaolo Abeni2-10/+10
Reorder the data in mptcp_pernet to avoid wasting space with no reasons and constify the access helpers. No functional changes intended. Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Mat Martineau <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-14mptcp: handle pending data on closed subflowPaolo Abeni3-8/+82
The PM can close active subflow, e.g. due to ingress RM_ADDR option. Such subflow could carry data still unacked at the MPTCP-level, both in the write and the rtx_queue, which has never reached the other peer. Currently the mptcp-level retransmission will deliver such data, but at a very low rate (at most 1 DSM for each MPTCP rtx interval). We can speed-up the recovery a lot, moving all the unacked in the tcp write_queue, so that it will be pushed again via other subflows, at the speed allowed by them. Also make available the new helper for later patches. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/207 Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Mat Martineau <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-14mptcp: less aggressive retransmission strategyPaolo Abeni3-10/+37
The current mptcp re-inject strategy is very aggressive, we have mptcp-level retransmissions even on single subflow connection, if the link in-use is lossy. Let's be a little more conservative: we do retransmit only if at least a subflow has write and rtx queue empty. Additionally use the backup subflows only if the active subflows are stale - no progresses in at least an rtx period and ignore stale subflows for rtx timeout update Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/207 Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Mat Martineau <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-14mptcp: more accurate timeoutPaolo Abeni1-23/+37
As reported by Maxim, we have a lot of MPTCP-level retransmissions when multilple links with different latencies are in use. This patch refactor the mptcp-level timeout accounting so that the maximum of all the active subflow timeout is used. To avoid traversing the subflow list multiple times, the update is performed inside the packet scheduler. Additionally clean-up a bit timeout handling. Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Mat Martineau <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-13net: 802: remove dead leftover after ipx driver removalLukas Bulwahn2-61/+0
Commit 7a2e838d28cf ("staging: ipx: delete it from the tree") removes the ipx driver and the config IPX. Since then, there is some dead leftover in ./net/802/, that was once used by the IPX driver, but has no other user. Remove this dead leftover. Signed-off-by: Lukas Bulwahn <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2021-08-13net: in_irq() cleanupChangbin Du4-7/+7
Replace the obsolete and ambiguos macro in_irq() with new macro in_hardirq(). Signed-off-by: Changbin Du <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2021-08-13af_unix: fix holding spinlock in oob handlingRao Shoaib1-12/+24
syzkaller found that OOB code was holding spinlock while calling a function in which it could sleep. Reported-by: [email protected] Fixes: 314001f0bf92 ("af_unix: Add OOB support") Signed-off-by: Rao Shoaib <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2021-08-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski39-130/+244
Conflicts: drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.h 9e26680733d5 ("bnxt_en: Update firmware call to retrieve TX PTP timestamp") 9e518f25802c ("bnxt_en: 1PPS functions to configure TSIO pins") 099fdeda659d ("bnxt_en: Event handler for PPS events") kernel/bpf/helpers.c include/linux/bpf-cgroup.h a2baf4e8bb0f ("bpf: Fix potentially incorrect results with bpf_get_local_storage()") c7603cfa04e7 ("bpf: Add ambient BPF runtime context stored in current") drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c 5957cc557dc5 ("net/mlx5: Set all field of mlx5_irq before inserting it to the xarray") 2d0b41a37679 ("net/mlx5: Refcount mlx5_irq with integer") MAINTAINERS 7b637cd52f02 ("MAINTAINERS: fix Microchip CAN BUS Analyzer Tool entry typo") 7d901a1e878a ("net: phy: add Maxlinear GPY115/21x/24x driver") Signed-off-by: Jakub Kicinski <[email protected]>
2021-08-13mac80211: Use flex-array for radiotap header bitmapKees Cook2-4/+8
In preparation for FORTIFY_SOURCE performing compile-time and run-time field bounds checking for memcpy(), memmove(), and memset(), avoid intentionally writing across neighboring fields. The it_present member of struct ieee80211_radiotap_header is treated as a flexible array (multiple u32s can be conditionally present). In order for memcpy() to reason (or really, not reason) about the size of operations against this struct, use of bytes beyond it_present need to be treated as part of the flexible array. Add a trailing flexible array and initialize its initial index via pointer arithmetic. Cc: Johannes Berg <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2021-08-13mac80211: radiotap: Use BIT() instead of shiftsKees Cook3-21/+21
IEEE80211_RADIOTAP_EXT has a value of 31, which means if shift was ever cast to 64-bit, the result would become sign-extended. As a matter of robustness, just replace all the open-coded shifts with BIT(). Suggested-by: David Sterba <[email protected]> Link: https://lore.kernel.org/lkml/[email protected]/ Cc: Johannes Berg <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Kees Cook <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2021-08-13mac80211: Remove unnecessary variable and labeldingsenjie1-11/+4
The variable ret and label just used as return, so we delete it and use the return statement instead of the goto statement. Signed-off-by: dingsenjie <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2021-08-13mac80211: include <linux/rbtree.h>Johannes Berg1-0/+1
This is needed for the rbtree, and we shouldn't just rely on it getting included somewhere implicitly. Include it explicitly. Acked-by: Toke Høiland-Jørgensen <[email protected]> Link: https://lore.kernel.org/r/20210715180234.512d64dee655.Ia51c29a9fb1e651e06bc00eabec90974103d333e@changeid Signed-off-by: Johannes Berg <[email protected]>
2021-08-13mac80211: Fix monitor MTU limit so that A-MSDUs get throughJohan Almbladh1-2/+9
The maximum MTU was set to 2304, which is the maximum MSDU size. While this is valid for normal WLAN interfaces, it is too low for monitor interfaces. A monitor interface may receive and inject MPDU frames, and the maximum MPDU frame size is larger than 2304. The MPDU may also contain an A-MSDU frame, in which case the size may be much larger than the MTU limit. Since the maximum size of an A-MSDU depends on the PHY mode of the transmitting STA, it is not possible to set an exact MTU limit for a monitor interface. Now the maximum MTU for a monitor interface is unrestricted. Signed-off-by: Johan Almbladh <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2021-08-13mac80211: remove unnecessary NULL check in ieee80211_register_hw()Dan Carpenter1-1/+1
The address "&sband->iftype_data[i]" points to an array at the end of struct. It can't be NULL and so the check can be removed. Fixes: bac2fd3d7534 ("mac80211: remove use of ieee80211_get_he_sta_cap()") Signed-off-by: Dan Carpenter <[email protected]> Link: https://lore.kernel.org/r/YNmgHi7Rh3SISdog@mwanda Signed-off-by: Johannes Berg <[email protected]>
2021-08-13mac80211: Reject zero MAC address in sta_info_insert_check()YueHaibing1-1/+1
As commit 52dba8d7d5ab ("mac80211: reject zero MAC address in add station") said, we don't consider all-zeroes to be a valid MAC address in most places, so also reject it here. Signed-off-by: YueHaibing <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2021-08-12Merge tag 'ieee802154-for-davem-2021-08-12' of ↵Jakub Kicinski1-1/+6
git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan Stefan Schmidt says: ==================== ieee802154 for net 2021-08-12 Mostly fixes coming from bot reports. Dongliang Mu tackled some syzkaller reports in hwsim again and Takeshi Misawa a memory leak in ieee802154 raw. * tag 'ieee802154-for-davem-2021-08-12' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan: net: Fix memory leak in ieee802154_raw_deliver ieee802154: hwsim: fix GPF in hwsim_new_edge_nl ieee802154: hwsim: fix GPF in hwsim_set_edge_lqi ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2021-08-12vsock/virtio: avoid potential deadlock when vsock device removeLongpeng(Mike)1-2/+5
There's a potential deadlock case when remove the vsock device or process the RESET event: vsock_for_each_connected_socket: spin_lock_bh(&vsock_table_lock) ----------- (1) ... virtio_vsock_reset_sock: lock_sock(sk) --------------------- (2) ... spin_unlock_bh(&vsock_table_lock) lock_sock() may do initiative schedule when the 'sk' is owned by other thread at the same time, we would receivce a warning message that "scheduling while atomic". Even worse, if the next task (selected by the scheduler) try to release a 'sk', it need to request vsock_table_lock and the deadlock occur, cause the system into softlockup state. Call trace: queued_spin_lock_slowpath vsock_remove_bound vsock_remove_sock virtio_transport_release __vsock_release vsock_release __sock_release sock_close __fput ____fput So we should not require sk_lock in this case, just like the behavior in vhost_vsock or vmci. Fixes: 0ea9e1d3a9e3 ("VSOCK: Introduce virtio_transport.ko") Cc: Stefan Hajnoczi <[email protected]> Signed-off-by: Longpeng(Mike) <[email protected]> Reviewed-by: Stefano Garzarella <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2021-08-12net: dsa: tag_8021q: don't broadcast during setup/teardownVladimir Oltean4-16/+26
Currently, on my board with multiple sja1105 switches in disjoint trees described in commit f66a6a69f97a ("net: dsa: permit cross-chip bridging between all trees in the system"), rebooting the board triggers the following benign warnings: [ 12.345566] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 1088 deletion: -ENOENT [ 12.353804] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 2112 deletion: -ENOENT [ 12.362019] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 1089 deletion: -ENOENT [ 12.370246] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 2113 deletion: -ENOENT [ 12.378466] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 1090 deletion: -ENOENT [ 12.386683] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 2114 deletion: -ENOENT Basically switch 1 calls dsa_tag_8021q_unregister, and switch 1's TX and RX VLANs cannot be found on switch 2's CPU port. But why would switch 2 even attempt to delete switch 1's TX and RX tag_8021q VLANs from its CPU port? Well, because we use dsa_broadcast, and it is supposed that it had added those VLANs in the first place (because in dsa_port_tag_8021q_vlan_match, all CPU ports match regardless of their tree index or switch index). The two trees probe asynchronously, and when switch 1 probed, it called dsa_broadcast which did not notify the tree of switch 2, because that didn't probe yet. But during unbind, switch 2's tree _is_ probed, so it _is_ notified of the deletion. Before jumping to introduce a synchronization mechanism between the probing across disjoint switch trees, let's take a step back and see whether we _need_ to do that in the first place. The RX and TX VLANs of switch 1 would be needed on switch 2's CPU port only if switch 1 and 2 were part of a cross-chip bridge. And dsa_tag_8021q_bridge_join takes care precisely of that (but if probing was synchronous, the bridge_join would just end up bumping the VLANs' refcount, because they are already installed by the setup path). Since by the time the ports are bridged, all DSA trees are already set up, and we don't need the tag_8021q VLANs of one switch installed on the other switches during probe time, the answer is that we don't need to fix the synchronization issue. So make the setup and teardown code paths call dsa_port_notify, which notifies only the local tree, and the bridge code paths call dsa_broadcast, which let the other trees know as well. Signed-off-by: Vladimir Oltean <[email protected]> Reviewed-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-12net: dsa: print more information when a cross-chip notifier failsVladimir Oltean1-6/+12
Currently this error message does not say a lot: [ 32.693498] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT [ 32.699725] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT [ 32.705931] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT [ 32.712139] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT [ 32.718347] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT [ 32.724554] DSA: failed to notify tag_8021q VLAN deletion: -ENOENT but in this form, it is immediately obvious (at least to me) what the problem is, even without further looking at the code: [ 12.345566] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 1088 deletion: -ENOENT [ 12.353804] sja1105 spi2.0: port 0 failed to notify tag_8021q VLAN 2112 deletion: -ENOENT [ 12.362019] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 1089 deletion: -ENOENT [ 12.370246] sja1105 spi2.0: port 1 failed to notify tag_8021q VLAN 2113 deletion: -ENOENT [ 12.378466] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 1090 deletion: -ENOENT [ 12.386683] sja1105 spi2.0: port 2 failed to notify tag_8021q VLAN 2114 deletion: -ENOENT Signed-off-by: Vladimir Oltean <[email protected]> Reviewed-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-12pktgen: Add output for imix resultsNick Richardson1-1/+25
The bps for imix mode is calculated by: sum(imix_entry.size) / time_elapsed The actual counts of each imix_entry are displayed under the "Current:" section of the interface output in the following format: imix_size_counts: size_1,count_1 size_2,count_2 ... size_n,count_n Example (count = 200000): imix_weights: 256,1 859,3 205,2 imix_size_counts: 256,32082 859,99796 205,68122 Result: OK: 17992362(c17964678+d27684) usec, 200000 (859byte,0frags) 11115pps 47Mb/sec (47977140bps) errors: 0 Summary of changes: Calculate bps based on imix counters when in IMIX mode. Add output for IMIX counters. Signed-off-by: Nick Richardson <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-12pktgen: Add imix distribution binsNick Richardson1-0/+41
In order to represent the distribution of imix packet sizes, a pre-computed data structure is used. It features 100 (IMIX_PRECISION) "bins". Contiguous ranges of these bins represent the respective packet size of each imix entry. This is done to avoid the overhead of selecting the correct imix packet size based on the corresponding weights. Example: imix_weights 40,7 576,4 1500,1 total_weight = 7 + 4 + 1 = 12 pkt_size 40 occurs 7/total_weight = 58% of the time pkt_size 576 occurs 4/total_weight = 33% of the time pkt_size 1500 occurs 1/total_weight = 9% of the time We generate a random number between 0-100 and select the corresponding packet size based on the specified weights. Eg. random number = 358723895 % 100 = 65 Selects the packet size corresponding to index:65 in the pre-computed imix_distribution array. An example of the pre-computed array is below: The imix_distribution will look like the following: 0 -> 0 (index of imix_entry.size == 40) 1 -> 0 (index of imix_entry.size == 40) 2 -> 0 (index of imix_entry.size == 40) [...] -> 0 (index of imix_entry.size == 40) 57 -> 0 (index of imix_entry.size == 40) 58 -> 1 (index of imix_entry.size == 576) [...] -> 1 (index of imix_entry.size == 576) 90 -> 1 (index of imix_entry.size == 576) 91 -> 2 (index of imix_entry.size == 1500) [...] -> 2 (index of imix_entry.size == 1500) 99 -> 2 (index of imix_entry.size == 1500) Create and use "bin" representation of the imix distribution. Signed-off-by: Nick Richardson <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-12pktgen: Parse internet mix (imix) inputNick Richardson1-0/+96
Adds "imix_weights" command for specifying internet mix distribution. The command is in this format: "imix_weights size_1,weight_1 size_2,weight_2 ... size_n,weight_n" where the probability that packet size_i is picked is: weight_i / (weight_1 + weight_2 + .. + weight_n) The user may provide up to 100 imix entries (size_i,weight_i) in this command. The user specified imix entries will be displayed in the "Params" section of the interface output. Values for clone_skb > 0 is not supported in IMIX mode. Summary of changes: Add flag for enabling internet mix mode. Add command (imix_weights) for internet mix input. Return -ENOTSUPP when clone_skb > 0 in IMIX mode. Display imix_weights in Params. Create data structures to store imix entries and distribution. Signed-off-by: Nick Richardson <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-12Revert "tipc: Return the correct errno code"Hoang Le1-3/+3
This reverts commit 0efea3c649f0 because of: - The returning -ENOBUF error is fine on socket buffer allocation. - There is side effect in the calling path tipc_node_xmit()->tipc_link_xmit() when checking error code returning. Fixes: 0efea3c649f0 ("tipc: Return the correct errno code") Acked-by: Jon Maloy <[email protected]> Signed-off-by: Hoang Le <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-11net: bridge: vlan: fix global vlan option range dumpingNikolay Aleksandrov1-1/+2
When global vlan options are equal sequentially we compress them in a range to save space and reduce processing time. In order to have the proper range end id we need to update range_end if the options are equal otherwise we get ranges with the same end vlan id as the start. Fixes: 743a53d9636a ("net: bridge: vlan: add support for dumping global vlan options") Signed-off-by: Nikolay Aleksandrov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2021-08-11mctp: Specify route types, require rtm_type in RTM_*ROUTE messagesJeremy Kerr1-5/+22
This change adds a 'type' attribute to routes, which can be parsed from a RTM_NEWROUTE message. This will help to distinguish local vs. peer routes in a future change. This means userspace will need to set a correct rtm_type in RTM_NEWROUTE and RTM_DELROUTE messages; we currently only accept RTN_UNICAST. Signed-off-by: Jeremy Kerr <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2021-08-11net: igmp: increase size of mr_ifc_countEric Dumazet1-1/+1
Some arches support cmpxchg() on 4-byte and 8-byte only. Increase mr_ifc_count width to 32bit to fix this problem. Fixes: 4a2b285e7e10 ("net: igmp: fix data-race in igmp_ifc_timer_expire()") Signed-off-by: Eric Dumazet <[email protected]> Reported-by: Guenter Roeck <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2021-08-11tcp_bbr: fix u32 wrap bug in round logic if bbr_init() called after 2B packetsNeal Cardwell1-1/+1
Currently if BBR congestion control is initialized after more than 2B packets have been delivered, depending on the phase of the tp->delivered counter the tracking of BBR round trips can get stuck. The bug arises because if tp->delivered is between 2^31 and 2^32 at the time the BBR congestion control module is initialized, then the initialization of bbr->next_rtt_delivered to 0 will cause the logic to believe that the end of the round trip is still billions of packets in the future. More specifically, the following check will fail repeatedly: !before(rs->prior_delivered, bbr->next_rtt_delivered) and thus the connection will take up to 2B packets delivered before that check will pass and the connection will set: bbr->round_start = 1; This could cause many mechanisms in BBR to fail to trigger, for example bbr_check_full_bw_reached() would likely never exit STARTUP. This bug is 5 years old and has not been observed, and as a practical matter this would likely rarely trigger, since it would require transferring at least 2B packets, or likely more than 3 terabytes of data, before switching congestion control algorithms to BBR. This patch is a stable candidate for kernels as far back as v4.9, when tcp_bbr.c was added. Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control") Signed-off-by: Neal Cardwell <[email protected]> Reviewed-by: Yuchung Cheng <[email protected]> Reviewed-by: Kevin Yang <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2021-08-11net: linkwatch: fix failure to restore device state across suspend/resumeWilly Tarreau1-2/+3
After migrating my laptop from 4.19-LTS to 5.4-LTS a while ago I noticed that my Ethernet port to which a bond and a VLAN interface are attached appeared to remain up after resuming from suspend with the cable unplugged (and that problem still persists with 5.10-LTS). It happens that the following happens: - the network driver (e1000e here) prepares to suspend, calls e1000e_down() which calls netif_carrier_off() to signal that the link is going down. - netif_carrier_off() adds a link_watch event to the list of events for this device - the device is completely stopped. - the machine suspends - the cable is unplugged and the machine brought to another location - the machine is resumed - the queued linkwatch events are processed for the device - the device doesn't yet have the __LINK_STATE_PRESENT bit and its events are silently dropped - the device is resumed with its link down - the upper VLAN and bond interfaces are never notified that the link had been turned down and remain up - the only way to provoke a change is to physically connect the machine to a port and possibly unplug it. The state after resume looks like this: $ ip -br li | egrep 'bond|eth' bond0 UP e8:6a:64:64:64:64 <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> eth0 DOWN e8:6a:64:64:64:64 <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> eth0.2@eth0 UP e8:6a:64:64:64:64 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> Placing an explicit call to netdev_state_change() either in the suspend or the resume code in the NIC driver worked around this but the solution is not satisfying. The issue in fact really is in link_watch that loses events while it ought not to. It happens that the test for the device being present was added by commit 124eee3f6955 ("net: linkwatch: add check for netdevice being present to linkwatch_do_dev") in 4.20 to avoid an access to devices that are not present. Instead of dropping events, this patch proceeds slightly differently by postponing their handling so that they happen after the device is fully resumed. Fixes: 124eee3f6955 ("net: linkwatch: add check for netdevice being present to linkwatch_do_dev") Link: https://lists.openwall.net/netdev/2018/03/15/62 Cc: Heiner Kallweit <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Florian Fainelli <[email protected]> Signed-off-by: Willy Tarreau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2021-08-11net: dsa: create a helper for locating EtherType DSA headers on TXVladimir Oltean7-17/+23
Create a similar helper for locating the offset to the DSA header relative to skb->data, and make the existing EtherType header taggers to use it. Signed-off-by: Vladimir Oltean <[email protected]> Reviewed-by: Florian Fainelli <[email protected]> Reviewed-by: Andrew Lunn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-11net: dsa: create a helper for locating EtherType DSA headers on RXVladimir Oltean8-26/+21
It seems that protocol tagging driver writers are always surprised about the formula they use to reach their EtherType header on RX, which becomes apparent from the fact that there are comments in multiple drivers that mention the same information. Create a helper that returns a void pointer to skb->data - 2, as well as centralize the explanation why that is the case. Signed-off-by: Vladimir Oltean <[email protected]> Reviewed-by: Florian Fainelli <[email protected]> Reviewed-by: Andrew Lunn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-11net: dsa: create a helper which allocates space for EtherType DSA headersVladimir Oltean8-10/+38
Hide away the memmove used by DSA EtherType header taggers to shift the MAC SA and DA to the left to make room for the header, after they've called skb_push(). The call to skb_push() is still left explicit in drivers, to be symmetric with dsa_strip_etype_header, and because not all callers can be refactored to do it (for example, brcm_tag_xmit_ll has common code for a pre-Ethernet DSA tag and an EtherType DSA tag). Signed-off-by: Vladimir Oltean <[email protected]> Reviewed-by: Andrew Lunn <[email protected]> Reviewed-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-11net: dsa: create a helper that strips EtherType DSA headers on RXVladimir Oltean8-28/+37
All header taggers open-code a memmove that is fairly not all that obvious, and we can hide the details behind a helper function, since the only thing specific to the driver is the length of the header tag. Signed-off-by: Vladimir Oltean <[email protected]> Reviewed-by: Andrew Lunn <[email protected]> Reviewed-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-11devlink: Add APIs to publish, unpublish individual parameterParav Pandit1-0/+48
Enable drivers to publish/unpublish individual parameter. Signed-off-by: Parav Pandit <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Reviewed-by: Leon Romanovsky <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-11devlink: Add API to register and unregister single parameterParav Pandit1-0/+37
Currently device configuration parameters can be registered as an array. Due to this a constant array must be registered. A single driver supporting multiple devices each with different device capabilities end up registering all parameters even if it doesn't support it. One possible workaround a driver can do is, it registers multiple single entry arrays to overcome such limitation. Better is to provide a API that enables driver to register/unregister a single parameter. This also further helps in two ways. (1) to reduce the memory of devlink_param_entry by avoiding in registering parameters which are not supported by the device. (2) avoid generating multiple parameter add, delete, publish, unpublish, init value notifications for such unsupported parameters Signed-off-by: Parav Pandit <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Reviewed-by: Leon Romanovsky <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-11devlink: Create a helper function for one parameter registrationParav Pandit1-6/+18
Create and use a helper function for one parameter registration. Subsequent patch also will reuse this for driver facing routine to register a single parameter. Signed-off-by: Parav Pandit <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Reviewed-by: Leon Romanovsky <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2021-08-11devlink: Add new "enable_vnet" generic device paramParav Pandit1-0/+5
Add new device generic parameter to enable/disable creation of VDPA net auxiliary device and associated device functionality in the devlink instance. User who prefers to disable such functionality can disable it using below example. $ devlink dev param set pci/0000:06:00.0 \ name enable_vnet value false cmode driverinit $ devlink dev reload pci/0000:06:00.0 At this point devlink instance do not create auxiliary device for the VDPA net functionality. Signed-off-by: Parav Pandit <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Reviewed-by: Leon Romanovsky <[email protected]> Signed-off-by: David S. Miller <[email protected]>