aboutsummaryrefslogtreecommitdiff
path: root/net/dsa/slave.c
AgeCommit message (Collapse)AuthorFilesLines
2021-06-29net: dsa: ensure during dsa_fdb_offload_notify that dev_hold and dev_put are ↵Vladimir Oltean1-5/+4
on the same dev When (a) "dev" is a bridge port which the DSA switch tree offloads, but is otherwise not a dsa slave (such as a LAG netdev), or (b) "dev" is the bridge net device itself then strange things happen to the dev_hold/dev_put pair: dsa_schedule_work() will still be called with a DSA port that offloads that netdev, but dev_hold() will be called on the non-DSA netdev. Then the "if" condition in dsa_slave_switchdev_event_work() does not pass, because "dev" is not a DSA netdev, so dev_put() is not called. This results in the simple fact that we have a reference counting mismatch on the "dev" net device. This can be seen when we add support for host addresses installed on the bridge net device. ip link add br1 type bridge ip link set br1 address 00:01:02:03:04:05 ip link set swp0 master br1 ip link del br1 [ 968.512278] unregister_netdevice: waiting for br1 to become free. Usage count = 5 It seems foolish to do penny pinching and not add the net_device pointer in the dsa_switchdev_event_work structure, so let's finally do that. As an added bonus, when we start offloading local entries pointing towards the bridge, these will now properly appear as 'offloaded' in 'bridge fdb' (this was not possible before, because 'dev' was assumed to only be a DSA net device): 00:01:02:03:04:05 dev br0 vlan 1 offload master br0 permanent 00:01:02:03:04:05 dev br0 offload master br0 permanent Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29net: dsa: include fdb entries pointing to bridge in the host fdb listVladimir Oltean1-2/+11
The bridge supports a legacy way of adding local (non-forwarded) FDB entries, which works on an individual port basis: bridge fdb add dev swp0 00:01:02:03:04:05 master local As well as a new way, added by Roopa Prabhu in commit 3741873b4f73 ("bridge: allow adding of fdb entries pointing to the bridge device"): bridge fdb add dev br0 00:01:02:03:04:05 self local The two commands are functionally equivalent, except that the first one produces an entry with fdb->dst == swp0, and the other an entry with fdb->dst == NULL. The confusing part, though, is that even if fdb->dst is swp0 for the 'local on port' entry, that destination is not used. Nonetheless, the idea is that the bridge has reference counting for local entries, and local entries pointing towards the bridge are still 'as local' as local entries for a port. The bridge adds the MAC addresses of the interfaces automatically as FDB entries with is_local=1. For the MAC address of the ports, fdb->dst will be equal to the port, and for the MAC address of the bridge, fdb->dst will point towards the bridge (i.e. be NULL). Therefore, if the MAC address of the bridge is not inherited from either of the physical ports, then we must explicitly catch local FDB entries emitted towards the br0, otherwise we'll miss the MAC address of the bridge (and, of course, any entry with 'bridge add dev br0 ... self local'). Co-developed-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29net: dsa: include bridge addresses which are local in the host fdb listTobias Waldekranz1-5/+13
The bridge automatically creates local (not forwarded) fdb entries pointing towards physical ports with their interface MAC addresses. For switchdev, the significance of these fdb entries is the exact opposite of that of non-local entries: instead of sending these frame outwards, we must send them inwards (towards the host). NOTE: The bridge's own MAC address is also "local". If that address is not shared with any port, the bridge's MAC is not be added by this functionality - but the following commit takes care of that case. NOTE 2: We mark these addresses as host-filtered regardless of the value of ds->assisted_learning_on_cpu_port. This is because, as opposed to the speculative logic done for dynamic address learning on foreign interfaces, the local FDB entries are rather fixed, so there isn't any risk of them migrating from one bridge port to another. Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29net: dsa: sync static FDB entries on foreign interfaces to hardwareVladimir Oltean1-4/+8
DSA is able to install FDB entries towards the CPU port for addresses which were dynamically learnt by the software bridge on foreign interfaces that are in the same bridge with a DSA switch interface. Since this behavior is opportunistic, it is guarded by the "assisted_learning_on_cpu_port" property which can be enabled by drivers and is not done automatically (since certain switches may support address learning of packets coming from the CPU port). But if those FDB entries added on the foreign interfaces are static (added by the user) instead of dynamically learnt, currently DSA does not do anything (and arguably it should). Because static FDB entries are not supposed to move on their own, there is no downside in reusing the "assisted_learning_on_cpu_port" logic to sync static FDB entries to the DSA CPU port unconditionally, even if assisted_learning_on_cpu_port is not requested by the driver. For example, this situation: br0 / \ swp0 dummy0 $ bridge fdb add 02:00:de:ad:00:01 dev dummy0 vlan 1 master static Results in DSA adding an entry in the hardware FDB, pointing this address towards the CPU port. The same is true for entries added to the bridge itself, e.g: $ bridge fdb add 02:00:de:ad:00:01 dev br0 vlan 1 self local (except that right now, DSA still ignores 'local' FDB entries, this will be changed in a later patch) Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29net: dsa: introduce a separate cross-chip notifier type for host FDBsVladimir Oltean1-5/+16
DSA treats some bridge FDB entries by trapping them to the CPU port. Currently, the only class of such entries are FDB addresses learnt by the software bridge on a foreign interface. However there are many more to be added: - FDB entries with the is_local flag (for termination) added by the bridge on the user ports (typically containing the MAC address of the bridge port) - FDB entries pointing towards the bridge net device (for termination). Typically these contain the MAC address of the bridge net device. - Static FDB entries installed on a foreign interface that is in the same bridge with a DSA user port. The reason why a separate cross-chip notifier for host FDBs is justified compared to normal FDBs is the same as in the case of host MDBs: the cross-chip notifier matching function in switch.c should avoid installing these entries on routing ports that route towards the targeted switch, but not towards the CPU. This is required in order to have proper support for H-like multi-chip topologies. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29net: dsa: introduce a separate cross-chip notifier type for host MDBsVladimir Oltean1-8/+2
Commit abd49535c380 ("net: dsa: execute dsa_switch_mdb_add only for routing port in cross-chip topologies") does a surprisingly good job even for the SWITCHDEV_OBJ_ID_HOST_MDB use case, where DSA simply translates a switchdev object received on dp into a cross-chip notifier for dp->cpu_dp. To visualize how that works, imagine the daisy chain topology below and consider a SWITCHDEV_OBJ_ID_HOST_MDB object emitted on sw2p0. How does the cross-chip notifier know to match on all the right ports (sw0p4, the dedicated CPU port, sw1p4, an upstream DSA link, and sw2p4, another upstream DSA link)? | sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 [ user ] [ user ] [ user ] [ dsa ] [ cpu ] [ ] [ ] [ ] [ ] [ x ] | +---------+ | sw1p0 sw1p1 sw1p2 sw1p3 sw1p4 [ user ] [ user ] [ user ] [ dsa ] [ dsa ] [ ] [ ] [ ] [ ] [ x ] | +---------+ | sw2p0 sw2p1 sw2p2 sw2p3 sw2p4 [ user ] [ user ] [ user ] [ user ] [ dsa ] [ ] [ ] [ ] [ ] [ x ] The answer is simple: the dedicated CPU port of sw2p0 is sw0p4, and dsa_routing_port returns the upstream port for all switches. That is fine, but there are other topologies where this does not work as well. There are trees with "H" topologies in the wild, where there are 2 or more switches with DSA links between them, but every switch has its dedicated CPU port. For these topologies, it seems stupid for the neighbor switches to install an MDB entry on the routing port, since these multicast addresses are fundamentally different than the usual ones we support (and that is the justification for this patch, to introduce the concept of a termination plane multicast MAC address, as opposed to a forwarding plane multicast MAC address). For example, when a SWITCHDEV_OBJ_ID_HOST_MDB would get added to sw0p0, without this patch, it would get treated as a regular port MDB on sw0p2 and it would match on the ports below (including the sw1p3 routing port). | | sw0p0 sw0p1 sw0p2 sw0p3 sw1p3 sw1p2 sw1p1 sw1p0 [ user ] [ user ] [ cpu ] [ dsa ] [ dsa ] [ cpu ] [ user ] [ user ] [ ] [ ] [ x ] [ ] ---- [ x ] [ ] [ ] [ ] With the patch, the host MDB notifier on sw0p0 matches only on the local switch, which is what we want for a termination plane address. | | sw0p0 sw0p1 sw0p2 sw0p3 sw1p3 sw1p2 sw1p1 sw1p0 [ user ] [ user ] [ cpu ] [ dsa ] [ dsa ] [ cpu ] [ user ] [ user ] [ ] [ ] [ x ] [ ] ---- [ ] [ ] [ ] [ ] Name this new matching function "dsa_switch_host_address_match" since we will be reusing it soon for host FDB entries as well. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29net: dsa: delete dsa_legacy_fdb_add and dsa_legacy_fdb_delVladimir Oltean1-23/+0
We want to add reference counting for FDB entries in cross-chip topologies, and in order for that to have any chance of working and not be unbalanced (leading to entries which are never deleted), we need to ensure that higher layers are sane, because if they aren't, it's garbage in, garbage out. For example, if we add a bridge FDB entry twice, the bridge properly errors out: $ bridge fdb add dev swp0 00:01:02:03:04:07 master static $ bridge fdb add dev swp0 00:01:02:03:04:07 master static RTNETLINK answers: File exists However, the same thing cannot be said about the bridge bypass operations: $ bridge fdb add dev swp0 00:01:02:03:04:07 $ bridge fdb add dev swp0 00:01:02:03:04:07 $ bridge fdb add dev swp0 00:01:02:03:04:07 $ bridge fdb add dev swp0 00:01:02:03:04:07 $ echo $? 0 But one 'bridge fdb del' is enough to remove the entry, no matter how many times it was added. The bridge bypass operations are impossible to maintain in these circumstances and lack of support for reference counting the cross-chip notifiers is holding us back from making further progress, so just drop support for them. The only way left for users to install static bridge FDB entries is the proper one, using the "master static" flags. With this change, rtnl_fdb_add() falls back to calling ndo_dflt_fdb_add() which uses the duplicate-exclusive variant of dev_uc_add(): dev_uc_add_excl(). Because DSA does not (yet) declare IFF_UNICAST_FLT, this results in us going to promiscuous mode: $ bridge fdb add dev swp0 00:01:02:03:04:05 [ 28.206743] device swp0 entered promiscuous mode $ bridge fdb add dev swp0 00:01:02:03:04:05 RTNETLINK answers: File exists So even if it does not completely fail, there is at least some indication that it is behaving differently from before, and closer to user space expectations, I would argue (the lack of a "local|static" specifier defaults to "local", or "host-only", so dev_uc_add() is a reasonable default implementation). If the generic implementation of .ndo_fdb_add provided by Vlad Yasevich is a proof of anything, it only proves that the implementation provided by DSA was always wrong, by not looking at "ndm->ndm_state & NUD_NOARP" (the "static" flag which means that the FDB entry points outwards) and "ndm->ndm_state & NUD_PERMANENT" (the "local" flag which means that the FDB entry points towards the host). It all used to mean the same thing to DSA. Update the documentation so that the users are not confused about what's going on. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28net: dsa: replay a deletion of switchdev objects for ports leaving a bridged LAGVladimir Oltean1-0/+55
When a DSA switch port leaves a bonding interface that is under a bridge, there might be dangling switchdev objects on that port left behind, because the bridge is not aware that its lower interface (the bond) changed state in any way. Call the bridge replay helpers with adding=false before changing dp->bridge_dev to NULL, because we need to simulate to dsa_slave_port_obj_del() that these notifications were emitted by the bridge. We add this hook to the NETDEV_PRECHANGEUPPER event handler, because we are calling into switchdev (and the __switchdev_handle_port_obj_del fanout helpers expect the upper/lower adjacency lists to still be valid) and PRECHANGEUPPER is the last moment in time when they still are. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28net: dsa: refactor the prechangeupper sanity checks into a dedicated functionVladimir Oltean1-15/+29
We need to add more logic to the DSA NETDEV_PRECHANGEUPPER event handler, more exactly we need to request an unsync of switchdev objects. In order to fit more code, refactor the existing logic into a helper. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28net: bridge: ignore switchdev events for LAG ports which didn't request replayVladimir Oltean1-0/+9
There is a slight inconvenience in the switchdev replay helpers added recently, and this is when: ip link add br0 type bridge ip link add bond0 type bond ip link set bond0 master br0 bridge vlan add dev bond0 vid 100 ip link set swp0 master bond0 ip link set swp1 master bond0 Since the underlying driver (currently only DSA) asks for a replay of VLANs when swp0 and swp1 join the LAG because it is bridged, what will happen is that DSA will try to react twice on the VLAN event for swp0. This is not really a huge problem right now, because most drivers accept duplicates since the bridge itself does, but it will become a problem when we add support for replaying switchdev object deletions. Let's fix this by adding a blank void *ctx in the replay helpers, which will be passed on by the bridge in the switchdev notifications. If the context is NULL, everything is the same as before. But if the context is populated with a valid pointer, the underlying switchdev driver (currently DSA) can use the pointer to 'see through' the bridge port (which in the example above is bond0) and 'know' that the event is only for a particular physical port offloading that bridge port, and not for all of them. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28net: switchdev: add a context void pointer to struct switchdev_notifier_infoVladimir Oltean1-3/+3
In the case where the driver asks for a replay of a certain type of event (port object or attribute) for a bridge port that is a LAG, it may do so because this port has just joined the LAG. But there might already be other switchdev ports in that LAG, and it is preferable that those preexisting switchdev ports do not act upon the replayed event. The solution is to add a context to switchdev events, which is NULL most of the time (when the bridge layer initiates the call) but which can be set to a value controlled by the switchdev driver when a replay is requested. The driver can then check the context to figure out if all ports within the LAG should act upon the switchdev event, or just the ones that match the context. We have to modify all switchdev_handle_* helper functions as well as the prototypes in the drivers that use these helpers too, because these helpers hide the underlying struct switchdev_notifier_info from us and there is no way to retrieve the context otherwise. The context structure will be populated and used in later patches. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-21net: dsa: targeted MTU notifiers should only match on one portVladimir Oltean1-4/+5
dsa_slave_change_mtu() calls dsa_port_mtu_change() twice: - it sends a cross-chip notifier with the MTU of the CPU port which is used to update the DSA links. - it sends one targeted MTU notifier which is supposed to only match the user port on which we are changing the MTU. The "propagate_upstream" variable is used here to bypass the cross-chip notifier system from switch.c But due to a mistake, the second, targeted notifier matches not only on the user port, but also on the DSA link which is a member of the same switch, if that exists. And because the DSA links of the entire dst were programmed in a previous round to the largest_mtu via a "propagate_upstream == true" notification, then the dsa_port_mtu_change(propagate_upstream == false) call that is immediately upcoming will break the MTU on the one DSA link which is chip-wise local to the dp whose MTU is changing right now. Example given this daisy chain topology: sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 [ cpu ] [ user ] [ user ] [ dsa ] [ user ] [ x ] [ ] [ ] [ x ] [ ] | +---------+ | sw1p0 sw1p1 sw1p2 sw1p3 sw1p4 [ user ] [ user ] [ user ] [ dsa ] [ dsa ] [ ] [ ] [ ] [ ] [ x ] ip link set sw0p1 mtu 9000 ip link set sw1p1 mtu 9000 # at this stage, sw0p1 and sw1p1 can talk # to one another using jumbo frames ip link set sw0p2 mtu 1500 # this programs the sw0p3 DSA link first to # the largest_mtu of 9000, then reprograms it to # 1500 with the "propagate_upstream == false" # notifier, breaking communication between # sw0p1 and sw1p1 To escape from this situation, make the targeted match really match on a single port - the user port, and rename the "propagate_upstream" variable to "targeted_match" to clarify the intention and avoid future issues. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-21net: dsa: calculate the largest_mtu across all ports in the treeVladimir Oltean1-6/+7
If we have a cross-chip topology like this: sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 [ cpu ] [ user ] [ user ] [ dsa ] [ user ] | +---------+ | sw1p0 sw1p1 sw1p2 sw1p3 sw1p4 [ user ] [ user ] [ user ] [ dsa ] [ dsa ] and we issue the following commands: 1. ip link set sw0p1 mtu 1700 2. ip link set sw1p1 mtu 1600 we notice the following happening: Command 1. emits a non-targeted MTU notifier for the CPU port (sw0p0) with the largest_mtu calculated across switch 0, of 1700. This matches sw0p0, sw0p3 and sw1p4 (all CPU ports and DSA links). Then, it emits a targeted MTU notifier for the user port (sw0p1), again with MTU 1700 (this doesn't matter). Command 2. emits a non-targeted MTU notifier for the CPU port (sw0p0) with the largest_mtu calculated across switch 1, of 1600. This matches the same group of ports as above, and decreases the MTU for the CPU port and the DSA links from 1700 to 1600. As a result, the sw0p1 user port can no longer communicate with its CPU port at MTU 1700. To address this, we should calculate the largest_mtu across all switches that may share a CPU port, and only emit MTU notifiers with that value. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-14net: dsa: dsa_slave_phy_connect(): extend phy's flags with port specific phy ↵Oleksij Rempel1-2/+5
flags The current get_phy_flags() is only processed when we connect to a PHY via a designed phy-handle property via phylink_of_phy_connect(), but if we fallback on the internal MDIO bus created by a switch and take the dsa_slave_phy_connect() path then we would not be processing that flag and using it at PHY connection time. Suggested-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11net: dsa: generalize overhead for taggers that use both headers and trailersVladimir Oltean1-6/+4
Some really really weird switches just couldn't decide whether to use a normal or a tail tagger, so they just did both. This creates problems for DSA, because we only have the concept of an 'overhead' which can be applied to the headroom or to the tailroom of the skb (like for example during the central TX reallocation procedure), depending on the value of bool tail_tag, but not to both. We need to generalize DSA to cater for these odd switches by transforming the 'overhead / tail_tag' pair into 'needed_headroom / needed_tailroom'. The DSA master's MTU is increased to account for both. The flow dissector code is modified such that it only calls the DSA adjustment callback if the tagger has a non-zero header length. Taggers are trivially modified to declare either needed_headroom or needed_tailroom, based on the tail_tag value that they currently declare. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-10net: dsa: fix error code getting shifted with 4 in dsa_slave_get_sset_countVladimir Oltean1-5/+7
DSA implements a bunch of 'standardized' ethtool statistics counters, namely tx_packets, tx_bytes, rx_packets, rx_bytes. So whatever the hardware driver returns in .get_sset_count(), we need to add 4 to that. That is ok, except that .get_sset_count() can return a negative error code, for example: b53_get_sset_count -> phy_ethtool_get_sset_count -> return -EIO -EIO is -5, and with 4 added to it, it becomes -1, aka -EPERM. One can imagine that certain error codes may even become positive, although based on code inspection I did not see instances of that. Check the error code first, if it is negative return it as-is. Based on a similar patch for dsa_master_get_strings from Dan Carpenter: https://patchwork.kernel.org/project/netdevbpf/patch/YJaSe3RPgn7gKxZv@mwanda/ Fixes: 91da11f870f0 ("net: Distributed Switch Architecture protocol support") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-27net: dsa: free skb->cb usage in core driverYangbo Lu1-1/+1
Free skb->cb usage in core driver and let device drivers decide to use or not. The reason having a DSA_SKB_CB(skb)->clone was because dsa_skb_tx_timestamp() which may set the clone pointer was called before p->xmit() which would use the clone if any, and the device driver has no way to initialize the clone pointer. This patch just put memset(skb->cb, 0, sizeof(skb->cb)) at beginning of dsa_slave_xmit(). Some new features in the future, like one-step timestamp may need more bytes of skb->cb to use in dsa_skb_tx_timestamp(), and p->xmit(). Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-27net: dsa: no longer clone skb in core driverYangbo Lu1-11/+1
It was a waste to clone skb directly in dsa_skb_tx_timestamp(). For one-step timestamping, a clone was not needed. For any failure of port_txtstamp (this may usually happen), the skb clone had to be freed. So this patch moves skb cloning for tx timestamp out of dsa core, and let drivers clone skb in port_txtstamp if they really need. Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com> Tested-by: Kurt Kanzenbach <kurt@linutronix.de> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-27net: dsa: no longer identify PTP packet in core driverYangbo Lu1-10/+2
Move ptp_classify_raw out of dsa core driver for handling tx timestamp request. Let device drivers do this if they want. Not all drivers want to limit tx timestamping for only PTP packet. Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com> Tested-by: Kurt Kanzenbach <kurt@linutronix.de> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-27net: dsa: check tx timestamp request in core driverYangbo Lu1-0/+3
Check tx timestamp request in core driver at very beginning of dsa_skb_tx_timestamp(), so that most skbs not requiring tx timestamp just return. And drop such checking in device drivers. Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com> Tested-by: Kurt Kanzenbach <kurt@linutronix.de> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-20net: dsa: enable selftest support for all switches by defaultOleksij Rempel1-0/+21
Most of generic selftest should be able to work with probably all ethernet controllers. The DSA switches are not exception, so enable it by default at least for DSA. This patch was tested with SJA1105 and AR9331. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16net: bridge: switchdev: include local flag in FDB notificationsVladimir Oltean1-1/+1
As explained in bugfix commit 6ab4c3117aec ("net: bridge: don't notify switchdev for local FDB addresses") as well as in this discussion: https://lore.kernel.org/netdev/20210117193009.io3nungdwuzmo5f7@skbuf/ the switchdev notifiers for FDB entries managed to have a zero-day bug, which was that drivers would not know what to do with local FDB entries, because they were not told that they are local. The bug fix was to simply not notify them of those addresses. Let us now add the 'is_local' bit to bridge FDB entries, and make all drivers ignore these entries by their own choice. Co-developed-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-13of: net: pass the dst buffer to of_get_mac_address()Michael Walle1-1/+1
of_get_mac_address() returns a "const void*" pointer to a MAC address. Lately, support to fetch the MAC address by an NVMEM provider was added. But this will only work with platform devices. It will not work with PCI devices (e.g. of an integrated root complex) and esp. not with DSA ports. There is an of_* variant of the nvmem binding which works without devices. The returned data of a nvmem_cell_read() has to be freed after use. On the other hand the return of_get_mac_address() points to some static data without a lifetime. The trick for now, was to allocate a device resource managed buffer which is then returned. This will only work if we have an actual device. Change it, so that the caller of of_get_mac_address() has to supply a buffer where the MAC address is written to. Unfortunately, this will touch all drivers which use the of_get_mac_address(). Usually the code looks like: const char *addr; addr = of_get_mac_address(np); if (!IS_ERR(addr)) ether_addr_copy(ndev->dev_addr, addr); This can then be simply rewritten as: of_get_mac_address(np, ndev->dev_addr); Sometimes is_valid_ether_addr() is used to test the MAC address. of_get_mac_address() already makes sure, it just returns a valid MAC address. Thus we can just test its return code. But we have to be careful if there are still other sources for the MAC address before the of_get_mac_address(). In this case we have to keep the is_valid_ether_addr() call. The following coccinelle patch was used to convert common cases to the new style. Afterwards, I've manually gone over the drivers and fixed the return code variable: either used a new one or if one was already available use that. Mansour Moufid, thanks for that coccinelle patch! <spml> @a@ identifier x; expression y, z; @@ - x = of_get_mac_address(y); + x = of_get_mac_address(y, z); <... - ether_addr_copy(z, x); ...> @@ identifier a.x; @@ - if (<+... x ...+>) {} @@ identifier a.x; @@ if (<+... x ...+>) { ... } - else {} @@ identifier a.x; expression e; @@ - if (<+... x ...+>@e) - {} - else + if (!(e)) {...} @@ expression x, y, z; @@ - x = of_get_mac_address(y, z); + of_get_mac_address(y, z); ... when != x </spml> All drivers, except drivers/net/ethernet/aeroflex/greth.c, were compile-time tested. Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Michael Walle <michael@walle.cc> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24dsa: slave: add support for TC_SETUP_FTPablo Neira Ayuso1-1/+19
The dsa infrastructure provides a well-defined hierarchy of devices, pass up the call to set up the flow block to the master device. From the software dataplane, the netfilter infrastructure uses the dsa slave devices to refer to the input and output device for the given skbuff. Similarly, the flowtable definition in the ruleset refers to the dsa slave port devices. This patch adds the glue code to call ndo_setup_tc with TC_SETUP_FT with the master device via the dsa slave devices. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24net: dsa: resolve forwarding path for dsa slave portsFelix Fietkau1-0/+16
Add .ndo_fill_forward_path for dsa slave port devices Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-23net: dsa: sync up switchdev objects and port attributes when joining the bridgeVladimir Oltean1-2/+2
If we join an already-created bridge port, such as a bond master interface, then we can miss the initial switchdev notifications emitted by the bridge for this port, while it wasn't offloaded by anybody. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-23net: dsa: pass extack to dsa_port_{bridge,lag}_joinVladimir Oltean1-2/+5
This is a pretty noisy change that was broken out of the larger change for replaying switchdev attributes and objects at bridge join time, which is when these extack objects are actually used. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-08net: dsa: fix switchdev objects on bridge master mistakenly being applied on ↵Vladimir Oltean1-16/+43
ports Tobias reports that after the blamed patch, VLAN objects being added to a bridge device are being added to all slave ports instead (swp2, swp3). ip link add br0 type bridge vlan_filtering 1 ip link set swp2 master br0 ip link set swp3 master br0 bridge vlan add dev br0 vid 100 self This is because the fix was too broad: we made dsa_port_offloads_netdev say "yes, I offload the br0 bridge" for all slave ports, but we didn't add the checks whether the switchdev object was in fact meant for the physical port or for the bridge itself. So we are reacting on events in a way in which we shouldn't. The reason why the fix was too broad is because the question itself, "does this DSA port offload this netdev", was too broad in the first place. The solution is to disambiguate the question and separate it into two different functions, one to be called for each switchdev attribute / object that has an orig_dev == net_bridge (dsa_port_offloads_bridge), and the other for orig_dev == net_bridge_port (*_offloads_bridge_port). In the case of VLAN objects on the bridge interface, this solves the problem because we know that VLAN objects are per bridge port and not per bridge. And when orig_dev is equal to the net_bridge, we offload it as a bridge, but not as a bridge port; that's how we are able to skip reacting on those events. Note that this is compatible with future plans to have explicit offloading of VLAN objects on the bridge interface as a bridge port (in DSA, this signifies that we should add that VLAN towards the CPU port). Fixes: 99b8202b179f ("net: dsa: fix SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING getting ignored") Reported-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Tobias Waldekranz <tobias@waldekranz.com> Tested-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-16net: dsa: add MRP supportHoratiu Vultur1-0/+22
Add support for offloading MRP in HW. Currently implement the switchdev calls 'SWITCHDEV_OBJ_ID_MRP', 'SWITCHDEV_OBJ_ID_RING_ROLE_MRP', to allow to create MRP instances and to set the role of these instances. Add DSA_NOTIFIER_MRP_ADD/DEL and DSA_NOTIFIER_MRP_ADD/DEL_RING_ROLE which calls to .port_mrp_add/del and .port_mrp_add/del_ring_role in the DSA driver for the switch. Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-14net: dsa: propagate extack to .port_vlan_filteringVladimir Oltean1-1/+2
Some drivers can't dynamically change the VLAN filtering option, or impose some restrictions, it would be nice to propagate this info through netlink instead of printing it to a kernel log that might never be read. Also netlink extack includes the module that emitted the message, which means that it's easier to figure out which ones are driver-generated errors as opposed to command misuse. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-14net: dsa: propagate extack to .port_vlan_addVladimir Oltean1-7/+18
Allow drivers to communicate their restrictions to user space directly, instead of printing to the kernel log. Where the conversion would have been lossy and things like VLAN ID could no longer be conveyed (due to the lack of support for printf format specifier in netlink extack), I chose to keep the messages in full form to the kernel log only, and leave it up to individual driver maintainers to move more messages to extack. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12net: dsa: act as passthrough for bridge port flagsVladimir Oltean1-3/+4
There are multiple ways in which a PORT_BRIDGE_FLAGS attribute can be expressed by the bridge through switchdev, and not all of them can be emulated by DSA mid-layer API at the same time. One possible configuration is when the bridge offloads the port flags using a mask that has a single bit set - therefore only one feature should change. However, DSA currently groups together unicast and multicast flooding in the .port_egress_floods method, which limits our options when we try to add support for turning off broadcast flooding: do we extend .port_egress_floods with a third parameter which b53 and mv88e6xxx will ignore? But that means that the DSA layer, which currently implements the PRE_BRIDGE_FLAGS attribute all by itself, will see that .port_egress_floods is implemented, and will report that all 3 types of flooding are supported - not necessarily true. Another configuration is when the user specifies more than one flag at the same time, in the same netlink message. If we were to create one individual function per offloadable bridge port flag, we would limit the expressiveness of the switch driver of refusing certain combinations of flag values. For example, a switch may not have an explicit knob for flooding of unknown multicast, just for flooding in general. In that case, the only correct thing to do is to allow changes to BR_FLOOD and BR_MCAST_FLOOD in tandem, and never allow mismatched values. But having a separate .port_set_unicast_flood and .port_set_multicast_flood would not allow the driver to possibly reject that. Also, DSA doesn't consider it necessary to inform the driver that a SWITCHDEV_ATTR_ID_BRIDGE_MROUTER attribute was offloaded, because it just calls .port_egress_floods for the CPU port. When we'll add support for the plain SWITCHDEV_ATTR_ID_PORT_MROUTER, that will become a real problem because the flood settings will need to be held statefully in the DSA middle layer, otherwise changing the mrouter port attribute will impact the flooding attribute. And that's _assuming_ that the underlying hardware doesn't have anything else to do when a multicast router attaches to a port than flood unknown traffic to it. If it does, there will need to be a dedicated .port_set_mrouter anyway. So we need to let the DSA drivers see the exact form that the bridge passes this switchdev attribute in, otherwise we are standing in the way. Therefore we also need to use this form of language when communicating to the driver that it needs to configure its initial (before bridge join) and final (after bridge leave) port flags. The b53 and mv88e6xxx drivers are converted to the passthrough API and their implementation of .port_egress_floods is split into two: a function that configures unicast flooding and another for multicast. The mv88e6xxx implementation is quite hairy, and it turns out that the implementations of unknown unicast flooding are actually the same for 6185 and for 6352: behind the confusing names actually lie two individual bits: NO_UNKNOWN_MC -> FLOOD_UC = 0x4 = BIT(2) NO_UNKNOWN_UC -> FLOOD_MC = 0x8 = BIT(3) so there was no reason to entangle them in the first place. Whereas the 6185 writes to MV88E6185_PORT_CTL0_FORWARD_UNKNOWN of PORT_CTL0, which has the exact same bit index. I have left the implementations separate though, for the only reason that the names are different enough to confuse me, since I am not able to double-check with a user manual. The multicast flooding setting for 6185 is in a different register than for 6352 though. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12net: switchdev: propagate extack to port attributesVladimir Oltean1-1/+2
When a struct switchdev_attr is notified through switchdev, there is no way to report informational messages, unlike for struct switchdev_obj. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com> Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-11net: dsa: add support for offloading HSRGeorge McCollister1-0/+14
Add support for offloading of HSR/PRP (IEC 62439-3) tag insertion tag removal, duplicate generation and forwarding on DSA switches. Add DSA_NOTIFIER_HSR_JOIN and DSA_NOTIFIER_HSR_LEAVE which trigger calls to .port_hsr_join and .port_hsr_leave in the DSA driver for the switch. The DSA switch driver should then set netdev feature flags for the HSR/PRP operation that it offloads. NETIF_F_HW_HSR_TAG_INS NETIF_F_HW_HSR_TAG_RM NETIF_F_HW_HSR_FWD NETIF_F_HW_HSR_DUP Signed-off-by: George McCollister <george.mccollister@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-06net: dsa: make assisted_learning_on_cpu_port bypass offloaded LAG interfacesVladimir Oltean1-0/+8
Given the following topology, and focusing only on Box A: Box A +----------------------------------+ | Board 1 br0 | | +---------+ | | / \ | | | | | | | bond0 | | | +-----+ | |192.168.1.1 | / \ | | eno0 swp0 swp1 swp2 | +---|--------|-------|-------|-----+ | | | | +--------+ | | Cable | | Cable| |Cable Cable | | +--------+ | | | | | | +---|--------|-------|-------|-----+ | eno0 swp0 swp1 swp2 | |192.168.1.2 | \ / | | | +-----+ | | | bond0 | | | | | | \ / | | +---------+ | | Board 2 br0 | +----------------------------------+ Box B The assisted_learning_on_cpu_port logic will see that swp0 is bridged with a "foreign interface" (bond0) and will therefore install all addresses learnt by the software bridge towards bond0 (including the address of eno0 on Box B) as static addresses towards the CPU port. But that's not what we want - bond0 is not really a "foreign interface" but one we can offload including L2 forwarding from/towards it. So we need to refine our logic for assisted learning such that, whenever we see an address learnt on a non-DSA interface, we search through the tree for any port that offloads that non-DSA interface. Some confusion might arise as to why we search through the whole tree instead of just the local switch returned by dsa_slave_dev_lower_find. Or a different angle of the same confusion: why does dsa_slave_dev_lower_find(br_dev) return a single dp that's under br_dev instead of the whole list of bridged DSA ports? To answer the second question, it should be enough to install the static FDB entry on the CPU port of a single switch in the tree, because dsa_port_fdb_add uses DSA_NOTIFIER_FDB_ADD which ensures that all other switches in the tree get notified of that address, and add the entry themselves using dsa_towards_port(). This should help understand the answer to the first question: the port returned by dsa_slave_dev_lower_find may not be on the same switch as the ports that offload the LAG. Nonetheless, if the driver implements .crosschip_lag_join and .crosschip_bridge_join as mv88e6xxx does, there still isn't any reason for trapping addresses learnt on the remote LAG towards the CPU, and we should prevent that. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-06net: dsa: automatically bring user ports down when master goes downVladimir Oltean1-0/+24
This is not fixing any actual bug that I know of, but having a DSA interface that is up even when its lower (master) interface is down is one of those things that just do not sound right. Yes, DSA checks if the master is up before actually bringing the user interface up, but nobody prevents bringing the master interface down immediately afterwards... Then the user ports would attempt dev_queue_xmit on an interface that is down, and wonder what's wrong. This patch prevents that from happening. NETDEV_GOING_DOWN is the notification emitted _before_ the master actually goes down, and we are protected by the rtnl_mutex, so all is well. For those of you reading this because you were doing switch testing such as latency measurements for autonomously forwarded traffic, and you needed a controlled environment with no extra packets sent by the network stack, this patch breaks that, because now the user ports go down too, which may shut down the PHY etc. But please don't do it like that, just do instead: tc qdisc add dev eno2 clsact tc filter add dev eno2 egress flower action drop Tested with two cascaded DSA switches: $ ip link set eno2 down sja1105 spi2.0 sw0p2: Link is Down mscc_felix 0000:00:00.5 swp0: Link is Down fsl_enetc 0000:00:00.2 eno2: Link is Down Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-06net: dsa: automatically bring up DSA master when opening user portVladimir Oltean1-2/+5
DSA wants the master interface to be open before the user port is due to historical reasons. The promiscuity of interfaces that are down used to have issues, as referenced Lennert Buytenhek in commit df02c6ff2e39 ("dsa: fix master interface allmulti/promisc handling"). The bugfix mentioned there, commit b6c40d68ff64 ("net: only invoke dev->change_rx_flags when device is UP"), was basically a "don't do that" approach to working around the promiscuity while down issue. Further work done by Vlad Yasevich in commit d2615bf45069 ("net: core: Always propagate flag changes to interfaces") has resolved the underlying issue, and it is strictly up to the DSA and 8021q drivers now, it is no longer mandated by the networking core that the master interface must be up when changing its promiscuity. From DSA's point of view, deciding to error out in dsa_slave_open because the master isn't up is (a) a bad user experience and (b) knocking at an open door. Even if there still was an issue with promiscuity while down, DSA could still just open the master and avoid it. Doing it this way has the additional benefit that user space can now remove DSA-specific workarounds, like systemd-networkd with BindCarrier: https://github.com/systemd/systemd/issues/7478 And we can finally remove one of the 2 bullets in the "Common pitfalls using DSA setups" chapter. Tested with two cascaded DSA switches: $ ip link set sw0p2 up fsl_enetc 0000:00:00.2 eno2: configuring for fixed/internal link mode fsl_enetc 0000:00:00.2 eno2: Link is Up - 1Gbps/Full - flow control rx/tx mscc_felix 0000:00:00.5 swp0: configuring for fixed/sgmii link mode mscc_felix 0000:00:00.5 swp0: Link is Up - 1Gbps/Full - flow control off 8021q: adding VLAN 0 to HW filter on device swp0 sja1105 spi2.0 sw0p2: configuring for phy/rgmii-id link mode IPv6: ADDRCONF(NETDEV_CHANGE): eno2: link becomes ready IPv6: ADDRCONF(NETDEV_CHANGE): swp0: link becomes ready Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29net: dsa: allow changing the tag protocol via the "tagging" device attributeVladimir Oltean1-12/+23
Currently DSA exposes the following sysfs: $ cat /sys/class/net/eno2/dsa/tagging ocelot which is a read-only device attribute, introduced in the kernel as commit 98cdb4807123 ("net: dsa: Expose tagging protocol to user-space"), and used by libpcap since its commit 993db3800d7d ("Add support for DSA link-layer types"). It would be nice if we could extend this device attribute by making it writable: $ echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging This is useful with DSA switches that can make use of more than one tagging protocol. It may be useful in dsa_loop in the future too, to perform offline testing of various taggers, or for changing between dsa and edsa on Marvell switches, if that is desirable. In terms of implementation, drivers can support this feature by implementing .change_tag_protocol, which should always leave the switch in a consistent state: either with the new protocol if things went well, or with the old one if something failed. Teardown of the old protocol, if necessary, must be handled by the driver. Some things remain as before: - The .get_tag_protocol is currently only called at probe time, to load the initial tagging protocol driver. Nonetheless, new drivers should report the tagging protocol in current use now. - The driver should manage by itself the initial setup of tagging protocol, no later than the .setup() method, as well as destroying resources used by the last tagger in use, no earlier than the .teardown() method. For multi-switch DSA trees, error handling is a bit more complicated, since e.g. the 5th out of 7 switches may fail to change the tag protocol. When that happens, a revert to the original tag protocol is attempted, but that may fail too, leaving the tree in an inconsistent state despite each individual switch implementing .change_tag_protocol transactionally. Since the intersection between drivers that implement .change_tag_protocol and drivers that support D in DSA is currently the empty set, the possibility for this error to happen is ignored for now. Testing: $ insmod mscc_felix.ko [ 79.549784] mscc_felix 0000:00:00.5: Adding to iommu group 14 [ 79.565712] mscc_felix 0000:00:00.5: Failed to register DSA switch: -517 $ insmod tag_ocelot.ko $ rmmod mscc_felix.ko $ insmod mscc_felix.ko [ 97.261724] libphy: VSC9959 internal MDIO bus: probed [ 97.267363] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 0 [ 97.274998] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 1 [ 97.282561] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 2 [ 97.289700] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 3 [ 97.599163] mscc_felix 0000:00:00.5 swp0 (uninitialized): PHY [0000:00:00.3:10] driver [Microsemi GE VSC8514 SyncE] (irq=POLL) [ 97.862034] mscc_felix 0000:00:00.5 swp1 (uninitialized): PHY [0000:00:00.3:11] driver [Microsemi GE VSC8514 SyncE] (irq=POLL) [ 97.950731] mscc_felix 0000:00:00.5 swp0: configuring for inband/qsgmii link mode [ 97.964278] 8021q: adding VLAN 0 to HW filter on device swp0 [ 98.146161] mscc_felix 0000:00:00.5 swp2 (uninitialized): PHY [0000:00:00.3:12] driver [Microsemi GE VSC8514 SyncE] (irq=POLL) [ 98.238649] mscc_felix 0000:00:00.5 swp1: configuring for inband/qsgmii link mode [ 98.251845] 8021q: adding VLAN 0 to HW filter on device swp1 [ 98.433916] mscc_felix 0000:00:00.5 swp3 (uninitialized): PHY [0000:00:00.3:13] driver [Microsemi GE VSC8514 SyncE] (irq=POLL) [ 98.485542] mscc_felix 0000:00:00.5: configuring for fixed/internal link mode [ 98.503584] mscc_felix 0000:00:00.5: Link is Up - 2.5Gbps/Full - flow control rx/tx [ 98.527948] device eno2 entered promiscuous mode [ 98.544755] DSA: tree 0 setup $ ping 10.0.0.1 PING 10.0.0.1 (10.0.0.1): 56 data bytes 64 bytes from 10.0.0.1: seq=0 ttl=64 time=2.337 ms 64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.754 ms ^C - 10.0.0.1 ping statistics - 2 packets transmitted, 2 packets received, 0% packet loss round-trip min/avg/max = 0.754/1.545/2.337 ms $ cat /sys/class/net/eno2/dsa/tagging ocelot $ cat ./test_ocelot_8021q.sh #!/bin/bash ip link set swp0 down ip link set swp1 down ip link set swp2 down ip link set swp3 down ip link set swp5 down ip link set eno2 down echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging ip link set eno2 up ip link set swp0 up ip link set swp1 up ip link set swp2 up ip link set swp3 up ip link set swp5 up $ ./test_ocelot_8021q.sh ./test_ocelot_8021q.sh: line 9: echo: write error: Protocol not available $ rmmod tag_ocelot.ko rmmod: can't unload module 'tag_ocelot': Resource temporarily unavailable $ insmod tag_ocelot_8021q.ko $ ./test_ocelot_8021q.sh $ cat /sys/class/net/eno2/dsa/tagging ocelot-8021q $ rmmod tag_ocelot.ko $ rmmod tag_ocelot_8021q.ko rmmod: can't unload module 'tag_ocelot_8021q': Resource temporarily unavailable $ ping 10.0.0.1 PING 10.0.0.1 (10.0.0.1): 56 data bytes 64 bytes from 10.0.0.1: seq=0 ttl=64 time=0.953 ms 64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.787 ms 64 bytes from 10.0.0.1: seq=2 ttl=64 time=0.771 ms $ rmmod mscc_felix.ko [ 645.544426] mscc_felix 0000:00:00.5: Link is Down [ 645.838608] DSA: tree 0 torn down $ rmmod tag_ocelot_8021q.ko Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-15net: dsa: set configure_vlan_while_not_filtering to true by defaultVladimir Oltean1-3/+6
As explained in commit 54a0ed0df496 ("net: dsa: provide an option for drivers to always receive bridge VLANs"), DSA has historically been skipping VLAN switchdev operations when the bridge wasn't in vlan_filtering mode, but the reason why it was doing that has never been clear. So the configure_vlan_while_not_filtering option is there merely to preserve functionality for existing drivers. It isn't some behavior that drivers should opt into. Ideally, when all drivers leave this flag set, we can delete the dsa_port_skip_vlan_configuration() function. New drivers always seem to omit setting this flag, for some reason. So let's reverse the logic: the DSA core sets it by default to true before the .setup() callback, and legacy drivers can turn it off. This way, new drivers get the new behavior by default, unless they explicitly set the flag to false, which is more obvious during review. Remove the assignment from drivers which were setting it to true, and add the assignment to false for the drivers that didn't previously have it. This way, it should be easier to see how many we have left. The following drivers: lan9303, mv88e6060 were skipped from setting this flag to false, because they didn't have any VLAN offload ops in the first place. The Broadcom Starfighter 2 driver calls the common b53_switch_alloc and therefore also inherits the configure_vlan_while_not_filtering=true behavior. Also, print a message through netlink extack every time a VLAN has been skipped. This is mildly annoying on purpose, so that (a) it is at least clear that VLANs are being skipped - the legacy behavior in itself is confusing, and the extack should be much more difficult to miss, unlike kernel logs - and (b) people have one more incentive to convert to the new behavior. No behavior change except for the added prints is intended at this time. $ ip link add br0 type bridge vlan_filtering 0 $ ip link set sw0p2 master br0 [ 60.315148] br0: port 1(sw0p2) entered blocking state [ 60.320350] br0: port 1(sw0p2) entered disabled state [ 60.327839] device sw0p2 entered promiscuous mode [ 60.334905] br0: port 1(sw0p2) entered blocking state [ 60.340142] br0: port 1(sw0p2) entered forwarding state Warning: dsa_core: skipping configuration of VLAN. # This was the pvid $ bridge vlan add dev sw0p2 vid 100 Warning: dsa_core: skipping configuration of VLAN. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20210115231919.43834-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-14net: dsa: Link aggregation supportTobias Waldekranz1-7/+63
Monitor the following events and notify the driver when: - A DSA port joins/leaves a LAG. - A LAG, made up of DSA ports, joins/leaves a bridge. - A DSA port in a LAG is enabled/disabled (enabled meaning "distributing" in 802.3ad LACP terms). When a LAG joins a bridge, the DSA subsystem will treat that as each individual port joining the bridge. The driver may look at the port's LAG device pointer to see if it is associated with any LAG, if that is required. This is analogue to how switchdev events are replicated out to all lower devices when reaching e.g. a LAG. Drivers can optionally request that DSA maintain a linear mapping from a LAG ID to the corresponding netdev by setting ds->num_lag_ids to the desired size. In the event that the hardware is not capable of offloading a particular LAG for any reason (the typical case being use of exotic modes like broadcast), DSA will take a hands-off approach, allowing the LAG to be formed as a pure software construct. This is reported back through the extended ACK, but is otherwise transparent to the user. Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Tested-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-14net: dsa: Don't offload port attributes on standalone portsTobias Waldekranz1-0/+3
In a situation where a standalone port is indirectly attached to a bridge (e.g. via a LAG) which is not offloaded, do not offload any port attributes either. The port should behave as a standard NIC. Previously, on mv88e6xxx, this meant that in the following setup: br0 / team0 / \ swp0 swp1 If vlan filtering was enabled on br0, swp0's and swp1's QMode was set to "secure". This caused all untagged packets to be dropped, as their default VID (0) was not loaded into the VTU. Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-12net: dsa: add optional stats64 supportOleksij Rempel1-1/+13
Allow DSA drivers to export stats64 Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-11net: dsa: remove obsolete comments about switchdev transactionsVladimir Oltean1-5/+0
Now that all port object notifiers were converted to be non-transactional, we can remove the comments that say otherwise. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Linus Walleij <linus.walleij@linaro.org> Acked-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-11net: switchdev: remove the transaction structure from port attributesVladimir Oltean1-10/+7
Since the introduction of the switchdev API, port attributes were transmitted to drivers for offloading using a two-step transactional model, with a prepare phase that was supposed to catch all errors, and a commit phase that was supposed to never fail. Some classes of failures can never be avoided, like hardware access, or memory allocation. In the latter case, merely attempting to move the memory allocation to the preparation phase makes it impossible to avoid memory leaks, since commit 91cf8eceffc1 ("switchdev: Remove unused transaction item queue") which has removed the unused mechanism of passing on the allocated memory between one phase and another. It is time we admit that separating the preparation from the commit phase is something that is best left for the driver to decide, and not something that should be baked into the API, especially since there are no switchdev callers that depend on this. This patch removes the struct switchdev_trans member from switchdev port attribute notifier structures, and converts drivers to not look at this member. In part, this patch contains a revert of my previous commit 2e554a7a5d8a ("net: dsa: propagate switchdev vlan_filtering prepare phase to drivers"). For the most part, the conversion was trivial except for: - Rocker's world implementation based on Broadcom OF-DPA had an odd implementation of ofdpa_port_attr_bridge_flags_set. The conversion was done mechanically, by pasting the implementation twice, then only keeping the code that would get executed during prepare phase on top, then only keeping the code that gets executed during the commit phase on bottom, then simplifying the resulting code until this was obtained. - DSA's offloading of STP state, bridge flags, VLAN filtering and multicast router could be converted right away. But the ageing time could not, so a shim was introduced and this was left for a further commit. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Linus Walleij <linus.walleij@linaro.org> Acked-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek Reviewed-by: Linus Walleij <linus.walleij@linaro.org> # RTL8366RB Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-11net: switchdev: remove the transaction structure from port object notifiersVladimir Oltean1-25/+9
Since the introduction of the switchdev API, port objects were transmitted to drivers for offloading using a two-step transactional model, with a prepare phase that was supposed to catch all errors, and a commit phase that was supposed to never fail. Some classes of failures can never be avoided, like hardware access, or memory allocation. In the latter case, merely attempting to move the memory allocation to the preparation phase makes it impossible to avoid memory leaks, since commit 91cf8eceffc1 ("switchdev: Remove unused transaction item queue") which has removed the unused mechanism of passing on the allocated memory between one phase and another. It is time we admit that separating the preparation from the commit phase is something that is best left for the driver to decide, and not something that should be baked into the API, especially since there are no switchdev callers that depend on this. This patch removes the struct switchdev_trans member from switchdev port object notifier structures, and converts drivers to not look at this member. Where driver conversion is trivial (like in the case of the Marvell Prestera driver, NXP DPAA2 switch, TI CPSW, and Rocker drivers), it is done in this patch. Where driver conversion needs more attention (DSA, Mellanox Spectrum), the conversion is left for subsequent patches and here we only fake the prepare/commit phases at a lower level, just not in the switchdev notifier itself. Where the code has a natural structure that is best left alone as a preparation and a commit phase (as in the case of the Ocelot switch), that structure is left in place, just made to not depend upon the switchdev transactional model. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Linus Walleij <linus.walleij@linaro.org> Acked-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-11net: switchdev: remove vid_begin -> vid_end range from VLAN objectsVladimir Oltean1-16/+7
The call path of a switchdev VLAN addition to the bridge looks something like this today: nbp_vlan_init | __br_vlan_set_default_pvid | | | | | br_afspec | | | | | | | v | | | br_process_vlan_info | | | | | | | v | | | br_vlan_info | | | / \ / | | / \ / | | / \ / | | / \ / v v v v v nbp_vlan_add br_vlan_add ------+ | ^ ^ | | | / | | | | / / / | \ br_vlan_get_master/ / v \ ^ / / br_vlan_add_existing \ | / / | \ | / / / \ | / / / \ | / / / \ | / / / v | | v / __vlan_add / / | / / | / v | / __vlan_vid_add | / \ | / v v v br_switchdev_port_vlan_add The ranges UAPI was introduced to the bridge in commit bdced7ef7838 ("bridge: support for multiple vlans and vlan ranges in setlink and dellink requests") (Jan 10 2015). But the VLAN ranges (parsed in br_afspec) have always been passed one by one, through struct bridge_vlan_info tmp_vinfo, to br_vlan_info. So the range never went too far in depth. Then Scott Feldman introduced the switchdev_port_bridge_setlink function in commit 47f8328bb1a4 ("switchdev: add new switchdev bridge setlink"). That marked the introduction of the SWITCHDEV_OBJ_PORT_VLAN, which made full use of the range. But switchdev_port_bridge_setlink was called like this: br_setlink -> br_afspec -> switchdev_port_bridge_setlink Basically, the switchdev and the bridge code were not tightly integrated. Then commit 41c498b9359e ("bridge: restore br_setlink back to original") came, and switchdev drivers were required to implement .ndo_bridge_setlink = switchdev_port_bridge_setlink for a while. In the meantime, commits such as 0944d6b5a2fa ("bridge: try switchdev op first in __vlan_vid_add/del") finally made switchdev penetrate the br_vlan_info() barrier and start to develop the call path we have today. But remember, br_vlan_info() still receives VLANs one by one. Then Arkadi Sharshevsky refactored the switchdev API in 2017 in commit 29ab586c3d83 ("net: switchdev: Remove bridge bypass support from switchdev") so that drivers would not implement .ndo_bridge_setlink any longer. The switchdev_port_bridge_setlink also got deleted. This refactoring removed the parallel bridge_setlink implementation from switchdev, and left the only switchdev VLAN objects to be the ones offloaded from __vlan_vid_add (basically RX filtering) and __vlan_add (the latter coming from commit 9c86ce2c1ae3 ("net: bridge: Notify about bridge VLANs")). That is to say, today the switchdev VLAN object ranges are not used in the kernel. Refactoring the above call path is a bit complicated, when the bridge VLAN call path is already a bit complicated. Let's go off and finish the job of commit 29ab586c3d83 by deleting the bogus iteration through the VLAN ranges from the drivers. Some aspects of this feature never made too much sense in the first place. For example, what is a range of VLANs all having the BRIDGE_VLAN_INFO_PVID flag supposed to mean, when a port can obviously have a single pvid? This particular configuration _is_ denied as of commit 6623c60dc28e ("bridge: vlan: enforce no pvid flag in vlan ranges"), but from an API perspective, the driver still has to play pretend, and only offload the vlan->vid_end as pvid. And the addition of a switchdev VLAN object can modify the flags of another, completely unrelated, switchdev VLAN object! (a VLAN that is PVID will invalidate the PVID flag from whatever other VLAN had previously been offloaded with switchdev and had that flag. Yet switchdev never notifies about that change, drivers are supposed to guess). Nonetheless, having a VLAN range in the API makes error handling look scarier than it really is - unwinding on errors and all of that. When in reality, no one really calls this API with more than one VLAN. It is all unnecessary complexity. And despite appearing pretentious (two-phase transactional model and all), the switchdev API is really sloppy because the VLAN addition and removal operations are not paired with one another (you can add a VLAN 100 times and delete it just once). The bridge notifies through switchdev of a VLAN addition not only when the flags of an existing VLAN change, but also when nothing changes. There are switchdev drivers out there who don't like adding a VLAN that has already been added, and those checks don't really belong at driver level. But the fact that the API contains ranges is yet another factor that prevents this from being addressed in the future. Of the existing switchdev pieces of hardware, it appears that only Mellanox Spectrum supports offloading more than one VLAN at a time, through mlxsw_sp_port_vlan_set. I have kept that code internal to the driver, because there is some more bookkeeping that makes use of it, but I deleted it from the switchdev API. But since the switchdev support for ranges has already been de facto deleted by a Mellanox employee and nobody noticed for 4 years, I'm going to assume it's not a biggie. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> # switchdev and mlxsw Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09net: dsa: dsa_legacy_fdb_{add,del} can be staticVladimir Oltean1-8/+8
Introduced in commit 37b8da1a3c68 ("net: dsa: Move FDB add/del implementation inside DSA") in net/dsa/legacy.c, these functions were moved again to slave.c as part of commit 2a93c1a3651f ("net: dsa: Allow compiling out legacy support"), before actually deleting net/dsa/slave.c in 93e86b3bc842 ("net: dsa: Remove legacy probing support"). Along with that movement there should have been a deletion of the prototypes from dsa_priv.h, they are not useful. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20210108233054.1222278-1-olteanv@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: dsa: remove the DSA specific notifiersVladimir Oltean1-17/+0
This effectively reverts commit 60724d4bae14 ("net: dsa: Add support for DSA specific notifiers"). The reason is that since commit 2f1e8ea726e9 ("net: dsa: link interfaces with the DSA master to get rid of lockdep warnings"), it appears that there is a generic way to achieve the same purpose. The only user thus far, the Broadcom SYSTEMPORT driver, was converted to use the generic notifiers. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: dsa: export dsa_slave_dev_checkVladimir Oltean1-0/+1
Using the NETDEV_CHANGEUPPER notifications, drivers can be aware when they are enslaved to e.g. a bridge by calling netif_is_bridge_master(). Export this helper from DSA to get the equivalent functionality of determining whether the upper interface of a CHANGEUPPER notifier is a DSA switch interface or not. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07net: dsa: listen for SWITCHDEV_{FDB,DEL}_ADD_TO_DEVICE on foreign bridge ↵Vladimir Oltean1-11/+55
neighbors Some DSA switches (and not only) cannot learn source MAC addresses from packets injected from the CPU. They only perform hardware address learning from inbound traffic. This can be problematic when we have a bridge spanning some DSA switch ports and some non-DSA ports (which we'll call "foreign interfaces" from DSA's perspective). There are 2 classes of problems created by the lack of learning on CPU-injected traffic: - excessive flooding, due to the fact that DSA treats those addresses as unknown - the risk of stale routes, which can lead to temporary packet loss To illustrate the second class, consider the following situation, which is common in production equipment (wireless access points, where there is a WLAN interface and an Ethernet switch, and these form a single bridging domain). AP 1: +------------------------------------------------------------------------+ | br0 | +------------------------------------------------------------------------+ +------------+ +------------+ +------------+ +------------+ +------------+ | swp0 | | swp1 | | swp2 | | swp3 | | wlan0 | +------------+ +------------+ +------------+ +------------+ +------------+ | ^ ^ | | | | | | | Client A Client B | | | +------------+ +------------+ +------------+ +------------+ +------------+ | swp0 | | swp1 | | swp2 | | swp3 | | wlan0 | +------------+ +------------+ +------------+ +------------+ +------------+ +------------------------------------------------------------------------+ | br0 | +------------------------------------------------------------------------+ AP 2 - br0 of AP 1 will know that Clients A and B are reachable via wlan0 - the hardware fdb of a DSA switch driver today is not kept in sync with the software entries on other bridge ports, so it will not know that clients A and B are reachable via the CPU port UNLESS the hardware switch itself performs SA learning from traffic injected from the CPU. Nonetheless, a substantial number of switches don't. - the hardware fdb of the DSA switch on AP 2 may autonomously learn that Client A and B are reachable through swp0. Therefore, the software br0 of AP 2 also may or may not learn this. In the example we're illustrating, some Ethernet traffic has been going on, and br0 from AP 2 has indeed learnt that it can reach Client B through swp0. One of the wireless clients, say Client B, disconnects from AP 1 and roams to AP 2. The topology now looks like this: AP 1: +------------------------------------------------------------------------+ | br0 | +------------------------------------------------------------------------+ +------------+ +------------+ +------------+ +------------+ +------------+ | swp0 | | swp1 | | swp2 | | swp3 | | wlan0 | +------------+ +------------+ +------------+ +------------+ +------------+ | ^ | | | Client A | | | Client B | | | v +------------+ +------------+ +------------+ +------------+ +------------+ | swp0 | | swp1 | | swp2 | | swp3 | | wlan0 | +------------+ +------------+ +------------+ +------------+ +------------+ +------------------------------------------------------------------------+ | br0 | +------------------------------------------------------------------------+ AP 2 - br0 of AP 1 still knows that Client A is reachable via wlan0 (no change) - br0 of AP 1 will (possibly) know that Client B has left wlan0. There are cases where it might never find out though. Either way, DSA today does not process that notification in any way. - the hardware FDB of the DSA switch on AP 1 may learn autonomously that Client B can be reached via swp0, if it receives any packet with Client 1's source MAC address over Ethernet. - the hardware FDB of the DSA switch on AP 2 still thinks that Client B can be reached via swp0. It does not know that it has roamed to wlan0, because it doesn't perform SA learning from the CPU port. Now Client A contacts Client B. AP 1 routes the packet fine towards swp0 and delivers it on the Ethernet segment. AP 2 sees a frame on swp0 and its fdb says that the destination is swp0. Hairpinning is disabled => drop. This problem comes from the fact that these switches have a 'blind spot' for addresses coming from software bridging. The generic solution is not to assume that hardware learning can be enabled somehow, but to listen to more bridge learning events. It turns out that the bridge driver does learn in software from all inbound frames, in __br_handle_local_finish. A proper SWITCHDEV_FDB_ADD_TO_DEVICE notification is emitted for the addresses serviced by the bridge on 'foreign' interfaces. The software bridge also does the right thing on migration, by notifying that the old entry is deleted, so that does not need to be special-cased in DSA. When it is deleted, we just need to delete our static FDB entry towards the CPU too, and wait. The problem is that DSA currently only cares about SWITCHDEV_FDB_ADD_TO_DEVICE events received on its own interfaces, such as static FDB entries. Luckily we can change that, and DSA can listen to all switchdev FDB add/del events in the system and figure out if those events were emitted by a bridge that spans at least one of DSA's own ports. In case that is true, DSA will also offload that address towards its own CPU port, in the eventuality that there might be bridge clients attached to the DSA switch who want to talk to the station connected to the foreign interface. In terms of implementation, we need to keep the fdb_info->added_by_user check for the case where the switchdev event was targeted directly at a DSA switch port. But we don't need to look at that flag for snooped events. So the check is currently too late, we need to move it earlier. This also simplifies the code a bit, since we avoid uselessly allocating and freeing switchdev_work. We could probably do some improvements in the future. For example, multi-bridge support is rudimentary at the moment. If there are two bridges spanning a DSA switch's ports, and both of them need to service the same MAC address, then what will happen is that the migration of one of those stations will trigger the deletion of the FDB entry from the CPU port while it is still used by other bridge. That could be improved with reference counting but is left for another time. This behavior needs to be enabled at driver level by setting ds->assisted_learning_on_cpu_port = true. This is because we don't want to inflict a potential performance penalty (accesses through MDIO/I2C/SPI are expensive) to hardware that really doesn't need it because address learning on the CPU port works there. Reported-by: DENG Qingfang <dqfext@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <kuba@kernel.org>