aboutsummaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2021-02-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski14-88/+154
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-02Merge tag 'net-5.11-rc7' of ↵Linus Torvalds11-30/+106
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes for 5.11-rc7, including fixes from bpf and mac80211 trees. Current release - regressions: - ip_tunnel: fix mtu calculation - mlx5: fix function calculation for page trees Previous releases - regressions: - vsock: fix the race conditions in multi-transport support - neighbour: prevent a dead entry from updating gc_list - dsa: mv88e6xxx: override existent unicast portvec in port_fdb_add Previous releases - always broken: - bpf, cgroup: two copy_{from,to}_user() warn_on_once splats for BPF cgroup getsockopt infra when user space is trying to race against optlen, from Loris Reiff. - bpf: add missing fput() in BPF inode storage map update helper - udp: ipv4: manipulate network header of NATed UDP GRO fraglist - mac80211: fix station rate table updates on assoc - r8169: work around RTL8125 UDP HW bug - igc: report speed and duplex as unknown when device is runtime suspended - rxrpc: fix deadlock around release of dst cached on udp tunnel" * tag 'net-5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (36 commits) net: hsr: align sup_multicast_addr in struct hsr_priv to u16 boundary net: ipa: fix two format specifier errors net: ipa: use the right accessor in ipa_endpoint_status_skip() net: ipa: be explicit about endianness net: ipa: add a missing __iomem attribute net: ipa: pass correct dma_handle to dma_free_coherent() r8169: fix WoL on shutdown if CONFIG_DEBUG_SHIRQ is set net/rds: restrict iovecs length for RDS_CMSG_RDMA_ARGS net: mvpp2: TCAM entry enable should be written after SRAM data net: lapb: Copy the skb before sending a packet net/mlx5e: Release skb in case of failure in tc update skb net/mlx5e: Update max_opened_tc also when channels are closed net/mlx5: Fix leak upon failure of rule creation net/mlx5: Fix function calculation for page trees docs: networking: swap words in icmp_errors_use_inbound_ifaddr doc udp: ipv4: manipulate network header of NATed UDP GRO fraglist net: ip_tunnel: fix mtu calculation vsock: fix the race conditions in multi-transport support net: sched: replaced invalid qdisc tree flush helper in qdisc_replace ibmvnic: device remove has higher precedence over reset ...
2021-02-02net: hsr: align sup_multicast_addr in struct hsr_priv to u16 boundaryAndreas Oetken1-1/+4
sup_multicast_addr is passed to ether_addr_equal for address comparison which casts the address inputs to u16 leading to an unaligned access. Aligning the sup_multicast_addr to u16 boundary fixes the issue. Signed-off-by: Andreas Oetken <andreas.oetken@siemens.com> Link: https://lore.kernel.org/r/20210202090304.2740471-1-ennoerlangen@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-02net/rds: restrict iovecs length for RDS_CMSG_RDMA_ARGSSabyrzhan Tasbolatov1-0/+3
syzbot found WARNING in rds_rdma_extra_size [1] when RDS_CMSG_RDMA_ARGS control message is passed with user-controlled 0x40001 bytes of args->nr_local, causing order >= MAX_ORDER condition. The exact value 0x40001 can be checked with UIO_MAXIOV which is 0x400. So for kcalloc() 0x400 iovecs with sizeof(struct rds_iovec) = 0x10 is the closest limit, with 0x10 leftover. Same condition is currently done in rds_cmsg_rdma_args(). [1] WARNING: mm/page_alloc.c:5011 [..] Call Trace: alloc_pages_current+0x18c/0x2a0 mm/mempolicy.c:2267 alloc_pages include/linux/gfp.h:547 [inline] kmalloc_order+0x2e/0xb0 mm/slab_common.c:837 kmalloc_order_trace+0x14/0x120 mm/slab_common.c:853 kmalloc_array include/linux/slab.h:592 [inline] kcalloc include/linux/slab.h:621 [inline] rds_rdma_extra_size+0xb2/0x3b0 net/rds/rdma.c:568 rds_rm_size net/rds/send.c:928 [inline] Reported-by: syzbot+1bd2b07f93745fa38425@syzkaller.appspotmail.com Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Link: https://lore.kernel.org/r/20210201203233.1324704-1-snovitoll@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-02net: lapb: Copy the skb before sending a packetXie He1-1/+2
When sending a packet, we will prepend it with an LAPB header. This modifies the shared parts of a cloned skb, so we should copy the skb rather than just clone it, before we prepend the header. In "Documentation/networking/driver.rst" (the 2nd point), it states that drivers shouldn't modify the shared parts of a cloned skb when transmitting. The "dev_queue_xmit_nit" function in "net/core/dev.c", which is called when an skb is being sent, clones the skb and sents the clone to AF_PACKET sockets. Because the LAPB drivers first remove a 1-byte pseudo-header before handing over the skb to us, if we don't copy the skb before prepending the LAPB header, the first byte of the packets received on AF_PACKET sockets can be corrupted. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Xie He <xie.he.0141@gmail.com> Acked-by: Martin Schiller <ms@dev.tdt.de> Link: https://lore.kernel.org/r/20210201055706.415842-1-xie.he.0141@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-02Merge tag 'mac80211-for-net-2021-02-02' of ↵Jakub Kicinski2-2/+6
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Two fixes: - station rate tables were not updated correctly after association, leading to bad configuration - rtl8723bs (staging) was initializing data incorrectly after the previous fix and needed to move the init later * tag 'mac80211-for-net-2021-02-02' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211: staging: rtl8723bs: Move wiphy setup to after reading the regulatory settings from the chip mac80211: fix station rate table updates on assoc ==================== Link: https://lore.kernel.org/r/20210202143505.37610-1-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-01udp: ipv4: manipulate network header of NATed UDP GRO fraglistDongseok Yi2-6/+65
UDP/IP header of UDP GROed frag_skbs are not updated even after NAT forwarding. Only the header of head_skb from ip_finish_output_gso -> skb_gso_segment is updated but following frag_skbs are not updated. A call path skb_mac_gso_segment -> inet_gso_segment -> udp4_ufo_fragment -> __udp_gso_segment -> __udp_gso_segment_list does not try to update UDP/IP header of the segment list but copy only the MAC header. Update port, addr and check of each skb of the segment list in __udp_gso_segment_list. It covers both SNAT and DNAT. Fixes: 9fd1ff5d2ac7 (udp: Support UDP fraglist GRO/GSO.) Signed-off-by: Dongseok Yi <dseok.yi@samsung.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Link: https://lore.kernel.org/r/1611962007-80092-1-git-send-email-dseok.yi@samsung.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-01net: ip_tunnel: fix mtu calculationVadim Fedorenko1-9/+7
dev->hard_header_len for tunnel interface is set only when header_ops are set too and already contains full overhead of any tunnel encapsulation. That's why there is not need to use this overhead twice in mtu calc. Fixes: fdafed459998 ("ip_gre: set dev->hard_header_len and dev->needed_headroom properly") Reported-by: Slava Bacherikov <mail@slava.cc> Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Link: https://lore.kernel.org/r/1611959267-20536-1-git-send-email-vfedorenko@novek.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-01vsock: fix the race conditions in multi-transport supportAlexander Popov1-5/+12
There are multiple similar bugs implicitly introduced by the commit c0cfa2d8a788fcf4 ("vsock: add multi-transports support") and commit 6a2c0962105ae8ce ("vsock: prevent transport modules unloading"). The bug pattern: [1] vsock_sock.transport pointer is copied to a local variable, [2] lock_sock() is called, [3] the local variable is used. VSOCK multi-transport support introduced the race condition: vsock_sock.transport value may change between [1] and [2]. Let's copy vsock_sock.transport pointer to local variables after the lock_sock() call. Fixes: c0cfa2d8a788fcf4 ("vsock: add multi-transports support") Signed-off-by: Alexander Popov <alex.popov@linux.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Jorgen Hansen <jhansen@vmware.com> Link: https://lore.kernel.org/r/20210201084719.2257066-1-alex.popov@linux.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-01mac80211: fix station rate table updates on assocFelix Fietkau2-2/+6
If the driver uses .sta_add, station entries are only uploaded after the sta is in assoc state. Fix early station rate table updates by deferring them until the sta has been uploaded. Cc: stable@vger.kernel.org Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20210201083324.3134-1-nbd@nbd.name [use rcu_access_pointer() instead since we won't dereference here] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-31Merge tag 'nfs-for-5.11-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds3-58/+48
Pull NFS client fixes from Trond Myklebust: - SUNRPC: Handle 0 length opaque XDR object data properly - Fix a layout segment leak in pnfs_layout_process() - pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn - pNFS/NFSv4: Improve rejection of out-of-order layouts - pNFS/NFSv4: Try to return invalid layout in pnfs_layout_process() * tag 'nfs-for-5.11-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: SUNRPC: Handle 0 length opaque XDR object data properly SUNRPC: Move simple_get_bytes and simple_get_netobj into private header pNFS/NFSv4: Improve rejection of out-of-order layouts pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn pNFS/NFSv4: Try to return invalid layout in pnfs_layout_process() pNFS/NFSv4: Fix a layout segment leak in pnfs_layout_process()
2021-01-30neighbour: Prevent a dead entry from updating gc_listChinmay Agarwal1-3/+4
Following race condition was detected: <CPU A, t0> - neigh_flush_dev() is under execution and calls neigh_mark_dead(n) marking the neighbour entry 'n' as dead. <CPU B, t1> - Executing: __netif_receive_skb() -> __netif_receive_skb_core() -> arp_rcv() -> arp_process().arp_process() calls __neigh_lookup() which takes a reference on neighbour entry 'n'. <CPU A, t2> - Moves further along neigh_flush_dev() and calls neigh_cleanup_and_release(n), but since reference count increased in t2, 'n' couldn't be destroyed. <CPU B, t3> - Moves further along, arp_process() and calls neigh_update()-> __neigh_update() -> neigh_update_gc_list(), which adds the neighbour entry back in gc_list(neigh_mark_dead(), removed it earlier in t0 from gc_list) <CPU B, t4> - arp_process() finally calls neigh_release(n), destroying the neighbour entry. This leads to 'n' still being part of gc_list, but the actual neighbour structure has been freed. The situation can be prevented from happening if we disallow a dead entry to have any possibility of updating gc_list. This is what the patch intends to achieve. Fixes: 9c29a2f55ec0 ("neighbor: Fix locking order for gc_list changes") Signed-off-by: Chinmay Agarwal <chinagar@codeaurora.org> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20210127165453.GA20514@chinagar-linux.qualcomm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29rxrpc: Fix deadlock around release of dst cached on udp tunnelDavid Howells1-3/+3
AF_RXRPC sockets use UDP ports in encap mode. This causes socket and dst from an incoming packet to get stolen and attached to the UDP socket from whence it is leaked when that socket is closed. When a network namespace is removed, the wait for dst records to be cleaned up happens before the cleanup of the rxrpc and UDP socket, meaning that the wait never finishes. Fix this by moving the rxrpc (and, by dependence, the afs) private per-network namespace registrations to the device group rather than subsys group. This allows cached rxrpc local endpoints to be cleared and their UDP sockets closed before we try waiting for the dst records. The symptom is that lines looking like the following: unregister_netdevice: waiting for lo to become free get emitted at regular intervals after running something like the referenced syzbot test. Thanks to Vadim for tracking this down and work out the fix. Reported-by: syzbot+df400f2f24a1677cd7e0@syzkaller.appspotmail.com Reported-by: Vadim Fedorenko <vfedorenko@novek.ru> Fixes: 5271953cad31 ("rxrpc: Use the UDP encap_rcv hook") Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Vadim Fedorenko <vfedorenko@novek.ru> Link: https://lore.kernel.org/r/161196443016.3868642.5577440140646403533.stgit@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29net: bridge: add warning comments to avoid extending sysfsNikolay Aleksandrov2-0/+8
We're moving to netlink-only options, so add comments in the bridge's sysfs files to warn against adding any new sysfs entries. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29net: bridge: mcast: drop hosts limit sysfs supportNikolay Aleksandrov1-26/+0
We decided to stop adding new sysfs bridge options and continue with netlink only, so remove hosts limit sysfs support. Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29net: dsa: add a second tagger for Ocelot switches based on tag_8021qVladimir Oltean3-3/+87
There are use cases for which the existing tagger, based on the NPI (Node Processor Interface) functionality, is insufficient. Namely: - Frames injected through the NPI port bypass the frame analyzer, so no source address learning is performed, no TSN stream classification, etc. - Flow control is not functional over an NPI port (PAUSE frames are encapsulated in the same Extraction Frame Header as all other frames) - There can be at most one NPI port configured for an Ocelot switch. But in NXP LS1028A and T1040 there are two Ethernet CPU ports. The non-NPI port is currently either disabled, or operated as a plain user port (albeit an internally-facing one). Having the ability to configure the two CPU ports symmetrically could pave the way for e.g. creating a LAG between them, to increase bandwidth seamlessly for the system. So there is a desire to have an alternative to the NPI mode. This change keeps the default tagger for the Seville and Felix switches as "ocelot", but it can be changed via the following device attribute: echo ocelot-8021q > /sys/class/<dsa-master>/dsa/tagging Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29net: dsa: allow changing the tag protocol via the "tagging" device attributeVladimir Oltean7-16/+217
Currently DSA exposes the following sysfs: $ cat /sys/class/net/eno2/dsa/tagging ocelot which is a read-only device attribute, introduced in the kernel as commit 98cdb4807123 ("net: dsa: Expose tagging protocol to user-space"), and used by libpcap since its commit 993db3800d7d ("Add support for DSA link-layer types"). It would be nice if we could extend this device attribute by making it writable: $ echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging This is useful with DSA switches that can make use of more than one tagging protocol. It may be useful in dsa_loop in the future too, to perform offline testing of various taggers, or for changing between dsa and edsa on Marvell switches, if that is desirable. In terms of implementation, drivers can support this feature by implementing .change_tag_protocol, which should always leave the switch in a consistent state: either with the new protocol if things went well, or with the old one if something failed. Teardown of the old protocol, if necessary, must be handled by the driver. Some things remain as before: - The .get_tag_protocol is currently only called at probe time, to load the initial tagging protocol driver. Nonetheless, new drivers should report the tagging protocol in current use now. - The driver should manage by itself the initial setup of tagging protocol, no later than the .setup() method, as well as destroying resources used by the last tagger in use, no earlier than the .teardown() method. For multi-switch DSA trees, error handling is a bit more complicated, since e.g. the 5th out of 7 switches may fail to change the tag protocol. When that happens, a revert to the original tag protocol is attempted, but that may fail too, leaving the tree in an inconsistent state despite each individual switch implementing .change_tag_protocol transactionally. Since the intersection between drivers that implement .change_tag_protocol and drivers that support D in DSA is currently the empty set, the possibility for this error to happen is ignored for now. Testing: $ insmod mscc_felix.ko [ 79.549784] mscc_felix 0000:00:00.5: Adding to iommu group 14 [ 79.565712] mscc_felix 0000:00:00.5: Failed to register DSA switch: -517 $ insmod tag_ocelot.ko $ rmmod mscc_felix.ko $ insmod mscc_felix.ko [ 97.261724] libphy: VSC9959 internal MDIO bus: probed [ 97.267363] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 0 [ 97.274998] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 1 [ 97.282561] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 2 [ 97.289700] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 3 [ 97.599163] mscc_felix 0000:00:00.5 swp0 (uninitialized): PHY [0000:00:00.3:10] driver [Microsemi GE VSC8514 SyncE] (irq=POLL) [ 97.862034] mscc_felix 0000:00:00.5 swp1 (uninitialized): PHY [0000:00:00.3:11] driver [Microsemi GE VSC8514 SyncE] (irq=POLL) [ 97.950731] mscc_felix 0000:00:00.5 swp0: configuring for inband/qsgmii link mode [ 97.964278] 8021q: adding VLAN 0 to HW filter on device swp0 [ 98.146161] mscc_felix 0000:00:00.5 swp2 (uninitialized): PHY [0000:00:00.3:12] driver [Microsemi GE VSC8514 SyncE] (irq=POLL) [ 98.238649] mscc_felix 0000:00:00.5 swp1: configuring for inband/qsgmii link mode [ 98.251845] 8021q: adding VLAN 0 to HW filter on device swp1 [ 98.433916] mscc_felix 0000:00:00.5 swp3 (uninitialized): PHY [0000:00:00.3:13] driver [Microsemi GE VSC8514 SyncE] (irq=POLL) [ 98.485542] mscc_felix 0000:00:00.5: configuring for fixed/internal link mode [ 98.503584] mscc_felix 0000:00:00.5: Link is Up - 2.5Gbps/Full - flow control rx/tx [ 98.527948] device eno2 entered promiscuous mode [ 98.544755] DSA: tree 0 setup $ ping 10.0.0.1 PING 10.0.0.1 (10.0.0.1): 56 data bytes 64 bytes from 10.0.0.1: seq=0 ttl=64 time=2.337 ms 64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.754 ms ^C - 10.0.0.1 ping statistics - 2 packets transmitted, 2 packets received, 0% packet loss round-trip min/avg/max = 0.754/1.545/2.337 ms $ cat /sys/class/net/eno2/dsa/tagging ocelot $ cat ./test_ocelot_8021q.sh #!/bin/bash ip link set swp0 down ip link set swp1 down ip link set swp2 down ip link set swp3 down ip link set swp5 down ip link set eno2 down echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging ip link set eno2 up ip link set swp0 up ip link set swp1 up ip link set swp2 up ip link set swp3 up ip link set swp5 up $ ./test_ocelot_8021q.sh ./test_ocelot_8021q.sh: line 9: echo: write error: Protocol not available $ rmmod tag_ocelot.ko rmmod: can't unload module 'tag_ocelot': Resource temporarily unavailable $ insmod tag_ocelot_8021q.ko $ ./test_ocelot_8021q.sh $ cat /sys/class/net/eno2/dsa/tagging ocelot-8021q $ rmmod tag_ocelot.ko $ rmmod tag_ocelot_8021q.ko rmmod: can't unload module 'tag_ocelot_8021q': Resource temporarily unavailable $ ping 10.0.0.1 PING 10.0.0.1 (10.0.0.1): 56 data bytes 64 bytes from 10.0.0.1: seq=0 ttl=64 time=0.953 ms 64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.787 ms 64 bytes from 10.0.0.1: seq=2 ttl=64 time=0.771 ms $ rmmod mscc_felix.ko [ 645.544426] mscc_felix 0000:00:00.5: Link is Down [ 645.838608] DSA: tree 0 torn down $ rmmod tag_ocelot_8021q.ko Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29net: dsa: keep a copy of the tagging protocol in the DSA switch treeVladimir Oltean1-12/+24
Cascading DSA switches can be done multiple ways. There is the brute force approach / tag stacking, where one upstream switch, located between leaf switches and the host Ethernet controller, will just happily transport the DSA header of those leaf switches as payload. For this kind of setups, DSA works without any special kind of treatment compared to a single switch - they just aren't aware of each other. Then there's the approach where the upstream switch understands the tags it transports from its leaves below, as it doesn't push a tag of its own, but it routes based on the source port & switch id information present in that tag (as opposed to DMAC & VID) and it strips the tag when egressing a front-facing port. Currently only Marvell implements the latter, and Marvell DSA trees contain only Marvell switches. So it is safe to say that DSA trees already have a single tag protocol shared by all switches, and in fact this is what makes the switches able to understand each other. This fact is also implied by the fact that currently, the tagging protocol is reported as part of a sysfs installed on the DSA master and not per port, so it must be the same for all the ports connected to that DSA master regardless of the switch that they belong to. It's time to make this official and enforce it (yes, this also means we won't have any "switch understands tag to some extent but is not able to speak it" hardware oddities that we'll support in the future). This is needed due to the imminent introduction of the dsa_switch_ops:: change_tag_protocol driver API. When that is introduced, we'll have to notify switches of the tagging protocol that they're configured to use. Currently the tag_ops structure pointer is held only for CPU ports. But there are switches which don't have CPU ports and nonetheless still need to be configured. These would be Marvell leaf switches whose upstream port is just a DSA link. How do we inform these of their tagging protocol setup/deletion? One answer to the above would be: iterate through the DSA switch tree's ports once, list the CPU ports, get their tag_ops, then iterate again now that we have it, and notify everybody of that tag_ops. But what to do if conflicts appear between one cpu_dp->tag_ops and another? There's no escaping the fact that conflict resolution needs to be done, so we can be upfront about it. Ease our work and just keep the master copy of the tag_ops inside the struct dsa_switch_tree. Reference counting is now moved to be per-tree too, instead of per-CPU port. There are many places in the data path that access master->dsa_ptr->tag_ops and we would introduce unnecessary performance penalty going through yet another indirection, so keep those right where they are. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29net: dsa: document the existing switch tree notifiers and add a new oneVladimir Oltean3-23/+58
The existence of dsa_broadcast has generated some confusion in the past: https://www.mail-archive.com/netdev@vger.kernel.org/msg365042.html So let's document the existing dsa_port_notify and dsa_broadcast functions and explain when each of them should be used. Also, in fact, the in-between function has always been there but was lacking a name, and is the main reason for this patch: dsa_tree_notify. Refactor dsa_broadcast to use it. This patch also moves dsa_broadcast (a top-level function) to dsa2.c, where it really belonged in the first place, but had no companion so it stood with dsa_port_notify. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29net: dsa: tag_8021q: add helpers to deduce whether a VLAN ID is RX or TX VLANVladimir Oltean1-2/+13
The sja1105 implementation can be blind about this, but the felix driver doesn't do exactly what it's being told, so it needs to know whether it is a TX or an RX VLAN, so it can install the appropriate type of TCAM rule. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29net: proc: speedup /proc/net/netstatEric Dumazet1-14/+36
Use cache friendly helpers to better use cpu caches while reading /proc/net/netstat Tested on a platform with 256 threads (AMD Rome) Before: 305 usec spent in netstat_seq_show() After: 130 usec spent in netstat_seq_show() Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20210128162145.1703601-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29net: Remove redundant calls of sk_tx_queue_clear().Kuniyuki Iwashima1-1/+10
The commit 41b14fb8724d ("net: Do not clear the sock TX queue in sk_set_socket()") removes sk_tx_queue_clear() from sk_set_socket() and adds it instead in sk_alloc() and sk_clone_lock() to fix an issue introduced in the commit e022f0b4a03f ("net: Introduce sk_tx_queue_mapping"). On the other hand, the original commit had already put sk_tx_queue_clear() in sk_prot_alloc(): the callee of sk_alloc() and sk_clone_lock(). Thus sk_tx_queue_clear() is called twice in each path. If we remove sk_tx_queue_clear() in sk_alloc() and sk_clone_lock(), it currently works well because (i) sk_tx_queue_mapping is defined between sk_dontcopy_begin and sk_dontcopy_end, and (ii) sock_copy() called after sk_prot_alloc() in sk_clone_lock() does not overwrite sk_tx_queue_mapping. However, if we move sk_tx_queue_mapping out of the no copy area, it introduces a bug unintentionally. Therefore, this patch adds a compile-time check to take care of the order of sock_copy() and sk_tx_queue_clear() and removes sk_tx_queue_clear() from sk_prot_alloc() so that it does the only allocation and its callers initialize fields. CC: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Acked-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20210128150217.6060-1-kuniyu@amazon.co.jp Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29ip_gre: add csum offload support for gre headerXin Long1-2/+13
This patch is to add csum offload support for gre header: On the TX path in gre_build_header(), when CHECKSUM_PARTIAL's set for inner proto, it will calculate the csum for outer proto, and inner csum will be offloaded later. Otherwise, CHECKSUM_PARTIAL and csum_start/offset will be set for outer proto, and the outer csum will be offloaded later. On the GSO path in gre_gso_segment(), when CHECKSUM_PARTIAL is not set for inner proto and the hardware supports csum offload, CHECKSUM_PARTIAL and csum_start/offset will be set for outer proto, and outer csum will be offloaded later. Otherwise, it will do csum for outer proto by calling gso_make_checksum(). Note that SCTP has to do the csum by itself for non GSO path in sctp_packet_pack(), as gre_build_header() can't handle the csum with CHECKSUM_PARTIAL set for SCTP CRC csum offload. v1->v2: - remove the SCTP part, as GRE dev doesn't support SCTP CRC CSUM and it will always do checksum for SCTP in sctp_packet_pack() when it's not a GSO packet. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29net: support ip generic csum processing in skb_csum_hwoffload_helpXin Long1-1/+12
NETIF_F_IP|IPV6_CSUM feature flag indicates UDP and TCP csum offload while NETIF_F_HW_CSUM feature flag indicates ip generic csum offload for HW, which includes not only for TCP/UDP csum, but also for other protocols' csum like GRE's. However, in skb_csum_hwoffload_help() it only checks features against NETIF_F_CSUM_MASK(NETIF_F_HW|IP|IPV6_CSUM). So if it's a non TCP/UDP packet and the features doesn't support NETIF_F_HW_CSUM, but supports NETIF_F_IP|IPV6_CSUM only, it would still return 0 and leave the HW to do csum. This patch is to support ip generic csum processing by checking NETIF_F_HW_CSUM for all protocols, and check (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM) only for TCP and UDP. Note that we're using skb->csum_offset to check if it's a TCP/UDP proctol, this might be fragile. However, as Alex said, for now we only have a few L4 protocols that are requesting Tx csum offload, we'd better fix this until a new protocol comes with a same csum offset. v1->v2: - not extend skb->csum_not_inet, but use skb->csum_offset to tell if it's an UDP/TCP csum packet. v2->v3: - add a note in the changelog, as Willem suggested. Suggested-by: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29net: atm: pppoatm: use new API for wakeup taskletEmil Renner Berthing1-4/+5
This converts the driver to use the new tasklet API introduced in commit 12cc923f1ccc ("tasklet: Introduce new initialization API") Signed-off-by: Emil Renner Berthing <kernel@esmil.dk> Link: https://lore.kernel.org/r/20210127173256.13954-2-kernel@esmil.dk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29net: atm: pppoatm: use tasklet_init to initialize wakeup taskletEmil Renner Berthing1-7/+3
Previously a temporary tasklet structure was initialized on the stack using DECLARE_TASKLET_OLD() and then copied over and modified. Nothing else in the kernel seems to use this pattern, so let's just call tasklet_init() like everyone else. Signed-off-by: Emil Renner Berthing <kernel@esmil.dk> Link: https://lore.kernel.org/r/20210127173256.13954-1-kernel@esmil.dk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29net: flow_offload: Add original direction flag to ct_metadataPaul Blakey1-0/+1
Give offloading drivers the direction of the offloaded ct flow, this will be used for matches on direction (ct_state +/-rpl). Signed-off-by: Paul Blakey <paulb@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29net/sched: cls_flower: Add match on the ct_state reply flagPaul Blakey1-2/+4
Add match on the ct_state reply flag. Example: $ tc filter add dev ens1f0_0 ingress prio 1 chain 1 proto ip flower \ ct_state +trk+est+rpl \ action mirred egress redirect dev ens1f0_1 $ tc filter add dev ens1f0_1 ingress prio 1 chain 1 proto ip flower \ ct_state +trk+est-rpl \ action mirred egress redirect dev ens1f0_0 Signed-off-by: Paul Blakey <paulb@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29net: packet: make pkt_sk() inlineMenglong Dong1-1/+1
It's better make 'pkt_sk()' inline here, as non-inline function shouldn't occur in headers. Besides, this function is simple enough to be inline. Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn> Link: https://lore.kernel.org/r/20210127123302.29842-1-dong.menglong@zte.com.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28nexthop: Extract a helper for validation of get/del RTNL requestsPetr Machata1-18/+25
Validation of messages for get / del of a next hop is the same as will be validation of messages for get of a resilient next hop group bucket. The difference is that policy for resilient next hop group buckets is a superset of that used for next-hop get. It is therefore possible to reuse the code that validates the nhmsg fields, extracts the next-hop ID, and validates that. To that end, extract from nh_valid_get_del_req() a helper __nh_valid_get_del_req() that does just that. Make the nlh argument const so that the function can be called from the dump context, which only has a const nlh. Propagate the constness to nh_valid_get_del_req(). Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28nexthop: Add a callback parameter to rtm_dump_walk_nexthops()Petr Machata1-10/+22
In order to allow different handling for next-hop tree dumper and for bucket dumper, parameterize the next-hop tree walker with a callback. Add rtm_dump_nexthop_cb() with just the bits relevant for next-hop tree dumping. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28nexthop: Extract a helper for walking the next-hop treePetr Machata1-19/+33
Extract from rtm_dump_nexthop() a helper to walk the next hop tree. A separate function for this will be reusable from the bucket dumper. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28nexthop: Strongly-type context of rtm_dump_nexthop()Petr Machata1-2/+16
The dump operations need to keep state from one invocation to another. A scratch area is dedicated for this purpose in the passed-in argument, cb, namely via two aliased arrays, struct netlink_callback.args and .ctx. Dumping of buckets will end up having to iterate over next hops as well, and it would be nice to be able to reuse the iteration logic with the NH dumper. The fact that the logic currently relies on fixed index to the .args array, and the indices would have to be coordinated between the two dumpers, makes this somewhat awkward. To make the access patters clearer, introduce a helper struct with a NH index, and instead of using the .args array directly, use it through this structure. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28nexthop: Extract a common helper for parsing dump attributesPetr Machata1-12/+19
Requests to dump nexthops have many attributes in common with those that requests to dump buckets of resilient NH groups will have. However, they have different policies. To allow reuse of this code, extract a policy-agnostic wrapper out of nh_valid_dump_req(), and convert this function into a thin wrapper around it. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28nexthop: Extract dump filtering parameters into a single structurePetr Machata1-20/+24
Requests to dump nexthops have many attributes in common with those that requests to dump buckets of resilient NH groups will have. In order to make reuse of this code simpler, convert the code to use a single structure with filtering configuration instead of passing around the parameters one by one. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28nexthop: Dispatch notifier init()/fini() by group typePetr Machata1-6/+19
After there are several next-hop group types, initialization and finalization of notifier type needs to reflect the actual type. Transform nh_notifier_grp_info_init() and _fini() to make extending them easier. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28nexthop: Use enum to encode notification typeIdo Schimmel1-6/+8
Currently there are only two types of in-kernel nexthop notification. The two are distinguished by the 'is_grp' boolean field in 'struct nh_notifier_info'. As more notification types are introduced for more next-hop group types, a boolean is not an easily extensible interface. Instead, convert it to an enum. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28nexthop: Assert the invariant that a NH group is of only one typePetr Machata1-2/+5
Most of the code that deals with nexthop groups relies on the fact that the group is of exactly one well-known type. Currently there is only one type, "mpath", but as more next-hop group types come, it becomes desirable to have a central place where the setting is validated. Introduce such place into nexthop_create_group(), such that the check is done before the code that relies on that invariant is invoked. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28nexthop: Introduce to struct nh_grp_entry a per-type unionPetr Machata1-2/+2
The values that a next-hop group needs to keep track of depend on the group type. Introduce a union to separate fields specific to the mpath groups from fields specific to other group types. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28nexthop: Dispatch nexthop_select_path() by group typePetr Machata1-6/+16
The logic for selecting path depends on the next-hop group type. Adapt the nexthop_select_path() to dispatch according to the group type. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28nexthop: Rename nexthop_free_mpathDavid Ahern1-2/+2
nexthop_free_mpath really should be nexthop_free_group. Rename it. Signed-off-by: David Ahern <dsahern@kernel.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28net/af_iucv: build SG skbs for TRANS_HIPER socketsJulian Wiedmann1-2/+4
The TX path no longer falls apart when some of its SG skbs are later linearized by lower layers of the stack. So enable the use of SG skbs in iucv_sock_sendmsg() again. This effectively reverts commit dc5367bcc556 ("net/af_iucv: don't use paged skbs for TX on HiperSockets"). Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28net/af_iucv: don't track individual TX skbs for TRANS_HIPER socketsJulian Wiedmann1-59/+21
Stop maintaining the skb_send_q list for TRANS_HIPER sockets. Not only is it extra overhead, but keeping around a list of skb clones means that we later also have to match the ->sk_txnotify() calls against these clones and free them accordingly. The current matching logic (comparing the skbs' shinfo location) is frustratingly fragile, and breaks if the skb's head is mangled in any sort of way while passing from dev_queue_xmit() to the device's HW queue. Also adjust the interface for ->sk_txnotify(), to make clear that we don't actually care about any skb internals. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28net/af_iucv: count packets in the xmit pathJulian Wiedmann1-6/+24
The TX code keeps track of all skbs that are in-flight but haven't actually been sent out yet. For native IUCV sockets that's not a huge deal, but with TRANS_HIPER sockets it would be much better if we didn't need to maintain a list of skb clones. Note that we actually only care about the _count_ of skbs in this stage of the TX pipeline. So as prep work for removing the skb tracking on TRANS_HIPER sockets, keep track of the skb count in a separate variable and pair any list {enqueue, unlink} with a count {increment, decrement}. Then replace all occurences where we currently look at the skb list's fill level. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28net/af_iucv: don't lookup the socket on TX notificationJulian Wiedmann1-12/+3
Whoever called iucv_sk(sk)->sk_txnotify() must already know that they're dealing with an af_iucv socket. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28net/af_iucv: remove WARN_ONCE on malformed RX packetsAlexander Egorenkov1-1/+0
syzbot reported the following finding: AF_IUCV failed to receive skb, len=0 WARNING: CPU: 0 PID: 522 at net/iucv/af_iucv.c:2039 afiucv_hs_rcv+0x174/0x190 net/iucv/af_iucv.c:2039 CPU: 0 PID: 522 Comm: syz-executor091 Not tainted 5.10.0-rc1-syzkaller-07082-g55027a88ec9f #0 Hardware name: IBM 3906 M04 701 (KVM/Linux) Call Trace: [<00000000b87ea538>] afiucv_hs_rcv+0x178/0x190 net/iucv/af_iucv.c:2039 ([<00000000b87ea534>] afiucv_hs_rcv+0x174/0x190 net/iucv/af_iucv.c:2039) [<00000000b796533e>] __netif_receive_skb_one_core+0x13e/0x188 net/core/dev.c:5315 [<00000000b79653ce>] __netif_receive_skb+0x46/0x1c0 net/core/dev.c:5429 [<00000000b79655fe>] netif_receive_skb_internal+0xb6/0x220 net/core/dev.c:5534 [<00000000b796ac3a>] netif_receive_skb+0x42/0x318 net/core/dev.c:5593 [<00000000b6fd45f4>] tun_rx_batched.isra.0+0x6fc/0x860 drivers/net/tun.c:1485 [<00000000b6fddc4e>] tun_get_user+0x1c26/0x27f0 drivers/net/tun.c:1939 [<00000000b6fe0f00>] tun_chr_write_iter+0x158/0x248 drivers/net/tun.c:1968 [<00000000b4f22bfa>] call_write_iter include/linux/fs.h:1887 [inline] [<00000000b4f22bfa>] new_sync_write+0x442/0x648 fs/read_write.c:518 [<00000000b4f238fe>] vfs_write.part.0+0x36e/0x5d8 fs/read_write.c:605 [<00000000b4f2984e>] vfs_write+0x10e/0x148 fs/read_write.c:615 [<00000000b4f29d0e>] ksys_write+0x166/0x290 fs/read_write.c:658 [<00000000b8dc4ab4>] system_call+0xe0/0x28c arch/s390/kernel/entry.S:415 Last Breaking-Event-Address: [<00000000b8dc64d4>] __s390_indirect_jump_r14+0x0/0xc Malformed RX packets shouldn't generate any warnings because debugging info already flows to dropmon via the kfree_skb(). Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com> Reviewed-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski27-146/+315
drivers/net/can/dev.c b552766c872f ("can: dev: prevent potential information leak in can_fill_info()") 3e77f70e7345 ("can: dev: move driver related infrastructure into separate subdir") 0a042c6ec991 ("can: dev: move netlink related code into seperate file") Code move. drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 57ac4a31c483 ("net/mlx5e: Correctly handle changing the number of queues when the interface is down") 214baf22870c ("net/mlx5e: Support HTB offload") Adjacent code changes net/switchdev/switchdev.c 20776b465c0c ("net: switchdev: don't set port_obj_info->handled true when -EOPNOTSUPP") ffb68fc58e96 ("net: switchdev: remove the transaction structure from port object notifiers") bae33f2b5afe ("net: switchdev: remove the transaction structure from port attributes") Transaction parameter gets dropped otherwise keep the fix. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28Merge tag 'mlx5-updates-2021-01-13' of ↵Jakub Kicinski1-30/+280
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5 subfunction support Parav Pandit says: This patchset introduces support for mlx5 subfunction (SF). A subfunction is a lightweight function that has a parent PCI function on which it is deployed. mlx5 subfunction has its own function capabilities and its own resources. This means a subfunction has its own dedicated queues(txq, rxq, cq, eq). These queues are neither shared nor stolen from the parent PCI function. When subfunction is RDMA capable, it has its own QP1, GID table and rdma resources neither shared nor stolen from the parent PCI function. A subfunction has dedicated window in PCI BAR space that is not shared with the other subfunctions or parent PCI function. This ensures that all class devices of the subfunction accesses only assigned PCI BAR space. A Subfunction supports eswitch representation through which it supports tc offloads. User must configure eswitch to send/receive packets from/to subfunction port. Subfunctions share PCI level resources such as PCI MSI-X IRQs with their other subfunctions and/or with its parent PCI function. Subfunction support is discussed in detail in RFC [1] and [2]. RFC [1] and extension [2] describes requirements, design and proposed plumbing using devlink, auxiliary bus and sysfs for systemd/udev support. Functionality of this patchset is best explained using real examples further below. overview: -------- A subfunction can be created and deleted by a user using devlink port add/delete interface. A subfunction can be configured using devlink port function attribute before its activated. When a subfunction is activated, it results in an auxiliary device on the host PCI device where it is deployed. A driver binds to the auxiliary device that further creates supported class devices. example subfunction usage sequence: ----------------------------------- Change device to switchdev mode: $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev Add a devlink port of subfunction flavour: $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88 Configure mac address of the port function: $ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88 Now activate the function: $ devlink port function set ens2f0npf0sf88 state active Now use the auxiliary device and class devices: $ devlink dev show pci/0000:06:00.0 auxiliary/mlx5_core.sf.4 $ ip link show 127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff altname enp6s0f0np0 129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff $ rdma dev show 43: rdmap6s0f0: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d112 sys_image_guid 248a:0703:00b3:d112 44: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112 After use inactivate the function: $ devlink port function set ens2f0npf0sf88 state inactive Now delete the subfunction port: $ devlink port del ens2f0npf0sf88 [1] https://lore.kernel.org/netdev/20200519092258.GF4655@nanopsycho/ [2] https://marc.info/?l=linux-netdev&m=158555928517777&w=2 ================= * tag 'mlx5-updates-2021-01-13' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5: Add devlink subfunction port documentation devlink: Extend devlink port documentation for subfunctions devlink: Add devlink port documentation net/mlx5: SF, Port function state change support net/mlx5: SF, Add port add delete functionality net/mlx5: E-switch, Add eswitch helpers for SF vport net/mlx5: E-switch, Prepare eswitch to handle SF vport net/mlx5: SF, Add auxiliary device driver net/mlx5: SF, Add auxiliary device support net/mlx5: Introduce vhca state event notifier devlink: Support get and set state of port function devlink: Support add and delete devlink port devlink: Introduce PCI SF port flavour and port attribute devlink: Prepare code to fill multiple port function attributes ==================== Link: https://lore.kernel.org/r/20210122193658.282884-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28Merge tag 'net-5.11-rc6' of ↵Linus Torvalds21-77/+226
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes including fixes from can, xfrm, wireless, wireless-drivers and netfilter trees. Nothing scary, Intel WiFi-related fixes seemed most notable to the users. Current release - regressions: - dsa: microchip: ksz8795: fix KSZ8794 port map again to program the CPU port correctly Current release - new code bugs: - iwlwifi: pcie: reschedule in long-running memory reads Previous releases - regressions: - iwlwifi: dbg: don't try to overwrite read-only FW data - iwlwifi: provide gso_type to GSO packets - octeontx2: make sure the buffer is 128 byte aligned - tcp: make TCP_USER_TIMEOUT accurate for zero window probes - xfrm: fix wraparound in xfrm_policy_addr_delta() - xfrm: fix oops in xfrm_replay_advance_bmp due to a race between CPUs in presence of packet reorder - tcp: fix TLP timer not set when CA_STATE changes from DISORDER to OPEN - wext: fix NULL-ptr-dereference with cfg80211's lack of commit() Previous releases - always broken: - igc: fix link speed advertising - stmmac: configure EHL PSE0 GbE and PSE1 GbE to 32 bits DMA addressing - team: protect features update by RCU to avoid deadlock - xfrm: fix disable_xfrm sysctl when used on xfrm interfaces themselves - fec: fix temporary RMII clock reset on link up - can: dev: prevent potential information leak in can_fill_info() Misc: - mrp: fix bad packing of MRP test packet structures - uapi: fix big endian definition of ipv6_rpl_sr_hdr - add David Ahern to IPv4/IPv6 maintainers" * tag 'net-5.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (86 commits) rxrpc: Fix memory leak in rxrpc_lookup_local mlxsw: spectrum_span: Do not overwrite policer configuration selftests: forwarding: Specify interface when invoking mausezahn stmmac: intel: Configure EHL PSE0 GbE and PSE1 GbE to 32 bits DMA addressing net: usb: cdc_ether: added support for Thales Cinterion PLSx3 modem family. ibmvnic: Ensure that CRQ entry read are correctly ordered MAINTAINERS: add missing header for bonding net: decnet: fix netdev refcount leaking on error path net: switchdev: don't set port_obj_info->handled true when -EOPNOTSUPP can: dev: prevent potential information leak in can_fill_info() net: fec: Fix temporary RMII clock reset on link up net: lapb: Add locking to the lapb module team: protect features update by RCU to avoid deadlock MAINTAINERS: add David Ahern to IPv4/IPv6 maintainers net/mlx5: CT: Fix incorrect removal of tuple_nat_node from nat rhashtable net/mlx5e: Revert parameters on errors when changing MTU and LRO state without reset net/mlx5e: Revert parameters on errors when changing trust state without reset net/mlx5e: Correctly handle changing the number of queues when the interface is down net/mlx5e: Fix CT rule + encap slow path offload and deletion net/mlx5e: Disable hw-tc-offload when MLX5_CLS_ACT config is disabled ...
2021-01-28rxrpc: Fix memory leak in rxrpc_lookup_localTakeshi Misawa1-0/+1
Commit 9ebeddef58c4 ("rxrpc: rxrpc_peer needs to hold a ref on the rxrpc_local record") Then release ref in __rxrpc_put_peer and rxrpc_put_peer_locked. struct rxrpc_peer *rxrpc_alloc_peer(struct rxrpc_local *local, gfp_t gfp) - peer->local = local; + peer->local = rxrpc_get_local(local); rxrpc_discard_prealloc also need ref release in discarding. syzbot report: BUG: memory leak unreferenced object 0xffff8881080ddc00 (size 256): comm "syz-executor339", pid 8462, jiffies 4294942238 (age 12.350s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 0a 00 00 00 00 c0 00 08 81 88 ff ff ................ backtrace: [<000000002b6e495f>] kmalloc include/linux/slab.h:552 [inline] [<000000002b6e495f>] kzalloc include/linux/slab.h:682 [inline] [<000000002b6e495f>] rxrpc_alloc_local net/rxrpc/local_object.c:79 [inline] [<000000002b6e495f>] rxrpc_lookup_local+0x1c1/0x760 net/rxrpc/local_object.c:244 [<000000006b43a77b>] rxrpc_bind+0x174/0x240 net/rxrpc/af_rxrpc.c:149 [<00000000fd447a55>] afs_open_socket+0xdb/0x200 fs/afs/rxrpc.c:64 [<000000007fd8867c>] afs_net_init+0x2b4/0x340 fs/afs/main.c:126 [<0000000063d80ec1>] ops_init+0x4e/0x190 net/core/net_namespace.c:152 [<00000000073c5efa>] setup_net+0xde/0x2d0 net/core/net_namespace.c:342 [<00000000a6744d5b>] copy_net_ns+0x19f/0x3e0 net/core/net_namespace.c:483 [<0000000017d3aec3>] create_new_namespaces+0x199/0x4f0 kernel/nsproxy.c:110 [<00000000186271ef>] unshare_nsproxy_namespaces+0x9b/0x120 kernel/nsproxy.c:226 [<000000002de7bac4>] ksys_unshare+0x2fe/0x5c0 kernel/fork.c:2957 [<00000000349b12ba>] __do_sys_unshare kernel/fork.c:3025 [inline] [<00000000349b12ba>] __se_sys_unshare kernel/fork.c:3023 [inline] [<00000000349b12ba>] __x64_sys_unshare+0x12/0x20 kernel/fork.c:3023 [<000000006d178ef7>] do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 [<00000000637076d4>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 9ebeddef58c4 ("rxrpc: rxrpc_peer needs to hold a ref on the rxrpc_local record") Signed-off-by: Takeshi Misawa <jeliantsurux@gmail.com> Reported-and-tested-by: syzbot+305326672fed51b205f7@syzkaller.appspotmail.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/161183091692.3506637.3206605651502458810.stgit@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>