aboutsummaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2023-03-30net: dsa: fix db type confusion in host fdb/mdb add/delVladimir Oltean1-12/+12
We have the following code paths: Host FDB (unicast RX filtering): dsa_port_standalone_host_fdb_add() dsa_port_bridge_host_fdb_add() | | +--------------+ +------------+ | | v v dsa_port_host_fdb_add() dsa_port_standalone_host_fdb_del() dsa_port_bridge_host_fdb_del() | | +--------------+ +------------+ | | v v dsa_port_host_fdb_del() Host MDB (multicast RX filtering): dsa_port_standalone_host_mdb_add() dsa_port_bridge_host_mdb_add() | | +--------------+ +------------+ | | v v dsa_port_host_mdb_add() dsa_port_standalone_host_mdb_del() dsa_port_bridge_host_mdb_del() | | +--------------+ +------------+ | | v v dsa_port_host_mdb_del() The logic added by commit 5e8a1e03aa4d ("net: dsa: install secondary unicast and multicast addresses as host FDB/MDB") zeroes out db.bridge.num if the switch doesn't support ds->fdb_isolation (the majority doesn't). This is done for a reason explained in commit c26933639b54 ("net: dsa: request drivers to perform FDB isolation"). Taking a single code path as example - dsa_port_host_fdb_add() - the others are similar - the problem is that this function handles: - DSA_DB_PORT databases, when called from dsa_port_standalone_host_fdb_add() - DSA_DB_BRIDGE databases, when called from dsa_port_bridge_host_fdb_add() So, if dsa_port_host_fdb_add() were to make any change on the "bridge.num" attribute of the database, this would only be correct for a DSA_DB_BRIDGE, and a type confusion for a DSA_DB_PORT bridge. However, this bug is without consequences, for 2 reasons: - dsa_port_standalone_host_fdb_add() is only called from code which is (in)directly guarded by dsa_switch_supports_uc_filtering(ds), and that function only returns true if ds->fdb_isolation is set. So, the code only executed for DSA_DB_BRIDGE databases. - Even if the code was not dead for DSA_DB_PORT, we have the following memory layout: struct dsa_bridge { struct net_device *dev; unsigned int num; bool tx_fwd_offload; refcount_t refcount; }; struct dsa_db { enum dsa_db_type type; union { const struct dsa_port *dp; // DSA_DB_PORT struct dsa_lag lag; struct dsa_bridge bridge; // DSA_DB_BRIDGE }; }; So, the zeroization of dsa_db :: bridge :: num on a dsa_db structure of type DSA_DB_PORT would access memory which is unused, because we only use dsa_db :: dp for DSA_DB_PORT, and this is mapped at the same address with dsa_db :: dev for DSA_DB_BRIDGE, thanks to the union definition. It is correct to fix up dsa_db :: bridge :: num only from code paths that come from the bridge / switchdev, so move these there. Signed-off-by: Vladimir Oltean <[email protected]> Reviewed-by: Simon Horman <[email protected]> Reviewed-by: Florian Fainelli <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-03-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski7-29/+148
Conflicts: drivers/net/ethernet/mediatek/mtk_ppe.c 3fbe4d8c0e53 ("net: ethernet: mtk_eth_soc: ppe: add support for flow accounting") 924531326e2d ("net: ethernet: mtk_eth_soc: add missing ppe cache flush when deleting a flow") Signed-off-by: Jakub Kicinski <[email protected]>
2023-03-30Merge tag 'net-6.3-rc5' of ↵Linus Torvalds6-24/+143
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from CAN and WPAN. Still quite a few bugs from this release. This pull is a bit smaller because major subtrees went into the previous one. Or maybe people took spring break off? Current release - regressions: - phy: micrel: correct KSZ9131RNX EEE capabilities and advertisement Current release - new code bugs: - eth: wangxun: fix vector length of interrupt cause - vsock/loopback: consistently protect the packet queue with sk_buff_head.lock - virtio/vsock: fix header length on skb merging - wpan: ca8210: fix unsigned mac_len comparison with zero Previous releases - regressions: - eth: stmmac: don't reject VLANs when IFF_PROMISC is set - eth: smsc911x: avoid PHY being resumed when interface is not up - eth: mtk_eth_soc: fix tx throughput regression with direct 1G links - eth: bnx2x: use the right build_skb() helper after core rework - wwan: iosm: fix 7560 modem crash on use on unsupported channel Previous releases - always broken: - eth: sfc: don't overwrite offload features at NIC reset - eth: r8169: fix RTL8168H and RTL8107E rx crc error - can: j1939: prevent deadlock by moving j1939_sk_errqueue() - virt: vmxnet3: use GRO callback when UPT is enabled - virt: xen: don't do grant copy across page boundary - phy: dp83869: fix default value for tx-/rx-internal-delay - dsa: ksz8: fix multiple issues with ksz8_fdb_dump - eth: mvpp2: fix classification/RSS of VLAN and fragmented packets - eth: mtk_eth_soc: fix flow block refcounting logic Misc: - constify fwnode pointers in SFP handling" * tag 'net-6.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (55 commits) net: ethernet: mtk_eth_soc: add missing ppe cache flush when deleting a flow net: ethernet: mtk_eth_soc: fix L2 offloading with DSA untag offload net: ethernet: mtk_eth_soc: fix flow block refcounting logic net: mvneta: fix potential double-frees in mvneta_txq_sw_deinit() net: dsa: sync unicast and multicast addresses for VLAN filters too net: dsa: mv88e6xxx: Enable IGMP snooping on user ports only xen/netback: use same error messages for same errors test/vsock: new skbuff appending test virtio/vsock: WARN_ONCE() for invalid state of socket virtio/vsock: fix header length on skb merging bnxt_en: Add missing 200G link speed reporting bnxt_en: Fix typo in PCI id to device description string mapping bnxt_en: Fix reporting of test result in ethtool selftest i40e: fix registers dump after run ethtool adapter self test bnx2x: use the right build_skb() helper net: ipa: compute DMA pool size properly net: wwan: iosm: fixes 7560 modem crash net: ethernet: mtk_eth_soc: fix tx throughput regression with direct 1G links ice: fix invalid check for empty list in ice_sched_assoc_vsi_to_agg() ice: add profile conflict check for AVF FDIR ...
2023-03-30netfilter: ctnetlink: Support offloaded conntrack entry deletionPaul Blakey1-8/+0
Currently, offloaded conntrack entries (flows) can only be deleted after they are removed from offload, which is either by timeout, tcp state change or tc ct rule deletion. This can cause issues for users wishing to manually delete or flush existing entries. Support deletion of offloaded conntrack entries. Example usage: # Delete all offloaded (and non offloaded) conntrack entries # whose source address is 1.2.3.4 $ conntrack -D -s 1.2.3.4 # Delete all entries $ conntrack -F Signed-off-by: Paul Blakey <[email protected]> Reviewed-by: Simon Horman <[email protected]> Acked-by: Pablo Neira Ayuso <[email protected]> Signed-off-by: Florian Westphal <[email protected]>
2023-03-30netfilter: nfnetlink_queue: enable classid socket info retrievalEric Sage1-0/+20
This enables associating a socket with a v1 net_cls cgroup. Useful for applying a per-cgroup policy when processing packets in userspace. Signed-off-by: Eric Sage <[email protected]> Signed-off-by: Florian Westphal <[email protected]>
2023-03-30netfilter: nfnetlink_log: remove rcu_bh usageFlorian Westphal1-13/+23
structure is free'd via call_rcu, so its safe to use rcu_read_lock only. While at it, skip rcu_read_lock for lookup from packet path, its always called with rcu held. Signed-off-by: Florian Westphal <[email protected]>
2023-03-30net: dsa: sync unicast and multicast addresses for VLAN filters tooVladimir Oltean1-5/+116
If certain conditions are met, DSA can install all necessary MAC addresses on the CPU ports as FDB entries and disable flooding towards the CPU (we call this RX filtering). There is one corner case where this does not work. ip link add br0 type bridge vlan_filtering 1 && ip link set br0 up ip link set swp0 master br0 && ip link set swp0 up ip link add link swp0 name swp0.100 type vlan id 100 ip link set swp0.100 up && ip addr add 192.168.100.1/24 dev swp0.100 Traffic through swp0.100 is broken, because the bridge turns on VLAN filtering in the swp0 port (causing RX packets to be classified to the FDB database corresponding to the VID from their 802.1Q header), and although the 8021q module does call dev_uc_add() towards the real device, that API is VLAN-unaware, so it only contains the MAC address, not the VID; and DSA's current implementation of ndo_set_rx_mode() is only for VID 0 (corresponding to FDB entries which are installed in an FDB database which is only hit when the port is VLAN-unaware). It's interesting to understand why the bridge does not turn on IFF_PROMISC for its swp0 bridge port, and it may appear at first glance that this is a regression caused by the logic in commit 2796d0c648c9 ("bridge: Automatically manage port promiscuous mode."). After all, a bridge port needs to have IFF_PROMISC by its very nature - it needs to receive and forward frames with a MAC DA different from the bridge ports' MAC addresses. While that may be true, when the bridge is VLAN-aware *and* it has a single port, there is no real reason to enable promiscuity even if that is an automatic port, with flooding and learning (there is nowhere for packets to go except to the BR_FDB_LOCAL entries), and this is how the corner case appears. Adding a second automatic interface to the bridge would make swp0 promisc as well, and would mask the corner case. Given the dev_uc_add() / ndo_set_rx_mode() API is what it is (it doesn't pass a VLAN ID), the only way to address that problem is to install host FDB entries for the cartesian product of RX filtering MAC addresses and VLAN RX filters. Fixes: 7569459a52c9 ("net: dsa: manage flooding on the CPU ports") Signed-off-by: Vladimir Oltean <[email protected]> Reviewed-by: Simon Horman <[email protected]> Reviewed-by: Florian Fainelli <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-03-30net: optimize ____napi_schedule() to avoid extra NET_RX_SOFTIRQEric Dumazet1-4/+18
____napi_schedule() adds a napi into current cpu softnet_data poll_list, then raises NET_RX_SOFTIRQ to make sure net_rx_action() will process it. Idea of this patch is to not raise NET_RX_SOFTIRQ when being called indirectly from net_rx_action(), because we can process poll_list from this point, without going to full softirq loop. This needs a change in net_rx_action() to make sure we restart its main loop if sd->poll_list was updated without NET_RX_SOFTIRQ being raised. Signed-off-by: Eric Dumazet <[email protected]> Cc: Jason Xing <[email protected]> Reviewed-by: Jason Xing <[email protected]> Tested-by: Jason Xing <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2023-03-30net: optimize napi_schedule_rps()Eric Dumazet1-3/+5
Based on initial patch from Jason Xing. Idea is to not raise NET_RX_SOFTIRQ from napi_schedule_rps() when we queued a packet into another cpu backlog. We can do this only in the context of us being called indirectly from net_rx_action(), to have the guarantee our rps_ipi_list will be processed before we exit from net_rx_action(). Link: https://lore.kernel.org/lkml/[email protected]/ Signed-off-by: Eric Dumazet <[email protected]> Cc: Jason Xing <[email protected]> Reviewed-by: Jason Xing <[email protected]> Tested-by: Jason Xing <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2023-03-30net: add softnet_data.in_net_rx_actionEric Dumazet1-0/+4
We want to make two optimizations in napi_schedule_rps() and ____napi_schedule() which require to know if these helpers are called from net_rx_action(), instead of being called from other contexts. sd.in_net_rx_action is only read/written by the owning cpu. Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Jason Xing <[email protected]> Tested-by: Jason Xing <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2023-03-30net: napi_schedule_rps() cleanupEric Dumazet1-6/+12
napi_schedule_rps() return value is ignored, remove it. Change the comment to clarify the intent. Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Jason Xing <[email protected]> Tested-by: Jason Xing <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2023-03-30wifi: nl80211: support advertising S1G capabilitiesKieran Frewen1-0/+10
Include S1G capabilities in netlink band info messages. Signed-off-by: Kieran Frewen <[email protected]> Co-developed-by: Gilad Itzkovitch <[email protected]> Signed-off-by: Gilad Itzkovitch <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2023-03-30wifi: mac80211: S1G capabilities information element in probe requestKieran Frewen2-0/+24
Add the missing S1G capabilities information element to probe requests. Signed-off-by: Kieran Frewen <[email protected]> Co-developed-by: Gilad Itzkovitch <[email protected]> Signed-off-by: Gilad Itzkovitch <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2023-03-30mac80211: minstrel_ht: remove unused n_supported variableTom Rix1-6/+0
clang with W=1 reports net/mac80211/rc80211_minstrel_ht.c:1711:6: error: variable 'n_supported' set but not used [-Werror,-Wunused-but-set-variable] int n_supported = 0; ^ This variable is not used so remove it. Signed-off-by: Tom Rix <[email protected]> Reviewed-by: Simon Horman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2023-03-30wifi: mac80211: fix invalid drv_sta_pre_rcu_remove calls for non-uploaded staFelix Fietkau1-1/+2
Avoid potential data corruption issues caused by uninitialized driver private data structures. Reported-by: Brian Coverstone <[email protected]> Fixes: 6a9d1b91f34d ("mac80211: add pre-RCU-sync sta removal driver operation") Signed-off-by: Felix Fietkau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2023-03-30wifi: mac80211: fix flow dissection for forwarded packetsFelix Fietkau1-1/+1
Adjust the network header to point at the correct payload offset Fixes: 986e43b19ae9 ("wifi: mac80211: fix receiving A-MSDU frames on mesh interfaces") Signed-off-by: Felix Fietkau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2023-03-30wifi: mac80211: fix mesh forwardingFelix Fietkau1-0/+3
Linearize packets (needed for forwarding A-MSDU subframes). Fixes: 986e43b19ae9 ("wifi: mac80211: fix receiving A-MSDU frames on mesh interfaces") Signed-off-by: Felix Fietkau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2023-03-30wifi: mac80211: fix receiving mesh packets in forwarding=0 networksFelix Fietkau1-8/+8
When forwarding is set to 0, frames are typically sent with ttl=1. Move the ttl decrement check below the check for local receive in order to fix packet drops. Reported-by: Thomas Hühn <[email protected]> Reported-by: Nick Hainke <[email protected]> Fixes: 986e43b19ae9 ("wifi: mac80211: fix receiving A-MSDU frames on mesh interfaces") Signed-off-by: Felix Fietkau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2023-03-30wifi: mac80211: fix the size calculation of ieee80211_ie_len_eht_cap()Ryder Lee1-1/+1
Here should return the size of ieee80211_eht_cap_elem_fixed, so fix it. Fixes: 820acc810fb6 ("mac80211: Add EHT capabilities to association/probe request") Signed-off-by: Ryder Lee <[email protected]> Link: https://lore.kernel.org/r/06c13635fc03bcff58a647b8e03e9f01a74294bd.1679935259.git.ryder.lee@mediatek.com Signed-off-by: Johannes Berg <[email protected]>
2023-03-30wifi: mac80211: fix potential null pointer dereferenceFelix Fietkau1-2/+2
rx->sta->amsdu_mesh_control is being passed to ieee80211_amsdu_to_8023s without checking rx->sta. Since it doesn't make sense to accept A-MSDU packets without a sta, simply add a check earlier. Fixes: 6e4c0d0460bd ("wifi: mac80211: add a workaround for receiving non-standard mesh A-MSDU") Signed-off-by: Felix Fietkau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2023-03-30wifi: mac80211: drop bogus static keywords in A-MSDU rxFelix Fietkau1-2/+2
These were unintentional copy&paste mistakes. Cc: [email protected] Fixes: 986e43b19ae9 ("wifi: mac80211: fix receiving A-MSDU frames on mesh interfaces") Signed-off-by: Felix Fietkau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2023-03-30virtio/vsock: WARN_ONCE() for invalid state of socketArseniy Krasnov1-0/+7
This adds WARN_ONCE() and return from stream dequeue callback when socket's queue is empty, but 'rx_bytes' still non-zero. This allows the detection of potential bugs due to packet merging (see previous patch). Signed-off-by: Arseniy Krasnov <[email protected]> Reviewed-by: Stefano Garzarella <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2023-03-30virtio/vsock: fix header length on skb mergingArseniy Krasnov1-1/+1
This fixes appending newly arrived skbuff to the last skbuff of the socket's queue. Problem fires when we are trying to append data to skbuff which was already processed in dequeue callback at least once. Dequeue callback calls function 'skb_pull()' which changes 'skb->len'. In current implementation 'skb->len' is used to update length in header of the last skbuff after new data was copied to it. This is bug, because value in header is used to calculate 'rx_bytes'/'fwd_cnt' and thus must be not be changed during skbuff's lifetime. Bug starts to fire since: commit 077706165717 ("virtio/vsock: don't use skbuff state to account credit") It presents before, but didn't triggered due to a little bit buggy implementation of credit calculation logic. So use Fixes tag for it. Fixes: 077706165717 ("virtio/vsock: don't use skbuff state to account credit") Signed-off-by: Arseniy Krasnov <[email protected]> Reviewed-by: Stefano Garzarella <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2023-03-29Merge tag 'ieee802154-for-net-2023-03-29' of ↵Jakub Kicinski1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/wpan/wpan Stefan Schmidt says: ==================== ieee802154 for net 2023-03-29 Two small fixes this time. Dongliang Mu removed an unnecessary null pointer check. Harshit Mogalapalli fixed an int comparison unsigned against signed from a recent other fix in the ca8210 driver. * tag 'ieee802154-for-net-2023-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/wpan/wpan: net: ieee802154: remove an unnecessary null pointer check ca8210: Fix unsigned mac_len comparison with zero in ca8210_skb_tx() ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-03-29bpf: allow a TCP CC to write app_limitedYixin Shen1-0/+3
A CC that implements tcp_congestion_ops.cong_control() should be able to write app_limited. A built-in CC or one from a kernel module is already able to write to this member of struct tcp_sock. For a BPF program, write access has not been allowed, yet. Signed-off-by: Yixin Shen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2023-03-29mptcp: do not fill info not used by the PM in usedMatthieu Baerts1-7/+13
Only the in-kernel PM uses the number of address and subflow limits allowed per connection. It then makes more sense not to display such info when other PMs are used not to confuse the userspace by showing limits not being used. While at it, we can get rid of the "val" variable and add indentations instead. It would have been good to have done this modification directly in commit 4d25247d3ae4 ("mptcp: bypass in-kernel PM restrictions for non-kernel PMs") but as we change a bit the behaviour, it is fine not to backport it to stable. Acked-by: Paolo Abeni <[email protected]> Signed-off-by: Matthieu Baerts <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-03-29mptcp: simplify subflow_syn_recv_sock()Paolo Abeni1-28/+13
Postpone the msk cloning to the child process creation so that we can avoid a bunch of conditionals. Link: https://github.com/multipath-tcp/mptcp_net-next/issues/61 Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Matthieu Baerts <[email protected]> Signed-off-by: Matthieu Baerts <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-03-29mptcp: avoid unneeded address copyPaolo Abeni1-2/+0
In the syn_recv fallback path, the msk is unused. We can skip setting the socket address. Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Matthieu Baerts <[email protected]> Signed-off-by: Matthieu Baerts <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-03-296lowpan: Remove redundant initialisation.Kuniyuki Iwashima1-1/+1
We'll call memset(&tmp, 0, sizeof(tmp)) later. Signed-off-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-03-29ipv6: Remove in6addr_any alternatives.Kuniyuki Iwashima2-12/+9
Some code defines the IPv6 wildcard address as a local variable and use it with memcmp() or ipv6_addr_equal(). Let's use in6addr_any and ipv6_addr_any() instead. Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Mark Bloch <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-03-29vsock: support sockmapBobby Eshleman6-6/+262
This patch adds sockmap support for vsock sockets. It is intended to be usable by all transports, but only the virtio and loopback transports are implemented. SOCK_STREAM, SOCK_DGRAM, and SOCK_SEQPACKET are all supported. Signed-off-by: Bobby Eshleman <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Reviewed-by: Stefano Garzarella <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-03-28net: dst: Switch to rcuref_t reference countingThomas Gleixner5-28/+12
Under high contention dst_entry::__refcnt becomes a significant bottleneck. atomic_inc_not_zero() is implemented with a cmpxchg() loop, which goes into high retry rates on contention. Switch the reference count to rcuref_t which results in a significant performance gain. Rename the reference count member to __rcuref to reflect the change. The gain depends on the micro-architecture and the number of concurrent operations and has been measured in the range of +25% to +130% with a localhost memtier/memcached benchmark which amplifies the problem massively. Running the memtier/memcached benchmark over a real (1Gb) network connection the conversion on top of the false sharing fix for struct dst_entry::__refcnt results in a total gain in the 2%-5% range over the upstream baseline. Reported-by: Wangyang Guo <[email protected]> Reported-by: Arjan Van De Ven <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-03-28net: dst: Prevent false sharing vs. dst_entry:: __refcntWangyang Guo4-27/+27
dst_entry::__refcnt is highly contended in scenarios where many connections happen from and to the same IP. The reference count is an atomic_t, so the reference count operations have to take the cache-line exclusive. Aside of the unavoidable reference count contention there is another significant problem which is caused by that: False sharing. perf top identified two affected read accesses. dst_entry::lwtstate and rtable::rt_genid. dst_entry:__refcnt is located at offset 64 of dst_entry, which puts it into a seperate cacheline vs. the read mostly members located at the beginning of the struct. That prevents false sharing vs. the struct members in the first 64 bytes of the structure, but there is also dst_entry::lwtstate which is located after the reference count and in the same cache line. This member is read after a reference count has been acquired. struct rtable embeds a struct dst_entry at offset 0. struct dst_entry has a size of 112 bytes, which means that the struct members of rtable which follow the dst member share the same cache line as dst_entry::__refcnt. Especially rtable::rt_genid is also read by the contexts which have a reference count acquired already. When dst_entry:__refcnt is incremented or decremented via an atomic operation these read accesses stall. This was found when analysing the memtier benchmark in 1:100 mode, which amplifies the problem extremly. Move the rt[6i]_uncached[_list] members out of struct rtable and struct rt6_info into struct dst_entry to provide padding and move the lwtstate member after that so it ends up in the same cache line. The resulting improvement depends on the micro-architecture and the number of CPUs. It ranges from +20% to +120% with a localhost memtier/memcached benchmark. [ tglx: Rearrange struct ] Signed-off-by: Wangyang Guo <[email protected]> Signed-off-by: Arjan van de Ven <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-03-28virtio/vsock: check argument to avoid no effect callArseniy Krasnov1-0/+6
Both of these functions have no effect when input argument is 0, so to avoid useless spinlock access, check argument before it. Signed-off-by: Arseniy Krasnov <[email protected]> Reviewed-by: Stefano Garzarella <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2023-03-28virtio/vsock: allocate multiple skbuffs on txArseniy Krasnov1-14/+43
This adds small optimization for tx path: instead of allocating single skbuff on every call to transport, allocate multiple skbuff's until credit space allows, thus trying to send as much as possible data without return to af_vsock.c. Signed-off-by: Arseniy Krasnov <[email protected]> Reviewed-by: Stefano Garzarella <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2023-03-28bpf: tcp: Use sock_gen_put instead of sock_put in bpf_iter_tcpMartin KaFai Lau1-2/+2
While reviewing the udp-iter batching patches, noticed the bpf_iter_tcp calling sock_put() is incorrect. It should call sock_gen_put instead because bpf_iter_tcp is iterating the ehash table which has the req sk and tw sk. This patch replaces all sock_put with sock_gen_put in the bpf_iter_tcp codepath. Fixes: 04c7820b776f ("bpf: tcp: Bpf iter batching and lock_sock") Signed-off-by: Martin KaFai Lau <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-03-28can: isotp: add module parameter for maximum pdu sizeOliver Hartkopp1-9/+56
With ISO 15765-2:2016 the PDU size is not limited to 2^12 - 1 (4095) bytes but can be represented as a 32 bit unsigned integer value which allows 2^32 - 1 bytes (~4GB). The use-cases like automotive unified diagnostic services (UDS) and flashing of ECUs still use the small static buffers which are provided at socket creation time. When a use-case requires to transfer PDUs up to 1025 kByte the maximum PDU size can now be extended by setting the module parameter max_pdu_size. The extended size buffers are only allocated on a per-socket/connection base when needed at run-time. changes since v2: https://lore.kernel.org/all/[email protected] - use ARRAY_SIZE() to reference DEFAULT_MAX_PDU_SIZE only at one place changes since v1: https://lore.kernel.org/all/[email protected] - limit the minimum 'max_pdu_size' to 4095 to maintain the classic behavior before ISO 15765-2:2016 Link: https://github.com/raspberrypi/linux/issues/5371 Signed-off-by: Oliver Hartkopp <[email protected]> Link: https://lore.kernel.org/all/[email protected] Signed-off-by: Marc Kleine-Budde <[email protected]>
2023-03-27ethtool: Add support for configuring tx_push_buf_lenShay Agroskin2-3/+33
This attribute, which is part of ethtool's ring param configuration allows the user to specify the maximum number of the packet's payload that can be written directly to the device. Example usage: # ethtool -G [interface] tx-push-buf-len [number of bytes] Co-developed-by: Jakub Kicinski <[email protected]> Signed-off-by: Shay Agroskin <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2023-03-27net: introduce a config option to tweak MAX_SKB_FRAGSEric Dumazet2-2/+14
Currently, MAX_SKB_FRAGS value is 17. For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg() attempts order-3 allocations, stuffing 32768 bytes per frag. But with zero copy, we use order-0 pages. For BIG TCP to show its full potential, we add a config option to be able to fit up to 45 segments per skb. This is also needed for BIG TCP rx zerocopy, as zerocopy currently does not support skbs with frag list. We have used MAX_SKB_FRAGS=45 value for years at Google before we deployed 4K MTU, with no adverse effect, other than a recent issue in mlx4, fixed in commit 26782aad00cc ("net/mlx4: MLX4_TX_BOUNCE_BUFFER_SIZE depends on MAX_SKB_FRAGS") Back then, goal was to be able to receive full size (64KB) GRO packets without the frag_list overhead. Note that /proc/sys/net/core/max_skb_frags can also be used to limit the number of fragments TCP can use in tx packets. By default we keep the old/legacy value of 17 until we get more coverage for the updated values. Sizes of struct skb_shared_info on 64bit arches MAX_SKB_FRAGS | sizeof(struct skb_shared_info): ============================================== 17 320 21 320+64 = 384 25 320+128 = 448 29 320+192 = 512 33 320+256 = 576 37 320+320 = 640 41 320+384 = 704 45 320+448 = 768 This inflation might cause problems for drivers assuming they could pack both the incoming packet (for MTU=1500) and skb_shared_info in half a page, using build_skb(). v3: fix build error when CONFIG_NET=n v2: fix two build errors assuming MAX_SKB_FRAGS was "unsigned long" Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Nikolay Aleksandrov <[email protected]> Reviewed-by: Jason Xing <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-03-27can: bcm: bcm_tx_setup(): fix KMSAN uninit-value in vfs_writeIvan Orlov1-6/+10
Syzkaller reported the following issue: ===================================================== BUG: KMSAN: uninit-value in aio_rw_done fs/aio.c:1520 [inline] BUG: KMSAN: uninit-value in aio_write+0x899/0x950 fs/aio.c:1600 aio_rw_done fs/aio.c:1520 [inline] aio_write+0x899/0x950 fs/aio.c:1600 io_submit_one+0x1d1c/0x3bf0 fs/aio.c:2019 __do_sys_io_submit fs/aio.c:2078 [inline] __se_sys_io_submit+0x293/0x770 fs/aio.c:2048 __x64_sys_io_submit+0x92/0xd0 fs/aio.c:2048 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Uninit was created at: slab_post_alloc_hook mm/slab.h:766 [inline] slab_alloc_node mm/slub.c:3452 [inline] __kmem_cache_alloc_node+0x71f/0xce0 mm/slub.c:3491 __do_kmalloc_node mm/slab_common.c:967 [inline] __kmalloc+0x11d/0x3b0 mm/slab_common.c:981 kmalloc_array include/linux/slab.h:636 [inline] bcm_tx_setup+0x80e/0x29d0 net/can/bcm.c:930 bcm_sendmsg+0x3a2/0xce0 net/can/bcm.c:1351 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg net/socket.c:734 [inline] sock_write_iter+0x495/0x5e0 net/socket.c:1108 call_write_iter include/linux/fs.h:2189 [inline] aio_write+0x63a/0x950 fs/aio.c:1600 io_submit_one+0x1d1c/0x3bf0 fs/aio.c:2019 __do_sys_io_submit fs/aio.c:2078 [inline] __se_sys_io_submit+0x293/0x770 fs/aio.c:2048 __x64_sys_io_submit+0x92/0xd0 fs/aio.c:2048 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd CPU: 1 PID: 5034 Comm: syz-executor350 Not tainted 6.2.0-rc6-syzkaller-80422-geda666ff2276 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/12/2023 ===================================================== We can follow the call chain and find that 'bcm_tx_setup' function calls 'memcpy_from_msg' to copy some content to the newly allocated frame of 'op->frames'. After that the 'len' field of copied structure being compared with some constant value (64 or 8). However, if 'memcpy_from_msg' returns an error, we will compare some uninitialized memory. This triggers 'uninit-value' issue. This patch will add 'memcpy_from_msg' possible errors processing to avoid uninit-value issue. Tested via syzkaller Reported-by: [email protected] Link: https://syzkaller.appspot.com/bug?id=47f897f8ad958bbde5790ebf389b5e7e0a345089 Signed-off-by: Ivan Orlov <[email protected]> Fixes: 6f3b911d5f29b ("can: bcm: add support for CAN FD frames") Acked-by: Oliver Hartkopp <[email protected]> Link: https://lore.kernel.org/all/[email protected] Signed-off-by: Marc Kleine-Budde <[email protected]>
2023-03-27can: j1939: prevent deadlock by moving j1939_sk_errqueue()Oleksij Rempel1-2/+6
This commit addresses a deadlock situation that can occur in certain scenarios, such as when running data TP/ETP transfer and subscribing to the error queue while receiving a net down event. The deadlock involves locks in the following order: 3 j1939_session_list_lock -> active_session_list_lock j1939_session_activate ... j1939_sk_queue_activate_next -> sk_session_queue_lock ... j1939_xtp_rx_eoma_one 2 j1939_sk_queue_drop_all -> sk_session_queue_lock ... j1939_sk_netdev_event_netdown -> j1939_socks_lock j1939_netdev_notify 1 j1939_sk_errqueue -> j1939_socks_lock __j1939_session_cancel -> active_session_list_lock j1939_tp_rxtimer CPU0 CPU1 ---- ---- lock(&priv->active_session_list_lock); lock(&jsk->sk_session_queue_lock); lock(&priv->active_session_list_lock); lock(&priv->j1939_socks_lock); The solution implemented in this commit is to move the j1939_sk_errqueue() call out of the active_session_list_lock context, thus preventing the deadlock situation. Reported-by: [email protected] Fixes: 5b9272e93f2e ("can: j1939: extend UAPI to notify about RX status") Co-developed-by: Hillf Danton <[email protected]> Signed-off-by: Hillf Danton <[email protected]> Signed-off-by: Oleksij Rempel <[email protected]> Link: https://lore.kernel.org/all/[email protected] Cc: [email protected] Signed-off-by: Marc Kleine-Budde <[email protected]>
2023-03-27dev_ioctl: fix a W=1 warningHeiner Kallweit1-0/+1
This fixes the following warning when compiled with GCC 12.2.0 and W=1. net/core/dev_ioctl.c:475: warning: Function parameter or member 'data' not described in 'dev_ioctl' Signed-off-by: Heiner Kallweit <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-03-27vsock/loopback: use only sk_buff_head.lock to protect the packet queueStefano Garzarella1-8/+2
pkt_list_lock was used before commit 71dc9ec9ac7d ("virtio/vsock: replace virtio_vsock_pkt with sk_buff") to protect the packet queue. After that commit we switched to sk_buff and we are using sk_buff_head.lock in almost every place to protect the packet queue except in vsock_loopback_work() when we call skb_queue_splice_init(). As reported by syzbot, this caused unlocked concurrent access to the packet queue between vsock_loopback_work() and vsock_loopback_cancel_pkt() since it is not holding pkt_list_lock. With the introduction of sk_buff_head, pkt_list_lock is redundant and can cause confusion, so let's remove it and use sk_buff_head.lock everywhere to protect the packet queue access. Fixes: 71dc9ec9ac7d ("virtio/vsock: replace virtio_vsock_pkt with sk_buff") Cc: [email protected] Reported-and-tested-by: [email protected] Signed-off-by: Stefano Garzarella <[email protected]> Reviewed-by: Bobby Eshleman <[email protected]> Reviewed-by: Arseniy Krasnov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2023-03-279p: Add additional debug flags and open modesEric Van Hensbergen1-4/+4
Add some additional debug flags to assist with debugging cache changes. Also add some additional open modes so we can track cache state in fids more directly. Signed-off-by: Eric Van Hensbergen <[email protected]>
2023-03-25xsk: allow remap of fill and/or completion ringsNuno Gonçalves1-3/+6
The remap of fill and completion rings was frowned upon as they control the usage of UMEM which does not support concurrent use. At the same time this would disallow the remap of these rings into another process. A possible use case is that the user wants to transfer the socket/ UMEM ownership to another process (via SYS_pidfd_getfd) and so would need to also remap these rings. This will have no impact on current usages and just relaxes the remap limitation. Signed-off-by: Nuno Gonçalves <[email protected]> Reviewed-by: Maciej Fijalkowski <[email protected]> Acked-by: Magnus Karlsson <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-03-25bpf: Use bpf_mem_cache_alloc/free in bpf_local_storage_elemMartin KaFai Lau1-1/+1
This patch uses bpf_mem_alloc for the task and cgroup local storage that the bpf prog can easily get a hold of the storage owner's PTR_TO_BTF_ID. eg. bpf_get_current_task_btf() can be used in some of the kmalloc code path which will cause deadlock/recursion. bpf_mem_cache_alloc is deadlock free and will solve a legit use case in [1]. For sk storage, its batch creation benchmark shows a few percent regression when the sk create/destroy batch size is larger than 32. The sk creation/destruction happens much more often and depends on external traffic. Considering it is hypothetical to be able to cause deadlock with sk storage, it can cross the bridge to use bpf_mem_alloc till a legit (ie. useful) use case comes up. For inode storage, bpf_local_storage_destroy() is called before waiting for a rcu gp and its memory cannot be reused immediately. inode stays with kmalloc/kfree after the rcu [or tasks_trace] gp. A 'bool bpf_ma' argument is added to bpf_local_storage_map_alloc(). Only task and cgroup storage have 'bpf_ma == true' which means to use bpf_mem_cache_alloc/free(). This patch only changes selem to use bpf_mem_alloc for task and cgroup. The next patch will change the local_storage to use bpf_mem_alloc also for task and cgroup. Here is some more details on the changes: * memory allocation: After bpf_mem_cache_alloc(), the SDATA(selem)->data is zero-ed because bpf_mem_cache_alloc() could return a reused selem. It is to keep the existing bpf_map_kzalloc() behavior. Only SDATA(selem)->data is zero-ed. SDATA(selem)->data is the visible part to the bpf prog. No need to use zero_map_value() to do the zeroing because bpf_selem_free(..., reuse_now = true) ensures no bpf prog is using the selem before returning the selem through bpf_mem_cache_free(). For the internal fields of selem, they will be initialized when linking to the new smap and the new local_storage. When 'bpf_ma == false', nothing changes in this patch. It will stay with the bpf_map_kzalloc(). * memory free: The bpf_selem_free() and bpf_selem_free_rcu() are modified to handle the bpf_ma == true case. For the common selem free path where its owner is also being destroyed, the mem is freed in bpf_local_storage_destroy(), the owner (task and cgroup) has gone through a rcu gp. The memory can be reused immediately, so bpf_local_storage_destroy() will call bpf_selem_free(..., reuse_now = true) which will do bpf_mem_cache_free() for immediate reuse consideration. An exception is the delete elem code path. The delete elem code path is called from the helper bpf_*_storage_delete() and the syscall bpf_map_delete_elem(). This path is an unusual case for local storage because the common use case is to have the local storage staying with its owner life time so that the bpf prog and the user space does not have to monitor the owner's destruction. For the delete elem path, the selem cannot be reused immediately because there could be bpf prog using it. It will call bpf_selem_free(..., reuse_now = false) and it will wait for a rcu tasks trace gp before freeing the elem. The rcu callback is changed to do bpf_mem_cache_raw_free() instead of kfree(). When 'bpf_ma == false', it should be the same as before. __bpf_selem_free() is added to do the kfree_rcu and call_tasks_trace_rcu(). A few words on the 'reuse_now == true'. When 'reuse_now == true', it is still racing with bpf_local_storage_map_free which is under rcu protection, so it still needs to wait for a rcu gp instead of kfree(). Otherwise, the selem may be reused by slab for a totally different struct while the bpf_local_storage_map_free() is still using it (as a rcu reader). For the inode case, there may be other rcu readers also. In short, when bpf_ma == false and reuse_now == true => vanilla rcu. [1]: https://lore.kernel.org/bpf/[email protected]/ Cc: Namhyung Kim <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-03-25bpf: Treat KF_RELEASE kfuncs as KF_TRUSTED_ARGSDavid Vernet1-0/+6
KF_RELEASE kfuncs are not currently treated as having KF_TRUSTED_ARGS, even though they have a superset of the requirements of KF_TRUSTED_ARGS. Like KF_TRUSTED_ARGS, KF_RELEASE kfuncs require a 0-offset argument, and don't allow NULL-able arguments. Unlike KF_TRUSTED_ARGS which require _either_ an argument with ref_obj_id > 0, _or_ (ref->type & BPF_REG_TRUSTED_MODIFIERS) (and no unsafe modifiers allowed), KF_RELEASE only allows for ref_obj_id > 0. Because KF_RELEASE today doesn't automatically imply KF_TRUSTED_ARGS, some of these requirements are enforced in different ways that can make the behavior of the verifier feel unpredictable. For example, a KF_RELEASE kfunc with a NULL-able argument will currently fail in the verifier with a message like, "arg#0 is ptr_or_null_ expected ptr_ or socket" rather than "Possibly NULL pointer passed to trusted arg0". Our intention is the same, but the semantics are different due to implemenetation details that kfunc authors and BPF program writers should not need to care about. Let's make the behavior of the verifier more consistent and intuitive by having KF_RELEASE kfuncs imply the presence of KF_TRUSTED_ARGS. Our eventual goal is to have all kfuncs assume KF_TRUSTED_ARGS by default anyways, so this takes us a step in that direction. Note that it does not make sense to assume KF_TRUSTED_ARGS for all KF_ACQUIRE kfuncs. KF_ACQUIRE kfuncs can have looser semantics than KF_RELEASE, with e.g. KF_RCU | KF_RET_NULL. We may want to have KF_ACQUIRE imply KF_TRUSTED_ARGS _unless_ KF_RCU is specified, but that can be left to another patch set, and there are no such subtleties to address for KF_RELEASE. Signed-off-by: David Vernet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-03-25bpf: Remove now-unnecessary NULL checks for KF_RELEASE kfuncsDavid Vernet2-5/+0
Now that we're not invoking kfunc destructors when the kptr in a map was NULL, we no longer require NULL checks in many of our KF_RELEASE kfuncs. This patch removes those NULL checks. Signed-off-by: David Vernet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-03-25Merge tag 'nfsd-6.3-4' of ↵Linus Torvalds1-5/+5
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fix from Chuck Lever: - Fix a crash when using NFS with krb5p * tag 'nfsd-6.3-4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: SUNRPC: Fix a crash in gss_krb5_checksum()
2023-03-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski16-99/+206
Conflicts: drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 6e9d51b1a5cb ("net/mlx5e: Initialize link speed to zero") 1bffcea42926 ("net/mlx5e: Add devlink hairpin queues parameters") https://lore.kernel.org/all/[email protected]/ https://lore.kernel.org/all/[email protected]/ Adjacent changes: drivers/net/phy/phy.c 323fe43cf9ae ("net: phy: Improved PHY error reporting in state machine") 4203d84032e2 ("net: phy: Ensure state transitions are processed from phy_stop()") Signed-off-by: Jakub Kicinski <[email protected]>