Age | Commit message (Collapse) | Author | Files | Lines |
|
We have the following code paths:
Host FDB (unicast RX filtering):
dsa_port_standalone_host_fdb_add() dsa_port_bridge_host_fdb_add()
| |
+--------------+ +------------+
| |
v v
dsa_port_host_fdb_add()
dsa_port_standalone_host_fdb_del() dsa_port_bridge_host_fdb_del()
| |
+--------------+ +------------+
| |
v v
dsa_port_host_fdb_del()
Host MDB (multicast RX filtering):
dsa_port_standalone_host_mdb_add() dsa_port_bridge_host_mdb_add()
| |
+--------------+ +------------+
| |
v v
dsa_port_host_mdb_add()
dsa_port_standalone_host_mdb_del() dsa_port_bridge_host_mdb_del()
| |
+--------------+ +------------+
| |
v v
dsa_port_host_mdb_del()
The logic added by commit 5e8a1e03aa4d ("net: dsa: install secondary
unicast and multicast addresses as host FDB/MDB") zeroes out
db.bridge.num if the switch doesn't support ds->fdb_isolation
(the majority doesn't). This is done for a reason explained in commit
c26933639b54 ("net: dsa: request drivers to perform FDB isolation").
Taking a single code path as example - dsa_port_host_fdb_add() - the
others are similar - the problem is that this function handles:
- DSA_DB_PORT databases, when called from
dsa_port_standalone_host_fdb_add()
- DSA_DB_BRIDGE databases, when called from
dsa_port_bridge_host_fdb_add()
So, if dsa_port_host_fdb_add() were to make any change on the
"bridge.num" attribute of the database, this would only be correct for a
DSA_DB_BRIDGE, and a type confusion for a DSA_DB_PORT bridge.
However, this bug is without consequences, for 2 reasons:
- dsa_port_standalone_host_fdb_add() is only called from code which is
(in)directly guarded by dsa_switch_supports_uc_filtering(ds), and that
function only returns true if ds->fdb_isolation is set. So, the code
only executed for DSA_DB_BRIDGE databases.
- Even if the code was not dead for DSA_DB_PORT, we have the following
memory layout:
struct dsa_bridge {
struct net_device *dev;
unsigned int num;
bool tx_fwd_offload;
refcount_t refcount;
};
struct dsa_db {
enum dsa_db_type type;
union {
const struct dsa_port *dp; // DSA_DB_PORT
struct dsa_lag lag;
struct dsa_bridge bridge; // DSA_DB_BRIDGE
};
};
So, the zeroization of dsa_db :: bridge :: num on a dsa_db structure of
type DSA_DB_PORT would access memory which is unused, because we only
use dsa_db :: dp for DSA_DB_PORT, and this is mapped at the same address
with dsa_db :: dev for DSA_DB_BRIDGE, thanks to the union definition.
It is correct to fix up dsa_db :: bridge :: num only from code paths
that come from the bridge / switchdev, so move these there.
Signed-off-by: Vladimir Oltean <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Conflicts:
drivers/net/ethernet/mediatek/mtk_ppe.c
3fbe4d8c0e53 ("net: ethernet: mtk_eth_soc: ppe: add support for flow accounting")
924531326e2d ("net: ethernet: mtk_eth_soc: add missing ppe cache flush when deleting a flow")
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from CAN and WPAN.
Still quite a few bugs from this release. This pull is a bit smaller
because major subtrees went into the previous one. Or maybe people
took spring break off?
Current release - regressions:
- phy: micrel: correct KSZ9131RNX EEE capabilities and advertisement
Current release - new code bugs:
- eth: wangxun: fix vector length of interrupt cause
- vsock/loopback: consistently protect the packet queue with
sk_buff_head.lock
- virtio/vsock: fix header length on skb merging
- wpan: ca8210: fix unsigned mac_len comparison with zero
Previous releases - regressions:
- eth: stmmac: don't reject VLANs when IFF_PROMISC is set
- eth: smsc911x: avoid PHY being resumed when interface is not up
- eth: mtk_eth_soc: fix tx throughput regression with direct 1G links
- eth: bnx2x: use the right build_skb() helper after core rework
- wwan: iosm: fix 7560 modem crash on use on unsupported channel
Previous releases - always broken:
- eth: sfc: don't overwrite offload features at NIC reset
- eth: r8169: fix RTL8168H and RTL8107E rx crc error
- can: j1939: prevent deadlock by moving j1939_sk_errqueue()
- virt: vmxnet3: use GRO callback when UPT is enabled
- virt: xen: don't do grant copy across page boundary
- phy: dp83869: fix default value for tx-/rx-internal-delay
- dsa: ksz8: fix multiple issues with ksz8_fdb_dump
- eth: mvpp2: fix classification/RSS of VLAN and fragmented packets
- eth: mtk_eth_soc: fix flow block refcounting logic
Misc:
- constify fwnode pointers in SFP handling"
* tag 'net-6.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (55 commits)
net: ethernet: mtk_eth_soc: add missing ppe cache flush when deleting a flow
net: ethernet: mtk_eth_soc: fix L2 offloading with DSA untag offload
net: ethernet: mtk_eth_soc: fix flow block refcounting logic
net: mvneta: fix potential double-frees in mvneta_txq_sw_deinit()
net: dsa: sync unicast and multicast addresses for VLAN filters too
net: dsa: mv88e6xxx: Enable IGMP snooping on user ports only
xen/netback: use same error messages for same errors
test/vsock: new skbuff appending test
virtio/vsock: WARN_ONCE() for invalid state of socket
virtio/vsock: fix header length on skb merging
bnxt_en: Add missing 200G link speed reporting
bnxt_en: Fix typo in PCI id to device description string mapping
bnxt_en: Fix reporting of test result in ethtool selftest
i40e: fix registers dump after run ethtool adapter self test
bnx2x: use the right build_skb() helper
net: ipa: compute DMA pool size properly
net: wwan: iosm: fixes 7560 modem crash
net: ethernet: mtk_eth_soc: fix tx throughput regression with direct 1G links
ice: fix invalid check for empty list in ice_sched_assoc_vsi_to_agg()
ice: add profile conflict check for AVF FDIR
...
|
|
Currently, offloaded conntrack entries (flows) can only be deleted
after they are removed from offload, which is either by timeout,
tcp state change or tc ct rule deletion. This can cause issues for
users wishing to manually delete or flush existing entries.
Support deletion of offloaded conntrack entries.
Example usage:
# Delete all offloaded (and non offloaded) conntrack entries
# whose source address is 1.2.3.4
$ conntrack -D -s 1.2.3.4
# Delete all entries
$ conntrack -F
Signed-off-by: Paul Blakey <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Acked-by: Pablo Neira Ayuso <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
|
|
This enables associating a socket with a v1 net_cls cgroup. Useful for
applying a per-cgroup policy when processing packets in userspace.
Signed-off-by: Eric Sage <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
|
|
structure is free'd via call_rcu, so its safe to use rcu_read_lock only.
While at it, skip rcu_read_lock for lookup from packet path, its always
called with rcu held.
Signed-off-by: Florian Westphal <[email protected]>
|
|
If certain conditions are met, DSA can install all necessary MAC
addresses on the CPU ports as FDB entries and disable flooding towards
the CPU (we call this RX filtering).
There is one corner case where this does not work.
ip link add br0 type bridge vlan_filtering 1 && ip link set br0 up
ip link set swp0 master br0 && ip link set swp0 up
ip link add link swp0 name swp0.100 type vlan id 100
ip link set swp0.100 up && ip addr add 192.168.100.1/24 dev swp0.100
Traffic through swp0.100 is broken, because the bridge turns on VLAN
filtering in the swp0 port (causing RX packets to be classified to the
FDB database corresponding to the VID from their 802.1Q header), and
although the 8021q module does call dev_uc_add() towards the real
device, that API is VLAN-unaware, so it only contains the MAC address,
not the VID; and DSA's current implementation of ndo_set_rx_mode() is
only for VID 0 (corresponding to FDB entries which are installed in an
FDB database which is only hit when the port is VLAN-unaware).
It's interesting to understand why the bridge does not turn on
IFF_PROMISC for its swp0 bridge port, and it may appear at first glance
that this is a regression caused by the logic in commit 2796d0c648c9
("bridge: Automatically manage port promiscuous mode."). After all,
a bridge port needs to have IFF_PROMISC by its very nature - it needs to
receive and forward frames with a MAC DA different from the bridge
ports' MAC addresses.
While that may be true, when the bridge is VLAN-aware *and* it has a
single port, there is no real reason to enable promiscuity even if that
is an automatic port, with flooding and learning (there is nowhere for
packets to go except to the BR_FDB_LOCAL entries), and this is how the
corner case appears. Adding a second automatic interface to the bridge
would make swp0 promisc as well, and would mask the corner case.
Given the dev_uc_add() / ndo_set_rx_mode() API is what it is (it doesn't
pass a VLAN ID), the only way to address that problem is to install host
FDB entries for the cartesian product of RX filtering MAC addresses and
VLAN RX filters.
Fixes: 7569459a52c9 ("net: dsa: manage flooding on the CPU ports")
Signed-off-by: Vladimir Oltean <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
____napi_schedule() adds a napi into current cpu softnet_data poll_list,
then raises NET_RX_SOFTIRQ to make sure net_rx_action() will process it.
Idea of this patch is to not raise NET_RX_SOFTIRQ when being called indirectly
from net_rx_action(), because we can process poll_list from this point,
without going to full softirq loop.
This needs a change in net_rx_action() to make sure we restart
its main loop if sd->poll_list was updated without NET_RX_SOFTIRQ
being raised.
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Jason Xing <[email protected]>
Reviewed-by: Jason Xing <[email protected]>
Tested-by: Jason Xing <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
|
|
Based on initial patch from Jason Xing.
Idea is to not raise NET_RX_SOFTIRQ from napi_schedule_rps()
when we queued a packet into another cpu backlog.
We can do this only in the context of us being called indirectly
from net_rx_action(), to have the guarantee our rps_ipi_list
will be processed before we exit from net_rx_action().
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Jason Xing <[email protected]>
Reviewed-by: Jason Xing <[email protected]>
Tested-by: Jason Xing <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
|
|
We want to make two optimizations in napi_schedule_rps() and
____napi_schedule() which require to know if these helpers are
called from net_rx_action(), instead of being called from
other contexts.
sd.in_net_rx_action is only read/written by the owning cpu.
Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Jason Xing <[email protected]>
Tested-by: Jason Xing <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
|
|
napi_schedule_rps() return value is ignored, remove it.
Change the comment to clarify the intent.
Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Jason Xing <[email protected]>
Tested-by: Jason Xing <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
|
|
Include S1G capabilities in netlink band info messages.
Signed-off-by: Kieran Frewen <[email protected]>
Co-developed-by: Gilad Itzkovitch <[email protected]>
Signed-off-by: Gilad Itzkovitch <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Johannes Berg <[email protected]>
|
|
Add the missing S1G capabilities information element to probe requests.
Signed-off-by: Kieran Frewen <[email protected]>
Co-developed-by: Gilad Itzkovitch <[email protected]>
Signed-off-by: Gilad Itzkovitch <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Johannes Berg <[email protected]>
|
|
clang with W=1 reports
net/mac80211/rc80211_minstrel_ht.c:1711:6: error: variable
'n_supported' set but not used [-Werror,-Wunused-but-set-variable]
int n_supported = 0;
^
This variable is not used so remove it.
Signed-off-by: Tom Rix <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Johannes Berg <[email protected]>
|
|
Avoid potential data corruption issues caused by uninitialized driver
private data structures.
Reported-by: Brian Coverstone <[email protected]>
Fixes: 6a9d1b91f34d ("mac80211: add pre-RCU-sync sta removal driver operation")
Signed-off-by: Felix Fietkau <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Johannes Berg <[email protected]>
|
|
Adjust the network header to point at the correct payload offset
Fixes: 986e43b19ae9 ("wifi: mac80211: fix receiving A-MSDU frames on mesh interfaces")
Signed-off-by: Felix Fietkau <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Johannes Berg <[email protected]>
|
|
Linearize packets (needed for forwarding A-MSDU subframes).
Fixes: 986e43b19ae9 ("wifi: mac80211: fix receiving A-MSDU frames on mesh interfaces")
Signed-off-by: Felix Fietkau <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Johannes Berg <[email protected]>
|
|
When forwarding is set to 0, frames are typically sent with ttl=1.
Move the ttl decrement check below the check for local receive in order to
fix packet drops.
Reported-by: Thomas Hühn <[email protected]>
Reported-by: Nick Hainke <[email protected]>
Fixes: 986e43b19ae9 ("wifi: mac80211: fix receiving A-MSDU frames on mesh interfaces")
Signed-off-by: Felix Fietkau <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Johannes Berg <[email protected]>
|
|
Here should return the size of ieee80211_eht_cap_elem_fixed, so fix it.
Fixes: 820acc810fb6 ("mac80211: Add EHT capabilities to association/probe request")
Signed-off-by: Ryder Lee <[email protected]>
Link: https://lore.kernel.org/r/06c13635fc03bcff58a647b8e03e9f01a74294bd.1679935259.git.ryder.lee@mediatek.com
Signed-off-by: Johannes Berg <[email protected]>
|
|
rx->sta->amsdu_mesh_control is being passed to ieee80211_amsdu_to_8023s
without checking rx->sta. Since it doesn't make sense to accept A-MSDU
packets without a sta, simply add a check earlier.
Fixes: 6e4c0d0460bd ("wifi: mac80211: add a workaround for receiving non-standard mesh A-MSDU")
Signed-off-by: Felix Fietkau <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Johannes Berg <[email protected]>
|
|
These were unintentional copy&paste mistakes.
Cc: [email protected]
Fixes: 986e43b19ae9 ("wifi: mac80211: fix receiving A-MSDU frames on mesh interfaces")
Signed-off-by: Felix Fietkau <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Johannes Berg <[email protected]>
|
|
This adds WARN_ONCE() and return from stream dequeue callback when
socket's queue is empty, but 'rx_bytes' still non-zero. This allows
the detection of potential bugs due to packet merging (see previous
patch).
Signed-off-by: Arseniy Krasnov <[email protected]>
Reviewed-by: Stefano Garzarella <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
|
|
This fixes appending newly arrived skbuff to the last skbuff of the
socket's queue. Problem fires when we are trying to append data to skbuff
which was already processed in dequeue callback at least once. Dequeue
callback calls function 'skb_pull()' which changes 'skb->len'. In current
implementation 'skb->len' is used to update length in header of the last
skbuff after new data was copied to it. This is bug, because value in
header is used to calculate 'rx_bytes'/'fwd_cnt' and thus must be not
be changed during skbuff's lifetime.
Bug starts to fire since:
commit 077706165717
("virtio/vsock: don't use skbuff state to account credit")
It presents before, but didn't triggered due to a little bit buggy
implementation of credit calculation logic. So use Fixes tag for it.
Fixes: 077706165717 ("virtio/vsock: don't use skbuff state to account credit")
Signed-off-by: Arseniy Krasnov <[email protected]>
Reviewed-by: Stefano Garzarella <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/wpan/wpan
Stefan Schmidt says:
====================
ieee802154 for net 2023-03-29
Two small fixes this time.
Dongliang Mu removed an unnecessary null pointer check.
Harshit Mogalapalli fixed an int comparison unsigned against signed from a
recent other fix in the ca8210 driver.
* tag 'ieee802154-for-net-2023-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/wpan/wpan:
net: ieee802154: remove an unnecessary null pointer check
ca8210: Fix unsigned mac_len comparison with zero in ca8210_skb_tx()
====================
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
A CC that implements tcp_congestion_ops.cong_control() should be able to
write app_limited. A built-in CC or one from a kernel module is already
able to write to this member of struct tcp_sock.
For a BPF program, write access has not been allowed, yet.
Signed-off-by: Yixin Shen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>
|
|
Only the in-kernel PM uses the number of address and subflow limits
allowed per connection.
It then makes more sense not to display such info when other PMs are
used not to confuse the userspace by showing limits not being used.
While at it, we can get rid of the "val" variable and add indentations
instead.
It would have been good to have done this modification directly in
commit 4d25247d3ae4 ("mptcp: bypass in-kernel PM restrictions for non-kernel PMs")
but as we change a bit the behaviour, it is fine not to backport it to
stable.
Acked-by: Paolo Abeni <[email protected]>
Signed-off-by: Matthieu Baerts <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Postpone the msk cloning to the child process creation
so that we can avoid a bunch of conditionals.
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/61
Signed-off-by: Paolo Abeni <[email protected]>
Reviewed-by: Matthieu Baerts <[email protected]>
Signed-off-by: Matthieu Baerts <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
In the syn_recv fallback path, the msk is unused. We can skip
setting the socket address.
Signed-off-by: Paolo Abeni <[email protected]>
Reviewed-by: Matthieu Baerts <[email protected]>
Signed-off-by: Matthieu Baerts <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
We'll call memset(&tmp, 0, sizeof(tmp)) later.
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Some code defines the IPv6 wildcard address as a local variable and
use it with memcmp() or ipv6_addr_equal().
Let's use in6addr_any and ipv6_addr_any() instead.
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Mark Bloch <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
This patch adds sockmap support for vsock sockets. It is intended to be
usable by all transports, but only the virtio and loopback transports
are implemented.
SOCK_STREAM, SOCK_DGRAM, and SOCK_SEQPACKET are all supported.
Signed-off-by: Bobby Eshleman <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Reviewed-by: Stefano Garzarella <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Under high contention dst_entry::__refcnt becomes a significant bottleneck.
atomic_inc_not_zero() is implemented with a cmpxchg() loop, which goes into
high retry rates on contention.
Switch the reference count to rcuref_t which results in a significant
performance gain. Rename the reference count member to __rcuref to reflect
the change.
The gain depends on the micro-architecture and the number of concurrent
operations and has been measured in the range of +25% to +130% with a
localhost memtier/memcached benchmark which amplifies the problem
massively.
Running the memtier/memcached benchmark over a real (1Gb) network
connection the conversion on top of the false sharing fix for struct
dst_entry::__refcnt results in a total gain in the 2%-5% range over the
upstream baseline.
Reported-by: Wangyang Guo <[email protected]>
Reported-by: Arjan Van De Ven <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
dst_entry::__refcnt is highly contended in scenarios where many connections
happen from and to the same IP. The reference count is an atomic_t, so the
reference count operations have to take the cache-line exclusive.
Aside of the unavoidable reference count contention there is another
significant problem which is caused by that: False sharing.
perf top identified two affected read accesses. dst_entry::lwtstate and
rtable::rt_genid.
dst_entry:__refcnt is located at offset 64 of dst_entry, which puts it into
a seperate cacheline vs. the read mostly members located at the beginning
of the struct.
That prevents false sharing vs. the struct members in the first 64
bytes of the structure, but there is also
dst_entry::lwtstate
which is located after the reference count and in the same cache line. This
member is read after a reference count has been acquired.
struct rtable embeds a struct dst_entry at offset 0. struct dst_entry has a
size of 112 bytes, which means that the struct members of rtable which
follow the dst member share the same cache line as dst_entry::__refcnt.
Especially
rtable::rt_genid
is also read by the contexts which have a reference count acquired
already.
When dst_entry:__refcnt is incremented or decremented via an atomic
operation these read accesses stall. This was found when analysing the
memtier benchmark in 1:100 mode, which amplifies the problem extremly.
Move the rt[6i]_uncached[_list] members out of struct rtable and struct
rt6_info into struct dst_entry to provide padding and move the lwtstate
member after that so it ends up in the same cache line.
The resulting improvement depends on the micro-architecture and the number
of CPUs. It ranges from +20% to +120% with a localhost memtier/memcached
benchmark.
[ tglx: Rearrange struct ]
Signed-off-by: Wangyang Guo <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Both of these functions have no effect when input argument is 0, so to
avoid useless spinlock access, check argument before it.
Signed-off-by: Arseniy Krasnov <[email protected]>
Reviewed-by: Stefano Garzarella <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
|
|
This adds small optimization for tx path: instead of allocating single
skbuff on every call to transport, allocate multiple skbuff's until
credit space allows, thus trying to send as much as possible data without
return to af_vsock.c.
Signed-off-by: Arseniy Krasnov <[email protected]>
Reviewed-by: Stefano Garzarella <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
|
|
While reviewing the udp-iter batching patches, noticed the bpf_iter_tcp
calling sock_put() is incorrect. It should call sock_gen_put instead
because bpf_iter_tcp is iterating the ehash table which has the req sk
and tw sk. This patch replaces all sock_put with sock_gen_put in the
bpf_iter_tcp codepath.
Fixes: 04c7820b776f ("bpf: tcp: Bpf iter batching and lock_sock")
Signed-off-by: Martin KaFai Lau <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
|
|
With ISO 15765-2:2016 the PDU size is not limited to 2^12 - 1 (4095)
bytes but can be represented as a 32 bit unsigned integer value which
allows 2^32 - 1 bytes (~4GB). The use-cases like automotive unified
diagnostic services (UDS) and flashing of ECUs still use the small
static buffers which are provided at socket creation time.
When a use-case requires to transfer PDUs up to 1025 kByte the maximum
PDU size can now be extended by setting the module parameter
max_pdu_size. The extended size buffers are only allocated on a
per-socket/connection base when needed at run-time.
changes since v2: https://lore.kernel.org/all/[email protected]
- use ARRAY_SIZE() to reference DEFAULT_MAX_PDU_SIZE only at one place
changes since v1: https://lore.kernel.org/all/[email protected]
- limit the minimum 'max_pdu_size' to 4095 to maintain the classic
behavior before ISO 15765-2:2016
Link: https://github.com/raspberrypi/linux/issues/5371
Signed-off-by: Oliver Hartkopp <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
Signed-off-by: Marc Kleine-Budde <[email protected]>
|
|
This attribute, which is part of ethtool's ring param configuration
allows the user to specify the maximum number of the packet's payload
that can be written directly to the device.
Example usage:
# ethtool -G [interface] tx-push-buf-len [number of bytes]
Co-developed-by: Jakub Kicinski <[email protected]>
Signed-off-by: Shay Agroskin <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Currently, MAX_SKB_FRAGS value is 17.
For standard tcp sendmsg() traffic, no big deal because tcp_sendmsg()
attempts order-3 allocations, stuffing 32768 bytes per frag.
But with zero copy, we use order-0 pages.
For BIG TCP to show its full potential, we add a config option
to be able to fit up to 45 segments per skb.
This is also needed for BIG TCP rx zerocopy, as zerocopy currently
does not support skbs with frag list.
We have used MAX_SKB_FRAGS=45 value for years at Google before
we deployed 4K MTU, with no adverse effect, other than
a recent issue in mlx4, fixed in commit 26782aad00cc
("net/mlx4: MLX4_TX_BOUNCE_BUFFER_SIZE depends on MAX_SKB_FRAGS")
Back then, goal was to be able to receive full size (64KB) GRO
packets without the frag_list overhead.
Note that /proc/sys/net/core/max_skb_frags can also be used to limit
the number of fragments TCP can use in tx packets.
By default we keep the old/legacy value of 17 until we get
more coverage for the updated values.
Sizes of struct skb_shared_info on 64bit arches
MAX_SKB_FRAGS | sizeof(struct skb_shared_info):
==============================================
17 320
21 320+64 = 384
25 320+128 = 448
29 320+192 = 512
33 320+256 = 576
37 320+320 = 640
41 320+384 = 704
45 320+448 = 768
This inflation might cause problems for drivers assuming they could pack
both the incoming packet (for MTU=1500) and skb_shared_info in half a page,
using build_skb().
v3: fix build error when CONFIG_NET=n
v2: fix two build errors assuming MAX_SKB_FRAGS was "unsigned long"
Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Nikolay Aleksandrov <[email protected]>
Reviewed-by: Jason Xing <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
|
|
Syzkaller reported the following issue:
=====================================================
BUG: KMSAN: uninit-value in aio_rw_done fs/aio.c:1520 [inline]
BUG: KMSAN: uninit-value in aio_write+0x899/0x950 fs/aio.c:1600
aio_rw_done fs/aio.c:1520 [inline]
aio_write+0x899/0x950 fs/aio.c:1600
io_submit_one+0x1d1c/0x3bf0 fs/aio.c:2019
__do_sys_io_submit fs/aio.c:2078 [inline]
__se_sys_io_submit+0x293/0x770 fs/aio.c:2048
__x64_sys_io_submit+0x92/0xd0 fs/aio.c:2048
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Uninit was created at:
slab_post_alloc_hook mm/slab.h:766 [inline]
slab_alloc_node mm/slub.c:3452 [inline]
__kmem_cache_alloc_node+0x71f/0xce0 mm/slub.c:3491
__do_kmalloc_node mm/slab_common.c:967 [inline]
__kmalloc+0x11d/0x3b0 mm/slab_common.c:981
kmalloc_array include/linux/slab.h:636 [inline]
bcm_tx_setup+0x80e/0x29d0 net/can/bcm.c:930
bcm_sendmsg+0x3a2/0xce0 net/can/bcm.c:1351
sock_sendmsg_nosec net/socket.c:714 [inline]
sock_sendmsg net/socket.c:734 [inline]
sock_write_iter+0x495/0x5e0 net/socket.c:1108
call_write_iter include/linux/fs.h:2189 [inline]
aio_write+0x63a/0x950 fs/aio.c:1600
io_submit_one+0x1d1c/0x3bf0 fs/aio.c:2019
__do_sys_io_submit fs/aio.c:2078 [inline]
__se_sys_io_submit+0x293/0x770 fs/aio.c:2048
__x64_sys_io_submit+0x92/0xd0 fs/aio.c:2048
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
CPU: 1 PID: 5034 Comm: syz-executor350 Not tainted 6.2.0-rc6-syzkaller-80422-geda666ff2276 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/12/2023
=====================================================
We can follow the call chain and find that 'bcm_tx_setup' function
calls 'memcpy_from_msg' to copy some content to the newly allocated
frame of 'op->frames'. After that the 'len' field of copied structure
being compared with some constant value (64 or 8). However, if
'memcpy_from_msg' returns an error, we will compare some uninitialized
memory. This triggers 'uninit-value' issue.
This patch will add 'memcpy_from_msg' possible errors processing to
avoid uninit-value issue.
Tested via syzkaller
Reported-by: [email protected]
Link: https://syzkaller.appspot.com/bug?id=47f897f8ad958bbde5790ebf389b5e7e0a345089
Signed-off-by: Ivan Orlov <[email protected]>
Fixes: 6f3b911d5f29b ("can: bcm: add support for CAN FD frames")
Acked-by: Oliver Hartkopp <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
Signed-off-by: Marc Kleine-Budde <[email protected]>
|
|
This commit addresses a deadlock situation that can occur in certain
scenarios, such as when running data TP/ETP transfer and subscribing to
the error queue while receiving a net down event. The deadlock involves
locks in the following order:
3
j1939_session_list_lock -> active_session_list_lock
j1939_session_activate
...
j1939_sk_queue_activate_next -> sk_session_queue_lock
...
j1939_xtp_rx_eoma_one
2
j1939_sk_queue_drop_all -> sk_session_queue_lock
...
j1939_sk_netdev_event_netdown -> j1939_socks_lock
j1939_netdev_notify
1
j1939_sk_errqueue -> j1939_socks_lock
__j1939_session_cancel -> active_session_list_lock
j1939_tp_rxtimer
CPU0 CPU1
---- ----
lock(&priv->active_session_list_lock);
lock(&jsk->sk_session_queue_lock);
lock(&priv->active_session_list_lock);
lock(&priv->j1939_socks_lock);
The solution implemented in this commit is to move the
j1939_sk_errqueue() call out of the active_session_list_lock context,
thus preventing the deadlock situation.
Reported-by: [email protected]
Fixes: 5b9272e93f2e ("can: j1939: extend UAPI to notify about RX status")
Co-developed-by: Hillf Danton <[email protected]>
Signed-off-by: Hillf Danton <[email protected]>
Signed-off-by: Oleksij Rempel <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
Cc: [email protected]
Signed-off-by: Marc Kleine-Budde <[email protected]>
|
|
This fixes the following warning when compiled with GCC 12.2.0 and W=1.
net/core/dev_ioctl.c:475: warning: Function parameter or member 'data'
not described in 'dev_ioctl'
Signed-off-by: Heiner Kallweit <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
pkt_list_lock was used before commit 71dc9ec9ac7d ("virtio/vsock:
replace virtio_vsock_pkt with sk_buff") to protect the packet queue.
After that commit we switched to sk_buff and we are using
sk_buff_head.lock in almost every place to protect the packet queue
except in vsock_loopback_work() when we call skb_queue_splice_init().
As reported by syzbot, this caused unlocked concurrent access to the
packet queue between vsock_loopback_work() and
vsock_loopback_cancel_pkt() since it is not holding pkt_list_lock.
With the introduction of sk_buff_head, pkt_list_lock is redundant and
can cause confusion, so let's remove it and use sk_buff_head.lock
everywhere to protect the packet queue access.
Fixes: 71dc9ec9ac7d ("virtio/vsock: replace virtio_vsock_pkt with sk_buff")
Cc: [email protected]
Reported-and-tested-by: [email protected]
Signed-off-by: Stefano Garzarella <[email protected]>
Reviewed-by: Bobby Eshleman <[email protected]>
Reviewed-by: Arseniy Krasnov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Add some additional debug flags to assist with debugging
cache changes. Also add some additional open modes so we
can track cache state in fids more directly.
Signed-off-by: Eric Van Hensbergen <[email protected]>
|
|
The remap of fill and completion rings was frowned upon as they
control the usage of UMEM which does not support concurrent use.
At the same time this would disallow the remap of these rings
into another process.
A possible use case is that the user wants to transfer the socket/
UMEM ownership to another process (via SYS_pidfd_getfd) and so
would need to also remap these rings.
This will have no impact on current usages and just relaxes the
remap limitation.
Signed-off-by: Nuno Gonçalves <[email protected]>
Reviewed-by: Maciej Fijalkowski <[email protected]>
Acked-by: Magnus Karlsson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
This patch uses bpf_mem_alloc for the task and cgroup local storage that
the bpf prog can easily get a hold of the storage owner's PTR_TO_BTF_ID.
eg. bpf_get_current_task_btf() can be used in some of the kmalloc code
path which will cause deadlock/recursion. bpf_mem_cache_alloc is
deadlock free and will solve a legit use case in [1].
For sk storage, its batch creation benchmark shows a few percent
regression when the sk create/destroy batch size is larger than 32.
The sk creation/destruction happens much more often and
depends on external traffic. Considering it is hypothetical
to be able to cause deadlock with sk storage, it can cross
the bridge to use bpf_mem_alloc till a legit (ie. useful)
use case comes up.
For inode storage, bpf_local_storage_destroy() is called before
waiting for a rcu gp and its memory cannot be reused immediately.
inode stays with kmalloc/kfree after the rcu [or tasks_trace] gp.
A 'bool bpf_ma' argument is added to bpf_local_storage_map_alloc().
Only task and cgroup storage have 'bpf_ma == true' which
means to use bpf_mem_cache_alloc/free(). This patch only changes
selem to use bpf_mem_alloc for task and cgroup. The next patch
will change the local_storage to use bpf_mem_alloc also for
task and cgroup.
Here is some more details on the changes:
* memory allocation:
After bpf_mem_cache_alloc(), the SDATA(selem)->data is zero-ed because
bpf_mem_cache_alloc() could return a reused selem. It is to keep
the existing bpf_map_kzalloc() behavior. Only SDATA(selem)->data
is zero-ed. SDATA(selem)->data is the visible part to the bpf prog.
No need to use zero_map_value() to do the zeroing because
bpf_selem_free(..., reuse_now = true) ensures no bpf prog is using
the selem before returning the selem through bpf_mem_cache_free().
For the internal fields of selem, they will be initialized when
linking to the new smap and the new local_storage.
When 'bpf_ma == false', nothing changes in this patch. It will
stay with the bpf_map_kzalloc().
* memory free:
The bpf_selem_free() and bpf_selem_free_rcu() are modified to handle
the bpf_ma == true case.
For the common selem free path where its owner is also being destroyed,
the mem is freed in bpf_local_storage_destroy(), the owner (task
and cgroup) has gone through a rcu gp. The memory can be reused
immediately, so bpf_local_storage_destroy() will call
bpf_selem_free(..., reuse_now = true) which will do
bpf_mem_cache_free() for immediate reuse consideration.
An exception is the delete elem code path. The delete elem code path
is called from the helper bpf_*_storage_delete() and the syscall
bpf_map_delete_elem(). This path is an unusual case for local
storage because the common use case is to have the local storage
staying with its owner life time so that the bpf prog and the user
space does not have to monitor the owner's destruction. For the delete
elem path, the selem cannot be reused immediately because there could
be bpf prog using it. It will call bpf_selem_free(..., reuse_now = false)
and it will wait for a rcu tasks trace gp before freeing the elem. The
rcu callback is changed to do bpf_mem_cache_raw_free() instead of kfree().
When 'bpf_ma == false', it should be the same as before.
__bpf_selem_free() is added to do the kfree_rcu and call_tasks_trace_rcu().
A few words on the 'reuse_now == true'. When 'reuse_now == true',
it is still racing with bpf_local_storage_map_free which is under rcu
protection, so it still needs to wait for a rcu gp instead of kfree().
Otherwise, the selem may be reused by slab for a totally different struct
while the bpf_local_storage_map_free() is still using it (as a
rcu reader). For the inode case, there may be other rcu readers also.
In short, when bpf_ma == false and reuse_now == true => vanilla rcu.
[1]: https://lore.kernel.org/bpf/[email protected]/
Cc: Namhyung Kim <[email protected]>
Signed-off-by: Martin KaFai Lau <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
KF_RELEASE kfuncs are not currently treated as having KF_TRUSTED_ARGS,
even though they have a superset of the requirements of KF_TRUSTED_ARGS.
Like KF_TRUSTED_ARGS, KF_RELEASE kfuncs require a 0-offset argument, and
don't allow NULL-able arguments. Unlike KF_TRUSTED_ARGS which require
_either_ an argument with ref_obj_id > 0, _or_ (ref->type &
BPF_REG_TRUSTED_MODIFIERS) (and no unsafe modifiers allowed), KF_RELEASE
only allows for ref_obj_id > 0. Because KF_RELEASE today doesn't
automatically imply KF_TRUSTED_ARGS, some of these requirements are
enforced in different ways that can make the behavior of the verifier
feel unpredictable. For example, a KF_RELEASE kfunc with a NULL-able
argument will currently fail in the verifier with a message like, "arg#0
is ptr_or_null_ expected ptr_ or socket" rather than "Possibly NULL
pointer passed to trusted arg0". Our intention is the same, but the
semantics are different due to implemenetation details that kfunc authors
and BPF program writers should not need to care about.
Let's make the behavior of the verifier more consistent and intuitive by
having KF_RELEASE kfuncs imply the presence of KF_TRUSTED_ARGS. Our
eventual goal is to have all kfuncs assume KF_TRUSTED_ARGS by default
anyways, so this takes us a step in that direction.
Note that it does not make sense to assume KF_TRUSTED_ARGS for all
KF_ACQUIRE kfuncs. KF_ACQUIRE kfuncs can have looser semantics than
KF_RELEASE, with e.g. KF_RCU | KF_RET_NULL. We may want to have
KF_ACQUIRE imply KF_TRUSTED_ARGS _unless_ KF_RCU is specified, but that
can be left to another patch set, and there are no such subtleties to
address for KF_RELEASE.
Signed-off-by: David Vernet <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
Now that we're not invoking kfunc destructors when the kptr in a map was
NULL, we no longer require NULL checks in many of our KF_RELEASE kfuncs.
This patch removes those NULL checks.
Signed-off-by: David Vernet <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fix from Chuck Lever:
- Fix a crash when using NFS with krb5p
* tag 'nfsd-6.3-4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
SUNRPC: Fix a crash in gss_krb5_checksum()
|
|
Conflicts:
drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
6e9d51b1a5cb ("net/mlx5e: Initialize link speed to zero")
1bffcea42926 ("net/mlx5e: Add devlink hairpin queues parameters")
https://lore.kernel.org/all/[email protected]/
https://lore.kernel.org/all/[email protected]/
Adjacent changes:
drivers/net/phy/phy.c
323fe43cf9ae ("net: phy: Improved PHY error reporting in state machine")
4203d84032e2 ("net: phy: Ensure state transitions are processed from phy_stop()")
Signed-off-by: Jakub Kicinski <[email protected]>
|