aboutsummaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)AuthorFilesLines
2022-09-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-1/+5
tools/testing/selftests/net/.gitignore sort the net-next version and use it Signed-off-by: Jakub Kicinski <[email protected]>
2022-09-01rxrpc: Remove rxrpc_get_reply_time() which is no longer usedDavid Howells1-2/+0
Remove rxrpc_get_reply_time() as that is no longer used now that the call issue time is used instead of the reply time. Signed-off-by: David Howells <[email protected]>
2022-09-01rxrpc: Fix ICMP/ICMP6 error handlingDavid Howells1-0/+4
Because rxrpc pretends to be a tunnel on top of a UDP/UDP6 socket, allowing it to siphon off UDP packets early in the handling of received UDP packets thereby avoiding the packet going through the UDP receive queue, it doesn't get ICMP packets through the UDP ->sk_error_report() callback. In fact, it doesn't appear that there's any usable option for getting hold of ICMP packets. Fix this by adding a new UDP encap hook to distribute error messages for UDP tunnels. If the hook is set, then the tunnel driver will be able to see ICMP packets. The hook provides the offset into the packet of the UDP header of the original packet that caused the notification. An alternative would be to call the ->error_handler() hook - but that requires that the skbuff be cloned (as ip_icmp_error() or ipv6_cmp_error() do, though isn't really necessary or desirable in rxrpc's case is we want to parse them there and then, not queue them). Changes ======= ver #3) - Fixed an uninitialised variable. ver #2) - Fixed some missing CONFIG_AF_RXRPC_IPV6 conditionals. Fixes: 5271953cad31 ("rxrpc: Use the UDP encap_rcv hook") Signed-off-by: David Howells <[email protected]>
2022-08-31tcp: make global challenge ack rate limitation per net-ns and default disabledEric Dumazet1-0/+2
Because per host rate limiting has been proven problematic (side channel attacks can be based on it), per host rate limiting of challenge acks ideally should be per netns and turned off by default. This is a long due followup of following commits: 083ae308280d ("tcp: enable per-socket rate limiting of all 'challenge acks'") f2b2c582e824 ("tcp: mitigate ACK loops for connections as tcp_sock") 75ff39ccc1bd ("tcp: make challenge acks less predictable") Signed-off-by: Eric Dumazet <[email protected]> Cc: Jason Baron <[email protected]> Acked-by: Neal Cardwell <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2022-08-31net: sched: gred/red: remove unused variables in struct red_statsZhengchao Shao1-1/+0
The variable "other" in the struct red_stats is not used. Remove it. Signed-off-by: Zhengchao Shao <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2022-08-31Bluetooth: Move hci_abort_conn to hci_conn.cBrian Gix1-0/+1
hci_abort_conn() is a wrapper around a number of DISCONNECT and CREATE_CONN_CANCEL commands that was being invoked from hci_request request queues, which are now deprecated. There are two versions: hci_abort_conn() which can be invoked from the hci_event thread, and hci_abort_conn_sync() which can be invoked within a hci_sync cmd chain. Signed-off-by: Brian Gix <[email protected]> Signed-off-by: Luiz Augusto von Dentz <[email protected]>
2022-08-31netfilter: remove nf_conntrack_helper sysctl and modparam togglesPablo Neira Ayuso2-3/+0
__nf_ct_try_assign_helper() remains in place but it now requires a template to configure the helper. A toggle to disable automatic helper assignment was added by: a9006892643a ("netfilter: nf_ct_helper: allow to disable automatic helper assignment") in 2012 to address the issues described in "Secure use of iptables and connection tracking helpers". Automatic conntrack helper assignment was disabled by: 3bb398d925ec ("netfilter: nf_ct_helper: disable automatic helper assignment") back in 2016. This patch removes the sysctl and modparam toggles, users now have to rely on explicit conntrack helper configuration via ruleset. Update tools/testing/selftests/netfilter/nft_conntrack_helper.sh to check that auto-assignment does not happen anymore. Acked-by: Aaron Conole <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2022-08-30net: devlink: stub port params cmds for they are unused internallyJiri Pirko1-1/+0
Follow-up the removal of unused internal api of port params made by commit 42ded61aa75e ("devlink: Delete not used port parameters APIs") and stub the commands and add extack message to tell the user what is going on. If later on port params are needed, could be easily re-introduced, but until then it is a dead code. Signed-off-by: Jiri Pirko <[email protected]> Reviewed-by: Jakub Kicinski <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
2022-08-30netlink: add helpers for extack attr presence checkingJakub Kicinski1-0/+7
Being able to check attribute presence and set extack if not on one line is handy, add helpers. Reviewed-by: Johannes Berg <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2022-08-29genetlink: start to validate reserved header bytesJakub Kicinski1-0/+3
We had historically not checked that genlmsghdr.reserved is 0 on input which prevents us from using those precious bytes in the future. One use case would be to extend the cmd field, which is currently just 8 bits wide and 256 is not a lot of commands for some core families. To make sure that new families do the right thing by default put the onus of opting out of validation on existing families. Signed-off-by: Jakub Kicinski <[email protected]> Acked-by: Paul Moore <[email protected]> (NetLabel) Signed-off-by: David S. Miller <[email protected]>
2022-08-29xfrm: lwtunnel: add lwtunnel support for xfrm interfaces in collect_md modeEyal Birger1-0/+11
Allow specifying the xfrm interface if_id and link as part of a route metadata using the lwtunnel infrastructure. This allows for example using a single xfrm interface in collect_md mode as the target of multiple routes each specifying a different if_id. With the appropriate changes to iproute2, considering an xfrm device ipsec1 in collect_md mode one can for example add a route specifying an if_id like so: ip route add <SUBNET> dev ipsec1 encap xfrm if_id 1 In which case traffic routed to the device via this route would use if_id in the xfrm interface policy lookup. Or in the context of vrf, one can also specify the "link" property: ip route add <SUBNET> dev ipsec1 encap xfrm if_id 1 link_dev eth15 Note: LWT_XFRM_LINK uses NLA_U32 similar to IFLA_XFRM_LINK even though internally "link" is signed. This is consistent with other _LINK attributes in other devices as well as in bpf and should not have an effect as device indexes can't be negative. Reviewed-by: Nicolas Dichtel <[email protected]> Reviewed-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: Eyal Birger <[email protected]> Signed-off-by: Steffen Klassert <[email protected]>
2022-08-29xfrm: interface: support collect metadata modeEyal Birger1-2/+9
This commit adds support for 'collect_md' mode on xfrm interfaces. Each net can have one collect_md device, created by providing the IFLA_XFRM_COLLECT_METADATA flag at creation. This device cannot be altered and has no if_id or link device attributes. On transmit to this device, the if_id is fetched from the attached dst metadata on the skb. If exists, the link property is also fetched from the metadata. The dst metadata type used is METADATA_XFRM which holds these properties. On the receive side, xfrmi_rcv_cb() populates a dst metadata for each packet received and attaches it to the skb. The if_id used in this case is fetched from the xfrm state, and the link is fetched from the incoming device. This information can later be used by upper layers such as tc, ebpf, and ip rules. Because the skb is scrubed in xfrmi_rcv_cb(), the attachment of the dst metadata is postponed until after scrubing. Similarly, xfrm_input() is adapted to avoid dropping metadata dsts by only dropping 'valid' (skb_valid_dst(skb) == true) dsts. Policy matching on packets arriving from collect_md xfrmi devices is done by using the xfrm state existing in the skb's sec_path. The xfrm_if_cb.decode_cb() interface implemented by xfrmi_decode_session() is changed to keep the details of the if_id extraction tucked away in xfrm_interface.c. Reviewed-by: Nicolas Dichtel <[email protected]> Reviewed-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: Eyal Birger <[email protected]> Signed-off-by: Steffen Klassert <[email protected]>
2022-08-29net: allow storing xfrm interface metadata in metadata_dstEyal Birger1-0/+20
XFRM interfaces provide the association of various XFRM transformations to a netdevice using an 'if_id' identifier common to both the XFRM data structures (polcies, states) and the interface. The if_id is configured by the controlling entity (usually the IKE daemon) and can be used by the administrator to define logical relations between different connections. For example, different connections can share the if_id identifier so that they pass through the same interface, . However, currently it is not possible for connections using a different if_id to use the same interface while retaining the logical separation between them, without using additional criteria such as skb marks or different traffic selectors. When having a large number of connections, it is useful to have a the logical separation offered by the if_id identifier but use a single network interface. Similar to the way collect_md mode is used in IP tunnels. This patch attempts to enable different configuration mechanisms - such as ebpf programs, LWT encapsulations, and TC - to attach metadata to skbs which would carry the if_id. This way a single xfrm interface in collect_md mode can demux traffic based on this configuration on tx and provide this metadata on rx. The XFRM metadata is somewhat similar to ip tunnel metadata in that it has an "id", and shares similar configuration entities (bpf, tc, ...), however, it does not necessarily represent an IP tunnel or use other ip tunnel information, and also has an optional "link" property which can be used for affecting underlying routing decisions. Additional xfrm related criteria may also be added in the future. Therefore, a new metadata type is introduced, to be used in subsequent patches in the xfrm interface and configuration entities. Reviewed-by: Nikolay Aleksandrov <[email protected]> Reviewed-by: Nicolas Dichtel <[email protected]> Signed-off-by: Eyal Birger <[email protected]> Signed-off-by: Steffen Klassert <[email protected]>
2022-08-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller1-1/+3
Daniel borkmann says: ==================== The following pull-request contains BPF updates for your *net* tree. We've added 11 non-merge commits during the last 14 day(s) which contain a total of 13 files changed, 61 insertions(+), 24 deletions(-). The main changes are: 1) Fix BPF verifier's precision tracking around BPF ring buffer, from Kumar Kartikeya Dwivedi. 2) Fix regression in tunnel key infra when passing FLOWI_FLAG_ANYSRC, from Eyal Birger. 3) Fix insufficient permissions for bpf_sys_bpf() helper, from YiFei Zhu. 4) Fix splat from hitting BUG when purging effective cgroup programs, from Pu Lehui. 5) Fix range tracking for array poke descriptors, from Daniel Borkmann. 6) Fix corrupted packets for XDP_SHARED_UMEM in aligned mode, from Magnus Karlsson. 7) Fix NULL pointer splat in BPF sockmap sk_msg_recvmsg(), from Liu Jian. 8) Add READ_ONCE() to bpf_jit_limit when reading from sysctl, from Kuniyuki Iwashima. 9) Add BPF selftest lru_bug check to s390x deny list, from Daniel Müller. ==================== Signed-off-by: David S. Miller <[email protected]>
2022-08-26net: sched: remove unnecessary init of qdisc skb headZhengchao Shao1-7/+0
The memory allocated by using kzallloc_node and kcalloc has been cleared. Therefore, the structure members of the new qdisc are 0. So there's no need to explicitly assign a value of 0. Signed-off-by: Zhengchao Shao <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-08-26Merge tag 'wireless-next-2022-08-26-v2' of ↵David S. Miller2-13/+39
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Johannes berg says: ==================== Various updates: * rtw88: operation, locking, warning, and code style fixes * rtw89: small updates * cfg80211/mac80211: more EHT/MLO (802.11be, WiFi 7) work * brcmfmac: a couple of fixes * misc cleanups etc. ==================== Signed-off-by: David S. Miller <[email protected]>
2022-08-25Bluetooth: convert hci_update_adv_data to hci_syncBrian Gix1-0/+1
hci_update_adv_data() is called from hci_event and hci_core due to events from the controller. The prior function used the deprecated hci_request method, and the new one uses hci_sync.c Signed-off-by: Brian Gix <[email protected]> Signed-off-by: Luiz Augusto von Dentz <[email protected]>
2022-08-25Bluetooth: move hci_get_random_address() to hci_syncBrian Gix1-0/+5
This function has no dependencies on the deprecated hci_request mechanism, so has been moved unchanged to hci_sync.c Signed-off-by: Brian Gix <[email protected]> Signed-off-by: Luiz Augusto von Dentz <[email protected]>
2022-08-25Bluetooth: Move Adv Instance timer to hci_syncBrian Gix1-1/+2
The Advertising Instance expiration timer adv_instance_expire was handled with the deprecated hci_request mechanism, rather than it's replacement: hci_sync. Signed-off-by: Brian Gix <[email protected]> Signed-off-by: Luiz Augusto von Dentz <[email protected]>
2022-08-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski5-3/+7
drivers/net/ethernet/mellanox/mlx5/core/en_fs.c 21234e3a84c7 ("net/mlx5e: Fix use after free in mlx5e_fs_init()") c7eafc5ed068 ("net/mlx5e: Convert ethtool_steering member of flow_steering struct to pointer") https://lore.kernel.org/all/[email protected]/ https://lore.kernel.org/all/[email protected]/ Signed-off-by: Jakub Kicinski <[email protected]>
2022-08-25net: devlink: limit flash component name to match version returned by info_get()Jiri Pirko1-2/+1
Limit the acceptance of component name passed to cmd_flash_update() to match one of the versions returned by info_get(), marked by version type. This makes things clearer and enforces 1:1 mapping between exposed version and accepted flash component. Check VERSION_TYPE_COMPONENT version type during cmd_flash_update() execution by calling info_get() with different "req" context. That causes info_get() to lookup the component name instead of filling-up the netlink message. Remove "UPDATE_COMPONENT" flag which becomes used. Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2022-08-25net: devlink: extend info_get() version put to indicate a flash componentJiri Pirko1-0/+16
Whenever the driver is called by his info_get() op, it may put multiple version names and values to the netlink message. Extend by additional helper devlink_info_version_running/stored_put_ext() that allows to specify a version type that indicates when particular version name represents a flash component. This is going to be used in follow-up patch calling info_get() during flash update command checking if version with this the version type exists. Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2022-08-25net: sched: delete duplicate cleanup of backlog and qlenZhengchao Shao1-1/+0
qdisc_reset() is clearing qdisc->q.qlen and qdisc->qstats.backlog _after_ calling qdisc->ops->reset. There is no need to clear them again in the specific reset function. Signed-off-by: Zhengchao Shao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
2022-08-25wifi: cfg80211: Add link_id to cfg80211_ch_switch_started_notify()Veerendranath Jakkam1-1/+3
Add link_id parameter to cfg80211_ch_switch_started_notify() to allow driver to indicate on which link channel switch started on MLD. Send the data to userspace so it knows as well. Signed-off-by: Veerendranath Jakkam <[email protected]> Link: https://lore.kernel.org/r/[email protected] Link: https://lore.kernel.org/r/[email protected] [squash two patches] Signed-off-by: Johannes Berg <[email protected]>
2022-08-25wifi: mac80211: maintain link_id in link_staJohannes Berg1-0/+2
To helper drivers if they e.g. have a lookup of the link_sta pointer, add the link ID to the link_sta structure. Signed-off-by: Johannes Berg <[email protected]>
2022-08-25wifi: mac80211: add link information in ieee80211_rx_statusVasanthakumar Thiagarajan1-0/+5
In MLO, when the address translation from link to MLD is done in fw/hw, it is necessary to be able to have some information on the link on which the frame has been received. Extend the rx API to include link_id and a valid flag in ieee80211_rx_status. Also make chanes to mac80211 rx APIs to make use of the reported link_id after sanity checks. Signed-off-by: Vasanthakumar Thiagarajan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2022-08-25wifi: mac80211: properly implement MLO key handlingJohannes Berg1-0/+2
Implement key installation and lookup (on TX and RX) for MLO, so we can use multiple GTKs/IGTKs/BIGTKs. Co-authored-by: Ilan Peer <[email protected]> Signed-off-by: Ilan Peer <[email protected]> Signed-off-by: Johannes Berg <[email protected]>
2022-08-25wifi: cfg80211: Add link_id parameter to various key operations for MLOVeerendranath Jakkam1-12/+25
Add support for various key operations on MLD by adding new parameter link_id. Pass the link_id received from userspace to driver for add_key, get_key, del_key, set_default_key, set_default_mgmt_key and set_default_beacon_key to support configuring keys specific to each MLO link. Userspace must not specify link ID for MLO pairwise key since it is common for all the MLO links. Signed-off-by: Veerendranath Jakkam <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]>
2022-08-25wifi: cfg80211: add link id to txq paramsShaul Triebitz1-0/+2
The Tx queue parameters are per link, so add the link ID from nl80211 parameters to the API. While at it, lock the wdev when calling into the driver so it (and we) can check the link ID appropriately. Signed-off-by: Shaul Triebitz <[email protected]> Signed-off-by: Johannes Berg <[email protected]>
2022-08-25net: gro: skb_gro_header helper functionRichard Gobert1-15/+18
Introduce a simple helper function to replace a common pattern. When accessing the GRO header, we fetch the pointer from frag0, then test its validity and fetch it from the skb when necessary. This leads to the pattern skb_gro_header_fast -> skb_gro_header_hard -> skb_gro_header_slow recurring many times throughout GRO code. This patch replaces these patterns with a single inlined function call, improving code readability. Signed-off-by: Richard Gobert <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://lore.kernel.org/r/20220823071034.GA56142@debian Signed-off-by: Paolo Abeni <[email protected]>
2022-08-24net: Add a bhash2 table hashed by port and addressJoanne Koong3-2/+95
The current bind hashtable (bhash) is hashed by port only. In the socket bind path, we have to check for bind conflicts by traversing the specified port's inet_bind_bucket while holding the hashbucket's spinlock (see inet_csk_get_port() and inet_csk_bind_conflict()). In instances where there are tons of sockets hashed to the same port at different addresses, the bind conflict check is time-intensive and can cause softirq cpu lockups, as well as stops new tcp connections since __inet_inherit_port() also contests for the spinlock. This patch adds a second bind table, bhash2, that hashes by port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6). Searching the bhash2 table leads to significantly faster conflict resolution and less time holding the hashbucket spinlock. Please note a few things: * There can be the case where the a socket's address changes after it has been bound. There are two cases where this happens: 1) The case where there is a bind() call on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will assign the socket an address when it handles the connect() 2) In inet_sk_reselect_saddr(), which is called when rebuilding the sk header and a few pre-conditions are met (eg rerouting fails). In these two cases, we need to update the bhash2 table by removing the entry for the old address, and add a new entry reflecting the updated address. * The bhash2 table must have its own lock, even though concurrent accesses on the same port are protected by the bhash lock. Bhash2 must have its own lock to protect against cases where sockets on different ports hash to different bhash hashbuckets but to the same bhash2 hashbucket. This brings up a few stipulations: 1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock will always be acquired after the bhash lock and released before the bhash lock is released. 2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always acquired+released before another bhash2 lock is acquired+released. * The bhash table cannot be superseded by the bhash2 table because for bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket bound to that port must be checked for a potential conflict. The bhash table is the only source of port->socket associations. Signed-off-by: Joanne Koong <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2022-08-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nfJakub Kicinski2-0/+4
Pablo Neira Ayuso says: ==================== Netfilter fixes for net 1) Fix crash with malformed ebtables blob which do not provide all entry points, from Florian Westphal. 2) Fix possible TCP connection clogging up with default 5-days timeout in conntrack, from Florian. 3) Fix crash in nf_tables tproxy with unsupported chains, also from Florian. 4) Do not allow to update implicit chains. 5) Make table handle allocation per-netns to fix data race. 6) Do not truncated payload length and offset, and checksum offset. Instead report EINVAl. 7) Enable chain stats update via static key iff no error occurs. 8) Restrict osf expression to ip, ip6 and inet families. 9) Restrict tunnel expression to netdev family. 10) Fix crash when trying to bind again an already bound chain. 11) Flowtable garbage collector might leave behind pending work to delete entries. This patch comes with a previous preparation patch as dependency. 12) Allow net.netfilter.nf_conntrack_frag6_high_thresh to be lowered, from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_defrag_ipv6: allow nf_conntrack_frag6_high_thresh increases netfilter: flowtable: fix stuck flows on cleanup due to pending work netfilter: flowtable: add function to invoke garbage collection immediately netfilter: nf_tables: disallow binding to already bound chain netfilter: nft_tunnel: restrict it to netdev family netfilter: nft_osf: restrict osf to ipv4, ipv6 and inet families netfilter: nf_tables: do not leave chain stats enabled on error netfilter: nft_payload: do not truncate csum_offset and csum_type netfilter: nft_payload: report ERANGE for too long offset and length netfilter: nf_tables: make table handle allocation per-netns friendly netfilter: nf_tables: disallow updates of implicit chain netfilter: nft_tproxy: restrict to prerouting hook netfilter: conntrack: work around exceeded receive window netfilter: ebtables: reject blobs that don't provide all entry points ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-08-24netlink: fix some kernel-doc commentsZhengchao Shao1-1/+3
Modify the comment of input parameter of nlmsg_ and nla_ function. Signed-off-by: Zhengchao Shao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-08-24net: Fix a data-race around gro_normal_batch.Kuniyuki Iwashima1-1/+1
While reading gro_normal_batch, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader. Fixes: 323ebb61e32b ("net: use listified RX for handling GRO_NORMAL skbs") Signed-off-by: Kuniyuki Iwashima <[email protected]> Acked-by: Edward Cree <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-08-24net: Fix a data-race around sysctl_net_busy_poll.Kuniyuki Iwashima1-1/+1
While reading sysctl_net_busy_poll, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader. Fixes: 060212928670 ("net: add low latency socket poll") Signed-off-by: Kuniyuki Iwashima <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-08-24netfilter: flowtable: fix stuck flows on cleanup due to pending workPablo Neira Ayuso1-0/+2
To clear the flow table on flow table free, the following sequence normally happens in order: 1) gc_step work is stopped to disable any further stats/del requests. 2) All flow table entries are set to teardown state. 3) Run gc_step which will queue HW del work for each flow table entry. 4) Waiting for the above del work to finish (flush). 5) Run gc_step again, deleting all entries from the flow table. 6) Flow table is freed. But if a flow table entry already has pending HW stats or HW add work step 3 will not queue HW del work (it will be skipped), step 4 will wait for the pending add/stats to finish, and step 5 will queue HW del work which might execute after freeing of the flow table. To fix the above, this patch flushes the pending work, then it sets the teardown flag to all flows in the flowtable and it forces a garbage collector run to queue work to remove the flows from hardware, then it flushes this new pending work and (finally) it forces another garbage collector run to remove the entry from the software flowtable. Stack trace: [47773.882335] BUG: KASAN: use-after-free in down_read+0x99/0x460 [47773.883634] Write of size 8 at addr ffff888103b45aa8 by task kworker/u20:6/543704 [47773.885634] CPU: 3 PID: 543704 Comm: kworker/u20:6 Not tainted 5.12.0-rc7+ #2 [47773.886745] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009) [47773.888438] Workqueue: nf_ft_offload_del flow_offload_work_handler [nf_flow_table] [47773.889727] Call Trace: [47773.890214] dump_stack+0xbb/0x107 [47773.890818] print_address_description.constprop.0+0x18/0x140 [47773.892990] kasan_report.cold+0x7c/0xd8 [47773.894459] kasan_check_range+0x145/0x1a0 [47773.895174] down_read+0x99/0x460 [47773.899706] nf_flow_offload_tuple+0x24f/0x3c0 [nf_flow_table] [47773.907137] flow_offload_work_handler+0x72d/0xbe0 [nf_flow_table] [47773.913372] process_one_work+0x8ac/0x14e0 [47773.921325] [47773.921325] Allocated by task 592159: [47773.922031] kasan_save_stack+0x1b/0x40 [47773.922730] __kasan_kmalloc+0x7a/0x90 [47773.923411] tcf_ct_flow_table_get+0x3cb/0x1230 [act_ct] [47773.924363] tcf_ct_init+0x71c/0x1156 [act_ct] [47773.925207] tcf_action_init_1+0x45b/0x700 [47773.925987] tcf_action_init+0x453/0x6b0 [47773.926692] tcf_exts_validate+0x3d0/0x600 [47773.927419] fl_change+0x757/0x4a51 [cls_flower] [47773.928227] tc_new_tfilter+0x89a/0x2070 [47773.936652] [47773.936652] Freed by task 543704: [47773.937303] kasan_save_stack+0x1b/0x40 [47773.938039] kasan_set_track+0x1c/0x30 [47773.938731] kasan_set_free_info+0x20/0x30 [47773.939467] __kasan_slab_free+0xe7/0x120 [47773.940194] slab_free_freelist_hook+0x86/0x190 [47773.941038] kfree+0xce/0x3a0 [47773.941644] tcf_ct_flow_table_cleanup_work Original patch description and stack trace by Paul Blakey. Fixes: c29f74e0df7a ("netfilter: nf_flow_table: hardware offload support") Reported-by: Paul Blakey <[email protected]> Tested-by: Paul Blakey <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2022-08-24netfilter: flowtable: add function to invoke garbage collection immediatelyPablo Neira Ayuso1-0/+1
Expose nf_flow_table_gc_run() to force a garbage collector run from the offload infrastructure. Signed-off-by: Pablo Neira Ayuso <[email protected]>
2022-08-24netfilter: nf_tables: make table handle allocation per-netns friendlyPablo Neira Ayuso1-0/+1
mutex is per-netns, move table_netns to the pernet area. *read-write* to 0xffffffff883a01e8 of 8 bytes by task 6542 on cpu 0: nf_tables_newtable+0x6dc/0xc00 net/netfilter/nf_tables_api.c:1221 nfnetlink_rcv_batch net/netfilter/nfnetlink.c:513 [inline] nfnetlink_rcv_skb_batch net/netfilter/nfnetlink.c:634 [inline] nfnetlink_rcv+0xa6a/0x13a0 net/netfilter/nfnetlink.c:652 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0x652/0x730 net/netlink/af_netlink.c:1345 netlink_sendmsg+0x643/0x740 net/netlink/af_netlink.c:1921 Fixes: f102d66b335a ("netfilter: nf_tables: use dedicated mutex to guard transactions") Reported-by: Abhishek Shah <[email protected]> Reviewed-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2022-08-23net: dsa: use dsa_tree_for_each_cpu_port in dsa_tree_{setup,teardown}_masterVladimir Oltean1-0/+4
More logic will be added to dsa_tree_setup_master() and dsa_tree_teardown_master() in upcoming changes. Reduce the indentation by one level in these functions by introducing and using a dedicated iterator for CPU ports of a tree. Signed-off-by: Vladimir Oltean <[email protected]> Reviewed-by: Florian Fainelli <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2022-08-23vsock: add API call for data readyArseniy Krasnov1-0/+1
This adds 'vsock_data_ready()' which must be called by transport to kick sleeping data readers. It checks for SO_RCVLOWAT value before waking user, thus preventing spurious wake ups. Based on 'tcp_data_ready()' logic. Signed-off-by: Arseniy Krasnov <[email protected]> Reviewed-by: Stefano Garzarella <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2022-08-23vsock: SO_RCVLOWAT transport set callbackArseniy Krasnov1-0/+1
This adds transport specific callback for SO_RCVLOWAT, because in some transports it may be difficult to know current available number of bytes ready to read. Thus, when SO_RCVLOWAT is set, transport may reject it. Signed-off-by: Arseniy Krasnov <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2022-08-22bonding: 3ad: make ad_ticks_per_sec a constJonathan Toppins1-1/+1
The value is only ever set once in bond_3ad_initialize and only ever read otherwise. There seems to be no reason to set the variable via bond_3ad_initialize when setting the global variable will do. Change ad_ticks_per_sec to a const to enforce its read-only usage. Signed-off-by: Jonathan Toppins <[email protected]> Acked-by: Jay Vosburgh <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2022-08-22Remove DECnet support from kernelStephen Hemminger7-954/+0
DECnet is an obsolete network protocol that receives more attention from kernel janitors than users. It belongs in computer protocol history museum not in Linux kernel. It has been "Orphaned" in kernel since 2010. The iproute2 support for DECnet was dropped in 5.0 release. The documentation link on Sourceforge says it is abandoned there as well. Leave the UAPI alone to keep userspace programs compiling. This means that there is still an empty neighbour table for AF_DECNET. The table of /proc/sys/net entries was updated to match current directories and reformatted to be alphabetical. Signed-off-by: Stephen Hemminger <[email protected]> Acked-by: David Ahern <[email protected]> Acked-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-08-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski3-1/+27
No conflicts. Signed-off-by: Jakub Kicinski <[email protected]>
2022-08-18net: macsec: Expose MACSEC_SALT_LEN definition to user spaceEmeel Hakim1-1/+0
Expose MACSEC_SALT_LEN definition to user space to be used in various user space applications such as iproute. Iproute will use this as part of adding macsec extended packet number support. Reviewed-by: Raed Salem <[email protected]> Reviewed-by: Sabrina Dubroca <[email protected]> Signed-off-by: Emeel Hakim <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-08-18bpf: Change bpf_setsockopt(SOL_IPV6) to reuse do_ipv6_setsockopt()Martin KaFai Lau2-0/+4
After the prep work in the previous patches, this patch removes the dup code from bpf_setsockopt(SOL_IPV6) and reuses the implementation in do_ipv6_setsockopt(). ipv6 could be compiled as a module. Like how other code solved it with stubs in ipv6_stubs.h, this patch adds the do_ipv6_setsockopt to the ipv6_bpf_stub. The current bpf_setsockopt(IPV6_TCLASS) does not take the INET_ECN_MASK into the account for tcp. The do_ipv6_setsockopt(IPV6_TCLASS) will handle it correctly. The existing optname white-list is refactored into a new function sol_ipv6_setsockopt(). After this last SOL_IPV6 dup code removal, the __bpf_setsockopt() is simplified enough that the extra "{ }" around the if statement can be removed. Reviewed-by: Stanislav Fomichev <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-08-18bpf: Change bpf_setsockopt(SOL_IP) to reuse do_ip_setsockopt()Martin KaFai Lau1-0/+2
After the prep work in the previous patches, this patch removes the dup code from bpf_setsockopt(SOL_IP) and reuses the implementation in do_ip_setsockopt(). The existing optname white-list is refactored into a new function sol_ip_setsockopt(). NOTE, the current bpf_setsockopt(IP_TOS) is quite different from the the do_ip_setsockopt(IP_TOS). For example, it does not take the INET_ECN_MASK into the account for tcp and also does not adjust sk->sk_priority. It looks like the current bpf_setsockopt(IP_TOS) was referencing the IPV6_TCLASS implementation instead of IP_TOS. This patch tries to rectify that by using the do_ip_setsockopt(IP_TOS). While this is a behavior change, the do_ip_setsockopt(IP_TOS) behavior is arguably what the user is expecting. At least, the INET_ECN_MASK bits should be masked out for tcp. Reviewed-by: Stanislav Fomichev <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-08-18bpf: Change bpf_setsockopt(SOL_TCP) to reuse do_tcp_setsockopt()Martin KaFai Lau1-0/+2
After the prep work in the previous patches, this patch removes all the dup code from bpf_setsockopt(SOL_TCP) and reuses the do_tcp_setsockopt(). The existing optname white-list is refactored into a new function sol_tcp_setsockopt(). The sol_tcp_setsockopt() also calls the bpf_sol_tcp_setsockopt() to handle the TCP_BPF_XXX specific optnames. bpf_setsockopt(TCP_SAVE_SYN) now also allows a value 2 to save the eth header also and it comes for free from do_tcp_setsockopt(). Reviewed-by: Stanislav Fomichev <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-08-18bpf: Change bpf_setsockopt(SOL_SOCKET) to reuse sk_setsockopt()Martin KaFai Lau1-0/+2
After the prep work in the previous patches, this patch removes most of the dup code from bpf_setsockopt(SOL_SOCKET) and reuses them from sk_setsockopt(). The sock ptr test is added to the SO_RCVLOWAT because the sk->sk_socket could be NULL in some of the bpf hooks. The existing optname white-list is refactored into a new function sol_socket_setsockopt(). Reviewed-by: Stanislav Fomichev <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-08-18bpf: net: Consider has_current_bpf_ctx() when testing capable() in ↵Martin KaFai Lau1-0/+2
sk_setsockopt() When bpf program calling bpf_setsockopt(SOL_SOCKET), it could be run in softirq and doesn't make sense to do the capable check. There was a similar situation in bpf_setsockopt(TCP_CONGESTION). In commit 8d650cdedaab ("tcp: fix tcp_set_congestion_control() use from bpf hook"), tcp_set_congestion_control(..., cap_net_admin) was added to skip the cap check for bpf prog. This patch adds sockopt_ns_capable() and sockopt_capable() for the sk_setsockopt() to use. They will consider the has_current_bpf_ctx() before doing the ns_capable() and capable() test. They are in EXPORT_SYMBOL for the ipv6 module to use in a latter patch. Suggested-by: Stanislav Fomichev <[email protected]> Reviewed-by: Stanislav Fomichev <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>