aboutsummaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)AuthorFilesLines
2022-05-16net: fix dev_fill_forward_path with pppoe + bridgeFelix Fietkau1-1/+1
When calling dev_fill_forward_path on a pppoe device, the provided destination address is invalid. In order for the bridge fdb lookup to succeed, the pppoe code needs to update ctx->daddr to the correct value. Fix this by storing the address inside struct net_device_path_ctx Fixes: f6efc675c9dd ("net: ppp: resolve forwarding path for bridge pppoe devices") Signed-off-by: Felix Fietkau <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2022-05-16net: call skb_defer_free_flush() before each napi_poll()Eric Dumazet1-2/+3
skb_defer_free_flush() can consume cpu cycles, it seems better to call it in the inner loop: - Potentially frees page/skb that will be reallocated while hot. - Account for the cpu cycles in the @time_limit determination. - Keep softnet_data.defer_count small to reduce chances for skb_attempt_defer_free() to send an IPI. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-05-16net: add skb_defer_max sysctlEric Dumazet4-7/+19
commit 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") added another per-cpu cache of skbs. It was expected to be small, and an IPI was forced whenever the list reached 128 skbs. We might need to be able to control more precisely queue capacity and added latency. An IPI is generated whenever queue reaches half capacity. Default value of the new limit is 64. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-05-16net: use napi_consume_skb() in skb_defer_free_flush()Eric Dumazet1-1/+1
skb_defer_free_flush() runs from softirq context, we have the opportunity to refill the napi_alloc_cache, and/or use kmem_cache_free_bulk() when this cache is full. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-05-16net: fix possible race in skb_attempt_defer_free()Eric Dumazet2-5/+7
A cpu can observe sd->defer_count reaching 128, and call smp_call_function_single_async() Problem is that the remote CPU can clear sd->defer_count before the IPI is run/acknowledged. Other cpus can queue more packets and also decide to call smp_call_function_single_async() while the pending IPI was not yet delivered. This is a common issue with smp_call_function_single_async(). Callers must ensure correct synchronization and serialization. I triggered this issue while experimenting smaller threshold. Performing the call to smp_call_function_single_async() under sd->defer_lock protection did not solve the problem. Commit 5a18ceca6350 ("smp: Allow smp_call_function_single_async() to insert locked csd") replaced an informative WARN_ON_ONCE() with a return of -EBUSY, which is often ignored. Test of CSD_FLAG_LOCK presence is racy anyway. Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists") Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-05-16net: skb: check the boundrary of drop reason in kfree_skb_reason()Menglong Dong1-0/+2
Sometimes, we may forget to reset skb drop reason to NOT_SPECIFIED after we make it the return value of the functions with return type of enum skb_drop_reason, such as tcp_inbound_md5_hash. Therefore, its value can be SKB_NOT_DROPPED_YET(0), which is invalid for kfree_skb_reason(). So we check the range of drop reason in kfree_skb_reason() with DEBUG_NET_WARN_ON_ONCE(). Reviewed-by: Jiang Biao <[email protected]> Reviewed-by: Hao Peng <[email protected]> Signed-off-by: Menglong Dong <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-05-16net: dm: check the boundary of skb drop reasonsMenglong Dong1-1/+1
The 'reason' will be set to 'SKB_DROP_REASON_NOT_SPECIFIED' if it not small that SKB_DROP_REASON_MAX in net_dm_packet_trace_kfree_skb_hit(), but it can't avoid it to be 0, which is invalid and can cause NULL pointer in drop_reasons. Therefore, reset it to SKB_DROP_REASON_NOT_SPECIFIED when 'reason <= 0'. Reviewed-by: Jiang Biao <[email protected]> Reviewed-by: Hao Peng <[email protected]> Signed-off-by: Menglong Dong <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-05-16net: core: add READ_ONCE/WRITE_ONCE annotations for sk->sk_bound_dev_ifEric Dumazet1-4/+7
sock_bindtoindex_locked() needs to use WRITE_ONCE(sk->sk_bound_dev_if, val), because other cpus/threads might locklessly read this field. sock_getbindtodevice(), sock_getsockopt() need READ_ONCE() because they run without socket lock held. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-05-16net: allow gro_max_size to exceed 65536Alexander Duyck3-9/+9
Allow the gro_max_size to exceed a value larger than 65536. There weren't really any external limitations that prevented this other than the fact that IPv4 only supports a 16 bit length field. Since we have the option of adding a hop-by-hop header for IPv6 we can allow IPv6 to exceed this value and for IPv4 and non-TCP flows we can cap things at 65536 via a constant rather than relying on gro_max_size. [edumazet] limit GRO_MAX_SIZE to (8 * 65535) to avoid overflows. Signed-off-by: Alexander Duyck <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-05-16net: allow gso_max_size to exceed 65536Alexander Duyck3-4/+19
The code for gso_max_size was added originally to allow for debugging and workaround of buggy devices that couldn't support TSO with blocks 64K in size. The original reason for limiting it to 64K was because that was the existing limits of IPv4 and non-jumbogram IPv6 length fields. With the addition of Big TCP we can remove this limit and allow the value to potentially go up to UINT_MAX and instead be limited by the tso_max_size value. So in order to support this we need to go through and clean up the remaining users of the gso_max_size value so that the values will cap at 64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper limit for GSO_MAX_SIZE. v6: (edumazet) fixed a compile error if CONFIG_IPV6=n, in a new sk_trim_gso_size() helper. netif_set_tso_max_size() caps the requested TSO size with GSO_MAX_SIZE. Signed-off-by: Alexander Duyck <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-05-16net: add IFLA_TSO_{MAX_SIZE|SEGS} attributesEric Dumazet1-0/+6
New netlink attributes IFLA_TSO_MAX_SIZE and IFLA_TSO_MAX_SEGS are used to report to user-space the device TSO limits. ip -d link sh dev eth1 ... tso_max_size 65536 tso_max_segs 65535 Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Alexander Duyck <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-05-13net: page_pool: add page allocation stats for two fast page allocate pathJie Wang1-1/+4
Currently If use page pool allocation stats to analysis a RX performance degradation problem. These stats only count for pages allocate from page_pool_alloc_pages. But nic drivers such as hns3 use page_pool_dev_alloc_frag to allocate pages, so page stats in this API should also be counted. Signed-off-by: Jie Wang <[email protected]> Signed-off-by: Guangbin Huang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-05-12net: update the register_netdevice() kdocJakub Kicinski1-14/+6
The BUGS section looks quite dated, the registration is under rtnl lock. Remove some obvious information while at it. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-05-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+3
No conflicts. Build issue in drivers/net/ethernet/sfc/ptp.c 54fccfdd7c66 ("sfc: efx_default_channel_type APIs can be static") 49e6123c65da ("net: sfc: fix memory leak due to ptp channel") https://lore.kernel.org/all/[email protected]/ Signed-off-by: Jakub Kicinski <[email protected]>
2022-05-12rtnetlink: verify rate parameters for calls to ndo_set_vf_rateBin Chen1-10/+18
When calling ndo_set_vf_rate() the max_tx_rate parameter may be zero, in which case the setting is cleared, or it must be greater or equal to min_tx_rate. Enforce this requirement on all calls to ndo_set_vf_rate via a wrapper which also only calls ndo_set_vf_rate() if defined by the driver. Based on work by Jakub Kicinski <[email protected]> Signed-off-by: Bin Chen <[email protected]> Signed-off-by: Baowen Zheng <[email protected]> Signed-off-by: Simon Horman <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2022-05-11net: add more debug info in skb_checksum_help()Eric Dumazet1-4/+6
This is a followup of previous patch. Dumping the stack trace is a good start, but printing basic skb information is probably better. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-05-11net: remove two BUG() from skb_checksum_help()Eric Dumazet1-2/+6
I have a syzbot report that managed to get a crash in skb_checksum_help() If syzbot can trigger these BUG(), it makes sense to replace them with more friendly WARN_ON_ONCE() since skb_checksum_help() can instead return an error code. Note that syzbot will still crash there, until real bug is fixed. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-05-10net: fix kdoc on __dev_queue_xmit()Jakub Kicinski1-20/+15
Commit c526fd8f9f4f21 ("net: inline dev_queue_xmit()") exported __dev_queue_xmit(), now it's being rendered in html docs, triggering: Documentation/networking/kapi:92: net/core/dev.c:4101: WARNING: Missing matching underline for section title overline. Reported-by: Stephen Rothwell <[email protected]> Link: https://lore.kernel.org/linux-next/[email protected]/ Fixes: c526fd8f9f4f21 ("net: inline dev_queue_xmit()") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-05-10bpf: Add source ip in "struct bpf_tunnel_key"Kaixi Fan1-0/+9
Add tunnel source ip field in "struct bpf_tunnel_key". Add related code to set and get tunnel source field. Signed-off-by: Kaixi Fan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2022-05-10bpf: Print some info if disable bpf_jit_enable failedTiezhu Yang1-0/+6
A user told me that bpf_jit_enable can be disabled on one system, but he failed to disable bpf_jit_enable on the other system: # echo 0 > /proc/sys/net/core/bpf_jit_enable bash: echo: write error: Invalid argument No useful info is available through the dmesg log, a quick analysis shows that the issue is related with CONFIG_BPF_JIT_ALWAYS_ON. When CONFIG_BPF_JIT_ALWAYS_ON is enabled, bpf_jit_enable is permanently set to 1 and setting any other value than that will return failure. It is better to print some info to tell the user if disable bpf_jit_enable failed. Signed-off-by: Tiezhu Yang <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2022-05-10net: sysctl: Use SYSCTL_TWO instead of &twoTiezhu Yang1-4/+3
It is better to use SYSCTL_TWO instead of &two, and then we can remove the variable "two" in net/core/sysctl_net_core.c. Signed-off-by: Tiezhu Yang <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2022-05-09rtnetlink: add extack support in fdb del handlersAlaa Mohamed1-2/+2
Add extack support to .ndo_fdb_del in netdevice.h and all related methods. Signed-off-by: Alaa Mohamed <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-05-09net: fix wrong network header lengthLina Wang1-1/+3
When clatd starts with ebpf offloaing, and NETIF_F_GRO_FRAGLIST is enable, several skbs are gathered in skb_shinfo(skb)->frag_list. The first skb's ipv6 header will be changed to ipv4 after bpf_skb_proto_6_to_4, network_header\transport_header\mac_header have been updated as ipv4 acts, but other skbs in frag_list didnot update anything, just ipv6 packets. udp_queue_rcv_skb will call skb_segment_list to traverse other skbs in frag_list and make sure right udp payload is delivered to user space. Unfortunately, other skbs in frag_list who are still ipv6 packets are updated like the first skb and will have wrong transport header length. e.g.before bpf_skb_proto_6_to_4,the first skb and other skbs in frag_list has the same network_header(24)& transport_header(64), after bpf_skb_proto_6_to_4, ipv6 protocol has been changed to ipv4, the first skb's network_header is 44,transport_header is 64, other skbs in frag_list didnot change.After skb_segment_list, the other skbs in frag_list has different network_header(24) and transport_header(44), so there will be 20 bytes different from original,that is difference between ipv6 header and ipv4 header. Just change transport_header to be the same with original. Actually, there are two solutions to fix it, one is traversing all skbs and changing every skb header in bpf_skb_proto_6_to_4, the other is modifying frag_list skb's header in skb_segment_list. Considering efficiency, adopt the second one--- when the first skb and other skbs in frag_list has different network_header length, restore them to make sure right udp payload is delivered to user space. Signed-off-by: Lina Wang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-05-06net: move netif_set_gso_max helpersJakub Kicinski1-0/+21
These are now internal to the core, no need to expose them. Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-05-06net: make drivers set the TSO limit not the GSO limitJakub Kicinski1-2/+2
Drivers should call the TSO setting helper, GSO is controllable by user space. Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-05-06net: don't allow user space to lift the device limitsJakub Kicinski2-2/+37
Up until commit 46e6b992c250 ("rtnetlink: allow GSO maximums to be set on device creation") the gso_max_segs and gso_max_size of a device were not controlled from user space. The quoted commit added the ability to control them because of the following setup: netns A | netns B veth<->veth eth0 If eth0 has TSO limitations and user wants to efficiently forward traffic between eth0 and the veths they should copy the TSO limitations of eth0 onto the veths. This would happen automatically for macvlans or ipvlan but veth users are not so lucky (given the loose coupling). Unfortunately the commit in question allowed users to also override the limits on real HW devices. It may be useful to control the max GSO size and someone may be using that ability (not that I know of any user), so create a separate set of knobs to reliably record the TSO limitations. Validate the user requests. Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-05-06net: add netif_inherit_tso_max()Jakub Kicinski1-0/+12
To make later patches smaller create a helper for inheriting the TSO limitations of a lower device. The TSO in the name is not an accident, subsequent patches will replace GSO with TSO in more names. Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-05-05net: Make msg_zerocopy_alloc staticDavid Ahern1-2/+1
msg_zerocopy_alloc is only used by msg_zerocopy_realloc; remove the export and make static in skbuff.c Signed-off-by: David Ahern <[email protected]> Acked-by: Jonathan Lemon <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-05-05net: align SO_RCVMARK required privileges with SO_MARKEyal Birger1-0/+6
The commit referenced in the "Fixes" tag added the SO_RCVMARK socket option for receiving the skb mark in the ancillary data. Since this is a new capability, and exposes admin configured details regarding the underlying network setup to sockets, let's align the needed capabilities with those of SO_MARK. Fixes: 6fd1d51cfa25 ("net: SO_RCVMARK socket option for SO_MARK with recvmsg()") Signed-off-by: Eyal Birger <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-05-05Revert "Merge branch 'mlxsw-line-card-model'"Jakub Kicinski1-298/+5
This reverts commit 5e927a9f4b9f29d78a7c7d66ea717bb5c8bbad8e, reversing changes made to cfc1d91a7d78cf9de25b043d81efcc16966d55b3. The discussion is still ongoing so let's remove the uAPI until the discussion settles. Link: https://lore.kernel.org/all/[email protected]/ Reviewed-by: Ido Schimmel <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-05-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-5/+11
tools/testing/selftests/net/forwarding/Makefile f62c5acc800e ("selftests/net/forwarding: add missing tests to Makefile") 50fe062c806e ("selftests: forwarding: new test, verify host mdb entries") https://lore.kernel.org/all/[email protected]/ Signed-off-by: Jakub Kicinski <[email protected]>
2022-05-04tcp: resalt the secret every 10 secondsEric Dumazet1-3/+9
In order to limit the ability for an observer to recognize the source ports sequence used to contact a set of destinations, we should periodically shuffle the secret. 10 seconds looks effective enough without causing particular issues. Cc: Moshe Kol <[email protected]> Cc: Yossi Gilad <[email protected]> Cc: Amit Klein <[email protected]> Cc: Jason A. Donenfeld <[email protected]> Tested-by: Willy Tarreau <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2022-05-04secure_seq: use the 64 bits of the siphash for port offset calculationWilly Tarreau1-2/+2
SipHash replaced MD5 in secure_ipv{4,6}_port_ephemeral() via commit 7cd23e5300c1 ("secure_seq: use SipHash in place of MD5"), but the output remained truncated to 32-bit only. In order to exploit more bits from the hash, let's make the functions return the full 64-bit of siphash_3u32(). We also make sure the port offset calculation in __inet_hash_connect() remains done on 32-bit to avoid the need for div_u64_rem() and an extra cost on 32-bit systems. Cc: Jason A. Donenfeld <[email protected]> Cc: Moshe Kol <[email protected]> Cc: Yossi Gilad <[email protected]> Cc: Amit Klein <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: Willy Tarreau <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
2022-05-04memcg: accounting for objects allocated for new netdeviceVasily Averin1-1/+1
Creating a new netdevice allocates at least ~50Kb of memory for various kernel objects, but only ~5Kb of them are accounted to memcg. As a result, creating an unlimited number of netdevice inside a memcg-limited container does not fall within memcg restrictions, consumes a significant part of the host's memory, can cause global OOM and lead to random kills of host processes. The main consumers of non-accounted memory are: ~10Kb 80+ kernfs nodes ~6Kb ipv6_add_dev() allocations 6Kb __register_sysctl_table() allocations 4Kb neigh_sysctl_register() allocations 4Kb __devinet_sysctl_register() allocations 4Kb __addrconf_sysctl_register() allocations Accounting of these objects allows to increase the share of memcg-related memory up to 60-70% (~38Kb accounted vs ~54Kb total for dummy netdevice on typical VM with default Fedora 35 kernel) and this should be enough to somehow protect the host from misuse inside container. Other related objects are quite small and may not be taken into account to minimize the expected performance degradation. It should be separately mentonied ~300 bytes of percpu allocation of struct ipstats_mib in snmp6_alloc_dev(), on huge multi-cpu nodes it can become the main consumer of memory. This patch does not enables kernfs accounting as it affects other parts of the kernel and should be discussed separately. However, even without kernfs, this patch significantly improves the current situation and allows to take into account more than half of all netdevice allocations. Signed-off-by: Vasily Averin <[email protected]> Acked-by: Luis Chamberlain <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-05-03netdev: reshuffle netif_napi_add() APIs to allow dropping weightJakub Kicinski1-3/+3
Most drivers should not have to worry about selecting the right weight for their NAPI instances and pass NAPI_POLL_WEIGHT. It'd be best if we didn't require the argument at all and selected the default internally. This change prepares the ground for such reshuffling, allowing for a smooth transition. The following API should remain after the next release cycle: netif_napi_add() netif_napi_add_weight() netif_napi_add_tx() netif_napi_add_tx_weight() Where the _weight() variants take an explicit weight argument. I opted for a _weight() suffix rather than a __ prefix, because we use __ in places to mean that caller needs to also issue a synchronize_net() call. Signed-off-by: Jakub Kicinski <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-05-03net: sysctl: introduce sysctl SYSCTL_THREETonghao Zhang1-2/+1
This patch introdues the SYSCTL_THREE. KUnit: [00:10:14] ================ sysctl_test (10 subtests) ================= [00:10:14] [PASSED] sysctl_test_api_dointvec_null_tbl_data [00:10:14] [PASSED] sysctl_test_api_dointvec_table_maxlen_unset [00:10:14] [PASSED] sysctl_test_api_dointvec_table_len_is_zero [00:10:14] [PASSED] sysctl_test_api_dointvec_table_read_but_position_set [00:10:14] [PASSED] sysctl_test_dointvec_read_happy_single_positive [00:10:14] [PASSED] sysctl_test_dointvec_read_happy_single_negative [00:10:14] [PASSED] sysctl_test_dointvec_write_happy_single_positive [00:10:14] [PASSED] sysctl_test_dointvec_write_happy_single_negative [00:10:14] [PASSED] sysctl_test_api_dointvec_write_single_less_int_min [00:10:14] [PASSED] sysctl_test_api_dointvec_write_single_greater_int_max [00:10:14] =================== [PASSED] sysctl_test =================== ./run_kselftest.sh -c sysctl ... ok 1 selftests: sysctl: sysctl.sh Cc: Luis Chamberlain <[email protected]> Cc: Kees Cook <[email protected]> Cc: Iurii Zaikin <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: Paolo Abeni <[email protected]> Cc: Hideaki YOSHIFUJI <[email protected]> Cc: David Ahern <[email protected]> Cc: Simon Horman <[email protected]> Cc: Julian Anastasov <[email protected]> Cc: Pablo Neira Ayuso <[email protected]> Cc: Jozsef Kadlecsik <[email protected]> Cc: Florian Westphal <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Lorenz Bauer <[email protected]> Cc: Akhmat Karakotov <[email protected]> Signed-off-by: Tonghao Zhang <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2022-05-03net: sysctl: use shared sysctl macroTonghao Zhang1-6/+4
This patch replace two, four and long_one to SYSCTL_XXX. Cc: Luis Chamberlain <[email protected]> Cc: Kees Cook <[email protected]> Cc: Iurii Zaikin <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: Paolo Abeni <[email protected]> Cc: Hideaki YOSHIFUJI <[email protected]> Cc: David Ahern <[email protected]> Cc: Simon Horman <[email protected]> Cc: Julian Anastasov <[email protected]> Cc: Pablo Neira Ayuso <[email protected]> Cc: Jozsef Kadlecsik <[email protected]> Cc: Florian Westphal <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Lorenz Bauer <[email protected]> Cc: Akhmat Karakotov <[email protected]> Signed-off-by: Tonghao Zhang <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2022-05-02tcp: optimise skb_zerocopy_iter_stream()Pavel Begunkov1-2/+1
It's expensive to make a copy of 40B struct iov_iter to the point it was taking 0.2-0.5% of all cycles in my tests. iov_iter_revert() should be fine as it's a simple case without nested reverts/truncates. Signed-off-by: Pavel Begunkov <[email protected]> Link: https://lore.kernel.org/r/a7e1690c00c5dfe700c30eb9a8a81ec59f6545dd.1650884401.git.asml.silence@gmail.com Signed-off-by: Jakub Kicinski <[email protected]>
2022-05-02rtnl: move rtnl_newlink_create()Jakub Kicinski1-91/+86
Pure code move. Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2022-05-02rtnl: split __rtnl_newlink() into two functionsJakub Kicinski1-3/+20
__rtnl_newlink() is 250LoC, but has a few clear sections. Move the part which creates a new netdev to a separate function. For ease of review code will be moved in the next change. Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2022-05-02rtnl: allocate more attr tables on the heapJakub Kicinski1-12/+18
Commit a293974590cf ("rtnetlink: avoid frame size warning in rtnl_newlink()") moved to allocating the largest attribute array of rtnl_newlink() on the heap. Kalle reports the stack has grown above 1k again: net/core/rtnetlink.c:3557:1: error: the frame size of 1104 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] Move more attrs to the heap, wrap them in a struct. Don't bother with linkinfo, it's referenced a lot and we take its size so it's awkward to move, plus it's small (6 elements). Reported-by: Kalle Valo <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]> Tested-by: Kalle Valo <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2022-05-01sock: optimise sock_def_write_space barriersPavel Begunkov1-1/+25
Now we have a separate path for sock_def_write_space() and can go one step further. When it's called from sock_wfree() we know that there is a preceding atomic for putting down ->sk_wmem_alloc. We can use it to replace to replace smb_mb() with a less expensive smp_mb__after_atomic(). It also removes an extra RCU read lock/unlock as a small bonus. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-05-01sock: optimise UDP sock_wfree() refcountingPavel Begunkov1-0/+14
For non SOCK_USE_WRITE_QUEUE sockets, sock_wfree() (atomically) puts ->sk_wmem_alloc twice. It's needed to keep the socket alive while calling ->sk_write_space() after the first put. However, some sockets, such as UDP, are freed by RCU (i.e. SOCK_RCU_FREE) and use already RCU-safe sock_def_write_space(). Carve a fast path for such sockets, put down all refs in one go before calling sock_def_write_space() but guard the socket from being freed by an RCU read section. note: because TCP sockets are marked with SOCK_USE_WRITE_QUEUE it doesn't add extra checks in its path. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-05-01sock: dedup sock_def_write_space wmem_alloc checksPavel Begunkov1-3/+2
Except for minor rounding differences the first ->sk_wmem_alloc test in sock_def_write_space() is a hand coded version of sock_writeable(). Replace it with the helper, and also kill the following if duplicating the check. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-04-30net: inline dev_queue_xmit()Pavel Begunkov1-13/+2
Inline dev_queue_xmit() and dev_queue_xmit_accel(), they both are small proxy functions doing nothing but redirecting the control flow to __dev_queue_xmit(). Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-04-30net: inline skb_zerocopy_iter_dgramPavel Begunkov3-24/+0
skb_zerocopy_iter_dgram() is a small proxy function, inline it. For that, move __zerocopy_sg_from_iter into linux/skbuff.h Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-04-30net: inline sock_alloc_send_skbPavel Begunkov1-7/+0
sock_alloc_send_skb() is simple and just proxying to another function, so we can inline it and cut associated overhead. Signed-off-by: Pavel Begunkov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2022-04-28bpf, sockmap: Call skb_linearize only when required in ↵Liu Jian1-9/+13
sk_psock_skb_ingress_enqueue The skb_to_sgvec fails only when the number of frag_list and frags exceeds MAX_MSG_FRAGS. Therefore, we can call skb_linearize only when the conversion fails. Signed-off-by: Liu Jian <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2022-04-28net: SO_RCVMARK socket option for SO_MARK with recvmsg()Erin MacNeil1-0/+7
Adding a new socket option, SO_RCVMARK, to indicate that SO_MARK should be included in the ancillary data returned by recvmsg(). Renamed the sock_recv_ts_and_drops() function to sock_recv_cmsgs(). Signed-off-by: Erin MacNeil <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: David Ahern <[email protected]> Acked-by: Marc Kleine-Budde <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2022-04-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-14/+9
include/linux/netdevice.h net/core/dev.c 6510ea973d8d ("net: Use this_cpu_inc() to increment net->core_stats") 794c24e9921f ("net-core: rx_otherhost_dropped to core_stats") https://lore.kernel.org/all/[email protected]/ drivers/net/wan/cosa.c d48fea8401cf ("net: cosa: fix error check return value of register_chrdev()") 89fbca3307d4 ("net: wan: remove support for COSA and SRP synchronous serial boards") https://lore.kernel.org/all/[email protected]/ Signed-off-by: Jakub Kicinski <[email protected]>