aboutsummaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)AuthorFilesLines
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 234Thomas Gleixner2-26/+2
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details you should have received a copy of the gnu general public license along with this program if not see http www gnu org licenses extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 503 file(s). Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Alexios Zavras <[email protected]> Reviewed-by: Allison Randal <[email protected]> Reviewed-by: Enrico Weigelt <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2019-06-19net: flow_offload: implement support for meta keyJiri Pirko1-0/+6
Implement support for previously added flow dissector meta key. Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-19flow_dissector: add support for ingress ifindex dissectionJiri Pirko1-0/+9
Add new key meta that contains ingress ifindex value and add a function to dissect this from skb. The key and function is prepared to cover other potential skb metadata values dissection. Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-18ip6_tunnel: allow not to count pkts on tstats by passing dev as NULLXin Long1-3/+6
A similar fix to Patch "ip_tunnel: allow not to count pkts on tstats by setting skb's dev to NULL" is also needed by ip6_tunnel. Signed-off-by: Xin Long <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-18ipv6: Stop sending in-kernel notifications for each nexthopIdo Schimmel1-1/+0
Both listeners - mlxsw and netdevsim - of IPv6 FIB notifications are now ready to handle IPv6 multipath notifications. Therefore, stop ignoring such notifications in both drivers and stop sending notification for each added / deleted nexthop. v2: * Remove 'multipath_rt' from 'struct fib6_entry_notifier_info' Signed-off-by: Ido Schimmel <[email protected]> Acked-by: Jiri Pirko <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-18ipv6: Extend notifier info for multipath routesIdo Schimmel1-0/+7
Extend the IPv6 FIB notifier info with number of sibling routes being notified. This will later allow listeners to process one notification for a multipath routes instead of N, where N is the number of nexthops. Signed-off-by: Ido Schimmel <[email protected]> Acked-by: Jiri Pirko <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-18netlink: Add field to skip in-kernel notificationsIdo Schimmel1-1/+3
The struct includes a 'skip_notify' flag that indicates if netlink notifications to user space should be suppressed. As explained in commit 3b1137fe7482 ("net: ipv6: Change notifications for multipath add to RTA_MULTIPATH"), this is useful to suppress per-nexthop RTM_NEWROUTE notifications when an IPv6 multipath route is added / deleted. Instead, one notification is sent for the entire multipath route. This concept is also useful for in-kernel notifications. Sending one in-kernel notification for the addition / deletion of an IPv6 multipath route - instead of one per-nexthop - provides a significant increase in the insertion / deletion rate to underlying devices. Add a 'skip_notify_kernel' flag to suppress in-kernel notifications. Signed-off-by: Ido Schimmel <[email protected]> Acked-by: Jiri Pirko <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-18netlink: Document all fields of 'struct nl_info'Ido Schimmel1-0/+2
Some fields were not documented. Add documentation. Signed-off-by: Ido Schimmel <[email protected]> Acked-by: Jiri Pirko <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller21-154/+44
Honestly all the conflicts were simple overlapping changes, nothing really interesting to report. Signed-off-by: David S. Miller <[email protected]>
2019-06-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds6-6/+29
Pull networking fixes from David Miller: "Lots of bug fixes here: 1) Out of bounds access in __bpf_skc_lookup, from Lorenz Bauer. 2) Fix rate reporting in cfg80211_calculate_bitrate_he(), from John Crispin. 3) Use after free in psock backlog workqueue, from John Fastabend. 4) Fix source port matching in fdb peer flow rule of mlx5, from Raed Salem. 5) Use atomic_inc_not_zero() in fl6_sock_lookup(), from Eric Dumazet. 6) Network header needs to be set for packet redirect in nfp, from John Hurley. 7) Fix udp zerocopy refcnt, from Willem de Bruijn. 8) Don't assume linear buffers in vxlan and geneve error handlers, from Stefano Brivio. 9) Fix TOS matching in mlxsw, from Jiri Pirko. 10) More SCTP cookie memory leak fixes, from Neil Horman. 11) Fix VLAN filtering in rtl8366, from Linus Walluij. 12) Various TCP SACK payload size and fragmentation memory limit fixes from Eric Dumazet. 13) Use after free in pneigh_get_next(), also from Eric Dumazet. 14) LAPB control block leak fix from Jeremy Sowden" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (145 commits) lapb: fixed leak of control-blocks. tipc: purge deferredq list for each grp member in tipc_group_delete ax25: fix inconsistent lock state in ax25_destroy_timer neigh: fix use-after-free read in pneigh_get_next tcp: fix compile error if !CONFIG_SYSCTL hv_sock: Suppress bogus "may be used uninitialized" warnings be2net: Fix number of Rx queues used for flow hashing net: handle 802.1P vlan 0 packets properly tcp: enforce tcp_min_snd_mss in tcp_mtu_probing() tcp: add tcp_min_snd_mss sysctl tcp: tcp_fragment() should apply sane memory limits tcp: limit payload size of sacked skbs Revert "net: phylink: set the autoneg state in phylink_phy_change" bpf: fix nested bpf tracepoints with per-cpu data bpf: Fix out of bounds memory access in bpf_sk_storage vsock/virtio: set SOCK_DONE on peer shutdown net: dsa: rtl8366: Fix up VLAN filtering net: phylink: set the autoneg state in phylink_phy_change net: add high_order_alloc_disable sysctl/static key tcp: add tcp_tx_skb_cache sysctl ...
2019-06-17net: ipv4: move tcp_fastopen server side code to SipHash libraryArd Biesheuvel1-6/+4
Using a bare block cipher in non-crypto code is almost always a bad idea, not only for security reasons (and we've seen some examples of this in the kernel in the past), but also for performance reasons. In the TCP fastopen case, we call into the bare AES block cipher one or two times (depending on whether the connection is IPv4 or IPv6). On most systems, this results in a call chain such as crypto_cipher_encrypt_one(ctx, dst, src) crypto_cipher_crt(tfm)->cit_encrypt_one(crypto_cipher_tfm(tfm), ...); aesni_encrypt kernel_fpu_begin(); aesni_enc(ctx, dst, src); // asm routine kernel_fpu_end(); It is highly unlikely that the use of special AES instructions has a benefit in this case, especially since we are doing the above twice for IPv6 connections, instead of using a transform which can process the entire input in one go. We could switch to the cbcmac(aes) shash, which would at least get rid of the duplicated overhead in *some* cases (i.e., today, only arm64 has an accelerated implementation of cbcmac(aes), while x86 will end up using the generic cbcmac template wrapping the AES-NI cipher, which basically ends up doing exactly the above). However, in the given context, it makes more sense to use a light-weight MAC algorithm that is more suitable for the purpose at hand, such as SipHash. Since the output size of SipHash already matches our chosen value for TCP_FASTOPEN_COOKIE_SIZE, and given that it accepts arbitrary input sizes, this greatly simplifies the code as well. NOTE: Server farms backing a single server IP for load balancing purposes and sharing a single fastopen key will be adversely affected by this change unless all systems in the pool receive their kernel upgrades at the same time. Signed-off-by: Ard Biesheuvel <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-17netfilter: synproxy: extract SYNPROXY infrastructure from {ipt, ip6t}_SYNPROXYFernando Fernandez Mancera2-11/+46
Add common functions into nf_synproxy_core.c to prepare for nftables support. The prototypes of the functions used by {ipt, ip6t}_SYNPROXY are in the new file nf_synproxy.h Signed-off-by: Fernando Fernandez Mancera <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2019-06-17netfilter: bridge: port sysctls to use brnf_netChristian Brauner1-1/+2
This ports the sysctls to use struct brnf_net. With this patch we make it possible to namespace the br_netfilter module in the following patch. Signed-off-by: Christian Brauner <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2019-06-17netfilter: conntrack: small conntrack lookup optimizationFlorian Westphal1-4/+3
____nf_conntrack_find() performs checks on the conntrack objects in this order: 1. if (nf_ct_is_expired(ct)) This fetches ct->timeout, in third cache line. The hnnode that is used to store the list pointers resides in the first (origin) or second (reply tuple) cache lines. This test rarely passes, but its necessary to reap obsolete entries. 2. if (nf_ct_is_dying(ct)) This fetches ct->status, also in third cache line. The test is useless, and can be removed: Consider: cpu0 cpu1 ct = ____nf_conntrack_find() atomic_inc_not_zero(ct) -> ok nf_ct_key_equal -> ok is_dying -> DYING bit not set, ok set_bit(ct, DYING); ... unhash ... etc. return ct -> returning a ct with dying bit set, despite having a test for it. This (unlikely) case is fine - refcount prevents ct from getting free'd. 3. if (nf_ct_key_equal(h, tuple, zone, net)) nf_ct_key_equal checks in following order: 1. Tuple equal (first or second cacheline) 2. Zone equal (third cacheline) 3. confirmed bit set (->status, third cacheline) 4. net namespace match (third cacheline). Swapping "timeout" and "cpu" places timeout in the first cacheline. This has two advantages: 1. For a conntrack that won't even match the original tuple, we will now only fetch the first and maybe the second cacheline instead of always accessing the 3rd one as well. 2. in case of TCP ct->timeout changes frequently because we reduce/increase it when there are packets outstanding in the network. The first cacheline contains both the reference count and the ct spinlock, i.e. moving timeout there avoids writes to 3rd cacheline. The restart sequence in __nf_conntrack_find() is removed, if we found a candidate, but then fail to increment the refcount or discover the tuple has changed (object recycling), just pretend we did not find an entry. A second lookup won't find anything until another CPU adds a new conntrack with identical tuple into the hash table, which is very unlikely. We have the confirmation-time checks (when we hold hash lock) that deal with identical entries and even perform clash resolution in some cases. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2019-06-16net/mlx5: Declare more strictly devlink encap modeLeon Romanovsky1-2/+4
Devlink has UAPI declaration for encap mode, so there is no need to be loose on the data get/set by drivers. Update call sites to use enum devlink_eswitch_encap_mode instead of plain u8. Suggested-by: Parav Pandit <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Acked-by: Jiri Pirko <[email protected]> Reviewed-by: Parav Pandit <[email protected]> Reviewed-by: Petr Vorel <[email protected]>
2019-06-15tcp: add tcp_min_snd_mss sysctlEric Dumazet1-0/+1
Some TCP peers announce a very small MSS option in their SYN and/or SYN/ACK messages. This forces the stack to send packets with a very high network/cpu overhead. Linux has enforced a minimal value of 48. Since this value includes the size of TCP options, and that the options can consume up to 40 bytes, this means that each segment can include only 8 bytes of payload. In some cases, it can be useful to increase the minimal value to a saner value. We still let the default to 48 (TCP_MIN_SND_MSS), for compatibility reasons. Note that TCP_MAXSEG socket option enforces a minimal value of (TCP_MIN_MSS). David Miller increased this minimal value in commit c39508d6f118 ("tcp: Make TCP_MAXSEG minimum more correct.") from 64 to 88. We might in the future merge TCP_MIN_SND_MSS and TCP_MIN_MSS. CVE-2019-11479 -- tcp mss hardcoded to 48 Signed-off-by: Eric Dumazet <[email protected]> Suggested-by: Jonathan Looney <[email protected]> Acked-by: Neal Cardwell <[email protected]> Cc: Yuchung Cheng <[email protected]> Cc: Tyler Hicks <[email protected]> Cc: Bruce Curtis <[email protected]> Cc: Jonathan Lemon <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-15tcp: limit payload size of sacked skbsEric Dumazet1-0/+2
Jonathan Looney reported that TCP can trigger the following crash in tcp_shifted_skb() : BUG_ON(tcp_skb_pcount(skb) < pcount); This can happen if the remote peer has advertized the smallest MSS that linux TCP accepts : 48 An skb can hold 17 fragments, and each fragment can hold 32KB on x86, or 64KB on PowerPC. This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs can overflow. Note that tcp_sendmsg() builds skbs with less than 64KB of payload, so this problem needs SACK to be enabled. SACK blocks allow TCP to coalesce multiple skbs in the retransmit queue, thus filling the 17 fragments to maximal capacity. CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs Fixes: 832d11c5cd07 ("tcp: Try to restore large SKBs while SACK processing") Signed-off-by: Eric Dumazet <[email protected]> Reported-by: Jonathan Looney <[email protected]> Acked-by: Neal Cardwell <[email protected]> Reviewed-by: Tyler Hicks <[email protected]> Cc: Yuchung Cheng <[email protected]> Cc: Bruce Curtis <[email protected]> Cc: Jonathan Lemon <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-15net: sched: remove NET_CLS_IND config optionJiri Pirko1-4/+1
This config option makes only couple of lines optional. Two small helpers and an int in couple of cls structs. Remove the config option and always compile this in. This saves the user from unexpected surprises when he adds a filter with ingress device match which is silently ignored in case the config option is not set. Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-14net: dsa: make cpu_dp non constVivien Didelot1-1/+1
A port may trigger operations on its dedicated CPU port, so using cpu_dp as const will raise warnings. Make cpu_dp non const. Signed-off-by: Vivien Didelot <[email protected]> Reviewed-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-14net: add high_order_alloc_disable sysctl/static keyEric Dumazet1-0/+2
>From linux-3.7, (commit 5640f7685831 "net: use a per task frag allocator") TCP sendmsg() has preferred using order-3 allocations. While it gives good results for most cases, we had reports that heavy uses of TCP over loopback were hitting a spinlock contention in page allocations/freeing. This commits adds a sysctl so that admins can opt-in for order-0 allocations. Hopefully mm layer might optimize order-3 allocations in the future since it could give us a nice boost (see 8 lines of following benchmark) The following benchmark shows a win when more than 8 TCP_STREAM threads are running (56 x86 cores server in my tests) for thr in {1..30} do sysctl -wq net.core.high_order_alloc_disable=0 T0=`./super_netperf $thr -H 127.0.0.1 -l 15` sysctl -wq net.core.high_order_alloc_disable=1 T1=`./super_netperf $thr -H 127.0.0.1 -l 15` echo $thr:$T0:$T1 done 1: 49979: 37267 2: 98745: 76286 3: 141088: 110051 4: 177414: 144772 5: 197587: 173563 6: 215377: 208448 7: 241061: 234087 8: 267155: 263373 9: 295069: 297402 10: 312393: 335213 11: 340462: 368778 12: 371366: 403954 13: 412344: 443713 14: 426617: 473580 15: 474418: 507861 16: 503261: 538539 17: 522331: 563096 18: 532409: 567084 19: 550824: 605240 20: 525493: 641988 21: 564574: 665843 22: 567349: 690868 23: 583846: 710917 24: 588715: 736306 25: 603212: 763494 26: 604083: 792654 27: 602241: 796450 28: 604291: 797993 29: 611610: 833249 30: 577356: 841062 Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-14tcp: add tcp_tx_skb_cache sysctlEric Dumazet1-1/+3
Feng Tang reported a performance regression after introduction of per TCP socket tx/rx caches, for TCP over loopback (netperf) There is high chance the regression is caused by a change on how well the 32 KB per-thread page (current->task_frag) can be recycled, and lack of pcp caches for order-3 pages. I could not reproduce the regression myself, cpus all being spinning on the mm spinlocks for page allocs/freeing, regardless of enabling or disabling the per tcp socket caches. It seems best to disable the feature by default, and let admins enabling it. MM layer either needs to provide scalable order-3 pages allocations, or could attempt a trylock on zone->lock if the caller only attempts to get a high-order page and is able to fallback to order-0 ones in case of pressure. Tests run on a 56 cores host (112 hyper threads) - 35.49% netperf [kernel.vmlinux] [k] queued_spin_lock_slowpath - 35.49% queued_spin_lock_slowpath - 18.18% get_page_from_freelist - __alloc_pages_nodemask - 18.18% alloc_pages_current skb_page_frag_refill sk_page_frag_refill tcp_sendmsg_locked tcp_sendmsg inet_sendmsg sock_sendmsg __sys_sendto __x64_sys_sendto do_syscall_64 entry_SYSCALL_64_after_hwframe __libc_send + 17.31% __free_pages_ok + 31.43% swapper [kernel.vmlinux] [k] intel_idle + 9.12% netperf [kernel.vmlinux] [k] copy_user_enhanced_fast_string + 6.53% netserver [kernel.vmlinux] [k] copy_user_enhanced_fast_string + 0.69% netserver [kernel.vmlinux] [k] queued_spin_lock_slowpath + 0.68% netperf [kernel.vmlinux] [k] skb_release_data + 0.52% netperf [kernel.vmlinux] [k] tcp_sendmsg_locked 0.46% netperf [kernel.vmlinux] [k] _raw_spin_lock_irqsave Fixes: 472c2e07eef0 ("tcp: add one skb cache for tx") Signed-off-by: Eric Dumazet <[email protected]> Reported-by: Feng Tang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-14tcp: add tcp_rx_skb_cache sysctlEric Dumazet1-4/+2
Instead of relying on rps_needed, it is safer to use a separate static key, since we do not want to enable TCP rx_skb_cache by default. This feature can cause huge increase of memory usage on hosts with millions of sockets. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-14ipv4: tcp: fix ACK/RST sent with a transmit delayEric Dumazet2-4/+7
If we want to set a EDT time for the skb we want to send via ip_send_unicast_reply(), we have to pass a new parameter and initialize ipc.sockc.transmit_time with it. This fixes the EDT time for ACK/RST packets sent on behalf of a TIME_WAIT socket. Fixes: a842fe1425cb ("tcp: add optional per socket transmit delay") Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-15bpf: net: Add SO_DETACH_REUSEPORT_BPFMartin KaFai Lau1-0/+2
There is SO_ATTACH_REUSEPORT_[CE]BPF but there is no DETACH. This patch adds SO_DETACH_REUSEPORT_BPF sockopt. The same sockopt can be used to undo both SO_ATTACH_REUSEPORT_[CE]BPF. reseport_detach_prog() is added and it is mostly a mirror of the existing reuseport_attach_prog(). The differences are, it does not call reuseport_alloc() and returns -ENOENT when there is no old prog. Cc: Craig Gallek <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Reviewed-by: Stanislav Fomichev <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
2019-06-14Merge tag 'mac80211-next-for-davem-2019-06-14' of ↵David S. Miller3-31/+92
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next Johannes Berg says: ==================== Many changes all over: * HE (802.11ax) work continues * WPA3 offloads * work on extended key ID handling continues * fixes to honour AP supported rates with auth/assoc frames * nl80211 netlink policy improvements to fix some issues with strict validation on new commands with old attrs ==================== Signed-off-by: David S. Miller <[email protected]>
2019-06-14Merge tag 'mac80211-for-davem-2019-06-14' of ↵David S. Miller1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Various fixes, all over: * a few memory leaks * fixes for management frame protection security and A2/A3 confusion (affecting TDLS as well) * build fix for certificates * etc. ==================== Signed-off-by: David S. Miller <[email protected]>
2019-06-14nl80211: send event when CMD_FRAME duration expiresJames Prestwood1-0/+10
cfg80211_remain_on_channel_expired is used to notify userspace when the remain on channel duration expired by sending an event. There is no such equivalent to CMD_FRAME, where if offchannel and a duration is provided, the card will go offchannel for that duration. Currently there is no way for userspace to tell when that duration expired apart from setting an independent timeout. This timeout is quite erroneous as the kernel may not immediately send out the frame because of scheduling or work queue delays. In testing, it was found this timeout had to be quite large to accomidate any potential delays. A better solution is to have the kernel send an event when this duration has expired. There is already NL80211_CMD_FRAME_WAIT_CANCEL which can be used to cancel a NL80211_CMD_FRAME offchannel. Using this command matches perfectly to how NL80211_CMD_CANCEL_REMAIN_ON_CHANNEL works, where its both used to cancel and notify if the duration has expired. Signed-off-by: James Prestwood <[email protected]> Signed-off-by: Johannes Berg <[email protected]>
2019-06-14mac80211: call rate_control_send_low() internallyJohannes Berg1-23/+0
There's no rate control algorithm that *doesn't* want to call it internally, and calling it internally will let us modify its behaviour in the future. Signed-off-by: Johannes Berg <[email protected]>
2019-06-14cfg80211: Add a function to iterate all BSS entriesIlan Peer1-0/+20
Add a function that iterates over the BSS entries associated with a given wiphy and calls a callback for each iterated BSS. This can be used by drivers in various ways, e.g., to evaluate some property for all the BSSs in the medium. Signed-off-by: Ilan Peer <[email protected]> Signed-off-by: Luca Coelho <[email protected]> Signed-off-by: Johannes Berg <[email protected]>
2019-06-14mac80211: allow turning TWT responder support on and off via netlinkJohn Crispin2-0/+5
Allow the userland daemon to en/disable TWT support for an AP. Signed-off-by: Shashidhar Lakkavalli <[email protected]> Signed-off-by: John Crispin <[email protected]> [simplify parsing code] Signed-off-by: Johannes Berg <[email protected]>
2019-06-14mac80211: dynamically enable the TWT requester support on STA interfacesJohn Crispin1-0/+2
Turn TWT for STA interfaces when they associate and/or receive a beacon where the twt_responder bit has changed. Signed-off-by: Shashidhar Lakkavalli <[email protected]> Signed-off-by: John Crispin <[email protected]> Signed-off-by: Johannes Berg <[email protected]>
2019-06-14nl80211: require and validate vendor command policyJohannes Berg2-0/+17
Require that each vendor command give a policy of its sub-attributes in NL80211_ATTR_VENDOR_DATA, and then (stricly) check the contents, including the NLA_F_NESTED flag that we couldn't check on the outer layer because there we don't know yet. It is possible to use VENDOR_CMD_RAW_DATA for raw data, but then no nested data can be given (NLA_F_NESTED flag must be clear) and the data is just passed as is to the command. Signed-off-by: Johannes Berg <[email protected]>
2019-06-14mac80211: add ieee80211_get_he_iftype_cap() helperJohn Crispin1-4/+18
This function is similar to ieee80211_get_he_sta_cap() but allows passing the iftype. Also make ieee80211_get_he_sta_cap() use the new helper rather than duplicating the code. Signed-off-by: Shashidhar Lakkavalli <[email protected]> Signed-off-by: John Crispin <[email protected]> Signed-off-by: Johannes Berg <[email protected]>
2019-06-14nl80211: add support for SAE authentication offloadChung-Hsien Hsu1-0/+5
Let drivers advertise support for station-mode SAE authentication offload with a new NL80211_EXT_FEATURE_SAE_OFFLOAD flag. Signed-off-by: Chung-Hsien Hsu <[email protected]> Signed-off-by: Chi-Hsien Lin <[email protected]> Signed-off-by: Johannes Berg <[email protected]>
2019-06-14mac80211: AMPDU handling for Extended Key IDAlexander Wetzel1-0/+4
IEEE 802.11 - 2016 forbids mixing MPDUs with different keyIDs in one A-MPDU. Drivers supporting A-MPDUs and Extended Key ID must actively enforce that requirement due to the available two unicast keyIDs. Allow driver to signal mac80211 that they will not check the keyID in MPDUs when aggregating them and that they expect mac80211 to stop Tx aggregation when rekeying a connection using Extended Key ID. Signed-off-by: Alexander Wetzel <[email protected]> Signed-off-by: Johannes Berg <[email protected]>
2019-06-12tcp: add optional per socket transmit delayEric Dumazet1-0/+19
Adding delays to TCP flows is crucial for studying behavior of TCP stacks, including congestion control modules. Linux offers netem module, but it has unpractical constraints : - Need root access to change qdisc - Hard to setup on egress if combined with non trivial qdisc like FQ - Single delay for all flows. EDT (Earliest Departure Time) adoption in TCP stack allows us to enable a per socket delay at a very small cost. Networking tools can now establish thousands of flows, each of them with a different delay, simulating real world conditions. This requires FQ packet scheduler or a EDT-enabled NIC. This patchs adds TCP_TX_DELAY socket option, to set a delay in usec units. unsigned int tx_delay = 10000; /* 10 msec */ setsockopt(fd, SOL_TCP, TCP_TX_DELAY, &tx_delay, sizeof(tx_delay)); Note that FQ packet scheduler limits might need some tweaking : man tc-fq PARAMETERS limit Hard limit on the real queue size. When this limit is reached, new packets are dropped. If the value is lowered, packets are dropped so that the new limit is met. Default is 10000 packets. flow_limit Hard limit on the maximum number of packets queued per flow. Default value is 100. Use of TCP_TX_DELAY option will increase number of skbs in FQ qdisc, so packets would be dropped if any of the previous limit is hit. Use of a jump label makes this support runtime-free, for hosts never using the option. Also note that TSQ (TCP Small Queues) limits are slightly changed with this patch : we need to account that skbs artificially delayed wont stop us providind more skbs to feed the pipe (netem uses skb_orphan_partial() for this purpose, but FQ can not use this trick) Because of that, using big delays might very well trigger old bugs in TSO auto defer logic and/or sndbuf limited detection. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-12vrf: Increment Icmp6InMsgs on the original netdevStephen Suryaputra1-0/+16
Get the ingress interface and increment ICMP counters based on that instead of skb->dev when the the dev is a VRF device. This is a follow up on the following message: https://www.spinics.net/lists/netdev/msg560268.html v2: Avoid changing skb->dev since it has unintended effect for local delivery (David Ahern). Signed-off-by: Stephen Suryaputra <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-12net: ethtool: Allow matching on vlan DEI bitMaxime Chevallier1-0/+1
Using ethtool, users can specify a classification action matching on the full vlan tag, which includes the DEI bit (also previously called CFI). However, when converting the ethool_flow_spec to a flow_rule, we use dissector keys to represent the matching patterns. Since the vlan dissector key doesn't include the DEI bit, this information was silently discarded when translating the ethtool flow spec in to a flow_rule. This commit adds the DEI bit into the vlan dissector key, and allows propagating the information to the driver when parsing the ethtool flow spec. Fixes: eca4205f9ec3 ("ethtool: add ethtool_rx_flow_spec to flow_rule structure translator") Reported-by: Michał Mirosław <[email protected]> Signed-off-by: Maxime Chevallier <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-11net/tls: add kernel-driven resync mechanism for TXJakub Kicinski1-0/+23
TLS offload drivers keep track of TCP seq numbers to make sure the packets are fed into the HW in order. When packets get dropped on the way through the stack, the driver will get out of sync and have to use fallback encryption, but unless TCP seq number is resynced it will never match the packets correctly (or even worse - use incorrect record sequence number after TCP seq wraps). Existing drivers (mlx5) feed the entire record on every out-of-order event, allowing FW/HW to always be in sync. This patch adds an alternative, more akin to the RX resync. When driver sees a frame which is past its expected sequence number the stream must have gotten out of order (if the sequence number is smaller than expected its likely a retransmission which doesn't require resync). Driver will ask the stack to perform TX sync before it submits the next full record, and fall back to software crypto until stack has performed the sync. Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Dirk van der Merwe <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-11net/tls: generalize the resync callbackJakub Kicinski1-2/+3
Currently only RX direction is ever resynced, however, TX may also get out of sequence if packets get dropped on the way to the driver. Rename the resync callback and add a direction parameter. Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Dirk van der Merwe <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-11net/tls: add kernel-driven TLS RX resyncJakub Kicinski1-2/+32
TLS offload device may lose sync with the TCP stream if packets arrive out of order. Drivers can currently request a resync at a specific TCP sequence number. When a record is found starting at that sequence number kernel will inform the device of the corresponding record number. This requires the device to constantly scan the stream for a known pattern (constant bytes of the header) after sync is lost. This patch adds an alternative approach which is entirely under the control of the kernel. Kernel tracks records it had to fully decrypt, even though TLS socket is in TLS_HW mode. If multiple records did not have any decrypted parts - it's a pretty strong indication that the device is out of sync. We choose the min number of fully encrypted records to be 2, which should hopefully be more than will get retransmitted at a time. After kernel decides the device is out of sync it schedules a resync request. If the TCP socket is empty the resync gets performed immediately. If socket is not empty we leave the record parser to resync when next record comes. Before resync in message parser we peek at the TCP socket and don't attempt the sync if the socket already has some of the next record queued. On resync failure (encrypted data continues to flow in) we retry with exponential backoff, up to once every 128 records (with a 16k record thats at most once every 2M of data). Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Dirk van der Merwe <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-11net/tls: rename handle_device_resync()Jakub Kicinski1-1/+1
handle_device_resync() doesn't describe the function very well. The function checks if resync should be issued upon parsing of a new record. Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Dirk van der Merwe <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-11net/tls: pass record number as a byte arrayJakub Kicinski1-2/+3
TLS offload code casts record number to a u64. The buffer should be aligned to 8 bytes, but its actually a __be64, and the rest of the TLS code treats it as big int. Make the offload callbacks take a byte array, drivers can make the choice to do the ugly cast if they want to. Prepare for copying the record number onto the stack by defining a constant for max size of the byte array. Signed-off-by: Jakub Kicinski <[email protected]> Reviewed-by: Dirk van der Merwe <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-10bpf: Allow bpf_map_lookup_elem() on an xskmapJonathan Lemon1-2/+2
Currently, the AF_XDP code uses a separate map in order to determine if an xsk is bound to a queue. Instead of doing this, have bpf_map_lookup_elem() return a xdp_sock. Rearrange some xdp_sock members to eliminate structure holes. Remove selftest - will be added back in later patch. Signed-off-by: Jonathan Lemon <[email protected]> Acked-by: Martin KaFai Lau <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2019-06-10ipv6: Allow routes to use nexthop objectsDavid Ahern1-0/+1
Add support for RTA_NH_ID attribute to allow a user to specify a nexthop id to use with a route. fc_nh_id is added to fib6_config to hold the value passed in the RTA_NH_ID attribute. If a nexthop id is given, the gateway, device, encap and multipath attributes can not be set. Update ip6_route_del to check metric and protocol before nexthop specs. If fc_nh_id is set, then it must match the id in the route entry. Since IPv6 allows delete of a cached entry (an exception), add ip6_del_cached_rt_nh to cycle through all of the fib6_nh in a fib entry if it is using a nexthop. Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-10ipv4: Allow routes to use nexthop objectsDavid Ahern1-0/+1
Add support for RTA_NH_ID attribute to allow a user to specify a nexthop id to use with a route. fc_nh_id is added to fib_config to hold the value passed in the RTA_NH_ID attribute. If a nexthop id is given, the gateway, device, encap and multipath attributes can not be set. Update fib_nh_match to check ids on a route delete. Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-10nexthops: Add ipv6 helper to walk all fib6_nh in a nexthop structDavid Ahern1-0/+4
IPv6 has traditionally had a single fib6_nh per fib6_info. With nexthops we can have multiple fib6_nh associated with a fib6_info. Add a nexthop helper to invoke a callback for each fib6_nh in a 'struct nexthop'. If the callback returns non-0, the loop is stopped and the return value passed to the caller. Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-09ipv6: tcp: send consistent autoflowlabel in TIME_WAIT stateEric Dumazet1-0/+1
In case autoflowlabel is in action, skb_get_hash_flowi6() derives a non zero skb->hash to the flowlabel. If skb->hash is zero, a flow dissection is performed. Since all TCP skbs sent from ESTABLISH state inherit their skb->hash from sk->sk_txhash, we better keep a copy of sk->sk_txhash into the TIME_WAIT socket. After this patch, ACK or RST packets sent on behalf of a TIME_WAIT socket have the flowlabel that was previously used by the flow. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-09net: hwbm: Make the hwbm_pool lock a mutexSebastian Andrzej Siewior1-3/+3
Based on review, `lock' is only acquired in hwbm_pool_add() which is invoked via ->probe(), ->resume() and ->ndo_change_mtu(). Based on this the lock can become a mutex and there is no need to disable interrupts during the procedure. Now that the lock is a mutex, hwbm_pool_add() no longer invokes hwbm_pool_refill() in an atomic context so we can pass GFP_KERNEL to hwbm_pool_refill() and remove the `gfp' argument from hwbm_pool_add(). Cc: Thomas Petazzoni <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2019-06-09nexthop: off by one in nexthop_mpath_select()Dan Carpenter1-1/+1
The nhg->nh_entries[] array is allocated in nexthop_grp_alloc() and it has nhg->num_nh elements so this check should be >= instead of >. Fixes: 430a049190de ("nexthop: Add support for nexthop groups") Signed-off-by: Dan Carpenter <[email protected]> Reviewed-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>