aboutsummaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)AuthorFilesLines
2018-01-08netfilter: move checksum indirection to struct nf_ipv6_opsPablo Neira Ayuso1-1/+0
We cannot make a direct call to nf_ip6_checksum() because that would result in autoloading the 'ipv6' module because of symbol dependencies. Therefore, define checksum indirection in nf_ipv6_ops where this really belongs to. For IPv4, we can indeed make a direct function call, which is faster, given IPv4 is built-in in the networking code by default. Still, CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline stub for IPv4 in such case. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: nf_tables: remove hooks from family definitionPablo Neira Ayuso2-11/+11
They don't belong to the family definition, move them to the filter chain type definition instead. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: nf_tables: remove multihook chains and familiesPablo Neira Ayuso2-2/+0
Since NFPROTO_INET is handled from the core, we don't need to maintain extra infrastructure in nf_tables to handle the double hook registration, one for IPv4 and another for IPv6. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: nf_tables_inet: don't use multihook infrastructure anymorePablo Neira Ayuso1-2/+1
Use new native NFPROTO_INET support in netfilter core, this gets rid of ad-hoc code in the nf_tables API codebase. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: nf_tables: explicit nft_set_pktinfo() call from hook pathPablo Neira Ayuso4-4/+8
Instead of calling this function from the family specific variant, this reduces the code size in the fast path for the netdev, bridge and inet families. After this change, we must call nft_set_pktinfo() upfront from the chain hook indirection. Before: text data bss dec hex filename 2145 208 0 2353 931 net/netfilter/nf_tables_netdev.o After: text data bss dec hex filename 2125 208 0 2333 91d net/netfilter/nf_tables_netdev.o Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: nf_tables_arp: don't set forward chainPablo Neira Ayuso1-1/+0
46928a0b49f3 ("netfilter: nf_tables: remove multihook chains and families") already removed this, this is a leftover. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: core: only allow one nat hook per hook pointFlorian Westphal1-0/+4
The netfilter NAT core cannot deal with more than one NAT hook per hook location (prerouting, input ...), because the NAT hooks install a NAT null binding in case the iptables nat table (iptable_nat hooks) or the corresponding nftables chain (nft nat hooks) doesn't specify a nat transformation. Null bindings are needed to detect port collsisions between NAT-ed and non-NAT-ed connections. This causes nftables NAT rules to not work when iptable_nat module is loaded, and vice versa because nat binding has already been attached when the second nat hook is consulted. The netfilter core is not really the correct location to handle this (hooks are just hooks, the core has no notion of what kinds of side effects a hook implements), but its the only place where we can check for conflicts between both iptables hooks and nftables hooks without adding dependencies. So add nat annotation to hook_ops to describe those hooks that will add NAT bindings and then make core reject if such a hook already exists. The annotation fills a padding hole, in case further restrictions appar we might change this to a 'u8 type' instead of bool. iptables error if nft nat hook active: iptables -t nat -A POSTROUTING -j MASQUERADE iptables v1.4.21: can't initialize iptables table `nat': File exists Perhaps iptables or your kernel needs to be upgraded. nftables error if iptables nat table present: nft -f /etc/nftables/ipv4-nat /usr/etc/nftables/ipv4-nat:3:1-2: Error: Could not process rule: File exists table nat { ^^ Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: xtables: add and use xt_request_find_table_lockFlorian Westphal2-28/+24
currently we always return -ENOENT to userspace if we can't find a particular table, or if the table initialization fails. Followup patch will make nat table init fail in case nftables already registered a nat hook so this change makes xt_find_table_lock return an ERR_PTR to return the errno value reported from the table init function. Add xt_request_find_table_lock as try_then_request_module replacement and use it where needed. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: don't allocate space for arp/bridge hooks unless neededFlorian Westphal1-0/+2
no need to define hook points if the family isn't supported. Because we need these hooks for either nftables, arp/ebtables or the 'call-iptables' hack we have in the bridge layer add two new dependencies, NETFILTER_FAMILY_{ARP,BRIDGE}, and have the users select them. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: conntrack: timeouts can be constFlorian Westphal1-1/+1
Nowadays this is just the default template that is used when setting up the net namespace, so nothing writes to these locations. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: conntrack: l4 protocol trackers can be constFlorian Westphal1-1/+1
previous patches removed all writes to these structs so we can now mark them as const. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08netfilter: conntrack: constify list of builtin trackersFlorian Westphal1-1/+1
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-01-08xfrm: Return error on unknown encap_type in init_stateHerbert Xu1-0/+1
Currently esp will happily create an xfrm state with an unknown encap type for IPv4, without setting the necessary state parameters. This patch fixes it by returning -EINVAL. There is a similar problem in IPv6 where if the mode is unknown we will skip initialisation while returning zero. However, this is harmless as the mode has already been checked further up the stack. This patch removes this anomaly by aligning the IPv6 behaviour with IPv4 and treating unknown modes (which cannot actually happen) as transport mode. Fixes: 38320c70d282 ("[IPSEC]: Use crypto_aead and authenc in ESP") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-01-05net: revert "Update RFS target at poll for tcp/udp"Soheil Hassas Yeganeh2-4/+0
On multi-threaded processes, one common architecture is to have one (or a small number of) threads polling sockets, and a considerably larger pool of threads reading form and writing to the sockets. When we set RPS core on tcp_poll() or udp_poll() we essentially steer all packets of all the polled FDs to one (or small number of) cores, creaing a bottleneck and/or RPS misprediction. Another common architecture is to shard FDs among threads pinned to cores. In such a setting, setting RPS core in tcp_poll() and udp_poll() is redundant because the RFS core is correctly set in recvmsg and sendmsg. Thus, revert the following commit: c3f1dbaf6e28 ("net: Update RFS target at poll for tcp/udp"). Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-05ip: do not set RFS core on error queue readsSoheil Hassas Yeganeh1-1/+2
We should only record RPS on normal reads and writes. In single threaded processes, all calls record the same state. In multi-threaded processes where a separate thread processes errors, the RFS table mispredicts. Note that, when CONFIG_RPS is disabled, sock_rps_record_flow is a noop and no branch is added as a result of this patch. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-02net: tcp: Remove TCP probe moduleMasami Hiramatsu2-302/+0
Remove TCP probe module since jprobe has been deprecated. That function is now replaced by tcp/tcp_probe trace-event. You can use it via ftrace or perftools. Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-02net: tcp: Add trace events for TCP congestion window tracingMasami Hiramatsu1-0/+3
This adds an event to trace TCP stat variables with slightly intrusive trace-event. This uses ftrace/perf event log buffer to trace those state, no needs to prepare own ring-buffer, nor custom user apps. User can use ftrace to trace this event as below; # cd /sys/kernel/debug/tracing # echo 1 > events/tcp/tcp_probe/enable (run workloads) # cat trace Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-02inet_diag: Add equal-operator for portsKristian Evensen1-0/+8
inet_diag currently provides less/greater than or equal operators for comparing ports when filtering sockets. An equal comparison can be performed by combining the two existing operators, or a user can for example request a port range and then do the final filtering in userspace. However, these approaches both have drawbacks. Implementing equal using LE/GE causes the size and complexity of a filter to grow quickly as the number of ports increase, while it on busy machines would be great if the kernel only returns information about relevant sockets. This patch introduces source and destination port equal operators. INET_DIAG_BC_S_EQ is used to match a source port, INET_DIAG_BC_D_EQ a destination port, and usage is the same as for the existing port operators. I.e., the port to match is stored in the no-member of the next inet_diag_bc_op-struct in the filter. Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+11
net/ipv6/ip6_gre.c is a case of parallel adds. include/trace/events/tcp.h is a little bit more tricky. The removal of in-trace-macro ifdefs in 'net' paralleled with moving show_tcp_state_name and friends over to include/trace/events/sock.h in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-27tcp: do not allocate linear memory for zerocopy skbsWillem de Bruijn1-4/+7
Zerocopy payload is now always stored in frags, and space for headers is reversed, so this memory is unused. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-27tcp: place all zerocopy payload in fragsWillem de Bruijn1-4/+5
This avoids an unnecessary copy of 1-2KB and improves tso_fragment, which has to fall back to tcp_fragment if skb->len != skb_data_len. It also avoids a surprising inconsistency in notifications: Zerocopy packets sent over loopback have their frags copied, so set SO_EE_CODE_ZEROCOPY_COPIED in the notification. But this currently does not happen for small packets, because when all data fits in the linear fragment, data is not copied in skb_orphan_frags_rx. Reported-by: Tom Deseyn <tom.deseyn@gmail.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-27tcp: push full zerocopy packetsWillem de Bruijn1-1/+3
Skbs that reach MAX_SKB_FRAGS cannot be extended further. Do the same for zerocopy frags as non-zerocopy frags and set the PSH bit. This improves GRO assembly. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-27Merge branch 'master' of ↵David S. Miller3-69/+45
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2017-12-22 1) Separate ESP handling from segmentation for GRO packets. This unifies the IPsec GSO and non GSO codepath. 2) Add asynchronous callbacks for xfrm on layer 2. This adds the necessary infrastructure to core networking. 3) Allow to use the layer2 IPsec GSO codepath for software crypto, all infrastructure is there now. 4) Also allow IPsec GSO with software crypto for local sockets. 5) Don't require synchronous crypto fallback on IPsec offloading, it is not needed anymore. 6) Check for xdo_dev_state_free and only call it if implemented. From Shannon Nelson. 7) Check for the required add and delete functions when a driver registers xdo_dev_ops. From Shannon Nelson. 8) Define xfrmdev_ops only with offload config. From Shannon Nelson. 9) Update the xfrm stats documentation. From Shannon Nelson. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-27Merge branch 'master' of ↵David S. Miller1-1/+11
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2017-12-22 1) Check for valid id proto in validate_tmpl(), otherwise we may trigger a warning in xfrm_state_fini(). From Cong Wang. 2) Fix a typo on XFRMA_OUTPUT_MARK policy attribute. From Michal Kubecek. 3) Verify the state is valid when encap_type < 0, otherwise we may crash on IPsec GRO . From Aviv Heller. 4) Fix stack-out-of-bounds read on socket policy lookup. We access the flowi of the wrong address family in the IPv4 mapped IPv6 case, fix this by catching address family missmatches before we do the lookup. 5) fix xfrm_do_migrate() with AEAD to copy the geniv field too. Otherwise the state is not fully initialized and migration fails. From Antony Antony. 6) Fix stack-out-of-bounds with misconfigured transport mode policies. Our policy template validation is not strict enough. It is possible to configure policies with transport mode template where the address family of the template does not match the selectors address family. Fix this by refusing such a configuration, address family can not change on transport mode. 7) Fix a policy reference leak when reusing pcpu xdst entry. From Florian Westphal. 8) Reinject transport-mode packets through tasklet, otherwise it is possible to reate a recursion loop. From Herbert Xu. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-26net: erspan: remove md NULL checkWilliam Tu1-5/+0
The 'md' is allocated from 'tun_dst = ip_tun_rx_dst' and since we've checked 'tun_dst', 'md' will never be NULL. The patch removes it at both ipv4 and ipv6 erspan. Fixes: afb4c97d90e6 ("ip6_gre: fix potential memory leak in ip6erspan_rcv") Fixes: 50670b6ee9bc ("ip_gre: fix potential memory leak in erspan_rcv") Cc: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-26tcp: md5: Handle RCU dereference of md5sig_infoMat Martineau1-1/+1
Dereference tp->md5sig_info in tcp_v4_destroy_sock() the same way it is done in the adjacent call to tcp_clear_md5_list(). Resolves this sparse warning: net/ipv4/tcp_ipv4.c:1914:17: warning: incorrect type in argument 1 (different address spaces) net/ipv4/tcp_ipv4.c:1914:17: expected struct callback_head *head net/ipv4/tcp_ipv4.c:1914:17: got struct callback_head [noderef] <asn:4>*<noident> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Acked-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller3-4/+14
Lots of overlapping changes. Also on the net-next side the XDP state management is handled more in the generic layers so undo the 'net' nfp fix which isn't applicable in net-next. Include a necessary change by Jakub Kicinski, with log message: ==================== cls_bpf no longer takes care of offload tracking. Make sure netdevsim performs necessary checks. This fixes a warning caused by TC trying to remove a filter it has not added. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-20ipv4: Fix use-after-free when flushing FIB tablesIdo Schimmel1-2/+7
Since commit 0ddcf43d5d4a ("ipv4: FIB Local/MAIN table collapse") the local table uses the same trie allocated for the main table when custom rules are not in use. When a net namespace is dismantled, the main table is flushed and freed (via an RCU callback) before the local table. In case the callback is invoked before the local table is iterated, a use-after-free can occur. Fix this by iterating over the FIB tables in reverse order, so that the main table is always freed after the local table. v3: Reworded comment according to Alex's suggestion. v2: Add a comment to make the fix more explicit per Dave's and Alex's feedback. Fixes: 0ddcf43d5d4a ("ipv4: FIB Local/MAIN table collapse") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-20net: sock: replace sk_state_load with inet_sk_state_load and remove ↵Yafang Shao4-5/+5
sk_state_store sk_state_load is only used by AF_INET/AF_INET6, so rename it to inet_sk_state_load and move it into inet_sock.h. sk_state_store is removed as it is not used any more. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-20net: tracepoint: replace tcp_set_state tracepoint with inet_sock_set_state ↵Yafang Shao4-9/+19
tracepoint As sk_state is a common field for struct sock, so the state transition tracepoint should not be a TCP specific feature. Currently it traces all AF_INET state transition, so I rename this tracepoint to inet_sock_set_state tracepoint with some minor changes and move it into trace/events/sock.h. We dont need to create a file named trace/events/inet_sock.h for this one single tracepoint. Two helpers are introduced to trace sk_state transition - void inet_sk_state_store(struct sock *sk, int newstate); - void inet_sk_set_state(struct sock *sk, int state); As trace header should not be included in other header files, so they are defined in sock.c. The protocol such as SCTP maybe compiled as a ko, hence export inet_sk_set_state(). Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-20ip_gre: fix potential memory leak in erspan_rcvHaishuang Yan1-1/+3
If md is NULL, tun_dst must be freed, otherwise it will cause memory leak. Fixes: 1a66a836da6 ("gre: add collect_md mode to ERSPAN tunnel") Cc: William Tu <u9012063@gmail.com> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-20ip_gre: fix error path when erspan_rcv failedHaishuang Yan1-0/+2
When erspan_rcv call return PACKET_REJECT, we shoudn't call ipgre_rcv to process packets again, instead send icmp unreachable message in error path. Fixes: 84e54fe0a5ea ("gre: introduce native tunnel support for ERSPAN") Acked-by: William Tu <u9012063@gmail.com> Cc: William Tu <u9012063@gmail.com> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-20esp: Don't require synchronous crypto fallback on offloading anymore.Steffen Klassert1-10/+2
We support asynchronous crypto on layer 2 ESP now. So no need to force synchronous crypto fallback on offloading anymore. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-12-20net: Add asynchronous callbacks for xfrm on layer 2.Steffen Klassert1-3/+21
This patch implements asynchronous crypto callbacks and a backlog handler that can be used when IPsec is done at layer 2 in the TX path. It also extends the skb validate functions so that we can update the driver transmit return codes based on async crypto operation or to indicate that we queued the packet in a backlog queue. Joint work with: Aviv Heller <avivh@mellanox.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-12-20xfrm: Separate ESP handling from segmentation for GRO packets.Steffen Klassert2-56/+22
We change the ESP GSO handlers to only segment the packets. The ESP handling and encryption is defered to validate_xmit_xfrm() where this is done for non GRO packets too. This makes the code more robust and prepares for asynchronous crypto handling. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-12-19ipv4: fib: Fix metrics match when deleting a routePhil Sutter1-2/+6
The recently added fib_metrics_match() causes a regression for routes with both RTAX_FEATURES and RTAX_CC_ALGO if the latter has TCP_CONG_NEEDS_ECN flag set: | # ip link add d0 type dummy | # ip link set d0 up | # ip route add 172.29.29.0/24 dev d0 features ecn congctl dctcp | # ip route del 172.29.29.0/24 dev d0 features ecn congctl dctcp | RTNETLINK answers: No such process During route insertion, fib_convert_metrics() detects that the given CC algo requires ECN and hence sets DST_FEATURE_ECN_CA bit in RTAX_FEATURES. During route deletion though, fib_metrics_match() compares stored RTAX_FEATURES value with that from userspace (which obviously has no knowledge about DST_FEATURE_ECN_CA) and fails. Fixes: 5f9ae3d9e7e4a ("ipv4: do metrics match when looking up and deleting a route") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-19ip_gre: remove the incorrect mtu limit for ipgre tapXin Long1-0/+1
ipgre tap driver calls ether_setup(), after commit 61e84623ace3 ("net: centralize net_device min/max MTU checking"), the range of mtu is [min_mtu, max_mtu], which is [68, 1500] by default. It causes the dev mtu of the ipgre tap device to not be greater than 1500, this limit value is not correct for ipgre tap device. Besides, it's .change_mtu already does the right check. So this patch is just to set max_mtu as 0, and leave the check to it's .change_mtu. Fixes: 61e84623ace3 ("net: centralize net_device min/max MTU checking") Reported-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-19xfrm: Reinject transport-mode packets through taskletHerbert Xu1-1/+11
This is an old bugbear of mine: https://www.mail-archive.com/netdev@vger.kernel.org/msg03894.html By crafting special packets, it is possible to cause recursion in our kernel when processing transport-mode packets at levels that are only limited by packet size. The easiest one is with DNAT, but an even worse one is where UDP encapsulation is used in which case you just have to insert an UDP encapsulation header in between each level of recursion. This patch avoids this problem by reinjecting tranport-mode packets through a tasklet. Fixes: b05e106698d9 ("[IPV4/6]: Netfilter IPsec input hooks") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-12-18net: erspan: reload pointer after pskb_may_pullWilliam Tu1-1/+3
pskb_may_pull() can change skb->data, so we need to re-load pkt_md and ershdr at the right place. Fixes: 94d7d8f29287 ("ip6_gre: add erspan v2 support") Fixes: f551c91de262 ("net: erspan: introduce erspan v2 for ip_gre") Signed-off-by: William Tu <u9012063@gmail.com> Cc: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-18net: erspan: fix wrong return valueWilliam Tu1-1/+1
If pskb_may_pull return failed, return PACKET_REJECT instead of -ENOMEM. Fixes: 94d7d8f29287 ("ip6_gre: add erspan v2 support") Fixes: f551c91de262 ("net: erspan: introduce erspan v2 for ip_gre") Signed-off-by: William Tu <u9012063@gmail.com> Cc: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Acked-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller11-27/+59
Three sets of overlapping changes, two in the packet scheduler and one in the meson-gxl PHY driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15ip_gre: fix wrong return value of erspan_rcvHaishuang Yan1-1/+1
If pskb_may_pull return failed, return PACKET_REJECT instead of -ENOMEM. Fixes: 84e54fe0a5ea ("gre: introduce native tunnel support for ERSPAN") Cc: William Tu <u9012063@gmail.com> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15net: erspan: introduce erspan v2 for ip_greWilliam Tu1-15/+90
The patch adds support for erspan version 2. Not all features are supported in this patch. The SGT (security group tag), GRA (timestamp granularity), FT (frame type) are set to fixed value. Only hardware ID and direction are configurable. Optional subheader is also not supported. Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-15net: erspan: refactor existing erspan codeWilliam Tu1-10/+17
The patch refactors the existing erspan implementation in order to support erspan version 2, which has additional metadata. So, in stead of having one 'struct erspanhdr' holding erspan version 1, breaks it into 'struct erspan_base_hdr' and 'struct erspan_metadata'. Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-13tcp: refresh tcp_mstamp from timers callbacksEric Dumazet1-0/+2
Only the retransmit timer currently refreshes tcp_mstamp We should do the same for delayed acks and keepalives. Even if RFC 7323 does not request it, this is consistent to what linux did in the past, when TS values were based on jiffies. Fixes: 385e20706fac ("tcp: use tp->tcp_mstamp in output path") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Mike Maloney <maloney@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Mike Maloney <maloney@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-13tcp: fix potential underestimation on rcv_rttWei Wang1-4/+6
When ms timestamp is used, current logic uses 1us in tcp_rcv_rtt_update() when the real rcv_rtt is within 1 - 999us. This could cause rcv_rtt underestimation. Fix it by always using a min value of 1ms if ms timestamp is used. Fixes: 645f4c6f2ebd ("tcp: switch rcv_rtt_est and rcvq_space to high resolution timestamps") Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-13tcp: pause Fast Open globally after third consecutive timeoutYuchung Cheng3-30/+22
Prior to this patch, active Fast Open is paused on a specific destination IP address if the previous connections to the IP address have experienced recurring timeouts . But recent experiments by Microsoft (https://goo.gl/cykmn7) and Mozilla browsers indicate the isssue is often caused by broken middle-boxes sitting close to the client. Therefore it is much better user experience if Fast Open is disabled out-right globally to avoid experiencing further timeouts on connections toward other destinations. This patch changes the destination-IP disablement to global disablement if a connection experiencing recurring timeouts or aborts due to timeout. Repeated incidents would still exponentially increase the pause time, starting from an hour. This is extremely conservative but an unfortunate compromise to minimize bad experience due to broken middle-boxes. Reported-by: Dragana Damjanovic <ddamjanovic@mozilla.com> Reported-by: Patrick McManus <mcmanus@ducksong.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Wei Wang <weiwan@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-13tcp/dccp: avoid one atomic operation for timewait hashdanceEric Dumazet2-17/+17
First, rename __inet_twsk_hashdance() to inet_twsk_hashdance() Then, remove one inet_twsk_put() by setting tw_refcnt to 3 instead of 4, but adding a fat warning that we do not have the right to access tw anymore after inet_twsk_hashdance() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller3-3/+2
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The follow patchset contains Netfilter fixes for your net tree, they are: 1) Fix compilation warning in x_tables with clang due to useless redundant reassignment, from Colin Ian King. 2) Add bugtrap to net_exit to catch uninitialized lists, patch from Vasily Averin. 3) Fix out of bounds memory reads in H323 conntrack helper, this comes with an initial patch to remove replace the obscure CHECK_BOUND macro as a dependency. From Eric Sesterhenn. 4) Reduce retransmission timeout when window is 0 in TCP conntrack, from Florian Westphal. 6) ctnetlink clamp timeout to INT_MAX if timeout is too large, otherwise timeout wraps around and it results in killing the entry that is being added immediately. 7) Missing CAP_NET_ADMIN checks in cthelper and xt_osf, due to no netns support. From Kevin Cernekee. 8) Missing maximum number of instructions checks in xt_bpf, patch from Jann Horn. 9) With no CONFIG_PROC_FS ipt_CLUSTERIP compilation breaks, patch from Arnd Bergmann. 10) Missing netlink attribute policy in nftables exthdr, from Florian Westphal. 11) Enable conntrack with IPv6 MASQUERADE rules, as a357b3f80bc8 should have done in first place, from Konstantin Khlebnikov. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-13tcp: allow TLP in ECN CWRNeal Cardwell1-6/+3
This patch enables tail loss probe in cwnd reduction (CWR) state to detect potential losses. Prior to this patch, since the sender uses PRR to determine the cwnd in CWR state, the combination of CWR+PRR plus tcp_tso_should_defer() could cause unnecessary stalls upon losses: PRR makes cwnd so gentle that tcp_tso_should_defer() defers sending wait for more ACKs. The ACKs may not come due to packet losses. Disallowing TLP when there is unused cwnd had the primary effect of disallowing TLP when there is TSO deferral, Nagle deferral, or we hit the rwin limit. Because basically every application write() or incoming ACK will cause us to run tcp_write_xmit() to see if we can send more, and then if we sent something we call tcp_schedule_loss_probe() to see if we should schedule a TLP. At that point, there are a few common reasons why some cwnd budget could still be unused: (a) rwin limit (b) nagle check (c) TSO deferral (d) TSQ For (d), after the next packet tx completion the TSQ mechanism will allow us to send more packets, so we don't really need a TLP (in practice it shouldn't matter whether we schedule one or not). But for (a), (b), (c) the sender won't send any more packets until it gets another ACK. But if the whole flight was lost, or all the ACKs were lost, then we won't get any more ACKs, and ideally we should schedule and send a TLP to get more feedback. In particular for a long time we have wanted some kind of timer for TSO deferral, and at least this would give us some kind of timer Reported-by: Steve Ibanez <sibanez@stanford.edu> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Nandita Dukkipati <nanditad@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>