aboutsummaryrefslogtreecommitdiff
path: root/include/net/tcp.h
AgeCommit message (Collapse)AuthorFilesLines
2017-11-01tcp: fix tcp_mtu_probe() vs highest_sackEric Dumazet1-3/+3
Based on SNMP values provided by Roman, Yuchung made the observation that some crashes in tcp_sacktag_walk() might be caused by MTU probing. Looking at tcp_mtu_probe(), I found that when a new skb was placed in front of the write queue, we were not updating tcp highest sack. If one skb is freed because all its content was copied to the new skb (for MTU probing), then tp->highest_sack could point to a now freed skb. Bad things would then happen, including infinite loops. This patch renames tcp_highest_sack_combine() and uses it from tcp_mtu_probe() to fix the bug. Note that I also removed one test against tp->sacked_out, since we want to replace tp->highest_sack regardless of whatever condition, since keeping a stale pointer to freed skb is a recipe for disaster. Fixes: a47e5a988a57 ("[TCP]: Convert highest_sack to sk_buff to allow direct access") Signed-off-by: Eric Dumazet <[email protected]> Reported-by: Alexei Starovoitov <[email protected]> Reported-by: Roman Gushchin <[email protected]> Reported-by: Oleksandr Natalenko <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: Neal Cardwell <[email protected]> Acked-by: Yuchung Cheng <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-0/+1
Several conflicts here. NFP driver bug fix adding nfp_netdev_is_nfp_repr() check to nfp_fl_output() needed some adjustments because the code block is in an else block now. Parallel additions to net/pkt_cls.h and net/sch_generic.h A bug fix in __tcp_retransmit_skb() conflicted with some of the rbtree changes in net-next. The tc action RCU callback fixes in 'net' had some overlap with some of the recent tcf_block reworking. Signed-off-by: David S. Miller <[email protected]>
2017-10-29bpf: bpf_compute_data uses incorrect cb structureJohn Fastabend1-0/+1
SK_SKB program types use bpf_compute_data to store the end of the packet data. However, bpf_compute_data assumes the cb is stored in the qdisc layer format. But, for SK_SKB this is the wrong layer of the stack for this type. It happens to work (sort of!) because in most cases nothing happens to be overwritten today. This is very fragile and error prone. Fortunately, we have another hole in tcp_skb_cb we can use so lets put the data_end value there. Note, SK_SKB program types do not use data_meta, they are failed by sk_skb_is_valid_access(). Signed-off-by: John Fastabend <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-28tcp: remove unnecessary includeAlexei Starovoitov1-3/+0
two extra #include are not necessary in tcp.h Remove them. Fixes: 40304b2a1567 ("bpf: BPF support for sock_ops") Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-28tcp: Namespace-ify sysctl_tcp_pacing_ca_ratioEric Dumazet1-2/+0
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-28tcp: Namespace-ify sysctl_tcp_pacing_ss_ratioEric Dumazet1-1/+0
Also remove an obsolete comment about TCP pacing. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-28tcp: Namespace-ify sysctl_tcp_invalid_ratelimitEric Dumazet1-1/+0
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-28tcp: Namespace-ify sysctl_tcp_autocorkingEric Dumazet1-1/+0
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-28tcp: Namespace-ify sysctl_tcp_min_rtt_wlenEric Dumazet1-1/+0
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-28tcp: Namespace-ify sysctl_tcp_min_tso_segsEric Dumazet1-1/+0
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-28tcp: Namespace-ify sysctl_tcp_challenge_ack_limitEric Dumazet1-1/+0
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-28tcp: Namespace-ify sysctl_tcp_limit_output_bytesEric Dumazet1-1/+0
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-28tcp: Namespace-ify sysctl_tcp_workaround_signed_windowsEric Dumazet1-2/+2
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-28tcp: Namespace-ify sysctl_tcp_tso_win_divisorEric Dumazet1-1/+0
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-28tcp: Namespace-ify sysctl_tcp_moderate_rcvbufEric Dumazet1-1/+0
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-28tcp: Namespace-ify sysctl_tcp_nometrics_saveEric Dumazet1-1/+0
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-27tcp: Namespace-ify sysctl_tcp_frtoEric Dumazet1-1/+0
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-27tcp: Namespace-ify sysctl_tcp_adv_win_scaleEric Dumazet1-5/+4
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-27tcp: Namespace-ify sysctl_tcp_app_winEric Dumazet1-1/+0
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-27tcp: Namespace-ify sysctl_tcp_dsackEric Dumazet1-1/+0
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-27tcp: Namespace-ify sysctl_tcp_max_reorderingEric Dumazet1-1/+0
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-27tcp: remove stale sysctl_tcp_reorderingEric Dumazet1-1/+0
This extern is no longer used. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-27tcp: Namespace-ify sysctl_tcp_fackEric Dumazet1-1/+0
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-27tcp: Namespace-ify sysctl_tcp_abort_on_overflowEric Dumazet1-1/+0
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-27tcp: Namespace-ify sysctl_tcp_rfc1337Eric Dumazet1-1/+0
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-27tcp: Namespace-ify sysctl_tcp_stdurgEric Dumazet1-1/+0
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-27tcp: Namespace-ify sysctl_tcp_retrans_collapseEric Dumazet1-1/+0
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-27tcp: Namespace-ify sysctl_tcp_slow_start_after_idleEric Dumazet1-2/+1
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-27tcp: Namespace-ify sysctl_tcp_thin_linear_timeoutsEric Dumazet1-2/+0
Note that sysctl_tcp_thin_dupack was not used, I deleted it. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-27tcp: Namespace-ify sysctl_tcp_recoveryEric Dumazet1-1/+1
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-27tcp: Namespace-ify sysctl_tcp_early_retransEric Dumazet1-1/+0
Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-26tcp: TCP experimental option for SMCUrsula Braun1-0/+7
The SMC protocol [1] relies on the use of a new TCP experimental option [2, 3]. With this option, SMC capabilities are exchanged between peers during the TCP three way handshake. This patch adds support for this experimental option to TCP. References: [1] SMC-R Informational RFC: http://www.rfc-editor.org/info/rfc7609 [2] Shared Use of TCP Experimental Options RFC 6994: https://tools.ietf.org/rfc/rfc6994.txt [3] IANA ExID SMCR: http://www.iana.org/assignments/tcp-parameters/tcp-parameters.xhtml#tcp-exids Signed-off-by: Ursula Braun <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-24tcp: Configure TFO without cookie per socket and/or per routeChristoph Paasch1-1/+2
We already allow to enable TFO without a cookie by using the fastopen-sysctl and setting it to TFO_SERVER_COOKIE_NOT_REQD (or TFO_CLIENT_NO_COOKIE). This is safe to do in certain environments where we know that there isn't a malicous host (aka., data-centers) or when the application-protocol already provides an authentication mechanism in the first flight of data. A server however might be providing multiple services or talking to both sides (public Internet and data-center). So, this server would want to enable cookie-less TFO for certain services and/or for connections that go to the data-center. This patch exposes a socket-option and a per-route attribute to enable such fine-grained configurations. Signed-off-by: Christoph Paasch <[email protected]> Reviewed-by: Yuchung Cheng <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-0/+5
There were quite a few overlapping sets of changes here. Daniel's bug fix for off-by-ones in the new BPF branch instructions, along with the added allowances for "data_end > ptr + x" forms collided with the metadata additions. Along with those three changes came veritifer test cases, which in their final form I tried to group together properly. If I had just trimmed GIT's conflict tags as-is, this would have split up the meta tests unnecessarily. In the socketmap code, a set of preemption disabling changes overlapped with the rename of bpf_compute_data_end() to bpf_compute_data_pointers(). Changes were made to the mv88e6060.c driver set addr method which got removed in net-next. The hyperv transport socket layer had a locking change in 'net' which overlapped with a change of socket state macro usage in 'net-next'. Signed-off-by: David S. Miller <[email protected]>
2017-10-20tcp: socket option to set TCP fast open keyYuchung Cheng1-2/+3
New socket option TCP_FASTOPEN_KEY to allow different keys per listener. The listener by default uses the global key until the socket option is set. The key is a 16 bytes long binary data. This option has no effect on regular non-listener TCP sockets. Signed-off-by: Yuchung Cheng <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: Christoph Paasch <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-20bpf: avoid preempt enable/disable in sockmap using tcp_skb_cb regionJohn Fastabend1-0/+5
SK_SKB BPF programs are run from the socket/tcp context but early in the stack before much of the TCP metadata is needed in tcp_skb_cb. So we can use some unused fields to place BPF metadata needed for SK_SKB programs when implementing the redirect function. This allows us to drop the preempt disable logic. It does however require an API change so sk_redirect_map() has been updated to additionally provide ctx_ptr to skb. Note, we do however continue to disable/enable preemption around actual BPF program running to account for map updates. Signed-off-by: John Fastabend <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-12tcp: remove obsolete helpersEric Dumazet1-17/+0
Remove three inline helpers that are no longer needed. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-11tcp: fix tcp_unlink_write_queue()Eric Dumazet1-0/+1
Yury reported crash with this signature : [ 554.034021] [<ffff80003ccd5a58>] 0xffff80003ccd5a58 [ 554.034156] [<ffff00000888fd34>] skb_release_all+0x14/0x30 [ 554.034288] [<ffff00000888fd64>] __kfree_skb+0x14/0x28 [ 554.034409] [<ffff0000088ece6c>] tcp_sendmsg_locked+0x4dc/0xcc8 [ 554.034541] [<ffff0000088ed68c>] tcp_sendmsg+0x34/0x58 [ 554.034659] [<ffff000008919fd4>] inet_sendmsg+0x2c/0xf8 [ 554.034783] [<ffff0000088842e8>] sock_sendmsg+0x18/0x30 [ 554.034928] [<ffff0000088861fc>] SyS_sendto+0x84/0xf8 Problem is that skb->destructor contains garbage, and this is because I accidentally removed tcp_skb_tsorted_anchor_cleanup() from tcp_unlink_write_queue() This would trigger with a write(fd, <invalid_memory>, len) attempt, and we will add to packetdrill this capability to avoid future regressions. Fixes: 75c119afe14f ("tcp: implement rb-tree based retransmit queue") Reported-by: Yury Norov <[email protected]> Tested-by: Yury Norov <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-07tcp: implement rb-tree based retransmit queueEric Dumazet1-42/+47
Using a linear list to store all skbs in write queue has been okay for quite a while : O(N) is not too bad when N < 500. Things get messy when N is the order of 100,000 : Modern TCP stacks want 10Gbit+ of throughput even with 200 ms RTT flows. 40 ns per cache line miss means a full scan can use 4 ms, blowing away CPU caches. SACK processing often can use various hints to avoid parsing whole retransmit queue. But with high packet losses and/or high reordering, hints no longer work. Sender has to process thousands of unfriendly SACK, accumulating a huge socket backlog, burning a cpu and massively dropping packets. Using an rb-tree for retransmit queue has been avoided for years because it added complexity and overhead, but now is the time to be more resistant and say no to quadratic behavior. 1) RTX queue is no longer part of the write queue : already sent skbs are stored in one rb-tree. 2) Since reaching the head of write queue no longer needs sk->sk_send_head, we added an union of sk_send_head and tcp_rtx_queue Tested: On receiver : netem on ingress : delay 150ms 200us loss 1 GRO disabled to force stress and SACK storms. for f in `seq 1 10` do ./netperf -H lpaa6 -l30 -- -K bbr -o THROUGHPUT|tail -1 done | awk '{print $0} {sum += $0} END {printf "%7u\n",sum}' Before patch : 323.87 351.48 339.59 338.62 306.72 204.07 304.93 291.88 202.47 176.88 2840 After patch: 1700.83 2207.98 2070.17 1544.26 2114.76 2124.89 1693.14 1080.91 2216.82 1299.94 18053 Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-07tcp: uninline tcp_write_queue_purge()Eric Dumazet1-14/+1
Since the upcoming rtx rbtree will add some extra code, it is time to not inline this fat function anymore. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-05tcp: new list for sent but unacked skbs for RACK recoveryEric Dumazet1-1/+23
This patch adds a new queue (list) that tracks the sent but not yet acked or SACKed skbs for a TCP connection. The list is chronologically ordered by skb->skb_mstamp (the head is the oldest sent skb). This list will be used to optimize TCP Rack recovery, which checks an skb's timestamp to judge if it has been lost and needs to be retransmitted. Since TCP write queue is ordered by sequence instead of sent time, RACK has to scan over the write queue to catch all eligible packets to detect lost retransmission, and iterates through SACKed skbs repeatedly. Special cares for rare events: 1. TCP repair fakes skb transmission so the send queue needs adjusted 2. SACK reneging would require re-inserting SACKed skbs into the send queue. For now I believe it's not worth the complexity to make RACK work perfectly on SACK reneging, so we do nothing here. 3. Fast Open: currently for non-TFO, send-queue correctly queues the pure SYN packet. For TFO which queues a pure SYN and then a data packet, send-queue only queues the data packet but not the pure SYN due to the structure of TFO code. This is okay because the SYN receiver would never respond with a SACK on a missing SYN (i.e. SYN is never fast-retransmitted by SACK/RACK). In order to not grow sk_buff, we use an union for the new list and _skb_refdst/destructor fields. This is a bit complicated because we need to make sure _skb_refdst and destructor are properly zeroed before skb is cloned/copied at transmit, and before being freed. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: Yuchung Cheng <[email protected]> Signed-off-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-05tcp: uniform the set up of sockets after successful connectionWei Wang1-0/+1
Currently in the TCP code, the initialization sequence for cached metrics, congestion control, BPF, etc, after successful connection is very inconsistent. This introduces inconsistent bevhavior and is prone to bugs. The current call sequence is as follows: (1) for active case (tcp_finish_connect() case): tcp_mtup_init(sk); icsk->icsk_af_ops->rebuild_header(sk); tcp_init_metrics(sk); tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB); tcp_init_congestion_control(sk); tcp_init_buffer_space(sk); (2) for passive case (tcp_rcv_state_process() TCP_SYN_RECV case): icsk->icsk_af_ops->rebuild_header(sk); tcp_call_bpf(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB); tcp_init_congestion_control(sk); tcp_mtup_init(sk); tcp_init_buffer_space(sk); tcp_init_metrics(sk); (3) for TFO passive case (tcp_fastopen_create_child()): inet_csk(child)->icsk_af_ops->rebuild_header(child); tcp_init_congestion_control(child); tcp_mtup_init(child); tcp_init_metrics(child); tcp_call_bpf(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB); tcp_init_buffer_space(child); This commit uniforms the above functions to have the following sequence: tcp_mtup_init(sk); icsk->icsk_af_ops->rebuild_header(sk); tcp_init_metrics(sk); tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE/PASSIVE_ESTABLISHED_CB); tcp_init_congestion_control(sk); tcp_init_buffer_space(sk); This sequence is the same as the (1) active case. We pick this sequence because this order correctly allows BPF to override the settings including congestion control module and initial cwnd, etc from the route, and then allows the CC module to see those settings. Suggested-by: Neal Cardwell <[email protected]> Tested-by: Neal Cardwell <[email protected]> Signed-off-by: Wei Wang <[email protected]> Acked-by: Neal Cardwell <[email protected]> Acked-by: Yuchung Cheng <[email protected]> Acked-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+1
Just simple overlapping changes. Signed-off-by: David S. Miller <[email protected]>
2017-10-01ipv4: Namespaceify tcp_fastopen_key knobHaishuang Yan1-3/+3
Different namespace application might require different tcp_fastopen_key independently of the host. David Miller pointed out there is a leak without releasing the context of tcp_fastopen_key during netns teardown. So add the release action in exit_batch path. Tested: 1. Container namespace: # cat /proc/sys/net/ipv4/tcp_fastopen_key: 2817fff2-f803cf97-eadfd1f3-78c0992b cookie key in tcp syn packets: Fast Open Cookie Kind: TCP Fast Open Cookie (34) Length: 10 Fast Open Cookie: 1e5dd82a8c492ca9 2. Host: # cat /proc/sys/net/ipv4/tcp_fastopen_key: 107d7c5f-68eb2ac7-02fb06e6-ed341702 cookie key in tcp syn packets: Fast Open Cookie Kind: TCP Fast Open Cookie (34) Length: 10 Fast Open Cookie: e213c02bf0afbc8a Signed-off-by: Haishuang Yan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-01ipv4: Remove the 'publish' logic in tcp_fastopen_init_key_onceHaishuang Yan1-1/+1
The 'publish' logic is not necessary after commit dfea2aa65424 ("tcp: Do not call tcp_fastopen_reset_cipher from interrupt context"), because in tcp_fastopen_cookie_gen,it wouldn't call tcp_fastopen_init_key_once. Signed-off-by: Haishuang Yan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-01ipv4: Namespaceify tcp_fastopen knobHaishuang Yan1-1/+0
Different namespace application might require enable TCP Fast Open feature independently of the host. This patch series continues making more of the TCP Fast Open related sysctl knobs be per net-namespace. Reported-by: Luca BRUNO <[email protected]> Signed-off-by: Haishuang Yan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-01IPv4: early demux can return an error codePaolo Abeni1-1/+1
Currently no error is emitted, but this infrastructure will used by the next patch to allow source address validation for mcast sockets. Since early demux can do a route lookup and an ipv4 route lookup can return an error code this is consistent with the current ipv4 route infrastructure. Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-09-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+0
2017-09-19net: sk_buff rbnode reorgEric Dumazet1-6/+0
skb->rbnode shares space with skb->next, skb->prev and skb->tstamp Current uses (TCP receive ofo queue and netem) need to save/restore tstamp, while skb->dev is either NULL (TCP) or a constant for a given queue (netem). Since we plan using an RB tree for TCP retransmit queue to speedup SACK processing with large BDP, this patch exchanges skb->dev and skb->tstamp. This saves some overhead in both TCP and netem. v2: removes the swtstamp field from struct tcp_skb_cb Signed-off-by: Eric Dumazet <[email protected]> Cc: Soheil Hassas Yeganeh <[email protected]> Cc: Wei Wang <[email protected]> Cc: Willem de Bruijn <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-09-18tcp: remove two unused functionsYuchung Cheng1-1/+0
remove tcp_may_send_now and tcp_snd_test that are no longer used Fixes: 840a3cbe8969 ("tcp: remove forward retransmit feature") Signed-off-by: Yuchung Cheng <[email protected]> Signed-off-by: Neal Cardwell <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>