From 3a91d29f20276fa7cd4d0c9c7f3e78b30708159d Mon Sep 17 00:00:00 2001
From: Koichiro Den <den@klaipeden.com>
Date: Sun, 22 Oct 2017 13:13:16 +0900
Subject: tcp: do tcp_mstamp_refresh before retransmits on TSQ handler

When retransmission on TSQ handler was introduced in the commit
f9616c35a0d7 ("tcp: implement TSQ for retransmits"), the retransmitted
skbs' timestamps were updated on the actual transmission. In the later
commit 385e20706fac ("tcp: use tp->tcp_mstamp in output path"), it stops
being done so. In the commit, the comment says "We try to refresh
tp->tcp_mstamp only when necessary", and at present tcp_tsq_handler and
tcp_v4_mtu_reduced applies to this. About the latter, it's okay since
it's rare enough.

About the former, even though possible retransmissions on the tasklet
comes just after the destructor run in NET_RX softirq handling, the time
between them could be nonnegligibly large to the extent that
tcp_rack_advance or rto rearming be affected if other (remaining) RX,
BLOCK and (preceding) TASKLET sofirq handlings are unexpectedly heavy.

So in the same way as tcp_write_timer_handler does, doing tcp_mstamp_refresh
ensures the accuracy of algorithms relying on it.

Fixes: 385e20706fac ("tcp: use tp->tcp_mstamp in output path")
Signed-off-by: Koichiro Den <den@klaipeden.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/tcp_output.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

(limited to 'net/ipv4/tcp_output.c')

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0bc9e46a5369..973befc36fd4 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -739,8 +739,10 @@ static void tcp_tsq_handler(struct sock *sk)
 		struct tcp_sock *tp = tcp_sk(sk);
 
 		if (tp->lost_out > tp->retrans_out &&
-		    tp->snd_cwnd > tcp_packets_in_flight(tp))
+		    tp->snd_cwnd > tcp_packets_in_flight(tp)) {
+			tcp_mstamp_refresh(tp);
 			tcp_xmit_retransmit_queue(sk);
+		}
 
 		tcp_write_xmit(sk, tcp_current_mss(sk), tp->nonagle,
 			       0, GFP_ATOMIC);
-- 
cgit 


From 5889e2c0e441d84060e66211ed5c4517ca591167 Mon Sep 17 00:00:00 2001
From: Yousuk Seung <ysseung@google.com>
Date: Tue, 24 Oct 2017 16:44:42 -0700
Subject: tcp: call tcp_rate_skb_sent() when retransmit with unaligned
 skb->data

Current implementation calls tcp_rate_skb_sent() when tcp_transmit_skb()
is called when it clones skb only. Not calling tcp_rate_skb_sent() is OK
for all such code paths except from __tcp_retransmit_skb() which happens
when skb->data address is not aligned. This may rarely happen e.g. when
small amount of data is sent initially and the receiver partially acks
odd number of bytes for some reason, possibly malicious.

Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/tcp_output.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

(limited to 'net/ipv4/tcp_output.c')

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 973befc36fd4..1151870018e3 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2843,8 +2843,10 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
 		nskb = __pskb_copy(skb, MAX_TCP_HEADER, GFP_ATOMIC);
 		err = nskb ? tcp_transmit_skb(sk, nskb, 0, GFP_ATOMIC) :
 			     -ENOBUFS;
-		if (!err)
+		if (!err) {
 			skb->skb_mstamp = tp->tcp_mstamp;
+			tcp_rate_skb_sent(sk, skb);
+		}
 	} else {
 		err = tcp_transmit_skb(sk, skb, 1, GFP_ATOMIC);
 	}
-- 
cgit 


From ee1836aec4f5a977c1699a311db4d9027ef21ac8 Mon Sep 17 00:00:00 2001
From: Eric Dumazet <edumazet@google.com>
Date: Thu, 26 Oct 2017 21:21:40 -0700
Subject: tcp: refresh tp timestamp before tcp_mtu_probe()

In the unlikely event tcp_mtu_probe() is sending a packet, we
want tp->tcp_mstamp being as accurate as possible.

This means we need to call tcp_mstamp_refresh() a bit earlier in
tcp_write_xmit().

Fixes: 385e20706fac ("tcp: use tp->tcp_mstamp in output path")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/tcp_output.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'net/ipv4/tcp_output.c')

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1151870018e3..ae60dd3faed0 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2239,6 +2239,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 
 	sent_pkts = 0;
 
+	tcp_mstamp_refresh(tp);
 	if (!push_one) {
 		/* Do MTU probing. */
 		result = tcp_mtu_probe(sk);
@@ -2250,7 +2251,6 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 	}
 
 	max_segs = tcp_tso_segs(sk, mss_now);
-	tcp_mstamp_refresh(tp);
 	while ((skb = tcp_send_head(sk))) {
 		unsigned int limit;
 
-- 
cgit 


From 2b7cda9c35d3b940eb9ce74b30bbd5eb30db493d Mon Sep 17 00:00:00 2001
From: Eric Dumazet <edumazet@google.com>
Date: Mon, 30 Oct 2017 23:08:20 -0700
Subject: tcp: fix tcp_mtu_probe() vs highest_sack

Based on SNMP values provided by Roman, Yuchung made the observation
that some crashes in tcp_sacktag_walk() might be caused by MTU probing.

Looking at tcp_mtu_probe(), I found that when a new skb was placed
in front of the write queue, we were not updating tcp highest sack.

If one skb is freed because all its content was copied to the new skb
(for MTU probing), then tp->highest_sack could point to a now freed skb.

Bad things would then happen, including infinite loops.

This patch renames tcp_highest_sack_combine() and uses it
from tcp_mtu_probe() to fix the bug.

Note that I also removed one test against tp->sacked_out,
since we want to replace tp->highest_sack regardless of whatever
condition, since keeping a stale pointer to freed skb is a recipe
for disaster.

Fixes: a47e5a988a57 ("[TCP]: Convert highest_sack to sk_buff to allow direct access")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reported-by: Roman Gushchin <guro@fb.com>
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/tcp.h     | 6 +++---
 net/ipv4/tcp_output.c | 3 ++-
 2 files changed, 5 insertions(+), 4 deletions(-)

(limited to 'net/ipv4/tcp_output.c')

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 33599d17522d..e6d0002a1b0b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1771,12 +1771,12 @@ static inline void tcp_highest_sack_reset(struct sock *sk)
 	tcp_sk(sk)->highest_sack = tcp_write_queue_head(sk);
 }
 
-/* Called when old skb is about to be deleted (to be combined with new skb) */
-static inline void tcp_highest_sack_combine(struct sock *sk,
+/* Called when old skb is about to be deleted and replaced by new skb */
+static inline void tcp_highest_sack_replace(struct sock *sk,
 					    struct sk_buff *old,
 					    struct sk_buff *new)
 {
-	if (tcp_sk(sk)->sacked_out && (old == tcp_sk(sk)->highest_sack))
+	if (old == tcp_highest_sack(sk))
 		tcp_sk(sk)->highest_sack = new;
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index ae60dd3faed0..823003eef3a2 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2062,6 +2062,7 @@ static int tcp_mtu_probe(struct sock *sk)
 	nskb->ip_summed = skb->ip_summed;
 
 	tcp_insert_write_queue_before(nskb, skb, sk);
+	tcp_highest_sack_replace(sk, skb, nskb);
 
 	len = 0;
 	tcp_for_write_queue_from_safe(skb, next, sk) {
@@ -2665,7 +2666,7 @@ static bool tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 		else if (!skb_shift(skb, next_skb, next_skb_size))
 			return false;
 	}
-	tcp_highest_sack_combine(sk, next_skb, skb);
+	tcp_highest_sack_replace(sk, next_skb, skb);
 
 	tcp_unlink_write_queue(next_skb, sk);
 
-- 
cgit 


From 3b11775033dc87c3d161996c54507b15ba26414a Mon Sep 17 00:00:00 2001
From: Eric Dumazet <edumazet@google.com>
Date: Thu, 2 Nov 2017 12:30:25 -0700
Subject: tcp: do not mangle skb->cb[] in tcp_make_synack()

Christoph Paasch sent a patch to address the following issue :

tcp_make_synack() is leaving some TCP private info in skb->cb[],
then send the packet by other means than tcp_transmit_skb()

tcp_transmit_skb() makes sure to clear skb->cb[] to not confuse
IPv4/IPV6 stacks, but we have no such cleanup for SYNACK.

tcp_make_synack() should not use tcp_init_nondata_skb() :

tcp_init_nondata_skb() really should be limited to skbs put in write/rtx
queues (the ones that are only sent via tcp_transmit_skb())

This patch fixes the issue and should even save few cpu cycles ;)

Fixes: 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/tcp_output.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

(limited to 'net/ipv4/tcp_output.c')

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 823003eef3a2..478909f4694d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3180,13 +3180,8 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 	th->source = htons(ireq->ir_num);
 	th->dest = ireq->ir_rmt_port;
 	skb->mark = ireq->ir_mark;
-	/* Setting of flags are superfluous here for callers (and ECE is
-	 * not even correctly set)
-	 */
-	tcp_init_nondata_skb(skb, tcp_rsk(req)->snt_isn,
-			     TCPHDR_SYN | TCPHDR_ACK);
-
-	th->seq = htonl(TCP_SKB_CB(skb)->seq);
+	skb->ip_summed = CHECKSUM_PARTIAL;
+	th->seq = htonl(tcp_rsk(req)->snt_isn);
 	/* XXX data is queued and acked as is. No buffer/window check */
 	th->ack_seq = htonl(tcp_rsk(req)->rcv_nxt);
 
-- 
cgit