aboutsummaryrefslogtreecommitdiff
path: root/net/ipv4/tcp_input.c
AgeCommit message (Collapse)AuthorFilesLines
2007-04-25[TCP] FRTO: frto_counter modulo-op converted to two assignmentsIlpo Järvinen1-2/+2
Signed-off-by: Ilpo Järvinen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2007-04-25[TCP]: Don't enter to fast recovery while using FRTOIlpo Järvinen1-0/+4
Because TCP is not in Loss state during FRTO recovery, fast recovery could be triggered by accident. Non-SACK FRTO is more robust than not yet included SACK-enhanced version (that can receiver high number of duplicate ACKs with SACK blocks during FRTO), at least with unidirectional transfers, but under extraordinary patterns fast recovery can be incorrectly triggered, e.g., Data loss+ACK losses => cumulative ACK with enough SACK blocks to meet sacked_out >= dupthresh condition). Signed-off-by: Ilpo Järvinen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2007-04-25[TCP] FRTO: Response should reset also snd_cwnd_cntIlpo Järvinen1-0/+1
Since purpose is to reduce CWND, we prevent immediate growth. This is not a major issue nor is "the correct way" specified anywhere. Signed-off-by: Ilpo Järvinen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2007-04-25[TCP] FRTO: fixes fallback to conventional recoveryIlpo Järvinen1-5/+9
The FRTO detection did not care how ACK pattern affects to cwnd calculation of the conventional recovery. This caused incorrect setting of cwnd when the fallback becames necessary. The knowledge tcp_process_frto() has about the incoming ACK is now passed on to tcp_enter_frto_loss() in allowed_segments parameter that gives the number of segments that must be added to packets-in-flight while calculating the new cwnd. Instead of snd_una we use FLAG_DATA_ACKED in duplicate ACK detection because RFC4138 states (in Section 2.2): If the first acknowledgment after the RTO retransmission does not acknowledge all of the data that was retransmitted in step 1, the TCP sender reverts to the conventional RTO recovery. Otherwise, a malicious receiver acknowledging partial segments could cause the sender to declare the timeout spurious in a case where data was lost. If the next ACK after RTO is duplicate, we do not retransmit anything, which is equal to what conservative conventional recovery does in such case. Signed-off-by: Ilpo Järvinen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2007-04-25[TCP] FRTO: Ignore some uninteresting ACKsIlpo Järvinen1-3/+10
Handles RFC4138 shortcoming (in step 2); it should also have case c) which ignores ACKs that are not duplicates nor advance window (opposite dir data, winupdate). Signed-off-by: Ilpo Järvinen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2007-04-25[TCP] FRTO: Use Disorder state during operation instead of OpenIlpo Järvinen1-3/+3
Retransmission counter assumptions are to be changed. Forcing reason to do this exist: Using sysctl in check would be racy as soon as FRTO starts to ignore some ACKs (doing that in the following patches). Userspace may disable it at any moment giving nice oops if timing is right. frto_counter would be inaccessible from userspace, but with SACK enhanced FRTO retrans_out can include other than head, and possibly leaving it non-zero after spurious RTO, boom again. Luckily, solution seems rather simple: never go directly to Open state but use Disorder instead. This does not really change much, since TCP could anyway change its state to Disorder during FRTO using path tcp_fastretrans_alert -> tcp_try_to_open (e.g., when a SACK block makes ACK dubious). Besides, Disorder seems to be the state where TCP should be if not recovering (in Recovery or Loss state) while having some retransmissions in-flight (see tcp_try_to_open), which is exactly what happens with FRTO. Signed-off-by: Ilpo Järvinen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2007-04-25[TCP] FRTO: Consecutive RTOs keep prior_ssthresh and ssthreshIlpo Järvinen1-6/+14
In case a latency spike causes more than one RTO, the later should not cause the already reduced ssthresh to propagate into the prior_ssthresh since FRTO declares all such RTOs spurious at once or none of them. In treating of ssthresh, we mimic what tcp_enter_loss() does. The previous state (in frto_counter) must be available until we have checked it in tcp_enter_frto(), and also ACK information flag in process_frto(). Signed-off-by: Ilpo Järvinen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2007-04-25[TCP] FRTO: Comment cleanup & improvementIlpo Järvinen1-17/+32
Moved comments out from the body of process_frto() to the head (preferred way; see Documentation/CodingStyle). Bonus: it's much easier to read in this compacted form. FRTO algorithm and implementation is described in greater detail. For interested reader, more information is available in RFC4138. Signed-off-by: Ilpo Järvinen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2007-04-25[TCP] FRTO: Moved tcp_use_frto from tcp.h to tcp_input.cIlpo Järvinen1-0/+13
In addition, removed inline. Signed-off-by: Ilpo Järvinen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2007-04-25[TCP] FRTO: Separated response from FRTO detection algorithmIlpo Järvinen1-6/+10
FRTO spurious RTO detection algorithm (RFC4138) does not include response to a detected spurious RTO but can use different response algorithms. Signed-off-by: Ilpo Järvinen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2007-04-25[TCP] FRTO: Incorrectly clears TCPCB_EVER_RETRANS bitIlpo Järvinen1-1/+1
FRTO was slightly too brave... Should only clear TCPCB_SACKED_RETRANS bit. Signed-off-by: Ilpo Järvinen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2007-02-10[NET] IPV4: Fix whitespace errors.YOSHIFUJI Hideaki1-58/+58
Signed-off-by: YOSHIFUJI Hideaki <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2007-02-08[TCP]: Check num sacks in SACK fast pathBaruch Even1-0/+5
We clear the unused parts of the SACK cache, This prevents us from mistakenly taking the cache data if the old data in the SACK cache is the same as the data in the SACK block. This assumes that we never receive an empty SACK block with start and end both at zero. Signed-off-by: Baruch Even <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2007-02-08[TCP]: Seperate DSACK from SACK fast pathBaruch Even1-35/+31
Move DSACK code outside the SACK fast-path checking code. If the DSACK determined that the information was too old we stayed with a partial cache copied. Most likely this matters very little since the next packet will not be DSACK and we will find it in the cache. but it's still not good form and there is little reason to couple the two checks. Since the SACK receive cache doesn't need the data to be in host order we also remove the ntohl in the checking loop. Signed-off-by: Baruch Even <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2007-02-08[TCP]: Advance fast path pointer for first block onlyBaruch Even1-10/+24
Only advance the SACK fast-path pointer for the first block, the fast-path assumes that only the first block advances next time so we should not move the cached skb for the next sack blocks. Signed-off-by: Baruch Even <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2007-01-25[TCP]: Fix sorting of SACK blocks.Baruch Even1-4/+5
The sorting of SACK blocks actually munges them rather than sort, causing the TCP stack to ignore some SACK information and breaking the assumption of ordered SACK blocks after sorting. The sort takes the data from a second buffer which isn't moved causing subsequent data moves to occur from the wrong location. The fix is to use a temporary buffer as a normal sort does. Signed-off-By: Baruch Even <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2007-01-23[TCP]: skb is unexpectedly freed.Masayuki Nakagawa1-2/+4
I encountered a kernel panic with my test program, which is a very simple IPv6 client-server program. The server side sets IPV6_RECVPKTINFO on a listening socket, and the client side just sends a message to the server. Then the kernel panic occurs on the server. (If you need the test program, please let me know. I can provide it.) This problem happens because a skb is forcibly freed in tcp_rcv_state_process(). When a socket in listening state(TCP_LISTEN) receives a syn packet, then tcp_v6_conn_request() will be called from tcp_rcv_state_process(). If the tcp_v6_conn_request() successfully returns, the skb would be discarded by __kfree_skb(). However, in case of a listening socket which was already set IPV6_RECVPKTINFO, an address of the skb will be stored in treq->pktopts and a ref count of the skb will be incremented in tcp_v6_conn_request(). But, even if the skb is still in use, the skb will be freed. Then someone still using the freed skb will cause the kernel panic. I suggest to use kfree_skb() instead of __kfree_skb(). Signed-off-by: Masayuki Nakagawa <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2006-12-07[NET]: Memory barrier cleanupsRalf Baechle1-2/+2
I believe all the below memory barriers only matter on SMP so therefore the smp_* variant of the barrier should be used. I'm wondering if the barrier in net/ipv4/inet_timewait_sock.c should be dropped entirely. schedule_work's implementation currently implies a memory barrier and I think sane semantics of schedule_work() should imply a memory barrier, as needed so the caller shouldn't have to worry. It's not quite obvious why the barrier in net/packet/af_packet.c is needed; maybe it should be implied through flush_dcache_page? Signed-off-by: Ralf Baechle <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2006-12-02[NET]: Annotate __skb_checksum_complete() and friends.Al Viro1-2/+2
Signed-off-by: Al Viro <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2006-12-02[TCP]: MD5 Signature Option (RFC2385) support.YOSHIFUJI Hideaki1-0/+8
Based on implementation by Rick Payne. Signed-off-by: YOSHIFUJI Hideaki <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2006-12-02SELinux: Return correct context for SO_PEERSECVenkat Yekkirala1-0/+2
Fix SO_PEERSEC for tcp sockets to return the security context of the peer (as represented by the SA from the peer) as opposed to the SA used by the local/source socket. Signed-off-by: Venkat Yekkirala <[email protected]> Signed-off-by: James Morris <[email protected]>
2006-10-04[TCP]: Kill warning in tcp_clean_rtx_queue().David S. Miller1-1/+1
GCC can't tell we always initialize 'tv' in all the cases we actually use it, so explicitly set it up with zeros. Signed-off-by: David S. Miller <[email protected]>
2006-09-28[TCP]: Fix and simplify microsecond rtt samplingJohn Heffner1-8/+8
This changes the microsecond RTT sampling so that samples are taken in the same way that RTT samples are taken for the RTO calculator: on the last segment acknowledged, and only when the segment hasn't been retransmitted. Signed-off-by: John Heffner <[email protected]> Acked-by: Stephen Hemminger <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2006-09-28[TCP] net/ipv4/tcp_input.c: trivial annotationsAl Viro1-7/+7
Signed-off-by: Al Viro <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2006-09-28[TCP]: struct tcp_sack_block annotationsAl Viro1-1/+1
Some of the instances of tcp_sack_block are host-endian, some - net-endian. Define struct tcp_sack_block_wire identical to struct tcp_sack_block with u32 replaced with __be32; annotate uses of tcp_sack_block replacing net-endian ones with tcp_sack_block_wire. Change is obviously safe since for cc(1) __be32 is typedefed to u32. Signed-off-by: Al Viro <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2006-09-22[TCP]: Send ACKs each 2nd received segment.Alexey Kuznetsov1-0/+2
It does not affect either mss-sized connections (obviously) or connections controlled by Nagle (because there is only one small segment in flight). The idea is to record the fact that a small segment arrives on a connection, where one small segment has already been received and still not-ACKed. In this case ACK is forced after tcp_recvmsg() drains receive buffer. In other words, it is a "soft" each-2nd-segment ACK, which is enough to preserve ACK clock even when ABC is enabled. Signed-off-by: Alexey Kuznetsov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2006-09-22[TCP]: Fix rcv mss estimate for LROHerbert Xu1-1/+1
By passing a Linux-generated TSO packet straight back into Linux, Xen becomes our first LRO user :) Unfortunately, there is at least one spot in our stack that needs to be changed to cope with this. The receive MSS estimate is computed from the raw packet size. This is broken if the packet is GSO/LRO. Fortunately the real MSS can be found in gso_size so we simply need to use that if it is non-zero. Real LRO NICs should of course set the gso_size field in future. Signed-off-by: Herbert Xu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2006-09-22[NET/IPV4/IPV6]: Change some sysctl variables to __read_mostlyBrian Haley1-18/+18
Change net/core, ipv4 and ipv6 sysctl variables to __read_mostly. Couldn't actually measure any performance increase while testing (.3% I consider noise), but seems like the right thing to do. Signed-off-by: Brian Haley <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2006-09-17[TCP]: Turn ABC off.Stephen Hemminger1-1/+1
Turn Appropriate Byte Count off by default because it unfairly penalizes applications that do small writes. Add better documentation to describe what it is so users will understand why they might want to turn it on. Signed-off-by: Stephen Hemminger <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2006-08-29[TCP]: Two RFC3465 Appropriate Byte Count fixes.Daikichi Osuga1-2/+7
1) fix slow start after retransmit timeout 2) fix case of L=2*SMSS acked bytes comparison Signed-off-by: Daikichi Osuga <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2006-08-04[TCP]: Fixes IW > 2 cases when TCP is application limitedIlpo Järvinen1-1/+2
Whenever a transfer is application limited, we are allowed at least initial window worth of data per window unless cwnd is previously less than that. Signed-off-by: Ilpo Järvinen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2006-06-30Remove obsolete #include <linux/config.h>Jörn Engel1-1/+0
Signed-off-by: Jörn Engel <[email protected]> Signed-off-by: Adrian Bunk <[email protected]>
2006-06-29[NET]: Add ECN support for TSOMichael Chan1-4/+0
In the current TSO implementation, NETIF_F_TSO and ECN cannot be turned on together in a TCP connection. The problem is that most hardware that supports TSO does not handle CWR correctly if it is set in the TSO packet. Correct handling requires CWR to be set in the first packet only if it is set in the TSO header. This patch adds the ability to turn on NETIF_F_TSO and ECN using GSO if necessary to handle TSO packets with CWR set. Hardware that handles CWR correctly can turn on NETIF_F_TSO_ECN in the dev-> features flag. All TSO packets with CWR set will have the SKB_GSO_TCPV4_ECN set. If the output device does not have the NETIF_F_TSO_ECN feature set, GSO will split the packet up correctly with CWR only set in the first segment. With help from Herbert Xu <[email protected]>. Since ECN can always be enabled with TSO, the SOCK_NO_LARGESEND sock flag is completely removed. Signed-off-by: Michael Chan <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2006-06-23[NET]: Merge TSO/UFO fields in sk_buffHerbert Xu1-1/+1
Having separate fields in sk_buff for TSO/UFO (tso_size/ufo_size) is not going to scale if we add any more segmentation methods (e.g., DCCP). So let's merge them. They were used to tell the protocol of a packet. This function has been subsumed by the new gso_type field. This is essentially a set of netdev feature bits (shifted by 16 bits) that are required to process a specific skb. As such it's easy to tell whether a given device can process a GSO skb: you just have to and the gso_type field and the netdev's features field. I've made gso_type a conjunction. The idea is that you have a base type (e.g., SKB_GSO_TCPV4) that can be modified further to support new features. For example, if we add a hardware TSO type that supports ECN, they would declare NETIF_F_TSO | NETIF_F_TSO_ECN. All TSO packets with CWR set would have a gso_type of SKB_GSO_TCPV4 | SKB_GSO_TCPV4_ECN while all other TSO packets would be SKB_GSO_TCPV4. This means that only the CWR packets need to be emulated in software. Signed-off-by: Herbert Xu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2006-06-17[TCP]: Minimum congestion window consolidation.Stephen Hemminger1-2/+11
Many of the TCP congestion methods all just use ssthresh as the minimum congestion window on decrease. Rather than duplicating the code, just have that be the default if that handle in the ops structure is not set. Minor behaviour change to TCP compound. It probably wants to use this (ssthresh) as lower bound, rather than ssthresh/2 because the latter causes undershoot on loss. Signed-off-by: Stephen Hemminger <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2006-06-17[TCP]: tcp_rcv_rtt_measure_ts() call in pure-ACK path is superfluousDavid S. Miller1-2/+0
We only want to take receive RTT mesaurements for data bearing frames, here in the header prediction fast path for a pure-sender, we know that we have a pure-ACK and thus the checks in tcp_rcv_rtt_mesaure_ts() will not pass. Signed-off-by: David S. Miller <[email protected]>
2006-06-17[I/OAT]: TCP recv offload to I/OATChris Leech1-7/+67
Locks down user pages and sets up for DMA in tcp_recvmsg, then calls dma_async_try_early_copy in tcp_v4_do_rcv Signed-off-by: Chris Leech <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2006-06-11[TCP]: continued: reno sacked_out count fixAki M Nyrhinen1-3/+1
From: Aki M Nyrhinen <[email protected]> IMHO the current fix to the problem (in_flight underflow in reno) is incorrect. it treats the symptons but ignores the problem. the problem is timing out packets other than the head packet when we don't have sack. i try to explain (sorry if explaining the obvious). with sack, scanning the retransmit queue for timed out packets is fine because we know which packets in our retransmit queue have been acked by the receiver. without sack, we know only how many packets in our retransmit queue the receiver has acknowledged, but no idea which packets. think of a "typical" slow-start overshoot case, where for example every third packet in a window get lost because a router buffer gets full. with sack, we check for timeouts on those every third packet (as the rest have been sacked). the packet counting works out and if there is no reordering, we'll retransmit exactly the packets that were lost. without sack, however, we check for timeout on every packet and end up retransmitting consecutive packets in the retransmit queue. in our slow-start example, 2/3 of those retransmissions are unnecessary. these unnecessary retransmissions eat the congestion window and evetually prevent fast recovery from continuing, if enough packets were lost. Signed-off-by: David S. Miller <[email protected]>
2006-05-16[TCP]: reno sacked_out count fixAngelo P. Castellani1-0/+2
From: "Angelo P. Castellani" <[email protected]> Using NewReno, if a sk_buff is timed out and is accounted as lost_out, it should also be removed from the sacked_out. This is necessary because recovery using NewReno fast retransmit could take up to a lot RTTs and the sk_buff RTO can expire without actually being really lost. left_out = sacked_out + lost_out in_flight = packets_out - left_out + retrans_out Using NewReno without this patch, on very large network losses, left_out becames bigger than packets_out + retrans_out (!!). For this reason unsigned integer in_flight overflows to 2^32 - something. Signed-off-by: David S. Miller <[email protected]>
2006-04-14[IPV4]: Possible cleanups.Adrian Bunk1-1/+0
This patch contains the following possible cleanups: - make the following needlessly global function static: - arp.c: arp_rcv() - remove the following unused EXPORT_SYMBOL's: - devinet.c: devinet_ioctl - fib_frontend.c: ip_rt_ioctl - inet_hashtables.c: inet_bind_bucket_create - inet_hashtables.c: inet_bind_hash - tcp_input.c: sysctl_tcp_abc - tcp_ipv4.c: sysctl_tcp_tw_reuse - tcp_output.c: sysctl_tcp_mtu_probing - tcp_output.c: sysctl_tcp_base_mss Signed-off-by: Adrian Bunk <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2006-03-20[TCP] mtu probing: move tcp-specific data out of inet_connection_sockJohn Heffner1-2/+2
This moves some TCP-specific MTU probing state out of inet_connection_sock back to tcp_sock. Signed-off-by: John Heffner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2006-03-20[TCP]: MTU probingJohn Heffner1-0/+49
Implementation of packetization layer path mtu discovery for TCP, based on the internet-draft currently found at <http://www.ietf.org/internet-drafts/draft-ietf-pmtud-method-05.txt>. Signed-off-by: John Heffner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2006-02-09[TCP]: rcvbuf lock when tcp_moderate_rcvbuf enabledJohn Heffner1-1/+2
The rcvbuf lock should probably be honored here. Signed-off-by: John Heffner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2006-01-09[NET]: Change some "if (x) BUG();" to "BUG_ON(x);"Kris Katterjohn1-1/+1
This changes some simple "if (x) BUG();" statements to "BUG_ON(x);" Signed-off-by: Kris Katterjohn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2006-01-03[TCP]: less inline'sStephen Hemminger1-21/+61
TCP inline usage cleanup: * get rid of inline in several places * replace __inline__ with inline where possible * move functions used in one file out of tcp.h * let compiler decide on used once cases On x86_64: text data bss dec hex filename 3594701 648348 567400 4810449 4966d1 vmlinux.orig 3593133 648580 567400 4809113 496199 vmlinux On sparc64: text data bss dec hex filename 2538278 406152 530392 3474822 350586 vmlinux.ORIG 2536382 406384 530392 3473158 34ff06 vmlinux Signed-off-by: Stephen Hemminger <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2006-01-03[IP_SOCKGLUE]: Remove most of the tcp specific callsArnaldo Carvalho de Melo1-6/+4
As DCCP needs to be called in the same spots. Now we have a member in inet_sock (is_icsk), set at sock creation time from struct inet_protosw->flags (if INET_PROTOSW_ICSK is set, like for TCP and DCCP) to see if a struct sock instance is a inet_connection_sock for places like the ones in ip_sockglue.c (v4 and v6) where we previously were looking if sk_type was SOCK_STREAM, that is insufficient because we now use the same code for DCCP, that has sk_type SOCK_DCCP. Signed-off-by: Arnaldo Carvalho de Melo <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2006-01-03[ICSK]: Rename struct tcp_func to struct inet_connection_sock_af_opsArnaldo Carvalho de Melo1-5/+6
And move it to struct inet_connection_sock. DCCP will use it in the upcoming changesets. Signed-off-by: Arnaldo Carvalho de Melo <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2005-11-15[TCP]: More spelling fixes.Stephen Hemminger1-2/+2
From Joe Perches Signed-off-by: Stephen Hemminger <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2005-11-10[TCP]: speed up SACK processingStephen Hemminger1-15/+129
Use "hints" to speed up the SACK processing. Various forms of this have been used by TCP developers (Web100, STCP, BIC) to avoid the 2x linear search of outstanding segments. Signed-off-by: Stephen Hemminger <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2005-11-10[TCP]: spelling fixesStephen Hemminger1-20/+20
Minor spelling fixes for TCP code. Signed-off-by: Stephen Hemminger <[email protected]> Signed-off-by: David S. Miller <[email protected]>