aboutsummaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)AuthorFilesLines
2011-03-04ipv4: Remove flowi from struct rtable.David S. Miller4-83/+131
The only necessary parts are the src/dst addresses, the interface indexes, the TOS, and the mark. The rest is unnecessary bloat, which amounts to nearly 50 bytes on 64-bit. Signed-off-by: David S. Miller <[email protected]>
2011-03-04ipv4: Set rt->rt_iif more sanely on output routes.David S. Miller1-1/+1
rt->rt_iif is only ever inspected on input routes, for example DCCP uses this to populate a route lookup flow key when generating replies to another packet. Therefore, setting it to anything other than zero on output routes makes no sense. Signed-off-by: David S. Miller <[email protected]>
2011-03-04ipv4: Get peer more cheaply in rt_init_metrics().David S. Miller1-2/+2
We know this is a new route object, so doing atomics and stuff makes no sense at all. Signed-off-by: David S. Miller <[email protected]>
2011-03-04ipv4: Optimize flow initialization in output route lookup.David S. Miller1-8/+10
We burn a lot of useless cycles, cpu store buffer traffic, and memory operations memset()'ing the on-stack flow used to perform output route lookups in __ip_route_output_key(). Only the first half of the flow object members even matter for output route lookups in this context, specifically: FIB rules matching cares about: dst, src, tos, iif, oif, mark FIB trie lookup cares about: dst FIB semantic match cares about: tos, scope, oif Therefore only initialize these specific members and elide the memset entirely. On Niagara2 this kills about ~300 cycles from the output route lookup path. Likely, we can take things further, since all callers of output route lookups essentially throw away the on-stack flow they use. So they don't care if we use it as a scratch-pad to compute the final flow key. Signed-off-by: David S. Miller <[email protected]> Acked-by: Eric Dumazet <[email protected]>
2011-03-04inetpeer: seqlock optimizationEric Dumazet1-40/+35
David noticed : ------------------ Eric, I was profiling the non-routing-cache case and something that stuck out is the case of calling inet_getpeer() with create==0. If an entry is not found, we have to redo the lookup under a spinlock to make certain that a concurrent writer rebalancing the tree does not "hide" an existing entry from us. This makes the case of a create==0 lookup for a not-present entry really expensive. It is on the order of 600 cpu cycles on my Niagara2. I added a hack to not do the relookup under the lock when create==0 and it now costs less than 300 cycles. This is now a pretty common operation with the way we handle COW'd metrics, so I think it's definitely worth optimizing. ----------------- One solution is to use a seqlock instead of a spinlock to protect struct inet_peer_base. After a failed avl tree lookup, we can easily detect if a writer did some changes during our lookup. Taking the lock and redo the lookup is only necessary in this case. Note: Add one private rcu_deref_locked() macro to place in one spot the access to spinlock included in seqlock. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2011-03-03Merge branch 'master' of ↵David S. Miller2-3/+4
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/bnx2x/bnx2x.h
2011-03-03ipv4: Fix __ip_dev_find() to use ifa_local instead of ifa_address.David S. Miller1-2/+2
Reported-by: Stephen Hemminger <[email protected]> Reported-by: Julian Anastasov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2011-03-03ipv4: Fix crash in dst_release when udp_sendmsg route lookup fails.David S. Miller1-0/+1
As reported by Eric: [11483.697233] IP: [<c12b0638>] dst_release+0x18/0x60 ... [11483.697741] Call Trace: [11483.697764] [<c12fc9d2>] udp_sendmsg+0x282/0x6e0 [11483.697790] [<c12a1c01>] ? memcpy_toiovec+0x51/0x70 [11483.697818] [<c12dbd90>] ? ip_generic_getfrag+0x0/0xb0 The pointer passed to dst_release() is -EINVAL, that's because we leave an error pointer in the local variable "rt" by accident. NULL it out to fix the bug. Reported-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2011-03-02ipv4: ip_route_output_key() is better as an inline.David S. Miller1-6/+0
This avoid a stack frame at zero cost. Signed-off-by: David S. Miller <[email protected]>
2011-03-02ipv4: Make output route lookup return rtable directly.David S. Miller17-138/+160
Instead of on the stack. Signed-off-by: David S. Miller <[email protected]>
2011-03-02xfrm: Return dst directly from xfrm_lookup()David S. Miller3-25/+24
Instead of on the stack. Signed-off-by: David S. Miller <[email protected]>
2011-03-01inet: Replace left-over references to inet->corkHerbert Xu1-2/+2
The patch to replace inet->cork with cork left out two spots in __ip_append_data that can result in bogus packet construction. Signed-off-by: Herbert Xu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2011-03-01ipv4: Make icmp route lookup code a bit clearer.David S. Miller1-79/+96
The route lookup code in icmp_send() is slightly tricky as a result of having to handle all of the requirements of RFC 4301 host relookups. Pull the route resolution into a seperate function, so that the error handling and route reference counting is hopefully easier to see and contained wholly within this new routine. Signed-off-by: David S. Miller <[email protected]>
2011-03-01xfrm: Handle blackhole route creation via afinfo.David S. Miller2-13/+8
That way we don't have to potentially do this in every xfrm_lookup() caller. Signed-off-by: David S. Miller <[email protected]>
2011-03-01xfrm: Kill XFRM_LOOKUP_WAIT flag.David S. Miller1-3/+1
This can be determined from the flow flags instead. Signed-off-by: David S. Miller <[email protected]>
2011-03-01ipv4: Kill can_sleep arg to ip_route_output_flow()David S. Miller6-8/+9
This boolean state is now available in the flow flags. Signed-off-by: David S. Miller <[email protected]>
2011-03-01net: Add FLOWI_FLAG_CAN_SLEEP.David S. Miller2-3/+6
And set is in contexts where the route resolution can sleep. Signed-off-by: David S. Miller <[email protected]>
2011-03-01ipv4: Make final arg to ip_route_output_flow to be boolean "can_sleep"David S. Miller6-8/+8
Since that is what the current vague "flags" argument means. Signed-off-by: David S. Miller <[email protected]>
2011-03-01ipv4: Can final ip_route_connect() arg to boolean "can_sleep".David S. Miller3-3/+3
Since that's what the current vague "flags" thing means. Signed-off-by: David S. Miller <[email protected]>
2011-03-01udp: Add lockless transmit pathHerbert Xu1-1/+14
The UDP transmit path has been running under the socket lock for a long time because of the corking feature. This means that transmitting to the same socket in multiple threads does not scale at all. However, as most users don't actually use corking, the locking can be removed in the common case. This patch creates a lockless fast path where corking is not used. Please note that this does create a slight inaccuracy in the enforcement of socket send buffer limits. In particular, we may exceed the socket limit by up to (number of CPUs) * (packet size) because of the way the limit is computed. As the primary purpose of socket buffers is to indicate congestion, this should not be a great problem for now. Signed-off-by: Herbert Xu <[email protected]> Acked-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2011-03-01udp: Switch to ip_finish_skbHerbert Xu1-33/+50
This patch converts UDP to use the new ip_finish_skb API. This would then allows us to more easily use ip_make_skb which allows UDP to run without a socket lock. Signed-off-by: Herbert Xu <[email protected]> Acked-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2011-03-01inet: Add ip_make_skb and ip_finish_skbHerbert Xu1-14/+51
This patch adds the helper ip_make_skb which is like ip_append_data and ip_push_pending_frames all rolled into one, except that it does not send the skb produced. The sending part is carried out by ip_send_skb, which the transport protocol can call after it has tweaked the skb. It is meant to be called in cases where corking is not used should have a one-to-one correspondence to sendmsg. This patch also adds the helper ip_finish_skb which is meant to be replace ip_push_pending_frames when corking is required. Previously the protocol stack would peek at the socket write queue and add its header to the first packet. With ip_finish_skb, the protocol stack can directly operate on the final skb instead, just like the non-corking case with ip_make_skb. Signed-off-by: Herbert Xu <[email protected]> Acked-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2011-03-01inet: Remove explicit write references to sk/inet in ip_append_dataHerbert Xu1-98/+140
In order to allow simultaneous calls to ip_append_data on the same socket, it must not modify any shared state in sk or inet (other than those that are designed to allow that such as atomic counters). This patch abstracts out write references to sk and inet_sk in ip_append_data and its friends so that we may use the underlying code in parallel. Signed-off-by: Herbert Xu <[email protected]> Acked-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2011-03-01inet: Remove unused sk_sndmsg_* from UFOHerbert Xu1-1/+0
UFO doesn't really use the sk_sndmsg_* parameters so touching them is pointless. It can't use them anyway since the whole point of UFO is to use the original pages without copying. Signed-off-by: Herbert Xu <[email protected]> Acked-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2011-02-24ipv4: Rearrange how ip_route_newports() gets port keys.David S. Miller1-1/+5
ip_route_newports() is the only place in the entire kernel that cares about the port members in the routing cache entry's lookup flow key. Therefore the only reason we store an entire flow inside of the struct rtentry is for this one special case. Rewrite ip_route_newports() such that: 1) The caller passes in the original port values, so we don't need to use the rth->fl.fl_ip_{s,d}port values to remember them. 2) The lookup flow is constructed by hand instead of being copied from the routing cache entry's flow. Signed-off-by: David S. Miller <[email protected]>
2011-02-23xfrm: Const'ify address arguments to ->dst_lookup()David S. Miller1-2/+2
Signed-off-by: David S. Miller <[email protected]>
2011-02-23xfrm: Const'ify tmpl and address arguments to ->init_temprop()David S. Miller1-2/+2
Signed-off-by: David S. Miller <[email protected]>
2011-02-22xfrm: Mark flowi arg to ->init_tempsel() const.David S. Miller1-1/+1
Signed-off-by: David S. Miller <[email protected]>
2011-02-22xfrm: Mark flowi arg to ->fill_dst() const.David S. Miller1-1/+1
Signed-off-by: David S. Miller <[email protected]>
2011-02-22xfrm: Mark flowi arg to ->get_tos() const.David S. Miller1-1/+1
Signed-off-by: David S. Miller <[email protected]>
2011-02-21tcp: undo_retrans counter fixesYuchung Cheng2-3/+4
Fix a bug that undo_retrans is incorrectly decremented when undo_marker is not set or undo_retrans is already 0. This happens when sender receives more DSACK ACKs than packets retransmitted during the current undo phase. This may also happen when sender receives DSACK after the undo operation is completed or cancelled. Fix another bug that undo_retrans is incorrectly incremented when sender retransmits an skb and tcp_skb_pcount(skb) > 1 (TSO). This case is rare but not impossible. Signed-off-by: Yuchung Cheng <[email protected]> Acked-by: Ilpo Järvinen <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2011-02-20tcp: Remove debug macro of TCP_CHECK_TIMERShan Wei3-17/+0
Now, TCP_CHECK_TIMER is not used for debuging, it does nothing. And, it has been there for several years, maybe 6 years. Remove it to keep code clearer. Signed-off-by: Shan Wei <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2011-02-19Merge branch 'master' of ↵David S. Miller4-10/+24
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: Documentation/feature-removal-schedule.txt drivers/net/e1000e/netdev.c net/xfrm/xfrm_policy.c
2011-02-19tcp: fix inet_twsk_deschedule()Eric Dumazet1-0/+2
Eric W. Biederman reported a lockdep splat in inet_twsk_deschedule() This is caused by inet_twsk_purge(), run from process context, and commit 575f4cd5a5b6394577 (net: Use rcu lookups in inet_twsk_purge.) removed the BH disabling that was necessary. Add the BH disabling but fine grained, right before calling inet_twsk_deschedule(), instead of whole function. With help from Linus Torvalds and Eric W. Biederman Reported-by: Eric W. Biederman <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> CC: Daniel Lezcano <[email protected]> CC: Pavel Emelyanov <[email protected]> CC: Arnaldo Carvalho de Melo <[email protected]> CC: stable <[email protected]> (# 2.6.33+) Signed-off-by: David S. Miller <[email protected]>
2011-02-18ipv4: Implement __ip_dev_find using new interface address hash.David S. Miller2-40/+33
Much quicker than going through the FIB tables. Signed-off-by: David S. Miller <[email protected]>
2011-02-18ipv4: Add hash table of interface addresses.David S. Miller1-0/+45
This will be used to optimize __ip_dev_find() and friends. With help from Eric Dumazet. Signed-off-by: David S. Miller <[email protected]>
2011-02-18net: provide default_advmss() methods to blackhole dst_opsEric Dumazet1-0/+1
Commit 0dbaee3b37e118a (net: Abstract default ADVMSS behind an accessor.) introduced a possible crash in tcp_connect_init(), when dst->default_advmss() is called from dst_metric_advmss() Reported-by: George Spelvin <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2011-02-17ipv4: Use const'ify fib_result deep in the route call chains.David S. Miller2-16/+18
The only troublesome bit here is __mkroute_output which wants to override res->fi and res->type, compute those in local variables instead. Signed-off-by: David S. Miller <[email protected]>
2011-02-17ipv4: Avoid use of signed integers in fib_trie code.David S. Miller1-5/+5
GCC emits all kinds of crazy zero extensions when we go from signed int, to unsigned short, etc. etc. This transformation has to be legal because: 1) In tkey_extract_bits() in mask_pfx(), the values are used to perform shifts, on which negative values are undefined by C. 2) In fib_table_lookup() we perform comparisons with unsigned values, constants, and additions. None of which should encounter negative values. Signed-off-by: David S. Miller <[email protected]>
2011-02-17net: Add initial_ref arg to dst_alloc().David S. Miller1-5/+2
This allows avoiding multiple writes to the initial __refcnt. The most simplest cases of wanting an initial reference of "1" in ipv4 and ipv6 have been converted, the rest have been left along and kept at the existing "0". Signed-off-by: David S. Miller <[email protected]>
2011-02-17ipv4: Consolidate ipv4 dst allocation logic.David S. Miller1-31/+21
This also allows us to combine all the dst->flags settings and avoid read/modify/write sequences to this struct member. Signed-off-by: David S. Miller <[email protected]>
2011-02-17ipv4: Move rcu_read_{lock,unlock}() into ip_route_output_slow().David S. Miller1-7/+6
Simplifies tail of __ip_route_output_key(). Signed-off-by: David S. Miller <[email protected]>
2011-02-17ipv4: Simplify output route creation call sequence.David S. Miller1-35/+23
There's a lot of redundancy and unnecessary stack frames in the output route creation path. 1) Make __mkroute_output() return error pointers. 2) Eliminate ip_mkroute_output() entirely, made possible by #1. 3) Call __mkroute_output() directly and handling the returning error pointers in ip_route_output_slow(). Signed-off-by: David S. Miller <[email protected]>
2011-02-14ipv4: Cache learned redirect information in inetpeer.David S. Miller1-94/+42
Note that we do not generate the redirect netevent any longer, because we don't create a new cached route. Instead, once the new neighbour is bound to the cached route, we emit a neigh update event instead. Signed-off-by: David S. Miller <[email protected]>
2011-02-14ipv4: Cache learned PMTU information in inetpeer.David S. Miller1-174/+86
The general idea is that if we learn new PMTU information, we bump the peer genid. This triggers the dst_ops->check() code to validate and if necessary propagate the new PMTU value into the metrics. Learned PMTU information self-expires. This means that it is not necessary to kill a cached route entry just because the PMTU information is too old. As a consequence: 1) When the path appears unreachable (dst_ops->link_failure or dst_ops->negative_advice) we unwind the PMTU state if it is out of date, instead of killing the cached route. A redirected route will still be invalidated in these situations. 2) rt_check_expire(), rt_worker_func(), et al. are no longer necessary at all. Signed-off-by: David S. Miller <[email protected]>
2011-02-14arp_notify: unconditionally send gratuitous ARP for NETDEV_NOTIFY_PEERS.Ian Campbell1-10/+20
NETDEV_NOTIFY_PEER is an explicit request by the driver to send a link notification while NETDEV_UP/NETDEV_CHANGEADDR generate link notifications as a sort of side effect. In the later cases the sysctl option is present because link notification events can have undesired effects e.g. if the link is flapping. I don't think this applies in the case of an explicit request from a driver. This patch makes NETDEV_NOTIFY_PEER unconditional, if preferred we could add a new sysctl for this case which defaults to on. This change causes Xen post-migration ARP notifications (which cause switches to relearn their MAC tables etc) to be sent by default. Signed-off-by: Ian Campbell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2011-02-14ipv4: fix rcu lock imbalance in fib_select_default()Eric Dumazet1-1/+1
Commit 0c838ff1ade7 (ipv4: Consolidate all default route selection implementations.) forgot to remove one rcu_read_unlock() from fib_select_default(). Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2011-02-11ip_gre: Add IPPROTO_GRE to flowi in ipgre_tunnel_xmitSteffen Klassert1-0/+1
Commit 5811662b15db018c740c57d037523683fd3e6123 ("net: use the macros defined for the members of flowi") accidentally removed the setting of IPPROTO_GRE from the struct flowi in ipgre_tunnel_xmit. This patch restores it. Signed-off-by: Steffen Klassert <[email protected]> Acked-by: Changli Gao <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2011-02-10inet: Create a mechanism for upward inetpeer propagation into routes.David S. Miller1-1/+18
If we didn't have a routing cache, we would not be able to properly propagate certain kinds of dynamic path attributes, for example PMTU information and redirects. The reason is that if we didn't have a routing cache, then there would be no way to lookup all of the active cached routes hanging off of sockets, tunnels, IPSEC bundles, etc. Consider the case where we created a cached route, but no inetpeer entry existed and also we were not asked to pre-COW the route metrics and therefore did not force the creation a new inetpeer entry. If we later get a PMTU message, or a redirect, and store this information in a new inetpeer entry, there is no way to teach that cached route about the newly existing inetpeer entry. The facilities implemented here handle this problem. First we create a generation ID. When we create a cached route of any kind, we remember the generation ID at the time of attachment. Any time we force-create an inetpeer entry in response to new path information, we bump that generation ID. The dst_ops->check() callback is where the knowledge of this event is propagated. If the global generation ID does not equal the one stored in the cached route, and the cached route has not attached to an inetpeer yet, we look it up and attach if one is found. Now that we've updated the cached route's information, we update the route's generation ID too. This clears the way for implementing PMTU and redirects directly in the inetpeer cache. There is absolutely no need to consult cached route information in order to maintain this information. At this point nothing bumps the inetpeer genids, that comes in the later changes which handle PMTUs and redirects using inetpeers. Signed-off-by: David S. Miller <[email protected]>
2011-02-10inetpeer: Add redirect and PMTU discovery cached info.David S. Miller1-0/+2
Validity of the cached PMTU information is indicated by it's expiration value being non-zero, just as per dst->expires. The scheme we will use is that we will remember the pre-ICMP value held in the metrics or route entry, and then at expiration time we will restore that value. In this way PMTU expiration does not kill off the cached route as is done currently. Redirect information is permanent, or at least until another redirect is received. Signed-off-by: David S. Miller <[email protected]>