aboutsummaryrefslogtreecommitdiff
path: root/net/ipv6
AgeCommit message (Collapse)AuthorFilesLines
2015-10-22Merge branch 'master' of ↵David S. Miller2-8/+16
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2015-10-22 1) Fix IPsec pre-encap fragmentation for GSO packets. From Herbert Xu. 2) Fix some header checks in _decode_session6. We skip the header informations if the data pointer points already behind the header in question for some protocols. This is because we call pskb_may_pull with a negative value converted to unsigened int from pskb_may_pull in this case. Skipping the header informations can lead to incorrect policy lookups. From Mathias Krause. 3) Allow to change the replay threshold and expiry timer of a state without having to set other attributes like replay counter and byte lifetime. Changing these other attributes may break the SA. From Michael Rossberg. 4) Fix pmtu discovery for local generated packets. We may fail dispatch to the inner address family. As a reault, the local error handler is not called and the mtu value is not reported back to userspace. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <[email protected]>
2015-10-22net: ipv6: Dont add RT6_LOOKUP_F_IFACE flag if saddr setDavid Ahern1-2/+4
741a11d9e410 ("net: ipv6: Add RT6_LOOKUP_F_IFACE flag if oif is set") adds the RT6_LOOKUP_F_IFACE flag to make device index mismatch fatal if oif is given. Hajime reported that this change breaks the Mobile IPv6 use case that wants to force the message through one interface yet use the source address from another interface. Handle this case by only adding the flag if oif is set and saddr is not set. Fixes: 741a11d9e410 ("net: ipv6: Add RT6_LOOKUP_F_IFACE flag if oif is set") Cc: Hajime Tazaki <[email protected]> Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-21ipv6: gro: support sit protocolEric Dumazet1-0/+12
Tom Herbert added SIT support to GRO with commit 19424e052fb4 ("sit: Add gro callbacks to sit_offload"), later reverted by Herbert Xu. The problem came because Tom patch was building GRO packets without proper meta data : If packets were locally delivered, we would not care. But if packets needed to be forwarded, GSO engine was not able to segment individual segments. With the following patch, we correctly set skb->encapsulation and inner network header. We also update gso_type. Tested: Server : netserver modprobe dummy ifconfig dummy0 8.0.0.1 netmask 255.255.255.0 up arp -s 8.0.0.100 4e:32:51:04:47:e5 iptables -I INPUT -s 10.246.7.151 -j TEE --gateway 8.0.0.100 ifconfig sixtofour0 sixtofour0 Link encap:IPv6-in-IPv4 inet6 addr: 2002:af6:798::1/128 Scope:Global inet6 addr: 2002:af6:798::/128 Scope:Global UP RUNNING NOARP MTU:1480 Metric:1 RX packets:411169 errors:0 dropped:0 overruns:0 frame:0 TX packets:409414 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:20319631739 (20.3 GB) TX bytes:29529556 (29.5 MB) Client : netperf -H 2002:af6:798::1 -l 1000 & Checked on server traffic copied on dummy0 and verify segments were properly rebuilt, with proper IP headers, TCP checksums... tcpdump on eth0 shows proper GRO aggregation takes place. Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Tom Herbert <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller1-0/+1
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains four Netfilter fixes for net, they are: 1) Fix Kconfig dependencies of new nf_dup_ipv4 and nf_dup_ipv6. 2) Remove bogus test nh_scope in IPv4 rpfilter match that is breaking --accept-local, from Xin Long. 3) Wait for RCU grace period after dropping the pending packets in the nfqueue, from Florian Westphal. 4) Fix sleeping allocation while holding spin_lock_bh, from Nikolay Borisov. ==================== Signed-off-by: David S. Miller <[email protected]>
2015-10-21netlink: Rightsize IFLA_AF_SPEC size calculationArad, Ronen1-1/+2
if_nlmsg_size() overestimates the minimum allocation size of netlink dump request (when called from rtnl_calcit()) or the size of the message (when called from rtnl_getlink()). This is because ext_filter_mask is not supported by rtnl_link_get_af_size() and rtnl_link_get_size(). The over-estimation is significant when at least one netdev has many VLANs configured (8 bytes for each configured VLAN). This patch-set "rightsizes" the protocol specific attribute size calculation by propagating ext_filter_mask to rtnl_link_get_af_size() and adding this a argument to get_link_af_size op in rtnl_af_ops. Bridge module already used filtering aware sizing for notifications. br_get_link_af_size_filtered() is consistent with the modified get_link_af_size op so it replaces br_get_link_af_size() in br_af_ops. br_get_link_af_size() becomes unused and thus removed. Signed-off-by: Ronen Arad <[email protected]> Acked-by: Sridhar Samudrala <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-21net: Really fix vti6 with oif in dst lookupsDavid Ahern2-1/+5
6e28b000825d ("net: Fix vti use case with oif in dst lookups for IPv6") is missing the checks on FLOWI_FLAG_SKIP_NH_OIF. Add them. Fixes: 42a7b32b73d6 ("xfrm: Add oif to dst lookups") Cc: Steffen Klassert <[email protected]> Signed-off-by: David Ahern <[email protected]> Acked-by: Steffen Klassert <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller3-26/+28
Conflicts: drivers/net/usb/asix_common.c net/ipv4/inet_connection_sock.c net/switchdev/switchdev.c In the inet_connection_sock.c case the request socket hashing scheme is completely different in net-next. The other two conflicts were overlapping changes. Signed-off-by: David S. Miller <[email protected]>
2015-10-19xfrm: Fix pmtu discovery for local generated packets.Steffen Klassert1-0/+1
Commit 044a832a777 ("xfrm: Fix local error reporting crash with interfamily tunnels") moved the setting of skb->protocol behind the last access of the inner mode family to fix an interfamily crash. Unfortunately now skb->protocol might not be set at all, so we fail dispatch to the inner address family. As a reault, the local error handler is not called and the mtu value is not reported back to userspace. We fix this by setting skb->protocol on message size errors before we call xfrm_local_error. Fixes: 044a832a7779c ("xfrm: Fix local error reporting crash with interfamily tunnels") Signed-off-by: Steffen Klassert <[email protected]>
2015-10-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller9-35/+21
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for your net-next tree. Most relevantly, updates for the nfnetlink_log to integrate with conntrack, fixes for cttimeout and improvements for nf_queue core, they are: 1) Remove useless ifdef around static inline function in IPVS, from Eric W. Biederman. 2) Simplify the conntrack support for nfnetlink_queue: Merge nfnetlink_queue_ct.c file into nfnetlink_queue_core.c, then rename it back to nfnetlink_queue.c 3) Use y2038 safe timestamp from nfnetlink_queue. 4) Get rid of dead function definition in nf_conntrack, from Flavio Leitner. 5) Attach conntrack support for nfnetlink_log.c, from Ken-ichirou MATSUZAWA. This adds a new NETFILTER_NETLINK_GLUE_CT Kconfig switch that controls enabling both nfqueue and nflog integration with conntrack. The userspace application can request this via NFULNL_CFG_F_CONNTRACK configuration flag. 6) Remove unused netns variables in IPVS, from Eric W. Biederman and Simon Horman. 7) Don't put back the refcount on the cttimeout object from xt_CT on success. 8) Fix crash on cttimeout policy object removal. We have to flush out the cttimeout extension area of the conntrack not to refer to an unexisting object that was just removed. 9) Make sure rcu_callback completion before removing nfnetlink_cttimeout module removal. 10) Fix compilation warning in br_netfilter when no nf_defrag_ipv4 and nf_defrag_ipv6 are enabled. Patch from Arnd Bergmann. 11) Autoload ctnetlink dependencies when NFULNL_CFG_F_CONNTRACK is requested. Again from Ken-ichirou MATSUZAWA. 12) Don't use pointer to previous hook when reinjecting traffic via nf_queue with NF_REPEAT verdict since it may be already gone. This also avoids a deadloop if the userspace application keeps returning NF_REPEAT. 13) A bunch of cleanups for netfilter IPv4 and IPv6 code from Ian Morris. 14) Consolidate logger instance existence check in nfulnl_recv_config(). 15) Fix broken atomicity when applying configuration updates to logger instances in nfnetlink_log. 16) Get rid of the .owner attribute in our hook object. We don't need this anymore since we're dropping pending packets that have escaped from the kernel when unremoving the hook. Patch from Florian Westphal. 17) Remove unnecessary rcu_read_lock() from nf_reinject code, we always assume RCU read side lock from .call_rcu in nfnetlink. Also from Florian. 18) Use static inline function instead of macros to define NF_HOOK() and NF_HOOK_COND() when no netfilter support in on, from Arnd Bergmann. ==================== Signed-off-by: David S. Miller <[email protected]>
2015-10-18tcp: do not set queue_mapping on SYNACKEric Dumazet1-2/+0
At the time of commit fff326990789 ("tcp: reflect SYN queue_mapping into SYNACK packets") we had little ways to cope with SYN floods. We no longer need to reflect incoming skb queue mappings, and instead can pick a TX queue based on cpu cooking the SYNACK, with normal XPS affinities. Note that all SYNACK retransmits were picking TX queue 0, this no longer is a win given that SYNACK rtx are now distributed on all cpus. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-17Merge branch 'master' of ↵Pablo Neira Ayuso24-106/+157
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next This merge resolves conflicts with 75aec9df3a78 ("bridge: Remove br_nf_push_frag_xmit_sk") as part of Eric Biederman's effort to improve netns support in the network stack that reached upstream via David's net-next tree. Signed-off-by: Pablo Neira Ayuso <[email protected]> Conflicts: net/bridge/br_netfilter_hooks.c
2015-10-16netfilter: remove hook owner refcountingFlorian Westphal4-14/+0
since commit 8405a8fff3f8 ("netfilter: nf_qeueue: Drop queue entries on nf_unregister_hook") all pending queued entries are discarded. So we can simply remove all of the owner handling -- when module is removed it also needs to unregister all its hooks. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2015-10-16tcp/dccp: add inet_csk_reqsk_queue_drop_and_put() helperEric Dumazet1-1/+1
Let's reduce the confusion about inet_csk_reqsk_queue_drop() : In many cases we also need to release reference on request socket, so add a helper to do this, reducing code size and complexity. Fixes: 4bdc3d66147b ("tcp/dccp: fix behavior of stale SYN_RECV request sockets") Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-16ipv6: Initialize rt6_info properly in ip6_blackhole_route()Martin KaFai Lau1-15/+5
ip6_blackhole_route() does not initialize the newly allocated rt6_info properly. This patch: 1. Call rt6_info_init() to initialize rt6i_siblings and rt6i_uncached 2. The current rt->dst._metrics init code is incorrect: - 'rt->dst._metrics = ort->dst._metris' is not always safe - Not sure what dst_copy_metrics() is trying to do here considering ip6_rt_blackhole_cow_metrics() always returns NULL Fix: - Always do dst_copy_metrics() - Replace ip6_rt_blackhole_cow_metrics() with dst_cow_metrics_generic() 3. Mask out the RTF_PCPU bit from the newly allocated blackhole route. This bug triggers an oops (reported by Phil Sutter) in rt6_get_cookie(). It is because RTF_PCPU is set while rt->dst.from is NULL. Fixes: d52d3997f843 ("ipv6: Create percpu rt6_info") Signed-off-by: Martin KaFai Lau <[email protected]> Reported-by: Phil Sutter <[email protected]> Tested-by: Phil Sutter <[email protected]> Cc: Hannes Frederic Sowa <[email protected]> Cc: Julian Anastasov <[email protected]> Cc: Phil Sutter <[email protected]> Cc: Steffen Klassert <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-16ipv6: Move common init code for rt6_info to a new function rt6_info_init()Martin KaFai Lau1-6/+11
Introduce rt6_info_init() to do the common init work for 'struct rt6_info' (after calling dst_alloc). It is a prep work to fix the rt6_info init logic in the ip6_blackhole_route(). Signed-off-by: Martin KaFai Lau <[email protected]> Cc: Hannes Frederic Sowa <[email protected]> Cc: Julian Anastasov <[email protected]> Cc: Phil Sutter <[email protected]> Cc: Steffen Klassert <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-14netfilter: ipv6: pointer cast layoutIan Morris1-1/+1
Correct whitespace layout of a pointer casting. No changes detected by objdiff. Signed-off-by: Ian Morris <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2015-10-14netfilter: ip6_tables: improve if statementsIan Morris1-3/+3
Correct whitespace layout of if statements. No changes detected by objdiff. Signed-off-by: Ian Morris <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2015-10-13tcp/dccp: fix behavior of stale SYN_RECV request socketsEric Dumazet1-1/+6
When a TCP/DCCP listener is closed, its pending SYN_RECV request sockets become stale, meaning 3WHS can not complete. But current behavior is wrong : incoming packets finding such stale sockets are dropped. We need instead to cleanup the request socket and perform another lookup : - Incoming ACK will give a RST answer, - SYN rtx might find another listener if available. - We expedite cleanup of request sockets and old listener socket. Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table") Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-13netfilter: ip6_tables: ternary operator layoutIan Morris1-2/+2
Correct whitespace layout of ternary operators in the netfilter-ipv6 code. No changes detected by objdiff. Signed-off-by: Ian Morris <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2015-10-13netfilter: ipv6: whitespace around operatorsIan Morris3-5/+5
This patch cleanses whitespace around arithmetical operators. No changes detected by objdiff. Signed-off-by: Ian Morris <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2015-10-13netfilter: ipv6: code indentationIan Morris3-6/+6
Use tabs instead of spaces to indent code. No changes detected by objdiff. Signed-off-by: Ian Morris <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2015-10-13netfilter: ip6_tables: function definition layoutIan Morris1-3/+3
Use tabs instead of spaces to indent second line of parameters in function definitions. No changes detected by objdiff. Signed-off-by: Ian Morris <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2015-10-13netfilter: ip6_tables: label placementIan Morris1-1/+1
Whitespace cleansing: Labels should not be indented. No changes detected by objdiff. Signed-off-by: Ian Morris <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2015-10-13net: Add VRF support to IPv6 stackDavid Ahern5-14/+63
As with IPv4 support for VRFs added to IPv6 stack by replacing hardcoded table ids with possibly device specific ones and manipulating the oif in the flowi6. The flow flags are used to skip oif compare in nexthop lookups if the device is enslaved to a VRF via the L3 master device. Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-13net: Export fib6_get_table and nd_tblDavid Ahern2-0/+2
Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-13ipv6: Don't call with rt6_uncached_list_flush_devEric W. Biederman1-5/+7
As originally written rt6_uncached_list_flush_dev makes no sense when called with dev == NULL as it attempts to flush all uncached routes regardless of network namespace when dev == NULL. Which is simply incorrect behavior. Furthermore at the point rt6_ifdown is called with dev == NULL no more network devices exist in the network namespace so even if the code in rt6_uncached_list_flush_dev were to attempt something sensible it would be meaningless. Therefore remove support in rt6_uncached_list_flush_dev for handling network devices where dev == NULL, and only call rt6_uncached_list_flush_dev when rt6_ifdown is called with a network device. Fixes: 8d0b94afdca8 ("ipv6: Keep track of DST_NOCACHE routes in case of iface down/unregister") Signed-off-by: "Eric W. Biederman" <[email protected]> Reviewed-by: Martin KaFai Lau <[email protected]> Tested-by: Martin KaFai Lau <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-12ipv6 route: use err pointers instead of returning pointer by referenceRoopa Prabhu1-15/+17
This patch makes ip6_route_info_create return err pointer instead of returning the rt pointer by reference as suggested by Dave Signed-off-by: Roopa Prabhu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-12ipv6: Pass struct net into nf_ct_frag6_gatherEric W. Biederman2-4/+3
The function nf_ct_frag6_gather is called on both the input and the output paths of the networking stack. In particular ipv6_defrag which calls nf_ct_frag6_gather is called from both the the PRE_ROUTING chain on input and the LOCAL_OUT chain on output. The addition of a net parameter makes it explicit which network namespace the packets are being reassembled in, and removes the need for nf_ct_frag6_gather to guess. Signed-off-by: "Eric W. Biederman" <[email protected]> Acked-by: Pablo Neira Ayuso <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-12net: shrink struct sock and request_sock by 8 bytesEric Dumazet2-3/+3
One 32bit hole is following skc_refcnt, use it. skc_incoming_cpu can also be an union for request_sock rcv_wnd. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-12net: SO_INCOMING_CPU setsockopt() supportEric Dumazet2-4/+9
SO_INCOMING_CPU as added in commit 2c8c56e15df3 was a getsockopt() command to fetch incoming cpu handling a particular TCP flow after accept() This commits adds setsockopt() support and extends SO_REUSEPORT selection logic : If a TCP listener or UDP socket has this option set, a packet is delivered to this socket only if CPU handling the packet matches the specified one. This allows to build very efficient TCP servers, using one listener per RX queue, as the associated TCP listener should only accept flows handled in softirq by the same cpu. This provides optimal NUMA behavior and keep cpu caches hot. Note that __inet_lookup_listener() still has to iterate over the list of all listeners. Following patch puts sk_refcnt in a different cache line to let this iteration hit only shared and read mostly cache lines. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-11ipv6: drop frames with attached skb->sk in forwardingHannes Frederic Sowa1-0/+3
This is a clone of commit 2ab957492d13b ("ip_forward: Drop frames with attached skb->sk") for ipv6. This commit has exactly the same reasons as the above mentioned commit, namely to prevent panics during netfilter reload or a misconfigured stack. Signed-off-by: Hannes Frederic Sowa <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-11ipv6: gre: setup default multicast routes over PtP linksHannes Frederic Sowa1-0/+2
GRE point-to-point interfaces should also support ipv6 multicast. Setting up default multicast routes on interface creation was forgotten. Add it. Bugzilla: <https://bugzilla.kernel.org/show_bug.cgi?id=103231> Cc: Julien Muchembled <[email protected]> Cc: Eric Dumazet <[email protected]> Cc: Nicolas Dumazet <[email protected]> Signed-off-by: Hannes Frederic Sowa <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-08dst: Pass net into dst->outputEric W. Biederman4-14/+11
The network namespace is already passed into dst_output pass it into dst->output lwt->output and friends. Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-08ipv4, ipv6: Pass net into ip_local_out and ip6_local_outEric W. Biederman5-6/+5
Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-08ipv4, ipv6: Pass net into __ip_local_out and __ip6_local_outEric W. Biederman1-3/+2
Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-08ipv6: Merge ip6_local_out and ip6_local_out_skEric W. Biederman5-11/+5
Stop hidding the sk parameter with an inline helper function and make all of the callers pass it, so that it is clear what the function is doing. Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-08ipv6: Merge __ip6_local_out and __ip6_local_out_skEric W. Biederman3-9/+4
Only __ip6_local_out_sk has callers so rename __ip6_local_out_sk __ip6_local_out and remove the previous __ip6_local_out. Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-08dst: Pass a sk into .local_outEric W. Biederman3-3/+3
For consistency with the other similar methods in the kernel pass a struct sock into the dst_ops .local_out method. Simplifying the socket passing case is needed a prequel to passing a struct net reference into .local_out. Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-08net: Pass net into dst_output and remove dst_output_okfnEric W. Biederman8-11/+12
Replace dst_output_okfn with dst_output Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-07net: Fix vti use case with oif in dst lookups for IPv6David Ahern1-0/+1
It occurred to me yesterday that 741a11d9e4103 ("net: ipv6: Add RT6_LOOKUP_F_IFACE flag if oif is set") means that xfrm6_dst_lookup needs the FLOWI_FLAG_SKIP_NH_OIF flag set. This latest commit causes the oif to be considered in lookups which is known to break vti. This explains why 58189ca7b274 did not the IPv6 change at the time it was submitted. Fixes: 42a7b32b73d6 ("xfrm: Add oif to dst lookups") Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-07net: Fix vti use case with oif in dst lookups for IPv6David Ahern1-0/+1
It occurred to me yesterday that 741a11d9e4103 ("net: ipv6: Add RT6_LOOKUP_F_IFACE flag if oif is set") means that xfrm6_dst_lookup needs the FLOWI_FLAG_SKIP_NH_OIF flag set. This latest commit causes the oif to be considered in lookups which is known to break vti. This explains why 58189ca7b274 did not the IPv6 change at the time it was submitted. Fixes: 42a7b32b73d6 ("xfrm: Add oif to dst lookups") Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-05Merge branch 'master' of ↵David S. Miller2-11/+16
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/net-next Eric W. Biederman says: ==================== net: Pass net through ip fragmention This is the next installment of my work to pass struct net through the output path so the code does not need to guess how to figure out which network namespace it is in, and ultimately routes can have output devices in another network namespace. This round focuses on passing net through ip fragmentation which we seem to call from about everywhere. That is the main ip output paths, the bridge netfilter code, and openvswitch. This has to happend at once accross the tree as function pointers are involved. First some prep work is done, then ipv4 and ipv6 are converted and then temporary helper functions are removed. ==================== Acked-by: Nicolas Dichtel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-05ipv6: use ktime_t for internal timestampsArnd Bergmann1-9/+7
The ipv6 mip6 implementation is one of only a few users of the skb_get_timestamp() function in the kernel, which is both unsafe on 32-bit architectures because of the 2038 overflow, and slightly less efficient than the skb_get_ktime() based approach. This converts the function call and the mip6_report_rate_limiter structure that stores the time stamp, eliminating all uses of timeval in the ipv6 code. Signed-off-by: Arnd Bergmann <[email protected]> Cc: Alexey Kuznetsov <[email protected]> Cc: James Morris <[email protected]> Cc: Hideaki YOSHIFUJI <[email protected]> Cc: Patrick McHardy <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-05tcp: avoid two atomic ops for syncookiesEric Dumazet1-1/+1
inet_reqsk_alloc() is used to allocate a temporary request in order to generate a SYNACK with a cookie. Then later, syncookie validation also uses a temporary request. These paths already took a reference on listener refcount, we can avoid a couple of atomic operations. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-03tcp: do not lock listener to process SYN packetsEric Dumazet1-2/+9
Everything should now be ready to finally allow SYN packets processing without holding listener lock. Tested: 3.5 Mpps SYNFLOOD. Plenty of cpu cycles available. Next bottleneck is the refcount taken on listener, that could be avoided if we remove SLAB_DESTROY_BY_RCU strict semantic for listeners, and use regular RCU. 13.18% [kernel] [k] __inet_lookup_listener 9.61% [kernel] [k] tcp_conn_request 8.16% [kernel] [k] sha_transform 5.30% [kernel] [k] inet_reqsk_alloc 4.22% [kernel] [k] sock_put 3.74% [kernel] [k] tcp_make_synack 2.88% [kernel] [k] ipt_do_table 2.56% [kernel] [k] memcpy_erms 2.53% [kernel] [k] sock_wfree 2.40% [kernel] [k] tcp_v4_rcv 2.08% [kernel] [k] fib_table_lookup 1.84% [kernel] [k] tcp_openreq_init_rwin Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-03tcp: attach SYNACK messages to request sockets instead of listenerEric Dumazet1-2/+3
If a listen backlog is very big (to avoid syncookies), then the listener sk->sk_wmem_alloc is the main source of false sharing, as we need to touch it twice per SYNACK re-transmit and TX completion. (One SYN packet takes listener lock once, but up to 6 SYNACK are generated) By attaching the skb to the request socket, we remove this source of contention. Tested: listen(fd, 10485760); // single listener (no SO_REUSEPORT) 16 RX/TX queue NIC Sustain a SYNFLOOD attack of ~320,000 SYN per second, Sending ~1,400,000 SYNACK per second. Perf profiles now show listener spinlock being next bottleneck. 20.29% [kernel] [k] queued_spin_lock_slowpath 10.06% [kernel] [k] __inet_lookup_established 5.12% [kernel] [k] reqsk_timer_handler 3.22% [kernel] [k] get_next_timer_interrupt 3.00% [kernel] [k] tcp_make_synack 2.77% [kernel] [k] ipt_do_table 2.70% [kernel] [k] run_timer_softirq 2.50% [kernel] [k] ip_finish_output 2.04% [kernel] [k] cascade Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-03tcp/dccp: install syn_recv requests into ehash tableEric Dumazet2-113/+36
In this patch, we insert request sockets into TCP/DCCP regular ehash table (where ESTABLISHED and TIMEWAIT sockets are) instead of using the per listener hash table. ACK packets find SYN_RECV pseudo sockets without having to find and lock the listener. In nominal conditions, this halves pressure on listener lock. Note that this will allow for SO_REUSEPORT refinements, so that we can select a listener using cpu/numa affinities instead of the prior 'consistent hash', since only SYN packets will apply this selection logic. We will shrink listen_sock in the following patch to ease code review. Signed-off-by: Eric Dumazet <[email protected]> Cc: Ying Cai <[email protected]> Cc: Willem de Bruijn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-03tcp/dccp: remove inet_csk_reqsk_queue_added() timeout argumentEric Dumazet1-1/+1
This is no longer used. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-03tcp: get_openreq[46]() changesEric Dumazet1-3/+4
When request sockets are no longer in a per listener hash table but on regular TCP ehash, we need to access listener uid through req->rsk_listener get_openreq6() also gets a const for its request socket argument. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2015-10-03tcp: cleanup tcp_v[46]_inbound_md5_hash()Eric Dumazet1-4/+6
We'll soon have to call tcp_v[46]_inbound_md5_hash() twice. Also add const attribute to the socket, as it might be the unlocked listener for SYN packets. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>