aboutsummaryrefslogtreecommitdiff
path: root/net/ipv6
AgeCommit message (Collapse)AuthorFilesLines
2015-12-25ip_tunnel: Move stats update to iptunnel_xmit()Pravin B Shelar1-5/+2
By moving stats update into iptunnel_xmit(), we can simplify iptunnel_xmit() usage. With this change there is no need to call another function (iptunnel_xmit_stats()) to update stats in tunnel xmit code path. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22tcp: honour SO_BINDTODEVICE for TW_RST case tooFlorian Westphal1-2/+4
Hannes points out that when we generate tcp reset for timewait sockets we pretend we found no socket and pass NULL sk to tcp_vX_send_reset(). Make it cope with inet tw sockets and then provide tw sk. This makes RSTs appear on correct interface when SO_BINDTODEVICE is used. Packetdrill test case: // want default route to be used, we rely on BINDTODEVICE `ip route del 192.0.2.0/24 via 192.168.0.2 dev tun0` 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 // test case still works due to BINDTODEVICE 0.001 setsockopt(3, SOL_SOCKET, SO_BINDTODEVICE, "tun0", 4) = 0 0.100...0.200 connect(3, ..., ...) = 0 0.100 > S 0:0(0) <mss 1460,sackOK,nop,nop> 0.200 < S. 0:0(0) ack 1 win 32792 <mss 1460,sackOK,nop,nop> 0.200 > . 1:1(0) ack 1 0.210 close(3) = 0 0.210 > F. 1:1(0) ack 1 win 29200 0.300 < . 1:1(0) ack 2 win 46 // more data while in FIN_WAIT2, expect RST 1.300 < P. 1:1001(1000) ack 1 win 46 // fails without this change -- default route is used 1.301 > R 1:1(0) win 0 Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-22tcp: send_reset: test for non-NULL sk firstFlorian Westphal1-3/+3
tcp_md5_do_lookup requires a full socket, so once we extend _send_reset() to also accept timewait socket we would have to change if (!sk && hash_location) to something like if ((!sk || !sk_fullsock(sk)) && hash_location) { ... } else { (sk && sk_fullsock(sk)) tcp_md5_do_lookup() } Switch the two branches: check if we have a socket first, then fall back to a listener lookup if we saw a md5 option (hash_location). Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller3-114/+74
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains the first batch of Netfilter updates for the upcoming 4.5 kernel. This batch contains userspace netfilter header compilation fixes, support for packet mangling in nf_tables, the new tracing infrastructure for nf_tables and cgroup2 support for iptables. More specifically, they are: 1) Two patches to include dependencies in our netfilter userspace headers to resolve compilation problems, from Mikko Rapeli. 2) Four comestic cleanup patches for the ebtables codebase, from Ian Morris. 3) Remove duplicate include in the netfilter reject infrastructure, from Stephen Hemminger. 4) Two patches to simplify the netfilter defragmentation code for IPv6, patch from Florian Westphal. 5) Fix root ownership of /proc/net netfilter for unpriviledged net namespaces, from Philip Whineray. 6) Get rid of unused fields in struct nft_pktinfo, from Florian Westphal. 7) Add mangling support to our nf_tables payload expression, from Patrick McHardy. 8) Introduce a new netlink-based tracing infrastructure for nf_tables, from Florian Westphal. 9) Change setter functions in nfnetlink_log to be void, from Rami Rosen. 10) Add netns support to the cttimeout infrastructure. 11) Add cgroup2 support to iptables, from Tejun Heo. 12) Introduce nfnl_dereference_protected() in nfnetlink, from Florian. 13) Add support for mangling pkttype in the nf_tables meta expression, also from Florian. BTW, I need that you pull net into net-next, I have another batch that requires changes that I don't yet see in net. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18net: Allow accepted sockets to be bound to l3mdev domainDavid Ahern1-2/+2
Allow accepted sockets to derive their sk_bound_dev_if setting from the l3mdev domain in which the packets originated. A sysctl setting is added to control the behavior which is similar to sk_mark and sysctl_tcp_fwmark_accept. This effectively allow a process to have a "VRF-global" listen socket, with child sockets bound to the VRF device in which the packet originated. A similar behavior can be achieved using sk_mark, but a solution using marks is incomplete as it does not handle duplicate addresses in different L3 domains/VRFs. Allowing sockets to inherit the sk_bound_dev_if from l3mdev domain provides a complete solution. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18ipv6: addrconf: use stable address generator for ARPHRD_NONEBjørn Mork1-6/+39
Add a new address generator mode, using the stable address generator with an automatically generated secret. This is intended as a default address generator mode for device types with no EUI64 implementation. The new generator is used for ARPHRD_NONE interfaces initially, adding default IPv6 autoconf support to e.g. tun interfaces. If the addrgenmode is set to 'random', either by default or manually, and no stable secret is available, then a random secret is used as input for the stable-privacy address generator. The secret can be read and modified like manually configured secrets, using the proc interface. Modifying the secret will change the addrgen mode to 'stable-privacy' to indicate that it operates on a known secret. Existing behaviour of the 'stable-privacy' mode is kept unchanged. If a known secret is available when the device is created, then the mode will default to 'stable-privacy' as before. The mode can be manually set to 'random' but it will behave exactly like 'stable-privacy' in this case. The secret will not change. Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: 吉藤英明 <hideaki.yoshifuji@miraclelinux.com> Signed-off-by: Bjørn Mork <bjorn@mork.no> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-18ila: add NETFILTER dependencyArnd Bergmann1-0/+1
The recently added generic ILA translation facility fails to build when CONFIG_NETFILTER is disabled: net/ipv6/ila/ila_xlat.c:229:20: warning: 'struct nf_hook_state' declared inside parameter list net/ipv6/ila/ila_xlat.c:235:27: error: array type has incomplete element type 'struct nf_hook_ops' static struct nf_hook_ops ila_nf_hook_ops[] __read_mostly = { This adds an explicit Kconfig dependency to avoid that case. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: 7f00feaf1076 ("ila: Add generic ILA translation facility") Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller5-8/+15
Conflicts: drivers/net/geneve.c Here we had an overlapping change, where in 'net' the extraneous stats bump was being removed whilst in 'net-next' the final argument to udp_tunnel6_xmit_skb() was being changed. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17ipv6: add IPV6_HDRINCL option for raw socketsHannes Frederic Sowa1-4/+16
Same as in Windows, we miss IPV6_HDRINCL for SOL_IPV6 and SOL_RAW. The SOL_IP/IP_HDRINCL is not available for IPv6 sockets. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-17ipv6: allow routes to be configured with expire valuesXin Long1-0/+10
Add the support for adding expire value to routes, requested by Tom Gundersen <teg@jklm.no> for systemd-networkd, and NetworkManager wants it too. implement it by adding the new RTNETLINK attribute RTA_EXPIRES. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15ipv6: automatically enable stable privacy mode if stable_secret setHannes Frederic Sowa1-0/+6
Bjørn reported that while we switch all interfaces to privacy stable mode when setting the secret, we don't set this mode for new interfaces. This does not make sense, so change this behaviour. Fixes: 622c81d57b392cc ("ipv6: generation of stable privacy addresses for link-local and autoconf") Reported-by: Bjørn Mork <bjorn@mork.no> Cc: Bjørn Mork <bjorn@mork.no> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15net: diag: Support destroying TCP sockets.Lorenzo Colitti1-0/+1
This implements SOCK_DESTROY for TCP sockets. It causes all blocking calls on the socket to fail fast with ECONNABORTED and causes a protocol close of the socket. It informs the other end of the connection by sending a RST, i.e., initiating a TCP ABORT as per RFC 793. ECONNABORTED was chosen for consistency with FreeBSD. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15ila: Add generic ILA translation facilityTom Herbert4-1/+691
This patch implements an ILA tanslation table. This table can be configured with identifier to locator mappings, and can be be queried to resolve a mapping. Queries can be parameterized based on interface, direction (incoming or outoing), and matching locator. The table is implemented using rhashtable and is configured via netlink (through "ip ila .." in iproute). The table may be used as alternative means to do do ILA tanslations other than the lw tunnels Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15ila: Create net/ipv6/ila directoryTom Herbert5-81/+152
Create ila directory in preparation for supporting other hooks in the kernel than LWT for doing ILA. This includes: - Moving ila.c to ila/ila_lwt.c - Splitting out some common functions into ila_common.c Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15net: Eliminate NETIF_F_GEN_CSUM and NETIF_F_V[46]_CSUMTom Herbert2-2/+3
These netif flags are unnecessary convolutions. It is more straightforward to just use NETIF_F_HW_CSUM, NETIF_F_IP_CSUM, and NETIF_F_IPV6_CSUM directly. This patch also: - Cleans up can_checksum_protocol - Simplifies netdev_intersect_features Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-14net: fix IP early demux racesEric Dumazet1-2/+1
David Wilder reported crashes caused by dst reuse. <quote David> I am seeing a crash on a distro V4.2.3 kernel caused by a double release of a dst_entry. In ipv4_dst_destroy() the call to list_empty() finds a poisoned next pointer, indicating the dst_entry has already been removed from the list and freed. The crash occurs 18 to 24 hours into a run of a network stress exerciser. </quote> Thanks to his detailed report and analysis, we were able to understand the core issue. IP early demux can associate a dst to skb, after a lookup in TCP/UDP sockets. When socket cache is not properly set, we want to store into sk->sk_dst_cache the dst for future IP early demux lookups, by acquiring a stable refcount on the dst. Problem is this acquisition is simply using an atomic_inc(), which works well, unless the dst was queued for destruction from dst_release() noticing dst refcount went to zero, if DST_NOCACHE was set on dst. We need to make sure current refcount is not zero before incrementing it, or risk double free as David reported. This patch, being a stable candidate, adds two new helpers, and use them only from IP early demux problematic paths. It might be possible to merge in net-next skb_dst_force() and skb_dst_force_safe(), but I prefer having the smallest patch for stable kernels : Maybe some skb_dst_force() callers do not expect skb->dst can suddenly be cleared. Can probably be backported back to linux-3.6 kernels Reported-by: David J. Wilder <dwilder@us.ibm.com> Tested-by: David J. Wilder <dwilder@us.ibm.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-14ipv6: addrconf: drop ieee802154 specific thingsAlexander Aring1-5/+3
This patch removes ARPHRD_IEEE802154 from addrconf handling. In the earlier days of 802.15.4 6LoWPAN, the interface type was ARPHRD_IEEE802154 which introduced several issues, because 802.15.4 interfaces used the same type. Since commit 965e613d299c ("ieee802154: 6lowpan: fix ARPHRD to ARPHRD_6LOWPAN") we use ARPHRD_6LOWPAN for 6LoWPAN interfaces. This patch will remove ARPHRD_IEEE802154 which is currently deadcode, because ARPHRD_IEEE802154 doesn't reach the minimum 1280 MTU of IPv6. Also we use 6LoWPAN EUI64 specific defines instead using link-layer constanst from 802.15.4 link-layer header. Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-14net: add validation for the socket syscall protocol argumentHannes Frederic Sowa1-0/+3
郭永刚 reported that one could simply crash the kernel as root by using a simple program: int socket_fd; struct sockaddr_in addr; addr.sin_port = 0; addr.sin_addr.s_addr = INADDR_ANY; addr.sin_family = 10; socket_fd = socket(10,3,0x40000000); connect(socket_fd , &addr,16); AF_INET, AF_INET6 sockets actually only support 8-bit protocol identifiers. inet_sock's skc_protocol field thus is sized accordingly, thus larger protocol identifiers simply cut off the higher bits and store a zero in the protocol fields. This could lead to e.g. NULL function pointer because as a result of the cut off inet_num is zero and we call down to inet_autobind, which is NULL for raw sockets. kernel: Call Trace: kernel: [<ffffffff816db90e>] ? inet_autobind+0x2e/0x70 kernel: [<ffffffff816db9a4>] inet_dgram_connect+0x54/0x80 kernel: [<ffffffff81645069>] SYSC_connect+0xd9/0x110 kernel: [<ffffffff810ac51b>] ? ptrace_notify+0x5b/0x80 kernel: [<ffffffff810236d8>] ? syscall_trace_enter_phase2+0x108/0x200 kernel: [<ffffffff81645e0e>] SyS_connect+0xe/0x10 kernel: [<ffffffff81779515>] tracesys_phase2+0x84/0x89 I found no particular commit which introduced this problem. CVE: CVE-2015-8543 Cc: Cong Wang <cwang@twopensource.com> Reported-by: 郭永刚 <guoyonggang@360.cn> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-14Merge branch 'master' of ↵Pablo Neira Ayuso18-95/+125
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Resolve conflict between commit 264640fc2c5f4f ("ipv6: distinguish frag queues by device for multicast and link-local packets") from the net tree and commit 029f7f3b8701c ("netfilter: ipv6: nf_defrag: avoid/free clone operations") from the nf-next tree. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Conflicts: net/ipv6/netfilter/nf_conntrack_reasm.c
2015-12-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller1-0/+1
Pablo Neira Ayuso says: ==================== netfilter fixes for net The following patchset contains Netfilter fixes for you net tree, specifically for nf_tables and nfnetlink_queue, they are: 1) Avoid a compilation warning in nfnetlink_queue that was introduced in the previous merge window with the simplification of the conntrack integration, from Arnd Bergmann. 2) nfnetlink_queue is leaking the pernet subsystem registration from a failure path, patch from Nikolay Borisov. 3) Pass down netns pointer to batch callback in nfnetlink, this is the largest patch and it is not a bugfix but it is a dependency to resolve a splat in the correct way. 4) Fix a splat due to incorrect socket memory accounting with nfnetlink skbuff clones. 5) Add missing conntrack dependencies to NFT_DUP_IPV4 and NFT_DUP_IPV6. 6) Traverse the nftables commit list in reverse order from the commit path, otherwise we crash when the user applies an incremental update via 'nft -f' that deletes an object that was just introduced in this batch, from Xin Long. Regarding the compilation warning fix, many people have sent us (and keep sending us) patches to address this, that's why I'm including this batch even if this is not critical. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-10netfilter: nf_dup: add missing dependencies with NF_CONNTRACKPablo Neira Ayuso1-0/+1
CONFIG_NF_CONNTRACK=m CONFIG_NF_DUP_IPV4=y results in: net/built-in.o: In function `nf_dup_ipv4': >> (.text+0xd434f): undefined reference to `nf_conntrack_untracked' Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-12-09netfilter: ipv6: nf_defrag: fix NULL deref panicFlorian Westphal1-5/+8
Valdis reports NULL deref in nf_ct_frag6_gather. Problem is bogus use of skb_queue_walk() -- we miss first skb in the list since we start with head->next instead of head. In case the element we're looking for was head->next we won't find a result and then trip over NULL iter. (defrag uses plain NULL-terminated list rather than one terminated by head-of-list-pointer, which is what skb_queue_walk expects). Fixes: 029f7f3b8701cc7a ("netfilter: ipv6: nf_defrag: avoid/free clone operations") Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Tested-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-12-05ipv6: keep existing flags when setting IFA_F_OPTIMISTICBjørn Mork1-1/+1
Commit 64236f3f3d74 ("ipv6: introduce IFA_F_STABLE_PRIVACY flag") failed to update the setting of the IFA_F_OPTIMISTIC flag, causing the IFA_F_STABLE_PRIVACY flag to be lost if IFA_F_OPTIMISTIC is set. Cc: Erik Kline <ek@google.com> Cc: Fernando Gont <fgont@si6networks.com> Cc: Lorenzo Colitti <lorenzo@google.com> Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Fixes: 64236f3f3d74 ("ipv6: introduce IFA_F_STABLE_PRIVACY flag") Signed-off-by: Bjørn Mork <bjorn@mork.no> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-05ipv6: Only act upon NETDEV_*_TYPE_CHANGE if we have ipv6 addressesAndrew Lunn1-1/+2
An interface changing type may not have IPv6 addresses. Don't call the address configuration type change in this case. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-04gre6: allow to update all parameters via rtnlNicolas Dichtel1-5/+3
Parameters were updated only if the kernel was unable to find the tunnel with the new parameters, ie only if core pamareters were updated (keys, addr, link, type). Now it's possible to update ttl, hoplimit, flowinfo and flags. Fixes: c12b395a4664 ("gre: Support GRE over IPv6") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller17-88/+102
Conflicts: drivers/net/ethernet/renesas/ravb_main.c kernel/bpf/syscall.c net/ipv4/ipmr.c All three conflicts were cases of overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03net: ipv6: restrict hop_limit sysctl setting to range [1; 255]Phil Sutter1-1/+15
Setting a value bigger than 255 resulted in using only the lower eight bits of that value as it is assigned to the u8 header field. To avoid this unexpected result, reject such values. Setting a value of zero is technically possible, but hosts receiving such a packet have to treat it like hop_limit was set to one, according to RFC2460. Therefore I don't see a use-case for that. Setting a route's hop_limit to zero in iproute2 means to use the sysctl default, which is not the case here: Setting e.g. net.conf.eth0.hop_limit=0 will not make the kernel use net.conf.all.hop_limit for outgoing packets on eth0. To avoid these kinds of confusion, reject zero. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03ipv6: kill sk_dst_lockEric Dumazet4-26/+4
While testing the np->opt RCU conversion, I found that UDP/IPv6 was using a mixture of xchg() and sk_dst_lock to protect concurrent changes to sk->sk_dst_cache, leading to possible corruptions and crashes. ip6_sk_dst_lookup_flow() uses sk_dst_check() anyway, so the simplest way to fix the mess is to remove sk_dst_lock completely, as we did for IPv4. __ip6_dst_store() and ip6_dst_store() share same implementation. sk_setup_caps() being called with socket lock being held or not, we have to use sk_dst_set() instead of __sk_dst_set() Note that I had to move the "np->dst_cookie = rt6_get_cookie(rt);" in ip6_dst_store() before the sk_setup_caps(sk, dst) call. This is because ip6_dst_store() can be called from process context, without any lock held. As soon as the dst is installed in sk->sk_dst_cache, dst can be freed from another cpu doing a concurrent ip6_dst_store() Doing the dst dereference before doing the install is needed to make sure no use after free would trigger. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-02tcp: suppress too verbose messages in tcp_send_ack()Eric Dumazet1-3/+3
If tcp_send_ack() can not allocate skb, we properly handle this and setup a timer to try later. Use __GFP_NOWARN to avoid polluting syslog in the case host is under memory pressure, so that pertinent messages are not lost under a flood of useless information. sk_gfp_atomic() can use its gfp_mask argument (all callers currently were using GFP_ATOMIC before this patch) We rename sk_gfp_atomic() to sk_gfp_mask() to clearly express this function now takes into account its second argument (gfp_mask) Note that when tcp_transmit_skb() is called with clone_it set to false, we do not attempt memory allocations, so can pass a 0 gfp_mask, which most compilers can emit faster than a non zero or constant value. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-02ipv6: add complete rcu protection around np->optEric Dumazet9-36/+74
This patch addresses multiple problems : UDP/RAW sendmsg() need to get a stable struct ipv6_txoptions while socket is not locked : Other threads can change np->opt concurrently. Dmitry posted a syzkaller (http://github.com/google/syzkaller) program desmonstrating use-after-free. Starting with TCP/DCCP lockless listeners, tcp_v6_syn_recv_sock() and dccp_v6_request_recv_sock() also need to use RCU protection to dereference np->opt once (before calling ipv6_dup_options()) This patch adds full RCU protection to np->opt Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-01Revert "ipv6: ndisc: inherit metadata dst when creating ndisc requests"Nicolas Dichtel3-9/+5
This reverts commit ab450605b35caa768ca33e86db9403229bf42be4. In IPv6, we cannot inherit the dst of the original dst. ndisc packets are IPv6 packets and may take another route than the original packet. This patch breaks the following scenario: a packet comes from eth0 and is forwarded through vxlan1. The encapsulated packet triggers an NS which cannot be sent because of the wrong route. CC: Jiri Benc <jbenc@redhat.com> CC: Thomas Graf <tgraf@suug.ch> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-30net: remove unnecessary mroute.h includesNikolay Aleksandrov1-1/+0
It looks like many files are including mroute.h unnecessarily, so remove the include. Most importantly remove it from ipv6. CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> CC: Steffen Klassert <steffen.klassert@secunet.com> CC: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-24net: ipmr, ip6mr: fix vif/tunnel failure race conditionNikolay Aleksandrov1-4/+0
Since (at least) commit b17a7c179dd3 ("[NET]: Do sysfs registration as part of register_netdevice."), netdev_run_todo() deals only with unregistration, so we don't need to do the rtnl_unlock/lock cycle to finish registration when failing pimreg or dvmrp device creation. In fact that opens a race condition where someone can delete the device while rtnl is unlocked because it's fully registered. The problem gets worse when netlink support is introduced as there are more points of entry that can cause it and it also makes reusing that code correctly impossible. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-24ipv6: distinguish frag queues by device for multicast and link-local packetsMichal Kubeček2-5/+10
If a fragmented multicast packet is received on an ethernet device which has an active macvlan on top of it, each fragment is duplicated and received both on the underlying device and the macvlan. If some fragments for macvlan are processed before the whole packet for the underlying device is reassembled, the "overlapping fragments" test in ip6_frag_queue() discards the whole fragment queue. To resolve this, add device ifindex to the search key and require it to match reassembling multicast packets and packets to link-local addresses. Note: similar patch has been already submitted by Yoshifuji Hideaki in http://patchwork.ozlabs.org/patch/220979/ but got lost and forgotten for some reason. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-23netfilter: ipv6: avoid nf_iterate recursionFlorian Westphal2-49/+36
The previous patch changed nf_ct_frag6_gather() to morph reassembled skb with the previous one. This means that the return value is always NULL or the skb argument. So change it to an err value. Instead of invoking NF_HOOK recursively with threshold to skip already-called hooks we can now just return NF_ACCEPT to move on to the next hook except for -EINPROGRESS (which means skb has been queued for reassembly), in which case we return NF_STOLEN. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-11-23netfilter: ipv6: nf_defrag: avoid/free clone operationsFlorian Westphal2-71/+40
commit 6aafeef03b9d9ecf ("netfilter: push reasm skb through instead of original frag skbs") changed ipv6 defrag to not use the original skbs anymore. So rather than keeping the original skbs around just to discard them afterwards just use the original skbs directly for the fraglist of the newly assembled skb and remove the extra clone/free operations. The skb that completes the fragment queue is morphed into a the reassembled one instead, just like ipv4 defrag. openvswitch doesn't need any additional skb_morph magic anymore to deal with this situation so just remove that. A followup patch can then also remove the NF_HOOK (re)invocation in the ipv6 netfilter defrag hook. Cc: Joe Stringer <joestringer@nicira.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-11-23netfilter: remove duplicate includestephen hemminger1-1/+0
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-11-22net: ip6mr: fix static mfc/dev leaks on table destructionNikolay Aleksandrov1-7/+8
Similar to ipv4, when destroying an mrt table the static mfc entries and the static devices are kept, which leads to devices that can never be destroyed (because of refcnt taken) and leaked memory. Make sure that everything is cleaned up on netns destruction. Fixes: 8229efdaef1e ("netns: ip6mr: enable namespace support in ipv6 multicast forwarding code") CC: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-22net: IPv6 fib lookup tracepointDavid Ahern1-0/+10
Add tracepoint to show fib6 table lookups and result. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-18net/ip6_tunnel: fix dst leakPaolo Abeni1-1/+1
the commit cdf3464e6c6b ("ipv6: Fix dst_entry refcnt bugs in ip6_tunnel") introduced percpu storage for ip6_tunnel dst cache, but while clearing such cache it used raw_cpu_ptr to walk the per cpu entries, so cached dst on non current cpu are not actually reset. This patch replaces raw_cpu_ptr with per_cpu_ptr, properly cleaning such storage. Fixes: cdf3464e6c6b ("ipv6: Fix dst_entry refcnt bugs in ip6_tunnel") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-16snmp: Remove duplicate OUTMCAST stat incrementNeil Horman1-2/+0
the OUTMCAST stat is double incremented, getting bumped once in the mcast code itself, and again in the common ip output path. Remove the mcast bump, as its not needed Validated by the reporter, with good results Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: Claus Jensen <claus.jensen@microsemi.com> CC: Claus Jensen <claus.jensen@microsemi.com> CC: David Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-15tcp: ensure proper barriers in lockless contextsEric Dumazet1-4/+15
Some functions access TCP sockets without holding a lock and might output non consistent data, depending on compiler and or architecture. tcp_diag_get_info(), tcp_get_info(), tcp_poll(), get_tcp4_sock() ... Introduce sk_state_load() and sk_state_store() to fix the issues, and more clearly document where this lack of locking is happening. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-15ipv6: Check rt->dst.from for the DST_NOCACHE routeMartin KaFai Lau1-1/+2
All DST_NOCACHE rt6_info used to have rt->dst.from set to its parent. After commit 8e3d5be73681 ("ipv6: Avoid double dst_free"), DST_NOCACHE is also set to rt6_info which does not have a parent (i.e. rt->dst.from is NULL). This patch catches the rt->dst.from == NULL case. Fixes: 8e3d5be73681 ("ipv6: Avoid double dst_free") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-15ipv6: Check expire on DST_NOCACHE routeMartin KaFai Lau1-1/+10
Since the expires of the DST_NOCACHE rt can be set during the ip6_rt_update_pmtu(), we also need to consider the expires value when doing ip6_dst_check(). This patches creates __rt6_check_expired() to only check the expire value (if one exists) of the current rt. In rt6_dst_from_check(), it adds __rt6_check_expired() as one of the condition check. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-15ipv6: Avoid creating RTF_CACHE from a rt that is not managed by fib6 treeMartin KaFai Lau1-1/+7
The original bug report: https://bugzilla.redhat.com/show_bug.cgi?id=1272571 The setup has a IPv4 GRE tunnel running in a IPSec. The bug happens when ndisc starts sending router solicitation at the gre interface. The simplified oops stack is like: __lock_acquire+0x1b2/0x1c30 lock_acquire+0xb9/0x140 _raw_write_lock_bh+0x3f/0x50 __ip6_ins_rt+0x2e/0x60 ip6_ins_rt+0x49/0x50 ~~~~~~~~ __ip6_rt_update_pmtu.part.54+0x145/0x250 ip6_rt_update_pmtu+0x2e/0x40 ~~~~~~~~ ip_tunnel_xmit+0x1f1/0xf40 __gre_xmit+0x7a/0x90 ipgre_xmit+0x15a/0x220 dev_hard_start_xmit+0x2bd/0x480 __dev_queue_xmit+0x696/0x730 dev_queue_xmit+0x10/0x20 neigh_direct_output+0x11/0x20 ip6_finish_output2+0x21f/0x770 ip6_finish_output+0xa7/0x1d0 ip6_output+0x56/0x190 ~~~~~~~~ ndisc_send_skb+0x1d9/0x400 ndisc_send_rs+0x88/0xc0 ~~~~~~~~ The rt passed to ip6_rt_update_pmtu() is created by icmp6_dst_alloc() and it is not managed by the fib6 tree, so its rt6i_table == NULL. When __ip6_rt_update_pmtu() creates a RTF_CACHE clone, the newly created clone also has rt6i_table == NULL and it causes the ip6_ins_rt() oops. During pmtu update, we only want to create a RTF_CACHE clone from a rt which is currently managed (or owned) by the fib6 tree. It means either rt->rt6i_node != NULL or rt is a RTF_PCPU clone. It is worth to note that rt6i_table may not be NULL even it is not (yet) managed by the fib6 tree (e.g. addrconf_dst_alloc()). Hence, rt6i_node is a better check instead of rt6i_table. Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering pmtu") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Reported-by: Chris Siebenmann <cks-rhbugzilla@cs.toronto.edu> Cc: Chris Siebenmann <cks-rhbugzilla@cs.toronto.edu> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-05tcp: use correct req pointer in tcp_move_syn() callsEric Dumazet1-1/+1
I mistakenly took wrong request sock pointer when calling tcp_move_syn() @req_unhash is either a copy of @req, or a NULL value for FastOpen connexions (as we do not expect to unhash the temporary request sock from ehash table) Fixes: 805c4bc05705 ("tcp: fix req->saved_syn race") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ying Cai <ycai@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-05tcp: fix req->saved_syn raceEric Dumazet1-8/+12
For the reasons explained in commit ce1050089c96 ("tcp/dccp: fix ireq->pktopts race"), we need to make sure we do not access req->saved_syn unless we own the request sock. This fixes races for listeners using TCP_SAVE_SYN option. Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets") Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Ying Cai <ycai@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-04ipv6: clean up dev_snmp6 proc entry when we fail to initialize inet6_devSabrina Dubroca1-0/+1
In ipv6_add_dev, when addrconf_sysctl_register fails, we do not clean up the dev_snmp6 entry that we have already registered for this device. Call snmp6_unregister_dev in this case. Fixes: a317a2f19da7d ("ipv6: fail early when creating netdev named all or default") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller3-25/+16
Minor overlapping changes in net/ipv4/ipmr.c, in 'net' we were fixing the "BH-ness" of the counter bumps whilst in 'net-next' the functions were modified to take an explicit 'net' parameter. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-03ipv6: fix tunnel error handlingMichal Kubeček1-1/+11
Both tunnel6_protocol and tunnel46_protocol share the same error handler, tunnel6_err(), which traverses through tunnel6_handlers list. For ipip6 tunnels, we need to traverse tunnel46_handlers as we do e.g. in tunnel46_rcv(). Current code can generate an ICMPv6 error message with an IPv4 packet embedded in it. Fixes: 73d605d1abbd ("[IPSEC]: changing API of xfrm6_tunnel_register") Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>