aboutsummaryrefslogtreecommitdiff
path: root/net/ipv6
AgeCommit message (Collapse)AuthorFilesLines
2018-03-25ipv6: the entire IPv6 header chain must fit the first fragmentPaolo Abeni1-4/+9
While building ipv6 datagram we currently allow arbitrary large extheaders, even beyond pmtu size. The syzbot has found a way to exploit the above to trigger the following splat: kernel BUG at ./include/linux/skbuff.h:2073! invalid opcode: 0000 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 1 PID: 4230 Comm: syzkaller672661 Not tainted 4.16.0-rc2+ #326 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__skb_pull include/linux/skbuff.h:2073 [inline] RIP: 0010:__ip6_make_skb+0x1ac8/0x2190 net/ipv6/ip6_output.c:1636 RSP: 0018:ffff8801bc18f0f0 EFLAGS: 00010293 RAX: ffff8801b17400c0 RBX: 0000000000000738 RCX: ffffffff84f01828 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8801b415ac18 RBP: ffff8801bc18f360 R08: ffff8801b4576844 R09: 0000000000000000 R10: ffff8801bc18f380 R11: ffffed00367aee4e R12: 00000000000000d6 R13: ffff8801b415a740 R14: dffffc0000000000 R15: ffff8801b45767c0 FS: 0000000001535880(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000002000b000 CR3: 00000001b4123001 CR4: 00000000001606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ip6_finish_skb include/net/ipv6.h:969 [inline] udp_v6_push_pending_frames+0x269/0x3b0 net/ipv6/udp.c:1073 udpv6_sendmsg+0x2a96/0x3400 net/ipv6/udp.c:1343 inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:764 sock_sendmsg_nosec net/socket.c:630 [inline] sock_sendmsg+0xca/0x110 net/socket.c:640 ___sys_sendmsg+0x320/0x8b0 net/socket.c:2046 __sys_sendmmsg+0x1ee/0x620 net/socket.c:2136 SYSC_sendmmsg net/socket.c:2167 [inline] SyS_sendmmsg+0x35/0x60 net/socket.c:2162 do_syscall_64+0x280/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 RIP: 0033:0x4404c9 RSP: 002b:00007ffdce35f948 EFLAGS: 00000217 ORIG_RAX: 0000000000000133 RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004404c9 RDX: 0000000000000003 RSI: 0000000020001f00 RDI: 0000000000000003 RBP: 00000000006cb018 R08: 00000000004002c8 R09: 00000000004002c8 R10: 0000000020000080 R11: 0000000000000217 R12: 0000000000401df0 R13: 0000000000401e80 R14: 0000000000000000 R15: 0000000000000000 Code: ff e8 1d 5e b9 fc e9 15 e9 ff ff e8 13 5e b9 fc e9 44 e8 ff ff e8 29 5e b9 fc e9 c0 e6 ff ff e8 3f f3 80 fc 0f 0b e8 38 f3 80 fc <0f> 0b 49 8d 87 80 00 00 00 4d 8d 87 84 00 00 00 48 89 85 20 fe RIP: __skb_pull include/linux/skbuff.h:2073 [inline] RSP: ffff8801bc18f0f0 RIP: __ip6_make_skb+0x1ac8/0x2190 net/ipv6/ip6_output.c:1636 RSP: ffff8801bc18f0f0 As stated by RFC 7112 section 5: When a host fragments an IPv6 datagram, it MUST include the entire IPv6 Header Chain in the First Fragment. So this patch addresses the issue dropping datagrams with excessive extheader length. It also updates the error path to report to the calling socket nonnegative pmtu values. The issue apparently predates git history. v1 -> v2: cleanup error path, as per Eric's suggestion Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+91e6f9932ff122fa4410@syzkaller.appspotmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-25net/ipv4: disable SMC TCP option with SYN CookiesHans Wippel1-0/+2
Currently, the SMC experimental TCP option in a SYN packet is lost on the server side when SYN Cookies are active. However, the corresponding SYNACK sent back to the client contains the SMC option. This causes an inconsistent view of the SMC capabilities on the client and server. This patch disables the SMC option in the SYNACK when SYN Cookies are active to avoid this issue. Fixes: 60e2a7780793b ("tcp: TCP experimental option for SMC") Signed-off-by: Hans Wippel <hwippel@linux.vnet.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller1-2/+4
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for your net tree, they are: 1) Don't pick fixed hash implementation for NFT_SET_EVAL sets, otherwise userspace hits EOPNOTSUPP with valid rules using the meter statement, from Florian Westphal. 2) If you send a batch that flushes the existing ruleset (that contains a NAT chain) and the new ruleset definition comes with a new NAT chain, don't bogusly hit EBUSY. Also from Florian. 3) Missing netlink policy attribute validation, from Florian. 4) Detach conntrack template from skbuff if IP_NODEFRAG is set on, from Paolo Abeni. 5) Cache device names in flowtable object, otherwise we may end up walking over devices going aways given no rtnl_lock is held. 6) Fix incorrect net_device ingress with ingress hooks. 7) Fix crash when trying to read more data than available in UDP packets from the nf_socket infrastructure, from Subash. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-24netfilter: nf_socket: Fix out of bounds access in nf_sk_lookup_slow_v{4,6}Subash Abhinov Kasiviswanathan1-2/+4
skb_header_pointer will copy data into a buffer if data is non linear, otherwise it will return a pointer in the linear section of the data. nf_sk_lookup_slow_v{4,6} always copies data of size udphdr but later accesses memory within the size of tcphdr (th->doff) in case of TCP packets. This causes a crash when running with KASAN with the following call stack - BUG: KASAN: stack-out-of-bounds in xt_socket_lookup_slow_v4+0x524/0x718 net/netfilter/xt_socket.c:178 Read of size 2 at addr ffffffe3d417a87c by task syz-executor/28971 CPU: 2 PID: 28971 Comm: syz-executor Tainted: G B W O 4.9.65+ #1 Call trace: [<ffffff9467e8d390>] dump_backtrace+0x0/0x428 arch/arm64/kernel/traps.c:76 [<ffffff9467e8d7e0>] show_stack+0x28/0x38 arch/arm64/kernel/traps.c:226 [<ffffff946842d9b8>] __dump_stack lib/dump_stack.c:15 [inline] [<ffffff946842d9b8>] dump_stack+0xd4/0x124 lib/dump_stack.c:51 [<ffffff946811d4b0>] print_address_description+0x68/0x258 mm/kasan/report.c:248 [<ffffff946811d8c8>] kasan_report_error mm/kasan/report.c:347 [inline] [<ffffff946811d8c8>] kasan_report.part.2+0x228/0x2f0 mm/kasan/report.c:371 [<ffffff946811df44>] kasan_report+0x5c/0x70 mm/kasan/report.c:372 [<ffffff946811bebc>] check_memory_region_inline mm/kasan/kasan.c:308 [inline] [<ffffff946811bebc>] __asan_load2+0x84/0x98 mm/kasan/kasan.c:739 [<ffffff94694d6f04>] __tcp_hdrlen include/linux/tcp.h:35 [inline] [<ffffff94694d6f04>] xt_socket_lookup_slow_v4+0x524/0x718 net/netfilter/xt_socket.c:178 Fix this by copying data into appropriate size headers based on protocol. Fixes: a583636a83ea ("inet: refactor inet[6]_lookup functions to take skb") Signed-off-by: Tejaswi Tanikella <tejaswit@codeaurora.org> Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-23ipv6: fix possible deadlock in rt6_age_examine_exception()Eric Dumazet1-6/+7
syzbot reported a LOCKDEP splat [1] in rt6_age_examine_exception() rt6_age_examine_exception() is called while rt6_exception_lock is held. This lock is the lower one in the lock hierarchy, thus we can not call dst_neigh_lookup() function, as it can fallback to neigh_create() We should instead do a pure RCU lookup. As a bonus we avoid a pair of atomic operations on neigh refcount. [1] WARNING: possible circular locking dependency detected 4.16.0-rc4+ #277 Not tainted syz-executor7/4015 is trying to acquire lock: (&ndev->lock){++--}, at: [<00000000416dce19>] __ipv6_dev_mc_dec+0x45/0x350 net/ipv6/mcast.c:928 but task is already holding lock: (&tbl->lock){++-.}, at: [<00000000b5cb1d65>] neigh_ifdown+0x3d/0x250 net/core/neighbour.c:292 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&tbl->lock){++-.}: __raw_write_lock_bh include/linux/rwlock_api_smp.h:203 [inline] _raw_write_lock_bh+0x31/0x40 kernel/locking/spinlock.c:312 __neigh_create+0x87e/0x1d90 net/core/neighbour.c:528 neigh_create include/net/neighbour.h:315 [inline] ip6_neigh_lookup+0x9a7/0xba0 net/ipv6/route.c:228 dst_neigh_lookup include/net/dst.h:405 [inline] rt6_age_examine_exception net/ipv6/route.c:1609 [inline] rt6_age_exceptions+0x381/0x660 net/ipv6/route.c:1645 fib6_age+0xfb/0x140 net/ipv6/ip6_fib.c:2033 fib6_clean_node+0x389/0x580 net/ipv6/ip6_fib.c:1919 fib6_walk_continue+0x46c/0x8a0 net/ipv6/ip6_fib.c:1845 fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1893 fib6_clean_tree+0x1e6/0x340 net/ipv6/ip6_fib.c:1970 __fib6_clean_all+0x1f4/0x3a0 net/ipv6/ip6_fib.c:1986 fib6_clean_all net/ipv6/ip6_fib.c:1997 [inline] fib6_run_gc+0x16b/0x3c0 net/ipv6/ip6_fib.c:2053 ndisc_netdev_event+0x3c2/0x4a0 net/ipv6/ndisc.c:1781 notifier_call_chain+0x136/0x2c0 kernel/notifier.c:93 __raw_notifier_call_chain kernel/notifier.c:394 [inline] raw_notifier_call_chain+0x2d/0x40 kernel/notifier.c:401 call_netdevice_notifiers_info+0x32/0x70 net/core/dev.c:1707 call_netdevice_notifiers net/core/dev.c:1725 [inline] __dev_notify_flags+0x262/0x430 net/core/dev.c:6960 dev_change_flags+0xf5/0x140 net/core/dev.c:6994 devinet_ioctl+0x126a/0x1ac0 net/ipv4/devinet.c:1080 inet_ioctl+0x184/0x310 net/ipv4/af_inet.c:919 sock_do_ioctl+0xef/0x390 net/socket.c:957 sock_ioctl+0x36b/0x610 net/socket.c:1081 vfs_ioctl fs/ioctl.c:46 [inline] do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686 SYSC_ioctl fs/ioctl.c:701 [inline] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 -> #2 (rt6_exception_lock){+.-.}: __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline] _raw_spin_lock_bh+0x31/0x40 kernel/locking/spinlock.c:168 spin_lock_bh include/linux/spinlock.h:315 [inline] rt6_flush_exceptions+0x21/0x210 net/ipv6/route.c:1367 fib6_del_route net/ipv6/ip6_fib.c:1677 [inline] fib6_del+0x624/0x12c0 net/ipv6/ip6_fib.c:1761 __ip6_del_rt+0xc7/0x120 net/ipv6/route.c:2980 ip6_del_rt+0x132/0x1a0 net/ipv6/route.c:2993 __ipv6_dev_ac_dec+0x3b1/0x600 net/ipv6/anycast.c:332 ipv6_dev_ac_dec net/ipv6/anycast.c:345 [inline] ipv6_sock_ac_close+0x2b4/0x3e0 net/ipv6/anycast.c:200 inet6_release+0x48/0x70 net/ipv6/af_inet6.c:433 sock_release+0x8d/0x1e0 net/socket.c:594 sock_close+0x16/0x20 net/socket.c:1149 __fput+0x327/0x7e0 fs/file_table.c:209 ____fput+0x15/0x20 fs/file_table.c:243 task_work_run+0x199/0x270 kernel/task_work.c:113 exit_task_work include/linux/task_work.h:22 [inline] do_exit+0x9bb/0x1ad0 kernel/exit.c:865 do_group_exit+0x149/0x400 kernel/exit.c:968 get_signal+0x73a/0x16d0 kernel/signal.c:2469 do_signal+0x90/0x1e90 arch/x86/kernel/signal.c:809 exit_to_usermode_loop+0x258/0x2f0 arch/x86/entry/common.c:162 prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline] syscall_return_slowpath arch/x86/entry/common.c:265 [inline] do_syscall_64+0x6ec/0x940 arch/x86/entry/common.c:292 entry_SYSCALL_64_after_hwframe+0x42/0xb7 -> #1 (&(&tb->tb6_lock)->rlock){+.-.}: __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline] _raw_spin_lock_bh+0x31/0x40 kernel/locking/spinlock.c:168 spin_lock_bh include/linux/spinlock.h:315 [inline] __ip6_ins_rt+0x56/0x90 net/ipv6/route.c:1007 ip6_route_add+0x141/0x190 net/ipv6/route.c:2955 addrconf_prefix_route+0x44f/0x620 net/ipv6/addrconf.c:2359 fixup_permanent_addr net/ipv6/addrconf.c:3368 [inline] addrconf_permanent_addr net/ipv6/addrconf.c:3391 [inline] addrconf_notify+0x1ad2/0x2310 net/ipv6/addrconf.c:3460 notifier_call_chain+0x136/0x2c0 kernel/notifier.c:93 __raw_notifier_call_chain kernel/notifier.c:394 [inline] raw_notifier_call_chain+0x2d/0x40 kernel/notifier.c:401 call_netdevice_notifiers_info+0x32/0x70 net/core/dev.c:1707 call_netdevice_notifiers net/core/dev.c:1725 [inline] __dev_notify_flags+0x15d/0x430 net/core/dev.c:6958 dev_change_flags+0xf5/0x140 net/core/dev.c:6994 do_setlink+0xa22/0x3bb0 net/core/rtnetlink.c:2357 rtnl_newlink+0xf37/0x1a50 net/core/rtnetlink.c:2965 rtnetlink_rcv_msg+0x57f/0xb10 net/core/rtnetlink.c:4641 netlink_rcv_skb+0x14b/0x380 net/netlink/af_netlink.c:2444 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4659 netlink_unicast_kernel net/netlink/af_netlink.c:1308 [inline] netlink_unicast+0x4c4/0x6b0 net/netlink/af_netlink.c:1334 netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1897 sock_sendmsg_nosec net/socket.c:629 [inline] sock_sendmsg+0xca/0x110 net/socket.c:639 ___sys_sendmsg+0x767/0x8b0 net/socket.c:2047 __sys_sendmsg+0xe5/0x210 net/socket.c:2081 SYSC_sendmsg net/socket.c:2092 [inline] SyS_sendmsg+0x2d/0x50 net/socket.c:2088 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 -> #0 (&ndev->lock){++--}: lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920 __raw_write_lock_bh include/linux/rwlock_api_smp.h:203 [inline] _raw_write_lock_bh+0x31/0x40 kernel/locking/spinlock.c:312 __ipv6_dev_mc_dec+0x45/0x350 net/ipv6/mcast.c:928 ipv6_dev_mc_dec+0x110/0x1f0 net/ipv6/mcast.c:961 pndisc_destructor+0x21a/0x340 net/ipv6/ndisc.c:392 pneigh_ifdown net/core/neighbour.c:695 [inline] neigh_ifdown+0x149/0x250 net/core/neighbour.c:294 rt6_disable_ip+0x537/0x700 net/ipv6/route.c:3874 addrconf_ifdown+0x14b/0x14f0 net/ipv6/addrconf.c:3633 addrconf_notify+0x5f8/0x2310 net/ipv6/addrconf.c:3557 notifier_call_chain+0x136/0x2c0 kernel/notifier.c:93 __raw_notifier_call_chain kernel/notifier.c:394 [inline] raw_notifier_call_chain+0x2d/0x40 kernel/notifier.c:401 call_netdevice_notifiers_info+0x32/0x70 net/core/dev.c:1707 call_netdevice_notifiers net/core/dev.c:1725 [inline] __dev_notify_flags+0x262/0x430 net/core/dev.c:6960 dev_change_flags+0xf5/0x140 net/core/dev.c:6994 devinet_ioctl+0x126a/0x1ac0 net/ipv4/devinet.c:1080 inet_ioctl+0x184/0x310 net/ipv4/af_inet.c:919 packet_ioctl+0x1ff/0x310 net/packet/af_packet.c:4066 sock_do_ioctl+0xef/0x390 net/socket.c:957 sock_ioctl+0x36b/0x610 net/socket.c:1081 vfs_ioctl fs/ioctl.c:46 [inline] do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686 SYSC_ioctl fs/ioctl.c:701 [inline] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 other info that might help us debug this: Chain exists of: &ndev->lock --> rt6_exception_lock --> &tbl->lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&tbl->lock); lock(rt6_exception_lock); lock(&tbl->lock); lock(&ndev->lock); *** DEADLOCK *** 2 locks held by syz-executor7/4015: #0: (rtnl_mutex){+.+.}, at: [<00000000a2f16daa>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 #1: (&tbl->lock){++-.}, at: [<00000000b5cb1d65>] neigh_ifdown+0x3d/0x250 net/core/neighbour.c:292 stack backtrace: CPU: 0 PID: 4015 Comm: syz-executor7 Not tainted 4.16.0-rc4+ #277 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x194/0x24d lib/dump_stack.c:53 print_circular_bug.isra.38+0x2cd/0x2dc kernel/locking/lockdep.c:1223 check_prev_add kernel/locking/lockdep.c:1863 [inline] check_prevs_add kernel/locking/lockdep.c:1976 [inline] validate_chain kernel/locking/lockdep.c:2417 [inline] __lock_acquire+0x30a8/0x3e00 kernel/locking/lockdep.c:3431 lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920 __raw_write_lock_bh include/linux/rwlock_api_smp.h:203 [inline] _raw_write_lock_bh+0x31/0x40 kernel/locking/spinlock.c:312 __ipv6_dev_mc_dec+0x45/0x350 net/ipv6/mcast.c:928 ipv6_dev_mc_dec+0x110/0x1f0 net/ipv6/mcast.c:961 pndisc_destructor+0x21a/0x340 net/ipv6/ndisc.c:392 pneigh_ifdown net/core/neighbour.c:695 [inline] neigh_ifdown+0x149/0x250 net/core/neighbour.c:294 rt6_disable_ip+0x537/0x700 net/ipv6/route.c:3874 addrconf_ifdown+0x14b/0x14f0 net/ipv6/addrconf.c:3633 addrconf_notify+0x5f8/0x2310 net/ipv6/addrconf.c:3557 notifier_call_chain+0x136/0x2c0 kernel/notifier.c:93 __raw_notifier_call_chain kernel/notifier.c:394 [inline] raw_notifier_call_chain+0x2d/0x40 kernel/notifier.c:401 call_netdevice_notifiers_info+0x32/0x70 net/core/dev.c:1707 call_netdevice_notifiers net/core/dev.c:1725 [inline] __dev_notify_flags+0x262/0x430 net/core/dev.c:6960 dev_change_flags+0xf5/0x140 net/core/dev.c:6994 devinet_ioctl+0x126a/0x1ac0 net/ipv4/devinet.c:1080 inet_ioctl+0x184/0x310 net/ipv4/af_inet.c:919 packet_ioctl+0x1ff/0x310 net/packet/af_packet.c:4066 sock_do_ioctl+0xef/0x390 net/socket.c:957 sock_ioctl+0x36b/0x610 net/socket.c:1081 vfs_ioctl fs/ioctl.c:46 [inline] do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686 SYSC_ioctl fs/ioctl.c:701 [inline] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 Fixes: c757faa8bfa2 ("ipv6: prepare fib6_age() for exception table") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Wei Wang <weiwan@google.com> Cc: Martin KaFai Lau <kafai@fb.com> Acked-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller7-44/+79
Fun set of conflict resolutions here... For the mac80211 stuff, these were fortunately just parallel adds. Trivially resolved. In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the function phy_disable_interrupts() earlier in the file, whilst in 'net-next' the phy_error() call from this function was removed. In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the 'rt_table_id' member of rtable collided with a bug fix in 'net' that added a new struct member "rt_mtu_locked" which needs to be copied over here. The mlxsw driver conflict consisted of net-next separating the span code and definitions into separate files, whilst a 'net' bug fix made some changes to that moved code. The mlx5 infiniband conflict resolution was quite non-trivial, the RDMA tree's merge commit was used as a guide here, and here are their notes: ==================== Due to bug fixes found by the syzkaller bot and taken into the for-rc branch after development for the 4.17 merge window had already started being taken into the for-next branch, there were fairly non-trivial merge issues that would need to be resolved between the for-rc branch and the for-next branch. This merge resolves those conflicts and provides a unified base upon which ongoing development for 4.17 can be based. Conflicts: drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f9524 (IB/mlx5: Fix cleanup order on unload) added to for-rc and commit b5ca15ad7e61 (IB/mlx5: Add proper representors support) add as part of the devel cycle both needed to modify the init/de-init functions used by mlx5. To support the new representors, the new functions added by the cleanup patch needed to be made non-static, and the init/de-init list added by the representors patch needed to be modified to match the init/de-init list changes made by the cleanup patch. Updates: drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function prototypes added by representors patch to reflect new function names as changed by cleanup patch drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init stage list to match new order from cleanup patch ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22gre: fix TUNNEL_SEQ bit check on sequence numberingColin Ian King1-1/+1
The current logic of flags | TUNNEL_SEQ is always non-zero and hence sequence numbers are always incremented no matter the setting of the TUNNEL_SEQ bit. Fix this by using & instead of |. Detected by CoverityScan, CID#1466039 ("Operands don't affect result") Fixes: 77a5196a804e ("gre: add sequence number for collect md mode.") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22net/ipv6: Handle onlink flag with multipath routesDavid Ahern1-0/+1
For multipath routes the ONLINK flag can be specified per nexthop in rtnh_flags or globally in rtm_flags. Update ip6_route_multipath_add to consider the ONLINK setting coming from rtnh_flags. Each loop over nexthops the config for the sibling route is initialized to the global config and then per nexthop settings overlayed. The flag is 'or'ed into fib6_config to handle the ONLINK flag coming from either rtm_flags or rtnh_flags. Fixes: fc1e64e1092f ("net/ipv6: Add support for onlink flag") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22ipv6: sr: fix NULL pointer dereference when setting encap source addressDavid Lebrun1-2/+3
When using seg6 in encap mode, we call ipv6_dev_get_saddr() to set the source address of the outer IPv6 header, in case none was specified. Using skb->dev can lead to BUG() when it is in an inconsistent state. This patch uses the net_device attached to the skb's dst instead. [940807.667429] BUG: unable to handle kernel NULL pointer dereference at 000000000000047c [940807.762427] IP: ipv6_dev_get_saddr+0x8b/0x1d0 [940807.815725] PGD 0 P4D 0 [940807.847173] Oops: 0000 [#1] SMP PTI [940807.890073] Modules linked in: [940807.927765] CPU: 6 PID: 0 Comm: swapper/6 Tainted: G W 4.16.0-rc1-seg6bpf+ #2 [940808.028988] Hardware name: HP ProLiant DL120 G6/ProLiant DL120 G6, BIOS O26 09/06/2010 [940808.128128] RIP: 0010:ipv6_dev_get_saddr+0x8b/0x1d0 [940808.187667] RSP: 0018:ffff88043fd836b0 EFLAGS: 00010206 [940808.251366] RAX: 0000000000000005 RBX: ffff88042cb1c860 RCX: 00000000000000fe [940808.338025] RDX: 00000000000002c0 RSI: ffff88042cb1c860 RDI: 0000000000004500 [940808.424683] RBP: ffff88043fd83740 R08: 0000000000000000 R09: ffffffffffffffff [940808.511342] R10: 0000000000000040 R11: 0000000000000000 R12: ffff88042cb1c850 [940808.598012] R13: ffffffff8208e380 R14: ffff88042ac8da00 R15: 0000000000000002 [940808.684675] FS: 0000000000000000(0000) GS:ffff88043fd80000(0000) knlGS:0000000000000000 [940808.783036] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [940808.852975] CR2: 000000000000047c CR3: 00000004255fe000 CR4: 00000000000006e0 [940808.939634] Call Trace: [940808.970041] <IRQ> [940808.995250] ? ip6t_do_table+0x265/0x640 [940809.043341] seg6_do_srh_encap+0x28f/0x300 [940809.093516] ? seg6_do_srh+0x1a0/0x210 [940809.139528] seg6_do_srh+0x1a0/0x210 [940809.183462] seg6_output+0x28/0x1e0 [940809.226358] lwtunnel_output+0x3f/0x70 [940809.272370] ip6_xmit+0x2b8/0x530 [940809.313185] ? ac6_proc_exit+0x20/0x20 [940809.359197] inet6_csk_xmit+0x7d/0xc0 [940809.404173] tcp_transmit_skb+0x548/0x9a0 [940809.453304] __tcp_retransmit_skb+0x1a8/0x7a0 [940809.506603] ? ip6_default_advmss+0x40/0x40 [940809.557824] ? tcp_current_mss+0x24/0x90 [940809.605925] tcp_retransmit_skb+0xd/0x80 [940809.654016] tcp_xmit_retransmit_queue.part.17+0xf9/0x210 [940809.719797] tcp_ack+0xa47/0x1110 [940809.760612] tcp_rcv_established+0x13c/0x570 [940809.812865] tcp_v6_do_rcv+0x151/0x3d0 [940809.858879] tcp_v6_rcv+0xa5c/0xb10 [940809.901770] ? seg6_output+0xdd/0x1e0 [940809.946745] ip6_input_finish+0xbb/0x460 [940809.994837] ip6_input+0x74/0x80 [940810.034612] ? ip6_rcv_finish+0xb0/0xb0 [940810.081663] ipv6_rcv+0x31c/0x4c0 ... Fixes: 6c8702c60b886 ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels") Reported-by: Tom Herbert <tom@quantonium.net> Signed-off-by: David Lebrun <dlebrun@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22ipv6: sr: fix scheduling in RCU when creating seg6 lwtunnel stateDavid Lebrun1-1/+1
The seg6_build_state() function is called with RCU read lock held, so we cannot use GFP_KERNEL. This patch uses GFP_ATOMIC instead. [ 92.770271] ============================= [ 92.770628] WARNING: suspicious RCU usage [ 92.770921] 4.16.0-rc4+ #12 Not tainted [ 92.771277] ----------------------------- [ 92.771585] ./include/linux/rcupdate.h:302 Illegal context switch in RCU read-side critical section! [ 92.772279] [ 92.772279] other info that might help us debug this: [ 92.772279] [ 92.773067] [ 92.773067] rcu_scheduler_active = 2, debug_locks = 1 [ 92.773514] 2 locks held by ip/2413: [ 92.773765] #0: (rtnl_mutex){+.+.}, at: [<00000000e5461720>] rtnetlink_rcv_msg+0x441/0x4d0 [ 92.774377] #1: (rcu_read_lock){....}, at: [<00000000df4f161e>] lwtunnel_build_state+0x59/0x210 [ 92.775065] [ 92.775065] stack backtrace: [ 92.775371] CPU: 0 PID: 2413 Comm: ip Not tainted 4.16.0-rc4+ #12 [ 92.775791] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc27 04/01/2014 [ 92.776608] Call Trace: [ 92.776852] dump_stack+0x7d/0xbc [ 92.777130] __schedule+0x133/0xf00 [ 92.777393] ? unwind_get_return_address_ptr+0x50/0x50 [ 92.777783] ? __sched_text_start+0x8/0x8 [ 92.778073] ? rcu_is_watching+0x19/0x30 [ 92.778383] ? kernel_text_address+0x49/0x60 [ 92.778800] ? __kernel_text_address+0x9/0x30 [ 92.779241] ? unwind_get_return_address+0x29/0x40 [ 92.779727] ? pcpu_alloc+0x102/0x8f0 [ 92.780101] _cond_resched+0x23/0x50 [ 92.780459] __mutex_lock+0xbd/0xad0 [ 92.780818] ? pcpu_alloc+0x102/0x8f0 [ 92.781194] ? seg6_build_state+0x11d/0x240 [ 92.781611] ? save_stack+0x9b/0xb0 [ 92.781965] ? __ww_mutex_wakeup_for_backoff+0xf0/0xf0 [ 92.782480] ? seg6_build_state+0x11d/0x240 [ 92.782925] ? lwtunnel_build_state+0x1bd/0x210 [ 92.783393] ? ip6_route_info_create+0x687/0x1640 [ 92.783846] ? ip6_route_add+0x74/0x110 [ 92.784236] ? inet6_rtm_newroute+0x8a/0xd0 Fixes: 6c8702c60b886 ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels") Signed-off-by: David Lebrun <dlebrun@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22net: Convert nf_ct_net_opsKirill Tkhai1-0/+1
These pernet_operations register and unregister sysctl. Also, there is inet_frags_exit_net() called in exit method, which has to be safe after a560002437d3 "net: Fix hlist corruptions in inet_evict_bucket()". Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-20ipv6: old_dport should be a __be16 in __ip6_datagram_connect()Stefano Brivio1-1/+1
Fixes: 2f987a76a977 ("net: ipv6: keep sk status consistent after datagram connect failure") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-20netfilter: ctnetlink: synproxy supportPablo Neira Ayuso1-1/+7
This patch exposes synproxy information per-conntrack. Moreover, send sequence adjustment events once server sends us the SYN,ACK packet, so we can synchronize the sequence adjustment too for packets going as reply from the server, as part of the synproxy logic. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-19vti6: Fix dev->max_mtu settingStefano Brivio1-1/+1
We shouldn't allow a tunnel to have IP_MAX_MTU as MTU, because another IPv6 header is going on top of our packets. Without this patch, we might end up building packets bigger than IP_MAX_MTU. Fixes: b96f9afee4eb ("ipv4/6: use core net MTU range checking") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-03-19vti6: Keep set MTU on link creation or change, validate itStefano Brivio1-8/+16
In vti6_link_config(), if MTU is already given on link creation or change, validate and use it instead of recomputing it. To do that, we need to propagate the knowledge that MTU was set by userspace all the way down to vti6_link_config(). To keep this simple, vti6_dev_init() sets the new 'keep_mtu' argument of vti6_link_config() to true: on initialization, we don't have convenient access to netlink attributes there, but we will anyway check whether dev->mtu is set in vti6_link_config(). If it's non-zero, it was set to the value of the IFLA_MTU attribute during creation. Otherwise, determine a reasonable value. Fixes: ed1efb2aefbb ("ipv6: Add support for IPsec virtual tunnel interfaces") Fixes: 53c81e95df17 ("ip6_vti: adjust vti mtu according to mtu of lower device") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-03-19vti6: Properly adjust vti6 MTU from MTU of lower deviceStefano Brivio1-4/+6
If a lower device is found, we don't need to subtract LL_MAX_HEADER to calculate our MTU: just use its MTU, the link layer headers are already taken into account by it. If the lower device is not found, start from ETH_DATA_LEN instead, and only in this case subtract a worst-case LL_MAX_HEADER. We then need to subtract our additional IPv6 header from the calculation. While at it, note that vti6 doesn't have a hardware header, so it doesn't need to set dev->hard_header_len. And as vti6_link_config() now always sets the MTU, there's no need to set a default value in vti6_dev_setup(). This makes the behaviour consistent with IPv4 vti, after commit a32452366b72 ("vti4: Don't count header length twice."), which was accidentally reverted by merge commit f895f0cfbb77 ("Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec"). While commit 53c81e95df17 ("ip6_vti: adjust vti mtu according to mtu of lower device") improved on the original situation, this was still not ideal. As reported in that commit message itself, if we start from an underlying veth MTU of 9000, we end up with an MTU of 8832, that is, 9000 - LL_MAX_HEADER - sizeof(ipv6hdr). This should simply be 8880, or 9000 - sizeof(ipv6hdr) instead: we found the lower device (veth) and we know we don't have any additional link layer header, so there's no need to subtract an hypothetical worst-case number. Fixes: 53c81e95df17 ("ip6_vti: adjust vti mtu according to mtu of lower device") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-03-16udp: Move the udp sysctl to namespace.Tonghao Zhang1-26/+26
This patch moves the udp_rmem_min, udp_wmem_min to namespace and init the udp_l3mdev_accept explicitly. The udp_rmem_min/udp_wmem_min affect udp rx/tx queue, with this patch namespaces can set them differently. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-16net/ipv6: Add l3mdev check to ipv6_chk_addr_and_flagsDavid Ahern1-0/+15
Lookup the L3 master device for the passed in device. Only consider addresses on netdev's with the same master device. If the device is not enslaved or is NULL, then the l3mdev is NULL which means only devices not enslaved (ie, in the default domain) are considered. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-16net/ipv6: Change address check to always take a device argumentDavid Ahern6-17/+41
ipv6_chk_addr_and_flags determines if an address is a local address and optionally if it is an address on a specific device. For example, it is called by ip6_route_info_create to determine if a given gateway address is a local address. The address check currently does not consider L3 domains and as a result does not allow a route to be added in one VRF if the nexthop points to an address in a second VRF. e.g., $ ip route add 2001:db8:1::/64 vrf r2 via 2001:db8:102::23 Error: Invalid gateway address. where 2001:db8:102::23 is an address on an interface in vrf r1. ipv6_chk_addr_and_flags needs to allow callers to always pass in a device with a separate argument to not limit the address to the specific device. The device is used used to determine the L3 domain of interest. To that end add an argument to skip the device check and update callers to always pass a device where possible and use the new argument to mean any address in the domain. Update a handful of users of ipv6_chk_addr with a NULL dev argument. This patch handles the change to these callers without adding the domain check. ip6_validate_gw needs to handle 2 cases - one where the device is given as part of the nexthop spec and the other where the device is resolved. There is at least 1 VRF case where deferring the check to only after the route lookup has resolved the device fails with an unintuitive error "RTNETLINK answers: No route to host" as opposed to the preferred "Error: Gateway can not be a local address." The 'no route to host' error is because of the fallback to a full lookup. The check is done twice to avoid this error. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-16net/ipv6: Refactor gateway validation on route addDavid Ahern1-54/+66
Move gateway validation code from ip6_route_info_create into ip6_validate_gw. Code move plus adjustments to handle the potential reset of dev and idev and to make checkpatch happy. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-13Merge branch 'master' of ↵David S. Miller3-3/+9
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2018-03-13 1) Refuse to insert 32 bit userspace socket policies on 64 bit systems like we do it for standard policies. We don't have a compat layer, so inserting socket policies from 32 bit userspace will lead to a broken configuration. 2) Make the policy hold queue work without the flowcache. Dummy bundles are not chached anymore, so we need to generate a new one on each lookup as long as the SAs are not yet in place. 3) Fix the validation of the esn replay attribute. The The sanity check in verify_replay() is bypassed if the XFRM_STATE_ESN flag is not set. Fix this by doing the sanity check uncoditionally. From Florian Westphal. 4) After most of the dst_entry garbage collection code is removed, we may leak xfrm_dst entries as they are neither cached nor tracked somewhere. Fix this by reusing the 'uncached_list' to track xfrm_dst entries too. From Xin Long. 5) Fix a rcu_read_lock/rcu_read_unlock imbalance in xfrm_get_tos() From Xin Long. 6) Fix an infinite loop in xfrm_get_dst_nexthop. On transport mode we fetch the child dst_entry after we continue, so this pointer is never updated. Fix this by fetching it before we continue. 7) Fix ESN sequence number gap after IPsec GSO packets. We accidentally increment the sequence number counter on the xfrm_state by one packet too much in the ESN case. Fix this by setting the sequence number to the correct value. 8) Reset the ethernet protocol after decapsulation only if a mac header was set. Otherwise it breaks configurations with TUN devices. From Yossi Kuperman. 9) Fix __this_cpu_read() usage in preemptible code. Use this_cpu_read() instead in ipcomp_alloc_tfms(). From Greg Hackmann. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-12net: ipv6: keep sk status consistent after datagram connect failurePaolo Abeni1-7/+14
On unsuccesful ip6_datagram_connect(), if the failure is caused by ip6_datagram_dst_update(), the sk peer information are cleared, but the sk->sk_state is preserved. If the socket was already in an established status, the overall sk status is inconsistent and fouls later checks in datagram code. Fix this saving the old peer information and restoring them in case of failure. This also aligns ipv6 datagram connect() behavior with ipv4. v1 -> v2: - added missing Fixes tag Fixes: 85cb73ff9b74 ("net: ipv6: reset daddr and dport in sk if connect() fails") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-12ipv6: Use ip6_multipath_hash_policy() in rt6_multipath_hash().David S. Miller1-1/+1
Make use of the new helper. Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-09ip6erspan: make sure enough headroom at xmit.William Tu1-0/+3
The patch adds skb_cow_header() to ensure enough headroom at ip6erspan_tunnel_xmit before pushing the erspan header to the skb. Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-09ip6erspan: improve error handling for erspan version number.William Tu1-0/+2
When users fill in incorrect erspan version number through the struct erspan_metadata uapi, current code skips pushing the erspan header but continue pushing the gre header, which is incorrect. The patch fixes it by returning error. Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-09ip6gre: add erspan v2 to tunnel lookupWilliam Tu1-1/+2
The patch adds the erspan v2 proto in ip6gre_tunnel_lookup so the erspan v2 tunnel can be found correctly. Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-09net: do not create fallback tunnels for non-default namespacesEric Dumazet3-2/+9
fallback tunnels (like tunl0, gre0, gretap0, erspan0, sit0, ip6tnl0, ip6gre0) are automatically created when the corresponding module is loaded. These tunnels are also automatically created when a new network namespace is created, at a great cost. In many cases, netns are used for isolation purposes, and these extra network devices are a waste of resources. We are using thousands of netns per host, and hit the netns creation/delete bottleneck a lot. (Many thanks to Kirill for recent work on this) Add a new sysctl so that we can opt-out from this automatic creation. Note that these tunnels are still created for the initial namespace, to be the least intrusive for typical setups. Tested: lpk43:~# cat add_del_unshare.sh for i in `seq 1 40` do (for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) & done wait lpk43:~# echo 0 >/proc/sys/net/core/fb_tunnels_only_for_init_net lpk43:~# time ./add_del_unshare.sh real 0m37.521s user 0m0.886s sys 7m7.084s lpk43:~# echo 1 >/proc/sys/net/core/fb_tunnels_only_for_init_net lpk43:~# time ./add_del_unshare.sh real 0m4.761s user 0m0.851s sys 1m8.343s lpk43:~# Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-09ipv6: fix access to non-linear packet in ndisc_fill_redirect_hdr_option()Lorenzo Bianconi1-1/+2
Fix the following slab-out-of-bounds kasan report in ndisc_fill_redirect_hdr_option when the incoming ipv6 packet is not linear and the accessed data are not in the linear data region of orig_skb. [ 1503.122508] ================================================================== [ 1503.122832] BUG: KASAN: slab-out-of-bounds in ndisc_send_redirect+0x94e/0x990 [ 1503.123036] Read of size 1184 at addr ffff8800298ab6b0 by task netperf/1932 [ 1503.123220] CPU: 0 PID: 1932 Comm: netperf Not tainted 4.16.0-rc2+ #124 [ 1503.123347] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014 [ 1503.123527] Call Trace: [ 1503.123579] <IRQ> [ 1503.123638] print_address_description+0x6e/0x280 [ 1503.123849] kasan_report+0x233/0x350 [ 1503.123946] memcpy+0x1f/0x50 [ 1503.124037] ndisc_send_redirect+0x94e/0x990 [ 1503.125150] ip6_forward+0x1242/0x13b0 [...] [ 1503.153890] Allocated by task 1932: [ 1503.153982] kasan_kmalloc+0x9f/0xd0 [ 1503.154074] __kmalloc_track_caller+0xb5/0x160 [ 1503.154198] __kmalloc_reserve.isra.41+0x24/0x70 [ 1503.154324] __alloc_skb+0x130/0x3e0 [ 1503.154415] sctp_packet_transmit+0x21a/0x1810 [ 1503.154533] sctp_outq_flush+0xc14/0x1db0 [ 1503.154624] sctp_do_sm+0x34e/0x2740 [ 1503.154715] sctp_primitive_SEND+0x57/0x70 [ 1503.154807] sctp_sendmsg+0xaa6/0x1b10 [ 1503.154897] sock_sendmsg+0x68/0x80 [ 1503.154987] ___sys_sendmsg+0x431/0x4b0 [ 1503.155078] __sys_sendmsg+0xa4/0x130 [ 1503.155168] do_syscall_64+0x171/0x3f0 [ 1503.155259] entry_SYSCALL_64_after_hwframe+0x42/0xb7 [ 1503.155436] Freed by task 1932: [ 1503.155527] __kasan_slab_free+0x134/0x180 [ 1503.155618] kfree+0xbc/0x180 [ 1503.155709] skb_release_data+0x27f/0x2c0 [ 1503.155800] consume_skb+0x94/0xe0 [ 1503.155889] sctp_chunk_put+0x1aa/0x1f0 [ 1503.155979] sctp_inq_pop+0x2f8/0x6e0 [ 1503.156070] sctp_assoc_bh_rcv+0x6a/0x230 [ 1503.156164] sctp_inq_push+0x117/0x150 [ 1503.156255] sctp_backlog_rcv+0xdf/0x4a0 [ 1503.156346] __release_sock+0x142/0x250 [ 1503.156436] release_sock+0x80/0x180 [ 1503.156526] sctp_sendmsg+0xbb0/0x1b10 [ 1503.156617] sock_sendmsg+0x68/0x80 [ 1503.156708] ___sys_sendmsg+0x431/0x4b0 [ 1503.156799] __sys_sendmsg+0xa4/0x130 [ 1503.156889] do_syscall_64+0x171/0x3f0 [ 1503.156980] entry_SYSCALL_64_after_hwframe+0x42/0xb7 [ 1503.157158] The buggy address belongs to the object at ffff8800298ab600 which belongs to the cache kmalloc-1024 of size 1024 [ 1503.157444] The buggy address is located 176 bytes inside of 1024-byte region [ffff8800298ab600, ffff8800298aba00) [ 1503.157702] The buggy address belongs to the page: [ 1503.157820] page:ffffea0000a62a00 count:1 mapcount:0 mapping:0000000000000000 index:0x0 compound_mapcount: 0 [ 1503.158053] flags: 0x4000000000008100(slab|head) [ 1503.158171] raw: 4000000000008100 0000000000000000 0000000000000000 00000001800e000e [ 1503.158350] raw: dead000000000100 dead000000000200 ffff880036002600 0000000000000000 [ 1503.158523] page dumped because: kasan: bad access detected [ 1503.158698] Memory state around the buggy address: [ 1503.158816] ffff8800298ab900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 1503.158988] ffff8800298ab980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 1503.159165] >ffff8800298aba00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 1503.159338] ^ [ 1503.159436] ffff8800298aba80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 1503.159610] ffff8800298abb00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 1503.159785] ================================================================== [ 1503.159964] Disabling lock debugging due to kernel taint The test scenario to trigger the issue consists of 4 devices: - H0: data sender, connected to LAN0 - H1: data receiver, connected to LAN1 - GW0 and GW1: routers between LAN0 and LAN1. Both of them have an ethernet connection on LAN0 and LAN1 On H{0,1} set GW0 as default gateway while on GW0 set GW1 as next hop for data from LAN0 to LAN1. Moreover create an ip6ip6 tunnel between H0 and H1 and send 3 concurrent data streams (TCP/UDP/SCTP) from H0 to H1 through ip6ip6 tunnel (send buffer size is set to 16K). While data streams are active flush the route cache on HA multiple times. I have not been able to identify a given commit that introduced the issue since, using the reproducer described above, the kasan report has been triggered from 4.14 and I have not gone back further. Reported-by: Jianlin Shi <jishi@redhat.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-08net: Convet ipv6_net_opsKirill Tkhai1-0/+1
These pernet_operations are similar to ipv4_net_ops. They are safe to be async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-08net: Convert ip6 tables pernet_operationsKirill Tkhai5-0/+5
The pernet_operations: ip6table_filter_net_ops ip6table_mangle_net_ops ip6table_nat_net_ops ip6table_raw_net_ops ip6table_security_net_ops have exit methods, which call ip6t_unregister_table(). ip6table_filter_net_ops has init method registering filter table. Since there must not be in-flight ipv6 packets at the time of pernet_operations execution and since pernet_operations don't send ipv6 packets each other, these pernet_operations are safe to be async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-07ip6mr: remove synchronize_rcu() in favor of SOCK_RCU_FREEEric Dumazet1-1/+5
Kirill found that recently added synchronize_rcu() call in ip6mr_sk_done() was slowing down netns dismantle and posted a patch to use it only if the socket was found. I instead suggested to get rid of this call, and use instead SOCK_RCU_FREE We might later change IPv4 side to use the same technique and unify both stacks. IPv4 does not use synchronize_rcu() but has a call_rcu() that could be replaced by SOCK_RCU_FREE. Tested: time for i in {1..1000}; do unshare -n /bin/false;done Before : real 7m18.911s After : real 10.187s Fixes: 8571ab479a6e ("ip6mr: Make mroute_sk rcu-based") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Yuval Mintz <yuvalm@mellanox.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-07ipv6: Reflect MTU changes on PMTU of exceptions for MTU-less routesStefano Brivio1-29/+42
Currently, administrative MTU changes on a given netdevice are not reflected on route exceptions for MTU-less routes, with a set PMTU value, for that device: # ip -6 route get 2001:db8::b 2001:db8::b from :: dev vti_a proto kernel src 2001:db8::a metric 256 pref medium # ping6 -c 1 -q -s10000 2001:db8::b > /dev/null # ip netns exec a ip -6 route get 2001:db8::b 2001:db8::b from :: dev vti_a src 2001:db8::a metric 0 cache expires 571sec mtu 4926 pref medium # ip link set dev vti_a mtu 3000 # ip -6 route get 2001:db8::b 2001:db8::b from :: dev vti_a src 2001:db8::a metric 0 cache expires 571sec mtu 4926 pref medium # ip link set dev vti_a mtu 9000 # ip -6 route get 2001:db8::b 2001:db8::b from :: dev vti_a src 2001:db8::a metric 0 cache expires 571sec mtu 4926 pref medium The first issue is that since commit fb56be83e43d ("net-ipv6: on device mtu change do not add mtu to mtu-less routes") we don't call rt6_exceptions_update_pmtu() from rt6_mtu_change_route(), which handles administrative MTU changes, if the regular route is MTU-less. However, PMTU exceptions should be always updated, as long as RTAX_MTU is not locked. Keep the check for MTU-less main route, as introduced by that commit, but, for exceptions, call rt6_exceptions_update_pmtu() regardless of that check. Once that is fixed, one problem remains: MTU changes are not reflected if the new MTU is higher than the previous one, because rt6_exceptions_update_pmtu() doesn't allow that. We should instead allow PMTU increase if the old PMTU matches the local MTU, as that implies that the old MTU was the lowest in the path, and PMTU discovery might lead to different results. The existing check in rt6_mtu_change_route() correctly took that case into account (for regular routes only), so factor it out and re-use it also in rt6_exceptions_update_pmtu(). While at it, fix comments style and grammar, and try to be a bit more descriptive. Reported-by: Xiumei Mu <xmu@redhat.com> Fixes: fb56be83e43d ("net-ipv6: on device mtu change do not add mtu to mtu-less routes") Fixes: f5bbe7ee79c2 ("ipv6: prepare rt6_mtu_change() for exception table") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-07ipv6: ndisc: use true and false for boolean valuesGustavo A. R. Silva1-1/+1
Assign true or false to boolean variables instead of an integer value. This issue was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-07xfrm: Verify MAC header exists before overwriting eth_hdr(skb)->h_protoYossi Kuperman1-1/+2
Artem Savkov reported that commit 5efec5c655dd leads to a packet loss under IPSec configuration. It appears that his setup consists of a TUN device, which does not have a MAC header. Make sure MAC header exists. Note: TUN device sets a MAC header pointer, although it does not have one. Fixes: 5efec5c655dd ("xfrm: Fix eth_hdr(skb)->h_proto to reflect inner IP version") Reported-by: Artem Savkov <artem.savkov@gmail.com> Tested-by: Artem Savkov <artem.savkov@gmail.com> Signed-off-by: Yossi Kuperman <yossiku@mellanox.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-03-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller9-25/+29
All of the conflicts were cases of overlapping changes. In net/core/devlink.c, we have to make care that the resouce size_params have become a struct member rather than a pointer to such an object. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-05netfilter: x_tables: ensure last rule in base chain matches underflow/policyFlorian Westphal1-1/+16
Harmless from kernel point of view, but again iptables assumes that this is true when decoding ruleset coming from kernel. If a (syzkaller generated) ruleset doesn't have the underflow/policy stored as the last rule in the base chain, then iptables will abort() because it doesn't find the chain policy. libiptc assumes that the policy is the last rule in the basechain, which is only true for iptables-generated rulesets. Unfortunately this needs code duplication -- the functions need the struct layout of the rule head, but that is different for ip/ip6/arptables. NB: pr_warn could be pr_debug but in case this break rulesets somehow its useful to know why blob was rejected. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-05netfilter: compat: prepare xt_compat_init_offsets to return errorsFlorian Westphal1-3/+7
should have no impact, function still always returns 0. This patch is only to ease review. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-05netfilter: x_tables: add counters allocation wrapperFlorian Westphal1-1/+1
allows to have size checks in a single spot. This is supposed to reduce oom situations when fuzz-testing xtables. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-05netfilter: x_tables: move hook entry checks into coreFlorian Westphal1-10/+3
Allow followup patch to change on location instead of three. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-05netfilter: x_tables: check standard verdicts in coreFlorian Westphal1-5/+0
Userspace must provide a valid verdict to the standard target. The verdict can be either a jump (signed int > 0), or a return code. Allowed return codes are either RETURN (pop from stack), NF_ACCEPT, DROP and QUEUE (latter is allowed for legacy reasons). Jump offsets (verdict > 0) are checked in more detail later on when loop-detection is performed. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-05netfilter: unlock xt_table earlier in __do_replaceXin Long1-1/+2
Now it's doing cleanup_entry for oldinfo under the xt_table lock, but it's not really necessary. After the replacement job is done in xt_replace_table, oldinfo is not used elsewhere any more, and it can be freed without xt_table lock safely. The important thing is that rtnl_lock is called in some xt_target destroy, which means rtnl_lock, a big lock is used in xt_table lock, a smaller one. It usually could be the reason why a dead lock may happen. Besides, all xt_target/match checkentry is called out of xt_table lock. It's better also to move all cleanup_entry calling out of xt_table lock, just as do_replace_finish does for ebtables. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-03-05net: Convert arp_tables_net_ops and ip6_tables_net_opsKirill Tkhai1-0/+1
These pernet_operations call xt_proto_init() and xt_proto_fini(), which just register and unregister /proc entries. They are safe to be marked as async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-05net: Convert log pernet_operationsKirill Tkhai1-0/+1
These pernet_operations use nf_log_set() and nf_log_unset() in their methods: nf_log_bridge_net_ops nf_log_arp_net_ops nf_log_ipv4_net_ops nf_log_ipv6_net_ops nf_log_netdev_net_ops Nobody can send such a packet to a net before it's became registered, nobody can send a packet after all netdevices are unregistered. So, these pernet_operations are able to be marked as async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-04gre: add sequence number for collect md mode.William Tu1-5/+8
Currently GRE sequence number can only be used in native tunnel mode. This patch adds sequence number support for gre collect metadata mode. RFC2890 defines GRE sequence number to be specific to the traffic flow identified by the key. However, this patch does not implement per-key seqno. The sequence number is shared in the same tunnel device. That is, different tunnel keys using the same collect_md tunnel share single sequence number. Signed-off-by: William Tu <u9012063@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-04net: xfrm: use skb_gso_validate_network_len() to check gso sizesDaniel Axtens1-1/+1
Replace skb_gso_network_seglen() with skb_gso_validate_network_len(), as it considers the GSO_BY_FRAGS case. Signed-off-by: Daniel Axtens <dja@axtens.net> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-04net: rename skb_gso_validate_mtu -> skb_gso_validate_network_lenDaniel Axtens2-2/+2
If you take a GSO skb, and split it into packets, will the network length (L3 headers + L4 headers + payload) of those packets be small enough to fit within a given MTU? skb_gso_validate_mtu gives you the answer to that question. However, we recently added to add a way to validate the MAC length of a split GSO skb (L2+L3+L4+payload), and the names get confusing, so rename skb_gso_validate_mtu to skb_gso_validate_network_len Signed-off-by: Daniel Axtens <dja@axtens.net> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-04net/ipv6: Add support for path selection using hash of 5-tupleDavid Ahern3-17/+80
Some operators prefer IPv6 path selection to use a standard 5-tuple hash rather than just an L3 hash with the flow the label. To that end add support to IPv6 for multipath hash policy similar to bf4e0a3db97eb ("net: ipv4: add support for ECMP hash policy choice"). The default is still L3 which covers source and destination addresses along with flow label and IPv6 protocol. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Tested-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-04net/ipv6: Pass skb to route lookupDavid Ahern12-44/+65
IPv6 does path selection for multipath routes deep in the lookup functions. The next patch adds L4 hash option and needs the skb for the forward path. To get the skb to the relevant FIB lookup functions it needs to go through the fib rules layer, so add a lookup_data argument to the fib_lookup_arg struct. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-04net/ipv6: Make rt6_multipath_hash similar to fib_multipath_hashDavid Ahern1-4/+10
Make rt6_multipath_hash more of a direct parallel to fib_multipath_hash and reduce stack and overhead in the process: get_hash_from_flowi6 is just a wrapper around __get_hash_from_flowi6 with another stack allocation for flow_keys. Move setting the addresses, protocol and label into rt6_multipath_hash and allow it to make the call to flow_hash_from_keys. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-04net: Align ip_multipath_l3_keys and ip6_multipath_l3_keysDavid Ahern1-2/+2
Symmetry is good and allows easy comparison that ipv4 and ipv6 are doing the same thing. To that end, change ip_multipath_l3_keys to set addresses at the end after the icmp compares, and move the initialization of ipv6 flow keys to rt6_multipath_hash. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>