aboutsummaryrefslogtreecommitdiff
path: root/net/ipv6
AgeCommit message (Collapse)AuthorFilesLines
2018-04-19net/ipv6: Rename addrconf_dst_allocDavid Ahern3-43/+43
addrconf_dst_alloc now returns a fib6_info. Update the name and its users to reflect the change. Rename only; no functional change intended. Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-19net/ipv6: Rename fib6_info struct elementsDavid Ahern5-238/+238
Change the prefix for fib6_info struct elements from rt6i_ to fib6_. rt6i_pcpu and rt6i_exception_bucket are left as is given that they point to rt6_info entries. Rename only; not functional change intended. Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-19netfilter: nf_tables: NAT chain and extensions require NF_TABLESPablo Neira Ayuso1-27/+28
Move these options inside the scope of the 'if' NF_TABLES and NF_TABLES_IPV6 dependencies. This patch fixes: net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_nat_do_chain': >> net/ipv6/netfilter/nft_chain_nat_ipv6.c:37: undefined reference to `nft_do_chain' net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_chain_nat_ipv6_exit': >> net/ipv6/netfilter/nft_chain_nat_ipv6.c:94: undefined reference to `nft_unregister_chain_type' net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_chain_nat_ipv6_init': >> net/ipv6/netfilter/nft_chain_nat_ipv6.c:87: undefined reference to `nft_register_chain_type' that happens with: CONFIG_NF_TABLES=m CONFIG_NFT_CHAIN_NAT_IPV6=y Fixes: 02c7b25e5f54 ("netfilter: nf_tables: build-in filter chain type") Reported-by: kbuild test robot <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2018-04-18ipv6: frags: fix a lockdep false positiveEric Dumazet1-11/+12
lockdep does not know that the locks used by IPv4 defrag and IPv6 reassembly units are of different classes. It complains because of following chains : 1) sch_direct_xmit() (lock txq->_xmit_lock) dev_hard_start_xmit() xmit_one() dev_queue_xmit_nit() packet_rcv_fanout() ip_check_defrag() ip_defrag() spin_lock() (lock frag queue spinlock) 2) ip6_input_finish() ipv6_frag_rcv() (lock frag queue spinlock) ip6_frag_queue() icmpv6_param_prob() (lock txq->_xmit_lock at some point) We could add lockdep annotations, but we also can make sure IPv6 calls icmpv6_param_prob() only after the release of the frag queue spinlock, since this naturally makes frag queue spinlock a leaf in lock hierarchy. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-17net/ipv6: Remove unused code and variables for rt6_infoDavid Ahern3-49/+2
Drop unneeded elements from rt6_info struct and rearrange layout to something more relevant for the data path. Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-17net/ipv6: Flip FIB entries to fib6_infoDavid Ahern5-200/+201
Convert all code paths referencing a FIB entry from rt6_info to fib6_info. Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-17net/ipv6: separate handling of FIB entries from dst based routesDavid Ahern6-147/+113
Last step before flipping the data type for FIB entries: - use fib6_info_alloc to create FIB entries in ip6_route_info_create and addrconf_dst_alloc - use fib6_info_release in place of dst_release, ip6_rt_put and rt6_release - remove the dst_hold before calling __ip6_ins_rt or ip6_del_rt - when purging routes, drop per-cpu routes - replace inc and dec of rt6i_ref with fib6_info_hold and fib6_info_release - use rt->from since it points to the FIB entry - drop references to exception bucket, fib6_metrics and per-cpu from dst entries (those are relevant for fib entries only) Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-17net/ipv6: introduce fib6_info struct and helpersDavid Ahern1-0/+60
Add fib6_info struct and alloc, destroy, hold and release helpers. Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-17net/ipv6: Cleanup exception and cache route handlingDavid Ahern2-77/+81
IPv6 FIB will only contain FIB entries with exception routes added to the FIB entry. Once this transformation is complete, FIB lookups will return a fib6_info with the lookup functions still returning a dst based rt6_info. The current code uses rt6_info for both paths and overloads the rt6_info variable usually called 'rt'. This patch introduces a new 'f6i' variable name for the result of the FIB lookup and keeps 'rt' as the dst based return variable. 'f6i' becomes a fib6_info in a later patch which is why it is introduced as f6i now; avoids the additional churn in the later patch. In addition, remove RTF_CACHE and dst checks from fib6 add and delete since they can not happen now and will never happen after the data type flip. Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-17net/ipv6: Add gfp_flags to route add functionsDavid Ahern3-25/+34
Most FIB entries can be added using memory allocated with GFP_KERNEL. Add gfp_flags to ip6_route_add and addrconf_dst_alloc. Code paths that can be reached from the packet path (e.g., ndisc and autoconfig) or atomic notifiers use GFP_ATOMIC; paths from user context (adding addresses and routes) use GFP_KERNEL. Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-17net/ipv6: Create a neigh_lookup for FIB entriesDavid Ahern2-15/+26
The router discovery code has a FIB entry and wants to validate the gateway has a neighbor entry. Refactor the existing dst_neigh_lookup for IPv6 and create a new function that takes the gateway and device and returns a neighbor entry. Use the new function in ndisc_router_discovery to validate the gateway. Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-17net/ipv6: Move dst flags to booleans in fib entriesDavid Ahern2-7/+26
Continuing to wean FIB paths off of dst_entry, use a bool to hold requests for certain dst settings. Add a helper to convert the flags to DST flags when a FIB entry is converted to a dst_entry. Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-17net/ipv6: Add rt6_info create function for ip6_pol_route_lookupDavid Ahern1-4/+25
ip6_pol_route_lookup is the lookup function for ip6_route_lookup and rt6_lookup. At the moment it returns either a reference to a FIB entry or a cached exception. To move FIB entries to a separate struct, this lookup function needs to convert FIB entries to an rt6_info that is returned to the caller. Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-17net/ipv6: Add fib6_null_entryDavid Ahern2-32/+56
ip6_null_entry will stay a dst based return for lookups that fail to match an entry. Add a new fib6_null_entry which constitutes the root node and leafs for fibs. Replace existing references to ip6_null_entry with the new fib6_null_entry when dealing with FIBs. Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-17net/ipv6: move expires into rt6_infoDavid Ahern4-17/+19
Add expires to rt6_info for FIB entries, and add fib6 helpers to manage it. Data path use of dst.expires remains. The transition is fairly straightforward: when working with fib entries, rt->dst.expires is just rt->expires, rt6_clean_expires is replaced with fib6_clean_expires, rt6_set_expires becomes fib6_set_expires, and rt6_check_expired becomes fib6_check_expired, where the fib6 versions are added by this patch. Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-17net/ipv6: move metrics from dst to rt6_infoDavid Ahern3-210/+123
Similar to IPv4, add fib metrics to the fib struct, which at the moment is rt6_info. Will be moved to fib6_info in a later patch. Copy metrics into dst by reference using refcount. To make the transition: - add dst_metrics to rt6_info. Default to dst_default_metrics if no metrics are passed during route add. No need for a separate pmtu entry; it can reference the MTU slot in fib6_metrics - ip6_convert_metrics allocates memory in the FIB entry and uses ip_metrics_convert to copy from netlink attribute to metrics entry - the convert metrics call is done in ip6_route_info_create simplifying the route add path + fib6_commit_metrics and fib6_copy_metrics and the temporary mx6_config are no longer needed - add fib6_metric_set helper to change the value of a metric in the fib entry since dst_metric_set can no longer be used - cow_metrics for IPv6 can drop to dst_cow_metrics_generic - rt6_dst_from_metrics_check is no longer needed - rt6_fill_node needs the FIB entry and dst as separate arguments to keep compatibility with existing output. Current dst address is renamed to dest. (to be consistent with IPv4 rt6_fill_node really should be split into 2 functions similar to fib_dump_info and rt_fill_info) - rt6_fill_node no longer needs the temporary metrics variable Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-17net/ipv6: Defer initialization of dst to data pathDavid Ahern1-41/+74
Defer setting dst input, output and error until fib entry is copied. The reject path from ip6_route_info_create is moved to a new function ip6_rt_init_dst_reject with a helper doing the conversion from fib6_type to dst error. The remainder of the new ip6_rt_init_dst is an amalgamtion of dst code from addrconf_dst_alloc and the non-reject path of ip6_route_info_create. The dst output function is always ip6_output and the input function is either ip6_input (local routes), ip6_mc_input (multicast routes) or ip6_forward (anything else). A couple of places using dst.error are updated to look at rt6i_flags. Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-17net/ipv6: Move nexthop data to fib6_nhDavid Ahern3-76/+94
Introduce fib6_nh structure and move nexthop related data from rt6_info and rt6_info.dst to fib6_nh. References to dev, gateway or lwtstate from a FIB lookup perspective are converted to use fib6_nh; datapath references to dst version are left as is. Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-17net/ipv6: Save route type in rt6_infoDavid Ahern2-26/+22
The RTN_ type for IPv6 FIB entries is currently embedded in rt6i_flags and dst.error. Since dst is going to be removed, it can no longer be relied on for FIB dumps so save the route type as fib6_type. fc_type is set in current users based on the algorithm in rt6_fill_node: - rt6i_flags contains RTF_LOCAL: fc_type = RTN_LOCAL - rt6i_flags contains RTF_ANYCAST: fc_type = RTN_ANYCAST - else fc_type = RTN_UNICAST Similarly, fib6_type is set in the rt6_info templates based on the RTF_REJECT section of rt6_fill_node converting dst.error to RTN type. Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-17net/ipv6: Move support functions up in route.cDavid Ahern1-60/+59
Code move only. Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-17net/ipv6: Pass net namespace to route functionsDavid Ahern4-50/+59
Pass network namespace reference into route add, delete and get functions. Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-17net/ipv6: Pass net to fib6_update_sernumDavid Ahern2-7/+6
Pass net namespace to fib6_update_sernum. It can not be marked const as fib6_new_sernum will change ipv6.fib6_sernum. Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-17ipv6: send netlink notifications for manually configured addressesLorenzo Bianconi1-8/+5
Send a netlink notification when userspace adds a manually configured address if DAD is enabled and optimistic flag isn't set. Moreover send RTM_DELADDR notifications for tentative addresses. Some userspace applications (e.g. NetworkManager) are interested in addr netlink events albeit the address is still in tentative state, however events are not sent if DAD process is not completed. If the address is added and immediately removed userspace listeners are not notified. This behaviour can be easily reproduced by using veth interfaces: $ ip -b - <<EOF > link add dev vm1 type veth peer name vm2 > link set dev vm1 up > link set dev vm2 up > addr add 2001:db8:a:b:1:2:3:4/64 dev vm1 > addr del 2001:db8:a:b:1:2:3:4/64 dev vm1 EOF This patch reverts the behaviour introduced by the commit f784ad3d79e5 ("ipv6: do not send RTM_DELADDR for tentative addresses") Suggested-by: Thomas Haller <[email protected]> Signed-off-by: Lorenzo Bianconi <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-17ipv6: Count interface receive statistics on the ingress netdevStephen Suryaputra5-50/+34
The statistics such as InHdrErrors should be counted on the ingress netdev rather than on the dev from the dst, which is the egress. Signed-off-by: Stephen Suryaputra <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-17net/ipv6: Make __inet6_bind staticDavid Ahern1-27/+26
BPF core gets access to __inet6_bind via ipv6_bpf_stub_impl, so it is not invoked directly outside of af_inet6.c. Make it static and move inet6_bind after to avoid forward declaration. Signed-off-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-16tcp: implement mmap() for zero copy receiveEric Dumazet1-1/+1
Some networks can make sure TCP payload can exactly fit 4KB pages, with well chosen MSS/MTU and architectures. Implement mmap() system call so that applications can avoid copying data without complex splice() games. Note that a successful mmap( X bytes) on TCP socket is consuming bytes, as if recvmsg() has been done. (tp->copied += X) Only PROT_READ mappings are accepted, as skb page frags are fundamentally shared and read only. If tcp_mmap() finds data that is not a full page, or a patch of urgent data, -EINVAL is returned, no bytes are consumed. Application must fallback to recvmsg() to read the problematic sequence. mmap() wont block, regardless of socket being in blocking or non-blocking mode. If not enough bytes are in receive queue, mmap() would return -EAGAIN, or -EIO if socket is in a state where no other bytes can be added into receive queue. An application might use SO_RCVLOWAT, poll() and/or ioctl( FIONREAD) to efficiently use mmap() On the sender side, MSG_EOR might help to clearly separate unaligned headers and 4K-aligned chunks if necessary. Tested: mlx4 (cx-3) 40Gbit NIC, with tcp_mmap program provided in following patch. MTU set to 4168 (4096 TCP payload, 40 bytes IPv6 header, 32 bytes TCP header) Without mmap() (tcp_mmap -s) received 32768 MB (0 % mmap'ed) in 8.13342 s, 33.7961 Gbit, cpu usage user:0.034 sys:3.778, 116.333 usec per MB, 63062 c-switches received 32768 MB (0 % mmap'ed) in 8.14501 s, 33.748 Gbit, cpu usage user:0.029 sys:3.997, 122.864 usec per MB, 61903 c-switches received 32768 MB (0 % mmap'ed) in 8.11723 s, 33.8635 Gbit, cpu usage user:0.048 sys:3.964, 122.437 usec per MB, 62983 c-switches received 32768 MB (0 % mmap'ed) in 8.39189 s, 32.7552 Gbit, cpu usage user:0.038 sys:4.181, 128.754 usec per MB, 55834 c-switches With mmap() on receiver (tcp_mmap -s -z) received 32768 MB (100 % mmap'ed) in 8.03083 s, 34.2278 Gbit, cpu usage user:0.024 sys:1.466, 45.4712 usec per MB, 65479 c-switches received 32768 MB (100 % mmap'ed) in 7.98805 s, 34.4111 Gbit, cpu usage user:0.026 sys:1.401, 43.5486 usec per MB, 65447 c-switches received 32768 MB (100 % mmap'ed) in 7.98377 s, 34.4296 Gbit, cpu usage user:0.028 sys:1.452, 45.166 usec per MB, 65496 c-switches received 32768 MB (99.9969 % mmap'ed) in 8.01838 s, 34.281 Gbit, cpu usage user:0.02 sys:1.446, 44.7388 usec per MB, 65505 c-switches Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-16tcp: fix SO_RCVLOWAT and RCVBUF autotuningEric Dumazet1-0/+1
Applications might use SO_RCVLOWAT on TCP socket hoping to receive one [E]POLLIN event only when a given amount of bytes are ready in socket receive queue. Problem is that receive autotuning is not aware of this constraint, meaning sk_rcvbuf might be too small to allow all bytes to be stored. Add a new (struct proto_ops)->set_rcvlowat method so that a protocol can override the default setsockopt(SO_RCVLOWAT) behavior. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-16ipv6: remove unnecessary check in addrconf_prefix_rcv_add_addr()Lorenzo Bianconi1-2/+1
Remove unnecessary check on update_lft variable in addrconf_prefix_rcv_add_addr routine since it is always set to 0. Moreover remove update_lft re-initialization to 0 Signed-off-by: Lorenzo Bianconi <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-16xfrm: Fix warning in xfrm6_tunnel_net_exit.Steffen Klassert1-0/+3
We need to make sure that all states are really deleted before we check that the state lists are empty. Otherwise we trigger a warning. Fixes: baeb0dbbb5659 ("xfrm6_tunnel: exit_net cleanup check added") Reported-and-tested-by:[email protected] Signed-off-by: Steffen Klassert <[email protected]>
2018-04-05net/ipv6: Increment OUTxxx counters after netfilter hookJeff Barnhill1-2/+5
At the end of ip6_forward(), IPSTATS_MIB_OUTFORWDATAGRAMS and IPSTATS_MIB_OUTOCTETS are incremented immediately before the NF_HOOK call for NFPROTO_IPV6 / NF_INET_FORWARD. As a result, these counters get incremented regardless of whether or not the netfilter hook allows the packet to continue being processed. This change increments the counters in ip6_forward_finish() so that it will not happen if the netfilter hook chooses to terminate the packet, which is similar to how IPv4 works. Signed-off-by: Jeff Barnhill <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-05vti6: better validate user provided tunnel namesEric Dumazet1-2/+5
Use valid_name() to make sure user does not provide illegal device name. Fixes: ed1efb2aefbb ("ipv6: Add support for IPsec virtual tunnel interfaces") Signed-off-by: Eric Dumazet <[email protected]> Cc: Steffen Klassert <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-05ip6_tunnel: better validate user provided tunnel namesEric Dumazet1-4/+7
Use valid_name() to make sure user does not provide illegal device name. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-05ip6_gre: better validate user provided tunnel namesEric Dumazet1-3/+5
Use dev_valid_name() to make sure user does not provide illegal device name. syzbot caught the following bug : BUG: KASAN: stack-out-of-bounds in strlcpy include/linux/string.h:300 [inline] BUG: KASAN: stack-out-of-bounds in ip6gre_tunnel_locate+0x334/0x860 net/ipv6/ip6_gre.c:339 Write of size 20 at addr ffff8801afb9f7b8 by task syzkaller851048/4466 CPU: 1 PID: 4466 Comm: syzkaller851048 Not tainted 4.16.0+ #1 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x1b9/0x29f lib/dump_stack.c:53 print_address_description+0x6c/0x20b mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report.cold.7+0xac/0x2f5 mm/kasan/report.c:412 check_memory_region_inline mm/kasan/kasan.c:260 [inline] check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267 memcpy+0x37/0x50 mm/kasan/kasan.c:303 strlcpy include/linux/string.h:300 [inline] ip6gre_tunnel_locate+0x334/0x860 net/ipv6/ip6_gre.c:339 ip6gre_tunnel_ioctl+0x69d/0x12e0 net/ipv6/ip6_gre.c:1195 dev_ifsioc+0x43e/0xb90 net/core/dev_ioctl.c:334 dev_ioctl+0x69a/0xcc0 net/core/dev_ioctl.c:525 sock_ioctl+0x47e/0x680 net/socket.c:1015 vfs_ioctl fs/ioctl.c:46 [inline] file_ioctl fs/ioctl.c:500 [inline] do_vfs_ioctl+0x1cf/0x1650 fs/ioctl.c:684 ksys_ioctl+0xa9/0xd0 fs/ioctl.c:701 SYSC_ioctl fs/ioctl.c:708 [inline] SyS_ioctl+0x24/0x30 fs/ioctl.c:706 do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 Fixes: c12b395a4664 ("gre: Support GRE over IPv6") Signed-off-by: Eric Dumazet <[email protected]> Reported-by: syzbot <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-05ipv6: sit: better validate user provided tunnel namesEric Dumazet1-3/+5
Use dev_valid_name() to make sure user does not provide illegal device name. syzbot caught the following bug : BUG: KASAN: stack-out-of-bounds in strlcpy include/linux/string.h:300 [inline] BUG: KASAN: stack-out-of-bounds in ipip6_tunnel_locate+0x63b/0xaa0 net/ipv6/sit.c:254 Write of size 33 at addr ffff8801b64076d8 by task syzkaller932654/4453 CPU: 0 PID: 4453 Comm: syzkaller932654 Not tainted 4.16.0+ #1 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x1b9/0x29f lib/dump_stack.c:53 print_address_description+0x6c/0x20b mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report.cold.7+0xac/0x2f5 mm/kasan/report.c:412 check_memory_region_inline mm/kasan/kasan.c:260 [inline] check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267 memcpy+0x37/0x50 mm/kasan/kasan.c:303 strlcpy include/linux/string.h:300 [inline] ipip6_tunnel_locate+0x63b/0xaa0 net/ipv6/sit.c:254 ipip6_tunnel_ioctl+0xe71/0x241b net/ipv6/sit.c:1221 dev_ifsioc+0x43e/0xb90 net/core/dev_ioctl.c:334 dev_ioctl+0x69a/0xcc0 net/core/dev_ioctl.c:525 sock_ioctl+0x47e/0x680 net/socket.c:1015 vfs_ioctl fs/ioctl.c:46 [inline] file_ioctl fs/ioctl.c:500 [inline] do_vfs_ioctl+0x1cf/0x1650 fs/ioctl.c:684 ksys_ioctl+0xa9/0xd0 fs/ioctl.c:701 SYSC_ioctl fs/ioctl.c:708 [inline] SyS_ioctl+0x24/0x30 fs/ioctl.c:706 do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <[email protected]> Reported-by: syzbot <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-04inet: frags: fix ip6frag_low_thresh boundaryEric Dumazet2-4/+0
Giving an integer to proc_doulongvec_minmax() is dangerous on 64bit arches, since linker might place next to it a non zero value preventing a change to ip6frag_low_thresh. ip6frag_low_thresh is not used anymore in the kernel, but we do not want to prematuraly break user scripts wanting to change it. Since specifying a minimal value of 0 for proc_doulongvec_minmax() is moot, let's remove these zero values in all defrag units. Fixes: 6e00f7dd5e4e ("ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh") Signed-off-by: Eric Dumazet <[email protected]> Reported-by: Maciej Żenczykowski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-04net: avoid unneeded atomic operation in ip*_append_data()Paolo Abeni1-1/+2
After commit 694aba690de0 ("ipv4: factorize sk_wmem_alloc updates done by __ip_append_data()") and commit 1f4c6eb24029 ("ipv6: factorize sk_wmem_alloc updates done by __ip6_append_data()"), when transmitting sub MTU datagram, an addtional, unneeded atomic operation is performed in ip*_append_data() to update wmem_alloc: in the above condition the delta is 0. The above cause small but measurable performance regression in UDP xmit tput test with packet size below MTU. This change avoids such overhead updating wmem_alloc only if wmem_alloc_delta is non zero. The error path is left intentionally unmodified: it's a slow path and simplicity is preferred to performances. Fixes: 694aba690de0 ("ipv4: factorize sk_wmem_alloc updates done by __ip_append_data()") Fixes: 1f4c6eb24029 ("ipv6: factorize sk_wmem_alloc updates done by __ip6_append_data()") Signed-off-by: Paolo Abeni <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-04ipv6: udp: set dst cache for a connected sk if current not validAlexey Kodanev1-19/+2
A new RTF_CACHE route can be created between ip6_sk_dst_lookup_flow() and ip6_dst_store() calls in udpv6_sendmsg(), when datagram sending results to ICMPV6_PKT_TOOBIG error: udp_v6_send_skb(), for example with vti6 tunnel: vti6_xmit(), get ICMPV6_PKT_TOOBIG error skb_dst_update_pmtu(), can create a RTF_CACHE clone icmpv6_send() ... udpv6_err() ip6_sk_update_pmtu() ip6_update_pmtu(), can create a RTF_CACHE clone ... ip6_datagram_dst_update() ip6_dst_store() And after commit 33c162a980fe ("ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update"), the UDPv6 error handler can update socket's dst cache, but it can happen before the update in the end of udpv6_sendmsg(), preventing getting the new dst cache on the next udpv6_sendmsg() calls. In order to fix it, save dst in a connected socket only if the current socket's dst cache is invalid. The previous patch prepared ip6_sk_dst_lookup_flow() to do that with the new argument, and this patch enables it in udpv6_sendmsg(). Fixes: 33c162a980fe ("ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update") Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception") Signed-off-by: Alexey Kodanev <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-04ipv6: udp: convert 'connected' to bool type in udpv6_sendmsg()Alexey Kodanev1-5/+5
This should make it consistent with ip6_sk_dst_lookup_flow() that is accepting the new 'connected' parameter of type bool. Signed-off-by: Alexey Kodanev <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-04ipv6: allow to cache dst for a connected sk in ip6_sk_dst_lookup_flow()Alexey Kodanev3-5/+14
Add 'connected' parameter to ip6_sk_dst_lookup_flow() and update the cache only if ip6_sk_dst_check() returns NULL and a socket is connected. The function is used as before, the new behavior for UDP sockets in udpv6_sendmsg() will be enabled in the next patch. Signed-off-by: Alexey Kodanev <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-04ipv6: add a wrapper for ip6_dst_store() with flowi6 checksAlexey Kodanev2-8/+18
Move commonly used pattern of ip6_dst_store() usage to a separate function - ip6_sk_dst_store_flow(), which will check the addresses for equality using the flow information, before saving them. There is no functional changes in this patch. In addition, it will be used in the next patch, in ip6_sk_dst_lookup_flow(). Signed-off-by: Alexey Kodanev <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-02ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_threshEric Dumazet1-1/+1
I forgot to change ip6frag_low_thresh proc_handler from proc_dointvec_minmax to proc_doulongvec_minmax Fixes: 3e67f106f619 ("inet: frags: break the 2GB limit for frags storage") Signed-off-by: Eric Dumazet <[email protected]> Reported-by: Maciej Żenczykowski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller6-34/+55
Minor conflicts in drivers/net/ethernet/mellanox/mlx5/core/en_rep.c, we had some overlapping changes: 1) In 'net' MLX5E_PARAMS_LOG_{SQ,RQ}_SIZE --> MLX5E_REP_PARAMS_LOG_{SQ,RQ}_SIZE 2) In 'net-next' params->log_rq_size is renamed to be params->log_rq_mtu_frames. 3) In 'net-next' params->hard_mtu is added. Signed-off-by: David S. Miller <[email protected]>
2018-04-01ipv6: factorize sk_wmem_alloc updates done by __ip6_append_data()Eric Dumazet1-5/+12
While testing my inet defrag changes, I found that the senders could spend ~20% of cpu cycles in skb_set_owner_w() updating sk->sk_wmem_alloc for every fragment they cook, competing with TX completion of prior skbs possibly happening on another cpus. The solution to this problem is to use alloc_skb() instead of sock_wmalloc() and manually perform a single sk_wmem_alloc change. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-01ip6_gre: remove redundant 'tunnel' setting in ip6erspan_tap_init()Alexey Kodanev1-1/+0
'tunnel' was already set at the start of ip6erspan_tap_init(). Fixes: 5a963eb61b7c ("ip6_gre: Add ERSPAN native tunnel support") Signed-off-by: Alexey Kodanev <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller3-18/+84
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-03-31 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Add raw BPF tracepoint API in order to have a BPF program type that can access kernel internal arguments of the tracepoints in their raw form similar to kprobes based BPF programs. This infrastructure also adds a new BPF_RAW_TRACEPOINT_OPEN command to BPF syscall which returns an anon-inode backed fd for the tracepoint object that allows for automatic detach of the BPF program resp. unregistering of the tracepoint probe on fd release, from Alexei. 2) Add new BPF cgroup hooks at bind() and connect() entry in order to allow BPF programs to reject, inspect or modify user space passed struct sockaddr, and as well a hook at post bind time once the port has been allocated. They are used in FB's container management engine for implementing policy, replacing fragile LD_PRELOAD wrapper intercepting bind() and connect() calls that only works in limited scenarios like glibc based apps but not for other runtimes in containerized applications, from Andrey. 3) BPF_F_INGRESS flag support has been added to sockmap programs for their redirect helper call bringing it in line with cls_bpf based programs. Support is added for both variants of sockmap programs, meaning for tx ULP hooks as well as recv skb hooks, from John. 4) Various improvements on BPF side for the nfp driver, besides others this work adds BPF map update and delete helper call support from the datapath, JITing of 32 and 64 bit XADD instructions as well as offload support of bpf_get_prandom_u32() call. Initial implementation of nfp packet cache has been tackled that optimizes memory access (see merge commit for further details), from Jakub and Jiong. 5) Removal of struct bpf_verifier_env argument from the print_bpf_insn() API has been done in order to prepare to use print_bpf_insn() soon out of perf tool directly. This makes the print_bpf_insn() API more generic and pushes the env into private data. bpftool is adjusted as well with the print_bpf_insn() argument removal, from Jiri. 6) Couple of cleanups and prep work for the upcoming BTF (BPF Type Format). The latter will reuse the current BPF verifier log as well, thus bpf_verifier_log() is further generalized, from Martin. 7) For bpf_getsockopt() and bpf_setsockopt() helpers, IPv4 IP_TOS read and write support has been added in similar fashion to existing IPv6 IPV6_TCLASS socket option we already have, from Nikita. 8) Fixes in recent sockmap scatterlist API usage, which did not use sg_init_table() for initialization thus triggering a BUG_ON() in scatterlist API when CONFIG_DEBUG_SG was enabled. This adds and uses a small helper sg_init_marker() to properly handle the affected cases, from Prashant. 9) Let the BPF core follow IDR code convention and therefore use the idr_preload() and idr_preload_end() helpers, which would also help idr_alloc_cyclic() under GFP_ATOMIC to better succeed under memory pressure, from Shaohua. 10) Last but not least, a spelling fix in an error message for the BPF cookie UID helper under BPF sample code, from Colin. ==================== Signed-off-by: David S. Miller <[email protected]>
2018-03-31inet: frags: get rid of nf_ct_frag6_skb_cb/NFCT_FRAG6_CBEric Dumazet1-18/+11
nf_ct_frag6_queue() uses skb->cb[] to store the fragment offset, meaning that we could use two cache lines per skb when finding the insertion point, if for some reason inet6_skb_parm size is increased in the future. By using skb->ip_defrag_offset instead of skb->cb[] we pack all the fields in a single cache line, matching what we did for IPv4. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31ipv6: frags: get rid of ip6frag_skb_cb/FRAG6_CBEric Dumazet1-18/+12
ip6_frag_queue uses skb->cb[] to store the fragment offset, meaning that we could use two cache lines per skb when finding the insertion point, if for some reason inet6_skb_parm size is increased in the future. By using skb->ip_defrag_offset instead of skb->cb[], we pack all the fields in a single cache line, matching what we did for IPv4. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31ipv6: frags: rewrite ip6_expire_frag_queue()Eric Dumazet1-8/+16
Make it similar to IPv4 ip_expire(), and release the lock before calling icmp functions. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31inet: frags: break the 2GB limit for frags storageEric Dumazet3-9/+9
Some users are willing to provision huge amounts of memory to be able to perform reassembly reasonnably well under pressure. Current memory tracking is using one atomic_t and integers. Switch to atomic_long_t so that 64bit arches can use more than 2GB, without any cost for 32bit arches. Note that this patch avoids an overflow error, if high_thresh was set to ~2GB, since this test in inet_frag_alloc() was never true : if (... || frag_mem_limit(nf) > nf->high_thresh) Tested: $ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh <frag DDOS> $ grep FRAG /proc/net/sockstat FRAG: inuse 14705885 memory 16000002880 $ nstat -n ; sleep 1 ; nstat | grep Reas IpReasmReqds 3317150 0.0 IpReasmFails 3317112 0.0 Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-03-31inet: frags: remove inet_frag_maybe_warn_overflow()Eric Dumazet2-6/+4
This function is obsolete, after rhashtable addition to inet defrag. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>