aboutsummaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2018-05-02net/smc: determine vlan_id of stacked net_deviceUrsula Braun1-3/+23
An SMC link group is bound to a specific vlan_id. Its link uses the RoCE-GIDs established for the specific vlan_id. This patch makes sure the appropriate vlan_id is determined for stacked scenarios like for instance a master bonding device with vlan devices enslaved. Signed-off-by: Ursula Braun <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-02net/smc: handle ioctls SIOCINQ, SIOCOUTQ, and SIOCOUTQNSDUrsula Braun1-3/+30
SIOCINQ returns the amount of unread data in the RMB. SIOCOUTQ returns the amount of unsent or unacked sent data in the send buffer. SIOCOUTQNSD returns the amount of data prepared for sending, but not yet sent. Signed-off-by: Ursula Braun <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-02net/smc: ipv6 support for smc_diag.cKarsten Graul1-9/+30
Update smc_diag.c to support ipv6 addresses on the diagnosis interface. Signed-off-by: Karsten Graul <[email protected]> Signed-off-by: Ursula Braun <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-02net/smc: periodic testlink supportKarsten Graul6-3/+75
Add periodic LLC testlink support to ensure the link is still active. The interval time is initialized using the value of sysctl_tcp_keepalive_time. Signed-off-by: Karsten Graul <[email protected]> Signed-off-by: Ursula Braun <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-02net/smc: restrict non-blocking connect finishUrsula Braun1-6/+8
The smc_poll code tries to finish connect() if the socket is in state SMC_INIT and polling of the internal CLC-socket returns with EPOLLOUT. This makes sense for a select/poll call following a connect call, but not without preceding connect(). With this patch smc_poll starts connect logic only, if the CLC-socket is no longer in its initial state TCP_CLOSE. In addition, a poll error on the internal CLC-socket is always propagated to the SMC socket. With this patch the code path mentioned by syzbot https://syzkaller.appspot.com/bug?extid=03faa2dc16b8b64be396 is no longer possible. Signed-off-by: Ursula Braun <[email protected]> Reported-by: [email protected] Signed-off-by: David S. Miller <[email protected]>
2018-05-02sctp: fix the issue that the cookie-ack with auth can't get processedXin Long1-1/+1
When auth is enabled for cookie-ack chunk, in sctp_inq_pop, sctp processes auth chunk first, then continues to the next chunk in this packet if chunk_end + chunk_hdr size < skb_tail_pointer(). Otherwise, it will go to the next packet or discard this chunk. However, it missed the fact that cookie-ack chunk's size is equal to chunk_hdr size, which couldn't match that check, and thus this chunk would not get processed. This patch fixes it by changing the check to chunk_end + chunk_hdr size <= skb_tail_pointer(). Fixes: 26b87c788100 ("net: sctp: fix remote memory pressure from excessive queueing") Signed-off-by: Xin Long <[email protected]> Acked-by: Neil Horman <[email protected]> Acked-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-02sctp: use the old asoc when making the cookie-ack chunk in dupcook_dXin Long1-1/+1
When processing a duplicate cookie-echo chunk, for case 'D', sctp will not process the param from this chunk. It means old asoc has nothing to be updated, and the new temp asoc doesn't have the complete info. So there's no reason to use the new asoc when creating the cookie-ack chunk. Otherwise, like when auth is enabled for cookie-ack, the chunk can not be set with auth, and it will definitely be dropped by peer. This issue is there since very beginning, and we fix it by using the old asoc instead. Signed-off-by: Xin Long <[email protected]> Acked-by: Neil Horman <[email protected]> Acked-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-02sctp: init active key for the new asoc in dupcook_a and dupcook_bXin Long1-0/+6
When processing a duplicate cookie-echo chunk, for case 'A' and 'B', after sctp_process_init for the new asoc, if auth is enabled for the cookie-ack chunk, the active key should also be initialized. Otherwise, the cookie-ack chunk made later can not be set with auth shkey properly, and a crash can even be caused by this, as after Commit 1b1e0bc99474 ("sctp: add refcnt support for sh_key"), sctp needs to hold the shkey when making control chunks. Fixes: 1b1e0bc99474 ("sctp: add refcnt support for sh_key") Reported-by: Jianwen Ji <[email protected]> Signed-off-by: Xin Long <[email protected]> Acked-by: Neil Horman <[email protected]> Acked-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-02tcp_bbr: fix to zero idle_restart only upon S/ACKed dataNeal Cardwell1-1/+3
Previously the bbr->idle_restart tracking was zeroing out the bbr->idle_restart bit upon ACKs that did not SACK or ACK anything, e.g. receiving incoming data or receiver window updates. In such situations BBR would forget that this was a restart-from-idle situation, and if the min_rtt had expired it would unnecessarily enter PROBE_RTT (even though we were actually restarting from idle but had merely forgotten that fact). The fix is simple: we need to remember we are restarting from idle until we receive a S/ACK for some data (a S/ACK for the first flight of data we send as we are restarting). This commit is a stable candidate for kernels back as far as 4.9. Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control") Signed-off-by: Neal Cardwell <[email protected]> Signed-off-by: Yuchung Cheng <[email protected]> Signed-off-by: Soheil Hassas Yeganeh <[email protected]> Signed-off-by: Priyaranjan Jha <[email protected]> Signed-off-by: Yousuk Seung <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-02udp: Complement partial checksum for GSO packetSean Tranchetti1-0/+1
Using the udp_v4_check() function to calculate the pseudo header for the newly segmented UDP packets results in assigning the complement of the value to the UDP header checksum field. Always undo the complement the partial checksum value in order to match the case where GSO is not used on the UDP transmit path. Fixes: ee80d1ebe5ba ("udp: add udp gso") Signed-off-by: Sean Tranchetti <[email protected]> Signed-off-by: Subash Abhinov Kasiviswanathan <[email protected]> Acked-by: Willem de Bruijn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01net/tls: Don't recursively call push_record during tls_write_space callbacksDave Watson1-0/+7
It is reported that in some cases, write_space may be called in do_tcp_sendpages, such that we recursively invoke do_tcp_sendpages again: [ 660.468802] ? do_tcp_sendpages+0x8d/0x580 [ 660.468826] ? tls_push_sg+0x74/0x130 [tls] [ 660.468852] ? tls_push_record+0x24a/0x390 [tls] [ 660.468880] ? tls_write_space+0x6a/0x80 [tls] ... tls_push_sg already does a loop over all sending sg's, so ignore any tls_write_space notifications until we are done sending. We then have to call the previous write_space to wake up poll() waiters after we are done with the send loop. Reported-by: Andre Tomt <[email protected]> Signed-off-by: Dave Watson <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01tcp: send in-queue bytes in cmsg upon readSoheil Hassas Yeganeh1-4/+39
Applications with many concurrent connections, high variance in receive queue length and tight memory bounds cannot allocate worst-case buffer size to drain sockets. Knowing the size of receive queue length, applications can optimize how they allocate buffers to read from the socket. The number of bytes pending on the socket is directly available through ioctl(FIONREAD/SIOCINQ) and can be approximated using getsockopt(MEMINFO) (rmem_alloc includes skb overheads in addition to application data). But, both of these options add an extra syscall per recvmsg. Moreover, ioctl(FIONREAD/SIOCINQ) takes the socket lock. Add the TCP_INQ socket option to TCP. When this socket option is set, recvmsg() relays the number of bytes available on the socket for reading to the application via the TCP_CM_INQ control message. Calculate the number of bytes after releasing the socket lock to include the processed backlog, if any. To avoid an extra branch in the hot path of recvmsg() for this new control message, move all cmsg processing inside an existing branch for processing receive timestamps. Since the socket lock is not held when calculating the size of receive queue, TCP_INQ is a hint. For example, it can overestimate the queue size by one byte, if FIN is received. With this method, applications can start reading from the socket using a small buffer, and then use larger buffers based on the remaining data when needed. V3 change-log: As suggested by David Miller, added loads with barrier to check whether we have multiple threads calling recvmsg in parallel. When that happens we lock the socket to calculate inq. V4 change-log: Removed inline from a static function. Signed-off-by: Soheil Hassas Yeganeh <[email protected]> Signed-off-by: Yuchung Cheng <[email protected]> Signed-off-by: Willem de Bruijn <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: Neal Cardwell <[email protected]> Suggested-by: David Miller <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01net: core: Inline netdev_features_size_check()Florian Fainelli1-1/+2
We do not require this inline function to be used in multiple different locations, just inline it where it gets used in register_netdevice(). Suggested-by: David Miller <[email protected]> Suggested-by: Stephen Hemminger <[email protected]> Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01ipv6: Allow non-gateway ECMP for IPv6Thomas Winter1-3/+0
It is valid to have static routes where the nexthop is an interface not an address such as tunnels. For IPv4 it was possible to use ECMP on these routes but not for IPv6. Signed-off-by: Thomas Winter <[email protected]> Cc: David Ahern <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: Alexey Kuznetsov <[email protected]> Cc: Hideaki YOSHIFUJI <[email protected]> Acked-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01udp: disable gso with no_check_txWillem de Bruijn2-0/+8
Syzbot managed to send a udp gso packet without checksum offload into the gso stack by disabling tx checksum (UDP_NO_CHECK6_TX). This triggered the skb_warn_bad_offload. RIP: 0010:skb_warn_bad_offload+0x2bc/0x600 net/core/dev.c:2658 skb_gso_segment include/linux/netdevice.h:4038 [inline] validate_xmit_skb+0x54d/0xd90 net/core/dev.c:3120 __dev_queue_xmit+0xbf8/0x34c0 net/core/dev.c:3577 dev_queue_xmit+0x17/0x20 net/core/dev.c:3618 UDP_NO_CHECK6_TX sets skb->ip_summed to CHECKSUM_NONE just after the udp gso integrity checks in udp_(v6_)send_skb. Extend those checks to catch and fail in this case. After the integrity checks jump directly to the CHECKSUM_PARTIAL case to avoid reading the no_check_tx flags again (a TOCTTOU race). Fixes: bec1f6f69736 ("udp: generate gso with UDP_SEGMENT") Signed-off-by: Willem de Bruijn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01ethtool: fix a potential missing-check bugWenwen Wang1-0/+5
In ethtool_get_rxnfc(), the object "info" is firstly copied from user-space. If the FLOW_RSS flag is set in the member field flow_type of "info" (and cmd is ETHTOOL_GRXFH), info needs to be copied again from user-space because FLOW_RSS is newer and has new definition, as mentioned in the comment. However, given that the user data resides in user-space, a malicious user can race to change the data after the first copy. By doing so, the user can inject inconsistent data. For example, in the second copy, the FLOW_RSS flag could be cleared in the field flow_type of "info". In the following execution, "info" will be used in the function ops->get_rxnfc(). Such inconsistent data can potentially lead to unexpected information leakage since ops->get_rxnfc() will prepare various types of data according to flow_type, and the prepared data will be eventually copied to user-space. This inconsistent data may also cause undefined behaviors based on how ops->get_rxnfc() is implemented. This patch simply re-verifies the flow_type field of "info" after the second copy. If the value is not as expected, an error code will be returned. Signed-off-by: Wenwen Wang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01cls_flower: Support multiple masks per priorityPaul Blakey1-101/+174
Currently flower doesn't support inserting filters with different masks on a single priority, even if the actual flows (key + mask) inserted aren't overlapping, as with the use case of offloading openvswitch datapath flows. Instead one must go up one level, and assign different priorities for each mask, which will create a different flower instances. This patch opens flower to support more than one mask per priority, and a single flower instance. It does so by adding another hash table on top of the existing one which will store the different masks, and the filters that share it. The user is left with the responsibility of ensuring non overlapping flows, otherwise precedence is not guaranteed. Signed-off-by: Paul Blakey <[email protected]> Signed-off-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01tcp: fix TCP_REPAIR_QUEUE bound checkingEric Dumazet1-1/+1
syzbot is able to produce a nasty WARN_ON() in tcp_verify_left_out() with following C-repro : socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3 setsockopt(3, SOL_TCP, TCP_REPAIR, [1], 4) = 0 setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [-1], 4) = 0 bind(3, {sa_family=AF_INET, sin_port=htons(20002), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 sendto(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1242, MSG_FASTOPEN, {sa_family=AF_INET, sin_port=htons(20002), sin_addr=inet_addr("127.0.0.1")}, 16) = 1242 setsockopt(3, SOL_TCP, TCP_REPAIR_WINDOW, "\4\0\0@+\205\0\0\377\377\0\0\377\377\377\177\0\0\0\0", 20) = 0 writev(3, [{"\270", 1}], 1) = 1 setsockopt(3, SOL_TCP, TCP_REPAIR_OPTIONS, "\10\0\0\0\0\0\0\0\0\0\0\0|\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 386) = 0 writev(3, [{"\210v\r[\226\320t\231qwQ\204\264l\254\t\1\20\245\214p\350H\223\254;\\\37\345\307p$"..., 3144}], 1) = 3144 The 3rd system call looks odd : setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [-1], 4) = 0 This patch makes sure bound checking is using an unsigned compare. Fixes: ee9952831cfd ("tcp: Initial repair mode") Signed-off-by: Eric Dumazet <[email protected]> Reported-by: syzbot <[email protected]> Cc: Pavel Emelyanov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01ipv6: fix uninit-value in ip6_multipath_l3_keys()Eric Dumazet1-1/+6
syzbot/KMSAN reported an uninit-value in ip6_multipath_l3_keys(), root caused to a bad assumption of ICMP header being already pulled in skb->head ip_multipath_l3_keys() does the correct thing, so it is an IPv6 only bug. BUG: KMSAN: uninit-value in ip6_multipath_l3_keys net/ipv6/route.c:1830 [inline] BUG: KMSAN: uninit-value in rt6_multipath_hash+0x5c4/0x640 net/ipv6/route.c:1858 CPU: 0 PID: 4507 Comm: syz-executor661 Not tainted 4.16.0+ #87 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x185/0x1d0 lib/dump_stack.c:53 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:683 ip6_multipath_l3_keys net/ipv6/route.c:1830 [inline] rt6_multipath_hash+0x5c4/0x640 net/ipv6/route.c:1858 ip6_route_input+0x65a/0x920 net/ipv6/route.c:1884 ip6_rcv_finish+0x413/0x6e0 net/ipv6/ip6_input.c:69 NF_HOOK include/linux/netfilter.h:288 [inline] ipv6_rcv+0x1e16/0x2340 net/ipv6/ip6_input.c:208 __netif_receive_skb_core+0x47df/0x4a90 net/core/dev.c:4562 __netif_receive_skb net/core/dev.c:4627 [inline] netif_receive_skb_internal+0x49d/0x630 net/core/dev.c:4701 netif_receive_skb+0x230/0x240 net/core/dev.c:4725 tun_rx_batched drivers/net/tun.c:1555 [inline] tun_get_user+0x740f/0x7c60 drivers/net/tun.c:1962 tun_chr_write_iter+0x1d4/0x330 drivers/net/tun.c:1990 call_write_iter include/linux/fs.h:1782 [inline] new_sync_write fs/read_write.c:469 [inline] __vfs_write+0x7fb/0x9f0 fs/read_write.c:482 vfs_write+0x463/0x8d0 fs/read_write.c:544 SYSC_write+0x172/0x360 fs/read_write.c:589 SyS_write+0x55/0x80 fs/read_write.c:581 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Fixes: 23aebdacb05d ("ipv6: Compute multipath hash for ICMP errors from offending packet") Signed-off-by: Eric Dumazet <[email protected]> Reported-by: syzbot <[email protected]> Cc: Jakub Sitnicki <[email protected]> Acked-by: Jakub Sitnicki <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01sctp: add sctp_make_op_error_limited and reuse inner functionsMarcelo Ricardo Leitner1-84/+46
The idea is quite similar to the old functions, but note that the _fixed function wasn't "fixed" as in that it would generate a packet with a fixed size, but rather limited/bounded to PMTU. Also, now with sctp_mtu_payload(), we have a more accurate limit. Signed-off-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01sctp: allow sctp_init_cause to return errorsMarcelo Ricardo Leitner1-3/+9
And do so if the skb doesn't have enough space for the payload. This is a preparation for the next patch. Signed-off-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01net/tls: Add generic NIC offload infrastructureIlya Lesokhin5-3/+1265
This patch adds a generic infrastructure to offload TLS crypto to a network device. It enables the kernel TLS socket to skip encryption and authentication operations on the transmit side of the data path. Leaving those computationally expensive operations to the NIC. The NIC offload infrastructure builds TLS records and pushes them to the TCP layer just like the SW KTLS implementation and using the same API. TCP segmentation is mostly unaffected. Currently the only exception is that we prevent mixed SKBs where only part of the payload requires offload. In the future we are likely to add a similar restriction following a change cipher spec record. The notable differences between SW KTLS and NIC offloaded TLS implementations are as follows: 1. The offloaded implementation builds "plaintext TLS record", those records contain plaintext instead of ciphertext and place holder bytes instead of authentication tags. 2. The offloaded implementation maintains a mapping from TCP sequence number to TLS records. Thus given a TCP SKB sent from a NIC offloaded TLS socket, we can use the tls NIC offload infrastructure to obtain enough context to encrypt the payload of the SKB. A TLS record is released when the last byte of the record is ack'ed, this is done through the new icsk_clean_acked callback. The infrastructure should be extendable to support various NIC offload implementations. However it is currently written with the implementation below in mind: The NIC assumes that packets from each offloaded stream are sent as plaintext and in-order. It keeps track of the TLS records in the TCP stream. When a packet marked for offload is transmitted, the NIC encrypts the payload in-place and puts authentication tags in the relevant place holders. The responsibility for handling out-of-order packets (i.e. TCP retransmission, qdisc drops) falls on the netdev driver. The netdev driver keeps track of the expected TCP SN from the NIC's perspective. If the next packet to transmit matches the expected TCP SN, the driver advances the expected TCP SN, and transmits the packet with TLS offload indication. If the next packet to transmit does not match the expected TCP SN. The driver calls the TLS layer to obtain the TLS record that includes the TCP of the packet for transmission. Using this TLS record, the driver posts a work entry on the transmit queue to reconstruct the NIC TLS state required for the offload of the out-of-order packet. It updates the expected TCP SN accordingly and transmits the now in-order packet. The same queue is used for packet transmission and TLS context reconstruction to avoid the need for flushing the transmit queue before issuing the context reconstruction request. Signed-off-by: Ilya Lesokhin <[email protected]> Signed-off-by: Boris Pismenny <[email protected]> Signed-off-by: Aviad Yehezkel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01net/tls: Split conf to rx + txBoris Pismenny2-111/+130
In TLS inline crypto, we can have one direction in software and another in hardware. Thus, we split the TLS configuration to separate structures for receive and transmit. Signed-off-by: Boris Pismenny <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01net: Add TLS TX offload featuresIlya Lesokhin1-0/+1
This patch adds a netdev feature to configure TLS TX offloads. Signed-off-by: Ilya Lesokhin <[email protected]> Signed-off-by: Boris Pismenny <[email protected]> Signed-off-by: Aviad Yehezkel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01net: Add Software fallback infrastructure for socket dependent offloadsIlya Lesokhin2-0/+7
With socket dependent offloads we rely on the netdev to transform the transmitted packets before sending them to the wire. When a packet from an offloaded socket is rerouted to a different device we need to detect it and do the transformation in software. Signed-off-by: Ilya Lesokhin <[email protected]> Signed-off-by: Boris Pismenny <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01net: Rename and export copy_skb_headerIlya Lesokhin1-4/+5
copy_skb_header is renamed to skb_copy_header and exported. Exposing this function give more flexibility in copying SKBs. skb_copy and skb_copy_expand do not give enough control over which parts are copied. Signed-off-by: Ilya Lesokhin <[email protected]> Signed-off-by: Boris Pismenny <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-05-01tcp: Add clean acked data hookIlya Lesokhin1-0/+25
Called when a TCP segment is acknowledged. Could be used by application protocols who hold additional metadata associated with the stream data. This is required by TLS device offload to release metadata associated with acknowledged TLS records. Signed-off-by: Ilya Lesokhin <[email protected]> Signed-off-by: Boris Pismenny <[email protected]> Signed-off-by: Aviad Yehezkel <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-30net: bridge: Publish bridge accessor functionsPetr Machata3-0/+72
Add a couple new functions to allow querying FDB and vlan settings of a bridge. Signed-off-by: Petr Machata <[email protected]> Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-30ipv6: sr: extract the right key values for "seg6_make_flowlabel"Ahmed Abdelsalam1-1/+1
The seg6_make_flowlabel() is used by seg6_do_srh_encap() to compute the flowlabel from a given skb. It relies on skb_get_hash() which eventually calls __skb_flow_dissect() to extract the flow_keys struct values from the skb. In case of IPv4 traffic, calling seg6_make_flowlabel() after skb_push(), skb_reset_network_header(), and skb_mac_header_rebuild() will results in flow_keys struct of all key values set to zero. This patch calls seg6_make_flowlabel() before resetting the headers of skb to get the right key values. Extracted Key values are based on the type inner packet as follows: 1) IPv6 traffic: src_IP, dst_IP, L4 proto, and flowlabel of inner packet. 2) IPv4 traffic: src_IP, dst_IP, L4 proto, src_port, and dst_port 3) L2 traffic: depends on what kind of traffic carried into the L2 frame. IPv6 and IPv4 traffic works as discussed 1) and 2) Here a hex_dump of struct flow_keys for IPv4 and IPv6 traffic 10.100.1.100: 47302 > 30.0.0.2: 5001 00000000: 14 00 02 00 00 00 00 00 08 00 11 00 00 00 00 00 00000010: 00 00 00 00 00 00 00 00 13 89 b8 c6 1e 00 00 02 00000020: 0a 64 01 64 fc00:a1:a > b2::2 00000000: 28 00 03 00 00 00 00 00 86 dd 11 00 99 f9 02 00 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 b2 00 00 00000020: 00 00 00 00 00 00 00 00 00 00 00 02 fc 00 00 a1 00000030: 00 00 00 00 00 00 00 00 00 00 00 0a Signed-off-by: Ahmed Abdelsalam <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-30erspan: auto detect truncated packets.William Tu2-0/+12
Currently the truncated bit is set only when the mirrored packet is larger than mtu. For certain cases, the packet might already been truncated before sending to the erspan tunnel. In this case, the patch detect whether the IP header's total length is larger than the actual skb->len. If true, this indicated that the mirrored packet is truncated and set the erspan truncate bit. I tested the patch using bpf_skb_change_tail helper function to shrink the packet size and send to erspan tunnel. Reported-by: Xiaoyan Jin <[email protected]> Signed-off-by: William Tu <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-29net: core: Assert the size of netdev_featres_tFlorian Fainelli1-0/+1
We have about 53 netdev_features_t bits defined and counting, add a build time check to catch when an u64 type will not be enough and we will have to convert that to a bitmap. This is done in register_netdevice() for convenience. Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-29net: Revoke export for __skb_tx_hash, update it to just be static skb_tx_hashAlexander Duyck1-6/+4
I am dropping the export of __skb_tx_hash as after my patches nobody is using it outside of the net/core/dev.c file. In addition I am renaming and repurposing it to just be a static declaration of skb_tx_hash since that was the only user for it at this point. By doing this the compiler can inline it into __netdev_pick_tx as that will improve performance. Signed-off-by: Alexander Duyck <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-29tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receiveEric Dumazet3-90/+108
When adding tcp mmap() implementation, I forgot that socket lock had to be taken before current->mm->mmap_sem. syzbot eventually caught the bug. Since we can not lock the socket in tcp mmap() handler we have to split the operation in two phases. 1) mmap() on a tcp socket simply reserves VMA space, and nothing else. This operation does not involve any TCP locking. 2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements the transfert of pages from skbs to one VMA. This operation only uses down_read(&current->mm->mmap_sem) after holding TCP lock, thus solving the lockdep issue. This new implementation was suggested by Andy Lutomirski with great details. Benefits are : - Better scalability, in case multiple threads reuse VMAS (without mmap()/munmap() calls) since mmap_sem wont be write locked. - Better error recovery. The previous mmap() model had to provide the expected size of the mapping. If for some reason one part could not be mapped (partial MSS), the whole operation had to be aborted. With the tcp_zerocopy_receive struct, kernel can report how many bytes were successfuly mapped, and how many bytes should be read to skip the problematic sequence. - No more memory allocation to hold an array of page pointers. 16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/ - skbs are freed while mmap_sem has been released Following patch makes the change in tcp_mmap tool to demonstrate one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...) Note that memcg might require additional changes. Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive") Signed-off-by: Eric Dumazet <[email protected]> Reported-by: syzbot <[email protected]> Suggested-by: Andy Lutomirski <[email protected]> Cc: [email protected] Acked-by: Soheil Hassas Yeganeh <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-29bridge: check iface upper dev when setting master via ioctlHangbin Liu1-2/+2
When we set a bond slave's master to bridge via ioctl, we only check the IFF_BRIDGE_PORT flag. Although we will find the slave's real master at netdev_master_upper_dev_link() later, it already does some settings and allocates some resources. It would be better to return as early as possible. v1 -> v2: use netdev_master_upper_dev_get() instead of netdev_has_any_upper_dev() to check if we have a master, because not all upper devs are masters, e.g. vlan device. Reported-by: [email protected] Signed-off-by: Hangbin Liu <[email protected]> Acked-by: Nikolay Aleksandrov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-27udp: remove stray export symbolWillem de Bruijn1-1/+0
UDP GSO needs to export __udp_gso_segment to call it from ipv6. I accidentally exported static ipv4 function __udp4_gso_segment. Remove that EXPORT_SYMBOL_GPL. Fixes: ee80d1ebe5ba ("udp: add udp gso") Signed-off-by: Willem de Bruijn <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-27net: support compat 64-bit time in {s,g}etsockoptLance Richardson1-2/+4
For the x32 ABI, struct timeval has two 64-bit fields. However the kernel currently interprets the user-space values used for the SO_RCVTIMEO and SO_SNDTIMEO socket options as having a pair of 32-bit fields. When the seconds portion of the requested timeout is less than 2**32, the seconds portion of the effective timeout is correct but the microseconds portion is zero. When the seconds portion of the requested timeout is zero and the microseconds portion is non-zero, the kernel interprets the timeout as zero (never timeout). Fix by using 64-bit time for SO_RCVTIMEO/SO_SNDTIMEO as required for the ABI. The code included below demonstrates the problem. Results before patch: $ gcc -m64 -Wall -O2 -o socktmo socktmo.c && ./socktmo recv time: 2.008181 seconds send time: 2.015985 seconds $ gcc -m32 -Wall -O2 -o socktmo socktmo.c && ./socktmo recv time: 2.016763 seconds send time: 2.016062 seconds $ gcc -mx32 -Wall -O2 -o socktmo socktmo.c && ./socktmo recv time: 1.007239 seconds send time: 1.023890 seconds Results after patch: $ gcc -m64 -O2 -Wall -o socktmo socktmo.c && ./socktmo recv time: 2.010062 seconds send time: 2.015836 seconds $ gcc -m32 -O2 -Wall -o socktmo socktmo.c && ./socktmo recv time: 2.013974 seconds send time: 2.015981 seconds $ gcc -mx32 -O2 -Wall -o socktmo socktmo.c && ./socktmo recv time: 2.030257 seconds send time: 2.013383 seconds #include <stdio.h> #include <stdlib.h> #include <sys/socket.h> #include <sys/types.h> #include <sys/time.h> void checkrc(char *str, int rc) { if (rc >= 0) return; perror(str); exit(1); } static char buf[1024]; int main(int argc, char **argv) { int rc; int socks[2]; struct timeval tv; struct timeval start, end, delta; rc = socketpair(AF_UNIX, SOCK_STREAM, 0, socks); checkrc("socketpair", rc); /* set timeout to 1.999999 seconds */ tv.tv_sec = 1; tv.tv_usec = 999999; rc = setsockopt(socks[0], SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof tv); rc = setsockopt(socks[0], SOL_SOCKET, SO_SNDTIMEO, &tv, sizeof tv); checkrc("setsockopt", rc); /* measure actual receive timeout */ gettimeofday(&start, NULL); rc = recv(socks[0], buf, sizeof buf, 0); gettimeofday(&end, NULL); timersub(&end, &start, &delta); printf("recv time: %ld.%06ld seconds\n", (long)delta.tv_sec, (long)delta.tv_usec); /* fill send buffer */ do { rc = send(socks[0], buf, sizeof buf, 0); } while (rc > 0); /* measure actual send timeout */ gettimeofday(&start, NULL); rc = send(socks[0], buf, sizeof buf, 0); gettimeofday(&end, NULL); timersub(&end, &start, &delta); printf("send time: %ld.%06ld seconds\n", (long)delta.tv_sec, (long)delta.tv_usec); exit(0); } Fixes: 515c7af85ed9 ("x32: Use compat shims for {g,s}etsockopt") Reported-by: Gopal RajagopalSai <[email protected]> Signed-off-by: Lance Richardson <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-27net: qrtr: Expose tunneling endpoint to user spaceBjorn Andersson3-0/+170
This implements a misc character device named "qrtr-tun" for the purpose of allowing user space applications to implement endpoints in the qrtr network. This allows more advanced (and dynamic) testing of the qrtr code as well as opens up the ability of tunneling qrtr over a network or USB link. Signed-off-by: Bjorn Andersson <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-27sctp: allow unsetting sockopt MAXSEGMarcelo Ricardo Leitner1-7/+0
RFC 6458 Section 8.1.16 says that setting MAXSEG as 0 means that the user is not limiting it, and not that it should set to the *current* maximum, as we are doing. This patch thus allow setting it as 0, effectively removing the user limit. Signed-off-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-27sctp: consider idata chunks when setting SCTP_MAXSEGMarcelo Ricardo Leitner1-3/+6
When setting SCTP_MAXSEG sock option, it should consider which kind of data chunk is being used if the asoc is already available, so that the limit better reflect reality. Signed-off-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-27sctp: honor PMTU_DISABLED when handling icmpMarcelo Ricardo Leitner1-3/+5
sctp_sendmsg() could trigger PMTU updates even when PMTU_DISABLED was set, as pmtu_pending could be set unconditionally during icmp handling if the socket was in use by the application. This patch fixes it by checking for PMTU_DISABLED when handling such deferred updates. Signed-off-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-27sctp: re-use sctp_transport_pmtu in sctp_transport_routeMarcelo Ricardo Leitner2-22/+19
sctp_transport_route currently is very similar to sctp_transport_pmtu plus a few other bits. This patch reuses sctp_transport_pmtu in sctp_transport_route and removes the duplicated code. Also, as all calls to sctp_transport_route were forcing the dst release before calling it, let's just include such release too. Signed-off-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-27sctp: remove sctp_transport_pmtu_checkMarcelo Ricardo Leitner1-3/+0
We are now keeping the MTU information synced between asoc, transport and dst, which makes the check at sctp_packet_config() not needed anymore. As it was the sole caller to this function, lets remove it. Signed-off-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-27sctp: introduce sctp_dst_mtuMarcelo Ricardo Leitner2-7/+5
Which makes sure that the MTU respects the minimum value of SCTP_DEFAULT_MINSEGMENT and that it is correctly aligned. Signed-off-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-27sctp: remove sctp_assoc_pending_pmtuMarcelo Ricardo Leitner1-2/+4
No need for this helper. Signed-off-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-27sctp: introduce sctp_assoc_update_frag_pointMarcelo Ricardo Leitner3-19/+19
and avoid the open-coded versions of it. Now sctp_datamsg_from_user can just re-use asoc->frag_point as it will always be updated. Signed-off-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-27sctp: introduce sctp_mtu_payloadMarcelo Ricardo Leitner2-20/+12
When given a MTU, this function calculates how much payload we can carry on it. Without a MTU, it calculates the amount of header overhead we have. So that when we have extra overhead, like the one added for IP options on SELinux patches, it is easier to handle it. Signed-off-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-27sctp: introduce sctp_assoc_set_pmtuMarcelo Ricardo Leitner2-9/+14
All changes to asoc PMTU should now go through this wrapper, making it easier to track them and to do other actions upon it. Signed-off-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-27sctp: remove an if() that is always trueMarcelo Ricardo Leitner1-4/+2
As noticed by Xin Long, the if() here is always true as PMTU can never be 0. Reported-by: Xin Long <[email protected]> Signed-off-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-27sctp: move transport pathmtu calc away of sctp_assoc_add_peerMarcelo Ricardo Leitner2-10/+7
There was only one case that sctp_assoc_add_peer couldn't handle, which is when SPP_PMTUD_DISABLE is set and pathmtu not initialized. So add this situation to sctp_transport_route and reuse what was already in there. Signed-off-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-27tcp: remove mss check in tcp_select_initial_window()Wei Wang1-5/+3
In tcp_select_initial_window(), we only set rcv_wnd to tcp_default_init_rwnd() if current mss > (1 << wscale). Otherwise, rcv_wnd is kept at the full receive space of the socket which is a value way larger than tcp_default_init_rwnd(). With larger initial rcv_wnd value, receive buffer autotuning logic takes longer to kick in and increase the receive buffer. In a TCP throughput test where receiver has rmem[2] set to 125MB (wscale is 11), we see the connection gets recvbuf limited at the beginning of the connection and gets less throughput overall. Signed-off-by: Wei Wang <[email protected]> Acked-by: Eric Dumazet <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Yuchung Cheng <[email protected]> Acked-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>