aboutsummaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2020-05-20batadv_socket_read(): get rid of pointless access_ok()Al Viro1-3/+0
address is passed only to copy_to_user() Signed-off-by: Al Viro <[email protected]>
2020-05-20get rid of compat_mc_setsockopt()Al Viro1-90/+0
not used anymore Signed-off-by: Al Viro <[email protected]>
2020-05-20handle the group_source_req options directlyAl Viro2-4/+42
Native ->setsockopt() handling of these options (MCAST_..._SOURCE_GROUP and MCAST_{,UN}BLOCK_SOURCE) consists of copyin + call of a helper that does the actual work. The only change needed for ->compat_setsockopt() is a slightly different copyin - the helpers can be reused as-is. Signed-off-by: Al Viro <[email protected]>
2020-05-20ipv6: take handling of group_source_req options into a helperAl Viro1-29/+36
Signed-off-by: Al Viro <[email protected]>
2020-05-20ipv4: take handling of group_source_req options into a helperAl Viro1-39/+44
Signed-off-by: Al Viro <[email protected]>
2020-05-20ipv[46]: do compat setsockopt for MCAST_{JOIN,LEAVE}_GROUP directlyAl Viro2-0/+59
direct parallel to the way these two are handled in the native ->setsockopt() instances - the helpers that do the real work are already separated and can be reused as-is in this case. Signed-off-by: Al Viro <[email protected]>
2020-05-20ipv6: do compat setsockopt for MCAST_MSFILTER directlyAl Viro1-1/+47
similar to the ipv4 counterpart of that patch - the same trick used to align the tail array properly. Signed-off-by: Al Viro <[email protected]>
2020-05-20ip6_mc_msfilter(): pass the address list separatelyAl Viro2-4/+5
that way we'll be able to reuse it for compat case Signed-off-by: Al Viro <[email protected]>
2020-05-20ipv4: do compat setsockopt for MCAST_MSFILTER directlyAl Viro1-1/+47
Parallel to what the native setsockopt() does, except that unlike the native setsockopt() we do not use memdup_user() - we want the sockaddr_storage fields properly aligned, so we allocate 4 bytes more and copy compat_group_filter at the offset 4, which yields the proper alignments. Signed-off-by: Al Viro <[email protected]>
2020-05-20set_mcast_msfilter(): take the guts of setsockopt(MCAST_MSFILTER) into a helperAl Viro1-33/+40
Signed-off-by: Al Viro <[email protected]>
2020-05-20get rid of compat_mc_getsockopt()Al Viro3-85/+79
now we can do MCAST_MSFILTER in compat ->getsockopt() without playing silly buggers with copying things back and forth. We can form a native struct group_filter (sans the variable-length tail) on stack, pass that + pointer to the tail of original request to the helper doing the bulk of the work, then do the rest of copyout - same as the native getsockopt() does. Signed-off-by: Al Viro <[email protected]>
2020-05-20ip*_mc_gsfget(): lift copyout of struct group_filter into callersAl Viro4-29/+36
pass the userland pointer to the array in its tail, so that part gets copied out by our functions; copyout of everything else is done in the callers. Rationale: reuse for compat; the array is the same in native and compat, the layout of parts before it is different for compat. Signed-off-by: Al Viro <[email protected]>
2020-05-20compat_ip{,v6}_setsockopt(): enumerate MCAST_... options explicitlyAl Viro2-2/+18
We want to check if optname is among the MCAST_... ones; do that as an explicit switch. Signed-off-by: Al Viro <[email protected]>
2020-05-20lift compat definitions of mcast [sg]etsockopt requests into net/compat.hAl Viro1-25/+0
We want to get rid of compat_mc_[sg]etsockopt() and to have that stuff handled without compat_alloc_user_space(), extra copying through userland, etc. To do that we'll need ipv4 and ipv6 instances of ->compat_[sg]etsockopt() to manipulate the 32bit variants of mcast requests, so we need to move the definitions of those out of net/compat.c and into a public header. This patch just does a mechanical move to include/net/compat.h Signed-off-by: Al Viro <[email protected]>
2020-05-20SUNRPC: Clean up request deferral tracepointsChuck Lever1-6/+6
- Rename these so they are easy to enable and search for as a set - Move the tracepoints to get a more accurate sense of control flow - Tracepoints should not fire on xprt shutdown - Display memory address in case data structure had been corrupted - Abandon dprintk in these paths I haven't ever gotten one of these tracepoints to trigger. I wonder if we should simply remove them. Signed-off-by: Chuck Lever <[email protected]>
2020-05-20SUNRPC: Restructure svc_udp_recvfrom()Chuck Lever1-25/+35
Clean up. At this point, we are not ready yet to support bio_vecs in the UDP transport implementation. However, we can clean up svc_udp_recvfrom() to match the tracing and straight-lining recently changes made in svc_tcp_recvfrom(). Signed-off-by: Chuck Lever <[email protected]>
2020-05-20SUNRPC: Refactor svc_recvfrom()Chuck Lever1-41/+68
This function is not currently "generic" so remove the documenting comment and rename it appropriately. Its internals are converted to use bio_vecs for reading from the transport socket. In existing typical sunrpc uses of bio_vecs, the bio_vec array is allocated dynamically. Here, instead, an array of bio_vecs is added to svc_rqst. The lifetime of this array can be greater than one call to xpo_recvfrom(): - Multiple calls to xpo_recvfrom() might be needed to read an RPC message completely. - At some later point, rq_arg.bvecs will point to this array and it will carry the received message into svc_process(). I also expect that a future optimization will remove either the rq_vec or rq_pages array in favor of rq_bvec, thus conserving the size of struct svc_rqst. Signed-off-by: Chuck Lever <[email protected]>
2020-05-20rds: fix crash in rds_info_getsockopt()John Hubbard1-1/+2
The conversion to pin_user_pages() had a bug: it overlooked the case of allocation of pages failing. Fix that by restoring an equivalent check. Reported-by: [email protected] Fixes: dbfe7d74376e ("rds: convert get_user_pages() --> pin_user_pages()") Cc: David S. Miller <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Signed-off-by: John Hubbard <[email protected]> Acked-by: Santosh Shilimkar <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-05-20SUNRPC: Clean up svc_release_skb() functionsChuck Lever1-8/+15
Rename these functions using the convention used for other xpo method entry points. Signed-off-by: Chuck Lever <[email protected]>
2020-05-20SUNRPC: Refactor recvfrom path dealing with incomplete TCP receivesChuck Lever1-20/+19
Clean up: move exception processing out of the main path. Signed-off-by: Chuck Lever <[email protected]>
2020-05-20SUNRPC: Replace dprintk() call sites in TCP receive pathChuck Lever1-14/+2
Signed-off-by: Chuck Lever <[email protected]>
2020-05-20fs: make the pipe_buf_operations ->confirm operation optionalChristoph Hellwig1-1/+0
Just return 0 for success if it is not present. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Al Viro <[email protected]>
2020-05-20fs: make the pipe_buf_operations ->steal operation optionalChristoph Hellwig1-7/+0
Just return 1 for failure if it is not present. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: Al Viro <[email protected]>
2020-05-20Bluetooth: Fix assuming EIR flags can result in SSP authenticationLuiz Augusto von Dentz1-2/+0
EIR flags should just hint if SSP may be supported but we shall verify this with use of the actual features as the SSP bits may be disabled in the lower layers which would result in legacy authentication to be used. Signed-off-by: Luiz Augusto von Dentz <[email protected]> Signed-off-by: Marcel Holtmann <[email protected]>
2020-05-20rxrpc: Fix ack discardDavid Howells1-4/+26
The Rx protocol has a "previousPacket" field in it that is not handled in the same way by all protocol implementations. Sometimes it contains the serial number of the last DATA packet received, sometimes the sequence number of the last DATA packet received and sometimes the highest sequence number so far received. AF_RXRPC is using this to weed out ACKs that are out of date (it's possible for ACK packets to get reordered on the wire), but this does not work with OpenAFS which will just stick the sequence number of the last packet seen into previousPacket. The issue being seen is that big AFS FS.StoreData RPC (eg. of ~256MiB) are timing out when partly sent. A trace was captured, with an additional tracepoint to show ACKs being discarded in rxrpc_input_ack(). Here's an excerpt showing the problem. 52873.203230: rxrpc_tx_data: c=000004ae DATA ed1a3584:00000002 0002449c q=00024499 fl=09 A DATA packet with sequence number 00024499 has been transmitted (the "q=" field). ... 52873.243296: rxrpc_rx_ack: c=000004ae 00012a2b DLY r=00024499 f=00024497 p=00024496 n=0 52873.243376: rxrpc_rx_ack: c=000004ae 00012a2c IDL r=0002449b f=00024499 p=00024498 n=0 52873.243383: rxrpc_rx_ack: c=000004ae 00012a2d OOS r=0002449d f=00024499 p=0002449a n=2 The Out-Of-Sequence ACK indicates that the server didn't see DATA sequence number 00024499, but did see seq 0002449a (previousPacket, shown as "p=", skipped the number, but firstPacket, "f=", which shows the bottom of the window is set at that point). 52873.252663: rxrpc_retransmit: c=000004ae q=24499 a=02 xp=14581537 52873.252664: rxrpc_tx_data: c=000004ae DATA ed1a3584:00000002 000244bc q=00024499 fl=0b *RETRANS* The packet has been retransmitted. Retransmission recurs until the peer says it got the packet. 52873.271013: rxrpc_rx_ack: c=000004ae 00012a31 OOS r=000244a1 f=00024499 p=0002449e n=6 More OOS ACKs indicate that the other packets that are already in the transmission pipeline are being received. The specific-ACK list is up to 6 ACKs and NAKs. ... 52873.284792: rxrpc_rx_ack: c=000004ae 00012a49 OOS r=000244b9 f=00024499 p=000244b6 n=30 52873.284802: rxrpc_retransmit: c=000004ae q=24499 a=0a xp=63505500 52873.284804: rxrpc_tx_data: c=000004ae DATA ed1a3584:00000002 000244c2 q=00024499 fl=0b *RETRANS* 52873.287468: rxrpc_rx_ack: c=000004ae 00012a4a OOS r=000244ba f=00024499 p=000244b7 n=31 52873.287478: rxrpc_rx_ack: c=000004ae 00012a4b OOS r=000244bb f=00024499 p=000244b8 n=32 At this point, the server's receive window is full (n=32) with presumably 1 NAK'd packet and 31 ACK'd packets. We can't transmit any more packets. 52873.287488: rxrpc_retransmit: c=000004ae q=24499 a=0a xp=61327980 52873.287489: rxrpc_tx_data: c=000004ae DATA ed1a3584:00000002 000244c3 q=00024499 fl=0b *RETRANS* 52873.293850: rxrpc_rx_ack: c=000004ae 00012a4c DLY r=000244bc f=000244a0 p=00024499 n=25 And now we've received an ACK indicating that a DATA retransmission was received. 7 packets have been processed (the occupied part of the window moved, as indicated by f= and n=). 52873.293853: rxrpc_rx_discard_ack: c=000004ae r=00012a4c 000244a0<00024499 00024499<000244b8 However, the DLY ACK gets discarded because its previousPacket has gone backwards (from p=000244b8, in the ACK at 52873.287478 to p=00024499 in the ACK at 52873.293850). We then end up in a continuous cycle of retransmit/discard. kafs fails to update its window because it's discarding the ACKs and can't transmit an extra packet that would clear the issue because the window is full. OpenAFS doesn't change the previousPacket value in the ACKs because no new DATA packets are received with a different previousPacket number. Fix this by altering the discard check to only discard an ACK based on previousPacket if there was no advance in the firstPacket. This allows us to transmit a new packet which will cause previousPacket to advance in the next ACK. The check, however, needs to allow for the possibility that previousPacket may actually have had the serial number placed in it instead - in which case it will go outside the window and we should ignore it. Fixes: 1a2391c30c0b ("rxrpc: Fix detection of out of order acks") Reported-by: Dave Botsch <[email protected]> Signed-off-by: David Howells <[email protected]>
2020-05-20rxrpc: Trace discarded ACKsDavid Howells1-2/+10
Add a tracepoint to track received ACKs that are discarded due to being outside of the Tx window. Signed-off-by: David Howells <[email protected]>
2020-05-20Bluetooth: Consolidate encryption handling in hci_encrypt_cfmLuiz Augusto von Dentz1-25/+3
This makes hci_encrypt_cfm calls hci_connect_cfm in case the connection state is BT_CONFIG so callers don't have to check the state. Signed-off-by: Luiz Augusto von Dentz <[email protected]> Signed-off-by: Marcel Holtmann <[email protected]>
2020-05-19net: unexport skb_gro_receive()Eric Dumazet1-2/+0
skb_gro_receive() used to be used by SCTP, it is no longer the case. skb_gro_receive_list() is in the same category : never used from modules. Signed-off-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-05-19sctp: Don't add the shutdown timer if its already been addedNeil Horman1-3/+11
This BUG halt was reported a while back, but the patch somehow got missed: PID: 2879 TASK: c16adaa0 CPU: 1 COMMAND: "sctpn" #0 [f418dd28] crash_kexec at c04a7d8c #1 [f418dd7c] oops_end at c0863e02 #2 [f418dd90] do_invalid_op at c040aaca #3 [f418de28] error_code (via invalid_op) at c08631a5 EAX: f34baac0 EBX: 00000090 ECX: f418deb0 EDX: f5542950 EBP: 00000000 DS: 007b ESI: f34ba800 ES: 007b EDI: f418dea0 GS: 00e0 CS: 0060 EIP: c046fa5e ERR: ffffffff EFLAGS: 00010286 #4 [f418de5c] add_timer at c046fa5e #5 [f418de68] sctp_do_sm at f8db8c77 [sctp] #6 [f418df30] sctp_primitive_SHUTDOWN at f8dcc1b5 [sctp] #7 [f418df48] inet_shutdown at c080baf9 #8 [f418df5c] sys_shutdown at c079eedf #9 [f418df70] sys_socketcall at c079fe88 EAX: ffffffda EBX: 0000000d ECX: bfceea90 EDX: 0937af98 DS: 007b ESI: 0000000c ES: 007b EDI: b7150ae4 SS: 007b ESP: bfceea7c EBP: bfceeaa8 GS: 0033 CS: 0073 EIP: b775c424 ERR: 00000066 EFLAGS: 00000282 It appears that the side effect that starts the shutdown timer was processed multiple times, which can happen as multiple paths can trigger it. This of course leads to the BUG halt in add_timer getting called. Fix seems pretty straightforward, just check before the timer is added if its already been started. If it has mod the timer instead to min(current expiration, new expiration) Its been tested but not confirmed to fix the problem, as the issue has only occured in production environments where test kernels are enjoined from being installed. It appears to be a sane fix to me though. Also, recentely, Jere found a reproducer posted on list to confirm that this resolves the issues Signed-off-by: Neil Horman <[email protected]> CC: Vlad Yasevich <[email protected]> CC: "David S. Miller" <[email protected]> CC: [email protected] CC: [email protected] CC: [email protected] Acked-by: Marcelo Ricardo Leitner <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-05-19ipv6: use ->ndo_tunnel_ctl in addrconf_set_dstaddrChristoph Hellwig1-7/+2
Use the new ->ndo_tunnel_ctl instead of overriding the address limit and using ->ndo_do_ioctl just to do a pointless user copy. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-05-19ipv6: streamline addrconf_set_dstaddrChristoph Hellwig1-49/+38
Factor out a addrconf_set_sit_dstaddr helper for the actual work if we found a SIT device, and only hold the rtnl lock around the device lookup and that new helper, as there is no point in holding it over a copy_from_user call. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-05-19ipv6: stub out even more of addrconf_set_dstaddr if SIT is disabledChristoph Hellwig1-2/+3
There is no point in copying the structure from userspace or looking up a device if SIT support is not disabled and we'll eventually return -ENODEV anyway. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-05-19sit: impement ->ndo_tunnel_ctlChristoph Hellwig1-39/+34
Implement the ->ndo_tunnel_ctl method, and use ip_tunnel_ioctl to handle userspace requests for the SIOCGETTUNNEL, SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-05-19sit: refactor ipip6_tunnel_ioctlChristoph Hellwig1-158/+210
Split the ioctl handler into one function per command instead of having a all the logic sit in one giant switch statement. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-05-19impr: use ->ndo_tunnel_ctl in ipmr_new_tunnelChristoph Hellwig1-11/+3
Use the new ->ndo_tunnel_ctl instead of overriding the address limit and using ->ndo_do_ioctl just to do a pointless user copy. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-05-19net: add a new ndo_tunnel_ioctl methodChristoph Hellwig4-62/+51
This method is used to properly allow kernel callers of the IPv4 route management ioctls. The exsting ip_tunnel_ioctl helper is renamed to ip_tunnel_ctl to better reflect that it doesn't directly implement ioctls touching user memory, and is used for the guts of ndo_tunnel_ctl implementations. A new ip_tunnel_ioctl helper is added that can be wired up directly to the ndo_do_ioctl method and takes care of the copy to and from userspace. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-05-19ipv4: consolidate the VIFF_TUNNEL handling in ipmr_new_tunnelChristoph Hellwig1-40/+13
Also move the dev_set_allmulti call and the error handling into the ioctl helper. This allows reusing already looked up tunnel_dev pointer and the set up argument structure for the deletion in the error handler. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-05-19ipv4: streamline ipmr_new_tunnelChristoph Hellwig1-37/+36
Reduce a few level of indentation to simplify the function. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-05-19__netif_receive_skb_core: pass skb by referenceBoris Sukholitko1-5/+15
__netif_receive_skb_core may change the skb pointer passed into it (e.g. in rx_handler). The original skb may be freed as a result of this operation. The callers of __netif_receive_skb_core may further process original skb by using pt_prev pointer returned by __netif_receive_skb_core thus leading to unpleasant effects. The solution is to pass skb by reference into __netif_receive_skb_core. v2: Added Fixes tag and comment regarding ppt_prev and skb invariant. Fixes: 88eb1944e18c ("net: core: propagate SKB lists through packet_type lookup") Signed-off-by: Boris Sukholitko <[email protected]> Acked-by: Edward Cree <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-05-19net: inet_csk: Fix so_reuseport bind-address cache in tb->fast*Martin KaFai Lau1-19/+24
The commit 637bc8bbe6c0 ("inet: reset tb->fastreuseport when adding a reuseport sk") added a bind-address cache in tb->fast*. The tb->fast* caches the address of a sk which has successfully been binded with SO_REUSEPORT ON. The idea is to avoid the expensive conflict search in inet_csk_bind_conflict(). There is an issue with wildcard matching where sk_reuseport_match() should have returned false but it is currently returning true. It ends up hiding bind conflict. For example, bind("[::1]:443"); /* without SO_REUSEPORT. Succeed. */ bind("[::2]:443"); /* with SO_REUSEPORT. Succeed. */ bind("[::]:443"); /* with SO_REUSEPORT. Still Succeed where it shouldn't */ The last bind("[::]:443") with SO_REUSEPORT on should have failed because it should have a conflict with the very first bind("[::1]:443") which has SO_REUSEPORT off. However, the address "[::2]" is cached in tb->fast* in the second bind. In the last bind, the sk_reuseport_match() returns true because the binding sk's wildcard addr "[::]" matches with the "[::2]" cached in tb->fast*. The correct bind conflict is reported by removing the second bind such that tb->fast* cache is not involved and forces the bind("[::]:443") to go through the inet_csk_bind_conflict(): bind("[::1]:443"); /* without SO_REUSEPORT. Succeed. */ bind("[::]:443"); /* with SO_REUSEPORT. -EADDRINUSE */ The expected behavior for sk_reuseport_match() is, it should only allow the "cached" tb->fast* address to be used as a wildcard match but not the address of the binding sk. To do that, the current "bool match_wildcard" arg is split into "bool match_sk1_wildcard" and "bool match_sk2_wildcard". This change only affects the sk_reuseport_match() which is only used by inet_csk (e.g. TCP). The other use cases are calling inet_rcv_saddr_equal() and this patch makes it pass the same "match_wildcard" arg twice to the "ipv[46]_rcv_saddr_equal(..., match_wildcard, match_wildcard)". Cc: Josef Bacik <[email protected]> Fixes: 637bc8bbe6c0 ("inet: reset tb->fastreuseport when adding a reuseport sk") Signed-off-by: Martin KaFai Lau <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-05-19net/af_iucv: clean up function prototypesJulian Wiedmann1-57/+51
Remove a bunch of forward declarations (trivially shifting code around where needed), and make a few functions static. Signed-off-by: Julian Wiedmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-05-19net/af_iucv: remove a redundant zero initializationJulian Wiedmann1-1/+0
txmsg is declared as {0}, no need to clear individual fields later on. Signed-off-by: Julian Wiedmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-05-19net/af_iucv: replace open-coded U16_MAXJulian Wiedmann1-1/+2
Improve the readability of a range check. Signed-off-by: Julian Wiedmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-05-19net/af_iucv: remove pm supportJulian Wiedmann1-140/+1
commit 394216275c7d ("s390: remove broken hibernate / power management support") removed support for ARCH_HIBERNATION_POSSIBLE from s390. So drop the unused pm ops from the s390-only af_iucv socket code. Signed-off-by: Julian Wiedmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-05-19net/iucv: remove pm supportJulian Wiedmann1-188/+0
commit 394216275c7d ("s390: remove broken hibernate / power management support") removed support for ARCH_HIBERNATION_POSSIBLE from s390. So drop the unused pm ops from the s390-only iucv bus driver. CC: Hendrik Brueckner <[email protected]> Signed-off-by: Julian Wiedmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-05-19mptcp: use rightmost 64 bits in ADD_ADDR HMACTodd Malsbary1-2/+2
This changes the HMAC used in the ADD_ADDR option from the leftmost 64 bits to the rightmost 64 bits as described in RFC 8684, section 3.4.1. This issue was discovered while adding support to packetdrill for the ADD_ADDR v1 option. Fixes: 3df523ab582c ("mptcp: Add ADD_ADDR handling") Signed-off-by: Todd Malsbary <[email protected]> Acked-by: Christoph Paasch <[email protected]> Reviewed-by: Matthieu Baerts <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2020-05-19bpf: Add get{peer, sock}name attach types for sock_addrDaniel Borkmann3-5/+16
As stated in 983695fa6765 ("bpf: fix unconnected udp hooks"), the objective for the existing cgroup connect/sendmsg/recvmsg/bind BPF hooks is to be transparent to applications. In Cilium we make use of these hooks [0] in order to enable E-W load balancing for existing Kubernetes service types for all Cilium managed nodes in the cluster. Those backends can be local or remote. The main advantage of this approach is that it operates as close as possible to the socket, and therefore allows to avoid packet-based NAT given in connect/sendmsg/recvmsg hooks we only need to xlate sock addresses. This also allows to expose NodePort services on loopback addresses in the host namespace, for example. As another advantage, this also efficiently blocks bind requests for applications in the host namespace for exposed ports. However, one missing item is that we also need to perform reverse xlation for inet{,6}_getname() hooks such that we can return the service IP/port tuple back to the application instead of the remote peer address. The vast majority of applications does not bother about getpeername(), but in a few occasions we've seen breakage when validating the peer's address since it returns unexpectedly the backend tuple instead of the service one. Therefore, this trivial patch allows to customise and adds a getpeername() as well as getsockname() BPF cgroup hook for both IPv4 and IPv6 in order to address this situation. Simple example: # ./cilium/cilium service list ID Frontend Service Type Backend 1 1.2.3.4:80 ClusterIP 1 => 10.0.0.10:80 Before; curl's verbose output example, no getpeername() reverse xlation: # curl --verbose 1.2.3.4 * Rebuilt URL to: 1.2.3.4/ * Trying 1.2.3.4... * TCP_NODELAY set * Connected to 1.2.3.4 (10.0.0.10) port 80 (#0) > GET / HTTP/1.1 > Host: 1.2.3.4 > User-Agent: curl/7.58.0 > Accept: */* [...] After; with getpeername() reverse xlation: # curl --verbose 1.2.3.4 * Rebuilt URL to: 1.2.3.4/ * Trying 1.2.3.4... * TCP_NODELAY set * Connected to 1.2.3.4 (1.2.3.4) port 80 (#0) > GET / HTTP/1.1 > Host: 1.2.3.4 > User-Agent: curl/7.58.0 > Accept: */* [...] Originally, I had both under a BPF_CGROUP_INET{4,6}_GETNAME type and exposed peer to the context similar as in inet{,6}_getname() fashion, but API-wise this is suboptimal as it always enforces programs having to test for ctx->peer which can easily be missed, hence BPF_CGROUP_INET{4,6}_GET{PEER,SOCK}NAME split. Similarly, the checked return code is on tnum_range(1, 1), but if a use case comes up in future, it can easily be changed to return an error code instead. Helper and ctx member access is the same as with connect/sendmsg/etc hooks. [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Acked-by: Andrey Ignatov <[email protected]> Link: https://lore.kernel.org/bpf/61a479d759b2482ae3efb45546490bacd796a220.1589841594.git.daniel@iogearbox.net
2020-05-19bpf: Fix too large copy from user in bpf_test_initJesper Dangaard Brouer1-3/+5
Commit bc56c919fce7 ("bpf: Add xdp.frame_sz in bpf_prog_test_run_xdp().") recently changed bpf_prog_test_run_xdp() to use larger frames for XDP in order to test tail growing frames (via bpf_xdp_adjust_tail) and to have memory backing frame better resemble drivers. The commit contains a bug, as it tries to copy the max data size from userspace, instead of the size provided by userspace. This cause XDP unit tests to fail sporadically with EFAULT, an unfortunate behavior. The fix is to only copy the size specified by userspace. Fixes: bc56c919fce7 ("bpf: Add xdp.frame_sz in bpf_prog_test_run_xdp().") Signed-off-by: Jesper Dangaard Brouer <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/158980712729.256597.6115007718472928659.stgit@firesoul
2020-05-19proc: proc_pid_ns takes super_block as an argumentAlexey Gladkov1-1/+1
syzbot found that touch /proc/testfile causes NULL pointer dereference at tomoyo_get_local_path() because inode of the dentry is NULL. Before c59f415a7cb6, Tomoyo received pid_ns from proc's s_fs_info directly. Since proc_pid_ns() can only work with inode, using it in the tomoyo_get_local_path() was wrong. To avoid creating more functions for getting proc_ns, change the argument type of the proc_pid_ns() function. Then, Tomoyo can use the existing super_block to get pid_ns. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Reported-by: [email protected] Fixes: c59f415a7cb6 ("Use proc_pid_ns() to get pid_namespace from the proc superblock") Signed-off-by: Alexey Gladkov <[email protected]> Signed-off-by: Eric W. Biederman <[email protected]>
2020-05-18ipv4,appletalk: move SIOCADDRT and SIOCDELRT handling into ->compat_ioctlChristoph Hellwig3-73/+76
To prepare removing the global routing_ioctl hack start lifting the code into the ipv4 and appletalk ->compat_ioctl handlers. Unlike the existing handler we don't bother copying in the name - there are no compat issues for char arrays. Signed-off-by: Christoph Hellwig <[email protected]> Signed-off-by: David S. Miller <[email protected]>