aboutsummaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2017-04-26tcp: memset ca_priv data to 0 properlyWei Wang1-8/+3
Always zero out ca_priv data in tcp_assign_congestion_control() so that ca_priv data is cleared out during socket creation. Also always zero out ca_priv data in tcp_reinit_congestion_control() so that when cc algorithm is changed, ca_priv data is cleared out as well. We should still zero out ca_priv data even in TCP_CLOSE state because user could call connect() on AF_UNSPEC to disconnect the socket and leave it in TCP_CLOSE state and later call setsockopt() to switch cc algorithm on this socket. Fixes: 2b0a8c9ee ("tcp: add CDG congestion control") Reported-by: Andrey Konovalov <[email protected]> Signed-off-by: Wei Wang <[email protected]> Acked-by: Eric Dumazet <[email protected]> Acked-by: Yuchung Cheng <[email protected]> Acked-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-04-26ipv6: check skb->protocol before lookup for nexthopWANG Cong1-16/+18
Andrey reported a out-of-bound access in ip6_tnl_xmit(), this is because we use an ipv4 dst in ip6_tnl_xmit() and cast an IPv4 neigh key as an IPv6 address: neigh = dst_neigh_lookup(skb_dst(skb), &ipv6_hdr(skb)->daddr); if (!neigh) goto tx_err_link_failure; addr6 = (struct in6_addr *)&neigh->primary_key; // <=== HERE addr_type = ipv6_addr_type(addr6); if (addr_type == IPV6_ADDR_ANY) addr6 = &ipv6_hdr(skb)->daddr; memcpy(&fl6->daddr, addr6, sizeof(fl6->daddr)); Also the network header of the skb at this point should be still IPv4 for 4in6 tunnels, we shold not just use it as IPv6 header. This patch fixes it by checking if skb->protocol is ETH_P_IPV6: if it is, we are safe to do the nexthop lookup using skb_dst() and ipv6_hdr(skb)->daddr; if not (aka IPv4), we have no clue about which dest address we can pick here, we have to rely on callers to fill it from tunnel config, so just fall to ip6_route_output() to make the decision. Fixes: ea3dc9601bda ("ip6_tunnel: Add support for wildcard tunnel endpoints.") Reported-by: Andrey Konovalov <[email protected]> Tested-by: Andrey Konovalov <[email protected]> Cc: Steffen Klassert <[email protected]> Signed-off-by: Cong Wang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-04-26net: core: Prevent from dereferencing null pointer when releasing SKBMyungho Jung1-0/+3
Added NULL check to make __dev_kfree_skb_irq consistent with kfree family of functions. Link: https://bugzilla.kernel.org/show_bug.cgi?id=195289 Signed-off-by: Myungho Jung <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-04-26tcp: switch rcv_rtt_est and rcvq_space to high resolution timestampsEric Dumazet2-12/+18
Some devices or distributions use HZ=100 or HZ=250 TCP receive buffer autotuning has poor behavior caused by this choice. Since autotuning happens after 4 ms or 10 ms, short distance flows get their receive buffer tuned to a very high value, but after an initial period where it was frozen to (too small) initial value. With tp->tcp_mstamp introduction, we can switch to high resolution timestamps almost for free (at the expense of 8 additional bytes per TCP structure) Note that some TCP stacks use usec TCP timestamps where this patch makes even more sense : Many TCP flows have < 500 usec RTT. Hopefully this finer TS option can be standardized soon. Tested: HZ=100 kernel ./netperf -H lpaa24 -t TCP_RR -l 1000 -- -r 10000,10000 & Peer without patch : lpaa24:~# ss -tmi dst lpaa23 ... skmem:(r0,rb8388608,...) rcv_rtt:10 rcv_space:3210000 minrtt:0.017 Peer with the patch : lpaa23:~# ss -tmi dst lpaa24 ... skmem:(r0,rb428800,...) rcv_rtt:0.069 rcv_space:30000 minrtt:0.017 We can see saner RCVBUF, and more precise rcv_rtt information. Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-04-26tcp: remove ack_time from struct tcp_sacktag_stateEric Dumazet1-4/+0
It is no longer needed, everything uses tp->tcp_mstamp instead. Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-04-26tcp: use tp->tcp_mstamp in tcp_clean_rtx_queue()Eric Dumazet1-1/+1
Following patch will remove ack_time from struct tcp_sacktag_state Same info is now found in tp->tcp_mstamp Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-04-26tcp: do not pass timestamp to tcp_rack_advance()Eric Dumazet2-7/+4
No longer needed, since tp->tcp_mstamp holds the information. This is needed to remove sack_state.ack_time in a following patch. Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-04-26tcp: do not pass timestamp to tcp_rate_gen()Eric Dumazet2-5/+5
No longer needed, since tp->tcp_mstamp holds the information. This is needed to remove sack_state.ack_time in a following patch. Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-04-26tcp: do not pass timestamp to tcp_fastretrans_alert()Eric Dumazet1-8/+4
Not used anymore now tp->tcp_mstamp holds the information. This is needed to remove sack_state.ack_time in a following patch. Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-04-26tcp: do not pass timestamp to tcp_rack_identify_loss()Eric Dumazet1-5/+4
Not used anymore now tp->tcp_mstamp holds the information. Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-04-26tcp: do not pass timestamp to tcp_rack_mark_lost()Eric Dumazet2-2/+2
This is no longer used, since tcp_rack_detect_loss() takes the timestamp from tp->tcp_mstamp Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-04-26tcp: do not pass timestamp to tcp_rack_detect_loss()Eric Dumazet1-7/+4
We can use tp->tcp_mstamp as it contains a recent timestamp. This removes a call to skb_mstamp_get() from tcp_rack_reo_timeout() Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-04-26tcp: add tp->tcp_mstamp fieldEric Dumazet1-0/+3
We want to use precise timestamps in TCP stack, but we do not want to call possibly expensive kernel time services too often. tp->tcp_mstamp is guaranteed to be updated once per incoming packet. We will use it in the following patches, removing specific skb_mstamp_get() calls, and removing ack_time from struct tcp_sacktag_state. Signed-off-by: Eric Dumazet <[email protected]> Acked-by: Soheil Hassas Yeganeh <[email protected]> Acked-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-04-26Merge branches 'uaccess.alpha', 'uaccess.arc', 'uaccess.arm', ↵Al Viro104-551/+904
'uaccess.arm64', 'uaccess.avr32', 'uaccess.bfin', 'uaccess.c6x', 'uaccess.cris', 'uaccess.frv', 'uaccess.h8300', 'uaccess.hexagon', 'uaccess.ia64', 'uaccess.m32r', 'uaccess.m68k', 'uaccess.metag', 'uaccess.microblaze', 'uaccess.mips', 'uaccess.mn10300', 'uaccess.nios2', 'uaccess.openrisc', 'uaccess.parisc', 'uaccess.powerpc', 'uaccess.s390', 'uaccess.score', 'uaccess.sh', 'uaccess.sparc', 'uaccess.tile', 'uaccess.um', 'uaccess.unicore32', 'uaccess.x86' and 'uaccess.xtensa' into work.uaccess
2017-04-26xfrm: do the garbage collection after flushing policyXin Long1-0/+4
Now xfrm garbage collection can be triggered by 'ip xfrm policy del'. These is no reason not to do it after flushing policies, especially considering that 'garbage collection deferred' is only triggered when it reaches gc_thresh. It's no good that the policy is gone but the xdst still hold there. The worse thing is that xdst->route/orig_dst is also hold and can not be released even if the orig_dst is already expired. This patch is to do the garbage collection if there is any policy removed in xfrm_policy_flush. Signed-off-by: Xin Long <[email protected]> Signed-off-by: Steffen Klassert <[email protected]>
2017-04-26netfilter: don't attach a nat extension by defaultFlorian Westphal3-12/+2
nowadays the NAT extension only stores the interface index (used to purge connections that got masqueraded when interface goes down) and pptp nat information. Previous patches moved nf_ct_nat_ext_add to those places that need it. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2017-04-26netfilter: pptp: attach nat extension when neededFlorian Westphal2-6/+31
make sure nat extension gets added if the master conntrack is subject to NAT. This will be required once the nat core stops adding it by default. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2017-04-26netfilter: masquerade: attach nat extension if not presentFlorian Westphal2-3/+7
Currently the nat extension is always attached as soon as nat module is loaded. However, most NAT uses do not need the nat extension anymore. Prepare to remove the add-nat-by-default by making those places that need it attach it if its not present yet. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2017-04-26netfilter: conntrack: handle initial extension alloc via kreallocFlorian Westphal1-36/+15
krealloc(NULL, ..) is same as kmalloc(), so we can avoid special-casing the initial allocation after the prealloc removal (we had to use ->alloc_len as the initial allocation size). This also means we do not zero the preallocated memory anymore; only offsets[]. Existing code makes sure the new (used) extension space gets zeroed out. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2017-04-26netfilter: conntrack: mark extension structs as constFlorian Westphal8-9/+9
Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2017-04-26netfilter: conntrack: remove prealloc supportFlorian Westphal2-46/+4
It was used by the nat extension, but since commit 7c9664351980 ("netfilter: move nat hlist_head to nf_conn") its only needed for connections that use MASQUERADE target or a nat helper. Also it seems a lot easier to preallocate a fixed size instead. With default settings, conntrack first adds ecache extension (sysctl defaults to 1), so we get 40(ct extension header) + 24 (ecache) == 64 byte on x86_64 for initial allocation. Followup patches can constify the extension structs and avoid the initial zeroing of the entire extension area. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2017-04-26netfilter: SYNPROXY: Return NF_STOLEN instead of NF_DROP during handshakingGao Feng2-13/+28
Current SYNPROXY codes return NF_DROP during normal TCP handshaking, it is not friendly to caller. Because the nf_hook_slow would treat the NF_DROP as an error, and return -EPERM. As a result, it may cause the top caller think it meets one error. For example, the following codes are from cfv_rx_poll() err = netif_receive_skb(skb); if (unlikely(err)) { ++cfv->ndev->stats.rx_dropped; } else { ++cfv->ndev->stats.rx_packets; cfv->ndev->stats.rx_bytes += skb_len; } When SYNPROXY returns NF_DROP, then netif_receive_skb returns -EPERM. As a result, the cfv driver would treat it as an error, and increase the rx_dropped counter. So use NF_STOLEN instead of NF_DROP now because there is no error happened indeed, and free the skb directly. Signed-off-by: Gao Feng <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2017-04-26ebtables: remove nf_hook_register usageFlorian Westphal4-49/+46
Similar to ip_register_table, pass nf_hook_ops to ebt_register_table(). This allows to handle hook registration also via pernet_ops and allows us to avoid use of legacy register_hook api. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2017-04-26netfilter: decnet: only register hooks in init namespaceFlorian Westphal1-2/+2
looks like decnet isn't namespacified in first place, so restrict hook registration to the initial namespace. Prepares for eventual removal of legacy nf_register_hook() api. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2017-04-26ipvs: convert to use pernet nf_hook apiFlorian Westphal1-10/+9
nf_(un)register_hooks has to maintain an internal hook list to add/remove those hooks from net namespaces as they are added/deleted. ipvs already uses pernet_ops, so we can switch to the (more recent) pernet hook api instead. Compile tested only. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2017-04-26netfilter: synproxy: only register hooks when neededFlorian Westphal2-68/+78
Defer registration of the synproxy hooks until the first SYNPROXY rule is added. Also means we only register hooks in namespaces that need it. Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
2017-04-25Merge tag 'nfs-rdma-4.12-1' of git://git.linux-nfs.org/projects/anna/nfs-rdmaTrond Myklebust6-122/+295
NFS: NFS over RDMA Client Side Changes New Features: - Break RDMA connections after a connection timeout - Support for unloading the underlying device driver Bugfixes and cleanups: - Mark the receive workqueue as "read-mostly" - Silence warnings caused by ENOBUFS - Update a comment in xdr_init_decode_pages() - Remove rpcrdma_buffer->rb_pool.
2017-04-25svcrdma: Clean out old XDR encodersChuck Lever1-70/+0
Clean up: These have been replaced and are no longer used. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25svcrdma: Remove the req_map cacheChuck Lever2-152/+0
req_maps are no longer used by the send path and can thus be removed. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25svcrdma: Remove unused RDMA Write completion handlerChuck Lever1-18/+0
Clean up. All RDMA Write completions are now handled by svc_rdma_wc_write_ctx. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25svcrdma: Reduce size of sge array in struct svc_rdma_op_ctxtChuck Lever1-3/+3
The sge array in struct svc_rdma_op_ctxt is no longer used for sending RDMA Write WRs. It need only accommodate the construction of Send and Receive WRs. The maximum inline size is the largest payload it needs to handle now. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25svcrdma: Clean up RPC-over-RDMA backchannel reply processingChuck Lever2-16/+29
Replace C structure-based XDR decoding with pointer arithmetic. Pointer arithmetic is considered more portable. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25svcrdma: Report Write/Reply chunk overrunsChuck Lever1-2/+56
Observed at Connectathon 2017. If a client has underestimated the size of a Write or Reply chunk, the Linux server writes as much payload data as it can, then it recognizes there was a problem and closes the connection without sending the transport header. This creates a couple of problems: <> The client never receives indication of the server-side failure, so it continues to retransmit the bad RPC. Forward progress on the transport is blocked. <> The reply payload pages are not moved out of the svc_rqst, thus they can be released by the RPC server before the RDMA Writes have completed. The new rdma_rw-ized helpers return a distinct error code when a Write/Reply chunk overrun occurs, so it's now easy for the caller (svc_rdma_sendto) to recognize this case. Instead of dropping the connection, post an RDMA_ERROR message. The client now sees an RDMA_ERROR and can properly terminate the RPC transaction. As part of the new logic, set up the same delayed release for these payload pages as would have occurred in the normal case. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25svcrdma: Clean up RDMA_ERROR pathChuck Lever3-63/+51
Now that svc_rdma_sendto has been renovated, svc_rdma_send_error can be refactored to reduce code duplication and remove C structure- based XDR encoding. It is also relocated to the source file that contains its only caller. This is a refactoring change only. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25svcrdma: Use rdma_rw API in RPC reply pathChuck Lever3-354/+350
The current svcrdma sendto code path posts one RDMA Write WR at a time. Each of these Writes typically carries a small number of pages (for instance, up to 30 pages for mlx4 devices). That means a 1MB NFS READ reply requires 9 ib_post_send() calls for the Write WRs, and one for the Send WR carrying the actual RPC Reply message. Instead, use the new rdma_rw API. The details of Write WR chain construction and memory registration are taken care of in the RDMA core. svcrdma can focus on the details of the RPC-over-RDMA protocol. This gives three main benefits: 1. All Write WRs for one RDMA segment are posted in a single chain. As few as one ib_post_send() for each Write chunk. 2. The Write path can now use FRWR to register the Write buffers. If the device's maximum page list depth is large, this means a single Write WR is needed for each RPC's Write chunk data. 3. The new code introduces support for RPCs that carry both a Write list and a Reply chunk. This combination can be used for an NFSv4 READ where the data payload is large, and thus is removed from the Payload Stream, but the Payload Stream is still larger than the inline threshold. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25svcrdma: Introduce local rdma_rw API helpersChuck Lever4-1/+518
The plan is to replace the local bespoke code that constructs and posts RDMA Read and Write Work Requests with calls to the rdma_rw API. This shares code with other RDMA-enabled ULPs that manages the gory details of buffer registration and posting Work Requests. Some design notes: o The structure of RPC-over-RDMA transport headers is flexible, allowing multiple segments per Reply with arbitrary alignment, each with a unique R_key. Write and Send WRs continue to be built and posted in separate code paths. However, one whole chunk (with one or more RDMA segments apiece) gets exactly one ib_post_send and one work completion. o svc_xprt reference counting is modified, since a chain of rdma_rw_ctx structs generates one completion, no matter how many Write WRs are posted. o The current code builds the transport header as it is construct- ing Write WRs. I've replaced that with marshaling of transport header data items in a separate step. This is because the exact structure of client-provided segments may not align with the components of the server's reply xdr_buf, or the pages in the page list. Thus parts of each client-provided segment may be written at different points in the send path. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25svcrdma: Clean up svc_rdma_get_inv_rkey()Chuck Lever1-22/+17
Replace C structure-based XDR decoding with more portable code that instead uses pointer arithmetic. This is a refactoring change only. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25svcrdma: Add helper to save pages under I/OChuck Lever1-13/+18
Clean up: extract the logic to save pages under I/O into a helper to add a big documenting comment without adding clutter in the send path. This is a refactoring change only. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULTChuck Lever2-3/+1
The Send Queue depth is temporarily reduced to 1 SQE per credit. The new rdma_rw API does an internal computation, during QP creation, to increase the depth of the Send Queue to handle RDMA Read and Write operations. This change has to come before the NFSD code paths are updated to use the rdma_rw API. Without this patch, rdma_rw_init_qp() increases the size of the SQ too much, resulting in memory allocation failures during QP creation. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25svcrdma: Add svc_rdma_map_reply_hdr()Chuck Lever2-38/+59
Introduce a helper to DMA-map a reply's transport header before sending it. This will in part replace the map vector cache. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25svcrdma: Move send_wr to svc_rdma_op_ctxtChuck Lever2-35/+40
Clean up: Move the ib_send_wr off the stack, and move common code to post a Send Work Request into a helper. This is a refactoring change only. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2017-04-25xprtrdma: Remove rpcrdma_buffer::rb_poolChuck Lever1-1/+0
Since commit 1e465fd4ff47 ("xprtrdma: Replace send and receive arrays"), this field is no longer used. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-04-25sunrpc: Fix xdr_init_decode_pages() documenting commentChuck Lever1-1/+1
Clean up. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-04-25xprtrdma: Squelch ENOBUFS warningsChuck Lever1-3/+5
When ro_map is out of buffers, that's not a permanent error, so don't report a problem. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-04-25xprtrdma: Annotate receive workqueueChuck Lever1-1/+1
Micro-optimize the receive workqueue by marking it's anchor "read- mostly." Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-04-25xprtrdma: Revert commit d0f36c46deeaChuck Lever1-26/+7
Device removal is now adequately supported. Pinning the underlying device driver to prevent removal while an NFS mount is active is no longer necessary. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-04-25xprtrdma: Restore transport after device removalChuck Lever1-0/+48
After a device removal, enable the transport connect worker to restore normal operation if there is another device with connectivity to the server. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-04-25xprtrdma: Refactor rpcrdma_ep_connectChuck Lever1-46/+63
I'm about to add another arm to if (ep->rep_connected != 0) It will be cleaner to use a switch statement here. We'll be looking for a couple of specific errnos, or "anything else," basically to sort out the difference between a normal reconnect and recovery from device removal. This is a refactoring change only. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-04-25xprtrdma: Support unplugging an HCA from under an NFS mountChuck Lever3-9/+101
The device driver for the underlying physical device associated with an RPC-over-RDMA transport can be removed while RPC-over-RDMA transports are still in use (ie, while NFS filesystems are still mounted and active). The IB core performs a connection event upcall to request that consumers free all RDMA resources associated with a transport. There may be pending RPCs when this occurs. Care must be taken to release associated resources without leaving references that can trigger a subsequent crash if a signal or soft timeout occurs. We rely on the caller of the transport's ->close method to ensure that the previous RPC task has invoked xprt_release but the transport remains write-locked. A DEVICE_REMOVE upcall forces a disconnect then sleeps. When ->close is invoked, it destroys the transport's H/W resources, then wakes the upcall, which completes and allows the core driver unload to continue. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=266 Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-04-25xprtrdma: Use same device when mapping or syncing DMA buffersChuck Lever3-9/+14
When the underlying device driver is reloaded, ia->ri_device will be replaced. All cached copies of that device pointer have to be updated as well. Commit 54cbd6b0c6b9 ("xprtrdma: Delay DMA mapping Send and Receive buffers") added the rg_device field to each regbuf. As part of handling a device removal, rpcrdma_dma_unmap_regbuf is invoked on all regbufs for a transport. Simply calling rpcrdma_dma_map_regbuf for each Receive buffer after the driver has been reloaded should reinitialize rg_device correctly for every case except rpcrdma_wc_receive, which still uses rpcrdma_rep::rr_device. Ensure the same device that was used to map a Receive buffer is also used to sync it in rpcrdma_wc_receive by using rg_device there instead of rr_device. This is the only use of rr_device, so it can be removed. The use of regbufs in the send path is also updated, for completeness. Fixes: 54cbd6b0c6b9 ("xprtrdma: Delay DMA mapping Send and ... ") Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>