aboutsummaryrefslogtreecommitdiff
path: root/net/sunrpc
AgeCommit message (Collapse)AuthorFilesLines
2015-11-02xprtrdma: Pre-allocate Work Requests for backchannelChuck Lever3-2/+26
Pre-allocate extra send and receive Work Requests needed to handle backchannel receives and sends. The transport doesn't know how many extra WRs to pre-allocate until the xprt_setup_backchannel() call, but that's long after the WRs are allocated during forechannel setup. So, use a fixed value for now. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Tested-By: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02xprtrdma: Pre-allocate backward rpc_rqst and send/receive buffersChuck Lever5-12/+309
xprtrdma's backward direction send and receive buffers are the same size as the forechannel's inline threshold, and must be pre- registered. The consumer has no control over which receive buffer the adapter chooses to catch an incoming backwards-direction call. Any receive buffer can be used for either a forward reply or a backward call. Thus both types of RPC message must all be the same size. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Tested-By: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02SUNRPC: Abstract backchannel operationsChuck Lever2-2/+27
xprt_{setup,destroy}_backchannel() won't be adequate for RPC/RMDA bi-direction. In particular, receive buffers have to be pre- registered and posted in order to receive incoming backchannel requests. Add a virtual function call to allow the insertion of appropriate backchannel setup and destruction methods for each transport. In addition, freeing a backchannel request is a little different for RPC/RDMA. Introduce an rpc_xprt_op to handle the difference. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Tested-By: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02xprtrdma: Saving IRQs no longer needed for rb_lockChuck Lever1-14/+10
Now that RPC replies are processed in a workqueue, there's no need to disable IRQs when managing send and receive buffers. This saves noticeable overhead per RPC. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Tested-By: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02xprtrdma: Remove reply taskletChuck Lever1-43/+0
Clean up: The reply tasklet is no longer used. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Tested-By: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02xprtrdma: Use workqueue to process RPC/RDMA repliesChuck Lever4-18/+65
The reply tasklet is fast, but it's single threaded. After reply traffic saturates a single CPU, there's no more reply processing capacity. Replace the tasklet with a workqueue to spread reply handling across all CPUs. This also moves RPC/RDMA reply handling out of the soft IRQ context and into a context that allows sleeps. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Tested-By: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02xprtrdma: Replace send and receive arraysChuck Lever2-91/+73
The rb_send_bufs and rb_recv_bufs arrays are used to implement a pair of stacks for keeping track of free rpcrdma_req and rpcrdma_rep structs. Replace those arrays with free lists. To allow more than 512 RPCs in-flight at once, each of these arrays would be larger than a page (assuming 8-byte addresses and 4KB pages). Allowing up to 64K in-flight RPCs (as TCP now does), each buffer array would have to be 128 pages. That's an order-6 allocation. (Not that we're going there.) A list is easier to expand dynamically. Instead of allocating a larger array of pointers and copying the existing pointers to the new array, simply append more buffers to each list. This also makes it simpler to manage receive buffers that might catch backwards-direction calls, or to post receive buffers in bulk to amortize the overhead of ib_post_recv. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Devesh Sharma <[email protected]> Tested-By: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02xprtrdma: Refactor reply handler error handlingChuck Lever3-40/+53
Clean up: The error cases in rpcrdma_reply_handler() almost never execute. Ensure the compiler places them out of the hot path. No behavior change expected. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Devesh Sharma <[email protected]> Tested-By: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02xprtrdma: Prevent loss of completion signalsChuck Lever2-41/+38
Commit 8301a2c047cc ("xprtrdma: Limit work done by completion handler") was supposed to prevent xprtrdma's upcall handlers from starving other softIRQ work by letting them return to the provider before all CQEs have been polled. The logic assumes the provider will call the upcall handler again immediately if the CQ is re-armed while there are still queued CQEs. This assumption is invalid. The IBTA spec says that after a CQ is armed, the hardware must interrupt only when a new CQE is inserted. xprtrdma can't rely on the provider calling again, even though some providers do. Therefore, leaving CQEs on queue makes sense only when there is another mechanism that ensures all remaining CQEs are consumed in a timely fashion. xprtrdma does not have such a mechanism. If a CQE remains queued, the transport can wait forever to send the next RPC. Finally, move the wcs array back onto the stack to ensure that the poll array is always local to the CPU where the completion upcall is running. Fixes: 8301a2c047cc ("xprtrdma: Limit work done by completion ...") Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Devesh Sharma <[email protected]> Tested-By: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02xprtrdma: Re-arm after missed eventsChuck Lever1-56/+10
ib_req_notify_cq(IB_CQ_REPORT_MISSED_EVENTS) returns a positive value if WCs were added to a CQ after the last completion upcall but before the CQ has been re-armed. Commit 7f23f6f6e388 ("xprtrmda: Reduce lock contention in completion handlers") assumed that when ib_req_notify_cq() returned a positive RC, the CQ had also been successfully re-armed, making it safe to return control to the provider without losing any completion signals. That is an invalid assumption. Change both completion handlers to continue polling while ib_req_notify_cq() returns a positive value. Fixes: 7f23f6f6e388 ("xprtrmda: Reduce lock contention in ...") Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Devesh Sharma <[email protected]> Tested-By: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02xprtrdma: Enable swap-on-NFS/RDMAChuck Lever1-1/+1
After adding a swapfile on an NFS/RDMA mount and removing the normal swap partition, I was able to push the NFS client well into swap without any issue. I forgot to swapoff the NFS file before rebooting. This pinned the NFS mount and the IB core and provider, causing shutdown to hang. I think this is expected and safe behavior. Probably shutdown scripts should "swapoff -a" before unmounting any filesystems. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Tested-By: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-11-02xprtrdma: don't log warnings for flushed completionsSteve Wise1-2/+5
Unsignaled send WRs can get flushed as part of normal unmount, so don't log them as warnings. Signed-off-by: Steve Wise <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-10-28svcrdma: Port to new memory registration APISagi Grimberg2-57/+53
Instead of maintaining a fastreg page list, keep an sg table and convert an array of pages to a sg list. Then call ib_map_mr_sg and construct ib_reg_wr. Signed-off-by: Sagi Grimberg <[email protected]> Acked-by: Christoph Hellwig <[email protected]> Tested-by: Steve Wise <[email protected]> Tested-by: Selvin Xavier <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2015-10-28xprtrdma: Port to new memory registration APISagi Grimberg2-52/+69
Instead of maintaining a fastreg page list, keep an sg table and convert an array of pages to a sg list. Then call ib_map_mr_sg and construct ib_reg_wr. Signed-off-by: Sagi Grimberg <[email protected]> Acked-by: Christoph Hellwig <[email protected]> Tested-by: Steve Wise <[email protected]> Tested-by: Selvin Xavier <[email protected]> Reviewed-by: Chuck Lever <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2015-10-28Merge branch 'wr-cleanup' into k.o/for-4.4Doug Ledford3-55/+56
2015-10-28Merge branch 'wr-cleanup' of git://git.infradead.org/users/hch/rdma into ↵Doug Ledford7-95/+61
wr-cleanup Signed-off-by: Doug Ledford <[email protected]> Conflicts: drivers/infiniband/ulp/isert/ib_isert.c - Commit 4366b19ca5eb (iser-target: Change the recv buffers posting logic) changed the logic in isert_put_datain() and had to be hand merged
2015-10-28IB/cma: Add support for network namespacesGuy Shapiro2-3/+4
Add support for network namespaces in the ib_cma module. This is accomplished by: 1. Adding network namespace parameter for rdma_create_id. This parameter is used to populate the network namespace field in rdma_id_private. rdma_create_id keeps a reference on the network namespace. 2. Using the network namespace from the rdma_id instead of init_net inside of ib_cma, when listening on an ID and when looking for an ID for an incoming request. 3. Decrementing the reference count for the appropriate network namespace when calling rdma_destroy_id. In order to preserve the current behavior init_net is passed when calling from other modules. Signed-off-by: Guy Shapiro <[email protected]> Signed-off-by: Haggai Eran <[email protected]> Signed-off-by: Yotam Kenneth <[email protected]> Signed-off-by: Shachar Raindel <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2015-10-23sunrpc/cache: make cache flushing more reliable.Neil Brown1-13/+40
The caches used to store sunrpc authentication information can be flushed by writing a timestamp to a file in /proc. This timestamp has a one-second resolution and any entry in cache that was last_refreshed *before* that time is treated as expired. This is problematic as it is not possible to reliably flush the cache without interrupting NFS service. If the current time is written to the "flush" file, any entry that was added since the current second started will still be treated as valid. If one second beyond than the current time is written to the file then no entries can be valid until the second ticks over. This will mean that no NFS request will be handled for up to 1 second. To resolve this issue we make two changes: 1/ treat an entry as expired if the timestamp when it was last_refreshed is before *or the same as* the expiry time. This means that current code which writes out the current time will now flush the cache reliably. 2/ when a new entry in added to the cache - set the last_refresh timestamp to 1 second *beyond* the current flush time, when that not in the past. This ensures that newly added entries will always be valid. Now that we have a very reliable way to flush the cache, and also since we are using "since-boot" timestamps which are monotonic, change cache_purge() to set the smallest future flush_time which will work, and leave it there: don't revert to '1'. Also disable the setting of the 'flush_time' far into the future. That has never been useful and is now awkward as it would cause last_refresh times to be strange. Finally: if a request is made to set the 'flush_time' to the current second, assume the intent is to flush the cache and advance it, if necessary, to 1 second beyond the current 'flush_time' so that all active entries will be deemed to be expired. As part of this we need to add a 'cache_detail' arg to cache_init() and cache_fresh_locked() so they can find the current ->flush_time. Signed-off-by: NeilBrown <[email protected]> Reported-by: Olaf Kirch <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2015-10-23sunrpc: avoid warning in gss_key_timeoutArnd Bergmann1-7/+6
The gss_key_timeout() function causes a harmless warning in some configurations, e.g. ARM imx_v6_v7_defconfig with gcc-5.2, if the compiler cannot figure out the state of the 'expire' variable across an rcu_read_unlock(): net/sunrpc/auth_gss/auth_gss.c: In function 'gss_key_timeout': net/sunrpc/auth_gss/auth_gss.c:1422:211: warning: 'expire' may be used uninitialized in this function [-Wmaybe-uninitialized] To avoid this warning without adding a bogus initialization, this rewrites the function so the comparison is done inside of the critical section. As a side-effect, it also becomes slightly easier to understand because the implementation now more closely resembles the comment above it. Signed-off-by: Arnd Bergmann <[email protected]> Fixes: c5e6aecd034e7 ("sunrpc: fix RCU handling of gc_ctx field") Signed-off-by: J. Bruce Fields <[email protected]>
2015-10-23SUNRPC: Use MSG_SENDPAGE_NOTLAST when calling sendpage()Trond Myklebust2-2/+4
If we're sending more pages via kernel_sendpage(), then set MSG_SENDPAGE_NOTLAST. Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2015-10-21Merge branch 'k.o/for-4.3-v1' into k.o/for-4.4Doug Ledford5-40/+5
Pick up the late fixes from the 4.3 cycle so we have them in our next branch.
2015-10-15Merge tag 'for-linus' of ↵Linus Torvalds1-5/+3
git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull rdma updates from Doug Ledford: "We have four batched up patches for the current rc kernel. Two of them are small fixes that are obvious. One of them is larger than I would like for a late stage rc pull, but we found an issue in the namespace lookup code related to RoCE and this works around the issue for now (we allow a lookup with a namespace to succeed on RoCE since RoCE namespaces aren't implemented yet). This will go away in 4.4 when we put in support for namespaces in RoCE devices. The last one is large in terms of lines, but is all legal and no functional changes. Cisco needed to update their files to be more specific about their license. They had intended the files to be dual licensed as GPL/BSD all along, and specified that in their module license tag, but their file headers were not up to par. They contacted all of the contributors to get agreement and then submitted a patch to update the license headers in the files. Summary: - Work around connection namespace lookup bug related to RoCE - Change usnic license to Dual GPL/BSD (was intended to be that way all along, but wasn't clear, permission from contributors was chased down) - Fix an issue between NFSoRDMA and mlx5 that could cause an oops - Fix leak of sendonly multicast groups" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: IB/ipoib: For sendonly join free the multicast group on leave IB/cma: Accept connection without a valid netdev on RoCE xprtrdma: Don't require LOCAL_DMA_LKEY support for fastreg usnic: add missing clauses to BSD license
2015-10-13Merge tag 'nfsd-4.3-2' of git://linux-nfs.org/~bfields/linuxLinus Torvalds1-1/+1
Pull nfsd fixes from Bruce Fields: "Two nfsd fixes, one for an RDMA crash, one for a pnfs/block protocol bug" * tag 'nfsd-4.3-2' of git://linux-nfs.org/~bfields/linux: svcrdma: Fix NFS server crash triggered by 1MB NFS WRITE nfsd/blocklayout: accept any minlength
2015-10-12svcrdma: Fix NFS server crash triggered by 1MB NFS WRITEChuck Lever1-1/+1
Now that the NFS server advertises a maximum payload size of 1MB for RPC/RDMA again, it crashes in svc_process_common() when NFS client sends a 1MB NFS WRITE on an NFS/RDMA mount. The server has set up a 259 element array of struct page pointers in rq_pages[] for each incoming request. The last element of the array is NULL. When an incoming request has been completely received, rdma_read_complete() attempts to set the starting page of the incoming page vector: rqstp->rq_arg.pages = &rqstp->rq_pages[head->hdr_count]; and the page to use for the reply: rqstp->rq_respages = &rqstp->rq_arg.pages[page_no]; But the value of page_no has already accounted for head->hdr_count. Thus rq_respages now points past the end of the incoming pages. For NFS WRITE operations smaller than the maximum, this is harmless. But when the NFS WRITE operation is as large as the server's max payload size, rq_respages now points at the last entry in rq_pages, which is NULL. Fixes: cc9a903d915c ('svcrdma: Change maximum server payload . . .') BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=270 Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Reviewed-by: Steve Wise <[email protected]> Reviewed-by: Shirley Ma <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2015-10-09Merge tag 'nfsd-4.3-1' of git://linux-nfs.org/~bfields/linuxLinus Torvalds1-2/+4
Pull nfsd bugfix from Bruce Fields: "Just one RDMA bugfix" * tag 'nfsd-4.3-1' of git://linux-nfs.org/~bfields/linux: svcrdma: handle rdma read with a non-zero initial page offset
2015-10-08SUNRPC: Use MSG_SENDPAGE_NOTLAST in xs_send_pagedata()Trond Myklebust1-1/+3
If we're sending more than one page via kernel_sendpage(), then set MSG_SENDPAGE_NOTLAST between the pages so that we don't send suboptimal frames (see commit 2f5338442425 and commit 35f9c09fe9c7). Signed-off-by: Trond Myklebust <[email protected]>
2015-10-08SUNRPC: Move AF_LOCAL receive data path into a workqueue contextTrond Myklebust1-30/+42
Now that we've done it for TCP and UDP, let's convert AF_LOCAL as well. Signed-off-by: Trond Myklebust <[email protected]>
2015-10-08SUNRPC: Move UDP receive data path into a workqueue contextTrond Myklebust1-23/+61
Now that we've done it for TCP, let's convert UDP as well. Signed-off-by: Trond Myklebust <[email protected]>
2015-10-08SUNRPC: Move TCP receive data path into a workqueue contextTrond Myklebust1-15/+36
Stream protocols such as TCP can often build up a backlog of data to be read due to ordering. Combine this with the fact that some workloads such as NFS read()-intensive workloads need to receive a lot of data per RPC call, and it turns out that receiving the data from inside a softirq context can cause starvation. The following patch moves the TCP data receive into a workqueue context. We still end up calling tcp_read_sock(), but we do so from a process context, meaning that softirqs are enabled for most of the time. With this patch, I see a doubling of read bandwidth when running a multi-threaded iozone workload between a virtual client and server setup. Signed-off-by: Trond Myklebust <[email protected]>
2015-10-08SUNRPC: Refactor TCP receiveTrond Myklebust1-15/+29
Move the TCP data receive loop out of xs_tcp_data_ready(). Doing so will allow us to move the data receive out of the softirq context in a set of followup patches. Signed-off-by: Trond Myklebust <[email protected]>
2015-10-08IB: split struct ib_send_wrChristoph Hellwig3-55/+56
This patch split up struct ib_send_wr so that all non-trivial verbs use their own structure which embedds struct ib_send_wr. This dramaticly shrinks the size of a WR for most common operations: sizeof(struct ib_send_wr) (old): 96 sizeof(struct ib_send_wr): 48 sizeof(struct ib_rdma_wr): 64 sizeof(struct ib_atomic_wr): 96 sizeof(struct ib_ud_wr): 88 sizeof(struct ib_fast_reg_wr): 88 sizeof(struct ib_bind_mw_wr): 96 sizeof(struct ib_sig_handover_wr): 80 And with Sagi's pending MR rework the fast registration WR will also be down to a reasonable size: sizeof(struct ib_fastreg_wr): 64 Signed-off-by: Christoph Hellwig <[email protected]> Reviewed-by: Bart Van Assche <[email protected]> [srp, srpt] Reviewed-by: Chuck Lever <[email protected]> [sunrpc] Tested-by: Haggai Eran <[email protected]> Tested-by: Sagi Grimberg <[email protected]> Tested-by: Steve Wise <[email protected]>
2015-10-06xprtrdma: Don't require LOCAL_DMA_LKEY support for fastregSagi Grimberg1-5/+3
There is no need to require LOCAL_DMA_LKEY support as the PD allocation makes sure that there is a local_dma_lkey. Also correctly set a return value in error path. This caused a NULL pointer dereference in mlx5 which removed the support for LOCAL_DMA_LKEY. Fixes: bb6c96d72879 ("xprtrdma: Replace global lkey with lkey local to PD") Signed-off-by: Sagi Grimberg <[email protected]> Reviewed-by: Chuck Lever <[email protected]> Acked-by: Anna Schumaker <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2015-10-02Merge tag 'nfs-rdma-for-4.3-2' of git://git.linux-nfs.org/projects/anna/nfs-rdmaTrond Myklebust2-4/+7
NFS: NFSoRDMA bugfix Fixes a use-after-free bug. Signed-off-by: Anna Schumaker <[email protected]>
2015-10-01Merge tag 'for-linus' of ↵Linus Torvalds5-35/+2
git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull rdma fixes from Doug Ledford: - Fixes for mlx5 related issues - Fixes for ipoib multicast handling * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: IB/ipoib: increase the max mcast backlog queue IB/ipoib: Make sendonly multicast joins create the mcast group IB/ipoib: Expire sendonly multicast joins IB/mlx5: Remove pa_lkey usages IB/mlx5: Remove support for IB_DEVICE_LOCAL_DMA_LKEY IB/iser: Add module parameter for always register memory xprtrdma: Replace global lkey with lkey local to PD
2015-09-29svcrdma: handle rdma read with a non-zero initial page offsetSteve Wise1-2/+4
The server rdma_read_chunk_lcl() and rdma_read_chunk_frmr() functions were not taking into account the initial page_offset when determining the rdma read length. This resulted in a read who's starting address and length exceeded the base/bounds of the frmr. The server gets an async error from the rdma device and kills the connection, and the client then reconnects and resends. This repeats indefinitely, and the application hangs. Most work loads don't tickle this bug apparently, but one test hit it every time: building the linux kernel on a 16 core node with 'make -j 16 O=/mnt/0' where /mnt/0 is a ramdisk mounted via NFSRDMA. This bug seems to only be tripped with devices having small fastreg page list depths. I didn't see it with mlx4, for instance. Fixes: 0bf4828983df ('svcrdma: refactor marshalling logic') Signed-off-by: Steve Wise <[email protected]> Tested-by: Chuck Lever <[email protected]> Cc: [email protected] Signed-off-by: J. Bruce Fields <[email protected]>
2015-09-28xprtrdma: disconnect and flush cqs before freeing buffersSteve Wise2-4/+7
Otherwise a FRMR completion can cause a touch-after-free crash. In xprt_rdma_destroy(), call rpcrdma_buffer_destroy() only after calling rpcrdma_ep_destroy(). In rpcrdma_ep_destroy(), disconnect the cm_id first which should flush the qp, then drain the cqs, then destroy the qp, and finally destroy the cqs. Signed-off-by: Steve Wise <[email protected]> Tested-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-09-25Merge tag 'nfs-for-4.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds3-12/+21
Pull NFS client bugfixes from Trond Myklebust: "Highlights include: Stable patches: - fix v4.2 SEEK on files over 2 gigs - Fix a layout segment reference leak when pNFS I/O falls back to inband I/O. - Fix recovery of recalled read delegations Bugfixes: - Fix a case where NFSv4 fails to send CLOSE after a server reboot - Fix sunrpc to wait for connections to complete before retrying - Fix sunrpc races between transport connect/disconnect and shutdown - Fix an infinite loop when layoutget fail with BAD_STATEID - nfs/filelayout: Fix NULL reference caused by double freeing of fh_array - Fix a bogus WARN_ON_ONCE() in O_DIRECT when layout commit_through_mds is set - Fix layoutreturn/close ordering issues" * tag 'nfs-for-4.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS41: make close wait for layoutreturn NFS: Skip checking ds_cinfo.buckets when lseg's commit_through_mds is set NFSv4.x/pnfs: Don't try to recover stateids twice in layoutget NFSv4: Recovery of recalled read delegations is broken NFS: Fix an infinite loop when layoutget fail with BAD_STATEID NFS: Do cleanup before resetting pageio read/write to mds SUNRPC: xs_sock_mark_closed() does not need to trigger socket autoclose SUNRPC: Lock the transport layer on shutdown nfs/filelayout: Fix NULL reference caused by double freeing of fh_array SUNRPC: Ensure that we wait for connections to complete before retrying SUNRPC: drop null test before destroy functions nfs: fix v4.2 SEEK on files over 2 gigs SUNRPC: Fix races between socket connection and destroy code nfs: fix pg_test page count calculation Failing to send a CLOSE if file is opened WRONLY and server reboots on a 4.x mount
2015-09-25xprtrdma: Replace global lkey with lkey local to PDChuck Lever5-35/+2
The core API has changed so that devices that do not have a global DMA lkey automatically create an mr, per-PD, and make that lkey available. The global DMA lkey interface is going away in favor of the per-PD DMA lkey. The per-PD DMA lkey is always available. Convert xprtrdma to use the device's per-PD DMA lkey for regbufs, no matter which memory registration scheme is in use. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Sagi Grimberg <[email protected]> Cc: linux-nfs <[email protected]> Acked-by: Anna Schumaker <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2015-09-22userfaultfd: revert "userfaultfd: waitqueue: add nr wake parameter to ↵Andrea Arcangeli1-1/+1
__wake_up_locked_key" This reverts commit 51360155eccb907ff8635bd10fc7de876408c2e0 and adapts fs/userfaultfd.c to use the old version of that function. It didn't look robust to call __wake_up_common with "nr == 1" when we absolutely require wakeall semantics, but we've full control of what we insert in the two waitqueue heads of the blocked userfaults. No exclusive waitqueue risks to be inserted into those two waitqueue heads so we can as well stick to "nr == 1" of the old code and we can rely purely on the fact no waitqueue inserted in one of the two waitqueue heads we must enforce as wakeall, has wait->flags WQ_FLAG_EXCLUSIVE set. Signed-off-by: Andrea Arcangeli <[email protected]> Cc: Dr. David Alan Gilbert <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Thierry Reding <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-09-19SUNRPC: xs_sock_mark_closed() does not need to trigger socket autocloseTrond Myklebust1-1/+0
Under all conditions, it should be quite sufficient just to mark the socket as disconnected. It will then be closed by the transport shutdown or reconnect code. Signed-off-by: Trond Myklebust <[email protected]>
2015-09-19SUNRPC: Lock the transport layer on shutdownTrond Myklebust1-0/+6
Avoid all races with the connect/disconnect handlers by taking the transport lock. Reported-by:"Suzuki K. Poulose" <[email protected]> Acked-by: Jeff Layton <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2015-09-17SUNRPC: Ensure that we wait for connections to complete before retryingTrond Myklebust1-3/+8
Commit 718ba5b87343, moved the responsibility for unlocking the socket to xs_tcp_setup_socket, meaning that the socket will be unlocked before we know that it has finished trying to connect. The following patch is based on an initial patch by Russell King to ensure that we delay clearing the XPRT_CONNECTING flag until we either know that we failed to initiate a connection attempt, or the connection attempt itself failed. Fixes: 718ba5b87343 ("SUNRPC: Add helpers to prevent socket create from racing") Reported-by: Russell King <[email protected]> Reported-by: Russell King <[email protected]> Tested-by: Russell King <[email protected]> Tested-by: Benjamin Coddington <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2015-09-17SUNRPC: drop null test before destroy functionsJulia Lawall1-8/+4
Remove unneeded NULL test. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x; @@ -if (x != NULL) \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x); // </smpl> Signed-off-by: Julia Lawall <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2015-09-17SUNRPC: Fix races between socket connection and destroy codeTrond Myklebust1-0/+3
When we're destroying the socket transport, we need to ensure that we cancel any existing delayed connection attempts, and order them w.r.t. the call to xs_close(). Reported-by:"Suzuki K. Poulose" <[email protected]> Acked-by: Jeff Layton <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2015-09-09Merge tag 'for-linus' of ↵Linus Torvalds4-17/+13
git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull inifiniband/rdma updates from Doug Ledford: "This is a fairly sizeable set of changes. I've put them through a decent amount of testing prior to sending the pull request due to that. There are still a few fixups that I know are coming, but I wanted to go ahead and get the big, sizable chunk into your hands sooner rather than waiting for those last few fixups. Of note is the fact that this creates what is intended to be a temporary area in the drivers/staging tree specifically for some cleanups and additions that are coming for the RDMA stack. We deprecated two drivers (ipath and amso1100) and are waiting to hear back if we can deprecate another one (ehca). We also put Intel's new hfi1 driver into this area because it needs to be refactored and a transfer library created out of the factored out code, and then it and the qib driver and the soft-roce driver should all be modified to use that library. I expect drivers/staging/rdma to be around for three or four kernel releases and then to go away as all of the work is completed and final deletions of deprecated drivers are done. Summary of changes for 4.3: - Create drivers/staging/rdma - Move amso1100 driver to staging/rdma and schedule for deletion - Move ipath driver to staging/rdma and schedule for deletion - Add hfi1 driver to staging/rdma and set TODO for move to regular tree - Initial support for namespaces to be used on RDMA devices - Add RoCE GID table handling to the RDMA core caching code - Infrastructure to support handling of devices with differing read and write scatter gather capabilities - Various iSER updates - Kill off unsafe usage of global mr registrations - Update SRP driver - Misc mlx4 driver updates - Support for the mr_alloc verb - Support for a netlink interface between kernel and user space cache daemon to speed path record queries and route resolution - Ininitial support for safe hot removal of verbs devices" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (136 commits) IB/ipoib: Suppress warning for send only join failures IB/ipoib: Clean up send-only multicast joins IB/srp: Fix possible protection fault IB/core: Move SM class defines from ib_mad.h to ib_smi.h IB/core: Remove unnecessary defines from ib_mad.h IB/hfi1: Add PSM2 user space header to header_install IB/hfi1: Add CSRs for CONFIG_SDMA_VERBOSITY mlx5: Fix incorrect wc pkey_index assignment for GSI messages IB/mlx5: avoid destroying a NULL mr in reg_user_mr error flow IB/uverbs: reject invalid or unknown opcodes IB/cxgb4: Fix if statement in pick_local_ip6adddrs IB/sa: Fix rdma netlink message flags IB/ucma: HW Device hot-removal support IB/mlx4_ib: Disassociate support IB/uverbs: Enable device removal when there are active user space applications IB/uverbs: Explicitly pass ib_dev to uverbs commands IB/uverbs: Fix race between ib_uverbs_open and remove_one IB/uverbs: Fix reference counting usage of event files IB/core: Make ib_dealloc_pd return void IB/srp: Create an insecure all physical rkey only if needed ...
2015-09-07Merge tag 'nfs-for-4.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds9-316/+288
Pull NFS client updates from Trond Myklebust: "Highlights include: Stable patches: - Fix atomicity of pNFS commit list updates - Fix NFSv4 handling of open(O_CREAT|O_EXCL|O_RDONLY) - nfs_set_pgio_error sometimes misses errors - Fix a thinko in xs_connect() - Fix borkage in _same_data_server_addrs_locked() - Fix a NULL pointer dereference of migration recovery ops for v4.2 client - Don't let the ctime override attribute barriers. - Revert "NFSv4: Remove incorrect check in can_open_delegated()" - Ensure flexfiles pNFS driver updates the inode after write finishes - flexfiles must not pollute the attribute cache with attrbutes from the DS - Fix a protocol error in layoutreturn - Fix a protocol issue with NFSv4.1 CLOSE stateids Bugfixes + cleanups - pNFS blocks bugfixes from Christoph - Various cleanups from Anna - More fixes for delegation corner cases - Don't fsync twice for O_SYNC/IS_SYNC files - Fix pNFS and flexfiles layoutstats bugs - pnfs/flexfiles: avoid duplicate tracking of mirror data - pnfs: Fix layoutget/layoutreturn/return-on-close serialisation issues - pnfs/flexfiles: error handling retries a layoutget before fallback to MDS Features: - Full support for the OPEN NFS4_CREATE_EXCLUSIVE4_1 mode from Kinglong - More RDMA client transport improvements from Chuck - Removal of the deprecated ib_reg_phys_mr() and ib_rereg_phys_mr() verbs from the SUNRPC, Lustre and core infiniband tree. - Optimise away the close-to-open getattr if there is no cached data" * tag 'nfs-for-4.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (108 commits) NFSv4: Respect the server imposed limit on how many changes we may cache NFSv4: Express delegation limit in units of pages Revert "NFS: Make close(2) asynchronous when closing NFS O_DIRECT files" NFS: Optimise away the close-to-open getattr if there is no cached data NFSv4.1/flexfiles: Clean up ff_layout_write_done_cb/ff_layout_commit_done_cb NFSv4.1/flexfiles: Mark the layout for return in ff_layout_io_track_ds_error() nfs: Remove unneeded checking of the return value from scnprintf nfs: Fix truncated client owner id without proto type NFSv4.1/flexfiles: Mark layout for return if the mirrors are invalid NFSv4.1/flexfiles: RW layouts are valid only if all mirrors are valid NFSv4.1/flexfiles: Fix incorrect usage of pnfs_generic_mark_devid_invalid() NFSv4.1/flexfiles: Fix freeing of mirrors NFSv4.1/pNFS: Don't request a minimal read layout beyond the end of file NFSv4.1/pnfs: Handle LAYOUTGET return values correctly NFSv4.1/pnfs: Don't ask for a read layout for an empty file. NFSv4.1: Fix a protocol issue with CLOSE stateids NFSv4.1/flexfiles: Don't mark the entire deviceid as bad for file errors SUNRPC: Prevent SYN+SYNACK+RST storms SUNRPC: xs_reset_transport must mark the connection as disconnected NFSv4.1/pnfs: Ensure layoutreturn reserves space for the opaque payload ...
2015-09-05Merge tag 'nfsd-4.3' of git://linux-nfs.org/~bfields/linuxLinus Torvalds6-150/+197
Pull nfsd updates from Bruce Fields: "Nothing major, but: - Add Jeff Layton as an nfsd co-maintainer: no change to existing practice, just an acknowledgement of the status quo. - Two patches ("nfsd: ensure that...") for a race overlooked by the state locking rewrite, causing a crash noticed by multiple users. - Lots of smaller bugfixes all over from Kinglong Mee. - From Jeff, some cleanup of server rpc code in preparation for possible shift of nfsd threads to workqueues" * tag 'nfsd-4.3' of git://linux-nfs.org/~bfields/linux: (52 commits) nfsd: deal with DELEGRETURN racing with CB_RECALL nfsd: return CLID_INUSE for unexpected SETCLIENTID_CONFIRM case nfsd: ensure that delegation stateid hash references are only put once nfsd: ensure that the ol stateid hash reference is only put once net: sunrpc: fix tracepoint Warning: unknown op '->' nfsd: allow more than one laundry job to run at a time nfsd: don't WARN/backtrace for invalid container deployment. fs: fix fs/locks.c kernel-doc warning nfsd: Add Jeff Layton as co-maintainer NFSD: Return word2 bitmask if setting security label in OPEN/CREATE NFSD: Set the attributes used to store the verifier for EXCLUSIVE4_1 nfsd: SUPPATTR_EXCLCREAT must be encoded before SECURITY_LABEL. nfsd: Fix an FS_LAYOUT_TYPES/LAYOUT_TYPES encode bug NFSD: Store parent's stat in a separate value nfsd: Fix two typos in comments lockd: NLM grace period shouldn't block NFSv4 opens nfsd: include linux/nfs4.h in export.h sunrpc: Switch to using hash list instead single list sunrpc/nfsd: Remove redundant code by exports seq_operations functions sunrpc: Store cache_detail in seq_file's private directly ...
2015-09-04userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_keyAndrea Arcangeli1-1/+1
userfaultfd needs to wake all waitqueues (pass 0 as nr parameter), instead of the current hardcoded 1 (that would wake just the first waitqueue in the head list). Signed-off-by: Andrea Arcangeli <[email protected]> Acked-by: Pavel Emelyanov <[email protected]> Cc: Sanidhya Kashyap <[email protected]> Cc: [email protected] Cc: "Kirill A. Shutemov" <[email protected]> Cc: Andres Lagar-Cavilla <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Peter Feiner <[email protected]> Cc: "Dr. David Alan Gilbert" <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: "Huangpeng (Peter)" <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2015-08-30IB/core: Make ib_dealloc_pd return voidJason Gunthorpe1-1/+1
The majority of callers never check the return value, and even if they did, they can't do anything about a failure. All possible failure cases represent a bug in the caller, so just WARN_ON inside the function instead. This fixes a few random errors: net/rd/iw.c infinite loops while it fails. (racing with EBUSY?) This also lays the ground work to get rid of error return from the drivers. Most drivers do not error, the few that do are broken since it cannot be handled. Since uverbs can legitimately make use of EBUSY, open code the check. Signed-off-by: Jason Gunthorpe <[email protected]> Reviewed-by: Chuck Lever <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2015-08-30svcrdma: limit FRMR page list lengths to device maxSteve Wise1-2/+4
Svcrdma was incorrectly allocating fastreg MRs and page lists using RPCSVC_MAXPAGES, which can exceed the device capabilities. So limit the depth to the minimum of RPCSVC_MAXPAGES and xprt->sc_frmr_pg_list_len. Signed-off-by: Steve Wise <[email protected]> Signed-off-by: Doug Ledford <[email protected]>