aboutsummaryrefslogtreecommitdiff
path: root/net/sunrpc
AgeCommit message (Collapse)AuthorFilesLines
2015-08-30xprtrdma, svcrdma: Convert to ib_alloc_mrSagi Grimberg2-4/+4
Signed-off-by: Sagi Grimberg <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2015-08-29SUNRPC: Prevent SYN+SYNACK+RST stormsTrond Myklebust1-0/+2
Add a shutdown() call before we release the socket in order to ensure the reset is sent before we try to reconnect. Signed-off-by: Trond Myklebust <[email protected]>
2015-08-29SUNRPC: xs_reset_transport must mark the connection as disconnectedTrond Myklebust1-0/+1
In case the reconnection attempt fails. Cc: [email protected] Signed-off-by: Trond Myklebust <[email protected]>
2015-08-28svcrdma: Use max_sge_rd for destination read depthsSteve Wise2-11/+5
Signed-off-by: Steve Wise <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
2015-08-19SUNRPC: Allow sockets to do GFP_NOIO allocationsTrond Myklebust1-3/+3
Follow up to commit c4a7ca774949 ("SUNRPC: Allow waiting on memory allocation"). Allows the RPC socket code to do non-IO blocking. Signed-off-by: Trond Myklebust <[email protected]>
2015-08-17Merge branch 'bugfixes'Trond Myklebust1-4/+5
* bugfixes: SUNRPC: Fix a thinko in xs_connect() NFSv4.1/pNFS: Fix borken function _same_data_server_addrs_locked() NFS: nfs_set_pgio_error sometimes misses errors
2015-08-17Merge tag 'nfs-rdma-for-4.3' of git://git.linux-nfs.org/projects/anna/nfs-rdmaTrond Myklebust7-308/+276
NFS: NFS over RDMA Client Side Changes These patches improve both client performance and scalability, most notably by increasing the maixmum allowed rsize and wsize and by increasing the number of RDMA "credits". There are also several bugfixes, such as correcting how WRITE compounds are encoded and fixing large NFS symlink operations. Signed-off-by: Anna Schumaker <[email protected]>
2015-08-17SUNRPC: Fix a thinko in xs_connect()Trond Myklebust1-4/+5
It is rather pointless to test the value of transport->inet after calling xs_reset_transport(), since it will always be zero, and so we will never see any exponential back off behaviour. Also don't force early connections for SOFTCONN tasks. If the server disconnects us, we should respect the exponential backoff. Cc: [email protected] # 4.0+ Signed-off-by: Trond Myklebust <[email protected]>
2015-08-13sunrpc: Switch to using hash list instead single listKinglong Mee1-29/+31
Switch using list_head for cache_head in cache_detail, it is useful of remove an cache_head entry directly from cache_detail. v8, using hash list, not head list Signed-off-by: Kinglong Mee <[email protected]> Reviewed-by: NeilBrown <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2015-08-13sunrpc/nfsd: Remove redundant code by exports seq_operations functionsKinglong Mee1-6/+9
Nfsd has implement a site of seq_operations functions as sunrpc's cache. Just exports sunrpc's codes, and remove nfsd's redundant codes. v8, same as v6 Signed-off-by: Kinglong Mee <[email protected]> Reviewed-by: NeilBrown <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2015-08-13sunrpc: Store cache_detail in seq_file's private directlyKinglong Mee1-15/+13
Cleanup. Just store cache_detail in seq_file's private, an allocated handle is redundant. v8, same as v6. Signed-off-by: Kinglong Mee <[email protected]> Reviewed-by: NeilBrown <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2015-08-12sunrpc: increase UNX_MAXNODENAME from 32 to __NEW_UTS_LEN bytesJeff Layton1-1/+1
The current limit of 32 bytes artificially limits the name string that we end up stuffing into NFSv4.x client ID blobs. If you have multiple hosts with long hostnames that only differ near the end, then this can cause NFSv4 client ID collisions. Linux nodenames are actually limited to __NEW_UTS_LEN bytes (64), so use that as the limit instead. Also, use XDR_QUADLEN to specify the slack length, just for clarity and in case someone in the future changes this to something not evenly divisible by 4. Reported-by: Michael Skralivetsky <[email protected]> Signed-off-by: Jeff Layton <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2015-08-10nfsd/sunrpc: factor svc_rqst allocation and freeing from sv_nrthreads ↵Jeff Layton1-18/+36
refcounting In later patches, we'll want to be able to allocate and free svc_rqst structures without monkeying with the serv->sv_nrthreads refcount. Factor those pieces out of their respective functions. Signed-off-by: Shirley Ma <[email protected]> Acked-by: Jeff Layton <[email protected]> Tested-by: Shirley Ma <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2015-08-10nfsd/sunrpc: move pool_mode definitions into svc.hJeff Layton1-24/+7
In later patches, we're going to need to allow code external to svc.c to figure out what pool_mode is in use. Move these definitions into svc.h to prepare for that. Also, make the svc_pool_map object available and exported so that other modules can peek in there to get insight into what pool mode is in use. Likewise, export svc_pool_map_get/put function to make it safe to do so. Signed-off-by: Shirley Ma <[email protected]> Acked-by: Jeff Layton <[email protected]> Tested-by: Shirley Ma <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2015-08-10nfsd/sunrpc: turn enqueueing a svc_xprt into a svc_serv operationJeff Layton1-5/+5
For now, all services use svc_xprt_do_enqueue, but once we add workqueue-based service support, we'll need to do something different. Signed-off-by: Shirley Ma <[email protected]> Acked-by: Jeff Layton <[email protected]> Tested-by: Shirley Ma <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2015-08-10nfsd/sunrpc: move sv_module parm into sv_opsJeff Layton1-5/+3
...not technically an operation, but it's more convenient and cleaner to pass the module pointer in this struct. Signed-off-by: Shirley Ma <[email protected]> Acked-by: Jeff Layton <[email protected]> Tested-by: Shirley Ma <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2015-08-10nfsd/sunrpc: move sv_function into sv_opsJeff Layton1-5/+3
Since we now have a container for holding svc_serv operations, move the sv_function into it as well. Signed-off-by: Shirley Ma <[email protected]> Acked-by: Jeff Layton <[email protected]> Tested-by: Shirley Ma <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2015-08-10nfsd/sunrpc: add a new svc_serv_ops struct and move sv_shutdown into itJeff Layton1-9/+9
In later patches we'll need to abstract out more operations on a per-service level, besides sv_shutdown and sv_function. Declare a new svc_serv_ops struct to hold these operations, and move sv_shutdown into this struct. Signed-off-by: Shirley Ma <[email protected]> Acked-by: Jeff Layton <[email protected]> Tested-by: Shirley Ma <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2015-08-10svcrdma: Change maximum server payload back to RPCSVC_MAXPAYLOADChuck Lever2-2/+1
Both commit 0380a3f375 ("svcrdma: Add a separate "max data segs" macro for svcrdma") and commit 7e5be28827bf ("svcrdma: advertise the correct max payload") are incorrect. This commit reverts both changes, restoring the server's maximum payload size to 1MB. Commit 7e5be28827bf based the server's maximum payload on the _client's_ RPCRDMA_MAX_DATA_SEGS value. That was wrong. Commit 0380a3f375 tried to fix this so that the client maximum payload size could be raised without affecting the server, but managed to confuse matters more on the server side. More importantly, limiting the advertised maximum payload size was meant to be a workaround, not the actual fix. We need to revisit https://bugzilla.linux-nfs.org/show_bug.cgi?id=270 A Linux client on a platform with 64KB pages can overrun and crash an x86_64 NFS/RDMA server when the r/wsize is 1MB. An x86/64 Linux client seems to work fine using 1MB reads and writes when the Linux server's maximum payload size is restored to 1MB. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=270 Fixes: 0380a3f375 ("svcrdma: Add a separate "max data segs" macro") Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2015-08-05xprtrdma: take HCA driver refcount at clientDevesh Sharma1-8/+29
This is a rework of the following patch sent almost a year back: http://www.mail-archive.com/linux-rdma%40vger.kernel.org/msg20730.html In presence of active mount if someone tries to rmmod vendor-driver, the command remains stuck forever waiting for destruction of all rdma-cm-id. in worst case client can crash during shutdown with active mounts. The existing code assumes that ia->ri_id->device cannot change during the lifetime of a transport. xprtrdma do not have support for DEVICE_REMOVAL event either. Lifting that assumption and adding support for DEVICE_REMOVAL event is a long chain of work, and is in plan. The community decided that preventing the hang right now is more important than waiting for architectural changes. Thus, this patch introduces a temporary workaround to acquire HCA driver module reference count during the mount of a nfs-rdma mount point. Signed-off-by: Devesh Sharma <[email protected]> Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-08-05xprtrdma: Count RDMA_NOMSG type callsChuck Lever3-2/+5
RDMA_NOMSG type calls are less efficient than RDMA_MSG. Count NOMSG calls so administrators can tell if they happen to be used more than expected. Signed-off-by: Chuck Lever <[email protected]> Tested-by: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-08-05xprtrdma: Clean up xprt_rdma_print_stats()Chuck Lever1-25/+23
checkpatch.pl complained about the seq_printf() format string split across lines and the use of %Lu. Signed-off-by: Chuck Lever <[email protected]> Tested-by: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-08-05xprtrdma: Fix large NFS SYMLINK callsChuck Lever1-9/+16
Repair how rpcrdma_marshal_req() chooses which RDMA message type to use for large non-WRITE operations so that it picks RDMA_NOMSG in the correct situations, and sets up the marshaling logic to SEND only the RPC/RDMA header. Large NFSv2 SYMLINK requests now use RDMA_NOMSG calls. The Linux NFS server XDR decoder for NFSv2 SYMLINK does not handle having the pathname argument arrive in a separate buffer. The decoder could be fixed, but this is simpler and RDMA_NOMSG can be used in a variety of other situations. Ensure that the Linux client continues to use "RDMA_MSG + read list" when sending large NFSv3 SYMLINK requests, which is more efficient than using RDMA_NOMSG. Large NFSv4 CREATE(NF4LNK) requests are changed to use "RDMA_MSG + read list" just like NFSv3 (see Section 5 of RFC 5667). Before, these did not work at all. Signed-off-by: Chuck Lever <[email protected]> Tested-by: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-08-05xprtrdma: Fix XDR tail buffer marshallingChuck Lever1-2/+42
Currently xprtrdma appends an extra chunk element to the RPC/RDMA read chunk list of each NFSv4 WRITE compound. The extra element contains the final GETATTR operation in the compound. The result is an extra RDMA READ operation to transfer a very short piece of each NFS WRITE compound (typically 16 bytes). This is inefficient. It is also incorrect. The client is sending the trailing GETATTR at the same Position as the preceding WRITE data payload. Whether or not RFC 5667 allows the GETATTR to appear in a read chunk, RFC 5666 requires that these two separate RPC arguments appear at two distinct Positions. It can also be argued that the GETATTR operation is not bulk data, and therefore RFC 5667 forbids its appearance in a read chunk at all. Although RFC 5667 is not precise about when using a read list with NFSv4 COMPOUND is allowed, the intent is that only data arguments not touched by NFS (ie, read and write payloads) are to be sent using RDMA READ or WRITE. The NFS client constructs GETATTR arguments itself, and therefore is required to send the trailing GETATTR operation as additional inline content, not as a data payload. NB: This change is not backwards compatible. Some older servers do not accept inline content following the read list. The Linux NFS server should handle this content correctly as of commit a97c331f9aa9 ("svcrdma: Handle additional inline content"). Signed-off-by: Chuck Lever <[email protected]> Tested-by: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-08-05xprtrdma: Don't provide a reply chunk when expecting a short replyChuck Lever1-12/+1
Currently Linux always offers a reply chunk, even when the reply can be sent inline (ie. is smaller than 1KB). On the client, registering a memory region can be expensive. A server may choose not to use the reply chunk, wasting the cost of the registration. This is a change only for RPC replies smaller than 1KB which the server constructs in the RPC reply send buffer. Because the elements of the reply must be XDR encoded, a copy-free data transfer has no benefit in this case. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Sagi Grimberg <[email protected]> Tested-by: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-08-05xprtrdma: Always provide a write list when sending NFS READChuck Lever1-17/+4
The client has been setting up a reply chunk for NFS READs that are smaller than the inline threshold. This is not efficient: both the server and client CPUs have to copy the reply's data payload into and out of the memory region that is then transferred via RDMA. Using the write list, the data payload is moved by the device and no extra data copying is necessary. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Devesh Sharma <[email protected]> Reviewed-By: Sagi Grimberg <[email protected]> Tested-by: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-08-05xprtrdma: Account for RPC/RDMA header size when deciding to inlineChuck Lever1-2/+27
When the size of the RPC message is near the inline threshold (1KB), the client would allow messages to be sent that were a few bytes too large. When marshaling RPC/RDMA requests, ensure the combined size of RPC/RDMA header and RPC header do not exceed the inline threshold. Endpoints typically reject RPC/RDMA messages that exceed the size of their receive buffers. The two server implementations I test with (Linux and Solaris) use receive buffers that are larger than the client’s inline threshold. Thus so far this has been benign, observed only by code inspection. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Devesh Sharma <[email protected]> Tested-by: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-08-05xprtrdma: Remove logic that constructs RDMA_MSGP type callsChuck Lever3-107/+51
RDMA_MSGP type calls insert a zero pad in the middle of the RPC message to align the RPC request's data payload to the server's alignment preferences. A server can then "page flip" the payload into place to avoid a data copy in certain circumstances. However: 1. The client has to have a priori knowledge of the server's preferred alignment 2. Requests eligible for RDMA_MSGP are requests that are small enough to have been sent inline, and convey a data payload at the _end_ of the RPC message Today 1. is done with a sysctl, and is a global setting that is copied during mount. Linux does not support CCP to query the server's preferences (RFC 5666, Section 6). A small-ish NFSv3 WRITE might use RDMA_MSGP, but no NFSv4 compound fits bullet 2. Thus the Linux client currently leaves RDMA_MSGP disabled. The Linux server handles RDMA_MSGP, but does not use any special page flipping, so it confers no benefit. Clean up the marshaling code by removing the logic that constructs RDMA_MSGP type calls. This also reduces the maximum send iovec size from four to just two elements. /proc/sys/sunrpc/rdma_inline_write_padding is a kernel API, and thus is left in place. Signed-off-by: Chuck Lever <[email protected]> Tested-by: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-08-05xprtrdma: Clean up rpcrdma_ia_open()Chuck Lever5-45/+67
Untangle the end of rpcrdma_ia_open() by moving DMA MR set-up, which is different for each registration method, to the .ro_open functions. This is refactoring only. No behavior change is expected. Signed-off-by: Chuck Lever <[email protected]> Tested-by: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-08-05xprtrdma: Remove last ib_reg_phys_mr() call siteChuck Lever2-82/+21
All HCA providers have an ib_get_dma_mr() verb. Thus rpcrdma_ia_open() will either grab the device's local_dma_key if one is available, or it will call ib_get_dma_mr(). If ib_get_dma_mr() fails, rpcrdma_ia_open() fails and no transport is created. Therefore execution never reaches the ib_reg_phys_mr() call site in rpcrdma_register_internal(), so it can be removed. The remaining logic in rpcrdma_{de}register_internal() is folded into rpcrdma_{alloc,free}_regbuf(). This is clean up only. No behavior change is expected. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Devesh Sharma <[email protected]> Reviewed-By: Sagi Grimberg <[email protected]> Tested-by: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-08-05xprtrdma: Don't fall back to PHYSICAL memory registrationChuck Lever1-1/+1
PHYSICAL memory registration uses a single rkey for all of the client's memory, thus is insecure. It is still useful in some cases for testing. Retain the ability to select PHYSICAL memory registration capability via /proc/sys/sunrpc/rdma_memreg_strategy, but don't fall back to it if the HCA does not support FRWR or FMR. This means amso1100 no longer works out of the box with NFS/RDMA. When using amso1100 HCAs, set the memreg_strategy sysctl to 6 before performing NFS/RDMA mounts. Signed-off-by: Chuck Lever <[email protected]> Tested-by: Devesh Sharma <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-08-05xprtrdma: Raise maximum payload size to one megabyteChuck Lever1-2/+1
The point of larger rsize and wsize is to reduce the per-byte cost of memory registration and deregistration. Modern HCAs can typically handle a megabyte or more with a single registration operation. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Devesh Sharma <[email protected]> Reviewed-By: Sagi Grimberg <[email protected]> Tested-by: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-08-05xprtrdma: Make xprt_setup_rdma() agnostic to family of server addressChuck Lever1-17/+11
In particular, recognize when an IPv6 connection is bound. Signed-off-by: Chuck Lever <[email protected]> Tested-by: Devesh Sharma <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2015-07-28Merge tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds3-13/+23
Pull NFS client bugfixes from Trond Myklebust: "Highlights include: Stable patches: - Fix a situation where the client uses the wrong (zero) stateid. - Fix a memory leak in nfs_do_recoalesce Bugfixes: - Plug a memory leak when ->prepare_layoutcommit fails - Fix an Oops in the NFSv4 open code - Fix a backchannel deadlock - Fix a livelock in sunrpc when sendmsg fails due to low memory availability - Don't revalidate the mapping if both size and change attr are up to date - Ensure we don't miss a file extension when doing pNFS - Several fixes to handle NFSv4.1 sequence operation status bits correctly - Several pNFS layout return bugfixes" * tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (28 commits) nfs: Fix an oops caused by using other thread's stack space in ASYNC mode nfs: plug memory leak when ->prepare_layoutcommit fails SUNRPC: Report TCP errors to the caller sunrpc: translate -EAGAIN to -ENOBUFS when socket is writable. NFSv4.2: handle NFS-specific llseek errors NFS: Don't clear desc->pg_moreio in nfs_do_recoalesce() NFS: Fix a memory leak in nfs_do_recoalesce NFS: nfs_mark_for_revalidate should always set NFS_INO_REVAL_PAGECACHE NFS: Remove the "NFS_CAP_CHANGE_ATTR" capability NFS: Set NFS_INO_REVAL_PAGECACHE if the change attribute is uninitialised NFS: Don't revalidate the mapping if both size and change attr are up to date NFSv4/pnfs: Ensure we don't miss a file extension NFSv4: We must set NFS_OPEN_STATE flag in nfs_resync_open_stateid_locked SUNRPC: xprt_complete_bc_request must also decrement the free slot count SUNRPC: Fix a backchannel deadlock pNFS: Don't throw out valid layout segments pNFS: pnfs_roc_drain() fix a race with open pNFS: Fix races between return-on-close and layoutreturn. pNFS: pnfs_roc_drain should return 'true' when sleeping pNFS: Layoutreturn must invalidate all existing layout segments. ...
2015-07-27SUNRPC: Report TCP errors to the callerTrond Myklebust1-7/+6
Signed-off-by: Trond Myklebust <[email protected]>
2015-07-27sunrpc: translate -EAGAIN to -ENOBUFS when socket is writable.NeilBrown1-0/+9
The networking layer does not reliably report the distinction between a non-block write failing because: 1/ the queue is too full already and 2/ a memory allocation attempt failed. The distinction is important because in the first case it is appropriate to retry as soon as the socket reports that it is writable, and in the second case a small delay is required as the socket will most likely report as writable but kmalloc could still fail. sk_stream_wait_memory() exhibits this distinction nicely, setting 'vm_wait' if a small wait is needed. However in the non-blocking case it always returns -EAGAIN no matter the cause of the failure. This -EAGAIN call get all the way to sunrpc. The sunrpc layer expects EAGAIN to indicate the first cause, and ENOBUFS to indicate the second. Various documentation suggests that this is not unreasonable, but does not guarantee the desired error codes. The result of getting -EAGAIN when -ENOBUFS is expected is that the send is tried again in a tight loop and soft lockups are reported. so: add tests after calls to xs_sendpages() to translate -EAGAIN into -ENOBUFS if the socket is writable. This cannot happen inside xs_sendpages() as the test for "is socket writable" is different between TCP and UDP. With this change, the tight loop retrying xs_sendpages() becomes a loop which only retries every 250ms, and so will not trigger a soft-lockup warning. It is possible that the write did fail because the queue was too full and by the time xs_sendpages() completed, the queue was writable again. In this case an extra 250ms delay is inserted that isn't really needed. This circumstance suggests a degree of congestion so a delay is not necessarily a bad thing, and it can only cause a single 250ms delay, not a series of them. Signed-off-by: NeilBrown <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2015-07-22SUNRPC: xprt_complete_bc_request must also decrement the free slot countTrond Myklebust1-1/+1
Calling xprt_complete_bc_request() effectively causes the slot to be allocated, so it needs to decrement the backchannel free slot count as well. Fixes: 0d2a970d0ae5 ("SUNRPC: Fix a backchannel race") Signed-off-by: Trond Myklebust <[email protected]>
2015-07-22SUNRPC: Fix a backchannel deadlockTrond Myklebust1-2/+2
xprt_alloc_bc_request() cannot call xprt_free_bc_request() without deadlocking, since it already holds the xprt->bc_pa_lock. Reported-by: Chuck Lever <[email protected]> Fixes: 0d2a970d0ae55 ("SUNRPC: Fix a backchannel race") Signed-off-by: Trond Myklebust <[email protected]>
2015-07-20svcrdma: Remove svc_rdma_fastreg()Chuck Lever1-34/+0
Commit 0bf4828983df ("svcrdma: refactor marshalling logic") removed the last call site for svc_rdma_fastreg(). Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2015-07-20svcrdma: Clean up svc_rdma_get_reply_array()Chuck Lever1-0/+73
Kernel coding conventions frown upon having large nontrivial functions in header files, and the preference these days is to allow the compiler to make inlining decisions if possible. As these functions are re-homed into a .c file, be sure that comparisons with fields in struct rpcrdma_msg are with be32 constants. This is a refactoring change; no behavior change is intended. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2015-07-20svcrdma: Fix send_reply() scatter/gather set-upChuck Lever1-1/+9
The Linux NFS server returns garbage in the data payload of inline NFS/RDMA READ replies. These are READs of under 1000 bytes or so where the client has not provided either a reply chunk or a write list. The NFS server delivers the data payload for an NFS READ reply to the transport in an xdr_buf page list. If the NFS client did not provide a reply chunk or a write list, send_reply() is supposed to set up a separate sge for the page containing the READ data, and another sge for XDR padding if needed, then post all of the sges via a single SEND Work Request. The problem is send_reply() does not advance through the xdr_buf when setting up scatter/gather entries for SEND WR. It always calls dma_map_xdr with xdr_off set to zero. When there's more than one sge, dma_map_xdr() sets up the SEND sge's so they all point to the xdr_buf's head. The current Linux NFS/RDMA client always provides a reply chunk or a write list when performing an NFS READ over RDMA. Therefore, it does not exercise this particular case. The Linux server has never had to use more than one extra sge for building RPC/RDMA replies with a Linux client. However, an NFS/RDMA client _is_ allowed to send small NFS READs without setting up a write list or reply chunk. The NFS READ reply fits entirely within the inline reply buffer in this case. This is perhaps a more efficient way of performing NFS READs that the Linux NFS/RDMA client may some day adopt. Fixes: b432e6b3d9c1 ('svcrdma: Change DMA mapping logic to . . .') BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=285 Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2015-07-20NFS/RDMA Release resources in svcrdma when device is removedShirley Ma1-0/+1
When removing underlying RDMA device, the rmmod will hang forever if there are any outstanding NFS/RDMA client mounts. The outstanding NFS/RDMA counts could also prevent the server from shutting down. Further debugging shows that the existing connections are not teared down and resource are not released when receiving RDMA_CM_EVENT_DEVICE_REMOVAL event. It seems the original code missing svc_xprt_put() in RDMA_CM_EVENT_REMOVAL event handler thus svc_xprt_free is never invoked to release the existing connection resources. The patch has been passed removing, adding device back and forth without stopping NFS/RDMA service. This will also allow a device to be unplugged and swapped out without shutting down NFS service. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=252 Signed-off-by: Shirley Ma <[email protected]> Reviewed-by: Chuck Lever <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2015-07-03SUNRPC: Don't confuse ENOBUFS with a write_space issueTrond Myklebust1-1/+2
ENOBUFS means that memory allocations are failing due to an actual low memory situation. It should not be confused with being out of socket buffer space. Handle the problem by just punting to the delay in call_status. Reported-by: Neil Brown <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2015-07-03SUNRPC: Don't reencode message if transmission failed with ENOBUFSTrond Myklebust1-2/+3
If we're running out of buffer memory when transmitting data, then we want to just delay for a moment, and then continue transmitting the remainder of the message. Signed-off-by: Trond Myklebust <[email protected]>
2015-07-02Merge tag 'nfs-for-4.2-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds15-532/+751
Pull NFS client updates from Trond Myklebust: "Highlights include: Stable patches: - Fix a crash in the NFSv4 file locking code. - Fix an fsync() regression, where we were failing to retry I/O in some circumstances. - Fix an infinite loop in NFSv4.0 OPEN stateid recovery - Fix a memory leak when an attempted pnfs fails. - Fix a memory leak in the backchannel code - Large hostnames were not supported correctly in NFSv4.1 - Fix a pNFS/flexfiles bug that was impeding error reporting on I/O. - Fix a couple of credential issues in pNFS/flexfiles Bugfixes + cleanups: - Open flag sanity checks in the NFSv4 atomic open codepath - More NFSv4 delegation related bugfixes - Various NFSv4.1 backchannel bugfixes and cleanups - Fix the NFS swap socket code - Various cleanups of the NFSv4 SETCLIENTID and EXCHANGE_ID code - Fix a UDP transport deadlock issue Features: - More RDMA client transport improvements - NFSv4.2 LAYOUTSTATS functionality for pnfs flexfiles" * tag 'nfs-for-4.2-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (87 commits) nfs: Remove invalid tk_pid from debug message nfs: Remove invalid NFS_ATTR_FATTR_V4_REFERRAL checking in nfs4_get_rootfh nfs: Drop bad comment in nfs41_walk_client_list() nfs: Remove unneeded micro checking of CONFIG_PROC_FS nfs: Don't setting FILE_CREATED flags always nfs: Use remove_proc_subtree() instead remove_proc_entry() nfs: Remove unused argument in nfs_server_set_fsinfo() nfs: Fix a memory leak when meeting an unsupported state protect nfs: take extra reference to fl->fl_file when running a LOCKU operation NFSv4: When returning a delegation, don't reclaim an incompatible open mode. NFSv4.2: LAYOUTSTATS is optional to implement NFSv4.2: Fix up a decoding error in layoutstats pNFS/flexfiles: Fix the reset of struct pgio_header when resending pNFS/flexfiles: Turn off layoutcommit for servers that don't need it pnfs/flexfiles: protect ktime manipulation with mirror lock nfs: provide pnfs_report_layoutstat when NFS42 is disabled nfs: verify open flags before allowing open nfs: always update creds in mirror, even when we have an already connected ds nfs: fix potential credential leak in ff_layout_update_mirror_cred pnfs/flexfiles: report layoutstat regularly ...
2015-07-01Merge tag 'modules-next-for-linus' of ↵Linus Torvalds2-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull module updates from Rusty Russell: "Main excitement here is Peter Zijlstra's lockless rbtree optimization to speed module address lookup. He found some abusers of the module lock doing that too. A little bit of parameter work here too; including Dan Streetman's breaking up the big param mutex so writing a parameter can load another module (yeah, really). Unfortunately that broke the usual suspects, !CONFIG_MODULES and !CONFIG_SYSFS, so those fixes were appended too" * tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (26 commits) modules: only use mod->param_lock if CONFIG_MODULES param: fix module param locks when !CONFIG_SYSFS. rcu: merge fix for Convert ACCESS_ONCE() to READ_ONCE() and WRITE_ONCE() module: add per-module param_lock module: make perm const params: suppress unused variable error, warn once just in case code changes. modules: clarify CONFIG_MODULE_COMPRESS help, suggest 'N'. kernel/module.c: avoid ifdefs for sig_enforce declaration kernel/workqueue.c: remove ifdefs over wq_power_efficient kernel/params.c: export param_ops_bool_enable_only kernel/params.c: generalize bool_enable_only kernel/module.c: use generic module param operaters for sig_enforce kernel/params: constify struct kernel_param_ops uses sysfs: tightened sysfs permission checks module: Rework module_addr_{min,max} module: Use __module_address() for module_address_lookup() module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING module: Optimize __module_address() using a latched RB-tree rbtree: Implement generic latch_tree seqlock: Introduce raw_read_seqcount_latch() ...
2015-06-27Merge branch 'for-4.2' of git://linux-nfs.org/~bfields/linuxLinus Torvalds13-198/+129
Pull nfsd updates from Bruce Fields: "A relatively quiet cycle, with a mix of cleanup and smaller bugfixes" * 'for-4.2' of git://linux-nfs.org/~bfields/linux: (24 commits) sunrpc: use sg_init_one() in krb5_rc4_setup_enc/seq_key() nfsd: wrap too long lines in nfsd4_encode_read nfsd: fput rd_file from XDR encode context nfsd: take struct file setup fully into nfs4_preprocess_stateid_op nfsd: refactor nfs4_preprocess_stateid_op nfsd: clean up raparams handling nfsd: use swap() in sort_pacl_range() rpcrdma: Merge svcrdma and xprtrdma modules into one svcrdma: Add a separate "max data segs macro for svcrdma svcrdma: Replace GFP_KERNEL in a loop with GFP_NOFAIL svcrdma: Keep rpcrdma_msg fields in network byte-order svcrdma: Fix byte-swapping in svc_rdma_sendto.c nfsd: Update callback sequnce id only CB_SEQUENCE success nfsd: Reset cb_status in nfsd4_cb_prepare() at retrying svcrdma: Remove svc_rdma_xdr_decode_deferred_req() SUNRPC: Move EXPORT_SYMBOL for svc_process uapi/nfs: Add NFSv4.1 ACL definitions nfsd: Remove dead declarations nfsd: work around a gcc-5.1 warning nfsd: Checking for acl support does not require fetching any acls ...
2015-06-22sunrpc: use sg_init_one() in krb5_rc4_setup_enc/seq_key()Fabian Frederick1-6/+2
Don't opencode sg_init_one() Signed-off-by: Fabian Frederick <[email protected]> Signed-off-by: J. Bruce Fields <[email protected]>
2015-06-20SUNRPC: Set the TCP user timeout option on client socketsTrond Myklebust1-0/+7
Use the TCP_USER_TIMEOUT socket option to advertise to the server how long we will keep the connection open if there is unacknowledged data. See RFC5482. Signed-off-by: Trond Myklebust <[email protected]>
2015-06-19SUNRPC: Ensure we release the TCP socket once it has been closedTrond Myklebust2-19/+23
This fixes a regression introduced by commit caf4ccd4e88cf2 ("SUNRPC: Make xs_tcp_close() do a socket shutdown rather than a sock_release"). Prior to that commit, the autoclose feature would ensure that an idle connection would result in the socket being both disconnected and released, whereas now only gets disconnected. While the current behaviour is harmless, it does leave the port bound until either RPC traffic resumes or the RPC client is shut down. Reported-by: Steven Rostedt <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>