aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2017-11-17pNFS: Retry NFS4ERR_OLD_STATEID errors in layoutreturn-on-closeTrond Myklebust3-2/+40
If our layoutreturn on close operation returns an NFS4ERR_OLD_STATEID, then try to update the stateid and retry. We know that there should be no further LAYOUTGET requests being launched. Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17NFSv4: Don't try to CLOSE if the stateid 'other' field has changedTrond Myklebust3-12/+13
If the stateid is no longer recognised on the server, either due to a restart, or due to a competing CLOSE call, then we do not have to retry. Any open contexts that triggered a reopen of the file, will also act as triggers for any CLOSE for the updated stateids. Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17NFSv4: Retry CLOSE and DELEGRETURN on NFS4ERR_OLD_STATEID.Trond Myklebust5-2/+65
If we're racing with an OPEN, then retry the operation instead of declaring it a success. Signed-off-by: Trond Myklebust <[email protected]> [Andrew W Elble: Fix a typo in nfs4_refresh_open_stateid] Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17NFS: Fix a typo in nfs_rename()Trond Myklebust1-1/+1
On successful rename, the "old_dentry" is retained and is attached to the "new_dir", so we need to call nfs_set_verifier() accordingly. Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17NFSv4: Fix open create exclusive when the server rebootsTrond Myklebust1-15/+26
If the server that does not implement NFSv4.1 persistent session semantics reboots while we are performing an exclusive create, then the return value of NFS4ERR_DELAY when we replay the open during the grace period causes us to lose the verifier. When the grace period expires, and we present a new verifier, the server will then correctly reply NFS4ERR_EXIST. This commit ensures that we always present the same verifier when replaying the OPEN. Reported-by: Tigran Mkrtchyan <[email protected]> Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17NFSv4: Add a tracepoint to document open stateid updatesTrond Myklebust2-0/+5
Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17NFSv4: Fix OPEN / CLOSE raceTrond Myklebust3-35/+123
Ben Coddington has noted the following race between OPEN and CLOSE on a single client. Process 1 Process 2 Server ========= ========= ====== 1) OPEN file 2) OPEN file 3) Process OPEN (1) seqid=1 4) Process OPEN (2) seqid=2 5) Reply OPEN (2) 6) Receive reply (2) 7) new stateid, seqid=2 8) CLOSE file, using stateid w/ seqid=2 9) Reply OPEN (1) 10( Process CLOSE (8) 11) Reply CLOSE (8) 12) Forget stateid file closed 13) Receive reply (7) 14) Forget stateid file closed. 15) Receive reply (1). 16) New stateid seqid=1 is really the same stateid that was closed. IOW: the reply to the first OPEN is delayed. Since "Process 2" does not wait before closing the file, and it does not cache the closed stateid, then when the delayed reply is finally received, it is treated as setting up a new stateid by the client. The fix is to ensure that the client processes the OPEN and CLOSE calls in the same order in which the server processed them. This commit ensures that we examine the seqid of the stateid returned by OPEN. If it is a new stateid, we assume the seqid must be equal to the value 1, and that each state transition increments the seqid value by 1 (See RFC7530, Section 9.1.4.2, and RFC5661, Section 8.2.2). If the tracker sees that an OPEN returns with a seqid that is greater than the cached seqid + 1, then it bumps a flag to ensure that the caller waits for the RPCs carrying the missing seqids to complete. Note that there can still be pathologies where the server crashes before it can even send us the missing seqids. Since the OPEN call is still holding a slot when it waits here, that could cause the recovery to stall forever. To avoid that, we time out after a 5 second wait. Reported-by: Benjamin Coddington <[email protected]> Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17sunrpc: Add rpc_request static trace pointChuck Lever2-2/+31
Display information about the RPC procedure being requested in the trace log. This sometimes critical information cannot always be derived from other RPC trace entries. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17sunrpc: Fix rpc_task_begin trace pointChuck Lever1-2/+1
The rpc_task_begin trace point always display a task ID of zero. Move the trace point call site so that it picks up the new task ID. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17SUNRPC: Fix parsing failure in trace points with XIDsChuck Lever1-24/+25
mount.nf-11159 8.... 905.248380: xprt_transmit: [FAILED TO PARSE] xid=351291440 status=0 addr=192.168.2.5 port=20049 mount.nf-11159 8.... 905.248381: rpc_task_sleep: task:6210@1 flags=0e80 state=0005 status=0 timeout=60000 queue=xprt_pending kworker/-1591 1.... 905.248419: xprt_lookup_rqst: [FAILED TO PARSE] xid=351291440 status=0 addr=192.168.2.5 port=20049 kworker/-1591 1.... 905.248423: xprt_complete_rqst: [FAILED TO PARSE] xid=351291440 status=24 addr=192.168.2.5 port=20049 Byte swapping is not available during trace-cmd report. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17net: sunrpc: mark expected switch fall-throughsGustavo A. R. Silva3-0/+16
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Signed-off-by: Gustavo A. R. Silva <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17NFS: Fix bool initialization/comparisonThomas Meyer4-8/+8
Bool initializations should use true and false. Bool tests don't need comparisons. Signed-off-by: Thomas Meyer <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17NFS: Avoid RCU usage in tracepointsAnna Schumaker1-18/+6
There isn't an obvious way to acquire and release the RCU lock during a tracepoint, so we can't use the rpc_peeraddr2str() function here. Instead, rely on the client's cl_hostname, which should have similar enough information without needing an rcu_dereference(). Reported-by: Dave Jones <[email protected]> Cc: [email protected] # v3.12 Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17xprtrdma: Update copyright noticesChuck Lever5-0/+5
Credit work contributed by Oracle engineers since 2014. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17xprtrdma: Remove include for linux/prefetch.hChuck Lever1-1/+0
Clean up. This include should have been removed by commit 23826c7aeac7 ("xprtrdma: Serialize credit accounting again"). Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17rpcrdma: Remove C structure definitions of XDR data itemsChuck Lever3-68/+3
Clean up: C-structure style XDR encoding and decoding logic has been replaced over the past several merge windows on both the client and server. These data structures are no longer used. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17xprtrdma: Put Send CQ in IB_POLL_WORKQUEUE modeChuck Lever1-1/+1
Lift the Send and LocalInv completion handlers out of soft IRQ mode to make room for other work. Also, move the Send CQ to a different CPU than the CPU where the Receive CQ is running, for improved scalability. Signed-off-by: Chuck Lever <[email protected]> Reviewed-by: Devesh Sharma <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17Merge branch 'overlayfs-linus' of ↵Linus Torvalds10-376/+576
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs updates from Miklos Szeredi: - Report constant st_ino values across copy-up even if underlying layers are on different filesystems, but using different st_dev values for each layer. Ideally we'd report the same st_dev across the overlay, and it's possible to do for filesystems that use only 32bits for st_ino by unifying the inum space. It would be nice if it wasn't a choice of 32 or 64, rather filesystems could report their current maximum (that could change on resize, so it wouldn't be set in stone). - miscellaneus fixes and a cleanup of ovl_fill_super(), that was long overdue. - created a path_put_init() helper that clears out the pointers after putting the ref. I think this could be useful elsewhere, so added it to <linux/path.h> * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (30 commits) ovl: remove unneeded arg from ovl_verify_origin() ovl: Put upperdentry if ovl_check_origin() fails ovl: rename ufs to ofs ovl: clean up getting lower layers ovl: clean up workdir creation ovl: clean up getting upper layer ovl: move ovl_get_workdir() and ovl_get_lower_layers() ovl: reduce the number of arguments for ovl_workdir_create() ovl: change order of setup in ovl_fill_super() ovl: factor out ovl_free_fs() helper ovl: grab reference to workbasedir early ovl: split out ovl_get_indexdir() from ovl_fill_super() ovl: split out ovl_get_lower_layers() from ovl_fill_super() ovl: split out ovl_get_workdir() from ovl_fill_super() ovl: split out ovl_get_upper() from ovl_fill_super() ovl: split out ovl_get_lowerstack() from ovl_fill_super() ovl: split out ovl_get_workpath() from ovl_fill_super() ovl: split out ovl_get_upperpath() from ovl_fill_super() ovl: use path_put_init() in error paths for ovl_fill_super() vfs: add path_put_init() ...
2017-11-17Merge tag 'locks-v4.15-1' of ↵Linus Torvalds10-20/+10
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull file locking update from Jeff Layton: "A couple of fixes for a patch that went into v4.14, and the bug report just came in a few days ago.. It passes my (minimal) testing, and has been in linux-next for a few days now. I also would like to get my address changed in MAINTAINERS to clear that hurdle" * tag 'locks-v4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: fcntl: don't cap l_start and l_end values for F_GETLK64 in compat syscall fcntl: don't leak fd reference when fixup_compat_flock fails MAINTAINERS: s/[email protected]/[email protected]/
2017-11-17Merge branch 'work.cramfs' of ↵Linus Torvalds6-69/+584
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull cramfs updates from Al Viro: "Nicolas Pitre's cramfs work" * 'work.cramfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: cramfs: rehabilitate it cramfs: add mmap support cramfs: implement uncompressed and arbitrary data block positioning cramfs: direct memory access support
2017-11-17Merge branch 'work.misc' of ↵Linus Torvalds16-52/+60
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Assorted stuff, really no common topic here" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: grab the lock instead of blocking in __fd_install during resizing vfs: stop clearing close on exec when closing a fd include/linux/fs.h: fix comment about struct address_space fs: make fiemap work from compat_ioctl coda: fix 'kernel memory exposure attempt' in fsync pstore: remove unneeded unlikely() vfs: remove unneeded unlikely() stubs for mount_bdev() and kill_block_super() in !CONFIG_BLOCK case make vfs_ustat() static do_handle_open() should be static elf_fdpic: fix unused variable warning fold destroy_super() into __put_super() new helper: destroy_unused_super() fix address space warnings in ipc/ acct.h: get rid of detritus
2017-11-17Merge branch 'work.get_user_pages_fast' of ↵Linus Torvalds8-36/+20
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull get_user_pages_fast() conversion from Al Viro: "A bunch of places switched to get_user_pages_fast()" * 'work.get_user_pages_fast' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ceph: use get_user_pages_fast() pvr2fs: use get_user_pages_fast() atomisp: use get_user_pages_fast() st: use get_user_pages_fast() via_dmablit(): use get_user_pages_fast() fsl_hypervisor: switch to get_user_pages_fast() rapidio: switch to get_user_pages_fast() vchiq_2835_arm: switch to get_user_pages_fast()
2017-11-17Merge branch 'work.iov_iter' of ↵Linus Torvalds15-440/+202
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull iov_iter updates from Al Viro: - bio_{map,copy}_user_iov() series; those are cleanups - fixes from the same pile went into mainline (and stable) in late September. - fs/iomap.c iov_iter-related fixes - new primitive - iov_iter_for_each_range(), which applies a function to kernel-mapped segments of an iov_iter. Usable for kvec and bvec ones, the latter does kmap()/kunmap() around the callback. _Not_ usable for iovec- or pipe-backed iov_iter; the latter is not hard to fix if the need ever appears, the former is by design. Another related primitive will have to wait for the next cycle - it passes page + offset + size instead of pointer + size, and that one will be usable for everything _except_ kvec. Unfortunately, that one didn't get exposure in -next yet, so... - a bit more lustre iov_iter work, including a use case for iov_iter_for_each_range() (checksum calculation) - vhost/scsi leak fix in failure exit - misc cleanups and detritectomy... * 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (21 commits) iomap_dio_actor(): fix iov_iter bugs switch ksocknal_lib_recv_...() to use of iov_iter_for_each_range() lustre: switch struct ksock_conn to iov_iter vhost/scsi: switch to iov_iter_get_pages() fix a page leak in vhost_scsi_iov_to_sgl() error recovery new primitive: iov_iter_for_each_range() lnet_return_rx_credits_locked: don't abuse list_entry xen: don't open-code iov_iter_kvec() orangefs: remove detritus from struct orangefs_kiocb_s kill iov_shorten() bio_alloc_map_data(): do bmd->iter setup right there bio_copy_user_iov(): saner bio size calculation bio_map_user_iov(): get rid of copying iov_iter bio_copy_from_iter(): get rid of copying iov_iter move more stuff down into bio_copy_user_iov() blk_rq_map_user_iov(): move iov_iter_advance() down bio_map_user_iov(): get rid of the iov_for_each() bio_map_user_iov(): move alignment check into the main loop don't rely upon subsequent bio_add_pc_page() calls failing ... and with iov_iter_get_pages_alloc() it becomes even simpler ...
2017-11-17Revert "drm/radeon: dont switch vt on suspend"Alex Deucher1-1/+0
Fixes distorted colors on some cards on resume from suspend. This reverts commit b9729b17a414f99c61f4db9ac9f9ed987fa0cbfe. Bug: https://bugs.freedesktop.org/show_bug.cgi?id=98832 Bug: https://bugs.freedesktop.org/show_bug.cgi?id=99163 Bug: https://bugzilla.kernel.org/show_bug.cgi?id=107001 Reviewed-by: Michel Dänzer <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected]
2017-11-17drm/amd/amdgpu: fix over-bound accessing in amdgpu_cs_wait_any_fenceRoger He1-1/+1
Fixes an oops in amdgpu_cs_wait_any_fence. Reviewed-by: Christian König <[email protected]> Reviewed-by: Chunming Zhou <[email protected]> Signed-off-by: Roger He <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
2017-11-17Merge branch 'misc.compat' of ↵Linus Torvalds32-821/+508
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull compat and uaccess updates from Al Viro: - {get,put}_compat_sigset() series - assorted compat ioctl stuff - more set_fs() elimination - a few more timespec64 conversions - several removals of pointless access_ok() in places where it was followed only by non-__ variants of primitives * 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (24 commits) coredump: call do_unlinkat directly instead of sys_unlink fs: expose do_unlinkat for built-in callers ext4: take handling of EXT4_IOC_GROUP_ADD into a helper, get rid of set_fs() ipmi: get rid of pointless access_ok() pi433: sanitize ioctl cxlflash: get rid of pointless access_ok() mtdchar: get rid of pointless access_ok() r128: switch compat ioctls to drm_ioctl_kernel() selection: get rid of field-by-field copyin VT_RESIZEX: get rid of field-by-field copyin i2c compat ioctls: move to ->compat_ioctl() sched_rr_get_interval(): move compat to native, get rid of set_fs() mips: switch to {get,put}_compat_sigset() sparc: switch to {get,put}_compat_sigset() s390: switch to {get,put}_compat_sigset() ppc: switch to {get,put}_compat_sigset() parisc: switch to {get,put}_compat_sigset() get_compat_sigset() get rid of {get,put}_compat_itimerspec() io_getevents: Use timespec64 to represent timeouts ...
2017-11-17Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds19-166/+311
Pull more block layer updates from Jens Axboe: "A followup pull request, with some parts that either needed a bit more testing before going in, merge sync, or just later arriving fixes. This contains: - Timer related updates from Kees. These were purposefully delayed since I didn't want to pull in a later v4.14-rc tag to my block tree. - ide-cd prep sense buffer fix from Bart. Also delayed, as not to clash with the late fix we put into 4.14-rc. - Small BFQ updates series from Luca and Paolo. - Single nvmet fix from James, fixing a non-functional case there. - Bio fast clone fix from Michael, which made bcache return the wrong data for some cases. - Legacy IO path regression hang fix from Ming" * 'for-linus' of git://git.kernel.dk/linux-block: bio: ensure __bio_clone_fast copies bi_partno nvmet_fc: fix better length checking block: wake up all tasks blocked in get_request() block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP block, bfq: update blkio stats outside the scheduler lock block, bfq: add missing invocations of bfqg_stats_update_io_add/remove doc, block, bfq: update max IOPS sustainable with BFQ ide: Make ide_cdrom_prep_fs() initialize the sense buffer pointer md: Convert timers to use timer_setup() block: swim3: Convert timers to use timer_setup() block/aoe: Convert timers to use timer_setup() amifloppy: Convert timers to use timer_setup() block/floppy: Convert callback to pass timer_list
2017-11-17fs, nfs: convert nfs_client.cl_count from atomic_t to refcount_tElena Reshetova7-32/+33
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable nfs_client.cl_count is used as pure reference counter. Convert it to refcount_t and fix up the operations. Suggested-by: Kees Cook <[email protected]> Reviewed-by: David Windsor <[email protected]> Reviewed-by: Hans Liljestrand <[email protected]> Signed-off-by: Elena Reshetova <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17fs, nfs: convert nfs_lock_context.count from atomic_t to refcount_tElena Reshetova2-7/+8
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable nfs_lock_context.count is used as pure reference counter. Convert it to refcount_t and fix up the operations. Suggested-by: Kees Cook <[email protected]> Reviewed-by: David Windsor <[email protected]> Reviewed-by: Hans Liljestrand <[email protected]> Signed-off-by: Elena Reshetova <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17fs, nfs: convert nfs4_lock_state.ls_count from atomic_t to refcount_tElena Reshetova3-8/+8
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable nfs4_lock_state.ls_count is used as pure reference counter. Convert it to refcount_t and fix up the operations. Suggested-by: Kees Cook <[email protected]> Reviewed-by: David Windsor <[email protected]> Reviewed-by: Hans Liljestrand <[email protected]> Signed-off-by: Elena Reshetova <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17fs, nfs: convert nfs_cache_defer_req.count from atomic_t to refcount_tElena Reshetova2-4/+4
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable nfs_cache_defer_req.count is used as pure reference counter. Convert it to refcount_t and fix up the operations. Suggested-by: Kees Cook <[email protected]> Reviewed-by: David Windsor <[email protected]> Reviewed-by: Hans Liljestrand <[email protected]> Signed-off-by: Elena Reshetova <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17fs, nfs: convert nfs4_ff_layout_mirror.ref from atomic_t to refcount_tElena Reshetova2-5/+6
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable nfs4_ff_layout_mirror.ref is used as pure reference counter. Convert it to refcount_t and fix up the operations. Suggested-by: Kees Cook <[email protected]> Reviewed-by: David Windsor <[email protected]> Reviewed-by: Hans Liljestrand <[email protected]> Signed-off-by: Elena Reshetova <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17fs, nfs: convert pnfs_layout_hdr.plh_refcount from atomic_t to refcount_tElena Reshetova2-7/+7
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable pnfs_layout_hdr.plh_refcount is used as pure reference counter. Convert it to refcount_t and fix up the operations. Suggested-by: Kees Cook <[email protected]> Reviewed-by: David Windsor <[email protected]> Reviewed-by: Hans Liljestrand <[email protected]> Signed-off-by: Elena Reshetova <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17fs, nfs: convert pnfs_layout_segment.pls_refcount from atomic_t to refcount_tElena Reshetova2-8/+8
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <[email protected]> Signed-off-by: Hans Liljestrand <[email protected]> Signed-off-by: Kees Cook <[email protected]> Signed-off-by: David Windsor <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17fs, nfs: convert nfs4_pnfs_ds.ds_count from atomic_t to refcount_tElena Reshetova2-6/+7
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable nfs4_pnfs_ds.ds_count is used as pure reference counter. Convert it to refcount_t and fix up the operations. Suggested-by: Kees Cook <[email protected]> Reviewed-by: David Windsor <[email protected]> Reviewed-by: Hans Liljestrand <[email protected]> Signed-off-by: Elena Reshetova <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17NFSv4.1: Fix up replays of interrupted requestsTrond Myklebust2-47/+103
If the previous request on a slot was interrupted before it was processed by the server, then our slot sequence number may be out of whack, and so we try the next operation using the old sequence number. The problem with this, is that not all servers check to see that the client is replaying the same operations as previously when they decide to go to the replay cache, and so instead of the expected error of NFS4ERR_SEQ_FALSE_RETRY, we get a replay of the old reply, which could (if the operations match up) be mistaken by the client for a new reply. To fix this, we attempt to send a COMPOUND containing only the SEQUENCE op in order to resync our slot sequence number. Cc: Olga Kornievskaia <[email protected]> [[email protected]: fix an Oops] Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17xprtrdma: Remove atomic send completion countingChuck Lever3-33/+0
The sendctx circular queue now guarantees that xprtrdma cannot overflow the Send Queue, so remove the remaining bits of the original Send WQE counting mechanism. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17xprtrdma: RPC completion should wait for Send completionChuck Lever4-4/+34
When an RPC Call includes a file data payload, that payload can come from pages in the page cache, or a user buffer (for direct I/O). If the payload can fit inline, xprtrdma includes it in the Send using a scatter-gather technique. xprtrdma mustn't allow the RPC consumer to re-use the memory where that payload resides before the Send completes. Otherwise, the new contents of that memory would be exposed by an HCA retransmit of the Send operation. So, block RPC completion on Send completion, but only in the case where a separate file data payload is part of the Send. This prevents the reuse of that memory while it is still part of a Send operation without an undue cost to other cases. Waiting is avoided in the common case because typically the Send will have completed long before the RPC Reply arrives. These days, an RPC timeout will trigger a disconnect, which tears down the QP. The disconnect flushes all waiting Sends. This bounds the amount of time the reply handler has to wait for a Send completion. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17xprtrdma: Refactor rpcrdma_deferred_completionChuck Lever3-13/+22
Invoke a common routine for releasing hardware resources (for example, invalidating MRs). This needs to be done whether an RPC Reply has arrived or the RPC was terminated early. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17xprtrdma: Add a field of bit flags to struct rpcrdma_reqChuck Lever4-4/+8
We have one boolean flag in rpcrdma_req today. I'd like to add more flags, so convert that boolean to a bit flag. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17xprtrdma: Add data structure to manage RDMA Send argumentsChuck Lever4-32/+247
Problem statement: Recently Sagi Grimberg <[email protected]> observed that kernel RDMA- enabled storage initiators don't handle delayed Send completion correctly. If Send completion is delayed beyond the end of a ULP transaction, the ULP may release resources that are still being used by the HCA to complete a long-running Send operation. This is a common design trait amongst our initiators. Most Send operations are faster than the ULP transaction they are part of. Waiting for a completion for these is typically unnecessary. Infrequently, a network partition or some other problem crops up where an ordering problem can occur. In NFS parlance, the RPC Reply arrives and completes the RPC, but the HCA is still retrying the Send WR that conveyed the RPC Call. In this case, the HCA can try to use memory that has been invalidated or DMA unmapped, and the connection is lost. If that memory has been re-used for something else (possibly not related to NFS), and the Send retransmission exposes that data on the wire. Thus we cannot assume that it is safe to release Send-related resources just because a ULP reply has arrived. After some analysis, we have determined that the completion housekeeping will not be difficult for xprtrdma: - Inline Send buffers are registered via the local DMA key, and are already left DMA mapped for the lifetime of a transport connection, thus no additional handling is necessary for those - Gathered Sends involving page cache pages _will_ need to DMA unmap those pages after the Send completes. But like inline send buffers, they are registered via the local DMA key, and thus will not need to be invalidated In addition, RPC completion will need to wait for Send completion in the latter case. However, nearly always, the Send that conveys the RPC Call will have completed long before the RPC Reply arrives, and thus no additional latency will be accrued. Design notes: In this patch, the rpcrdma_sendctx object is introduced, and a lock-free circular queue is added to manage a set of them per transport. The RPC client's send path already prevents sending more than one RPC Call at the same time. This allows us to treat the consumer side of the queue (rpcrdma_sendctx_get_locked) as if there is a single consumer thread. The producer side of the queue (rpcrdma_sendctx_put_locked) is invoked only from the Send completion handler, which is a single thread of execution (soft IRQ). The only care that needs to be taken is with the tail index, which is shared between the producer and consumer. Only the producer updates the tail index. The consumer compares the head with the tail to ensure that the a sendctx that is in use is never handed out again (or, expressed more conventionally, the queue is empty). When the sendctx queue empties completely, there are enough Sends outstanding that posting more Send operations can result in a Send Queue overflow. In this case, the ULP is told to wait and try again. This introduces strong Send Queue accounting to xprtrdma. As a final touch, Jason Gunthorpe <[email protected]> suggested a mechanism that does not require signaling every Send. We signal once every N Sends, and perform SGE unmapping of N Send operations during that one completion. Reported-by: Sagi Grimberg <[email protected]> Suggested-by: Jason Gunthorpe <[email protected]> Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17xprtrdma: "Unoptimize" rpcrdma_prepare_hdr_sge()Chuck Lever1-7/+5
Commit 655fec6987be ("xprtrdma: Use gathered Send for large inline messages") assumed that, since the zeroeth element of the Send SGE array always pointed to req->rl_rdmabuf, it needed to be initialized just once. This was a valid assumption because the Send SGE array and rl_rdmabuf both live in the same rpcrdma_req. In a subsequent patch, the Send SGE array will be separated from the rpcrdma_req, so the zeroeth element of the SGE array needs to be initialized every time. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17xprtrdma: Change return value of rpcrdma_prepare_send_sges()Chuck Lever3-24/+38
Clean up: Make rpcrdma_prepare_send_sges() return a negative errno instead of a bool. Soon callers will want distinct treatments of different types of failures. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17xprtrdma: Fix error handling in rpcrdma_prepare_msg_sges()Chuck Lever1-14/+24
When this function fails, it needs to undo the DMA mappings it's done so far. Otherwise these are leaked. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17xprtrdma: Clean up SGE accounting in rpcrdma_prepare_msg_sges()Chuck Lever1-1/+1
Clean up. rpcrdma_prepare_hdr_sge() sets num_sge to one, then rpcrdma_prepare_msg_sges() sets num_sge again to the count of SGEs it added, plus one for the header SGE just mapped in rpcrdma_prepare_hdr_sge(). This is confusing, and nails in an assumption about when these functions are called. Instead, maintain a running count that both functions can update with just the number of SGEs they have added to the SGE array. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17xprtrdma: Decode credits field in rpcrdma_reply_handlerChuck Lever3-27/+14
We need to decode and save the incoming rdma_credits field _after_ we know that the direction of the message is "forward direction Reply". Otherwise, the credits value in reverse direction Calls is also used to update the forward direction credits. It is safe to decode the rdma_credits field in rpcrdma_reply_handler now that rpcrdma_reply_handler is single-threaded. Receives complete in the same order as they were sent on the NFS server. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17xprtrdma: Invoke rpcrdma_reply_handler directly from RECV completionChuck Lever3-23/+36
I noticed that the soft IRQ thread looked pretty busy under heavy I/O workloads. perf suggested one area that was expensive was the queue_work() call in rpcrdma_wc_receive. That gave me some ideas. Instead of scheduling a separate worker to process RPC Replies, promote the Receive completion handler to IB_POLL_WORKQUEUE, and invoke rpcrdma_reply_handler directly. Note that the poll workqueue is single-threaded. In order to keep memory invalidation from serializing all RPC Replies, handle any necessary invalidation tasks in a separate multi-threaded workqueue. This provides a two-tier scheme, similar to OS I/O interrupt handlers: A fast interrupt handler that schedules the slow handler and re-enables the interrupt, and a slower handler that is invoked for any needed heavy lifting. Benefits include: - One less context switch for RPCs that don't register memory - Receive completion handling is moved out of soft IRQ context to make room for other users of soft IRQ - The same CPU core now DMA syncs and XDR decodes the Receive buffer Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17xprtrdma: Refactor rpcrdma_reply_handler some moreChuck Lever2-57/+69
Clean up: I'd like to be able to invoke the tail of rpcrdma_reply_handler in two different places. Split the tail out into its own helper function. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17xprtrdma: Move decoded header fields into rpcrdma_repChuck Lever2-19/+20
Clean up: Make it easier to pass the decoded XID, vers, credits, and proc fields around by moving these variables into struct rpcrdma_rep. Note: the credits field will be handled in a subsequent patch. Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2017-11-17xprtrdma: Throw away reply when version is unrecognizedChuck Lever1-9/+8
A reply with an unrecognized value in the version field means the transport header is potentially garbled and therefore all the fields are untrustworthy. Fixes: 59aa1f9a3cce3 ("xprtrdma: Properly handle RDMA_ERROR ... ") Signed-off-by: Chuck Lever <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>