aboutsummaryrefslogtreecommitdiff
path: root/fs/nfs
AgeCommit message (Collapse)AuthorFilesLines
2022-03-22NFS: Don't deadlock when cookie hashes collideTrond Myklebust1-11/+18
In the very rare case where the readdir reply contains multiple cookies that map to the same hash value, we can end up deadlocking waiting for a page lock that we already hold. In this case we should fail the page lock by using grab_cache_page_nowait(). Signed-off-by: Trond Myklebust <[email protected]>
2022-03-21Merge tag 'for-5.18/block-2022-03-18' of git://git.kernel.dk/linux-blockLinus Torvalds2-23/+4
Pull block updates from Jens Axboe: - BFQ cleanups and fixes (Yu, Zhang, Yahu, Paolo) - blk-rq-qos completion fix (Tejun) - blk-cgroup merge fix (Tejun) - Add offline error return value to distinguish it from an IO error on the device (Song) - IO stats fixes (Zhang, Christoph) - blkcg refcount fixes (Ming, Yu) - Fix for indefinite dispatch loop softlockup (Shin'ichiro) - blk-mq hardware queue management improvements (Ming) - sbitmap dead code removal (Ming, John) - Plugging merge improvements (me) - Show blk-crypto capabilities in sysfs (Eric) - Multiple delayed queue run improvement (David) - Block throttling fixes (Ming) - Start deprecating auto module loading based on dev_t (Christoph) - bio allocation improvements (Christoph, Chaitanya) - Get rid of bio_devname (Christoph) - bio clone improvements (Christoph) - Block plugging improvements (Christoph) - Get rid of genhd.h header (Christoph) - Ensure drivers use appropriate flush helpers (Christoph) - Refcounting improvements (Christoph) - Queue initialization and teardown improvements (Ming, Christoph) - Misc fixes/improvements (Barry, Chaitanya, Colin, Dan, Jiapeng, Lukas, Nian, Yang, Eric, Chengming) * tag 'for-5.18/block-2022-03-18' of git://git.kernel.dk/linux-block: (127 commits) block: cancel all throttled bios in del_gendisk() block: let blkcg_gq grab request queue's refcnt block: avoid use-after-free on throttle data block: limit request dispatch loop duration block/bfq-iosched: Fix spelling mistake "tenative" -> "tentative" sr: simplify the local variable initialization in sr_block_open() block: don't merge across cgroup boundaries if blkcg is enabled block: fix rq-qos breakage from skipping rq_qos_done_bio() block: flush plug based on hardware and software queue order block: ensure plug merging checks the correct queue at least once block: move rq_qos_exit() into disk_release() block: do more work in elevator_exit block: move blk_exit_queue into disk_release block: move q_usage_counter release into blk_queue_release block: don't remove hctx debugfs dir from blk_mq_exit_queue block: move blkcg initialization/destroy into disk allocation/release handler sr: implement ->free_disk to simplify refcounting sd: implement ->free_disk to simplify refcounting sd: delay calling free_opal_dev sd: call sd_zbc_release_disk before releasing the scsi_device reference ...
2022-03-21NFSv4.1 provide mount option to toggle trunking discoveryOlga Kornievskaia2-1/+10
Introduce a new mount option -- trunkdiscovery,notrunkdiscovery -- to toggle whether or not the client will engage in actively discovery of trunking locations. v2 make notrunkdiscovery default Signed-off-by: Olga Kornievskaia <[email protected]> Fixes: 1976b2b31462 ("NFSv4.1 query for fs_location attr on a new file system") Signed-off-by: Trond Myklebust <[email protected]>
2022-03-18fscache: export fscache_end_operation()Jeffle Xu1-8/+0
Export fscache_end_operation() to avoid code duplication. Besides, considering the paired fscache_begin_read_operation() is already exported, it shall make sense to also export fscache_end_operation(). Signed-off-by: Jeffle Xu <[email protected]> Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> cc: [email protected] Link: https://lore.kernel.org/r/[email protected]/ # Jeffle's v4 Link: https://lore.kernel.org/r/164622971432.3564931.12184135678781328146.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/164678190346.1200972.7453733431978569479.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/164692888334.2099075.5166283293894267365.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/[email protected]/ # v5
2022-03-15fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folioMatthew Wilcox (Oracle)1-1/+1
These filesystems use __set_page_dirty_nobuffers() either directly or with a very thin wrapper; convert them en masse. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Tested-by: Damien Le Moal <[email protected]> Acked-by: Damien Le Moal <[email protected]> Tested-by: Mike Marshall <[email protected]> # orangefs Tested-by: David Howells <[email protected]> # afs
2022-03-15nfs: Convert from launder_page to launder_folioMatthew Wilcox (Oracle)1-7/+7
We don't need to use page_file_mapping() here because launder_folio is never called for swap cache pages. We also don't need to cast an loff_t in order to print it. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Tested-by: Damien Le Moal <[email protected]> Acked-by: Damien Le Moal <[email protected]> Tested-by: Mike Marshall <[email protected]> # orangefs Tested-by: David Howells <[email protected]> # afs
2022-03-15nfs: Convert from invalidatepage to invalidate_folioMatthew Wilcox (Oracle)2-12/+12
Print the folio index instead of the pointer, since this is more useful. We also don't need to use page_file_mapping() as we do not invalidate swapcache pages. Since this is the only caller of nfs_wb_page_cancel(), convert it to nfs_wb_folio_cancel(). Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Tested-by: Damien Le Moal <[email protected]> Acked-by: Damien Le Moal <[email protected]> Tested-by: Mike Marshall <[email protected]> # orangefs Tested-by: David Howells <[email protected]> # afs
2022-03-13NFS: swap-out must always use STABLE writes.NeilBrown1-4/+6
The commit handling code is not safe against memory-pressure deadlocks when writing to swap. In particular, nfs_commitdata_alloc() blocks indefinitely waiting for memory, and this can consume all available workqueue threads. swap-out most likely uses STABLE writes anyway as COND_STABLE indicates that a stable write should be used if the write fits in a single request, and it normally does. However if we ever swap with a small wsize, or gather unusually large numbers of pages for a single write, this might change. For safety, make it explicit in the code that direct writes used for swap must always use FLUSH_STABLE. Signed-off-by: NeilBrown <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2022-03-13NFS: swap IO handling is slightly different for O_DIRECT IONeilBrown2-16/+30
1/ Taking the i_rwsem for swap IO triggers lockdep warnings regarding possible deadlocks with "fs_reclaim". These deadlocks could, I believe, eventuate if a buffered read on the swapfile was attempted. We don't need coherence with the page cache for a swap file, and buffered writes are forbidden anyway. There is no other need for i_rwsem during direct IO. So never take it for swap_rw() 2/ generic_write_checks() explicitly forbids writes to swap, and performs checks that are not needed for swap. So bypass it for swap_rw(). Signed-off-by: NeilBrown <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2022-03-13NFSv4: keep state manager thread active if swap is enabledNeilBrown4-10/+66
If we are swapping over NFSv4, we may not be able to allocate memory to start the state-manager thread at the time when we need it. So keep it always running when swap is enabled, and just signal it to start. This requires updating and testing the cl_swapper count on the root rpc_clnt after following all ->cl_parent links. Signed-off-by: NeilBrown <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2022-03-13SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOCNeilBrown1-0/+2
rpc tasks can be marked as RPC_TASK_SWAPPER. This causes GFP_MEMALLOC to be used for some allocations. This is needed in some cases, but not in all where it is currently provided, and in some where it isn't provided. Currently *all* tasks associated with a rpc_client on which swap is enabled get the flag and hence some GFP_MEMALLOC support. GFP_MEMALLOC is provided for ->buf_alloc() but only swap-writes need it. However xdr_alloc_bvec does not get GFP_MEMALLOC - though it often does need it. xdr_alloc_bvec is called while the XPRT_LOCK is held. If this blocks, then it blocks all other queued tasks. So this allocation needs GFP_MEMALLOC for *all* requests, not just writes, when the xprt is used for any swap writes. Similarly, if the transport is not connected, that will block all requests including swap writes, so memory allocations should get GFP_MEMALLOC if swap writes are possible. So with this patch: 1/ we ONLY set RPC_TASK_SWAPPER for swap writes. 2/ __rpc_execute() sets PF_MEMALLOC while handling any task with RPC_TASK_SWAPPER set, or when handling any task that holds the XPRT_LOCKED lock on an xprt used for swap. This removes the need for the RPC_IS_SWAPPER() test in ->buf_alloc handlers. 3/ xprt_prepare_transmit() sets PF_MEMALLOC after locking any task to a swapper xprt. __rpc_execute() will clear it. 3/ PF_MEMALLOC is set for all the connect workers. Reviewed-by: Chuck Lever <[email protected]> (for xprtrdma parts) Signed-off-by: NeilBrown <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2022-03-13NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDSNeilBrown1-4/+0
NFS_RPC_SWAPFLAGS is only used for READ requests. It sets RPC_TASK_SWAPPER which gives some memory-allocation priority to requests. This is not needed for swap READ - though it is for writes where it is set via a different mechanism. RPC_TASK_ROOTCREDS causes the 'machine' credential to be used. This is not needed as the root credential is saved when the swap file is opened, and this is used for all IO. So NFS_RPC_SWAPFLAGS isn't needed, and as it is the only user of RPC_TASK_ROOTCREDS, that isn't needed either. Remove both. Signed-off-by: NeilBrown <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2022-03-13NFS: remove IS_SWAPFILE hackNeilBrown1-5/+0
This code is pointless as IS_SWAPFILE is always defined. So remove it. Suggested-by: Mark Hemment <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: NeilBrown <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2022-03-13NFS: Remove remaining dfprintks related to fscache and remove NFSDBG_FSCACHEDave Wysochanski1-10/+0
The fscache cookie APIs including fscache_acquire_cookie() and fscache_relinquish_cookie() now have very good tracing. Thus, there is no real need for dfprintks in the NFS fscache interface. The NFS fscache interface has removed all dfprintks so remove the NFSDBG_FSCACHE defines. Signed-off-by: Dave Wysochanski <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2022-03-13NFS: Replace dfprintks with tracepoints in fscache read and write page functionsDave Wysochanski2-18/+102
Most of fscache and other NFS IO paths are now using tracepoints. Remove the dfprintks in the NFS fscache read/write page functions and replace with tracepoints at the begin and end of the functions. Signed-off-by: Dave Wysochanski <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2022-03-13NFS: Rename fscache read and write pages functionsDave Wysochanski3-22/+15
Rename NFS fscache functions in a more consistent fashion to better reflect when we read from and write to fscache. Signed-off-by: Dave Wysochanski <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2022-03-13NFS: Cleanup usage of nfs_inode in fscache interfaceDave Wysochanski2-15/+13
A number of places in the fscache interface used nfs_inode when inode could be used, simplifying the code. Signed-off-by: Dave Wysochanski <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2022-03-13NFSv4.1 restrict GETATTR fs_location query to the main transportOlga Kornievskaia1-2/+13
In the presence of trunking transports, it's helpful to make sure that during the migration event, the GETATTR for fs_location attribute happens on the main transport. Signed-off-by: Olga Kornievskaia <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2022-03-13NFS: remove unneeded check in decode_devicenotify_args()Alexey Khoroshilov1-4/+0
[You don't often get email from [email protected]. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.] Overflow check in not needed anymore after we switch to kmalloc_array(). Signed-off-by: Alexey Khoroshilov <[email protected]> Fixes: a4f743a6bb20 ("NFSv4.1: Convert open-coded array allocation calls to kmalloc_array()") Signed-off-by: Trond Myklebust <[email protected]>
2022-03-02NFS: Cache all entries in the readdirplus replyTrond Myklebust1-14/+26
Even if we're not able to cache all the entries in the readdir buffer, let's ensure that we do prime the dcache. Signed-off-by: Trond Myklebust <[email protected]>
2022-03-02NFS: Optimise away the previous cookie fieldTrond Myklebust4-15/+14
Replace the 'previous cookie' field in struct nfs_entry with the array->last_cookie. Signed-off-by: Trond Myklebust <[email protected]>
2022-03-02NFS: Fix up forced readdirplusTrond Myklebust2-17/+40
Avoid clearing the entire readdir page cache if we're just doing forced readdirplus for the 'ls -l' heuristic. Signed-off-by: Trond Myklebust <[email protected]>
2022-03-02NFS: Convert readdir page cache to use a cookie based indexTrond Myklebust2-84/+69
Instead of using a linear index to address the pages, use the cookie of the first entry, since that is what we use to match the page anyway. This allows us to avoid re-reading the entire cache on a seekdir() type of operation. The latter is very common when re-exporting NFS, and is a major performance drain. The change does affect our duplicate cookie detection, since we can no longer rely on the page index as a linear offset for detecting whether we looped backwards. However since we no longer do a linear search through all the pages on each call to nfs_readdir(), this is less of a concern than it was previously. The other downside is that invalidate_mapping_pages() no longer can use the page index to avoid clearing pages that have been read. A subsequent patch will restore the functionality this provides to the 'ls -l' heuristic. Signed-off-by: Trond Myklebust <[email protected]>
2022-03-02NFS: Clean up page array initialisation/freeTrond Myklebust1-10/+6
Signed-off-by: Trond Myklebust <[email protected]>
2022-03-02NFS: Trace effects of the readdirplus heuristicTrond Myklebust2-1/+60
Enable tracking of when the readdirplus heuristic causes a page cache invalidation. Signed-off-by: Trond Myklebust <[email protected]>
2022-03-02NFS: Trace effects of readdirplus on the dcacheTrond Myklebust2-0/+8
Trace the effects of readdirplus on attribute and dentry revalidation. Signed-off-by: Trond Myklebust <[email protected]>
2022-03-02NFS: Add basic readdir tracingTrond Myklebust2-1/+80
Add tracing to track how often the client goes to the server for updated readdir information. Signed-off-by: Trond Myklebust <[email protected]>
2022-03-02NFS: Don't request readdirplus when revalidation was forcedTrond Myklebust1-10/+16
If the revalidation was forced, due to the presence of a LOOKUP_EXCL or a LOOKUP_REVAL flag, then readdirplus won't help. It also can't help when we're doing a path component lookup. Signed-off-by: Trond Myklebust <[email protected]>
2022-03-02NFS: Readdirplus can't help lookup for case insensitive filesystemsTrond Myklebust1-0/+2
If the filesystem is case insensitive, then readdirplus can't help with cache misses, since it won't return case folded variants of the filename. Signed-off-by: Trond Myklebust <[email protected]>
2022-03-02NFSv4: Ask for a full XDR buffer of readdir goodnessTrond Myklebust2-6/+7
Instead of pretending that we know the ratio of directory info vs readdirplus attribute info, just set the 'dircount' field to the same value as the 'maxcount' field. Signed-off-by: Trond Myklebust <[email protected]>
2022-03-02NFS: Don't ask for readdirplus unless it can help nfs_getattr()Trond Myklebust1-20/+25
If attribute caching is turned off, then use of readdirplus is not going to help stat() performance. Readdirplus also doesn't help if a file is being written to, since we will have to flush those writes in order to sync the mtime/ctime. Signed-off-by: Trond Myklebust <[email protected]>
2022-03-02NFS: Improve heuristic for readdirplusTrond Myklebust4-34/+55
The heuristic for readdirplus is designed to try to detect 'ls -l' and similar patterns. It does so by looking for cache hit/miss patterns in both the attribute cache and in the dcache of the files in a given directory, and then sets a flag for the readdirplus code to interpret. The problem with this approach is that a single attribute or dcache miss can cause the NFS code to force a refresh of the attributes for the entire set of files contained in the directory. To be able to make a more nuanced decision, let's sample the number of hits and misses in the set of open directory descriptors. That allows us to set thresholds at which we start preferring READDIRPLUS over regular READDIR, or at which we start to force a re-read of the remaining readdir cache using READDIRPLUS. Signed-off-by: Trond Myklebust <[email protected]>
2022-03-02NFS: Reduce use of uncached readdirTrond Myklebust1-20/+3
When reading a very large directory, we want to try to keep the page cache up to date if doing so is inexpensive. With the change to allow readdir to continue reading even when the cache is incomplete, we no longer need to fall back to uncached readdir in order to scale to large directories. Signed-off-by: Trond Myklebust <[email protected]>
2022-03-02NFS: Simplify nfs_readdir_xdr_to_array()Trond Myklebust1-18/+11
Recent changes to readdir mean that we can cope with partially filled page cache entries, so we no longer need to rely on looping in nfs_readdir_xdr_to_array(). Signed-off-by: Trond Myklebust <[email protected]>
2022-03-02NFS: If the cookie verifier changes, we must invalidate the page cacheTrond Myklebust1-1/+6
Ensure that if the cookie verifier changes when we use the zero-valued cookie, then we invalidate any cached pages. Signed-off-by: Trond Myklebust <[email protected]>
2022-03-02NFS: Adjust the amount of readahead performed by NFS readdirTrond Myklebust1-1/+52
The current NFS readdir code will always try to maximise the amount of readahead it performs on the assumption that we can cache anything that isn't immediately read by the process. There are several cases where this assumption breaks down, including when the 'ls -l' heuristic kicks in to try to force use of readdirplus as a batch replacement for lookup/getattr. This patch therefore tries to tone down the amount of readahead we perform, and adjust it to try to match the amount of data being requested by user space. Signed-off-by: Trond Myklebust <[email protected]>
2022-03-02NFS: Don't advance the page pointer unless the page is fullTrond Myklebust1-10/+22
When we hit the end of the data in the readdir page, we don't want to start filling a new page, unless this one is full. Signed-off-by: Trond Myklebust <[email protected]>
2022-03-02NFS: Don't re-read the entire page cache to find the next cookieTrond Myklebust1-3/+7
If the page cache entry that was last read gets invalidated for some reason, then make sure we can re-create it on the next call to readdir. This, combined with the cache page validation, allows us to reuse the cached value of page-index on successive calls to nfs_readdir. Credit is due to Benjamin Coddington for showing that the concept works, and that it allows for improved cache sharing between processes even in the case where pages are lost due to LRU or active invalidation. Suggested-by: Benjamin Coddington <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2022-03-02NFS: Store the change attribute in the directory page cacheTrond Myklebust1-31/+37
Use the change attribute and the first cookie in a directory page cache entry to validate that the page is up to date. Suggested-by: Benjamin Coddington <[email protected]> Signed-off-by: Trond Myklebust <[email protected]>
2022-02-28NFSD: Move svc_serv_ops::svo_function into struct svc_servChuck Lever1-32/+11
Hoist svo_function back into svc_serv and remove struct svc_serv_ops, since the struct is now devoid of fields. Signed-off-by: Chuck Lever <[email protected]>
2022-02-28NFSD: Remove svc_serv_ops::svo_moduleChuck Lever2-6/+2
struct svc_serv_ops is about to be removed. Neil Brown says: > I suspect svo_module can go as well - I don't think the thread is > ever the thing that primarily keeps a module active. A random sample of kthread_create() callers shows sunrpc is the only one that manages module reference count in this way. Suggested-by: Neil Brown <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
2022-02-28SUNRPC: Remove svc_shutdown_net()Chuck Lever1-1/+1
Clean up: svc_shutdown_net() now does nothing but call svc_close_net(). Replace all external call sites. svc_close_net() is renamed to be the inverse of svc_xprt_create(). Signed-off-by: Chuck Lever <[email protected]>
2022-02-28SUNRPC: Rename svc_create_xprt()Chuck Lever1-6/+6
Clean up: Use the "svc_xprt_<task>" function naming convention as is used for other external APIs. Signed-off-by: Chuck Lever <[email protected]>
2022-02-28SUNRPC: Remove the .svo_enqueue_xprt methodChuck Lever1-2/+0
We have never been able to track down and address the underlying cause of the performance issues with workqueue-based service support. svo_enqueue_xprt is called multiple times per RPC, so it adds instruction path length, but always ends up at the same function: svc_xprt_do_enqueue(). We do not anticipate needing this flexibility for dynamic nfsd thread management support. As a micro-optimization, remove .svo_enqueue_xprt because Spectre/Meltdown makes virtual function calls more costly. This change essentially reverts commit b9e13cdfac70 ("nfsd/sunrpc: turn enqueueing a svc_xprt into a svc_serv operation"). Signed-off-by: Chuck Lever <[email protected]>
2022-02-28NFS: Calculate page offsets algorithmicallyTrond Myklebust1-5/+13
Instead of relying on counting the page offsets as we walk through the page cache, switch to calculating them algorithmically. Signed-off-by: Trond Myklebust <[email protected]>
2022-02-28NFS: Use kzalloc() to avoid initialising the nfs_open_dir_contextTrond Myklebust1-7/+4
Signed-off-by: Trond Myklebust <[email protected]>
2022-02-28NFS: Initialise the readdir verifier as best we can in nfs_opendir()Trond Myklebust1-0/+1
For the purpose of ensuring that opendir() followed by seekdir() work as correctly as possible, try to initialise the readdir verifier in nfs_opendir(). Signed-off-by: Trond Myklebust <[email protected]>
2022-02-28NFS: Trace lookup revalidation failureTrond Myklebust1-12/+5
Enable tracing of lookup revalidation failures. Signed-off-by: Trond Myklebust <[email protected]>
2022-02-28NFS: Return valid errors from nfs2/3_decode_dirent()Trond Myklebust2-16/+7
Valid return values for decode_dirent() callback functions are: 0: Success -EBADCOOKIE: End of directory -EAGAIN: End of xdr_stream All errors need to map into one of those three values. Fixes: 573c4e1ef53a ("NFS: Simplify ->decode_dirent() calling sequence") Signed-off-by: Trond Myklebust <[email protected]>
2022-02-28Revert "NFSv4: use unique client identifiers in network namespaces"Trond Myklebust1-14/+0
This reverts commit 50c790a0b69bdc420f00f30bdf348d6c90194c78. The functionality is believed to be capable of causing regressions in existing setups, so the author has requested that it be reverted. Signed-off-by: Trond Myklebust <[email protected]>