Age | Commit message (Collapse) | Author | Files | Lines |
|
In the very rare case where the readdir reply contains multiple cookies
that map to the same hash value, we can end up deadlocking waiting for a
page lock that we already hold. In this case we should fail the page
lock by using grab_cache_page_nowait().
Signed-off-by: Trond Myklebust <[email protected]>
|
|
Pull block updates from Jens Axboe:
- BFQ cleanups and fixes (Yu, Zhang, Yahu, Paolo)
- blk-rq-qos completion fix (Tejun)
- blk-cgroup merge fix (Tejun)
- Add offline error return value to distinguish it from an IO error on
the device (Song)
- IO stats fixes (Zhang, Christoph)
- blkcg refcount fixes (Ming, Yu)
- Fix for indefinite dispatch loop softlockup (Shin'ichiro)
- blk-mq hardware queue management improvements (Ming)
- sbitmap dead code removal (Ming, John)
- Plugging merge improvements (me)
- Show blk-crypto capabilities in sysfs (Eric)
- Multiple delayed queue run improvement (David)
- Block throttling fixes (Ming)
- Start deprecating auto module loading based on dev_t (Christoph)
- bio allocation improvements (Christoph, Chaitanya)
- Get rid of bio_devname (Christoph)
- bio clone improvements (Christoph)
- Block plugging improvements (Christoph)
- Get rid of genhd.h header (Christoph)
- Ensure drivers use appropriate flush helpers (Christoph)
- Refcounting improvements (Christoph)
- Queue initialization and teardown improvements (Ming, Christoph)
- Misc fixes/improvements (Barry, Chaitanya, Colin, Dan, Jiapeng,
Lukas, Nian, Yang, Eric, Chengming)
* tag 'for-5.18/block-2022-03-18' of git://git.kernel.dk/linux-block: (127 commits)
block: cancel all throttled bios in del_gendisk()
block: let blkcg_gq grab request queue's refcnt
block: avoid use-after-free on throttle data
block: limit request dispatch loop duration
block/bfq-iosched: Fix spelling mistake "tenative" -> "tentative"
sr: simplify the local variable initialization in sr_block_open()
block: don't merge across cgroup boundaries if blkcg is enabled
block: fix rq-qos breakage from skipping rq_qos_done_bio()
block: flush plug based on hardware and software queue order
block: ensure plug merging checks the correct queue at least once
block: move rq_qos_exit() into disk_release()
block: do more work in elevator_exit
block: move blk_exit_queue into disk_release
block: move q_usage_counter release into blk_queue_release
block: don't remove hctx debugfs dir from blk_mq_exit_queue
block: move blkcg initialization/destroy into disk allocation/release handler
sr: implement ->free_disk to simplify refcounting
sd: implement ->free_disk to simplify refcounting
sd: delay calling free_opal_dev
sd: call sd_zbc_release_disk before releasing the scsi_device reference
...
|
|
Introduce a new mount option -- trunkdiscovery,notrunkdiscovery -- to
toggle whether or not the client will engage in actively discovery
of trunking locations.
v2 make notrunkdiscovery default
Signed-off-by: Olga Kornievskaia <[email protected]>
Fixes: 1976b2b31462 ("NFSv4.1 query for fs_location attr on a new file system")
Signed-off-by: Trond Myklebust <[email protected]>
|
|
Export fscache_end_operation() to avoid code duplication.
Besides, considering the paired fscache_begin_read_operation() is
already exported, it shall make sense to also export
fscache_end_operation().
Signed-off-by: Jeffle Xu <[email protected]>
Signed-off-by: David Howells <[email protected]>
Reviewed-by: Jeff Layton <[email protected]>
cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]/ # Jeffle's v4
Link: https://lore.kernel.org/r/164622971432.3564931.12184135678781328146.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/164678190346.1200972.7453733431978569479.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/164692888334.2099075.5166283293894267365.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/[email protected]/ # v5
|
|
These filesystems use __set_page_dirty_nobuffers() either directly or
with a very thin wrapper; convert them en masse.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Tested-by: Damien Le Moal <[email protected]>
Acked-by: Damien Le Moal <[email protected]>
Tested-by: Mike Marshall <[email protected]> # orangefs
Tested-by: David Howells <[email protected]> # afs
|
|
We don't need to use page_file_mapping() here because launder_folio
is never called for swap cache pages. We also don't need to
cast an loff_t in order to print it.
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Tested-by: Damien Le Moal <[email protected]>
Acked-by: Damien Le Moal <[email protected]>
Tested-by: Mike Marshall <[email protected]> # orangefs
Tested-by: David Howells <[email protected]> # afs
|
|
Print the folio index instead of the pointer, since this is more
useful. We also don't need to use page_file_mapping() as we do not
invalidate swapcache pages. Since this is the only caller of
nfs_wb_page_cancel(), convert it to nfs_wb_folio_cancel().
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Tested-by: Damien Le Moal <[email protected]>
Acked-by: Damien Le Moal <[email protected]>
Tested-by: Mike Marshall <[email protected]> # orangefs
Tested-by: David Howells <[email protected]> # afs
|
|
The commit handling code is not safe against memory-pressure deadlocks
when writing to swap. In particular, nfs_commitdata_alloc() blocks
indefinitely waiting for memory, and this can consume all available
workqueue threads.
swap-out most likely uses STABLE writes anyway as COND_STABLE indicates
that a stable write should be used if the write fits in a single
request, and it normally does. However if we ever swap with a small
wsize, or gather unusually large numbers of pages for a single write,
this might change.
For safety, make it explicit in the code that direct writes used for swap
must always use FLUSH_STABLE.
Signed-off-by: NeilBrown <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
|
|
1/ Taking the i_rwsem for swap IO triggers lockdep warnings regarding
possible deadlocks with "fs_reclaim". These deadlocks could, I believe,
eventuate if a buffered read on the swapfile was attempted.
We don't need coherence with the page cache for a swap file, and
buffered writes are forbidden anyway. There is no other need for
i_rwsem during direct IO. So never take it for swap_rw()
2/ generic_write_checks() explicitly forbids writes to swap, and
performs checks that are not needed for swap. So bypass it
for swap_rw().
Signed-off-by: NeilBrown <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
|
|
If we are swapping over NFSv4, we may not be able to allocate memory to
start the state-manager thread at the time when we need it.
So keep it always running when swap is enabled, and just signal it to
start.
This requires updating and testing the cl_swapper count on the root
rpc_clnt after following all ->cl_parent links.
Signed-off-by: NeilBrown <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
|
|
rpc tasks can be marked as RPC_TASK_SWAPPER. This causes GFP_MEMALLOC
to be used for some allocations. This is needed in some cases, but not
in all where it is currently provided, and in some where it isn't
provided.
Currently *all* tasks associated with a rpc_client on which swap is
enabled get the flag and hence some GFP_MEMALLOC support.
GFP_MEMALLOC is provided for ->buf_alloc() but only swap-writes need it.
However xdr_alloc_bvec does not get GFP_MEMALLOC - though it often does
need it.
xdr_alloc_bvec is called while the XPRT_LOCK is held. If this blocks,
then it blocks all other queued tasks. So this allocation needs
GFP_MEMALLOC for *all* requests, not just writes, when the xprt is used
for any swap writes.
Similarly, if the transport is not connected, that will block all
requests including swap writes, so memory allocations should get
GFP_MEMALLOC if swap writes are possible.
So with this patch:
1/ we ONLY set RPC_TASK_SWAPPER for swap writes.
2/ __rpc_execute() sets PF_MEMALLOC while handling any task
with RPC_TASK_SWAPPER set, or when handling any task that
holds the XPRT_LOCKED lock on an xprt used for swap.
This removes the need for the RPC_IS_SWAPPER() test
in ->buf_alloc handlers.
3/ xprt_prepare_transmit() sets PF_MEMALLOC after locking
any task to a swapper xprt. __rpc_execute() will clear it.
3/ PF_MEMALLOC is set for all the connect workers.
Reviewed-by: Chuck Lever <[email protected]> (for xprtrdma parts)
Signed-off-by: NeilBrown <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
|
|
NFS_RPC_SWAPFLAGS is only used for READ requests.
It sets RPC_TASK_SWAPPER which gives some memory-allocation priority to
requests. This is not needed for swap READ - though it is for writes
where it is set via a different mechanism.
RPC_TASK_ROOTCREDS causes the 'machine' credential to be used.
This is not needed as the root credential is saved when the swap file is
opened, and this is used for all IO.
So NFS_RPC_SWAPFLAGS isn't needed, and as it is the only user of
RPC_TASK_ROOTCREDS, that isn't needed either.
Remove both.
Signed-off-by: NeilBrown <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
|
|
This code is pointless as IS_SWAPFILE is always defined.
So remove it.
Suggested-by: Mark Hemment <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: NeilBrown <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
|
|
The fscache cookie APIs including fscache_acquire_cookie() and
fscache_relinquish_cookie() now have very good tracing. Thus,
there is no real need for dfprintks in the NFS fscache interface.
The NFS fscache interface has removed all dfprintks so remove the
NFSDBG_FSCACHE defines.
Signed-off-by: Dave Wysochanski <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
|
|
Most of fscache and other NFS IO paths are now using tracepoints.
Remove the dfprintks in the NFS fscache read/write page functions
and replace with tracepoints at the begin and end of the functions.
Signed-off-by: Dave Wysochanski <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
|
|
Rename NFS fscache functions in a more consistent fashion
to better reflect when we read from and write to fscache.
Signed-off-by: Dave Wysochanski <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
|
|
A number of places in the fscache interface used nfs_inode when inode could
be used, simplifying the code.
Signed-off-by: Dave Wysochanski <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
|
|
In the presence of trunking transports, it's helpful to make sure
that during the migration event, the GETATTR for fs_location attribute
happens on the main transport.
Signed-off-by: Olga Kornievskaia <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
|
|
[You don't often get email from [email protected]. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]
Overflow check in not needed anymore after we switch to kmalloc_array().
Signed-off-by: Alexey Khoroshilov <[email protected]>
Fixes: a4f743a6bb20 ("NFSv4.1: Convert open-coded array allocation calls to kmalloc_array()")
Signed-off-by: Trond Myklebust <[email protected]>
|
|
Even if we're not able to cache all the entries in the readdir buffer,
let's ensure that we do prime the dcache.
Signed-off-by: Trond Myklebust <[email protected]>
|
|
Replace the 'previous cookie' field in struct nfs_entry with the
array->last_cookie.
Signed-off-by: Trond Myklebust <[email protected]>
|
|
Avoid clearing the entire readdir page cache if we're just doing forced
readdirplus for the 'ls -l' heuristic.
Signed-off-by: Trond Myklebust <[email protected]>
|
|
Instead of using a linear index to address the pages, use the cookie of
the first entry, since that is what we use to match the page anyway.
This allows us to avoid re-reading the entire cache on a seekdir() type
of operation. The latter is very common when re-exporting NFS, and is a
major performance drain.
The change does affect our duplicate cookie detection, since we can no
longer rely on the page index as a linear offset for detecting whether
we looped backwards. However since we no longer do a linear search
through all the pages on each call to nfs_readdir(), this is less of a
concern than it was previously.
The other downside is that invalidate_mapping_pages() no longer can use
the page index to avoid clearing pages that have been read. A subsequent
patch will restore the functionality this provides to the 'ls -l'
heuristic.
Signed-off-by: Trond Myklebust <[email protected]>
|
|
Signed-off-by: Trond Myklebust <[email protected]>
|
|
Enable tracking of when the readdirplus heuristic causes a page cache
invalidation.
Signed-off-by: Trond Myklebust <[email protected]>
|
|
Trace the effects of readdirplus on attribute and dentry revalidation.
Signed-off-by: Trond Myklebust <[email protected]>
|
|
Add tracing to track how often the client goes to the server for updated
readdir information.
Signed-off-by: Trond Myklebust <[email protected]>
|
|
If the revalidation was forced, due to the presence of a LOOKUP_EXCL or
a LOOKUP_REVAL flag, then readdirplus won't help. It also can't help
when we're doing a path component lookup.
Signed-off-by: Trond Myklebust <[email protected]>
|
|
If the filesystem is case insensitive, then readdirplus can't help with
cache misses, since it won't return case folded variants of the filename.
Signed-off-by: Trond Myklebust <[email protected]>
|
|
Instead of pretending that we know the ratio of directory info vs
readdirplus attribute info, just set the 'dircount' field to the same
value as the 'maxcount' field.
Signed-off-by: Trond Myklebust <[email protected]>
|
|
If attribute caching is turned off, then use of readdirplus is not going
to help stat() performance.
Readdirplus also doesn't help if a file is being written to, since we
will have to flush those writes in order to sync the mtime/ctime.
Signed-off-by: Trond Myklebust <[email protected]>
|
|
The heuristic for readdirplus is designed to try to detect 'ls -l' and
similar patterns. It does so by looking for cache hit/miss patterns in
both the attribute cache and in the dcache of the files in a given
directory, and then sets a flag for the readdirplus code to interpret.
The problem with this approach is that a single attribute or dcache miss
can cause the NFS code to force a refresh of the attributes for the
entire set of files contained in the directory.
To be able to make a more nuanced decision, let's sample the number of
hits and misses in the set of open directory descriptors. That allows us
to set thresholds at which we start preferring READDIRPLUS over regular
READDIR, or at which we start to force a re-read of the remaining
readdir cache using READDIRPLUS.
Signed-off-by: Trond Myklebust <[email protected]>
|
|
When reading a very large directory, we want to try to keep the page
cache up to date if doing so is inexpensive. With the change to allow
readdir to continue reading even when the cache is incomplete, we no
longer need to fall back to uncached readdir in order to scale to large
directories.
Signed-off-by: Trond Myklebust <[email protected]>
|
|
Recent changes to readdir mean that we can cope with partially filled
page cache entries, so we no longer need to rely on looping in
nfs_readdir_xdr_to_array().
Signed-off-by: Trond Myklebust <[email protected]>
|
|
Ensure that if the cookie verifier changes when we use the zero-valued
cookie, then we invalidate any cached pages.
Signed-off-by: Trond Myklebust <[email protected]>
|
|
The current NFS readdir code will always try to maximise the amount of
readahead it performs on the assumption that we can cache anything that
isn't immediately read by the process.
There are several cases where this assumption breaks down, including
when the 'ls -l' heuristic kicks in to try to force use of readdirplus
as a batch replacement for lookup/getattr.
This patch therefore tries to tone down the amount of readahead we
perform, and adjust it to try to match the amount of data being
requested by user space.
Signed-off-by: Trond Myklebust <[email protected]>
|
|
When we hit the end of the data in the readdir page, we don't want to
start filling a new page, unless this one is full.
Signed-off-by: Trond Myklebust <[email protected]>
|
|
If the page cache entry that was last read gets invalidated for some
reason, then make sure we can re-create it on the next call to readdir.
This, combined with the cache page validation, allows us to reuse the
cached value of page-index on successive calls to nfs_readdir.
Credit is due to Benjamin Coddington for showing that the concept works,
and that it allows for improved cache sharing between processes even in
the case where pages are lost due to LRU or active invalidation.
Suggested-by: Benjamin Coddington <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
|
|
Use the change attribute and the first cookie in a directory page cache
entry to validate that the page is up to date.
Suggested-by: Benjamin Coddington <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
|
|
Hoist svo_function back into svc_serv and remove struct
svc_serv_ops, since the struct is now devoid of fields.
Signed-off-by: Chuck Lever <[email protected]>
|
|
struct svc_serv_ops is about to be removed.
Neil Brown says:
> I suspect svo_module can go as well - I don't think the thread is
> ever the thing that primarily keeps a module active.
A random sample of kthread_create() callers shows sunrpc is the only
one that manages module reference count in this way.
Suggested-by: Neil Brown <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
|
|
Clean up: svc_shutdown_net() now does nothing but call
svc_close_net(). Replace all external call sites.
svc_close_net() is renamed to be the inverse of svc_xprt_create().
Signed-off-by: Chuck Lever <[email protected]>
|
|
Clean up: Use the "svc_xprt_<task>" function naming convention as
is used for other external APIs.
Signed-off-by: Chuck Lever <[email protected]>
|
|
We have never been able to track down and address the underlying
cause of the performance issues with workqueue-based service
support. svo_enqueue_xprt is called multiple times per RPC, so
it adds instruction path length, but always ends up at the same
function: svc_xprt_do_enqueue(). We do not anticipate needing
this flexibility for dynamic nfsd thread management support.
As a micro-optimization, remove .svo_enqueue_xprt because
Spectre/Meltdown makes virtual function calls more costly.
This change essentially reverts commit b9e13cdfac70 ("nfsd/sunrpc:
turn enqueueing a svc_xprt into a svc_serv operation").
Signed-off-by: Chuck Lever <[email protected]>
|
|
Instead of relying on counting the page offsets as we walk through the
page cache, switch to calculating them algorithmically.
Signed-off-by: Trond Myklebust <[email protected]>
|
|
Signed-off-by: Trond Myklebust <[email protected]>
|
|
For the purpose of ensuring that opendir() followed by seekdir() work as
correctly as possible, try to initialise the readdir verifier in
nfs_opendir().
Signed-off-by: Trond Myklebust <[email protected]>
|
|
Enable tracing of lookup revalidation failures.
Signed-off-by: Trond Myklebust <[email protected]>
|
|
Valid return values for decode_dirent() callback functions are:
0: Success
-EBADCOOKIE: End of directory
-EAGAIN: End of xdr_stream
All errors need to map into one of those three values.
Fixes: 573c4e1ef53a ("NFS: Simplify ->decode_dirent() calling sequence")
Signed-off-by: Trond Myklebust <[email protected]>
|
|
This reverts commit 50c790a0b69bdc420f00f30bdf348d6c90194c78.
The functionality is believed to be capable of causing regressions in
existing setups, so the author has requested that it be reverted.
Signed-off-by: Trond Myklebust <[email protected]>
|